使用朴素贝叶斯分类器过滤垃圾邮件

GoodPanpan GoodPanpan     2022-08-26     724

关键词:

1.从文本中构建词向量

将每个文本用python分割成单词,构建成词向量,这里首先需要一个语料库,为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库。

2.利用词向量计算概率p(x|y)

When we attempt to classify a document, we multiply a lot of probabilities together to
get the probability that a document belongs to a given class. This will look something
like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
counts to 1, and we’ll initialize the denominators to 2.

 

Another problem is underflow: doing too many multiplications of small numbers.
When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
to this is to take the natural logarithm of this product. If you recall from algebra,
ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
error problem. Do we lose anything by using the natural log of a number rather than
the number itself? The answer is no.

 

3.使用词袋模型

Up until this point we’ve treated the presence or absence of a word as a feature. This
could be described as a set-of-words model. If a word appears more than once in a
document, that might convey some sort of information about the document over just
the word occurring in the document or not. This approach is known as a bag-of-words
model.

 

4.代码

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Tue Mar 28 17:22:48 2017
  4 
  5 @author: MyHome
  6 """
  7 ‘‘‘使用python把文本分割成一个个单词,构建词向量
  8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类‘‘‘
  9 import numpy as np
 10 import re
 11 from random import shuffle
 12 
 13 ‘‘‘创建一个词汇表‘‘‘
 14 def createVocabList(Dataset):
 15     vocabSet = set([])
 16     for document in Dataset:
 17         vocabSet = vocabSet | set(document)
 18 
 19     return list(vocabSet)
 20 
 21 
 22 ‘‘‘  将文本转化成词向量‘‘‘
 23 
 24 def setOfWords2Vec(vocabList,inputSet):
 25     returnVec = [0]*len(vocabList)
 26     for word in inputSet:
 27         if word in vocabList:
 28 
 29             #returnVec[vocabList.index(word)] = 1#词集模型
 30             returnVec[vocabList.index(word)] += 1#词袋模型
 31         else:
 32             print "the word:%s is not in VocabList"%word
 33     return returnVec
 34 
 35 
 36 ‘‘‘训练‘‘‘
 37 def trainNB(trainMatrix,trainCategory):
 38     numTrainDocs = len(trainMatrix)
 39     numWords = len(trainMatrix[0])
 40     p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率
 41     ‘‘‘初始化在类0和类1中单词出现个数及概率‘‘‘
 42     p0Num = np.ones(numWords)
 43     p1Num = np.ones(numWords)
 44     p0Denom = 0.0
 45     p1Denom = 0.0
 46     for i in range(numTrainDocs):
 47         if trainCategory[i] == 1:
 48             p1Num += trainMatrix[i]
 49             p1Denom += sum(trainMatrix[i])
 50         else:
 51             p0Num += trainMatrix[i]
 52             p0Denom += sum(trainMatrix[i])
 53     p1_vec = np.log(p1Num/p1Denom)
 54     p0_vec = np.log(p0Num/p0Denom)
 55 
 56     return p0_vec,p1_vec,p
 57 
 58 
 59 ‘‘‘构造分类器‘‘‘
 60 
 61 def classifyNB(Input,p0,p1,p):
 62     p1 = sum(Input*p1) + np.log(p)
 63     p0 = sum(Input*p0) + np.log(1.0-p)
 64     if p1 > p0:
 65         return 1
 66     else:
 67         return 0
 68 
 69 
 70 ‘‘‘预处理文本‘‘‘
 71 def textParse(bigString):
 72     listOfTokens = re.split(r"W*",bigString)
 73     return [tok.lower() for tok in listOfTokens if len(tok)>2]
 74 
 75 """垃圾邮件分类"""
 76 def spamTest():
 77     docList = []
 78     classList = []
 79     fullText = []
 80 
 81     for i in range(1,26):
 82         wordList = textParse(open(‘email/spam/%d.txt‘%i).read())
 83         docList.append(wordList)
 84         fullText.extend(wordList)
 85         classList.append(1)
 86         wordList = textParse(open("email/ham/%d.txt"%i).read())
 87         docList.append(wordList)
 88         fullText.extend(wordList)
 89         classList.append(0)
 90 
 91     vocabList = createVocabList(docList)
 92     DataSet = zip(docList,classList)
 93     print shuffle(DataSet)
 94     Data ,Y = zip(*DataSet)
 95     trainMat = []
 96     trainClass=[]
 97     testData = Data[40:]
 98     test_label = Y[40:]
 99     for index in xrange(len(Data[:40])):
100         trainMat.append(setOfWords2Vec(vocabList,Data[index]))
101         trainClass.append(Y[index])
102 
103     p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass))
104     errorCount = 0
105     for index in xrange(len(testData)):
106         wordVector = setOfWords2Vec(vocabList,testData[index])
107         if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]:
108             errorCount += 1
109     print "the error rate is : " ,float(errorCount)/len(testData)
110 
111 
112 if __name__ == "__main__":
113     spamTest()
114 
115 
116 
117 
118 
119 
120 

5.总结

  Using probabilities can sometimes be more effective than using hard rules for classification.
Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
from known values.
  You can reduce the need for a lot of data by assuming conditional independence
among the features in your data. The assumption we make is that the probability of
one word doesn’t depend on any other words in the document. We know this assumption
is a little simple. That’s why it’s known as na?ve Bayes. Despite its incorrect
assumptions, na?ve Bayes is effective at classification.
  There are a number of practical considerations when implementing na?ve Bayes in
a modern programming language. Underflow is one problem that can be addressed
by using the logarithm of probabilities in your calculations. The bag-of-words model is
an improvement on the set-of-words model when approaching document classification.
There are a number of other improvements, such as removing stop words, and
you can spend a long time optimizing a tokenizer.

朴素贝叶斯垃圾邮件过滤问题

...estion【发布时间】:2011-02-0618:14:45【问题描述】:我计划使用朴素贝叶斯分类模型实施垃圾邮件过滤器。在网上我看到了很多关于朴素贝叶斯分类的信息,但问题是它包含了很多数学内容,而不是清楚地说明它是如何完成的。问... 查看详情

在朴素贝叶斯垃圾邮件过滤中结合个体概率

...通过分析我积累的语料库来生成垃圾邮件过滤器。我正在使用***条目http://en.wikipedia.org/wiki/Bayesian_spam_filtering来开发我的分类代码。 查看详情

朴素贝叶斯垃圾邮件过滤效果

...外的非垃圾邮件相关字词来绕过它们。贝叶斯过滤器可以使用哪些编程技术来防止这种情况发生?【问题讨论】:【参考方案1】:PaulGraham是在2002年8月的原始文章APlanforS 查看详情

如何更改 NLTK 中朴素贝叶斯分类器的平滑方法?

...LTK?【发布时间】:2013-05-2204:36:56【问题描述】:我已经使用NLTK朴素贝叶斯方法训练了一个垃圾邮件分类器。垃圾邮件集和非垃圾邮件集都有20,000个单词在训练中的实例。我注意到当遇到未知功能时,classifier给它0.5垃圾邮件的... 查看详情

使用朴素贝叶斯算法简单实现垃圾邮件过滤

一、算法介绍朴素贝叶斯法,简称NB算法,是贝叶斯决策理论的一部分,是基于贝叶斯定理与特征条件独立假设的分类方法:首先理解两个概念:·先验概率是指根据以往经验和分析得到的概率,它往往作为“由因求果”问题中... 查看详情

机器学习——朴素贝叶斯算法

...性函数的作用拉普拉斯修正防溢出策略样例解释代码——使用拉普拉斯进行垃圾邮件分类构建文本向量从词向量到计算概率朴素贝叶斯分类器分类函数垃圾邮件分类总结朴素贝叶斯是有监督学习的一种分类算法,它基于“贝... 查看详情

朴素贝叶斯应用:垃圾邮件分类(代码片段)

...预处理:去掉长度小于3的词,去掉没有语义的词等尝试使用nltk库:pipinstallnltknltk.download不成功:就使用词频统计的处理方法训练集和测试集数据划分fromsklearn.model_selectionimporttrain_test_splitfromnltk.corpueimpor 查看详情

bayes朴素贝叶斯实现垃圾邮件分类

...件和25封正常邮件,随机产生了10组测试集和40组训练集,使用朴素贝叶斯方法实现了垃圾邮件的分类。 Bayes公式   遍历每篇文档向量,扫描所有文档的单词,合并集合去重,并生成最终的词汇表 # 创建词汇... 查看详情

朴素贝叶斯分类器动态训练

】朴素贝叶斯分类器动态训练【英文标题】:naivebayesclassifierdynamictraining【发布时间】:2020-09-1308:43:39【问题描述】:是否可以(如果可以)动态训练sklearnMultinomialNB分类器?每次我在其中输入电子邮件时,我都想训练(更新)... 查看详情

12.朴素贝叶斯-垃圾邮件分类

...接成字符串2.1传统方法来实现  2.2nltk库的安装与使用pipinstallnltkimportnltknltk.download()   #sever地址改成 http://www 查看详情

朴素贝叶斯应用:垃圾邮件分类

importnltkfromnltk.corpusimportstopwordsfromnltk.stemimportWordNetLemmatizerdefpreprocessing(text):tokens=[wordforsentinnltk.sent_tokenize(text)forwordinnltk.word_tokenize(sent)]stops=stopwords.words( 查看详情

12.朴素贝叶斯-垃圾邮件分类(代码片段)

...接成字符串  传统方法来实现 nltk库的安装与使用pipinstallnltkimportnltknltk.download()   #sever地址改成 http://www.nltk.org/nltk_data 查看详情

我的朴素贝叶斯训练数据是不是需要成比例?

...。规范的方法是对随机抽样的电子邮件进行手动分类,并使用它们来训练NB分类器。太好了,现在说我添加了一堆我知道不是垃圾邮件的存档电子邮件。这是否会扭曲我的分类器结果 查看详情

朴素贝叶斯-垃圾邮件分类

1.读邮件数据集文件,提取邮件本身与标签。列表numpy数组importcsvfile_path=r"SMSSpamCollection"sms=open(file_path,‘r‘,encoding=‘utf-8‘)data=csv.reader(sms,delimiter=" ")forrindata:print(r)sms.close()2.邮件预处理邮件分句名子分词去掉过短的单词词性还... 查看详情

朴素贝叶斯-垃圾邮件分类实现(代码片段)

1.前言《朴素贝叶斯算法(NaiveBayes)》,介绍了朴素贝叶斯原理。本文介绍的是朴素贝叶斯的基础实现,用来垃圾邮件分类。2.朴素贝叶斯基础实现朴素贝叶斯(naiveBayes)法是基于贝叶斯定理与特征条件独立假设的分类的方法。对... 查看详情

12.朴素贝叶斯-垃圾邮件分类(代码片段)

...比较级连接成字符串2.1传统方法来实现2.2nltk库的安装与使用pipinstallnltkimportnltknltk.download()   #sever地址改成 http://www.nltk.org/nl 查看详情

12.朴素贝叶斯-垃圾邮件分类(代码片段)

...比较级连接成字符串2.1传统方法来实现2.2nltk库的安装与使用pipinstallnltkimportnltknltk.download()   #sever地址改成 http://www.nltk.org/nl 查看详情

利用朴素贝叶斯(naviebayes)进行垃圾邮件分类

...分类的样例来加深对于理论的理解。这里我们来解释一下朴素这个词的含义:1)各个特征是相互独立的,各个特征出现与其出现的顺序无关;2)各个特征地位同等重要;以上都是比較强的如果以下是朴素贝叶斯分类的流程:这... 查看详情