理论内容
贝叶斯定理
贝叶斯定理是描述条件概率关系的定律
$$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}$$
朴素贝叶斯分类器
朴素贝叶斯分类器是一种基于概率的分类器,我们做以下定义:
- B:具有特征向量B
- A:属于类别A
有了这个定义,我们解释贝叶斯公式
- P(A|B):具有特征向量B样本属于A类别的概率(计算目标)
- P(B|A):在A类别中B向量出现的概率(训练样本中的数据)
- P(A):A类出现的概率(训练样本中的频率)
- P(B):B特征向量出现的概率(训练样本中的频率)
对于朴素贝叶斯分类器,进一步假设特征向量之间无关,那么朴素贝叶斯分类器公式可以如下表示$$P(A|B) = \cfrac{P(A)\prod P(B_{i} |A)}{P(B)}$$
以上公式右侧的值都可以在训练样本中算得。进行预测时,分别计算每个类别的概率,取概率最高的一个类别。
特征向量为连续值的朴素贝叶斯分类器
对于连续值,有以下两种处理方式
- 将连续值按区间离散化
- 假设特征向量服从正态分布或其他分布(很强的先验假设),由样本中估计出参数,计算贝叶斯公式时带入概率密度
代码实现
导入数据——文本新闻数据
# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
from sklearn import datasets
train = datasets.load_files("./20newsbydate/20news-bydate-train")
test = datasets.load_files("./20newsbydate/20news-bydate-test")
print(train.DESCR)
print(len(train.data))
print(train.data[0])
None
11314
b"From: cubbie@garnet.berkeley.edu ( )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\ngajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n This season so far, Morgan and Guzman helped to lead the Cubs\n at top in ERA, even better than THE rotation at Atlanta.\n Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n in the season, we Cubs fans have learned how to enjoy the\n short triumph while it is still there.\n"
处理数据——特征抽取(文字向量化)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
(11314, 129782)
模型训练
from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
模型评估
使用自带评估器
bays.score(test_vec,test.target)
0.80244291024960168
调用评估器
from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
precision recall f1-score support
alt.atheism 0.80 0.81 0.80 319
comp.graphics 0.65 0.80 0.72 389
comp.os.ms-windows.misc 0.80 0.04 0.08 394
comp.sys.ibm.pc.hardware 0.55 0.80 0.65 392
comp.sys.mac.hardware 0.85 0.79 0.82 385
comp.windows.x 0.69 0.84 0.76 395
misc.forsale 0.89 0.74 0.81 390
rec.autos 0.89 0.92 0.91 396
rec.motorcycles 0.95 0.94 0.95 398
rec.sport.baseball 0.95 0.92 0.93 397
rec.sport.hockey 0.92 0.97 0.94 399
sci.crypt 0.80 0.96 0.87 396
sci.electronics 0.79 0.70 0.74 393
sci.med 0.88 0.87 0.87 396
sci.space 0.84 0.92 0.88 394
soc.religion.christian 0.81 0.95 0.87 398
talk.politics.guns 0.72 0.93 0.81 364
talk.politics.mideast 0.93 0.94 0.94 376
talk.politics.misc 0.68 0.62 0.65 310
talk.religion.misc 0.88 0.44 0.59 251
avg / total 0.81 0.80 0.78 7532