朴素贝叶斯

监督式学习
监督表示你有不同的例子，并且你知道不同例子中的正确答案
我们给模型多个案例，每个案例都有许多特征和属性，模型如果能够挑选出正确的特征，那么就可以对新的案例进行分类

练习
下面哪些问题可以通过监督分类解决？
□ √ 拿一册带标签的照片，试着认出照片中的某个人
□ × 对银行数据进行分析，寻找异常的交易，将其标记为涉嫌欺诈
没有给出异常交易的明确定义，没有例子说明它的含义
□ √ 根据某人的音乐喜好以及其所爱音乐的特点，比如节奏或流派推荐一首他们可能会喜欢的歌
□ ×根据优达学城学生的学习风格，将其分成不同的组群
我们不知道学生属于什么类型，没有已经知道正确答案的大量样本

特征和标签
在机器学习中，我们通常将特征作为输入，然后尝试输出标签
以歌曲识别为例，输入的特征为intensity,tempo,genre,gender,输出的特征位like/dont like

对一个全新的数据点，你能了解多少

特征可视化

image.png

机器学习算法可以定义决策面(决策边界)(decision surface)
决策面通常位于两个不同的类的之间的某个位置上，
在一侧，算法将预测每个可能的数据点是那一类，
在另一侧，算法将预测每个可能的数据点是另一类
决策面将一类和另一类分开，并且能够泛化到之前未见过的数据点上

当决策面为直线时，我们称之为线性决策面

机器学习算法所做的是获取数据，将其转化为决策面，决策面适用于以后所有的情况，能帮助我们确定一个新的数据点属于哪一类

寻找决策面的算法：朴素贝叶斯

sklearn naive bayes/gaussian naive bayes

import numpy as np
X=np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
Y=np.array([1,1,1,2,2,2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()          #创建分类器
clf.fit(X,Y)                      #提供训练数据 X是特征 Y是标签
print(clf.predict([[-0.8,-1]]))  #让完成训练的分类器进行预测 我们想知道这个特定点的标签是什么

使用fit的过程就是分类器学习模式的过程，然后分类器能用学得的模式进行预测

计算 GaussianNB 准确性
评估分类器效果，计算分类器进行分类的准确率
确定分类器效果的指标叫做：准确率
准确率：
the number of points that are classified correctly divided by the total number of points in the test set.

法一：

def NBAccuracy(features_train, labels_train, features_test, labels_test):
    """ compute the accuracy of your Naive Bayes classifier """
    ### import the sklearn module for GaussianNB
    from sklearn.naive_bayes import GaussianNB

    ### create classifier
    clf = GaussianNB()

    ### fit the classifier on the training features and labels
    clf.fit(features_train,labels_train)

    ### use the trained classifier to predict labels for the test features
    pred = clf.predict(features_test)


    ### calculate and return the accuracy on the test data
    ### this is slightly different than the example, 
    ### where we just print the accuracy
    ### you might need to import an sklearn module
    accuracy = clf.score(features_test,labels_test)
    return accuracy

法二：

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
print accuracy_score(pred,labels_test

在机器学习中，你可能要泛化在某些方面存在差别的新数据(训练+测试)
算法泛化新数据

在本课程中，我们应该始终执行的操作是：
保存10%的数据，将其用作测试集，你将通过这些数据真正了解你在学习数据模式方面的进展
汇报结果时，使用测试后的结果，因为它能更好、更公平地了解数据在训练时的表现

朴素贝叶斯使我们从文本源中鉴别标签
之所以被称为朴素贝叶斯，是因为它忽略了word orders

朴素贝叶斯：
优点：
易于执行
特征空间非常大
运行非常容易，非常有效

缺点：
会有间断
由多个单词组成且意义不明显的短语，就不太适用朴素贝叶斯

因此根据想要解决的问题和要解决的数据集，我们不能把朴素贝叶斯算法看成是黑匣子，而是当成理论性的理解，理解算法如何运行，以及是否适合你所要解决的问题，然后我们要进行测试，训练集/测试集，我们检查测试集的表现以及它如何运行，如果表现得不尽如人意，那可能是错误的算法或者是错误的参数

迷你项目
我们有一组邮件，分别由同一家公司的两个人撰写其中半数的邮件。我们的目标是仅根据邮件正文区分每个人写的邮件。在这个迷你项目一开始，我们将使用朴素贝叶斯，并在之后的项目中扩展至其他算法。
我们会先给你一个字符串列表。每个字符串代表一封经过预处理的邮件的正文；然后，我们会提供代码，用来将数据集分解为训练集和测试集（在下节课中，你将学习如何进行预处理和分解，但是现在请使用我们提供的代码）。
朴素贝叶斯特殊的一点在于，这种算法非常适合文本分类。在处理文本时，常见的做法是将每个单词看作一个特征，这样就会有大量的特征。此算法的相对简单性和朴素贝叶斯独立特征的这一假设，使其能够出色完成文本的分类。在这个迷你项目中，你将在计算机中下载并安装 sklearn，然后使用朴素贝叶斯根据作者对邮件进行分类。

使用 pip 安装一系列 Python 包：
- 安装 sklearn: pip install scikit-learn
- 此处包含 sklearn 安装说明，可供参考
安装自然语言工具包：pip install nltk
获取机器学习简介源代码。你将需要 git 来复制资源库：git clone https://github.com/udacity/ud120-projects.git

你只需操作一次，基础代码包含所有迷你项目的初始代码。进入 tools/ 目录，运行 startup.py。该程序首先检查 python 模块，然后下载并解压缩我们在后期将大量使用的大型数据集。下载和解压缩需要一些时间，但是你无需等到全部完成再开始第一部分。
startup.py代码如下：

#!/usr/bin/python

print
print "checking for nltk"
try:
    import nltk
except ImportError:
    print "you should install nltk before continuing"

print "checking for numpy"
try:
    import numpy
except ImportError:
    print "you should install numpy before continuing"

print "checking for scipy"
try:
    import scipy
except:
    print "you should install scipy before continuing"

print "checking for sklearn"
try:
    import sklearn
except:
    print "you should install sklearn before continuing"

print
print "downloading the Enron dataset (this may take a while)"
print "to check on progress, you can cd up one level, then execute <ls -lthr>"
print "Enron dataset should be last item on the list, along with its current size"
print "download will complete at about 423 MB"
import urllib
url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz"
urllib.urlretrieve(url, filename="../enron_mail_20150507.tar.gz") 
print "download complete!"


print
print "unzipping Enron dataset (this may take a while)"
import tarfile
import os
os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tar.gz", "r:gz")
tfile.extractall(".")

print "you're ready to go!"

作者身份准确率和对NB分类器计时

import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


'''
features_train and features_test are the features for the training
and testing datasets, respectively
labels_train and labels_test are the corresponding item labels
'''

features_train, features_test, labels_train, labels_test = preprocess()

### your code goes here ###

from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()

t0 = time()
clf.fit(features_train,labels_train)
print "training time:", round(time()-t0, 3), "s"  #训练分类器的所需时间

t0 = time()
clf.predict(features_test)
print "predicting time:", round(time()-t0, 3), "s" #分类器预测的所需时间

print clf.score(features_test,labels_test)

朴素贝叶斯

推荐阅读更多精彩内容