高斯朴素贝叶斯分类器的原理
网上资料很多,主要原理有以下几点:
中心极限定理(模型的训练过程)
认为任何自然界中的现象,观测趋近于无穷次时某测量值满足高斯分布,因此可以通过对特征值序列取得平方差和均值的方法直接得出其表达式参数。
如果一个序列
的标准差为 ,均值为
则其分布满足
在训练模型时即计算出各个特征对应的均值和标准差,存储在模型中
贝叶斯公式(模型的预测过程)
其中包含多个特征,朴素贝叶斯的意思就是,认为各个特征相互独立,则
每个概率均由中心极限定理部分给出的高斯分布形式可以得到
而是先验概率,由原始数据易得。
本文主要参考对象:
李小文:高斯朴素贝叶斯的原理及Python实现
python实现
个人主要使用numpy对类进行了实现,如以下代码所示
#naive_bayes.py
from functools import reduce
import numpy as np
class Gaussian_naive_bayes:
def __init__(self):
pass
def fit(self, X:np.array, y:np.array)->None:
#data preparation
X_ndarray_2d = np.array(X)
y_ndarray_1d = np.array(y)
self.point_num = np.shape(y_ndarray_1d)[0]
self.KL = y_ndarray_1d.max() + 1 # kinds of label
self.KF = np.shape(X_ndarray_2d)[1] # kinds of feature
self.gaussian_param_mat_3d = np.ndarray((self.KL, self.KF, 2), dtype=np.float64) #element: (var, avg)
self.feature_prob_vec_2d = np.ndarray((self.KF, 2), dtype=np.float64) #element: (var, avg)
self.label_prob_vec_1d = np.ndarray((self.KL,), dtype=np.float64) #element: prob
#fill data structures
#P(i|F) = P(i)*P(F|i)/P(F)
#P(i)
for k in range(self.KL):
is_label_k_tfarray_1d = (k == y_ndarray_1d)
self.label_prob_vec_1d[k] = np.sum(is_label_k_tfarray_1d)/self.point_num
#P(F)
var_feature = np.var(X_ndarray_2d, axis=0)
avg_feature = np.average(X_ndarray_2d, axis=0)
self.feature_prob_vec_2d = np.vstack([var_feature, avg_feature]).T
#P(F|i)
for kl in range(self.KL):
data_idx = (y_ndarray_1d == kl)
var_feature_from_label = np.var(X_ndarray_2d[data_idx], axis=0)
avg_feature_from_label = np.average(X_ndarray_2d[data_idx], axis=0)
self.gaussian_param_mat_3d[kl] = np.vstack([var_feature_from_label, avg_feature_from_label]).T
def print_params(self):
if self.gaussian_param_mat_3d is None : print("[-] model not trained yet");return
else:
print("[+] gaussian parameters for P(label|feature)")
print(self.gaussian_param_mat_3d)
print("[+] gaussian parameters for P(feature)")
print(self.feature_prob_vec_2d)
print("[+] prior probabilities as P(label)")
print(self.label_prob_vec_1d)
def predict_prob(self, X:np.ndarray)->np.ndarray:
if self.gaussian_param_mat_3d is None : print("[-] model not trained yet");return
else:
ret_prob = np.ndarray((self.KL), dtype=np.float64)
for kl in range(self.KL):
P_feature_from_label = 1.
P_label = self.label_prob_vec_1d[kl]
P_feature = 1.
for kf in range(self.KF):
P_feature_from_label *= np.exp(-(X[kf] - self.gaussian_param_mat_3d[kl, kf, 1])**2/(2*self.gaussian_param_mat_3d[kl, kf, 0]))/(np.sqrt(2*np.pi))/np.sqrt(self.gaussian_param_mat_3d[kl, kf, 0])
P_feature *= np.exp(-(X[kf] - self.feature_prob_vec_2d[kf, 1])**2/(2*self.feature_prob_vec_2d[kf, 0]))/np.sqrt(2*np.pi)/np.sqrt(self.feature_prob_vec_2d[kf, 0])
ret_prob[kl] = P_feature_from_label * P_label * P_feature
return ret_prob
def predict_label(self, X:np.ndarray)->int:
ret_prob = self.predict_prob(X)
norm_prob = ret_prob/np.linalg.norm(ret_prob, ord=2)
max_prob_index = norm_prob.argmax()
print("[+] predict result:", max_prob_index, "probability:", norm_prob[max_prob_index], sep=' ')
return max_prob_index
并通过实例化对象对分类器进行检验。
我采用了知名的sklearn鸢尾花(iris)数据集,对数据进行加载,放入分类器中训练,得到模型。之后使用模型对训练数据本身做出判断。求出准确率。
#main.py
import naive_bayes
import numpy as np
from sklearn import datasets
data = datasets.load_iris()
print("features", data['feature_names'], sep='\n')
print("data", data['data'], sep='\n')
print("target", data['target'], sep='\n')
yyz_gnb_classify_machine = naive_bayes.Gaussian_naive_bayes()
yyz_gnb_classify_machine.fit(data.data, data.target)
yyz_gnb_classify_machine.print_params()
# print(yyz_gnb_classify_machine.predict_prob([4.3, 3., 1.1, 0.1]))
datasize = np.shape(data.target)[0]
result_list = list()
for i in range(datasize):
result_list.append(yyz_gnb_classify_machine.predict_label(data.data[i]))
print("[+] rate of correct:", np.sum(data['target'] == np.array(result_list))/datasize)
以下是程序的输出结果,需要注意的是,由于篇幅限制我手动对结果进行了压缩,太长的部分用...
进行替代。
features
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
data
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
...
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
target
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
[+] gaussian parameters for P(label|feature)
[[[0.121764 5.006 ]
[0.140816 3.428 ]
[0.029556 1.462 ]
[0.010884 0.246 ]]
[[0.261104 5.936 ]
[0.0965 2.77 ]
[0.2164 4.26 ]
[0.038324 1.326 ]]
[[0.396256 6.588 ]
[0.101924 2.974 ]
[0.298496 5.552 ]
[0.073924 2.026 ]]]
[+] gaussian parameters for P(feature)
[[0.68112222 5.84333333]
[0.18871289 3.05733333]
[3.09550267 3.758 ]
[0.57713289 1.19933333]]
[+] prior probabilities as P(label)
[0.33333333 0.33333333 0.33333333]
[+] predict result: 0 probability: 1.0
[+] predict result: 0 probability: 1.0
[+] predict result: 0 probability: 1.0
[+] predict result: 0 probability: 1.0
...
[+] predict result: 2 probability: 0.9996593043019856
[+] predict result: 2 probability: 0.9999999314375727
[+] predict result: 2 probability: 0.9999999999999697
[+] predict result: 2 probability: 0.9982447476614136
[+] rate of correct: 0.96
可以看出,对于训练时使用的原数据而言每次判断准确率都接近1,但是具有判断错误的情况发生,准确率在0.96左右。总体而言满足了机器学习的分类要求。
总结
- 高斯朴素贝叶斯分类器是一种有效的分类器。对于已经量化特征的自然数据具有很好的分类作用。
- 本次训练过程中,没有采用将训练集和测试集分开的方法,日后关于这个问题我会单独做一次研究。
- 本次只研究了高斯朴素贝叶斯分类器,还有多项式朴素贝叶斯分类器、伯努利朴素贝叶斯分类器等,适用于不同的数据集形式(多项式:文本分类)。