大师兄的数据分析学习笔记(十四):机器学习与数据建模
大师兄的数据分析学习笔记(十六):分类模型(二)
一、KNN
1. 关于距离
- 如果属性可以转化为数值类型,那么每个属性就可以看做是一个维度,每个对象都是空间中的坐标,数据表中的对象与对象之间,就会有距离。
- 距离常用的公式有:
欧氏距离:
曼哈顿距离:
闵可夫斯基距离:
2. KD-Tree
- KD-Tree用来在一个有很多点的空间里,找到一个随机点附近的K个点。
- KD-Tree的做法如下:
首先每次从一个不同的维度将点分成两份,直到不能切分为止。
分割线将空间分成很多格子,可以看做是一个树形索引。
围绕目标点,以K为距离画圆,通过与圆相交的线获得索引,计算与圆相交的点获得距离。
3. KNN算法思想
-
KNN(K-Nearest Neighbors)的思想是,计算一个数据点K个最近的邻居中,哪种标注更多,那么这个数据点就更倾向与其一致。
>>>import os
>>>import pandas as pd
>>>import numpy as np
>>>from sklearn.model_selection import train_test_split
>>>from sklearn.neighbors import NearestNeighbors,KNeighborsClassifier
>>>from sklearn.metrics import accuracy_score,recall_score,f1_score
>>>df = pd.read_csv(os.path.join(".", "data", "WA_Fn-UseC_-HR-Employee-Attrition.csv"))
>>>X_tt,X_validation,Y_tt,Y_validation = train_test_split(df.JobLevel,df.JobSatisfaction,test_size=0.2)
>>>X_train,X_test,Y_train,Y_test = train_test_split(X_tt,Y_tt,test_size=0.25)
>>>data = df[["JobSatisfaction","JobLevel"]]
>>>nbrs = NearestNeighbors(n_neighbors=3,algorithm="ball_tree").fit(data)
>>>distances,indices = nbrs.kneighbors(data)
>>>print(f"indices:{indices}")
>>>print(f"distance:{distances}")
indices:[[ 203 545 548]
[ 612 331 1468]
[ 779 1169 274]
...
[ 612 331 1468]
[ 612 331 1468]
[ 103 332 881]]
distance:[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
...
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
>>>knn_clf = KNeighborsClassifier(n_neighbors=5)
>>>knn_clf.fit(np.array(X_train).reshape(-1,1),np.array(Y_train).reshape(-1,1))
>>>Y_pred = knn_clf.predict(np.array(X_validation).reshape(-1,1))
>>>print(f"ACC_validation:{accuracy_score(Y_validation,Y_pred)}")
>>>print(f"REC_validation:{recall_score(Y_validation,Y_pred,average='macro')}")
>>>print(f"F-Score_validation:{f1_score(Y_validation,Y_pred,average='macro')}")
>>>print("="*20)
>>>Y_pred = knn_clf.predict(np.array(X_test).reshape(-1,1))
>>>print(f"ACC_test:{accuracy_score(Y_test,Y_pred)}")
>>>print(f"REC_test:{recall_score(Y_test,Y_pred,average='macro')}")
>>>print(f"F-Score_test:{f1_score(Y_test,Y_pred,average='macro')}")
>>>print("="*20)
>>>Y_pred = knn_clf.predict(np.array(X_train).reshape(-1,1))
>>>print(f"ACC_train:{accuracy_score(Y_train,Y_pred)}")
>>>print(f"REC_train:{recall_score(Y_train,Y_pred,average='macro')}")
>>>print(f"F-Score_train:{f1_score(Y_train,Y_pred,average='macro')}")
ACC_validation:0.23129251700680273
REC_validation:0.231737012987013
F-Score_validation:0.18999414398110684
====================
ACC_test:0.2585034013605442
REC_test:0.22533289904776502
F-Score_test:0.18903384382681035
====================
ACC_train:0.28458049886621317
REC_train:0.26506593601295936
F-Score_train:0.22034449227123712