机器学习第5天 结合昨天的知识实现逻辑回归
数据集地址
https://www.xiehaoo.com/media/record/pinke/2018/08/Social_Network_Ads.csv
该数据集包含了社交网络中用户的信息。这些信息涉及用户ID,性别,年龄以及预估薪资。一家汽车公司刚刚推出了他们新型的豪华SUV,我们尝试预测哪些用户会购买这种全新SUV。并且在最后一列用来表示用户是否购买。我们将建立一种模型来预测用户是否购买这种SUV,该模型基于两个变量,分别是年龄和预计薪资。因此我们的特征矩阵将是这两列。我们尝试寻找用户年龄与预估薪资之间的某种相关性,以及他是否购买SUV的决定。
User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
5 15728773 Male 27 58000 0
6 15598044 Female 27 84000 0
7 15694829 Female 32 150000 1
8 15600575 Male 25 33000 0
9 15727311 Female 35 65000 0
10 15570769 Female 26 80000 0
11 15606274 Female 26 52000 0
12 15746139 Male 20 86000 0
13 15704987 Male 32 18000 0
14 15628972 Male 18 82000 0
15 15697686 Male 29 80000 0
16 15733883 Male 47 25000 1
17 15617482 Male 45 26000 1
18 15704583 Male 46 28000 1
19 15621083 Female 48 29000 1
20 15649487 Male 45 22000 1
21 15736760 Female 47 49000 1
22 15714658 Male 48 41000 1
23 15599081 Female 45 22000 1
24 15705113 Male 46 23000 1
25 15631159 Male 47 20000 1
26 15792818 Male 49 28000 1
27 15633531 Female 47 30000 1
28 15744529 Male 29 43000 0
29 15669656 Male 31 18000 0
.. ... ... ... ... ...
370 15611430 Female 60 46000 1
371 15774744 Male 60 83000 1
372 15629885 Female 39 73000 0
373 15708791 Male 59 130000 1
374 15793890 Female 37 80000 0
375 15646091 Female 46 32000 1
376 15596984 Female 46 74000 0
377 15800215 Female 42 53000 0
378 15577806 Male 41 87000 1
379 15749381 Female 58 23000 1
380 15683758 Male 42 64000 0
381 15670615 Male 48 33000 1
382 15715622 Female 44 139000 1
383 15707634 Male 49 28000 1
384 15806901 Female 57 33000 1
385 15775335 Male 56 60000 1
386 15724150 Female 49 39000 1
387 15627220 Male 39 71000 0
388 15672330 Male 47 34000 1
389 15668521 Female 48 35000 1
390 15807837 Male 48 33000 1
391 15592570 Male 47 23000 1
392 15748589 Female 45 45000 1
393 15635893 Male 60 42000 1
394 15757632 Female 39 59000 0
395 15691863 Female 46 41000 1
396 15706071 Male 51 23000 1
397 15654296 Female 50 20000 1
398 15755018 Male 36 33000 0
399 15594041 Female 49 36000 1
[400 rows x 5 columns]
所有代码
import numpy as numpy
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
dataset = pd.read_csv('/Users/xiehao/Desktop/100-Days-Of-ML-Code-master/datasets/Social_Network_Ads.csv')
#数据预处理
X = dataset.iloc[:, [2, 3]].values
Y = dataset.iloc[:,4].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
#特征缩放
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#将逻辑回归应用于训练集
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
#预测测试集结果
y_pred = classifier.predict(X_test)
#生成混淆矩阵
cm = confusion_matrix(y_test, y_pred)
第一步:数据预处理
老规矩
#导入数据集
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
Y = dataset.iloc[:,4].values
#将数据集分成训练集和测试集,比例是1:4
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
#特征缩放
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
第二步:逻辑回归模型
该项工作的库将会是一个线性模型库,之所以被称为线性是因为逻辑回归是一个线性分类器,这意味着我们在二维空间中,我们两类用户(购买和不购买)将被一条直线分割。然后导入逻辑回归类。下一步我们将创建该类的对象,它将作为我们训练集的分类器。
#使用 LogisticRegression类中的fit对象
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
第三步:预测测试集结果
y_pred = classifier.predict(X_test)
>>print(y_pred)
[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1]
第四步:评估预测
我们预测了测试集。 现在我们将评估逻辑回归模型是否正确的学习和理解。因此这个混淆矩阵将包含我们模型的正确和错误的预测。
cm = confusion_matrix(y_test, y_pred)
>>print(cm)
[[65 3]
[ 8 24]]
感谢原作者 Avik-Jain 以及 zhyongquan的汉化