深入浅出python机器学习（5）SVM

好久没有更新了，最近在忙着找工作
今天写一下SVM - 支持向量机
在现实生活中，我们会经常遇到一些情况，需要将不同的东西进行分类，但是这些分类不是线性的，例如数据是以中心向四周扩散的，我们需要类似圆圈，分出重要和非重要的，这种就叫线性不可分，而SVM就是可以把这些数据投射到多维的情况，然后增加权重进行区分。
我们先演示一下线性可分的情况

import numpy as np 
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
# 创建500个数据点，分成两类
x, y = make_blobs(n_samples = 50, centers = 2, random_state = 6)

# 创建一个线性内核的支持向量机模型
estimator = svm.SVC(kernel = 'linear', C = 1000)
estimator.fit(x, y)

# 画图
plt.figure(figsize = (20,8))
plt.scatter(x[:,0], x[:,1], c = y, s = 30, cmap = plt.cm.Paired)

#建立图像坐标
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

#生成两个等差数列
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = estimator.decision_function(xy).reshape(XX.shape)

#把分类的边界画出来
ax.contour(XX, YY, Z, colors = 'k', levels = [-1, 0, 1], alpha = 0.5, linestyles = ['--', '--', '--'])
ax.scatter(estimator.support_vectors_[:,0], estimator.support_vectors_[:,1], s = 100, linewidth = 1, facecolors = 'none')
plt.show()

image.png

标记出来的这三个点就是分界点，也叫支持向量，而支持向量机就是要寻找一条直线，使得平移后得到的距离最大
如果把这些点投射到三维立体的呢，那它们之间的分割就是一个平面，这个时候就可以使用内核为 RBF 的支持向量机了，RBF 就是利用高斯正态分布，把特征转化为多次方。

# 创建一个RBF内核的支持向量机模型
clf_rbf = svm.SVC(kernel = 'rbf', C = 1000)
clf_rbf.fit(x, y)

plt.figure(figsize = (20,8))
plt.scatter(x[:,0], x[:,1], c = y, s = 30, cmap = plt.cm.Paired)

#建立图像坐标
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

#生成两个等差数列
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf_rbf.decision_function(xy).reshape(XX.shape)

#把分类的边界画出来
ax.contour(XX, YY, Z, colors = 'k', levels = [-1, 0, 1], alpha = 0.5, linestyles = ['--', '-', '--'])

ax.scatter(clf_rbf.support_vectors_[:,0], clf_rbf.support_vectors_[:,1], s = 100, linewidth = 1, facecolors = 'none')

plt.show()

image.png

上面我们是用SVM进行分类，使用的是SCM.SVC，接下来我们用SVM做一下波士顿房价的预测

# 实例- 波士顿房价回归分析
#导入数据集
from sklearn.datasets import load_boston
boston = load_boston()

# 使用SVR 进行回归建模
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

# 拆分数据集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = .25, random_state = 8)

# 建模 分别测试 linear 核函数和 rbf 核函数
for kernel in ['linear', 'rbf']:
    estimator = SVR(kernel = kernel)
    estimator.fit(x_train, y_train)
    print(kernel,'核函数的模型训练得分为：{:.3f}'.format(estimator.score(x_train, y_train)))
    print(kernel,'核函数的模型测试得分为：{:.3f}'.format(estimator.score(x_test, y_test)))

linear 核函数的模型训练得分为：0.709
linear 核函数的模型测试得分为：0.696
rbf 核函数的模型训练得分为：0.145
rbf 核函数的模型测试得分为：0.001
得分都很低，这个时候我们要怀疑一下是不是数据集有些问题，因为SVM对于特征值的分布要求比较高，我们可以用图看看每个特征的数量级分布情况

# 查看特征数值中的最大值和最小值分布
plt.figure(figsize = (20,8))
plt.plot(boston.data.min(axis = 0), '*', label = 'min')
plt.plot(boston.data.max(axis = 0), '^', label = 'max')

#设定纵坐标为对数形式
plt.yscale('log')

plt.legend()

plt.xticks(range(13),boston.feature_names)
plt.show()

image.png

可见特征值的分布都不太OK ，有从0.01到100的，这个时候我们需要对特征进行标准化

# 特征工程 标准化
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

#重新查看特征值数量级
plt.figure(figsize = (20,8))

plt.plot(x_train_scaled.min(axis = 0), 'v', label = 'x_train_min')
plt.plot(x_train_scaled.max(axis = 0), '^', label = 'x_train_max')

plt.plot(x_test_scaled.min(axis = 0), 'h', label = 'x_test_min')
plt.plot(x_test_scaled.max(axis = 0), 'd', label = 'x_test_max')

#设定纵坐标为对数形式
plt.yscale('log')

plt.legend()

plt.xticks(range(13),boston.feature_names)
plt.show()

image.png

然后我们重新进行模型拟合和评估

# 使用SVR 进行回归建模

from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

# 拆分数据集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = .25, random_state = 8)

# 建模 分别测试 linear 核函数和 rbf 核函数
for kernel in ['linear', 'rbf']:
    estimator = SVR(kernel = kernel)
    estimator.fit(x_train_scaled, y_train)
    print(kernel,'核函数的模型训练得分为：{:.3f}'.format(estimator.score(x_train_scaled, y_train)))
    print(kernel,'核函数的模型测试得分为：{:.3f}'.format(estimator.score(x_test_scaled, y_test)))

linear 核函数的模型训练得分为：0.706
linear 核函数的模型测试得分为：0.698
rbf 核函数的模型训练得分为：0.665
rbf 核函数的模型测试得分为：0.695
结果比之前好多了，但是我们使用的都是默认参数，而建模不调参那就是耍流氓

for kernel in ['linear', 'rbf']:
    estimator = SVR(kernel = kernel, C = 100, gamma = 0.2)
    estimator.fit(x_train_scaled, y_train)
    print(kernel,'核函数的模型训练得分为：{:.3f}'.format(estimator.score(x_train_scaled, y_train)))
    print(kernel,'核函数的模型测试得分为：{:.3f}'.format(estimator.score(x_test_scaled, y_test)))

linear 核函数的模型训练得分为：0.706
linear 核函数的模型测试得分为：0.699
rbf 核函数的模型训练得分为：0.983
rbf 核函数的模型测试得分为：0.901
调参之后，发现得分能到0.9了，勉强还是可以的
在深度学习出来之前，SVM是很热门的算法，但是对于10万级以上的样本，SVM需要跑很久，因此在现在，大家基本都用深度学习了。

深入浅出python机器学习（5）SVM

推荐阅读更多精彩内容