0#07 SVM 支持向量机

0x00 数据准备

为了简单起见我就开始自己造数据,开始分类

线性可分
(x  ,y)     label
(-1 ,0)     0
(0  ,1)     1
(1  ,0)     1
补充数据:
(-2,-3)     0
(1  ,1)     1

线性不可分
(x  ,y)     label
(0  ,0)     0
(1  ,0)     0
(0  ,1)     0
(-1 ,0)     0
(0  ,-1)    0

(0  ,2)     1
(1  ,2)     1
(2  ,2)     1
(2  ,1)     1
(2  ,0)     1
(2  ,-1)    1
(2  ,-2)    1
(1  ,-2)    1
(0  ,-2)    1
(-1 ,-2)    1
(-2 ,-2)    1
(-2 ,-1)    1
(-2 ,0)     1
(-2 ,1)     1
(-2 ,2)     1
(-1 ,2)     1

因为SVM支持向量机非常强大,所以能做的事情非常多,实例中我们会以fetch_lfw_people作为例子

0x01 笔算机器学习

我们先将点的坐标画出来

非SCV区分的方法.png

发现如果用我们已知的方法用一条直线区分,这条直线有很多条,但是很明显有些直线是毫无用处的,只要数据增加就会被修改.

于是我们就会想,是不是根据这些点,我们可以找到最适合的那条直线,

那什么是最适合的直线呢?
于是SVM定义,距离这条直线最近的点,与直线距离最大,也就是边界最大化.
比如说这个

边界最大化.png

于是我们把在边界上的点称为支持向量
比如这里的(-1,0),(0,1),(1,0)
如果删除或增加不是支持向量的点,不会影响结果.
但是我们这个例子是属于线性可分的例子.

还有一些情况是先行不可分,
比如
第二组数据

非线性可分.png

很明显正常的分发是化成圆圈,不能用一条直线进行区分
于是我们想办法把该坐标投射到3维度中去
比如使第三维度z=(x1^2+x22)
或者z=np.exp(-(x1^2+x22))
投影之后,就是一个线性可分的3维图形

区分非线性可分.png

代码如下

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x=np.array([[0  ,0],[1  ,0],[0  ,1],[-1 ,0],[0  ,-1],[0  ,2],[1  ,2],[2  ,2],[2  ,1],[2  ,0],[2  ,-1],[2  ,-2],[1  ,-2],[0  ,-2],[-1 ,-2],[-2 ,-2],[-2 ,-1],[-2 ,0],[-2 ,1],[-2 ,2],[-1 ,2]])
y=np.array([0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])

plt.figure(figsize=(6,6))
plt.scatter(x[:,0],x[:,1],c=y,cmap='autumn',s=500)

z=np.exp(-(x**2).sum(1))
from mpl_toolkits import mplot3d
from ipywidgets import interact,fixed
def plot_3D(elev=30,azim=30,x=x,y=y):
    ax=plt.subplot(projection='3d')
    ax.scatter3D(x[:,0],x[:,1],z,c=y,s=50,cmap='autumn')
    ax.view_init(elev=elev,azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
    
from mpl_toolkits import mplot3d
def plot_3D(elev=30,azim=30,x=x,y=y):
    ax=plt.subplot(projection='3d')
    ax.scatter3D(x[:,0],x[:,1],r,c=y,s=50,cmap='autumn')
    ax.view_init(elev=elev,azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
    
interact(plot_3D,elev=(-90,90),azim=(-180,180),x=fixed(x),y=fixed(y));

0x03 使用支持向量机

我们开始使用 SVM 进行区分

from sklearn.datasets import fetch_lfw_people
"""
下载的过程非常缓慢,所以提前下载更加合适
图片地址:
https://ndownloader.figshare.com/files/5976018 #lfw.tgz
https://ndownloader.figshare.com/files/5976015 #lfw-funneled.tgz
https://ndownloader.figshare.com/files/5976012 #pairsDevTrain.txt
https://ndownloader.figshare.com/files/5976009 #pairsDevTest.txt
https://ndownloader.figshare.com/files/5976006 #pairs.txt
保存到 ~/scikit_learn_data/lfw_home
然后将lfw.tgz与lfw-funneled.tgz解压
"""
faces = fetch_lfw_people(min_faces_per_person=60,download_if_missing=True)
print(faces.target_names)
print(faces.images.shape)

import matplotlib.pyplot as plt
fig,ax = plt.subplots(3,5,)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i],cmap='bone')
    axi.set(xticks=[],yticks=[],xlabel=faces.target_names[faces.target[i]])

"""
这一步是提取特征值,我们还没有基础,所以只需要知道,这一部试讲原本的近3000个像素点,提取其中的150个
"""

from sklearn.svm import SVC
from sklearn.decomposition import RandomizedPCA
from sklearn.pipeline import make_pipeline

pca = RandomizedPCA(n_components=150,whiten=True,random_state=0)
"""
C:惩罚参数C(越大,边界越硬)
kernel:内核类型。 
        'linear':线性
        'poly':表示算法使用多项式核函数
        'rbf':表示算法使用高斯核函数,分类非线性可分的样本的分类
        'sigmoid':
        'precomputed'
degree:  多项式核函数的次数（'poly'）。 被所有其他内核忽略。
gamma:  'rbf'，'poly'和'sigmoid'的核系数。 如果gamma是'auto'，那么将使用1 / n_features。
"""
svc = SVC(kernel='rbf',class_weight = 'balanced')
model=make_pipeline(pca,svc)

"""
分割训练集和测试集合
"""
from sklearn.cross_validation import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(faces.data,faces.target,random_state=0)

"""
使用网格法,找出最适合的参数
注意参数格式:
scv__C
函数名称 2*下划线 函数的变量名称
"""
from sklearn.grid_search import GridSearchCV
param_grid ={
    'svc__C':[1,5,10,50],
    'svc__gamma':[0.0001,0.0005,0.001,0.005]}
grid = GridSearchCV(model,param_grid)

grid.fit(xtrain,ytrain)
print(grid.best_params_)

model = grid.best_estimator_
yfit = model.predict(xtest)

"""
制作文字报告
比如:

                   precision    recall  f1-score   support

     Ariel Sharon       0.92      0.69      0.79        16
     Colin Powell       0.84      0.87      0.85        61
  Donald Rumsfeld       0.75      0.69      0.72        35
    George W Bush       0.78      0.97      0.86       125
Gerhard Schroeder       0.90      0.66      0.76        29
      Hugo Chavez       1.00      0.63      0.77        19
Junichiro Koizumi       1.00      0.76      0.87        17
       Tony Blair       0.96      0.77      0.86        35

      avg / total       0.85      0.83      0.83       337
"""
from sklearn.metrics import classification_report
print(classification_report(ytest,yfit,target_names=faces.target_names))

值得一提的是,对于sklearn的支持向量机
我们可以用方法将支持向量显示出来
一些方法:
支持向量的下标
model.support_
支持向量具体的坐标
model.support_vectors_
两个边分别含有的支持向量的个数
model.n_support_
赋予特征的权重（原始问题中的系数）。这仅适用于线性内核。
model.coef_

0x04 一些想法

支持向量机是深度学习之前最成功的算法
支持向量机不仅能解决线性分类问题,还能解决非线性分类问题.
训练好的模型的复杂度,由支持向量个数决定,而不是数据的维度决定的
SVM不太容易过拟合,因为处理支持向量的点,其他点都不重要
对于一些超参数,比如核函数
如果线性可分,我们使用'linear'
如果线性不可分,我们使用'rbf'

0#07 SVM 支持向量机

0x00 数据准备

0x01 笔算机器学习

0x03 使用支持向量机

0x04 一些想法

推荐阅读更多精彩内容