SVM支持向量机(一)：基础概念

超平面

定义： $\mathbf{w}^T \mathbf{x} + b = 0$

对于处在超平面两侧的两个点 $\mathbf{x_1}$ 和 $\mathbf{x_2}$ ，分别有：
$\mathbf{w}^T \mathbf{x_1} + b > 0$
$\mathbf{w}^T \mathbf{x_2} + b < 0$

Hyperplane

某样本到超平面的单位法向量为： $\frac{w} {||w||}$
某样本点到超平面的距离可以表示为： $\gamma = sgn(\frac{w}{||w||} \cdot x + \frac{b} {||w||})$
$sgn(z) =\left\{ \begin{array}{rcl} -1 & & {if \space \space z < 0}\\ 0 & & {if \space \space z = 0}\\ +1 & & {if \space \space z > 0}\\ \end{array} \right.$

所以可以看到图中原点距离超平面的距离是 $\frac{b}{||w||}$

线性分类问题（Linear Classifier）

$y = \mathbf{w}^T \mathbf{x} + b$

利用上述性质，y > 0 时为class A，y < 0时为class B，y = 0时，x正好在超平面上，simple and easy。

SVM的作用，是在线性分类器的基础上，找到一个合理的间隔（margin），使得样本到超平面的距离最大。

思路

我们可以在原始的超平面两侧，平行地添加两个超平面作为边界，并且要求：
$\mathbf{w}^T \mathbf{x} + (b - s) > 0$ 对于class A中所有样本成立
$\mathbf{w}^T \mathbf{x} + (b + s) < 0$ 对于class B中所有样本成立

Add margin to hyperplane

在上面我们提到，原点距离原始的超平面的距离为： $d = \frac{b}{||w||}$
那么当我们像上图一样加了两个平行线之后，可以得到：
$d_{blue} = \frac{b-s} {||w||}$ ， $d_{green} = \frac{b+s} {||w||}$

于是margin就是：
$m = d_{blue} - d_{green} = \frac{2s} {||w||}$
观察到margin的大小是个比例关系，所以我们可以先简单地把看出 $s = 1$ ，那么我们的margin的表达式就是 $m=\frac{2} {||w||} = \frac{2} {\sqrt{w^T w}}$

于是SVM的优化问题就变成了：

minize $f_0(\mathbf{w}, b)= \frac{1}{2} \mathbf{w}^T \mathbf{w}$
subject to $f_i(\mathbf{w}, b) = y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1 \geq 0$ for $i = 1, ..., N$

$y_i \in \{-1, 1\}$ 表示 $\mathbf{x}_i$ 的class label。这是一个Constrained convex optimization problem，或确切地来说，是一个quadratic programming问题。

SVM优化问题的求解

Primal problem

计算拉格朗日(拉格朗日乘数法)：
$L(\mathbf{w}, b, \mathbf{\alpha}) = \frac{1}{2} \mathbf{w}^T \mathbf{w} - \sum_{i=1}^{N} \alpha _i[y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1]$
求L关于 $w$ 和 $b$ 的偏导：
$\nabla_w L(\mathbf{w}, b, \mathbf{\alpha}) = \mathbf{w} - \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i \overset{!}{=} 0$ $\space \space \space \space$ (1)
$\frac{\partial L}{\partial b} = \sum_{i=1}^N \alpha_i y_i \overset{!}{=}0$ $\space \space \space \space$ (2)

于是我们可以得到：weights是训练样本 $\mathbf{x}$ 的线性组合:
$\mathbf{w} = \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i$ $\space \space \space \space$ (3)

Dual problem

把(2)和(3)式代回到表达式 $L(\mathbf{w}, b, \mathbf{\alpha})$ 中，可以得到拉格朗日的dual function $g(\alpha)$

于是我们的问题变成了：
maximize $g(\alpha) = \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i y_j \alpha_i \alpha_j \mathbf{x}_i^T \mathbf{x}_j$
subject to $\sum_{i=1}^N \alpha_i y_i = 0$ $\space \space \space \space \space \alpha_i \geq 0$ , for $i = 1, ..., N$

将dual functino $g(\alpha)$ 改成vector的形式，这样我们就得到了一个标准的quadratic programming的形式：

$g(\alpha) = \frac{1}{2} \alpha^T \mathbf{Q} \alpha + \alpha^T \mathbb{I}_N$
$\mathbf{Q}$ 是一个symmetric negative (semi-)definite matrix，约束 $\alpha$ 是线性的。

QP的解法可以用（比如）Sequential minimal optimization (SMO) 得到，由此可以得到 $\alpha$ 的最优解 $\alpha^*$ 。

求解 $\mathbf{w}$ 和 $b$ 对应的值

在primal problem中我们提到， $w$ 的值是训练样本的线性组合：
$\mathbf{w} = \sum_{i=1}^N \alpha^*_i y_i \mathbf{x}_i$

对 $b$ 的求解，我们可以用到互补松弛条件(Complementary slackness condition) $\alpha^* f_i(\mathbf{x}) = \alpha^* \cdot [y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1] = 0$

任意选取一个 $\mathbf{x}_i$ 使得 $\alpha_i \neq 0$ 于是有：
$y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1$ 因为 $y_i \in \{ -1, 1\}$ 所以可以写成：
$(\mathbf{w}^T \mathbf{x}_i + b) = y_i$ 于是：
$b = y_i - \mathbf{w}^T \mathbf{x}_i$

至此我们求解出了 $\mathbf{w}$ 和 $b$ ，超平面的两个参数。

支持向量机中的Support Vector是怎么得名的？

让我们回头再看一看complimentary slackness condition:
$\alpha_i f_i(\mathbf{x}^*) = 0$
$\alpha^* \cdot [y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1] = 0$

也就是说，只有当某个训练样本正好落在了边界(margin)上的时候，才会对weight vector $\mathbf{w}$ 有贡献（只有此时 $\alpha_i \neq 0$ ），此时才会有 $y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1$
这些训练样本就被称作support vector

Support vectors

用SVM进行二分类

对于一个待预测的样本 $\tilde{\mathbf{x}}$ ，它的类别由 $h(\tilde{\mathbf{x}}) = sgn(\mathbf{w}^T \tilde{\mathbf{x}} + b)$ 给出。
其中 $\mathbf{w} = \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i$ 通过训练得到， $\mathbf{x_i}$ 为训练样本。
于是我们有：
$h(\tilde{\mathbf{x}}) = sgn(\sum_{i=1}^N \alpha_i y_i \mathbf{x}_i^T \cdot \tilde{\mathbf{x}} + b)$