Support Vector Machines

Optimization objective

SVM hypothesis:
logistic regression:

$y = \frac{1}{1+e^{-\theta^Tx}}$

cost function:

$\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2$

Large Margin Intution

$\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2$

If $y=1$ , we want $\theta^Tx\ge1$ (not just $\ge0$ )
If $y=0$ , we want $\theta^Tx\le-1$ (not just $\le0$ )

If C is too large, the deasion boundary will be sensitive by outliers

The mathematics behind large margin classification (optional)

Vector Inner Product

SVM Decision Boundary

$\min\limits_\theta\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2=\frac{1}{2}||\theta||^2$

Kernels I

Non-liner decision boundary:

Given x, compute new feature feature depending on proximity to landmarks defined manually.

Kernels and Similarity (Gaussian kernel):

$f_1=similarity(x,l^{(1)})=\exp (-\frac{||x-l^{(1)}||^2}{2\sigma^2})=\exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)})^2}{2\sigma^2})$

If $x\approx l^{(1)}\qquad f_1\approx 1$
If $x$ far from $l^{(1)}$ $f_1\approx0$

Kernels II

Choosing the land marks:
Where to get l ?
Give $(x^{(1)},y^{(i)}),(x^{(2)},y^{(2)}),...(x^{(n)},y^{(n)}$
choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(n)}=x^{(n)}$

For training examples $(x^{(i)},y^{(i)})$

$f_m^{(i)} = sim(x^{(i)},l^{(m)}) f_0 =1$

SVM with Kernels

Hypothesis: Given $x$ , compute features $f\in R^{m+1}$
Predict 'y=1' if $\theta^Tf\ge0$
Training: $\min\limits_\theta C\sum\limits_{i=1}^my^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2\quad (n=m)$

Kernels ususally were used with SVM, although it can be used with logistic regressin, it runs slowly.

SVM parameters

C :

Large C: Lower bias, high variance.
Small C: Higher bias, low variance.

$\sigma^2$ :

Larger $\sigma^2$ : Features $f_i$ vary more smoothly. Higher bias, lower variance.(Underfit)
Small $\sigma^2$ : Feaugers $f_i$ vary less smoothly. Lower bias, higher variance. (Overfit)

Using an SVM

Need to specify:

Choice of parameter C
Choice of kernel (similarity function)

Note: Do perform feature scaling before using the Gaussian kernel.

Other choices of kernel

Not all similarity functions $similarity(x,l)$ make valid kernels. (Need to satisfy technical condition called "Mercer's Theorem") to make sure SVM packages' optimizations run correctly, and do not diverge.

Many off-the-shelf kernels avaliable:

Polynomial kernel: $k(x,l) = (x^Tl+constant)^degree$
String kernel
chi-square kernel
histogram intersection kernel

Multi-class classification

Many SVM packages already have build-in multi-class classification functionality.

Logistic regression vs. SVM

n = number of features, m = number of training examples.

If n is large (relative m):
Use logistc regression, or SVM without a kernel.
If n is small m is intermediate:
Use SVM with Gaussian kernel
If n is small, m is large:
Create/add more features, then use logistic regression or SVM without a kernel.

12. Support Vector Machines