线性模型

3.0 生成学习和识别器

基本目标：获得映射 $h : X \rightarrow Y$ ，其中 $X$ 为特征， $Y$ 为类别

Bayes optimal classifier: $P(Y|X)$

What is the most probable classification of the new instance given the training data?
Generative classifier:

e.g.,

Naive Bayes

LDA:

Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications.The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.

QDA:

Qualitative Data Analysis (QDA) is the range of processes and procedures used on the qualitative data that have been collected to transform them into some form of explanation, understandingorinterpretation of the people and situations that are being investigated.
QDA is usually based on an interpretative philosophy. The intention behind qualitative analysis is to answer the ‘why’, ‘what’ and ‘how’ questions, and to examine the meaningful and symbolic content of qualitative data.

Non-par: nonparametric statistics
Explantion:

A generative classifier tries to learn the model that generates the data behind the scenes by estimating the assumptions and distributions of the model. It then uses this to predict unseen data, because it assumes the model that was learned captures the real model . As we will see, this is often not true. An example is the Naive Bayes Classifier.
Steps:
1. Assume some functional form for $P(X|Y), P(Y)$
2. Estimate patameters of $P(X|Y)$ directly from training data
3. Apply Bayes formula to calculate $P(Y|X = x)$
Strengths:
1. Model the input patterns
2. Usually fast converge
3. Cheap computation
4. Robust to noise data
Weakness:
1. Can generate a sample of the data via $P(X) = \Sigma_y P(y)P(X|y)$
2. Ususally performs worse

Discriminative classifier
- Explanation:
  
  A discriminative classifier tries to model by just depending on the observed data. It makes fewer assumptions on the distributions but depends heavily on the quality of the data (Is it representative? Are there a lot of data?). An example is the Logistic Regression.
- Steps:
  1. Assume some functional form for $P(Y|X)$
  2. Estimate parameters of $P(Y|X)$ directly from training data
  3. Directly learn $P(Y|X)$
- Strengths:
  1. Model $P(Y|X)$ directly
  2. Model the decision boundary
  3. Usually good performance
- Weakness:
  1. Cannot obtain a sample of the data since $P(X)$ is not available
  2. Slow convergence
  3. Expensive computation
  4. Sensitive to noise data
Generative classifier v.s. Discriminative classifier
- In Bayes Theorem, we have this formula below:
$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$
- Concrete Example:
  
  Given that a software engineer lives in Silicon Valley (X), what is the probability he makes over 100K salary (Y|X)?
  1. Generative classifier:
    
    To continue with our example, a generative classifier first estimate from the data what is the probability a software developer living in Silicon Valley (X) given he makes over 100K (Y), i.e. together it means P(X|Y)
    
    Then, it estimates how many software engineers make over 100K, regardless of if he is in Silicon Valley or not, i.e. P(Y).
  2. Discriminative classifier:
    
    As with our example, a discriminative classifier estimates the probability of a software engineer making over 100K salary given he lives in Silicon Valley, i.e. P(Y|X)

3.1 线性模型的基本形式

给定 $d$ 个属性描述的示例 $x = (x_1,...,x_d)$ ，线性模型试图通过建立一个属性的线性组合函数来预测

$y = f(x,w) = \Sigma_{j=1}^{d}w_jx_j + b \\ f(x) = w^Tx + b$

此处的 $x$ 可能是：
1. 原始特征(Predictor variables): continuous or categorical
2. 变形特征(Transformed predictors: $x_4 = \log x_3$ )
3. Basic expansion: $x_4 = x_3^2$
4. 交乘项(Interaction: $x_4 = x_2x_3$ )
线性模型的特点：
1. 简单，易于建模。许多非线性模型都是在线性模型的基础上通过引入层级结构或高维映射而获得的
2. 反映了属性在预测中的重要性。当系数估计标准差相等的时候，其大小和符号表示了对预测影响的大小和方向

3.2 线性回归

假设有观测值 $Y = (y_1,...,y_m) \in R^m, X = (x_1,...,x_m) \in R^m$ , 现在去用 $x$ 去拟合线性模型来预测 $y$

3.2.1 无截距项

假设 $\hat{w}$ 是使得损失函数最优的参数

$\hat{w} = \arg min_w \Sigma_{i=1}^m(y_i - f(x_i))^2 = \arg min_w \Sigma_{i=1}^m(y_i - wx_i)^2 = argmin_w||y - wx||^2 \\ \frac{\partial E_w}{\partial w} = 2\Sigma_{i=1}^m(y_i - wx_i)x_i = 0\\ \Sigma_{i=1}^m y_ix_i = w\Sigma_{i=1}^{m}x_i^2\\$

$\hat{w} = \frac{\Sigma_{i=1}^m y_ix_i }{\Sigma_{i=1}^{m}x_i^2} = \frac{X^TY}{||X||^2}$

3.2.2 有截距项

模型表示为
$Y = b^* + w^*X + \epsilon$
假设（ $\hat{b},\hat{w}$ ）是使得损失函数最优的参数
$(\hat{b},\hat{w}) = \arg min_{(\hat{b},\hat{w})}\Sigma_{i=1}^m(y_i - b - wx_i)^2 = \arg min_{(\hat{b},\hat{w})}||Y - b - wX||^2 \\ \frac{\partial E_{(w,b)}}{\partial w} = 2\Sigma_{i=1}^m(y_i - b - wx_i)x_i = 0 \\ \frac{\partial E_{(w,b)}}{\partial w} = 2\Sigma_{i=1}^m(y_i - b - wx_i) = 0 \\ mb = \Sigma_{i=1}^my_i - w\Sigma_{i=1}^m \\ b = \bar{y} - w\bar{x} \\ \Sigma_{i=1}^m[y_i - \bar{y} - w(x_i - \bar{x})](x_i - \bar{x}) = 0\\ w = \frac{\Sigma_{i=1}^m(y_i - \bar{y})(x_i - \bar{x})}{\Sigma_{i=1}^m(x_i - \bar{x})(x_i - \bar{x})}$
因此有
$\hat{b} = \bar{y} - w\bar{x}\\ \hat{w} = \frac{(x - \bar{x}\mathbb I)^T(y - \bar{y}\mathbb I)}{||x - \bar{x}\mathbb I||^2}$
注意到
$\hat{w} = \frac{cov(x,y)}{var(x)} = cor(x,y)\sqrt{\frac{var(y)}{var(x)}}$

3.2.3 将截距项放在系数矩阵中（多元线性回归）

试图获得映射
$f(x_i) = w^Tx_i + b, f(x_i) + \epsilon = y_i$
将 $w$ 和 $b$ 吸收入向量形式 $\hat{w} = (w;b)$ ，即假设截距项是一个因子

将数据集 $D$ 表示成如下矩阵：
$X = \begin{pmatrix} x_{11} & x_{12} & ... & x_{1d} & 1\\ x_{21} & x_{22} & ... & x_{2d} & 1 \\ ... & ... & ... & ... & 1\\ x_{m1} & x_{m2} & ... & x_{md} & 1 \end{pmatrix}$

将标记 $y$ 改写成：
$y = (y_1;y_2;...;y_m)$
有最优参数：
$\hat{w}^* = \arg min_{\hat{w}}(y - X\hat{w})^T(y - X\hat{w})\\ E_{\hat{w}} = (y - X\hat{w})^T(y - X\hat{w})\\ \frac{\partial E_{\hat{w}}}{\partial \hat{w}} = 2X^T(X\hat{w} - y) = 0\\$

当 $X^TX$ 满秩或正定时
$\hat{w}^* = (X^TX)^{-1}X^Ty \\ \hat{y} = X\hat{w} = X(X^TX)^{-1}X^Ty$

hat matrix

记 $H = (X^TX)^{-1}X^T$

则 $\hat{y} = X\hat{w} = X(X^TX)^{-1}X^Ty$ 恰好是 $y \in R^m$ 在{ $x_1^T,...,x_d^T$ }(即矩阵 $X$ 的列空间)张成的 $d$ 维子空间上的投影

其中投影矩阵 $H$ 的性质有：
1. 对称矩阵 $H^T = H$
2. 幂等（idempotent） $H^2 = H$
3. $Hx = x$ for all $x \in col(X)$ and $Hx = 0$ for all $x \perp col(x)$

3.2.4 多因子模型（多元回归）的因子正交化处理

正交化的初衷

对于一元回归模型 $y = wx + \epsilon$ ，其系数 $b$ 的解为:
$\hat{w} = \frac{X^TY}{||X||^2} = \frac{\langle x,y\rangle}{\langle x, x\rangle}$
对比多元回归的到的系数解
$\hat{w} = (X^TX)^{-1}X^TY$
可知如果所有的解释变量（ $X$ 的列向量）都两两正交 $\langle x_i,x_j\rangle = 0,i\not = j$ ，则向量 $\hat{w}$ 的每一个分量恰好等于
$\hat{w}_i = \frac{\langle x_i,y\rangle}{\langle x_i,x_i\rangle}$
说明所有解释变量相互正交时，不同的因子对彼此的参数估计 $\hat{w}_i$ 没有任何影响
回归的几何意义

将 $\hat{w}$ 的表达式带入回归模型得到 $\epsilon$ 的表达式，并计算 $X$ 和 $\epsilon$ 的内积，得到
$\begin{array}{} X^T\epsilon \\ = X^T(Y - X\hat{w}) \\ = X^T(Y - X(X^TX)^{-1}X^TY)\\ = X^TY-(X^TX)(X^TX)^{-1}X^TY\\ =X^TY - X^TY \\ =0 \end{array}$
说明残差 $\epsilon$ 和解释变量 $X$ 正交

img

对于一元回归 $y = \hat{w}x + \epsilon$ 的情况，回归实际上将 $y$ 垂直投影（orthogonally projected onto）到 $x$ 之上,使得 $\epsilon$ 的模长最小

img

对于二元回归 $y = w_1x_1 + w_2x_2 + \epsilon$ ，假设 $x_1$ 和 $x_2$ 之间是正交的。其几何意义是将 $y$ 垂直投影到 $x_1$ 和 $x_2$ 生成的超平面内，此时可再将 $y$ 投影到 $x_1,x_2$ 上去。由于 $x_1,x_2$ 正交，则 $\hat{y}$ 恰好是两个向量的和， $\hat{w}_1$ 和 $\hat{w}_2$ 恰好是单位化后 $\hat{y}$ 在 $x_1,x_2$ 上的投影。

img

当 $x_1$ 和 $x_2$ 非正交时， $\hat{y}$ 不再是两个向量的和，解释变量之间对各自的回归系数有不同的作用
$\hat{w}_i \not= \frac{\langle x_i,y\rangle}{\langle x_i,x_i\rangle}$

正交化过程求解多元回归

Gram-Schmit正交化过程

img

对于3的
$b_p = \frac{\langle z_p,y\rangle}{\langle z_p,z_p\rangle}$
当 $y_i$ 满足独立同分布时，假设方程为 $\sigma^2$ ，可以证明回归系数 $b_p$ 的方差和 $z_p$ 的大小成反比：
$var(b_p) = \frac{\sigma^2}{\langle z_p,z_p\rangle} = \frac{\sigma^2}{||z_p||^2}$
由于 $z_0,z_1,...,z_p$ 相互正交，则对应的回归系数为
$\hat{w}_i = \frac{\langle z_j,y\rangle}{\langle z_j,z_j\rangle}$
Drygas逆向求解非正交因子回归系数

image-20200414002820845.png

由此解出未正交化的各因子对应回归系数

3.3 线性回归的变形

对数线性回归：对应输出标记在指数尺度上变化

$\ln(y) = wx + b$

广义线性模型: $g(y)$ 成为连接函数,单调可微
$g(y) = wx + b \\$
对数几率回归

$\ln\frac{y}{1-y} = wx +b \\ y = \frac{1}{1+e^{-(w^Tx+b)}}$

若将 $y$ 视为 $x$ 作为正例的可能， $1-y$ 视为 $x$ 作为负例的可能，二者的比值成为几率（odds）
$\ln \frac{y}{1-y}$ 成为对数几率

参考文献

https://zhuanlan.zhihu.com/p/41993542
Drygas, H. (2011). On the Relationship between the Method of Least Squares and Gram-Schmidt Orthogonalization. Acta et Commentationes Universitatis Tartuensis de Mathematica, Vol. 15(1), 3 – 13.
Hastie, T., R. Tibshirani, and J. Friedman (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed. Springer.

线性模型

线性模型

3.0 生成学习和识别器

3.1 线性模型的基本形式

3.2 线性回归

3.2.1 无截距项

3.2.2 有截距项

3.2.3 将截距项放在系数矩阵中（多元线性回归）

3.2.4 多因子模型（多元回归）的因子正交化处理

3.3 线性回归的变形

参考文献

推荐阅读更多精彩内容