线性模型

线性模型

3.0 生成学习和识别器

基本目标:获得映射h : X \rightarrow Y,其中X为特征,Y为类别

  1. Bayes optimal classifier: P(Y|X)

    What is the most probable classification of the new instance given the training data?

  2. Generative classifier:

  • e.g.,

    Naive Bayes

    LDA:

    Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications.The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.

    QDA:

    Qualitative Data Analysis (QDA) is the range of processes and procedures used on the qualitative data that have been collected to transform them into some form of explanation, understandingorinterpretation of the people and situations that are being investigated.
    QDA is usually based on an interpretative philosophy. The intention behind qualitative analysis is to answer the ‘why’, ‘what’ and ‘how’ questions, and to examine the meaningful and symbolic content of qualitative data.

    Non-par: nonparametric statistics

  • Explantion:

    A generative classifier tries to learn the model that generates the data behind the scenes by estimating the assumptions and distributions of the model. It then uses this to predict unseen data, because it assumes the model that was learned captures the real model . As we will see, this is often not true. An example is the Naive Bayes Classifier.

  • Steps:

    1. Assume some functional form for P(X|Y), P(Y)
    2. Estimate patameters of P(X|Y) directly from training data
    3. Apply Bayes formula to calculate P(Y|X = x)
  • Strengths:

    1. Model the input patterns
    2. Usually fast converge
    3. Cheap computation
    4. Robust to noise data
  • Weakness:
    1. Can generate a sample of the data via P(X) = \Sigma_y P(y)P(X|y)
    2. Ususally performs worse

  1. Discriminative classifier

    • Explanation:

      A discriminative classifier tries to model by just depending on the observed data. It makes fewer assumptions on the distributions but depends heavily on the quality of the data (Is it representative? Are there a lot of data?). An example is the Logistic Regression.

    • Steps:

      1. Assume some functional form for P(Y|X)
      2. Estimate parameters of P(Y|X) directly from training data
      3. Directly learn P(Y|X)
    • Strengths:

      1. Model P(Y|X) directly
      2. Model the decision boundary
      3. Usually good performance
    • Weakness:

      1. Cannot obtain a sample of the data since P(X) is not available
      2. Slow convergence
      3. Expensive computation
      4. Sensitive to noise data
  2. Generative classifier v.s. Discriminative classifier

    • In Bayes Theorem, we have this formula below:

    P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}

    • Concrete Example:

      Given that a software engineer lives in Silicon Valley (X), what is the probability he makes over 100K salary (Y|X)?

      1. Generative classifier:

        To continue with our example, a generative classifier first estimate from the data what is the probability a software developer living in Silicon Valley (X) given he makes over 100K (Y), i.e. together it means P(X|Y)

        Then, it estimates how many software engineers make over 100K, regardless of if he is in Silicon Valley or not, i.e. P(Y).

      2. Discriminative classifier:

        As with our example, a discriminative classifier estimates the probability of a software engineer making over 100K salary given he lives in Silicon Valley, i.e. P(Y|X)

3.1 线性模型的基本形式

  • 给定d个属性描述的示例x = (x_1,...,x_d),线性模型试图通过建立一个属性的线性组合函数来预测

y = f(x,w) = \Sigma_{j=1}^{d}w_jx_j + b \\ f(x) = w^Tx + b

  • 此处的x可能是:

    1. 原始特征(Predictor variables): continuous or categorical
    2. 变形特征(Transformed predictors: x_4 = \log x_3)
    3. Basic expansion: x_4 = x_3^2
    4. 交乘项(Interaction: x_4 = x_2x_3)
  • 线性模型的特点:

    1. 简单,易于建模。许多非线性模型都是在线性模型的基础上通过引入层级结构或高维映射而获得的
    2. 反映了属性在预测中的重要性。当系数估计标准差相等的时候,其大小和符号表示了对预测影响的大小和方向

3.2 线性回归

  • 假设有观测值Y = (y_1,...,y_m) \in R^m, X = (x_1,...,x_m) \in R^m, 现在去用x去拟合线性模型来预测y

3.2.1 无截距项

  • 假设\hat{w}是使得损失函数最优的参数

\hat{w} = \arg min_w \Sigma_{i=1}^m(y_i - f(x_i))^2 = \arg min_w \Sigma_{i=1}^m(y_i - wx_i)^2 = argmin_w||y - wx||^2 \\ \frac{\partial E_w}{\partial w} = 2\Sigma_{i=1}^m(y_i - wx_i)x_i = 0\\ \Sigma_{i=1}^m y_ix_i = w\Sigma_{i=1}^{m}x_i^2\\

  • \hat{w} = \frac{\Sigma_{i=1}^m y_ix_i }{\Sigma_{i=1}^{m}x_i^2} = \frac{X^TY}{||X||^2}

3.2.2 有截距项

  • 模型表示为
    Y = b^* + w^*X + \epsilon

  • 假设(\hat{b},\hat{w})是使得损失函数最优的参数
    (\hat{b},\hat{w}) = \arg min_{(\hat{b},\hat{w})}\Sigma_{i=1}^m(y_i - b - wx_i)^2 = \arg min_{(\hat{b},\hat{w})}||Y - b - wX||^2 \\ \frac{\partial E_{(w,b)}}{\partial w} = 2\Sigma_{i=1}^m(y_i - b - wx_i)x_i = 0 \\ \frac{\partial E_{(w,b)}}{\partial w} = 2\Sigma_{i=1}^m(y_i - b - wx_i) = 0 \\ mb = \Sigma_{i=1}^my_i - w\Sigma_{i=1}^m \\ b = \bar{y} - w\bar{x} \\ \Sigma_{i=1}^m[y_i - \bar{y} - w(x_i - \bar{x})](x_i - \bar{x}) = 0\\ w = \frac{\Sigma_{i=1}^m(y_i - \bar{y})(x_i - \bar{x})}{\Sigma_{i=1}^m(x_i - \bar{x})(x_i - \bar{x})}

  • 因此有
    \hat{b} = \bar{y} - w\bar{x}\\ \hat{w} = \frac{(x - \bar{x}\mathbb I)^T(y - \bar{y}\mathbb I)}{||x - \bar{x}\mathbb I||^2}

  • 注意到
    \hat{w} = \frac{cov(x,y)}{var(x)} = cor(x,y)\sqrt{\frac{var(y)}{var(x)}}

3.2.3 将截距项放在系数矩阵中(多元线性回归)

  • 试图获得映射
    f(x_i) = w^Tx_i + b, f(x_i) + \epsilon = y_i
    wb吸收入向量形式\hat{w} = (w;b),即假设截距项是一个因子

    将数据集D表示成如下矩阵:
    X = \begin{pmatrix} x_{11} & x_{12} & ... & x_{1d} & 1\\ x_{21} & x_{22} & ... & x_{2d} & 1 \\ ... & ... & ... & ... & 1\\ x_{m1} & x_{m2} & ... & x_{md} & 1 \end{pmatrix}

将标记y改写成:
y = (y_1;y_2;...;y_m)
有最优参数:
\hat{w}^* = \arg min_{\hat{w}}(y - X\hat{w})^T(y - X\hat{w})\\ E_{\hat{w}} = (y - X\hat{w})^T(y - X\hat{w})\\ \frac{\partial E_{\hat{w}}}{\partial \hat{w}} = 2X^T(X\hat{w} - y) = 0\\

X^TX满秩或正定时
\hat{w}^* = (X^TX)^{-1}X^Ty \\ \hat{y} = X\hat{w} = X(X^TX)^{-1}X^Ty

  • hat matrix

    H = (X^TX)^{-1}X^T

    \hat{y} = X\hat{w} = X(X^TX)^{-1}X^Ty恰好是y \in R^m在{x_1^T,...,x_d^T}(即矩阵X的列空间)张成的d维子空间上的投影

    其中投影矩阵H的性质有:

    1. 对称矩阵H^T = H
    2. 幂等(idempotent)H^2 = H
    3. Hx = x for all x \in col(X) and Hx = 0 for all x \perp col(x)

3.2.4 多因子模型(多元回归)的因子正交化处理

  • 正交化的初衷

    对于一元回归模型y = wx + \epsilon, 其系数b的解为:
    \hat{w} = \frac{X^TY}{||X||^2} = \frac{\langle x,y\rangle}{\langle x, x\rangle}
    对比多元回归的到的系数解
    \hat{w} = (X^TX)^{-1}X^TY
    可知如果所有的解释变量(X的列向量)都两两正交\langle x_i,x_j\rangle = 0,i\not = j,则向量\hat{w}的每一个分量恰好等于
    \hat{w}_i = \frac{\langle x_i,y\rangle}{\langle x_i,x_i\rangle}
    说明所有解释变量相互正交时,不同的因子对彼此的参数估计\hat{w}_i没有任何影响

  • 回归的几何意义

    \hat{w}的表达式带入回归模型得到\epsilon的表达式,并计算X\epsilon的内积,得到
    \begin{array}{} X^T\epsilon \\ = X^T(Y - X\hat{w}) \\ = X^T(Y - X(X^TX)^{-1}X^TY)\\ = X^TY-(X^TX)(X^TX)^{-1}X^TY\\ =X^TY - X^TY \\ =0 \end{array}
    说明残差\epsilon和解释变量X正交

    img

    对于一元回归y = \hat{w}x + \epsilon的情况,回归实际上将y 垂直投影(orthogonally projected onto)到x之上,使得\epsilon的模长最小

    img

    对于二元回归y = w_1x_1 + w_2x_2 + \epsilon,假设x_1x_2之间是正交的。其几何意义是将y垂直投影到x_1x_2生成的超平面内,此时可再将y投影到x_1,x_2上去。由于x_1,x_2正交,则\hat{y}恰好是两个向量的和,\hat{w}_1\hat{w}_2 恰好是单位化后\hat{y}x_1,x_2上的投影。

img

​ 当x_1x_2非正交时,\hat{y}不再是两个向量的和,解释变量之间对各自的回归系数有不同的作用
\hat{w}_i \not= \frac{\langle x_i,y\rangle}{\langle x_i,x_i\rangle}

  • 正交化过程求解多元回归

    Gram-Schmit正交化过程

img

​ 对于3的
b_p = \frac{\langle z_p,y\rangle}{\langle z_p,z_p\rangle}
​ 当y_i满足独立同分布时,假设方程为\sigma^2,可以证明回归系数b_p的方差和z_p的大小成反比:
var(b_p) = \frac{\sigma^2}{\langle z_p,z_p\rangle} = \frac{\sigma^2}{||z_p||^2}
​ 由于z_0,z_1,...,z_p相互正交,则对应的回归系数为
\hat{w}_i = \frac{\langle z_j,y\rangle}{\langle z_j,z_j\rangle}
Drygas逆向求解非正交因子回归系数

image-20200414002820845.png

​ 由此解出未正交化的各因子对应回归系数

3.3 线性回归的变形

  • 对数线性回归:对应输出标记在指数尺度上变化

\ln(y) = wx + b

  • 广义线性模型: g(y)成为连接函数,单调可微
    g(y) = wx + b \\

  • 对数几率回归

\ln\frac{y}{1-y} = wx +b \\ y = \frac{1}{1+e^{-(w^Tx+b)}}

  1. 若将y视为x作为正例的可能,1-y视为x作为负例的可能,二者的比值成为几率(odds)

  2. \ln \frac{y}{1-y}成为对数几率

参考文献

  • https://zhuanlan.zhihu.com/p/41993542
  • Drygas, H. (2011). On the Relationship between the Method of Least Squares and Gram-Schmidt Orthogonalization. Acta et Commentationes Universitatis Tartuensis de Mathematica, Vol. 15(1), 3 – 13.
  • Hastie, T., R. Tibshirani, and J. Friedman (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed. Springer.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容