学习笔记:The Elements of Statistical Learning (Chapter 4)

4.1 Introduction

In this chapter, we revisit the classification problem and focus on linear methods for classification.

\bullet What does linear here mean?

Depending on the prediction function, the boundaries of the input space can be rough or smooth. For an important class of the methods, these boundaries are linear.

While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, we can expand our variable
set X_1,\cdots, X_p by including their squares and cross-products X_1^2,X_2^2,\cdots,X_1X_2,\cdots, thereby adding p(p+1)/2 additional variables. Linear functions in the augmented space map down to quadratic functions in the original space --- hence linear decision boundaries to quadratic decision boundaries.This approach can be used with any basis transformation h(X), where h:\mathbb{R}^p\to\mathbb{R}^q, with q>p,and will be explored in later chapters.

4.2 Linear Regression of an Indicator Matrix

In this section, a basic method is introduced. This method may have some problems which are solved to some extent by some other methods in later sections.

Here, each of the respinse categories are coded via an indicator variable. If \mathcal G has K classes, there will be K such indicators Y_k,k=1,2,\cdots, K, with Y_k=1, if G=k, else 0. These are collected together in a vector Y =(Y_1,\cdots, Y_K), and the N training instances of these form an N\times K indicator response matrix Y. Now, we can fit the linear regression model to each of the columns of Y simultaneously, and the fit is given by
\hat Y =X(X'X)^{-1}X'Y,
and the coefficient matrix is
\hat B=(X'X)^{-1}X'Y.
To classify a new observation with input x, we first comput the fitted output \hat f(x)^T=x\hat B, a K vector,and then identify the largest component and classify accordingly.

\triangle Why this method works? What is the rationale?

\bullet One justification is to view the regression as an estimate of conditional expectation.

For the random variable Y_k, E(Y_k|X=x)=P(G=k|X=x), so conditional expectation of each of the Y_k seems a sensible goal. The real issue is: how good an approximation to conditional expecttion is the rather rigid linear regression model? Alternatively, are the \hat f(x) reasonable estimates of the posterior probabilities P(G=k|X=x), and more importantly, does this matter?

Although we can verify that \sum_{k\in\mathcal{G}}\hat f_k(x)=1 for any x, the \hat{f_k}(x) can be negative or greater than 1.

\bullet Another more simplistic viewpoint is to construct target t_k for each class, where t_k is the kth column of the K\times K identity matrix.y_i=t_k if g_i=k. Then we can fit the linear model by least squares:

min_B\sum_{i=1}^N\|y_i-B'x_i\|^2.
A new observation is classified by computing its fitted vector \hat f(x) and classifying to the closest target:
\hat G(x)=argmin_k\|\hat f(x)-t_k\|^2

This is exactly the same as the previous approach.

\triangle A serious problem of this method:

When the number of classes K\geqslant 3, the model may work badly(especially prevalent when K is large). Because of the rigid nature of the regression model, classes can be masked by others.

4.3 Linear Discriminant Analysis

Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors P(G|X) for optimal classification. Suppose f_k(x) is the class-conditional density of X in the class G=k, and let \pi_k be the prior probability of class k, with \sum_{k=1}^K \pi_k=1. A simple application of Bayes Theorem gives us
P(G=k|X=x)=\frac{f_k(x)\pi_k}{\sum_{l=1}^Kf_l(x)\pi_l}.
We see that in terms of ability to classify, having the f_k(x) is almost equivalent to having the quantity P(G=k|X=x).

Suppose that we model each class density as multivariate Gaussian
f_k(x)=\frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)).
Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix \Sigma_k=\Sigma,\forall k. In comparing tow classes k and l, it is sufficient to look at the log-ratio
\log\frac{P(G=k|X=x)}{P(G=l|X=x)}=\log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+x^T\Sigma^{-1}(\mu_k-\mu_l).
This is an equation linear in x. It implies that the decision boundary between classes k and l --- the set where P(G=k|X=x)=P(G=l|X=x) is linear in x in a p dimensions hyperplane.

We can also see that the linear discriminant functions
\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\log\mu_k
are an equivalent description of the decision rule, with G(x)=argmax_k\delta_k(x).

In practice, we do not know the parameters of the Gaussian distribution and will need to estimate them using our training data:

\bullet \hat\pi_k=N_k/N, where N_k is the number of class-k observations;

\bullet \hat\mu_k=\sum_{g_i=k}x_i/N_k;

\bullet \hat\Sigma=\sum_{k=1}^K\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N-K).

\triangle What if \Sigma_k are not assumed to be equal?

The pieces quadratic in x remain. We then get quadratic discriminant functions(QDA),
\delta_k(x)=-\frac{1}{2}\log|\Sigma_k|-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)+\log\pi_k.
The decision boundary between each pair of classes k and l is described by a quadratic equation \{x:\delta_k(x)=\delta_l(x)\}.

4.3.1 Regularized Descriminant Analysis

In this subsection, a method which is a compromise between LDA and QDA is introduced.

A regularized covarance matrices have the form:

\hat\Sigma_k(\alpha)=\alpha\hat\Sigma_k+(1-\alpha)\hat{\Sigma},
where \hat\Sigma is the polled covariance matrix as used in LDA. Here \alpha\in[0,1] aloows a continuum of models between LDA and QDA, and needs to be specified (by cross-validation, etc.).

Similar modifications allow \hat\Sigma itself to be shrunk toward the scalar covariance,
\hat\Sigma(\gamma)=\gamma\hat\Sigma+(1-\gamma)\hat\sigma^2I
for \gamma\in[0,1].Replacing \hat\Sigma by\hat\Sigma(\gamma) leads to a more general family of covariances \hat\Sigma(\alpha,\gamma) indexed by a pair of parameters.

4.3.2 Computation for LDA

The computations of LDA and especially QDA are simplified by diagonalizing \hat\Sigma or \hat\Sigma_k.

For the latter, suppose we compute the eigen-decomposition for each \hat\Sigma_k=U_kD_kU_k^T, where U_k is p\times p orthonormal, and D_k a diagonal matrix of positive eigenvalues d_{kl}. Then the ingredients for \delta_k(x) are
(x-\hat\mu_k)^T\hat\Sigma_k^{-1}(x-\hat\mu_k)=[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)];
\log|\hat\Sigma_k|=\Sigma_l\log d_{kl}.

4.3.3 Reduced-Rank Linear Discriminant Analysis

Motivations:

\bullet Gaussian classification with common covariances leads to linear dicision boundaries. Classification can be achieved by sphering the data with respect to W, and classifying to the closest centroid (modulo \log \pi_k) in the sphered space.

\bullet Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.

\bullet This subspace can be further decomposed into successively optimal subspaces in terms of centroid separation. This decomposition is identical to the decomposition due to Fisher.

For the detailed discussions, please refer to the textbook.

4.4 Logistic Regression

The logistic model arises from the desire to model the posterior probabilities of the K classes via linear functions in x, while at the same time ensuring that they sum to one and remain in [0,1].

The model has the form:

\log\frac{P(G=i|X=x)}{P(G=K|X=x)}=\beta_{i0}+\beta_i^Tx,\ \ (i=1,2,\cdots,K-1).
The model is specified in terms of K-1 log-odds or logit transformations. Although the model uses the last class as the denominator in the odds-ratios, the choice of denominator is arbitrary in that the estimates are equivariant under this choice. A simple calculation shows taht

P(G=k|X=x)=\dfrac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)},\ k=1,2,\cdots, K-1,
P(G=K|X=x)=\dfrac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)},
and clearly sum to one. To emphasize the dependence on the entire parameter set \theta=\{\beta_{10},\beta_1^T,\cdots,\beta_{(K-1)0},\beta_{K-1}^T\}, we denote the probabilities P(G=k|X=x)=P_k(x;\theta).

4.4.1 Fitting Logistic Regression Model

Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of G given X. Since P(G|X) completely specifies the conditional distribution, the multinomial distribution is approproate. The log-likelihood for N obserbations is
l(\theta)=\sum_{i=1}^N\log p_{g_i}(x_i;\theta),
where p_k(x_i;\theta)=P(G=k|X=x_i;\theta).

To maximize it, we need to solve the equation \frac{\partial l(\beta)}{\partial\beta}=0, with Newton-Raphson algorithm. The whole method is referred to as iteratively reweighted least squares (IRLS). For detailed discuss, please refer to the textbook.

Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome.

4.4.2 Example: South African Heart Disease

4.4.3 Quadratic Approximations and Inference

The maximum-likelihood parameter estimates \hat\beta satisfy a self-consistency relationship: they are the coefficients of a weighted least squares fit, where the responses are

z_i=x_i^T\hat\beta+\frac{(y_i-\hat p_i)}{\hat p_i(1-\hat p_i)},
and the weights are w_i=\hat p_i(1-\hat p_i), both depending on \hat\beta itself. Apart from providing a convenient algorithm, this connection with least quares has more to offer:

\bullet The weighted residual sum-of-squares is the familiar Pearson chi-square statistic
\sum_{i=1}^N\frac{(y_i-\hat p_i)^2}{\hat p_i(1-\hat p_i)},
a quadratic approximation to the deviance;

\bullet Asymptotic likelihood throey says that if the model is correct, then \hat\beta is consistent (i.e., converges to the true \beta).

\bullet A central limit theorem then shows that the distribution of \hat\beta converges to N(\beta,(X'WX)^{-1}). This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normal theory inference.

\bullet Model building can be costly for logistic regression models, because each model fitted requires iteration. Popular shortcuts are the Rao score test which tests for inclusion of a term, and the Wald test which can be used to test for exclusion of a term. Neither of these require iterative fitting, and are based on the maximum-likelihood fit of the current model. It turns out that both of these amount to adding or dropping a term from the weighted least squares fit, using the same weights. Such computations can be done efficiently, without recomputing the entire weighted least squares fit.

4.4.4 L_1 Regularized Logistic Regression

The L_1 penalty used in the lasso can be used for variable selection and shinkage with any linear regression model. For logistic regression, we would maximize a penalized version:

\max_{\beta_0,\beta}\{\sum_{i=1}^N\log p_{g_i}(x_i;\theta)-\lambda\sum_{j=1}^p|\beta_j|\},
where
\log p_{g_i}(x_i;\theta)=y_i(\beta_0+\beta^Tx_i)-\log(1+\exp(\beta_0+\beta^Tx_i)).
As with the lasso, we typically do not penalize the intercept term and standardize the predictors for the penalty to be meaningful. A solution can be found using nonlinear programming methods.

4.4.5 Logistic Regression or LDA?

4.5 Separating Hyperplanes

The main idea of this method is to find hyperplanes (classifier), which construct liner dicision boundaries that explicitly try to separate the data into different classes as well as possible.

For equation f(x)=\beta_0+\beta^Tx=0, it defines a hyperplane (or affine set) L. We have some properties:

\bullet \beta^*=\beta/\|\beta\| is the vector normal to the hyperplane L, since for any x_1,x_2 in L, \beta^T(x_1-x_2)=0;

\bullet For any point x_0 in L, \beta^Tx_0=-\beta_0

\bullet The signed distance of any point x to L is given by \beta^{*T}(x-x_0)=\frac{1}{\|\beta\|}(\beta^Tx+\beta_0)=\frac{1}{\|f'(x)\|}f(x). Hence f(x) is proportional to the signed distance from x to the hyperplane defined by f(x)=0.

4.5.1 Rosenblatt's Perceptron Learning Algorithm

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.

The goal is to minimize
D(\beta,\beta_0)=-\sum_{i\in \mathcal{M}}y_i(x_i^T\beta+\beta_0),
where \mathcal M indexes the set of misclassified points. The algorithm in fact uses stochastic gradient descent to minimize the piecewise linear criterion.

There are a number of problems with this algorithm, summarized in Ripley(1996):

\bullet When the data are separable, there are many solutions, and which one is found depends on the starting values.

\bullet The "finite" number of steps can be very large. The smaller the gap, the longer the time to find it.

\bullet When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.

4.5.2 Optimal Separating Hyperplanes

To be continued

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容