学习笔记:The Elements of Statistical Learning (Chapter 2)

2.1 Introduction

The main goal of unsupervised learning is to predict the Output(responses) via Input(predictors).

2.2 Variable Types and Terminology

Overall, there are mainly two kinds of outputs: quantitative outputs(Y) and qualitative outputs(G).

For Y, it means that sime measurements are bigger than others, and the measurements close in value means the variables are close in nature. To predict quantitative outputs, we name the methods as regression.

For G, it means that the values lay in a finite set and there is no explicit ordering among them. To code them, we can use dummy variables or other methods.To predictict qualitative outputs, we name the methods as classification.

There are also other kind of outputs, such as ordered categorical outputs.

Given the data
X = (x_1,\cdots,x_N)^T,
which is a N\times p matrix, our task is to predict \hat{Y}\in \mathbb{R}, or \hat{G}\in\mathcal{G}.
We have a set of training data: (x_i,y_i),i=1,\cdots,N

2.3 Two Simple Approaches to Prediction: Least Squares and Nearesr Neighbors.

2.3.1 Linear Models and Least Squares

Input: X = (X_1,\cdots,X_p)^T, predict Y via \hat{Y}=X^T\hat{\beta} (intercept included).
The function f(X) = X^T\beta is a linear function.

To fit this model, we have
RSS(\beta)=\sum_{i=1}^N(y_i-x_i^T\beta)^2=(y-X'\beta)^T(y-X'\beta).
After minimizing it by taking derivatives, we get
\hat{\beta}=(X'X)^{-1}X'y

2.3.2 Nearest Neighbor Methods

We find k observations with x_i closest to x in input space and average their responses by

\hat{Y}(x)=\frac{1}{k}\sum_{x_i\in N_k(x)}y_i,
where N_k(x) is the neighborhood of x defined by the k closest points x_i in the training sample.

2.3.3 From Least Square to Nearest Neighbors

First we consider the complexity. To fit the least square model, we need to fit p parameters, while for k-NN, a single parameter should be fit. However, for k-NN, if the neighborhoods were non-overlapping, there would be N/k neighborhoods and we would fit one parameter(a mean) in each neighborhood, and N/k is generally bigger than p.

From bias and variance aspects, generally, linear models has low variance and potnetially high bias, while k-NN is wiggly and unstable with high variance and low bias.

Considering where the constructed data came from, we have the two possible scenarios:

Scenario 1 : The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.

Scenario 2 : The training data in each class came from a mixture of 10 lowvariance Gaussian distributions, with individual means themselves distributed as Gaussian.

For Scenario 1, linear decision boundary is the best one[See Chapter 4]. For 2, other methods can be better, such as k-NN.

Each method has its own situations for which it works best; in particular linear regression is more appropriate for Scenario 1 above, while nearest neighbors are more suitable for Scenario 2. The time has come to expose the oracle!

2.4 Statistical Decision Theory

2.4.1 Theory for Quantitative Output

With joint probability P(X,Y),to find f, such that f(X) predicts Y, we have the EPE(expected prediction error) as follows:

EPE(f) = E(L(Y,f(X))),
where L(\cdot,\cdot) here is the loss function. If L(Y,f(X)) = (Y-f(X))^2, we have
EPE(f)=E(Y-f(X))^2=E_XE_{Y|X}([Y-f(X)]^2|X).
To minimize it, let c := f(X)|_{X=x}, then compute
\min_cE(Y-c)^2
and get c_0=EY.
Thus, we know f(x)=E(Y|X).

2.4.2 Implementation to k-NN and Linear Model

The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point x, we might ask for the average of all those y_is with input x_i = x. Since there is typically at most one observation
at any point x, we settle for
\hat{f}(x)=Ave(y_i|x_i\in N_k(x)),
where "Ave" denotes average,and N_k(x) is the neighborhood containing
the k points in \mathcal{T} closest to x. Two approximations are happening here:

\bullet expectation is approximated by averaging over sample data;

\bullet conditioning at a point is relaxed to conditioning on some region "close" to the target point.

Fact With joint probability P(X,Y), \hat{f}(x)\to E(Y|X=x) as k, N \to \infty and k/N \to 0. However, as dimension p increases, the approximations will get worse.

For linear model, we assume the function f of the form
f(x)\approx x^T\beta, and then plug it in the definition of EPE, we have
EPE(f)=E(Y-X'\beta)^2.
After minimizing it, we have
\hat{\beta}=[E(X'X)]^{-1}E(XY).
Note: we do not need it conditioned on X.

So both k-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:

\bullet Least squares assumes f(x) is well approximated by a globally linear function.

\bullet k-nearest neighbors assumes f(x) is well approximated by a locally constant function.

2.4.3 Bayes Classifier

For outputs which are categorical variable G, the loss function can be represented as L_{K\times K}, where K = card(\mathcal{G}).L_{i,j} is the price paid for classifying G belonging to \mathcal{G_i} as\mathcal{G}_j.(0 on the diagonal and non-negative elsewhere)

More commonly, we use zero-one loss function which can be represented as

L = J_{2\times 2}-Diag(1)_{2\times 2} \ or\ J_{3\times 3}-Diag(1)_{3\times 3},
which is 0 on the diagonal and 1 elsewhere.

With P(G,X), we have

EPE=E_X\sum_{k=1}^{K}L(\mathcal{G}_k,\hat{G}(X))P(\mathcal{G}_k|X).
With 0-1 loss function, we can minimize it point-wisely, which is known as Bayes-optimal classifier.

The error rate of this classifier is called Bayes rate.

2.5 Local Methods in High Dimensions.

In this section, it is explained that why k-NN will lead to worse results as dimension p increases as well as why linear model can be more stable when rigid assumptions are satisfied.

There are three main points listed to explain why k-NN will fail when p is large:

(1) To capture r proportion of data to form a local average, we need to cover more of the range of each input variable, as p increases.

For example, for a hypercube,r proportion of each dimension means r^p in the whole input space. So, if we need r proportion data of the whole input space, r^{1/p} data on each dimension will be needed. As p increases, r^{1/p} will also increase converging to 1.

(2) Sampling density decreases exponentially, when p increases.

(3) Average distance from other observations increase, and thus, the estimate tends to be 0 more.

For example, consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighboor estimate at the origin, The median distance from the origin to the closest data point is give by
d(p,N)=(1-\frac{1}{2}^{1/N})^{1/p}
For N = 500,p=10,d(p,N)\approx 0.52, more than
halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. An intuitive understanding to this example is similar to the example of the first point.

Now, we discuss why linear model can be more robust to p if the linear assumptions hold.

The linear assumption: Y=X^T\beta+\varepsilon,\varepsilon\sim N(0,\sigma^2).

Now, we have

EPE(x_0)=\sigma^2+E_{\mathcal{T}}x_0'(X'X)^{-1}x_0\sigma^2.
As N is large enough,\mathcal{T} is selected randomly,assume E(X)=0, X'X\to NCov(X), we have
E_{x_0}EPE(x_0)\sim \sigma^2+\sigma^2(p/N).
This is linearly proportional to p. As N is large and \sigma^2 is small, the growth of it can be ignored when p is increasing.

Overall, By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is substantially larger. However, if the assumptions are wrong, all bets are off and the 1-nearest neighbor may dominate.

2.6 Statistical Models, Supervised Learning and Function Approximation

To overcome the dimensionality problems,we anticipate using other classes of models for f(x). Now, we discuss a frame work for incorporating them into the prediction problem.

2.6.1 A Statistical Model for the Joint Distribution P(X,Y)

Suppose our data arose from a statistical model

Y=f(X)+\varepsilon,
where the random error \varepsilon has E(\varepsilon)=0 and is independent of X.

Note that for this model, f(x)=E(Y|X=x), and in fact the conditional distribution P(Y|X) depends on X only through the conditional mean f(x).

This additive error model is typically not used for qualitative outputs G. In this case, the target function p(X) is the conditional density P(G|X), and is modeled directly.

2.6.2 Supervised Learning

In this section, the task of supervised learning is briefly introduced. That is to learn f by example through a teacher.

2.6.3 Function Approximation

The function f(x) has domain equal to the p-dimensional input subspace, and is related to the data via a model such as y_i=f(x_i)+\varepsilon_i. For convenience, in this chapter we will assume the domain is \mathbb{R}^p, a p-dimensional Euclidean space, although in general the inputs can be of mixed type.

Note: many of the approximations we will encounter as associated a set of parameters \theta that can be modified to suit the data at hand.

For additive error models, RSS seems to be a reasonalbe criterion. However, least square is not the only criterion used and in some cases it would not make much sense, although it is generally convenient.

A more general principle for estimation is maximum likelihood estimation.

2.7 Structured Regression Models

First we discuss the difficulty of the problem. Consider
RSS(f)=\sum_{i=1}^N(y_i-f(x_i))^2,
minimizing it without any constrain leads to infinitely many solutions: any functions \hat{f} passing all the training points. And the results turn out to be useless. Thus, we need some constrains of f when finding the function f.

In general the constraints imposed by most learning methods can be described as complexity restrictions of one kind or another. This usually means some kind of regular behavior in small neighborhoods of the input space. That is, for all input points x sufficiently close to each other in some metric, \hat{f} exhibits some special structure such as nearly constant, linear or low-order polynomial behavior. The estimator is then obtained by averaging or polynomial fitting in that neighborhood.

The strength of the constraint is dictated by the neighborhood size. The larger the size of the neighborhood, the stronger the constraint, and the more sensitive the solution is to the particular choice of constraint. For example, local constant fits in infinitesimally small neighborhoods is no constraint at all; local linear fits in very large neighborhoods is almost a globally linear model, and is very restrictive.

The nature of the constraint depends on the metric used. Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The nearest-neighbor methods discussed so far are based on the assumption that locally the function is constant; close to a target input x_0, the function does not change much, and
so close outputs can be averaged to produce \hat{f}(x_0). Other methods such as splines, neural networks and basis-function methods implicitly define neighborhoods of local behavior.

One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions --- again the curse of dimensionality.

2.8 Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. Here three croad classes are described.

2.8.1 Roughness Penalty and Bayesian Methods

Here the class of functions is controlled by explicitly penalizing RSS(f) with a roughness penalty
PRSS(f;\lambda)=RSS(f)+\lambda J(f).
The user-seected penalty J(f) here controls some property of f.It will be large for functions f that vary too rapidly over small regions of input space. The amount of penalty is dictated by \lambda \geqslant0. For \lambda=0, no penalty is imposed, and any interpolating function will do.

Penalty function, or regularization methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.

2.8.2 Kernel Methods and Local Regression

These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function K_\lambda (x_0,x), which assigns weights to points x in a region around x_0.

In general, we can define a local regression estimate of f(x_0) as f_{\hat{\theta}}(x_0), where \hat{\theta} minimizes
RSS(f_\theta,x_0)=\sum_{i=0}^NK_\lambda(x_0,x_i)(y_i-f_\theta(x_i))^2,
and f_\theta is some parameterized function, such as low-ordered polynimial.

Note: basically, K_\lambda(x_0,x) is larger when x is closer to x_0. It means that the points in the neighborhood of x_0 are considered to be more important.

2.8.3 Basis Functions and Dictionary Methods

Here the model of function f is a linear expansion of basis functions
f_\theta(x)=\sum_{m=1}^M\theta_mh_m(x),
where each of the h_m is a function of the input x, and the term linear here refers to the action of the parameters \theta. In some cases, the sequence of basis functions is prescribed, such as a basis for polynomials in x of total degree M.

The adaptively chosen basis function methods are known as dictionary methods, where one has available a possibly infinite set or dictionary \mathcal{D} of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.

2.9 Model Selection and the Bias-Variance Tradeoff

For the three classes drescribed in the last section, we need to determine:

\bullet the multiplier of the penalty term;

\bullet the width of the kernel;

\bullet or the number of basis functions.

The k-NN regression fit \hat{f}_k(x_0) usefully illustrates the competing forces that affect the predictive ability of such approximations. Suppose the data arise from a model Y = f(X)+\varepsilon, with E(\varepsilon)=0 and Var(\varepsilon)=\sigma^2. For simplicity here we assume that the values of x_i in the sample are fixed in advance. The expected prediction error at x_0 can be decomposed:

\begin{aligned} EPE_k(x_0)&=E[(Y-\hat{f_k}(x_0))^2|X=x_0]\\ &=\sigma^2+[Bias^2(\hat{f_k}(x_0))+Var_{\mathcal{T}}(\hat{f_k}(x_0))]\\ &=\sigma^2+[f(x_0)-\frac{1}{k}\sum_{l=1}^kf(x_{(l)})]^2+\frac{\sigma^2}{k}. \end{aligned}

The first term \sigma^2 is beyond our control, even if we know the true f(x_0).

The second term will likely increase with k, if the true function is reasonably smooth.

The second term is simply the variance of an average here, and decreases as teh inverse of k.

Thus, as k varies, there is a bias-variance tradeoff.

More generally, as the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decrease. The opposite behavior occurs as the model complexity is decreased. For k-nearest neighbors, the model complexity is controlled by k.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,539评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,594评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,871评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,963评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,984评论 6 393
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,763评论 1 307
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,468评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,357评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,850评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,002评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,144评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,823评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,483评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,026评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,150评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,415评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,092评论 2 355