大学渣的ISLR笔记(6)-Linear Model Selection and Regularization

However, the linear model has distinct advantages in terms of inference and, on real-world problems, is often surprisingly competitive in relation to non-linear methods.

Before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.

Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability .

• Prediction Accuracy : If n>> p —that is, if n , the number of observations, is much larger than p , the number of variables—then the least squares estimates tend to also have low variance.and hence will perform well on test observations. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training.

• Model Interpretability : Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero.

In this chapter, we discuss three important classes of methods.

• Subset Selection. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

• Shrinkage. This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.

• Dimension Reduction. This approach involves projecting the p predictors into a M-dimensional subspace, where M by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

Subset Selection

Best Subset Selection


Stepwise Selection

For computational reasons, best subset selection cannot be applied with very large p.Best subset selection may also suffer from statistical problems when p is large. The larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.


Shrinkage Methods

The subset selection methods described in Section 6.1 involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

Ridge Regression

Recall from Chapter 3 that the least squares fitting procedure estimates β0, β1, . . . , βp using the values that minimize:


Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates ˆβR are the values that minimize:


where λ ≥0 is a tuning parameter ,to be determined separately.

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, λjβ2j, called a shrinkage penalty,is small whenβ1, . . . , βp are close to zero.and so it has the effect of shrinking the estimates of βj towards zero.The tuning parameter λserves to control the relative impact of these two terms on the regression coefficient estimates.When λ= 0, the penalty term has no effect, and ridge regression will produce the least squares estimates.However, asλ→∞ , the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates,βR λ, for each value of λ.

Selecting a good value for λ is critical.we defer this discussion to Section 6.2.3, where we use cross-validation.


Why Does Ridge Regression Improve Over Least Squares?

Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off . As λ increases, the flexibility of the ridge regression fit decreases,leading to decreased variance but increased bias.


In particular, when the number of variables p is almost as large as the number of observations n , as in the example in Figure 6.5, the least squares estimates will be extremely variable.And if p > n , then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.(n=p或者n<p最小二乘法有很大方差,但是岭回归用一点bias的增加换回了方差的很大减小。)

The Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ=). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large.

The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients The lasso coefficients:


In statistical parlance, the lasso uses L1 penalty instead of L2 penalty.


The Variable Selection Property of the Lasso


Selecting the Tuning Parameter


Dimension Reduction Methods

The methods that we have discussed so far in this chapter have controlled variance in two different ways, either by using a subset of the original variables, or by shrinking their coefficients toward zero. All of these methods are defined using the original predictors,X1,X2, . . . , Xp. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.

The term dimension reduction comes from the fact that this approach reduces the problem of estimating the p+1 coefficientsβ0, β1, . . . , βp to the simpler problem of estimating theM+ 1 coefficientsθ0, θ1, . . . , θM, where M < p. In other words, the dimension of the problem has been reduced from p+ 1 toM+ 1.

Principal Components Regression

Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables. PCA is discussed in greater detail as a tool for unsupervised learning in Chapter 10.Here we describe its use as a dimension reduction technique for regression.

PCA is a technique for reducing the dimension of a nXp data matrix X.

There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data.

The Principal Components Regression Approach


Partial Least Squares

Consequently, PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

We now present partial least squares(PLS), a supervised alternative to PCR.

What Goes Wrong in High Dimensions?

In order to illustrate the need for extra care and specialized techniques for regression and classification when p > n , we begin by examining what can go wrong if we apply a statistical technique not intended for the high dimensional setting. For this purpose, we examine least squares regression. But the same concepts apply to logistic regression, linear discriminant analysis, and other classical statistical approaches.

When the number of features p is as large as, or larger than, the number of observations n , least squares as described in Chapter 3 cannot (or rather, should not) be performed.

The reason is simple: regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero.


When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained.

we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression. At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome.

It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting.We have seen that when p > n , it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, R2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting.

It is important to instead report results on an independent test set, or cross-validation errors.An independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,088评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,715评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,361评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,099评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 60,987评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,063评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,486评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,175评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,440评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,518评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,305评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,190评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,550评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,880评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,152评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,451评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,637评论 2 335

推荐阅读更多精彩内容

  • 1.a contradiction in term自相矛盾的说法 Computer security is a c...
    碓碓碓阅读 795评论 0 0
  • 陈子欣开始后悔了。这是不可思议的事情,因为她从来不觉得自己有一天会因为这样一件小事而后悔。什么样的事情呢?作为一个...
    jong_bing阅读 146评论 0 0
  • 在5月份,我在健身房认识的4个教练走了3个。在我租的房子里面3个邻居全部都搬走了,包括住了两年的租户,还有我们公司...
    夏小帆阅读 641评论 0 0