大学渣的ISLR笔记(3)-Linear Regression

Simple Linear Regression

It assumes that there is approximately a linear relationship between X and Y. Mathematically, we can write this linear relationship as:


where ˆy indicates a prediction of Y on the basis of X= x . Here we use a hat symbol, ˆ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.

Estimating the Coefficients


Let ˆyi = ˆβ0+ ˆβ1xi be the prediction for based on the ith value of X. Then ei=yiˆyi represents the ith residual —this is the difference between the ith observed response value and the ith response value that is predicted by our linear model. We define the residual sum of squares(RSS) as:


The least squares approach chooses ˆβ0 and ˆβ1 to minimize the RSS.

Using some calculus, one can show that the minimizers are

Assessing the Accuracy of the Coefficient Estimates


The population regression line is unobserved.

At first glance, the difference between the population regression line and the least squares line may seem subtle and confusing. We only have one data set, and so what does it mean that two different lines describe the relationship between the predictor and the response?

The concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population.

For example, suppose that we are interested in knowing the population mean μof some random variable Y. Unfortunately, μ is unknown, but we do have access to nobservations from Y , which we can write as y1, . . . , yn, and which we can use to estimate μ .(统计学基础知识)

The analogy between linear regression and estimation of the mean of a random variable is an apt one based on the concept of bias .

if we estimate β0 and β1 on the basis of a particular data set, then our estimates won’t be exactly equal to β0 and β1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!

We continue the analogy with the estimation of the population mean μ of a random variable Y. A natural question is as follows: how accurate is the sample mean ˆμ as an estimate of μ ? We have established that the average of ˆμ ’s over many data sets will be very close to μ , but that a single estimate ˆμmay be a substantial underestimate or overestimate of μ .

How far off will that single estimate of ˆμ be? In general, we answer this question by computing the standard error of ˆμ , written as SE(ˆμ ). We have the well-known formula:


In a similar vein, we can wonder how close ˆβ0 and ˆβ1 are to the true values β0 and β1. To compute the standard errors associated with ˆβ0 and ˆβ1, we use the following formulas:


Assessing the Accuracy of the Model

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error(RSE) and theR2 statistic.

Due to the presence of these error terms, even if we knew the true regression line (i.e. even if β0 and β1 were known), we would not be able to perfectly predict fromX.

The RSE is an estimate of the standard deviation of e. Roughly speaking, it is the average amount that the response will deviate from the true regression line.


R2Statistic

The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE.

To calculateR2, we use the formula:




TSS is the total sum of squares.

TheR2 statistic is a measure of the linear relationship between X and Y. Recall that correlation , defined as:


is also a measure of the linear relationship between X and Y.

This suggests that we might be able to use r= Cor(X, Y) instead of R2 in order to assess the fit of the linear model.In fact, it can be shown that in the simple linear regression setting,R2=r2. In other words, the squared correlation and the R2 statistic are identical.


Multiple Linear Regression(多元线性回归)



Estimating the Regression Coefficients

Given estimates ˆβ0,ˆβ1, . . . ,ˆβp, we can make predictions using the formula:


We chooseβ0, β1, . . . , βp to minimize the sum of squared residuals:



Consider an absurd example to illustrate the point. Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper. Of course no one (yet) has suggested that ice creams should be banned at beaches to reduce shark attacks. In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks. A multiple regression of attacks versus ice cream sales and temperature reveals that, as intuition implies, the former predictor is no longer significant after adjusting for temperature.

Some Important Questions

When we perform multiple linear regression, we usually are interested in answering a few important questions.

1. Is at least one of the predictorsX1,X2, . . . , Xp useful in predicting the response?

H0:β1=β2= ... =βp= 0

Ha: at least oneβj is non-zero.

This hypothesis test is performed by computing the F-statistic:


we expect F to be greater than 1.

2. Do all the predictors help to explain Y , or is only a subset of the predictors useful?

For instance, if p= 2, then we can consider four models: (1) a model containing no variables, (2) a model containing X1only, (3) a model containing X2only, and (4) a model containing both X1and X2.

Unfortunately, there are a total of 2^p models that contain subsets of p variables.

Therefore, unless p is very small, we cannot consider all 2^p models, and instead we need an automated and efficient approach to choose a smaller set of models to consider. There are three classical approaches for this task:

Forward selection. We begin with the null model—a model that contains an intercept but no predictors. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.

Backward selection. We start with all variables in the model, and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

Mixed selection. This is a combination of forward and backward selection.

3. How well does the model fit the data?

Two of the most common numerical measures of model fit are the RSE and R^2.


4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Even if we knew f(X)—that is, even if we knew the true values forβ0, β1, . . . , βp—the response value cannot be predicted perfectly because of the random error in the model.We use a confidence interval to quantify it.

Other Considerations in the Regression Model


Qualitative Predictors


Predictors with Only Two Levels:


Extensions of the Linear Model

However, it makes several highly restrictive assumptions that are often violated in practice. Two of the most important assumptions state that the relationship between the predictors and response are additive and linear . The additive assumption means that the effect of changes in a predictor Xj on the response Yi s independent of the values of the other predictors. The linear assumption states that the change in the response due to a one-unit change inXj is constant, regardless of the value ofXj. In this book, we examine a number of sophisticated methods that relax these two assumptions. Here, we briefly examine some common classical approaches for extending the linear model.

Comparison of Linear Regression with K-Nearest Neighbors

In what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression? The answer is simple: the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form off.


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,634评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,951评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,427评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,770评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,835评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,799评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,768评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,544评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,979评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,271评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,427评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,121评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,756评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,375评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,579评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,410评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,315评论 2 352

推荐阅读更多精彩内容

  • 这两天在写历史人物的爱情,很多粉丝说颠覆了他们的认识,结果有个女粉丝让我写写历史上有名的女性的爱情,我觉得不好驳人...
    懒癌末期阅读 275评论 0 1
  • 早恋,在大多数家长眼里,都带有贬义色彩,谈虎色变。今天,作为早恋当事人之一的我,要为其正名。它并非洪水猛兽...
    小柒月爱生活阅读 318评论 0 2
  • 这是我读的第二本马尔克斯的书,决定读它,是因为喜欢《百年孤独》,喜欢马尔克斯这个作家。 《霍乱时期的爱情》讲述了三...
    小妮super阅读 6,645评论 87 170
  • 31岁,全职三年,现在终于克服困难,出来上班了,也许并没有那么的理想,但却是个好的开始。 工资不高,只能说比没有好...
    金鱼和木鱼阅读 440评论 1 0