Simple Linear Regression
It assumes that there is approximately a linear relationship between X and Y. Mathematically, we can write this linear relationship as:
where ˆy indicates a prediction of Y on the basis of X= x . Here we use a hat symbol, ˆ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.
Estimating the Coefficients
Let ˆyi = ˆβ0+ ˆβ1xi be the prediction for Y based on the ith value of X. Then ei=yi−ˆyi represents the ith residual —this is the difference between the ith observed response value and the ith response value that is predicted by our linear model. We define the residual sum of squares(RSS) as:
The least squares approach chooses ˆβ0 and ˆβ1 to minimize the RSS.
Using some calculus, one can show that the minimizers are
Assessing the Accuracy of the Coefficient Estimates
The population regression line is unobserved.
At first glance, the difference between the population regression line and the least squares line may seem subtle and confusing. We only have one data set, and so what does it mean that two different lines describe the relationship between the predictor and the response?
The concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population.
For example, suppose that we are interested in knowing the population mean μof some random variable Y. Unfortunately, μ is unknown, but we do have access to nobservations from Y , which we can write as y1, . . . , yn, and which we can use to estimate μ .(统计学基础知识)
The analogy between linear regression and estimation of the mean of a random variable is an apt one based on the concept of bias .
if we estimate β0 and β1 on the basis of a particular data set, then our estimates won’t be exactly equal to β0 and β1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!
We continue the analogy with the estimation of the population mean μ of a random variable Y. A natural question is as follows: how accurate is the sample mean ˆμ as an estimate of μ ? We have established that the average of ˆμ ’s over many data sets will be very close to μ , but that a single estimate ˆμmay be a substantial underestimate or overestimate of μ .
How far off will that single estimate of ˆμ be? In general, we answer this question by computing the standard error of ˆμ , written as SE(ˆμ ). We have the well-known formula:
In a similar vein, we can wonder how close ˆβ0 and ˆβ1 are to the true values β0 and β1. To compute the standard errors associated with ˆβ0 and ˆβ1, we use the following formulas:
Assessing the Accuracy of the Model
The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error(RSE) and theR2 statistic.
Due to the presence of these error terms, even if we knew the true regression line (i.e. even if β0 and β1 were known), we would not be able to perfectly predict Y fromX.
The RSE is an estimate of the standard deviation of e. Roughly speaking, it is the average amount that the response will deviate from the true regression line.
R2Statistic
The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE.
To calculateR2, we use the formula:
TSS is the total sum of squares.
TheR2 statistic is a measure of the linear relationship between X and Y. Recall that correlation , defined as:
is also a measure of the linear relationship between X and Y.
This suggests that we might be able to use r= Cor(X, Y) instead of R2 in order to assess the fit of the linear model.In fact, it can be shown that in the simple linear regression setting,R2=r2. In other words, the squared correlation and the R2 statistic are identical.
Multiple Linear Regression(多元线性回归)
Estimating the Regression Coefficients
Given estimates ˆβ0,ˆβ1, . . . ,ˆβp, we can make predictions using the formula:
We chooseβ0, β1, . . . , βp to minimize the sum of squared residuals:
Consider an absurd example to illustrate the point. Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper. Of course no one (yet) has suggested that ice creams should be banned at beaches to reduce shark attacks. In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks. A multiple regression of attacks versus ice cream sales and temperature reveals that, as intuition implies, the former predictor is no longer significant after adjusting for temperature.
Some Important Questions
When we perform multiple linear regression, we usually are interested in answering a few important questions.
1. Is at least one of the predictorsX1,X2, . . . , Xp useful in predicting the response?
H0:β1=β2= ... =βp= 0
Ha: at least oneβj is non-zero.
This hypothesis test is performed by computing the F-statistic:
we expect F to be greater than 1.
2. Do all the predictors help to explain Y , or is only a subset of the predictors useful?
For instance, if p= 2, then we can consider four models: (1) a model containing no variables, (2) a model containing X1only, (3) a model containing X2only, and (4) a model containing both X1and X2.
Unfortunately, there are a total of 2^p models that contain subsets of p variables.
Therefore, unless p is very small, we cannot consider all 2^p models, and instead we need an automated and efficient approach to choose a smaller set of models to consider. There are three classical approaches for this task:
Forward selection. We begin with the null model—a model that contains an intercept but no predictors. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.
Backward selection. We start with all variables in the model, and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.
Mixed selection. This is a combination of forward and backward selection.
3. How well does the model fit the data?
Two of the most common numerical measures of model fit are the RSE and R^2.
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Even if we knew f(X)—that is, even if we knew the true values forβ0, β1, . . . , βp—the response value cannot be predicted perfectly because of the random error in the model.We use a confidence interval to quantify it.
Other Considerations in the Regression Model
Qualitative Predictors
Predictors with Only Two Levels:
Extensions of the Linear Model
However, it makes several highly restrictive assumptions that are often violated in practice. Two of the most important assumptions state that the relationship between the predictors and response are additive and linear . The additive assumption means that the effect of changes in a predictor Xj on the response Yi s independent of the values of the other predictors. The linear assumption states that the change in the response Y due to a one-unit change inXj is constant, regardless of the value ofXj. In this book, we examine a number of sophisticated methods that relax these two assumptions. Here, we briefly examine some common classical approaches for extending the linear model.
Comparison of Linear Regression with K-Nearest Neighbors
In what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression? The answer is simple: the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form off.