R语言与统计-4：线性回归分析与模型诊断

R语言与统计-1：t检验与秩和检验
 R语言与统计-2：方差分析
 R语言与统计-3：卡方检验

回归分析概述

在统计学中，回归分析（regression analysis)指的是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。回归分析按照涉及的变量的多少，分为一元回归和多元回归分析；按照因变量的多少，可分为简单回归分析和多重回归分析；按照自变量和因变量之间的关系类型，可分为线性回归分析和非线性回归分析。
在大数据分析中，回归分析是一种预测性的建模技术，它研究的是因变量（目标）和自变量（预测器）之间的关系。
主要的回归技术：1. Linear Regression线性回归；2. Logistic Regression逻辑回归；3. Polynomial Regression多项式回归；4. Stepwise Regression逐步回归；5. Ridge Regression岭回归；6. Lasso Regression套索回归；7.ElasticNet回归；等等。

1. 线性回归分析

在(y~x)中，自变量x不管是哪种类型，只要因变量y是连续变量，就可以做线性回归。
❗️但如果因变量y是分类变量，就不能做线性回归，需要做logistic回归。

构建回归模型

x <- seq(1,5,len=100) #自变量
noise <- rnorm(n=100,mean = 0,sd = 1)
beta0=1 #截距
beta1=2 #斜率
y <- beta0+beta1*x+noise #写一个线性回归模型

首先使用plot(y~x)判断x和y是否呈线性关系

plot(y~x)

x和y总体来讲呈线性关系，可以进行线性回归

lm函数 (linear model)：进行线性回归

连续型变量线性回归方法

model <- lm(y~x)
summary(model)

# Call:
# lm(formula = y ~ x)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.4399 -0.6294  0.1119  0.7016  2.1503 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.57387    0.26688   5.897 5.27e-08 ***
# x            1.78498    0.08292  21.528  < 2e-16 ***
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.9671 on 98 degrees of freedom
# Multiple R-squared:  0.8254,  Adjusted R-squared:  0.8237 
# F-statistic: 463.4 on 1 and 98 DF,  p-value: < 2.2e-16

_{Residuals是残差的基本信息。残差是实际观察值与估计值 (拟合值) 之间的差，残差反应的是数据的离散程度。Coefficients是系数，(Intercept)是截距，1.57387是截距beta0的估计值，1.78498是x的估计值，也就是beta1（我们传入的是2）也就是x平均变化一个单位，y相应的变化1.78498个单位。Std. Error是标准误，x的p值< 2e-16 ***提示x和y的关系是非常显著的。R-squared是决定系数，取值范围是-1到1，越靠近1或-1，模型的可解释度就越大，越靠近于0，模型的可解释度久越小。Adjusted R-squared是调整后的R方值。F-statistics是对整个模型进行的检验，最后得到的p-value: < 2.2e-16是模型的p值，小于0.05，这个模型是显著的。上面的p值< 2e-16是针对x这个变量，而p-value: < 2.2e-16针对的是这个模型。}

分类变量线性回归方法

x <- factor(rep(c(0,1,2),each=20))
y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))
model2 <- lm(y~x)
summary(model)

# Call:
# lm(formula = y ~ x)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.4399 -0.6294  0.1119  0.7016  2.1503 

# Coefficients:
#             Estimate Std. Error t value
# (Intercept)  1.57387    0.26688   5.897
# x            1.78498    0.08292  21.528
#             Pr(>|t|)    
# (Intercept) 5.27e-08 ***
# x            < 2e-16 ***
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.9671 on 98 degrees of freedom
# Multiple R-squared:  0.8254,  Adjusted R-squared:  0.8237 
# F-statistic: 463.4 on 1 and 98 DF,  p-value: < 2.2e-16

多重线性回归/多变量线性回归（multivariable or multiple linear regression）：有多个自变量，只有1个因变量

2. 模型诊断

回归诊断主要用于检验关于回归假设是否成立，以及检验模型形式是否错误，否则我们通过最小二乘法求得的回归方程就缺乏理论依据。这些检验主要探究的问题为：
1）残差是否为随机性、是否为正态性、是否不为异方差；
2）高度相关的自变量是否引起了共线性；
3）模型的函数形式是否错误或在模型中是否缺少重要的自变量；
4）样本数据中是否存在异常值。

2.1 非正态性检验

使用shapiro.test()

data(LMdata,package = 'rinds')
model <- lm(y~x,data = LMdata$NonL) #构建模型
#首先找出残差
res1 <- residuals(model)
#查看残差是否符合正态分布
shapiro.test(res1)

#   Shapiro-Wilk normality test

# data:  res1
# W = 0.93524, p-value = 1e-04

##p值远小于0.05，不服从正态分布，说明这个模型是有问题的

2.2 非线性检验

model2 <- lm(y~x,data = LMdata$NonL)
plot(model2)

是非线性的

model2 <- lm(y~x+I(x^2),data = LMdata$NonL) #尝试加一个x^2
summary(model2)

# Call:
# lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.32348 -0.60702  0.00701  0.56018  2.27346 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.98702    0.62216   1.586    0.116    
# x            0.11085    0.45405   0.244    0.808    
# I(x^2)       1.97966    0.07456  26.552   <2e-16 ***
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.907 on 97 degrees of freedom
# Multiple R-squared:  0.9961,  Adjusted R-squared:  0.996 
# F-statistic: 1.224e+04 on 2 and 97 DF,  p-value: < 2.2e-16

_{可以看到x是没有统计学意义的，而x²的p值是有统计学意义的。也就是说这里的线性关系不是y跟x的线性关系的，是y跟x²的线性关系。所以就可以把x剃掉，构建y和x²的模型。}

update()函数可以用于模型的更新

model3 <- update(model2,y~.-x)
summary(model3)

# Call:
# lm(formula = y ~ I(x^2), data = LMdata$NonL)

# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.34289 -0.60692  0.01722  0.58186  2.29601 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.13378    0.15963   7.103 1.97e-10 ***
# I(x^2)       1.99760    0.01271 157.195  < 2e-16 ***
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.9026 on 98 degrees of freedom
# Multiple R-squared:  0.996,   Adjusted R-squared:  0.996 
# F-statistic: 2.471e+04 on 1 and 98 DF,  p-value: < 2.2e-16

除了update函数以外，AIC()函数(赤池信息准测)也可以用于模型的选择。AIC值越小，模型越好。

AIC(model,model2,model3)
#        df      AIC
# model   3 478.4558
# model2  4 269.2121
# model3  3 267.2736

#可以看到model3的拟合效果最好

plot(model3$residuals~LMdata$NonL$x)

残差没有规律的散布在0的左右，这样才是一个正确的残差分布图

2.3 异方差（方差不齐）

异方差考虑的是噪声noise对模型的影响
我们假设噪声是一种期望值为0，标准差为常数且服从正态分布的独立随机因素。如果这个假设不成立，模型的效果就会比较差。

model4 <- lm(y~x,data = LMdata$Hetero)
plot(model4$residuals~LMdata$NonL$x)

残差不齐

处理这种情况的方法是加权最小二乘。加权最小二乘是对不同的样本点给予不同的权重（不同样本点上残差的分布不一样，就给予样本点不同的权重）

model5 <- lm(y~x,weights = 1/x^2,data = LMdata$Hetero)  #weights就是加权
summary(model5)

# Call:
# lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

# Weighted Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.25132 -0.68409 -0.03924  0.63997  2.61098 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.2611     0.5139   2.454   0.0159 *  
# x             1.8973     0.2317   8.190 9.98e-13 ***
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 1.025 on 98 degrees of freedom
# Multiple R-squared:  0.4063,  Adjusted R-squared:  0.4003 
# F-statistic: 67.07 on 1 and 98 DF,  p-value: 9.977e-13

需要注意的是，在实际的问题当中，我们往往无法明确误差与x之间的关系，也就是我们不知道weights是什么。这个时候就可以使用迭代加权最小二乘。

library(nlme)
glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)
summary(glsmodel)
# Generalized least squares fit by REML
#   Model: y ~ x 
#   Data: LMdata$Hetero 
#        AIC      BIC    logLik
#   517.5286 525.2835 -255.7643

# Variance function:
#  Structure: fixed weights
#  Formula: ~x 

# Coefficients:
#                Value Std.Error  t-value p-value
# (Intercept) 1.421324 0.7091780 2.004184  0.0478
# x           1.832543 0.2603653 7.038351  0.0000

#  Correlation: 
#   (Intr)
# x -0.908

# Standardized residuals:
#         Min          Q1         Med          Q3 
# -2.17451000 -0.55322766 -0.03454023  0.51060224 
#         Max 
#  2.93969900 

# Residual standard error: 1.890127 
# Degrees of freedom: 100 total; 98 residual

2.4 自相关

各个观测间需要是绝对独立的，彼此之间没有联系。但在有些情况下，观测值可能是前后相关。表现在噪声当中，后一项的噪声和前一项的噪声之间存在依赖关系，这在经典的回归假设中是不允许的，对最小二乘会造成一定的影响。
自相关的问题很难从图中发现，只能通过检验方法来判断它是否存在残差的自相关。使用lmtest包中的dwtest()

model6 <- lm(y~x,data=LMdata$AC)
library(lmtest)
dwtest(model6)

#   Durbin-Watson test

# data:  model6
# DW = 0.65556, p-value = 2.683e-12
# alternative hypothesis: true autocorrelation is greater than 0

可以看到p值远小于0.05，也就是说模型中的残差存在自相关。默认的最小二乘法将不能使用。为了修正这种情况，我们使用gls函数来实施广义最小二乘

glsmodel2 <- gls(y~x,correlation = corAR1(),data=LMdata$AC)
summary(glsmodel2)
# Generalized least squares fit by REML
#   Model: y ~ x 
#   Data: LMdata$AC 
#        AIC      BIC    logLik
#   268.7617 279.1016 -130.3809

# Correlation Structure: AR(1)
#  Formula: ~1 
#  Parameter estimate(s):
#       Phi 
# 0.7041665 

# Coefficients:
#                Value Std.Error  t-value p-value
# (Intercept) 1.335059 0.7792019 1.713367  0.0898
# x           2.072936 0.2405513 8.617438  0.0000

#  Correlation: 
#   (Intr)
# x -0.926

# Standardized residuals:
#          Min           Q1          Med           Q3 
# -1.667086281 -0.571601899  0.001724114  0.568360354 
#          Max 
#  2.306177988 

# Residual standard error: 1.25329 
# Degrees of freedom: 100 total; 98 residual

2.5 异常值

在回归分析中，有三种异常的样本会对线性回归造成一定的影响。1. 远离回归线的离群点；2. 远离自变量均值的杠杆点；3. 对回归线有重要影响的高影响点。

model7 <- lm(y~x,data=LMdata$Outlier)
plot(y~x,data = LMdata$Outlier)
abline(model7)

识别异常点

library(car)
infl <- influencePlot(model7)

使用update函数剔除异常点

model8 <- update(model7,y~x,subset=-32,data=LMdata$Outlier)
plot(y~x,data = LMdata$Outlier)
abline(model7,col='red')
abline(model8,col='green')

2.6 多重共线性

在变量数目比较多的时候，自变量之间很可能会存在某种线性关系，这种情况就称为多重共线性，会对线性模型造成很大的影响。

model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)
summary(model9)

# Call:
# lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

# Residuals:
#      Min       1Q   Median       3Q      Max 
# -1.29100 -0.26369  0.00141  0.32529  0.91182 

# Coefficients:
#             Estimate Std. Error t value
# (Intercept)   0.3848     0.5813   0.662
# x1            7.2022     4.8552   1.483
# x2          -14.0916    12.1385  -1.161
# x3            8.2312     4.8559   1.695
#             Pr(>|t|)  
# (Intercept)   0.5095  
# x1            0.1412  
# x2            0.2486  
# x3            0.0933 .
# ---
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.499 on 96 degrees of freedom
# Multiple R-squared:  0.9984,  Adjusted R-squared:  0.9984 
# F-statistic: 2.057e+04 on 3 and 96 DF,  p-value: < 2.2e-16

_{可以看到x1 x2 x3这三个自变量跟y之间都没有显著的统计学关系，p值都大于0.05，但是R-squared都快接近1了，而且模型是显著的，这时候就要考虑多重共线性。}

使用vif函数计算方差膨胀因子，大于10认为存在共显性。

vif(model9)
#  x1         x2         x3 
#   7560.819 214990.752 222630.742

采用分步回归查看是哪些自变量之间的共显性

model10 <- step(model9)
# Start:  AIC=-135.11
# y ~ x1 + x2 + x3

#        Df Sum of Sq    RSS     AIC
# - x2    1   0.33560 24.241 -135.71
# <none>              23.905 -135.11
# - x1    1   0.54795 24.453 -134.84
# - x3    1   0.71550 24.621 -134.16

# Step:  AIC=-135.71
# y ~ x1 + x3

#        Df Sum of Sq     RSS     AIC
# <none>                 24.2 -135.71
# - x1    1     189.2   213.4   79.81
# - x3    1   15276.4 15300.7  507.05