R语言学习笔记1-婚外情预测模型

数据介绍

婚外情(出轨）数据即著名的“Fair's Affairs",取自于1969年《今日心理》(Psychology Today) 所做的一个非常有代表性的调查，而Greene（2003）和Fair (1978)都对它进行过分析。该数据从601个参与者身上收集了9个变量，包括一年来：

#affairs       婚外私通的次数
#age           年龄
#yearsmarried  婚龄
#religiousness 宗教信仰程度（5分制，1表示反对，5表示非常信仰）
#education     学历
#occupation    职业
#rating        对婚姻的自我评分 （5分制，1表示非常不幸福，5表示非常幸福）
#类别列：各个类别值下的频数统计
#gender        参与者的性别
#children      是否有小孩

一、导入数据
在R中，AER包携带有数据框Affairs，首先安装AER包并导入数据：

install.packages("AER")
data(Affairs,package="AER")  #进行包中指定的数据加载,Affairs变成全局变量
summary(Affairs)
  affairs          gender         age         yearsmarried    children 
Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171  
1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430  
Median : 0.000                Median :32.00   Median : 7.000            
Mean   : 1.456                Mean   :32.49   Mean   : 8.178            
3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000            
Max.   :12.000                Max.   :57.00   Max.   :15.000            
religiousness     education       occupation        rating     
Min.   :1.000   Min.   : 9.00   Min.   :1.000   Min.   :1.000  
1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
Median :3.000   Median :16.00   Median :5.000   Median :4.000  
Mean   :3.116   Mean   :16.17   Mean   :4.195   Mean   :3.932  
3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
Max.   :5.000   Max.   :20.00   Max.   :7.000   Max.   :5.000

由此可以知道，女性有315名，男性286名，其中72%的人有孩子，样本的年龄中位数为32岁。

基于是否有过婚外情，讲因变量affairs区分成2值类型,未有过的为0

mydata<Affairs
mydata$ynaffair<-ifelse(mydata$affair==0,0,1)
class(mydata$ynaffair) #新列类型numeric
mydata$ynaffair <- factor(mydata$ynaffair,levels=c(0,1),labels=c("No","Yes"))
class(mydata$ynaffair) #新列类型factor
table(mydata$ynaffair) #基于新列，生成频数表
 No Yes 
451 150

二、建立Logistic模型

#glm(formula,family=family(link=function),data=)
#分布函数             默认的连接函数      
#binomial             link="logit"
fit.full <- glm(ynaffair~gender+age+yearsmarried+children+
                    religiousness+education+occupation+rating,
                    data=mydata,family=binomial(),x=T,y=T)

summary(fit.full)
Call:
glm(formula = ynaffair ~ gender + age + yearsmarried + children + 
    religiousness + education + occupation + rating, family = binomial(), 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5713  -0.7499  -0.5690  -0.2539   2.5191  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.37726    0.88776   1.551 0.120807    
gendermale     0.28029    0.23909   1.172 0.241083    
age           -0.04426    0.01825  -2.425 0.015301 *  
yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
childrenyes    0.39767    0.29151   1.364 0.172508    
religiousness -0.32472    0.08975  -3.618 0.000297 ***
education      0.02105    0.05051   0.417 0.676851    
occupation     0.03092    0.07178   0.431 0.666630    
rating        -0.46845    0.09091  -5.153 2.56e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 609.51  on 592  degrees of freedom
AIC: 627.51

Number of Fisher Scoring iterations: 4
#从回归系数的p值可以看到，gendermale性别,
#childrenyes是否有孩子,education学历,occupation职业
#对方差的贡献都不显著，落入拒绝域，接受原假设：回归系数为0

去掉无用的变量，重新拟合模型，检验新模型的拟合效果。

fit.reduced <- glm(ynaffair~age+yearsmarried+religiousness+
                       rating,data=mydata,family=binomial())
summary(fit.reduced)

Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness + 
    rating, family = binomial(), data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6278  -0.7550  -0.5701  -0.2624   2.3998  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.93083    0.61032   3.164 0.001558 ** 
age           -0.03527    0.01736  -2.032 0.042127 *  
yearsmarried   0.10062    0.02921   3.445 0.000571 ***
religiousness -0.32902    0.08945  -3.678 0.000235 ***
rating        -0.46136    0.08884  -5.193 2.06e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 615.36  on 596  degrees of freedom
AIC: 625.36

Number of Fisher Scoring iterations: 4

#新模型的每个回归系数都非常显著

anova,比较广义线性回归由于这个两模型是嵌套关系，所以我们可以使用anova()函数对它们进行比较。而对于广义线性回归，可用卡方检验进行判断。

anova(fit.reduced,fit.full,test="Chisq")
Analysis of Deviance Table

Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness + 
    education + occupation + rating
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       596     615.36                     
2       592     609.51  4   5.8474   0.2108

#Pr(>Chi)=0.2108,未落入拒绝域，表明减少后的模型和之前的完整模型拟合程度一样好

三、解释模型的参数

#解释模型参数
#回归系数的含义是当其他预测变量不变时，一单位预测变量的变化可引起的响应变量对数优势比的变化
coef(fit.reduced)
  (Intercept)           age  yearsmarried religiousness        rating 
   1.93083017   -0.03527112    0.10062274   -0.32902386   -0.46136144 

#指数化
#保持年龄，宗教信仰，和婚姻评定不变，婚龄增加一年，婚外情的优势比将乘以1.106
#假如婚龄增加10年，优势比将乘以1.06^10，10次方，它反映的信息更为重要
#保持婚龄，宗教信仰，和婚姻评定不变，年龄增加一岁，婚外情的优势比将乘以0.965
#因为预测变量不能等于0，截距项在此处没有什么特定含义
exp(coef(fit.reduced))
  (Intercept)           age  yearsmarried religiousness        rating 
    6.8952321     0.9653437     1.1058594     0.7196258     0.6304248 
#在优势比的尺度上得到系数95%的置信区间
exp(confint(fit.reduced))
Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept)   2.1255764 23.3506030
age           0.9323342  0.9981470
yearsmarried  1.0448584  1.1718250
religiousness 0.6026782  0.8562807
rating        0.5286586  0.7493370

R中回归结果里面Estimate就是B值，可以通过coef(fit)来获取
OR值是B以e为底的指数值，exp(coef(fit)) 获取
95%CI就是exp(confint(fit))
P值就是summary（fit）最后一列的结果
Wald值其实就是Z值的平方

四、使用函数 nomogram() 构造 Nomogram

> fit.lmreduced <- lrm(ynaffair~age+yearsmarried+religiousness+rating,data=mydata,x=T,y=T)
> nom<-nomogram(fit.lmreduced,fun=plogis,lp=F)
> plot(nom)

诺模图解读：
假设潘金莲，22岁结婚，结婚2年，无信仰，婚姻满意度较坏。然后我们根据每个变量的值来计算潘金莲每个特征的得分：20岁结婚（67.5）+结婚2年（12.5）+无信仰（70）+婚姻满意度较坏（100）=250分。总分250分的出轨发生率大于70%。

使用函数 calibrate() 构造校准曲线对象“cal1”并打印校准曲线。

> call<-calibrate(fit.lmreduced,method='boot',B=100)
> plot(call)

n=601   Mean absolute error=0.011   Mean squared error=0.00019
0.9 Quantile of absolute error=0.021

校准曲线的解释：实际上，校准曲线是实际发生概率与预测的散点图。实际上，校准曲线可视化了 Hosmer-Lemeshow 拟合优度检验的结果，因此除了校准曲线之外，我们还应该查看 Hosmer-Lemeshow 拟合优度检验的结果。预测率和实际发生率越接近 Y = X，Hosmer-Lemeshow 拟合优度检验的 p 值大于 0.05 时，模型的校准效果越好。在这种情况下，校准曲线几乎与 Y = X 线重合，表明模型校准良好。

C-Statistics

C-Index 是一个通用指标，特别是用于评价回归模型的判别能力。C-Index 的范围在 0.5 到 1.0 之间。C-Index =0.5完全不一致，说明模型没有预测效果；C-Index =1.0 完全一致，说明模型预测结果与实际完全一致。一般认为C-Index在0.50-0.70之间为低精度，0.71-0.80为中等精度，0.80以上精度较高，0.9以上精度为极高。C-Index（全称Concordance Index）也常写为Harrell's C-Index、Concordance C、C-statistic等，主要用来反映预测模型的判别能力，最早由生物统计学教授Harrell提出1996 年在范德比尔特大学，看看该模型是否可以做出准确的预测。

> fit.lmreduced
Logistic Regression Model
 
 lrm(formula = ynaffair ~ age + yearsmarried + religiousness + 
     rating, data = mydata, x = T, y = T)
 
                        Model Likelihood      Discrimination    Rank Discrim.    
                              Ratio Test             Indexes          Indexes    
 Obs           601    LR chi2      60.02      R2       0.141    C       0.704    
  No           451    d.f.             4      R2(4,601)0.089    Dxy     0.408    
  Yes          150    Pr(> chi2) <0.0001    R2(4,337.7)0.153    gamma   0.409    
 max |deriv| 3e-07                            Brier    0.168    tau-a   0.153    
 
               Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept      1.9308 0.6103  3.16  0.0016  
 age           -0.0353 0.0174 -2.03  0.0421  
 yearsmarried   0.1006 0.0292  3.44  0.0006  
 religiousness -0.3290 0.0895 -3.68  0.0002  
 rating        -0.4614 0.0888 -5.19  <0.0001

C-Statistics：读取该模型中的Rank Discrim，模型参数中的参数C，即C-Statistics = 0.704。

如果要分析婚外情的数量，需要采用泊松回归。

#泊松回归
#当通过一系列连续型或类别性预测变量来预测计数型结果变量时，采用泊松回归
fit.poinsson <- glm(affairs~gender+age+yearsmarried+children+
                    religiousness+education+occupation+rating,
                    data=mydata,family=poisson())
 
summary(fit.poinsson)
 
Call:
glm(formula = affairs ~ gender + age + yearsmarried + children + 
    religiousness + education + occupation + rating, family = poisson(), 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.5331  -1.5813  -1.1597  -0.7084   8.3386  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.5528720  0.2877313   8.872  < 2e-16 ***
gendermale     0.0577932  0.0816503   0.708   0.4791    
age           -0.0330294  0.0059571  -5.545 2.95e-08 ***
yearsmarried   0.1169683  0.0107798  10.851  < 2e-16 ***
childrenyes   -0.0026631  0.1027267  -0.026   0.9793    
religiousness -0.3547250  0.0309683 -11.454  < 2e-16 ***
education      0.0006042  0.0169084   0.036   0.9715    
occupation     0.0717169  0.0247803   2.894   0.0038 ** 
rating        -0.4105613  0.0279314 -14.699  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2925.5  on 600  degrees of freedom
Residual deviance: 2359.6  on 592  degrees of freedom
AIC: 2871.5

Number of Fisher Scoring iterations: 7
 

> fit.poinssonreduced <- glm(affairs~age+yearsmarried+
+                     religiousness+occupation+rating,
+                     data=mydata,family=poisson())
> summary(fit.poinssonreduced)

Call:
glm(formula = affairs ~ age + yearsmarried + religiousness + 
    occupation + rating, family = poisson(), data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.5968  -1.5728  -1.1627  -0.7067   8.3473  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.533905   0.196924  12.867  < 2e-16 ***
age           -0.032255   0.005851  -5.512 3.54e-08 ***
yearsmarried   0.115698   0.009908  11.677  < 2e-16 ***
religiousness -0.354037   0.030892 -11.460  < 2e-16 ***
occupation     0.079828   0.019449   4.105 4.05e-05 ***
rating        -0.409443   0.027381 -14.953  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2925.5  on 600  degrees of freedom
Residual deviance: 2360.1  on 595  degrees of freedom
AIC: 2866.1

Number of Fisher Scoring iterations: 7

 
anova(fit.poinssonreduced,fit.poinsson,test="Chisq")
coef(fit.poinssonreduced)
exp(coef(fit.poinssonreduced))
exp(confint(fit.poinssonreduced)) 

 

#与logistic回归中的指数化参数相似，泊松模型中的指数化参数对响应变量的影响都是成倍增加的，
#而不是线性相加

#过度离势

#泊松分布的方差和均值相等
#当响应变量观测的方差比依据泊松分布预测的方差大时，泊松回归可能发生过度离势

#判断是否过度离势
#如果存在过度离势，在模型中你无法进行解释，那么可能会得到很小的标准误和置信区间
#也就是说，你将会发现并不真实存在的效应
#与logistic回归类似，此处如果残差偏差与残差自由度的比例远远大于1，那么表明存在过度离势

deviance(fit.poinssonreduced)/df.residual(fit.poinssonreduced) #返回  3.966529，远大于1，存在过度离势

#可能产生过度离势的原因
#遗漏了某个重要的预测变量
#可能因为事件相关
#在纵向数据分析中，重复测量的数据由于内在群聚特性可导致过度离势

R语言学习笔记1-婚外情预测模型

数据介绍

推荐阅读更多精彩内容