数据介绍
婚外情(出轨)数据即著名的“Fair's Affairs",取自于1969年《今日心理》(Psychology Today) 所做的一个非常有代表性的调查,而Greene(2003)和Fair (1978)都对它进行过分析。该数据从601个参与者身上收集了9个变量,包括一年来:
#affairs 婚外私通的次数
#age 年龄
#yearsmarried 婚龄
#religiousness 宗教信仰程度(5分制,1表示反对,5表示非常信仰)
#education 学历
#occupation 职业
#rating 对婚姻的自我评分 (5分制,1表示非常不幸福,5表示非常幸福)
#类别列:各个类别值下的频数统计
#gender 参与者的性别
#children 是否有小孩
一、导入数据
在R中,AER包携带有数据框Affairs,首先安装AER包并导入数据:
install.packages("AER")
data(Affairs,package="AER") #进行包中指定的数据加载,Affairs变成全局变量
summary(Affairs)
affairs gender age yearsmarried children
Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171
1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430
Median : 0.000 Median :32.00 Median : 7.000
Mean : 1.456 Mean :32.49 Mean : 8.178
3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000
Max. :12.000 Max. :57.00 Max. :15.000
religiousness education occupation rating
Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
Median :3.000 Median :16.00 Median :5.000 Median :4.000
Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000
由此可以知道,女性有315名,男性286名,其中72%的人有孩子,样本的年龄中位数为32岁。
基于是否有过婚外情,讲因变量affairs区分成2值类型,未有过的为0
mydata<Affairs
mydata$ynaffair<-ifelse(mydata$affair==0,0,1)
class(mydata$ynaffair) #新列类型numeric
mydata$ynaffair <- factor(mydata$ynaffair,levels=c(0,1),labels=c("No","Yes"))
class(mydata$ynaffair) #新列类型factor
table(mydata$ynaffair) #基于新列,生成频数表
No Yes
451 150
二、建立Logistic模型
#glm(formula,family=family(link=function),data=)
#分布函数 默认的连接函数
#binomial link="logit"
fit.full <- glm(ynaffair~gender+age+yearsmarried+children+
religiousness+education+occupation+rating,
data=mydata,family=binomial(),x=T,y=T)
summary(fit.full)
Call:
glm(formula = ynaffair ~ gender + age + yearsmarried + children +
religiousness + education + occupation + rating, family = binomial(),
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5713 -0.7499 -0.5690 -0.2539 2.5191
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.37726 0.88776 1.551 0.120807
gendermale 0.28029 0.23909 1.172 0.241083
age -0.04426 0.01825 -2.425 0.015301 *
yearsmarried 0.09477 0.03221 2.942 0.003262 **
childrenyes 0.39767 0.29151 1.364 0.172508
religiousness -0.32472 0.08975 -3.618 0.000297 ***
education 0.02105 0.05051 0.417 0.676851
occupation 0.03092 0.07178 0.431 0.666630
rating -0.46845 0.09091 -5.153 2.56e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 609.51 on 592 degrees of freedom
AIC: 627.51
Number of Fisher Scoring iterations: 4
#从回归系数的p值可以看到,gendermale性别,
#childrenyes是否有孩子,education学历,occupation职业
#对方差的贡献都不显著,落入拒绝域,接受原假设:回归系数为0
去掉无用的变量,重新拟合模型,检验新模型的拟合效果。
fit.reduced <- glm(ynaffair~age+yearsmarried+religiousness+
rating,data=mydata,family=binomial())
summary(fit.reduced)
Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness +
rating, family = binomial(), data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6278 -0.7550 -0.5701 -0.2624 2.3998
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93083 0.61032 3.164 0.001558 **
age -0.03527 0.01736 -2.032 0.042127 *
yearsmarried 0.10062 0.02921 3.445 0.000571 ***
religiousness -0.32902 0.08945 -3.678 0.000235 ***
rating -0.46136 0.08884 -5.193 2.06e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 615.36 on 596 degrees of freedom
AIC: 625.36
Number of Fisher Scoring iterations: 4
#新模型的每个回归系数都非常显著
anova,比较广义线性回归 由于这个两模型是嵌套关系,所以我们可以使用anova()函数对它们进行比较。而对于广义线性回归,可用卡方检验进行判断。
anova(fit.reduced,fit.full,test="Chisq")
Analysis of Deviance Table
Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 596 615.36
2 592 609.51 4 5.8474 0.2108
#Pr(>Chi)=0.2108,未落入拒绝域,表明减少后的模型和之前的完整模型拟合程度一样好
三、解释模型的参数
#解释模型参数
#回归系数的含义是当其他预测变量不变时,一单位预测变量的变化可引起的响应变量对数优势比的变化
coef(fit.reduced)
(Intercept) age yearsmarried religiousness rating
1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144
#指数化
#保持年龄,宗教信仰,和婚姻评定不变,婚龄增加一年,婚外情的优势比将乘以1.106
#假如婚龄增加10年,优势比将乘以1.06^10,10次方,它反映的信息更为重要
#保持婚龄,宗教信仰,和婚姻评定不变,年龄增加一岁,婚外情的优势比将乘以0.965
#因为预测变量不能等于0,截距项在此处没有什么特定含义
exp(coef(fit.reduced))
(Intercept) age yearsmarried religiousness rating
6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
#在优势比的尺度上得到系数95%的置信区间
exp(confint(fit.reduced))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 2.1255764 23.3506030
age 0.9323342 0.9981470
yearsmarried 1.0448584 1.1718250
religiousness 0.6026782 0.8562807
rating 0.5286586 0.7493370
R中回归结果里面Estimate就是B值,可以通过coef(fit)来获取
OR值是B以e为底的指数值,exp(coef(fit)) 获取
95%CI就是exp(confint(fit))
P值就是summary(fit)最后一列的结果
Wald值其实就是Z值的平方
四、使用函数 nomogram() 构造 Nomogram
> fit.lmreduced <- lrm(ynaffair~age+yearsmarried+religiousness+rating,data=mydata,x=T,y=T)
> nom<-nomogram(fit.lmreduced,fun=plogis,lp=F)
> plot(nom)
诺模图解读:
假设潘金莲,22岁结婚,结婚2年,无信仰,婚姻满意度较坏。然后我们根据每个变量的值来计算潘金莲每个特征的得分:20岁结婚(67.5)+结婚2年(12.5)+无信仰(70)+婚姻满意度较坏(100)=250分。总分250分的出轨发生率大于70%。
使用函数 calibrate() 构造校准曲线对象“cal1”并打印校准曲线。
> call<-calibrate(fit.lmreduced,method='boot',B=100)
> plot(call)
n=601 Mean absolute error=0.011 Mean squared error=0.00019
0.9 Quantile of absolute error=0.021
校准曲线的解释:实际上,校准曲线是实际发生概率与预测的散点图。实际上,校准曲线可视化了 Hosmer-Lemeshow 拟合优度检验的结果,因此除了校准曲线之外,我们还应该查看 Hosmer-Lemeshow 拟合优度检验的结果。预测率和实际发生率越接近 Y = X,Hosmer-Lemeshow 拟合优度检验的 p 值大于 0.05 时,模型的校准效果越好。在这种情况下,校准曲线几乎与 Y = X 线重合,表明模型校准良好。
C-Statistics
C-Index 是一个通用指标,特别是用于评价回归模型的判别能力 。C-Index 的范围在 0.5 到 1.0 之间。C-Index =0.5完全不一致,说明模型没有预测效果;C-Index =1.0 完全一致,说明模型预测结果与实际完全一致。一般认为C-Index在0.50-0.70之间为低精度,0.71-0.80为中等精度,0.80以上精度较高,0.9以上精度为极高。C-Index(全称Concordance Index)也常写为Harrell's C-Index、Concordance C、C-statistic等,主要用来反映预测模型的判别能力,最早由生物统计学教授Harrell提出1996 年在范德比尔特大学,看看该模型是否可以做出准确的预测。
> fit.lmreduced
Logistic Regression Model
lrm(formula = ynaffair ~ age + yearsmarried + religiousness +
rating, data = mydata, x = T, y = T)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 601 LR chi2 60.02 R2 0.141 C 0.704
No 451 d.f. 4 R2(4,601)0.089 Dxy 0.408
Yes 150 Pr(> chi2) <0.0001 R2(4,337.7)0.153 gamma 0.409
max |deriv| 3e-07 Brier 0.168 tau-a 0.153
Coef S.E. Wald Z Pr(>|Z|)
Intercept 1.9308 0.6103 3.16 0.0016
age -0.0353 0.0174 -2.03 0.0421
yearsmarried 0.1006 0.0292 3.44 0.0006
religiousness -0.3290 0.0895 -3.68 0.0002
rating -0.4614 0.0888 -5.19 <0.0001
C-Statistics:读取该模型中的Rank Discrim,模型参数中的参数C,即C-Statistics = 0.704。
如果要分析婚外情的数量,需要采用泊松回归。
#泊松回归
#当通过一系列连续型或类别性预测变量来预测计数型结果变量时,采用泊松回归
fit.poinsson <- glm(affairs~gender+age+yearsmarried+children+
religiousness+education+occupation+rating,
data=mydata,family=poisson())
summary(fit.poinsson)
Call:
glm(formula = affairs ~ gender + age + yearsmarried + children +
religiousness + education + occupation + rating, family = poisson(),
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.5331 -1.5813 -1.1597 -0.7084 8.3386
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.5528720 0.2877313 8.872 < 2e-16 ***
gendermale 0.0577932 0.0816503 0.708 0.4791
age -0.0330294 0.0059571 -5.545 2.95e-08 ***
yearsmarried 0.1169683 0.0107798 10.851 < 2e-16 ***
childrenyes -0.0026631 0.1027267 -0.026 0.9793
religiousness -0.3547250 0.0309683 -11.454 < 2e-16 ***
education 0.0006042 0.0169084 0.036 0.9715
occupation 0.0717169 0.0247803 2.894 0.0038 **
rating -0.4105613 0.0279314 -14.699 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2925.5 on 600 degrees of freedom
Residual deviance: 2359.6 on 592 degrees of freedom
AIC: 2871.5
Number of Fisher Scoring iterations: 7
> fit.poinssonreduced <- glm(affairs~age+yearsmarried+
+ religiousness+occupation+rating,
+ data=mydata,family=poisson())
> summary(fit.poinssonreduced)
Call:
glm(formula = affairs ~ age + yearsmarried + religiousness +
occupation + rating, family = poisson(), data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.5968 -1.5728 -1.1627 -0.7067 8.3473
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.533905 0.196924 12.867 < 2e-16 ***
age -0.032255 0.005851 -5.512 3.54e-08 ***
yearsmarried 0.115698 0.009908 11.677 < 2e-16 ***
religiousness -0.354037 0.030892 -11.460 < 2e-16 ***
occupation 0.079828 0.019449 4.105 4.05e-05 ***
rating -0.409443 0.027381 -14.953 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2925.5 on 600 degrees of freedom
Residual deviance: 2360.1 on 595 degrees of freedom
AIC: 2866.1
Number of Fisher Scoring iterations: 7
anova(fit.poinssonreduced,fit.poinsson,test="Chisq")
coef(fit.poinssonreduced)
exp(coef(fit.poinssonreduced))
exp(confint(fit.poinssonreduced))
#与logistic回归中的指数化参数相似,泊松模型中的指数化参数对响应变量的影响都是成倍增加的,
#而不是线性相加
#过度离势
#泊松分布的方差和均值相等
#当响应变量观测的方差比依据泊松分布预测的方差大时,泊松回归可能发生过度离势
#判断是否过度离势
#如果存在过度离势,在模型中你无法进行解释,那么可能会得到很小的标准误和置信区间
#也就是说,你将会发现并不真实存在的效应
#与logistic回归类似,此处如果残差偏差与残差自由度的比例远远大于1,那么表明存在过度离势
deviance(fit.poinssonreduced)/df.residual(fit.poinssonreduced) #返回 3.966529,远大于1,存在过度离势
#可能产生过度离势的原因
#遗漏了某个重要的预测变量
#可能因为事件相关
#在纵向数据分析中,重复测量的数据由于内在群聚特性可导致过度离势