◾Binary Dependent Variables and the Linear Probability Model
◾Probit and Logit Regression
◾Estimation and Inference in the Logit and Probit Models
二进制因变量和线性概率模型
library(AER)
library(stargazer)
data(HMDA)
# 探索数据
head(HMDA)
summary(HMDA)
# 将'deny' 转换为 numeric
#lm()不接受因变量属于class factor
#请注意,as.numeric()会将deny = no变为deny = 1,而deny = yes变为deny = 2
HMDA$deny <- as.numeric(HMDA$deny) - 1
# 建立简单的线性概率模型
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1
# 画图
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Scatterplot Mortgage Application Denial and the Payment-to-Income Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.8)
# 添加水平虚线和文本
abline(h = 1, lty = 2, col = "darkred") #水平线的y值
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# 添加估计回归线
abline(denymod1,
lwd = 1.8,
col = "steelblue")
#使用coeftest()来获得两个系数估计的稳健标准误差
#其估计值可以解释为:增加1个百分点会导致拒绝贷款的可能性增加。
# print robust coefficient summary
coeftest(denymod1, vcov. = vcovHC, type = "HC1")
> head(HMDA)
deny pirat hirat lvrat chist mhist phist unemp
1 no 0.221 0.221 0.8000000 5 2 no 3.9
2 no 0.265 0.265 0.9218750 2 2 no 3.2
3 no 0.372 0.248 0.9203980 1 2 no 3.2
4 no 0.320 0.250 0.8604651 1 2 no 4.3
5 no 0.360 0.350 0.6000000 1 1 no 3.2
6 no 0.240 0.170 0.5105263 1 1 no 3.9
selfemp insurance condomin afam single hschool
1 no no no no no yes
2 no no no no yes yes
3 no no no no no yes
4 no no no no no yes
5 no no no no no yes
6 no no no no no yes
> summary(HMDA)
deny pirat hirat
no :2095 Min. :0.0000 Min. :0.0000
yes: 285 1st Qu.:0.2800 1st Qu.:0.2140
Median :0.3300 Median :0.2600
Mean :0.3308 Mean :0.2553
3rd Qu.:0.3700 3rd Qu.:0.2988
Max. :3.0000 Max. :3.0000
lvrat chist mhist phist
Min. :0.0200 1:1353 1: 747 no :2205
1st Qu.:0.6527 2: 441 2:1571 yes: 175
Median :0.7795 3: 126 3: 41
Mean :0.7378 4: 77 4: 21
3rd Qu.:0.8685 5: 182
Max. :1.9500 6: 201
unemp selfemp insurance condomin
Min. : 1.800 no :2103 no :2332 no :1694
1st Qu.: 3.100 yes: 277 yes: 48 yes: 686
Median : 3.200
Mean : 3.774
3rd Qu.: 3.900
Max. :10.600
afam single hschool
no :2041 no :1444 no : 39
yes: 339 yes: 936 yes:2341
> denymod1
Call:
lm(formula = deny ~ pirat, data = HMDA)
Coefficients:
(Intercept) pirat
-0.07991 0.60353
> coeftest(denymod1, vcov. = vcovHC, type = "HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.079910 0.031967 -2.4998 0.01249 *
pirat 0.603535 0.098483 6.1283 1.036e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
在1%的显著性水平上,P/I ratio的真实系数不为0(pvalue=1.036e-09<0.01)。其估计可解释如下:P/I ratio每增加1个百分点会导致拒绝贷款的可能性增加0.6%。
◾通过一个额外的回归变量扩充简单模型
增加了一个额外的回归变量'black',如果申请人是非裔美国人,black=1,否则=0。
# 重命名变量“ afam”以保持一致性
colnames(HMDA)[colnames(HMDA) == "afam"] <- "black"
# 建模
denymod2 <- lm(deny ~ pirat + black, data = HMDA)
coeftest(denymod2, vcov. = vcovHC)
> coeftest(denymod2, vcov. = vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.090514 0.033430 -2.7076 0.006826 **
pirat 0.559195 0.103671 5.3939 7.575e-08 ***
blackyes 0.177428 0.025055 7.0815 1.871e-12 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
在1%的显著性水平上,'black'系数为正且不为0。其解释是,在保持P/I ratio不变的条件下,黑人会使拒绝抵押贷款申请的可能性增加约17.7%。
概率回归 Probit Regression
# 建立简单的概率回归模型
denyprobit <- glm(deny ~ pirat,
family = binomial(link = "probit"),
data = HMDA)
coeftest(denyprobit, vcov. = vcovHC, type = "HC1") #异方差
# 画图
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit Model of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.85)
# 添加水平虚线和文本
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# 添加估计的回归线
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")
# 预测P/I ratio = 0.3, 0.4
predictions <- predict(denyprobit,
newdata = data.frame("pirat" = c(0.3, 0.4)),
type = "response")
# 计算在概率上的差异
# 支付与收入的比率从0.3增加到0.4,预计将使被拒绝的概率增加约6.2%
diff(predictions)
下面使用一个增强的概率模型来估计种族对抵押贷款申请被拒绝概率的影响:
denyprobit2 <- glm(deny ~ pirat + black,
family = binomial(link = "probit"),
data = HMDA)
coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")
# 1. P/I ratio = 0.3预测两种不同假设的申请者
predictions <- predict(denyprobit2,
newdata = data.frame("black" = c("no", "yes"),
"pirat" = c(0.3, 0.3)),
type = "response")
# 2. 计算在概率上的差异
#白人和黑人申请者之间,预计将使被拒绝的概率差异约15.8%。
diff(predictions)
> coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.258787 0.176608 -12.7898 < 2.2e-16 ***
pirat 2.741779 0.497673 5.5092 3.605e-08 ***
blackyes 0.708155 0.083091 8.5227 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> diff(predictions)
2
0.1578117
逻辑回归Logit Regression
#建立简单逻辑回归
denylogit <- glm(deny ~ pirat,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit, vcov. = vcovHC, type = "HC1")
# 画图
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.9)
# 添加水平线和文本
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# 添加概率模型和逻辑模型的估计回归线
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")
lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)
# 添加图例
legend("topleft",
horiz = TRUE,
legend = c("Probit", "Logit"),
col = c("steelblue", "black"),
lty = c(1, 2))
# 建立多元逻辑回归
denylogit2 <- glm(deny ~ pirat + black,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
# 1. 预测P/I ratio = 0.3
predictions <- predict(denylogit2,
newdata = data.frame("black" = c("no", "yes"), "pirat" = c(0.3, 0.3)),
type = "response")
predictions
# 2. 计算在概率上的差距
diff(predictions)
> coeftest(denylogit, vcov. = vcovHC, type = "HC1")
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.02843 0.35898 -11.2218 < 2.2e-16 ***
pirat 5.88450 1.00015 5.8836 4.014e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.12556 0.34597 -11.9245 < 2.2e-16 ***
pirat 5.37036 0.96376 5.5723 2.514e-08 ***
blackyes 1.27278 0.14616 8.7081 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> predictions
1 2
0.07485143 0.22414592
白人申请人面临的拒绝概率仅为7.5%,而非裔美国人面临的拒绝概率为22.4%,差异为14.9%。
概率和逻辑模型的估计与推理
非线性最小二乘法(NLS):nls()
极大似然法(MLE):未知参数的极大似然估计量就是导致模型最有可能观察到的数据的值。结果证明,MLE比NLS更有效。
实例:应用于波士顿HMDA数据
# 定义低、中和高贷款价值比率 loan-to-value ratio
HMDA$lvrat <- factor(
ifelse(HMDA$lvrat < 0.8, "low",
ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")),
levels = c("low", "medium", "high"))
# 将credit转换为数字
HMDA$mhist <- as.numeric(HMDA$mhist)
HMDA$chist <- as.numeric(HMDA$chist)
# 估计所有6个模型的拒绝概率
#模型1是线性概率模型,模型2是Logit回归,模型3是概率回归
lpm_HMDA <- lm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp, data = HMDA)
logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp,
family = binomial(link = "logit"),
data = HMDA)
probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp,
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_2 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp + single + hschool + unemp,
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist
+ phist + insurance + selfemp + single + hschool + unemp + condomin
+ I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5) + I(chist==6),
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist
+ insurance + selfemp + single + hschool + unemp,
family = binomial(link = "probit"),
data = HMDA)
#将系数估计量的异方差-稳健标准误存储在一个列表
rob_se <- list(sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1"))))
stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1,
probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,
digits = 3,
type = "latex",
header = FALSE,
se = rob_se,
model.numbers = FALSE,
column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)"))
◾在表中,模型(1)、(2)和(3)是包括多个财务控制变量的基础模型,他们只在模拟拒绝概率的方式上不同。模型(1)是线性模型,模型(2)是逻辑回归,模型(3)概率回归。
◾在线性模型(1)中,系数有直接的解释。例如,估计消费信用评分‘chist’增加1个单位将使拒绝贷款的概率增加约0.031个百分点。
高loan-to-value比率会降低信用:lvrathigh系数为0.189,因此,在其他条件不变的情况下,高loan-to-value比率客户将比低loan-to-value比率客户被拒绝贷款的概率高19%。
非裔美国人的估计系数为0.084,表明在其他条件不变的情况下,非裔美国人比白人被拒绝贷款的概率高8.4%。
◾(2)和(3)模型提供了类似的证据,表明美国抵押贷款市场存在种族歧视。除住房费用收入比hirat(与0无显著差异)外,所有系数在1%水平上均很显著。
非线性使系数估计的解释比模型(1)更困难。为了陈述作为黑人的影响,需要计算两个只有种族不同的人的估计拒绝概率。估计Logit模型(2)的非裔美国人被拒绝概率的影响约为4%。
# 计算一个平均非裔美国人的回归量
new <- data.frame(
"pirat" = mean(HMDA$pirat),
"hirat" = mean(HMDA$hirat),
"lvrat" = "low",
"chist" = mean(HMDA$chist),
"mhist" = mean(HMDA$mhist),
"phist" = "no",
"insurance" = "no",
"selfemp" = "no",
"black" = c("no", "yes"),
"single" = "no",
"hschool" = "yes",
"unemp" = mean(HMDA$unemp),
"condomin" = "no")
# 计算线性模型的差异 #diff表示后一个数减去前一个数
predictions <- predict(lpm_HMDA, newdata = new)
diff(predictions)
> 2
0.08369674
# 计算logit模型(2)的差异
predictions <- predict(logit_HMDA, newdata = new, type = "response")
diff(predictions)
> 2
0.04042135
# 计算probit模型(3)的差异
predictions <- predict(probit_HMDA_1, newdata = new, type = "response")
diff(predictions)
> 2
0.05049716
# 计算probit模型(4)的差异
predictions <- predict(probit_HMDA_2, newdata = new, type = "response")
diff(predictions)
> 2
0.03978918
# 计算probit模型(5)的差异
predictions <- predict(probit_HMDA_3, newdata = new, type = "response")
diff(predictions)
> 2
0.04972468
# 计算probit模型(6)的差异
predictions <- predict(probit_HMDA_4, newdata = new, type = "response")
diff(predictions)
> 2
0.03955893
模型(2)和(3)对黑人被拒绝概率的影响的估计相似。这些简单的模型由于省略的变量而产生了有偏的估计。
模型(4)到(6)使用的回归变量,包括不同的申请人特征、信用评级指标变量以及交互作用。但大多数相应的系数并不显著,对这些模型非裔美国人对拒绝抵押贷款申请的可能性的影响与(2)和(3)没有太大差异。
使用F检验来测试这些系数在5%水平上是否共同显著:
linearHypothesis(probit_HMDA_4,
test = "F",
c("blackyes:pirat=0", "blackyes:hirat=0"),
vcov = vcovHC, type = "HC1")
>
Linear hypothesis test
Hypothesis:
blackyes:pirat = 0
blackyes:hirat = 0
Model 1: restricted model
Model 2: deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist +
insurance + selfemp + single + hschool + unemp
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 2366
2 2364 2 0.2473 0.7809
由于p值约0.77,因此不能拒绝原假设。
linearHypothesis(probit_HMDA_4,
test = "F",
c("blackyes=0", "blackyes:pirat=0", "blackyes:hirat=0"),
vcov = vcovHC, type = "HC1")
>
Linear hypothesis test
Hypothesis:
blackyes = 0
blackyes:pirat = 0
blackyes:hirat = 0
Model 1: restricted model
Model 2: deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist +
insurance + selfemp + single + hschool + unemp
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 2367
2 2364 3 4.7774 0.002534 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
由于p值约为0.002,可以拒绝完全没有种族歧视的原假设。
结论:
(1)到(6)模型表明,非裔美国人对拒绝抵押贷款申请的可能性有影响,影响估计为正(从4%到5%),在显著性1%水平下不为0。虽然线性模型似乎稍微高估了这种影响,但它仍然可以用作本质上非线性关系的近似。