计数模型的dependent variable为正整数,计数模型有很多回归的方法,这里我们只讨论泊松回归以及负二项回归
Poisson Model
Particularly useful for “rare” events,即数量少的事件,例如家庭拥有车辆数的预测。
若数据符合泊松分布,那么事件发生的间隔独立且符合指数分布。这里我就不放公式了。运用泊松回归需要注意的是均值与方差相等, 且均值大于等于0 (μ = σ^2)(μ = e^βo+β1X1)
-
回归结果解释
log(daysabsent) = 2.647 − 0.014LangArts − 0.409Male
如果在语言艺术这门考试中表现得好,那么就会有更少的缺席天数,如果你是男性,你将会有更少的缺席天数。
image.png(1)AgeGroup 45-54: e^1.90 = 6.86: The ratio of crash incidences is 6.86 times greater for 45-54 when compared to those aged 35 and under (baseline)
(2)RegionSouth: e^0.86 = 2.36: The ratio of crash incidences is 2.4 times greater for the South when compared to North Region (baseline) -
回归模型的质量 (Poisson Deviance)用来衡量模型有多接近完美image.png
Negetive Binomial Model
overdispersion 时我们会选择负二项回归, 即 variance is larger than mean (i.e, σ^2 > μ)

在这个模型中我们依旧加入了error term ε 独立分布,E(ε)符合Gamma分布
实际上泊松模型是一个特殊的负二项模型形式,当NB model的dispersion parameter θ = ∞时,负二项模型变为泊松模型。我们在数据分析时可以两个模型都做一下,然后通过对比他们的AIC值或者利用似然比检验来比较哪个模型更好。
library(foreign)
library(spam) #needed for library (fields)
library(fields)
library(MASS)
library(car)
library(ggplot2)
setwd(...)
pdata <- read.csv("poissonreg.csv")
names(pdata)
hist(pdata$daysabs,breaks=40,xlim=c(0,40),col="gray",main="Days Absent in HS",ylab="Num of HS juniors", xlab="Days Absent")
PModel<-glm(daysabs~math+langarts+male,family=poisson, data=pdata)
summary(PModel)
#rerun without math since not significant
PModel2<-glm(daysabs~langarts+male,family=poisson, data=pdata)
summary(PModel2)
pdata$phat<-predict(PModel2, type = "response")
pdata <- pdata[with(pdata, order(langarts, male)), ]
ggplot(pdata, aes(x = langarts, y = phat)) +
geom_point(aes(y = daysabs), alpha = 0.5, position = position_jitter(h = 0.2)) +
geom_line(size = 1) +
labs(x = "Language Arts Scores", y = "Expected days absent")
##dispersion
#you would like this number to be close to 1; if it is larger than one, your data is overdispersed.
#PModel2$df.res <= degrees of freedom of your model
#residuals(PModel2) <= actual-predicted value from PModel2
dp<-sum(residuals(PModel2,type="pearson")^2)/PModel2$df.res
plot(residuals(PModel2))
mean(pdata$daysabs)
var(pdata$daysabs)
sd(pdata$daysabs)
#if greater than one, the data shows signs of overdispersion
dp<-round(dp,2)
library(faraway) #written by Julian Faraway
par(mfrow=c(1,2))
halfnorm(residuals(PModel2))
plot(log(fitted(PModel2)),
log((pdata$daysabs-fitted(PModel2))^2),
xlab=expression(hat(mu)),
ylab=expression((y-hat(mu))^2),
main=bquote("Dispersion"==.(dp)))
abline(0,1)
##alternative method
library(AER)
dispersiontest(PModel2)
##Negative binomial model
nbmod<-glm.nb(formula=daysabs~math+langarts+male,link=log,data=pdata)
summary(nbmod)
##rerun without math since not significant
nbmod2<-glm.nb(formula=daysabs~langarts+male,link=log,data=pdata)
summary(nbmod2)
##likelihood ratio test
library(zoo)
library(lmtest)
lrtest(nbmod2)
lrtest(PModel2)

