如果dependent variable有多个类别(大于两类)那么我们就需要用MNL来解决问题,这些类别中并逻辑关联,例如去判断我喜欢的冰淇淋口味是草莓还是巧克力还是抹茶。
假设pij为第i个人属于第j类,那么就会存在:pi1+pi2+····+pin = 1
多元逻辑回归能不能简单的表示为:In [ pi1/(1-pi1)] = βX呢?当然不行,会造成更大的标准误差。正确的公式如下:
在MNL中,会利用newton-raphson算法进行最大似然估计,接下来举一个例子
预测一辆车会选择4条车道的主干道(Va), 还是两车道的高速路(Vt),还是四条车道的高速路(Vf)
回归后得到公式:
Va = −0.942(Dista)
Vt = 1.65−1.135(Distt)+0.128(VehAge)
Vf = −3.20 − 0.69(Distf ) + 0.233(VehAge) + 0.766(Male)
- 解释
(1)从这三个式子中可以看出,Va是没有截距的,因为Va成为了baseline
(2)当其余变量都想同时,通过对比三条道路的截距我们可以得出两车道高速路比四车道主干道更有可能被选择,四车道高速路比起四车道主干道更小概率的被选择。
需要指出的是,在逻辑回归中,如果置信区间包含1,那么需要考虑显著性的准确程度。而在线性回归中这个值为0.
最后加上代码,
library(tidyverse)
library(ggplot2)
library(nnet)
setwd("~/Desktop/analysis ppt/hw6")
SV_data <- read.csv("Class_Survey_W20.csv", header = TRUE)
Varname <- colnames(SV_data)
#dependent variable, predict which traffic mode are more likelt to be choosen(bus, walk, drive).
SV_data <- SV_data %>% rename(Method = Varname[1])
#independent variable, categorical preditor, male or female.
SV_data <- SV_data %>% rename(Sex = Varname[42])
#independent variable, categorical preditor, traffic influence the time of people arriving school or not.
SV_data <- SV_data %>% rename(TraffInf = Varname[38])
#independent variable, countinous preditor, time from home to school.
SV_data <- SV_data %>% rename(TTime = Varname[2])
SV_data <- SV_data %>% filter(Method != " ", Sex != " ", TraffInf !=" ", TTime != " ")
#recoding variables
MNL_data <- SV_data %>% filter(Method == "Bus"|Method == "Drive"|Method == "Walk")
MNL_data <- MNL_data %>% filter(Sex == "Female"|Sex == "Male")
MNL_data <- MNL_data %>% mutate (TraffInf = ifelse(TraffInf == "Never","No","Yes"))
MNL_data$Sex <- factor(MNL_data$Sex)
MNL_data$Method <- factor(MNL_data$Method)
#check whether the sex only includes female and male
levels(MNL_data$Sex)
#calculate the frequency of dependent variable
as.data.frame(table(MNL_data$Method))
#use nnet build multinominal model
MNL_data$Method <- relevel(MNL_data$Method, ref = "Bus")
MNLogit <- multinom(Method ~ Sex + TraffInf + TTime, data = MNL_data)
summary(MNLogit)
#z-test
z <- summary(MNLogit)$coefficients/summary(MNLogit)$standard.errors
z
# 2-tailed Wald z tests to test significance of coefficients
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
#confidence intervals of parameter estimates
confint(MNLogit)
#odds ratio
exp(coef(MNLogit))
#confidence intervals of the ORs
exp(confint(MNLogit))
BTW, 多元逻辑回归可以用vgam package做
library(xtable)
library(VGAM)
wallet<-read.csv(file="wallet.csv")
#For reviewing data
names(wallet)
summary(wallet)
#Including labels for 0, 1, and 2
wallet$honestfac<-factor(wallet$wallet,labels=c("least","ethical","most"))
#Recoding
punish_el<-ifelse(wallet$punish==1,1,0)
table(punish_el)
#################################################
#Multinomial logit model using "vglm" function
mlogit<-vglm(honestfac~male+business+punish_el+explain,family=multinomial(), data=wallet)
summary(mlogit)
##To create table for exporting into latex
A<-coef(summary(mlogit))
xtable(A)
exp(coef(mlogit)) #odds ratios
lrtest(mlogit) #to compare the initial and converged model
deviance(mlogit) #see residual deviance
AIC(mlogit) #AIC value
logLik(mlogit) #loglikelihood
#using vglm - you need to compute confidence interval yourself as follows:
b <- coef(mlogit)
se <- sqrt(diag(vcov(mlogit)))
CIM<-cbind(LL = b-qnorm(0.975)*se, UL = b+qnorm(0.975)*se)
round(CIM,2)