Ensemble Model: Stacked Model Example R语言代码详解—V1.0

本文对 Ensemble Model: Stacked Model Example 中的 R 语言代码进行详解。

本段代码介绍如下：
本文介绍了一种 ensemble model 即将若干种模型的预测结果合并，来获取房屋价格的预测值。
如下图标显示，其中 features sets 可以是一种或多种属性集合：

model combination.png

1. 数据准备

1.1 数据获取

train.raw <- read.csv(file.path(DATA.DIR,"train.csv"),stringsAsFactors = FALSE)
test.raw <- read.csv(file.path(DATA.DIR,"test.csv"), stringsAsFactors = FALSE)

1.2 数据初始化

1.2.1 选取重要的 features 并分类

计算 features 的重要性

具体请参见 Boruta Feature Importance Analysis。主要步骤如下:

区分字符型和数字型数据
给数据集分类
填充缺失值
1.数字型缺失则设为 -1
2.字符型缺失则设为“*MISSING"

执行 Boruta 分析，获取 features 的重要程度

set.seed(13)
bor.results <- Boruta(sample.df,response,
           maxRuns=101,
           doTrace=0)

执行后结果 plot 如下：

relative importance of each candidate explanatory attribute.png

分类 features

代码示例如下：
CONFIRMED_ATTR <- c("MSSubClass","MSZoning","LotArea","LotShape",
                                  "LandContour","Neighborhood", …… ,"Fence")

1.2.2 为 Cross validation 进行数据分割

# create folds for training
set.seed(13)
data_folds <- createFolds(train.raw$SalePrice, k=5)

语法说明：createFolds(train.raw$SalePrice, k=5)

Create Level 0 Model Feature Sets

将拆分出两个 Feature Set， Feature Set 1 和 2
都包括 Boruta Confirmed and Tentative attributes。此处的每一个 Feature Set 都是由用户自定义的 R 函数生成的。这些函数将原始的 Training Set 变成 Feature Set。此处并没有使用额外的 Feature Engineering。

1.3 Feature Set 1

对 SalePrice 数据取 Log - Boruta Confirmed and tentative Attributes

具体语法:
id <- df$Id
if (class(df$SalePrice) != "NULL") {
    y <- log(df$SalePrice)
} else {
    y <- NULL
}

** 填补缺失值**

具体语法：
    # for numeric set missing values to -1 for purposes
num_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$integer)
for (x in num_attr){
  predictors[[x]][is.na(predictors[[x]])] <- -1
}

# for character  atributes set missing value
char_attr <- intersect(predictor_vars,DATA_ATTR_TYPES$character)
for (x in char_attr){
  predictors[[x]][is.na(predictors[[x]])] <- "*MISSING*"
  predictors[[x]] <- factor(predictors[[x]])
}

1.4 Feature Set 2（xgboost）

同 Feature Set 1，首先对 SalePrice 数据取 Log。
同 Feature Set 1，然后填补缺失值。

2. Level 0 Model Training

2.1 Helper Function For Training

为后续建模做一些准备的工作，包括根据 cross-validation 中的一份数据而建模（trainOneFold）以及根据这份数据及其模型得到预测值。

** train model on one data fold**
如下将合并为一个 funcion - prepL0FeatureSet1：
1.获取特定的一份 cross-validation 数据, 即 get fold specific cv data

cv.data <- list()
cv.data$predictors <- feature_set$train$predictors[this_fold,]
cv.data$ID <- feature_set$train$id[this_fold]
cv.data$y <- feature_set$train$y[this_fold]

2.对这一份数据，获得相应的 training data, 即
get training data for specific fold。

train.data <- list()
train.data$predictors <- feature_set$train$predictors[-this_fold,]
train.data$y <- feature_set$train$y[-this_fold]

3.使用 do.call() 一次性执行操作，寻找合适的 model。

 fitted_mdl <- do.call(train,
                      c(list(x=train.data$predictors,y=train.data$y),
                    CARET.TRAIN.PARMS,
                    MODEL.SPECIFIC.PARMS,
                    CARET.TRAIN.OTHER.PARMS))

其中，

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

R 语言中 train()：Fit Predictive Models Over Different Tuning Parameters.

4.获取预测值, 即 make prediction from a model fitted to one fold。

      yhat <- predict(fitted_mdl,newdata = cv.data$predictors,type = "raw")
      score <- rmse(cv.data$y,yhat)  
      ans <- list(fitted_mdl=fitted_mdl,
            score=score,
            predictions=data.frame(ID=cv.data$ID,yhat=yhat,y=cv.data$y))

make prediction from a model fitted to one fold
根据已有的模型进行预测，如下也包装成一个函数 function - makeOneFoldTestPrediction：

fitted_mdl <- this_fold$fitted_mdl
yhat <- predict(fitted_mdl,newdata = feature_set$test$predictors,type = "raw")

2.2 gbm model

set caret training parameters

The caret
package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

CARET.TRAIN.PARMS <-list(method="gbm")   
CARET.TUNE.GRID <-expand.grid(n.trees=100, 
                            interaction.depth=10, 
                            shrinkage=0.1,
                            n.minobsinnode=10)
MODEL.SPECIFIC.PARMS <- list(verbose=0)

其中，

expand.grid(): 由所有的 supplied vectors or factors 新建一个 data frame 。

model specific training parameter

    CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)

    CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

其中，

trainControl 生成一些列参数，这些参数将进一步调控如何生成模型，可能的参数有：
method：resampling method
……
verboseIter: 逻辑语句来打印 training log。
classProbs: 逻辑语句来决定是否应该计算 class probabilities

generate features for Level 1

为后续 Level 1 Model Prediction 做准备。

gbm_set <- llply(data_folds,trainOneFold,L0FeatureSet1)

其中，trainOneFold 是一个训练 Model，LOFeatureSet1 是一个处理过的 Feature 集合。

final model fit
最终选定一个 GBM Model。

gbm_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(gbm_set,function(x){x$score}))))

其中，cat is useful for producing output in user-defined functions.

** create test submission**
最终的预测值是根据不同的 data folds（根据 cross validation 分成了若干 data folds）适用的不同 model 而生成的预测值的平均值，并写入 .csv 文件。

test_gbm_yhat <- predict(gbm_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
gbm_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_gbm_yhat))  
write.csv(gbm_submission,file="gbm_sumbission.csv",row.names=FALSE)

2.3 xgboost model

xgboost model 的流程、算法和 gbm model 相同，具体解释不再赘述，仅将主要流程和语法列举如下：

set caret training parameters

CARET.TRAIN.PARMS <- list(method="xgbTree")   
CARET.TUNE.GRID <-  expand.grid(nrounds=800, 
                            max_depth=10, 
                            eta=0.03, 
                            gamma=0.1, 
                            colsample_bytree=0.4, 
                            min_child_weight=1)
MODEL.SPECIFIC.PARMS <- list(verbose=0)

** model specific training parameter**

CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

generate Level 1 features

xgb_set <- llply(data_folds,trainOneFold,L0FeatureSet2)

final model fit

xgb_mdl <- do.call(train, c(list(x=L0FeatureSet2$train$predictors,y=L0FeatureSet2$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(xgb_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(xgb_set,function(x){x$score}))))

** create test submission**

test_xgb_yhat <- predict(xgb_mdl,newdata = L0FeatureSet2$test$predictors,type = "raw")
xgb_submission <- cbind(Id=L0FeatureSet2$test$id,SalePrice=exp(test_xgb_yhat))
write.csv(xgb_submission,file="xgb_sumbission.csv",row.names=FALSE)

2.4 ranger model

ranger model 的流程、算法和 xgboost、gbm model 相同，具体解释不再赘述，仅将主要流程和语法列举如下：

set caret training parameters

CARET.TRAIN.PARMS <- list(method="ranger")   
CARET.TUNE.GRID <-  expand.grid(mtry=2*as.integer(sqrt(ncol(L0FeatureSet1$train$predictors))))
MODEL.SPECIFIC.PARMS <- list(verbose=0,num.trees=500)

model specific training parameter

CARET.TRAIN.CTRL <- trainControl(method="none",
                             verboseIter=FALSE,
                             classProbs=FALSE)

CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                       tuneGrid=CARET.TUNE.GRID,
                       metric="RMSE")

generate Level 1 features

rngr_set <- llply(data_folds,trainOneFold,L0FeatureSet1)

final model fit

rngr_mdl <- do.call(train, c(list(x=L0FeatureSet1$train$predictors,y=L0FeatureSet1$train$y),
             CARET.TRAIN.PARMS,
             MODEL.SPECIFIC.PARMS,
             CARET.TRAIN.OTHER.PARMS))

CV Error Estimate

cv_y <- do.call(c,lapply(rngr_set,function(x){x$predictions$y}))
cv_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))
rmse(cv_y,cv_yhat)
cat("Average CV rmse:",mean(do.call(c,lapply(rngr_set,function(x){x$score}))))

create test submission

test_rngr_yhat <- predict(rngr_mdl,newdata = L0FeatureSet1$test$predictors,type = "raw")
rngr_submission <- cbind(Id=L0FeatureSet1$test$id,SalePrice=exp(test_rngr_yhat))
write.csv(rngr_submission,file="rngr_sumbission.csv",row.names=FALSE)

3. Level 1 Model Training

根据之前的结果，gbm_set、xgb_set、rngr_set 分别指代的是 gbm、xgb、rngr 模型下取出来的 features，获取使用三个模型的预测值。

gbm_yhat <- do.call(c,lapply(gbm_set,function(x){x$predictions$yhat}))
xgb_yhat <- do.call(c,lapply(xgb_set,function(x){x$predictions$yhat}))
rngr_yhat <- do.call(c,lapply(rngr_set,function(x){x$predictions$yhat}))

3.1 Create predictions For Level 1 Model

问题：如下这一段没有读懂具体语法。

L1FeatureSet$train$id <- do.call(c,lapply(gbm_set,function(x){x$predictions$ID}))
L1FeatureSet$train$y <- do.call(c,lapply(gbm_set,function(x){x$predictions$y}))
predictors <- data.frame(gbm_yhat,xgb_yhat,rngr_yhat)
predictors_rank <- t(apply(predictors,1,rank))
colnames(predictors_rank) <- paste0("rank_",names(predictors))
L1FeatureSet$train$predictors <- predictors #cbind(predictors,predictors_rank)
L1FeatureSet$test$id <- gbm_submission[,"Id"]
L1FeatureSet$test$predictors <- data.frame(gbm_yhat=test_gbm_yhat, xgb_yhat=test_xgb_yhat, rngr_yhat=test_rngr_yhat)

3.2 Neural Net Model

同之前 Level 0 Model 的大致流程：

** set caret training parameters**

CARET.TRAIN.PARMS <- list(method="nnet") 
CARET.TUNE.GRID <-  NULL  # NULL 使用了默认的微调参数

model specific training parameter

CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
                             number=5,
                             repeats=1,
                             verboseIter=FALSE)
CARET.TRAIN.OTHER.PARMS <- list(trControl=CARET.TRAIN.CTRL,
                        maximize=FALSE,
                       tuneGrid=CARET.TUNE.GRID,
                       tuneLength=7,
                       metric="RMSE")
# Other model specific parameters
MODEL.SPECIFIC.PARMS <- list(verbose=FALSE,linout=TRUE,trace=FALSE)

train the model

l1_nnet_mdl <- do.call(train, 
                       c(list(x=L1FeatureSet$train$predictors, y=L1FeatureSet$train$y),
                        CARET.TRAIN.PARMS,
                        MODEL.SPECIFIC.PARMS,
                        CARET.TRAIN.OTHER.PARMS))

附录

For additional information on model stacking see these references:

MLWave: Kaggle Ensembling Guide
Kaggle Forum Posting: Stacking
Winning Data Science Competitions: Jeong-Yoon Lee This talk is about 90 minutes long. The sections relevant to model stacking are discussed in these segments (h:mm:ss to h:mm:ss): 1:05:25 to 1:12:15 and 1:21:30 to 1:27:00.

最后编辑于：2017.12.07 16:51:37

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 220,809评论 6赞 513
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,189评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 167,290评论 0赞 359
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,399评论 1赞 294
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,425评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,116评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,710评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,629评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,155评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,261评论 3赞 339
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,399评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,068评论 5赞 347
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,758评论 3赞 332
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,252评论 0赞 23
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,381评论 1赞 271
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,747评论 3赞 375
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,402评论 2赞 358

Ensemble Model: Stacked Model Example R语言代码详解—V1.0

1. 数据准备

1.1 数据获取

1.2 数据初始化

1.2.1 选取重要的 features 并分类

1.2.2 为 Cross validation 进行数据分割

Create Level 0 Model Feature Sets

1.3 Feature Set 1

1.4 Feature Set 2（xgboost）

2. Level 0 Model Training

2.1 Helper Function For Training

2.2 gbm model

2.3 xgboost model

2.4 ranger model

3. Level 1 Model Training

3.1 Create predictions For Level 1 Model

3.2 Neural Net Model

附录

推荐阅读更多精彩内容