本篇主要介绍xgboost在R中的使用,主要参考了here出的文章。
data loading
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
其中train包含有data和label,同理test。
basic training
使用xgboost时必须设定的几个参数:
objective: 目标函数,如binary:logistic,表示二分类
max_depth: 树的深度
nthread: 调用线程数
nrounds: 树的棵树
eta: 学习率
xgb = xgboost(data=train$data, label=train$label, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)
## [1] train-error:0.046522
## [2] train-error:0.022263
xgb.DMatrix
用于组合train$data和train$label:
dtrain = xgb.DMatrix(data=train$data, label=train$label)
bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)
verbose option
用于设置训练过程中显示的信息,其中:
verbose=0,no message
verbose=1,evaluation metric
verbose=2,evaluation metric + tree information
bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2, verbose=2)
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
## [1] train-error:0.046522
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
## [2] train-error:0.022263
predict
在使用xgboost训练得到模型bst后,可以用于预测test$data的label.
pred = predict(bst, test$data)
prediction = as.numeric(pred > 0.5)
print(head(prediction, 5))
err = mean(as.numeric(prediction != test$label))
xgb.train
使用该方法,可以在每一个round结束后,计算测试集的准确率,从而选择不overfitting的模型.
dtrain = xgb.DMatrix(data=train$data, label=train$label)
dtest = xgb.DMatrix(data=test$data, label=test$label)
watchlist = list(train=dtrain, test=dtest)
bst = xgb.train(data=dtrain, max_depth=2, eta=1, nthread=2, nrounds=2, watchlist=watchlist, objective='binary:logistic')
linear boosting
前面所用的模型均基于boosting tree, 通过设置booster参数booster='gblinear',同时remove eta,我们可以使用linear boosting.
bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
# 设定两个eval_metric,查看模型预测结果
保存、加载
# DMatrix save & load
xgb.DMatrix.save(dtrain, 'dtrain.buffer')
dtrain = xgb.DMatrix('dtrain.buffer')
# model save & load
xgb.save(bst, 'xgboost.model')
bst = xgb.load('xgboost.model')
查看模型中变量的重要性
import_mat = xgb.importance(names(train$data), model=bst)
print(import_mat)
xgb.plot.importance(importance_matrix=import_mat)
查看树
使用xgb.dump(model, with_stats=T),使用xgb.plot.tree(model)
则可以画出模型中的树。