LightGBM 如何调参

本文结构：

什么是 LightGBM
怎么调参
和 xgboost 的代码比较

1. 什么是 LightGBM

Light GBM is a gradient boosting framework that uses tree based learning algorithm.

LightGBM 垂直地生长树，即 leaf-wise，它会选择最大 delta loss 的叶子来增长。

而以往其它基于树的算法是水平地生长，即 level-wise，

当生长相同的叶子时，Leaf-wise 比 level-wise 减少更多的损失。

高速，高效处理大数据，运行时需要更低的内存，支持 GPU

不要在少量数据上使用，会过拟合，建议 10,000+ 行记录时使用。

2. 怎么调参

下面几张表为重要参数的含义和如何应用

Control Parameters	含义	用法
`max_depth`	树的最大深度	当模型过拟合时,可以考虑首先降低 `max_depth`
`min_data_in_leaf`	叶子可能具有的最小记录数	默认20，过拟合时用
`feature_fraction`	例如为0.8时，意味着在每次迭代中随机选择80％的参数来建树	boosting 为 random forest 时用
`bagging_fraction`	每次迭代时用的数据比例	用于加快训练速度和减小过拟合
`early_stopping_round`	如果一次验证数据的一个度量在最近的`early_stopping_round` 回合中没有提高，模型将停止训练	加速分析，减少过多迭代
lambda	指定正则化	0～1
`min_gain_to_split`	描述分裂的最小 gain	控制树的有用的分裂
`max_cat_group`	在 group 边界上找到分割点	当类别数量很多时，找分割点很容易过拟合时

Core Parameters	含义	用法
Task	数据的用途	选择 train 或者 predict
application	模型的用途	选择 regression: 回归时，binary: 二分类时，multiclass: 多分类时
boosting	要用的算法	gbdt， rf: random forest， dart: Dropouts meet Multiple Additive Regression Trees， goss: `Gradient-based One-Side Sampling`
`num_boost_round`	迭代次数	通常 100+
`learning_rate`	如果一次验证数据的一个度量在最近的 `early_stopping_round` 回合中没有提高，模型将停止训练	常用 0.1, 0.001, 0.003…
`num_leaves`		默认 31
device		cpu 或者 gpu
metric		`mae: mean absolute error ， mse: mean squared error ， binary_logloss: loss for binary classification ， multi_logloss: loss for multi classification`

IO parameter	含义
`max_bin`	表示 feature 将存入的 bin 的最大数量
`categorical_feature`	如果 `categorical_features = 0,1,2`，则列 0，1，2是 categorical 变量
`ignore_column`	与 `categorical_features` 类似，只不过不是将特定的列视为categorical，而是完全忽略
`save_binary`	这个参数为 true 时，则数据集被保存为二进制文件，下次读数据时速度会变快

调参

IO parameter	含义
`num_leaves`	取值应 `<= 2 ^（max_depth）`，超过此值会导致过拟合
`min_data_in_leaf`	将它设置为较大的值可以避免生长太深的树，但可能会导致 underfitting，在大型数据集时就设置为数百或数千
`max_depth`	这个也是可以限制树的深度

下表对应了 Faster Speed ，better accuracy ，over-fitting 三种目的时，可以调的参数

Faster Speed	better accuracy	over-fitting
将 `max_bin` 设置小一些	用较大的 `max_bin`	`max_bin` 小一些
	`num_leaves` 大一些	`num_leaves` 小一些
用 `feature_fraction` 来做 `sub-sampling`		用 `feature_fraction`
用 `bagging_fraction 和 bagging_freq`		设定 `bagging_fraction 和 bagging_freq`
	training data 多一些	training data 多一些
用 `save_binary` 来加速数据加载	直接用 categorical feature	用 `gmin_data_in_leaf 和 min_sum_hessian_in_leaf`
用 parallel learning	用 dart	用 `lambda_l1, lambda_l2 ，min_gain_to_split` 做正则化
	`num_iterations` 大一些，`learning_rate` 小一些	用 `max_depth` 控制树的深度

3. lightGBM 和 xgboost 的代码比较

#xgboost
dtrain = xgb.DMatrix(x_train,label=y_train)
dtest = xgb.DMatrix(x_test)


# lightgbm
train_data = lgb.Dataset(x_train,label=y_train)

setting parameters:

#xgboost
parameters = {
    'max_depth':7, 
    'eta':1, 
    'silent':1,
    'objective':'binary:logistic',
    'eval_metric':'auc',
    'learning_rate':.05}

# lightgbm
param = {
    'num_leaves':150, 
    'objective':'binary',
    'max_depth':7,
    'learning_rate':.05,
    'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

training model :

#xgboost
num_round = 50
from datetime import datetime 
start = datetime.now() 
xg = xgb.train(parameters,dtrain,num_round) 
stop = datetime.now()

# lightgbm
num_round = 50
start = datetime.now()
lgbm = lgb.train(param,train_data,num_round)
stop = datetime.now()

Execution time of the model:

#xgboost
execution_time_xgb = stop - start 
execution_time_xgb

# lightgbm
execution_time_lgbm = stop - start
execution_time_lgbm

predicting model on test set:

#xgboost
ypred = xg.predict(dtest) 
ypred

# lightgbm
ypred2 = lgbm.predict(x_test)
ypred2[0:5]

Converting probabilities into 1 or 0:

#xgboost
for i in range(0,9769): 
    if ypred[i] >= .5:       # setting threshold to .5 
       ypred[i] = 1 
    else: 
       ypred[i] = 0

# lightgbm
for i in range(0,9769):
    if ypred2[i] >= .5:       # setting threshold to .5
       ypred2[i] = 1
    else:  
       ypred2[i] = 0

calculating accuracy of our model :

#xgboost
from sklearn.metrics import accuracy_score 
accuracy_xgb = accuracy_score(y_test,ypred) 
accuracy_xgb

# lightgbm
accuracy_lgbm = accuracy_score(ypred2,y_test)
accuracy_lgbm
y_test.value_counts()
from sklearn.metrics import roc_auc_score

calculating roc_auc_score:

#xgboost
auc_xgb =  roc_auc_score(y_test,ypred)

# lightgbm
auc_lgbm = roc_auc_score(y_test,ypred2)

最后可以建立一个 dataframe 来比较 Lightgbm 和 xgb:

auc_lgbm comparison_dict = {
    'accuracy score':(accuracy_lgbm,accuracy_xgb),
    'auc score':(auc_lgbm,auc_xgb),
    'execution time':(execution_time_lgbm,execution_time_xgb)}

comparison_df = DataFrame(comparison_dict) 
comparison_df.index= ['LightGBM','xgboost'] 
comparison_df

学习资料：
https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/

推荐阅读历史技术博文链接汇总
 http://www.jianshu.com/p/28f02bb59fe5
也许可以找到你想要的：
[入门问题][TensorFlow][深度学习][强化学习][神经网络][机器学习][自然语言处理][聊天机器人]

LightGBM 如何调参