注: 本文章主要以代码为主,作为个人笔记以供后序参考使用。
lightgbm 、xgboost、catboost作为常用的三大boost模型, 经常在机器学习竞赛中使用,而且效果极好。后续为了更好的提升算法的精度可能会以此作为基模型进行模型融合。本文着重讲解lightgbm的训练及后续的超参数调优,其他模型也可按照此步骤进行训练。
在开始正文之前先boost模型的复杂度如何计算及lightgbm模型较基础的boost模型做出的改进。
boost模型复杂度=特征数量*特征分裂点的数量*样本数量
lightGBM 优化点:
- Histogram算法: 采用Histogram直方图的算法,对于数据进行离散化和分箱操作,寻找最佳特征分裂点。计算代价和内存占用都大大减少,另一方面个叶子节点的直方图可由其父节点的直方图与其兄弟节点的直方图做差得到,这也可以加速特征节点分裂。(减少特征分裂点的数量)
- GOSS(Gradient-based One-Side Sampling)算法:单边梯度抽样算法 ,将训练过程中大部分权重较小的样本提出(减少样本数量)
- EFB(Exclusive Feature Bundling)算法:互斥特征捆绑算法,通过将两个互斥的特征捆绑在一起,合为一个特征。
定义lightgbm 模型并训练作为baseline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import lightgbm as lgb
import pickle as pkl
模型参数设置
param = {
'boosting_type':'gbdt',
'eval_metric': "auc",
'num_leaves' : 20,
'n_estimators': 500,
'objective': 'multiclass',
'learning_rate' : 0.005,
'max_depth' : 50,
'min_child_samples':20,
# 'feature_fraction':0.8,
# 'bagging_fraction': 0.9,
# 'bagging_freq': 8,
'reg_alpha':0.005,
'reg_lambda':0.17
# 'lambda_l1': 0.6,
# 'lambda_l2': 0.5
# 'scale_pos_weight':k,
# 'is_unbalance':True
}
X_train, X_valid, y_train, y_valid = train_test_split(features, labels, test_size=0.2, random_state=2021) # 数据集以自己的为准,这里不作解释
lgb_model = lgb.LGBMClassifier(**param)
lgb_model.fit(X_train, y_train, eval_set =[(X_valid,y_valid)], early_stopping_rounds=800, verbose=5) # 模型训练
pred = lgb_model.predict(X_valid)
查看模型预测效果,以及对于不同类别的预测情况
print(accuracy_score(pred, y_valid))
print(classification_report(pred, y_valid))
print(confusion_matrix(pred, y_valid))
模型保存
def pickle_write(path, data):
with open(path, "wb") as fw:
pkl.dump(data, fw)
def pickle_read(path):
with open(path, "rb") as fr:
data = pkl.load(fr)
return data
pickle_write("model/lgb_baseline.pickle", lgb_model)
# lgb_model = pickle_read("model/lgb_baseline.pickle")
网格搜索GridSearchCV
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
param_dist = {"learning_rate": [0.1,0.05,0.01],
"max_depth": [50, 80, 100],
"n_estimators": [500, 800,1000]
}
grid_search = GridSearchCV(lgb_model, param_grid=param_dist, cv = 3, scoring="roc_auc", verbose=5)
grid_search.fit(X_train,y_train)
grid_search.best_estimator_, grid_search.best_score_
hyperopt 超参数搜索
常用函数如下:
hp.choice(label, options)
返回传入的列表或者数组其中的一个选项。hp.randint(label, upper)
返回范围:[0,upper]中的随机整数。hp.uniform(label, low, high)
返回位于[low,hight]之间的均匀分布的值。在优化时,这个变量被限制为一个双边区间。hp.normal(label, mu, sigma)
返回正态分布的实数值,其平均值为 mu ,标准偏差为 σ。优化时,这是一个无边界的变量。
from hyperopt import fmin, tpe, hp, partial, Trials
# 自定义hyperopt的参数空间
space = {"max_depth": hp.randint("max_depth", 15),
"num_trees": hp.randint("num_trees", 300),
'learning_rate': hp.uniform('learning_rate', 1e-3, 5e-1),
"bagging_fraction": hp.randint("bagging_fraction", 5),
"num_leaves": hp.randint("num_leaves", 6),
}
参数解析
def argsDict_tranform(argsDict, isPrint=False):
argsDict["max_depth"] = argsDict["max_depth"] + 5
argsDict['num_trees'] = argsDict['num_trees'] + 150
argsDict["learning_rate"] = argsDict["learning_rate"] * 0.02 + 0.05
argsDict["bagging_fraction"] = argsDict["bagging_fraction"] * 0.1 + 0.5
argsDict["num_leaves"] = argsDict["num_leaves"] * 3 + 10
if isPrint:
print(argsDict)
else:
pass
return argsDict
模型工厂及最优模型评估输出
def lightgbm_factory(argsDict, report=False):
argsDict = argsDict_tranform(argsDict)
params = {
'boosting_type':'gbdt',
'n_estimators': 200,
'max_depth': argsDict['max_depth'],
'num_trees': argsDict['num_trees'],
'learning_rate': argsDict['learning_rate'],
'bagging_fraction': argsDict['bagging_fraction'],
'num_leaves': argsDict['num_leaves'],
'objective': 'multiclass',
'min_child_samples':20,
'eval_metric': "auc", # 回归指标 'rmse'
# 'feature_fraction': 0.7,
# 'lambda_l1': 0.6,
# 'lambda_l2': 0.5,
# 'bagging_seed': 100
}
model_lgb = lgb.LGBMClassifier(**params)
model_lgb.fit(X_train, y_train, eval_set =[(X_valid,y_valid)], early_stopping_rounds=500, verbose=5)
return get_tranformer_score(model_lgb, report)
评估函数(分类和回归)
def get_tranformer_score(tranformer, report):
prediction = tranformer.predict(X_valid, num_iteration=tranformer.best_iteration_)
if report:
print(classification_report(prediction, y_valid))
print(confusion_matrix(prediction, y_valid))
pickle_write("model/lgb_best.pickle", tranformer)
return -accuracy_score(y_valid, prediction)
### 此函数根据需求自行修改
def get_mse_score(tranformer):
prediction = tranformer.predict(X_valid, num_iteration=model.best_iteration)
return mean_squared_error(y_predict,prediction)
参数搜索:
trials = Trials()
algo = partial(tpe.suggest, n_startup_jobs=1)
best = fmin(lightgbm_factory, space, algo=algo, max_evals=20, pass_expr_memo_ctrl=None, trials=trials)
查看每次参数的loss
plt.plot(trials.losses())
# trials.statuses()
# trials.trials
选择最优参数训练:
ACC = lightgbm_factory(best, report=True)
print('best :', best)
print('best param after transform :')
argsDict_tranform(best,isPrint=True)
print('accuracy of the best lightgbm:', -ACC)
将模型转为C语言,以便后续的模型部署。
- pip install m2cgen 安装所需库。
import m2cgen as m2c
def model_2c(model, path):
code = m2c.export_to_c(model)
with open(path, "w+") as fw:
fw.write(code)