今天用xgboost的XGBRegressor,最后获取feature importance时,发现plot_importance和feature_importance_得到的feature排名不一样。
原来,plot_importance默认的importance_type='weight',而feature_importance_默认的importance_type='gain',把plot_importance的importance_type换成gain就是一样了。
那么,xgboost里面的feature importance是怎么计算的呢?weight和gain的计算方式有什么不一样呢?
以下是plot_importance中importance type的解释,一共有三种类型:weight, gain, cover。
importance_type : str, default "weight"
How the importance is calculated: either "weight", "gain", or "cover"
- "weight" is the number of times a feature appears in a tree
- "gain" is the average gain of splits which use the feature
- "cover" is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split.
weight 指的是特征在提升树里出现的次数,也就是在所有树中,某个特征作为分裂节点的次数。
gain指的是在所有树中,某个特征在分裂后带来的平均信息增益。
cover指的是与特征相关的记录(observation)的相对数量。例如,如果有100条记录(observation),4个特征(feature) 和3棵树(tree),并且假设特征1分别用于确定树1,树2和树3中10、5和2个记录的叶节点;则cover指标会将该特征的coverage计算为10 + 5 + 2 = 17个记录。这将针对所有4个特征进行计算,其cover将以所有特征的cover指标的17%表示。