Xgboost的plot_importance和feature_importance的计算方法

今天用xgboost的XGBRegressor，最后获取feature importance时，发现plot_importance和feature_importance_得到的feature排名不一样。

原来，plot_importance默认的importance_type='weight'，而feature_importance_默认的importance_type='gain'，把plot_importance的importance_type换成gain就是一样了。

那么，xgboost里面的feature importance是怎么计算的呢？weight和gain的计算方式有什么不一样呢？

以下是plot_importance中importance type的解释，一共有三种类型：weight, gain, cover。

importance_type : str, default "weight"
How the importance is calculated: either "weight", "gain", or "cover"

"weight" is the number of times a feature appears in a tree
"gain" is the average gain of splits which use the feature
"cover" is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split.

weight 指的是特征在提升树里出现的次数，也就是在所有树中，某个特征作为分裂节点的次数。

gain指的是在所有树中，某个特征在分裂后带来的平均信息增益。

cover指的是与特征相关的记录(observation)的相对数量。例如，如果有100条记录(observation)，4个特征(feature) 和3棵树(tree)，并且假设特征1分别用于确定树1，树2和树3中10、5和2个记录的叶节点；则cover指标会将该特征的coverage计算为10 + 5 + 2 = 17个记录。这将针对所有4个特征进行计算，其cover将以所有特征的cover指标的17%表示。