本文转自 kaggle编码categorical feature总结。
kaggle竞赛本质上是套路的竞赛。这篇文章讲讲kaggle竞赛里categorical feature的常用处理套路,主要基于树模型(lightgbm,xgboost, etc.)。重点是target encoding 和 beta target encoding。
总结:
- label encoding,特征存在内在顺序 (ordinal feature);
- one hot encoding,特征无内在顺序,category数量 < 4;
- target encoding (mean encoding, likelihood encoding, impact encoding),特征无内在顺序,category数量 > 4;
- beta target encoding,特征无内在顺序,category数量 > 4, K-fold cross validation;
- 不做处理(模型自动编码),CatBoost,lightgbm
1. Label encoding
对于一个有m个category的特征,经过label encoding以后,每个category会映射到0到m-1之间的一个数。label encoding适用于ordinal feature (特征存在内在顺序)。
# train -> training dataframe
# test -> test dataframe
# cat_cols -> categorical columns
for col in cat_cols:
le = LabelEncoder()
le.fit(np.concatenate([train[col], test[col]]))
train[col] = le.transform(train[col])
test[col] = le.transform(test[col])
2. One-hot encoding (OHE)
对于一个有m个category的特征,经过独热编码(OHE)处理后,会变为m个二元特征,每个特征对应于一个category。这m个二元特征互斥,每次只有一个激活。
独热编码解决了原始特征缺少内在顺序的问题,但是缺点是对于high-cardinality categorical feature (category数量很多),编码之后特征空间过大(此处可以考虑PCA降维),而且由于one-hot feature 比较unbalanced,树模型里每次的切分增益较小,树模型通常需要grow very deep才能得到不错的精度。因此OHE一般用于category数量 <4的情况。
参考:Using Categorical Data with One Hot Encoding
# train -> training dataframe
# test -> test dataframe
# cat_cols -> categorical columns
df = train.append(test).reset_index()
original_column = list(df.columns)
df = pd.get_dummies(df, columns = cat_cols, dummy_na = True)
new_column = [c for c in df.columns if c not in original_column ]
3. Target encoding (or likelihood encoding, impact encoding, mean encoding)
Target encoding 采用 target mean value (among each category) 来给categorical feature做编码。为了减少target variable leak,主流的方法是使用2 levels of cross-validation求出target mean,思路如下:
把train data划分为20-folds (举例:infold: fold #2-20, out of fold: fold #1)
将每一个 infold (fold #2-20) 再次划分为10-folds (举例:inner_infold: fold #2-10, Inner_oof: fold #1)
计算 10-folds的 inner out of folds值 (举例:使用inner_infold #2-10 的target的均值,来作为inner_oof #1的预测值)
对10个inner out of folds 值取平均,得到 inner_oof_mean
计算oof_mean (举例:使用 infold #2-20的inner_oof_mean 来预测 out of fold #1的oof_mean
将train data 的 oof_mean 映射到test data完成编码
参考: Likelihood encoding of categorical features
open source package category_encoders: scikit-learn-contrib/categorical-encoding
# train -> training dataframe
# test -> test dataframe
n_folds = 20
n_inner_folds = 10
likelihood_encoded = pd.Series()
likelihood_coding_map = {}
oof_default_mean = train[target].mean() # global prior mean
kf = KFold(n_splits=n_folds, shuffle=True)
oof_mean_cv = pd.DataFrame()
split = 0
for infold, oof in kf.split(train[feature]):
print ('==============level 1 encoding..., fold %s ============' % split)
inner_kf = KFold(n_splits=n_inner_folds, shuffle=True)
inner_oof_default_mean = train.iloc[infold][target].mean()
inner_split = 0
inner_oof_mean_cv = pd.DataFrame()
likelihood_encoded_cv = pd.Series()
for inner_infold, inner_oof in inner_kf.split(train.iloc[infold]):
print ('==============level 2 encoding..., inner fold %s ============' % inner_split)
# inner out of fold mean
oof_mean = train.iloc[inner_infold].groupby(by=feature)[target].mean()
# assign oof_mean to the infold
likelihood_encoded_cv = likelihood_encoded_cv.append(train.iloc[infold].apply(
lambda x : oof_mean[x[feature]]
if x[feature] in oof_mean.index
else inner_oof_default_mean, axis = 1))
inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how='outer')
inner_oof_mean_cv.fillna(inner_oof_default_mean, inplace=True)
inner_split += 1
oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how='outer')
oof_mean_cv.fillna(value=oof_default_mean, inplace=True)
split += 1
print ('============final mapping...===========')
likelihood_encoded = likelihood_encoded.append(train.iloc[oof].apply(
lambda x: np.mean(inner_oof_mean_cv.loc[x[feature]].values)
if x[feature] in inner_oof_mean_cv.index
else oof_default_mean, axis=1))
######################################### map into test dataframe
train[feature] = likelihood_encoded
likelihood_coding_mapping = oof_mean_cv.mean(axis = 1)
default_coding = oof_default_mean
likelihood_coding_map[feature] = (likelihood_coding_mapping, default_coding)
mapping, default_mean = likelihood_coding_map[feature]
test[feature] = test.apply(lambda x : mapping[x[feature]]
if x[feature] in mapping
else default_mean,axis = 1)
4. beta target encoding
我第一次看到这个方法是在kaggle竞赛Avito Demand Prediction Challenge 第14名的solution分享: 14th Place Solution: The Almost Golden Defenders
和target encoding 一样,beta target encoding 也采用 target mean value (among each category) 来给categorical feature做编码。不同之处在于,为了进一步减少target variable leak,beta target encoding发生在在5-fold CV内部,而不是在5-fold CV之前:
- 把train data划分为5-folds (5-fold cross validation)
- target encoding based on infold data
- train model
- get out of fold prediction
同时beta target encoding 加入了smoothing term,用 bayesian mean 来代替mean。Bayesian mean (Bayesian average) 的思路: 某一个category如果数据量较少(<N_min),noise就会比较大,需要补足数据,达到smoothing 的效果。补足数据值 = prior mean。N_min 是一个regularization term,N_min 越大,regularization效果越强。
# train -> training dataframe
# test -> test dataframe
# N_min -> smoothing term, minimum sample size, if sample size is less than N_min, add up to N_min
# target_col -> target column
# cat_cols -> categorical colums
# Step 1: fill NA in train and test dataframe
# Step 2: 5-fold CV (beta target encoding within each fold)
kf = KFold(n_splits=5, shuffle=True, random_state=0)
for i, (dev_index, val_index) in enumerate(kf.split(train.index.values)):
# split data into dev set and validation set
dev = train.loc[dev_index].reset_index(drop=True)
val = train.loc[val_index].reset_index(drop=True)
feature_cols = []
for var_name in cat_cols:
feature_name = f'{var_name}_mean'
feature_cols.append(feature_name)
prior_mean = np.mean(dev[target_col])
stats = dev[[target_col, var_name]].groupby(var_name).agg(['sum', 'count'])[target_col].reset_index()
### beta target encoding by Bayesian average for dev set
df_stats = pd.merge(dev[[var_name]], stats, how='left')
df_stats['sum'].fillna(value = prior_mean, inplace = True)
df_stats['count'].fillna(value = 1.0, inplace = True)
N_prior = np.maximum(N_min - df_stats['count'].values, 0) # prior parameters
dev[feature_name] = (prior_mean * N_prior + df_stats['sum']) / (N_prior + df_stats['count']) # Bayesian mean
### beta target encoding by Bayesian average for val set
df_stats = pd.merge(val[[var_name]], stats, how='left')
df_stats['sum'].fillna(value = prior_mean, inplace = True)
df_stats['count'].fillna(value = 1.0, inplace = True)
N_prior = np.maximum(N_min - df_stats['count'].values, 0) # prior parameters
val[feature_name] = (prior_mean * N_prior + df_stats['sum']) / (N_prior + df_stats['count']) # Bayesian mean
### beta target encoding by Bayesian average for test set
df_stats = pd.merge(test[[var_name]], stats, how='left')
df_stats['sum'].fillna(value = prior_mean, inplace = True)
df_stats['count'].fillna(value = 1.0, inplace = True)
N_prior = np.maximum(N_min - df_stats['count'].values, 0) # prior parameters
test[feature_name] = (prior_mean * N_prior + df_stats['sum']) / (N_prior + df_stats['count']) # Bayesian mean
# Bayesian mean is equivalent to adding N_prior data points of value prior_mean to the data set.
del df_stats, stats
# Step 3: train model (K-fold CV), get oof prediction
另外,对于target encoding和beta target encoding,不一定要用target mean (or bayesian mean),也可以用其他的统计值包括 medium, frqequency, mode, variance, skewness, and kurtosis -- 或任何与target有correlation的统计值。
5. 不做任何处理(模型自动编码)
XgBoost和Random Forest,不能直接处理categorical feature,必须先编码成为numerical feature。
-
lightgbm和CatBoost,可以直接处理categorical feature。
lightgbm: 需要先做label encoding。用特定算法(On Grouping for Maximum Homogeneity)找到optimal split,效果优于ONE。也可以选择采用one-hot encoding,。Features - LightGBM documentation
CatBoost: 不需要先做label encoding。可以选择采用one-hot encoding,target encoding (with regularization)。CatBoost — Transforming categorical features to numerical features — Yandex Technologies
参考: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db