携程用户流失预警项目

1、项目介绍

  • 背景
    携程作为中国领先的综合性旅行服务公司,每天向超过2.5亿会员提供全方位的旅行服务,在这海量的网站访问量中,我们可分析用户的行为数据来挖掘潜在的信息资源。其中,客户流失率是考量业务成绩的一个非常关键的指标。本项目的目的是为了深入了解用户画像及行为偏好,找到最优算法,挖掘出影响用户流失的关键因素,从而更好地完善产品设计、提升用户体验。
  • 评估标准
    评分指标为97%精确度下的召回率,即:在precision>=0.97的recall中,选取max(recall)。
  • 数据集
    数据集包括49个指标(某一周的数据),预测的目标样本为流失样本(即label=1),将这些指标按订单相关、酒店相关和客户行为相关进行归类。


    指标分类.png

2、项目流程

2.1 数据处理

2.1.1 目标特征分布

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t')
df = df_orign.copy()
df['label'].value_counts()

返回


label字段分布.PNG

流失和未流失的用户比例2:5,样本不算特别不平衡,此处不做处理。

2.1.2 处理异常值

  • 通过describe函数发现,部分与价格有关的字段(delta_price1、delta_price2、lowestprice)存在负值的情况。


    价格字段存在负值.PNG

    存在负值的具体数量如下:


    价格字段负值记录数.PNG

    三个字段的分布如下:
    delta_price1分布.PNG

    delta_price2分布图.PNG

    lowestprice分布图.PNG

根据上图可见,delta_price1【负值>=25%】、delta_price2【负值>=25%】、lowestprice【仅有1条负值记录】,考虑到字段的分布情况(近似正态分布)以及实际需求,此处用0来替换价格异常值。

df[['delta_price1','delta_price2','lowestprice']] = df[['delta_price1','delta_price2','lowestprice']].applymap(lambda x: 0 if x<0 else x)
  • 24小时内登陆时长内登录时长不应该超过24小时,将大于24的值改为24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24

2.1.3 处理缺失值

各个字段缺失值情况:


nan.png

根据上图显示:几乎所有字段都存在缺失值,存在缺失值的字段均为连续性字段,其中historyvisit_7ordernum缺失值超过80%,已没有分析的必要,考虑将其删除。
剩余存在缺失值的字段,考虑用其他值来填充空值。

  • 用其他字段填充
    计算字段的相关性发现:
    commentnums_pre和novoters_pre相关性较强;
    commentnums_pre2和novoters_pre2相关性较强。
(df['commentnums_pre']/df['novoters_pre']).describe()

返回


评分率.PNG

此处取上图结果的中位数65%作为评分率,考虑用novoters_pre*65%来填充commentnums_pre,commentnums_pre/65%来填充novoters_pre。

def fill_commentnum_novoter_pre(x):
    if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']):
        x['commentnums_pre'] = x['novoters_pre']*0.65
    elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
        x['novoters_pre'] = x['commentnums_pre']/0.65
    else:
        return x
    return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
df[['commentnums_pre','novoters_pre']].info()

返回


填充之后的缺失值情况.PNG

填充了commentnums_pre和novoters_pre的部分缺失值,剩余缺失值用中位数填充。

同上,填充commentnums_pre2和novoters_pre2字段,剩余缺失值用均值填充。

def fill_commentnum_novoter_pre2(x):
    if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
        x['commentnums_pre2'] = x['novoters_pre2']*0.65
    elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
        x['novoters_pre2'] = x['commentnums_pre2']/0.65
    else:
        return x
    return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
  • 均值、中位数、0填充
#均值(极端值影响不大,符合近似正态分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
            'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))
#中位数
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
               'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
                ,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))
#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
  • 聚类填充
    commentnums和novoters、cancelrate、hoteluv存在较强相关性,考虑通过聚类取中位数的方式来填充commentnums。
#commentnums:当前酒店点评数
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']

同理,取starprefer和consuming_capacity聚类后每类avgprice的均值来填充avgprice的空值

#avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']

取consuming_capacity和avgprice聚类后的中位数来填充delta_price1

#delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = 187#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = 100#data['fill1']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = 26#data['fill2']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = 1269#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = 323#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = 573#data['fill0']
df['delta_price1'] = data['delta_price1']

取 consuming_capacity和avgprice聚类delta_price2的中位数来填充delta_price2

#delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = 91#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = 419#data['fill1']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = 18#data['fill2']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = 205#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = 1042#data['fill0']
df['delta_price2'] = data['delta_price2']
  • 分段填充
    consuming_capacity和starprefer相关,考虑通过starprefer分段来填充consuming_capacity。
    看一下这两个字段的描述情况:


    描述0.PNG

    描述1.PNG

    描述2.PNG

    描述3.PNG

    根据上述描述情况,将starprefer分成三段,将每块区域内consuming_capacity的均值来填充consuming_capacity的空值。

fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
    if x.isnull()['consuming_capacity']:
        if x['starprefer']<60:
            x['consuming_capacity'] = fill1
        elif (x['starprefer']<80)&(x['starprefer']>=60):
            x['consuming_capacity'] = fill2
        else:
            x['consuming_capacity'] = fill3
    else:
        return x
    return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)

以上,缺失值处理完毕

2.2 特征工程

2.2.1 新增字段

  • 时间字段
    新增字段:访问日期和入住日期间隔天数booking_gap、入住日期是星期几week_day、入住日期是否是周末is_weekend
#格式为年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#访问日期和入住日期间隔天数
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期几
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
  • 是否是同一个样本【选取部分客户行为指标】
    查看字段sid,发现95%都是老用户,新用户很少,一周内部分用户可能会下多个订单,为了方便后续划分训练集和验证集,此处添加一个user_tag来区分是否是同一个用户的订单。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
                  df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
                 df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
                  df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
                df['historyvisit_visit_detailpagenum'].map(str) + \
                  df['delta_price2'].map(str) +  \
                df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
df['user_tag'].unique().shape

返回670226,即实际这周有670226个用户下过订单。

  • 用户字段和酒店字段
    选取部分用户相关字段进行聚类创建用户字段user_group,选取部分酒店相关字段进行聚类创建酒店字段hotel_group。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
             'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚类之前先标准化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
    km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
    km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('数据聚calinski_harabaz指数为:%f'%(score)) #3:218580.269018  4:218580.416497 5:218581.368953 6:218581.203569 
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('数据聚calinski_harabaz指数为:%f'%(score))  #3:266853.481135  4:268442.314369 5:268796.468103 6:268796.707149

2.2.2 连续特征离散化

historyvisit_avghotelnum大部分都小于5,将字段处理成小于等于5和大于5的离散值;
ordercanncelednum大部分都小于5,将字段处理成小于等于5和大于5的离散值;
sid等于1是新访设为0,其他设为1为老用户。
avgprice、lowestprice、starprefer、consuming_capacity和h进行数值分段离散化。

df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)  
#分段离散化
def discrete_avgprice(x):
    if x<=200:
        return 0
    elif x<=400:
        return 1
    elif x<=600:
        return 2
    else:
        return 3
    
def discrete_lowestprice(x):
    if x<=100:
        return 0
    elif x<=200:
        return 1
    elif x<=300:
        return 2
    else:
        return 3
    
def discrete_starprefer(x):
    if x==0:
        return 0
    elif x<=60:
        return 1
    elif x<=80:
        return 2
    else:
        return 3
    
def discrete_consuming_capacity(x):
    if x<0:
        return 0
    elif x<=20:
        return 1
    elif x<=40:
        return 2
    elif x<=60:
        return 3
    else:
        return 4
    
def discrete_h(x):
    if x>=0 and x<6:#凌晨访问
        return 0
    elif x<12:#上午访问
        return 1
    elif x<18:#下午访问
        return 2
    else:
        return 3#晚上访问
    
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)

对当前的数值型类别变量进行离散特征热编码,此处用OneHotEncoder方法

discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
                  ,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
                 'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)

2.2.3 删除字段

去掉两类字段:
d、arrival、sampleid、firstorder_bu这几个对分析没有意义的字段;
historyvisit_totalordernum和ordernum_oneyear这两个字段值相等,此处取ordernum_oneyear这个字段,删除historyvisit_totalordernum;
decisionhabit_user和historyvisit_avghotelnum数值较一致,此处选择historyvisit_avghotelnum,删除decisionhabit_user。

encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user'],axis=1)
encode_df_new.shape

最终去除目标字段label和划分训练集字段user_tag,共有79个字段。

2.3 模型训练

2.3.1 划分训练集和验证集

为了保证训练集和验证集独立同分布,将数据按照user_tag进行排序,取前70%作为训练集,剩余的作为验证集。

ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]

2.3.2 比较各个模型的训练效果

所有模型的调参都采用GridSearchCV网格搜索进行。

  • GBDT
#调整的参数:
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate,需要配合调整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最终的参数结果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('调参之后:测试集中precision>=0.97对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.15988300816140671
0.8808204850185188

  • xgboost
#调整的参数:
#迭代器个数n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合调整n_esgtimators


from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.7640022417597814
0.9754939563495324

  • 随机森林
#调整的参数:
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.666135416301797
0.9616117844760916

  • Adaboost
bdt = AdaBoostClassifier(algorithm="SAMME",
                         n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.00019265123121650496
0.7300356696791559

  • DecisionTree
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.0
0.8340018840954033

根据上述结果可知,xgboost的训练效果最好,当precision>=0.97时,recall最大能达到76.4%。

2.3.3 模型堆叠

后面也尝试了模型堆叠的方法,看是否能得到更好的效果,首先利用上述提到的各个模型,根据特征重要性选取了57个特征,然后利用KFold方法进行5折交叉验证,得到五种模型的验证集和测试集结果,分别作为第二层的训练数据集和测试数据集,并用逻辑回归模型来训练这五个特征,最终得到的结果是当precision>=0.97时,recall最大能达到78.3%,比原来的76.4%稍有提高。

  • 选取重要特征
#筛选特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier

def get_top_n_features(train_x, train_y):

    # random forest
    rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
    rf_est.fit(train_x, train_y)
    feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
                                          'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)

    # AdaBoost
    ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
    ada_est.fit(train_x, train_y)
    feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
                                           'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)

    
    # GradientBoosting
    gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
    gb_est.fit(train_x, train_y)
    feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
                                          'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)

    # DecisionTree
    dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
    dt_est.fit(train_x, train_y)
    feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
                                          'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
    
    # xgbc
    xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
    xg_est.fit(train_x, train_y)
    feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
                                          'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)

    
    return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg

feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg], 
                               ignore_index=True).drop_duplicates()
    
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, 
                                   feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n

最终从79个特征中选取了57个。

  • 第一层模型训练
#第一层
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)

def get_out_fold(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((5, ntest))
    oof_train_prob = np.zeros((ntrain,))
    oof_test_prob = np.zeros((ntest,))
    oof_test_skf_prob = np.empty((5, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
        oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
        oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
        print('现在是第{}层'.format(i))
        print('训练集索引如下:')
        print(train_index)
        print('测试集索引如下:')
        print(test_index)
    oof_test[:] = oof_test_skf.mean(axis=0)
    oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)

x_train = train_x_new.values 
x_test = test_x_new.values 
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
  • 第二层模型训练
    将第一层的输出结果作为训练集和测试集
#划分训练集和测试集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#逻辑回归模型训练
from sklearn.linear_model import LogisticRegression
#调参
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每个参数值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳参数值为:{}'.format(rfsearch4.best_params_))
# print('最佳参数值roc_auc得分为:{}'.format(rfsearch4.best_score_))
#调参结果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))

返回
0.7832498511331395
0.9763271659779821
通过堆叠的方法,将recall值从76.4%提高到78.3%。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,014评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,796评论 3 386
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,484评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,830评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,946评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,114评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,182评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,927评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,369评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,678评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,832评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,533评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,166评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,885评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,128评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,659评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,738评论 2 351