背景:
A公司有一款在线服务的P产品,公司的营销通路是100%的网络媒介。A公司希望提供30天免费的P产品后,期望顾客能正式签约购买P产品之服务。但A公司发现‚每隔1~2天便对数以万计的顾客发送电子营销文宣,不但购买率低下,甚至造成诸多客诉。同时,客户之预期获利是以人工经验评估之,没有量化或模型工具之协助,不晓得到底应该使用广告全投放还是机器学习模型来做投放?
这个基本和模拟题2类似,并且要基于混淆矩阵分析获益矩阵,考虑营销成本和收益。
训练和测试数据一共8000 ,训练数据6000,字段意义如下,具体可见官网
数据读取
打开发现很多空值是用'?' 显示,而且很多是object类型,需要做转换
# -*- coding: UTF-8 -*-
# 保证脚本与Python3兼容
from __future__ import print_function
import os #读取数据文件
import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split #划分训练集测试集使用
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.feature_extraction import DictVectorizer #特征转换器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import tree
from sklearn.model_selection import GridSearchCV
%matplotlib inline
import warnings
#禁止警告
warnings.filterwarnings("ignore")
defreadData(path):
"""
使用pandas读取数据
"""
data = pd.read_csv(path)
cols = list(data.columns.values)
return data[cols]
defvisualData(data):
"""
画直方图,直观了解数据
"""
data.hist(
rwidth=0.9, grid=True, figsize=(8, 8), alpha=0.6,bins=10, color="blue")
plt.show()
if __name__ == "__main__":
# 设置显示格式
pd.set_option('display.width', 1000)
homePath = os.path.dirname(os.path.abspath('__file__'))
# Windows下的存储路径与Linux并不相同
if os.name == "nt":
dataPath = "%s\\df_training.csv" % homePath
else:
dataPath = "%s/df_training.csv" % homePath
train = readData(dataPath)
if os.name == "nt":
dataPath = "%s\\df_test.csv" % homePath
else:
dataPath = "%s/df_test.csv" % homePath
test = readData(dataPath)
#统计分析信息
train.info()
train.head()
#没有空值因为变成?,但大多数为object,需要转换
test.info()
数据清洗和描述分析
#print(set(train.columns)-set(['ID',"'Purchase or not'"])) 显示特征
features = ["'Product using score'", "'User area'", 'age', "'Point balance'", 'gender', "'Cumulative using time'",
"'Active user'", "' Estimated salary'", "'Pay a monthly fee by credit card'", "'Product service usage'"]
for feature in features:
na_count = train[train[feature]=='?'].shape[0]
na_per = na_count/train[feature].count()
na_per_buy = train[train[feature]=='?']["'Purchase or not'"].value_counts()[1]/na_count
print('%s 空值的百分比: %.2f ,空值中购买的百分比: %.2f ' %(feature,na_per,na_per_buy))
#用na填充?,并去除空值
train2=train.copy()
train2[feature] = train2[feature].replace('?',np.nan).dropna()
#非分类变量转为数值
if feature not in ["'User area'",'gender']:
train2[feature]=train2[feature].apply(pd.to_numeric)
#可视化交叉报表
#取值个数大于20则分箱
if len(train1[feature].unique())>=20:
#等宽分箱
train2['cut'] = pd.cut(train2[feature],bins=10,include_lowest=True,right=False,precision=0)
cross1 = pd.crosstab(train2['cut'], train["'Purchase or not'"])
else:
cross1 = pd.crosstab(train2[feature], train["'Purchase or not'"])
print(cross1)
cross1.plot(kind="bar", color=["blue", "0.45"], rot=0)
plt.show()
从总体看,每个特征却数值都在30%左右,其中购买占20%左右,因为训练数据较少缺失占比较高,避免丢失有效信息,缺失值可以指派为一类值替换。
其中产品使用分数 Product using score ,410以下的购买可能基本为0,其它分布不均接近正态
用户区域看看taichung中购买的比例高于tainan 和 taipei
age的分布基本符合正态分布,但80到85怀疑有异常购买数据,74之后基本趋于0
点数余额看,有[0.0, 22227.0) 和[111134.0, 133361.0)两个区间购买者数特别高,其它差不多
性别看女性中的购买比例较大
产品使用量也是0到4的整数为主,3和4 特别高,1在30%,2最低
而是否活跃用户和是否用信用卡付费为逻辑值,取0和1,其中非活跃用户更多人购买,而信用卡付费对买和不买的影响差不多。
用箱线图看连续值得分布情况
import matplotlib.pyplot as plt
feabox = ["'Product using score'" , "' Estimated salary'"]
for feature in feabox:
train3=train1.copy()
train3[feature] = train3[feature].replace('?',np.nan).dropna()
train3[feature]=train3[feature].apply(pd.to_numeric)
train3[feature].plot.box(title= feature)
plt.grid(linestyle="--", alpha=0.3)
plt.boxplot(x = train3[feature], # 指定绘图数据
patch_artist=True, # 要求用自定义颜色填充盒形图,默认白色填充
showmeans=True, # 以点的形式显示均值
boxprops = {'color':'black','facecolor':'steelblue'}, # 设置箱体属性,如边框色和填充色
# 设置异常点属性,如点的形状、填充色和点的大小
flierprops = {'marker':'o','markerfacecolor':'red', 'markersize':3},
# 设置均值点的属性,如点的形状、填充色和点的大小
meanprops = {'marker':'D','markerfacecolor':'indianred', 'markersize':4},
# 设置中位数线的属性,如线的类型和颜色
medianprops = {'linestyle':'--','color':'orange'},
labels = [' '] # 删除x轴的刻度标签,否则图形显示刻度标签为1
)
# 添加图形标题
plt.title(feature)
#train3.info()
plt.show()
数据预处理和特征处理
# 编码映射
alldata = pd.concat([train,test],axis=0)
#做字典映射表
area_map = {'Taichung': 0 ,'Tainan': 1 ,'Taipei': 2, '?': 3}
gender_map = {'Female': 0, 'Male': 1, '?': 3}
#应用映射
alldata["'User area'"] = alldata["'User area'"].map(area_map)
alldata["gender"] = alldata["gender"].map(gender_map)
#其它'?'转为nan
alldata= alldata.replace('?',np.nan)
alldata["'Cumulative using time'"].fillna(alldata["'Cumulative using time'"].mode(),inplace=True)
alldata["'Product service usage'"].fillna(alldata["'Product service usage'"].mode(),inplace=True)
#其它指派为一类
alldata.fillna(value=-1,inplace=True)
#转为数值
for feature in features:
alldata[feature] = alldata[feature].apply(pd.to_numeric)
#再分测试训练集
newtrain = alldata[alldata["'Purchase or not'"]!='Withheld']
newtrain["'Purchase or not'"] = newtrain["'Purchase or not'"].apply(pd.to_numeric)
newtest = alldata[alldata["'Purchase or not'"]=='Withheld']
newtrain.info()
newtest.info()
先用随机森林做个baseline
#使用RF进行简单预测
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
predictors = list(set(newtrain.columns)-set(['ID',"'Purchase or not'"]))
X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],
test_size=0.2,random_state=1234)
rf = RandomForestClassifier(n_estimators=1000, min_samples_split=5, min_samples_leaf=3)
rf.fit(X_train, y_train)
print(accuracy_score(rf.predict(X_val), y_val))
print(roc_auc_score(rf.predict(X_val),y_val))
0.83
0.7924476371342856
梯度提升训练
bagging和boosting是两个常见模型融合的方法,随机森林属于前者,接下来使用boosting的一种模型gbdt来优化,gbdt有xgboost和LightGBM,传说LightGBM更加高效精准,现在变身调参侠开始修炼
step 1 :设定初始学习率并调测迭代次数
import pandas as pd
import lightgbm as lgb
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'nthread':4,
'learning_rate':0.1,
'num_leaves':30,
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
}
data_train = lgb.Dataset(X_train, y_train)
cv_results = lgb.cv(params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',early_stopping_rounds=50,seed=0)
print('best n_estimators:', len(cv_results['auc-mean']))
print('best cv score:', pd.Series(cv_results['auc-mean']).max())
best n_estimators: 44
best cv score: 0.7852211260765876
step 2:根据迭代次数确定max_depth和num_leaves
这是提高精确度的最重要的参数。这里我们引入sklearn里的GridSearchCV()函数进行搜索。
params_test1={'max_depth': range(3,8,1), 'num_leaves':range(5, 100, 5)}
gsearch1 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,
n_estimators=44, max_depth=6, bagging_fraction = 0.8,feature_fraction = 0.8),
param_grid = params_test1, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch1.fit(X_train,y_train)
means = gsearch1.cv_results_['mean_test_score']
params = gsearch1.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
step 3:调试min_data_in_leaf和max_bin
params_test2={'max_bin': range(5,256,10), 'min_data_in_leaf':range(1,102,10)}
gsearch2 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,
n_estimators=44, max_depth=3, num_leaves=5,bagging_fraction = 0.8,feature_fraction = 0.8),
param_grid = params_test2, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch2.fit(X_train,y_train)
means = gsearch2.cv_results_['mean_test_score']
params = gsearch2.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
step 4:确定feature_fraction、bagging_fraction、bagging_freq
params_test3={'feature_fraction': [0.6,0.7,0.8,0.9,1.0],
'bagging_fraction': [0.6,0.7,0.8,0.9,1.0],
'bagging_freq': range(0,81,10)
}
gsearch3 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,
n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1),
param_grid = params_test3, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch3.fit(X_train,y_train)
means = gsearch3.cv_results_['mean_test_score']
params = gsearch3.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
step 5:调测lambda_l1和lambda_l2
params_test4={'lambda_l1': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],
'lambda_l2': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]
}
gsearch4 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,
n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7),
param_grid = params_test4, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch4.fit(X_train,y_train)
means = gsearch4.cv_results_['mean_test_score']
params = gsearch4.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
step 6:确定 min_split_gain参数
params_test5={'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}
gsearch5 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,
n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7,
lambda_l1=1e-05,lambda_l2=0.001),
param_grid = params_test5, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch5.fit(X_train,y_train)
means = gsearch5.cv_results_['mean_test_score']
params = gsearch5.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
step 7 :降低学习率,增加迭代次数,验证模型
from sklearn.metrics import accuracy_score,roc_auc_score
X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],
test_size=0.2,random_state=2019)
model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.005,
n_estimators=2900, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,
bagging_fraction=0.9,bagging_freq=40, feature_fraction=0.7,
lambda_l1=1e-05,lambda_l2=0.001,min_split_gain=0)
model.fit(X_train,y_train)
y_train['Predicted_Results'] = model.predict(X_val)
print(accuracy_score(pred, y_val))
newtest['Predicted_Results'] = model.predict(newtest[predictors])
newtest[['ID','Predicted_Results']].to_csv('results3.csv',index=False)
结果为
0.8341666666666666
计算最终获益
根据不同的营销文案成本和其他成本还有收益 ,计算最终获益矩阵和利润
from sklearn.metrics import confusion_matrix
#计算最终利润
print(confusion_matrix(y_val, pred))
tp = confusion_matrix(y_val, pred)[0][0]
fp = confusion_matrix(y_val,pred)[1][0]
#print(con_matrix)
profitA = tp * 1500 - fp * 500
print(profitA)
profitB = tp * 700 - fp * 500
print(profitB)
参考
菜菜的机器学习sklearn课堂
https://www.biaodianfu.com/lightgbm.html
https://blog.csdn.net/qq_24519677/article/details/82811215
https://blog.csdn.net/u012735708/article/details/83749703
http://www.sohu.com/a/311595528_99953482
https://www.imooc.com/article/43784
https://blog.csdn.net/jingyi130705008/article/details/82670011