CDA分析建模-产品营销模型之建置及预测

背景：

A公司有一款在线服务的P产品，公司的营销通路是100％的网络媒介。A公司希望提供30天免费的P产品后，期望顾客能正式签约购买P产品之服务。但A公司发现‚每隔1~2天便对数以万计的顾客发送电子营销文宣，不但购买率低下，甚至造成诸多客诉。同时，客户之预期获利是以人工经验评估之，没有量化或模型工具之协助，不晓得到底应该使用广告全投放还是机器学习模型来做投放?

这个基本和模拟题2类似，并且要基于混淆矩阵分析获益矩阵，考虑营销成本和收益。

训练和测试数据一共8000 ，训练数据6000，字段意义如下，具体可见官网

数据读取

打开发现很多空值是用'?' 显示，而且很多是object类型，需要做转换

# -*- coding: UTF-8 -*-

# 保证脚本与Python3兼容

from __future__ import print_function

import os #读取数据文件

import sys

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split #划分训练集测试集使用

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import SimpleImputer

from sklearn import metrics

from sklearn.feature_extraction import DictVectorizer #特征转换器

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

from sklearn import tree

from sklearn.model_selection import GridSearchCV

%matplotlib inline

import warnings

#禁止警告

warnings.filterwarnings("ignore")

defreadData(path):

"""

使用pandas读取数据

"""

data = pd.read_csv(path)

cols = list(data.columns.values)

return data[cols]

defvisualData(data):

"""

画直方图，直观了解数据

"""

data.hist(

rwidth=0.9, grid=True, figsize=(8, 8), alpha=0.6,bins=10, color="blue")

plt.show()

if __name__ == "__main__":

# 设置显示格式

pd.set_option('display.width', 1000)

homePath = os.path.dirname(os.path.abspath('__file__'))

# Windows下的存储路径与Linux并不相同

if os.name == "nt":

dataPath = "%s\\df_training.csv" % homePath

else:

dataPath = "%s/df_training.csv" % homePath

train = readData(dataPath)

if os.name == "nt":

dataPath = "%s\\df_test.csv" % homePath

else:

dataPath = "%s/df_test.csv" % homePath

test = readData(dataPath)

#统计分析信息

train.info()

train.head()

#没有空值因为变成？，但大多数为object，需要转换

test.info()

数据清洗和描述分析

#print(set(train.columns)-set(['ID',"'Purchase or not'"])) 显示特征

features = ["'Product using score'", "'User area'", 'age', "'Point balance'", 'gender', "'Cumulative using time'",

"'Active user'", "' Estimated salary'", "'Pay a monthly fee by credit card'", "'Product service usage'"]

for feature in features:

na_count = train[train[feature]=='?'].shape[0]

na_per = na_count/train[feature].count()

na_per_buy = train[train[feature]=='?']["'Purchase or not'"].value_counts()[1]/na_count

print('%s 空值的百分比： %.2f ,空值中购买的百分比： %.2f ' %(feature,na_per,na_per_buy))

#用na填充？,并去除空值

train2=train.copy()

train2[feature] = train2[feature].replace('?',np.nan).dropna()

#非分类变量转为数值

if feature not in ["'User area'",'gender']:

train2[feature]=train2[feature].apply(pd.to_numeric)

#可视化交叉报表

#取值个数大于20则分箱

if len(train1[feature].unique())>=20:

#等宽分箱

train2['cut'] = pd.cut(train2[feature],bins=10,include_lowest=True,right=False,precision=0)

cross1 = pd.crosstab(train2['cut'], train["'Purchase or not'"])

else:

cross1 = pd.crosstab(train2[feature], train["'Purchase or not'"])

print(cross1)

cross1.plot(kind="bar", color=["blue", "0.45"], rot=0)

plt.show()

从总体看，每个特征却数值都在30%左右，其中购买占20%左右，因为训练数据较少缺失占比较高，避免丢失有效信息，缺失值可以指派为一类值替换。

其中产品使用分数 Product using score ，410以下的购买可能基本为0，其它分布不均接近正态

用户区域看看taichung中购买的比例高于tainan 和 taipei

age的分布基本符合正态分布，但80到85怀疑有异常购买数据，74之后基本趋于0

点数余额看，有[0.0, 22227.0) 和[111134.0, 133361.0)两个区间购买者数特别高，其它差不多

性别看女性中的购买比例较大

产品使用量也是0到4的整数为主，3和4 特别高，1在30%，2最低

而是否活跃用户和是否用信用卡付费为逻辑值，取0和1，其中非活跃用户更多人购买，而信用卡付费对买和不买的影响差不多。

用箱线图看连续值得分布情况

import matplotlib.pyplot as plt

feabox = ["'Product using score'" , "' Estimated salary'"]

for feature in feabox:

train3=train1.copy()

train3[feature] = train3[feature].replace('?',np.nan).dropna()

train3[feature]=train3[feature].apply(pd.to_numeric)

train3[feature].plot.box(title= feature)

plt.grid(linestyle="--", alpha=0.3)

plt.boxplot(x = train3[feature], # 指定绘图数据

patch_artist=True, # 要求用自定义颜色填充盒形图，默认白色填充

showmeans=True, # 以点的形式显示均值

boxprops = {'color':'black','facecolor':'steelblue'}, # 设置箱体属性，如边框色和填充色

# 设置异常点属性，如点的形状、填充色和点的大小

flierprops = {'marker':'o','markerfacecolor':'red', 'markersize':3},

# 设置均值点的属性，如点的形状、填充色和点的大小

meanprops = {'marker':'D','markerfacecolor':'indianred', 'markersize':4},

# 设置中位数线的属性，如线的类型和颜色

medianprops = {'linestyle':'--','color':'orange'},

labels = [' '] # 删除x轴的刻度标签，否则图形显示刻度标签为1

)

# 添加图形标题

plt.title(feature)

#train3.info()

plt.show()

数据预处理和特征处理

# 编码映射

alldata = pd.concat([train,test],axis=0)

#做字典映射表

area_map = {'Taichung': 0 ,'Tainan': 1 ,'Taipei': 2, '?': 3}

gender_map = {'Female': 0, 'Male': 1, '?': 3}

#应用映射

alldata["'User area'"] = alldata["'User area'"].map(area_map)

alldata["gender"] = alldata["gender"].map(gender_map)

#其它'?'转为nan

alldata= alldata.replace('?',np.nan)

alldata["'Cumulative using time'"].fillna(alldata["'Cumulative using time'"].mode(),inplace=True)

alldata["'Product service usage'"].fillna(alldata["'Product service usage'"].mode(),inplace=True)

#其它指派为一类

alldata.fillna(value=-1,inplace=True)

#转为数值

for feature in features:

alldata[feature] = alldata[feature].apply(pd.to_numeric)

#再分测试训练集

newtrain = alldata[alldata["'Purchase or not'"]!='Withheld']

newtrain["'Purchase or not'"] = newtrain["'Purchase or not'"].apply(pd.to_numeric)

newtest = alldata[alldata["'Purchase or not'"]=='Withheld']

newtrain.info()

newtest.info()

先用随机森林做个baseline

#使用RF进行简单预测

from sklearn.metrics import accuracy_score,roc_auc_score

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

predictors = list(set(newtrain.columns)-set(['ID',"'Purchase or not'"]))

X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],

test_size=0.2,random_state=1234)

rf = RandomForestClassifier(n_estimators=1000, min_samples_split=5, min_samples_leaf=3)

rf.fit(X_train, y_train)

print(accuracy_score(rf.predict(X_val), y_val))

print(roc_auc_score(rf.predict(X_val),y_val))

0.83

0.7924476371342856

梯度提升训练

bagging和boosting是两个常见模型融合的方法，随机森林属于前者，接下来使用boosting的一种模型gbdt来优化，gbdt有xgboost和LightGBM，传说LightGBM更加高效精准，现在变身调参侠开始修炼

step 1 :设定初始学习率并调测迭代次数

import pandas as pd

import lightgbm as lgb

params = {

'boosting_type': 'gbdt',

'objective': 'binary',

'metric': 'auc',

'nthread':4,

'learning_rate':0.1,

'num_leaves':30,

'max_depth': 6,

'subsample': 0.8,

'colsample_bytree': 0.8,

}

data_train = lgb.Dataset(X_train, y_train)

cv_results = lgb.cv(params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',early_stopping_rounds=50,seed=0)

print('best n_estimators:', len(cv_results['auc-mean']))

print('best cv score:', pd.Series(cv_results['auc-mean']).max())

best n_estimators: 44

best cv score: 0.7852211260765876

step 2:根据迭代次数确定max_depth和num_leaves

这是提高精确度的最重要的参数。这里我们引入sklearn里的GridSearchCV()函数进行搜索。

params_test1={'max_depth': range(3,8,1), 'num_leaves':range(5, 100, 5)}

gsearch1 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

n_estimators=44, max_depth=6, bagging_fraction = 0.8,feature_fraction = 0.8),

param_grid = params_test1, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch1.fit(X_train,y_train)

means = gsearch1.cv_results_['mean_test_score']

params = gsearch1.cv_results_['params']

for mean,param in zip(means,params):

print("%f with: %r" % (mean,param))

step 3：调试min_data_in_leaf和max_bin

params_test2={'max_bin': range(5,256,10), 'min_data_in_leaf':range(1,102,10)}

gsearch2 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

n_estimators=44, max_depth=3, num_leaves=5,bagging_fraction = 0.8,feature_fraction = 0.8),

param_grid = params_test2, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch2.fit(X_train,y_train)

means = gsearch2.cv_results_['mean_test_score']

params = gsearch2.cv_results_['params']

for mean,param in zip(means,params):

print("%f with: %r" % (mean,param))

step 4：确定feature_fraction、bagging_fraction、bagging_freq

params_test3={'feature_fraction': [0.6,0.7,0.8,0.9,1.0],

'bagging_fraction': [0.6,0.7,0.8,0.9,1.0],

'bagging_freq': range(0,81,10)

}

gsearch3 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1),

param_grid = params_test3, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch3.fit(X_train,y_train)

means = gsearch3.cv_results_['mean_test_score']

params = gsearch3.cv_results_['params']

for mean,param in zip(means,params):

print("%f with: %r" % (mean,param))

step 5：调测lambda_l1和lambda_l2

params_test4={'lambda_l1': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],

'lambda_l2': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]

}

gsearch4 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7),

param_grid = params_test4, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch4.fit(X_train,y_train)

means = gsearch4.cv_results_['mean_test_score']

params = gsearch4.cv_results_['params']

for mean,param in zip(means,params):

print("%f with: %r" % (mean,param))

step 6：确定 min_split_gain参数

params_test5={'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}

gsearch5 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7,

lambda_l1=1e-05,lambda_l2=0.001),

param_grid = params_test5, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch5.fit(X_train,y_train)

means = gsearch5.cv_results_['mean_test_score']

params = gsearch5.cv_results_['params']

for mean,param in zip(means,params):

print("%f with: %r" % (mean,param))

step 7 ：降低学习率，增加迭代次数，验证模型

from sklearn.metrics import accuracy_score,roc_auc_score

X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],

test_size=0.2,random_state=2019)

model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.005,

n_estimators=2900, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,

bagging_fraction=0.9,bagging_freq=40, feature_fraction=0.7,

lambda_l1=1e-05,lambda_l2=0.001,min_split_gain=0)

model.fit(X_train,y_train)

y_train['Predicted_Results'] = model.predict(X_val)

print(accuracy_score(pred, y_val))

newtest['Predicted_Results'] = model.predict(newtest[predictors])

newtest[['ID','Predicted_Results']].to_csv('results3.csv',index=False)

结果为

0.8341666666666666

计算最终获益

根据不同的营销文案成本和其他成本还有收益，计算最终获益矩阵和利润

from sklearn.metrics import confusion_matrix

#计算最终利润

print(confusion_matrix(y_val, pred))

tp = confusion_matrix(y_val, pred)[0][0]

fp = confusion_matrix(y_val,pred)[1][0]

#print(con_matrix)

profitA = tp * 1500 - fp * 500

print(profitA)

profitB = tp * 700 - fp * 500

print(profitB)

参考

菜菜的机器学习sklearn课堂

https://www.biaodianfu.com/lightgbm.html

https://blog.csdn.net/qq_24519677/article/details/82811215

https://blog.csdn.net/u012735708/article/details/83749703

http://www.sohu.com/a/311595528_99953482

https://www.imooc.com/article/43784

https://blog.csdn.net/jingyi130705008/article/details/82670011

最后编辑于：2020.01.20 10:35:42

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,470评论 6赞 501
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,393评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,577评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,176评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,189评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,155评论 1赞 299
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,041评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,903评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,319评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,539评论 2赞 332
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,703评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,417评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,013评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,664评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,818评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,711评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,601评论 2赞 353

CDA分析建模-产品营销模型之建置及预测

背景：

数据读取

数据清洗和描述分析

数据预处理和特征处理

先用随机森林做个baseline

梯度提升训练

step 1 :设定初始学习率并调测迭代次数

step 2:根据迭代次数确定max_depth和num_leaves

step 3：调试min_data_in_leaf和max_bin

step 4：确定feature_fraction、bagging_fraction、bagging_freq

step 5：调测lambda_l1和lambda_l2

step 6：确定 min_split_gain参数

step 7 ：降低学习率，增加迭代次数，验证模型

计算最终获益

参考

推荐阅读更多精彩内容