(尝试)用户广告点击情况预测

使用的数据是阿里云天池的数据(https://tianchi.aliyun.com/dataset/dataDetail?dataId=56),数据中包含了四张表,分别为用户行为日志behavior_log(简称为bl)、原始样本骨架raw_sample(简称为rs)、广告基本信息表ad_feature(简称为af)、用户基本信息表user_profile(简称为up)。

下面仅尝试使用一下随机森林进行简单的预测,所以将缺失值直接删除,最后预测效果不错,准确率高达93.95%。

代码如下:

import pandas as pd

import numpy as np

import datetime

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings('ignore')  #忽视警告

up = pd.read_csv(r'E:\datafile\ad_clik\user_profile.csv')

af = pd.read_csv(r'E:\datafile\ad_clik\ad_feature.csv')

rs = pd.read_csv(r'E:\datafile\ad_clik\raw_sample.csv',iterator=True,chunksize=10000000,header=0)

# 提取出全非空的数据,395932

u = []  #665836个缺失

u.extend(user_class_null)

u.extend(user_pvalue_null)

u.extend(user_class_pvalue_null)

complete_up = up[~up['userid'].isin(u)]

# 各用户特征的分布情况,还有广告属性的分布情况,此处省略

vector=['cms_segid','cms_group_id','final_gender_code','age_level','pvalue_level','shopping_level','occupation','new_user_class_level ']

%matplotlib inline

for i in vector:

    y = complete_up[i].value_counts().reset_index()

    y.columns = [i,'person_count']

    y.sort_values(by=i,ascending = True)


    x = y[i].tolist()

    cou = y['person_count'].tolist()

    plt.figure(figsize=(15,8))

    plt.bar(x,cou)

    plt.show()


# 设置训练集

t1 = '2017-05-06 00:00:00'

t2 = '2017-05-12 23:59:59'

f = '%Y-%m-%d %H:%M:%S'

startTime = datetime.datetime.strptime(t1,f).timestamp()  #1494000000.0

endTime = datetime.datetime.strptime(t2,f).timestamp()    #1494604799.0

# 只要complete_up表中的userid

u = complete_up['userid'].tolist()

# 不要af表中的brand缺失的adgroup_id

a = af[af['brand'].isnull()]['adgroup_id'].tolist()

count = 0

for chunk in rs:

    chunk.drop(index=chunk[chunk.time_stamp < startTime].index,inplace=True)

    chunk.drop(index=chunk[chunk.time_stamp > endTime].index,inplace=True)

    chunk.drop(index=chunk[chunk['adgroup_id'].isin(a)].index,inplace=True)

    chunk.drop(index=chunk[~chunk['user'].isin(u)].index,inplace=True)

    list = []

    for i in chunk.time_stamp.index:

        d = chunk.time_stamp[i]

        dates = datetime.datetime.fromtimestamp(d)

        list.append(dates)

    chunk.insert(loc=3,column='datetimes',value=list)

    del chunk['time_stamp']

    chunk.to_csv('E:\\datafile\\rs\\rs_train_complete.csv',mode='a',index=False,header=0)  #header=0,是布尔类型,表示不加入列名

    count += 1

    print(count,end='-')

print('ok')

# 连接up和af

rs_train = pd.read_csv('E:\\datafile\\rs\\rs_train_complete.csv',header=None,

                      names=['userid','adgroup_id','datatimes','pid','noclk','clk'])

df = pd.merge(rs_train,up,how='left',on='userid')

df = pd.merge(df,af,how='left',on='adgroup_id')

### 由于有的广告属性的取值特别多,可以根据点击量和点击率进行分桶,做数据转换

# 先计算cate_id的点击量和点击率

cate_y = df['cate_id'][df['clk']==1].value_counts().reset_index()

cate_y.columns = ['cate_id','clk']

# cate_n = df['cate_id'][df['clk']==0].value_counts().reset_index()

# cate_n.columns = ['cate_id','nclk']

cate_sum = df['cate_id'].value_counts().reset_index()

cate_sum.columns = ['cate_id','counts']

cate = pd.merge(cate_y,cate_sum,how='outer',on='cate_id')

cate = cate.fillna(0)

cate['clk_ratio'] = cate['clk']/cate['counts']

cate['clk_ratio'] = cate['clk_ratio'].map(lambda x:('%.4f')%x)

cate['clk_ratio'] = cate['clk_ratio'].astype(float)

cate['cate_clk_bins'] = pd.qcut(cate['clk'],16,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

cate['cate_clk_bins'] = cate['cate_clk_bins'].astype(int)

cate['cate_clk_ratio_bins'] = pd.qcut(cate['clk_ratio'],14,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

cate['cate_clk_ratio_bins'] = cate['cate_clk_ratio_bins'].astype(int)

cate.drop(['clk','counts','clk_ratio'],axis=1,inplace=True)

# 先计算cate_id的点击量和点击率

cust_y = df['customer'][df['clk']==1].value_counts().reset_index()

cust_y.columns = ['customer','clk']

# cate_n = df['cate_id'][df['clk']==0].value_counts().reset_index()

# cate_n.columns = ['cate_id','nclk']

cust_sum = df['customer'].value_counts().reset_index()

cust_sum.columns = ['customer','counts']

cust = pd.merge(cust_y,cust_sum,how='outer',on='customer')

cust = cust.fillna(0)

cust['clk_ratio'] = cust['clk']/cust['counts']

cust['clk_ratio'] = cust['clk_ratio'].map(lambda x:('%.4f')%x)

cust['clk_ratio'] = cust['clk_ratio'].astype(float)

cust['cust_clk_bins'] = pd.qcut(cust['clk'],65,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

cust['cust_clk_bins'] = cust['cust_clk_bins'].astype(int)

cust['cust_clk_ratio_bins'] = pd.qcut(cust['clk_ratio'],26,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

cust['cust_clk_ratio_bins'] = cust['cust_clk_ratio_bins'].astype(int)

cust.drop(['clk','counts','clk_ratio'],axis=1,inplace=True)

# 先计算campaign_id的点击量和点击率

camp_y = df['campaign_id'][df['clk']==1].value_counts().reset_index()

camp_y.columns = ['campaign_id','clk']

# cate_n = df['cate_id'][df['clk']==0].value_counts().reset_index()

# cate_n.columns = ['cate_id','nclk']

camp_sum = df['campaign_id'].value_counts().reset_index()

camp_sum.columns = ['campaign_id','counts']

camp = pd.merge(camp_y,camp_sum,how='outer',on='campaign_id')

camp = camp.fillna(0)

camp['clk_ratio'] = camp['clk']/camp['counts']

camp['clk_ratio'] = camp['clk_ratio'].map(lambda x:('%.4f')%x)

camp['clk_ratio'] = camp['clk_ratio'].astype(float)

camp['camp_clk_bins'] = pd.qcut(camp['clk'],100,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

camp['camp_clk_bins'] = camp['camp_clk_bins'].astype(int)

# camp['clk_bins'].unique().size

camp['camp_clk_ratio_bins'] = pd.qcut(camp['clk_ratio'],30,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

camp['camp_clk_ratio_bins'] = camp['camp_clk_ratio_bins'].astype(int)

#camp['clk_ratio_bins'].unique().size

camp.drop(['clk','counts','clk_ratio'],axis=1,inplace=True)

# 先计算campaign_id的点击量和点击率

brand_y = df['brand'][df['clk']==1].value_counts().reset_index()

brand_y.columns = ['brand','clk']

# cate_n = df['cate_id'][df['clk']==0].value_counts().reset_index()

# cate_n.columns = ['cate_id','nclk']

brand_sum = df['brand'].value_counts().reset_index()

brand_sum.columns = ['brand','counts']

brand = pd.merge(brand_y,brand_sum,how='outer',on='brand')

brand = brand.fillna(0)

brand['clk_ratio'] = brand['clk']/brand['counts']

brand['clk_ratio'] = brand['clk_ratio'].map(lambda x:('%.4f')%x)

brand['clk_ratio'] = brand['clk_ratio'].astype(float)

brand['brand_clk_bins'] = pd.qcut(brand['clk'],40,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

brand['brand_clk_bins'] = brand['brand_clk_bins'].astype(int)

# brand['clk_bins'].unique().size

brand['brand_clk_ratio_bins'] = pd.qcut(brand['clk_ratio'],22,duplicates='drop',labels=[1,2,3,4,5,6,7,8,9,10])

brand['brand_clk_ratio_bins'] = brand['brand_clk_ratio_bins'].astype(int)

# brand['clk_ratio_bins'].unique().size

brand.drop(['clk','counts','clk_ratio'],axis=1,inplace=True)

### 相关性分析

from sklearn.feature_selection import chi2,SelectKBest

X = t[['cms_segid','cms_group_id','final_gender_code','age_level','pvalue_level','shopping_level','occupation','new_user_class_level ',

      'cate_clk_bins','cate_clk_ratio_bins','cust_clk_bins','cust_clk_ratio_bins','camp_clk_bins','camp_clk_ratio_bins','brand_clk_bins',

      'brand_clk_ratio_bins']].values

print(X.shape)

y = t['clk'].tolist()

# selector = SelectKBest(chi2,k='all')

# selector.fit(X, y)

# scores = selector.scores_

# scores  #第4、5、6、7、8个特征相关性并不明显

# pvalues = selector.pvalues_

# pvalues  #p值都小于0.05

selector = SelectKBest(chi2,k=11)

v = selector.fit(X, y).get_support(indices=True)

print(v)

scores = selector.scores_

print(scores)  #第4、5、6、7、8个特征相关性并不明显

### 同时使用SPSS也验证了price和clk也相关

### 预测

t = pd.merge(df,cate,how='left',on='cate_id')

t = pd.merge(t,cust,how='left',on='customer')

t = pd.merge(t,camp,how='left',on='campaign_id')

t = pd.merge(t,brand,how='left',on='brand')

todrop = ['userid','adgroup_id','datatimes','noclk','cate_id','campaign_id','customer','brand']

t.drop(todrop,axis=1,inplace=True)

## 先使用训练集进行交叉验证

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score,train_test_split

# 只取定类数据中相关性强的前11个特征,加上pid和price

rf_x = t[['cms_segid','cms_group_id','final_gender_code','cate_clk_bins','cate_clk_ratio_bins','cust_clk_bins','cust_clk_ratio_bins',

          'camp_clk_bins','camp_clk_ratio_bins','brand_clk_bins','brand_clk_ratio_bins','pid','price']].values

rf_y = t['clk'].tolist()

train_X,test_X, train_y, test_y = train_test_split(rf_x,rf_y,test_size=1/5)

clf1 = RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)

scores = cross_val_score(clf1,train_X,train_y,scoring='accuracy',cv=5)

clf1.fit(train_X,train_y)

y_pred = clf1.predict(test_X)

print(scores.mean())  # 0.9394756096191237

print(scores.std())  # 0.00022112156289561522

test = pd.DataFrame([y_pred,test_y],index=['y_pred','test_y'])

test = test.T

print(test[(y_pred!=test_y) & (y_pred==1)]['y_pred'].size)

print(test[(y_pred!=test_y) & (y_pred==0)]['y_pred'].size)

print('预测准确率:',test[(y_pred==test_y)]['y_pred'].size/test['y_pred'].size)  # 0.9395123965128687

## 再预测测试集

rs_test = pd.DataFrame()

# 设置提取的时间段

t1 = '2017-05-13 00:00:00'

t2 = '2017-05-13 23:59:59'

f = '%Y-%m-%d %H:%M:%S'

startTime = datetime.datetime.strptime(t1,f).timestamp()  #1494000000.0

endTime = datetime.datetime.strptime(t2,f).timestamp()    #1494604799.0

# 只要complete_up表中的userid

u = complete_up['userid'].tolist()

# 不要af表中的brand缺失的adgroup_id

a = af[af['brand'].isnull()]['adgroup_id'].tolist()

count = 0

for chunk in rs:

    chunk.drop(index=chunk[chunk.time_stamp < startTime].index,inplace=True)

    chunk.drop(index=chunk[chunk.time_stamp > endTime].index,inplace=True)

    chunk.drop(index=chunk[chunk['adgroup_id'].isin(a)].index,inplace=True)

    chunk.drop(index=chunk[~chunk['user'].isin(u)].index,inplace=True)

    list = []

    for i in chunk.time_stamp.index:

        d = chunk.time_stamp[i]

        dates = datetime.datetime.fromtimestamp(d)

        list.append(dates)

    chunk.insert(loc=3,column='datetimes',value=list)

    del chunk['time_stamp']

    rs_test = pd.concat([rs_test,chunk])

    count += 1

    print(count,end='-')

print('ok')

rs_test.columns = ['userid','adgroup_id','datetimes','pid','nonclk','clk']

temp = pd.merge(rs_test,up,how='left',on='userid')

rf_test = pd.merge(temp,af,how='left',on='adgroup_id')

temp = pd.merge(rf_test,cate,how='left',on='cate_id')

temp = pd.merge(temp,cust,how='left',on='customer')

temp = pd.merge(temp,camp,how='left',on='campaign_id')

rf_test = pd.merge(temp,brand,how='left',on='brand')

todrop = ['userid','adgroup_id','datetimes','nonclk','cate_id','campaign_id','customer','brand','age_level','pvalue_level',

          'shopping_level','occupation','new_user_class_level ']

rf_test.drop(todrop,axis=1,inplace=True)

# 删除那些没有有缺失值的

RF_test = rf_test.dropna()

# 找出没有匹配到的数据  18451条

test_null = rf_test[rf_test.isnull().T.any()]

test_null.describe()

test_x = RF_test[['cms_segid','cms_group_id','final_gender_code','cate_clk_bins','cate_clk_ratio_bins','cust_clk_bins','cust_clk_ratio_bins',

          'camp_clk_bins','camp_clk_ratio_bins','brand_clk_bins','brand_clk_ratio_bins','pid','price']].values

test_y = RF_test['clk'].tolist()

y_pred = clf1.predict(test_x)

test = pd.DataFrame([y_pred,test_y],index=['y_pred','test_y'])

test = test.T

print(test[(y_pred!=test_y) & (y_pred==1)]['y_pred'].size)

print(test[(y_pred!=test_y) & (y_pred==0)]['y_pred'].size)

print('预测准确率:',test[(y_pred==test_y)]['y_pred'].size/test['y_pred'].size)  # 0.9395123965128687

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,163评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,301评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,089评论 0 352
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,093评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,110评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,079评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,005评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,840评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,278评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,497评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,667评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,394评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,980评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,628评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,649评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,548评论 2 352