项目背景

全方位深入探索经典数据集。

1 数据集审查

import matplotlib.pyplot as plt  # 图形库
import numpy as np
import pandas as pd
from sklearn.metrics import silhouette_score  # 导入轮廓系数指标
from sklearn.cluster import KMeans  # KMeans模块
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder  # 数据预处理库
from mlxtend.preprocessing import one_hot
import seaborn as sns
data = pd.read_table(r'E:\BaiduNetdiskDownload\Statistics\python数据分析与数据化运营\chapter7\ad_performance.txt')
data

image.png

data.describe()#数据量纲差异太大，日均UV均值540，注册率、转化率在0.0几左右，差别近千倍，后续要做MinMax（）

image.png

data.isnull().any(axis=0)#平均停留时间

image.png

pd.DataFrame(data.corr()["访问深度"].sort_values(ascending=False))

image.png

平均停留时间与访问深度之间的相关性系数较强，需要删掉。这是因为我们的数据分析任务是聚类，聚类算法对共线性数据比较敏感。

2 数据可视化探索

import seaborn as sns
sns.set_style('white',{'font.sans-serif':['simhei','Arial']})#设置白色背景和中文显示

2.1 分类型变量探索

2.1.1 广告类型

sns.countplot(y="广告类型",data=data,color='#1E90FF',
             order = data["广告类型"].value_counts().index)
data["广告类型"].value_counts()/sum(data["广告类型"].value_counts())

image.png

2.1.2 素材类型

sns.countplot(y="素材类型",data=data,color='#1E90FF',
             order = data["素材类型"].value_counts().index)
data["素材类型"].value_counts()/sum(data["素材类型"].value_counts())

image.png

2.1.3 合作方式

sns.countplot(y="合作方式",data=data,color='#1E90FF',
             order = data["合作方式"].value_counts().index)
data["合作方式"].value_counts()/sum(data["合作方式"].value_counts())

image.png

2.1.4 广告尺寸

sns.countplot(y="广告尺寸",data=data,color='#1E90FF',
             order = data["广告尺寸"].value_counts().index)
data["广告尺寸"].value_counts()/sum(data["广告尺寸"].value_counts())

image.png

2.1.4.1 广告尺寸转化为广告面积

data["广告尺寸"].unique()[0].split("*",2)[0],data["广告尺寸"].unique()[0].split("*",2)[1]#要将140,40提取出来

image.png

length_lst=[]#长度list
for i in range(len(data["广告尺寸"].unique())):
   length_lst.append(data["广告尺寸"].unique()[i].split("*",2)[0])
width_lst=[]#宽度list
for j in range(len(data["广告尺寸"].unique())):
   width_lst.append(data["广告尺寸"].unique()[j].split("*",2)[1])

length_lst = list(map(int, length_lst))#提取的是字符串，需转为int,后续才能查看广告面积
width_lst = list(map(int, width_lst))

2.1.4.2 广告面积的数据表格

frames = [pd.DataFrame(length_lst,columns=["广告尺寸长度"])
         ,pd.DataFrame(width_lst,columns=["广告尺寸宽度"])
         ,pd.DataFrame(np.array(length_lst)*np.array(width_lst),columns=["广告尺寸面积"])]
df_adv=pd.concat(frames,join='inner',axis=1)
df_adv.sort_values(by="广告尺寸面积",ascending=False)

image.png

2.1.5 广告卖点

sns.countplot(y='广告卖点',data=data,color='#1E90FF',
             order = data['广告卖点'].value_counts().index)
data["广告卖点"].value_counts()/sum(data["广告卖点"].value_counts())

image.png

3 数据转换

3.1 分类变量转换

3.1.1 方法1：get_dummies

x_cate = pd.get_dummies(data.loc[:,"素材类型":"广告卖点"])
x_cate#使用get_dummies,可以做独热编码

image.png

3.1.3 方法2: OneHotEncoder

OneHotEncoder(sparse=False).fit_transform(data.loc[:,"素材类型":"广告卖点"])

image.png

3.2 数值型变量转换

data.iloc[:,1:7].describe()#看一看，日均UV、访问深度的最大值与最小值之间差距太大，留个印象。

image.png

x_seq = MinMaxScaler().fit_transform(data.iloc[:,1:7]).round(2)#直接fit_transform，否则要先fit再transform
x_seq

image.png

3.2.1 合并数据转换后的数据表格

prepar_df = pd.concat([pd.DataFrame(x_seq,columns=[x for x in data.iloc[:,1:7].columns]),x_cate],join='inner',axis=1)
prepar_df

image.png

4 K-Means模型建立

4.1 方法1：构造碎石图

4.1.1 分类型变量

#inertias_1 = []
for i in range(1,45):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(x_cate)
    inertia = kmeans.inertia_
    #inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)

image.png

4.1.2 数值型变量

for i in range(1,200):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(x_seq)
    inertia = kmeans.inertia_
    #inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)

image.png

4.1.3 数值型、分类型特征一起打包画出碎石图

inertias_1 = []
for i in range(1,20):
    kmeans = KMeans(n_clusters=i, init='k-means++',max_iter=300, n_init=10,random_state=0)
    kmeans.fit(prepar_df)
    inertia = kmeans.inertia_
    inertias_1.append(inertia)
    print('For n_cluster = ',i,'The inertia is:',inertia)
figure = plt.figure(1, figsize=(15,6))
plt.plot(np.arange(1,20), inertias_1, alpha=0.5, marker='o')
plt.xlabel("K")
plt.ylabel("Inertia ")

image.png

从图中不难发现：从K=3左右开始曲线逐渐平缓，直到K=7时，曲线已十分平缓。但后面的值由于Inertia过低变得没有意义，当Ineretia = 0时，每个样本点都被当做一个类别。

4.2 方法2：通过轮廓平均系数查找K值

4.2.1 聚类模型一般方法

当n_clusters = 3时，输出平均轮廓系数值

model_kmeans = KMeans(n_clusters=3)  # 建立聚类模型对象
labels_tmp = model_kmeans.fit_predict(prepar_df)#通过fit_predict的方式，将转换后的数据，打上聚类标签
silhouette_score(prepar_df,labels_tmp)#通过转换后的数据与标签之间的关系，将得到平均轮廓系数值，这个值在K尽可能小的情况下最大为最优解。

【out】:0.45746043641666684

4.2.2 建立for循环

score_list = []  # 用来存储每个K下模型的平局轮廓系数
for k in range(3,8):  # 遍历3到7的K
    model_kmeans = KMeans(n_clusters=k)  # 建立聚类模型对象
    labels = model_kmeans.fit_predict(prepar_df)  # 训练聚类模型
    silhouetteScore = silhouette_score(prepar_df,labels)  # 得到每个K下的平均轮廓系数
    score_list.append([k, silhouetteScore])  # append()只能输入一个值，append([])可输入两个值
print(score_list)

【out】:[[3, 0.45746043641666684], [4, 0.5019703686844438], [5, 0.4798826406042495], [6, 0.4773992930791803], [7, 0.5005669314822621]]
由此可得，K=4是最佳聚类效果

4.2.3 模型再优化，求得聚类标签

这一次手动输入K=4的值

model_kmeans = KMeans(n_clusters=4,random_state=117)  # 建立聚类模型对象
labels = model_kmeans.fit_predict(prepar_df)
pd.DataFrame(labels, columns=['clusters'])

image.png

4.2.4 重要逻辑拐点：使用原始数据与得到的标签进行合并

在前面的步骤中，数据转化所得到的prepar_df表格,因完成聚类并打上相应的标签clusters，在接下来的步骤中将不再使用。

final_df = pd.concat((data,pd.DataFrame(labels, columns=['clusters'])),axis=1)
final_df

image.png

5 模型汇总处理

当clusters = 0 时

final_df[final_df["clusters"]==0].iloc[:,1:7].describe()#clusters=0时候的描述性统计，后续要通过均值获得整体数据结果。

image.png

final_df[final_df["clusters"]==0].iloc[:,7:-1].describe()#这里面的top可查看最高统计量

image.png

5.1 通过获取数值型特征的均值，与分类型特征进行组合

cluster_features = [] 
for i in range(4):  # 在4个类别中循环
    label_data = final_df[final_df['clusters'] == i]  # 获得特定类的数据

    part1_data = label_data.iloc[:, 1:7]  # 获得数值型数据特征
    part1_desc = part1_data.describe()  # 得到数值型特征的描述性统计信息
    merge_data1 = part1_desc.iloc[1, :]  # 得到数值型特征的均值

    part2_data = label_data.iloc[:, 7:-1]  # 获得字符串型数据特征
    part2_desc = part2_data.describe(include='all')  # 获得字符串型数据特征的描述性统计信息
    merge_data2 = part2_desc.iloc[2, :]  # 获得字符串型数据特征的最频繁值

    merge_line = pd.concat((merge_data1, merge_data2), axis=0)  # 将数值型和字符串型典型特征沿行合并
    cluster_features.append(merge_line)  # 将每个类别下的数据特征追加到列表
cluster_pd = pd.DataFrame(cluster_features).T
cluster_pd

image.png

5.2 换成中位数试试

cluster_features_ = [] 
for i in range(4):  # 在4个类别中循环
    label_data = final_df[final_df['clusters'] == i]  

    part1_data = label_data.iloc[:, 1:7]  
    part1_desc = part1_data.describe()  
    merge_data1_ = label_data.iloc[:,1:7].median() # 得到数值型特征的中位数

    part2_data = label_data.iloc[:, 7:-1]  
    part2_desc = part2_data.describe(include='all')  
    merge_data2 = part2_desc.iloc[2, :]  

    merge_line = pd.concat((merge_data1_, merge_data2), axis=0)
    cluster_features_.append(merge_line)
cluster_pd_ = pd.DataFrame(cluster_features_).T
cluster_pd_

image.png

6 使用雷达图探索

fig = plt.figure(figsize=(6,6))  # 建立画布
ax = fig.add_subplot(111, polar=True)  # 增加子网格，注意polar参数
labels = np.array(merge_data1.index)  # 设置要展示的数据标签
cor_list = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']  # 定义不同类别的颜色
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False)  # 计算各个区间的角度
angles = np.concatenate((angles, [angles[0]]))

image.png

6.1 数据预处理

num_sets = cluster_pd.iloc[:6, :].T.astype(np.float64)  # 获取要展示的【均值】数据
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
num_sets_max_min

image.png

data_tmp = num_sets_max_min[0, :]  # 获得对应类数据
data_con = np.concatenate((data_tmp, [data_tmp[0]]))  # 建立相同首尾字段以便于闭合
ax.plot(angles, data_con, 'o-', c=cor_list[0], label=0)  # 画线
ax.set_title("聚类汇总图", fontproperties="SimHei",fontweight="black",fontsize="x-large")
fig

image.png

6.2 使用均值进行展示

fig = plt.figure(figsize=(6,6))  # 建立画布
ax = fig.add_subplot(111, polar=True)  # 增加子网格，注意polar参数
labels = np.array(merge_data1.index)  # 设置要展示的数据标签
cor_list = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']  # 定义不同类别的颜色
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False)  # 计算各个区间的角度
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
    data_tmp = num_sets_max_min[i, :]  # 获得对应类数据
    data_con = np.concatenate((data_tmp, [data_tmp[0]]))  # 建立相同首尾字段以便于闭合
    ax.plot(angles, data_con, 'o-', c=cor_list[i], label=i)  # 画线
ax.set_thetagrids(angles * 180 / np.pi,labels, fontproperties="SimHei")  # 设置极坐标轴
ax.set_title("聚类汇总图", fontproperties="SimHei",fontweight="black",fontsize="x-large")
ax.set_rlim(-0.2, 1.2)  # 设置坐标轴尺度范围
plt.legend(loc=0)
cluster_pd# 设置图例位置

image.png

6.3 换成中位数，查看与Clusters之间的关系

num_sets = cluster_pd_.iloc[:6, :].T.astype(np.float64)  # 获取要展示的【均值】数据
num_sets_max_min = MinMaxScaler().fit_transform(num_sets)
fig = plt.figure(figsize=(6,6))  # 建立画布
ax = fig.add_subplot(111, polar=True)  # 增加子网格，注意polar参数
labels = np.array(merge_data1.index)  # 设置要展示的数据标签
cor_list = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']  # 定义不同类别的颜色
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False)  # 计算各个区间的角度
angles = np.concatenate((angles, [angles[0]]))
for i in range(4):
    data_tmp = num_sets_max_min[i, :]  # 获得对应类数据
    data_con = np.concatenate((data_tmp, [data_tmp[0]]))  # 建立相同首尾字段以便于闭合
    ax.plot(angles, data_con, 'o-', c=cor_list[i], label=i,alpha=0.5)  # 画线
ax.set_thetagrids(angles * 180 / np.pi,labels, fontproperties="SimHei")  # 设置极坐标轴
ax.set_title("聚类汇总图", fontproperties="SimHei",fontweight="black",fontsize="x-large")
ax.set_rlim(-0.2, 1.2)  # 设置坐标轴尺度范围
plt.legend(loc=0)
cluster_pd_

image.png

中位数和均值展示的结果差别挺大，考虑到本次研究目的是探索广告的综合投放效果，因此选择中位数作为可视化标准。

7 可视化探索每个类别所对应的指标

final_df.columns[1:]

Index(['日均UV', '平均注册率', '平均搜索量', '访问深度', '订单转化率', '投放总时间', '素材类型', '广告类型',
'合作方式', '广告尺寸', '广告卖点', 'clusters'],
dtype='object')

col=final_df.columns[1:]

7.1 当clusters = 0 时

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 0',fontweight="black",fontsize="x-large")
f.subplots_adjust(hspace=1)#增加子图之间的间隔
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==0],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==0],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==0],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==0],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==0],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==0],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==0],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==0],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==0],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==0],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==0],x=col[10])

image.png

7.2 当clusters=1时

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 1',fontweight="black",fontsize="x-large")
f.subplots_adjust(hspace=1)#增加子图之间的间隔
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==1],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==1],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==1],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==1],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==1],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==1],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==1],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==1],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==1],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==1],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==1],x=col[10])

image.png

7.3 当clusters = 2时

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 2',fontweight="black",fontsize="x-large")
f.subplots_adjust(hspace=1)#增加子图之间的间隔
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==2],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==2],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==2],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==2],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==2],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==2],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==2],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==2],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==2],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==2],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==2],x=col[10])

image.png

7.4 当clusters = 3时

f=plt.figure(figsize=(18,7))
f.suptitle('clusters = 3',fontweight="black",fontsize="x-large")
f.subplots_adjust(hspace=1)#增加子图之间的间隔
fig.set_size_inches(18,7)
f.add_subplot(6,2,1)
sns.histplot(final_df[final_df['clusters']==3],x=col[0], kde=True)
f.add_subplot(6,2,2)
sns.histplot(final_df[final_df['clusters']==3],x=col[1], kde=True)
f.add_subplot(6,2,3)
sns.histplot(final_df[final_df['clusters']==3],x=col[2], kde=True)
f.add_subplot(6,2,4)
sns.histplot(final_df[final_df['clusters']==3],x=col[3], kde=True)
f.add_subplot(6,2,5)
sns.histplot(final_df[final_df['clusters']==3],x=col[4], kde=True)
f.add_subplot(6,2,6)
sns.histplot(final_df[final_df['clusters']==3],x=col[5], kde=True)
f.add_subplot(6,2,7)
sns.histplot(final_df[final_df['clusters']==3],x=col[6])
f.add_subplot(6,2,8)
sns.histplot(final_df[final_df['clusters']==3],x=col[7])
f.add_subplot(6,2,9)
sns.histplot(final_df[final_df['clusters']==3],x=col[8])
f.add_subplot(6,2,10)
sns.histplot(final_df[final_df['clusters']==3],x=col[9])
f.add_subplot(6,2,11)
sns.histplot(final_df[final_df['clusters']==3],x=col[10])

image.png

通过雷达图的排名给不同指标打分，最高分4分，最低分1分。例如4分就是当前指标排第一的那个。

计算方式=收益型数值排名加和/成本型数值排名。也就是说，分母为投放总时间，分子为其它所有数值型特征。

image.png

8 结论与总结

8.1 项目结论

1.类别3分数最高，广告投放的综合效果最佳。之所以日均UV不高的原因，极大可能是因为(满减、JPG、308388广告尺寸）所造成的。其中308388的广告形状，是所有广告中唯一的【树状型】广告。
2.类别1和类别0分数相近，可以一同比较。类别1可能是比较贵的商品，相比于类别0而言，访问深度、平均搜索量等用户型指标较高，但转化率、注册率这样的卖家型指标较低，用户还在观望中。反观类别0，卖家型指标高，用户型指标较低，可能是较有特色的大众产品广告。
3.类别2分数最低。投放时间过长而导致的访问深度、日均UV等指标。而且广告尺寸是(600 * 90)小面积广告，合理的解释是，投放成本较低，使得投放时间较长。然而，最终的投放结果不尽如人意。在未来的经营活动中，建议直接放弃，从而避免时间、精力上的耗损。

image.png

8.2 理论总结与展望

1.聚类的优势：从这次项目中不难发现，聚类能将杂乱无须的数据进行汇总。非常明显的优势在通过Seaborn展示每个类别所对应的相关指标中可以看到，然而当未建模之前是无法通过绘制直方图表现出来。
2.clusters最终形成了4类，并且样本不均衡。后续可以尝试用多分类模型进行建模，看看究竟什么样的指标影响了本次聚类。比如，广告尺寸的面积就可以当做新特征。并且，可以将广告形状当做新特征纳入分类模型。
3.可视化能力还需加强。

image.png

广告投放效果的K-Means聚类

项目背景

1 数据集审查

2 数据可视化探索

2.1 分类型变量探索

2.1.1 广告类型

2.1.2 素材类型

2.1.3 合作方式

2.1.4 广告尺寸

2.1.4.1 广告尺寸转化为广告面积

2.1.4.2 广告面积的数据表格

2.1.5 广告卖点

3 数据转换

3.1 分类变量转换

3.1.1 方法1：get_dummies

3.1.3 方法2: OneHotEncoder

3.2 数值型变量转换

3.2.1 合并数据转换后的数据表格

4 K-Means模型建立

4.1 方法1：构造碎石图

4.1.1 分类型变量

4.1.2 数值型变量

4.1.3 数值型、分类型特征一起打包画出碎石图

4.2 方法2：通过轮廓平均系数查找K值

4.2.1 聚类模型一般方法

4.2.2 建立for循环

4.2.3 模型再优化，求得聚类标签

4.2.4 重要逻辑拐点：使用原始数据与得到的标签进行合并

5 模型汇总处理

5.1 通过获取数值型特征的均值，与分类型特征进行组合

5.2 换成中位数试试

6 使用雷达图探索

6.1 数据预处理

6.2 使用均值进行展示

6.3 换成中位数，查看与Clusters之间的关系

7 可视化探索每个类别所对应的指标

7.1 当clusters = 0 时

7.2 当clusters=1时

7.3 当clusters = 2时

7.4 当clusters = 3时

8 结论与总结

8.1 项目结论

8.2 理论总结与展望

推荐阅读更多精彩内容