数据分析思维案例-黑五用户行为分析

黑五用户行为分析案例学习总结
老师指路->https://www.jianshu.com/u/1f32f227da5f
使用工具：Anaconda-jupyter、Excel

数据分析师必须要具备下面三方面的能力：
1、懂运营业务
2、有结构化思维
3、精通一两门数据工具
业务（电商为例）：

电商业务.png

数据工具：

数据工具.png

数据分析思维：

数据思维.png

黑五用户行为分析案例

项目介绍:

黑色星期五是美国感恩节后一天，圣诞节前的一次大采购活动，当天一般美国商场会推出大量的打折优惠、促销活动，由于美国的商场一般以红笔记录赤字，以黑笔记录盈利，而感恩节后的这个星期五人们疯狂的抢购使得商场利润大增，因此被商家们称作黑色星期五。商家期望通过以这一天开始的圣诞大采购为这一年获得最多的盈利。

分析目的:

本次的分析数据来自于Kaggle提供的某电商黑色星期五的销售记录，将围绕产品和用户两大方面展开叙述，为电商平台制定策略提供分析及建议。

本文分析的主要框架：

1.整体消费的情况
2.用户画像分析(探究最优价值的用户类型:性别、年龄、职业、婚姻)
3.城市业绩分析(城市分布、居住年限分布)
产品分析(探究最优价值的产品) 细化分析：产品销售额Top 10产品、产品销售额Top10 产品类别
4.最大贡献用户价值分析: 客单价、价值Top1000用户清单、价值Top1000用户情况
5.结论以及建议

数据集观察

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
import seaborn as sns 
df=pd.read_csv("BlackFriday.csv")

df.info()#查看数据描述

数据描述

分析：2类产品、3类产品数据不全

原始数据中共有12个字段，每个字段共537577行，字段解释如下：
User_ID：用户ID
Product_ID: 产品ID
Gender: 性别
Age: 年龄
Occupation: 职业
City_Category: 城市（A,B,C）
Stay_In_Current_City_Years：居住时长
Marital_Status：婚姻状况
Product_Category_1 产品类别1,是一级分类
Product_Category_2 产品类别2,是二级分类
Product_Category_3 产品类别3,是三级分类
Purchase：金额（美元）

1、整体的消费情况

df["Purchase"].sum()  #总的消费金额是50亿美元
df["Product_ID"].count()  #所销售的产品数量537577
df["Purchase"].sum()/df["Product_ID"].count()  
#平均每个产品的价格是9333美元
df["Purchase"].sum()/df["User_ID"].drop_duplicates(keep='first').count()  #平均客单价是85万美元

知识点：drop_duplicates 去除重复项

data.drop_duplicates(subset=['A','B'],keep='first',inplace=True) 代码中subset对应的值是列名，表示只考虑这两列，将这两列对应值相同的行进行去重。默认值为subset=None表示考虑所有列。

keep='first'表示保留第一次出现的重复行，是默认值。keep另外两个取值为"last"和False，分别表示保留最后一次出现的重复行和去除所有重复行。

inplace=True表示直接在原来的DataFrame上删除重复项，而默认值False表示生成一个副本。

分析：
从本次的消费记录来看,记录的主要是大客户的消费数据，人均消费已经达到了85万美元！这些人一共贡献了50亿美金的销售额。抓住忠实用户，并促进他们消费，是互联网电商发展的基本操作。

2、用户的角度上来考虑下问题

01、性别方面

df_gender_purchase=df.groupby("Gender").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
#以性别分组，聚合函数实现购买量求和，重置索引，将字段名命名为Purchase_amount

df_gender_purchase["gender_purchase_prop"]=df_gender_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)
#新建字段名为 gender_purchase_prop
#将该性别消费用户的消费金额除以用户总消费金额，得到该性别消费金额比率
 
def Gender_user_count(x):
    if x[0]=="F":
        return (df.loc[df["Gender"]=="F"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
    if x[0]=="M":
        return (df.loc[df["Gender"]=="M"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
#定义性别用户量统计函数
#如果为女性，删除重复用户名，并统计女性用户个数
#男性同理

df_gender_purchase["gender_user_count"]=df_gender_purchase.apply(lambda x:Gender_user_count(x),axis=1)
#建立新的字段名gender_user_count，用以统计每个性别用户量

df_gender_purchase["gender_customer_price"]=df_gender_purchase.apply(lambda x:x[1]/x[3],axis=1)
#建立新的字段名gender_customer_price，用以统计该性别用户平均客单价

df_gender_purchase["gender_count_prop"]=df_gender_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
#建立新字段名gender_count_prop，求得该性别用户占总用户的比率
df_gender_purchase

性别分组

分析：
在黑色星期五的活动中，男性是占据了71%的用户,将近是女性的2.5倍;但是贡献了将近76%的销售额,是女生的3.3倍;
显然是有更多的男性参与这个活动,并且客单价还是较高于女性, 所以应该针对男性用价格较高的产品来推销

02、年龄方面

df_age_purchase=df.groupby("Age").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_age_purchase["Age_purchase_prop"]=df_age_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)
    
def Age_user_count(x):
    for i in df["Age"].drop_duplicates():
        if x[0]==i:
            return (df.loc[df["Age"]==i].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
#定义年龄用户数量
    
df_age_purchase["Age_user_count"]=df_age_purchase.apply(lambda x:Age_user_count(x),axis=1)
df_age_purchase["Age_customer_price"]=df_age_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_age_purchase["Age_count_prop"]=df_age_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_age_purchase

年龄分组

分析：
消费人数和金额主要集中在18-45这个年龄阶段，几乎贡献了80%的销售额,其中26-35年龄段，无论是消费者人数和消费金额都是最多的,这是应该重点推销商品的用户

03、婚姻状态方面

df_Marital_purchase=df.groupby("Marital_Status").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_Marital_purchase["Marital_purchase_prop"]=df_Marital_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)

def Marital_user_count(x):
    if x[0]==0:
        return (df.loc[df["Marital_Status"]==0].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
    if x[0]==1:
        return (df.loc[df["Marital_Status"]==1].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
 #统计未婚、已婚用户数量   

df_Marital_purchase["Marital_user_count"]=df_Marital_purchase.apply(lambda x:Marital_user_count(x),axis=1)
df_Marital_purchase["Marital_customer_price"]=df_Marital_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_Marital_purchase["Marital_count_prop"]=df_Marital_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_Marital_purchase

婚姻状态分组

分析：
不结婚的人在销售金额、参与活动数量方面是比已经结婚的高出40%

04、合并性别和婚姻状态这两个字段分析不同年龄段的销售额情况

df["Gender_MaritalStatus"]=df[["Gender","Marital_Status"]].apply(lambda x:str(x[0])+"_"+str(x[1]),axis=1)
#合并性别和婚姻状态

df_Gender_MaritalStatus_purchase=df.groupby(["Gender_MaritalStatus","Age"]).agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
#以性别-婚姻状态，年龄分组
#聚合函数求每分组消费总额
#重置索引
#将Purchase字段名改为Purchase_amount

def Gender_MaritalStatus_user_count(x):
    for i in df["Gender_MaritalStatus"].drop_duplicates():
        for j in df["Age"].drop_duplicates():
            if x[0]==i and x[1]==j:
                return (df.loc[(df["Gender_MaritalStatus"]==i) & (df["Age"]==j)].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
#定义函数，统计不同性别-婚姻状态-年龄用户数量

df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_user_count"]=df_Gender_MaritalStatus_purchase.apply(lambda x:Gender_MaritalStatus_user_count(x),axis=1)
df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_user_price"]=df_Gender_MaritalStatus_purchase.apply(lambda x:x[2]/x[3],axis=1)
df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_count_prop"]=df_Gender_MaritalStatus_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_Gender_MaritalStatus_purchase.head(5)

性别-婚姻状态-年龄分组

sns.barplot(x="Age",hue="Gender_MaritalStatus",y="Gender_MaritalStatus_user_count",data=df_Gender_MaritalStatus_purchase)

年龄-性别婚姻状态用户统计

分析：26到35这个时间区间中,未婚状态下的男性参与活动的人数的最多的

sns.barplot(x="Age",hue="Gender_MaritalStatus",y="Purchase_amount",data=df_Gender_MaritalStatus_purchase)

年龄-性别婚姻状态用户购物总额统计

分析：
18-35这个时间区间未婚男性的消费总额也排到第一位

05、不同职位的下的人购买情况

df_Occupation_purchase=df.groupby("Occupation").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_Occupation_purchase["Occupation_purchase_prop"]=df_Occupation_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)

def Occupation_user_count(x):
    for i in df["Occupation"].drop_duplicates():
        if x[0]==i:
            return (df.loc[df["Occupation"]==i].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())

    
df_Occupation_purchase["Occupation_user_count"]=df_Occupation_purchase.apply(lambda x:Occupation_user_count(x),axis=1)
df_Occupation_purchase["Occupation_customer_price"]=df_Occupation_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_Occupation_purchase["Occupation_count_prop"]=df_Occupation_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_Occupation_purchase.sort_values(by="Occupation_user_count",ascending=False)

不同职位消费情况统计

分析：
4、0、7、1职位的人数占到了用户总人数的40%,这些职位应该是我们关注的对象

3.从城市贡献的角度上来考虑

df_City_Category_purchase=df.groupby("City_Category").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_City_Category_purchase["Marital_purchase_prop"]=df_City_Category_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)

def City_Category_user_count(x):
    if x[0]=="A":
        return (df.loc[df["City_Category"]=="A"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
    if x[0]=="B":
        return (df.loc[df["City_Category"]=="B"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
    if x[0]=="C":
        return (df.loc[df["City_Category"]=="C"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())
#分别计算不同城市类别的用户数量

df_City_Category_purchase["City_Category_user_count"]=df_City_Category_purchase.apply(lambda x:City_Category_user_count(x),axis=1)
df_City_Category_purchase["City_Category_customer_price"]=df_City_Category_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_City_Category_purchase["City_Category_count_prop"]=df_City_Category_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_City_Category_purchase

不同城市类别用户的消费情况统计

分析：
C 城市的参与活动的用户量占总的53%,但是贡献销售额仅仅占了30%,相反B城市是占的总用户量的28%确贡献了40%的销售额,并且AB城市的客单价是分别是C城市的近似2倍。我们大致能够猜测到AB城市的消费水品较高，下次举办活动的时候,可以对AB城市的价格适当提高。C城市可以适当降低价格，通过提高销售量来提高销售额。

4、品相方面来考虑

01、销售额Top10的产品

df_count10=df.groupby("Product_ID").agg({"User_ID":"count","Purchase":"sum"}).rename(columns={"Purchase":"Purchase_amount","User_ID":"User_count"}).reset_index().sort_values(by=["Purchase_amount"],ascending=False)[["Product_ID","Purchase_amount"]].head(10)
df_count10

销售额前10

02、销量前Top10的产品

df_amount10=df.groupby("Product_ID").agg({"User_ID":"count","Purchase":"sum"}).rename(columns={"Purchase":"Purchase_amount","User_ID":"User_count"}).reset_index().sort_values(by=["User_count"],ascending=False)[["Product_ID","User_count"]].head(10)
df_amount10

销量前10

03、产品既在销量前10又在销售额前10

pd.merge(df_amount10,df_count10,left_on="Product_ID",right_on="Product_ID",how="inner")

销量前10+销售额前10

5总结

1、用户的角度

结论汇总： 1）年龄在26-35岁，职业编号为"4","0","7","1"的未婚男性消费人群属于高消费人群，该平台的超级忠实用户

后续改进： 1）对高价值用户重点关注，进行更精细化的营销，后续为这些高价值用户提供更多的高价值消费品；

2）针对其他的用户，主要引导用户点击购买，多推荐一些热销的商品；

2、商品的角度

结论汇总： 1）黑色星期五期间，一级商品分类的5、1、8的销量、销售额都是排在前3的，

而且最受用户欢迎的商品top10中也有这3类商品，这3类商品贡献了72%的销售额；

2）销量排名最低的三个商品种类是16、11、12，占比都不到0.3%；

3）即在在Top10销售额中的产品和在Top10销售量的产品，可利用爆款商品陈列位置为其他产品引流。

后续改进： 1）可以在最受用户欢迎的商品top10的商品和其他一些相关的商品做一些捆绑销售，带动其他商品的销量；在一级商品分类为5、1、8的商品页面推荐一些其他的商品，引导用户去点击购买；

2）具体再分析下销量排名最低的三个商品种类是什么原因造成的，如果商品种类16、11、12是一些已经淘汰过时的商品或者被一些该商品的替代品占领了市场，可以考虑是否要下架，减少相关渠道的广告等；

3.城市角度

结论汇总 1)畅销第一级别类目依次是5、8、1，仓库管理需按畅销商品名单、分类，安排库存，对于消费旺盛B城市提前备货，节省调度；同时监控库存，防止断货。