使用数据来自阿里巴巴天池,是2017年11月25日至2017年12月3日之间用户行为数据,包括用户id,购买商品类别,购买商品编号,用户行为和用户购买时间。由于原数据较大(100万),在此我们只导入前10万条数据进行分析。
1、导入数据
从数据集中导入数据,原数据包括100万条,为了便于分析,本文选择前10万条数据导入mysql
删除表中重复数据(根据所有字段,先对数据进行非并列排名,删除排名大于1的
SELECT * FROM
(SELECT 用户ID, 商品ID,商品类目ID,行为类型,时间戳,row_number() over (ORDER BY 用户ID,商品ID,商品类目ID,行为类型,时间戳) AS ranking
FROM userbehavior) AS a
GROUP BY ranking HAVING count(ranking) > 1;
查找空缺值
SELECT * FROM userbehavior
WHERE 用户ID is NULL OR 商品ID is NULL OR 商品类目ID is NULL OR 行为类型 is NULL OR 时间戳 is NULL;
一致化处理 (将时间戳整理成日期和时间)
ALTER TABLE userbehavior ADD 日期 varchar(255);
ALTER TABLE userbehavior ADD 时间 varchar(255);
UPDATE userbehavior SET 日期=FROM_UNIXTIME(时间戳,'%Y-%m-%d');
UPDATE userbehavior SET 时间=FROM_UNIXTIME(时间戳,'%h:%i:%s');
异常值处理
SELECT 日期,COUNT(*) FROM userbehavior
GROUP BY 日期 ORDER BY 日期;
DELETE FROM userbehavior
WHERE 日期 in('2017-09-11','2017-11-17','2017-11-20','2017-11-23','2017-11-24');
2、研究问题
首先了解本次选取的子集中共有多少商品类别,以及多少种商品
SELECT count(DISTINCT 商品类目ID) AS 商品类别, count(DISTINCT 商品ID) AS 商品种类
FROM userbehavior;
2.1.1 用户行为分析
点击率是否与购买量相符?按行为分组,观察点击、收藏、加购、购买各组数量
由图可知,点击人数多,加入购物车和收藏商品的数量有很大程度的下降,实际购买人数更少,这说明商品的购买转化率较低,为何会出现这种情况?从商品角度(是否推送的商品不是客户需要的?)和购买流程角度(是否流程上出现问题,导致最终购买量降低?)进行分析,并提出两个假设;
假设1:推送的商品不是客户想要的。
用户想要在淘宝找到什么商品(点击次数排名)
SELECT 商品类目ID,count(商品类目ID) AS 点击次数
FROM UserBehavior
WHERE 行为类型='pv'
GROUP BY 商品类目ID
ORDER BY 点击次数 DESC
LIMIT 10;
通过sql语句查询发现,点击量前10的商品类目如下。分为3个档位:第一和第二点击量为4509和3949,商品类别为4756105和3607361。从第三至第五,点击量为2500左右,从第六到第十,点击量从1769降至1275
上图为点击量前十的商品类别,可以看到4756105,3607361,4145813等商品类别在用户点击量中排名前列。选择在点击量中排名前三的商品类别4756105,3607361,4145813,进一步分析其中的商品
点击量排名前三的商品类目中排名前三的商品
SELECT 商品类目ID, 商品ID,count(商品ID) AS 点击次数
FROM UserBehavior
WHERE 行为类型='pv' AND 商品类目ID ='4756105'-- 4756105、3607361、4145813
GROUP BY 商品ID
ORDER BY 点击次数 DESC
LIMIT 3;
由上图可发现4756105和3607361类别中的商品点击量最高,都保持在30左右。4756105类别中点击量较高的商品为812879,3006793和3769601;3607361类别中点击量较高的商品为2331370,138964和3845720。而4145813类别中的商品点击量相对较小,保持在15左右
购买量前10的商品类别
SELECT 商品类目ID,count(商品类目ID) AS 购买商品数量
FROM UserBehavior
WHERE 行为类型='buy'
GROUP BY 商品类目ID
ORDER BY 购买商品数量 DESC;
LIMIT 10;
上图为购买量前十的商品类别,发现点击量排名前三的商品类别4756105,3607361,4145813钟只有4145813依旧排名前三,而4756105和3607361分别排名第七和第八。与此同时,在购买量中排名前列的商品类别2735466、146116和2885642等均未出现在点击量排名前列的商品类别中
点击量前3类目下的点击排名前3的商品,销量
SELECT 商品ID,count(商品ID) AS 购买商品数量
FROM UserBehavior
WHERE 行为类型='buy' AND 商品ID in ('812879','3006793','3769601','2331370','138964','3845720','2828948','4718309','4334690')
GROUP BY 商品ID
ORDER BY 购买商品数量 DESC;
结果如上图,发现只有两种商品存在购买情况,并且购买量皆仅为1个。
因此,假设1成立。淘宝的推送机制不够合理,淘宝根据用户点击量进行推荐的商品类别与用户对于商品类别的购买需求关联度尚可,但推荐的具体商品并没有满足用户的购买需求,导致了用户在点击商品后的行为转化率极低
假设1:不同的购买流程导致转换率低
按不同用户、不同商品分组,创建用户对具体商品的不同行为的视图
CREATE VIEW process AS
SELECT 用户ID,商品ID,
SUM(CASE WHEN 行为类型 = 'pv' THEN 1 ELSE 0 END) AS pv,
SUM(CASE WHEN 行为类型 = 'cart' THEN 1 ELSE 0 END) AS cart,
SUM(CASE WHEN 行为类型 = 'fav' THEN 1 ELSE 0 END) AS fav,
SUM(CASE WHEN 行为类型 = 'buy' THEN 1 ELSE 0 END) AS buy
FROM UserBehavior
GROUP BY 用户ID,商品ID;
按流程分析各环节的数据
-- 点击后流失
SELECT COUNT(用户ID) AS 点击后流失
FROM process
WHERE pv>0 AND cart=0 AND fav= 0 AND buy=0;
-- 点击后购买
SELECT COUNT(用户ID) AS 点击后购买
FROM process
WHERE pv>0 AND cart=0 AND fav= 0 AND buy>0;
-- 点击后收藏
SELECT COUNT(用户ID) AS 点击后收藏
FROM process
WHERE pv>0 AND fav>0;
-- 点击、收藏、加购物车
SELECT COUNT(用户ID) AS 点击、收藏、加购物车
FROM process
WHERE pv>0 AND cart>0 AND fav>0 AND buy=0;
-- 点击、收藏、加购物车、购买
SELECT COUNT(用户ID) AS 点击、收藏、加购物车、购买
FROM process
WHERE pv>0 AND cart>0 AND fav>0 AND buy>0;
-- 点击、收藏、购买
SELECT COUNT(用户ID) AS 点击、收藏、购买
FROM process
WHERE pv>0 AND cart=0 AND fav>0 AND buy>0;
-- 点击、收藏、流失
SELECT COUNT(用户ID) AS 点击、收藏、购买
FROM process
WHERE pv>0 AND cart=0 AND fav>0 AND buy=0;
-- 点击、加购物车
SELECT COUNT(用户ID) AS 点击、加购物车
FROM process
WHERE pv>0 AND cart>0 AND fav=0;
-- 点击、加购物车、购买
SELECT COUNT(用户ID) AS 点击、加购物车、购买
FROM process
WHERE pv>0 AND cart>0 AND fav=0 AND buy>0;
-- 点击、加购物车、流失
SELECT COUNT(用户ID) AS 点击、加购物车、流失
FROM process
WHERE pv>0 AND cart>0 AND fav=0 AND buy=0;
因此,假设2成立。其中,点击后购买率就有1.1%;点击→收藏→购买和点击→加入购物车→购买这两个流程的购买转化率明显大于点击→购买中的购买转化率。可以进一步分析为何流失?是商品问题?还是其他购物平台折扣力度更大?还是用户习惯问题?
2.1.1 用户行为与时间分析
假设1:用户活跃度和购买在工作日和周末时具有差异
用户购买
SELECT 日期,COUNT(行为类型) AS 当日购买
FROM UserBehavior
WHERE 行为类型='buy'
GROUP BY 日期
ORDER BY 日期 DESC;
日活
SELECT 日期,COUNT(行为类型) AS 当日购买
FROM UserBehavior
WHERE 行为类型='pv'
GROUP BY 日期
ORDER BY 日期 DESC;
通过对数据进行可视化发现,用户日活和商品销量在12/02和12/03这两天较高,而12/02和12/03两天是周末。主要原因是在周末用户通常都拥有更多的空闲时间,因此周末的日活显著高于工作日的日活
因此假设1成立,在周末通常用户都会拥有更多的空闲时间,因此相比工作日,周末的日活与购买均有有明显的上升趋势
假设2:用户活跃度和购买在不同时间段上具有差异
用户活跃度时间段(购买与之类似)
CREATE VIEW timetaa AS
select 日期,
sum(case when 时间 between '00:00:00' and '00:59:59' then 1 else 0 end) as 0点,
sum(case when 时间 between '01:00:00' and '01:59:59' then 1 else 0 end) as 1点,
sum(case when 时间 between '02:00:00' and '02:59:59' then 1 else 0 end) as 2点,
sum(case when 时间 between '03:00:00' and '03:59:59' then 1 else 0 end) as 3点,
sum(case when 时间 between '04:00:00' and '04:59:59' then 1 else 0 end) as 4点,
sum(case when 时间 between '05:00:00' and '05:59:59' then 1 else 0 end) as 5点,
sum(case when 时间 between '06:00:00' and '06:59:59' then 1 else 0 end) as 6点,
sum(case when 时间 between '07:00:00' and '07:59:59' then 1 else 0 end) as 7点,
sum(case when 时间 between '08:00:00' and '08:59:59' then 1 else 0 end) as 8点,
sum(case when 时间 between '09:00:00' and '09:59:59' then 1 else 0 end) as 9点,
sum(case when 时间 between '10:00:00' and '10:59:59' then 1 else 0 end) as 10点,
sum(case when 时间 between '11:00:00' and '11:59:59' then 1 else 0 end) as 11点,
sum(case when 时间 between '12:00:00' and '12:59:59' then 1 else 0 end) as 12点,
sum(case when 时间 between '13:00:00' and '13:59:59' then 1 else 0 end) as 13点,
sum(case when 时间 between '14:00:00' and '14:59:59' then 1 else 0 end) as 14点,
sum(case when 时间 between '15:00:00' and '15:59:59' then 1 else 0 end) as 15点,
sum(case when 时间 between '16:00:00' and '16:59:59' then 1 else 0 end) as 16点,
sum(case when 时间 between '17:00:00' and '17:59:59' then 1 else 0 end) as 17点,
sum(case when 时间 between '18:00:00' and '18:59:59' then 1 else 0 end) as 18点,
sum(case when 时间 between '19:00:00' and '19:59:59' then 1 else 0 end) as 19点,
sum(case when 时间 between '20:00:00' and '20:59:59' then 1 else 0 end) as 20点,
sum(case when 时间 between '21:00:00' and '21:59:59' then 1 else 0 end) as 21点,
sum(case when 时间 between '22:00:00' and '22:59:59' then 1 else 0 end) as 22点,
sum(case when 时间 between '23:00:00' and '23:59:59' then 1 else 0 end) as 23点
FROM userbehavior
WHERE 行为类型='pv'
GROUP BY 日期;
因此,假设2成立。用户活跃度和购买在不同时间段上存在差异,且在工作日与周末分别有不同的波峰波谷,建议淘宝根据工作日与周末的不同在不同时间段上分别制定相应的营销策略
2.1.3 RFM分析
R、F值(计算时间差和频率)
SELECT 用户ID,商品ID,最近日期,ROUND(UNIX_TIMESTAMP('2017-12-03')-UNIX_TIMESTAMP(最近日期))/3600/24 AS 时间差
FROM
(SELECT 用户ID,商品ID,MAX(日期) AS 最近日期
FROM userbehavior
WHERE 行为类型='buy' AND 日期 not in ('2017-12-03')
GROUP BY 用户ID) AS a;
SELECT 用户ID,商品ID,count(商品ID) AS 频率
FROM userbehavior
WHERE 行为类型='buy' AND 日期 BETWEEN '2017-11-28' AND '2017-12-03'
GROUP BY 商品ID;
给R、F按价值打分
计算价值的平均值
select avg(R_score),avg(F_score)
from (
select userid,R,
(case when R between 0 and 2 then 4
when R between 3 and 4 then 3
when R between 5 and 6 then 2
when R between 7 and 8 then 1
else 0 end) as R_score,F,
(case when F between 1 and 6 then 1
when F between 7 and 12 then 2
when F between 13 and 18 then 3
when F >=19 then 4
else 0 end) as F_score
from (select userid,DATEDIFF('2017-12-03',max(dates)) as R,
sum(case when behavior='buy' then 1 else 0 end) as F
from userbehavior where behavior='buy'
group by userid) as A) as B;
d. 得出R、F价值高低
e. 和用户分类规则表比较,得出用户分类
select 用户分类,count(userid) as '用户数量'
from (
select userid,
(case when R_score>3.2846 and F_score>1.1148 then "重要价值客户"
when R_score>3.2846 and F_score<1.1148 then "重要发展客户"
when R_score<3.2846 and F_score>1.1148 then "重要保持客户"
when R_score<3.2846 and F_score<1.1148 then "重要挽留客户"
end) as '用户分类'
from (select userid,R,
(case when R between 0 and 2 then 4
when R between 3 and 4 then 3
when R between 5 and 6 then 2
when R between 7 and 8 then 1
else 0 end) as R_score,F,
(case when F between 1 and 6 then 1
when F between 7 and 12 then 2
when F between 13 and 18 then 3
when F >=19 then 4
else 0 end) as F_score
from (select userid,DATEDIFF('2017-12-03',max(dates)) as R,
sum(case when behavior='buy' then 1 else 0 end) as F
from userbehavior where behavior='buy'
group by userid) as A) as B) as C group by 用户分类;
1、重要价值用户占比30.85%,对这部分用户,应加强交流与互动,可建立会员制度以维护平台的忠实用户,让其留在平台继续购物消费;
2、重要挽留用户占比29.21%,这部分用户可能已流失,要想办法主动联系用户,明确流失原因,想办法挽回并再次激活用户购物;
3、重要发展用户占比27.27%,对于这部分用户,应刺激消费,可推荐其平时浏览的同类商品,提高用户购物频率,增加用户对平台的粘性;
4、重要保持用户占比12.67%,对于这部分用户,可根据以往的消费记录,主动与其保持联系,为其进行个性化推荐,以提高复购率。
总结
1、从对用户行为与时间分析中可知:大部分用户的主要活跃时间在19点到22点,在22点左右可达到一天的顶峰。相对于周末,用户工作日的活跃度有所下降。商家可以根据用户的活跃时间段精准推送商家的折扣优惠或促销活动,推出直播带货等活动,提高购买率。
2、基于AARRR模型对用户行为路径进行分析可得:用户跳失率较低,复购率高,对于平台有较高的忠诚度和认可度。
3、用户在从浏览、加入购物车、收藏、购买行为路径中主要的流失环节为“加入购物车”环节,建议在后续商品详情页增加多个购物车入口,或增加领取优惠券等活动,引导用户点击后进行收藏和加购的行为,从而提升用户激活环节的转化率;也可优化电商平台的搜索匹配度和推荐策略,主动根据用户喜好推荐相关的商品,优化商品搜索的准确度和聚合能力,对搜索结果排序优先级进行优化。
4、基于RFM模型对用户价值进行分析可明确:对不同类别的用户采用不同的方式刺激其消费。对于重要保持用户可使用短信、邮件、push提醒的方法召回;对于重要价值用户,可建立会员权益制度或者继续促进消费。