题目一
image.png
导入excel数据至navicat,查看表信息如下:
image.png
求相机活跃用户数、次日留存用户数、三日留存用户数、七日留存用户数思路:
1.通过where筛选相机
2.通过groupby分组用户id、日期,(日期格式需调整和需求一致)
3.用户活跃率是前后日对比所得,所以需要将表命名两个不同名字的表,当做不同的集合,实现自连接
4.分组日期,聚合用户人数,求出每日活跃用户数
5.case when 实现次日留存用户数、三日留存用户数、七日留存用户数
6.case when 实现次日留存率、三日留存率、七日留存率
SQL查询代码
1.通过where筛选相机
SELECT * FROM sheet1
where app_name='相机';
2.通过groupby分组用户id、日期(日期格式需调整和需求一致)
SELECT *, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno
image.png
3.用户活跃率是前后日对比所得,所以需要将表命名两个不同名字的表,当做不同的集合,实现自连接
select *
from
(SELECT uid, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno) as a
left join
(SELECT uid, date_format(dayno, '%Y%m%d') as day2
FROM sheet1
where app_name='相机'
group by uid,dayno) as b
on a.uid =b.uid
where a.day1<b.day2;
4.分组日期,聚合用户人数,求出每日活跃用户数
select day1,count(distinct a.uid) as '活跃用户数'
from
(SELECT uid, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno) as a
left join
(SELECT uid, date_format(dayno, '%Y%m%d') as day2
FROM sheet1
where app_name='相机'
group by uid,dayno) as b
on a.uid =b.uid and a.day1<b.day2
group by a.day1 ;
image.png
5.case when 实现次日留存用户数、三日留存用户数、七日留存用户数
select day1,count(distinct a.uid) as '活跃用户数',
sum(case when b.day2-a.day1 =1 then 1 else 0 end) as '次日留存用户数',
sum(case when b.day2-a.day1 =3 then 1 else 0 end) as '三日留存用户数',
sum(case when b.day2-a.day1 =7 then 1 else 0 end) as '七日留存用户数'
from
(SELECT uid, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno) as a
left join
(SELECT uid, date_format(dayno, '%Y%m%d') as day2
FROM sheet1
where app_name='相机'
group by uid,dayno) as b
on a.uid =b.uid and a.day1<b.day2
group by a.day1 ;
image.png
6.case when 实现次日留存率、三日留存率、七日留存率
select day1,count(distinct a.uid) as '活跃用户数',
sum(case when b.day2-a.day1 =1 then 1 else 0 end) as '次日留存用户数',
sum(case when b.day2-a.day1 =3 then 1 else 0 end) as '三日留存用户数',
sum(case when b.day2-a.day1 =7 then 1 else 0 end) as '七日留存用户数',
concat(round((sum(case when b.day2-a.day1 =1 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '次日留存率',
concat(round((sum(case when b.day2-a.day1 =3 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '三日留存率',
concat(round((sum(case when b.day2-a.day1 =7 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '七日留存率'
from
(SELECT uid, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno) as a
left join
(SELECT uid, date_format(dayno, '%Y%m%d') as day2
FROM sheet1
where app_name='相机'
group by uid,dayno) as b
on a.uid =b.uid and a.day1<b.day2
group by a.day1 ;
image.png
题目二
image.png
1、分别用Python、sql导入附件(相机.xlsx)的数据进入自己的Mysql当中,要求字段符合、表名符合(可加后缀区分)。
SQL部分:
直接导入到navicat,根据导入向导依次下一步操作,主要操作如下:
image.png
image.png
image.png
Python部分:
导入模块:
import pandas as pd
import numpy as np
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
读取数据:
df=pd.read_excel
image.png
2、分析pd.to_sql中的if_exist参数可用哪些,分别是什么,有什么细节区别?认为用哪种更好
存储到数据库代码:
engine = create_engine('mysql://root:mysql111@localhost:3306/test')
df.to_sql('sheet1',con = engine,if_exists='replace', index=False)
分别是fail, append, replace。
fail当数据库里存在同名表时,什么也不做。append当数据库里存在同名表时,把数据追加在表后面。replace会删除原来的同名表,再新建表后插入数据。
具体哪种比较好要看情况,如果你的数据每天追加就用append。每天都更新,并且不需要保留之前数据就用replace。fail的话,可以避免覆盖原有表格的失误发生
3、总结MYSQL的date_format、concat的用法,并以此数据举例说明。
date_format可以将把一个日期转换为各种样式的字符串。格式为date_format(原日期格,新日期格式)
代码示例与结果如下:
SELECT *, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno
image.png
concat拼接用法,例如上述,留存率需要以百分比表现,拼接一个‘%’的用法如下:
select day1,count(distinct a.uid) as '活跃用户数',
round(sum(case when b.day2-a.day1 =1 then 1 else 0 end)/count(distinct a.uid),2) as '次日留存率',
round(sum(case when b.day2-a.day1 =1 then 1 else 0 end)/count(distinct a.uid),2) as '三日留存率',
round(sum(case when b.day2-a.day1 =1 then 1 else 0 end)/count(distinct a.uid),2) as '七日留存率',
concat(round((sum(case when b.day2-a.day1 =1 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '次日留存率',
concat(round((sum(case when b.day2-a.day1 =3 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '三日留存率',
concat(round((sum(case when b.day2-a.day1 =7 then 1 else 0 end)/count(distinct a.uid))*100,2),'%') as '七日留存率'
from
(SELECT uid, date_format(dayno, '%Y%m%d') as day1
FROM sheet1
where app_name='相机'
group by uid,dayno) as a
left join
(SELECT uid, date_format(dayno, '%Y%m%d') as day2
FROM sheet1
where app_name='相机'
group by uid,dayno) as b
on a.uid =b.uid and a.day1<b.day2
group by a.day1 ;
image.png
4、利用MYSQL语句解题,分步骤达到目的再一次性运行,达到题目要求。
(1)求活跃度计算
(2)次日留存
(3)延伸 次日、三日、七日留存
解题思路与答案见题目一
5、用Python解题,要求函数封装,利用调用函数计算
解题思路与答案借鉴Vayne老哥
作者:vayne233
链接:https://www.jianshu.com/p/8a03ff5d99c8
import datetime
import pandas as pd
import numpy as np
def jisuan():
# 读取数据
df=pd.read_excel(r'C:\Users\admin\Desktop\xiangji.xlsx',sheet_name='Sheet1')
# 通过处理,得到后续所有计算所需的df2表
df=df[df.app_name=='相机'].drop_duplicates(subset=['dayno','uid'])
df=df[['uid','dayno']]
df2=pd.merge(df,df,on='uid',how='left')
df2=df2[df2.dayno_x<=df2.dayno_y]
df2['difference']=df2.apply(lambda x: x[2]-x[1], axis=1)
# 得到基础活跃人数表d1,方便以后做连接
d1=df2.groupby('dayno_x').uid.nunique().reset_index().rename(columns={'dayno_x':'日期','uid':'活跃用户数'})
d2=df2[df2['difference']==datetime.timedelta(1)].groupby('dayno_x').uid.count().reset_index().rename(columns={'dayno_x':'日期','uid':'次日留存用户数'})
d1=pd.merge(d1,d2,how='left')
d2=df2[df2['difference']==datetime.timedelta(2)].groupby('dayno_x').uid.count().reset_index().rename(columns={'dayno_x':'日期','uid':'三日留存用户数'})
d1=pd.merge(d1,d2,how='left')
d2=df2[df2['difference']==datetime.timedelta(6)].groupby('dayno_x').uid.count().reset_index().rename(columns={'dayno_x':'日期','uid':'七日留存用户数'})
# 填充缺失值
d1=pd.merge(d1,d2,how='left').fillna(0)
# 浮点数转为整数
d1[['次日留存用户数','三日留存用户数','七日留存用户数']]=d1[['次日留存用户数','三日留存用户数','七日留存用户数']].astype(int)
d1['次日留存率']=d1['次日留存用户数']/d1['活跃用户数']
d1['三日留存率']=d1['三日留存用户数']/d1['活跃用户数']
d1['七日留存率']=d1['七日留存用户数']/d1['活跃用户数']
# 浮点数转为百分比
d1['次日留存率']=d1['次日留存率'].apply(lambda x: format(x,'.2%'))
d1['三日留存率']=d1['三日留存率'].apply(lambda x: format(x,'.2%'))
d1['七日留存率']=d1['七日留存率'].apply(lambda x: format(x,'.2%'))
return d1
image.png