参考:
数分面试-SQL篇
本篇文章完全按照以上参考链接写,其实可以理解为个人笔记
拼多多面试题
注:部分来源于笔试,面试部分因为都是基于其他人面经口述记录,所以难免有一些在格式不统一的地方。
1.case专题-商品订单数据
数据表:
订单表orders,大概字段有(order_id'订单号,'user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’,'商品一级类目commodity_level_I','商品二级类目commodity_level_2')
问题1. 求最近7天内每一个一级类目下成交总额排名前3的二级类目:
思路:
- 先给出最近七天每一个一级类目下的二级类目的成交总额
select o.commodity_level_I, o.commodity_level_2, sum(order_pay) as total_pay
from orders as o
where datediff(now(), o.order_time) <= 7
group by o.commodity_level_I, o.commodity_level_2
- 再对每个一级类目下的成交总额进行排名
select tmp.commodity_level_I, tmp.commodity_level_2, tmp.total_pay,
row_number() over(partition by tmp.commodity_level_I order by tmp.total_pay desc) as rank
from (select o.commodity_level_I, o.commodity_level_2, sum(order_pay) as total_pay
from orders as o
where datediff(now(), o.order_time) <= 7
group by o.commodity_level_I, o.commodity_level_2) as tmp
- 取排名前三的即可
select tmp1.commodity_level_I as '商品一级类目',
tmp1.commodity_level_2 as '商品二级类目',
tmp1.total_pay as '成交总额'
from (select tmp.commodity_level_I, tmp.commodity_level_2, tmp.total_pay,
row_number() over(partition by tmp.commodity_level_I order by tmp.total_pay desc) as rank
from (select o.commodity_level_I, o.commodity_level_2, sum(order_pay) as total_pay
from orders as o
where datediff(now(), o.order_time) <= 7
group by o.commodity_level_I, o.commodity_level_2) as tmp ) as tmp1
where tmp1.rank <=3
问题2.提取8.1-8.10每一天消费金额排名在101-195的user_id
思路:
- 选取8.1-8.10每一天每个user的消费金额
select o.order_time, o. user_id, sum(o.order_pay) as sum_pay
from orders as o
where o.order_time between '2020-08-01' and '2020-08-10'
group by o.order_time, o. user_id
- 给出每天的消费金额排名
select tmp.order_time, tmp. user_id, tmp.sum_pay,
row_number() over(partition by tmp.order_time order by tmp.sum_pay desc) as rank
from (select o.order_time, o. user_id, sum(o.order_pay) as sum_pay
from orders as o
where o.order_time between '2020-08-01' and '2020-08-10'
group by o.order_time, o. user_id) as tmp
3.选出排名101-195的即可
select tmp1.order_time as '订单日期',
tmp1. user_id ,
tmp1.sum_pay as '消费金额'
from (select tmp.order_time, tmp. user_id, tmp.sum_pay,
row_number() over(partition by tmp.order_time order by tmp.sum_pay desc) as rank
from (select o.order_time, o. user_id, sum(o.order_pay) as sum_pay
from orders as o
where o.order_time between '2020-08-01' and '2020-08-10'
group by o.order_time, o. user_id) as tmp) as tmp1
where tmp1.rank between 101 and 195;
2.case专题-活动运营数据分析
数据表
表1——订单表orders,大概字段有(user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’)
表2——活动报名表act_apply,大概字段有(act_id‘活动编号’, user_id‘报名用户’,act_time‘报名时间’)
1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额,总订单数
思路:
两表按user_id进行连接
还有一个条件是下单的时间要大于等于报名时间
然后按照act_id进行分组统计
select a.act_id, sum(o.order_pay) as '总金额', count(a.act_id) as '订单数'
from orders as o
join act_apply as a
on o.user_id = a.user_id
where o.order_time >= a.act_time
group by a.act_id
参考链接中使用的是left join(如果需要改我这里是right join)
这里是需要斟酌一下的
2.统计每个活动从开始后到当天(考试日)平均每天产生的订单数,活动开始时间定义为最早有用户报名的时间。(涉及到时间的数据类型均为:datetime)
- 先统计每个活动的最小时间即为活动开始的时间
select a.act_id, a.user_id, a.act_time,
min(a.act_time) over(partition by a.act_id) as start_date
from act_apply as a
2.与订单表连接 总的订单数除以时间差即可
select tmp.act_id as '活动编号', count(*)/datediff(now(), tmp.start_date) as `平均每天产生的订单数`
from (select a.act_id, a.user_id, a.act_time,
min(a.act_time) over(partition by a.act_id) as start_date
from act_apply as a) as tmp
right join orders as o
on o.user_id = tmp.user_id
where o.order_time >= tmp.act_time
-- 如果group by 没有start_date的话 sql不允许select语句使用它的 这是聚合函数的限制
group by tmp.act_id
其实不用窗口函数也是可行的
SELECT a.act_id, COUNT(*)/ DATEDIFF(NOW() - MIN(a.act_time)) as m_order_num
FROM orders as o
LEFT JOIN act_apply as a
ON a.user_id = o.user_id
WHERE order_time >= act_time
GROUP BY a.act_id;
3.case专题-用户行为路径分析
表1——用户行为表tracking_log,大概字段有(user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’)
问题1.统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻
思路:
- 用窗口函数给出每天每个用户的每次操作、对应的前一个操作
select t.user_id, t.opr_id as opr1, t.log_time,
lead(1) over(partition by t.log_time, t.user_id order by t.log_time asc) as opr2
from tracking_log as t
2.按时间进行分组 统计opr1为A opr2为B的用户数目
select tmp.log_time, count(distinct tmp.user_id) as 'A-B路径用户计数'
from (select t.user_id, t.opr_id as opr1, t.log_time,
lead(1) over(partition by t.log_time, t.user_id order by t.log_time asc) as opr2
from tracking_log as t) as tmp
where tmp.opr1 = 'A' and tmp.opr2 = 'B'
group by tmp.log_time
问题2.统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
思路:
- 构造每个用户的浏览记录
select t.user_id, group_concat(t.opr_id order by t.log_time asc) as path
from tracking_log as t
group by t.user_id
GROUP_CONCAT(DISTINCT expression
ORDER BY expression
SEPARATOR sep);
//原文出自【易百教程】,商业转载请联系作者获得授权,非商业请保留原文链接:https://www.yiibai.com/mysql/group_concat.html
2.选出满足条件的记录
select t.user_id, group_concat(t.opr_id order by t.log_time asc) as path
from tracking_log as t
group by t.user_id
having path REGEXP 'A[A-Z]+B[^C]+D'
MySQL 正则表达式
3.统计数量即可
select count(tmp.user_id) as result
from (select t.user_id, group_concat(t.opr_id order by t.log_time asc) as path
from tracking_log as t
group by t.user_id
having path REGEXP 'A[A-Z]+B[^C]+D') as tmp
参考答案
select count(user_id) as result
from
(
select user_id,group_concat(opr_id order by log_time) as user_behavior_path
from tracking_log
group by user_id
having (user_behavior_path like '%A%B%D%')
and (user_behavior_path not like '%A%B%C%D%')
) t
使用模糊查询 选出满足 ABD顺序的 再排除满足ABCD顺序的即可
4.case专题-用户留存分析
表1——用户登陆表user_log,大概字段有(user_id‘用户编号’,log_date‘登陆时间’)
问题1.求每天新增用户数,以及他们第2天、30天的留存率
思路:
- 先选出每个用户的最小登陆时间 则该用户是当天的新增用户
select u.user_id, min(u.log_time) as first_time
from user_log as u
group by u.user_id
2.与主表连接 得到每个用户的登陆时间及其首次登陆时间
对首次登陆时间分组 统计数量则为每日新增用户数
select tmp. first_time, count(distinct tmp.user_id) as '每天新增用户数'
from (select u.user_id, min(u.log_time) as first_time
from user_log as u
group by u.user_id) as tmp
join user_log as u
on u.user_id = tmp.user_id
group by tmp. first_time;
- 第二天留存的用户通过if判断 如果登陆时间与首次登陆时间差值为1 则返回用户id 否则记为null 统计数量即可
30天留存同理
select tmp. first_time, count(distinct tmp.user_id) as '每天新增用户数',
count(distinct if(datediff(tmp.log_time, tmp. first_time) = 1, tmp.user_id, null)) as '第二天留存用户数',
count(distinct if(datediff(tmp.log_time, tmp. first_time) = 29, tmp.user_id, null)) as '第三十天留存用户数'
from (select u.user_id, min(u.log_time) as first_time
from user_log as u
group by u.user_id) as tmp
join user_log as u
on u.user_id = tmp.user_id
group by tmp. first_time;
- 所求留存率只需除以新增用户数 然后用round取两位即可
select tmp. first_time, count(distinct tmp.user_id) as '每天新增用户数',
round(count(distinct if(datediff(tmp.log_time, tmp. first_time) = 1, tmp.user_id, null))/count(distinct tmp.user_id), 2) as '第二天的留存率',
round(count(distinct if(datediff(tmp.log_time, tmp. first_time) = 29, tmp.user_id, null))/count(distinct tmp.user_id), 2) as '第三十天的留存率'
from (select u.user_id, min(u.log_time) as first_time
from user_log as u
group by u.user_id) as tmp
join user_log as u
on u.user_id = tmp.user_id
group by tmp. first_time;
用窗口函数怎么做呢
其实以上思路的1前三步用窗口函数就可以实现
select tmp.first_date, count(distinct tmp.user_id) as '每天新增用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
group by tmp.first_date;
第二天的回访数量可以这么求
select tmp.first_date, count(distinct tmp.user_id) as '第二天留存用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
where datediff(tmp.log_time, tmp.first_date)= 1
group by tmp.first_date;
第三十天留存用户数同理
select tmp.first_date, count(distinct tmp.user_id) as '第三十天留存用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
where datediff(tmp.log_time, tmp.first_date)= 29
group by tmp.first_date;
三表连接
select t1.first_date, t1.'每天新增用户数',
concat(round(t2.'第二天留存用户数'/t1.'每天新增用户数' , 2), '%') as '第2天回访率',
concat(round(t2.'第二天留存用户数'/t1.'每天新增用户数' , 2), '%') as '第30天回访率'
from (select tmp.first_date, count(distinct tmp.user_id) as '每天新增用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
group by tmp.first_date) as t1
join (select tmp.first_date, count(distinct tmp.user_id) as '第二天留存用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
where datediff(tmp.log_time, tmp.first_date)= 1
group by tmp.first_date) as t2
on t1. first_date = t2.first_date
join (select tmp.first_date, count(distinct tmp.user_id) as '第三十天留存用户数'
from (select u.user_id, u.log_time, min(u.log_time) over(partition by u.user_id) as first_date
from user_log as u) as tmp
where datediff(tmp.log_time, tmp.first_date)= 29
group by tmp.first_date) as t3
on t1. first_date = t3.first_date
问题2.找近90天,30天,7天的登录人数
思路:
- 求得每个用户每次登陆距离现在的时间
select u.user_id, u.log_time, datediff(now(), u.log_time) as '距今登陆时间天数'
from user_log as u
2.统计近7天的登陆人数
select count(distinct u.user_id)
from user_log as u
where datediff(now(), u.log_time) <= 7
30天,90天的类似
如果一次查询出则可以用如下方法
select
count(distinct if(datediff(now(), u.log_time)<=7, u.user_id, null)) as '近7天登陆人数',
count(distinct if(datediff(now(), u.log_time)<=30, u.user_id, null)) as '近30天登陆人数',
count(distinct if(datediff(now(), u.log_time)<=90, u.user_id, null)) as '近90天登陆人数'
from user_log as u
问题3.求用户近一个月平均登录时间间隔(按天)
思路:
- 给出近一个月每次登陆前一次的登陆情况
selec u.user_id, u.log_time, lag(u.log_time) over(partition u.user_id order by u.log_time asc) as pre_log_time
from user_log as u
where datediff(now(), u.log_time) <= 30
- 给出用户每次登陆的时间间隔
selec u.user_id, u.log_time - lag(u.log_time) over(partition u.user_id order by u.log_time asc) as diff
from user_log as u
where datediff(now(), u.log_time) <= 30
- 对用户进行分组 统计平均的间隔
select tmp.user_id, avg(tmp.diff) as '平均间隔'
from (selec u.user_id, u.log_time - lag(u.log_time) over(partition u.user_id order by u.log_time asc) as diff
from user_log as u
where datediff(now(), u.log_time) <= 30) as tmp
group by tmp.user_id
5.case专题-统计特征(中位数,众数,四分位数)
表1——订单表orders,字段:店铺id(shop_id),销量(sale),商品id(commodity_id)
问题1.求每个店铺销量的中位数
这个问题其实在这里做过
Leetcode569. 员工薪水中位数(困难)
还有另一个中位数的题目
Leetcode571. 给定数字的频率查询中位数(困难)
都挺难的 现在返回来看还是有些不会的
参考这篇文章还是学到了新的方法
解法一:
常规思路
设每个店铺销量组成的序列长度是cnt
当cnt是偶数时,中位数所在序号是cnt/2,cnt/2+1
当cnt是奇数时,中位数所在序号是ceil(cnt/2)
1.先给出cnt 和每个店铺按照销量的排名
select o.shop_id, o.sale,
count(1) over(partition by o.shop_id) as cnt,
row_number() over(partition by o.shop_id order by o.sale asc) as sale_rank
from orders as o
- 对于cnt为奇数或偶数选出对应排名的记录
select tmp.shop_id, tmp.sale
from (select o.shop_id, o.sale,
count(1) over(partition by o.shop_id) as cnt,
row_number() over(partition by o.shop_id order by o.sale asc) as sale_rank
from orders as o) as tmp
where (tmp.cnt%2=0 and sale_rank in (cnt/2, cnt/2+1)) or
(tmp.cnt%2=1 and sale_rank = ceiling(cnt/2))
- 最后对shop_id分组 对sale取平均即可
select tmp.shop_id, avg(tmp.sale) as '中位数'
from (select o.shop_id, o.sale,
count(1) over(partition by o.shop_id) as cnt,
row_number() over(partition by o.shop_id order by o.sale asc) as sale_rank
from orders as o) as tmp
where (tmp.cnt%2=0 and sale_rank in (cnt/2, cnt/2+1)) or
(tmp.cnt%2=1 and sale_rank = ceiling(cnt/2))
group by tmp.group_id
解法二:
abs(rn - (cnt+1)/2) < 1
解释下上面的公式:
rn是给定长度为cnt的数列的序号排序,
eg:对于1,2,3,4,5,它的中位数所在序号是3,3-(5+1)/2 = 0
对于1,2,3,4,它的中位数所在序号是2,3
2 - (4+1)/2 = -0.5
3-(4+1)/2 = 0.5
可见(cnt+1)/2是一个数列的中间位置,如果是奇数数列,这个位置刚好是中位数所在
如果是偶数,abs(rn - (cnt+1)/2) < 1
代码思路与上边很相似 只不过改了where条件
select shop_id,avg(sale) as median
from
(
select shop_id,sale,
row_number() over (partition by shop_id order by sale) as rn, -- 各商品销量在其店铺内的顺序编号
count(1) over (partition by shop_id) as cnt -- 各店铺的商品记录数
from orders
)
where abs(rn - (cnt+1)/2) < 1 -- 顺序编号在店铺商品销量记录数中间的,即为中位数
group by shop_id
解法三:不用窗口函数,不排序,利用中位数定义
这个有点儿绕 感兴趣的可以看一下链接
中位数出现的频率一定大于等于大于它的数和小于它的数的绝对值之差。
问题2.求每个店铺订购商品的众数
不用窗口函数的思路:
- 对每个店铺的商品出现次数进行统计
select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id
- 选出每个店铺频数最高的商品次数
select tmp.shop_id, max(order_num) as max_order_num
from (select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id) as tmp
group by tmp.shop_id
- 利用二元in选出对应字段即可
select tmp.shop_id, tmp.commodity_id, tmp.order_num
from (select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id) as tmp
where (tmp.shop_id, tmp.order_num) in (select tmp.shop_id, max(order_num) as max_order_num
from (select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id) as tmp
group by tmp.shop_id)
利用窗口函数的思路:
- 对每个店铺的商品出现次数进行统计
select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id
2.选出每个店铺频数最高的商品次数
select tmp.shop_id, tmp.commodity_id, tmp.order_num,
max(tmp.order_num) over(partition by tmp.shop_id) as max_order_num
from (select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id) as tmp
- 最后选出频数等于最高频数即可
select tmp1.shop_id, tmp1.commodity_id, tmp1.order_num
from (select tmp.shop_id, tmp.commodity_id, tmp.order_num,
max(tmp.order_num) over(partition by tmp.shop_id) as max_order_num
from (select o.shop_id, o.commodity_id, count(*) as order_num
from order as o
group by o.shop_id, o.commodity_id) as tmp) as tmp1
where tmp1.order_num = tmp1.max_order_num
问题3.求四分位数
三个表
T1:good_id,cate_id(分类)
T2:mall_id(店铺), good_id
T3:mall_id, credit_score(信用分)
问,在不同分类中,店铺的信用分top25%
思路:
- 连接三表
select T1.good_id, T2.cate_id, T2.mall_id, T3.credit_score
from T1
join T2
on T1.good_id = T2.good_id
join T3
on T3.mall_id = T2.mall_id
- 给出不同分类下的排名 和不同分类的总条数
select tmp.cate_id, tmp.mall_id, tmp.credit_score,
row_number() over(partition by tmp.cate_id order by tmp.credit_score desc) as cate_rank,
count(*) over(partition by tmp.cate_id) as cate_total
from (select T1.good_id, T2.cate_id, T2.mall_id, T3.credit_score
from T1
join T2
on T1.good_id = T2.good_id
join T3
on T3.mall_id = T2.mall_id) as tmp
- 选出排名小于等于总数*25%的即可
select tmp2.cate_id, tmp2.mall_id, tmp2.credit_score, tmp2.cate_rank
from (select tmp.cate_id, tmp.mall_id, tmp.credit_score,
row_number() over(partition by tmp.cate_id order by tmp.credit_score desc) as cate_rank,
count(*) over(partition by tmp.cate_id) as cate_total
from (select T1.good_id, T2.cate_id, T2.mall_id, T3.credit_score
from T1
join T2
on T1.good_id = T2.good_id
join T3
on T3.mall_id = T2.mall_id) as tmp) as tmp2
where tmp2.cate_rank <= tmp2.cate_total*0.25
6.case专题-GMV周同比统计
表:T —— 字段:时间(sale_date),店铺类别(cate_id),店铺数量(mall_num),gmv
问题1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例
思路:
- 选出618前一周的gmv 和对应7天后的gmv
select T.sale_date, T.cate_id, T.mall_num, T.gmv,
lead(gmv, 7) over(partition by T.cate_id order by T.sale_date asc) as gmv_lead_7
from T
-- 窗口函数执行实在where之后的所以这里的范围是这样 而不是 0617
where T.sale_date between '2020-06-11' and '2020-06-24'
- 选出一周前的记录 对店铺类别分组 统计gmv的日均提升幅度和比例即可
select T.cate_id,
avg(gmv_lead_7 - gmv) as '日均提升幅度',
avg((gmv_lead_7 - gmv)/gmv) as '日均提升比例'
from (select T.sale_date, T.cate_id, T.mall_num, T.gmv,
lead(gmv, 7) over(partition by T.cate_id order by T.sale_date asc) as gmv_lead_7
from T
where T.sale_date between '2020-06-11' and '2020-06-24') as tmp
where tmp.sale_date between '2020-06-11' and '2020-06-17'
group by T.cate_id;
问题2.在618前一周gmv top20%,20-40%等这5类商铺在618后一周内gmv日均提升幅度和比例
思路:
- 先给出618前一周的记录 然后对类别进行分组 得到前一周每个类别的总gmv 一遍后续求比例
select T.cate_id, sum(T.gmv) as total_gmv
from T
where T.sale_date between '2020-06-11' and '2020-06-17'
group by T.cate_id
2.根据总的gmv排名 给定五个类别的分组
select tmp.cate_id,
count(T.cate_id) as total,
row_number() over(order by tmp.total_gmv desc) as rank
from (select T.cate_id, sum(T.gmv) as total_gmv
from T
where T.sale_date between '2020-06-11' and '2020-06-17'
group by T.cate_id) as tmp
3.根据rank/total的大小可以给定分组
select tmp1.cate_id,
case when tmp1.rank <= tmp1.total*0.1 then '10%'
else tmp1.rank <= tmp1.total*0.2 then '10%-20%'
else tmp1.rank <= tmp1.total*0.3 then '20%-30%'
else tmp1.rank <= tmp1.total*0.4 then '30%-40%'
else tmp1.rank <= tmp1.total*0.5then '40%-50%'
else tmp1.rank > tmp1.total*0.5 then '50%以上'
end as gmv_quantile
from (select tmp.cate_id,
count(T.cate_id) as total,
row_number() over(order by tmp.total_gmv desc) as rank
from (select T.cate_id, sum(T.gmv) as total_gmv
from T
where T.sale_date between '2020-06-11' and '2020-06-17'
group by T.cate_id) as tmp) as tmp1
- 之后就与第一问很相似了 只不过是对于没类店铺又进行了一个分类 然后根据这个分类进行统计
select b.gmv_quantile,
avg(a.gmv_lead_7 - a.gmv) as '日均提升幅度',
avg((a.gmv_lead_7 - a.gmv)/a.gmv) as '日均提升比例'
from (select T.sale_date, T.cate_id, T.mall_num, T.gmv,
lead(gmv, 7) over(partition by T.cate_id order by T.sale_date asc) as gmv_lead_7
from T
where T.sale_date between '2020-06-11' and '2020-06-24') as a
join (select tmp1.cate_id,
case when tmp1.rank <= tmp1.total*0.1 then '10%'
else tmp1.rank <= tmp1.total*0.2 then '10%-20%'
else tmp1.rank <= tmp1.total*0.3 then '20%-30%'
else tmp1.rank <= tmp1.total*0.4 then '30%-40%'
else tmp1.rank <= tmp1.total*0.5then '40%-50%'
else tmp1.rank > tmp1.total*0.5 then '50%以上'
end as gmv_quantile
from (select tmp.cate_id,
count(T.cate_id) as total,
row_number() over(order by tmp.total_gmv desc) as rank
from (select T.cate_id, sum(T.gmv) as total_gmv
from T
where T.sale_date between '2020-06-11' and '2020-06-17'
group by T.cate_id) as tmp) as tmp1) as b
on a.cate_id = b.cate_id
where a.sale_date between '20190611' and '20190617'
and b.gmv_quantile in ('10%','10%-20%','20%-30%','30%-40','40%-50%')
group by b.gmv_quantile
7.case专题-连续区间问题
(拼多多二面面试题,就是找到一个session的开始和结束位置,但是具体题目条件记不清楚了,这里放个类似的题目,看看这一类题目的解题思路)
Leetcode1285. 找到连续区间的开始和结束数字(中等)
思路:
- 利用定义变量的方法 判断当前值与前一值是否差一 若是则分组保持不变 否组分组+1
select L.log_id,
@group:=if(L.log_id = @pre_id + 1,@group, @group+1) as group,
@pre_id:=L.log_id
from Logs as L, (select @pre_id:=null, @group:=null) as init
- 对于每个分组 最大值为结束 最小值为起始
select min(L.log_id) as start_id,
max(L.log_id) as end_id
from (select L.log_id,
@group:=if(L.log_id = @pre_id + 1,@group, @group+1) as group,
@pre_id:=L.log_id
from Logs as L, (select @pre_id:=null, @group:=null) as init) as tmp
group by tmp.group;
猿辅导面试题
8.case专题-学生成绩分析
表:Enrollments
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| student_id | int |
| course_id | int |
| grade | int |
+---------------+---------+
(student_id, course_id) 是该表的主键。
问题1.查询每位学生获得的最高成绩和它所对应的科目,若科目成绩并列,取 course_id 最小的一门。查询结果需按 student_id 增序进行排序。
解法1. 利用二元in
思路:
- 选出没名学生的最高成绩
select E.student_id, max(E.grade) as max_grade
from Enrollments as E
group by E.student_id
- 利用二元in选出对应的科目
select E.student_id, E.course_id, E.grade
from Enrollments as E
where (E.student_id, E.grade) in (select E.student_id, max(E.grade) as max_grade
from Enrollments as E
group by E.student_id)
- 按照student_id和grade分组 选出最小的course_id 然后按student_id排序即可
select E.student_id, min(E.course_id) as course_id, E.grade
from Enrollments as E
where (E.student_id, E.grade) in (select E.student_id, max(E.grade) as max_grade
from Enrollments as E
group by E.student_id)
group by E.student_id, E.grade
order by E.student_id asc;
解法2. 利用窗口函数
思路:
- 对stdent_id进行分组 对grade desc, course_id asc进行排序 得到成绩排序(相同成绩的课程id小的在前)
select E.student_id, E.course_id, E.grade,
row_number() over(partition by E.student_id order by grade desc, course_id asc) as rank
from Enrollments as E
- 选出rank为1的记录然后按student_id 增序进行排序即可
select tmp.student_id, tmp.course_id, tmp.grade
from (select E.student_id, E.course_id, E.grade,
row_number() over(partition by E.student_id order by grade desc, course_id asc) as rank
from Enrollments as E) as tmp
where tmp.rank = 1
order by tmp.student_id
参考链接的代码:
- 对stdent_id进行分组 对grade desc进行排序 按dense_rank() 给定序号 最高分相同的序号都为1
- 选出序号为1 的 然后对stdent_id进行分组 对course_id asc进行排序 选出rank为1的
- 按照student_id排序即可
select a.student_id,a.course_id,a.grade
from
(
select student_id,course_id,grade,
row_number()over(partition by student_id order by course_id) as course_rank
from
(
select student_id,course_id,grade,
dense_rank()over(partition by student_id order by grade desc) as grade_rank
from Enrollments
) t
where t.grade_rank = 1
) a
where a.course_rank = 1
order by a.student_id
问题2.查询每一科目成绩最高和最低分数的学生,输出courseid,studentid,score
思路:
- 先查出每门科目的最高分和最低分
select course_id,max(grade) as max_grade
from Enrollments
group by course_id
select course_id,min(grade) as min_grade
from Enrollments
group by course_id
- 利用二元in
select E.student_id, E.course_id, E.grade
from Enrollments as E
where (E.student_id, E.grade) in
(select E.student_id, max(E.grade) as max_grade
from Enrollments as E
group by E.student_id) or
(E.student_id, E.grade) in
(select E.student_id, min(E.grade) as min_grade
from Enrollments as E
group by E.student_id)
参考的链接
用连接分别选出最高分和最低分 然后用union进行合并
select e.course_id,e.student_id,e.grade as score
from Enrollments e left join
(
select course_id,max(grade) as max_grade
from Enrollments
group by course_id
) t on e.course_id = t.course_id
where e.grade = t.max_grade
union all
select e.course_id,e.student_id,e.grade as score
from Enrollments e left join
(
select course_id,min(grade) as min_grade
from Enrollments
group by course_id
) t on e.course_id = t.course_id
where e.grade = t.min_grade
9.case专题-学生做题情况分析
表t:做题日期(time),学生id(stu_id),题目id(exer_id)
统计10.1-10.10每天做新题的人的数量,重点在每天
思路:
- 每名学生对于每道题目第一次做才叫新题 也就是最小的时间
select t.time, t.stu_id, t.exer_id,
min(t.time) over(partition by t.stu_id, t.exer_id) as first_time
from t
where t.time between '2020-10-01' and '2020-10-10'
- 如果time等于first_time就是在做新题 统计每名学生做新题的次数 如果为10次 则说明每天都在做新题
select tmp.stu_id
from (select t.time, t.stu_id, t.exer_id,
min(t.time) over(partition by t.stu_id, t.exer_id) as first_time
from t
where t.time between '2020-10-01' and '2020-10-10') as tmp
where tmp.time = tmp.first_time
group by tmp.stu_id
having count(tmp.stu_id) = 10
3.统计最终满足条件的用户数量
select count(a.stu_id) as '10.1-10.10每天做新题的人的数量'
from (select tmp.stu_id
from (select t.time, t.stu_id, t.exer_id,
min(t.time) over(partition by t.stu_id, t.exer_id) as first_time
from t
where t.time between '2020-10-01' and '2020-10-10') as tmp
where tmp.time = tmp.first_time
group by tmp.stu_id
having count(tmp.stu_id) = 10) as a