1.over() 窗口函数的语法结构
分析函数 over(partition by 列名 order by 列 rows between 开始位置 and 结束位置)
partition by 相当于sql中的group by,例如“partition by 月份”
rows between指定窗口位置。常用的范围是 <u>"rows between unbounded preceding and current row(表示从起点到当前行)</u>"
其他常用窗口关键字:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点(一般结合PRECEDING,FOLLOWING使用)
UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)
比如说:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到当前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点)
2. 常与over() 一起使用的分析函数
2.1 聚合类
avg sum min
2.2 排名类
row_number(不会重复) rank(会有重复的会产生空位 如1 1 3) dense_rank(会重复但是不会空位如1 1 2.
2.3 其他类
lag:可以记录上一条记录
lead, ntile(n)
3. 第一组练习题
3.1 原始数据
20191020,11111,85
20191020,22222,83
20191020,33333,86
20191021,11111,87
20191021,22222,65
20191021,33333,98
20191022,11111,67
20191022,22222,34
20191022,33333,88
20191023,11111,99
20191023,22222,33
创建表格:
create table test_window
(logday string, userid string, score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath '/home/frog005/sxing/prac1.txt' into table test_window;
3.2 每个用户的信息以及表中数据的总数
3.2.1 方法一、用count和left join
select * from test_window a, (select count(logday) as total_count from test_window) b;
3.2.2 方法二、窗口函数count
select *, count() over() as total_count from test_window;
3.3 用户明细和用户每天总数
3.3.1 用group by结合left join
select b.*, a.day_sum from
(select logday, count(*) as day_sum from test_window group by logday ) a
left join test_window b
on a.logday=b.logday;
3.3.2 用窗口函数
select *, count() over(partition by logday) as day_sum from test_window;
3.4 计算第一天到现在所有score大于80分的用户人次
3.4.1 使用窗口函数
select *, count() over(order by logday rows between unbounded preceding and current row) as prev_cum from test_window where score>80;
注意这里求的是人次,所以同一个人两次超过80,应该计数2.
3.5 计算每个用户到当前日期分数大于80的天数
使用窗口函数
select *,
count() over(partition by userid order by logday rows between unbounded preceding and current row) as my_day_over_80
from test_window where score > 80;
3.6 计算第一天到现在所有score大于80分的用户人数(去重用count结合distinct)
-
先计算每个用户的第一天是哪天。
如果直接group by 再order by结果不对,因为group by会取出第一个数据,然后用这个数据去排序,但是第一个数据不一定是最大数据。但是我们这一个题目的数据正好就是,所以我们不用关心,直接搞。具体的group by和orderby的说明,参考这个文章。group by 和order by联合使用的注意事项。
select userid,
collect_set(logday)[0] as first_day,
collect_set(score)[0] as first_score_over80
from test_window
where score > 80
group by userid;
这里是试探一下数据。注意where在group by前面,注意group by里面没有的字段需要用collect_set()[0] 来表示。结果如图:
然后我们可以增加条目,累加得到截止每一天的总人数。
4. 第二组练习
4.0.原始数据
需要导入的数据txt:
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
建表以及导入:
create table buy_event
(
name string,
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/home/frog005/sxing/buy_event.txt" into table buy_event;
4.1 查询在2017年4月份购买过的顾客及总人次
select *,
count() over() as total_persons
from buy_event
where substr(orderdate,0, 7)='2017-04';
直接用count() 会报错?不知道原因,但是加上over就ok了。
方法二,子查询的方法
先得到月份表
select *,substr(orderdate, 0, 7) as month_str
from buy_event
where substr(orderdate, 0, 7) ='2017-04';
有这个表作为目标做子查询即可。太麻烦了,还是用count over方法一好。
方法三:如果是人数,要求去重人次怎么办
select *,
count(distinct name) over() as total_persons
from buy_event
where substr(orderdate,0, 7)='2017-04';
4.2 查询顾客的购买明细和月购买总额
select *,
sum(cost) over(partition by name, substr(orderdate, 0, 7)) as month_total
from buy_event;
partition 里面可以使用函数,substr等等。
4.3 顾客的购买明细以及到目前为止的总金额
select *, sum(cost) over(partition by name order by orderdate) as total_until_this_time from buy_event;
order by 之后rows between unbounded preceding and current row写不写无所谓。
4.4 查询顾客的上次购买时间
select *, lag(orderdate) over(partition by name order by orderdate) as record_buy_date from buy_event;
4.5 查询前20%时间的订单信息
select *, ntile(5) over (order by orderdate) as orderdate_seg
from buy_event;
这样可以得到我们每一个条目所在的段id。然后用这个表做子查询即可。
select * from (
select *, ntile(5) over (order by orderdate) as orderdate_seg
from buy_event) a
where orderdate_seg = 1;
5. 第三组练习
5.0 原始数据
name subject score
孙悟空 语文 87
孙悟空 数学 95
孙悟空 英语 68
大海 语文 94
大海 数学 56
大海 英语 84
宋宋 语文 64
宋宋 数学 86
宋宋 英语 84
婷婷 语文 65
婷婷 数学 85
婷婷 英语 78
创建表格
create table score
(
name string,
subject string,
score int
) row format delimited fields terminated by "\t";
#加载数据
load data local inpath '/home/frog005/sxing/score_source.txt' into table score;
5.1 每门课程的学生排名
select *,
row_number() over (partition by subject order by score desc) as bukaolvchongfu,
rank() over (partition by subject order by score desc) as chongfu_baoliu_kongwei,
dense_rank() over (partition by subject order by score desc) as chongfu_bubaoliu_kongwei
from score;
5.2 每门科目的top n的学生
select *
from
(select *,
row_number() over (partition by subject order by score desc) as bukaolvchongfu from score
) suanhaopaimingde
where bukaolvchongfu < 4;