hive窗口函数基本面试面sql必考项目,这里画一下重点猴,那现在就操练起来~
窗口函数格式:
function_name(field) over (partition by field order by field rows between unbounded preceding and current row)
窗口说明:
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点 UNBOUNDED FOLLOWING:表示到后面的终点
默认是自起点到当前行
数据准备:
create table study_test
(
student_id bigint comment '学生',
class_name string comment '课程名称',
score double comment '成绩'
);
insert into table study_test values(1,'语文',66.6)
,(1,'数学',33.6) ,(1,'英语',55.6)
,(2,'语文',66.6)
,(2,'数学',88.6) ,(2,'英语',35.6)
,(3,'语文',46.6)
,(3,'数学',58.6) ,(3,'英语',95.6)
,(4,'语文',96.6)
,(4,'数学',98.6) ,(4,'英语',85.6)
;
1.常用的排名函数 row_number,rank,dense_rank
1.1 row_number(),排名从1开始增加,无重复并列情况
1.2 rank(),有并列,排名相等会在名次中留下空位
1.3 dense_rank(),有并列,排名相等会在名次中不留空位
测试case:
select
*,
row_number()over(partition by class_name order by score desc) rnk_1,
rank()over(partition by class_name order by score desc) rnk_1,
dense_rank()over(partition by class_name order by score desc) rnk_1
from study_test
结果:
排序名次
2.sum,avg,min,max等窗口函数
测试case:
select
*,
row_number()over(partition by class_name order by score desc) rnk , --排名
sum(score)over(partition by class_name order by score desc) as sum_all, -- 分组所有行
sum(score)over(partition by class_name order by score desc) as sum_1, --默认起点到当前行
sum(score)over(partition by class_name order by score desc rows between unbounded preceding and current row) as sum_2, --同上
sum(score)over(partition by class_name order by score desc rows between 1 preceding and current row) as sum_3, -- 往前一行到当前行
sum(score)over(partition by class_name order by score desc rows between current row and unbounded following) as sum_4, -- 当前行到往后所有行
sum(score)over(partition by class_name order by score desc rows between 1 preceding and 1 following ) as sum_5 -- 往前一行+当前行+往后一行
from study_test
where class_name='数学'
结果:
student_name | class_name | score | rnk | sum_all | sum_1 | sum_2 | sum_3 | sum_4 | sum_5 |
---|---|---|---|---|---|---|---|---|---|
小小 | 数学 | 98.6 | 1 | 98.6 | 98.6 | 98.6 | 98.6 | 279.4 | 187.2 |
小白 | 数学 | 88.6 | 2 | 187.2 | 187.2 | 187.2 | 187.2 | 180.8 | 245.8 |
小花 | 数学 | 58.6 | 3 | 245.8 | 245.8 | 245.8 | 147.2 | 92.2 | 180.8 |
小明 | 数学 | 33.6 | 4 | 279.4 | 279.4 | 279.4 | 92.2 | 33.6 | 92.2 |
3.lead(),lag()函数,first_value(),last_value()
lead(col,n,default) 用于统计窗口内往下第n行值:
第一个参数为列名;
第二个参数为往下第n行(可选,默认为1);
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL);
同理,lag(col,n,default) 用于统计窗口内往上第n行值。
first_value:取分组内排序后,截止到当前行,第一个值
last_value:取分组内排序后,截止到当前行,最后一个值
测试 case:
select
*,
row_number()over(partition by class_name order by score desc) rnk , --排名
lead(score,1,0.999)over(partition by class_name order by score desc) as lead_1, -- 往后一行,默认0.999
lead(score,1)over(partition by class_name order by score desc) as lead_2, -- 往后一行,默认null
lag(score,1,999.999)over(partition by class_name order by score desc) as lag_1, -- 往前一行,默认999.999
lag(score,1)over(partition by class_name order by score desc) as lag_2, -- 往前一行,默认null
first_value(score)over(partition by class_name order by score desc) as first_data,
last_value(score)over(partition by class_name order by score desc) as last_data,
last_value(score)over(partition by class_name) as last_data_2
from study_test
where class_name='数学'
测试结果: