一、应用场景:
- 用于分区排序
- 动态Group By
- top N
- 累计计算
二、函数介绍
1、窗口函数:
first_value:取分组内排序后,截止到当前行,第一个值;
last_value:取分组内排序后,截止到当前行,最后一个值;
lead(col, n, default):用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为null时,取默认值,如不指定则为null);
lag(col, n, default):与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为null时,取默认值,如不指定,则为null)。
2、over从句
1)使用标准的聚合函数count、sum、min、max、avg
2)使用partition by
语句,使用一个或多个原始列
3)使用partition by
与order by
语句,使用一个或多个分区或者排序列
4)使用窗口规范,窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
当ORDER BY
后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
当ORDER BY
和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
.
OVER
从句支持以下函数, 但是并不支持和窗口一起使用它们。
Ranking
函数: Rank, NTile, DenseRank, CumeDist, PercentRank
.
Lead
和 Lag
函数.
3、分析函数
row_number()
:从1开始,按照顺序生成组内记录的序列,比如按照pv降序排列生成分组内的pv排名;获取分组内的top1记录;获取一个session内的第一条记录等等。
rank()
:生成数据项在分组内的排名,排名相等会在名次中留下空位。
dense_rank()
:生成数据项在分组内的排名,排名相对不会在名次中留下空位。
cume_dist
:小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪资的人数占总人数的比例。
percent_rank
: (分组内当前行的rank值-1)/(分组内总行数-1)。
ntile(n)
:用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。ntile
不支持rows between
,比如ntile(2) over(partition by cookieied order by createtime rows between 3 preceding and current row)
。
--- Hive2.1.0及以后支持Distinct
COUNT(DISTINCT a) OVER (PARTITION BY c)
--- Hive 2.2.0中在使用ORDER BY和窗口限制时支持distinct
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
--- Hive2.1.0及以后支持在OVER从句中支持聚合函数
SELECT rank() OVER (ORDER BY sum(b))
FROM t
GROUP BY a
;
4、测试数据集
-- COUNT、SUM、MIN、MAX、AVG
select
user_id,
user_type,
sales,
--默认为从起点到当前行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc) AS sales_1,
--从起点到当前行,结果与sales_1不同。
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sales_2,
--当前行+往前3行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sales_3,
--当前行+往前3行+往后1行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sales_4,
--当前行+往后所有行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sales_5,
--分组内所有行
SUM(sales) OVER(PARTITION BY user_type) AS sales_6
from
order_detail
order by
user_type,
sales,
user_id
;
-- 注意:
-- 输出结果和order by相关,默认为升序;
-- 如果不指定rows between,默认为起点到当前行;
-- 如果不指定order by,则将分组内所有值累加;
关键是理解
ROWS BETWEEN
含义,也叫做WINDOW子句
:
PRECEDING
:往前
FOLLOWING
:往后
CURRENT ROW
:当前行
UNBOUNDED
:无界限(起点或终点)
UNBOUNDED PRECEDING
:表示从前面的起点
UNBOUNDED FOLLOWING
:表示到后面的终点
其他COUNT、AVG,MIN,MAX
,和SUM
用法一样。
-- first_value与last_value
select
user_id,
user_type,
ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
order_detail;
-- lead与lag
select
user_id,device_id,
lead(device_id) over (order by sales) as default_after_one_line,
lag(device_id) over (order by sales) as default_before_one_line,
lead(device_id,2) over (order by sales) as after_two_line,
lag(device_id,2,'abc') over (order by sales) as before_two_line
from
order_detail;
-- RANK、ROW_NUMBER、DENSE_RANK
select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
order_detail;
-- NTILE
select
user_type,sales,
--分组内将数据分成2片
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
--分组内将数据分成3片
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
--分组内将数据分成4片
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
--将所有数据分成4片
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
order_detail
order by
user_type,
sales
--取sale前20%的用户ID
select
user_id
from
(
select
user_id,
NTILE(5) OVER(ORDER BY sales desc) AS nt
from
order_detail
)A
where nt=1;
-- CUME_DIST、PERCENT_RANK
select
user_id,user_type,sales,
--没有partition,所有数据均为1组
CUME_DIST() OVER(ORDER BY sales) AS cd1,
--按照user_type进行分组
CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
order_detail;
select
user_type,sales
--分组内总行数
SUM(1) OVER(PARTITION BY user_type) AS s,
--RANK值
RANK() OVER(ORDER BY sales) AS r,
PERCENT_RANK() OVER(ORDER BY sales) AS pr,
--分组内
PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
order_detail;