Hive开窗函数

1. 介绍

普通聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
因此,普通聚合函数每组(Group by)只有一个返回值,而开窗函数则可以为窗口中的每行都返回一个值。

1.1 基础结构

分析函数(如:sum(), max(), row_number()...) + 窗口子句(over函数)

1.2 over函数

over(partition by [column_n] order by [column_m])先按照column_n分区,相同的column_n分为一区,每个分区根据column_m排序(默认升序)。

1.3 测试数据

建表并插入数据

-- 建表
create table student_scores(
  id int,
  studentId int,
  language int,
  math int,
  english int,
  classId string,
  departmentId string
);
-- 写入数据
insert into table student_scores values 
  (1,111,68,69,90,'class1','department1'),
  (2,112,73,80,96,'class1','department1'),
  (3,113,90,74,75,'class1','department1'),
  (4,114,89,94,93,'class1','department1'),
  (5,115,99,93,89,'class1','department1'),
  (6,121,96,74,79,'class2','department1'),
  (7,122,89,86,85,'class2','department1'),
  (8,123,70,78,61,'class2','department1'),
  (9,124,76,70,76,'class2','department1'),
  (10,211,89,93,60,'class1','department2'),
  (11,212,76,83,75,'class1','department2'),
  (12,213,71,94,90,'class1','department2'),
  (13,214,94,94,66,'class1','department2'),
  (14,215,84,82,73,'class1','department2'),
  (15,216,85,74,93,'class1','department2'),
  (16,221,77,99,61,'class2','department2'),
  (17,222,80,78,96,'class2','department2'),
  (18,223,79,74,96,'class2','department2'),
  (19,224,75,80,78,'class2','department2'),
  (20,225,82,85,63,'class2','department2');

1.4 窗口含义

select studentId,math,departmentId,classId,
-- 符合所有条件的行作为窗口,这里符合department1的有9个
count(math) over() as count1,
-- 按照classId分组的所有行作为窗口
count(math) over(partition by classId) as count2,
-- 按照classId分组、按照math排序的所有行作为窗口
count(math) over(partition by classId order by math) as count3,
-- 按照classId分组、按照math排序,当前行向前1行向后2行的行作为窗口
count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4,
-- 按照classId分组、按照math排序,当前行向后所有行作为窗口
count(math) over(partition by classId order by math rows between current row and unbounded following) as count5
from student_scores where departmentId='department1';

结果:

studentId    math    departmentId    classId    count1    count2    count3    count4    count5
111          69      department1     class1     9         5         1         3         5
113          74      department1     class1     9         5         2         4         4
112          80      department1     class1     9         5         3         4         3
115          93      department1     class1     9         5         4         3         2
114          94      department1     class1     9         5         5         2         1
124          70      department1     class2     9         4         1         3         4
121          74      department1     class2     9         4         2         4         3
123          78      department1     class2     9         4         3         3         2
122          86      department1     class2     9         4         4         2         1

结果解析:

studentId=115,
count1为departmentId=department1的行数为9,
count2为分区class1中的行数5,
count3为分区class1中math<=93的行数4,
count4为分区class1中math值向前1行和向后2行的行数3,
count5为分区class1中当前math值到class1分区结束的行数2。

上面可以看到:如果不指定ROWS BETWEEN,默认统计窗口是从起点到当前行
关键是ROWS BETWEEN,也叫做window子句
PRECEDING:向前
FOLLOWING:向后
CURRENT ROW:当前行
UNBOUNDED:无边界,UNBOUNDED PRECEDING表示从最前面的起点开始,UNBOUNDED FOLLOWING表示到最后面的终点

开窗函数可以粗略地分为两类:聚合开窗函数和排序开窗函数。

2. 聚合开窗函数

2.1 sum函数

select studentId,math,departmentId,classId,
sum(math) over() as sum1,
sum(math) over(partition by classId) as sum2,
sum(math) over(partition by classId order by math) as sum3,
sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4,
sum(math) over(partition by classId order by math rows between current row and unbounded following) as sum5
from student_scores where departmentId='department1';

-- 结果解析:类似count()函数

min(),max(),avg()都与count()类似

2.2 first_value开窗函数

作用:返回分区中的第一个值

select studentId,math,departmentId,classId,
first_value(math) over(partition by classId order by math) as first_value1,
first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value2
from student_scores where departmentId='department1';

结果:

studentid    math    departmentid    classid    first_value1    first_value2
111          69      department1     class1     69              69
113          74      department1     class1     69              69
112          80      department1     class1     69              74
115          93      department1     class1     69              80
114          94      department1     class1     69              93
124          70      department1     class2     70              70
121          74      department1     class2     70              70
123          78      department1     class2     70              74
122          86      department1     class2     70              78

结果解析:

studentId=122,
first_value1为分区class2中按照math排序的第一个值70,
first_value2为分区class2中按照math排序后当前行向前1行向后2行区间的第一个值78。

2.3 last_vlaue开窗函数

作用:返回分区最后一个值

2.4 lag开窗函数

作用:LAG(col, n, DEFAULT)用于统计窗口内向上第n行的值
col:列名
n:向上n行,[可选,默认为1]
DEFAULT:当向上n行为NULL时,取默认值;如果不指定,则为NULL

select studentId,math,departmentId,classId,
lag(math,2,60) over(partition by classId order by math) as lag1,
lag(math,2) over(partition by classId order by math) as lag2
from student_scores where departmentId='department1';

结果:

studentid    math    departmentid    classid    lag1    lag2
111          69      department1     class1     60      NULL
113          74      department1     class1     60      NULL
112          80      department1     class1     69      69
115          93      department1     class1     74      74
114          94      department1     class1     80      80
124          70      department1     class2     60      NULL
121          74      department1     class2     60      NULL
123          78      department1     class2     70      70
122          86      department1     class2     74      74

结果解析:

studentId=113,
lag1为分区class1按照math排序后当前行向上2行的值NULL,但是设置了DEFUALT,所以为60,
lag2因为没有设置DEFAULT,所以为NULL。

2.5 lead开窗函数

作用:LEAD(col, n, DEFAULT)LAG相反,用于统计窗口内向下n行的值
col:列名
n:向下n行,[可选,默认为1]
DEFAULT:当向下n行为NULL时,取默认值;如果不指定,则为NULL

2.6 cume_dist开窗函数

作用:计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:(小于等于当前值的行数) / (分区内总行数)

select studentId,math,departmentId,classId,
cume_dist() over(order by math) as cume_dist1,
cume_dist() over(order by math desc) as cume_dist2,
cume_dist() over(partition by classId order by math) as cume_dist3
from student_scores where departmentId='department1';

结果:

studentid    math    departmentid    classid    cume_dist1            cume_dist2            cume_dist3
111          69      department1     class1     0.1111111111111111    1.0                   0.2
113          74      department1     class1     0.4444444444444444    0.7777777777777778    0.4
112          80      department1     class1     0.6666666666666666    0.4444444444444444    0.6
115          93      department1     class1     0.8888888888888888    0.2222222222222222    0.8
114          94      department1     class1     1.0                   0.1111111111111111    1.0
124          70      department1     class2     0.2222222222222222    0.8888888888888888    0.25
121          74      department1     class2     0.4444444444444444    0.7777777777777778    0.5
123          78      department1     class2     0.5555555555555556    0.5555555555555556    0.75
122          86      department1     class2     0.7777777777777778    0.3333333333333333    1.0

结果解析:

studentId=115,
cume_dist1=小于等于93的行数8/总行数9=0.8888888888888888
cume_dist2=大于等于93的行数2/总行数9=0.2222222222222222
cume_dist3=class1分区内小于等于93的行数4/总行数5=0.8

3. 排序开窗函数

3.1 row_number开窗函数

作用:row_number() over([partition by col1] [order by col2])开窗函数是基于over子句中order by列的一个排名。在窗口或分区内从1开始排序,即使遇到col2相等时,名次依旧增加。例如:有两条记录相等,但一个是第一,一个是第二。

select id,studentId,language,math,english,classId,departmentId,
row_number() over (order by math) as row_number1
from student_scores;

结果:

id    studentid    language    math    english    classid    departmentid    row_number1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     4
3     113          90          74      75         class1     department1     5
6     121          96          74      79         class2     department1     6
8     123          70          78      61         class2     department1     7
17    222          80          78      96         class2     department2     8
19    224          75          80      78         class2     department2     9
2     112          73          80      96         class1     department1     10
14    215          84          82      73         class1     department2     11
11    212          76          83      75         class1     department2     12
20    225          82          85      63         class2     department2     13
7     122          89          86      85         class2     department1     14
5     115          99          93      89         class1     department1     15
10    211          89          93      60         class1     department2     16
12    213          71          94      90         class1     department2     17
4     114          89          94      93         class1     department1     18
13    214          94          94      66         class1     department2     19
16    221          77          99      61         class2     department2     20

studentId=223, 216, 113, 121的math都为74,但是row_number1的值是递增的

3.2 rank开窗函数

作用:rank() over([partition by col1] [order by col2]),当遇到col2相等时,名次相同,但是下一个col2值的名次递增N(N是重复的次数)。例如:有两条记录是并列第一,下一个是第三,没有第二。

select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as rank1
from student_scores;

结果:

id    studentid    language    math    english    classid    departmentid    rank1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     3
3     113          90          74      75         class1     department1     3
6     121          96          74      79         class2     department1     3
8     123          70          78      61         class2     department1     7
17    222          80          78      96         class2     department2     7
19    224          75          80      78         class2     department2     9
2     112          73          80      96         class1     department1     9
14    215          84          82      73         class1     department2     11
11    212          76          83      75         class1     department2     12
20    225          82          85      63         class2     department2     13
7     122          89          86      85         class2     department1     14
5     115          99          93      89         class1     department1     15
10    211          89          93      60         class1     department2     15
12    213          71          94      90         class1     department2     17
4     114          89          94      93         class1     department1     17
13    214          94          94      66         class1     department2     17
16    221          77          99      61         class2     department2     20

从第3行到第6行,重复出现了4次,所以第7行rank1为3+4=7

3.3 dense_rank开窗函数

作用:dense_rank() over([partition by col1] [order by col2])rank类似,当遇到col2相等时,名次同样相等,不同的是,下一个col2值的名次+1,而不是+N。

select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as dense_rank1
from student_scores;

结果:

id    studentid    language    math    english    classid    departmentid    dense_rank1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     3
3     113          90          74      75         class1     department1     3
6     121          96          74      79         class2     department1     3
8     123          70          78      61         class2     department1     4
17    222          80          78      96         class2     department2     4
19    224          75          80      78         class2     department2     5
2     112          73          80      96         class1     department1     5
14    215          84          82      73         class1     department2     6
11    212          76          83      75         class1     department2     7
20    225          82          85      63         class2     department2     8
7     122          89          86      85         class2     department1     9
5     115          99          93      89         class1     department1     10
10    211          89          93      60         class1     department2     10
12    213          71          94      90         class1     department2     11
4     114          89          94      93         class1     department1     11
13    214          94          94      66         class1     department2     11
16    221          77          99      61         class2     department2     12

从第3行到第6行math值相等,dense_rank1为3,所以第7行dense_rank1为4

3.4 ntile开窗函数

作用:ntile(N) over([partition by col1] [order by col2]),将分区中的数据按照顺序划分为N片,返回当前片的值。
注1:如果切片分布不均匀,默认增加第一个切片的分布
注2:ntile不支持ROWS BETWEEN

select id,studentId,language,math,english,classId,departmentId,
ntile(2) over(partition by departmentid order by math) as ntile1
from student_scores;

结果:

id    studentid    language    math    english    classid    departmentid    ntile1
1     111          68          69      90         class1     department1     1     
9     124          76          70      76         class2     department1     1     
6     121          96          74      79         class2     department1     1     
3     113          90          74      75         class1     department1     1     
8     123          70          78      61         class2     department1     1     
2     112          73          80      96         class1     department1     2     
7     122          89          86      85         class2     department1     2     
5     115          99          93      89         class1     department1     2     
4     114          89          94      93         class1     department1     2     
15    216          85          74      93         class1     department2     1     
18    223          79          74      96         class2     department2     1     
17    222          80          78      96         class2     department2     1     
19    224          75          80      78         class2     department2     1     
14    215          84          82      73         class1     department2     1     
11    212          76          83      75         class1     department2     1     
20    225          82          85      63         class2     department2     2     
10    211          89          93      60         class1     department2     2     
12    213          71          94      90         class1     department2     2     
13    214          94          94      66         class1     department2     2     
16    221          77          99      61         class2     department2     2     

3.5 percent_rank开窗函数

作用:percent_rank() over([partition by col1] [order by col2]),计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。

select id,studentid,language,math,english,classid,departmentid,
percent_rank() over(order by math) as rn1 
from student_scores;

结果:

id    studentid    language    math    english    classid    departmentid    rn1
1     111          68          69      90         class1     department1     0.0
9     124          76          70      76         class2     department1     0.05263157894736842
18    223          79          74      96         class2     department2     0.10526315789473684
15    216          85          74      93         class1     department2     0.10526315789473684
3     113          90          74      75         class1     department1     0.10526315789473684
6     121          96          74      79         class2     department1     0.10526315789473684
8     123          70          78      61         class2     department1     0.3157894736842105
17    222          80          78      96         class2     department2     0.3157894736842105
19    224          75          80      78         class2     department2     0.42105263157894735
2     112          73          80      96         class1     department1     0.42105263157894735
14    215          84          82      73         class1     department2     0.5263157894736842
11    212          76          83      75         class1     department2     0.5789473684210527
20    225          82          85      63         class2     department2     0.631578947368421
7     122          89          86      85         class2     department1     0.6842105263157895
5     115          99          93      89         class1     department1     0.7368421052631579
10    211          89          93      60         class1     department2     0.7368421052631579
12    213          71          94      90         class1     department2     0.8421052631578947
4     114          89          94      93         class1     department1     0.8421052631578947
13    214          94          94      66         class1     department2     0.8421052631578947
16    221          77          99      61         class2     department2     1.0

第二行,rn1的值为(2-1)/(20-1)=0.05263157894736842
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,324评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,303评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,192评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,555评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,569评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,566评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,927评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,583评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,827评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,590评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,669评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,365评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,941评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,928评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,159评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,880评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,399评论 2 342

推荐阅读更多精彩内容