HQL聚集计算之窗口篇

自Hive 0.11.0之后,窗口函数,作为一组可以以扫描多个输入行作为输入来计算和生成每一个返回值的特殊函数,正式进入HQL大家族。该组函数的语法位,
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>]),其中Function (arg1,..., argn)分为以下四类,

  • 聚集函数 如 SUM, MAX, MIN
  • 排序函数 如 RANK, ROW_NUMBER
  • 分析函数 用于统计或比较,如 LEAD, LAG, FIRST_VALUE
  1. 聚集函数
    自Hive 2.2.0之后,DISTINCT 可以和聚集函数一起使用于窗口函数中。以下示例1,用于创建表和加载数据以方便演示功能
> CREATE TABLE IF NOT EXISTS employee_contract (      
> name string,      
> dept_num int,      
> employee_id int,      
> salary int,      
> type string,      
> start_date date      
> )      
> ROW FORMAT DELIMITED      
> FIELDS TERMINATED BY '|'      
> STORED as TEXTFILE;      
No rows affected (0.282 seconds)      
> LOAD DATD INPATH '/tmp/hivedemo/data/employee_contract.txt'      
> OVERWRITE INTO TABLE employee_contract;      
No rows affected (0.48 seconds)

以下示例2,演示了一般聚集函数的窗口函数用法

> SELECT       
> name,       
> dept_num as deptno,       
> salary,      
> count(*) OVER (PARTITION BY dept_num) as cnt,      
> count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,      
> sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as       sum1,      
> sum(salary) OVER(ORDER BY dept_num) as sum2,      
> sum(salary) OVER(ORDER BY dept_num, name) as sum3      
> FROM employee_contract      
> ORDER BY deptno, name;      
+---------+--------+--------+-----+-----+-------+-------+-------+      
| name    | deptno | salary | cnt | dcnt| sum1  | sum2  | sum3  |      
+---------+--------+--------+-----+-----+-------+-------+-------+      
| Lucy    | 1000   | 5500   | 5   | 1   | 24900 | 24900 | 5500  |      
| Michael | 1000   | 5000   | 5   | 1   | 24900 | 24900 | 10500 |      
| Steven  | 1000   | 6400   | 5   | 1   | 24900 | 24900 | 16900 |      
| Wendy   | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 20900 |      
| Will    | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 24900 |     
| Jess    | 1001   | 6000   | 3   | 1   | 17400 | 42300 | 30900 |      
| Lily    | 1001   | 5000   | 3   | 1   | 17400 | 42300 | 35900 |      
| Mike    | 1001   | 6400   | 3   | 1   | 17400 | 42300 | 42300 |      
| Richard | 1002   | 8000   | 3   | 1   | 20500 | 62800 | 50300 |      
| Wei     | 1002   | 7000   | 3   | 1   | 20500 | 62800 | 57300 |      
| Yun     | 1002   | 5500   | 3   | 1   | 20500 | 62800 | 62800 |      
+---------+--------+--------+-----+-----+-------+-------+-------+      
11 rows selected (111.856 seconds)
  1. 排序函数,
  • ROW_NUMBER 根据分区和排序,从1开始给每一行指定唯一的序号。
  • RANK 在一个组中,对组内的记录条目排序
  • DENSE_RANK 和RANK相似,但是排名是连续无断开和跳跃。
  • PERCENT_RANK 公式为(current rank - 1)/(total number of rows - 1)
  • NTILE 把一个排序过的数据集分成若干部分,给每一条记录分配至这些部分的序号。
    以下示例演示了以上函数的使用,
> SELECT 
> name, 
> dept_num as deptno, 
> salary,
> row_number() OVER () as rnum, -- sequence in orginal table
> rank() OVER (PARTITION BY dept_num ORDER BY salary) as rk, 
> dense_rank() OVER (PARTITION BY dept_num ORDER BY salary) as drk,
> percent_rank() OVER(PARTITION BY dept_num ORDER BY salary) as prk,
> ntile(4) OVER(PARTITION BY dept_num ORDER BY salary) as ntile
> FROM employee_contract
> ORDER BY deptno, name;
+---------+--------+--------+------+----+-----+------+-------+
| name    | deptno | salary | rnum | rk | drk | prk  | ntile |
+---------+--------+--------+------+----+-----+------+-------+
| Lucy    | 1000   | 5500   | 7    | 4  | 3   | 0.75 | 3     |
| Michael | 1000   | 5000   | 11   | 3  | 2   | 0.5  | 2     |
| Steven  | 1000   | 6400   | 8    | 5  | 4   | 1.0  | 4     |
| Wendy   | 1000   | 4000   | 9    | 1  | 1   | 0.0  | 1     |
| Will    | 1000   | 4000   | 10   | 1  | 1   | 0.0  | 1     |
| Jess    | 1001   | 6000   | 5    | 2  | 2   | 0.5  | 2     |
| Lily    | 1001   | 5000   | 6    | 1  | 1   | 0.0  | 1     |
| Mike    | 1001   | 6400   | 4    | 3  | 3   | 1.0  | 3     |
| Richard | 1002   | 8000   | 1    | 3  | 3   | 1.0  | 3     |
| Wei     | 1002   | 7000   | 3    | 2  | 2   | 0.5  | 2     |
| Yun     | 1002   | 5500   | 2    | 1  | 1   | 0.0  | 1     |
+---------+--------+--------+------+----+-----+------+-------+
11 rows selected (80.052 seconds)

自Hive 2.1.0之后,聚集函数可以用于OVER子句中,请看以下示例,

> SELECT
> dept_num,
> rank() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
> FROM employee_contract
> GROUP BY dept_num;
+----------+----+
| dept_num | rk |
+----------+----+
| 1000     | 1  |
| 1001     | 1  |
| 1002     | 1  | 
+----------+----+
3 rows selected (54.43 seconds)
  1. 分析函数
  • cume_dist (number of rows ≤ current row)/(total number of rows)
  • lead 返回后续记录
  • lag 返回之前记录
  • first_value 返回排序数据集的第一个值
  • last_value 返回排序数据集的最后一个值
    以下示例,
> SELECT 
> name,
> dept_num as deptno,
> salary,
> cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume,
> lead(salary, 2) OVER (PARTITION BY dept_num ORDER BY salary) as lead,
> lag(salary, 2, 0) OVER (PARTITION BY dept_num ORDER BY salary) as lag,
> first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as lval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING) as lval2
> FROM employee_contract 
> ORDER BY deptno, salary;
+--------+------+--------+------+------+-----+------+------+-------+
| name   |deptno| salary | cume | lead | lag | fval |lvalue|lvalue2|
+--------+------+--------+------+------+-----+------+------+-------+
| Will   | 1000 | 4000   | 0.4  | 5500 | 0   | 4000 | 4000 | 6400  |
| Wendy  | 1000 | 4000   | 0.4  | 5000 | 0   | 4000 | 4000 | 6400  |
| Michael| 1000 | 5000   | 0.6  | 6400 | 4000| 4000 | 5000 | 6400  |
| Lucy   | 1000 | 5500   | 0.8  | NULL | 4000| 4000 | 5500 | 6400  |
| Steven | 1000 | 6400   | 1.0  | NULL | 5000| 4000 | 6400 | 6400  |
| Lily   | 1001 | 5000   | 0.33 | 6400 | 0   | 5000 | 5000 | 6400  |
| Jess   | 1001 | 6000   | 0.67 | NULL | 0   | 5000 | 6000 | 6400  |
| Mike   | 1001 | 6400   | 1.0  | NULL | 5000| 5000 | 6400 | 6400  |
| Yun    | 1002 | 5500   | 0.33 | 8000 | 0   | 5500 | 5500 | 8000  |
| Wei    | 1002 | 7000   | 0.67 | NULL | 0   | 5500 | 7000 | 8000  |
| Richard| 1002 | 8000   | 1.0  | NULL | 5500| 5500 | 8000 | 8000  |
+--------+------+--------+------+------+-----+------+------+-------+
11 rows selected (55.203 seconds)

其中last_value缺省的窗口为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
[<window_expression>] 用来进一步子分区结果并应用窗口函数,有两类窗口类型,行类型和范围类型。另外rank(...), ntile(...), dense_rank(...), cume_dist(...), percent_rank(...), lead(...), lag(...), and row_number(...) 暂不支持这一个功能。

window_expression definition

行类型示例1

-- Preceding and Following
> SELECT 
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS 2 PRECEDING) win6, -- FOLLOWING does not work in this way
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS UNBOUNDED PRECEDING) win7
> FROM employee_contract
> ORDER BY dno, name;
+---------+------+------+------+------+------+------+------+------+------+
| name    | dno  | sal  | win1 | win2 | win3 | win4 | win5 | win6 | win7 |
+---------+------+------+------+------+------+------+------+------+------+
| Lucy    | 1000 | 5500 | 5500 | 6400 | 6400 | NULL | 6400 | 5500 | 5500 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 | 5500 | 6400 | 5500 | 5500 |
| Steven  | 1000 | 6400 | 6400 | 6400 | 6400 | 5500 | 4000 | 6400 | 6400 |
| Wendy   | 1000 | 4000 | 6400 | 6400 | 6400 | 6400 | 4000 | 6400 | 6400 |
| Will    | 1000 | 4000 | 6400 | 6400 | 4000 | 6400 | NULL | 6400 | 6400 |
| Jess    | 1001 | 6000 | 6000 | 6400 | 6400 | NULL | 6400 | 6000 | 6000 |
| Lily    | 1001 | 5000 | 6000 | 6400 | 6400 | 6000 | 6400 | 6000 | 6000 |
| Mike    | 1001 | 6400 | 6400 | 6400 | 6400 | 6000 | NULL | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 | NULL | 7000 | 8000 | 8000 |
| Wei     | 1002 | 7000 | 8000 | 8000 | 8000 | 8000 | 5500 | 8000 | 8000 |
| Yun     | 1002 | 5500 | 8000 | 8000 | 7000 | 8000 | NULL | 8000 | 8000 |
+---------+------+------+------+------+------+------+------+------+------+
11 rows selected (55.885 seconds)-- Current and Unbounded
> SELECT 
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win8,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win9,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win10,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) win11,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win12,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win13,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win14
> FROM employee_contract> ORDER BY dno, name;
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|name   |dno | sal  | win8 | win9 | win10 | win11 | win12 | win13 | win14 |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|Lucy   |1000| 5500 | 5500 | 5500 | 6400  | NULL  | 5500  | 5500  | 6400  |
|Michael|1000| 5000 | 5000 | 6400 | 6400  | 5500  | 5500  | 6400  | 6400  |
|Steven |1000| 6400 | 6400 | 6400 | 6400  | 5500  | 6400  | 6400  | 6400  |
|Wendy  |1000| 4000 | 4000 | 4000 | 4000  | 6400  | 6400  | 6400  | 6400  |
|Will   |1000| 4000 | 4000 | 4000 | 4000  | 6400  | 6400  | 6400  | 6400  |
|Jess   |1001| 6000 | 6000 | 6000 | 6400  | NULL  | 6000  | 6000  | 6400  |
|Lily   |1001| 5000 | 5000 | 6400 | 6400  | 6000  | 6000  | 6400  | 6400  |
|Mike   |1001| 6400 | 6400 | 6400 | 6400  | 6000  | 6400  | 6400  | 6400  |
|Richard|1002| 8000 | 8000 | 8000 | 8000  | NULL  | 8000  | 8000  | 8000  |
|Wei    |1002| 7000 | 7000 | 7000 | 7000  | 8000  | 8000  | 8000  | 8000  |
|Yun    |1002| 5500 | 5500 | 5500 | 5500  | 8000  | 8000  | 8000  | 8000  |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
11 rows selected (53.754 seconds)

行类型示例2

> SELECT 
> name, dept_num, salary,
> max(salary) OVER w1 as win1,
> max(salary) OVER w2 as win2,
> max(salary) OVER w3 as win3
> FROM employee_contract
> WINDOW w1 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
> ),
> w2 as w3,
> w3 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING
> );
+---------+----------+--------+------+------+------+
| name    | dept_num | salary | win1 | win2 | win3 |
+---------+----------+--------+------+------+------+
| Lucy    | 1000     | 5500   | 5500 | 6400 | 6400 |
| Michael | 1000     | 5000   | 5500 | 6400 | 6400 |
| Steven  | 1000     | 6400   | 6400 | 6400 | 6400 |
| Wendy   | 1000     | 4000   | 6400 | 6400 | 6400 |
| Will    | 1000     | 4000   | 6400 | 4000 | 4000 |
| Jess    | 1001     | 6000   | 6000 | 6400 | 6400 |
| Lily    | 1001     | 5000   | 6000 | 6400 | 6400 |
| Mike    | 1001     | 6400   | 6400 | 6400 | 6400 |
| Richard | 1002     | 8000   | 8000 | 8000 | 8000 |
| Wei     | 1002     | 7000   | 8000 | 8000 | 8000 |
| Yun     | 1002     | 5500   | 8000 | 7000 | 7000 |
+---------+----------+--------+------+------+------+
11 rows selected (57.204 seconds)

范围类型示例1

> SELECT
> dept_num, start_date, name, salary,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND CURRENT ROW) win2
> FROM employee_contract
> order by dept_num, start_date;
+----------+------------+---------+--------+------+------+
| dept_num | start_date | name    | salary | win1 | win2 |
+----------+------------+---------+--------+------+------+
| 1000     | 2010-01-03 | Lucy    | 5500   | 6400 | 5500 |
| 1000     | 2012-11-03 | Steven  | 6400   | 6400 | 6400 |
| 1000     | 2013-10-02 | Will    | 4000   | 5000 | 4000 |
| 1000     | 2014-01-29 | Michael | 5000   | 5500 | 5000 |
| 1000     | 2014-10-02 | Wendy   | 4000   | 5000 | 4000 |
| 1001     | 2013-11-03 | Mike    | 6400   | 6400 | 6400 |
| 1001     | 2014-11-29 | Lily    | 5000   | 6000 | 5000 |
| 1001     | 2014-12-02 | Jess    | 6000   | 6400 | 6000 |
| 1002     | 2010-04-03 | Wei     | 7000   | 8000 | 7000 |
| 1002     | 2013-09-01 | Richard | 8000   | 8000 | 8000 |
| 1002     | 2014-01-29 | Yun     | 5500   | 5500 | 5500 |
+----------+------------+---------+--------+------+------+
11 rows selected (60.784 seconds)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,132评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,802评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,566评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,858评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,867评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,695评论 1 282
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,064评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,705评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,915评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,677评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,796评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,432评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,041评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,992评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,223评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,185评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,535评论 2 343

推荐阅读更多精彩内容

  • 分析函数,也称为窗口函数,通常被认为仅对数据仓库SQL有用。使用分析函数的查询,基于对数据行的分组来计算总量值。与...
    猫猫_tomluo阅读 3,305评论 3 18
  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,446评论 0 13
  • 这一周主要学习了 Hive 的一些基础知识,学习了多个 Hive 窗口函数,虽然感觉这些窗口函数没有实际的应用...
    大石兄阅读 2,722评论 2 8
  • Hive 数据类型 基本数据类型 注:Hive会隐式地将类型转换为两个整型类型中值较大的那个类型; 集合数据类型 ...
    haidaozheng阅读 2,255评论 0 1
  • 正月十五看烟花是一件让我欢喜让我忧,让我笑来让我愁的事儿。 吃过晚饭妈妈微笑着说:“走,去看烟花啦。” “不去了吧...
    文字在笔下静静流淌阅读 172评论 0 0