自Hive 0.11.0之后,窗口函数,作为一组可以以扫描多个输入行作为输入来计算和生成每一个返回值的特殊函数,正式进入HQL大家族。该组函数的语法位,
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>]),其中Function (arg1,..., argn)分为以下四类,
- 聚集函数 如 SUM, MAX, MIN
- 排序函数 如 RANK, ROW_NUMBER
- 分析函数 用于统计或比较,如 LEAD, LAG, FIRST_VALUE
- 聚集函数
自Hive 2.2.0之后,DISTINCT 可以和聚集函数一起使用于窗口函数中。以下示例1,用于创建表和加载数据以方便演示功能
> CREATE TABLE IF NOT EXISTS employee_contract (
> name string,
> dept_num int,
> employee_id int,
> salary int,
> type string,
> start_date date
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED as TEXTFILE;
No rows affected (0.282 seconds)
> LOAD DATD INPATH '/tmp/hivedemo/data/employee_contract.txt'
> OVERWRITE INTO TABLE employee_contract;
No rows affected (0.48 seconds)
以下示例2,演示了一般聚集函数的窗口函数用法
> SELECT
> name,
> dept_num as deptno,
> salary,
> count(*) OVER (PARTITION BY dept_num) as cnt,
> count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,
> sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as sum1,
> sum(salary) OVER(ORDER BY dept_num) as sum2,
> sum(salary) OVER(ORDER BY dept_num, name) as sum3
> FROM employee_contract
> ORDER BY deptno, name;
+---------+--------+--------+-----+-----+-------+-------+-------+
| name | deptno | salary | cnt | dcnt| sum1 | sum2 | sum3 |
+---------+--------+--------+-----+-----+-------+-------+-------+
| Lucy | 1000 | 5500 | 5 | 1 | 24900 | 24900 | 5500 |
| Michael | 1000 | 5000 | 5 | 1 | 24900 | 24900 | 10500 |
| Steven | 1000 | 6400 | 5 | 1 | 24900 | 24900 | 16900 |
| Wendy | 1000 | 4000 | 5 | 1 | 24900 | 24900 | 20900 |
| Will | 1000 | 4000 | 5 | 1 | 24900 | 24900 | 24900 |
| Jess | 1001 | 6000 | 3 | 1 | 17400 | 42300 | 30900 |
| Lily | 1001 | 5000 | 3 | 1 | 17400 | 42300 | 35900 |
| Mike | 1001 | 6400 | 3 | 1 | 17400 | 42300 | 42300 |
| Richard | 1002 | 8000 | 3 | 1 | 20500 | 62800 | 50300 |
| Wei | 1002 | 7000 | 3 | 1 | 20500 | 62800 | 57300 |
| Yun | 1002 | 5500 | 3 | 1 | 20500 | 62800 | 62800 |
+---------+--------+--------+-----+-----+-------+-------+-------+
11 rows selected (111.856 seconds)
- 排序函数,
- ROW_NUMBER 根据分区和排序,从1开始给每一行指定唯一的序号。
- RANK 在一个组中,对组内的记录条目排序
- DENSE_RANK 和RANK相似,但是排名是连续无断开和跳跃。
- PERCENT_RANK 公式为(current rank - 1)/(total number of rows - 1)
- NTILE 把一个排序过的数据集分成若干部分,给每一条记录分配至这些部分的序号。
以下示例演示了以上函数的使用,
> SELECT
> name,
> dept_num as deptno,
> salary,
> row_number() OVER () as rnum, -- sequence in orginal table
> rank() OVER (PARTITION BY dept_num ORDER BY salary) as rk,
> dense_rank() OVER (PARTITION BY dept_num ORDER BY salary) as drk,
> percent_rank() OVER(PARTITION BY dept_num ORDER BY salary) as prk,
> ntile(4) OVER(PARTITION BY dept_num ORDER BY salary) as ntile
> FROM employee_contract
> ORDER BY deptno, name;
+---------+--------+--------+------+----+-----+------+-------+
| name | deptno | salary | rnum | rk | drk | prk | ntile |
+---------+--------+--------+------+----+-----+------+-------+
| Lucy | 1000 | 5500 | 7 | 4 | 3 | 0.75 | 3 |
| Michael | 1000 | 5000 | 11 | 3 | 2 | 0.5 | 2 |
| Steven | 1000 | 6400 | 8 | 5 | 4 | 1.0 | 4 |
| Wendy | 1000 | 4000 | 9 | 1 | 1 | 0.0 | 1 |
| Will | 1000 | 4000 | 10 | 1 | 1 | 0.0 | 1 |
| Jess | 1001 | 6000 | 5 | 2 | 2 | 0.5 | 2 |
| Lily | 1001 | 5000 | 6 | 1 | 1 | 0.0 | 1 |
| Mike | 1001 | 6400 | 4 | 3 | 3 | 1.0 | 3 |
| Richard | 1002 | 8000 | 1 | 3 | 3 | 1.0 | 3 |
| Wei | 1002 | 7000 | 3 | 2 | 2 | 0.5 | 2 |
| Yun | 1002 | 5500 | 2 | 1 | 1 | 0.0 | 1 |
+---------+--------+--------+------+----+-----+------+-------+
11 rows selected (80.052 seconds)
自Hive 2.1.0之后,聚集函数可以用于OVER子句中,请看以下示例,
> SELECT
> dept_num,
> rank() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
> FROM employee_contract
> GROUP BY dept_num;
+----------+----+
| dept_num | rk |
+----------+----+
| 1000 | 1 |
| 1001 | 1 |
| 1002 | 1 |
+----------+----+
3 rows selected (54.43 seconds)
- 分析函数
- cume_dist (number of rows ≤ current row)/(total number of rows)
- lead 返回后续记录
- lag 返回之前记录
- first_value 返回排序数据集的第一个值
- last_value 返回排序数据集的最后一个值
以下示例,
> SELECT
> name,
> dept_num as deptno,
> salary,
> cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume,
> lead(salary, 2) OVER (PARTITION BY dept_num ORDER BY salary) as lead,
> lag(salary, 2, 0) OVER (PARTITION BY dept_num ORDER BY salary) as lag,
> first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as lval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED
> PRECEDING AND UNBOUNDED FOLLOWING) as lval2
> FROM employee_contract
> ORDER BY deptno, salary;
+--------+------+--------+------+------+-----+------+------+-------+
| name |deptno| salary | cume | lead | lag | fval |lvalue|lvalue2|
+--------+------+--------+------+------+-----+------+------+-------+
| Will | 1000 | 4000 | 0.4 | 5500 | 0 | 4000 | 4000 | 6400 |
| Wendy | 1000 | 4000 | 0.4 | 5000 | 0 | 4000 | 4000 | 6400 |
| Michael| 1000 | 5000 | 0.6 | 6400 | 4000| 4000 | 5000 | 6400 |
| Lucy | 1000 | 5500 | 0.8 | NULL | 4000| 4000 | 5500 | 6400 |
| Steven | 1000 | 6400 | 1.0 | NULL | 5000| 4000 | 6400 | 6400 |
| Lily | 1001 | 5000 | 0.33 | 6400 | 0 | 5000 | 5000 | 6400 |
| Jess | 1001 | 6000 | 0.67 | NULL | 0 | 5000 | 6000 | 6400 |
| Mike | 1001 | 6400 | 1.0 | NULL | 5000| 5000 | 6400 | 6400 |
| Yun | 1002 | 5500 | 0.33 | 8000 | 0 | 5500 | 5500 | 8000 |
| Wei | 1002 | 7000 | 0.67 | NULL | 0 | 5500 | 7000 | 8000 |
| Richard| 1002 | 8000 | 1.0 | NULL | 5500| 5500 | 8000 | 8000 |
+--------+------+--------+------+------+-----+------+------+-------+
11 rows selected (55.203 seconds)
其中last_value缺省的窗口为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
[<window_expression>] 用来进一步子分区结果并应用窗口函数,有两类窗口类型,行类型和范围类型。另外rank(...), ntile(...), dense_rank(...), cume_dist(...), percent_rank(...), lead(...), lag(...), and row_number(...) 暂不支持这一个功能。
行类型示例1
-- Preceding and Following
> SELECT
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS 2 PRECEDING) win6, -- FOLLOWING does not work in this way
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS UNBOUNDED PRECEDING) win7
> FROM employee_contract
> ORDER BY dno, name;
+---------+------+------+------+------+------+------+------+------+------+
| name | dno | sal | win1 | win2 | win3 | win4 | win5 | win6 | win7 |
+---------+------+------+------+------+------+------+------+------+------+
| Lucy | 1000 | 5500 | 5500 | 6400 | 6400 | NULL | 6400 | 5500 | 5500 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 | 5500 | 6400 | 5500 | 5500 |
| Steven | 1000 | 6400 | 6400 | 6400 | 6400 | 5500 | 4000 | 6400 | 6400 |
| Wendy | 1000 | 4000 | 6400 | 6400 | 6400 | 6400 | 4000 | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 6400 | 4000 | 6400 | NULL | 6400 | 6400 |
| Jess | 1001 | 6000 | 6000 | 6400 | 6400 | NULL | 6400 | 6000 | 6000 |
| Lily | 1001 | 5000 | 6000 | 6400 | 6400 | 6000 | 6400 | 6000 | 6000 |
| Mike | 1001 | 6400 | 6400 | 6400 | 6400 | 6000 | NULL | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 | NULL | 7000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 8000 | 8000 | 8000 | 8000 | 5500 | 8000 | 8000 |
| Yun | 1002 | 5500 | 8000 | 8000 | 7000 | 8000 | NULL | 8000 | 8000 |
+---------+------+------+------+------+------+------+------+------+------+
11 rows selected (55.885 seconds)-- Current and Unbounded
> SELECT
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win8,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win9,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win10,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) win11,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win12,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win13,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win14
> FROM employee_contract> ORDER BY dno, name;
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|name |dno | sal | win8 | win9 | win10 | win11 | win12 | win13 | win14 |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|Lucy |1000| 5500 | 5500 | 5500 | 6400 | NULL | 5500 | 5500 | 6400 |
|Michael|1000| 5000 | 5000 | 6400 | 6400 | 5500 | 5500 | 6400 | 6400 |
|Steven |1000| 6400 | 6400 | 6400 | 6400 | 5500 | 6400 | 6400 | 6400 |
|Wendy |1000| 4000 | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
|Will |1000| 4000 | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
|Jess |1001| 6000 | 6000 | 6000 | 6400 | NULL | 6000 | 6000 | 6400 |
|Lily |1001| 5000 | 5000 | 6400 | 6400 | 6000 | 6000 | 6400 | 6400 |
|Mike |1001| 6400 | 6400 | 6400 | 6400 | 6000 | 6400 | 6400 | 6400 |
|Richard|1002| 8000 | 8000 | 8000 | 8000 | NULL | 8000 | 8000 | 8000 |
|Wei |1002| 7000 | 7000 | 7000 | 7000 | 8000 | 8000 | 8000 | 8000 |
|Yun |1002| 5500 | 5500 | 5500 | 5500 | 8000 | 8000 | 8000 | 8000 |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
11 rows selected (53.754 seconds)
行类型示例2
> SELECT
> name, dept_num, salary,
> max(salary) OVER w1 as win1,
> max(salary) OVER w2 as win2,
> max(salary) OVER w3 as win3
> FROM employee_contract
> WINDOW w1 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
> ),
> w2 as w3,
> w3 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING
> );
+---------+----------+--------+------+------+------+
| name | dept_num | salary | win1 | win2 | win3 |
+---------+----------+--------+------+------+------+
| Lucy | 1000 | 5500 | 5500 | 6400 | 6400 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 |
| Steven | 1000 | 6400 | 6400 | 6400 | 6400 |
| Wendy | 1000 | 4000 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 4000 | 4000 |
| Jess | 1001 | 6000 | 6000 | 6400 | 6400 |
| Lily | 1001 | 5000 | 6000 | 6400 | 6400 |
| Mike | 1001 | 6400 | 6400 | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 8000 | 8000 | 8000 |
| Yun | 1002 | 5500 | 8000 | 7000 | 7000 |
+---------+----------+--------+------+------+------+
11 rows selected (57.204 seconds)
范围类型示例1
> SELECT
> dept_num, start_date, name, salary,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND CURRENT ROW) win2
> FROM employee_contract
> order by dept_num, start_date;
+----------+------------+---------+--------+------+------+
| dept_num | start_date | name | salary | win1 | win2 |
+----------+------------+---------+--------+------+------+
| 1000 | 2010-01-03 | Lucy | 5500 | 6400 | 5500 |
| 1000 | 2012-11-03 | Steven | 6400 | 6400 | 6400 |
| 1000 | 2013-10-02 | Will | 4000 | 5000 | 4000 |
| 1000 | 2014-01-29 | Michael | 5000 | 5500 | 5000 |
| 1000 | 2014-10-02 | Wendy | 4000 | 5000 | 4000 |
| 1001 | 2013-11-03 | Mike | 6400 | 6400 | 6400 |
| 1001 | 2014-11-29 | Lily | 5000 | 6000 | 5000 |
| 1001 | 2014-12-02 | Jess | 6000 | 6400 | 6000 |
| 1002 | 2010-04-03 | Wei | 7000 | 8000 | 7000 |
| 1002 | 2013-09-01 | Richard | 8000 | 8000 | 8000 |
| 1002 | 2014-01-29 | Yun | 5500 | 5500 | 5500 |
+----------+------------+---------+--------+------+------+
11 rows selected (60.784 seconds)