1. posexplode
select date, sum(amount) as total_amount
from (
select date_add(start_date,id) as date, amount from tmp
lateral view posexplode(split(space(datediff(end_date,start_date)),' ')) t as id,t_date
) t
group by date ;
reference: https://www.jianshu.com/p/d333e70c21d1
2. space
select split(space(10), ' ') from dual;
3. lag/lead
lag(col, n, default) 前n行数据
LAG(column_expression, offset, default_value) OVER (PARTITION BY partition_expression ORDER BY sort_expression)
lead(col, n, default) 后n行数据
LEAD(column_expression, offset, default_value) OVER (PARTITION BY partition_expression ORDER BY sort_expression)
https://zhuanlan.zhihu.com/p/397789258?utm_id=0
4. ntile
ntile(n) over(order by col) as bucket_num
n是指定的分箱数量。如果不能平均分配,则优先分配较小编号的箱,并且各个箱中能放的行数最多相差1。
备注:NULL值的处理,可以设置单独为一组,或者默认为最小值
5. first_value/last_value
first_value取分区内排序后,截止到当前行,第一个值
last_value分组内排序后,截止到当前行,最后一个值,如果要取分组内的最后一行,需要使用 rows between current row and unbounded following
select *,
first_value(logday) over(partition by userid order by logday) as co1,
last_value(logday) over(partition by userid order by logday) as co2,
last_value(logday) over(partition by userid order by logday rows between current row and unbounded following) as co3
from test.test_window;
6. rank/dense_rank
rank 会留下空位
dense_rank 不会留下空位
7. percent_rank()
当前行-1 / 当前组总行数-1
select dept, userid, sal,
percent_rank() over(order by sal) as pr1
from cookie3;
结果分析:
d1 user1 1000 0.0 (1-1)/(5-1)=0.0
d1 user2 2000 0.25 (2-1)/(5-1)=0.25
d1 user3 3000 0.5 (3-1)/(5-1)=0.5
d2 user4 4000 0.75 (4-1)/(5-1)=0.75
d2 user5 5000 1.0 (5-1)/(5-1)=1.0
select dept, userid, sal,
percent_rank() over(partition by dept order by sal) as pr2
from cookie3;
结果分析:
d1 user1 1000 0.0 (1-1)/(3-1)=0.0
d1 user2 2000 0.5 (2-1)/(3-1)=0.5
d1 user3 3000 1.0 (3-1)/(3-1)=1.0
d2 user4 4000 0.0 (1-1)/(2-1)=0.0
d2 user5 5000 1.0 (2-1)/(2-1)=1.0
8. cume_dist
小于(或大于)等于当前值的行数/分组内总行数
select dept, userid, sal,
cume_dist() over(order by sal) as cume1
from cookie3;
结果:
d1 user1 1000 0.2
d1 user2 2000 0.4
d1 user3 3000 0.6
d2 user4 4000 0.8
d2 user5 5000 1.0
9. hive 窗口函数中的rows和range的区分
sum(id) over(order by id range between 1 preceding and 2 following)
窗口大小设置为该分区内小于本记录id-1和id+2天的窗口
order by id,所以要看id的值,并对key(id进行range操作),即[id-1, id+2],注意是闭区间
https://blog.csdn.net/DolphinF/article/details/131695958
窗口函数使用:https://www.baispace.cn/article/hive-knowledge-window-function.html
10. 时间日期函数
https://www.cnblogs.com/lubians/p/17480959.html
11. array
https://blog.csdn.net/dupizi/article/details/131779370
12. limit offset
跳过offset行,取接下来的limit行。
image.png