本篇为视频篇的md文稿,说明:
本篇不归纳总结,以3道SQL题为例,主要介绍一下在数据仓库中一些常见的SQL处理应用
一. 窗口应用:用户连续消费的最大月份
二. 函数应用:提取Json中的key值
三. 分组应用:混合排序
一. 窗口应用:用户连续消费的最大月份
有表log_consume,字段 uid order_time order_price
数据:
A 2023-05-01 100
A 2023-06-02 100
A 2023-06-03 100
B 2023-01-01 100
B 2023-03-01 100
B 2023-04-01 100
B 2023-05-01 100
需求: 现在要统计用户连续消费的最大月份数
- 首先对uid,order_time(年-月)去重
- 窗口,偏移量函数(lag)对order_time开窗,对uid分组,偏移量为1 as lag_order_time
- 对两个时间点进行差计算, 如果结果为1,说明就是连续的(考虑跨年),if判断
- sun窗口,对时间差进行开窗,对uid进行分组
- 输出结果
with x as(
SELECT
uid,
order_month,
(year(order_month) - year(lag_order_month) ) * 12 + month(order_month) - month(lag_order_month) as month_sum
FROM (
SELECT
uid,
order_month,
lag(order_month,1,order_month) over(partion by uid order by order_month asc) as lag_order_month
FROM (
SELECT uid,concat(substr(order_time,1,7),'-1') as order_month
FROM log_consume
GROUP BY concat(substr(order_time,1,7),'-1')
) t1
) t2
)
SELECT
uid ,k,group_concat(order_month)
FROM
(
SELECT
uid,
order_month,
sum(tag) over(partion by uid order by order_month asc) as k
FROM
(
SELECT
uid,
order_month,
IF(month_sum = 1 ,0,1) as tag
FROM
x
) t1
) t2
GROUP BY uid ,k
二. 函数应用:提取Json中的key值
[{“website”:“baidu.com”,“name”:“百度”},{“website”:“google.com”,“name”:“谷歌”}]
- get_json_object
- json_tuple
select
good_friend,friend_1 ,friend_2, friend_3
from
table
lateral view json_tuple(str,'friend') as good_friend
lateral view json_tuple(good_friend,'friend_1','friend_2','friend_3') tb as friend_1 ,friend_2, friend_3
- explode,replace,posexplode
主要的思路正则替换,炸裂函数
SELECT
pos + 1 as rn
val
FROM
(
regexp_replace(translate('[{“website”:“baidu.com”,“name”:“百度”},{“website”:“google.com”,“name”:“谷歌”}]','[}{}""',''),'\,','\:') as str
) t1
lateral view posexplode(str,':') t2 as pas ,val
三. 分组应用:混合排序
sum() over()
sum() over(partion by ....)
sum() over(partion by order by ....)
sum() over() as rn 标签值
- 按city分类
- 按score降序,此时只考虑为A B C 不考虑A-1 A-2
- 如果存在A-1 A-2 之类,按照id升序
排序前的数据
id city name score
1 a A 100
2 a A-1 80
3 b C 70
4 a A-2 90
5 b D 85
6 b B 75
7 b E 30
8 b B-1 50
9 a F 95
10 b G 65
排序后的数据
id city name score
1 a A 100
2 a A-1 80
4 a A-2 90
9 a F 95
5 b D 85
6 b B 75
8 b B-1 50
3 b C 70
10 b G 65
7 b E 30
with x as (
select
id , city, name, score ,if(instr(name,'-'),substr(1,1),name) as base_name
from
order_data
) t1
select
x.id , x.city, x.name, x.score
from
order_data join x on order_data.name = x.base_name
order by x.city,order_data.score,x.id