hive 日常数据需求(尽可能展示窗口函数的使用)
题目一:每个用户截止到每月为止的最大交易金额和累计到该月的总交易金额,结果数据格式如下
#先将每个人每个月的消费进行聚合
SELECT customer_key,DATE_FORMAT(create_date,"yyyy-MM") AS umonth,
COUNT(1) AS ucount,SUM(unit_price) AS sum_price
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
ORDER BY umonth
#随后通过窗口函数,求出当前月为止最大消费以及累计消费
SELECT customer_key,
umonth,
ucount,
MAX(sum_price) over(PARTITION BY customer_key ORDER BY umonth ROWS BETWEEN unbounded preceding AND current ROW) AS current_max,
SUM(sum_price) over(PARTITION BY customer_key ORDER BY umonth ROWS BETWEEN unbounded preceding AND current ROW) AS current_sum
FROM( #每个人每月消费
SELECT customer_key,DATE_FORMAT(create_date,"yyyy-MM") AS umonth,
COUNT(1) AS ucount,SUM(unit_price) AS sum_price
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
ORDER BY umonth) a
LIMIT 10;
笔记:
1.窗口函数边界:
2.hive中,取年月的两种方式
题目二:计算用户的回购率和复购率
复购率: 当前月份购买2次及以上的客户占所有客户比例
回购率:当前月份购买且上个月份也购买的客户占当月所有月份客户比例
思路:
复购率
1、对当月(2月份)的客户分组,计数购买次数
2、筛选购买次数为2以上的,认为是复购群体
SELECT
umonth,
COUNT(1) AS `总客户数`,
SUM(IF(buys >=2,1,0)) AS `复购客户`,
SUM(IF(buys >=2,1,0))/COUNT(1) AS `复购率`
FROM ( #按月份统计客户消费次数情况
SELECT customer_key,
DATE_FORMAT(create_date,"yyyy-MM") umonth,
COUNT(1) buys
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,"yyyy-MM")
) a
GROUP BY umonth
LIMIT 10;
回购率
1、筛选当月及上月部分
2、利用客户id进行当月连上月,推荐左连
3、对同一条客户id均有购买记录的,认为是回购群体
SELECT
a.umonth,
SUM(IF(a.cons >=2,1,0))/COUNT(1) AS `复购率`,
COUNT(b.customer_key)/COUNT(a.customer_key) AS `回购率`
FROM
(SELECT customer_key,
DATE_FORMAT(create_date,'yyyy-MM') umonth,
COUNT(1) cons
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,'yyyy-MM')
)a
LEFT JOIN
(SELECT customer_key,
DATE_FORMAT(create_date,'yyyy-MM') umonth,
COUNT(1) cons
FROM ods_sales_orders
GROUP BY customer_key,DATE_FORMAT(create_date,'yyyy-MM')
) b
ON a.customer_key = b.customer_key
AND CONCAT(a.umonth,'-01') =add_months(CONCAT(b.umonth,'-01'),1)
GROUP BY a.umonth
LIMIT 10;
注意:
1.连接条件中,条件1为客户key字段,表示同一个客户;条件2为两个月的信息连接(本月日期 = 上月日期 +1个月)
2.ADD_MONTHS:别忘了有个's',add_months只能对完整日期格式进行月份加减(add_months(‘yyyy-MM-dd hh:mm:ss’,N),add_months(‘yyyy-MM-dd’,N)),所以这里先用concat给'年月'增加一个'日'字段,其中N为正则增加,N为负则减少
题目三:求用户最先购买的两种子类别产品信息,要求拼接成 “用户号-产品1-产品2” 的形式
第一步:根据用户分组,求出用户购买的产品及顺序
cpzl_zw1是cpzl_zw的上一个购买的产品
SELECT customer_key,cpzl_zw,
row_number() over(PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw,1,0) over(PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
-
lag() over():
lag(参数1,参数2,参数3)
参数1:表中列名 参数2:往下偏移多少位 参数3:超出行数时默认设置值
lag 往往和over结合使用
如:lag(参数1,参数2,参数3)over(order by 列)
第二步:选择相邻两个产品不同的记录
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1
第三步:为相邻的产品添加排序(即上面lag函数添加的产品)
若相邻产品的排序为3,则表示第三件产品,题目要求前两件产品,则可以根据小于3进行筛选
WITH tmp1 AS (
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1)
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) AS order_num1
FROM tmp1
-
with as():类似于mysql中的视图功能暂时引用版(as 括号后为暂时引用的表),必须在结束后使用select进行查询否则报错,且多个with as用逗号隔开。
第四步:数据拼接
WITH tmp1 AS (
SELECT
customer_key,
cpzl_zw,
order_num,
cpzl_zw1
FROM (
SELECT
customer_key,
cpzl_zw,
row_number() OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw, 1, 0) OVER (PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
) a
WHERE cpzl_zw != cpzl_zw1)
,tmp2 AS (
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) AS order_num1
FROM tmp1)
SELECT
CONCAT(customer_key,'-',CONCAT_WS('-',collect_set(cpzl_zw))) AS z1
FROM tmp2
WHERE order_num1 < 3
GROUP BY customer_key;
-
concat_ws:如何指定参数之间的分隔符
使用函数CONCAT_WS()。使用语法为:CONCAT_WS(separator,str1,str2,…)
CONCAT_WS() 代表 CONCAT With Separator ,是CONCAT()的特殊形式。第一个参数是其它参数的分隔符。分隔符的位置放在要连接的两个字符串之间。分隔符可以是一个字符串,也可以是其它参数。如果分隔符为 NULL,则结果为 NULL。函数会忽略任何分隔符参数后的 NULL 值。但是CONCAT_WS()不会忽略任何空字符串。 (然而会忽略所有的 NULL)。 - collect_set():
常用于聚合后的字段处理
个人认为第二步可以省略,collect_set自带去重功能,有点多余了
WITH tmp AS (
SELECT customer_key,cpzl_zw,
row_number() over(PARTITION BY customer_key ORDER BY create_date ASC) AS order_num,
lag(cpzl_zw,1,0) over(PARTITION BY customer_key ORDER BY create_date ASC) AS cpzl_zw1
FROM ods_sales_orders
),tmp2 AS(
SELECT customer_key,cpzl_zw,order_num,cpzl_zw1,
row_number() over(PARTITION BY customer_key ORDER BY order_num ASC) order_num1
FROM tmp)
SELECT
CONCAT(customer_key,'-',CONCAT_WS('-',collect_set(cpzl_zw))) AS z1
FROM tmp2
WHERE order_num1 < 3
GROUP BY customer_key;
题目四:统计各个省份所属城市下最受欢迎的Top 3产品和其销量(不能出现有null)
#多次利用with as
#方法一:
SELECT chinese_city,product_key,cons,c.`rank`
FROM (
WITH tmp AS(
SELECT sales_order_key,product_key,chinese_city
FROM ods_sales_orders a
RIGHT JOIN ods_customer b
ON a.customer_key = b.customer_key)
,tmp2 AS (
SELECT chinese_city,product_key,COUNT(sales_order_key) AS cons
FROM tmp
GROUP BY chinese_city,product_key
)
SELECT chinese_city,product_key,cons,
row_number() over(PARTITION BY chinese_city ORDER BY cons DESC) AS `rank`
FROM tmp2
) c
WHERE c.`rank` <= 3
LIMIT 100;
#方法二:
SELECT chinese_city,product_key,cons,`rank`
FROM(
SELECT chinese_city,product_key,cons,
row_number() over(PARTITION BY chinese_city ORDER BY cons DESC) AS `rank`
FROM (
SELECT b.chinese_city,a.product_key,COUNT(sales_order_key) AS cons
FROM ods_sales_orders a
RIGHT JOIN ods_customer b
ON a.customer_key = b.customer_key
GROUP BY b.chinese_city,a.product_key
) c
) d
WHERE `rank` <= 3
AND d.chinese_city != 'null'
LIMIT 100;
步骤:
第一步:将两表进行连接(customer为主表)
第二步:按城市、产品进行聚合分组,求出销量
第三步:窗口函数对销量进行排序
第四步:where进行筛选(where不能直接对窗口函数结果进行筛选,所以要再次使用表子连接)
题目五:商品的销售数量top10,排名需考虑并列排名的情况
提示:只用到订单表 ods_sales_orders,并列排序 dense_rank 窗口函数
SELECT product_key,cons,`rank`
FROM(
SELECT product_key,cons,dense_rank() over(ORDER BY cons DESC) AS `rank`
FROM (
SELECT product_key,COUNT(1) AS cons
FROM ods_sales_orders
GROUP BY product_key) a)b
WHERE b.`rank` <= 10;
题目六:计算累计和(统计2019年1-12月的累积销量,即1月为1月份的值,2月为1、2月份值的和,3月为1、2、3月份的和,12月为1-12月份值的和)
SELECT yearmonth,ROUND(sum_price,2),
ROUND(SUM(sum_price) over(ORDER BY yearmonth ASC),2) AS cumsum
FROM(
SELECT
DATE_FORMAT(create_date,"yyyy-MM") AS yearmonth,SUM(unit_price) AS sum_price
FROM ods_sales_orders
WHERE YEAR(create_date) = '2019'
GROUP BY DATE_FORMAT(create_date,"yyyy-MM")
) a;
题目七:计算客户平均购买一次商品的间隔时间
SELECT customer_key,AVG(diff_date) AS avg_buy_period
FROM(
SELECT
customer_key,
create_date,
lead(create_date,1) over(PARTITION BY customer_key ORDER BY create_date ASC) next_date,
DATEDIFF(lead(create_date,1) over(PARTITION BY customer_key ORDER BY create_date ASC),create_date) AS diff_date
FROM ods_sales_orders) a
GROUP BY customer_key
HAVING avg_buy_period IS NOT NULL;
- lead() over():作用于lag相同,方向相反;Lag和Lead分析函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead)作为独立的列。
题目八:查询最近前20%时间的订单信息
SELECT *
FROM(
SELECT *, ntile(5) OVER(ORDER BY create_date DESC) `grouping`
FROM ods_sales_orders ) a
WHERE a.`grouping` = 1
LIMIT 10;
- ntile(n):用于将数据分成n组