背景
日常工作中有许多数据处理需求需要解决,在此之间,获得需求,用hive实现需求,最终完成任务。
题目
数据源在:hive中的adventure_ods库的ods_sales_orders表
表名 | 表注释 | 字段 | 字段注释 | |
---|---|---|---|---|
ods_sales_orders | 订单明细表 | sales_order_key | 订单主键 | 一个订单表示销售一个产品 |
ods_sales_orders | 订单明细表 | create_date | 订单日期 | |
ods_sales_orders | 订单明细表 | customer_key | 客户编号 | |
ods_sales_orders | 订单明细表 | product_key | 产品编号 | |
ods_sales_orders | 订单明细表 | english_product_name | 产品名 | |
ods_sales_orders | 订单明细表 | cpzl_zw | 产品子类 | |
ods_sales_orders | 订单明细表 | cplb_zw | 产品类别 | |
ods_sales_orders | 订单明细表 | unit_price | 产品单价 |
找出前前几行看看:
题目一:每个用户截止到每月为止的最大交易金额和累计到该月的总交易金额,结果数据格式如下
方法1,先对日期汇总
先针对日期做窗口汇总,然后group by+max 干掉每个月里面的不是第一天的那种'小数据'.
select
customer_key,
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders;
这时候只是多了冗余数据,去掉即可.怎么去掉呢?我们先把这个结果存起来。CTAS。
create table temp_table as
(select
customer_key,
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders
);
这时候我们就可以去掉冗余了,使用max函数干掉小的那种。
select customer_key, umonth, month_count,
max(max_until_this_time) as current_max,
max(cumsum_until_this_time) as current_sum
from temp_table
group by customer_key, umonth, month_count;
方法二,先对月份做汇总
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders;
这时候有重复,用group by 去重。
select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month;
-- 保存为临时表
create table temp_tt as (
select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month);
这时候我们就可以计算截止当前的最大和累计了。
select customer_key, umonth, ucount,
cast(max(max_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_max,
cast(sum(sum_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_sum
from temp_tt;
customer_key | umonth(当月) | ucount(月订单量) | current_max(最大交易金额) | current_sum(累计该月总交易金额) |
---|---|---|---|---|
11009 | 2018-12 | 1 | 53.99 | 53.99 |
1358999 | 2019-2 | 1 | 28.99 | 28.99 |
1358999 | 2019-4 | 1 | 69.99 | 98.98 |
1359000 | 2019-1 | 1 | 2294.99 | 2294.99 |
1359002 | 2019-11 | 1 | 8.99 | 8.99 |
1359003 | 2020-1 | 1 | 1120.49 | 1120.49 |
1359005 | 2019-2 | 1 | 782.99 | 782.99 |
1359009 | 2019-1 | 1 | 2384.07 | 2384.07 |
1359014 | 2019-1 | 1 | 69.99 | 69.99 |
1359014 | 2019-2 | 1 | 69.99 | 94.98 |
思路:
1.对数据按照客户及其年-月分组
2.分组后就每月销售金额之和
3.使用窗口函数,对每个客户不同月份分组求最大值(max)和累计值(sum)
题目二:计算用户的回购率和复购率
复购率
复购率: 当前月份购买2次及以上的客户占所有客户比例,以2月为例进行计算
- 先增加购买一个本月购买次数的列
select customer_key, count(customer_key) as count_this_month from
ods_sales_orders
where year(create_date)=2019 and month(create_date)=2
group by customer_key;
下面就好算了。
select concat("", cast(sum(if(count_this_month>1,1,0))*100/count(*) as decimal(14,4)), "%")
from (
select customer_key, count(customer_key) as count_this_month from
ods_sales_orders
where year(create_date)=2019 and month(create_date)=2
group by customer_key
) mubiao ;
回购率
回购率:当前月份购买且上个月份也购买的客户占当月所有月份客户比例
1.先得到当前月份有哪些用户
select customer_key,month(create_date) as month_2_count
from ods_sales_orders
where month(create_date)=2
group by customer_key, month(create_date); -- 2月份的用户
select customer_key,month(create_date) as month_1_count
from ods_sales_orders
where month(create_date)=1
group by customer_key, month(create_date); -- 1月份的用户
上面可以分别得到2,1月份每个customer的购买情况,left_join 筛选非NULL行就可以得到上个月是否买了东西。
select sum(if (month_1_count is not null, 1, 0)) / count(*)
from (select customer_key,month(create_date) as month_2_count
from ods_sales_orders
where month(create_date)=2
group by customer_key, month(create_date)) a
left join (select customer_key,month(create_date) as month_1_count
from ods_sales_orders
where month(create_date)=1
group by customer_key, month(create_date)) b on a.customer_key=b.customer_key;
思路:
复购率
- 1、对当月(2月份)的客户分组,计数购买次数
- 2、筛选购买次数为2以上的,认为是复购群体
回购率
- 1、筛选当月及上月部分
- 2、利用客户id进行当月连上月,推荐左连
- 3、对同一条客户id均有购买记录的,认为是回购群体
题目三:求用户号对应不同的产品
用户号 | 产品 | 购买时间 |
---|---|---|
1 | A | 2019-12-23 |
1 | B | 2019-12-23 |
2 | C | 2019-12-23 |
2 | A | 2019-12-24 |
2 | B | 2019-12-23 |
解答
先找几条看看数据的样子。
select customer_key, create_date, product_key from ods_sales_orders limit 100;
其实就是怕连续买了两个产品1,所以需要去重,对于同一个产品只保留最前面的那一次购买,分组然后min时间即可。
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key;
with temp as (
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key
)
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp;
这时候就可以筛选了,选出其中序号<3的那些订单们。
with temp as (
select customer_key, product_key, min(create_date) as create_date
from ods_sales_orders
group by customer_key, product_key
) , -- end of temp table1
temp_with_index as(
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp) --end of temp table2
select concat(customer_key, '-', concat_ws('-', collect_set(product_key)))
from temp_with_index
where product_index_by_time < 3
group by customer_key;
当然也可以改为算子类:
with temp as (
select customer_key, cpzl_zw, min(create_date) as create_date
from ods_sales_orders
group by customer_key, cpzl_zw
) , -- end of temp table1
temp_with_index as(
select
*,
row_number() over(partition by customer_key order by create_date) as product_index_by_time
from temp) --end of temp table2
select concat(customer_key, '-', concat_ws('-', collect_set(cpzl_zw)))
from temp_with_index
where product_index_by_time < 3
group by customer_key;
要求输出例子**:用户号-产品1-产品2
例如:1-A-B (按先后时间顺序,同时不限定)
参考:https://www.jianshu.com/p/90d0657c0218
思路:
- 1.利用窗口函数,对用户号分组,按时间对产品进行排序
- 2.利用左连或其他方法拼接,筛选排序顺序为1、2的
- 3.用concat或者其他函数拼接获得结果
题目四:统计各个省份所属城市下最受欢迎的Top 3产品和其销量(不能出现有null)
-
先看看城市和顾客的从属关系
关联一下,给每个订单确定一个城市。
select
a.*,
b.chinese_city
from ods_sales_orders a inner join ods_customer b
on a.customer_key = b.customer_key
where product_key = 214;
with temp as
(select
product_key,
b.chinese_city,
count(*)
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city is not NULL) and (product_key is not NULL)
group by chinese_city, product_key
)
select * from temp;
截图里面有个null,好奇嘛?原来他用的是字符串null
-- 汇总每个城市里面的产品降序
with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
)
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp;
下面考虑只要top3.
with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
),
temp_rn as(
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp)
select * from temp_rn
where rn < 4;
题目五:商品的销售数量top10,排名需考虑并列排名的情况
select product_key,
count(*) as num
from ods_sales_orders
group by product_key;
with temp as(
select product_key,
count(*) as num
from ods_sales_orders
group by product_key
),
temp_with_rn as
(
select
*,
rank() over(order by num desc) as rn
from temp
)
select * from temp_with_rn
where rn < 11;
题目六:计算累计和(统计2019年1-12月的累积销量,即1月为1月份的值,2月为1、2月份值的和,3月为1、2、3月份的和,12月为1-12月份值的和)
先统计每个月的和
select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7);
累加和的计算
with temp as(
select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7)
)
select *,
cast(sum(umonth_total) over(order by umonth) as decimal(12, 2)) as cumsum
from temp;
题目七:计算客户平均购买一次商品的间隔时间
先记录一下上一次的购买时间
select customer_key,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders;
记录两次之间的时间间隔:
with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders)
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL;
with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders),
temp_with_diff as
(
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL
)
select customer_key, avg(gap_date) as avg_gap
from temp_with_diff
group by customer_key;
题目八:查询最近前20%时间的订单信息
提示:使用ntile 函数记录条目在哪个分段。
select *,
ntile(5) over(order by create_date) as tile_index
from ods_sales_orders;
with temp as
(
select *,
ntile(5) over(order by create_date) as tile_index
from ods_sales_orders
)
select *
from temp
where tile_index = 2; -- 20%-40%