Hive实战练习

背景

日常工作中有许多数据处理需求需要解决,在此之间,获得需求,用hive实现需求,最终完成任务。

题目

数据源在:hive中的adventure_ods库的ods_sales_orders表

表名 表注释 字段 字段注释
ods_sales_orders 订单明细表 sales_order_key 订单主键 一个订单表示销售一个产品
ods_sales_orders 订单明细表 create_date 订单日期
ods_sales_orders 订单明细表 customer_key 客户编号
ods_sales_orders 订单明细表 product_key 产品编号
ods_sales_orders 订单明细表 english_product_name 产品名
ods_sales_orders 订单明细表 cpzl_zw 产品子类
ods_sales_orders 订单明细表 cplb_zw 产品类别
ods_sales_orders 订单明细表 unit_price 产品单价

找出前前几行看看:

1592889021912

题目一:每个用户截止到每月为止的最大交易金额和累计到该月的总交易金额,结果数据格式如下

方法1,先对日期汇总

先针对日期做窗口汇总,然后group by+max 干掉每个月里面的不是第一天的那种'小数据'.

select 
customer_key, 
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders;
1592896441615

这时候只是多了冗余数据,去掉即可.怎么去掉呢?我们先把这个结果存起来。CTAS。

create table temp_table as
(select 
customer_key, 
substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as month_count,
count() over(partition by customer_key) as total_count,
last_value(create_date) over(partition by customer_key) as last_id,
max(unit_price) over(partition by customer_key order by create_date) max_until_this_time,
sum(unit_price) over(partition by customer_key order by create_date) cumsum_until_this_time
from ods_sales_orders
);
1592897995262

这时候我们就可以去掉冗余了,使用max函数干掉小的那种。

select customer_key, umonth, month_count,
max(max_until_this_time) as current_max,
max(cumsum_until_this_time) as current_sum
from temp_table
group by customer_key, umonth, month_count;
1592898843979

方法二,先对月份做汇总

select 
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders;
1592899355077

这时候有重复,用group by 去重。


select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select 
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month;
-- 保存为临时表
create table temp_tt as (
select customer_key, umonth, ucount, max_this_month, sum_this_month
from (
select 
customer_key, substr(create_date, 1, 7) as umonth,
count() over(partition by customer_key, substr(create_date, 1, 7)) as ucount,
max(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as max_this_month,
sum(unit_price) over(partition by customer_key, substr(create_date, 1, 7)) as sum_this_month
from ods_sales_orders)mubiao
group by customer_key, umonth, ucount, max_this_month, sum_this_month);

1592899570032
1592899901432

这时候我们就可以计算截止当前的最大和累计了。

select customer_key, umonth, ucount, 
cast(max(max_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_max,
cast(sum(sum_this_month) over(partition by customer_key order by umonth) as decimal(10,2)) as current_sum
from temp_tt;

1592900568935
customer_key umonth(当月) ucount(月订单量) current_max(最大交易金额) current_sum(累计该月总交易金额)
11009 2018-12 1 53.99 53.99
1358999 2019-2 1 28.99 28.99
1358999 2019-4 1 69.99 98.98
1359000 2019-1 1 2294.99 2294.99
1359002 2019-11 1 8.99 8.99
1359003 2020-1 1 1120.49 1120.49
1359005 2019-2 1 782.99 782.99
1359009 2019-1 1 2384.07 2384.07
1359014 2019-1 1 69.99 69.99
1359014 2019-2 1 69.99 94.98

思路:

  • 1.对数据按照客户及其年-月分组

  • 2.分组后就每月销售金额之和

  • 3.使用窗口函数,对每个客户不同月份分组求最大值(max)和累计值(sum)

题目二:计算用户的回购率和复购率

复购率

复购率: 当前月份购买2次及以上的客户占所有客户比例,以2月为例进行计算

  1. 先增加购买一个本月购买次数的列
   
select customer_key, count(customer_key) as count_this_month from 
   
   ods_sales_orders
   
   where year(create_date)=2019 and month(create_date)=2
   
   group by customer_key;
1592902362138

下面就好算了。

select concat("", cast(sum(if(count_this_month>1,1,0))*100/count(*) as decimal(14,4)), "%")
from (
select customer_key, count(customer_key) as count_this_month from 
   
   ods_sales_orders
   
   where year(create_date)=2019 and month(create_date)=2
   
   group by customer_key
) mubiao ;
1592903111227

回购率

回购率:当前月份购买且上个月份也购买的客户占当月所有月份客户比例

1.先得到当前月份有哪些用户

select customer_key,month(create_date) as month_2_count

from ods_sales_orders

where month(create_date)=2

group by customer_key, month(create_date);  --  2月份的用户

select customer_key,month(create_date) as month_1_count

from ods_sales_orders

where month(create_date)=1

group by customer_key, month(create_date);  --  1月份的用户

上面可以分别得到2,1月份每个customer的购买情况,left_join 筛选非NULL行就可以得到上个月是否买了东西。

select sum(if (month_1_count is not null, 1, 0)) / count(*)

from (select customer_key,month(create_date) as month_2_count

from ods_sales_orders

where month(create_date)=2

group by customer_key, month(create_date)) a

left join (select customer_key,month(create_date) as month_1_count

from ods_sales_orders

where month(create_date)=1

group by customer_key, month(create_date)) b on a.customer_key=b.customer_key;
1592923229512

思路

复购率

  • 1、对当月(2月份)的客户分组,计数购买次数
  • 2、筛选购买次数为2以上的,认为是复购群体

回购率

  • 1、筛选当月及上月部分
  • 2、利用客户id进行当月连上月,推荐左连
  • 3、对同一条客户id均有购买记录的,认为是回购群体

题目三:求用户号对应不同的产品

用户号 产品 购买时间
1 A 2019-12-23
1 B 2019-12-23
2 C 2019-12-23
2 A 2019-12-24
2 B 2019-12-23

解答

先找几条看看数据的样子。

select customer_key, create_date, product_key from ods_sales_orders limit 100;
1592953179472

其实就是怕连续买了两个产品1,所以需要去重,对于同一个产品只保留最前面的那一次购买,分组然后min时间即可。

select customer_key, product_key, min(create_date)  as create_date

from ods_sales_orders

group by customer_key, product_key;
1592953922587
with temp as (

select customer_key, product_key, min(create_date)  as create_date

from ods_sales_orders

group by customer_key, product_key

) 

select 

*,

row_number() over(partition by customer_key order by create_date) as product_index_by_time

from temp;
1592954161731

这时候就可以筛选了,选出其中序号<3的那些订单们。

with temp as (

select customer_key, product_key, min(create_date)  as create_date

from ods_sales_orders

group by customer_key, product_key

) , -- end of temp table1

temp_with_index as(

select 

*,

row_number() over(partition by customer_key order by create_date) as product_index_by_time

from temp) --end of temp table2

select concat(customer_key, '-', concat_ws('-', collect_set(product_key)))

from temp_with_index 

where product_index_by_time < 3

group by customer_key;
1592954954300

当然也可以改为算子类:

with temp as (

select customer_key, cpzl_zw, min(create_date)  as create_date

from ods_sales_orders

group by customer_key, cpzl_zw

) , -- end of temp table1

temp_with_index as(

select 

*,

row_number() over(partition by customer_key order by create_date) as product_index_by_time

from temp) --end of temp table2

select concat(customer_key, '-', concat_ws('-', collect_set(cpzl_zw)))

from temp_with_index 

where product_index_by_time < 3

group by customer_key;
1592955065121
1592955316071

要求输出例子**:用户号-产品1-产品2

例如:1-A-B (按先后时间顺序,同时不限定)

参考:https://www.jianshu.com/p/90d0657c0218

思路:

  • 1.利用窗口函数,对用户号分组,按时间对产品进行排序
  • 2.利用左连或其他方法拼接,筛选排序顺序为1、2的
  • 3.用concat或者其他函数拼接获得结果

题目四:统计各个省份所属城市下最受欢迎的Top 3产品和其销量(不能出现有null)

  1. 先看看城市和顾客的从属关系

    1592970242408
  2. 关联一下,给每个订单确定一个城市。

select

a.*,

b.chinese_city

from ods_sales_orders a inner join ods_customer  b

on a.customer_key = b.customer_key 
where product_key = 214;
1592970499187
with temp as
(select
product_key,
b.chinese_city,
count(*)
from ods_sales_orders a inner join ods_customer  b
on (a.customer_key = b.customer_key)
and (chinese_city is not NULL) and (product_key is not NULL)
group by chinese_city, product_key
)
select * from temp;
1592970871254

截图里面有个null,好奇嘛?原来他用的是字符串null

1592972100699

用正规的NULL关键字反而搞不出来。所以我们应该入乡随俗,用!=null 来表示。
1592972153906
-- 汇总每个城市里面的产品降序
with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer  b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
)
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp;
1592972545377

下面考虑只要top3.

with temp as
(select
b.chinese_city,
product_key,
count(*) as pro_count
from ods_sales_orders a inner join ods_customer  b
on (a.customer_key = b.customer_key)
and (chinese_city != 'null') and (product_key != 'null')
group by chinese_city, product_key
),
temp_rn as(
select *,
row_number() over(partition by chinese_city order by pro_count desc) as rn
from temp)
select * from temp_rn
where rn < 4;
1592972683061

题目五:商品的销售数量top10,排名需考虑并列排名的情况

select product_key,
count(*) as num
from ods_sales_orders
group by product_key;
1592972806796
with temp as(
select product_key,
count(*) as num
from ods_sales_orders
group by product_key
),
temp_with_rn as
(
select 
*,
rank() over(order by num desc) as rn
from temp
)
select * from temp_with_rn
where rn < 11;
1592972999469

题目六:计算累计和(统计2019年1-12月的累积销量,即1月为1月份的值,2月为1、2月份值的和,3月为1、2、3月份的和,12月为1-12月份值的和)

先统计每个月的和

select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7);
1592973185048

累加和的计算

with temp as(
select substr(create_date, 1, 7) as umonth,
sum(unit_price) as umonth_total
from ods_sales_orders
where year(create_date)=2019
group by substr(create_date, 1, 7)
) 
select *,
cast(sum(umonth_total) over(order by umonth) as decimal(12, 2)) as cumsum
from temp;
1592973366784

题目七:计算客户平均购买一次商品的间隔时间

先记录一下上一次的购买时间

select customer_key,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders;
1592973864868

记录两次之间的时间间隔:

with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders)
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL;
1592974122319
with temp as
(select customer_key, create_date,
lag(create_date) over(partition by customer_key order by create_date) as prev_date
from ods_sales_orders),
temp_with_diff as
(
select customer_key, create_date, prev_date,
datediff(create_date, prev_date) as gap_date
from temp
where prev_date is not NULL
)
select customer_key, avg(gap_date) as avg_gap
from temp_with_diff
group by customer_key;
1592974262952

题目八:查询最近前20%时间的订单信息

提示:使用ntile 函数记录条目在哪个分段。

select *,

ntile(5) over(order by create_date) as tile_index

from ods_sales_orders;
with temp as 
(
select *,

ntile(5) over(order by create_date) as tile_index

from ods_sales_orders
) 
select *
from temp
where tile_index = 2; -- 20%-40%
1592976373523
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,752评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,100评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,244评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,099评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,210评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,307评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,346评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,133评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,546评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,849评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,019评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,702评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,331评论 3 319
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,030评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,260评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,871评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,898评论 2 351

推荐阅读更多精彩内容