hive sql 练习

数据和题目源：数据蛙https://www.jianshu.com/u/1f32f227da5f

建表：使用hive sql的CTAS 快速建表语法复制表到自己的数据库

create table frog_qing.ods_sales_orders as select * from adventure_ods.ods_sales_orders;
create table frog_qing.ods_cutomer as select * from adventure_ods.ods_customer;

ods_sales_orders表的列说明如下：

1.PNG
ods_cutomer表的列说明如下：

2.PNG

题目汇总

题目一：每个用户截止到每月为止的最大交易金额和累计到该月的总交易金额，结果数据格式如下
题目二：计算用户的回购率和复购率
题目三：求用户号对应不同的产品]
题目四：统计各个省份所属城市下最受欢迎的Top 3产品和其销量（不能出现有null）
题目五：商品的销售数量top10，排名需考虑并列排名的情况
题目六：计算累计和(统计2019年1-12月的累积销量，即1月为1月份的值，2月为1、2月份值的和，3月为1、2、3月份的和，12月为1-12月份值的和
题目七计算客户平均购买一次商品的间隔时间
题目八：查询最近前20%时间的订单信息

每个用户截止到每月为止的最大交易金额和累计到该月的总交易金额，结果数据格式如下
1.1 先将日期转为By 年/月格式,按照客户和年月进行分组
1.2 计算分组后的最大交易金额和累计总交易金额

select 
    b.customer_key,
    b.umonth,
    b.ucount,
    b.current_max,
    sum(b.sum_price) over(partition by b.customer_key order by b.umonth) as current_sum
from (
    select 
        customer_key,
        date_format(create_date, 'YYYY-MM-01') as umonth,
        count(sales_order_key) as ucount,
        max(unit_price) as current_max,
        sum(unit_price) as sum_price
    from ods_sales_orders
    group by customer_key,date_format(create_date, 'YYYY-MM-01')
) b
limit 50;

计算用户的回购率和复购率
2.1 复购率，一个月内购买了多次的用户占总用户的比例

select 
    a.umonth,
    count(a.ucount) as user_c,
    sum(if(a.ucount>1, 1, 0))/count(ucount) as fugou
from (
    select 
        date_format(create_date, 'YYYY-MM-01') AS umonth,
        count(customer_key) as ucount
    from ods_sales_orders
    group by date_format(create_date, 'YYYY-MM-01'),customer_key
) a
group by a.umonth
limit 20;

2.2 回购率，当前月份购买且下个月份也购买的客户占当月所有月份客户比例，思路：现将当月数据和下个月数据进行连接，连接后按月分组计算回购人数和比例

select 
    a.umonth,
    count(a.customer_key) as pay_user,
    count(b.customer_key) as fg_user,
    concat((count(b.customer_key)/count(a.customer_key))*100,'%') as fg_rate
from (
    select 
        date_format(create_date, 'YYYY-MM-01') as umonth,
        customer_key
    from ods_sales_orders a
    group by date_format(create_date, 'YYYY-MM-01'),customer_key) a
left join (
    select 
        date_format(create_date, 'YYYY-MM-01') as umonth,
        customer_key
    from ods_sales_orders a
    group by date_format(create_date, 'YYYY-MM-01'),customer_key) b
on a.customer_key=b.customer_key 
and add_months(a.umonth,1)=b.umonth
group by a.umonth;

求用户号对应不同的产品,理解：求出用户第一次和第二次购买的产品
注意点：
3.1.1 lead()窗口函数求出窗口的后一个值，当为Null值时，使用where函数判断是否相等时会过滤掉这些null值，而我们需要保留这些null值，使用nvl()将null值转换，进行保留
3.1.2 collect_set(字段)根据某个字段分组后，把分在一组的数据进行去重，去重后的数据按分组形成集合，默认分隔符','
3.1.3 concat_ws('-',collect_set(xx)) 可将分组后的数据使用指定分隔符链接

方法一、按照数据蛙答案修改,筛选出来购买大于两次的用户，且购买产品不同的用户和对应产品

with tmp as (
    select 
        a.customer_key,
        a.cpzl_zw,
        a.rk,
        a.second_cpzl,
        row_number() over(partition by customer_key order by rk) as rn
    from (
        select 
            customer_key,
            cpzl_zw,
            row_number() over(partition by customer_key order by create_date) as rk,
            lead(cpzl_zw,1, null) over(partition by customer_key order by create_date) as second_cpzl
        from ods_sales_orders
    ) a
    where a.cpzl_zw != a.second_cpzl
)
select concat(customer_key,'-',cpzl_zw, '-',second_cpzl),count(1)
from tmp
where tmp.rn < 2
limit 20;

方法二、利用collect_set特性，保留了购买一次客户的产品和两次购买相同产品的客户

select 
    a.customer_key,
    count(1),
    concat(a.customer_key, '-',concat_ws('-', collect_set(a.cpzl_zw))) as product
from (
    select 
        customer_key,
        cpzl_zw,
        row_number() over(partition by customer_key order by create_date) as rn
    from ods_sales_orders
) a
where a.rn <3
group by a.customer_key
limit 20;

统计各个省份所属城市下最受欢迎的Top 3产品和其销量（不能出现有null）
4.1 求出每个城市各产品的销量

select 
    b.chinese_city,
    a.cpzl_zw,
    count(a.sales_order_key) as sales
from ods_sales_orders a
left join ods_cutomer b
on a.customer_key = b.customer_key
group by b.chinese_city,a.cpzl_zw
limit 20;

4.2 求出按城市按产品销量的排名

select 
    c.chinese_city,
    c.cpzl_zw,
    c.sales,
    row_number() over(partition by c.chinese_city order by c.sales desc) rn
from (
    select 
        b.chinese_city,
        a.cpzl_zw,
        count(a.sales_order_key) as sales
    from ods_sales_orders a
    left join ods_cutomer b
    on a.customer_key = b.customer_key
    group by b.chinese_city,a.cpzl_zw
) c
where c.chinese_city is not null and c.chinese_city != 'null'
limit 20;

4.3 筛选每个城市top3的产品

select 
    d.chinese_city,
    d.cpzl_zw,
    d.sales,
    d.rn
from (
    select 
        c.chinese_city,
        c.cpzl_zw,
        c.sales,
        row_number() over(partition by c.chinese_city order by c.sales desc) rn
    from (
        select 
            b.chinese_city,
            a.cpzl_zw,
            count(a.sales_order_key) as sales
        from ods_sales_orders a
        left join ods_cutomer b
        on a.customer_key = b.customer_key
        group by b.chinese_city,a.cpzl_zw
    ) c
    where c.chinese_city is not null and c.chinese_city != 'null'
) d
where d.rn <4;

商品的销售数量top10，排名需考虑并列排名的情况
5.1 计算每个产品的销量

select 
    cpzl_zw,
    count(1) as sales
from ods_sales_orders
group by cpzl_zw
limit 20;

5.2 按销量进行排名，筛选前10销量
DENSE_RANK() 生成数据项在分组中的排名，相同值序号一样，不跳过下个序号

select *
from (
    select 
        a.cpzl_zw,
        a.sales,
        dense_rank() over(order by a.sales) as rn
    from (
        select 
            cpzl_zw,
            count(1) as sales
        from ods_sales_orders
        group by cpzl_zw
    ) a
    ) b
where b.rn <=10;

计算累计和(统计2019年1-12月的累积销量，即1月为1月份的值，2月为1、2月份值的和，3月为1、2、3月份的和，12月为1-12月份值的和)
6.1 计算每个月的销量

select 
    date_format(create_date, 'YYYY-MM-01') as month_date,
    COUNT(sales_order_key) as sale_amount
from ods_sales_orders
group by date_format(create_date, 'YYYY-MM-01');

6.2 计算累计销量
ORDER BY create_date 按月份对查询读取的记录进行排序，就是窗口范围内的排序

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 定义起点和终点，UNBOUNDED PRECEDING 为起点，表明从第一行开始, CURRENT ROW为默认值，就是这一句等于：ROWS UNBOUNDED PRECEDING PRECEDING：在前 N 行的意思，FOLLOWING：在后 N 行的意思。

select 
    a.month_date,
    a.sale_amount,
    sum(a.sale_amount) over(order by month_date asc ORDER BY create_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sale_amount_consum
from (
    select 
        date_format(create_date, 'YYYY-MM-01') as month_date,
        COUNT(sales_order_key) as sale_amount
    from ods_sales_orders
    group by date_format(create_date, 'YYYY-MM-01')
) a;

计算客户平均购买一次商品的间隔时间
7.1 得出用户两次购买的时间，使用lead()函数
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值，第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）所以使用LEAD (时间,1)，取客户的时间，按照客户编号分组再按时间排序，所以使用over(partition by 客户编号 order by 时间)

select 
    customer_key,
    create_date,
    lead(create_date,1,null) over(partition by customer_key order by create_date) as next_date
from ods_sales_orders
limit 20;

7.2 计算两次购买时间差,再计算出每个客户的平均时间差

select 
    a.customer_key,
    avg(datediff(a.next_date, a.create_date)) as avg_days
from (
    select 
        customer_key,
        create_date,
        lead(create_date,1,'2019-01-01') over(partition by customer_key order by create_date) as next_date
    from ods_sales_orders
) a 
where datediff(a.next_date, a.create_date)>0
group by a.customer_key limit 20;

查询最近前20%即前1/5时间的订单信息
8.1 将时间按降序分为5组，使用nitle()窗口函数将日期降序分位5组

select 
    *,
    ntile(5) over(order by create_date desc) gid
from ods_sales_orders limit 10;

8.2 取出时间分组为第一组的订单信息

select *
from (
    select 
        *,
        ntile(5) over(order by create_date desc) gid
    from ods_sales_orders
) a
where a.gid=1;

hive sql 练习

题目汇总

推荐阅读更多精彩内容