本篇介绍Hive三种方法,优化多count(distinct )
先上待优化代码:
select count(distinct sid) as sid
,count(distinct entity_id) as entity_id
,count(distinct billing_status_code) as billing_status_code
from c_detail
where cal_dt='2020-03-30';
因为count(disticnt)需要去重操作,需要将所有数据放到同一task去重,只会产生一个reduce task。如果数据量过大,成为性能瓶颈
两种普通操作:
一、负载均衡法(将一步到位转成两步 ①去重 ②count)
select max(case when col = 'sid' then cn end) as sid
,max(case when col = 'entity_id' then cn end) as entity_id
,max(case when col = 'billing_status_code' then cn end) as billing_status_code
from
(select count(1) as cn,'sid' as col from (select sid from c_detail where cal_dt='2020-03-30' group by sid)a union all
select count(1) as cn,'entity_id' as col from (select entity_id from c_detail where cal_dt='2020-03-30' group by entity_id)a union all
select count(1) as cn,'billing_status_code' as col from (select billing_status_code from c_detail where cal_dt='2020-03-30' group by billing_status_code)a )b ;
优点:负载均衡,完全解决一个reduce产生的性能瓶颈
缺点:写起来比较麻烦,需要每个字段单独去重union all到一起。最后还得行转列,如果几十个count(distinct) 写死也不为过。
二、省事法 ①先组合去重减少数据量 ②count(distinct) 第二步仍然是一个reduce,但是数据量减少了。
select count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)
from
(select sid,entity_id,billing_status_code
from c_detail
where cal_dt='2020-03-30'
group by sid,entity_id,billing_status_code)a;
优点:写起来简单
缺点:如果去重后数据量仍然非常大,还是跑不出来,如果sid是用户id基数很大,其他两列基数很小,这个去重效果就微乎其微,需要一二组合使用
组合折中法
select count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)
from
(select entity_id,billing_status_code from c_detail
where cal_dt='2020-03-30'
group by entity_id,billing_status_code)a
join (select count(1) as sid from (select distinct sid from c_detail
where cal_dt='2020-03-30') b)c on 1=1;
如果基数低字段较多的话,组合使用写起来也非常烦
三、高端大气grouping sets法,既写起来方便,又能负载均衡
select count(case when entity_id is null and billing_status_code is null then 1 end) as sid
,count(case when sid is null and billing_status_code is null then 1 end) as entity_id
,count(case when sid is null and entity_id is null then 1 end) as billing_status_code
from
(select sid,entity_id,billing_status_code
from c_detail
group by sid,entity_id,billing_status_code
grouping sets((sid),(entity_id),(billing_status_code)))a;
相当于使用grouping sets来代替 无数的group by+union all
缺点:grouping sets组过多会产生性能问题,grouping sets语法逻辑,会产生什么样的性能问题呢?请看下篇 Hive多维度聚合