Hive优化之多count(distinct)

本篇介绍Hive三种方法,优化多count(distinct )


先上待优化代码:

select count(distinct sid) as sid

      ,count(distinct entity_id) as entity_id

      ,count(distinct billing_status_code) as billing_status_code

  from c_detail

where cal_dt='2020-03-30';

因为count(disticnt)需要去重操作,需要将所有数据放到同一task去重,只会产生一个reduce task。如果数据量过大,成为性能瓶颈


两种普通操作:

一、负载均衡法(将一步到位转成两步 ①去重  ②count)

select max(case when col = 'sid' then cn end) as sid

      ,max(case when col = 'entity_id' then cn end)  as entity_id

  ,max(case when col = 'billing_status_code' then cn end) as billing_status_code

  from

(select count(1) as cn,'sid' as col from (select sid from c_detail where cal_dt='2020-03-30' group by sid)a union all

select count(1) as cn,'entity_id' as col from (select entity_id from c_detail where cal_dt='2020-03-30' group by entity_id)a union all

select count(1) as cn,'billing_status_code' as col from (select billing_status_code from c_detail where cal_dt='2020-03-30' group by billing_status_code)a )b ;

优点:负载均衡,完全解决一个reduce产生的性能瓶颈

缺点:写起来比较麻烦,需要每个字段单独去重union all到一起。最后还得行转列,如果几十个count(distinct) 写死也不为过。


二、省事法 ①先组合去重减少数据量 ②count(distinct) 第二步仍然是一个reduce,但是数据量减少了。

select count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)

  from

(select sid,entity_id,billing_status_code

  from c_detail

where cal_dt='2020-03-30'

group by sid,entity_id,billing_status_code)a;

优点:写起来简单

缺点:如果去重后数据量仍然非常大,还是跑不出来,如果sid是用户id基数很大,其他两列基数很小,这个去重效果就微乎其微,需要一二组合使用

组合折中法

select count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)

  from

(select  entity_id,billing_status_code from c_detail

where cal_dt='2020-03-30'

group by entity_id,billing_status_code)a

join (select  count(1) as sid from (select  distinct sid from c_detail

where cal_dt='2020-03-30') b)c on 1=1;

如果基数低字段较多的话,组合使用写起来也非常烦


三、高端大气grouping sets法,既写起来方便,又能负载均衡

select count(case when entity_id is null and billing_status_code is null then 1 end) as sid

      ,count(case when sid is null and billing_status_code is null then 1 end) as entity_id

  ,count(case when sid is null and entity_id is null then 1 end) as billing_status_code

  from

(select sid,entity_id,billing_status_code

  from c_detail

group by sid,entity_id,billing_status_code

grouping sets((sid),(entity_id),(billing_status_code)))a;

相当于使用grouping sets来代替 无数的group by+union all

缺点:grouping sets组过多会产生性能问题,grouping sets语法逻辑,会产生什么样的性能问题呢?请看下篇 Hive多维度聚合

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容