Hive小技巧

转载请在文章起始处注明出处，谢谢。

1、取字符串某个位置的值

比如，我要取‘abcd’的b，有两种取法：

select split('abcd','')[1]
select substr('abcd',2,1)

2、我要取a表的用户，b表的用户，还有a、b重复的用户

这里可以用sum() over(partition by) 来取，下面举个例子：
假如a表和b表结构一样，都有dt,pin,type。

取数代码示意如下：

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true;
set hive.merge.size.per.task = 512000000;
set hive.merge.smallfiles.avgsize = 512000000;
set mapred.max.split.size = 512000000;
set mapred.min.split.size.per.node = 512000000;
set mapred.min.split.size.per.rack = 512000000;
set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.hadoop.supports.splittable.combineinputformat = true;
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nostrict;
set hive.exec.max.dynamic.partitions.pernode = 1000;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 16;
set hive.new.job.grouping.set.cardinality = 5000;
select
    t2.dt,
    count(distinct case
        when t2.t = 1 then t2.pin
        end) a的人数,
    count(distinct case
        when t2.t = 2 then t2.pin
        end) b的人数,
    count(distinct case
        when t2.t >= 3 then t2.pin
        end) ab重复人数
from 
    (
        select
            t1.dt,
            t1.pin,
            t1.type,
            sum(t1.type) over(partition by t1.pin) t
        from 
            (
                表a a 
                union all
                表b b 
            )
            t1 
        group by 
            t1.dt,
            t1.pin,
            t1.type
    )
    t2
group by 
    t2.dt

注意：t2子查询里的t1.type是一定要写的，不然会报错：

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.

Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:16 Invalid column reference 'type'

3、如果拿到一条数据，不需要跑底表可以测试代码是否准确。

比如我想把一段string改成array，底表很大，用代码测试一次需要很长时间，那可以通过with as来做一个临时表，方便测试。

hive> with releases as (select '["us","ca","fr"]' as country)
    > select  split(regexp_extract(country,'^\\["(.*)\\"]$',1),'","')
    > from    releases
    > ;
OK
_c0
["us","ca","fr"]

最后编辑于：2021.01.22 18:32:27

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

Hive小技巧

Hive小技巧

转载请在文章起始处注明出处，谢谢。

1、取字符串某个位置的值

2、我要取a表的用户，b表的用户，还有a、b重复的用户

3、如果拿到一条数据，不需要跑底表可以测试代码是否准确。

相关阅读更多精彩内容

友情链接更多精彩内容