Hive SQl 笔记

作为一个 SQL 盲，最近也写了点 hive sql 相关的东西，整理一些零散的笔记

表数据核对：两个表表结构一样，如何进行全表全字段数据校验，输出不一致的行？

select *,count(1) as count from(
select 全字段  from mgwh_mgplusmigrate_temp.
union all
select  全字段 from mgwh_mgplusmigrate_ods.) u
group by 全字段
having count(*) = 1

一个例子：

select *,count(1) as count from(
select contentid_whole,consumecode_id,consumecode_name,filename from mgwh_mgplusmigrate_temp.migu_plus_mgp_consumecode
union all
select contentid_whole,consumecode_id,consumecode_name,filename from mgwh_mgplusmigrate_ods.migu_plus_mgp_consumecodetmp) u
group by contentid_whole,consumecode_id,consumecode_name,filename
having count(*) = 1

思路：两个表全表查询并 union all 合并成一张表，对这张表所有字段进行 group by ，最后根据所以字段进行聚合 - count，输出 group by 之后 count(*) = 1 的即为匹配差异行。
如果全字段匹配能匹配上，count(*) > 1 （如果单表数据有重复还必须为偶数）

知识点：

记住union all 和 union：union all 不去重合并，union 去重合并（如何记住：all 代表所有都要）
理解 group by：根据字段聚合分组，每个分组只返回一行
如果要屏蔽 null 和 '' 的差异，可以采用函数 CALESCE(col,'') 来实现，COALESCE 是一个函数，COALESCE (expression_1, expression_2, ...,expression_n) 依次参考各参数表达式，遇到非 null 值即停止并返回该值。

延伸思考：select * from table group by id 为什么不行？
group by 顾名思义：分组，根据某个或者某几个属性分组，每个唯一的分组只返回一行；而分组后肯定会存在一对多的关系，所以不能直接 select *，对于非分组的字段需采用聚合操作保证这个字段多个聚合为一个值。

GROUP BY returns one row for each unique combination of fields in the GROUP BY clause. To ensure only one row, you would have to use an aggregate function - COUNT, SUM, MAX - without a GROUP BY clause.[1]

hive 不支持子查询，如何改写 in 和 not in 这类子查询

# in
select * from A where col_1 in (select col_1 from B)
# not in
select * from A where col_1 not in (select col_1 from B)

将这种常见类型的子查询改写，常规思路的：通过 left join 改写，从左表匹配，从右表未匹配到的列补 null。通过 null 和非 null 来区分左右表数据包含关系。

# in 
select * from A
left join from B
on A.col = B.col 
where B.col is not null

# not in
select * from A
left join from B
on A.col = B.col 
where B.col is null

通过韦恩图，图形化理解起来更容易一点：

in - 韦恩图

not in - 韦恩图

一图了解所有的 join

sql joins

如何对表数据进行去重？
union 可以实现表与表之间的全字段去重，但是根据某个字段或者某几个字段去重就可以用开窗函数 - ROW_NUMBER
ROW_NUMBER() over(partition BY userid ORDER BY request_time DESC) rn
这个表达式：根据 userid 进行计数（userid 相同的行从1开始自增），根据 request_time 降序排序

一个样例

insert overwrite table userpassidtmp
select userid,passid from (select userid,get_json_object(message,'$.response.result.body.passid') passid, 
ROW_NUMBER() over(partition BY userid ORDER BY request_time DESC) rn
from user_center_info_di where user_type < 2 
and operate_type in ('004003','004015') and resultcode = 0 and dayid=$day) t where t.rn = 1;

[1] how-does-group-by-work

最后编辑于：2019.06.26 17:00:24