前言
hive中行转列和列转行比较常见,一般是用于中间数据的生产。
行转列
行转列一般采用IF进行判断,筛选出本列对应的数据进行统计。
- 基础数据
unique_id | sex | age |
---|---|---|
561635f8aa0cc1195238dbaf0c2044e3 | Female | 20 |
5ef957bb74e46a0f7b61d56d80f01689 | Male | 30 |
fbb891c344394007e00b3bd51ed65e2e | Female | 25 |
- ETL转化
SELECT age
, count(if(sex = 'Female', 1, NULL)) AS Female
, count(if(sex = 'Male', 1, NULL)) AS Male
FROM person_basic_info_all
GROUP BY age
ORDER BY age
- 输出结果
age | Female | Male |
---|---|---|
10 | 10 | 20 |
20 | 30 | 30 |
30 | 25 | 25 |
炸裂函数explode,将数组转换成多行
- 基础数据
unique_id | certificate_id |
---|---|
34f228cc5176d87907cbefc9ed5d01f6 | 01095583,P0104378,30342-01 |
c3db65dca36fa59a368204fd18f5bc82 | 51425538,EC6311054,0073774-01 |
21e4e7491e3ef58307ade2ac8e5e3c90 | 01092425,P0058397,1045-12 |
- ETL转化
SELECT explode(split(certificate_id, ',')) AS certificate_id FROM certification_info
- 输出结果
certificate_id
01095583
EC6311054
0073774-01
P0058397
如果我们直接
SELECT unique_id, explode(split(certificate_id, ',')) AS certificate_id FROM certification_info
就会报错
Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
那么,我们由此引入LATERAL VIEW函数
1.Lateral View 用于和UDTF函数【explode,split】结合来使用。
2.首先通过UDTF函数将数据拆分成多行,再将多行结果组合成一个支持别名的虚拟表。
3..主要解决在select使用UDTF做查询的过程中查询只能包含单个UDTF,不能包含其它字段以及多个UDTF的情况。
4.语法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)
列转行(Lateral View + explode)
对一列进行炸裂,其他列保持不变
SELECT unique_id, certificate_id FROM certification_info
LATERAL VIEW explode(certificate_id) tmpTable AS certificate_id
- 输出结果
unique_id | certificate_id |
---|---|
34f228cc5176d87907cbefc9ed5d01f6 | 01095583 |
34f228cc5176d87907cbefc9ed5d01f6 | P0104378 |
34f228cc5176d87907cbefc9ed5d01f6 | 30342-01 |
c3db65dca36fa59a368204fd18f5bc82 | 51425538 |
c3db65dca36fa59a368204fd18f5bc82 | EC6311054 |
21e4e7491e3ef58307ade2ac8e5e3c90 | 01092425 |
列转行(Lateral View + posexplode)
对多列进行炸裂,其他列保持不变
- 基础数据
unique_id | ic_number | passport_number | birth_certificate_id |
---|---|---|---|
561635f8aa0cc1195238dbaf0c2044e3 | 01052408 | C0544935 | 36066 |
5ef957bb74e46a0f7b61d56d80f01689 | 01107410 | P0093218 | 40067-01 |
fbb891c344394007e00b3bd51ed65e2e | 01041475 | C0474512 | 22612 |
- ETL转化
SELECT unique_id, certificate_type, certificate_id
FROM (
SELECT unique_id, certificate_type, certificate_id
FROM (
SELECT unique_id
, concat_ws(',', 'ic_number', 'passport_number', 'birth_certificate_id') AS certificate_type_tuple
, concat_ws(',', ic_number, passport_number, birth_certificate_id) AS certificate_id_tuple
FROM person_basic_info_all
) person
LATERAL VIEW posexplode(split(certificate_type_tuple, ',')) certificate_type_tuple AS idx_type, certificate_type
LATERAL VIEW posexplode(split(certificate_id_tuple, ',')) certificate_id_tuple AS idx_id, certificate_id
WHERE idx_type = idx_id
) tmp
- 输出结果
unique_id | certificate_type | certificate_id |
---|---|---|
5ef957bb74e46a0f7b61d56d80f01689 | passport_number | X419700 |
81536afe4de7de3d33e0c9d864b54f3a | ic_number | 50687170 |
d45af8297c01ebea84782d2b3e72cb0c | birth_certificate_id | 0090801-01 |
结
行和列转换的问题主要解决存储和计算过程中关联关系不对等的问题。