背景
hive 中有一张待处理的分区表,存储的方式是parquet,处理之后的目标表是一张非分区的外部表,并且分隔方式为 “,”。
问题
部分记录的 int 类型字段 出现 null 情况
表结构
原表结构
CREATE EXTERNAL TABLE
adm_v1.adm_cms_flow_daily
(
url
string,
title
string,
courseeduid
string,
courseeduname
string,
catalogid
string,
catalogname
string,
innercode
string,
adduser
string,
addtime
string,
original
string,
pv
string,
uv
string,
downuv
string,
downpv
string,
platform
string,
areaid
string
)
PARTITIONED BY (
dn
string,
dt
string)
stored as parquet
TBLPROPERTIES ('parquet.compression'='gzip');
目标表结构
CREATE EXTERNAL TABLE
tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4
(
dt
string,
dn
BIGINT,
addtime
BIGINT ,
adduser
string,
innercode
string,
url
string,
title
string,
original
BIGINT,
timestamp
TIMESTAMP,
pv
BIGINT,
uv
BIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
处理语句
insert overwrite table tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4
select from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd') as dt,case dn when 'acc' then 1 when 'jianshe' then 2 when 'med' then 3 else 4 end as dn1,addtime,adduser,innercode,url,title,cast(original as bigint ),from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss'),cast(pv as bigint),cast(uv as bigint) from adm_v1.adm_cms_flow_daily where dn='acc' and dt='20201110' and addtime='20201110' ;
问题展示
原表中的original 字段 存在 1 的值
导入到tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4表之后 original 字段变为null 值
问题分析/定位
将有问题的记录取出,观察数据,要特别注意一下original 的前一个字段是title,title 中可能存在逗号影响目标表的存储。
查询原表中记录
查询目标中记录
从上面两个查询结果对比可以看出,title 字段在逗号的位置被截断了。这样问题就找到了,替换目标表的分隔符为 \u0001。 重新观察
CREATE EXTERNAL TABLE
tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v8
(
dt
string,
dn
BIGINT,
addtime
BIGINT ,
adduser
string,
innercode
string,
url
string,
title
string,
original
BIGINT,
pv
BIGINT,
uv
BIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' ;
问题解决。
引出的问题
最后一列跑哪去了?
测试:最后两列是pv、uv,从下图看出来,最后一列被舍弃了。
本文由博客群发一文多发等运营工具平台 OpenWrite 发布