背景
hive 中有一张待处理的分区表,存储的方式是parquet,处理之后的目标表是一张非分区的外部表,并且分隔方式为 “,”。
问题
部分记录的 int 类型字段 出现 null 情况
表结构
原表结构
CREATE EXTERNAL TABLE
adm_v1.adm_cms_flow_daily(
urlstring,
titlestring,
courseeduidstring,
courseedunamestring,
catalogidstring,
catalognamestring,
innercodestring,
adduserstring,
addtimestring,
originalstring,
pvstring,
uvstring,
downuvstring,
downpvstring,
platformstring,
areaidstring
)
PARTITIONED BY (
dnstring,
dtstring)
stored as parquet
TBLPROPERTIES ('parquet.compression'='gzip');
目标表结构
CREATE EXTERNAL TABLE
tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4(
dtstring,
dnBIGINT,
addtimeBIGINT ,
adduserstring,
innercodestring,
urlstring,
titlestring,
originalBIGINT,
timestampTIMESTAMP,
pvBIGINT,
uvBIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
处理语句
insert overwrite table tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4
select from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd') as dt,case dn when 'acc' then 1 when 'jianshe' then 2 when 'med' then 3 else 4 end as dn1,addtime,adduser,innercode,url,title,cast(original as bigint ),from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss'),cast(pv as bigint),cast(uv as bigint) from adm_v1.adm_cms_flow_daily where dn='acc' and dt='20201110' and addtime='20201110' ;
问题展示
原表中的original 字段 存在 1 的值
file
导入到tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v4表之后 original 字段变为null 值
file
问题分析/定位
将有问题的记录取出,观察数据,要特别注意一下original 的前一个字段是title,title 中可能存在逗号影响目标表的存储。
查询原表中记录
file
查询目标中记录
file
从上面两个查询结果对比可以看出,title 字段在逗号的位置被截断了。这样问题就找到了,替换目标表的分隔符为 \u0001。 重新观察
CREATE EXTERNAL TABLE
tmp_v1.tmp_zdh_test_biz_cms_flow_tmp_v8(
dtstring,
dnBIGINT,
addtimeBIGINT ,
adduserstring,
innercodestring,
urlstring,
titlestring,
originalBIGINT,
pvBIGINT,
uvBIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' ;
file
问题解决。
引出的问题
最后一列跑哪去了?
测试:最后两列是pv、uv,从下图看出来,最后一列被舍弃了。
file
本文由博客群发一文多发等运营工具平台 OpenWrite 发布





