离线数仓中关于压缩和存储的问题描述

一、日志行为数据

1、日志数据在flume中指定了数据放在hdfs上的压缩格式

a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop

2、创建lzo表，导入数据

CREATE EXTERNAL TABLE ods_log (`line` string)
PARTITIONED BY (`dt` string) -- 按照时间创建分区
STORED AS -- 指定存储方式，读数据采用LzoTextInputFormat；
  INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_log'  -- 指定数据在hdfs上的存储位置
;
load data inpath '/origin_data/gmall/log/topic_log/2020-08-20' 
into table ods_log partition(dt='2020-08-20')--导入数据;

3、把数据从hdfs迁移到hive外部表分区中/warehouse/ods/ods_log/dt=$do_date

在迁移过程中指定了运行队列hive，另外添加lzo索引，为后续读取时切分使用。

$hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer 
-Dmapreduce.job.queuename=hive
/warehouse/gmall/ods/ods_log/dt=$do_date

4、dwd如何使用lzo压缩并带有lzo索引的表数据

（1）两种方式，分别查询数据有多少行 
hive (gmall)> select * from ods_log;2955 row(s)
hive (gmall)> select count(*) from ods_log;2959 row(s)
（2）两次查询结果不一致。
原因是select * from ods_log这种查询方式是不执行MR操作的，默认采用的是ods_log建表语句中指定的DeprecatedLzoTextInputFormat，能够识别lzo.index为索引文件。
select count(*) from ods_log这种查询方式是执行MR操作的，而hive默认采用的是CombineHiveInputFormat，不能识别lzo.index为索引文件，将索引文件当做普通文件处理。更严重的是，这会导致LZO文件无法切片。
hive (gmall)> set hive.input.format;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
解决办法：修改CombineHiveInputFormat为HiveInputFormat
（3）再次测试
hive (gmall)>
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

hive (gmall)> select * from ods_log;
Time taken: 0.706 seconds, Fetched: 2955 row(s)
hive (gmall)> select count(*) from ods_log;2955 row(s)

dwd层建表后采用下面parquet存储格式。
PARTITIONED BY (dt string)
stored as parquet
LOCATION '/warehouse/gmall/dwd/dwd_error_log'
TBLPROPERTIES('parquet.compression'='lzo');
说明：数据采用parquet存储方式，是可以支持切片的，不需要再对数据创建索引。
如果单纯的text方式存储数据，需要采用支持切片的，lzop压缩方式并创建索引。

二、业务操作数据

1、使用sqoop导入hdfs时，采用lzo压缩，并建lzo索引

import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop102:3306/gmall \
--username root \
--password 000000 \
--target-dir /origin_data/gmall/db/$1/$do_date \
--delete-target-dir \
--query "$2 and  \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--compress \
--compression-codec lzop \
--null-string '\\N' \
--null-non-string '\\N'

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer 
/origin_data/gmall/db/$1/$do_date

2、业务数据ods层数据采用lzo压缩,并指定数据输入输出格式

STORED AS -- 指定存储方式，读数据采用LzoTextInputFormat；输出数据采用TextOutputFormat
  INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/warehouse/gmall/ods/ods_order_info/' -- 指定数据在hdfs上的存储位置
;

3、业务数据dwd层采用lzo压缩，parquet存储

--设置HiveInputFormat,不能使用默认CombineHiveInputFormat
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
--存储表的方式改变成parquet
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_sku_info/'
tblproperties ("parquet.compression"="lzo");

三、dwd-dws-dwt采用的都是parquet存储，lzo压缩。

四、ads层数据量比较小，可以不采取压缩方式。

使用dws/dwt层的数据时不需要设置HiveInputFormat，
采用默认的CombineHiveInputFormat就可以识别parquet（本身支持切分），。