小结一下Hadoop/Hive的文件格式和压缩算法,目录如下,
0. Overview
1. 文件格式
2. 压缩算法
3. Others
4. Reference
Overview
文件格式和压缩算法在大数据系统里面是一个高关注的优化点,双方常常是配合着一起调优使用。
1. 文件格式
A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.
file format | characteristics | hive storage option |
---|---|---|
TextFile | plain text, default format | STORED AS TEXTFILE |
SequenceFile | row-based, binary key-value, splittable | STORED AS SEQUENCEFILE |
Avro | row-based, binary or JSON, splittable | STORED AS AVRO |
RCFile | columnar, RLE | STORED AS RCFILE |
ORCFile | Optimized RC, Flatten | STORED AS ORC |
Parquet | column-oriented binary file, Nested | STORED AS PARQUET |
2. 压缩算法
To balance the processing capacity required to compress and uncompress the data, the CPU
required to processing compress or uncompress data, the disk IO
required to read and write the data, and the network bandwidth
required to send the data across the network.
Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.
compression format | characteristics | splittable |
---|---|---|
DEFLATE | DefaultCodec | no |
GZip | uses more CPU resources than Snappy or LZO; provides a higher compression ratio; A good choice for cold data | no |
BZip2 | more compression than GZip | yes |
LZO | better choice for hot data | yes if indexed |
LZ4 | significantly faster than LZO | no |
Snappy | performs better than LZO, better choice for hot data | yes? |
Others
- 游程编码,Run Length Encoding,RLE,常用于列式存储,4A3B2C1D4E
- 纠删码,Erasure Coding,EC,hadoop 3.0.0的replica,但由于其带宽和cpu高消耗,常用于冷数据,k块原始+m块校验
- Doc Values,最大公约数压缩,偏移量进行编码,按照docid排序的,利用内存映射文件mmap,预读取机制
- skipList
- bitSet [1,3,4,7,10]->[1,0,1,1,0,0,1,0,0,1]
- Roaring Bitmap (bitset improvement),类似RLE,4A3B
- Frame Of Reference编码
- 数值差分[73,300,302,332,343,372]->[73,227,2,30,11,29]
- term index,tire树
- term dictionary
-
finite state transducers
- 维度字段上移到父文档里,而不用在每个子文档里重复存储,从而减少索引的尺寸
- segment一个int就可以存储
- Hyperloglog
- 聚合之后再做聚合,Pipeline Aggregation
Reference
- Format Wars
- Data Storage and Modelling in Hadoop
- Apache Hive Different File Formats
- Hive 列存储简介
- hadoop 压缩 gzip biz2 lzo snappy
- Choosing a Data Compression Format
- Data Compression in Hadoop
- Hadoop: The Definitive Guide
- An Overview of File and Serialization Formats in Hadoop
- 深入理解 ElasticSearch Doc Values