三种列式存储源代码分析读取差异

tablename : t
列信息：A,B,C三列，列数据量相同，类型为string
表数据量大小：300G，每列数据100G
主要分析两种类型的sql对于不同存储格式的读取数据量大小

Q1:select count(distinct A) from t;
Q2:select * from t where A='1';

1. PARQUET

1.1 结构图

说明：最小行级别块为：row group

1575256561463.png

1.2 读取parquet代码解析

读取数据时需要传入schema信息，schema信息包含列信息

ParquetFileReader parquetFileReader = new ParquetFileReader(configuration, parquetFilePath,
                readFooter.getBlocks(), schema.getColumns());
PageReadStore pages = null;
while (null != (pages = parquetFileReader.readNextRowGroup())) {

通过ParquetFileReader.readNextRowGroup可以看出来，根据schema信息组装列信息(无索引利用情况)

public PageReadStore readNextRowGroup() throws IOException {
        if (this.currentBlock == this.blocks.size()) {
            return null;
        } else {
            BlockMetaData block = (BlockMetaData)this.blocks.get(this.currentBlock);
            if (block.getRowCount() == 0L) {
                throw new RuntimeException("Illegal row group of 0 rows");
            } else {
                ColumnChunkPageReadStore columnChunkPageReadStore = new ColumnChunkPageReadStore(block.getRowCount());
                List<ParquetFileReader.ConsecutiveChunkList> allChunks = new ArrayList();
                ParquetFileReader.ConsecutiveChunkList currentChunks = null;
                Iterator i$ = block.getColumns().iterator();

                while(true) {
                    ColumnChunkMetaData mc;
                    ColumnDescriptor columnDescriptor;
                    do {
                        if (!i$.hasNext()) {
                            i$ = allChunks.iterator();

                            while(i$.hasNext()) {
                                ParquetFileReader.ConsecutiveChunkList consecutiveChunks = (ParquetFileReader.ConsecutiveChunkList)i$.next();
                                List<ParquetFileReader.Chunk> chunks = consecutiveChunks.readAll(this.f);
                                Iterator i$ = chunks.iterator();

                                while(i$.hasNext()) {
                                    ParquetFileReader.Chunk chunk = (ParquetFileReader.Chunk)i$.next();
                                    columnChunkPageReadStore.addColumn(chunk.descriptor.col, chunk.readAllPages());
                                }
                            }

1.3 读取量结果

parquet	数据量大小
Q1	100G
Q2	300G

2. ORCFILE

2.1 结构图

说明：最小行级别块为：stripe

1575256520904.png

2.2 读取orcfile代码解析

1.orcReader

SearchArgument sArg = SearchArgumentFactory
            .newBuilder()
            .equals("x", PredicateLeaf.Type.LONG, 9996l)
            .build();
    String[] sCols = new String[]{"x", null };
VectorizedRowBatch batch = readSchema.createRowBatch();
    Reader.Options options = reader.options().schema(readSchema);
    if (sArg != null && sCols != null) {
      options.searchArgument(sArg, sCols);
    }
    RecordReader rowIterator = reader.rows(options);

2.RecordReaderImpl.java

private void readStripe() throws IOException {
    StripeInformation stripe = beginReadStripe();
    planner.parseStripe(stripe, fileIncluded);
    includedRowGroups = pickRowGroups();

    // move forward to the first unskipped row
    if (includedRowGroups != null) {
      while (rowInStripe < rowCountInStripe &&
          !includedRowGroups[(int) (rowInStripe / rowIndexStride)]) {
        rowInStripe = Math.min(rowCountInStripe, rowInStripe + rowIndexStride);
      }
    }

    // if we haven't skipped the whole stripe, read the data
    if (rowInStripe < rowCountInStripe) {
      planner.readData(indexes, includedRowGroups, false);
      reader.startStripe(planner);
      // if we skipped the first row group, move the pointers forward
      if (rowInStripe != 0) {
        seekToRowEntry(reader, (int) (rowInStripe / rowIndexStride));
      }
    }
  }

pickRowGroup(Pick the row groups that we need to load from the current stripe.return an array with a boolean for each row group or null if all of the row groups must be read)看如下代码，sargApp是过滤条件，后面几个表示当前stripe,还有index信息，通过这几个条件判断当前stripe是否需要读取。

protected boolean[] pickRowGroups() throws IOException {
    // if we don't have a sarg or indexes, we read everything
    if (sargApp == null) {
      return null;
    }
    readRowIndex(currentStripe, fileIncluded, sargApp.sargColumns);
    return sargApp.pickRowGroups(stripes.get(currentStripe),
        indexes.getRowGroupIndex(),
        indexes.getBloomFilterKinds(), stripeFooter.getColumnsList(),
        indexes.getBloomFilterIndex(), false);
  }

构造读取时分两部分，通过跟踪代码发现

1575256250324.png

会先使用index（有三种类型的index过滤）进行stripe数据剪技，再使用schema进行列值还原。比parquet多一步使用index过滤阶段

2.3 读取量结果

假如通过index(max,min,bloomfilter等)过滤量数据量大小为xG

orcfile	数据量大小
Q1	100G
Q2	300G-X

2.4 hive配置语句

CREATE TABLE lxw1234_orc2 stored AS ORC 
TBLPROPERTIES
('orc.compress'='SNAPPY',
'orc.create.index'='true',
"orc.bloom.filter.columns"="pcid",
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='10485760',
'orc.row.index.stride'='10000') 
AS 
SELECT CAST(siteid AS INT) AS id,
pcid 
FROM lxw1234_text 
DISTRIBUTE BY id sort BY id;

2.4 例子分析（写数据排序）

每个stripe由行组成，每行数据默认10000行

写数据设置

1575268150641.png

读数据设置

1575268169113.png

过滤数据结果呈现

1575268124206.png

使用coreWriter写一个文件，有100000行数据，x的范围是1-100000；然后我构造SearchArgument（x=99961）,从过滤结果呈现可以看出来，过滤了前90000行数据，把当前stripe需要读取的rowInStripe设置为90000.

2.5 例子分析2（写数据列值随机）

1575269098965.png

1575269110150.png

1575269070512.png

使用coreWriter写一个文件，有100000行数据，x的范围是1-100000；但不是有序的，是随机的，从过滤结果来看，没有过滤掉数据，因此需要读取全部的stripe的数据。

3. CARBONDATA

3.1 结构图

说明：最小行级别块为：blocklet

1575256853170.png

3.2 读取carbondata代码解析

public BlockletScannedResult scanBlocklet(RawBlockletColumnChunks rawBlockletColumnChunks)
      throws IOException, FilterUnsupportedException {
    if (blockExecutionInfo.isDirectVectorFill()) {
      return executeFilterForPages(rawBlockletColumnChunks);
    } else {
      return executeFilter(rawBlockletColumnChunks);
    }
  }

跟踪了一个select语句，会发现，如果有过滤条件，会到BlockletFilterScanner.scanBlocklet,其中isDirectVectorFill是用来控制是否把结果直接交给spark。跟executeFilter方法会发现方法上注释如下：

   This method will process the data in below order
   1. first apply min max on the filter tree and check whether any of the filter
   is fall on the range of min max, if not then return empty result
   2. If filter falls on min max range then apply filter on actual
   data and get the filtered row index
   3. if row index is empty then return the empty result
   4. if row indexes is not empty then read only those blocks(measure or dimension)
   which was present in the query but not present in the filter, as while applying filter
   some of the blocks where already read and present in chunk holder so not need to
   read those blocks again, this is to avoid reading of same blocks which was already read
   5. Set the blocks and filter indexes to result

总结一下逻辑

page使用最大最小值进行过滤，如果符合，进行实际数据过滤，然后拿到row index;
如果row index为空，返回空
如果不为空，拿其他列（这时如果已经读取过的block不需要再次读取）
返回block与index
所以carbon是通过过滤列的列值先定位到数据的row index，再通过row index去取其他列的数据。其花费的代价为过滤列的总量-过滤的数据量+取其他列时读取的数据量（这块最小单位为page）

3.3 读取量结果

跟上面一样，假如通过index(max,min,bloomfilter等)过滤量数据量大小为xG，假如需要读取的其他列的Page总数据量为yG

carbondata	数据量大小
Q1	100G
Q2	100G-x+y

4 结论

4.1 对比

item/存储格式	parquet	orcfile	carbondata
Q1	100G	100G	100G
Q2	300G	300G -x	100G-x+y
索引	无	column min/max,bloomfilter	cloumn min/max,bloomfilter，inverted index(占空间大)
行列存储最小单位	row group	stripe	blocklet