一、hbase

1.1 文件格式

[图片上传失败...(image-9cdc93-1646955640463)]

1.2 索引Block数据结构

代码入口：

   class BlockIndexChunk {

     // key
    private final List<byte[]> blockKeys = new ArrayList<>();

     // 对应的block的offset
    private final List<Long> blockOffsets = new ArrayList<>();

     // block的大小
    private final List<Integer> onDiskDataSizes = new ArrayList<>();

     // 叶子结点 
    private final List<Long> numSubEntriesAt = new ArrayList<>();
  }

1.3 数据索引构建

代码入口:org.apache.hadoop.hbase.regionserver.StoreFileWriter#append

1.3.1 leaf index block

构建leaf index block代码块：

void writeNonRoot(DataOutput out) throws IOException {
  // The number of entries in the block.
  out.writeInt(blockKeys.size());

  if (secondaryIndexOffsetMarks.size() != blockKeys.size()) {
    throw new IOException("Corrupted block index chunk writer: " +
                          blockKeys.size() + " entries but " +
                          secondaryIndexOffsetMarks.size() + " secondary index items");
  }
  // For each entry, write a "secondary index" of relative offsets to the
  // entries from the end of the secondary index. This works, because at
  // read time we read the number of entries and know where the secondary
  // index ends.
  for (int currentSecondaryIndex : secondaryIndexOffsetMarks)
    out.writeInt(currentSecondaryIndex);
  // We include one other element in the secondary index to calculate the
  // size of each entry more easily by subtracting secondary index elements.
  out.writeInt(curTotalNonRootEntrySize);
  for (int i = 0; i < blockKeys.size(); ++i) {
    out.writeLong(blockOffsets.get(i));
    out.writeInt(onDiskDataSizes.get(i));
    out.write(blockKeys.get(i));
  }
}

构建上层索引,在root chunk 中添加索引

 public void blockWritten(long offset, int onDiskSize, int uncompressedSize) {
      // Add leaf index block size
      totalBlockOnDiskSize += onDiskSize;
      totalBlockUncompressedSize += uncompressedSize;

      if (singleLevelOnly)
        throw new UnsupportedOperationException(INLINE_BLOCKS_NOT_ALLOWED);

      if (firstKey == null) {
        throw new IllegalStateException("Trying to add second-level index " +
            "entry with offset=" + offset + " and onDiskSize=" + onDiskSize +
            "but the first key was not set in writeInlineBlock");
      }

      if (rootChunk.getNumEntries() == 0) {
        // We are writing the first leaf block, so increase index level.
        expectNumLevels(1);
        numLevels = 2;
      }

      // Add another entry to the second-level index. Include the number of
      // entries in all previous leaf-level chunks for mid-key calculation.
      rootChunk.add(firstKey, offset, onDiskSize, totalNumEntries);
      firstKey = null;
    }

1.3.2 intermediate level data index block

代码入口:org.apache.hadoop.hbase.io.hfile.HFileBlockIndex.BlockIndexWriter#writeIndexBlocks

/**
 * Current leaf-level chunk. New entries referencing data blocks get added
 * to this chunk until it grows large enough to be written to disk.
*/
private BlockIndexChunk curInlineChunk = new BlockIndexChunk();
if (curInlineChunk != null) {
        while (rootChunk.getRootSize() > maxChunkSize
            // HBASE-16288: if firstKey is larger than maxChunkSize we will loop indefinitely
            && rootChunk.getNumEntries() > minIndexNumEntries
            // Sanity check. We will not hit this (minIndexNumEntries ^ 16) blocks can be addressed
            && numLevels < 16) {
          rootChunk = writeIntermediateLevel(out, rootChunk);
          numLevels += 1;
        }
}

1.3.3 root index

代码入口：org.apache.hadoop.hbase.io.hfile.HFileBlockIndex.BlockIndexWriter#writeIndexBlocks

      // write the root level
long rootLevelIndexPos = out.getPos();

{
  DataOutput blockStream =
    blockWriter.startWriting(BlockType.ROOT_INDEX);
  rootChunk.writeRoot(blockStream);
  if (midKeyMetadata != null)
    blockStream.write(midKeyMetadata);
  blockWriter.writeHeaderAndData(out);
  if (cacheConf != null) {
    cacheConf.getBlockCache().ifPresent(cache -> {
      HFileBlock blockForCaching = blockWriter.getBlockForCaching(cacheConf);
      cache.cacheBlock(new BlockCacheKey(nameForCaching, rootLevelIndexPos, true,
                                         blockForCaching.getBlockType()), blockForCaching);
    });
  }
}

1.3.4 序列化结构

一级索引：

[图片上传失败...(image-a839e-1646955640462)]

多级索引查询：二分查找

    static int binarySearchNonRootIndex(Cell key, ByteBuff nonRootIndex,
        CellComparator comparator) {

      int numEntries = nonRootIndex.getIntAfterPosition(0);
      int low = 0;
      int high = numEntries - 1;
      int mid = 0;

      // Entries start after the number of entries and the secondary index.
      // The secondary index takes numEntries + 1 ints.
      int entriesOffset = Bytes.SIZEOF_INT * (numEntries + 2);

      // If we imagine that keys[-1] = -Infinity and
      // keys[numEntries] = Infinity, then we are maintaining an invariant that
      // keys[low - 1] < key < keys[high + 1] while narrowing down the range.
      ByteBufferKeyOnlyKeyValue nonRootIndexkeyOnlyKV = new ByteBufferKeyOnlyKeyValue();
      ObjectIntPair<ByteBuffer> pair = new ObjectIntPair<>();
      // 二分查找
      while (low <= high) {
        mid = low + ((high - low) >> 1);

        // Midkey's offset relative to the end of secondary index
        int midKeyRelOffset = nonRootIndex.getIntAfterPosition(Bytes.SIZEOF_INT * (mid + 1));

        // The offset of the middle key in the blockIndex buffer
        int midKeyOffset = entriesOffset       // Skip secondary index
            + midKeyRelOffset                  // Skip all entries until mid
            + SECONDARY_INDEX_ENTRY_OVERHEAD;  // Skip offset and on-disk-size

        // We subtract the two consecutive secondary index elements, which
        // gives us the size of the whole (offset, onDiskSize, key) tuple. We
        // then need to subtract the overhead of offset and onDiskSize.
        int midLength = nonRootIndex.getIntAfterPosition(Bytes.SIZEOF_INT * (mid + 2)) -
            midKeyRelOffset - SECONDARY_INDEX_ENTRY_OVERHEAD;

        // we have to compare in this order, because the comparator order
        // has special logic when the 'left side' is a special key.
        // TODO make KeyOnlyKeyValue to be Buffer backed and avoid array() call. This has to be
        // done after HBASE-12224 & HBASE-12282
        // TODO avoid array call.
        nonRootIndex.asSubByteBuffer(midKeyOffset, midLength, pair);
        nonRootIndexkeyOnlyKV.setKey(pair.getFirst(), pair.getSecond(), midLength);
        int cmp = PrivateCellUtil.compareKeyIgnoresMvcc(comparator, key, nonRootIndexkeyOnlyKV);

        // key lives above the midpoint
        if (cmp > 0)
          low = mid + 1; // Maintain the invariant that keys[low - 1] < key
        // key lives below the midpoint
        else if (cmp < 0)
          high = mid - 1; // Maintain the invariant that key < keys[high + 1]
        else
          return mid; // exact match
      }

      // As per our invariant, keys[low - 1] < key < keys[high + 1], meaning
      // that low - 1 < high + 1 and (low - high) <= 1\. As per the loop break
      // condition, low >= high + 1\. Therefore, low = high + 1.

      if (low != high + 1) {
        throw new IllegalStateException("Binary search broken: low=" + low
            + " " + "instead of " + (high + 1));
      }

      // OK, our invariant says that keys[low - 1] < key < keys[low]. We need to
      // return i such that keys[i] <= key < keys[i + 1]. Therefore i = low - 1.
      int i = low - 1;

      // Some extra validation on the result.
      if (i < -1 || i >= numEntries) {
        throw new IllegalStateException("Binary search broken: result is " +
            i + " but expected to be between -1 and (numEntries - 1) = " +
            (numEntries - 1));
      }

      return i;
    }

1.4 布隆过滤器构建

会根据key进行拆分不同大小的bloomchunk

内存数据结构：BloomFilterChunk

一层索引

代码入口：CompoundBloomFilterWriter

@InterfaceAudience.Private
public class BloomFilterChunk implements BloomFilterBase {

  /** Bytes (B) in the array. This actually has to fit into an int. */
  protected long byteSize;
  /** Number of hash functions */
  protected int hashCount;
  /** Hash type */
  protected final int hashType;
  /** Hash Function */
  protected final Hash hash;
  /** Keys currently in the bloom */
  protected int keyCount;
  /** Max Keys expected for the bloom */
  protected int maxKeys;
  /** Bloom bits */
  protected ByteBuffer bloom;
  /** The type of bloom */
  protected BloomType bloomType;
  }

二、presto

目前 presto没有索引类，只有统计类的数据数据结构

列统计信息：ColumnStatistics

public final class ColumnStatistics
{
    private static final ColumnStatistics EMPTY = new ColumnStatistics(Estimate.unknown(), Estimate.unknown(), Estimate.unknown(), Optional.empty());

    private final Estimate nullsFraction;
    private final Estimate distinctValuesCount;
    private final Estimate dataSize;
    private final Optional<DoubleRange> range;
}

public enum ColumnStatisticType
{
    MIN_VALUE,
    MAX_VALUE,
    NUMBER_OF_DISTINCT_VALUES,
    NUMBER_OF_NON_NULL_VALUES,
    NUMBER_OF_TRUE_VALUES,
    MAX_VALUE_SIZE_IN_BYTES,
    TOTAL_SIZE_IN_BYTES,
}


# 四、es-lucene

[https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps](https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps)

[https://issues.apache.org/jira/browse/LUCENE-5983](https://issues.apache.org/jira/browse/LUCENE-5983)

代码块 ： RoaringDocIdSet

/**

{@link DocIdSet} implementation inspired from http://roaringbitmap.org/
The space is divided into blocks of 2^16 bits and each block is encoded
independently. In each block, if less than 2^12 bits are set, then
documents are simply stored in a short[]. If more than 2^16-212 bits are
set, then the inverse of the set is encoded in a simple short[]. Otherwise
a {@link FixedBitSet} is used.
@lucene.internal
*/
public class RoaringDocIdSet extends DocIdSet {
}


目前lucene在使用query的查询缓存的时候缓存的是 docId ,目前是自己实现的bitmap

# 五、carbondata

# 六、参考

2022-03-11