原网址:https://github.com/facebook/rocksdb/wiki/Block-Cache
(有道)
Block cache is where RocksDB caches data in memory for reads. User can pass in a Cache
object to a RocksDB instance with a desired capacity (size). A Cache
object can be shared by multiple RocksDB instances in the same process, allowing users to control the overall cache capacity. The block cache stores uncompressed blocks. Optionally user can set a second block cache storing compressed blocks. Reads will fetch data blocks first from uncompressed block cache, then compressed block cache. The compressed block cache can be a replacement of OS page cache, if [[Direct-IO]] is used.
块缓存是RocksDB为读操作在内存中缓存数据的地方。用户可以将一个Cache对象传递给RocksDB实例,该对象具有所需的容量(size)。一个Cache对象可以被同一进程中的多个RocksDB实例共享,允许用户控制整体的Cache容量。块缓存存储未压缩的块。用户可以选择设置第二个块缓存存储压缩块。读操作将首先从未压缩块缓存中获取数据块,然后再从压缩块缓存中获取数据块。压缩块缓存可以替代OS页面缓存,如果使用[[Direct-IO]]。
There are two cache implementations in RocksDB, namely LRUCache
and ClockCache
. Both types of the cache are sharded to mitigate lock contention. Capacity is divided evenly to each shard and shards don't share capacity. By default each cache will be sharded into at most 64 shards, with each shard has no less than 512k bytes of capacity.
RocksDB中有两个缓存实现,分别是LRUCache和ClockCache。这两种类型的缓存都被分片,以缓解锁争用。容量平均分配给每个分片,分片不共享容量。默认情况下,每个cache最多分片为64个shard,每个shard的容量不小于512k字节。
Usage
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. To set a customized block cache, call NewLRUCache()
or NewClockCache()
to create a cache object, and set it to block based table options. Users can also have their own cache implementation by implementing the Cache
interface.
RocksDB将使用8MB容量的基于lru的块缓存实现。要设置一个定制的块缓存,调用NewLRUCache()或NewClockCache()来创建一个缓存对象,并将其设置为基于表选项的块。用户还可以通过实现cache接口拥有自己的缓存实现。
std::shared_ptr<Cache> cache = NewLRUCache(capacity);
BlockBasedTableOptions table_options;
table_options.block_cache = cache;
Options options;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
To set compressed block cache:
设置压缩块缓存。
table_options.block_cache_compressed = another_cache;
RocksDB will create the default block cache if block_cache
is set to nullptr
. To disable block cache completely:
如果block_cache设置为nullptr, RocksDB将创建默认的块缓存。完全禁用块缓存:
table_options.no_block_cache = true;
LRU Cache
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. Each shard of the cache maintains its own LRU list and its own hash table for lookup. Synchronization is done via a per-shard mutex. Both lookup and insert to the cache would require a locking mutex of the shard. User can create a LRU cache by calling NewLRUCache()
. The function provides several useful options to set to the cache:
RocksDB将使用8MB容量的基于lru的块缓存实现。缓存的每个碎片都维护自己的LRU列表和自己的哈希表,以供查找。同步是通过每个分片的互斥完成的。对缓存的查找和插入都需要一个分片的锁互斥。用户可以通过调用NewLRUCache()来创建LRU缓存。该函数提供了几个有用的选项来设置缓存:
capacity
: Total size of the cache.
缓存的总大小。num_shard_bits
: The number of bits from cache keys to be use as shard id. The cache will be sharded into2^num_shard_bits
shards.
作为分片id的缓存键的比特数。缓存将被分片为2^num_shard_bits分片。strict_capacity_limit
: In rare case, block cache size can go larger than its capacity. This is when ongoing reads or iterations over DB pin blocks in block cache, and the total size of pinned blocks exceeds the capacity. If there are further reads which try to insert blocks into block cache, ifstrict_capacity_limit=false
(default), the cache will fail to respect its capacity limit and allow the insertion. This can create undesired OOM error that crashes the DB if the host don't have enough memory. Setting the option totrue
will reject further insertion to the cache and fail the read or iteration. The option works on per-shard basis, means it is possible one shard is rejecting insert when it is full, while another shard still have extra unpinned space.
strict_capacity_limit:在极少数情况下,块缓存的大小可能会超过容量。这是指在块缓存中对DB引脚块进行读取或迭代,且固定块的总大小超过了容量。如果有进一步的读操作试图向块缓存中插入块,如果strict_capacity_limit=false(默认),缓存将不遵守其容量限制并允许插入。如果主机没有足够的内存,这可能会产生不希望的OOM错误,导致数据库崩溃。将该选项设置为true将拒绝进一步插入缓存,并使读取或迭代失败。该选项以每个切分为基础,意味着可能一个切分在空间满时拒绝插入,而另一个切分仍然有额外的未固定空间。high_pri_pool_ratio
: The ratio of capacity reserved for high priority blocks. See [[Caching Index, Filter, and Compression Dictionary Blocks|Block-Cache#caching-index-filter-and-compression-dictionary-blocks]] section below for more information.
预留给高优先级块的容量比例。更多信息请参见下面的[[Caching Index, Filter, and Compression Dictionary Blocks|Block-Cache# Caching - Index - Filter -and- Compression - Dictionary - Blocks]]小节。
Clock Cache
WARNING: The ClockCache implementation has at least one remaining bug that could lead to crash or data corruption. Please do not use ClockCache until this is fixed.
警告:ClockCache实现至少还有一个可能导致崩溃或数据损坏的bug。在此问题解决之前,请不要使用ClockCache。
ClockCache
implements the CLOCK algorithm. Each shard of clock cache maintains a circular list of cache entries. A clock handle runs over the circular list looking for unpinned entries to evict, but also giving each entry a second chance to stay in cache if it has been used since last scan. A tbb::concurrent_hash_map
is used for lookup.
ClockCache实现了CLOCK算法。时钟缓存的每个碎片维护一个循环的缓存条目列表。时钟句柄在循环列表中运行,查找要清除的未固定条目,但如果上次扫描之后使用了每个条目,它也给每个条目第二次机会留在缓存中。tbb::concurrent_hash_map用于查找。
The benefit over LRUCache
is it has finer-granularity locking. In case of LRU cache, the per-shard mutex has to be locked even on lookup, since it needs to update its LRU-list. Looking up from a clock cache won't require locking per-shard mutex, but only looking up the concurrent hash map, which has fine-granularity locking. Only inserts needs to lock the per-shard mutex. With clock cache we see boost of read throughput over LRU cache in contented environment (see inline comments in cache/clock_cache.cc
for benchmark setup):
与LRUCache相比,它的优点是具有更细粒度的锁定。在LRU缓存的情况下,即使在查找时,每个分片的互斥锁也必须被锁定,因为它需要更新它的LRU列表。从时钟缓存查找时,不需要对每个分片的互斥锁,而只需要查找具有细粒度锁定的并发哈希映射。只有插入需要锁定每个分片的互斥。有了时钟缓存,我们可以看到在满足的环境下,通过LRU缓存读取吞吐量的提高(参见cache/clock_cache中的内联注释)。Cc用于基准设置):
Threads Cache Cache ClockCache LRUCache
Size Index/Filter Throughput(MB/s) Hit Throughput(MB/s) Hit
32 2GB yes 466.7 85.9% 433.7 86.5%
32 2GB no 529.9 72.7% 532.7 73.9%
32 64GB yes 649.9 99.9% 507.9 99.9%
32 64GB no 740.4 99.9% 662.8 99.9%
16 2GB yes 278.4 85.9% 283.4 86.5%
16 2GB no 318.6 72.7% 335.8 73.9%
16 64GB yes 391.9 99.9% 353.3 99.9%
16 64GB no 433.8 99.8% 419.4 99.8%
To create a clock cache, call NewClockCache()
. To make clock cache available, RocksDB needs to be linked with Intel TBB library. Again there are several options users can set when creating a clock cache:
要创建时钟缓存,请调用NewClockCache()。为了使时钟缓存可用,RocksDB需要与Intel TBB库链接。在创建时钟缓存时,用户可以设置以下几个选项:
-
capacity
: Same as LRUCache. -
num_shard_bits
: Same as LRUCache. -
strict_capacity_limit
: Same as LRUCache.
Caching Index, Filter, and Compression Dictionary Blocks
By default index, filter, and compression dictionary blocks (with the exception of the partitions of partitioned indexes/filters) are cached outside of block cache, and users won't be able to control how much memory should be used to cache these blocks, other than setting max_open_files
. Users can opt to cache index and filter blocks in block cache, which allows for better control of memory used by RocksDB. To cache index, filter, and compression dictionary blocks in block cache:
默认情况下,索引、过滤器和压缩字典块(除了分区索引/过滤器的分区)缓存在块缓存之外,用户不能控制应该使用多少内存来缓存这些块,而只能设置max_open_files。用户可以选择在块缓存中缓存索引和过滤块,这样可以更好地控制RocksDB使用的内存。在块缓存中缓存索引、过滤和压缩字典块:
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
Note that the partitions of partitioned indexes/filters are as a rule stored in the block cache, regardless of the value of the above option.
请注意,无论上面选项的值是多少,分区索引/过滤器的分区作为一个规则存储在块缓存中。
By putting index, filter, and compression dictionary blocks in block cache, these blocks have to compete against data blocks for staying in cache. Although index and filter blocks are being accessed more frequently than data blocks, there are scenarios where these blocks can be thrashing. This is undesired because index and filter blocks tend to be much larger than data blocks, and they are usually of higher value to stay in cache (the latter is also true for compression dictionary blocks). There are two options to tune to mitigate the problem:
通过将索引、过滤和压缩字典块放入块缓存中,这些块必须与数据块竞争,以便保留在缓存中。尽管索引和筛选器块比数据块被访问得更频繁,但在某些情况下,这些块可能会出现抖动。这是不希望的,因为索引和过滤器块往往比数据块大得多,并且它们通常在缓存中有更高的值(压缩字典块也是如此)。有两种调优方法可以缓解这个问题:
cache_index_and_filter_blocks_with_high_priority
: Set priority to high for index, filter, and compression dictionary blocks in block cache. For partitioned indexes/filters, this affects the priority of the partitions as well. It only affectLRUCache
so far, and need to use together withhigh_pri_pool_ratio
when callingNewLRUCache()
. If the feature is enabled, LRU-list in LRU cache will be split into two parts, one for high-pri blocks and one for low-pri blocks. Data blocks will be inserted to the head of low-pri pool. Index, filter, and compression dictionary blocks will be inserted to the head of high-pri pool. If the total usage in the high-pri pool exceedcapacity * high_pri_pool_ratio
, the block at the tail of high-pri pool will overflow to the head of low-pri pool, after which it will compete against data blocks to stay in cache. Eviction will start from the tail of low-pri pool.
将块缓存中的索引、过滤和压缩字典块的优先级设置为高。对于分区索引/过滤器,这也会影响分区的优先级。到目前为止,它只影响LRUCache,在调用NewLRUCache()时需要与high_pri_pool_ratio一起使用。如果启用该特性,LRU缓存中的LRU列表将被分成两部分,一部分用于高优先级块,另一部分用于低优先级块。数据块将被插入到低优先级池的头。索引、过滤器和压缩字典块将被插入到高优先级池的头部。如果高优先级池的总使用量超过容量* high_pri_pool_ratio,高优先级池尾部的块将溢出到低优先级池的头部,在此之后,它将与数据块竞争留在缓存中。驱逐将从低优先级泳池的尾部开始。pin_l0_filter_and_index_blocks_in_cache
: Pin level-0 file's index and filter blocks in block cache, to avoid them from being evicted. Starting with RocksDB version 6.4, this option also affects compression dictionary blocks. Level-0 index and filters are typically accessed more frequently. Also they tend to be smaller in size so hopefully pinning them in cache won't consume too much capacity.
0级文件的索引和过滤块在块缓存中,以避免他们被驱逐。从RocksDB 6.4版本开始,这个选项也会影响压缩字典块。级别0的索引和过滤器通常被更频繁地访问。此外,它们的大小往往更小,所以希望将它们固定在缓存中不会消耗太多的容量。pin_top_level_index_and_filter
: only applicable to partitioned indexes/filters. Iftrue
, the top level of the partitioned index/filter structure will be pinned in the cache, regardless of the LSM tree level (that is, unlike the previous option, this affects files on all LSM tree levels, not just L0).
仅适用于分区索引/过滤器。如果为真值,则分区索引/过滤器结构的顶层将固定在缓存中,而不考虑LSM树的级别(也就是说,与前一个选项不同,这将影响所有LSM树级别的文件,而不仅仅是L0)。
Simulated Cache
SimCache
is an utility to predict cache hit rate if cache capacity or number of shards is changed. It wraps around the real Cache
object that the DB is using, and runs a shadow LRU cache simulating the given capacity and number of shards, and measure cache hits and misses of the shadow cache. The utility is useful when user wants to open a DB with, say, 4GB cache size, but would like to know what the cache hit rate will become if cache size enlarge to, say, 64GB. To create a simulated cache:
SimCache是一个实用程序,可以在缓存容量或碎片数量发生变化时预测缓存命中率。它包裹了DB正在使用的真实Cache对象,并运行一个影子LRU缓存,模拟给定的容量和碎片数量,并测量影子缓存的缓存命中和未命中。当用户想要打开一个缓存大小为4GB的数据库,但又想知道如果缓存大小扩大到64GB,缓存命中率将会如何时,这个实用程序很有用。创建一个模拟缓存。
// This cache is the actual cache use by the DB.
std::shared_ptr<Cache> cache = NewLRUCache(capacity);
// This is the simulated cache.
std::shared_ptr<Cache> sim_cache = NewSimCache(cache, sim_capacity, sim_num_shard_bits);
BlockBasedTableOptions table_options;
table_options.block_cache = sim_cache;
The extra memory overhead of the simulated cache is less than 2% of sim_capacity
.
模拟缓存的额外内存开销小于sim_capacity的2%。
Statistics
A list of block cache counters can be accessed through Options.statistics
if it is non-null.
可以通过Options访问块缓存计数器列表。非空时的统计信息。
// total block cache misses
// REQUIRES: BLOCK_CACHE_MISS == BLOCK_CACHE_INDEX_MISS +
// BLOCK_CACHE_FILTER_MISS +
// BLOCK_CACHE_DATA_MISS;
BLOCK_CACHE_MISS = 0,
// total block cache hit
// REQUIRES: BLOCK_CACHE_HIT == BLOCK_CACHE_INDEX_HIT +
// BLOCK_CACHE_FILTER_HIT +
// BLOCK_CACHE_DATA_HIT;
BLOCK_CACHE_HIT,
// # of blocks added to block cache.
BLOCK_CACHE_ADD,
// # of failures when adding blocks to block cache.
BLOCK_CACHE_ADD_FAILURES,
// # of times cache miss when accessing index block from block cache.
BLOCK_CACHE_INDEX_MISS,
// # of times cache hit when accessing index block from block cache.
BLOCK_CACHE_INDEX_HIT,
// # of times cache miss when accessing filter block from block cache.
BLOCK_CACHE_FILTER_MISS,
// # of times cache hit when accessing filter block from block cache.
BLOCK_CACHE_FILTER_HIT,
// # of times cache miss when accessing data block from block cache.
BLOCK_CACHE_DATA_MISS,
// # of times cache hit when accessing data block from block cache.
BLOCK_CACHE_DATA_HIT,
// # of bytes read from cache.
BLOCK_CACHE_BYTES_READ,
// # of bytes written into cache.
BLOCK_CACHE_BYTES_WRITE,
See also: [[Memory-usage-in-RocksDB#block-cache]]