翻译 Leveled Compaction

网址：https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

Leveled Compaction

Structure of the files

Files on disk are organized in multiple levels. We call them level-1, level-2, etc, or L1, L2, etc, for short. A special level-0 (or L0 for short) contains files just flushed from in-memory write buffer (memtable). Each level (except level 0) is one data sorted run:
文件在磁盘上分多层存储。叫作level-1, level-2, 等或者简化为L1, L2。L0包含刚刚从内存写入的内容。每层都是排序内容：

image.png

Inside each level (except level 0), data is range partitioned into multiple SST files:
每层，数据排序存储在多个sst文件中：

image.png

The level is a sorted run because keys in each SST file are sorted (See Block-based Table Format as an example). To identify a position for a key, we first binary search the start/end key of all files to identify which file possibly contains the key, and then binary search inside the file to locate the exact position. In all, it is a full binary search across all the keys in the level.
每个sst文件是排序内容。key标记位置，二分搜索所有文件，之后文件内搜索。所有的是一层数据的二分查找。

All non-0 levels have target sizes. Compaction's goal will be to restrict data size of those levels to be under the target. The size targets are usually exponentially increasing:
所有非0层有大小。压缩过程会限制数据大小到大小目标大小以下。大小目标通常呈指数增长:

image.png

Compactions

Compaction triggers when number of L0 files reaches level0_file_num_compaction_trigger, files of L0 will be merged into L1. Normally we have to pick up all the L0 files because they usually are overlapping:
当L0文件大小达到level0_file_num_compaction_trigger触发压缩过程，L0合并到L1.一般合并所有L0文件，因为通常重叠（待分析）

image.png

After the compaction, it may push the size of L1 to exceed its target:
压缩后，通常L1会超过标准大小：

image.png

In this case, we will pick at least one file from L1 and merge it with the overlapping range of L2. The result files will be placed in L2:
这种情况至少合并一个L1文件到L2的重叠部分.结果替换L2内容：

image.png

If the results push the next level's size exceeds the target, we do the same as previously -- pick up a file and merge it into the next level:
如果下一层超过标准值，执行相同内容，合并文件到下一层：

image.png

and then

image.png

and then

image.png

Multiple compactions can be executed in parallel if needed:
如果需要会执行并发压缩：

image.png

Maximum number of compactions allowed is controlled by max_background_compactions.
压缩数量通过max_background_compactions控制。

However, L0 to L1 compaction is not parallelized by default. In some cases, it may become a bottleneck that limit the total compaction speed. RocksDB supports subcompaction-based parallelization only for L0 to L1. To enable it, users can set max_subcompactions to more than 1. Then, we'll try to partition the range and use multiple threads to execute it:
然而，L0到L1默认不执行并发压缩。一些情况下，会成为压缩速度瓶颈。Rocksdb支持并发L0到L1的并发压缩。可以通过设置max_subcompactions大于1实现。之后会尝试分多个线程执行：

image.png

Compaction Picking

When multiple levels trigger the compaction condition, RocksDB needs to pick which level to compact first. A score is generated for each level:
当多层同时压缩，需要选择哪个先执行。一个分数计算如下：

For non-zero levels, the score is total size of the level divided by the target size. If there are already files picked that are being compacted into the next level, the size of those files is not included into the total size, because they will soon go away.
非0层，得分为层的大小除以目标大小。如果存在正在压缩的文件，该文件不会包含在总大小内容，因为很快会执行完。
for level-0, the score is the total number of files, divided by level0_file_num_compaction_trigger, or total size over max_bytes_for_level_base, which ever is larger. (if the file size is smaller than level0_file_num_compaction_trigger, compaction won't trigger from level 0, no matter how big the score is.)
L0层，得分为文件数据量，除以level0_file_num_compaction_trigger或者max_bytes_for_level_base。(如果文件大小小于level0_file_num_compaction_trigger，压缩将不会从level0触发，无论分数有多大。（有道）)

We compare the score of each level, and the level with highest score takes the priority to compact.
比较每层的得分，之后执行得分最高的压缩过程。

Which file(s) to compact from the level are explained in Choose Level Compaction Files.
压缩的文件展开分析见explained in Choose Level Compaction Files.

level_compaction_dynamic_level_bytes is true

Target size of the last level (num_levels-1) will always be actual size of the level. And then Target_Size(Ln-1) = Target_Size(Ln) / max_bytes_for_level_multiplier. We won't fill any level whose target will be lower than max_bytes_for_level_base / max_bytes_for_level_multiplier . These levels will be kept empty and all L0 compaction will skip those levels and directly go to the first level with valid target size.
Ln-1为实际大小， Target_Size(Ln-1) = Target_Size(Ln) / max_bytes_for_level_multiplier.在小于max_bytes_for_level_base / max_bytes_for_level_multiplier 的层没有任何操作。
这些级别将保持为空，所有L0压缩将跳过这些级别，直接进入具有有效目标大小的第一个级别。（有道）

For example, if max_bytes_for_level_base is 1GB, num_levels=6 and the actual size of last level is 276GB, then the target size of L1-L6 will be 0, 0, 0.276GB, 2.76GB, 27.6GB and 276GB, respectively.
例如，如果max_bytes_for_level_base是1G，num_levels=6大小为276GB，L1-L6为0, 0, 0.276GB, 2.76GB, 27.6GB and 276GB。

This is to guarantee a stable LSM-tree structure, where 90% of data is stored in the last level, which can't be guaranteed if level_compaction_dynamic_level_bytes is false. For example, in the previous example:
这是为了保证一个稳定的lsm树结构，其中90%的数据存储在最后一个级别中，如果level_compaction_dynamic_level_bytes为假则无法保证这一点。例如，在前面的例子中:（有道）

image.png

We can guarantee 90% of data is stored in the last level, 9% data in the second last level. There will be multiple benefits to it.
我们可以保证90%的数据存储在最后一个关卡，9%的数据存储在第二个最后关卡。这将带来多种好处。（有道）

When L0 files piled up

Sometimes writes are heavy, temporarily or permanently, so that number of L0 files piled up before they can be compacted to lower levels. When it happens, the behavior of leveled compaction changes:
有时写负载很重，所有L0文件数量在压缩之前堆积。当发生这种情况，层压缩如下：

Intra-L0 Compaction

Too many L0 files hurt read performance in most queries. To address the issue, RocksDB may choose to compact some L0 files together to a larger file. This sacrifices write amplification by one but may significantly improve read amplification in L0 and in turn increase the capability RocksDB can hold data in L0. This would generate other benefits which would be explained below. Additional write amplification of 1 is far smaller than the usual write amplification of leveled compaction, which is often larger than 10. So we believe it is a good trade-off. Maximum size of Intra-L0 compaction is also bounded by options.max_compaction_bytes. If the option takes a reasonable value, total L0 size will still be bounded, even with Intra-L0 files.
L0太多文件影响读性能。处理这种情况，RocksDB合并L0的文件到更大的文件中。
这牺牲了1的写放大，但可以显著提高L0中的读放大，进而提高RocksDB在L0中保存数据的能力。（有道）
其他收益如下分析如下。
附加的写放大倍数为1远远小于通常的水平压缩的写放大倍数，后者通常大于10。所以我们相信这是一个很好的权衡。Intra-L0压缩的最大大小也受到options.max_compaction_bytes的限制。如果该选项接受一个合理的值，那么L0的总大小仍然是有限的，即使是Intra-L0文件也是如此。（有道）（待分析）

Adjust level targets

If total L0 size grows too large, it can be even larger than target size of L1, or even lower levels. It doesn't make sense to continue following this configured targets for each level. Instead, for dynamic level, target levels are adjusted. Size of L1 will be adjusted to actual size of L0. And all levels between L1 and the last level will have adjusted target sizes, so that levels will have the same multiplier. The motivation is to make compaction down to lower levels to happen slower. If data stuck in L0->L1 compaction, it is wasteful to still aggressively compacting lower levels, which competes I/O with higher level compactions.
如果L0太大，大于L1的大小或者底层大小。
在每个关卡中继续遵循这个配置的目标是没有意义的。相反，对于动态水平，目标水平是调整的。L1的尺寸将调整为L0的实际尺寸。L1和最后一个关卡之间的所有关卡都将调整目标大小，因此关卡将具有相同的乘数。其动机是将压实降低到较低的水平，以便更慢地发生。如果数据卡在L0- L1压缩中，那么仍然积极地压缩较低级别的数据是浪费的，这将与较高级别的压缩竞争I/O。（有道）

For example, if configured multiplier is 10, configured base level size is 1GB, and actual L1 to L4 size is 640MB, 6.4GB, 64GB, 640GB, accordingly. If a spike of writes come, and push total L0 size up to 10GB. L1 size will be adjusted to 10GB, and size target of L1 to L4 becomes 10GB, 40GB, 160GB, 640GB. If it is a temporary recent spike, where the new data is likely still staying in its current level L0 or maybe next level L1 , then actual file size of lower levels (i.e, L3, L4) are still close to the previous size while the their size targets have increased. Therefore lower level compaction almost stops and all the resource is used for L0 => L1 and L1 => L2 compactions, so that it can clear L0 files sooner. In case the high write rate becomes permanent. The adjusted targets's write amplification (expected 14) is better than the configured one (expected 32), so it's still a good move.
例如，如果multiplier为10，base level是1GB,L1-L4为640MB, 6.4GB, 64GB, 640GB。
如果出现写操作高峰，将L0的总大小推高到10GB。（有道）
L1大小跳转为10GB，L1-L4为10GB, 40GB, 160GB, 640GB.
如果是最近的临时峰值，即新数据可能仍然停留在当前级别L0或下一个级别L1中，那么较低级别(例如L3、L4)的实际文件大小仍然接近以前的大小，而它们的大小目标已经增加。（有道）

The goal for this feature is for leveled compaction to handle temporary spike of writes more smoothly. Note that leveled compaction still cannot efficiently handle write rate that is too much higher than capacity based on the configuration. Works on going to further improve it.
这个特性的目的是为了平衡压缩，以更平稳地处理临时的写峰值。注意，根据配置，水平压缩仍然不能有效地处理比容量高太多的写速率。致力于进一步改进它。（有道）

TTL

A file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are "soft deleted" -- set the values to be empty instead of actually issuing a Delete. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
如果没有对文件的键范围内的数据进行更新，那么文件可以在LSM树中存在很长一段时间而不进行压缩过程。例如，在某些用例中，键被“软删除”——将值设置为空，而不是实际发出Delete。可能不会再写这个“已删除”的键范围，如果是这样，这些数据可能会在LSM中保留很长一段时间，从而造成空间的浪费。（有道）

A dynamic ttl column-family option has been introduced to solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process, reach to the bottommost level and get rid of old unwanted data. This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. Note that it could lead to more writes as RocksDB would schedule more compactions.
为了解决这个问题，引入了一个动态ttl列族选项。当没有其他的后台工作时，比TTL旧的文件(以及数据)将被安排进行压缩。这将使数据经过常规的压缩处理，达到最底层，摆脱旧的不需要的数据。这还有一个(好的)副作用，即非最底层的所有数据都比ttl更新，而最底层的所有数据都比ttl旧。注意，这可能会导致更多的写入，因为RocksDB会安排更多的压缩。（有道）（待分析）

Periodic compaction

If compaction filter is present, RocksDB ensures that data go through compaction filter after a certain amount of time. This is achieved via options.periodic_compaction_seconds. Setting it to 0 disables this feature. Leaving it the default value, i.e. UINT64_MAX - 1, indicates that RocksDB controls the feature. At the moment, RocksDB will change the value to 30 days. Whenever RocksDB tries to pick a compaction, files older than 30 days will be eligible for compaction and be compacted to the same level.
如果存在压实过滤器，RocksDB确保数据在一定时间后通过压实过滤器。这可以通过options.periodic_compaction_seconds实现。将其设置为0将禁用此特性。将其保留为默认值，即UINT64_MAX - 1，表示RocksDB控制该特性。目前，RocksDB将该值改为30天。每当RocksDB试图选择一个压缩文件时，超过30天的文件将符合压缩条件，并被压缩到相同的级别。（有道）（待分析）

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,753评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,668评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 166,090评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,010评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,054评论 6赞 395
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,806评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,484评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,380评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,873评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,021评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,158评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,838评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,499评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,044评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,159评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,449评论 3赞 374
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,136评论 2赞 356

翻译 Leveled Compaction

Leveled Compaction

Structure of the files

Compactions

Compaction Picking

level_compaction_dynamic_level_bytes is true

When L0 files piled up

Intra-L0 Compaction

Adjust level targets

TTL

Periodic compaction

推荐阅读更多精彩内容