LSM-Tree Merge Strategy

Overall Introduction

Observation: Worst-case point lookup cost, long-range lookup cost, and space amplification derive mostly from the largest level.
Infer: Merge operations at all levels of LSM-tree but the largest hardly improve on these metrics while significantly adding to the amortized cost of updates.
Result: Suboptimal trade-offs of the state-of-the-art.

The balance between the I/O cost of merging and the I/O cost of lookups and space-amplification can be tuned using two knobs.

  • The size ratio T between the capacities of adjacent levels; T controls the number of levels of LSM-tree and thus the overall number of times that entry gets merged across levels.
  • Merge policy; It controls the number of times an entry gets merged within a level.

Major 2 Merge Policies

  • Tiering
  • Leveling


    Overview of an LSM-tree using tiering and leveling merge policies

In both cases, the merge is triggered by the buffer flushing and causing Level 1 to reach capacity. With tiering, all runs at Level 1 get merged into a new run that gets placed at Level 2. With leveling, the merge also includes the preexisting run at Level 2.

Size-Tiered compaction strategy (STCS). The idea of STCS is fairly straightforward, as illustrated here:

image.png

As usual, memtables are periodically flushed to new sstables. These are pretty small, and soon their number grows. As soon as we have enough (by default, 4) of these small sstables, we compact them into one medium sstable. When we have collected enough medium tables, we compact them into one large table. And so on, with compacted sstables growing increasingly large.
The full STCS algorithm is more complex than what we just described, because sstables may have overlapping data or deletions and thus the compacted sstables have varying sizes that don’t fall neatly into “small, medium, large, etc.” categories. Therefore STCS tries to fit the sstables into buckets of similarly-sized sstables, and compacts sstables within each of these buckets. Understanding these details is not necessary for the point of this post, but if you’re curious, refer to this blog post from 2014 or any other documentation of Scylla’s or Cassandra’s size-tiered compaction strategy.
Size-tiered compaction has several compelling properties which made it popular as the first and default compaction strategy of Cassandra and Scylla, and of many other LSM implementations. It results in a low and logarithmic (in size of data) number of sstables, and the same data is copied during compaction a fairly low number of times. We’ll address these issues again, using measures called read amplification and write amplification, in the following posts. In this post, we want to focus on the weakest aspect of size-tiered compaction, known as space amplification. This weakness is what eventually led to the development of alternative compaction strategies, such as leveled compaction and hybrid compaction which we will investigate in the next two posts.

总结: 针对Size-tiered compaction,每一层最多有T个sstable,每一个sstable里面的key均为有序的。每T个sstable会被merge成一个大的sstable到下一层。即为图中所示,sstable一层比一层大T倍。size-tiered compaction的写放大相对较小,但存在space amplification的问题。space amplification 意味着存储的大小大于数据被一个sstable所存储的大小。导致space amplification的原因是在merge的过程当中,需要同时保持上一层和merge后的一层的数据,最终导致数据大小翻倍。另外一种merge策略Leveled Compaction Strategy (LCS)意图解决space amplification的问题,但带来了新的问题:write amplification。

The Leveled Compaction Strategy was the second compaction strategy introduced in Apache Cassandra. It was first introduced in Cassandra 1.0 in 2011, and was based on ideas from Google’s LevelDB. As we will show below, it solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request), which we will not further discuss in this post.
The first thing that Leveled Compaction does is to replace large sstables, the staple of STCS, by “runs” of fixed-sized (by default, 160 MB) sstables. A run is a log-structured-merge (LSM) term for a large sorted file split into several smaller files. In other words, a run is a collection of sstables with non-overlapping token ranges. The benefit of using a run of fragments (small sstables) instead of one huge sstable is that with a run, we can compact only parts of the huge sstable instead of all of it. Leveled compaction indeed does this, but its cleverness is how it does it:
Leveled compaction divides the small sstables (“fragments”) into levels:

image.png

Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels.
Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).
The job of Leveled compaction strategy is to maintain this structure while keeping L0 empty:

  • When we have enough (e.g., 4) sstables in L0, we compact them into L1.
    We do this by compacting all the sstables in L0 together with all the sstables in L1. The result of this compaction is a new run (large sstable split by our chosen size limit, by default 160 MB) which we put in L1, replacing the entire content of L1.
  • The new run in L1 may have more than the desired 10 sstables. If that happens, we pick one sstable from L1 and compact it into L2:
  • A single sstable in L1 is part of a run of 10 files. The whole run covers the entire token range, which means that the single sstable we chose covers roughly 1/10th of the token range. At the same time, each of the L2 sstables covers roughly 1/100th of the token range. So the single L1 sstable we are compacting will overlap around 10 of the L2 sstables.
  • So what we do is to take the single L1 sstable and the roughly 10 L2 sstables which overlap its token range, and compact those together – as usual splitting the result into small sstables. We replace the input sstables with the compaction results, putting the results in L2 (note that L2 remains a single run).
  • After we compacted a table from L1 into L2, now L2 may have more than the desired number of sstables, so we compact sstables from L2 into L3. Again, this involves compacting one sstable from L2 and about 10 sstables from L3.
    And so on.

总结:针对Leveled Compaction Strategy,引入runs的概念。一个run包含多个小sstable。每次merge为一个run和另外一个run之间的操作。为了优化merge操作,merge的时候选择一个sstable和下一层的对应key range的T个sstable进行merge。

Fence Pointer
All major LSM-tree based key-value stores index the first key of every block of every run in main memory. These are called fence pointers. The fence pointers take up O(N/B) space in main memory, and they enable a lookup to find the relevant key-range at every run with one I/O.

Bloom Filters
Objective: speed up the point lookups. Each run has a Bloom filter in main memory.
A Bloom Filter is a space-efficient probabilistic data structure used to answer set membership queries.
The false positive rate (FPR) depends on the ratio between the number of bits allocated to the filter and the number of entries in the set.

A point lookup probes a Bloom filter before accessing the corresponding run in storage. If the filter returns a true positive, the lookup access the run with one I/O (using the fence pointers), finds the matching entry and terminates.
If the filter returns a negative, the lookup skips the run thereby saving one I/O.
Otherwise, we have a false positive, meaning the lookup wastes one I/O by accessing the run, not finding a matching entry, and having to continue searching for the target key in the next run.

Design Space and Problem Analysis

Analysis: worse-case space-amplification and I/O costs of updates and lookups.

  • merge policy
  • size ratio

Note:

  • In key-value stores in industry, the number of bits per entry for Bloom Filters across all levels is the same. Therefore, the Bloom Filters at the largest level are larger than the filters at all smaller levels combines.
  • The FPR at the largest level is upper bounded to O(e^{-M/N}), where M is the memory budget for all filters and N is the number of entries in the system.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,616评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,020评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,078评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,040评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,154评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,265评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,298评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,072评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,491评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,795评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,970评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,654评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,272评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,985评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,223评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,815评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,852评论 2 351