分类：LSM-tree，optimization for write amplification，novel lsm structure

PebblesDB：SOSP 17

提出了一个novel lsm structure 以及对应的 novel compaction algorithm。

lsm的每一个component（level）通过guards 把 sstable 切分成多个独立的units（单元）（同一层不同的unit 之间是有序的，但是每一个unit内的sstable不保证有序，所以会出现overlapping）；这个方法借鉴的思路是从skiplist出来的。

compaction algorithm的目的也是为了维护这种结构，并且保证同一个component（level）的数据没有overwrite。

LSM

write-optimized data structures
- high write throughput
- good read throughput
- disadvantage： write amplification

impaction about write amplification
- reduce write read throughput
- wear out flash device

探求问题的解决方案要看清问题的本质，本质就是引起问题的根本原因（内因，外因）；然后再根据自己的需求对导致问题的根本原因进行放缩，知道一种在某些workload情景下有非常明显提高的方法。
The root cause of write amplification is： rewriting data to the same level multiple times to maintain sorted non-overlapping files in each level

FLSM（fragmented LSM）

Naive approach

just append the file to the end of next level， but many overlapping files within a level
affect the read performance

partially sorted level

Hybrid overlapping files and non-overlapping files
guards to group together overlapping files
write data only once per level in most cases

大概的思路如下图所示：
[图片上传中...(image.png-6e05fd-1553343884927-0)]

the problem solved

Not re-write data to the same level in most case to reduce write amplification
partially sorted levels to efficiently reduce the search space

critical issue

How to select the guards？

guards are chosen dynamically and randomly
dependent on the distribution of the data

如何选择guards

总体原则：概率的从插入的key中选择，防止数据的skew（倾斜）

方法：设置了一个 Guard Probability。当一个key插入到系统的时候，通过这个Guard Probability决定其是否可以成为一个Guard。

定义：Guard probability gp(key,i) ：表示key是否可以在第i个level成为Guard。

其中这个guard probability表示是最低层级（ level 1 ）的概率。

Guard 选择的影响：在skiplist中，如果在某一层选择出来一个key作为skiplist的一个guard，那么这个guard将为影响到这一层以及这一层一下的所有层级的guard。在FLSM的guard选择中，也是一样的，如果在i层选择了K作为guard，那么K也会成为i+1，i+2的guard。

局限性：当前的选择guard的方式非常简单，容易计算，仅仅考虑到了每一个数据key的公平分布问题，没有考虑到选择为了将选择出来的guard应用到各个层级中的IO消耗（选择出来guard需要保证每一个层级的数据维持FLSM的原则，那么需要对数据进行compaction）

插入与删除guards

插入guards

插入和选择guards并不是保持同步的，因为选择某个层次多一个guard是非常容易的并且可以做到原子性，但是真的让存储系统的SSTable按照新选出来的guard进行组织并不是一个可以和选择guard同步完成的事情。guard的插入是一个异步的过程。

首先在内存中存在一个 uncommitted guard的集合表示的是那些选出来了，但是还没有被插入的guard。针对这些uncommitted guard，需要在下一轮compaction的时候，对sstable进行分区，compaction的原则是要结合 old guards和uncommitted guard。当所有的compaction结束以后，就可以把那些uncommitted guards写入到存储中，那么新的查询，将使用新的guards集合。

疑问：论文中这点描述的比较弱，仅仅有一段内容。下面是自己的看法：

在Li层选出一个新的guard，那么Li+1，Li+2都需要保证这个guard（这一点是论文中要求的，skiplist也是这样做的）
新选出的uncommitted guard相邻的两个left guard和right guard表示的是Li层对应的区间，当uncommitted guard需要插入的时候，就要原子性的保证Li层原来的left guard和right guard的sstable需要按照 uncommitted guard进行区分。
根据2中的原则，提交 uncommitted guard的时候，需要保证Li+1，Li+2...的层级都需要插入对应的guard，所以这轮compaction也需要把余下的层的sstable的进行partition。
重新开始一轮compaction的时候，如果有多个uncommitted guard的时候，需要一次性保证这些uncommitted guard在这轮compaction中完成。

删除guards

删除Li层的某个guard的情况并不多，只有在某些guard中的数据被完全删除为空，或者数据量较少的时候才需要删除。

删除guard的方式和插入guard的方式基本相同，在进行compaction的时候不需要太多的考虑，对于Li层只需要考虑append到相邻的guard后面即可，Li层的数据compaction到Li+1的过程仍然按照原来的方式进行即可。

删除Li层的某个guard，需要保证level<i的对应的guard都被删除。

调优与局限性

FLSM针对读和range 查询的性能主要依靠的就是在一个guard中sstable的个数。如果sstable过多，那么查询性能机会降低，如果sstable过少，那么写入放大就会比较大。

文章特殊贡献（自己认为）

点出了LSM的写入放大的根本原因，以及从naive的思路去去除导致写入放大的根本原因。
点出了写入放大带来的危害(多个方面):写吞吐，合并，查询（合并不及时导致数据无序性增加）
分析B+Tree为什么不适合 write-intensive的workload，以及B+tree的写入放大会更大： B+Tree会导致磁盘的random access

收获

摘要书写：应用价值，使用结构，结构出现的问题。为了解决这个问题，提出什么结构。这个结构中有什么新的方法，这个方法的功能是什么。最后集成在了什么样的系统中。评估我们的系统，在xx的workload中获得了xx的效果。
实验要充分：
- 实验准备与配置，使用什么样的环境，系统，以及原因
- 测试系统的配置与原因
- 测试项目：write amplification，single thread，random write and read，sequentialwrite，range query，space amplification， multi-threads read and write（read是等write finished有进行的），concurrent read and write （mixed），low memory performance