原网址:https://github.com/facebook/rocksdb/wiki/Prefix-Seek
(有道)
Why Prefix Seek?
Normally, when an iterator seek is executed, RocksDB needs to place the position at every sorted run (memtable, level-0 file, other levels, etc) to the seek position and merge these sorted runs. This can sometimes involve several I/O requests. It is also CPU heavy for decompression of several data blocks and other CPU overheads.
通常情况下,当迭代器查找执行时,RocksDB需要在每次排序运行(memtable, level-0 file,其他级别等)时将位置放置到查找位置,并合并这些排序运行。这有时会涉及到几个I/O请求。对于多个数据块的解压缩和其他CPU开销,它也是CPU负担。
Prefix seek is a feature for mitigating these overheads for some use cases. The basic idea is that, if users know the iterating will be within one key prefix, the common prefix can be used to reduce costs. The most commonly used prefix iterating technique is prefix bloom filter. If many sorted runs don't contain any entry for this prefix, it can be filtered out by a bloom filter, and some I/Os and CPU for the sorted run can be ignored.
在某些用例中,前缀查找是用来减少这些开销的特性。其基本思想是,如果用户知道迭代将在一个关键前缀内,则可以使用公共前缀来降低成本。最常用的前缀迭代技术是前缀布隆过滤器。如果许多已排序的运行不包含此前缀的任何条目,它可以被bloom过滤器过滤掉,并且一些已排序运行的I/O和CPU可以被忽略。
A motivating use case for prefix seek is representing multi-maps, such as secondary indexes in MyRocks, where the RocksDB prefix is the key for the multimap and a RocksDB iterator finds all the entries associated with that prefix (multimap key).
一个表现不错的使用场景multi-maps,比如MyRocks中的二级索引,其中RocksDB前缀是multimap的键,而RocksDB迭代器会查找与该前缀(multimap键)相关的所有条目。
Whether prefix seek can improve performance is workload dependent. It's more likely to be effective for very short iterator queries than longer ones.
前缀查找是否能提高性能取决于工作负载。相对于较长的迭代器查询,它更可能对非常短的迭代器查询有效。
Defining a "prefix"
"Prefix" is defined by options.prefix_extractor, which is a shared_pointer of a SliceTransform
instance. By calling SliceTransform.Transform()
against a key, we extract a Slice
representing a substring of the Slice
, usually the prefix part. In this wiki page, we use "prefix" to refer to the output of options.prefix_extractor.Transform()
of a key. You can use fixed length prefix transformer, by calling NewFixedPrefixTransform(prefix_len)
, a capped length prefix transformer, by calling NewCappedPrefixTransform(prefix_len)
, or you can implement your own prefix transformer in the way you want and pass it to options.prefix_extractor
. Prefix extractor is related to comparator. A prefix extractor needs to be used with a comparator where keys with the same prefix are close to each others in the total order defined by the comparator.
"Prefix"通过options.prefix_extractor定义,它是一个SliceTransform
实例的shared_pointer。通过对一个键调用SliceTransform.Transform(),我们提取子字符串的Slice,通常是前缀部分。在这个wiki页面中,我们使用“prefix”来引用一个键的options.prefix_extractor.Transform()的输出。你可以通过调用NewFixedPrefixTransform(prefix_len)来使用固定长度的前缀转换器,或者你也可以按照你想要的方式实现你自己的前缀转换器,并将其传递给options.prefix_extractor。前缀提取器与比较器相关。前缀提取器需要与比较器一起使用,其中具有相同前缀的键按照比较器定义的总顺序彼此接近。
The recommendation is to use capped prefix transform with bytewise or reverse bytewise comparators when possible. It will maximize features supported with relatively good performance.
建议在可能的情况下使用带上限的前缀转换和字节或反向的字节比较器。它将以相对较好的性能最大限度地支持特性。
Configure prefix bloom filter
Although it is not the only way users can take advantage of prefix iterating, prefix bloom filter in block-based SST file is the most commonly used and effective one. Here is an example how you can set up prefix bloom filters in SST files.
虽然前缀迭代不是用户利用前缀迭代的唯一方式,但基于块的SST文件中的前缀bloom过滤器是最常用和有效的方法。下面是一个如何在SST文件中设置前缀bloom过滤器的例子。
Options options;
// Set up bloom filter
rocksdb::BlockBasedTableOptions table_options;
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_options.whole_key_filtering = false; // If you also need Get() to use whole key filters, leave it to true.
options.table_factory.reset(
rocksdb::NewBlockBasedTableFactory(table_options)); // For multiple column family setting, set up specific column family's ColumnFamilyOptions.table_factory instead.
// Define a prefix. In this way, a fixed length prefix extractor. A recommended one to use.
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
DB* db;
Status s = DB::Open(options, "/tmp/rocksdb", &db);
How this bloom filter is used depends on ReadOptions
setting in read queries. See the section below for details.
如何使用bloom filter取决于读取查询中的ReadOptions设置。有关详细信息,请参阅下面一节。
options.prefix_extractor
can be changed when DB is restarted. Usually the end result would be that bloom filters in existing SST files will be ignored in reads. In some cases, old bloom filters can still be used. This will be explained in following sections.
options.prefix_extractor
可以在DB重启时修改。通常,最终的结果是在读取时忽略现有SST文件中的bloom过滤器。在某些情况下,旧的bloom filter仍然可以使用。这将在下面几节中解释。
Configure reads while prefix bloom filters are set up
How to ignore prefix bloom filters in read
Users can only use prefix bloom filter to read inside a specific prefix. If iterator is outside a prefix, the feature needs to be disabled for specific iterators to prevent wrong results:
用户只能使用prefix bloom过滤器读取特定的前缀。如果迭代器在前缀之外,则需要为特定的迭代器禁用该特性,以防止错误的结果:
ReadOptions read_options;
read_options.total_order_seek = true;
Iterator* iter = db->NewIterator(read_options);
Slice key = "foobar";
iter->Seek(key); // Seek "foobar" in total order
Putting read_options.total_order_seek = true
will make sure the query returns the same result as if there is no prefix bloom filter.
read_options.total_order_seek = true
将确保查询返回相同的结果,就像没有前缀bloom过滤器一样。
Adaptive Prefix Mode
Since Release 6.8, a new adaptive mode is introduced:
从6.8版开始,引入了一个新的自适应模式:
ReadOptions read_options;
read_options.auto_prefix_mode = true;
Iterator* iter = db->NewIterator(read_options);
// ......
This will always generate the same result as read_options.total_order_seek = true
, while prefix bloom filter might be used, based on seek key and iterator upper bound. It is the recommended way of using prefix bloom filter, because it's less prone to misuse. We will turn this option as default on once it is proven to be stable.
这将总是生成与read_options相同的结果。Total_order_seek = true,而前缀bloom filter可以根据查找键和迭代器上限使用。这是使用前缀bloom过滤器的推荐方式,因为它不太容易被误用。一旦它被证明是稳定的,我们将把这个选项作为默认打开。
Here is an example how this feature can be used: for a fixed length prefix extractor of 3, following queries will take advantage of bloom filters:
下面是一个如何使用这个特性的例子:对于一个固定长度为3的前缀提取器,以下查询将利用bloom过滤器:
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
options.comparator = BytewiseComparator(); // This is the default
// ......
ReadOptions read_options;
read_options.auto_prefix_mode = true;
std::string upper_bound;
Slice upper_bound_slice;
// "foo2" and "foo9" share the same prefix "foo".
upper_bound = "foo9";
upper_bound_slice = Slice(upper_bound);
read_options.iterate_upper_bound = &upper_bound_slice;
Iterator* iter = db->NewIterator(read_options);
iter->Seek("foo2");
// "foobar2" and "foobar9" share longer prefix than "foo".
upper_bound = "foobar9";
upper_bound_slice = Slice(upper_bound);
read_options.iterate_upper_bound = &upper_bound_slice;
Iterator* iter = db->NewIterator(read_options);
iter->Seek("foobar2");
// "foo2" and "fop" doesn't share the same prefix, but "fop" is the successor key of prefix "foo" which is the prefix of "foo2".
upper_bound = "fop";
upper_bound_slice = Slice(upper_bound);
read_options.iterate_upper_bound = &upper_bound_slice;
Iterator* iter = db->NewIterator(read_options);
iter->Seek("foo2");
This feature has some limitations:
这个特性有一些局限性:
- It only supports fixed and capped prefix extractor
它只支持固定和封顶前缀提取器 - The comparator needs to implement
IsSameLengthImmediateSuccessor()
. The built-in byte-wise and reverse byte-wise comparators have it implemented.
比较器需要实现IsSameLengthImmediateSuccessor()
。内置的逐字节和反向逐字节比较器实现了它。 - Right now only
Seek()
can automatically take advantage of the feature by comparing seek key with iterator upper bound.SeekForPrev()
never uses prefix bloom filter. (no fundamental reason why it can't be done. You are welcome to finish the feature)
目前,只有Seek()可以通过比较Seek键和迭代器上限来自动利用该特性。SeekForPrev()从不使用前缀bloom过滤器。(没有根本的理由说明为什么不能这样做。欢迎您完成功能) - Some extra CPU is used to determine qualification of prefix bloom filter in each SST file to query.
使用额外的CPU来确定每个SST文件中要查询的前缀bloom filter是否匹配。
Manual prefix iterating
With this option, users determine their use case qualifies for prefix iterator and are responsible of never iterating outside iterator. RocksDB will not return error when it is misused and the iterating result will be undefined. Note that, when iterating outside the iterator range, the result might not be limited to missing keys. Some undefined result might include: deleted keys might show up, key ordering is not followed, or very slow queries. Also note that even data within the prefix range might not be correct if the iterator has moved out of the prefix range and come back again.
通过这个选项,用户决定是否使用prefix iterator并保证不会超出范围。当被误用时,RocksDB将不会返回错误,并且迭代结果将是未定义的。注意,在迭代器范围外迭代时,结果可能不限于丢失键。一些未定义的结果可能包括:删除的键可能会出现,键的顺序没有遵循,或非常慢的查询。还要注意,如果迭代器已经移出了前缀范围并再次返回,那么即使是前缀范围内的数据也可能不正确。
Right now this mode is the default for backward compatible reason, but users should be extra careful when using this mode.
由于向后兼容的原因,目前这个模式是默认的,但是用户在使用这个模式时应该格外小心。
How to use the manual mode:
手动模式的使用方法:
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
ReadOptions read_options;
read_options.total_order_seek = false;
read_options.auto_prefix_mode = false;
Iterator* iter = db->NewIterator(read_options);
iter->Seek("foobar");
// Iterate within prefix "foo"
iter->SeekForPrev("foobar");
// iterate within prefix "foo"
Prefix extractor change
If prefix extractor changes when DB restarts, some SST files may contain prefix bloom filters generated using different prefix extractor than the one in current options. When opening those files, RocksDB will compare prefix extractor name stored in the properties of the SST files. If the name is different from the prefix extractor provided in the options, the prefix bloom filter is not used for this file, with following exception: if the previous SST file uses fixed or capped prefix extractor, the filter may sometimes be used if seek key and upper bound indicates that the iterator is within a prefix specified by previous prefix extractor. The behavior within the SST file would be the same as automatic prefix mode.
当DB重启时,如果prefix extractor发生变化,可能会导致某些SST文件中包含使用与当前选项不同的prefix extractor生成的prefix bloom filter。当打开这些文件时,RocksDB会比较存储在SST文件属性中的前缀提取器名称。如果名称与选项中提供的前缀提取器不同,则该文件不使用前缀bloom过滤器,例外如下:如果前一个SST文件使用固定的或有上限的前缀提取器,如果查找键和上限表示迭代器在前一个前缀提取器指定的前缀内,则有时会使用这个过滤器。SST文件中的行为与自动前缀模式相同。
Other prefix iterating features
Besides prefix bloom filter. There are several prefix iterating features:
除了前缀bloom过滤器。有几个前缀迭代特性:
- Prefix bloom filter in memtable. Turn it on with
options.memtable_prefix_bloom_size_ratio
memtable中的前缀bloom过滤器。使用options.memtable_prefix_bloom_size_ratio打开它 - Prefix memtables. Hash linked list and hash skip list memtables are supported. See [[MemTable]] for details.
前缀memtables。支持哈希链表和哈希跳跃表。详情请参见[[MemTable]]。 - Prefix hash index in block based table format. See [[Data Block Hash Index]] for details.
前缀哈希索引基于块的表格式。详见[[数据块哈希索引]]。 - PlainTable format. See [[PlainTable Format]] for details.
PlainTable格式。详情请参见[[PlainTable Format]]。
General Prefix Seek API
We introduced how to use prefix bloom filter in detail above. The usage is almost the same for other prefix iterating features. The only difference is that, some options might not be supported. Here is the general workflow:
我们在上面详细介绍了如何使用前缀bloom过滤器。其他前缀迭代特性的用法几乎相同。唯一的区别是,有些选项可能不被支持。以下是一般的工作流程:
When options.prefix_extractor
for your DB or column family is specified, RocksDB is in a "prefix seek" mode. Example of how to use it:
当设置options.prefix_extractor
用于DB或者column family时,RocksDB处于“prefix seek”模式。使用它的例子:
Options options;
// <---- Enable some features supporting prefix extraction
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
DB* db;
Status s = DB::Open(options, "/tmp/rocksdb", &db);
......
Iterator* iter = db->NewIterator(ReadOptions());
iter->Seek("foobar"); // Seek inside prefix "foo"
iter->Next(); // Find next key-value pair inside prefix "foo"
When options.prefix_extractor
is not nullptr
and with default ReadOptions
, iterators are not guaranteed a total order of all keys, but only keys for the same prefix. When doing Iterator.Seek(lookup_key)
, RocksDB will extract the prefix of lookup_key. If there is one or more keys in the database matching prefix of lookup_key, RocksDB will place the iterator to the key equal or larger than lookup_key of the same prefix, as for total ordering mode. If no key of the prefix equals or is larger than lookup_key, or after calling one or more Next()
, we finish all keys for the prefix, we might return Valid()=false
, or any key (might be non-existing) that is larger than the previous key. Setting ReadOptions.prefix_same_as_start=true
guarantees the first of those two behaviors.
当options.prefix_extractor
不是nullptr,并且使用默认的ReadOptions,迭代器不能保证所有键的总顺序,只能保证相同前缀的键的顺序。当执行Iterator.Seek(lookup_key)时,RocksDB会提取lookup_key的前缀。如果数据库中有一个或多个键匹配lookup_key的前缀,那么RocksDB会将迭代器放置到等于或大于该前缀的lookup_key的键,就像总的排序模式一样。如果前缀中没有大于或等于lookup_key的键,或者在调用了一个或多个Next()之后,完成了前缀的所有键,可以返回Valid()=false,或者任何大于前一个键的键(可能不存在)。设置 ReadOptions.prefix_same_as_start=true
保证了这两种行为中的第一种。
From release 4.11, we support Prev()
in prefix mode, but only when the iterator is still within the range of all the keys for the prefix. The output of Prev()
is not guaranteed to be correct when the iterator is out of the range.
从4.11版开始,我们在前缀模式下支持Prev(),但只有当迭代器仍然在前缀的所有键的范围内时才支持。当迭代器超出该范围时,不能保证Prev()的输出是正确的。
When prefix seek mode is enabled, RocksDB will freely organize the data or build look-up data structures that can locate keys for specific prefix or rule out non-existing prefixes quickly. Here are some supported optimizations for prefix seek mode: prefix bloom for block based tables and mem tables, hash-based mem tables, as well as PlainTable format. One example setting:
当启用前缀查找模式时,RocksDB会自由地组织数据,或者构建查找数据结构,可以快速定位特定前缀的键,或者排除不存在的前缀。这里有一些支持的前缀查找模式的优化:基于块的表和mem表的前缀bloom,基于哈希的mem表,以及PlainTable格式。设置一个例子:
Options options;
// Enable prefix bloom for mem tables
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
options.memtable_prefix_bloom_size_ratio = 0.1;
// Enable prefix hash for SST files
BlockBasedTableOptions table_options;
table_options.index_type = BlockBasedTableOptions::IndexType::kHashSearch;
DB* db;
Status s = DB::Open(options, "/tmp/rocksdb", &db);
......
auto iter = db->NewIterator(ReadOptions());
iter->Seek("foobar"); // Seek inside prefix "foo"
From release 3.5, we support a read option to allow RocksDB to use total order even if options.prefix_extractor
is given. To enable the feature set ReadOption.total_order_seek=true
to the read option passed when doing NewIterator()
, example:
从3.5版本开始,我们便支持了一个读取选项,即允许RocksDB使用总序,即使给出options.prefix_extractor
。启用特性集ReadOption。total_order_seek=true在执行NewIterator()时传递的读取选项,例如:
ReadOptions read_options;
read_options.total_order_seek = true;
auto iter = db->NewIterator(read_options);
Slice key = "foobar";
iter->Seek(key); // Seek "foobar" in total order
Performance might be worse in this mode. Please be aware that not all implementations of prefix seek support this option. For example, some implementations of PlainTable doesn't support it and you'll see an error in status code when you try to use it. Hash-based mem tables might do a very expensive online sorting if you use it. This mode is supported in prefix bloom and hash index of block based tables.
在这种模式下,性能可能会更差。请注意,不是所有的prefix seek的实现都支持这个选项。例如,PlainTable的一些实现不支持它,当您尝试使用它时,会在状态码中看到一个错误。如果您使用基于哈希的mem表,它可能会执行非常昂贵的在线排序。基于块的表的前缀bloom和hash索引支持这种方式。
From release 6.8, we support a similar option while allowing underlying implementation to take advantage of prefix specific information when not impacting results.
从6.8版开始,我们支持类似的选项,同时允许底层实现在不影响结果的情况下利用前缀特定的信息。
ReadOptions read_options;
read_options.auto_prefix_mode = true;
auto iter = db->NewIterator(read_options);
Slice key = "foobar";
iter->Seek(key); // Seek "foobar" in total order
Right now, the feature is only supported for forward seeking with prefix bloom filter.
目前,该功能只支持前向搜索与前缀bloom过滤器。
Limitation
SeekToLast()
is not supported well with prefix iterating. SeekToFirst()
is only supported by some configurations. You should use total order mode, if you will execute those types of queries against your iterator.
在使用前缀迭代时,不支持SeekToLast()。SeekToFirst()仅在某些配置中支持。如果你要对你的迭代器执行这些类型的查询,你应该使用全顺序模式。
One common bug of using prefix iterating is to use prefix mode to iterate in reverse order. But it is not yet supported. If reverse iterating is your common query pattern, you can reorder the data to turn your iterating order to be forward. You can do it through implementing a customized comparator, or encode your key in a different way.
使用前缀迭代的一个常见错误是使用前缀模式进行反向迭代。但目前还没有得到支持。如果反向迭代是常见的查询模式,则可以重新排序数据,将迭代顺序改为向前。您可以通过实现一个定制的比较器来实现,或者以不同的方式对密钥进行编码。
API change from 2.8 -> 3.0
In this section, we explain the API as of 2.8 release and the change in 3.0.
在本节中,我们将解释2.8版本的API以及3.0版本的变化。
Before the Change
As of RocksDB 2.8, there are 3 seek modes:
在RocksDB 2.8中,有3种寻找模式:
Total Order Seek
This is the traditional seek behavior you'd expect. The seek performs on a total ordered key space, positioning the iterator to a key that is greater or equal to the target key you seek.
这是传统的搜索行为。查找在一个总的有序键空间上执行,将迭代器定位到一个大于或等于目标键的键。
auto iter = db->NewIterator(ReadOptions());
Slice key = "foo_bar";
iter->Seek(key);
Not all table formats support total order seek. For example, the newly introduced PlainTable format only supports prefix-based seek() unless it is opened in total order mode (Options.prefix_extractor == nullptr).
并不是所有的表格式都支持总顺序查找。例如,新引入的PlainTable格式只支持基于前缀的seek(),除非它在总顺序模式下打开(Options。prefix_extractor = = nullptr)。
Use ReadOptions.prefix
This is the least flexible way to do a seek. Prefix needs to be supplied when creating an iterator.
这是做求的最不灵活的方法。创建迭代器时需要提供前缀。
Slice prefix = "foo";
ReadOptions ro;
ro.prefix = &prefix;
auto iter = db->NewIterator(ro);
Slice key = "foo_bar"
iter->Seek(key);
Options.prefix_extractor
is a prerequisite. The Seek()
is constrained to the prefix provided by ReadOptions
, which means you will need to create a new iterator to seek a different prefix. The benefit of this approach is that irrelevant files are filtered out at the time of building the new iterator. So if you want to seek multiple keys with the same prefix, it might perform better. However, we consider this is a very rare use case.
Options.prefix_extractor
是一个先决条件。Seek()被限制为ReadOptions提供的前缀,这意味着您需要创建一个新的迭代器来查找不同的前缀。这种方法的好处是,在构建新的迭代器时,不相关的文件会被过滤掉。因此,如果您想寻找具有相同前缀的多个键,它可能会执行得更好。然而,我们认为这是一个非常罕见的用例。
Use ReadOptions.prefix_seek
This mode is more flexible than ReadOption.prefix
. No pre-filtering is done at iterator creation time. As a result, the same iterator can be reused for seek of different key/prefix.
此模式比ReadOption.prefix更灵活。在迭代器创建时不进行预过滤。因此,可以重用相同的迭代器来查找不同的键/前缀。
ReadOptions ro;
ro.prefix_seek = true;
auto iter = db->NewIterator(ro);
Slice key = "foo_bar";
iter->Seek(key);
Same as ReadOptions.prefix
, is a prerequisite.
同ReadOptions.prefix
一样,Options.prefix_extractor
是一个先决条件。
What's Changed
It becomes obvious that 3 modes of seek are confusing:
很明显,3种搜索模式是令人困惑的:
- One mode would require another option to be set (e.g.
Options.prefix_extractor
);
一种模式需要设置另一个选项(例如Options.prefix_extractor); - It is not obvious to our users which mode of the last two is preferred under different circumstances
对于我们的用户来说,在后两种模式中哪一种在不同情况下更受欢迎并不明显
This change tries to address this issue and makes things straight: by default, Seek()
is performed in prefix mode if Options.prefix_extractor
is defined and vice versa. The motivation is simple: if Options.prefix_extractor
is provided, it is a very clear signal that underlying data can be sharded and prefix seek is a natural fit. Usage becomes unified:
此更改试图解决这个问题,并使事情变得更简单:默认情况下,如果Options, Seek()将在前缀模式下执行。定义Prefix_extractor,反之亦然。动机很简单:if Options.prefix_extractor
被设置,这是一个非常明确的信号,表明底层数据可以被分片,前缀查找是一种自然的匹配。使用变得统一:
auto iter = db->NewIterator(ReadOptions());
Slice key = "foo_bar";
iter->Seek(key);
Transition to the New Usage
Transition to the new style should be simple: remove the assignment to Options.prefix
or Options.prefix_seek
, since they are deprecated. Now, seek directly with your target key or prefix. Since Next()
can go across the boundary to a different prefix, you will need to check the end condition:
过渡到新样式应该很简单:删除对Options的赋值。前缀或选项。Prefix_seek,因为它们已被弃用。现在,直接寻找你的目标键或前缀。因为Next()可以跨越边界到不同的前缀,你需要检查结束条件:
auto iter = DB::NewIterator(ReadOptions());
for (iter.Seek(prefix); iter.Valid() && iter.key().starts_with(prefix); iter.Next()) {
// do something
}