原文网址:https://github.com/facebook/rocksdb/wiki/Iterator
(有道)
Introduction
All data in the database is logically arranged in sorted order. An application can specify a key comparison method that specifies a total ordering of keys. An Iterator API allows an application to do a range scan on the database. The Iterator can seek to a specified key and then the application can start scanning one key at a time from that point. The Iterator API can also be used to do a reverse iteration of the keys in the database. A consistent-point-in-time view of the database is created when the Iterator is created. Thus, all keys returned via the Iterator are from a consistent view of the database.
数据库中的所有数据都是按照逻辑顺序排列的。应用程序可以指定一个键比较方法,该方法指定键的总顺序。Iterator API允许应用程序对数据库进行范围扫描。迭代器可以查找到指定的键,然后应用程序可以从该点开始一次扫描一个键。Iterator API还可以用于对数据库中的键进行反向迭代。在创建Iterator时,会创建一个时间点一致的数据库视图。因此,通过Iterator返回的所有键都来自数据库的一致视图。
Consistent View
If ReadOptions.snapshot
is given, the iterator will return data as of the snapshot. If it is nullptr
, the iterator will read from an implicit snapshot as of the time the iterator is created. The implicit snapshot is preserved by [[pinning resource|Iterator#resource-pinned-by-iterators-and-iterator-refreshing]]. There is no way to convert this implicit snapshot to an explicit snapshot.
如果设置ReadOptions.snapshot
,迭代器将返回快照的数据。如果为nullptr,迭代器将从创建迭代器时的隐式快照中读取数据。隐式快照由[[pin - resource|Iterator#resource- pinted -by- Iterator -and- Iterator -refresh]]保存。没有办法将这个隐式快照转换为显式快照。
Error Handling
Iterator::status()
returns the error of the iterating. The errors include I/O errors, checksum mismatch, unsupported operations, internal errors, or other errors.
Iterator::status()返回迭代的错误。这些错误包括I/O错误、校验和不匹配、不支持的操作、内部错误或其他错误。
If there is no error, the status is Status::OK()
. If the status is not OK, the iterator will be invalidated too. In another word, if Iterator::Valid()
is true, status()
is guaranteed to be OK()
so it's safe to proceed other operations without checking status():
如果没有错误,状态为status::OK()。如果状态不是OK,迭代器也会失效。换句话说,如果Iterator::Valid()为true, status()就保证为OK(),这样就可以安全地进行其他操作,而无需检查status():
for (it->Seek("hello"); it->Valid(); it->Next()) {
// Do something with it->key() and it->value().
}
if (!it->status().ok()) {
// Handle error. it->status().ToString() contains error message.
}
On the other hand, if Iterator::Valid()
is false, there are two possibilities: (1) We reached the end of the data. In this case, status()
is OK()
; (2) there is an error. In this case status()
is not OK()
. It is always a good practice to check status()
if the iterator is invalidated.
另一方面,如果Iterator::Valid()为false,则有两种可能:(1)我们到达了数据的末端。在这种情况下,status()是OK();(2)有错误。在这种情况下,status()不是OK()。如果迭代器无效,检查status()总是一个好习惯。
Seek()
and SeekForPrev()
discard previous status.
Seek()和SeekForPrev()丢弃之前的状态。
Note that in release 5.13.x or earlier (before https://github.com/facebook/rocksdb/pull/3810 which was merged on May 17, 2018) the behavior of status()
and Valid()
used to be different:
注意,在5.13版本中。x或更早(2018年5月17日https://github.com/facebook/rocksdb/pull/3810合并之前),status()和Valid()的行为过去是不同的:
-
Valid()
could return true even ifstatus()
is not ok. This could sometimes be used to skip over corrupted data. This is not supported anymore. The intended way of dealing with corrupted data isRepairDB()
(seedb.h
).
'Valid()'可以返回true,即使' status() '是无效的。这有时可以用来跳过损坏的数据。这不再被支持。处理损坏数据的预期方法是' RepairDB() '(参见' db.h ')。 -
Seek()
andSeekForPrev()
didn't always discard previous status.Next()
andPrev()
didn't always preserve non-ok status.
Seek()和SeekForPrev()并不总是丢弃之前的状态。Next()和Prev()并不总是保持非ok状态。
Iterating upper bound and lower bound
A user can specify an upper bound of your range query by setting ReadOptions.iterate_upper_bound
for the read option passed to NewIterator()
. By setting this option, RocksDB doesn't have to find the next key after the upper bound. In some cases, some I/Os or computation can be avoided. In some specific workloads, the improvement can be significant. Note it applies to both of forward and backward iterating. The behavior is not defined when you do SeekForPrev() with a seek key higher than upper bound, or calling SeekToLast() with the last key to be higher than an iterator upper bound, although RocksDB will not crash.
用户可以通过设置ReadOptions来指定范围查询的上限。iterate_upper_bound用于传递给NewIterator()的读选项。通过设置这个选项,RocksDB无需在上限后寻找下一个键。在某些情况下,可以避免某些I/O或计算。在某些特定的工作负载中,这种改进可能是显著的。注意,它同时适用于前向迭代和后向迭代。当你使用SeekForPrev()方法将seek键值设置为高于上限,或者调用SeekToLast()方法将最后一个键值设置为高于迭代器上限时,这种行为是没有定义的,尽管RocksDB不会崩溃。
Similarly, ReadOptions.iterate_lower_bound
can be used with backward iterating to help RocksDB optimize the performance.
同样,ReadOptions。iterate_lower_bound可以与向后迭代一起使用,以帮助RocksDB优化性能。
See the comment of the options for more information.
有关更多信息,请参阅选项的注释。
Resource pinned by iterators and iterator refreshing
Iterators by themselves don't use much memory, but it can prevent some resource from being released. This includes:
迭代器本身并不使用太多内存,但它可以防止一些资源被释放。这包括:
- memtables and SST files as of the creation time of the iterators. Even if some memtables and SST files are removed after flush or compaction, they are still preserved if an iterator pinned them.
memtable和SST文件在迭代器创建时的值。即使一些memtable和SST文件在刷新或压缩后被删除,如果迭代器锁定它们,它们仍然被保留。 - data blocks for the current iterating position. These blocks will be kept in memory, either pinned in block cache, or in the heap if block cache is not set. Please note that although normally blocks are small, in some extreme cases, a single block can be quite large, if the value size is very large.
当前迭代位置的数据块。这些块将保存在内存中,或者固定在块缓存中,或者如果块缓存没有设置,则保存在堆中。请注意,尽管通常块很小,但在一些极端情况下,如果值大小非常大,单个块可能非常大。
So the best use of iterator is to keep it short-lived, so that these resource is freed timely.
因此,迭代器的最佳用途是保持它的寿命较短,以便及时释放这些资源。
An iterator has some creation costs. In some use cases (especially memory-only cases), people want to avoid the creation costs of iterators by reusing iterators. When you are doing it, be aware that in case an iterator getting stale, it can block resource from being released. So make sure you destroy or refresh them if they are not used after some time, e.g. one second. When you need to treat this stale iterator, before release 5.7, you'll need to destroy the iterator and recreate it if needed. Since release 5.7, you can call an API Iterator::Refresh()
to refresh it. By calling this function, the iterator is refreshed to represent the recent states, and the stale resource pinned previously is released.
迭代器有一定的创建成本。在一些用例中(特别是只使用内存的情况下),人们希望通过重用迭代器来避免创建迭代器的成本。当你这样做的时候,要注意万一迭代器过时了,它会阻塞资源被释放。因此,如果一段时间后(比如一秒后)它们没有被使用,请确保销毁或刷新它们。当您需要处理这个陈旧的迭代器时,在5.7版本之前,您需要销毁该迭代器,并在需要时重新创建它。从5.7版开始,你可以调用API ' Iterator::Refresh() '来刷新它。通过调用这个函数,迭代器会被刷新以表示最近的状态,之前固定的陈旧资源也会被释放。
Prefix Iterating
Prefix iterator allows users to use bloom filter or hash index in iterator, in order to improve the performance. However, the feature has limitation and may return wrong results without reporting an error if misused. So we recommend you to use this feature carefully. For how to use the feature, see [[Prefix Seek|Prefix Seek]]. Options total_order_seek
and prefix_same_as_start
are only applicable in prefix iterating.
前缀迭代器允许用户在迭代器中使用bloom filter或hash index,以提高性能。然而,该特性有其局限性,如果使用不当,可能会返回错误的结果而不报告错误。因此,我们建议您小心使用此功能。该特性的使用方法请参见[[Prefix Seek|Prefix Seek]]。选项total_order_seek和prefix_same_as_start只适用于前缀迭代。
Read-ahead
RocksDB does automatic readahead and prefetches data on noticing more than 2 IOs for the same table file during iteration. This applies only to the block based table format. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of BlockBasedTableOptions.max_auto_readahead_size
(default 256 KB). This helps in cutting down the number of IOs needed to complete the range scan. This automatic readahead is enabled only when ReadOptions.readahead_size = 0 (default value). On Linux, readahead
syscall is used in Buffered IO mode, and an AlignedBuffer
is used in Direct IO mode to store the prefetched data. (Automatic iterator-readahead is available starting 5.12 for buffered IO and 5.15 for direct IO).
在迭代过程中,如果发现同一个表文件超过2个IOs, RocksDB会自动预读和预取数据。这只适用于基于块的表格式。预读大小从8KB开始,并且在每一个额外的顺序IO上呈指数增长,直到最大的BlockBasedTableOptions。max_auto_readahead_size(默认256kb)。这有助于减少完成范围扫描所需的IOs数量。这种自动预读只有在ReadOptions时才启用。Readahead_size = 0(默认值)在Linux上,在Buffered IO模式下使用预读系统调用,在Direct IO模式下使用AlignedBuffer来存储预取的数据。(自动迭代器预读从5.12开始,用于缓冲IO, 5.15用于直接IO)。
If your entire use case is dominated by iterating and you are relying on OS page cache (i.e using buffered IO), you can choose to turn on readahead manually by setting DBOptions.advise_random_on_open = false
. This is more helpful if you run on hard drives or remote storage, but may not have much actual effects on directly attached SSD devices.
如果你的整个用例是由迭代主导的,并且你依赖于OS页面缓存(即使用缓冲IO),你可以选择通过设置DBOptions手动打开预读。advise_random_on_open = false。如果您在硬盘驱动器或远程存储上运行,这将更有帮助,但在直接连接的SSD设备上可能没有多少实际效果。
ReadOptions.readahead_size
provides read-ahead support in RocksDB for very limited use cases. The limitation of this feature is that, if turned on, the constant cost of the iterator will be much higher. So you should only use it if you iterate a very large range of data, and can't work it around using other approaches. A typical use case will be that the storage is remote storage with very long latency, OS page cache is not available and a large amount of data will be scanned. By enabling this feature, every read of SST files will read-ahead data according to this setting. Note that one iterator can open each file per level, as well as all L0 files at the same time. You need to budget your read-ahead memory for them. And the memory used by the read-ahead buffer can't be tracked automatically.
ReadOptions。readahead_size在RocksDB中为非常有限的用例提供了预读支持。这个特性的局限性是,如果启用,迭代器的常量开销将高得多。所以你应该只在迭代很大范围的数据时使用它,而不能使用其他方法。一个典型的用例是,存储是远程存储,有很长的延迟,操作系统页面缓存不可用,大量的数据将被扫描。通过启用该功能,每次读取SST文件时都会根据该设置预读数据。请注意,一个迭代器可以打开每个级别的每个文件,以及同时打开所有L0文件。您需要为它们规划预读内存。而且预读缓冲区使用的内存不能被自动跟踪。
We are looking for improving read-ahead in RocksDB.
我们希望能够提高RocksDB的预读能力。