学习资料:
1.https://www.elastic.co/cn/blog/found-dive-into-elasticsearch-storage
2.https://alibaba-cloud.medium.com/analysis-of-lucene-basic-concepts-5ff5d8b90a53
3.https://dzone.com/refcardz/lucene
4.https://stackoverflow.com/questions/2602253/how-does-lucene-index-documents
5.https://lucene.apache.org/core/3_5_0/fileformats.html
目标:
学习Lucene的存储设计
ES的学习暂时不急,因为ES包含大量关于分布式的东西,现在专注单体存储设计
Definitions
Segment
When Lucene writes data it first writes to an in-memory buffer (similar to MemTable in LSM, but not readable). When the data in the Buffer reaches a certain amount, it will be flushed to become a Segment. Every segment has its own independent index and are independently searchable, but the data can never be changed. This scheme prevents random writes. Data is written as Batch or as an Append and achieves a high throughput. The documents written in the Segment cannot be modified, but they can be deleted. The deletion method does not change the file in its original, internal location, but the DocID of the document to be deleted is saved by another file to ensure that the data file cannot be modified. Index queries need to query multiple Segments and merge the results, as well as handling deleted documents. In order to optimize queries, Lucene has a policy to merge multiple segments and in this regard is similar to LSM’s Merge of SSTable.
Type of fields
1.In Lucene, fields may bestored, in which case their text is stored in the index literally, in a non-inverted manner.
2.Fields that are inverted are called indexed.
3.The text of a field may betokenizedinto terms to be indexed, or the text of a field may be used literally as a term to be indexed
Field Infos
1. fnm the definition of filed
2. fdx the index file of field
3. fdt The stored fields for documents
Term Infos
1. tis Part of the term dictionary, stores term info
2. tii The index into the Term Infos file
Other