Lucene tim文件格式详解

本文及后面关于Lucene的文章所采用的lucene 版本为8.1.0.

1. 什么是tim文件

tim文件主要作用是保存term的值及term的一些统计信息，如term的频率，doc的频率。

2. tim文件格式

tim 文件格式

3. 测试代码及结果

 public static void main(String[] args) {
        try {
            // initialization
            Directory index = new NIOFSDirectory(Paths.get("/tmp/lucene/test_two"));
            IndexWriterConfig config = new IndexWriterConfig();
            IndexWriter writer = new IndexWriter(index, config);

            Document doc = new Document();
            TextField textField = new TextField("title", "lucene test, hello word, nice, nice", Field.Store.YES);
            doc.add(textField);
            writer.addDocument(doc);
            Document doc2 = new Document();
            TextField t2 = new TextField("title", "nice haha", Field.Store.YES);
            doc2.add(t2);
            writer.addDocument(doc2);
            writer.commit();

        } catch (Exception e) {
            e.printStackTrace();
        }

doc0 text 内容为: 'lucene test, hello word, nice, nice'
doc1 text 内容为: 'nice data'

得到的分词结果(lucene 会将最终的term安字典排序)为:

term	doc/ doc freq	term freq	pos
haha	1/1	1	1
hello	0/1	1	2
lucene	0/1	1	0
nice	0, 1 /2	3	4,5/0
test	0/1	1	1
word	0/1	1	3

说明:

doc/doc freq. 代表该term的包含此term的docId和 docFreq, 以haha为列，在doc 1出现过，且仅有doc1出现，nice在doc 0, doc 1出现，孤doc freq = 2
term freq. 代表在所有文档中，term 出现的次数， nice的总出现次数为3, doc 0 出现两次，doc 1 出现1次
pos 代表该term 在所在的文档中的position, 以nice 来说，在doc0 的position为4, 5, 在doc1的position为0

为什么term最终的顺序是 haha hello lucene nice test word? lucene 在构造索引的时候会将所有term按字典排序，一方面是易于查找，另一方面是排好序的term在构FST索引时占用空间最小

4. Tim 文件

tim文件内容.png

5. 文件内容分析

5.1 文件头

文件头部分包含两个头内容 BlockTreeTermsDict 和 Lucene50PostingsWriterTerms, 这两个头内容基本一致。本部分源码在BlockTreeTermsWriter的274和283行

5.1.1 BlockTreeTermsDict 内容

3fd7 6c17 固定头MAGIC
12 为BlockTreeTermsDict长度18
42 6c6f 636b 5472 6565 5465 726d 7344 6963 18个字节即BlockTreeTermsDict
00 0000 03 4个字节的BlockTreeTermsReader.VERSION_CURRENT
e7 b872 3275 fbea 57cd 23fa 3f4a 2696 30 16个字节的segmentId, 这个是随机生成的
0a segment suffix 长度 10
4c75 6365 6e65 3530 5f30 10个字节的segment suffix内容即Lucene50_0

5.1.2 Lucene50PostingsWriterTerms 内容

同5.1.1 的1)
1b 为Lucene50PostingsWriterTerms的长度27
4c 7563 656e 6535 3050 6f73 7469 6e67 7357 7269 7465 7254 6572 6d73 即Lucene50PostingsWriterTerms
0000 0001 4个字节的VERSION_IMPACT_SKIP_DATA
同5.1.1中的5)
同5.1.1中的6)
同5.1.1中的7)

5.2 term 数据内容

在上述代码中插入了两个doc, 每一个doc 仅有一个field title, 通过分词后，总共有6个term,

term	doc/ doc freq	term freq
haha	1/1	1
hello	0/1	1
lucene	0/1	1
nice	0, 1 /2	3
test	0/1	1
word	0/1	1

现在开始分析term 部分

8001 block size。每个field分为若干block, 每个block最大为128个term, 8001 为Vint编码，关于lunece中编码方式后面会针对性的说明
0d即13, 13 = 6 * 2 + 1，其中的6为对应terms数量，具体计算逻辑在 BlockTreeTermsWriter#writeBlock的668行，对应tim文件格式图term个数
43即67 ， 67 = 33 * 2 + 1, 其中的33为6个term占用空间大小, 具体计算逻辑在BlockTreeTermsWriter#writeBlock 820行. 对应tim文件格式图term占用空间
接下来33 个字节为terms内容，首先是term的size, 然后是term的值
04 6861 6861 0568 656c 6c6f 066c 7563 656e 6504 6e69 6365 0474 6573 7404 776f 7264 对应tim文件格式图term1 到term n长度与值部分
0c 12 代表接下的12字节为term的docFreq与totalTermFreq 对应tim文件格式图doc freq term freq 占用大小
01 0001 0001 0002 0101 0001 00 依次为每一个term的docFreq与(totalTermFreq - docFreq)值对应tim文件格式图doc freq term freq 值部分
11 17个字节，接下来的17个字节为meta数据，主要记录term在doc文件与pos文件中偏移及term的docId, 以三个字节为一组，当term所在的doc中包含term次数大于1时，term次数值会保存在pos文件中。 17 = 5 * 3 + 2
5e 3d 01 第一个term haha的元信息，5e 代表第一term在doc文件中偏移，仔细观察doc文件，可以知道5e(00)恰好是doc data 起始offset, 3d 是 haha在pos文件中偏移，仔细观察pos文件的3d位置为1，正好为为haha的postition， 01 代表doc 1
00 01 00 为term hello的元数据，00 01 采用delta编码，也就是说 term的 doc data offset 为 5e, position offset 为 3e, 最后00 代表 doc 0
00 01 00 为term lucene 的元数据，00 01 采用delta编码，也就是说 term的 doc data offset 为 5e, position offset 为 3f, 最后00 代表 doc 0
00 01 为term nice的元数据，00 01 采用delta编码，也就是说 term的 doc data offset 为 5e, position offset 为 40, nice的term doc freq 大于1, 此部分信息保存在doc文件中
03 03 00 为term test 的元数据，00 01 采用delta编码，也就是说 term的 doc data offset 为 5e, position offset 为 43, 最后00 代表 doc 0
00 01 00 为term test 的元数据，00 01 采用delta编码，也就是说 term的 doc data offset 为 5e, position offset 为 44, 最后00 代表 doc 0

5.3 field 元数据区

field元数据区主要存储着以field为单位一些统计信息，源码在BlockTreeTermsWriter#close 详细如下

01. 即1, 为对应field的数量，对应tim文件格式图field size n 部分。接着为每个field元数据
00 对filed info number， lucene内部为第一个field分配一个field number 对应tim文件格式图field info number 部分
06 terms 个数对应tim文件格式图num terms 部分
02 root code length 对应tim文件格式图rootcode length 部分
de 03 root code 内容，对应tim文件格式图root code 部分
08 所有term的总的frequency, 8 = (1 + 1 + 1 + 3 + 1 + 1) 对应tim文件格式图sumTotal TermFreq 部分
07 所有term的总的doc freq, 7 = (1 + 1 + 1 + 2 + 1 + 1) 对应tim文件格式图sumDoc Freq 部分
02 doc 数量对应tim文件格式图docCount 部分
02 long size 固定为2 对应tim文件格式图longsSize 部分
04 6861 6861 最小term， 04 表示长度，后面4个字节为内容对应tim文件格式图minTerm 部分
0477 6f72 6400 最大term， 04 表示长度，后面4个字节为内容对应tim文件格式图maxTerm 部分
00 0000 0000 0000 b9 8字节的long 这个是filed meta 在tim文件的偏移, 185, 对应tim文件格式图filedMeta offset 部分

5.3 footer区

footer区主要有以下内容

c0 2893 e8 MAGIC值，为header值的反码
00 0000 00 固定4个字节int 值为0
00 0000 00fb 0975 84 8个字节的CRC码

觉得本文有帮助的话，请关注我的简书，一同进步！