lucene源码分析-搜索过程解析(一)

1 搜索示例

首先在lucene索引中预先写入了一些文档,主要包含两个field (id和name)信息,每个field都是stored和indexed

{
    "id": 0,
    "name": "Stephen"
},{
    "id": 1,
    "name": "Draymond"
},{
    "id": 2,
    "name", "LeBron"
},{
    "id": 3,
    "name": "Kevin"
}

我们使用如下代码,可以从lucene索引中搜索name为LeBron的文档信息,接下来,我们将通过一系列文章来分析这些代码是如何工作的

public class IndexSearcherTest {
    public static void main(String[] args) throws IOException, ParseException {
        Directory directory = FSDirectory.open("/lucene/index/path");
        IndexReader indexReader = DirectoryReader.open(directory);

        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        
        QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
        Query query = queryParser.parse("LeBron");

        TopDocs topDocs = indexSearcher.search(query, 10);

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for(ScoreDoc scoreDoc : scoreDocs){
            System.out.println("doc: " + scoreDoc.doc + ", score: " + scoreDoc.score);

            Document document = indexReader.document(scoreDoc.doc);
            if(null == document){
                continue;
            }
            System.out.println("id :" + document.get("id") + ", name: " + document.get("name"));
        }
    }
}

本片文章主要介绍打开索引路径和索引目录的部分

1 打开索引文件

由于索引存在本地磁盘中,可以使用FSDirectory打开本地的索引文件,获取索引路径的Directory对象

Directory directory = FSDirectory.open("/lucene/index/path");

① 如果当前jre是64位且支持Unmap(能加载sun.misc.Cleaner类和java.nio.DirectByteBuffer.cleaner()方法),则创建的是MMapDirectory对象
② 如果当前系统是Windows(判断操作系统名称是否以Windows开头),则创建SimpleFSDirectory对象
③ 如果不满足上述两种情况,则创建NIOFSDirectory对象

public abstract class FSDirectory extends BaseDirectory {
  public static FSDirectory open(File path) throws IOException {
    return open(path, null);
  }

  public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
    if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
      return new MMapDirectory(path, lockFactory);
    } else if (Constants.WINDOWS) {
      return new SimpleFSDirectory(path, lockFactory);
    } else {
      return new NIOFSDirectory(path, lockFactory);
    }
  }
}

上述三种Directory的构造方法都是通过调用父类FSDirectory实现的

FSDirectory类图

在FSDirectory的构造方法中,主要是初始化了lockFactory和directory对象

public abstract class FSDirectory extends BaseDirectory {
  protected FSDirectory(File path, LockFactory lockFactory) throws IOException {
    // new ctors use always NativeFSLockFactory as default:
    if (lockFactory == null) {
      lockFactory = new NativeFSLockFactory();
    }
    directory = path.getCanonicalFile();

    if (directory.exists() && !directory.isDirectory())
      throw new NoSuchDirectoryException("file '" + directory + "' exists but is not a directory");

    setLockFactory(lockFactory);

  }
}

2 打开索引目录

2.1 查找segment文件

在打开索引路径获得索引目录后,会使用DirectoryReader打开这个索引目录

IndexReader indexReader = DirectoryReader.open(directory);

这个过程非常复杂,主要是查找并打开索引的segment文件(segments_N和segments.gen),从segment中获取index文件信息并打开(.tip, .tim,.doc,.pos,.tvd,.tvx,.si,.nvd,.nvm)

public abstract class DirectoryReader extends BaseCompositeReader<AtomicReader> {
  public static DirectoryReader open(final Directory directory) throws IOException {
    return StandardDirectoryReader.open(directory, null, DEFAULT_TERMS_INDEX_DIVISOR);
  }
}

在StandardDirectoryReader的open()方法中,主要是创建SegmentInfos.FindSegmentsFile对象并重写doBody()方法,然后执行对象的run()方法

final class StandardDirectoryReader extends DirectoryReader {
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);
        final SegmentReader[] readers = new SegmentReader[sis.size()];
        boolean success = false;
        try {
          for (int i = sis.size()-1; i >= 0; i--) {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
          }

          DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
          success = true;

          return reader;
        } finally {
          if (success == false) {
            IOUtils.closeWhileHandlingException(readers);
          }
        }
      }
    }.run(commit);
  }
}

SegmentInfos.FindSegmentsFile的run()方法主要是查找segment文件
① 从以segment开头但不为segments.gen的文件中查找后缀最大的字符的作为genA
② 读取segment.gen文件,从lucene索引文件格式可知其格式如下:

GenHeader Generation Generation Footer

generation为一个Long类型的数字,并且被写入了两次,如果两个值相同,则作为genB
③ 比较genA和genB的值,最大的作为最终的gen值,用segment_[gen]作为segment文件名

public final class SegmentInfos implements Cloneable, Iterable<SegmentCommitInfo> {
  public abstract static class FindSegmentsFile {
    public Object run(IndexCommit commit) throws IOException {
      if (commit != null) {
        if (directory != commit.getDirectory())
          throw new IOException("the specified commit does not match the specified Directory");
        return doBody(commit.getSegmentsFileName());
      }

      String segmentFileName = null;
      long gen = 0;
      int retryCount = 0;

      boolean useFirstMethod = true;

      while(true) {
        if (useFirstMethod) {
          files = directory.listAll();
          
          if (files != null) {
            genA = getLastCommitGeneration(files);
          }
          
          long genB = -1;
          try {
            genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);
          } catch (IOException e) {
          
          }
  
          if (genInput != null) {
            try {
              int version = genInput.readInt();
              if (version == FORMAT_SEGMENTS_GEN_47 || version == FORMAT_SEGMENTS_GEN_CHECKSUM) {
                long gen0 = genInput.readLong();
                long gen1 = genInput.readLong();
                if (gen0 == gen1) {
                  // The file is consistent.
                  genB = gen0;
                }
              } else {
                throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_START, FORMAT_SEGMENTS_GEN_CURRENT);
              }
            } catch (IOException err2) {
             
          }

          gen = Math.max(genA, genB);

        if (useFirstMethod && lastGen == gen && retryCount >= 2) {
          useFirstMethod = false;
        }

        segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                "",
                                                                gen);
        try {
          Object v = doBody(segmentFileName);
          if (infoStream != null) {
            message("success on " + segmentFileName);
          }
          return v;
        } catch (IOException err) {
          // ...
        }
      }
    }
    protected abstract Object doBody(String segmentFileName) throws IOException;
  }
}

2.2 打开index文件

在SegmentInfos.FindSegmentsFile的doBody()方法中,主要是读取各个index文件

final class StandardDirectoryReader extends DirectoryReader {
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);
        final SegmentReader[] readers = new SegmentReader[sis.size()];
        boolean success = false;
        try {
          for (int i = sis.size()-1; i >= 0; i--) {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
          }

          // This may throw IllegalArgumentException if there are too many docs, so
          // it must be inside try clause so we close readers in that case:
          DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
          success = true;

          return reader;
        } finally {
          if (success == false) {
            IOUtils.closeWhileHandlingException(readers);
          }
        }
      }
    }.run(commit);
  }
}

首先创建一个SegmentInfos()对象,然后调用sis.read(directory, segmentFileName);读取segment文件
segment文件格式如下:

Header Version NameCounter SegCount SegCount CommitUserData Footer

然后遍历segment中的每一个段信息,调用SegmentReader的构造方法读取索引的segment 索引文件

public final class SegmentReader extends AtomicReader implements Accountable {
  public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
    this.si = si;
    // 读取cfs
    fieldInfos = readFieldInfos(si);
    // 读取tip tim nvd nvm fdt fdx tvf tvd tvx
    core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
    segDocValues = new SegmentDocValues();
    
    boolean success = false;
    final Codec codec = si.info.getCodec();
    try {
      if (si.hasDeletions()) {
        读取 del
        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
      } else {
        assert si.getDelCount() == 0;
        liveDocs = null;
      }
      numDocs = si.info.getDocCount() - si.getDelCount();

      if (fieldInfos.hasDocValues()) {
        initDocValuesProducers(codec);
      }

      success = true;
    } finally {
      if (!success) {
        doClose();
      }
    }
  }
}

① 用readFieldInfos()方法读取cfs文件,一个“虚拟”的文件,用于访问复合流

public final class SegmentReader extends AtomicReader implements Accountable {
  static FieldInfos readFieldInfos(SegmentCommitInfo info) throws IOException {
    final Directory dir;
    final boolean closeDir;
    if (info.getFieldInfosGen() == -1 && info.info.getUseCompoundFile()) {
      // no fieldInfos gen and segment uses a compound file
      dir = new CompoundFileDirectory(info.info.dir,
          IndexFileNames.segmentFileName(info.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION),
          IOContext.READONCE,
          false);
      closeDir = true;
    } else {
      // gen'd FIS are read outside CFS, or the segment doesn't use a compound file
      dir = info.info.dir;
      closeDir = false;
    }
    
    try {
      final String segmentSuffix = info.getFieldInfosGen() == -1 ? "" : Long.toString(info.getFieldInfosGen(), Character.MAX_RADIX);
      Codec codec = info.info.getCodec();
      FieldInfosFormat fisFormat = codec.fieldInfosFormat();
      return fisFormat.getFieldInfosReader().read(dir, info.info.name, segmentSuffix, IOContext.READONCE);
    } finally {
      if (closeDir) {
        dir.close();
      }
    }
  }
}

② 构建SegmentCoreReaders对象时主要读取tip tim nvd nvm fdt fdx tvf tvd tvx 格式文件

final class SegmentCoreReaders implements Accountable {
  SegmentCoreReaders(SegmentReader owner, Directory dir, SegmentCommitInfo si, IOContext context, int termsIndexDivisor) throws IOException {

    if (termsIndexDivisor == 0) {
      throw new IllegalArgumentException("indexDivisor must be < 0 (don't load terms index) or greater than 0 (got 0)");
    }
    
    final Codec codec = si.info.getCodec();
    final Directory cfsDir; // confusing name: if (cfs) its the cfsdir, otherwise its the segment's directory.

    boolean success = false;
    
    try {
      if (si.info.getUseCompoundFile()) {
        // 读取cfs 文件
        cfsDir = cfsReader = new CompoundFileDirectory(dir, IndexFileNames.segmentFileName(si.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION), context, false);
      } else {
        cfsReader = null;
        cfsDir = dir;
      }

      final FieldInfos fieldInfos = owner.fieldInfos;
      
      this.termsIndexDivisor = termsIndexDivisor;
      final PostingsFormat format = codec.postingsFormat();
      final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, fieldInfos, context, termsIndexDivisor);
      // 读取tip 和 tim 文件
      fields = format.fieldsProducer(segmentReadState);
      assert fields != null;

      if (fieldInfos.hasNorms()) {
        // 读取 nvd和 nvm 文件
        normsProducer = codec.normsFormat().normsProducer(segmentReadState);
        assert normsProducer != null;
      } else {
        normsProducer = null;
      }
      // 读取fdx 和fdt 文件
      fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, fieldInfos, context);
      if (fieldInfos.hasVectors()) { 
        // 读取 tvf  tvd  和  tvx 文件
        termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, fieldInfos, context);
      } else {
        termVectorsReaderOrig = null;
      }

      success = true;
    } finally {
      if (!success) {
        decRef();
      }
    }
  }
}

tim文件格式如下:

Header FSTIndexNumFields <IndexStartFP>NumFields DirOffset Footer

tip文件格式如下:

Header PostingsHeader NumBlocks FieldSummary DirOffset Footer
public class BlockTreeTermsReader extends FieldsProducer {
  public BlockTreeTermsReader(Directory dir, FieldInfos fieldInfos, SegmentInfo info,
                              PostingsReaderBase postingsReader, IOContext ioContext,
                              String segmentSuffix, int indexDivisor)
    throws IOException {
    
    this.postingsReader = postingsReader;

    this.segment = info.name;
    // 读取cfs
    in = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_EXTENSION),
                       ioContext);

    boolean success = false;
    IndexInput indexIn = null;

    try {
      version = readHeader(in);
      if (indexDivisor != -1) {
        indexIn = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_INDEX_EXTENSION),
                                ioContext);
        int indexVersion = readIndexHeader(indexIn);
        if (indexVersion != version) {
          throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
        }
      }
      
      // verify
      if (indexIn != null && version >= BlockTreeTermsWriter.VERSION_CHECKSUM) {
        CodecUtil.checksumEntireFile(indexIn);
      }

      // Have PostingsReader init itself
      postingsReader.init(in);

      // ...
  }
}

fdx 文件格式

<Header> <FieldValuesPosition> SegSize

fdt文件格式

<Header> <DocFieldData> SegSize
public final class Lucene40StoredFieldsReader extends StoredFieldsReader implements Cloneable, Closeable {
  public Lucene40StoredFieldsReader(Directory d, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
    final String segment = si.name;
    boolean success = false;
    fieldInfos = fn;
    try {
      fieldsStream = d.openInput(IndexFileNames.segmentFileName(segment, "", FIELDS_EXTENSION), context);
      final String indexStreamFN = IndexFileNames.segmentFileName(segment, "", FIELDS_INDEX_EXTENSION);
      indexStream = d.openInput(indexStreamFN, context);
      
      CodecUtil.checkHeader(indexStream, CODEC_NAME_IDX, VERSION_START, VERSION_CURRENT);
      CodecUtil.checkHeader(fieldsStream, CODEC_NAME_DAT, VERSION_START, VERSION_CURRENT);
      assert HEADER_LENGTH_DAT == fieldsStream.getFilePointer();
      assert HEADER_LENGTH_IDX == indexStream.getFilePointer();
      final long indexSize = indexStream.length() - HEADER_LENGTH_IDX;
      this.size = (int) (indexSize >> 3);
      // Verify two sources of "maxDoc" agree:
      if (this.size != si.getDocCount()) {
        throw new CorruptIndexException("doc counts differ for segment " + segment + ": fieldsReader shows " + this.size + " but segmentInfo shows " + si.getDocCount());
      }
      numTotalDocs = (int) (indexSize >> 3);
      success = true;
    } finally {
      if (!success) {
        try {
          close();
        } catch (Throwable t) {} // ensure we throw our original exception
      }
    }
  }
}

tvx 文件格式

Header <DocumentPosition,FieldPosition> NumDocs

tvd 文件格式

Header <NumFields, FieldNums, FieldPositions> NumDocs

tvf 文件格式

Header <NumTerms, Flags, TermFreqs> NumFields

liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);主要读取del文件
del文件格式

Format Header ByteCount BitCount Bits
public class Lucene40LiveDocsFormat extends LiveDocsFormat {
  public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {
    String filename = IndexFileNames.fileNameFromGeneration(info.info.name, DELETES_EXTENSION, info.getDelGen());
    final BitVector liveDocs = new BitVector(dir, filename, context);
    if (liveDocs.length() != info.info.getDocCount()) {
      throw new CorruptIndexException("liveDocs.length()=" + liveDocs.length() + "info.docCount=" + info.info.getDocCount() + " (filename=" + filename + ")");
    }
    if (liveDocs.count() != info.info.getDocCount() - info.getDelCount()) {
      throw new CorruptIndexException("liveDocs.count()=" + liveDocs.count() + " info.docCount=" + info.info.getDocCount() + " info.getDelCount()=" + info.getDelCount() + " (filename=" + filename + ")");
    }
    return liveDocs;
  }
}

3 创建IndexSearcher

在打开索引目录后,接着创建IndexSearcher对象

IndexSearcher indexSearcher = new IndexSearcher(indexReader);

创建IndexSearcher的过程,主要是初始化searcher的context和reader

public class IndexSearcher {
  public IndexSearcher(IndexReader r) {
    this(r, null);
  }
  public IndexSearcher(IndexReader r, ExecutorService executor) {
    this(r.getContext(), executor);
  }
  public IndexSearcher(IndexReaderContext context, ExecutorService executor) {
    assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader();
    reader = context.reader();
    this.executor = executor;
    this.readerContext = context;
    leafContexts = context.leaves();
    this.leafSlices = executor == null ? null : slices(leafContexts);
  }
}
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,390评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,821评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,632评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,170评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,033评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,098评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,511评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,204评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,479评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,572评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,341评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,213评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,576评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,893评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,171评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,486评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,676评论 2 335

推荐阅读更多精彩内容