通过solr定位“独孤求败”在金庸15本小说中出现的章节

实验内容

使用Solr对金庸的15本小说的每一个章节进行全文索引，定位“独孤求败”在小说中出现的章节。

实验环境

操作系统：Win10专业版 17134.137
Solr版本：7.3.1
Java版本：1.8.0_172

实验步骤

1. 启动solr并建立新的core

在solr的根目录下，同时点击鼠标右键+shift，在出现的菜单中，选择在此处打开PowerShell窗口，输入如下指令启动solr

bin/solr.cmd start

启动完成后，输入如下指令建立新core

bin/solr.cmd create -c jinyong

然后在浏览器输入

http://localhost:8983/solr/#/jinyong

即可进入solr admin界面，对新建的core进行管理

tips:在windows下重启solr需要输入如下指令

bin/solr.cmd restart -p 8983

2. 添加Ik分词包和text_ik字段类型

首先进入如下目录

image

然后，将ik的两个Jar包放入lib文件夹，配置文件放入classes文件夹

image

然后在jinyong这个core中的manage-shema中加入如下fieldType配置：

<!-- IK字段类型 -->
<fieldType name="text_ik" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
  <tokenizer class="org.apache.lucene.analysis.ik.IKTokenizerFactory" isMaxWordLength="false" useSmart="false"/>
</analyzer>
<analyzer type="query">
  <tokenizer class="org.apache.lucene.analysis.ik.IKTokenizerFactory" isMaxWordLength="false" useSmart="false"/>
</analyzer>
</fieldType>

重启solr，进入solr admin界面，查看text_ik字段类型分词效果，为了完好的分出“独孤求败”4个字，我在ik分词包的额外词(ext.dic)中加入了“独孤求败”，在停用词（stopword.dic)加入了“独孤”，“求败”两次，以上两个文件均在以下目录中

image

分词效果如图所示：

image

3. 使用DIH方式导入金庸小说的文本文件

为了知道“孤独求败”具体在哪个章节中出现，先使用章节分割器将15本小说按章节分开，每本小说一个文件夹，再用编码转换器，将txt文件编码转为utf-8，如图：

image

准备好被索引的文件后，在solr-7.3.1/server/solr/jinyong/conf目录下，建立data-config.xml文件。文件内容如下

<dataConfig>
  <dataSource name="fileDataSource" type="FileDataSource" />
  <document>
    <!-- baseDir:被索引文件的位置 fileName:正则匹配文件名 -->
    <!-- 索引文件出错时跳过该文件 -->
    <!-- 递归地索引baseDir下的每个文件夹 -->
    <entity name="files" dataSource="null" rootEntity="false"
    processor="FileListEntityProcessor"
    baseDir="D:\Storage\小说\金庸小说全集" fileName=".*\.txt" 
    onError="skip" 
    recursive="true"> 
    
      <!-- 将文件信息与相应的managed-schema中的field对应起来，有id、filePath、size、lastModified、text这些字段 -->
      <field column="file" name="id"/>
      <field column="fileAbsolutePath" name="filePath" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastModified" />

      <entity processor="PlainTextEntityProcessor" name="txtfile" url="${files.fileAbsolutePath}" dataSource="fileDataSource">
        <field column="plainText" name="text"/>
      </entity>
    </entity>
  </document>
</dataConfig>

然后添加相应的field到managed-schema文件中(id字段会默认定义，所以不用再重复定义)

<!-- txt文件定义字段 -->
<field name="text" type="text_ik" indexed="true" stored="true" omitNorms="true" multiValued="false"/>
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="filePath" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="size" type="plong" indexed="true" stored="true" />
<field name="lastModified" type="pdate" indexed="true" stored="true" />

然后在solrconfig.xml文件中定义DIH
先是导入相应的jar包，在如下位置添加以下语句

image

<!--  添加DIH包依赖 -->
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

再是再合适位置加入如下定义:

<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
  <str name="config">data-config.xml</str>
</lst>
</requestHandler>

重启solr后，打开solr amdin，在如下界面即可导入文件，建立索引