前言

读博快两年了，说起来我的方向是Information Retrieval，可是自己到现在常用的索引包都不会用。这次的目的是索引Reddit post的JSON格式的数据的，本来想用Glasgow的Terrier，因为认识几个他们组里的学生，都跟我安利这个。可是用起来发现Terrier只针对学术界常用的TREC的数据集做了相应的优化和文档配置说明。对于其他类型的文档，Terrier并没有完整的说明，着实让人懵逼。

鉴于Lucene强大的开源背景和我着实不想再写Java代码的懒惰心理，我决定使用Solr 6.0给文档建立索引，然后通过python调用curl进行访问。

Solr 6.0 安装

配置好JAVA环境
下载zip包
解压缩

Solr 6.0 使用

这个我必须先吐槽一句，光是配置schema.xml折腾了我一个星期，真的是读博智商低啊……下面我把终于搞明白了的流程逐步记录下来。

Standalone版本的启动流程

首先[这是一个非常好的样例](Solr Schema.xml Example)
进入Solr的解压文件夹

bin/solr start

启动单机模式，此时打开浏览器输入http://localhost:8983/solr/可以看到相关页面。

建立一个Core，一个Core对应一个对文档集合的索引。

bin/solr create -c <core name> -d basic_configs

这里我们默认使用basic_configs.

对建立好的索引，我们需要配置schema.xml文件和solrconfig.xml文件，这两个文件位于server/<core name>文件夹下面。

在solrconfig.xml中将这个core设置为读取手工配置的schema.xml的模式。

<schemaFactory class = "ClassicIndexSchemaFactory"/>

讲managed-schema文件名改为schema.xml
配置schema.xml，这一步会在下一节重点说。
重点，重新读取core才能启用新配置

curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=<core name>"

索引文件

bin/post -c <core name> /data/redditData/full_corpus.json

在python中查询

  from urllib2 import *
  import simplejson
  import json
  connection = urlopen('http://localhost:8983/solr/reddit_r3d/select?q=title:"French%20Open"%20selftext:"French%20Open"&fl=id,name,title,selftext,created_utc&wt=json')
  response = simplejson.load(connection)
  print response['response']['numFound'], "documents found."
    
 # Create the output file to store relevant documents
  output_folderpath = $PATH + corename
  if not os.path.exists(output_folderpath):
      os.makedirs(output_folderpath)
    
 # Write relevant document in to the file
  if response['response']['numFound'] == 0:
      continue
  else:
      output_filepath = (output_folderpath + 
                         query_id + "_" +
                         str(response['response']['numFound']) + "_" +
                         str(kvalue) +
                         ".json")
      print output_filepath
      with open(output_filepath, "w") as text_file:
          for document in response['response']['docs']:
              text_file.write(json.dumps(document)+'\n')

Schema.xml的手工配置

配置fields说明要进行处理的tags

<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
  <field name="_text_" type="text_en" multiValued="true" indexed="true" stored="false"/>
  <field name="_version_" type="long" indexed="false" stored="false"/>
  <field name="author" type="text_general" indexed="false" stored="true"/>
  <field name="created_utc" type="tlongs" indexed="true" stored="true"/>
  <field name="domain" type="text_general" indexed="true" stored="true"/>
  <field name="downs" type="tlongs" indexed="false" stored="true"/>
  <field name="edited" type="booleans" indexed="false" stored="true"/>
  <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
  <field name="is_self" type="booleans" indexed="false" stored="true"/>
  <field name="name" type="text_general" indexed="false" stored="true"/>
  <field name="num_comments" type="tlongs" indexed="false" stored="true"/>
  <field name="retrieved_on" type="tlongs" indexed="false" stored="true"/>
  <field name="score" type="tlongs" indexed="false" stored="true"/>
  <field name="selftext" type="text_en" indexed="true" stored="true"/>
  <field name="subreddit" type="text_general" indexed="true" stored="true"/>
  <field name="subreddit_id" type="text_general" indexed="false" stored="true"/>
  <field name="title" type="text_en" indexed="true" stored="true"/>
  <field name="ups" type="tlongs" indexed="false" stored="true"/>
  <field name="url" type="text_general" indexed="true" stored="true"/>

这里需要说明，对于未知文本格式的文本还是选用text_general比较好，string可能会报错

配置fieldType说明对应类型的field中的内容进行什么样的操作

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

如上，可以定义index过程中和query过程中的具体处理，我只在原来的基础上添加了：
<charFilter class="solr.HTMLStripCharFilterFactory"/>
用于处理掉文本中的HTML格式的符号。

配置copyField复制field内容到另一个field，用于合并field或对同一field做不同的操作。这里是把这几个field合并到_text_一同索引：

   <copyField source="title" dest="_text_"/>
   <copyField source="selftext" dest="_text_"/>
   <copyField source="domain" dest="_text_"/>
   <copyField source="subreddit" dest="_text_"/>
   <copyField source="url" dest="_text_"/>

配置dynamicField用于动态匹配索引过程中遇到的未定义的field
对于不关注的field，可以采用下面方法过滤掉：声明一个忽略类型ignored，动态匹配所以没有被定义的field。

<dynamicField name="*" type="ignored" multiValued="true" />
<fieldType name="ignored" stored="false" indexed="false" docValues="false" multiValued="true" class="solr.TextField" />

额外说明

Solr 6.0开始，默认索引采用BM25，很重要。

Solr 6.0 零基础快速上手