自定义分词器

前言

es能够实现快速的全文搜索,除了依赖其本身倒排索引的思想,还依赖其分词器

分析器

  • es本身内置了一些常用的分析器(analyzer),分析器由三种构建组成:
    • character filter: 字符过滤器(在一段文本进行分词之前,先进行预处理,比如过滤html标签等)
    • tokenizer: 分词器(对字段进行切分)
    • token filter: token过滤器(对切分的单词进行加工,如大小写转换等)
  • 三者顺序: character filter -> tokenizer -> token filter
  • 三者个数: character filter(0个或多个)+tokenizer(恰好一个)+token filter(0个或多个)

es内置的分析器

  • es内置了一些常用的分析器,如下:
Standard Analyzer - 默认分词器,按词切分,小写处理
Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
Stop Analyzer - 小写处理,停用词过滤(the,a,is)
Whitespace Analyzer - 按照空格切分,不转小写
Keyword Analyzer - 不分词,直接将输入当作输出
Patter Analyzer - 正则表达式,默认\W+(非字符分割)
Language - 提供了30多种常见语言的分词器
Customer Analyzer 自定义分词器
  • 根据这些分词器我们可以进行自定义一些简单的分词器,如 以逗号分隔的分词器
{
 "settings":{
  "analysis":{
    "analyzer":{
      "comma":{
        "type":"pattern",
        "pattern":","
      }
    }
  }
 }
}
  • 或者自定义选择分词器及过滤器,组装一个新的分析器
{
    "settings": {
        "analysis": {
            "analyzer": {
                "std_folded": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    }
}

自定义分析器

  • 并不是所有的需求都可以以内置的组件进行组装得到,当有一些特殊的需求时,内置的分词器可能很难实现,这时我们可以尝试自定义分析器。 以下以连续字符串分词为例: 给定一个字符串,要求分词出来的结果涵盖: 所有的连续3个字母、4个字母、5个字母...
    嗯... 其实elasticsearch内置的分词器,也可以实现,如下:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 4,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }}

自定义插件实现

这里我们以一个空格分词器为例

pom文件
  <properties>
    <elasticsearch.version>6.5.4</elasticsearch.version>
    <lucene.version>7.5.0</lucene.version>
    <maven.compiler.target>1.8</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.elasticsearch</groupId>
      <artifactId>elasticsearch</artifactId>
      <version>${elasticsearch.version}</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>

  <build>
    <resources>
      <resource>
        <directory>src/main/resources</directory>
        <filtering>false</filtering>
        <excludes>
          <exclude>*.properties</exclude>
        </excludes>
      </resource>
    </resources>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.6</version>
        <configuration>
          <appendAssemblyId>false</appendAssemblyId>
          <outputDirectory>${project.build.directory}/releases/</outputDirectory>
          <descriptors>
            <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
          </descriptors>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.5.1</version>
        <configuration>
          <source>${maven.compiler.target}</source>
          <target>${maven.compiler.target}</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  • 注意这里指定了 plugin.xml并设置了静态资源文件
plugin.xml 注意文件位置
<?xml version="1.0"?>
<assembly>
  <id>my-analysis</id>
  <formats>
    <format>zip</format>
  </formats>
  <includeBaseDirectory>false</includeBaseDirectory>
  <files>
    <file>
      <source>${project.basedir}/src/main/resources/my.properties</source>
      <outputDirectory/>
      <filtered>true</filtered>
    </file>
  </files>
  <dependencySets>
    <dependencySet>
      <outputDirectory/>
      <useProjectArtifact>true</useProjectArtifact>
      <useTransitiveFiltering>true</useTransitiveFiltering>
      <excludes>
        <exclude>org.elasticsearch:elasticsearch</exclude>
      </excludes>
    </dependencySet>
  </dependencySets>
</assembly>
  • 这里指定了my.properties
my.properties
description=${project.description}
version=${project.version}
name=${project.name}
classname=com.test.plugin.MyPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}
  • 这里指定了classname就是我们的插件类
代码
  • 分析器
package com.test.index.analysis;

import org.apache.lucene.analysis.Analyzer;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyAnalyzer extends Analyzer {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    MyTokenizer myTokenizer = new MyTokenizer();
    return new TokenStreamComponents(myTokenizer);
  }
}
  • 分析器provider
package com.test.index.analysis;

import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyAnalyzerProvider extends AbstractIndexAnalyzerProvider<MyAnalyzer> {
  private MyAnalyzer myAnalyzer;
  public MyAnalyzerProvider(IndexSettings indexSettings,Environment environment, String name, Settings settings) {
    super(indexSettings,name,settings);
    myAnalyzer = new MyAnalyzer();
  }
  @Override
  public MyAnalyzer get() {
    return myAnalyzer;
  }
}
  • 分词器--核心逻辑
package com.test.index.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyTokenizer extends Tokenizer {
  private final StringBuilder buffer = new StringBuilder();
  private int suffixOffset;
  /** 分词开始的位置 **/
  private int tokenStart = 0;
  /** 分词结束的位置 **/
  private int tokenEnd = 0;
  /** 将attribute加入map, 这里分出来的词语 需要包含字符串 和 offset两种属性 **/
  private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);

  @Override
  public boolean incrementToken() throws IOException {
    clearAttributes();
    buffer.setLength(0); // 清空数据
    int ci;
    char ch;
    tokenStart = tokenEnd;
    // 读取一个字符
    ci = input.read();
    ch = (char)ci;
    while (true) {
      if (ci == -1) {
        // 没有数据了
        if (buffer.length() == 0) {
          // 分词结束
          return false;
        }else {
          // 返回一个分词结果
          termAttribute.setEmpty().append(buffer);
          offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
          return true;
        }
      }else if (ch == ' ') {
        // 遇到空格
        tokenEnd ++;
        if (buffer.length()>0) {
          termAttribute.setEmpty().append(buffer);
          offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
          return true;
        }else {
          ci = input.read();
          ch = (char) ci;
        }
      }else { // 没有遇到空格,继续追加
        buffer.append(ch);
        tokenEnd++;
        ci = input.read();
        ch = (char) ci;

      }
    }
  }

  @Override
  public void end() throws IOException {
    int finalOffset = correctOffset(suffixOffset);
    offsetAttribute.setOffset(finalOffset,finalOffset);
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    tokenStart = tokenEnd = 0;
  }
}
  • 分词器工厂
package com.test.index.analysis;

import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyTokenizerFactory extends AbstractTokenizerFactory {

  public MyTokenizerFactory(IndexSettings indexSettings,Environment environment,String ignored, Settings settings) {
    super(indexSettings,ignored,settings);
  }

  @Override
  public Tokenizer create() {
    return new MyTokenizer();
  }
}
  • 插件类
package com.test.plugin;

import com.test.index.analysis.MyAnalyzerProvider;
import com.test.index.analysis.MyTokenizerFactory;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyPlugin extends Plugin implements AnalysisPlugin {

  @Override
  public Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> getTokenizers() {
    Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
    extra.put("my-word", MyTokenizerFactory::new);
    return extra;
  }
  @Override
  public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {

    Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> extra = new HashMap<>();
    extra.put("my-word", MyAnalyzerProvider::new);
    return extra;
  }
}
后续

到这里代码就开发完成了,可以进行简单的自测看下效果,然后就可以使用maven命令进行打包,之后就是分词器插件的安装流程,这里不再进一步说明

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容