前言
es能够实现快速的全文搜索,除了依赖其本身倒排索引的思想,还依赖其分词器
分析器
- es本身内置了一些常用的分析器(analyzer),分析器由三种构建组成:
- character filter: 字符过滤器(在一段文本进行分词之前,先进行预处理,比如过滤html标签等)
- tokenizer: 分词器(对字段进行切分)
- token filter: token过滤器(对切分的单词进行加工,如大小写转换等)
- 三者顺序: character filter -> tokenizer -> token filter
- 三者个数: character filter(0个或多个)+tokenizer(恰好一个)+token filter(0个或多个)
es内置的分析器
- es内置了一些常用的分析器,如下:
Standard Analyzer - 默认分词器,按词切分,小写处理
Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
Stop Analyzer - 小写处理,停用词过滤(the,a,is)
Whitespace Analyzer - 按照空格切分,不转小写
Keyword Analyzer - 不分词,直接将输入当作输出
Patter Analyzer - 正则表达式,默认\W+(非字符分割)
Language - 提供了30多种常见语言的分词器
Customer Analyzer 自定义分词器
- 根据这些分词器我们可以进行自定义一些简单的分词器,如 以逗号分隔的分词器
{
"settings":{
"analysis":{
"analyzer":{
"comma":{
"type":"pattern",
"pattern":","
}
}
}
}
}
- 或者自定义选择分词器及过滤器,组装一个新的分析器
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
自定义分析器
- 并不是所有的需求都可以以内置的组件进行组装得到,当有一些特殊的需求时,内置的分词器可能很难实现,这时我们可以尝试自定义分析器。 以下以连续字符串分词为例: 给定一个字符串,要求分词出来的结果涵盖: 所有的连续3个字母、4个字母、5个字母...
嗯... 其实elasticsearch内置的分词器,也可以实现,如下:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
}}
自定义插件实现
这里我们以一个空格分词器为例
pom文件
<properties>
<elasticsearch.version>6.5.4</elasticsearch.version>
<lucene.version>7.5.0</lucene.version>
<maven.compiler.target>1.8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>false</filtering>
<excludes>
<exclude>*.properties</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<outputDirectory>${project.build.directory}/releases/</outputDirectory>
<descriptors>
<descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
</descriptors>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${maven.compiler.target}</source>
<target>${maven.compiler.target}</target>
</configuration>
</plugin>
</plugins>
</build>
- 注意这里指定了 plugin.xml并设置了静态资源文件
plugin.xml 注意文件位置
<?xml version="1.0"?>
<assembly>
<id>my-analysis</id>
<formats>
<format>zip</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<files>
<file>
<source>${project.basedir}/src/main/resources/my.properties</source>
<outputDirectory/>
<filtered>true</filtered>
</file>
</files>
<dependencySets>
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<excludes>
<exclude>org.elasticsearch:elasticsearch</exclude>
</excludes>
</dependencySet>
</dependencySets>
</assembly>
- 这里指定了my.properties
my.properties
description=${project.description}
version=${project.version}
name=${project.name}
classname=com.test.plugin.MyPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}
- 这里指定了classname就是我们的插件类
代码
- 分析器
package com.test.index.analysis;
import org.apache.lucene.analysis.Analyzer;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
MyTokenizer myTokenizer = new MyTokenizer();
return new TokenStreamComponents(myTokenizer);
}
}
- 分析器provider
package com.test.index.analysis;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyAnalyzerProvider extends AbstractIndexAnalyzerProvider<MyAnalyzer> {
private MyAnalyzer myAnalyzer;
public MyAnalyzerProvider(IndexSettings indexSettings,Environment environment, String name, Settings settings) {
super(indexSettings,name,settings);
myAnalyzer = new MyAnalyzer();
}
@Override
public MyAnalyzer get() {
return myAnalyzer;
}
}
- 分词器--核心逻辑
package com.test.index.analysis;
import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyTokenizer extends Tokenizer {
private final StringBuilder buffer = new StringBuilder();
private int suffixOffset;
/** 分词开始的位置 **/
private int tokenStart = 0;
/** 分词结束的位置 **/
private int tokenEnd = 0;
/** 将attribute加入map, 这里分出来的词语 需要包含字符串 和 offset两种属性 **/
private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);
@Override
public boolean incrementToken() throws IOException {
clearAttributes();
buffer.setLength(0); // 清空数据
int ci;
char ch;
tokenStart = tokenEnd;
// 读取一个字符
ci = input.read();
ch = (char)ci;
while (true) {
if (ci == -1) {
// 没有数据了
if (buffer.length() == 0) {
// 分词结束
return false;
}else {
// 返回一个分词结果
termAttribute.setEmpty().append(buffer);
offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
return true;
}
}else if (ch == ' ') {
// 遇到空格
tokenEnd ++;
if (buffer.length()>0) {
termAttribute.setEmpty().append(buffer);
offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
return true;
}else {
ci = input.read();
ch = (char) ci;
}
}else { // 没有遇到空格,继续追加
buffer.append(ch);
tokenEnd++;
ci = input.read();
ch = (char) ci;
}
}
}
@Override
public void end() throws IOException {
int finalOffset = correctOffset(suffixOffset);
offsetAttribute.setOffset(finalOffset,finalOffset);
}
@Override
public void reset() throws IOException {
super.reset();
tokenStart = tokenEnd = 0;
}
}
- 分词器工厂
package com.test.index.analysis;
import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyTokenizerFactory extends AbstractTokenizerFactory {
public MyTokenizerFactory(IndexSettings indexSettings,Environment environment,String ignored, Settings settings) {
super(indexSettings,ignored,settings);
}
@Override
public Tokenizer create() {
return new MyTokenizer();
}
}
- 插件类
package com.test.plugin;
import com.test.index.analysis.MyAnalyzerProvider;
import com.test.index.analysis.MyTokenizerFactory;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyPlugin extends Plugin implements AnalysisPlugin {
@Override
public Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> getTokenizers() {
Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
extra.put("my-word", MyTokenizerFactory::new);
return extra;
}
@Override
public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> extra = new HashMap<>();
extra.put("my-word", MyAnalyzerProvider::new);
return extra;
}
}
后续
到这里代码就开发完成了,可以进行简单的自测看下效果,然后就可以使用maven命令进行打包,之后就是分词器插件的安装流程,这里不再进一步说明