自定义分词器
- 当 ElasticSearch 自带的分词器无法满足需要时,可以自定义分词器,通过组合不同的组件实现;
- 分词器的三个组件:
-
Character Filters:针对原始文本处理,例如去除 html 标签;
-
Tokenizer:按照规则分为单词;
-
Token Filters:将切分的单词进行加工,小写,删除 stopwords,增加同义词;
Character Filters
- 在 Tokenizer 之前对文本进行处理,比如增加,删除及替换字符串;
- 可以配置多个 Character Filters;
- Character Filters 会影响 Tokenizer 的 position 和 offset 信息;
- ElasticSearch 自带的 Character Filters:
- HTML strip - 去除 HTML 标签;
- Mapping - 字符串替换;
- Pattern replace - 正则匹配替换;
Tokenizer
- 将原始的文本按照一定的规则,切分为词(term / token);
- ElasticSearch 内置的 Tokenizer:
- whitespace
- standard
- uax_url_email
- pattern
- keyword
- path hierarchy
- 可以使用 Java 开发插件,实现自己的 Tokenizer;
Token Filters
- 将 Tokenizer 输出的单词(term)进行增加,修改,删除
- ElasticSearch 自带的 Token Filters:
- Lowercase
- stop
- synonym - 添加近义词
内置分词器 | 举几个例子
char_filter -> html_strip
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
char_filter -> mapping
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ "- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
char_filter -> mapping | 多个 mapping 规则
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ ":) => happy", ":( => sad"]
}
],
"text": ["I am felling :)", "Feeling :( today"]
}
char_filter -> pattern_replace
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "pattern_replace",
"pattern" : "http://(.*)",
"replacement" : "$1"
}
],
"text" : "http://www.elastic.co"
}
tokenizer -> path_hierarchy
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a/b/c/d/e"
}
filter -> stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": ["The rain in Spain falls mainly on the plain."]
}
filter -> lowercase
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop"],
"text": ["The gilrs in China are playing this game!"]
}
自定义 Analyzer | 举个栗子
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":[
"emoticons"
],
"tokenizer":"punctuation",
"filter":[
"lowercase",
"english_stop"
]
}
},
"tokenizer":{
"punctuation":{
"type":"pattern",
"pattern":"[ .,!?]"
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_",
":( => _sad_)"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer":"my_custom_analyzer",
"text":"I'm a :) person, and you?"
}