1. 借鉴
极客时间 阮一鸣老师的Elasticsearch核心技术与实战
Elasticsearch 分词器
Elasticsearch 默认分词器和中分分词器之间的比较及使用方法
Elasticsearch系列---使用中文分词器
官网 character filters
官网 tokenizers
官网 token filters
2. 开始
一、analyze api
方式1 指定分词器
GET /_analyze
{
"analyzer": "ik_max_word",
"text": "Hello Lady, I'm Elasticsearch ^_^"
}
方式2 指定索引及属性字段
GET /tmdb_movies/_analyze
{
"field": "title",
"text": "Basketball with cartoon alias"
}
方式3 自定义分词
GET /_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Machine Building Industry Epoch"
}
二、自定义分词器
- 分词器是由三部分组成的,分别是
character filter
,tokenizer
,token filter
character filter[字符过滤器]
处理原始文本,可以配置多个,会影响到tokenizer的position和offset信息
在es中有几个默认的字符过滤器
- HTML Strip
去除html标签 - Mapping
字符串替换 - Pattern Replace
正则匹配替换
举个栗子
html_strip
GET _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<br>you know, for search</br>"
}
- 结果
{
"tokens" : [
{
"token" : """
you know, for search
""",
"start_offset" : 0,
"end_offset" : 29,
"type" : "word",
"position" : 0
}
]
}
mapping
GET _analyze
{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": ["- => "]
},
"html_strip"
],
"text": "<br>中国-北京 中国-台湾 中国-人民</br>"
}
- 结果
{
"tokens" : [
{
"token" : "中国北京",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "中国台湾",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "中国人民",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
}
]
}
pattern_replace
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "https?://(.*)",
"replacement": "$1"
}],
"text": "https://www.elastic.co"
}
- 结果
{
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 0
}
]
}
tokenizer[分词器]
将原始文本按照一定规则,切分成词项(字符处理)
在es中有几个默认的分词器
- standard
- letter
- lowercase
- whitespace
- uax url email
- classic
- thai
- n-gram
- edge n-gram
- keyword
- pattern
- simple
- char group
- simple pattern split
- path
举个栗子
path_hierarchy
GET /_analyze
{
"tokenizer": "path_hierarchy",
"text": ["/usr/local/bin/java"]
}
- 结果
{
"tokens" : [
{
"token" : "/usr",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/bin",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/bin/java",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}
token filter[分词过滤]
将tokenizer输出的词项进行处理,如:增加,修改,删除
在es中有几个默认的分词过滤器
- lowercase
- stop
- uppercase
- reverse
- length
- n-gram
- edge n-gram
- pattern replace
- trim
- ...[更多参照官网,目前仅列举用到的]
举个栗子
GET /_analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": ["how are you i am fine thank you"]
}
三、自定义分词器
自定义也无非是定义char_filter,tokenizer,filter(token filter)
DELETE /my_analysis
PUT /my_analysis
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "my_tokenizer",
"filter": [
"my_tokenizer_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["_ => "]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[,.!? ]"
}
},
"filter": {
"my_tokenizer_filter": {
"type": "stop",
"stopword": "__english__"
}
}
}
}
}
POST /my_analysis/_analyze
{
"analyzer": "my_analyzer",
"text": ["Hello Kitty!, A_n_d you?"]
}