Elasticsearch 内置的分词器对中文不友好,会把中文分成单个字来进行全文检索,不能达到想要的结果,在全文检索及新词发展如此快的互联网时代,IK可以进行友好的分词及自定义分词。
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。
ik 带有两个分词器
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有
标准分词
GET _analyze
{
"analyzer": "standard",
"text":"好好学习,天天向上"
}
分词结果是将每个字作为一个词
"tokens": [
{
"token": "好",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "好",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "学",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "习",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "天",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "天",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "向",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "上",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 7
}
]
}
ik_smart分词以及结果(做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有
)
GET _analyze
{
"analyzer": "ik_smart",
"text":"好好学习,天天向上"
}
{
"tokens": [
{
"token": "好好学习",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "天天向上",
"start_offset": 5,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}
]
}
ik_max_word分词以及结果(将文本做最细粒度的拆分;尽可能多的拆分出词语)
GET _analyze
{
"analyzer": "ik_max_word",
"text":"好好学习,天天向上"
}
{
"tokens": [
{
"token": "好好学习",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "好好学",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "好好",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "好学",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "学习",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 4
},
{
"token": "天天向上",
"start_offset": 5,
"end_offset": 9,
"type": "CN_WORD",
"position": 5
},
{
"token": "天天",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "向上",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 7
}
]
}
栗子:对ik分词器的演示
新建索引,并设置mapping
PUT /ik_index
PUT /ik_index/text/_mapping
{
"properties": {
"context":{
"type": "text",
"fields": {
"context_ik_smart":{
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
},
"context_ik_max_word":{
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
}
}
添加多个文档
POST /ik_index/text
{
"context":"好好学习,天天向上"
}
POST /ik_index/text
{
"context":"学和习,有什么区别"
}
POST /ik_index/text
{
"context":"es的分词该怎么学的"
}
POST /ik_index/text
{
"context":"ik是怎么把句子分成词的"
}
搜索“学习”
//标准分词器搜索
GET /ik_index/text/_search?pretty
{
"query": {
"match": {
"context": "学习"
}
}
}
//ik_smart分词搜索
GET /ik_index/text/_search?pretty
{
"query": {
"match": {
"context.context_ik_smart": "学习"
}
}
}
//ik_max_word分词搜索
GET /ik_index/text/_search?pretty
{
"query": {
"match": {
"context.context_ik_max_word": "学习"
}
}
}
标准分词分词后搜索结果
brandard
ik_smart分词后搜索结果
ik_smart
ik_max_word分词后搜索结果
ik_max_word