018.Elasticsearch分词器原理及使用

1.分词器介绍

  • 什么是分词器?

    将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化(normalization)的一种工具,例如:

    "hello tom and jerry"可以分为"hello"、"tom"、"and"、"jerry"这4个单词

    常规化是说,例如,"hello tom & jerry",那么把"&"这个字符转换为"and",对一个html标签进行分词时,先去掉标签"<span>hello<span>" -> "hello"

  • 常用的内置分词器

    • standard analyzer
    • simple analyzer
    • whitespace analyzer
    • stop analyzer
    • language analyzer
    • pattern analyzer

1.1 standard analyzer

默认分词器:按照非字母和非数字字符进行分隔,单词转为小写
测试文本:a*B!c d4e 5f 7-h
分词结果:abcd4e5f7h

{
  "tokens" : [
    {
      "token" : "a", # 分词后的单词
      "start_offset" : 0, # 在原文本中的起始位置
      "end_offset" : 1, # 原文本中的结束位置
      "type" : "<ALPHANUM>", # 单词类型:ALPHANUM(字母)、NUM(数字)
      "position" : 0 # 单词位置,是分出来的所有单词的第几个单词
    },
    {
      "token" : "b",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "c",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "d4e",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "5f",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "7",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<NUM>",
      "position" : 5
    },
    {
      "token" : "h",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}

1.2 simple analyzer

分词效果:按照非字母字符进行分隔,单词转为小写
测试文本:a*B!c d4e 5f 7-h
分词结果:abcdefh

GET _analyze
{
  "analyzer": "simple",
  "text": "a*B!c d4e 5f 7-h"
}

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "b",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "c",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "d",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "e",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "f",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "h",
      "start_offset" : 15,
      "end_offset" : 16,
    "type" : "word",
      "position" : 6
    }
  ]
}

1.3 whitespace analyzer

分词效果:按照空白字符进行分隔
测试文本:a*B!c D d4e 5f 7-h
分词结果:a*B!cDd4e5f7-h

GET _analyze
{
  "analyzer": "whitespace",
  "text": "a*B!c D d4e 5f 7-h"
}

{
  "tokens" : [
    {
      "token" : "a*B!c",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "D",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "d4e",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "5f",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "7-h",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "word",
      "position" : 4
    }
  ]
}

1.4 stop analyzer

分词效果:使用非字母字符进行分隔,单词转换为小写,并去掉停用词(默认为英语的停用词,例如theaanthisofat等)
测试文本:The apple is red
分词结果:applered

GET _analyze
{
  "analyzer": "stop",
  "text": "The apple is red"
}

{
  "tokens" : [
    {
      "token" : "apple",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "red",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 3
    }
  ]
}

1.5 language analyzer

分词效果:使用指定的语言的语法进行分词,默认为english,没有内置中文分词器

GET _analyze
{
  "analyzer": "english",
  "text": "\"I'm Tony,\", he said, \"nice to meet you!\""
}

{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "toni",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "he",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "said",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "nice",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "meet",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "you",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

1.6 pattern analyzer

分词效果:使用指定的正则表达式进行分词,默认\\W+,即多个非数字非字母字符

GET _analyze
{
  "analyzer": "pattern",
  "text": "The best 3-points shooter is Curry!"
}

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "3",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "points",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "curry",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "word",
      "position" : 6
    }
  ]
}

2.分词器使用

2.1 指定index的分词器

创建测试索引

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_1": {
          "type": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "name": {
          "type": "text"
        },
        "desc": {
          "type": "text",
          "analyzer": "my_analyzer_1"
        }
      }
    }
  }
}

创建测试数据:

PUT my_index/_doc/1
{
  "id": "001",
  "name": "Curry",
  "desc": "The best 3-points shooter is Curry!"
}

查询:由于desc字段使用whitespace分词,所以通过curry是查询不到的,需要通过Curry!来查询

GET my_index/_search
{
  "query": {
    "match": {
      "desc": "curry"
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

GET my_index/_search
{
  "query": {
    "match": {
      "desc": "Curry!"
    }
  }
}

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "id" : "001",
          "name" : "Curry",
          "desc" : "The best 3-points shooter is Curry!"
        }
      }
    ]
  }
}

2.2 更改分词器设置

# 创建索引,并设置分词器,启用停用词,默认的standard分词器是没有使用停用词的
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_standard": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

# 测试
GET /my_index/_analyze
{
  "analyzer": "my_standard",
  "text": "a dog is in the house"
}

{
  "tokens": [
    {
      "token": "dog",
      "start_offset": 2,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "house",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 5
    }
  ]
}

2.3 自定义分词器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["& => and"] # "$"转换为"and"
        }
      },
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": ["the", "a"] # 指定两个停用词
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "my_char_filter"], # 使用内置的html标签过滤和自定义的my_char_filter
          "tokenizer": "standard",
          "filter": ["lowercase", "my_filter"] # 使用内置的lowercase filter和自定义的my_filter
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}

{
  "tokens": [
    {
      "token": "tomandjerry",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "are",
      "start_offset": 10,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "friend",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "in",
      "start_offset": 23,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "house",
      "start_offset": 30,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "haha",
      "start_offset": 42,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 7
    }
  ]
}

2.4 为指定的type、指定的字段设置自定义的分词器

PUT /my_index/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

3. 中文分词器

3.1. 中文分词器介绍

Elasticsearch内置的分词器无法对中文进行分词,例如:

GET _analyze
{
  "analyzer": "standard",
  "text": "火箭明年总冠军"
}

{
  "tokens" : [
    {
      "token" : "火",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "箭",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "明",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "年",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "总",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "冠",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "军",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

我们期望的分词结果是火箭明年总冠军,这就需要使用中文分词器了。

  • 常见的中文分词器
    • smartCN :一个简单的中⽂或中英⽂混合文本分词器
    • IK分词器:更智能更友好的中⽂分词器

3.2 smartCN安装方式

bin/elasticsearch-plugin install analysis-smartcn

完成后重启ES集群,测试:

GET _analyze
{
  "analyzer": "smartcn",
  "text": "火箭明年总冠军"
}

{
  "tokens" : [
    {
      "token" : "火箭",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "明年",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "总",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "冠军",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

3.3 IK分词器安装

下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases

  • 下载与ES同版本的IK分词器elasticsearch-analysis-ik-x.x.x.zip

  • 在ES的plugins目录下创建ik目录

    [giant@jd2 plugins]$ mkdir ik
    
  • elasticsearch-analysis-ik-x.x.x.zip上传到plugins/ik目录下并解压

    [giant@jd2 ik]$ unzip elasticsearch-analysis-ik-6.6.0.zip
    
  • 删除elasticsearch-analysis-ik-x.x.x.zip安装包

    [giant@jd2 ik]$ rm -rf elasticsearch-analysis-ik-6.6.0.zip
    [giant@jd2 ik]$ ll
    total 1428
    -rw-r--r-- 1 giant giant 263965 Jan 15 17:07 commons-codec-1.9.jar
    -rw-r--r-- 1 giant giant  61829 Jan 15 17:07 commons-logging-1.2.jar
    drwxr-xr-x 2 giant giant    299 Jan 15 17:07 config
    -rw-r--r-- 1 giant giant  54693 Jan 15 17:07 elasticsearch-analysis-ik-6.6.0.jar
    -rw-r--r-- 1 giant giant 736658 Jan 15 17:07 httpclient-4.5.2.jar
    -rw-r--r-- 1 giant giant 326724 Jan 15 17:07 httpcore-4.4.4.jar
    -rw-r--r-- 1 giant giant   1805 Jan 15 17:07 plugin-descriptor.properties
    -rw-r--r-- 1 giant giant    125 Jan 15 17:07 plugin-security.policy
    
  • 所有ES节点均进行以上操作,然后重启ES集群

IK分词器测试:

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "火箭明年总冠军"
}

{
  "tokens" : [
    {
      "token" : "火箭",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "明年",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "总冠军",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "冠军",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

IK分词器有两种analyzer,ik_max_word和ik_smart

  • ik_max_word:会将文本做最细粒度的拆分
  • ik_smart:会做最粗粒度的拆分

3.4 IK分词器配置文件

  • IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
  • main.dic:IK分词器原生内置的中文词库,总共有27万多条,只要这里定义的单词,都会被分在一起
  • quantifier.dic:放了一些单位相关的词
  • suffix.dic:放了一些后缀单词
  • surname.dic:中国的姓氏
  • stopword.dic:英文停用词

3.5 自定义词库

  • 自定义词库:每年都会涌现一些特殊的流行词,网红,蓝瘦香菇,喊麦,鬼畜,一般不会在ik的原生词典里,自己补充这些最新的词语,到ik的词库里面去,然后修改IKAnalyzer.cfg.xml配置文件

  • 自定义停用词库:比如"了","的","啥","么",我们可能并不想去建立索引,让人家搜索

    <entry key="ext_dict">custom/mydict.dic</entry>
    <entry key="ext_stopwords">custom/mystopdict.dic</entry>
    
  • 然后需要重启es,才能生效

  • 测试
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "网红"
}

{
    "tokens": [
        {
            "token": "网",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "红",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        }
    ]
}
  • 自定义词库
mkdir -p ${ELASTICSEARCH_HOME}/plugins/ik/config/custom
touch ${ELASTICSEARCH_HOME}/plugins/ik/config/custom/mydict.dic
# 然后把网红这个词写进去
# 然后修改IKAnalyzer.cfg.xml
<entry key="ext_dict">custom/mydict.dic</entry>
  • 重启es,并测试
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "网红"
}

{
    "tokens": [
        {
            "token": "网红",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,504评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,434评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,089评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,378评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,472评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,506评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,519评论 3 413
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,292评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,738评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,022评论 2 329
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,194评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,873评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,536评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,162评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,413评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,075评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,080评论 2 352