标签： search

elasticsearch 是一款非常强大的搜索开源搜索和分析软件，高扩展高可用。

最新版本 1.7.2 搜索很强大，结合中文分词，可以高效的检索数据，给出符合查询条件的数据，并排序赋予权重得分。

本篇只是入门级讲解，大部分常用的功能基本涉及到了。

elasticsearch 是一款非常成熟的产品，如果需要更深层次的理解，需要详细阅读官方文档。

目前状况

虽然 elasticsearch 很强大，但是数据需要自己组织结构并导入。1.5 之前的版本支持 river 插件，可以通过插件直接从数据库同步数据到 elasticsearch 中，但是目前版本已经不推荐数据导入插件，所以需要自己写 数据同步模块。

中文分词 可以使用，该插件支持实时更新热词，并且可以配置不同的分词策略。

elasticsearch 基于 java 开发，运行需要安装 java 环境。接口为 Restful 风格，实际使用时可以按照 CURD 的原则使用相应的 Http 协议。

默认配置绑定 localhost ，端口 9200 。

数据结构

数据存储结构为 /{index}/{type}/{id} ，使用三级结构保存数据，原始数据保存为 JSON 。

例如：

PUT /index/test/1
{ "title": "最新电影" }

GET /index/test/1
{
    "_index": "index",
    "_type": "test",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
        "title": "最新电影"
    }
}

原始文档存放在 _source 下，并且存储的数据添加了其他 MetaData 信息 _index _type _id _version ，再次使用 PUT 可以更新文档，使 _version 变为 2 。

数据导入

数据导入可以通过 Post 来完成。

1. 建立Index

建立一个普通的 _index ，不用传任何参数：

PUT http://localhost:9200/test
// 返回
{
    "acknowledged": true
}

如果需要使用 Analysis (语句分词分析) ，则可以设置详细的 _index ：

PUT http://localhost:9200/test
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, // 一个主节点，默认5
     "number_of_replicas" : 0 // 0个副本，后面可以加，默认1
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } // 关闭_all字段，因为我们只搜索title字段
    },
    "resource": { // 这个是 _type
      "dynamic": false, // 关闭“动态修改索引”
      "properties": {
        "title": { // 表明对title字段进行分词分析
          "type": "string",
          "index": "analyzed",
          "fields": { // elasticsearch可以识别语言
            "cn": { // 中文使用中文分词
              "type": "string",
              "analyzer": "ik_smart"
            },
            "en": { // 英文使用英文分词
              "type": "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

然后向上述 _index(test) 下导入数据：

POST /test/resource/ { "title": "周星驰" } // 这种会自动生成id
PUT /test/resource/1?op_type=create { "title": "周星驰" }
PUT /test/resource/1/_create { "title": "周星驰" }

上述的第一种方式会自动生成 _id 。

POST /_bulk or /test/_bulk or /test/resource/_bulk
{ "create": { "_index": "test", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影，最好，新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

也可以将 /_bulk 提交的内容放入一个文本（文件末尾必须有一空行 \n）

curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"

// 已经存在会报错
{
  "error" : "DocumentAlreadyExistsException[[website][4] [blog][123]:
             document already exists]",
  "status" : 409
}

数据检索

Retrieving 检索文档

可以使用 Head 判断是否存在

HEAD /{index}/{type}/{id}

可以直接检索到 id 一级，精确获取文档。

// pretty会格式化JSON
GET /{index}/{type}/{id}[?pretty][&_source=field1,field...]

_source命令可以精确检索字段

可以使用 _search 命令：

GET /{index}/_search or /{index}/{type}/_search
// 返回示例
{
    "took": 1, // 耗费毫秒数 
    "timed_out": false, // 可以在命令中设置?timeout=10ms
    "_shards": { // 分区
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": { // 所有文档
        "total": 2,
        "max_score": 1,
        "hits": [ // 所有文档
          {
            "_index": "test",
            "_type": "resource",
            "_id": 1,
            "_score": 1,
            "_source": {
              "title": "周星驰"
            }
          },
          ...
        ]
    }
}

如果开启了 Analysis ，使用 multi_match 则可以按关键字（分词）进行搜索，返回结果按 _score 来排序。

POST /{index}/{type}/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "周星驰最新电影fox",
      "fields": ["title", "title.cn", "title.en"]
    }
  }
}
// 返回示例
{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 1.4102149,
        "hits": [
            {
                "_index": "index",
                "_type": "test",
                "_id": "1",
                "_score": 1.4102149,
                "_source": {
                    "title": "周星驰最新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "3",
                "_score": 1.1354887,
                "_source": {
                    "title": "周星驰最新电影，最好，新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "2",
                "_score": 1.0024924,
                "_source": {
                    "title": "周星驰最好看的新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "4",
                "_score": 0.31740457,
                "_source": {
                    "title": "最最最最好的新新新新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "5",
                "_score": 0.013072087,
                "_source": {
                    "title": "I'm not happy about the foxes"
                }
            }
        ]
    }
}

还可以加上分页，高亮以及最小匹配度：

POST /{index}/{type}/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields",  // 搜索使用的模式
      "query":    "周星驰最新电影fox",
      "fields": [ "title", "title.cn", "title.en" ], // 设置搜索的范围
      "minimum_should_match": "20%" // 最小匹配度
    }
  },
  "from": 0,
  "size": 10,
  "highlight" : {
    "pre_tags" : ["<strong>"],
    "post_tags" : ["</strong>"],
    "fields" : {
      "title" : {},
      "title.cn" : {},
      "title.en" : {}
    }
  }
}
// 返回
{
    "took": 13,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 1.1782398,
        "hits": [
            {
                "_index": "test",
                "_type": "resource",
                "_id": "1",
                "_score": 1.1782398,
                "_source": {
                    "title": "周星驰最新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "3",
                "_score": 0.9440402,
                "_source": {
                    "title": "周星驰最新电影，最好，新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>，<strong>最</strong>好，<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最新</strong><strong>电影</strong>，最好，<strong>新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>，<strong>最</strong>好，<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "2",
                "_score": 0.8302629,
                "_source": {
                    "title": "周星驰最好看的新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "4",
                "_score": 0.255055,
                "_source": {
                    "title": "最最最最好的新新新新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>最</strong><strong>最</strong><strong>最</strong><strong>最</strong>好的<strong>新</strong><strong>新</strong><strong>新</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "最最<strong>最</strong>最好的新新新新<strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>最</strong><strong>最</strong><strong>最</strong><strong>最</strong>好的<strong>新</strong><strong>新</strong><strong>新</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "5",
                "_score": 0.012243208,
                "_source": {
                    "title": "I'm not happy about the foxes"
                },
                "highlight": {
                    "title.en": [
                        "I'm not happy about the <strong>foxes</strong>"
                    ]
                }
            }
        ]
    }
}

elasticsearch 虽然可以识别语言类型，但是可以看到，英文分词对中文是每个字都区分开了，中文分词则不支持英文。所以使用的时候需要注意。

在上述例子中， multi_match 使用了 most_fields，表示匹配任何满足条件的 field ，multi_match支持如下几种模式：

best_fields
默认模式，搜索任何 field ，但是使用 _score 是所有 field 中最高的一项。
most_fields
搜索任何 field ，但是 _score 是所有 field 的和值。
cross_fields
将所有 field 看成是一个进行搜索。
match_phrase or match_phrase_prefix
两个与 best_fields 类似，但是会把 fileds 拆开，变成多个 queries

{
  "multi_match" : {
    "query":      "quick brown f",
    "type":       "phrase_prefix",
    "fields":     [ "subject", "message" ]
  }
}
// to
{
  "dis_max": {
    "queries": [
      { "match_phrase_prefix": { "subject": "quick brown f" }},
      { "match_phrase_prefix": { "message": "quick brown f" }}
    ]
  }
}

删除文档

DELETE 用来删除文档。

不会立即删除，只是标记删除，在需要的时候再删除。

DELETE /{index}  删除整个index
DELETE /{index}/{type} 删除type一级
DELETE /{index}/{type}/{id} 删除具体的某个文档
// 200
{
  "found" :    true,
  "_index" :   "x",
  "_type" :    "x",
  "_id" :      "x",
  "_version" : 3
}
// 404
{
  "found" :    false,
  "_index" :   "x",
  "_type" :    "x",
  "_id" :      "x",
  "_version" : 4
}

中文分词

https://github.com/medcl/elasticsearch-analysis-ik

配置，使用配置1或者2

elasticsearch.yml
// 1
index:
  analysis:
    analyzer:
      ik:
          alias: [ik_analyzer]
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
// 2
index.analysis.analyzer.ik.type : "ik" // = ik_max_word

ik_max_word 会将文本做最细粒度的拆分，如

『中华人民共和国国歌』被拆分成
『中华人民共和国』
『中华人民』
...
『国歌』，会穷尽各种可能的组合

ik_smart 会做最粗粒度的拆分，如

『中华人民共和国国歌』拆分为
『中华人民共和国』
『国歌』

在之前的例子中已经使用到了这个分词插件。

elasticsearch入门