基础知识

ES查询的URL是以/索引/类型/文档的形式组织的。查询URL举例:

GET /product/book/1?name=xxx

index就像sql中的库，type就像sql中的表，document就像sql中的记录。

ykuang: 个人认为把index比作数据库中的字典表create table dict(id,type,item,name,...,其他字段)会更合适, type就是字典表里区分字典的一个字段, document就是一条记录, 一个字典项, document里的json属性就是字典表里的其他字段。

索引index

索引index是存储document文档数据的结构,意义类似于关系型数据库中的数据库。

类型type

类型type也是用于存储document的逻辑结构，相对于index来说，type是index的下级，所以通常在面向有实际意义的数据时，index作为大类的划分，type作为小类的划分。比如如果把book作为一个大类来建立index的话，那么书的类型(小说类、文学类、IT技术类等)就可以作为type。

type只是意义上的逻辑结构, 并不真的用来划分数据。
可以从SQL方面想，就好像一个职员表，一条记录中的某一个字段说明了他属于哪个部门】。

文档document

文档的格式是json式的。
对于文档，有几个主要的标识信息：

_index(插入到哪个索引中),
_type(插入到哪个类型中),
_id(文档的id是多少)，
_version：版本，对这个ID的文档的操作次数

前3个是创建一个文档的时候必须的, 当没有提供_type时,默认设置为_doc

ElaticSearch并不是完全无结构的，不要与某些NoSQL数据库混为一谈，虽然它的结构非常灵活（面向json，可以随意增加字段）。在index中还有一个mapping，mapping管理了整个index的各个字段的属性，也就是定义了整个index中document的结构。

好显然用URL来查询不是很灵活, 所以ES还提供DSL来查询。下面介绍DSL的语法。

精确查询

等值查询

可以理解为SQL中的=符号


term主要用于精确匹配，比如数字，日期，布尔值或 未经分析的字符串

GET  test1/_doc/_search
{
  "query": {
    "term": {
      "phone": "12345678909"
    }
  }
}


如果想在一个字段匹配多个值的话，可以使用terms，相当于SQL的in语法。

GET  test1/_doc/_search
{
  "query": {
    "terms": {
       "uid": [ 1234, 12345, 123456 ] 
    }
  }
}

term 用法（与 match 进行对比）
term 一般用在不分词字段上的，因为它是完全匹配查询，如果要查询的字段是分词字段就会被拆分成各种分词结果，和完全查询的内容就对应不上了。

范围查询

range可以理解为SQL中的><符号，其中gt是大于，lt是小于，gte是大于等于，lte是小于等于。

GET  test1/_doc/_search
{
  "query": {
   "range": { 
      "uid": { 
        "gt": 1234,
        "lte": 12345
      } 
    } 
  }
}

存在(exists)查询

exists可以理解为SQL中的exists函数，就是判断是否存在该字段(注意是字段不是字段值)。

GET  test1/_doc/_search
{
  "query": {
   "exists": { 
       "field":"msgcode" 
    } 
  }
}

模糊查询

前缀查询

prefix 前缀搜索（性能较差，扫描所有倒排索引）
比如有一个不分词字段 product_name，分别有两个 doc 是：iphone-6，iphone-7。我们搜索 iphone 这个前缀关键字就可以搜索到结果

GET /product_index/product/_search
{
  "query": {
    "prefix": {
      "product_name": {
        "value": "iphone"
      }
    }
  }
}

模糊(wildcard)查询

wildcard查询相当于SQL语句中的like语法，只不过它查询的数据需要加上*符号。

GET /test1/_search
{
  "query": {
   "wildcard": { 
       "message":"*wu*" 
    } 
  }
}

正则(regexp)查询

regexp可以支持正则查询，比如查询短信内容中的验证码之类的。

下面的这个示例就是查询以xu开头，后面是0-9数字的内容的数据。

GET /test1/_search
{
  "query": {
   "regexp": { 
       "message":"xu[0-9]" 
    } 
  }
}

fuzzy 纠错查询

参数 fuzziness 默认是 2，表示最多可以纠错两次，但是这个值不能很大，不然没效果。一般 AUTO 是自动纠错。
下面的关键字漏了一个字母 o。

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": {
        "query": "PHILIPS tothbrush",
        "fuzziness": "AUTO",
        "operator": "and"
      }
    }
  }
}

全文检索

match_all可以查询集群所有索引库的信息，包括一些隐藏索性库的信息。

GET _search
{   
  "query": {
    "match_all": {}
  }
}

full-text search 全文检索，倒排索引

索引中只要有任意一个匹配拆分后词就可以出现在结果中，只是匹配度越高的排越前面
比如查询：PHILIPS toothbrush，会被拆分成两个单词：PHILIPS 和 toothbrush。只要索引中 product_name 中含有任意对应单词，都会在搜索结果中，只是如果有数据同时含有这两个单词，则排序在前面。

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": "PHILIPS toothbrush"
    }
  }
}

phrase search 短语搜索

索引中必须同时匹配拆分后词就可以出现在结果中
比如查询：PHILIPS toothbrush，会被拆分成两个单词：PHILIPS 和 toothbrush。索引中必须有同时有这两个单词的才会在结果中。

GET /product_index/product/_search
{
  "query": {
    "match_phrase": {
      "product_name": "PHILIPS toothbrush"
    }
  }
}

match 用法（与 term 进行对比）：
查询的字段内容是进行分词处理的，只要分词的单词结果中，在数据中有满足任意的分词结果都会被查询出来

match必须满足分词结果中所有的词，任意一个就可以的。（这个常见，所以很重要）

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": {
        "query": "PHILIPS toothbrush",
        "operator": "and"
      }
     }
   }
}

match 还还有一种情况，就是必须满足分词结果中百分比的词，比如搜索词被分成这样子：java 程序员书推荐，这里就有 4 个词，假如要求 50% 命中其中两个词就返回，我们可以这样：
当然，这种需求也可以用 must、must_not、should 匹配同一个字段进行组合来查询

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": {
        "query": "java 程序员 书 推荐",
        "minimum_should_match": "50%"
      }
    }
  }
}

multi_match 用法：
查询 product_name 和 product_desc 字段中，只要有：toothbrush 关键字的就查询出来。

GET /product_index/product/_search
{
  "query": {
    "multi_match": {
      "query": "toothbrush",
      "fields": [
        "product_name",
        "product_desc"
      ]
    }
  }
}

multi_match 跨多个 field 查询，表示查询分词必须出现在相同字段中。

GET /product_index/product/_search
{
  "query": {
    "multi_match": {
      "query": "PHILIPS toothbrush",
      "type": "cross_fields",
      "operator": "and",
      "fields": [
        "product_name",
        "product_desc"
      ]
    }
  }
}

match_phrase 用法（短语搜索）（与 match 进行对比）：
对这个查询词不进行分词，必须完全匹配查询词才可以作为结果显示。

GET /product_index/product/_search
{
  "query": {
    "match_phrase": {
      "product_name": "PHILIPS toothbrush"
    }
  }
}

match_phrase + slop（与 match_phrase 进行对比）：
在说 slop 的用法之前，需要先说明原数据是：PHILIPS toothbrush HX6730/02，被分词后至少有：PHILIPS，toothbrush，HX6730 三个 term。
match_phrase 的用法我们上面说了，按理说查询的词必须完全匹配才能查询到，PHILIPS HX6730 很明显是不完全匹配的。
但是有时候我们就是要这种不完全匹配，只要求他们尽可能靠谱，中间有几个单词是没啥问题的，那就可以用到 slop。slop = 2 表示中间如果间隔 2 个单词以内也算是匹配的结果（）。
其实也不能称作间隔，应该说是移位，查询的关键字分词后移动多少位可以跟 doc 内容匹配，移动的次数就是 slop。所以 HX6730 PHILIPS 其实也是可以匹配到 doc 的，只是 slop = 5 才行。

GET /product_index/product/_search
{
  "query": {
    "match_phrase": {
      "product_name" : {
          "query" : "PHILIPS HX6730",
          "slop" : 1
      }
    }
  }
}

match + match_phrase + slop 组合查询，使查询结果更加精准和结果更多
但是 match_phrase 性能没有 match 好，所以一般需要先用 match 第一步进行过滤，然后在用 match_phrase 进行进一步匹配，并且重新打分，这里又用到了：rescore，window_size 表示对前 10 个进行重新打分
下面第一个是未重新打分的，第二个是重新打分的

GET /product_index/product/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "product_name": {
            "query": "PHILIPS HX6730"
          }
        }
      },
      "should": {
        "match_phrase": {
          "product_name": {
            "query": "PHILIPS HX6730",
            "slop": 10
          }
        }
      }
    }
  }
}

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": "PHILIPS HX6730"
    }
  },
  "rescore": {
    "window_size": 10,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "product_name": {
            "query": "PHILIPS HX6730",
            "slop": 10
          }
        }
      }
    }
  }
}

match_phrase_prefix 用法（不常用），一般用于类似 Google 搜索框，关键字输入推荐
max_expansions 用来限定最多匹配多少个 term，优化性能
但是总体来说性能还是很差，因为还是会扫描整个倒排索引。推荐用 edge_ngram 做该功能

GET /product_index/product/_search
{
  "query": {
    "match_phrase_prefix": {
      "product_name": "PHILIPS HX",
      "slop": 5,
      "max_expansions": 20
    }
  }
}

组合查询

bool 可以用来合并多个过滤条件查询结果的布尔逻辑，它包含这如下几个操作符:

must : 多个查询条件的完全匹配,相当于 and。
must_not ::多个查询条件的相反匹配，相当于 not。
should : 至少有一个查询条件匹配, 相当于 or。

GET /test1/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "phone": "12345678909"
        }
      },
      "must_not": {
        "term": {
          "uid": 12345
        }
      },
      "should": [
        {
          "term": {
            "uid": 1234
          }
        },
        {
          "term": {
            "uid": 123456
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

过滤

query 和 filter 一起使用的话，filter 会先执行

从搜索结果上看：
filter，只查询出搜索条件的数据，不计算相关度分数
query，查询出搜索条件的数据，并计算相关度分数，按照分数进行倒序排序

从性能上看：
filter（性能更好，无排序），无需计算相关度分数，也就无需排序，内置的自动缓存最常使用查询结果的数据
缓存的东西不是文档内容，而是 bitset，具体看：https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_finding_exact_values.html#_internal_filter_operation
query（性能较差，有排序），要计算相关度分数，按照分数进行倒序排序，没有缓存结果的功能

filter 和 query 一起使用可以兼顾两者的特性，所以看你业务需求。

GET /store/products/_search
{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "term": {
          "price": 200
        }
      }
    }
  }

排序

一般应该用不到, 因为用ES一般都是用它的全文检索功能, 一般都是按相似度倒序排。

GET /product_index/product/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "product_name": "PHILIPS toothbrush"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

权重

boost 用法（默认是 1）。在搜索精准度的控制上，还有一种需求，比如搜索：PHILIPS toothbrush，要比：Braun toothbrush 更加优先，我们可以这样：

GET /product_index/product/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "product_name": "toothbrush"
          }
        }
      ],
      "should": [
        {
          "match": {
            "product_name": {
              "query": "PHILIPS",
              "boost": 4
            }
          }
        },
        {
          "match": {
            "product_name": {
              "query": "Braun",
              "boost": 3
            }
          }
        }
      ]
    }
  }
}

高亮 Highlight

给匹配拆分后的查询词增加高亮的 html 标签，比如这样的结果：<em>PHILIPS</em> <em>toothbrush</em> HX6730/02

GET /product_index/product/_search
{
  "query": {
    "match": {
      "product_name": "PHILIPS toothbrush"
    }
  },
  "highlight": {
    "fields": {
      "product_name": {}
    }
  }
}

参考资料

官方文档:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_queries_and_filters.html

博客文章:

https://www.cnblogs.com/xuwujing/p/11567053.html
https://www.cnblogs.com/sddai/p/11061412.html

随风行云博客里的几篇文章:

ElasticSearch查询基础知识