声明
本人以elasticsearch-definitive-guide-cn项目作为入门教程,并对文档中部分问题在文中进行了纠正,比如demo执行报错,一般是因为文档对应的版本偏低,对比新版官方文档可找到原因,如:过滤查询(filtered)已被弃用,并在ES 5.0中删除,可使用bool / must / filter查询。
总的来说,Elasticsearch 权威指南作为入门教程还是不错的,如果你的英语水平还可以,建议直接看官方文档。
环境准备
Elasticsearch版本:6.8.2
安装教程可参考我的另一篇文章 windows下docker安装Elasticsearch
与Elasticsearch交互
JAVA API
关于Java API的更多信息请查看相关章节:Java API
基于HTTP协议,以JSON为数据交互格式的RESTful API
其他所有程序语言都可以使用RESTful API,通过9200端口的与Elasticsearch进行通信,你可以使用你喜欢的WEB客户端,事实上,如你所见,你甚至可以通过curl
命令与Elasticsearch通信。
向Elasticsearch发出的请求的组成部分与其它普通的HTTP请求是一样的:
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
- VERB HTTP方法:
GET
,POST
,PUT
,HEAD
,DELETE
- PROTOCOL http或者https协议(只有在Elasticsearch前面有https代理的时候可用)
- HOST Elasticsearch集群中的任何一个节点的主机名,如果是在本地的节点,那么就叫localhost
- PORT Elasticsearch HTTP服务所在的端口,默认为9200
- PATH API路径(例如_count将返回集群中文档的数量),PATH可以包含多个组件,例如_cluster/stats或者_nodes/stats/jvm
- QUERY_STRING 一些可选的查询请求参数,例如
?pretty
参数将使请求返回更加美观易读的JSON数据 - BODY 一个JSON格式的请求主体(如果请求需要的话)
面向文档
Elasticsearch是面向文档(document oriented)的,这意味着它可以存储整个对象或文档(document)。然而它不仅仅是存储,还会索引(index)每个文档的内容使之可以被搜索。在Elasticsearch中,你可以对文档(而非成行成列的数据)进行索引、搜索、排序、过滤。这种理解数据的方式与以往完全不同,这也是Elasticsearch能够执行复杂的全文搜索的原因之一。
JSON
ELasticsearch使用Javascript对象符号(JavaScript
Object Notation),也就是JSON,作为文档序列化格式。JSON现在已经被大多语言所支持,而且已经成为NoSQL领域的标准格式。它简洁、简单且容易阅读。
让我们先添加几条数据看看,我们可以用postman执行下面的请求
PUT /mycompany/customer/101
{
"first_name" : "Donald",
"last_name" : "Trump",
"gender" : "male",
"age":74,
"about" : "Businessman,the president of America,a crazy guy!He has lots of money!",
"interests": [ "golf", "music" ]
}
返回结果:
{
"_index": "mycompany",
"_type": "customer",
"_id": "101",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
so easy! 索引为mycompany,类型为customer,ID为101,让我们再多造一些数据,
PUT /mycompany/customer/102
{
"first_name" : "Jack",
"last_name" : "Ma",
"gender" : "male",
"age":56,
"about" : "Chinese Rural Teacher,the president of Alibaba, he is not interested in money!",
"interests": [ "太极", "dance" ]
}
PUT /mycompany/customer/103
{
"first_name" : "Jackie",
"last_name" : "Chan",
"gender" : "male",
"age":66,
"about" : "Famous kung fu star,He is a Chinese!",
"interests": [ "kung fu", "music" ]
}
PUT /mycompany/customer/104
{
"first_name" : "Taylor",
"last_name" : "Swift",
"gender" : "female",
"age":31,
"about" : "American Country Singer!",
"interests": [ "music" ]
}
检索文档
查询字符串方式
精确搜索
GET /{index}/{type}/{id}
_search关键字
使用关键字_search来取代原来的文档ID。响应内容的hits数组中包含了我们所有的三个文档。默认情况下搜索会返回前10个结果
GET /{index}/{type}/_search
条件搜索
GET /{index}/{type}/_search?q={field}:{val}
eg:
GET /mycompany/customer/_search?q=last_name:Jack
DSL语句
查询字符串搜索便于通过命令行完成特定(ad hoc)的搜索,但是它也有局限性。Elasticsearch提供丰富且灵活的查询语言叫做DSL查询(Query DSL),它允许你构建更加复杂、强大的查询。
DSL(Domain Specific Language特定领域语言)以JSON请求体的形式出现。我们可以这样表示之前关于“Singer”的查询:
GET /mycompany/customer/_search
{
"query" : {
"match" : {
"about" : "singer"
}
}
}
更复杂的搜索
- 注:过滤查询(filtered)已被弃用,并在ES 5.0中删除,可使用bool / must / filter查询
GET /mycompany/customer/_search
{
"query" : {
"bool" : {
"filter" : {
"range" : {
"age" : { "gt" : 70 } <1>
}
},
"must" : {
"match" : {
"gender" : "male" <2>
}
}
}
}
}
- <1> 这部分查询属于区间过滤器(range filter),它用于查找所有年龄大于30岁的数据——
gt
为"greater than"的缩写。 - <2> 这部分查询与之前的
match
语句(query)一致。
全文搜索
到目前为止搜索都很简单:搜索特定的名字,通过年龄筛选。让我们尝试一种更高级的搜索,全文搜索——一种传统数据库很难实现的功能。
我们将会搜索所有“not interested in money”的客户:
GET /mycompany/customer/_search
{
"query" : {
"match" : {
"about" : "not interested in money"
}
}
}
我们可以看到查出了两个人,但是字段_score值不一样,值越大表示匹配度越高
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.1507283,
"hits": [
{
"_index": "mycompany",
"_type": "customer",
"_id": "102",
"_score": 1.1507283,
"_source": {
"first_name": "Jack",
"last_name": "Ma",
"gender": "male",
"age": 56,
"about": "Chinese Rural Teacher,the president of Alibaba, he is not interested in money!",
"interests": [
"太极",
"dance"
]
}
},
{
"_index": "mycompany",
"_type": "customer",
"_id": "101",
"_score": 0.78111285,
"_source": {
"first_name": "Donald",
"last_name": "Trump",
"gender": "male",
"age": 74,
"about": "Businessman,the president of America,a crazy guy!He has lots of money!",
"interests": [
"golf",
"music"
]
}
}
]
}
}
默认情况下,Elasticsearch根据结果相关性评分来对结果集进行排序,所谓的「结果相关性评分」就是文档与查询条件的匹配程度。很显然,排名第一的Jack Ma
的about
字段明确的写到“not interested in money”。
但是为什么Trump
也会出现在结果里呢?原因是“money”在他的about
字段中被提及了。因为只有“money”被提及而“not interested in”没有,所以她的_score
要低于John。
这个例子很好的解释了Elasticsearch如何在各种文本字段中进行全文搜索,并且返回相关性最大的结果集。相关性(relevance)的概念在Elasticsearch中非常重要,而这个概念在传统关系型数据库中是不可想象的,因为传统数据库对记录的查询只有匹配或者不匹配。
短语搜索
目前我们可以在字段中搜索单独的一个词,这挺好的,但是有时候你想要确切的匹配若干个单词或者短语(phrases)。例如我们想只查出not interested in money的人,而不需要查出has lots of money的人。
要做到这个,我们只要将match
查询变更为match_phrase
查询即可:
GET /mycompany/customer/_search
{
"query" : {
"match_phrase" : {
"about" : "not interested in money"
}
}
}
高亮搜索
很多应用喜欢从每个搜索结果中高亮(highlight)匹配到的关键字,这样用户可以知道为什么这些文档和查询相匹配。在Elasticsearch中高亮片段是非常容易的。
让我们在之前的语句上增加highlight
参数:
GET /mycompany/customer/_search
{
"query" : {
"match_phrase" : {
"about" : "not interested in money"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
当我们运行这个语句时,会命中与之前相同的结果,但是在返回结果中会有一个新的部分叫做highlight
,这里包含了来自about
字段中的文本,并且用<em></em>
来标识匹配到的单词。
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.1507283,
"hits": [
{
"_index": "mycompany",
"_type": "customer",
"_id": "102",
"_score": 1.1507283,
"_source": {
"first_name": "Jack",
"last_name": "Ma",
"gender": "male",
"age": 56,
"about": "Chinese Rural Teacher,the president of Alibaba, he is not interested in money!",
"interests": [
"太极",
"dance"
]
},
"highlight": {
"about": [
"Chinese Rural Teacher,the president of Alibaba, he is <em>not</em> <em>interested</em> <em>in</em> <em>money</em>!"
]
}
}
]
}
}
分析
Elasticsearch有一个功能叫做聚合(aggregations),它允许你在数据上生成复杂的分析统计。它很像SQL中的GROUP BY
但是功能更强大。
举个例子,让我们找到所有客户中最大的共同点(兴趣爱好)是什么:
GET /mycompany/customer/_search
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
直接执行上面的语句会报错
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "megacorp",
"node": "-Md3f007Q3G6HtdnkXoRiA",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}
应该是5.x后对排序,聚合这些操作用单独的数据结构(fielddata)缓存到内存里了,需要单独开启,官方解释在此fielddata
简单来说就是在聚合前执行如下操作:
PUT /mycompany/_mapping/customer
{
"properties": {
"interests": {
"type": "text",
"fielddata": true
}
}
}
返回
{
"acknowledged": true
}
现在可正常执行分析语句。
聚合也允许分级汇总。例如,让我们统计每种兴趣下客户的平均年龄
GET /mycompany/customer/_search
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
结果如下:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.0,
"hits": [......]
},
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "music",
"doc_count": 3,
"avg_age": {
"value": 57.0
}
},
{
"key": "dance",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
},
{
"key": "fu",
"doc_count": 1,
"avg_age": {
"value": 66.0
}
},
{
"key": "golf",
"doc_count": 1,
"avg_age": {
"value": 74.0
}
},
{
"key": "kung",
"doc_count": 1,
"avg_age": {
"value": 66.0
}
},
{
"key": "太",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
},
{
"key": "极",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
}
]
}
}
}
该聚合结果比之前的聚合结果要更加丰富。我们依然得到了兴趣以及数量(指具有该兴趣的客户人数)的列表,但是现在每个兴趣额外拥有avg_age
字段来显示具有该兴趣客户的平均年龄。
即使你还不理解语法,但你也可以大概感觉到通过这个特性可以完成相当复杂的聚合工作,你可以处理任何类型的数据。