0、背景:
解决了Elasticsearch聚类结果分页的问题后的某一天,产品找到了我。
产品:这里需要加一个搜索功能!明天和其他功能一起上线!
我:好的(wdnmd,你是拉屎的时候突然来灵感了?之前给原型的时候怎么没有?)。
心里问候完产品后,开始思考怎么实现。
1、Terms aggregation之include
ES版本7.9.2
想获取demo数据,请点击这篇文章
先看看之前分页的解决办法:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"field": "job.keyword",
"precision_threshold": 30000
}
}
}
}
这里是利用bucket_sort
来分页,cardinality
来获取total。
在官方文档里边逛了一圈,发现terms aggregation有include好像可以解决这个问题,直接开始:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"include": ".*Programmer.*",
"field": "job.keyword",
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"field": "job.keyword",
"precision_threshold": 30000
}
}
}
}
include
:为字符串时支持正则表达式。为数组的时候支持多字段精确
过滤。
如:
...
"aggs": {
"myTerms": {
"terms": {
"include": ".*Programmer.*", #支持正则,但部分字符为保留字
"field": "job.keyword",
"size": 10
}
}
...
...
"aggs": {
"myTerms": {
"terms": {
"include": ["Programmer","DBA"], #支持多值,但是不支持正则
"field": "job.keyword",
"size": 10
}
}
正在暗爽的时候,发现上面获取到的total是不带include过滤条件时的total,不符合要求。
2、cardinality之script
利用cardinality 的script来达到过滤效果:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"include": ".*Programmer.*",
"field": "job.keyword",
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"precision_threshold": 30000
}
}
}
}
对应结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"myTerms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Programmer",
"doc_count" : 7
},
{
"key" : "Javascript Programmer",
"doc_count" : 4
}
]
},
"termsCount" : {
"value" : 2
}
}
}
3、问题
关键字为英文时的大小写问题,terms aggregations的include虽然支持正则,但是正则中的(?i)不支持,所以大小写敏感是个问题。
比如:关键词为“ai”或者“Ai”想要检索出来AI。当然terms aggregations也可以用script如:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"precision_threshold": 30000
}
}
}
}
虽然可以在if条件中,编写满足大写或者小写的条件,但类似Ai这样的仍然不能满足命中AI。
这个问题的其他解决办法利用normalizer,但仍会有缺陷。
本来就想大小写敏感检索,可以忽略以上问题
4、总结
- 方法1:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"include": ".*programmer.*",
"field": "job.keyword",
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"precision_threshold": 30000
}
}
}
}
- 方法2:
GET employees/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"myTerms": {
"terms": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"size": 10
},
"aggs": {
"myBucketSort": {
"bucket_sort": {
"from": 0,
"size": 5,
"gap_policy": "SKIP"
}
}
}
},
"termsCount": {
"cardinality": {
"script": {
"source": """if(doc['job.keyword'].value.contains('Programmer')) {doc['job.keyword'].value }"""
},
"precision_threshold": 30000
}
}
}
}
- 其他方法:利用normalizer或者analyzer,但同样有弊端,如何实现可以google搜索一下。