前言
ES中的聚合功能类似于数据库中group by语句,但是功能更为强大。
1. 概念
SELECT COUNT(color) //指标
FROM table
GROUP BY color //桶
1.1 桶(Buckets):满足特定条件的文档集合
1.2 指标(Metrics):桶内文档的统计计算
2.基本语法
2.1 例子1
{
"size" : 0,
"aggs": { //第一个桶
"colors": { //指定桶名称
"terms": { //指定聚合类型
"field": "color"
},
"aggs": { //嵌套桶
"avg_price": { "avg": { "field": "price" } // 指定聚合类型
},
"make" : { // 桶名称
"terms" : { // 指定聚合类型
"field" : "make"
},
"aggs" : { //嵌套桶
"min_price" : { "min": { "field": "price"} }, // 指定聚合类型
"max_price" : { "max": { "field": "price"} } // 指定聚合类型
}
}
}
}
}
}
2.2 例子2:histogram
"aggs":{
"price":{
"histogram":{ "field": "price", "interval": 20000 },
"aggs":{
"revenue": {
"sum": {"field" : "price"}
}
}
}
}
2.3 例子3:date_histogram以及强制返回空桶
"aggs": {
"sales": {
"date_histogram": {
"field": "sold",
"interval": "month",
"format": "yyyy-MM-dd",
"min_doc_count" : 0, //强制返回空buckets
"extended_bounds" : { //强制返回整年
"min" : "2014-01-01",
"max" : "2014-12-31"
}
}
}
}
2.4 例子4:后过滤器(只过滤搜索结果,不过滤聚合结果)post-filter
{
"size" : 0,
"query": {"match": {"make": "ford"}},
"post_filter": {
"term" : { "color" : "green" }
},
"aggs" : {
"all_colors": { "terms" : { "field" : "color" }}
}
}
2.5 例子5:桶内置排序
"aggs" : {
"colors" : {
"terms" : {
"field" : "color",
"order": {"_count" : "asc" } //按文档数量排序
}
}
}
2.6 例子6:按度量排序
"aggs" : {
"colors" : {
"terms" : {
"field" : "color",
"order": { "avg_price" : "asc" }
},
"aggs": {
"avg_price": {"avg": {"field": "price"}
}
}
}
}
2.7 例子7:基于深度度量排序
"aggs" : {
"colors" : {
"histogram" : {
"field" : "price",
"interval": 20000,
"order": {"red_green_cars>stats.variance" : "asc" } // 将度量用>嵌套
},
"aggs": {
"red_green_cars": {
"filter": { "terms": {"color": ["red", "green"]}},
"aggs": {
"stats": {"extended_stats": {"field" : "price"}}
}
}
}
}
}
3.近似聚合
3.1 cardinality(去重)
1)语法
"aggs" : {
"distinct_colors" : { "cardinality" : { "field" : "color","precision_threshold" : 100}}
//precision_threshold 表示在何种基数下希望得到一个近乎精确的结果
}
2)原理
使用HyperLogLog(HLL)算法,该算法也应用于redis中,优点是,即使输入元素的数量或者体积非常非常大,计算基数所需的空间总是固定的、并且是很小的,redis中只需要12K内存,在标准误差0.81%的前提下,能够统计2^64个基数。
stream-lib 实现了一个java版本的HHL:https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
3.2 percentiles、percentile_ranks(百分位计算)
1)语法
percentiles :默认情况下,percentiles会返回一组预定义的百分位数值[1, 5, 25, 50, 75, 95, 99],体现的是某个百分比以下所有文档的最小值
"aggs" : {
"zones" : {
"terms" : { "field" : "zone"},
"aggs" : {
"load_times" : {
"percentiles" : {
"field" : "latency",
"percents" : [50, 95.0, 99.0] //感兴趣的百分比
}
},
{"avg_load_avg" : { "avg" : { "field" : "latency" } }}
}
}
}
percentile_ranks:体现的是某个具体值属于哪一个百分比等级
"aggs" : {
"zones" : {
"terms" : { "field" : "zone"},
"aggs" : {
"load_times" : {
"percentile_ranks" : {
"field" : "latency",
"values" : [210, 800]
}
}
}
}
}
2)原理
使用TDigest算法