第一章、概述
JSON查询是Druid内置的本机查询,本机查询级别较低,与内部计算的执行方式密切相关。json查询被设计为轻量级且非常快速地完成。这意味着,对于更复杂的分析或构建更复杂的可视化,需要json查询。
SQL查询是Druid内置的SQL层,是基于json的查询语言的替代方式,SQL查询在查询节点上转换成本地的查询,然后传递给数据流程,除了在查询节点上翻译SQL的性能开销之外,与Druid提供的json查询没有其他的性能损失。
第二章、创建本地查询(Making native queries)
一、如何提交查询
Druid数据库支持通过向查询节点发送JSON格式的HTTP REST请求进行查询,提交方式如下:
curl -X POST '<queryable_host>:<port>/Druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @<query_json_file>
- <queryable_host> 是查询节点
- <port> 查询节点的端口 8082
- <query_json_file> 查询的内容文件
查询示例(查询ID:可写可不写 类型:搜索 数据源:tsdb 结果集:list 过滤器:isgood 时间段:2020-03-25 2020-04-26)
{
"queryId":"search1",
"queryType": "search",
"dataSource": "tsdb",
"resultFormat": "list",
"columns": [
"tagname",
"tagvalue",
"isgood"
],
"filter": {
"type": "selector",
"dimension": "isgood",
"value": "true"
},
"intervals": [
"2020-03-25/2020-04-26"
],
"batchSize": 20480,
"limit": 10
}
二、 如何取消查询
可以使用其唯一标识符显式取消查询,如果查询标识符是在查询时设置的,或者是已知的,则可以在Broker或Router上使用以下端点来取消查询
DELETE http://IP:Port/Druid/v2/{queryId}
curl -X DELETE "http://host:port/Druid/v2/search1"
第三章、查询请求基本元素
一、 查询类型
- 时间序列查询(Timeseries)
- TopN查询(TopN)
- 分组查询(GroupBy)
- 扫描查询(Scan)
- 时间边界查询(TimeBoundary)
- 段元数据查询(SegmentMeatadata)
- 数据源元数据查询(DatasourceMeatadata)
- 搜索查询(Search)
- 选择查询(Select)
- 多值维度查询(Multi-value dimensions)
- 查找查询(Lookups)
- Join查询(Joins)
- 多租户(Multitenancy considerations)
- 缓存查询(Query caching)
- 过滤查询(Spatial filters)
二、 数据源
<font color=red>必须项</font> 用于定义查询对应的数据源,可以是一个指定数据源的字符串,也可以一个JSON对象,用于指定多个数据源或是指定某个groupBy查询结果为数据源。
[1] 单数据源
这是默认的dataSource形式,即指定单个数据源
"dataSource":{
"type":"table",
"name":"<DatasourceName>"
}
通常简写成:
"dataSource":"<DatasourceName>"
[2] 多数据源
指定union和datasource列表可查询多个schema相同的数据源
"type":"union",
"dataSources":[
"<datasource_1>",
"<datasource_2>",
"...",
"<datasource_n>"
]
[3] 查询结果
如果查询的数据源是另一个groupBy查询的结果,可通过query dataSource实现嵌套查询
"dataSource":{
"type":"query",
"query":{
"queryType":"groupBy",
……
}
}
三、 过滤器
<font color=red>可选项</font> 是用于对维度进行条件过滤的JSON对象,支持string、long以及float类型的维度及时间戳字段。
[1] selector
这是最简单的filter类型,等同于SQL查询中的单个等值条件
"filter":{
"type":"selector",
"dimension":"<dimension_name>",
"value":"<dimension_value>"
}
等同于select中"where <dimension_name> = <dimension_value>"
[2] columnComparison
表达两个不同维度列之间的等值过滤条件
"filter":{
"type":"columnComparison",
"dimensions":[
"<dimension_a>",
"<dimension_b>"
]
}
等同于select中"where <dimension_a> = <dimension_b>"
[3] regex
用于string类型维度列的正则匹配过滤,<pattern_string>可以是任何标准的java正则表达式
"filter":{
"type":"regex",
"dimension":"<dimension_name>",
"pattern":"<pattern_string>"
}
[4] search
用于部分字符串匹配的过滤
"filter":{
"type":"search",
"dimension":"<dimension_name>",
"query":{
"type":"<match_type>",
"value":"<value>",
"caseSensitive":true|false
}
}
<match_type>类型说明
-
contains
过滤<dimension_name>指定的维度值中包含<value>子串的记录,caseSenstivie取值为true或false,默认为false,即大小写不敏感
"filter":{ "type":"search", "dimension":"product", "query":{ "type":"contains", "value":"foo", "caseSensitive":true } }
-
insensitive_contains
不需要,等同于<match_type>为contains同时<casesensitive_flag>设为false(或不设置)时的效果
"filter":{ "type":"search", "dimension":"product", "query":{ "type":"insensitive_contains", "value":"foo" } }
-
fragment
过滤<dimension_name>指定的维度值中包含<value>中所列的多个子串中至少一个的记录
"filter":{ "type":"search", "dimension":"product", "query":{ "type":"fragment", "value":["foo","bar"], "caseSensitive":true } }
[5] in
In filter与select中IN的作用相同
"filter":{
"type":"in",
"dimension":"<dimension_name>",
"value":<value_json_array>
}
示例:
"filter":{
"type":"in",
"dimension":"outlaw",
"value":["Good", "Bad", "Ugly"]
}
[6] bound
用于维度值值域过滤,也可用于比较过滤如大于、小于、大于等于、小于等于、介于…与…之间
"filter":{
"type":"bound",
"dimension":"<dimension_name>",
"lower":"<lower_value>",
"lowerStrict":true|false,
"upper":"<upper_value>",
"upperStrict":true|false,
"ordering":"lexicographic|alphanumeric|numeric|strlen"
}
ordering 选项说明
-
lexicographic
默认选项,将需要比较的值转换成UTF-8的字节组,然后按字典序进行排列
-
alphanumeric
当比较的值是数字与字母混合型时使用
-
numeric
转换成数值型后进行大小比较,如果不可解析,就当做空值处理
-
strlen
比较字符串长度,长度相同情况下进行字符串本身比较
举例说明参数取值组合之后的不同效果
-
SQL表达: name>="foo" and name<="hoo"
"filter":{ "type":"bound", "dimension":"name", "lower":"foo", "upper":"hoo" }
-
SQL表达: age>21 and age<31
"filter":{ "type":"bound", "dimension":"age", "lower":"21", "lowerStrict":true, "upper":"31", "upperStrict":true, "ordering":"numeric" }
-
SQL表达: age between 21 and 31
"filter":{ "type":"bound", "dimension":"age", "lower":"21", "upper":"31", "ordering":"numeric" } ```
-
SQL表达: age>=21
"filter":{ "type":"bound", "dimension":"age", "lower":"21", "ordering":"numeric" }
[7] interval
interval filter 对包含长毫秒(long millisecond)值的列进行过滤,适用于_time列和可以被解析为长毫秒值的维度
"filter":{
"type":"interval",
"dimension":"__time",
"intervals":[
"<ISO 8601 time interval_1>", # 2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z
"<ISO 8601 time interval_2>", # 2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z
….
]
}
所有interval都是左闭右开的区间
[8] javascript
JavaScript filter是用来匹配指定维度是否满足指定JavaScript函数的判定结果。JavaScript函数的输入是维度值
"filter":{
"type":"javascript",
"dimension":"<dimension_string>",
"function":"function(value){<...>}" # function(x) { return(x >= 'bar' && x <= 'foo') }
}
[9] extraction
extraction filter匹配使用特定extraction函数的维度
"filter":{
"type":"extraction",
"dimension":"product",
"value":"bar_1",
"ertractionFn":{
"type":"lookup",
"lookup":{
"type": "map",
"map":{
"product_1": "bar_1",
"product_5": "bar_1",
"product_3": "bar_1"
}
}
}
}
[10] 逻辑组合
各类filter可以通过逻辑运算符进行组合使用,每个<filter>都是一个嵌入的子类filter,如selector、search filter等
-
and
"filter":{ "type": "and", "fields": ["<filter>", "<filter>",…] }
-
or
"filter":{ "type": "or", "fields": ["<filter>", "<filter>",…] }
-
not
"filter":{ "type": "not", "field": "<filter>" }
[11] 过滤器中使用extraction函数
除spatial过滤器外,所有的过滤器支持extraction函数。extraction函数通过在过滤器中extractionFn进行设置。
一旦设定,extraction function将用于在过滤器使用前对输入值进行转换
示例展示了一个结合了extraction函数的选择过滤器。示例如下:
这个例子匹配production列在[product_1,product_1,product_1]的维度值。
{
"filter": {
"type": "selector",
"dimension": "product",
"value": "bar_1",
"extractionFn": {
"type": "lookup",
"lookup": {
"type": "map",
"map": {
"product_1": "bar_1",
"product_5": "bar_1",
"product_3": "bar_1"
}
}
}
}
[12] 时间戳上使用过滤器
在时间戳字段中同样可使用过滤器。时间戳为长整型毫秒数值,使用_time作为时间戳列的维度名称。需注意,时间戳列不具有位图索引,因此,在时间戳上的过滤时会对整列进行扫描,会对效率产生影响。如果可以,通过intervals指定查询时间范围会加快速度
-
对长整型时间戳值过滤
"filter": { "type": "selector", "dimension": "__time", "value": "124457387532" }
-
对周中的日期信息进行过滤
"filter": { "type": "selector", "dimension": "__time", "value": "Friday", "extractionFn": { "type": "timeFormat", "format": "EEEE", "timeZone": "America/New_York", "locale": "en" } }
-
对一组ISO8601格式时间间隔值进行过滤
"filter":{ "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }
四、过滤器
Aggregations在数据导入时可作为数据进入时序数据库前对其进行整合的一种方式,也可在查询时作为查询语句的一部分
[1] 计数聚合
计数聚合器计算与filter匹配(如果有filter)的时序数据行数
"aggregation":{
"type":"count",
"name": "<output_name>"
}
[2] 总和聚合
-
longSum aggregator
longSum aggregator将值作为64位带符号整数进行求和计算
"aggregation":{ "type":"longSum", "name": "<output_name>", "fieldName": "<metric_name>" }
-
doubleSum aggregator
doubleSum aggregator将值作为64位浮点数进行求和计算
"aggregation":{ "type":"doubleSum", "name": "<output_name>", "fieldName": "<metric_name>" }
[3] 最小/最大聚合
-
doubleMin aggregator
doubleMin aggregator计算所有度量值和Double.POSITIVE_INFINITY的最小值
"aggregation":{ "type":"doubleMin", "name": "<output_name>", "fieldName": "<metric_name>" }
-
doubleMax aggregator
doubleMax aggregator计算所有度量值和Double.NEGATIVE_INFINITY的最大值
"aggregation":{ "type":"doubleMax", "name": "<output_name>", "fieldName": "<metric_name>" }
-
longMin aggregator
longMin aggregator计算所有度量值和Long.MAX_VALUE的最小值
"aggregation":{ "type":"longMin", "name": "<output_name>", "fieldName": "<metric_name>" }
-
longMax aggregator
longMax aggregator计算所有度量值和Long.MIN_VALUE的最大值
"aggregation":{ "type":"longMax", "name": "<output_name>", "fieldName": "<metric_name>" }
[4] JavaScript聚合
JavaScript聚合用于在字段集合(允许使用度量列和维度列)上计算任意JavaScript函数,返回浮点值
"aggregation":{
"type":"javascript",
"name": "<output_name>",
"fieldName": ["<column1>", "<column2>",...],
"fnaggregate" "function(current, column1, column2,...) {
"<updates partial aggregate (current) based on the current row values>
Return <uodated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) {return <combine partial results>; }",
"fnReset" : "function(){return <initial value>; }"
}
示例
"aggregation":{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": [
"x",
"y"
],
"fnAggregate": "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine": "function(partialA, partialB) { return partialA + partialB; }",
"fnReset": "function() { return 10; }"
}
[5] 估计聚合
-
基数聚合
基数聚合(Cardinality aggregator)计算一组维度的基数,用HyperLogLog估算基数。注意:该聚合方法相比使用特定聚合方式对某一列进行索引而言要慢得多
"aggregation":{ "type":"cardinality", "name": "<output_name>", "fields": ["<dimension1>", "<dimension2>",...], "byRow": false|true }
byRow参数的设置含义
基数聚合中byRow设为false(默认值)时,它将计算包含所有给定维度的维度值合并后集合的基数
对于单一维度,等同于: SELECT COUNT(DISTINCT(dimension)) FROM "<datasource>" 对于多个维度,等同于: SELECT COUNT(DISTINCT(value)) FROM ( SELECT dim_1 as value FROM "<datasource>" UNION SELECT dim_2 as value FROM "<datasource>" UNION SELECT dim_3 as value FROM "<datasource>" )
byRow为true时,它将按行计算基数 等同于
SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM "<datasource>" GROUP BY DIM1, DIM2, DIM3 )
示例
获取人们居住或来源的不同国家的数量
"aggregation":{ "type": "cardinality", "name": "distinct_countries", "fields": [ "country_of_origin", "country_of_residence" ] }
获取不同人的数量(通过first name和last name的组合)
"aggregation":{ "type": "cardinality", "name": "distinct_people", "fields": [ "first_name", "last_name" ], "byRow" : true }
获取姓的不同首字母的数量
"aggregation":{ "type": "cardinality", "name": "distinct_last_name_first_char", "fields": [ { "type": "extraction", "dimension": "last_name", "outputName": "last_name_first_char", "extractionFn": { "type": "substring", "index": 0, "length": 1 } } ], "byRow": true }
-
HyperUnique基数聚合
用HyperLogLog计算某一维度的估计基数,这要求在加载阶段该维度列已经设置聚合为hyperunique度量
{ "type":"hyperUnique", "name":"devices", "fieldName":"device_id_met" }
hyperUnique聚合语法
"aggregation":{ "type":"hyperUnique", "name": "<output_name>", "fields": "<metric_name>" }
[6] 过滤后聚合
Filter aggregator包装任意给定aggregator,但只聚合给定维度过滤器匹配的值
"aggregation":{
"type":"filtered",
"filter" : {
"type" : "selector",
"dimension" : "<dimension>",
"value" : "<dimension value>"
}
"aggregator" : "<aggregation>"
}
[7] 扩展聚合
基本聚合仅包含count/count distinct/sum/min/max,为满足其他常见聚合操作需求,可引入Stats和histogram 等扩展包,但需要在系统配置和加载阶段进行特殊设置才能生效
-
Stats聚合(用于计算度量列的方差和标准差)
使用注意事项
A.修改配置common.runtime.properties
在conf/Druid/_common/common.runtime.properties中的Druid.extensions.loadList增加Druid-stats,修改后需要重启来加载新添加的extension
B.在数据加载时进行特别配置
具体位置是spec->dataSchema->metricsSpec项中定义 { "type":"variance", "name":"<output_name>", "fieldName":"<metric_name>", "inputType":"<input_type>", "estimator":"<estimator>" } output_name : 为产生的聚合指标列名 metric_name : 为原始度量列名 input_type : 可指定为float/long/variance,默认为float estimator : 是指variance的算法,可指定为population,默认为sample方法,两者差别在于计算时的分母前者是总数,后者是总数减一
完整示例
{ "type":"index", "spec":{ "ioConfig":{ "type":"index", "firehose":{ "type":"local", "baseDir":"/rawdata/", "filter":"testdata.json" } }, "dataSchema":{ "dataSource":"test_variance_1", "granularitySpec":{ "type":"uniform", "segmentGranularity":"day", "queryGranularity":"none", "rollup":"false", "intervals":[ "2017-05-11/2017-05-12" ] }, "parser":{ "type":"string", "parseSpec":{ "format":"json", "dimensionsSpec":{ "dimensions":[ "SN", "Description", "SENSOR_ID", "PLC_ID" ] }, "timestampSpec":{ "format":"millis", "column":"TIMESTAMP" } } }, "metricsSpec":[ { "type":"count", "name":"count" }, { "type":"variance", "name":"VALUE_var_metric", "fieldName":"VALUE" }, { "type":"doubleSum", "name":"VALUE", "fieldName":"VALUE" } ] }, "tuningConfig":{ "type":"index", "partitionsSpec":{ "type":"hashed", "targetPartitionSize":5000000 }, "ignoreInvalidRows":true } } }
查询调用语法 (该类聚合仅针对数值型度量列。如果记录中该度量值为空,则作为0处理)
计算方差
在query的aggregation或aggregations中定义: { "type":"varianceFold", "name":"<output_name>", "fieldName":"<metric_name>", "estimator":"<estimator>" } 与定义的语法类似,差别是type改为varianceFold,而<metric_name>则是定义语法中的<output_name>
计算标准差
标准差的计算与方差不同,需要在postAggregation或postAggregations中定义: { "type":"stddev", "name":"<output_name>", "fieldName":"<aggregator_name>", "estimator":"<estimator>" } 与查询方差的语法类似,差别是type改为stddev,而<aggregator_name>则是方差调用语法中的<output_name>
方差及标准差查询示例
{ "queryType":"timeseries", "dataSource":"test_variance_1", "intervals":[ "2015-05-01/2018-05-04" ], "granularity":"day", "threshold":25, "aggregations":[ { "type":"varianceFold", "name":"VALUE_var", "fieldName":"VALUE_var_metric" } ], "postAggregations":[ { "type":"stddev", "name":"VALUE_stddev", "fieldName":"VALUE_var_metric" } ] }
-
approxHistogram聚合 (用于计算近似histogram,仅适用于度量列)
使用注意事项
A. 修改配置common.runtime.properties
conf/Druid/_common/common.runtime.properties中。 节点需要重启来加载新添加的extension
B. 在数据加载时进行特别配置
具体位置是spec->dataSchema->metricsSpec项中定义 { "type":"approxHistogram|approxHistogramFold", # "name":"<output_name>", "fieldName":"<metric_name>", "resolution":<integer>, "numBuckets":<integer>, "lowerLimit":<float>, "upperLimit":"<float>" } approxHistogram:缺失值将被当成0 approxHistogramFold:缺失值将被忽略 Resolution:要存储的质心数(数据点)。 分辨率越高,结果越准确,但计算速度越慢。默认值为50 numBuckets:生成直方图的输出桶数,桶级间隔是动态的,基于底层数据的范围。 默认值为7 lowerLimit/upperLimit:限制给定范围的近似值。 超出该范围的值将被聚合成两个质心。 超出该范围的值的数量仍然保持不变。默认值-INF/+INF
示例
{ "type":"approxHistogram", "name":"VALUE_approxHistogram", "fieldName":"VALUE", "numBuckets":3 }
查询调用语法
{ "queryType":"timeseries", "dataSource":"test_histogram_new_test", "intervals":[ "2015-05-01/2018-05-04" ], "granularity":"day", "threshold":25, "aggregations":[ { "type":"approxHistogramFold", "name":"VALUE_approxHistogram", "fieldName":"VALUE_approxHistogram", "numBuckets":3 } ], "postAggregations":[ { "type":"quantile", "name":"VALUE_approxHistogram_quantile", "fieldName":"VALUE_approxHistogram", "probability":0.5 }, { "type":"min", "name":"VALUE_approxHistogram_min", "fieldName":"VALUE_approxHistogram" } ] }
利用postAggregations用来计算分位数、最小值、最大值等指标
min:返回底层近似直方图聚合器的最小值
{ "type":"min", "name":"<output_name>", "fieldName":"<aggregator_name>" }
max: 返回底层近似直方图聚合器的最大值
{ "type":"max", "name":"<output_name>", "fieldName":"<aggregator_name>" }
equalBuckets:用给定数量的等宽的buckets计算近似histogram结果。buckets宽度基于统计数据的范围
{ "type":"equalBuckets", "name":"<output_name>", "fieldName":"<aggregator_name>", "numBuckets":<numBucket> }
buckets:根据给给定的offset和bucketSize 计算近似histogram结果
{ "type":"buckets", "name":"<output_name>", "fieldName":"<aggregator_name>", "bucketSize":<bucket_size>, "offset":<offset> }
customBuckets:根据指定切分点进行分段
{ "type":"customBuckets", "name":"<output_name>", "fieldName":"<aggregator_name>", "breaks":[ "<value>", "<value>", …… ] }
quantile:计算单个分位数
{ "type":"quantile", "name":"<output_name>", "fieldName":"<aggregator_name>", "probability":<quantile> }
quantiles:计算一组分位数
{ "type":"quantiles", "name":"<output_name>", "fieldName":"<aggregator_name>", "probabilities":[ <quantile>, <quantile>, ... ] }
五、聚合后操作
postAggregation是基于aggregations结果进行的进一步操作
[1] Arithmetic
Arithmetic类型的聚合后操作将提供的函数从左到右地应用于给定字段,字段可以是aggregators 或其他 post aggregators。支持的函数有+, -, *, /和quotient。Arithmeti也可指定排序方式,用于确定结果的排序方式。如果无指定排序(或者设为null),则使用默认的浮点数排序。numericFirst排序按有限数值->空值->无限数值的顺序进行返回
"postAggregation":{
"type" : "airthmetic",
"name" : "<output_name>",
"fn" : "<airthmetic_function>",
"fields" : [<post_aggregator>, <post_aggregator>, . . . ],
"ordering" : <null (default), or "numericFirst">
}
使用示例
{ ...
"aggregations": [
{
"type": "count",
"name": "rows"
},
{
"type": "doubleSum",
"name": "tot",
"fieldName": "total"
}
],
"postAggregations": [{
"type": "arithmetic",
"name": "average",
"fn": "*",
"fields": [
{ "type": "arithmetic",
"name": "div",
"fn": "/",
"fields": [
{"type": "fieldAccess","name": "tot","fieldName":"tot"},
{"type": "fieldAccess","name": "rows","fieldName": "rows"}
]
},
{"type": "constant","name": "const","value": 100}
]
}]
...
}
[2] fieldAccess
fieldAccesso类型用于返回指定聚合项的值
"postAggregation":{
"type" : "fieldAccess",
"name" : "<output_name>",
"fieldName" : "<aggregator_name>"
}
[3] Constant
Constant类型用于返回指定常量
"postAggregation":{
"type" : "constant",
"name" : "<output_name>",
"fieldName" : "<numerical_value>"
}
[4] JavaScript
将指定的JavaScript函数应用于给定字段。字段作为参数按指定顺序传入JavaScript函数
"postAggregation":{
"type" : "Javascript",
"name" : "<output_name>",
"fieldName" : [<aggregator_name>, <aggregator_name>, . . . ],
"function" : "<javascript function>"
}
示例
"postAggregation":{
"type": "javascript",
"name": "absPercent",
"fieldNames": [
"delta",
"total"
],
"function": "function(delta, total) { return 100 * Math.abs(delta) / total; }"
}
[5] HyperUniqueCardinality
hyperUniqueCardinality post aggregator用于包装hyperUnique对象,使之可以用于post aggregator
"postAggregation":{
"type" : "hyperUniqueCardinality",
"name" : "<output_name>",
"fieldName" : "<the name field value of the hyperUniqueCardinality aggregator>"
}
示例
"postAggregation":{
"type": "arithmetic",
"name": "average_users_per_row",
"fn": "/",
"fields": [
{
"type": "hyperUniqueCardinality",
"fieldName": "unique_users"
},
{
"type": "fieldAccess",
"name": "rows",
"fieldName": "rows"
}
]
}
六、时间粒度
[1] 基本粒度
根据指定的简单粒度字符串将UTC时间戳分成一个个时间单元
- all
- all
- none
- second
- minute
- fifteen_minute
- thirty_minute
- hour
- day
- week
- month
- quarter
- year
all是指将所有数据划归为一个集合,none是不对数据进行操作,但是目前在时序查询中并不建议这么做
示例数据
{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
-
hour粒度的groupBy查询
{ "queryType": "groupBy", "dataSource": "my_dataSource", "granularity": "hour", "dimensions": [ "language" ], "aggregations": [ { "type": "count", "name": "count" } ], "intervals": [ "2000-01-01T00:00Z/3000-01-01T00:00Z" ] }
-
结果(所有的空集合最后都将被舍弃)
[ { "version" : "v1", "timestamp" : "2013-08-31T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T23:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T03:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
-
查询的粒度调整为day,结果
[ { "version": "v1", "timestamp": "2013-08-31T00:00:00.000Z", "event": { "count": 1, "language": "en" } }, { "version": "v1", "timestamp": "2013-09-01T00:00:00.000Z", "event": { "count": 1, "language": "en" } }, { "version": "v1", "timestamp": "2013-09-02T00:00:00.000Z", "event": { "count": 1, "language": "en" } }, { "version": "v1", "timestamp": "2013-09-03T00:00:00.000Z", "event": { "count": 1, "language": "en" } } ]
如果将查询的粒度调整为none,获得的结果和粒度设置为读入数据粒度得到的结果一致。如果查询的粒度小于读取数据的的粒度是不合理的,因为更小粒度的信息不能在索引中得到表示。因此,如果查询的粒度小于读入数据的粒度,时序数据库将自动将粒度调整为读入数据的粒度。
-
如果将查询的粒度调整为all,结果数据将聚合在一个集合内
``` [ { "version": "v1", "timestamp": "2000-01-01T00:00:00.000Z", "event": { "count": 4, "language": "en" } } ] ```
[2] duration粒度
duration粒度是根据给定的精确持续时间(以毫秒计)进行时间单元切分,并支持指定初始时刻
"granularity":{
"type": "duration",
"duration": 36000000,
"origin":"2012-01-01T00:30:00Z"
}
重新使用上一示例中数据,提交指定24小时duration的groupBy查询语句
-
查询
{ "queryType": "groupBy", "dataSource": "my_dataSource", "granularity": {"type":"duration","duration":"86400000"} "dimensions": [ "language" ], "aggregations": [ { "type": "count", "name": "count" } ], "intervals": [ "2000-01-01T00:00Z/3000-01-01T00:00Z" ] }
-
结果
[ { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如果将粒度的origin设置为2012-01-01T00:30:00Z
-
结果
[ { "version" : "v1", "timestamp" : "2013-08-31T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
每一个数据集的时间戳开始在第30分钟
[3] period粒度
period粒度是根据指定为任意years, months, weeks, hours, minutes 和 seconds时间段的组合(格式为ISO8601)进行时间单元切分,同时支持指定时段边界开始的时区以及返回时间戳的时区。默认情况下,年的开始为1月1号,月的开始为每月1号,周的开始为周一的最初时间。
时区可选(默认为UTC),初始时间可选(默认为给定时区的1970-01-01T00:00:00)
-
按照在Pacific 时区,两天一个时间单元进行划分
"granularity":{ "type": "period", "period": "P2D", "timeZone": "America/Los_Angeles" }
-
按照在Pacific 时区,从2月份开始,3个月一个时间单元进行划分
"granularity":{ "type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00" }
使用上一示例的数据, 提交以下groupBy查询(指定Pacific时区,时段为1天)
-
查询
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"}, "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00" ] }
-
结果
[ { "version" : "v1", "timestamp" : "2013-08-30T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000-07:00", "event" : { "count" : 2, "language" : "en" } } ]
每一个集合中时间戳已经转化为Pacific时间,groupBy查询中的时间间隔不会转化为特定时区
如果设置粒度的初始时间为 1970-01-01T20:30:00-08:00
- 查询
"granularity":{
"type": "period",
"period": "P1D",
"timeZone": "America/Los_Angeles",
"origin": "1970-01-01T20:30:00-08:00"
}
- 结果
[ {
"version" : "v1",
"timestamp" : "2013-08-29T20:30:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-08-30T20:30:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-01T20:30:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-02T20:30:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
} ]
指定的初始时间和时区没有任何关系,它只用于定位每个粒度集合的初始点
七、维度转换
[1] default
原样返回维度值并可重新命名维度
"dimensionSpecs":{
"type" : "default",
"dimension" : "<dimension>",
"outputName": "<output_name>",
"outputType": <"STRING"|"LONG"|"FLOAT">
}
[2] extraction
返回使用给定extraction函数转换的维度值
"dimensionSpecs":{
"type" : "extraction",
"dimension" : "<dimension>",
"outputName" : "<output_name>",
"outputType": <"STRING"|"LONG"|"FLOAT">,
"extractionFn" : "<extraction_function>"
}
<extraction_function>定义了应用于维度列的数据转换函数。这些转换可应用于普通(字符串类型)维度上,也适用于聚合返回结果中的_time维度(并非原始的时间戳,与聚合操作中的时间粒度相关)
如果__time维度上使用以字符串作为输入的转换函数,会先被转换为ISO-8601格式
extractionFn类型
-
regex
正则表达式处理函数,返回第一个匹配给定正则表达式的组合。如果无匹配项,将返回维度值
{ "type": "regex", "expr": "<regular_expression>", "replaceMissingValue": true, "replaceMissingValueWith": "foobar" }
举例来说,使用 "expr" : "(\w\w\w).*"会将'Monday', 'Tuesday', 'Wednesday'转换成 'Mon', 'Tue', 'Wed'。
如果replaceMissingValue设置为true,不匹配的维度值将被replaceMissingValueWith属性值替换,如果replaceMissingValueWith未明确指定,则不匹配的维度值将替换成null值
-
partial
在正则表达式匹配的情况下返回原维度值,否则返回null
{ "type" : "partial", "expr" : "<regular_expression>" }
-
Search Query Extraction Function
在SearchQuerySpec匹配的情况下返回原维度值,否则返回null
{ "type" : "SearchQuery", "expr" : "<search_query_spec>" }
-
substring
返回维度值中开始于指定位置的指定长度的子字符串。如果设定长度超过维度值长度,则起始索引后的字符串部分将被返回。如果索引位置大于整个维度值的长度,将返回null值
{ "type" : "substring", "index" : 1, "length" : 4 }
-
timeFormat
根据给定的格式字符串、时区和区域格式返回格式化维度值
对于_time维度值,将格式化根据时间粒度聚合后结果中的时间值;而对于普通维度,函数假设字符串为ISO-8601日期和时间格式
{ "type": "timeFormat", "format": "<output_format>", "timeZone": "<time_zone>", "locale": "<locale>", "granularity":" <granularity>" }
- format:指定结果中日期时间格式
- format:指定结果中日期时间格式
- local:指定语言与国家,如en-US, en-GB, fr-FR, fr-CA等
- timeZone:指定时区
- granularity:在进行格式转换前使用的时间粒度
{ "type": "extraction", "dimension": "__time", "outputName": "dayOfWeek", "extractionFn": { "type": "timeFormat", "format": "EEEE", "timeZone": "America/Montreal", "locale": "fr" } }
-
time
用于时间戳列的格式转化(不需要转成ISO-8601格式的字符串)
{ "type" : "time", "timeFormat" : "<input_format>", "resultFormat" : "<output_format>" }
-
Javascript
Javascript Extraction Function返回通过给定Javascript function转化后的维度值。对于普通维度(字符串类型),输入值以字符串形式传入。对于_time维度,输入值被作为一个数传入,这个数为从UTC时间1970-01-01以来的毫秒数
-
普通维度上的使用示例
{ "type": "javascript", "function": "function(str) { return str.substr(0, 3); }" } { "type": "javascript", "function": "function(str) { return str + '!!!'; }", "injective": true }
其中,injective的属性明确了javascript function 是否保留唯一性。默认值为false(不保留)
-
_time维度示例
{ "type": "javascript", "function": "function(t) { return 'Second ' + Math.floor((t % 60000) / 1000); }" }
-
Lookup
用于对维度值进行映射替换。在查询过程中执行extraction时候允许去定义一组key/value值。其中key代表需要去匹配的值,而value指其替换值
{ "type": "lookup", "lookup": { "type": "map", "map": { "foo": "bar", "baz": "bat" } }, "retainMissingValue": false, "injective": false, "replaceMissingValueWith": "MISSING" } { "type": "lookup", "lookup": { "type": "namespace", "namespace": "some_lookup" }, "replaceMissingValueWith": "Unknown", "injective": false }
属性 描述 lookup 具有namespace/map两种类型。map类型作为查询中的一部分被传递,namespace类型则被传递到所有查询节点 retainMissingValue 具有true/false两种选项,true表示如果在lookup中没有找到匹配项则使用原始值,false则不保留原始值 replaceMissingValueWith 当做缺失值处理时,设定用来进行替换的值。其中设置为""与设置为null等效 injective 明确是否可以优化,即假设没有将多个名称组合成一个 Optimize属性可允许基于extraction filter的lookup的优化(默认为true)。优化层将在broker上运行并将extraction filter语句重写为selector filters语句的从句
{ "filter": { "type": "selector", "dimension": "product", "value": "bar_1", "extractionFn": { "type": "lookup", "optimize": true, "lookup": { "type": "map", "map": { "product_1": "bar_1", "product_3": "bar_1" } } } } }
重写成
{ "filter": { "type": "or", "fields": [ { "filter": { "type": "selector", "dimension": "product", "value": "product_1" } }, { "filter": { "type": "selector", "dimension": "product", "value": "product_3" } } ] } }
-
-
registeredLookup
建议在可能的情况下尽量使用lookup dimension spec,任意注册做为lookup dimension spec使用的lookup可以被用作维度提取
{ "type": "registeredLookup", "lookup": "some_lookup_name", "retainMissingValue": true, "injective": false }
-
cascade
用于链式提取函数的执行。extractionFns的属性是一组提取函数,执行顺序为数组中的索引顺序。以下示例为 regex、javascript和 substring 的链式表达。维度值将按照以上确定顺序的提取函数的顺序进行转化
{ "type": "cascade", "extractionFns": [ { "type": "regex", "expr": "/([^/]+)/", "replaceMissingValue": false, "replaceMissingValueWith": null }, { "type": "javascript", "function": "function(str) { return \"the \".concat(str) }" }, { "type": "substring", "index": 0, "length": 7 } ] }
-
stringFormat
返回根据给定格式字符串进行转换后的维度值
{ "type": "stringFormat", "format": "<sprintf_expression>", "nullHandling": "<optionalattributeforhandlingnullvalue>" }
想要在实际维度值前后拼接"["和"]",我们需要指定"[%s]"为格式字符串
-
Upper/Lower
Upper/lower返回全为大写或小写形式的维度值。用户也可以指定使用语言
{ "type" : "upper", "locale":"fr" }
也可以不设置locale选项
-
Bucket
通过给定的size和offset参数进行数据分桶。非数值型数据被转换为null
size:桶大小(可选项,默认为1)
offset:位移量(可选项,默认为0)
示例创建从2开始,长度为5的数据桶。在这种情况下,在范围[2,7)中的值将转换为2,在[7,12)中的值将转换为7
{ "type": "bucket", "size": 5, "offset": 2 }
八、查询上下文
查询上下文(context)用于查询参数的额外配置,属于可选项
属性 | 默认 | 描述 |
---|---|---|
timeout | 0(notimeout) | 查询超时(以毫秒计),超时将取消未完成查询 |
priority | 0 | 查询优先级,优先级高的查询先获得计算资源 |
queryId | 自动生成 | 查询的唯一标识符,如果query id已设置或已知,可通过它取消查询 |
useCache | true | 是否使用缓存,它在代理或历史节点配置中可能被覆盖 |
populateCache | true | 指出是否在查询缓存中保存查询结果,主要用于调试,在代理或历史节点配置中可能被覆盖 |
BySegment | false | 指出是否在查询缓存中保存查询结果,主要用于调试,在代理或历史节点配置中可能被覆盖 |
finalize | true | 是否完成聚合结果的标志,主要用于调试。例如,设置为false时,hyperUnique聚合返回完整的HyperLogLog sketch而不是估计的基数 |
chunkPeriod | 0(off) | 在代理节点级,长时间间隔的查询会被分解成较短的时间间隔,以使并行合计能超过通常情况。分解查询会使用大量的集群资源但能更快完成 |
minTopNThreshold | 1000 | 返回每个分片的top minTopNThreshold本地结果并合并以决定全局TopN |
maxResults | 500000 | groupBy查询语句能够处理的最大结果数量。默认值可通过broker(查询节点)和historical(数据节点)的配置项Druid.query.groupBy.maxResults进行修改。在查询时只能降低这个值 |
maxIntermediateRows | 50000 | 当处理groupBy查询中单个分片时的中间行数最大值。默认值可通过 broker(查询节点)和historical(数据节点)配置项Druid.query.groupBy.maxIntermediateRows进行修改。在查询时只能降低这个值 |
groupByIsSingleThreaded | false | 是否单线程运行group By查询。默认值可通过 broker(查询节点)和historical(数据节点)配置项Druid.query.groupBy.singleThreaded进行修改 |