一、在Hive以explain执行语句
创建测试表
create table dual(key string,value string);
执行explain extended select a.key*(4+5), b.value from dual a join dual b on a.key=b.key and a.key>10;
hive> explain extended select a.key*(4+5), b.value from dual a join dual b on a.key=b.key and a.key>10;
OK
Explain
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_JOIN
TOK_TABREF
TOK_TABNAME
dual
a
TOK_TABREF
TOK_TABNAME
dual
b
and
=
.
TOK_TABLE_OR_COL
a
key
.
TOK_TABLE_OR_COL
b
key
>
.
TOK_TABLE_OR_COL
a
key
10
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
*
.
TOK_TABLE_OR_COL
a
key
+
4
5
TOK_SELEXPR
.
TOK_TABLE_OR_COL
b
value
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
TableScan
alias: a
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (key > 10) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
HashTable Sink Operator
keys:
0 key (type: string)
1 key (type: string)
Position of Big Table: 1
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (key > 10) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 key (type: string)
1 key (type: string)
outputColumnNames: _col0, _col6
Position of Big Table: 1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: (_col0 * 9) (type: double), _col6 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Stats Publishing Key Prefix: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns _col0,_col1
columns.types double:string
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Local Work:
Map Reduce Local Work
Path -> Alias:
hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual [b]
Path -> Partition:
hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
Partition
base file name: dual
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,value
columns.comments
columns.types string:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
name g6_hadoop.dual
serialization.ddl struct dual { string key, string value}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1555755989
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
columns key,value
columns.comments
columns.types string:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
name g6_hadoop.dual
serialization.ddl struct dual { string key, string value}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
transient_lastDdlTime 1555755989
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: g6_hadoop.dual
name: g6_hadoop.dual
Truncated Path -> Alias:
/g6_hadoop.db/dual [b]
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
二、过程剖析
一个HQL语句发出会经历以下过程
1、HQL会被解析成一个抽象语法树,AST本质一个字符串
2、Aanalyzer后悔生形成一个QB(Query Block)
3、然后到了逻辑执行计划,生成一堆Operator Tree
4、logical optimizer对逻辑执行计划进行优化后生成一堆Operator Tree
5、最后到了物理执行计划,生成一堆TaskTree
6、phsical optimizer对物理执行计划进行优化,生成优化后的TaskTree,即最终在集群跑的Task
重点是逻辑、物理优化器的环节
三、SQL ON HADOOP常见SQL场景总结
1、单表的的分析:select a,聚合函数 from a group by b
2、多表的Join:select a.,b. from a join b on a.id=b.id
四、SQL映射MR流程
1、过滤类:如select a.id,b.city from a where a.day='20190420' and a.city ='101'.此类不会发生shuffle
2、聚合类:如select a.id,b.city from ta where a,day='20190420' and b.city ='101' group city,此类会发生shuffle。
id | city_id | category | day |
---|---|---|---|
1 | 101 | 奔驰 | 20180101 |
2 | 101 | 路虎 | 20180101 |
3 | 102 | 奔驰 | 20180101 |
4 | 102 | 奔驰 | 20180101 |
5 | 101 | 奔驰 | 20180101 |
6 | 101 | 奔驰 | 20180101 |
表infos三个字段id,city_id,category和分区字段day
过滤类
select a.id,a.city_id from infos a where a.day='20180101' and a.category='奔驰'
聚合类
select a.city_id,count(1) from infos a where a.day='20180101' and a.category='奔驰' group by a.city_id
本质是WC,上图combiner是本地局部的redece,好处是减少shuffle的数据量,但不是任何场景都会发生combiner,如求平均数。
MR COUNT的流程
- map: split ==> (word, 1)
- shuffle: (word,1) partitioner ==> reduce
- reduce: (word, 可迭代的(1,1,1,1...))
==> (word, sum(可迭代的))