HQL执行流程及映射MR流程

一、在Hive以explain执行语句
创建测试表

create table dual(key string,value string);

执行explain extended select a.key*(4+5), b.value from dual a join dual b on a.key=b.key and a.key>10;

hive> explain extended select  a.key*(4+5), b.value  from dual a join dual b on a.key=b.key and a.key>10;
OK
Explain
ABSTRACT SYNTAX TREE:
  
TOK_QUERY
   TOK_FROM
      TOK_JOIN
         TOK_TABREF
            TOK_TABNAME
               dual
            a
         TOK_TABREF
            TOK_TABNAME
               dual
            b
         and
            =
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               .
                  TOK_TABLE_OR_COL
                     b
                  key
            >
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               10
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            *
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               +
                  4
                  5
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  b
               value


STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a 
          TableScan
            alias: a
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (key > 10) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 key (type: string)
                  1 key (type: string)
                Position of Big Table: 1

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: b
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (key > 10) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 key (type: string)
                  1 key (type: string)
                outputColumnNames: _col0, _col6
                Position of Big Table: 1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: (_col0 * 9) (type: double), _col6 (type: string)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    Stats Publishing Key Prefix: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001/
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        properties:
                          columns _col0,_col1
                          columns.types double:string
                          escape.delim \
                          hive.serialization.extend.additional.nesting.levels true
                          serialization.format 1
                          serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false
      Local Work:
        Map Reduce Local Work
      Path -> Alias:
        hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual [b]
      Path -> Partition:
        hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual 
          Partition
            base file name: dual
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              bucket_count -1
              columns key,value
              columns.comments 
              columns.types string:string
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
              name g6_hadoop.dual
              serialization.ddl struct dual { string key, string value}
              serialization.format 1
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              transient_lastDdlTime 1555755989
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                bucket_count -1
                columns key,value
                columns.comments 
                columns.types string:string
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
                name g6_hadoop.dual
                serialization.ddl struct dual { string key, string value}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1555755989
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: g6_hadoop.dual
            name: g6_hadoop.dual
      Truncated Path -> Alias:
        /g6_hadoop.db/dual [b]

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

二、过程剖析
一个HQL语句发出会经历以下过程

1、HQL会被解析成一个抽象语法树，AST本质一个字符串
2、Aanalyzer后悔生形成一个QB（Query Block）
3、然后到了逻辑执行计划，生成一堆Operator Tree
4、logical optimizer对逻辑执行计划进行优化后生成一堆Operator Tree
5、最后到了物理执行计划，生成一堆TaskTree
6、phsical optimizer对物理执行计划进行优化，生成优化后的TaskTree，即最终在集群跑的Task

重点是逻辑、物理优化器的环节

三、SQL ON HADOOP常见SQL场景总结

1、单表的的分析:select a,聚合函数 from a group by b
2、多表的Join:select a.,b. from a join b on a.id=b.id

四、SQL映射MR流程
1、过滤类:如select a.id,b.city from a where a.day='20190420' and a.city ='101'.此类不会发生shuffle
2、聚合类:如select a.id,b.city from ta where a,day='20190420' and b.city ='101' group city，此类会发生shuffle。

id	city_id	category	day
1	101	奔驰	20180101
2	101	路虎	20180101
3	102	奔驰	20180101
4	102	奔驰	20180101
5	101	奔驰	20180101
6	101	奔驰	20180101

表infos三个字段id,city_id,category和分区字段day

过滤类

select a.id,a.city_id from infos a where a.day='20180101' and a.category='奔驰'

image.png

聚合类

select a.city_id，count(1) from infos a where a.day='20180101' and a.category='奔驰' group by a.city_id

image.png

本质是WC，上图combiner是本地局部的redece，好处是减少shuffle的数据量，但不是任何场景都会发生combiner，如求平均数。

MR COUNT的流程

map: split ==> (word, 1)
shuffle: (word,1) partitioner ==> reduce
reduce: (word, 可迭代的(1,1,1,1...))
==> (word, sum(可迭代的))

HQL执行流程及映射MR流程

过滤类

聚合类

推荐阅读更多精彩内容