HQL执行流程及映射MR流程

一、在Hive以explain执行语句
创建测试表

create table dual(key string,value string);

执行explain extended select a.key*(4+5), b.value from dual a join dual b on a.key=b.key and a.key>10;

hive> explain extended select  a.key*(4+5), b.value  from dual a join dual b on a.key=b.key and a.key>10;
OK
Explain
ABSTRACT SYNTAX TREE:
  
TOK_QUERY
   TOK_FROM
      TOK_JOIN
         TOK_TABREF
            TOK_TABNAME
               dual
            a
         TOK_TABREF
            TOK_TABNAME
               dual
            b
         and
            =
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               .
                  TOK_TABLE_OR_COL
                     b
                  key
            >
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               10
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            *
               .
                  TOK_TABLE_OR_COL
                     a
                  key
               +
                  4
                  5
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  b
               value


STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a 
          TableScan
            alias: a
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (key > 10) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 key (type: string)
                  1 key (type: string)
                Position of Big Table: 1

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: b
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (key > 10) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 key (type: string)
                  1 key (type: string)
                outputColumnNames: _col0, _col6
                Position of Big Table: 1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: (_col0 * 9) (type: double), _col6 (type: string)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    Stats Publishing Key Prefix: hdfs://hadoop000:8020/tmp/hive/hadoop/71908556-ae01-4cd2-a13c-171c04687fe4/hive_2019-04-20_18-26-31_164_1873867714665609652-1/-mr-10000/.hive-staging_hive_2019-04-20_18-26-31_164_1873867714665609652-1/-ext-10001/
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        properties:
                          columns _col0,_col1
                          columns.types double:string
                          escape.delim \
                          hive.serialization.extend.additional.nesting.levels true
                          serialization.format 1
                          serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false
      Local Work:
        Map Reduce Local Work
      Path -> Alias:
        hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual [b]
      Path -> Partition:
        hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual 
          Partition
            base file name: dual
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              bucket_count -1
              columns key,value
              columns.comments 
              columns.types string:string
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
              name g6_hadoop.dual
              serialization.ddl struct dual { string key, string value}
              serialization.format 1
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              transient_lastDdlTime 1555755989
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                bucket_count -1
                columns key,value
                columns.comments 
                columns.types string:string
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://hadoop000:8020/user/hive/warehouse/g6_hadoop.db/dual
                name g6_hadoop.dual
                serialization.ddl struct dual { string key, string value}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1555755989
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: g6_hadoop.dual
            name: g6_hadoop.dual
      Truncated Path -> Alias:
        /g6_hadoop.db/dual [b]

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

二、过程剖析
一个HQL语句发出会经历以下过程


1、HQL会被解析成一个抽象语法树,AST本质一个字符串
2、Aanalyzer后悔生形成一个QB(Query Block)
3、然后到了逻辑执行计划,生成一堆Operator Tree
4、logical optimizer对逻辑执行计划进行优化后生成一堆Operator Tree
5、最后到了物理执行计划,生成一堆TaskTree
6、phsical optimizer对物理执行计划进行优化,生成优化后的TaskTree,即最终在集群跑的Task

重点是逻辑、物理优化器的环节

三、SQL ON HADOOP常见SQL场景总结

1、单表的的分析:select a,聚合函数 from a group by b
2、多表的Join:select a.,b. from a join b on a.id=b.id

四、SQL映射MR流程
1、过滤类:如select a.id,b.city from a where a.day='20190420' and a.city ='101'.此类不会发生shuffle
2、聚合类:如select a.id,b.city from ta where a,day='20190420' and b.city ='101' group city,此类会发生shuffle。

id city_id category day
1 101 奔驰 20180101
2 101 路虎 20180101
3 102 奔驰 20180101
4 102 奔驰 20180101
5 101 奔驰 20180101
6 101 奔驰 20180101

表infos三个字段id,city_id,category和分区字段day

过滤类

select a.id,a.city_id from infos a where a.day='20180101' and a.category='奔驰'

image.png

聚合类

select a.city_id,count(1) from infos a where a.day='20180101' and a.category='奔驰' group by a.city_id

image.png

本质是WC,上图combiner是本地局部的redece,好处是减少shuffle的数据量,但不是任何场景都会发生combiner,如求平均数。

MR COUNT的流程

  1. map: split ==> (word, 1)
  2. shuffle: (word,1) partitioner ==> reduce
  3. reduce: (word, 可迭代的(1,1,1,1...))
    ==> (word, sum(可迭代的))
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,452评论 0 10
  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,959评论 0 23
  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,516评论 0 13
  • ORACLE自学教程 --create tabletestone ( id number, --序号usernam...
    落叶寂聊阅读 1,138评论 0 0
  • 学校里的世界是梦幻的,在象牙塔的顶端俯瞰现实虚假的一逼。出了学校,没了高处不胜寒才懂得成长。 回家第二天,金钱...
    留才君阅读 156评论 0 1