大数据开发之Hive篇5-Hive数据查询语言

备注:
Hive 版本 2.1.1

一.Hive SELECT(数据查询语言)概述

select语句是Hive中使用的最频繁,也是语法最为复杂的语句。select语句很多语法与传统的关系型数据库类似,这也就给从传统数据库转大数据hive数据仓库提供了便利。

语法:

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

测试数据准备:

drop table if exists dept;
drop table if exists emp;
drop table if exists bonus;
drop table if exists salgrade;


create table DEPT
( deptno INT,
  dname  VARCHAR(14),
  loc    VARCHAR(13)
); 

insert into DEPT(deptno, dname, loc) values ('10', 'ACCOUNTING', 'NEW YORK');

insert into DEPT(deptno, dname, loc) values ('20', 'RESEARCH', 'DALLAS');

insert into DEPT(deptno, dname, loc) values ('30', 'SALES', 'CHICAGO');

insert into DEPT(deptno, dname, loc) values ('40', 'OPERATIONS', 'BOSTON');



-- Create table
create table EMP
(
  empno    INT,
  ename    VARCHAR(10),
  job      VARCHAR(9),
  mgr      INT,
  hiredate DATE,
  sal      decimal(7,2),
  comm     decimal(7,2),
  deptno   INT
) ;

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7369', 'SMITH', 'CLERK', '7902','1980-12-17', '800', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7499', 'ALLEN', 'SALESMAN', '7698', '1981-02-20', '1600', '300', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7521', 'WARD', 'SALESMAN', '7698', '1981-02-22', '1250', '500', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7566', 'JONES', 'MANAGER', '7839', '1981-04-02', '2975', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7654', 'MARTIN', 'SALESMAN', '7698', '1981-09-28', '1250', '1400', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7698', 'BLAKE', 'MANAGER', '7839', '1981-05-01', '2850', null, '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7782', 'CLARK', 'MANAGER', '7839', '1981-06-09', '2450', null, '10');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7788', 'SCOTT', 'ANALYST', '7566', '1987-06-13', '3000', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7839', 'KING', 'PRESIDENT', null, '1981-11-17', '5000', null, '10');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7844', 'TURNER', 'SALESMAN', '7698', '1981-09-08', '1500', '0', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7876', 'ADAMS', 'CLERK', '7788', '1987-06-13', '1100', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7900', 'JAMES', 'CLERK', '7698', '1981-12-03', '950', null, '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7902', 'FORD', 'ANALYST', '7566', '1981-12-03', '3000', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7934', 'MILLER', 'CLERK', '7782', '1982-01-23', '1300', null, '10');



create table SALGRADE
(
  grade INT,
  losal INT,
  hisal INT
) ;

insert into SALGRADE(grade, losal, hisal)
values ('1', '700', '1200');

insert into SALGRADE(grade, losal, hisal)
values ('2', '1201', '1400');

insert into SALGRADE(grade, losal, hisal)
values ('3', '1401', '2000');

insert into SALGRADE(grade, losal, hisal)
values ('4', '2001', '3000');

insert into SALGRADE(grade, losal, hisal)
values ('5', '3001', '9999');


create table BONUS
(
  ename VARCHAR(10),
  job   VARCHAR(9),
  sal   INT,
  comm  INT
) ;

二.Select的几个简单例子

打开hive的列头输出:

set hive.cli.print.header=true;

下面以几个简单的例子来认识下select语句

2.1 表别名

当select中有多个表时,可以给表一个别名,一些可视化工具,例如DBeaver中,有别名可以直接带出相应的列。

select ename,sal from emp;
select e.ename,e.sal from emp e;

测试记录:

hive> 
    > select ename,sal from emp;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.073 seconds, Fetched: 14 row(s)
hive> select e.ename,e.sal from emp e;
OK
e.ename e.sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.07 seconds, Fetched: 14 row(s)

2.2 字段值的计算

实际的数据开发过程中,有时候会对表的列进行一些的计算。

代码:

SELECT count(*), avg(sal) FROM emp;

测试记录:

hive> 
    > SELECT count(*), avg(sal) FROM emp;
Query ID = root_20201204164822_8e00e473-82d2-406c-af03-f6236729d963
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0088, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0088/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0088
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 16:48:30,047 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:48:36,234 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.25 sec
2020-12-04 16:48:41,388 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.31 sec
MapReduce Total cumulative CPU time: 6 seconds 310 msec
Ended Job = job_1606698967173_0088
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.31 sec   HDFS Read: 17740 HDFS Write: 114 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 310 msec
OK
_c0     _c1
14      2073.214286
Time taken: 19.467 seconds, Fetched: 1 row(s)

2.3 字段别名

我们看到上面的例子,没有给字段加别名,导致系统随机的给了_c0 _c1这样的。
除了可以给表别名之外,还可以给列加别名。

代码:

SELECT count(*) as emp_count, avg(sal) as avg_salary FROM emp;

测试记录:

hive> 
    > SELECT count(*) as emp_count, avg(sal) as avg_salary FROM emp;
Query ID = root_20201204165143_ce11cb13-1464-4b7c-8e65-327395c82bed
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0089, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0089/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0089
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 16:51:49,569 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:51:55,759 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.17 sec
2020-12-04 16:52:01,934 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.4 sec
MapReduce Total cumulative CPU time: 6 seconds 400 msec
Ended Job = job_1606698967173_0089
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.4 sec   HDFS Read: 17733 HDFS Write: 114 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 400 msec
OK
emp_count       avg_salary
14      2073.214286
Time taken: 19.514 seconds, Fetched: 1 row(s)

2.4 LIMIT

LIMIT用于限制输出的行,例如我只想输出10行,就用limit 10

代码:

SELECT ename, sal FROM emp LIMIT 10;

测试记录:

hive> 
    > SELECT ename, sal FROM emp LIMIT 10;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
Time taken: 0.077 seconds, Fetched: 10 row(s)

2.5 FROM子查询

有时候逻辑相对而言比较复杂,需要用到from 子查询语句。

代码:

SELECT ename,sal
from 
(select ename,sal from emp) e;

测试记录:

hive> 
    > SELECT ename,sal
    > from 
    > (select ename,sal from emp) e;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.069 seconds, Fetched: 14 row(s)

2.6 case when 判断

假设我此时需要根据薪酬来判断薪资的级别,可以通过case when语句进行判断。

代码:

select ename,
       sal,
       case when sal >= 3000 then 'High SAL'
            when sal >= 2000 and sal < 3000 then 'Middle SAL'
            else 'Low SAL'
       end as sal_grade
  from emp;

测试记录:

hive> 
    > select ename,
    >        sal,
    >        case when sal >= 3000 then 'High SAL'
    >             when sal >= 2000 and sal < 3000 then 'Middle SAL'
    >             else 'Low SAL'
    >        end as sal_grade
    >   from emp;
Query ID = root_20201204165914_7229d9ea-b045-423b-a240-b04e6e8276e1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0090, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0090/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0090
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 16:59:21,504 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:59:28,775 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.75 sec
MapReduce Total cumulative CPU time: 6 seconds 750 msec
Ended Job = job_1606698967173_0090
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.75 sec   HDFS Read: 13014 HDFS Write: 660 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 750 msec
OK
ename   sal     sal_grade
ALLEN   1600.00 Low SAL
ADAMS   1100.00 Low SAL
MILLER  1300.00 Low SAL
WARD    1250.00 Low SAL
MARTIN  1250.00 Low SAL
BLAKE   2850.00 Middle SAL
SCOTT   3000.00 High SAL
SMITH   800.00  Low SAL
JAMES   950.00  Low SAL
FORD    3000.00 High SAL
JONES   2975.00 Middle SAL
CLARK   2450.00 Middle SAL
KING    5000.00 High SAL
TURNER  1500.00 Low SAL
Time taken: 15.382 seconds, Fetched: 14 row(s)
hive> 

2.7 where过滤

此时我只想看dept 为10的员工信息,并不想看所有,此时可以通过where子句进行过滤

代码:

select ename,sal from emp where deptno = '10';

测试记录:

hive> select ename,sal from emp where deptno = '10';
Query ID = root_20201204170244_78b7dc0a-5e43-4183-bd47-9379092687cc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0091, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0091/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0091
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 17:02:52,315 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:02:58,494 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.73 sec
MapReduce Total cumulative CPU time: 6 seconds 730 msec
Ended Job = job_1606698967173_0091
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.73 sec   HDFS Read: 12664 HDFS Write: 252 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 730 msec
OK
ename   sal
MILLER  1300.00
CLARK   2450.00
KING    5000.00
Time taken: 14.775 seconds, Fetched: 3 row(s)

2.8 group by分组

加上此时我想知道每个部门的平均工资,此时可以通过group by分组语句实现

代码:

select deptno,avg(sal) from emp group by deptno;

测试记录:

hive> 
    > select deptno,avg(sal) from emp group by deptno;
Query ID = root_20201204170424_288f6ce3-a3ee-4f7c-99b9-5634269bb613
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0092, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0092/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0092
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 17:04:31,900 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:04:39,100 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.19 sec
2020-12-04 17:04:44,246 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.29 sec
MapReduce Total cumulative CPU time: 6 seconds 290 msec
Ended Job = job_1606698967173_0092
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.29 sec   HDFS Read: 17311 HDFS Write: 168 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 290 msec
OK
deptno  _c1
10      2916.666667
20      2175.000000
30      1566.666667
Time taken: 20.501 seconds, Fetched: 3 row(s)
hive> 

2.9 Having子句

HAVING用于约束结果集,只给出符合HAVING条件的结果

代码:

select deptno,avg(sal) avg_sal from emp group by deptno having avg(sal) > 2000;

测试记录:

hive> 
    > select deptno,avg(sal) avg_sal from emp group by deptno having avg(sal) > 2000;
Query ID = root_20201204170622_ee515280-33b8-4cf7-af56-a1cdb9731d38
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0093, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0093/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0093
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 17:06:29,633 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:06:36,835 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.31 sec
2020-12-04 17:06:43,012 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.56 sec
MapReduce Total cumulative CPU time: 7 seconds 560 msec
Ended Job = job_1606698967173_0093
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.56 sec   HDFS Read: 17746 HDFS Write: 141 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 560 msec
OK
deptno  avg_sal
10      2916.666667
20      2175.000000
Time taken: 21.566 seconds, Fetched: 2 row(s)
hive> 

三.Hive的join

Hive的join与关系型数据库的类似,但是要注意,hive不支持非等值连接。

语法:

join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON expression

hive 支持如下连接:

INNER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI JOIN

Join MapReduce实现

image.png

测试数据:

create table t1(id int);
create table t2(id int);

insert into t1 values(1);
insert into t1 values(2);

insert into t2 values(2);
insert into t2 values(3);

3.1 内连接 --求交集

image.png
select t1.id,t2.id id2
from t1
inner join t2
     on t1.id = t2.id;
--或者
select t1.id,t2.id id2
from t1,t2
where t1.id = t2.id;
hive> 
    > select t1.id,t2.id id2
    > from t1
    > inner join t2
    >      on t1.id = t2.id;
Query ID = root_20201204191120_3861be11-a271-4bc4-b240-a5ec96b9794e
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0114, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0114/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0114
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:11:34,556 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:11:42,800 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.34 sec
MapReduce Total cumulative CPU time: 6 seconds 340 msec
Ended Job = job_1606698967173_0114
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.34 sec   HDFS Read: 12447 HDFS Write: 190 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 340 msec
OK
t1.id   id2
2       2
Time taken: 23.749 seconds, Fetched: 1 row(s)
hive> select t1.id,t2.id id2
    > from t1,t2
    > where t1.id = t2.id;
Query ID = root_20201204191146_2e3e53c9-5f29-4d17-84f6-8222f5b4c742
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0115, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0115/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0115
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:12:01,580 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:12:07,771 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.67 sec
MapReduce Total cumulative CPU time: 6 seconds 670 msec
Ended Job = job_1606698967173_0115
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.67 sec   HDFS Read: 12773 HDFS Write: 190 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 670 msec
OK
t1.id   id2
2       2
Time taken: 22.941 seconds, Fetched: 1 row(s)
hive> 

3.2 左连接 --求A的全部

image.png
select t1.id,t2.id id2
from t1
left join t2
  on t1.id = t2.id;

hive> select t1.id,t2.id id2
    > from t1
    > left join t2
    >   on t1.id = t2.id;
Query ID = root_20201204191035_0e063217-a8b4-4669-8a30-5e1be3e903eb
Total jobs = 1
WARNING: Use "yarn jar" to launch YARN applications.
2020-12-04 19:10:42     Uploaded 1 File to: file:/tmp/root/e8ed4bca-fbbf-4db0-b223-33d23b3bbc3a/hive_2020-12-04_19-10-35_645_5302676112925292414-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile91--.hashtable (296 bytes)
2020-12-04 19:10:42     End of local task; Time Taken: 0.963 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0113, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0113/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0113
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 19:10:49,515 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:10:56,723 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.98 sec
MapReduce Total cumulative CPU time: 1 seconds 980 msec
Ended Job = job_1606698967173_0113
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.98 sec   HDFS Read: 6112 HDFS Write: 120 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 980 msec
OK
t1.id   id2
1       NULL
2       2
Time taken: 22.147 seconds, Fetched: 2 row(s)

3.3 左连接 --实现A-B的差集

image.png
select t1.id,t2.id id2
from t1
left join t2
  on t1.id = t2.id
where t2.id is null;
hive> select t1.id,t2.id id2
    > from t1
    > left join t2
    >   on t1.id = t2.id
    > where t2.id is null;
Query ID = root_20201204190954_8ea563bb-c5e6-4d00-8262-ed1264c1c1cc
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0112, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0112/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0112
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 19:10:08,983 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:10:15,161 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1606698967173_0112
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.0 sec   HDFS Read: 6586 HDFS Write: 104 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
t1.id   id2
1       NULL
Time taken: 22.658 seconds, Fetched: 1 row(s)

3.4 全连接 -- A union B 求合集

image.png
select t1.id id1,t2.id id2
from t1
full join t2
on t1.id = t2.id
hive> 
    > select t1.id id1,t2.id id2
    > from t1
    > full join t2
    > on t1.id = t2.id
    > ;
Query ID = root_20201204190853_888f4198-8453-4c53-b8ce-bc06c59ebc6a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0111, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0111/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0111
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2020-12-04 19:09:00,901 Stage-1 map = 0%,  reduce = 0%
2020-12-04 19:09:07,088 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 3.77 sec
2020-12-04 19:09:10,186 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.55 sec
2020-12-04 19:09:15,339 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.94 sec
MapReduce Total cumulative CPU time: 7 seconds 940 msec
Ended Job = job_1606698967173_0111
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 7.94 sec   HDFS Read: 17569 HDFS Write: 137 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 940 msec
OK
id1     id2
1       NULL
2       2
NULL    3
Time taken: 22.535 seconds, Fetched: 3 row(s)

3.5 全连接实现-去交集

image.png
select t1.id id1,t2.id id2
from t1
left join t2
on t1.id = t2.id
where t2.id is null
union all
select t1.id id1,t2.id id2
from t1
right join t2
on t1.id = t2.id
where t1.id is null;
hive> 
    > select t1.id id1,t2.id id2
    > from t1
    > left join t2
    > on t1.id = t2.id
    > where t2.id is null
    > union all
    > select t1.id id1,t2.id id2
    > from t1
    > right join t2
    > on t1.id = t2.id
    > where t1.id is null;
Query ID = root_20201204190745_d1e37397-4a04-44b5-920b-cc9e3327d6ac
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0110, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0110/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0110
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 0
2020-12-04 19:07:59,931 Stage-2 map = 0%,  reduce = 0%
2020-12-04 19:08:08,176 Stage-2 map = 67%,  reduce = 0%, Cumulative CPU 5.88 sec
2020-12-04 19:08:12,287 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 8.79 sec
MapReduce Total cumulative CPU time: 8 seconds 790 msec
Ended Job = job_1606698967173_0110
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 3   Cumulative CPU: 8.79 sec   HDFS Read: 23996 HDFS Write: 295 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 790 msec
OK
_u1.id1 _u1.id2
1       NULL
NULL    3
Time taken: 27.58 seconds, Fetched: 2 row(s)

3.6 右连接实现-B-A 求差集

image.png
select t1.id,t2.id id2
from t1
right join t2
  on t1.id = t2.id
where t1.id is null;
hive> select t1.id,t2.id id2
    > from t1
    > right join t2
    >   on t1.id = t2.id
    > where t1.id is null;
Query ID = root_20201204190148_850cffa0-f440-4feb-b85f-6d014f9c6c3f
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0105, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0105/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0105
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:02:03,141 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:02:09,326 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.04 sec
MapReduce Total cumulative CPU time: 6 seconds 40 msec
Ended Job = job_1606698967173_0105
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.04 sec   HDFS Read: 13009 HDFS Write: 191 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 40 msec
OK
t1.id   id2
NULL    3
Time taken: 22.345 seconds, Fetched: 1 row(s)
hive> 

3.7 右连接 --求B的全部

image.png
select t1.id,t2.id id2
from t1
right join t2
 on t1.id = t2.id;
hive> select t1.id,t2.id id2
    > from t1
    > right join t2
    >  on t1.id = t2.id;
Query ID = root_20201204190106_2d049d88-62e4-4e51-88e4-f005248cff60
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-12-04 19:01:12     Starting to launch local task to process map join;      maximum memory = 1908932608
2020-12-04 19:01:13     Dump the side-table for tag: 0 with group count: 2 into file: file:/tmp/root/e8ed4bca-fbbf-4db0-b223-33d23b3bbc3a/hive_2020-12-04_19-01-06_491_8753533712871347988-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0104, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0104/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0104
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:01:20,696 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:01:26,882 Stage-3 map = 50%,  reduce = 0%, Cumulative CPU 1.97 sec
2020-12-04 19:01:27,911 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.95 sec
MapReduce Total cumulative CPU time: 3 seconds 950 msec
Ended Job = job_1606698967173_0104
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 3.95 sec   HDFS Read: 12061 HDFS Write: 207 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 950 msec
OK
t1.id   id2
2       2
NULL    3
Time taken: 22.507 seconds, Fetched: 2 row(s)
hive> 

3.8 left semi join

对于一对多的情况下,经常使用in 和exists,在hive里面其实也可以使用left semi join

SELECT a.* FROM a WHERE a.key IN
(SELECT b.key FROM b WHERE b.value > 100)
等价于
SELECT a.* FROM a LEFT SEMI JOIN b ON
(a.key = b.key AND b.value > 100)

四.排序子句

排序子句有
1) ORDER BY
与传统RDBMS SQL语义一致,对结果集全局排序
对于MapReduce的实现,是需要将结果集shuffle到一个reducer
如果数据量非常大,则会导致reducer执行时间非常长
通常会加LIMIT来限制排序结果的数量

2) SORT BY
Hive特有,reducer本地排序,而非全局排序
当reducer = 1时,与Order By语义相同,否则得到的结果集不同

3) DISTRIBUTE BY
控制map输出的shuffle key
默认是按照key的hashcode
一般用于控制将相同key的数据shuffle到同一个reducer

4) CLUSTER BY
等于DISTRIBUTE BY … SORT BY …
DISTRIBUTE BY 和 SORT BY的字段相同,并正序排序

五.抽样(TABLESAMPLE)

当数据量特别大时,对全体数据进行处理存在困难时,抽样就显得尤其重要了。抽样可以从被抽取的数据中估计和推断出整体的特性,是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。

Hive支持桶表抽样和块抽样。所谓桶表指的是在创建表时使用CLUSTERED BY子句创建了桶的表。桶表抽样的语法如下:

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

TABLESAMPLE子句允许用户编写用于数据抽样而不是整个表的查询,该子句出现FROM子句中,可用于任何表中。桶编号从1开始,colname表明抽取样本的列,可以是非分区列中的任意一列,或者使用rand()表明在整个行中抽取样本而不是单个列。在colname上分桶的行随机进入1到y个桶中,返回属于桶x的行。下面的例子中,返回32个桶中的第3个桶中的行:

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

通常情况下,TABLESAMPLE将会扫描整个表然后抽取样本,显然这种做法效率不是很高。替代方法是,由于在使用CLUSTERED BY时指定了分桶的列,如果抽样时TABLESAMPLE子句中指定的列匹配CLUSTERED BY子句中的列,TABLESAMPLE只扫描表中要求的分区。假如上面的例子中,source表在创建时使用了CLUSTEREDBY id INTO 32 BUCKETS,那么下面的语句将返回第3个和第19个簇中的行,因为每个桶由(32/16)=2个簇组成。为什么是3和19呢,因为要返回的是第3个桶,而每个桶由原来的2个簇组成,第3个桶就由原来的第3个和19个簇组成,根据简单的哈希算法(3%16=19%16)。

TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

相反,下面的语句将会返回第3个簇的一半,因为每个桶由(32/64)=1/2个簇组成。

TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)  

从Hive-0.8开始可以使用块抽样,语法为:

block_sample: TABLESAMPLE (n PERCENT)  

该语句允许抽取数据大小的至少n%(不是行数,而是数据大小)做为输入,支持CombineHiveInputFormat而一些特殊的压缩格式是不能够被处理的,如果抽样失败,MapReduce作业的输入将是整个表。由于在HDFS块层级进行抽样,所以抽样粒度为块的大小,例如如果块大小为256MB,即使输入的n%仅为100MB,也会得到256MB的数据。下面的例子中输入的0.1%或更多将用于查询:

SELECT *  
ROM source TABLESAMPLE(0.1 PERCENT) s;  

如果希望在不同的块中抽取相同的数据,可以改变下面的参数:

set hive.sample.seednumber=<INTEGER>;  

也可以指定读取数据的长度,该方法与PERCENT抽样具有一样的限制,为什么有相同的限制,是因为该语法仅将百分比改为了具体值,但没有改变基于块抽样这一前提条件。该语法为:

block_sample: TABLESAMPLE (ByteLengthLiteral)  

ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')  

下面的例子中输入的100M或更多将用于查询:

SELECT *  
FROM source TABLESAMPLE(100M) s; 

Hive也支持基于行数的输入限制,当效果与上面介绍的两个不同。首先不需要CombineHiveInputFormat,这意味着可以被用在非原生表中。其次行数被用在每个split中。因此总的行数根据输入的split数而变化很大。语法格式为:

block_sample: TABLESAMPLE (n ROWS) 

例如下面的查询将从每个split中抽取10行:

SELECT * FROM source TABLESAMPLE(10 ROWS);  

参考:

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,686评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,668评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,160评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,736评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,847评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,043评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,129评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,872评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,318评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,645评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,777评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,861评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,589评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,687评论 2 351

推荐阅读更多精彩内容