备注:
Hive 版本 2.1.1
一.Hive的DML(数据操作语言)概述
Hive的DML语句继承了传统的关系型数据库,在此之上还有自己独特的语句。
数据仓库实施的过程中,经常会使用到DML,用于写业务逻辑。
语句 | 描述 | 备注 |
---|---|---|
Load 命令 | 向表中加载数据 | 可以是操作系统文件,也可以是HDFS上的文件 |
INSERT INTO TABLE FROM Query | 查询结构插入表或分区 | 正式环境使用最频繁的语句 |
CTAS | 查询结果直接创建表 | 省略了建表语句,直接将数据录入自动创建的表 |
Writing data into the filesystem from queries | 查询结果可以插入到文件系统目录中 | 将hive表的数据加载到文件 |
INSERT | 标准的insert 语句录入数据 | insert values 标准的录入数据语句 |
Update | Update语句,更新数据 | 需要开启事务 |
Delete | Delete语句,删除数据 | 需要开启事务 |
Merge | Merge语句,存在update,不存在insert | 需要开始事务,Hive 2.2版本开始支持 |
二.Load 命令
语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
LOAD操作是一个简单的文件系统的move/copy操作,不起MR Job
如果表是分区表,比如指定要加载到的确定的分区(指定所有分区字段的值)
filepath可以是文件,也可以是目录。如果是文件,则将该文件move/copy到相应的路径下,如果是目录,则将该目录下的所有文件move/copy到相应的路径下;如果是目录,不可包含子目录
如果指定LOCAL关键字,则指本地文件copy到相应的路径下;如果不指定LOCAL,则是将默认HDFS上的文件move到相应的路径下
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将filepath下的内容替换;否则,将filepath下的内容添加到表或者分区路径下
2.1 数据准备
人工造数据:
空值 有null、有''、有直接为空的
[root@hp1 hive]# more test1.txt
1,'abc','2020-01-01'
2,'def','2020-01-01 23:00:01'
3,'ghi','2020-01-02 19:00:10'
4,'aaa','2020-01-02 19:00:20'
5,,'2020-01-02 19:00:30'
6,'abc',
7,'def','2020-01-03 19:00:30'
8,'abc','2020-02-02 19:00:30'
将test1.txt上传到hdfs
[root@hp1 hive]# hadoop fs -put test1.txt /tmp
[root@hp1 hive]#
[root@hp1 hive]# hadoop fs -ls /tmp
Found 6 items
d--------- - hdfs supergroup 0 2020-12-03 18:16 /tmp/.cloudera_health_monitoring_canary_files
drwxrwxrwt - root supergroup 0 2020-12-03 11:15 /tmp/db_test3.db
drwxrwxrwt - root supergroup 0 2020-12-03 11:16 /tmp/db_test4.db
drwx-wx-wx - hive supergroup 0 2020-11-18 17:57 /tmp/hive
drwxrwxrwt - mapred hadoop 0 2020-11-19 15:36 /tmp/logs
-rw-rw-rw- 3 root supergroup 166 2020-12-03 18:16 /tmp/test1.txt
[root@hp1 hive]#
hive表
CREATE TABLE t2(
user_id int,
name varchar(100) COMMENT '用户姓名',
loagin_date string);
2.2 将服务器文件加载到hive表
LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2;
导入数据1
测试记录:
可以看到都是空格
hive>
> LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2;
Loading data to table test.t2
OK
Time taken: 0.867 seconds
hive> select * from t2;
OK
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.446 seconds, Fetched: 9 row(s)
更改表的分隔符
drop table t2;
CREATE TABLE t2(
user_id int,
name varchar(100) COMMENT '用户姓名',
loagin_date string) row format delimited fields terminated by ',' stored as textfile;
导入数据2
可以看到数值和字符类型是正常导入了,但是时间类型没有正常导入
hive>
>
> LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2;
Loading data to table test.t2
OK
Time taken: 0.632 seconds
hive> select * from t2;
OK
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
Time taken: 0.073 seconds, Fetched: 8 row(s)
2.3 将HDFS文件加载到hive表
与操作系统文件相比,就是少了个local关键字,其余相同。
create table t3 like t2;
LOAD DATA INPATH '/tmp/test1.txt' into table t3;
测试记录:
hive>
> create table t3 like t2;
OK
Time taken: 0.078 seconds
hive>
> LOAD DATA INPATH '/tmp/test1.txt' into table t3;
Loading data to table test.t3
OK
Time taken: 0.605 seconds
hive> select * from t3;
OK
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
Time taken: 0.064 seconds, Fetched: 8 row(s)
hive>
三.INSERT INTO TABLE FROM Query
语法:
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
如果表是分区表,比如指定要加载到的确定的分区(指定所有分区字段的值)
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将查询结果的内容替换;否则,将查询结果添加到表或者分区路径下
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将查询结果的内容替换;否则,将查询结果添加到表或者分区路径下
3.1 Insert into select 语句
insert into select 语句同关系型数据的语法,用于从表中加载数据。
create table t4 like t3;
insert into t4 select user_id,name,loagin_date from t3;
select * from t4;
测试记录:
hive>
>
> create table t4 like t3;
OK
Time taken: 0.091 seconds
hive> insert into t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204105325_8e9447df-993b-4b9c-b347-fd0c60382987
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0021, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0021/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0021
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 10:53:33,921 Stage-1 map = 0%, reduce = 0%
2020-12-04 10:53:39,158 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1606698967173_0021
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_10-53-25_821_7213321429457977661-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.78 sec HDFS Read: 4468 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
Time taken: 16.067 seconds
hive>
> select * from t4;
OK
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 8 row(s)
3.2 Insert overwrite select 语句
Insert overwrite table select 功能是先清空表,然后在将select的语句录入到表
这个语句是hive的原创,数据仓库中全量更新数据的时候经常会使用到。
语法:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
代码:
-- 延续上例 再insert 一次
insert into t4 select user_id,name,loagin_date from t3;
-- 查询数据发现重复了
select * from t4;
-- 改成 insert overwrite table命令
insert overwrite table t4 select user_id,name,loagin_date from t3;
-- 再次验证数据
select * from t4;
测试记录:
hive>
> insert into t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204110806_8624193d-6311-44f5-8575-2abeb6c9e102
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0022, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0022/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 11:08:14,896 Stage-1 map = 0%, reduce = 0%
2020-12-04 11:08:22,132 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.72 sec
MapReduce Total cumulative CPU time: 1 seconds 720 msec
Ended Job = job_1606698967173_0022
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_11-08-06_488_865341941305657171-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.72 sec HDFS Read: 4552 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
Time taken: 17.314 seconds
hive> select * from t4;
OK
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 16 row(s)
hive>
> insert overwrite table t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204110931_49346b1e-779f-4b4f-830f-7f363b2855c9
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0023, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0023/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 11:09:38,356 Stage-1 map = 0%, reduce = 0%
2020-12-04 11:09:43,517 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1606698967173_0023
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_11-09-31_276_442497908099823168-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 4557 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
Time taken: 14.942 seconds
hive> select * from t4;
OK
1 'abc' '2020-01-01'
2 'def' '2020-01-01 23:00:01'
3 'ghi' '2020-01-02 19:00:10'
4 'aaa' '2020-01-02 19:00:20'
5 '2020-01-02 19:00:30'
6 'abc'
7 'def' '2020-01-03 19:00:30'
8 'abc' '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 8 row(s)
3.3 multiple inserts
语法:
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
需求:
我要将员工表有奖金和没有奖金的分开2个表存储
一般情况下需要2个insert语句,有了 multiple insert之后,一个语句就可以搞定
代码:
create table emp_bonus like emp;
create table emp_no_bonus like emp;
from emp e
insert into emp_bonus
select empno,ename,job,mgr,hiredate,sal,comm,deptno
where comm > 0
insert into emp_no_bonus
select empno,ename,job,mgr,hiredate,sal,comm,deptno
where comm = 0 or comm is null;
select * from emp_bonus;
select * from emp_no_bonus;
测试记录:
hive>
> create table emp_bonus like emp;
OK
Time taken: 0.078 seconds
hive> create table emp_no_bonus like emp;
OK
Time taken: 0.095 seconds
hive>
> from emp e
> insert into emp_bonus
> select empno,ename,job,mgr,hiredate,sal,comm,deptno
> where comm > 0
> insert into emp_no_bonus
> select empno,ename,job,mgr,hiredate,sal,comm,deptno
> where comm = 0 or comm is null;
Query ID = root_20201204113103_98083a55-2d8a-44d5-85fa-67af5369a21d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0024, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0024/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0024
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 0
2020-12-04 11:31:11,214 Stage-2 map = 0%, reduce = 0%
2020-12-04 11:31:17,437 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 6.82 sec
MapReduce Total cumulative CPU time: 6 seconds 820 msec
Ended Job = job_1606698967173_0024
Stage-5 is filtered out by condition resolver.
Stage-4 is selected by condition resolver.
Stage-6 is filtered out by condition resolver.
Stage-11 is filtered out by condition resolver.
Stage-10 is selected by condition resolver.
Stage-12 is filtered out by condition resolver.
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0025, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0025/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0025
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2020-12-04 11:31:29,467 Stage-4 map = 0%, reduce = 0%
2020-12-04 11:31:35,647 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1606698967173_0025
Launching Job 5 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0026, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0026/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0026
Hadoop job information for Stage-10: number of mappers: 1; number of reducers: 0
2020-12-04 11:31:47,670 Stage-10 map = 0%, reduce = 0%
2020-12-04 11:31:53,852 Stage-10 map = 100%, reduce = 0%, Cumulative CPU 1.57 sec
MapReduce Total cumulative CPU time: 1 seconds 570 msec
Ended Job = job_1606698967173_0026
Loading data to table test.emp_bonus
Loading data to table test.emp_no_bonus
MapReduce Jobs Launched:
Stage-Stage-2: Map: 2 Cumulative CPU: 6.82 sec HDFS Read: 15263 HDFS Write: 965 HDFS EC Read: 0 SUCCESS
Stage-Stage-4: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 3180 HDFS Write: 157 HDFS EC Read: 0 SUCCESS
Stage-Stage-10: Map: 1 Cumulative CPU: 1.57 sec HDFS Read: 3568 HDFS Write: 520 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 920 msec
OK
Time taken: 52.03 seconds
hive> select * from emp_bonus;
OK
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
Time taken: 0.081 seconds, Fetched: 3 row(s)
hive>
> select * from emp_no_bonus;
OK
7566 JONES MANAGER 7839 1981-04-02 2975.0 NULL 20
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-06-13 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-06-13 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 0.059 seconds, Fetched: 11 row(s)
hive>
3.4 动态分区插入
语法:
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
用户可以不完全指定分区字段的值,而是部分指定。对于固定指定的分区称之为静态分区,相反,对于没有指定的分区称之为动态分区。
动态分区的值取决于select查询中相应的字段值
hive.exec.dynamic.partition.mode=[strict|nonstrict]
strict mode下,必须至少指定一个静态分区; nonstrictmode下,可以全部都是动态分区
四.CTAS
CTAS 省略了建表的语句,直接将查询结果集录入到新表里面,可以用于临时表及备份表。
语法:
CREATE TABLE tablename
[ROW FORMAT row_format] [STORED AS file_format]
AS SELECT … FROM
测试记录:
hive>
>
>
> desc emp_ctas;
FAILED: SemanticException [Error 10001]: Table not found emp_ctas
hive> create table emp_ctas as select * from emp where 1 = 1;
Query ID = root_20201204144641_1e70859c-276e-4ac4-b593-225f3d9e2770
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0027, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0027/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0027
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:46:48,605 Stage-1 map = 0%, reduce = 0%
2020-12-04 14:46:54,786 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.69 sec
MapReduce Total cumulative CPU time: 3 seconds 690 msec
Ended Job = job_1606698967173_0027
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0028, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0028/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0028
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 14:47:07,574 Stage-3 map = 0%, reduce = 0%
2020-12-04 14:47:13,760 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.52 sec
MapReduce Total cumulative CPU time: 1 seconds 520 msec
Ended Job = job_1606698967173_0028
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 3.69 sec HDFS Read: 9949 HDFS Write: 816 HDFS EC Read: 0 SUCCESS
Stage-Stage-3: Map: 1 Cumulative CPU: 1.52 sec HDFS Read: 3004 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 210 msec
OK
Time taken: 33.892 seconds
hive>
> select * from emp_ctas;
OK
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-06-13 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-06-13 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
Time taken: 0.061 seconds, Fetched: 14 row(s)
五.Writing data into the filesystem from queries
将查询的结果集录入到目录中
语法:
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
代码:
insert overwrite local directory '/tmp/emp.txt'
select * from emp;
测试记录:
hive>
> insert overwrite local directory '/tmp/emp.txt'
> select * from emp;
Query ID = root_20201204145220_d4e24136-f156-466c-bf0d-f9e36e9a82fe
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0029, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0029/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0029
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:52:26,262 Stage-1 map = 0%, reduce = 0%
2020-12-04 14:52:33,455 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.42 sec
MapReduce Total cumulative CPU time: 3 seconds 420 msec
Ended Job = job_1606698967173_0029
Moving data to local directory /tmp/emp.txt
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 3.42 sec HDFS Read: 9863 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 420 msec
OK
Time taken: 14.343 seconds
[root@hp1 hive]# more /tmp/emp.txt/
*** /tmp/emp.txt/: directory ***
[root@hp1 hive]# cd /tmp/emp.txt/
[root@hp1 emp.txt]# ls
000000_0 000001_0
[root@hp1 emp.txt]# more 000000_0
7521�WARD�SALESMAN�7698�1981-02-22�1250.0�500.0�30
7566�JONES�MANAGER�7839�1981-04-02�2975.0�\N�20
7654�MARTIN�SALESMAN�7698�1981-09-28�1250.0�1400.0�30
7698�BLAKE�MANAGER�7839�1981-05-01�2850.0�\N�30
7782�CLARK�MANAGER�7839�1981-06-09�2450.0�\N�10
7788�SCOTT�ANALYST�7566�1987-06-13�3000.0�\N�20
7839�KING�PRESIDENT�\N�1981-11-17�5000.0�\N�10
7844�TURNER�SALESMAN�7698�1981-09-08�1500.0�0.0�30
7876�ADAMS�CLERK�7788�1987-06-13�1100.0�\N�20
7900�JAMES�CLERK�7698�1981-12-03�950.0�\N�30
7902�FORD�ANALYST�7566�1981-12-03�3000.0�\N�20
7934�MILLER�CLERK�7782�1982-01-23�1300.0�\N�10
[root@hp1 emp.txt]# more 000001_0
7369�SMITH�CLERK�7902�1980-12-17�800.0�\N�20
7499�ALLEN�SALESMAN�7698�1981-02-20�1600.0�300.0�30
[root@hp1 emp.txt]#
六.INSERT
sql标准的insert语句,使用方法同传统关系型数据库的insert 语句
语法:
Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
需求:往 emp_ctas表录入数据
代码:
insert into emp_ctas(empno,ename,deptno) values (1,'test1',10);
insert into emp_ctas(empno,ename,deptno) values (2,'test2',20),(3,'test3',30);
测试记录:
hive>
> insert overwrite local directory '/tmp/emp.txt'
> select * from emp;
Query ID = root_20201204145220_d4e24136-f156-466c-bf0d-f9e36e9a82fe
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0029, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0029/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0029
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:52:26,262 Stage-1 map = 0%, reduce = 0%
2020-12-04 14:52:33,455 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.42 sec
MapReduce Total cumulative CPU time: 3 seconds 420 msec
Ended Job = job_1606698967173_0029
Moving data to local directory /tmp/emp.txt
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 3.42 sec HDFS Read: 9863 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 420 msec
OK
Time taken: 14.343 seconds
hive>
> insert into emp_ctas(empno,ename,deptno) values (1,'test1',10);
Query ID = root_20201204145950_46e74db1-30a2-47ef-bc68-7334c0bfe75d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0030, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0030/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 14:59:57,007 Stage-1 map = 0%, reduce = 0%
2020-12-04 15:00:05,251 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.97 sec
MapReduce Total cumulative CPU time: 2 seconds 970 msec
Ended Job = job_1606698967173_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas/.hive-staging_hive_2020-12-04_14-59-50_800_1003035375971432143-1/-ext-10000
Loading data to table test.emp_ctas
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.97 sec HDFS Read: 5352 HDFS Write: 95 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 970 msec
OK
Time taken: 16.171 seconds
hive> insert into emp_ctas(empno,ename,deptno) values (2,'test2',20),(3,'test3',30);
Query ID = root_20201204150012_55a81ff8-bb7f-4e07-b77b-371655c88c3e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0031, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0031/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0031
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 15:00:18,975 Stage-1 map = 0%, reduce = 0%
2020-12-04 15:00:25,227 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1606698967173_0031
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas/.hive-staging_hive_2020-12-04_15-00-12_024_6107099678250146899-1/-ext-10000
Loading data to table test.emp_ctas
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.0 sec HDFS Read: 5366 HDFS Write: 121 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
Time taken: 14.84 seconds
hive>
七.Update 语句
语法:
Standard Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
默认是没有开启事务的,所以update语句执行会报错,如下:
hive>
> update emp_ctas set comm = 100 where ename = 'test1';
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive>
>
后续在事务章节再来讲解ACID事务。
八.Delete 语句
语法:
Standard Syntax:
DELETE FROM tablename [WHERE expression]
后续在事务章节再来讲解ACID事务。
九.Merge
Merge的用法同关系型数据库的merge语句
语法:
Standard Syntax:
MERGE INTO <target table> AS T USING <source expression/table> AS S
ON <boolean expression1>
WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET <set clause list>
WHEN MATCHED [AND <boolean expression3>] THEN DELETE
WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES<value list>
hive 2.2版本之后才支持,此处略过。
参考
1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual