hive的基本操作

hive的使用方式

1.使用CLI

直接使用hive命令即可进入客户端。

2. 使用hiveserver2服务

修改hdfs-site.xml，core-site.xml
- 在hdfs-site.xml加上dsf.webhdfs.enabled-->true
- core-site.xml加入hadoop.proxyuser.hadoop.hosts-->*
  ......groups-->*
把hive启动为一个后台服务，只有启动为后台服务之后，才能让HJDBC,ODBC等程序去连接hive

nohup command &
nohup 的意思： no hang up 不挂起
输入命令：nohup hiverser2 & 2>~/hive_err.log 1>~/hive_std.log
日志：0代表标准输入，1代表标准输出，2代表异常输出
nohup hiverser2 & 2>/dev/null 1>/dev/null
进入黑洞，所有日志都不保存
输入jps 出现 RunJar进程表示启动成功
使用beeline客户端工具去连接hiveserver2
1. $ beeline
2. >!connect jdbc:hive2://hadoop02:10000

HQL的使用

关于库的DDL

创建库
create database if not exists hadoop;
创建时使用if not exists 忽略异常
删除时，使用 if exists 忽略异常
适用于创建表
查询库列表信息
show databases;
查询正在使用的库
select current_database();
切换库
use dname;
查询库的详细信息
desc database dname;
desc database extended dname;
删除库
drop database dname;
drop database dname restrict;
如果已经有表是不能删除的。
drop database dname cascade;
级联的方式删除数据库
修改库/基本不用

关于表的DDL

创建表
create

comment 表注释
partioner by(col_name data_type...)
分区字段不能在表字段中出现
clustered by (col_name,....) 分桶
[sorted by (col_name[asc|desc],...)]是否排序按照哪个字段排序
into num_buckets BUCKETS 整个表分成多少个桶
分桶表的字段必须是表字段中的一部分
row format row_format 行的分隔符以什么字符终止
row format delimited fields terminated by "," lines terminated by "\n"
stored as file_format 存储什么文件
file_format:
1. textfile 普通文本
2. sequencefile 序列化文本
3. rcfile行列存储相结合的文件
4. 自定义文件格式
location hdfs_path
创建表的时候可以指定表的路径。
内外部表都是可以指定hdfs的存储路径的。
最佳实践是：如果一份数据已经存储在hdfs上并且要让多个客户端使用，就用外部表。
set hive.exec.mode.local.auto=true;
hive尝试本地模式运行
会话断或者reset就自动失效;
复制表
create table student_1 like student;
复制一张表的定义，不包含数据
CTAS
create table .... as select ...
set property
查看配置文件

6个表DDL的例子

创建内部表：create table student (id int, name string) row format delimited fields terminated by ',';
创建外部表： create external studen_ext row format delimited fields terminated by ',' location '/hive/student';
desc 表名就可以看到表结构是externaltable
分区表：
create table student_ptn(id int , name string) partitioned by (city string) row format delimited fields terminated by ','

create table t01_ptn02 (count int) partitioned by (username string,month string) row format delimited fields terminated by ',';

添加分区：alter table student_ptn add partition(city="beijing")

city 是分区字段，如果有还有如zip那目录结构就是/city=beijing/zip=10011

分区字段不能使用表中存在的字段
如果某张表是分区表，某个分区就是这张表目录下的一个分区目录
数据文件只能放在分区文件夹中，不能放在表文件夹下。

查看分区： show partitions student_ptn;
分桶表
create table studen_bck (id int , name string) clustered by (id) sorted by (id asc,name desc) into 4 buckets row format delimited fields terminated by ','
使用CTAS创建表
就是从一个查询sql结果来创建一个表进行存储
create table studnet_ctas as select * from student where id <10;
复制表结构
create table sut_copy like student;

无论被复制的表是内部表还是外部表，如果在table的前面没有加exteral那么复制出来的新表都是内部表

查看命令

show tables;
show tables in dname;
show tables like 'stu*';//使用正则表达式

查看表的详细信息
desc studnet;
desc extended student;
desc formatted student;
show partitions stu; //查看分区信息
show functions;//查看函数
desc function extended substring;//查看函数用法
show create table stu;//查看建表的详细语句

修改表

修改表名
alter table stu rename to new_stu;
修改字段定义
- 增加一个字段
  alter table stu **add columns **(sex string,age int);
- 修改一个字段定义
  alter talbe stu change age new_age string;
- 删除一个字段
  不支持
- 替换所有字段
  alter table stu replace columns(id int,name string);
  int类型可以转成string，string转不成int
  但hive-1.2.2版本可以任意替换
  hive schema on read //hive是读模式的数据仓库
- 修改分区信息
  - 添加静态分区：alter table stu_ptn add partioner(city="chongqing") partioner(city="kunming") ......;
  - 修改分区
    一般来说只会修改分区数据的存储目录alter table stu_ptn partioner(city='beijing') set location '/stu_ptn_beijing';
  - 删除分区
    alter table stu_ptn drop partition (city='beijing')
清空表
truncate table stu;
删除表
drop table stu;

DML数据操纵语言

导入数据

load方式装载数据
hive模式是读模式，可以导入任何数据
- load data local inpath "/home/" into table student;
  从Linux本地导入数据到student表中。
  会把数据文件上传进/user/hive/warehouse/student
- load data inpath "/stu/test.txt" into table stu;
  从hdfs上导入数据
  如果数据已经在hdfs上，就不要再创建内部表。
  因为这样会把这份数据移动到/user/hive/warehouse/目录下
  内部表删除时就会把这份数据删掉。
- hadoop fs -put file user/hive/warehouse/studnet/
  直接上传到上传到hive表中
- load data local inpath "....." overwrite into talbe;
  覆盖导入
inser 方式插入数据
- insert into student (id,name,sex,age,department)values(1111,'ss','f',12,'nn'),(xx,xxx,xxx,);
  insert方式，首先创建一张零时表如values_tmp_table_1 来保存inser语句的结果，然后再将记录插入到表中
- insert into table student_c select *from student where age<=18;

多重插入

创建一张分区表create table stu_ptn_age(id int,name string, sex String )partioned by （age int）.....

从stu表中，把数据分成三类，插入到stu_ptn这张表的三个分区中：
导入数据到分区表时，这个分区可以不存在。会自动创建

insert into table  stu_ptn_age partition(age=18) select id,name,sex,department from student where age <=18;
insert into table  stu_ptn_age partition(age=19) select id,name,sex,department from student where age =19;
insert into table  stu_ptn_age partition(age=20) select  id,name,sex,department from student where age >=20;

这种方式比较耗时

可以使用多重插入来降低任务复杂度
主要减少的是原表的数据扫描次数

  from sudent
  insert into table stu_ptn_age partition(age=18) select id,name,sex,department where age<=18 ;
  insert into table stu_ptn_age partition(age=19) select id,name,sex,department where=19;
  insert into table stu_ptn_age partition(age=20) select id,name,sex,department where >=20;

清空表truncate时不会清空age=xx的分区信息
select * from stu_ptn;
分区字段也会显示。
在使用过程中分区字段和普通字段是一样的。
分区的信息存储在partition表中

问题：如果真实的需求是每一个年龄一个分区？

动态分区插入

创建一张测试表：create stu_ptn_dpt .....partition by (department string)....
插入数据会报错：insert into table t01_ptn partition(username,month) select count,username,month from table01;
set hive.exec.dynamic.partition.mode=nonstrict
如果一张表有多个分区字段：那么在进行动态分区插入是，一定要有一列是静态分区；如果不像受这样的限制就把模式设置为nonstrict。

如果往分区表中插入数据，不要使用load方式，这容易使分区内的数据混乱，除非在非常确定的情况下

insert方式导出数据
insert overwrite local directory "/home/hadoop/tem/stu_le18" select * from student where age<=18;
这种方式要注意路径，因为是overwriter
在查看到处数据时使用：sed -e 's/\x01/\t/g' file.txt 替换默认的Ctrl+a字段分隔符。

字符串替换：s命令
sed 's/hello/hi/g' sed.txt              
##  在整行范围内把hello替换为hi。如果没有g标记，则只有每行第一个匹配的hello被替换成hi。

多点编辑：e命令
sed -e '1,5d' -e 's/hello/hi/' sed.txt
##  (-e)选项允许在同一行里执行多条命令。如例子所示，第一条命令删除1至5行，第二条命令用hello替换hi。
命令的执行顺序对结果有影响。如果两个命令都是替换命令，那么第一个替换命令将影响第二个替换命令的结果。

sed --expression='s/hello/hi/' --expression='/today/d' sed.txt
##  一个比-e更好的命令是--expression。它能给sed表达式赋值。

查询

distinct去重
show function;271个内置函数--2.3.3
UDF 单行函数，输入1，输出1；
UDAF 多对一函数，输入n 输出1
UDTF 一对多函数输入1，输出n
不支持update和delete
因为是hive是数据仓库，联机事务分析
支持in 和 exits
select * from student where in (18,19)
老版本不支持，hive推荐使用semi join半连接
支持 case when

select id,t_job,t_edu **case** t_edu 
when "硕士" then 1 
when "本科" then 2 
else 3 
end as level 
from lagou limit 1,100;

select count(distinct age )from join .. on ..where ... goup by ... having ... cluster by ...distribute by ..sort by .. order by ... limit ....

order by 全局排序 select * from studnet order by age desc
sort by
局部排序，每个分区内有序，但是你会发现同一个age的条目会被分到不同分区中，因为没有进行hash散列。
一个sql就是一个mr程序，局部排序就是指，有多个reduceTask执行的话，那么最终，每个reduceTask的结果是有序的，如果只有一个reduceTask sort by = order by
set mapreduce.job.reduces =3;
select * from student sort by age desc;
如果使用* 号查询出来的是随机进行分区的。
distribute by
分桶操作
select * from student distribute by age sort by age desc;
分桶就是把age求hash值之后模以桶数得到的结果就知道要分到哪个桶中，分桶的个数就是reduceTask的个数。
sort by是进行局部排序，所以每个桶中的数据是有序的
cluster by
cluster by age = distribute by age sort by age;
distribute by id sort by id,age != cluster by id sort by age;
cluster by 不能和sort by 同用。
如果要散列一个字段之后进行多个分区的排序只能用distributed和sort组合。

hive的基本操作

hive的基本操作

hive的使用方式

1.使用CLI

2. 使用hiveserver2服务

HQL的使用

关于库的DDL

关于表的DDL

6个表DDL的例子

查看命令

修改表

DML数据操纵语言

导入数据

多重插入

问题：如果真实的需求是每一个年龄一个分区？

动态分区插入

查询

相关阅读更多精彩内容

友情链接更多精彩内容

hive的基本操作

hive的使用方式

1.使用CLI

2. 使用hiveserver2服务

HQL的使用

关于库的DDL

关于表的DDL

6个表DDL的例子

查看命令

修改表

DML数据操纵语言

导入数据

多重插入

问题： 如果真实的需求是每一个年龄一个分区？

动态分区插入

查询

相关阅读更多精彩内容

友情链接更多精彩内容

问题：如果真实的需求是每一个年龄一个分区？