1. 借鉴

Hive学习笔记——Hive中的分桶
 Hive分区和分桶（0925）
HIVE表索引，分区和分桶的区别

2. 开始

基于前篇文章的建表语句，来分别看下分区表和分桶表。

分区表

分区表对应HDFS上独立的文件夹，一个文件夹表示一个分区

创建分区表

# 2020-03-18日 创建订单
# 文件为hotel_checkin_log-2020-03-18.log
10271563,2020-03-19,14:00,2020-03-20,12:00,02:55:22,1431259675035954008
00277553,2020-03-19,12:00,2020-03-22,12:00,12:00:00,1124559675462631493
02371493,2020-03-19,17:00,2020-03-19,12:00,09:20:00,1634659676964192299

# 2020-03-19日 创建订单
# 文件为hotel_checkin_log-2020-03-19.log
10271563,2020-03-20,14:00,2020-03-21,12:00,02:55:22,1124759681900888144
00277553,2020-03-20,12:00,2020-03-22,12:00,12:00:00,1234759682320318549
02371493,2020-03-20,17:00,2020-03-21,12:00,09:20:00,1236559684014817343

# 2020-03-20日 创建订单
# 文件为hotel_checkin_log-2020-03-20.log
10271563,2020-03-21,14:00,2020-03-23,12:00,02:55:22,1123659685898059853
00277553,2020-03-22,12:00,2020-03-25,12:00,12:00:00,1213559689173811267
02371493,2020-03-22,17:00,2020-03-23,12:00,09:20:00,1133659694081146884

我们根据数据构建以下表结构:

create table hotel_checkin_log(
  code string,
  checkin_date string,
  checkin_time string,
  checkout_date string,
  checkout_time string,
  order_time string,
  uId string
)
partitioned by (order_date string)
row format delimited fields terminated by ',';

加载数据

此时加载数据和之前的略有区别，主要是需要指定分区。此时分区列不包含在原始数据中，在加载数据的时候需要手动指定，所以数据一定要规划好

load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-18.log' into table db_company.hotel_checkin_log partition(order_date='2020-03-18');
load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-19.log' into table db_company.hotel_checkin_log partition(order_date='2020-03-19');
load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-20.log' into table db_company.hotel_checkin_log partition(order_date='2020-03-20');

如果没有指定分区会出现以下错误

ValidationFailureSemanticException Partition spec xxx contains non-partition columns

查询分区表

不指定分区

select * from hotel_checkin_log;

指定分区

select * from db_company.hotel_checkin_log where order_date = '2020-03-18';

查询分区明细

show partitions  hotel_checkin_log;

明细如下：

partition
order_date=2020-03-18
order_date=2020-03-19
order_date=2020-03-20
Time taken: 0.061 seconds, Fetched: 3 row(s)

创建分区

注意：这里的创建分区，不是有了一个order_date分区，再创建一个order_time分区，而是为order_date添加分区，我们是可以指定多个分区，后面我们再说。

# 创建一个分区
alter table db_company.hotel_checkin_log add partition(order_date='2020-03-21');

# 创建多个分区
alter table db_company.hotel_checkin_log add partition(order_date='2020-03-22') partition(order_date='2020-03-23');

我们此时看下分区明细

partition
order_date=2020-03-18
order_date=2020-03-19
order_date=2020-03-20
order_date=2020-03-21
order_date=2020-03-22
order_date=2020-03-23
Time taken: 0.059 seconds, Fetched: 6 row(s)

删除分区

可以看到删除分区和创建分区基本类似，区别在于

add 关键字替换为drop关键字
批量删除时以,分隔，批量添加时以空格分隔

# 删除单个分区
alter table db_company.hotel_checkin_log drop partition(order_date='2020-03-21');

# 删除多个分区
alter table db_company.hotel_checkin_log drop partition(order_date='2020-03-22'), partition(order_date='2020-03-23');

创建多级分区

上面我们说了创建分区，是在一个分区列下面增加分区，那如果我想有两个分区列呢

create table hotel_checkin_log_gather(
  code string,
  checkin_date string,
  checkin_time string,
  checkout_date string,
  checkout_time string,
  order_time string,
  uId string
)
partitioned by (order_date string, gather int)
row format delimited fields terminated by ',';

加载数据

load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-18.log' into table db_company.hotel_checkin_log_gather partition(order_date='2020-03-18', gather = 1);
load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-19.log' into table db_company.hotel_checkin_log_gather partition(order_date='2020-03-19', gather = 1);
load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-20.log' into table db_company.hotel_checkin_log_gather partition(order_date='2020-03-20', gather = 1);

指定了两个分区，则在加载数据时都需要指定，否则会出现未知分区的错误提示

分区与数据加载

上面我们都是加载数据到hdfs时并指定分区。
我们知道，hive的数据是存放在hdfs中的，有时我们需要先将数据直接放到hdfs的目录中然后再加载。

方式1

① 创建了分区表
② 手动在hdfs中创建分区文件夹，并上传数据文件（在这一步可以验证以下，分区表是查询不出来数据的）
③ 进行关联修复
④ 再次查询有数据了

下面我们依次来操作一下

我们使用hotel_checkin_log这个表，目前它有三个分区（order_date=2020-03-18，order_date=2020-03-19，order_date=2020-03-20）

我们创建一个2020-03-31的分区并上传文件

#创建文件夹
hdfs dfs -mkdir -p /hive.db/company/hotel_checkin_log/order_date=2020-03-31
# 上传文件
hdfs dfs -put /opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-03-31.log /hive.db/company/hotel_checkin_log/order_date=2020-03-31

查询一下数据【没有是正常的】

select * from db_company.hotel_checkin_log where order_date = '2020-03-31'

修复关联

msck repair table db_company.hotel_checkin_log;

再次查询即可看到数据了

方式2

① 创建了分区表
② 手动在hdfs中创建分区文件夹，并上传数据文件（在这一步可以验证以下，分区表是查询不出来数据的）
③ 创建分区

下面我们依次来操作一下

我们使用hotel_checkin_log这个表，目前它有四个分区（order_date=2020-03-18，order_date=2020-03-19，order_date=2020-03-20, order_date=2020-03-31）

我们创建一个2020-04-01的分区并上传文件

#创建文件夹
hdfs dfs -mkdir -p /hive.db/company/hotel_checkin_log/order_date=2020-04-01
# 上传文件
hdfs dfs -put /opt/envs/datas/hotel_checkin_log/hotel_checkin_log-2020-04-01.log /hive.db/company/hotel_checkin_log/order_date=2020-04-01

查询一下数据【没有是正常的】

select * from db_company.hotel_checkin_log where order_date = '2020-03-31'

创建分区

alter table hotel_checkin_log add partition(order_date='2020-04-01');

再次查询即可看到数据了

分桶表

分桶表对应HDFS上多个文件，一个文件表示一个桶。分桶将整个数据内容按照某列属性的值的hash值进行分桶，比如我们这里的按照code进行分桶，分为4个桶，就是对code属性值的hash值对4取摸，按照取模结果对数据分桶。

创建分区表

我们基于以下数据创建分区表 <数据文件可以这里下载>

创建分区表后导入数据有一个限制，需要从中间表导入，否则数据并没有在多个分捅。为了将数据填充到桶中，必须使用insert语句而不是load语句，因为它(load)不根据元数据定义验证数据(只将文件复制到文件夹)

我们根据数据构建以下表结构:

# 创建临时表
create TEMPORARY table tmp_hotel_checkin_bucket_log(
  code string,
  checkin_date string,
  checkin_time string,
  checkout_date string,
  checkout_time string,
  order_time string,
  uId string
)
row format delimited 
fields terminated by ',';

# 创建正式表
create table hotel_checkin_bucket_log(
  code string,
  checkin_date string,
  checkin_time string,
  checkout_date string,
  checkout_time string,
  order_time string,
  uId string
)
clustered by(code) 
into 4 buckets
row format delimited 
fields terminated by ',';

查看分捅数量 [Num Buckets 属性]

desc formatted hotel_checkin_bucket_log;

OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
code                    string                                      
checkin_date            string                                      
checkin_time            string                                      
checkout_date           string                                      
checkout_time           string                                      
order_time              string                                      
uid                     string                                      
                 
# Detailed Table Information             
Database:               db_company               
Owner:                  root                     
CreateTime:             Mon Jun 22 11:38:56 UTC 2020     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop01:9000/hive.db/company/hotel_checkin_bucket_log    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                1                   
        totalSize               863                 
        transient_lastDdlTime   1592826192          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            4                  
Bucket Columns:         [code]                   
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    ,                   
Time taken: 0.05 seconds, Fetched: 36 row(s)

加载数据至中间表

load data local inpath '/opt/envs/datas/hotel_checkin_log/hotel_checkin_log.log' into table tmp_hotel_checkin_bucket_log;

在插入数据之前，需要在hive命令行设置以下参数，另外在hive本地模式下强制分桶会报错，因为只有一个reducer，所以要么需要集群模式，要么使用本地模式，不强制分桶，结果只有一个桶

# 设置强制分桶
set hive.enforce.bucketing=true;
 # 这里我创建的是4个桶，或者设置为-1，表示不限制，依据建表中的分桶数量
set mapred.reduce.tasks = 4;

从中间表导入数据

insert into table hotel_checkin_bucket_log select code, checkin_date, checkin_time, checkout_date, checkout_time,  order_time, uId from tmp_hotel_checkin_bucket_log;

Hive 1.2.1 分区和分捅

Hive 1.2.1 分区和分捅

1. 借鉴

2. 开始

分区表

创建分区表

加载数据

查询分区表

查询分区明细

创建分区

删除分区

创建多级分区

分区与数据加载

方式1

方式2

分桶表

创建分区表

3. 大功告成