Linux 搭建Hive数仓工具

第一篇文章，介绍了大数据下Hive工具的主要功能，可以使用sql语句来完成大数据计算任务，极大的方便了业务人员。

上一篇文章介绍了如何在Linux下搭建单机版Hadoop环境，文本记录了在此基础上，搭建数仓工具Hive的具体过程。

环境

系统环境	版本
操作系统	Ubuntu 18.04
Java	openjdk version "1.8.0_212"
Hadoop	3.2.0
Mysql	5.7
Hive	3.1.1

配置Mysql

配置Mysql用于存储Hive的MateData

安装mysql
https://www.jianshu.com/p/f81ec3a7c903

sudo apt install -y mysql-server

新建hive用户

mysql> create user 'hive'@'localhost' identified by '123456';
mysql> grant all on *.* to 'hive'@'localhost' identified by '123456';
mysql> flush privileges;

验证hive用户

mysql> select host, user, authentication_string from mysql.user;
+-----------+------------------+-------------------------------------------+
| host      | user             | authentication_string                     |
+-----------+------------------+-------------------------------------------+
| localhost | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | mysql.session    | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | mysql.sys        | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | debian-sys-maint | *01B9179E564621D21BF72BE9D4D31F64EAD60870 |
| %         | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | hive             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+-----------+------------------+-------------------------------------------+
6 rows in set (0.00 sec)

创建hive数据库

msyql -u hive -p

mysql> create database hive;

配置Hive

下载Hive

wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz

配置环境变量

export HIVE_HOME=/home/hduser/hadoop-family/hive-3.1.1
export PATH=${PATH}:${HADOOP_HOME}/bin/:${HADOOP_HOME}/sbin/:${HIVE_HOME}/bin:${DERBY_HOME}/bin

配置hive-site.xml (4项)
在hive-3.1.1/conf下默认没有hive-site.xml
需要cp hive-default.xml.template hive-site.xml后，再修改以下内容，在这里&需要被转义写成&
剔除3210行的非法字符&8#;

<!--mysql connection configuration-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8&amp;useSSL=false</value>
    <description>JDBC connect string for a JDBC metastore.</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
</property>

<!--hiveserver2 configuration-->
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>localhost</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>

新建目录mkdir -p /home/hduser/hadoop-family/hive-3.1.1/hive-iotmp/
再替换掉hive-site.xml中，所有的${system:java.io.tmpdir}与${system:user.name}

sed -i 's/${system:java.io.tmpdir}/\/home\/hduser\/hadoop-family\/hive-3.1.1\/hive_data\/tmp/g' hive-site.xml 
sed -i 's/${system:user.name}/hduser/g' hive-site.xml

下载mysql-connector-java.jar，放入hive-3.1.1/lib/下

wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz

配置Hadoop

修改hadoop-3.2.0/etc/hadoop/core-site.xml，添加以下内容，主要原因是hadoop引入了一个安全伪装机制，不允许上层系统直接将实际用户传递到hadoop层，而是将实际用户传递给一个代理用户，由代理用户在hadoop上执行操作，避免任意客户随意操作hadoop

<property>
    <name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hduser.groups</name>
    <value>*</value>
</property>

在使用hive创建table之前，需要在haddop中新建hive的hive.metastore.warehouse.dir，并添加g+w权限

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

启动hive

在hive2.1版本后，需要执行schematool命名来执行初始化

schematool -dbType mysql -initSchema

创建数据库

hive> create database hive_db;

创建表

hive> use hive_db;
hive> create table hive_table (
    > id int,
    > name varchar(16),
    > age int,
    > message varchar(128)
    > );

插入数据
hive会将sql指令转为MapReduce任务去执行

hive> insert into hive_table values (1, 'xiaoxin', 16, 'I am wangxin'),(2, 'kunkun', 18, 'I am shikun');
Query ID = hduser_20190625042455_3600e443-663d-45f6-b842-37a3f2a8843b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1561428402986_0002, Tracking URL = http://ubuntu:8088/proxy/application_1561428402986_0002/
Kill Command = /home/hduser/hadoop-family/hadoop-3.2.0/bin/mapred job  -kill job_1561428402986_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-06-25 04:25:04,263 Stage-1 map = 0%,  reduce = 0%
2019-06-25 04:25:10,502 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.77 sec
2019-06-25 04:25:15,666 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.49 sec
MapReduce Total cumulative CPU time: 5 seconds 490 msec
Ended Job = job_1561428402986_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hive_db.db/hive_table/.hive-staging_hive_2019-06-25_04-24-55_404_2009487560348382612-1/-ext-10000
Loading data to table hive_db.hive_table
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.49 sec   HDFS Read: 18587 HDFS Write: 380 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 490 msec
OK
Time taken: 21.773 seconds

查询数据

hive> select * from hive_table;
OK
1   xiaoxin 16  I am wangxin
2   kunkun  18  I am shikun
Time taken: 0.174 seconds, Fetched: 2 row(s)

查看hadoop文件
在hdfs中可以看到新建的db文件

$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2019-06-25 04:22 /user/hive/warehouse/hive_db.db

启动Hiveserver2

hiveserver在hive2.0版本后，由于其不支持多客户端连接等功能被弃用，改用hiveserver2代替，hiveserver2有着自己的CLI工具beeline

启动hiveserver2

$ hiveserver2 
2019-06-25 19:20:16: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hive-3.1.1/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hadoop-3.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 2b252413-1672-4aee-943e-b1fac451ffeb
Hive Session ID = 4937ae43-4382-4333-ac1b-458851a55d45
Hive Session ID = ea183319-7ba4-4f3b-9d3b-d6ac6477aeca
Hive Session ID = 00343592-4acb-4d21-a2ec-af4ef765b2af

连接hiveserver2

# 本地连接
beeline -u jdbc:hive2://

# 远程连接
beeline -u jdbc:hive2://localhost:10000 -n hduser -p 123456

# 测试连接
beeline 
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hduser
Enter password for jdbc:hive2://localhost:10000: 123456

查询数据
连接成功后，操作同hivesever1

0: jdbc:hive2://localhost:10000> select * from hive_db.hive_table;
INFO  : Compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:hive_table.id, type:int, comment:null), FieldSchema(name:hive_table.name, type:varchar(16), comment:null), FieldSchema(name:hive_table.age, type:int, comment:null), FieldSchema(name:hive_table.message, type:varchar(128), comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.171 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Completed executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.0 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+----------------+------------------+-----------------+---------------------+
| hive_table.id  | hive_table.name  | hive_table.age  | hive_table.message  |
+----------------+------------------+-----------------+---------------------+
| 1              | xiaoxin          | 16              | I am wangxin        |
| 2              | kunkun           | 18              | I am shikun         |
+----------------+------------------+-----------------+---------------------+
2 rows selected (0.214 seconds)