Linux 搭建Hive数仓工具

第一篇文章,介绍了大数据下Hive工具的主要功能,可以使用sql语句来完成大数据计算任务,极大的方便了业务人员。

上一篇文章介绍了如何在Linux下搭建单机版Hadoop环境,文本记录了在此基础上,搭建数仓工具Hive的具体过程。

环境

系统环境 版本
操作系统 Ubuntu 18.04
Java openjdk version "1.8.0_212"
Hadoop 3.2.0
Mysql 5.7
Hive 3.1.1

配置Mysql

配置Mysql用于存储HiveMateData

  1. 安装mysql
    https://www.jianshu.com/p/f81ec3a7c903
sudo apt install -y mysql-server
  1. 新建hive用户
mysql> create user 'hive'@'localhost' identified by '123456';
mysql> grant all on *.* to 'hive'@'localhost' identified by '123456';
mysql> flush privileges;
  1. 验证hive用户
mysql> select host, user, authentication_string from mysql.user;
+-----------+------------------+-------------------------------------------+
| host      | user             | authentication_string                     |
+-----------+------------------+-------------------------------------------+
| localhost | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | mysql.session    | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | mysql.sys        | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | debian-sys-maint | *01B9179E564621D21BF72BE9D4D31F64EAD60870 |
| %         | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | hive             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+-----------+------------------+-------------------------------------------+
6 rows in set (0.00 sec)
  1. 创建hive数据库
msyql -u hive -p

mysql> create database hive;

配置Hive

  1. 下载Hive
wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
  1. 配置环境变量
export HIVE_HOME=/home/hduser/hadoop-family/hive-3.1.1
export PATH=${PATH}:${HADOOP_HOME}/bin/:${HADOOP_HOME}/sbin/:${HIVE_HOME}/bin:${DERBY_HOME}/bin
  1. 配置hive-site.xml (4项)
    hive-3.1.1/conf下默认没有hive-site.xml
    需要cp hive-default.xml.template hive-site.xml后,再修改以下内容,在这里&需要被转义写成&
    剔除3210行的非法字符&8#;
<!--mysql connection configuration-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8&amp;useSSL=false</value>
    <description>JDBC connect string for a JDBC metastore.</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
</property>

<!--hiveserver2 configuration-->
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>localhost</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>

新建目录mkdir -p /home/hduser/hadoop-family/hive-3.1.1/hive-iotmp/
再替换掉hive-site.xml中,所有的${system:java.io.tmpdir}${system:user.name}

sed -i 's/${system:java.io.tmpdir}/\/home\/hduser\/hadoop-family\/hive-3.1.1\/hive_data\/tmp/g' hive-site.xml 
sed -i 's/${system:user.name}/hduser/g' hive-site.xml 
  1. 下载mysql-connector-java.jar,放入hive-3.1.1/lib/
wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz

配置Hadoop

修改hadoop-3.2.0/etc/hadoop/core-site.xml,添加以下内容,主要原因是hadoop引入了一个安全伪装机制,不允许上层系统直接将实际用户传递到hadoop层,而是将实际用户传递给一个代理用户,由代理用户在hadoop上执行操作,避免任意客户随意操作hadoop

<property>
    <name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hduser.groups</name>
    <value>*</value>
</property>

在使用hive创建table之前,需要在haddop中新建hivehive.metastore.warehouse.dir,并添加g+w权限

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

启动hive

hive2.1版本后,需要执行schematool命名来执行初始化

schematool -dbType mysql -initSchema
  1. 创建数据库
hive> create database hive_db;
  1. 创建表
hive> use hive_db;
hive> create table hive_table (
    > id int,
    > name varchar(16),
    > age int,
    > message varchar(128)
    > );
  1. 插入数据
    hive会将sql指令转为MapReduce任务去执行
hive> insert into hive_table values (1, 'xiaoxin', 16, 'I am wangxin'),(2, 'kunkun', 18, 'I am shikun');
Query ID = hduser_20190625042455_3600e443-663d-45f6-b842-37a3f2a8843b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1561428402986_0002, Tracking URL = http://ubuntu:8088/proxy/application_1561428402986_0002/
Kill Command = /home/hduser/hadoop-family/hadoop-3.2.0/bin/mapred job  -kill job_1561428402986_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-06-25 04:25:04,263 Stage-1 map = 0%,  reduce = 0%
2019-06-25 04:25:10,502 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.77 sec
2019-06-25 04:25:15,666 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.49 sec
MapReduce Total cumulative CPU time: 5 seconds 490 msec
Ended Job = job_1561428402986_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hive_db.db/hive_table/.hive-staging_hive_2019-06-25_04-24-55_404_2009487560348382612-1/-ext-10000
Loading data to table hive_db.hive_table
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.49 sec   HDFS Read: 18587 HDFS Write: 380 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 490 msec
OK
Time taken: 21.773 seconds
  1. 查询数据
hive> select * from hive_table;
OK
1   xiaoxin 16  I am wangxin
2   kunkun  18  I am shikun
Time taken: 0.174 seconds, Fetched: 2 row(s)
  1. 查看hadoop文件
    hdfs中可以看到新建的db文件
$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2019-06-25 04:22 /user/hive/warehouse/hive_db.db

启动Hiveserver2

hiveserver在hive2.0版本后,由于其不支持多客户端连接等功能被弃用,改用hiveserver2代替,hiveserver2有着自己的CLI工具beeline

  1. 启动hiveserver2
$ hiveserver2 
2019-06-25 19:20:16: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hive-3.1.1/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hadoop-3.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 2b252413-1672-4aee-943e-b1fac451ffeb
Hive Session ID = 4937ae43-4382-4333-ac1b-458851a55d45
Hive Session ID = ea183319-7ba4-4f3b-9d3b-d6ac6477aeca
Hive Session ID = 00343592-4acb-4d21-a2ec-af4ef765b2af

  1. 连接hiveserver2
# 本地连接
beeline -u jdbc:hive2://

# 远程连接
beeline -u jdbc:hive2://localhost:10000 -n hduser -p 123456

# 测试连接
beeline 
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hduser
Enter password for jdbc:hive2://localhost:10000: 123456
  1. 查询数据
    连接成功后,操作同hivesever1
0: jdbc:hive2://localhost:10000> select * from hive_db.hive_table;
INFO  : Compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:hive_table.id, type:int, comment:null), FieldSchema(name:hive_table.name, type:varchar(16), comment:null), FieldSchema(name:hive_table.age, type:int, comment:null), FieldSchema(name:hive_table.message, type:varchar(128), comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.171 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Completed executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.0 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+----------------+------------------+-----------------+---------------------+
| hive_table.id  | hive_table.name  | hive_table.age  | hive_table.message  |
+----------------+------------------+-----------------+---------------------+
| 1              | xiaoxin          | 16              | I am wangxin        |
| 2              | kunkun           | 18              | I am shikun         |
+----------------+------------------+-----------------+---------------------+
2 rows selected (0.214 seconds)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。