介绍
Hive是运行在Hadoop之上的数据仓库,将结构化的数据文件映射为一张数据库表,提供简单类SQL查询语言,称为HQL,并将SQL语句转换成MapReduce任务运算。有利于利用SQL语言查询、分析数据,适于处理不频繁变动的数据。Hive底层可以是HBase或者HDFS存储的文件。
推荐文章:hbase和hive的差别是什么,各自适用在什么场景中?
作者:yuan daisy
链接:https://www.zhihu.com/question/21677041/answer/78289309
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
- Hive中的表是纯逻辑表,就只是表的定义等,即表的元数据。Hive本身不存储数据,它完全依赖HDFS和MapReduce。这样就可以将结构化的数据文件映射为为一张数据库表,并提供完整的SQL查询功能,并将SQL语句最终转换为MapReduce任务进行运行。 而HBase表是物理表,适合存放非结构化的数据。
- Hive是基于MapReduce来处理数据,而MapReduce处理数据是基于行的模式;HBase处理数据是基于列的而不是基于行的模式,适合海量数据的随机访问。
- HBase的表是疏松的存储的,因此用户可以给行定义各种不同的列;而Hive表是稠密型,即定义多少列,每一行有存储固定列数的数据。
- Hive使用Hadoop来分析处理数据,而Hadoop系统是批处理系统,因此不能保证处理的低迟延问题;而HBase是近实时系统,支持实时查询。
- Hive不提供row-level的更新,它适用于大量append-only数据集(如日志)的批任务处理。而基于HBase的查询,支持和row-level的更新。
- Hive提供完整的SQL实现,通常被用来做一些基于历史数据的挖掘、分析。而HBase不适用与有join,多级索引,表关系复杂的应用场景。
环境搭建
环境描述
用两台机器,主机装了hive服务端(172.16.252.128 host:master),此机也为hadoop-master,另装有mysql。
hive客户端(172.16.252.128 host:slave02),此机也为hadoop-slave。
环境依赖
需要提前安装好mysql和hadoop。
安装教程见我博客:
[Linux(CentOS7)下rpm方式安装mysql5.6.29](http://www.jianshu.com/p/e23d22022c53)
[CentOS7下搭建Hadoop2.7.3集群](http://www.jianshu.com/p/26e857d7aca8)
安装调试
服务端(master)
- 官网下载最新的hiva二进制包apache-hive-2.1.0-bin.tar.gz
解压即安装,我放在/data
目录下
$ tar -zxvf /data/apache-hive-2.1.0-bin.tar.gz
- 配置环境变量
$ vi /etc/profile
#Hive Env
HIVE_HOME=/date/apache-hive-2.1.0-bin
PATH=$PATH:$HIVE_HOME/bin
export HIVE_NAME PATH
$ source /etc/profile
- 登陆mysql并创建hive用户和hive数据库用以同步hive结构
$ mysql -uroot -p
mysql> CREATE USER 'hive' IDENTIFIED BY 'hive';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hive'@'master' WITH GRANT OPTION;
mysql> flush privileges;
mysql> exit;
$ mysql -h master -uhive
mysql> set password = password('hive');
- 下载
mysql-connector-java-5.1.40.tar.gz
tar -zvxf mysql-connector-java-5.1.40.tar.gz
cd mysql-connector-java-5.1.40
cp mysql-connector-java-5.1.40-bin.jar $HIVE_HOME/lib/
- 修改hive配置文件
$ cp /data/apache-hive-2.1.0-bin/conf/hive-default.xml.template /data/apache-hive-2.1.0-bin/conf/hive-site.xml
$ vi /data/apache-hive-2.1.0-bin/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
#mysql连接jdbc
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
#mysql连接
<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
#mysql用户名
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
#mysql密码
<value>hive</value>
<description>password to use against metastore database</description>
</property>
客户端(slave02)
第一步和第二步同服务端。
第三步:修改配置文件
$ cp /data/apache-hive-2.1.0-bin/conf/hive-default.xml.template /data/apache-hive-2.1.0-bin/conf/hive-site.xml
$ vi /data/apache-hive-2.1.0-bin/conf/hive-site.xml
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
启动(有报错请看下一部分的分析)
- 服务端开启
[root@master bin]# cd $HADOOP_HOME/bin/
[root@master bin]# ./hive --service metastore &
[1] 32436
[root@master bin]# which: no hbase in (/data/spark-2.0.1-bin-hadoop2.7/bin:/data/spark-2.0.1-bin-hadoop2.7/sbin:/data/soft/scala-2.11.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.7.0_79/bin:/usr/java/jdk1.7.0_79/jre/bin:/data/hadoop-2.7.3/bin:/data/hadoop-2.7.3/sbin:/date/apache-hive-2.1.0-bin/bin:/root/bin)
Starting Hive Metastore Server
- 服务器登录hive
[root@master bin]# ./hive
which: no hbase in (/data/spark-2.0.1-bin-hadoop2.7/bin:/data/spark-2.0.1-bin-hadoop2.7/sbin:/data/soft/scala-2.11.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.7.0_79/bin:/usr/java/jdk1.7.0_79/jre/bin:/data/hadoop-2.7.3/bin:/data/hadoop-2.7.3/sbin:/date/apache-hive-2.1.0-bin/bin:/root/bin)
Logging initialized using configuration in jar:file:/data/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
hive> show databases;
OK
default
wentao
Time taken: 0.85 seconds, Fetched: 2 row(s)
- 客户端登录hive
[root@slave02 bin]# ./hive
which: no hbase in (/data/spark-2.0.1-bin-hadoop2.7/bin:/data/spark-2.0.1-bin-hadoop2.7/sbin:/data/soft/scala-2.11.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.7.0_79/bin:/usr/java/jdk1.7.0_79/jre/bin:/data/hadoop-2.7.3/bin:/data/hadoop-2.7.3/sbin:/root/bin)
Logging initialized using configuration in jar:file:/data/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
hive> show databases;
OK
default
wentao
Time taken: 0.933 seconds, Fetched: 2 row(s)
hive>
两边的数据是同步的。同时登录到master的mysql查看hive表
mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
| AUX_TABLE |
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_COMPACTIONS |
| COMPLETED_TXN_COMPONENTS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| KEY_CONSTRAINTS |
| MASTER_KEYS |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
| WRITE_SET |
+---------------------------+
57 rows in set (0.00 sec)
mysql>
hive结构和数据就以键值对的形式存储在这些表中。
报错及修改
- SLF4J多重绑定
```
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
```
解决:
上述jar包有重复绑定Logger类,删除较旧版本即可。
``rm -rf /data/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar``
- Hadoop处于安全模式无法启动Hive
```
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/hive. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3895)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:984)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:578)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:518)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
```
解决:
强制关闭hadoop安全模式
``$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave``
- iotmp动态地址无法解析
```
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:631)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:550)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:518)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 12 more
```
解决:
将`java.io.tmpdir`固定
```
$ mkdir $HIVE_HOME/tmpdir
vi $HIVE_HOME/conf/hive-site.xml
将所有含 ${system:java.io.tmpdir} 所在的value替换成 $HIVE_HOME/tmpdir
```
**注意以上所有$HIVE_HOME等标示都是你的安装目录**
参考
Hadoop集群之Hive安装配置
hive shell not opening when I have hive-site.xml
Hadoop 解除 "Name node is in safe mode"
slf4j提示Class path contains multiple SLF4J bindings
hbase和hive的差别是什么,各自适用在什么场景中?