Hive部署及整合Hbase
1.软件版本信息
hadoop: 2.8.3
hbase: 1.0.0
hive: 2.3.4
mysql:5.7.24
2.准备mysql
# mysql 主要用于存储hive表的元数据,不会存储具体的数据
2.1 创建名称为hive的数据库
2.2 修改数据库编码格式为 latin1 ,排序规则为 latin1_bin
3. hive配置
--我将下载好的tar.gz包解压到 /home/etc/这个目录下,成为 /home/etc/apache-hive-2.3.4/
3.1 复制 /home/etc/apache-hive-2.3.4/conf/ 目录下 hive-default.xml.template 一份命名为 hive-site.xml
cp hive-default.xml.template hive-site.xml
3.2 修改内容
<!--根据name找到如下的配置并修改-->
<!--mysql连接的地址-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<!--mysql连接DiverName-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!--mysql连接用户名-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>zabbix</value>
<description>username to use against metastore database</description>
</property>
<!--mysql连接密码-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
<!--io临时目录,需手动创建此目录-->
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/etc/apache-hive-2.3.4/iotmp</value>
<description>Local scratch space for Hive jobs</description>
</property>
<!--io临时目录,同上目录-->
<property>
<name>hive.querylog.location</name>
<value>/home/etc/apache-hive-2.3.4/iotmp</value>
<description>Location of Hive run time structured log file</description>
</property>
<!--io临时目录,同上目录-->
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/etc/apache-hive-2.3.4/iotmp</value>
<description>Temporary local directory for added resources in the remote file system. </description>
</property>
3.3 创建 iotmp目录
在 apache-hive-2.3.4目录下创建iotmp目录
3.4 配置环境变量
修改 ~/.bash_profile(用户环境变量)
# 1.配置HIVE_HOME
export HIVE_HOME=、home/etc/apache-hive-2.3.4
# 2.将HIVE_HOME加到PATH环境变量中
export PATH=$PATH:$HIVE_HOME/bin
# 3.重新加载环境变量
source ~/.bash_profile
3.5 导入hbase的jar包和mysql的驱动包
# 1.放入mysql驱动包 到 /home/etc/apache-hive-2.3.4/lib/ 下
mysql-connector-java-5.1.9.jar
# 2.放入hbase相关包 到 /home/etc/apache-hive-2.3.4/lib/ 下
进入 hbase的lib目录下
cp hbase-protocol-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-server-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-client-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0-tests.jar /home/etc/apache-hive-2.3.4/lib/
3.6 在hive中加入hbase的配置
修改 /home/etc/apache-hive-2.3.4/conf/ 下的 hive-site.xml配置
<!--file下的jar路径加入到path中-->
<property>
<name>hive.reloadable.aux.jars.path</name>
<value>
file:///home/etc/apache-hive-2.3.4/lib/hive-hbase-handler-2.3.4.jar,
file:///home/etc/apache-hive-2.3.4/lib/hbase-protocol-1.0.0.jar,
file:///home/etc/apache-hive-2.3.4/lib/hbase-server-1.0.0.jar,
file:///home/etc/apache-hive-2.3.4/lib/hbase-client-1.0.0.jar,
file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0.jar,
file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0-tests.jar,
file:///home/etc/apache-hive-2.3.4/lib/zookeeper-3.4.6.jar,
file:///home/etc/apache-hive-2.3.4/lib/guava-14.0.1.jar
</value>
<description>
The locations of the plugin jars, which can be a comma-separated folders or jars. Jars can be renewed
by executing reload command. And these jars can be used as the auxiliary classes like creating a UDF or SerDe.
</description>
</property>
<!--防止启动时报MissingTableException:Required table missing : "VERSION" in Catalog ""
Schema "". DataNucleus requires this table to perform its persistence operations-->
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
<description>Auto creates necessary schema on a startup if one doesn't exist</description>
</property>
4.启动hive
4.1 交互式启动
启动一个hive交互shell,在任意地方输入hive即可启动
set hive.cli.print.current.db=true; (让提示符显示当前库)
set hive.cli.print.header=true;(显示查询结果时显示字段名称)
4.2 启动hive服务
在 /home/etc/apache-hive-2.3.4/bin/目录下
hiveserver2 -hiveconf hive.root.logger=DEBUG,console
或者后台启动
hiveserver2 1>/dev/null 2>&1 &
5.hive建表
5.1 创建hive表的同时,创建hbase的表
在hive交互式环境下执行
# Hive中的表名tbl_name
# 指定存储处理器
# 声明列族,列名
# hbase.table.name 声明HBase表名, 为可选属性默认与Hive的表名相同,
# hbase.mapred.output.outputtable 指定插入数据时写入的表, 如果以后需要往该表插入数据就需要指定该值
# key为rowkey,cf1为列族,val为列
CREATE TABLE tbl_name(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "tbl_name", "hbase.mapred.output.outputtable" = "tbl_name");
5.2 根据hbase表创建hive表
在hive交互式环境下执行
# 现在hbase中存在一个表名为 JIAYUAN ,有一个列族为 body
# 列有userId,url,deviceId,type,platId,channelId,citycode,req,res,time
# EXTERANL代表为hbase的外部表
# jiayuan_table1为hive中的表名
CREATE EXTERNAL TABLE jiayuan_table1 (
rowkey string,
userId string,
url string,
deviceId string,
type string,
platId string,
channelId string,
citycode string,
req string,
res string,
time string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,body:userId,body:url,body:deviceId,body:type,body:platId,body:channelId,body:citycode,body:req,body:res,body:time")
TBLPROPERTIES ("hbase.table.name" = "JIAYUAN");
查看数据
show tables;查看所有的表
select * from jiayuan_table1 limi 1; 从表中查询一条数据
hive> select * from jiayuan_table1 limit 1;
OK
8c64a436-afb7-4d91-8e24-726e7437dc791604887326065 8c64a436-afb7-4d91-8e24-726e7437dc79 /house/search-keyword GET KM ios NULL {"searchType":"forsale","cityCode":"530100","keyword":"j"} {"code":"hlsp-err-02","data":[],"errMsg":"关键词字数不少于2个字"} 1604887326065
Time taken: 3.282 seconds, Fetched: 1 row(s)
6.pyspark 连接 hive
6.1 迁移jar包
在 /home/etc/apache-hive-2.3.4/lib/ 里 找到 hive-hbase-handler-2.3.4.jar 包
放入 spark 文件夹下的 jars 目录里
否则会在读取 hive 表数据时报如下错误
# java.lang.ClassNotFoundException: org.apache.hadoop.hive.hbase.HBaseStorageHandler
6.2 修改配置文件
修改 /home/etc/apache-hive-2.3.4/conf/hive-site.xml
<!--host为hive服务器的地址,9083为默认的端口-->
<property>
<name>hive.metastore.uris</name>
<value>thrift://host:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
6.3 迁移配置文件
将/home/etc/apache-hive-2.3.4/conf/hive-site.xml
复制一份到 spark目录下的conf文件夹内
6.4 启动服务
# 1.启动metastore服务
hive --service metastore
# 2.启动hive服务
hiveserver2 -hiveconf hive.root.logger=DEBUG,console
6.5 pyspark 访问
from pyspark.sql import SparkSession
# config中配置的即为6.2修改配置文件的地址
# enableHiveSupport,开启hive支持
spark = SparkSession\
.builder\
.appName('map-search-list')\
.config("hive.metastore.uris", "thrift://128.196.216.16:9083")\
.enableHiveSupport()\
.getOrCreate()
# 展示所有表
spark.sql('show tables').show()
hive_database = "default" # 要操作的数据库
hive_table = "jiayuan_table1" # 要操作的数据表
hive_read_sql = "select * from {}.{}".format(hive_database, hive_table)
df = spark.sql(hive_read_sql) #default.jiayuan_table1
df.show()
print('读取hive完成')