1. 系统环境
cdh:6.3.2
hive:2.1.1
hadoop:3.0.0
2. 准备工作
相关jar包,可maven中央仓库下载,==特殊说明mongo-hadoop-core 自己手动编译==
mongo-java-driver-3.12.8
mongo-hadoop-core-2.0.2
mongo-hadoop-core-2.0.2
编译后的jar包如下:
mongo-hadoop-core-2.0.2.jar mongo-java-driver-3.12.8.jar mongo-hadoop-hive-2.0.2.jar
具体地址
http://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-core/2.0.2
http://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-hive/2.0.2
http://mvnrepository.com/artifact/org.mongodb/mongo-java-driver/3.12.8
3.cdh中配置hive辅助jar位置并将3个jar包分发至机器的配置目录
其他情况下也可以指定另外的路径,具体配置方案:
https://blog.csdn.net/shujuelin/article/details/106372341
4.具体操作
mongo与hive集成有2种方式
- MongoDB-based 直接连接hidden节点,使用 com.mongodb.hadoop.hive.MongoStorageHandler做数据Serde
- BSON-based 将数据dump成bson文件,上传到HDFS系统,使用 com.mongodb.hadoop.hive.BSONSerDe
- mongodb-based:
CREATE external TABLE test2(
`_id` string comment '主建id',
fullOrderInfo string comment '订单数据',
`_class` string comment 'model'
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"_id":"_id","fullOrderInfo":"fullOrderInfo","_class":"_class"}')
TBLPROPERTIES('mongo.uri'='mongodb://admin:123456@XX.XX.XX.XX:27017/库名.集合名?authSource=admin&authMechanism=SCRAM-SHA-1');
- BSON-based
create external table if not exists test
(
`_id` string comment '主建id',
fullOrderInfo string comment '订单数据',
`_class` string comment 'model'
)
comment '临时表'
row format serde 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"fullOrderInfo":"fullOrderInfo"}')
stored as inputformat 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
outputformat 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
location '/warehouse/XX//XX/XXXX';
==注意:此方式会有不同错误一般都是包不兼容问题,根据报错信息处理==
5.总结
- 使用第一种方式尽量创建外部表,防止误删数据。
- 第一种方式适用于实时读取数据
- 第二种方式适用于离线同步,更加稳定