前言
由于目前发现hive3.0.0版本对tez0.92在beeline模式下出现不兼容现象,暂时未解决,还由于spark2.3对流式处理优化但是对hive离线并未有太大影响,
所以本次使用hive2.3.6 on spark2.0.0搭建spark引擎同时完美支持tez引擎。
1.hive 整合spark版本对应关系:
image.png
2.环境版本
2.1软件
jdk-1.8.0
scala-2.11.8
apache-hive-2.3.6.tar.gz
Hadoop-2.7.2
spark-2.0.0-src
maven-3.6.3
注:软件的安装就不阐述了,随便搜一搜一大堆,端口别冲突
2.2远端镜像maven的setting配置:
<mirror>xml
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
注:默认也行,速度比较慢
3.spark2.0.0编译(去掉hive)安装
sudo ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
注:最好root下编译,编译过程
编译好后的样子:
image.png
3.1解压重命名:
tar -zxvf spark-2.0.0-bin-hadoop2-without-hive.tgz -C ./
mv spark-2.0.0-bin-hadoop2-without-hive spark200
3.2配置环境变量并source:
#SPARK_HOME
export SPARK_HOME=/opt/module/spark200
export PATH=$PATH:$SPARK_HOME/bin
export SCALA_HOME=/opt/module/scala-2.11.8
export PATH=$PATH:$SPARK_HOME
3.3配置spark:
vi /opt/module/spark200/conf/spark-env.sh
export JAVA_HOME=/opt/module/jdk
export SCALA_HOME=/opt/module/scala-2.11.8
export HADOOP_HOME=/opt/module/hadoop-2.7.2
export HADOOP_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop
export HADOOP_YARN_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop
export SPARK_HOME=/opt/module/spark200
export SPARK_WORKER_MEMORY=512m
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_DRIVER_MEMORY=512m
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-2.7.2/bin/hadoop classpath)
#SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://hadoop1:9000/spark-log -Dspark.history.fs.logDirectory=hdfs://hadoop1:9000/spark-log"
3.4配置节点:
vi /opt/module/spark200/conf/slaves
hadoop1
hadoop2
hadoop3
注:不配也行,因为是yarn集群调度,不启动spark都可以
3.5把spark/jars/*.jar 全部上传到hdfs的目录下
hdfs dfs -mkdir /spark-jars
hdfs dfs -put ./jars/*.jar /spark-jars
3.6启动spark测试
hive --service metastore
hive --service hiveserver2
cd /opt/module/spark200
./sbin/start-all.sh --默认端口8080
./sbin/start-history-server.sh --默认端口18080
注:端口别跟别的冲突
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-cores 1 \
--queue default \
./examples/jars/spark-examples_2.11-2.0.0.jar
image.png
3.7hive增加配置
<!--sprk配置 -->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/opt/module/spark200</value>
</property>
<property>
<name>spark.master</name>
<value>yarn-cluster</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/opt/module/spark200</value>
</property>
<property>
<name>spark.master</name>
<value>yarn-cluster</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://hadoop1:9000/spark-log</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>512m</value>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoop1:9000/spark-jars/*</value>
</property>
3.8拷贝spark下的jar包到hive的lib下
cp scala-library-2.11.8.jar $HIVE_HOME/lib/
cp spark-core_2.11-2.0.0.jar $HIVE_HOME/lib/
cp spark-network-common_2.11-2.0.0.jar $HIVE_HOME/lib/
3.9hive-cli执行:
cli
yarn端口
3.10beeline执行:
beeline模式下
hive2.0后beeline打印日志需要修改编译源码
结语
搭建过程遇到问题留言即可,本次搭建为官方兼容版,比较容易,过后会更新非匹配版本搭建