之前用CDH5.2进行集群的搭建,现需要将CDH支持spark-sql,具体搭建请见CDH离线安装
一:准备环境
jdk1.7.0_79
scala2.10.4
maven3.3.9
spark-1.1.0.tgz
配置环境变量如下,并使其生效:source /etc/profile
export JAVA_HOME=/usr/local/jdk1
export M2_HOME=/usr/local/maven
export SCALA_HOME=/usr/local/scala
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$SCALA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
现已有编译好的spark
二:编译spark源码
1. 重新设置maven编译所占空间,因为编译过程复杂、时间长
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
2解压源码并进行编译
nohup mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=hadoop2.5.0-cdh5.2.0 -Dscala-2.10.4 -Phive -Phive-thriftserver -DskipTests clean package > ./spark-mvn-date +%Y%m%d%H
.log 2>&1 &
三.安装spark assembly
1.拷贝assembly jar包
将编译好的assembly包拷贝到指向CDH的jars目录下
$cp spark-assembly-1.1.0-hadoop2.4.0.jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/
2.替换CDH中spark下的assembly jar包
修改软链接
$ cd /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/assembly/lib
$ ln -s ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
$ ln -s spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly.jar
3.拷贝spark-sql运行文件
从spark源文件的bin下拷贝到CDH的spark的bin目录下
$ mv /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql.bak
$ cp /root/spark-1.1.0-bin-hadoop2.4/bin/spark-sql /opt/cloudera/parcels/CDH/lib/spark/bin/
4.配置环境变量
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CMD=/opt/cloudera/parcels/CDH/bin/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin
5.拷贝assembly jar包拷贝到HDFS
首先需要将assembly jar拷贝到HDFS的/user/spark/share/lib目录下,修改文件权限为755
6.在CM上配置
登陆CM,修改spark的服务范围为assembly jar在HDFS中的路径
修改服务范围
修改高级配置
修改客户端配置
7.运行spark-sql
运行sql
四:关闭spark-sql的INFO信息
1.备份log4j.properties
进入$SPARK_HOME/conf目录下
$cp /opt/cloudera/parcels/CDH/lib/spark/conf/log4j.properties /opt/cloudera/parcels/CDH/lib/spark/conf/log4j.properties.bak
2. 进入log4j.properties文件,将其中的INFO修改为WARN(第二行位置),内容如下:
修改文件
3 local class incompatible: stream classdesc serialVersionUID = 5017373498943810947, local class serialVersionUID = 18257903091306170
解决方案:client端类版本与server端不一致,将client端的jar包上传到hdfs上并配置spark文件