之前做过大数据spark分析的工程,既然打算写学习交流了,所以也慢慢把整理过的东西放出来,求交流指点。
主要框架如下,大多数的安装资料来源于官网和网上靠谱的,经过测试选定了如下的参数。
1、所有服务器上添加每个服务器的ip映射
vi/etc/hosts
服务器的hostname
2、SSH免密登陆
所有服务器安装SSH,设置所有服务器之间可以免密登陆
rsa //加回车
[if !supportLists](一、 [endif]出现ssh不通的问题
[if !supportLists](二、 [endif]手动将pub和auth文件备份并手动传输过去
每个机子上的id_rsa.pub发给master节点,传输公钥可以用scp来传输。
scp~/.ssh/id_rsa.pub root@master:~/.ssh/id_rsa.pub.slave1
在master上,将所有公钥加到用于认证的公钥文件authorized_keys中
cat~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys
将公钥文件authorized_keys分发给每台slave
scp~/.ssh/authorized_keys spark@master:~/.ssh/
在每台机子上验证SSH无密码通信
3、安装JAVA
下载java8 解压
修改环境变量sudo vi /etc/profile,添加下列内容,注意将home路径替换成该服务器的:
exportWORK_SPACE=/home/spark/workspace/
export JAVA_HOME=$WORK_SPACE/jdk
export JRE_HOME=/home/spark/work/jdk/jre
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
export CLASSPATH=$CLASSPATH:.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
然后使环境变量生效,并验证 Java 是否安装成功
$ source /etc/profile #生效环境变量
$ java -version #
如果打印出如下版本信息,则说明安装成功
java version
Java(TM) SE Runtime Environment
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
4、安装Scala
Spark官方要求 Scala 版本为2.10.x,下载Scala版本2.10.4,解压
再次修改环境变量sudo vi /etc/profile,添加以下内容,注意将home路径替换成服务器的:
exportSCALA_HOME=$WORK_SPACE/scala
export PATH=$PATH:$SCALA_HOME/bin
同样的方法使环境变量生效,并验证 scala 是否安装成功
$ source /etc/profile #生效环境变量
$ scala -version #
如果打印出如下版本信息,则说明安装成功
Scala code runner version -- Copyright2002-2013, LAMP/EPFL
5、配置Hadoop集群
从官网下载 Hadoop 2.7.3版本,解压
cd ~/workspace/hadoop- 2.7.3/etc/hadoop进入hadoop配置目录,需要配置有以下7个文件:hadoop-env.sh,yarn-env.sh,slaves,core-site.xml,hdfs-site.xml,maprd-site.xml,yarn-site.xml,capacity-scheduler.xml
在hadoop-env.sh中配置JAVA_HOME,对应服务器的home路径
#The java implementation to use.
export JAVA_HOME=/home/spark/workspace/jdk
exportHADOOP_HEAPSIZE=6000
在yarn-env.sh中配置JAVA_HOME
#some Java parameters
export JAVA_HOME=/home/spark/workspace/jdk
YARN_HEAPSIZE=6000
在slaves中配置slave节点的ip或者host,
所有服务器的hostname
XX
修改core-site.xml
fs.defaultFS
hdfs://xx:9000/
hadoop.tmp.dir
/home/hadooptemp
修改hdfs-site.xml
dfs.namenode.secondary.http-address
xx:9001
dfs.namenode.name.dir
file:/usr/local/hadoop-2.7.3/dfs/name
dfs.datanode.data.dir
file:/usr/local/hadoop-2.7.3/dfs/data
dfs.replication
3
修改mapred-site.xml
mapreduce.framework.name
yarn
修改yarn-site.xml
yarn.nodemanager.resource.cpu-vcores
16
yarn.nodemanager.resource.memory-mb
16384
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.nodemanager.aux-services
spark_shuffle
yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.address
xx:8032
yarn.resourcemanager.scheduler.address
xx:8030
yarn.resourcemanager.resource-tracker.address
xx:8035
yarn.resourcemanager.admin.address
xx:8033
yarn.resourcemanager.webapp.address
xx:8088
修改capacity-scheduler.xml
yarn.scheduler.capacity.maximum-applications
10000
Maximum number of applications that canbe pending and running.
yarn.scheduler.capacity.maximum-am-resource-percent
0.1
Maximum percent of resources in thecluster which can be used to run
application masters i.e. controls numberof concurrent running
applications.
yarn.scheduler.capacity.resource-calculator
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
The ResourceCalculator implementation tobe used to compare
Resources in the scheduler.
The default i.e.DefaultResourceCalculator only uses Memory while
DominantResourceCalculator usesdominant-resource to compare
multi-dimensional resources such asMemory, CPU etc.
yarn.scheduler.capacity.root.queues
default,cdn
The queues at the this level (root is theroot queue).
yarn.scheduler.capacity.root.cdn.capacity
30
Default queue targetcapacity.
arn.scheduler.capacity.root.cdn.state
RUNNING
The state of the default queue. State canbe one of RUNNING or STOPPED.
yarn.scheduler.capacity.root.default.capacity
70
Default queue targetcapacity.
yarn.scheduler.capacity.root.default.user-limit-factor
1.0
Default queue user limit a percentagefrom 0.0 to 1.0.
yarn.scheduler.capacity.root.default.maximum-capacity
100
The maximum capacity of the defaultqueue.
yarn.scheduler.capacity.root.cdn.maximum-capacity
100
The maximum capacity of the defaultqueue.
yarn.scheduler.capacity.root.default.state
RUNNING
The state of the default queue. State canbe one of RUNNING or STOPPED.
yarn.scheduler.capacity.root.default.acl_submit_applications
*
The ACL of who can submit jobs to thedefault queue.
yarn.scheduler.capacity.root.default.acl_administer_queue
The ACL of who can administer jobs on thedefault queue.
yarn.scheduler.capacity.node-locality-delay
40
Number of missed scheduling opportunitiesafter which the CapacityScheduler
attempts to schedule rack-localcontainers.
Typically this should be set to number ofnodes in the cluster, By default is setting
approximately number of nodes in one rackwhich is 40.
yarn.scheduler.capacity.queue-mappings
A list of mappings that will be used toassign jobs to queues
The syntax for this list is[u|g]:[name]:[queue_name][,next mapping]*
Typically this list will be used to mapusers to queues,
for example, u:%user:%user maps all usersto queues with the same name
as the user.
yarn.scheduler.capacity.queue-mappings-override.enable
false
If a queue mapping is present, will itoverride the value specified
by the user? This can be used byadministrators to place jobs in queues
that are different than the one specifiedby the user.
The default is false.
将配置好的hadoop- 2.7.3文件夹分发给所有slaves吧
scp
-r ~/workspace/hadoop- 2.7.3 root@slave1:~/workspace/
启动Hadoop
在 master 上执行以下操作,就可以启动 hadoop 了。
cd
~/workspace/hadoop- 2.7.3.
#进入hadoop目录
格式化namenode
sbin/start-dfs.sh #
启动dfs
sbin/start-yarn.sh #
启动yarn
验证 Hadoop 是否安装成功可以通过jps命令查看各个节点启动的进程是否正常。在master 上应该有以下几个进程:
$jps #run on master
3407 SecondaryNameNode
3218 NameNode
3552 ResourceManager
3910 Jps
在每个slave上应该有以下几个进程:
$jps #run on slaves
2072 NodeManager
2213 Jps
1962 DataNode
或者在浏览器中输入 http://master:8088 ,应该有 hadoop 的管理界面出来了,可以看到配置的情况。
6、配置spark集群
从官网下载Spark2.0.1版本,解压
配置spark以下仅列出了一些重要的相关环境参数,需配置conf文件夹下的Spark-env.sh,Spark-defaluts.conf配置Spark-env.sh
exportSCALA_HOME=/usr/local/scala-2.11.8
exportJAVA_HOME=/usr/local/java
exportHADOOP_HOME=/usr/local/hadoop-2.7.3
exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=BJYZ-VA-TJFX1
SPARK_LOCAL_DIRS=/usr/local/spark-2.0.1-bin-hadoop2.7.3
SPARK_DRIVER_MEMORY=2G
SPARK_EXECUTOR_MEMORY=1G
配置Spark-defaluts.conf
spark.default.parallelism 64
spark.executor.extraJavaOptions-Xss1024k -XX:PermSize=128M -XX:MaxPermSize=256M
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.initialExecutors4
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 10
spark.dynamicAllocation.executorIdleTimeout30s
spark.executor.memory 6g
spark.executor.cores 4
验证 Spark 是否安装成功。
Sbin目录下start-all
用jps检查,在 master 上应该有以下几个进程:
$jps
7949 Jps
7328 SecondaryNameNode
7805 Master
7137 NameNode
7475 ResourceManager
在 slave 上应该有以下几个进程:
$jps
3132 DataNode
3759 Worker
3858 Jps
3231 NodeManager
进入Spark的Web管理页面:http://master:8080