一、 Hadoop分布式集群搭建
1 集群部署准备
节点名称 | hostName | IP地址 |
---|---|---|
Master | spark | 192.168. 59.137 |
Slave1 | sparkslave | 192.168. 59.138 |
采用两台CentOS 虚拟器,详细信息如下:
[root@spark ~]# uname -a
Linux spark 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@spark ~]# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
2 修改主机名
2.1 以root身份登录Master节点,修改/etc/hostname文件
HOSTNAME=spark
CentOS 7以下版本可能需要修改/etc/sysconfig/network
2.2 修改/etc/hosts文件:
192.168.59.138 sparkslave
2.3 重启系统:reboot,重启后验证hostName显示为spark,即为成功
hostname
2.4 对Slave做同样配置
3 SSH免密码登陆
3.1 所有节点创建用户hadoop
useradd -u XXXX -g hadoop -d /home/hadoop -c "Hadoop User." -m -s /bin/bash hadoop
passwd hadoop
3.2 Master节点免密登录
在Master节点进行如下操作。
3.2.1 首先保证系统中已经安装了ssh服务,然后用hadoop身份登录
su - hadoop
3.2.2 然后,以hadoop的身份生成公钥和私钥,一路回车即可。
ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:KlyIXT+aFZrvqtl6Mt5UdwnflSd1pAkJYlz+YRAxYVw hadoop@spark
The key's randomart image is:
+---[RSA 2048]----+
| .ooX*E. .+|
| ..+.o.. +o|
| . . o o +.o|
| o o + . = + o.|
| . o + S . = . |
| . . B o . |
| o = . |
| o*.. |
| .=*+.. |
+----[SHA256]-----+
完成后在/home/hadoop/.ssh/目录下就能看到生成了公钥和私钥:id_rsa,id_rsa.pub
[hadoop@spark ~]$ cd .ssh
[hadoop@spark .ssh]$ ls -ltr
total 8
-rw-r--r--. 1 hadoop hadoop 394 Mar 15 13:02 id_rsa.pub
-rw-------. 1 hadoop hadoop 1679 Mar 15 13:02 id_rsa
3.2.3 接着,将公钥写入authorized_keys文件:
[hadoop@spark .ssh]$ ssh-copy-id hadoop@sparkslave
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'sparkslave (192.168.59.138)' can't be established.
ECDSA key fingerprint is SHA256:3yVmsQP6CIq8vj0Pd3lOW/q98EqFlF2g1YxyjFZD6Dk.
ECDSA key fingerprint is MD5:90:00:e5:c6:c6:3f:1a:73:67:2b:72:b8:1b:f4:5c:33.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@sparkslave's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop@sparkslave'"
and check to make sure that only the key(s) you wanted were added.
检查sparkslave:
[hadoop@sparkslave .ssh]$ ls -ltr
total 16
-rw-r--r--. 1 hadoop hadoop 399 Mar 15 13:02 id_rsa.pub
-rw-------. 1 hadoop hadoop 1675 Mar 15 13:02 id_rsa
-rw-------. 1 hadoop hadoop 394 Mar 15 13:04 authorized_keys
-rw-r--r--. 1 hadoop hadoop 182 Mar 15 13:12 known_hosts
把自己的密钥也加入到authorized_keys末尾,(为了使用hadoop用户启动应用时不需要密码):
cat id_rsa.pub >> authorized_keys
3.2.4 测试是否需要密码,第一次需要输入yes,之后就不用了,能免密登录,则证明成功。
ssh hadoop@sparkslave
3.3 生成并上传公钥
为了能够让Master免密码登陆各个Slave节点,且Slave节点也能用hadoop用户免密登录到Master节点,在各个Slave节点上进行如下操作
3.3.1 首先保证系统中已经安装了ssh服务,然后用hadoop身份登录
su - hadoop
3.3.2 然后,以hadoop身份生成公钥和私钥,期间有提示,一路回车即可。
ssh-keygen -t rsa
3.3.3 然后,将公钥上传到Master节点。期间会提示输入Master节点上hadoop用户的密码,正确输入开始上传
ssh-copy-id hadoop@spark
把自己的密钥也加入到authorized_keys末尾(为了使用hadoop用户启动应用时不需要密码):
cat id_rsa.pub >> authorized_keys
3.3.4 测试
传完后,在Master节点测试是否需要密码。
ssh hadoop@spark
若可以免密码登陆,则证明ssh免密登录配置成功。
4 安装JDK
4.1 卸载自带的open jdk
首先需要卸载系统自带的open jdk。
[root@spark ~]# rpm -qa | grep java
tzdata-java-2017b-1.el7.noarch
python-javapackages-3.4.1-11.el7.noarch
java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64
javapackages-tools-3.4.1-11.el7.noarch
将上述出现的每个已经安装的软件卸载,执行命令:
rpm -e –nodes java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64
注意:要想卸载全部的jdk,每条卸载命令都要执行
4.2 安装sun jdk
mkdir –p /usr/local/java
去官网下载JDK.
tar -zxvf jdk-8u162-linux-x64.tar.gz
然后编辑系统配置文件:
vi /etc/profile
#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
保存退出之后,执行命令,使配置文件立即生效,否则需要重启。
source /etc/profile
最后,我们可以在终端执行命令,java –version,出现如下提示,即证明成功:
[root@spark jdk1.8.0_162]# java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
5. 安装配置Hadoop
5.1 确保目录权限
su - root
mkdir -p /usr/local/hadoop
mkdir -p /var/local/hadoop
chmod -R 777 /usr/local/hadoop #设置权限
chmod -R 777 /var/local/hadoop #设置权限
chown -R hadoop:hadoop /usr/local/hadoop #设置所属
chown -R hadoop:hadoop /var/local/hadoop #设置所属
su - hadoop
mkdir -p /var/local/hadoop/tmp
mkdir -p /var/local/hadoop/dfs/name
mkdir -p /var/local/hadoop/dfs/data
5.2 下载Hadoop
从hadoop的官网(http://www.apache.org/dyn/closer.cgi/hadoop/common)选择合适的版本,在此我选择了较新版本2.7.5(为了配合Spark版本)
5.3 解压
su - haddop
tar -xzvf hadoop-2.7.5.tar.gz
mv hadoop-2.7.5 /usr/local/hadoop/
5.4 更改环境变量,/etc/profile文件中添加Hadoop的环境变量,如下:
#Hadoop环境变量
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
改完后执行source /etc/profile命令立即生效。
5.5 修改配置文件(在hadoop-2.7.5/etc/hadoop下面)
(1)配置core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
(2)配置 hdfs-site.xml
<configuration>
<property>
<name>dfs.http.address</name>
<value>spark:50070</value>
<description>The address and the base port where the dfs namenode web ui will listen on.If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>sparkslave:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/local/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/local/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
注意: 按官方教程只需要配置 fs.defaultFS 和 dfs.replication 就可以运行,但是如果没有配置 hadoop.tmp.dir 参数,则默认使用的临时目录为 /tmp/hadoo-hadoop,而这个目录在重启时可能被系统清理掉,导致必须重新执行 format 。同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否则在接下来的步骤中可能会出错。
(3)配置yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>spark</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>spark:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>spark:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>spark:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>spark:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>spark:8088</value>
</property>
</configuration>
(4)配置mapred-site.xml
复制mapred-site.xml.template并重命名为mapred-site.xml,将文件修改为
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<final>true</final>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>spark:50030</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>spark:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>spark:19888</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://spark:9001</value>
</property>
</configuration>
(5)slaves
[hadoop@sparkslave hadoop]$ cat slaves
spark
sparkslave
(6)配置hadoop-env.sh,修改:
export JAVA_HOME=/usr/local/java/jdk1.8.0_162 #这里要使用绝对路径
(7) 配置Slave
将Master上的hadoop-2.7.5整个目录拷贝到其他Slave节点下面
注意修改目录权限
su - root
mkdir -p /usr/local/hadoop
mkdir -p /var/local/hadoop
chmod -R 777 /usr/local/hadoop #设置权限
chmod -R 777 /var/local/hadoop #设置权限
chown -R hadoop:root /usr/local/hadoop #设置所属
chown -R hadoop:root /var/local/hadoop #设置所属
su - hadoop
mkdir -p /var/local/hadoop/tmp
mkdir -p /var/local/hadoop/dfs/name
mkdir -p /var/local/hadoop/dfs/data
仿造3.4. 修改profile.
Copy hadoop:
scp -r /usr/local/hadoop/hadoop-2.7.5 hadoop@sparkslave:/usr/local/hadoop/
(8)执行namenode 的 format操作
hdfs namenode -format
出现Exitting with status 0 表示成功,若为 Exitting with status 1 则是出错。
18/03/26 15:40:37 INFO util.ExitUtil: Exiting with status 0
18/03/26 15:40:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at spark/192.168.59.187
************************************************************/
(9)启动hadoop集群
[hadoop@spark ~]$ start-dfs.sh
Starting namenodes on [spark]
spark: starting namenode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-namenode-spark.out
sparkslave: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-datanode-sparkslave.out
spark: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-datanode-spark.out
Starting secondary namenodes [sparkslave]
sparkslave: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-secondarynamenode-sparkslave.out
[hadoop@spark ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-resourcemanager-spark.out
sparkslave: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-nodemanager-sparkslave.out
spark: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-nodemanager-spark.out
启动完成,可以用jps命令查看,是否启动成功。
Mster节点:
[hadoop@spark ~]$ jps
15011 NodeManager
14468 NameNode
15224 Jps
14894 ResourceManager
Slave1节点:
[hadoop@sparkslave hadoop]$ jps
10705 NodeManager
10867 Jps
10565 SecondaryNameNode
10478 DataNode
java程序启动以后,会在/tmp目录下生成一个hsperfdata_username的文件夹,这个文件夹的文件,就是以java进程的pid命名。因此使用jps查看当前进程的时候,其实就是把/tmp/hsperfdata_username中的文件名遍历一遍之后输出。如果/tmp/hsperfdata_username的文件所有者和文件所属用户组与启动进程的用户不一致的话,在进程启动之后,就没有权限写/tmp/hsperfdata_username,所以/tmp/hsperfdata_username是一个空文件,理所当然jps也就没有任何显示。
[root@spark ~]# chown -R hadoop /tmp/hsperfdata_hadoop
[root@spark ~]# chgrp -R hadoop /tmp/hsperfdata_hadoop
(10)向hdfs写入,进行测试
[hadoop@spark ~]$ hadoop fs -mkdir /test
[hadoop@spark ~]$ hadoop fs -ls /
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-03-15 13:31 /test
查看namenode界面:http://spark:50070/
resourcemanager界面:http://spark:8080/
nodemanager界面:http://spark:8042/
如果不能访问,有可能是防火墙的问题。
[root@spark ~]# systemctl stop firewalld.service
[root@spark ~]# firewall-cmd --state
not running
[root@spark ~]# systemctl disable firewalld.service
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
二、 Spark 安装
1 安装Scala
(1)从官网(http://www.scala-lang.org/download/all.html)下载Scala,选择的版本是Scala 2.12.4,解压后放到/usr/local/hadoop目录下
(2) 修改/etc/profile:
#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala ENV
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH
(3) 在终端执行命令
[hadoop@spark hadoop]$ source /etc/profile
[hadoop@spark hadoop]$ scala
Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
Type in expressions for evaluation. Or try :help.
scala>
2 下载安装Spark
(1) 获取安装包
从官网http://spark.apache.org/downloads.html下载压缩包,由于我的Hadoop版本是2.7.5,对应下载Pre-built for Hadoop 2.7 and later版本的spark-2.3.0的tgz包,下载后解压,到/usr/local/hadoop目录下。
2.1 配置Spark
(1) spark-env.sh配置
[hadoop@spark conf]$ pwd
/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/conf
[hadoop@spark conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@spark conf]$ vi spark-env.sh
[hadoop@spark conf]$ cat spark-env.sh
#JAVA_HOME
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
#Hadoop_HOME
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala_HOME
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
#Spark_HOME
export SPARK_HOME=/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7
export SPARK_MASTER_IP=spark
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=2
[hadoop@spark ~]$ echo " export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.5/etc/hadoop" >> /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/conf/spark-env.sh
(2) slaves配置
[hadoop@spark conf]$ cp slaves.template slaves
[hadoop@spark conf]$ vi slaves
[hadoop@spark conf]$ cat slaves
# A Spark Worker will be started on each of the machines listed below.
#
spark
sparkslave
(3) profile配置
#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala ENV
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
#Spark ENV
export SPARK_HOME=/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
Spark的 standalone运行模式
启动Spark的master和worker服务,这是Spark的 standalone运行模式
(4) 复制到其他节点
在Master节点上安装配置完成Spark后,将整个spark目录拷贝到其他节点,并在各个节点上更改/etc/profile文件中的环境变量
(5) 测试
在Master节点启动集群
[hadoop@spark ~]$ /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-spark.out
sparkslave: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-sparkslave.out
spark: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-spark.out
[hadoop@spark ~]$ jps
15011 NodeManager
14468 NameNode
15957 Jps
15737 Master
15837 Worker
14894 ResourceManager
Slave:
[hadoop@sparkslave hadoop]$ jps
10705 NodeManager
11473 Worker
10565 SecondaryNameNode
10478 DataNode
11599 Jps
打开浏览器输入Master:8080,看到如下活动的Workers,证明安装配置并启动成功:
Spark on YARN运行模式
只需要在Hadoop分布式集群中任选一个节点安装配置Spark即可,不要集群安装。因为Spark应用程序提交到YARN后,YARN会负责集群资源的调度。
我们保留Master节点上的Spark,修改去除Slave上的安装目录:
[hadoop@sparkslave hadoop]$ cd /usr/local/hadoop/
[hadoop@sparkslave hadoop]$ ls
hadoop-2.7.5 scala-2.12.4 spark-2.3.0-bin-hadoop2.7
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$ mv spark-2.3.0-bin-hadoop2.7 spark-2.3.0-bin-hadoop2.7-bak
[hadoop@sparkslave hadoop]$ ls
hadoop-2.7.5 scala-2.12.4 spark-2.3.0-bin-hadoop2.7-bak
[hadoop@sparkslave hadoop]$ pwd
/usr/local/hadoop
(6) spark-shell运行在YARN上
(1)运行在yarn-client上
执行命令spark-shell --master yarn --deploy-mode client:
[hadoop@spark ~]$ spark-shell --master yarn --deploy-mode client
2018-03-26 16:30:49 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-26 16:31:28 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://spark:4040
Spark context available as 'sc' (master = yarn, app id = application_1522051219440_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
如果出现以下输出:
[hadoop@spark ~]$ spark-shell --master yarn --deploy-mode client
2018-03-26 16:30:49 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-26 16:31:28 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-03-26 16:36:28 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
at $line3.$read$$iw$$iw.<init>(<console>:15)
at $line3.$read$$iw.<init>(<console>:42)
at $line3.$read.<init>(<console>:44)
at $line3.$read$.<init>(<console>:48)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<console>:7)
at $line3.$eval$.$print(<console>:6)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:98)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
at org.apache.spark.repl.Main$.doMain(Main.scala:70)
at org.apache.spark.repl.Main$.main(Main.scala:53)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-03-26 16:39:08 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
2018-03-26 16:39:09 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
... 47 elided
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
Spark context available as 'sc' (master = yarn, app id = application_1522051219440_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
由于是在虚拟机上运行,虚拟内存可能超过了设定的数值。解决办法:
先停止YARN服务,然后修改yarn-site.xml,增加如下内容:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
(2)YARN WEB
打开YARN WEB页面:192.168.59.187:8088
可以看到Spark shell应用程序正在运行,单击ID号链接,可以看到该应用程序的详细信息。
scala> val rdd=sc.parallelize(1 to 100,5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.count
res0: Long = 100
scala>