Spark shell

常用命令

1.1 --master

后接master url
对于local来说，local (Default: local[*])

1.1.1 --master后接参数解析

(1) local

Run Spark locally with one worker thread (i.e. no parallelism at all)

本地模式运行spark，只有一个worker线程(没有并行)

(2) local[K]
Run Spark locally with K worker threads(ideally, set this to the number of cores on your machine)

本地模式运行spark，有k个worker线程(理想状态下, 设置k为机器的cpu的core的数量)

(3) yarn
Connect to a YARN cluster in client or cluster moded，epending on the value of --deploy-mode

连接到以client或者cluster模式启动的YARN集群，取决于deploy-mode的值

注

如果deploy-mode没有指定，就是yarn-client模式, 
也就是说--master yarn <==> --deploy-mode client

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster

确保HADOOP_CONF_DIR或者YARN_CONF_DIR 
指向HADOOP集群上的配置文件所在的文件夹

These configs are used to write to HDFS and connect to the YARN ResourceManager.

这些配置被用来去写入到HDFS上并且连接YARN的ResourceManager

The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration

这个配置将会分发到YARN的集群上
以便应用所使用的所有container使用相同的配置

也就是说当master的参数设置为yarn时候，需要配置HADOOP_CONF_DIR或YARN_CONF_DIR

1.2 --class
1.3 --name
1.4 --jars

spark启动

2.1 Spark Shell启动时会创建默认的Spark context

2.2 Spaek Shell启动时会创建默认的app id

yarn-client模式与yarn-cluster模式的区别

3.1 yarn-client模式

(1) spark driver运行在本地

(2) 不能断开spark-shell，断开再打开就重新向resource manager重新申请一个新的资源，建立新的executor

(3) 能看见日志

(4) driver运行在client端，client端可能在集群外，driver频繁与executor通信，网络压力大

3.2 yarn-cluster

(1) spark driver运行在AM上的
(2) 可以断开spark-shell(因为是运行在client端)
(3) 看不见spark的运行日志
(4) 网络压力稍小

3.3 如何看运行在yarn-cluster模式下spark作业的日志

yarn -logs -application <app ID>

3.4 如何提交到spark集群上运行
(1) maven 打包成jar包
(2) 使用spark-submit脚本

参数
  --master
  --class
  --name
  打包后的jar包
  log文件所在的hdfs上的路径
  在hdfs上的输出路径

3.5 Example

(1) local模式

$SPARK_HOME/bin/spark-submit \
--master local[2] \
--class com.henry.com.SparkCore.SparkContextNewApp \
--name LogServerApp \
/home/hadoop/test/g5SparkLearning-1.0-SNAPSHOT.jar

(2) yarn模式

export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop
$SPARK_HOME/bin/spark-submit \
--master yarn \
--class com.henry.com.SparkCore.LogServerApp \
--name LogServerApp \
/home/hadoop/test/Spark.jar \
hdfs://hadoop:12345/logs/input/data.txt hdfs://hadoop:12345/logs/output15

Spark shell

推荐阅读更多精彩内容