Spark运行原理及术语

一、Glossary

The following table summarizes terms you’ll see used to refer to cluster concepts(下表总结了您将看到用于引用群集概念的术语):

Term	Meaning
Application	User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar	A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program	The process running the main() function of the application and creating the SparkContext
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode	Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node	Any node that can run application code in the cluster
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task	A unit of work that will be sent to one executor
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

术语总结：

Application:由一个driver和多个executor构成
Application jar：Idea开发的Spark代码通过Maven打成的jar包
Driver program：运行main()函数的进程，并且创建了一个SparkContext
Cluster manage：用来从Cluster(YARN)申请资源的一个外部服务
Deploy mode ：用来区别Driver跑在local还是Cluster。以Client运行时Driver跑在本地，以Cluster运行时Driver跑在Cluster
Worker node ：集群中用来运行application的节点(对应YARN的NodeManager)
Executor：在NM上启用application的进程，可以运行task和保存数据到磁盘或者内存。每个application拥有自己的executor，也就是说两个application的executor互不干扰。(运行在Yarn中的container中)
Job：众多task组成的一个并行计算，一个Spark 的action算子产生一个Job
Stage：每个Job被分成一系列的task集合，这些集合称之为Stage，Stage之间彼此依赖。一遇到Spark 的shuffle算子就会产生新stage
task：Spark运行的基本单元

Spark的Job、Stage、Task、Partition之间的关系：

一个Job由一个或多个Stage组成
一个Stage由多个task组成
task数等于Satage中最后一个RDD的的partition数

1Job --> N Stage --> N Task

二、Spark运行原理架构图

Spark执行流程图.png

There are several useful things to note about this architecture(关于这种架构有几点有用的注意事项):

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating(隔离) applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic(不关心) to the underlying cluster manager. As long as it can acquire(申请到) executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to(靠近) the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

内部详细流程图png

1、Client通过spark-submit提交Spark作业，根据Deploy mode参数在相对应的位置初始化SparkContext(Driver Manager)，即Spark的运行环境。SparkContext创建DAG Scheduler和Task Scheduer，Driver根据应用程序执行代码，将整个程序根据action算子划分成多个job。每个job内部构建DAG图，DAG Scheduler将DAG图划分为多个stage，同时每个stage内部划分为多个task，DAG Scheduler将taskset传给Task Scheduer，Task Scheduer负责集群上task的调度

2、Driver根据sparkcontext中的资源需求向resource manager申请资源，包括executor数及内存资源。

3、资源管理器收到请求后在满足条件的work node(NM)节点上创建executor进程

4、Executor创建完成后会向driver反向注册，以便driver可以分配task给他执行

5、当程序执行完后，driver向resource manager注销所申请的资源