Spark 是啥
Apache Spark2.2.0 is a fast and general engine for large-scale data processing.
Spark 有多快
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced
DAG execution engine
that supportsacyclic data flow
andin-memory computing
.
Spark 为啥快
部署架构图
代码中寻找对象
package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("Spark Pi")
.getOrCreate()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / (n - 1))
spark.stop()
}
}
Standalone 模式下对象的架构
Job 提交过程
YARN模式下的部署架构
计算模型
GroupByTest 实例分析
package org.apache.spark.examples
import java.util.Random
import org.apache.spark.sql.SparkSession
/**
* Usage: GroupByTest [numMappers] [numKVPairs] [KeySize] [numReducers]
bin/run-example GroupByTest 100 10000 1000 36
*/
object GroupByTest {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("GroupBy Test")
.getOrCreate()
val numMappers = if (args.length > 0) args(0).toInt else 2
val numKVPairs = if (args.length > 1) args(1).toInt else 1000
val valSize = if (args.length > 2) args(2).toInt else 1000
val numReducers = if (args.length > 3) args(3).toInt else numMappers
val pairs1 = spark.sparkContext.parallelize(0 until numMappers, numMappers).flatMap { p =>
val ranGen = new Random
val arr1 = new Array[(Int, Array[Byte])](numKVPairs)
for (i <- 0 until numKVPairs) {
val byteArr = new Array[Byte](valSize)
ranGen.nextBytes(byteArr)
arr1(i) = (ranGen.nextInt(Int.MaxValue), byteArr)
}
arr1
}.cache()
// Enforce that everything has been calculated and in cache
pairs1.count()
println(pairs1.groupByKey(numReducers).count())
spark.stop()
}
}
数据流
逻辑执行图
逻辑执行图描述的是 job 的数据流:job 会经过哪些 transformation(),中间生成哪些 RDD 及 RDD 之间的依赖关系。