15-SparkCore02

Application

a driver program +  executors

SparkContext = application

spark-shell ? application

gateway

application1: 1 driver + 10 executors

application2: 1 driver + 10 executors

share

application ==> n jobs ==> n stages ==> n tasks

partition  ==  task

textFile("")............ count

textFile("")............ count

textFile("")............ count

textFile("").cache

cache  lazy === transformation

unpersist eager

def persist() = persist(StorageLevel.MEMORY_ONLY)

def cache()  = persist()

class StorageLevel private(

    private var _useDisk: Boolean,

    private var _useMemory: Boolean,

    private var _useOffHeap: Boolean,

    private var _deserialized: Boolean,

    private var _replication: Int = 1)

MEMORY_ONLY  (false, true, false, true)

Lineage

textFile ==> xx ==> yy ==> zz 

  map filter  map  .....

描述的是一个RDD如何从父RDD计算得来的

Dependency

窄依赖

一个父RDD的partition至多被子RDD的某个partition使用一次

pipeline

宽依赖

一个父RDD的partition会被子RDD的partition使用多次

xxKey

join not co.....

shuffle ==> stage

lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容