1、A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
2、Represents an immutable, partitioned collection of elements that can be operated on in parallel
RDD是Resilient Distributed Dataset(弹性分布式数据集)的简称。RDD的弹性体现在计算方面,当Spark进行计算时,某一阶段出现数据丢失或者故障,可以通过RDD的血缘关系就行修复。
RDD是不可变(immutable)的,一旦创建就不可改变。RDDA-->RDDB,RDDA经过转换操作变成RDDB,这两个RDD具有血缘关系,但是是两个不同的RDD,体现了RDD一旦创建就不可变的性质。
RDD由一系列可分区的集合构成,且可以并行化。
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging
RDD定义:
1、是一个抽象类,有诸多子类,比如jdbcRDD/hadoopRDD
2、Serializable
3、Logging()
4、transient
RDD的五大特性
- Internally, each RDD is characterized by five main properties:
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
特性1:RDD由一系列的分区构成
特性2:可以在每个分区上作用上一个函数进行计算,对RDD进行计算,本质上就是对RDD的分区进行计算
特性3:RDD之间存在血缘关系,下一个RDD可以通过上一个RDD得到
特性4:RDD是基于 hash-partitioned的分区,是Key-Value形式
特性5:RDD进行计算时遵循数据本地性,Task会被分配到数据节点上运行。移动计算不移动数据的特性。
RDD的五大特性与RDD源码的对应:
- 特性1:A list of partitions
protected def getPartitions: Array[Partition]
- 特性2:A function for computing each split
def compute(split: Partition, context: TaskContext): Iterator[T]
- 特性3:A list of dependencies on other RDDs
protected def getDependencies: Seq[Dependency[_]] = deps
- 特性4: Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
- 特性5:Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
@transient val partitioner: Option[Partitioner] = None