1.RDD解析
分布式:数据的来源
数据集:数据的类型 & 计算逻辑的封装(类似数据模型)
弹性:
- 抽象类
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging
- 不可变:计算逻辑不可变
- 可分区:提高数据处理能力
- 并行计算:多任务同时执行
2.RDD属性
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*
* The partitions in this array must satisfy the following property:
* `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
*/
protected def getPartitions: Array[Partition]
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
/**
* Optionally overridden by subclasses to specify placement preferences.
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None
1.一组分区(Partition)即数据集的基本组成单位: protected def getPartitions: Array[Partition]
2.一个计算每个分区的函数: def compute(split: Partition, context: TaskContext): Iterator[T]
3.RDD之间的依赖关系 : protected def getDependencies: Seq[Dependency[_]] = deps
4.一个Partitioner,即RDD的分片函数: @transient val partitioner: Option[Partitioner] = None
5.一个列表,存储存取每个Partition的优先位置:将计算发往数据位置。 protected def getPreferredLocations(split: Partition): Seq[String] = Nil
RDD表示只读的分区的数据集,对RDD进行改动,只能通过RDD的转换操作,由一个RDD得到一个新的RDD,新的RDD包含了从其他RDD衍生所必需的信息。RDDs之间存在依赖,RDD的执行是按照血缘关系延时计算的。如果血缘关系较长,可以通过持久化RDD来切断血缘关系。