RDD的存储机制：

其数据分布存储在多台机器上，都是以block的形式存储在服务器上。每个Executor都会启动一个BlockManagerSlave，并且管理一部分block;block的元数据保存在Driver的BlockManagerMaster上，当BlockManagerSlave产生block之后就会向BlockManagerMaster注册该block。RDD的partion是一个逻辑数据块，对应相应的物理数据块block。具体流程如下：

RDD数据存储机制

1.分区列表性

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
 protected def getPartitions: Array [Partition]

2.依赖其他RDD，血缘关系

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
 protected def getDependencies: Seq [Dependency[_]]= deps

3.每个分区都有一个函数计算

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

4.key-value的分区器

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

5.每个分区都有一个分区位置列表

/**
* Optionally overridden by subclasses to specify placement preferences.
* 可选的，输入参数是 split分片，返回的是一组优先的节点位置。例如hdfs的block位置
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil

1.1 Spark-RDD存储机制以及特性

1.1 Spark-RDD存储机制以及特性

RDD的存储机制：

1.分区列表性

2.依赖其他RDD，血缘关系

3.每个分区都有一个函数计算

4.key-value的分区器

5.每个分区都有一个分区位置列表

推荐阅读更多精彩内容