Spark快速入门(1) 核心概念和抽象：RDD

Spark简介

Spark是目前比较流程的大数据计算引擎。在Spark出现之前，MapReduce已经作为大数据领域的编程模型和计算框架，那为什么又开发了Spark呢？

MapReduce并没有充分的利用内存，而有些数据量不大的场景使用内存就可以完成计算；
MapReduce有很多冗余的中间结果读写操作，虽然会提升稳定性，但某些adhoc的查询对执行时间更加敏感；
MapReduce没有对功能做组件化，用户只能全部安装。

Spark通过对数据的重新抽象，和执行过程的优化，致力于解决以上问题，目前Spark已经可以应用于批计算、实时计算和图计算，并在向数据科学方向继续发展。

Spark RDD

RDD是Spark的核心抽象，既可以灵活方便的定义一个计算过程，又可以保证计算过程可以有效的执行。

为什么需要RDD

假设我们实现一些需要循环的计算，比如K-Means和PageRank，使用MapReduce的模型存在什么问题呢？

循环的连续任务之间的关系，只有用户的代码才知道，框架并不知道，因此不能进行任何优化，比如调整寄存器数量等；
MapReduce在磁盘持久化中间计算结果，产生了额外的I/O，但对于一个只需要执行几分钟的K-Means计算来说，在这几分钟出现故障的可能性很低，而使用内存保存中间结果是可以大幅降低计算时间的；
在使用高级计算操作，比如join的场景时，不同应用的MapReduce的代码很难复用。

什么是RDD

然后我们看一下RDD（Resilient Distributed Dataset），这里的Resilient和Distributed体现了两个重要的特性：

Resilient：能够自动恢复计算过程的错误，并保证计算完成；
Distributed：计算过程可以并行执行。

因此RDD的准确定义是一组只读并且支持分区的数据集。每个RDD需要实现的三个接口有：

partitions() -> Array[Partition]：可以枚举访问数据集的所有分区。

  /**
   * Get the array of partitions of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
        partitions_.zipWithIndex.foreach { case (partition, index) =>
          require(partition.index == index,
            s"partitions($index).partition == ${partition.index}, but it should equal $index")
        }
      }
      partitions_
    }
  }

iterator(p : Partition) -> Iterator：partitions方法将获取到的分区传入到iterator方法，用于遍历分区的数据。

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

dependencies() -> Array[Dependency]：用于定义RDD依赖的parent RDD，创建parent RDD的partition到当前RDD partition的映射。

  /**
   * Get the list of dependencies of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }

RDD中的数据类型是提前定义好的，比如RDD<String>、RDD<Integer>等。

我们来看一个RDD的实现，假如我们有一个读取HDFS上的二进制文件的RDD：

partitions() -> Array[Partition]：
由于HDFS的每个文件被分为小的blocks，每个blocks适合做为一个partition。因此该方法的实现过程为：从NameNode中查询HDFS文件的block信息，为每个block创建一个partition，返回partition的数组。
iterator(p Partition, parents: Array[Iterator[_]]) -> Iterator[Byte]：
解析partition的block信息，为每个partition创建一个读取文件的Reader对象。
dependencies() -> Array[Dependency]：
读取HDFS文件没有依赖的RDD，因此返回一个空数组。

这里如果文件是由某种Hadoop的文件格式存储，并进行切分的话，切分会使用到一个实现了输入文件切分接口的对象，而且这个对象也会被发送到iterator中用于解析切分后的文件。

再看一个RDD的实现，内存中的数组。如果将整个数组看成一个partition，那么它的三个方法的实现分别为：

partitions() -> Array[Partition]：
返回整个数组作为一个partition。
iterator(p: Partition, parents: Array[Iterator[_]]) -> Iterator[T]：
返回整个数组的遍历器。
dependencies() -> Array[Dependency]：
没有依赖，返回空数组。

如果将数组分成多块，每块为一个partition的话，那么它的三个方法实现如下：

partitions() -> Array[Partition]：
将数组分成N块，每块创建一个partition，返回所有的partitions。
iterator(p: Partition, parents: Array[Iterator[_]]) -> Iterator[T]：
返回数组中一块的遍历器。
dependencies() -> Array[Dependency]：
没有依赖，返回空数组。

使用分块的方法可以使计算并行化，加快计算速度。

小结

本节我们主要介绍了RDD的概念，RDD的实现不仅说明了RDD的分区方式，遍历方式，还说明了它所依赖的RDD的格式，这么做的目的会在后面继续介绍。