aggregate
官方文档描述:
Aggregate the elements of each partition, and then the results for all the partitions,
using given combine functions and a neutral "zero value". This function can return
a different result type, U, than the type of this RDD, T. Thus, we need one operation
for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce.
Both of these functions are allowed to modify and return their first argument
instead of creating a new U to avoid memory allocation.
函数原型:
def aggregate[U](zeroValue: U)(seqOp: JFunction2[U, T, U], combOp: JFunction2[U, U, U]): U
源码分析:
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult
}
**
aggregate函数将每个分区里面的元素进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回U的类型不需要和RDD的T中元素类型一致。 这样,我们需要一个函数将T中元素合并到U中,另一个函数将两个U进行合并。其中,参数1是初值元素;参数2是seq函数是与初值进行比较;参数3是comb函数是进行合并 。
注意:如果没有指定分区,aggregate是计算每个分区的,空值则用初始值替换
**
实例:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
Integer aggregateValue = javaRDD.aggregate(3, new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("seq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + v1 + "," + v2);
return Math.max(v1, v2);
}
}, new Function2<Integer, Integer, Integer>() {
int i = 0;
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("comb~~~~~~~~~i~~~~~~~~~~~~~~~~~~~"+i++);
System.out.println("comb~~~~~~~~~v1~~~~~~~~~~~~~~~~~~~" + v1);
System.out.println("comb~~~~~~~~~v2~~~~~~~~~~~~~~~~~~~" + v2);
return v1 + v2;
}
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateValue);
aggregateByKey
官方文档描述:
Aggregate the values of each key, using given combine functions and a neutral "zero value".
This function can return a different result type, U, than the type of the values in this RDD,V.
Thus, we need one operation for merging a V into a U and one operation for merging
two U's,as in scala.TraversableOnce. The former operation is used for merging values
within a partition, and the latter is used for merging values between partitions.
To avoid memory allocation, both of these functions are allowed to modify and return
their first argument instead of creating a new U.
函数原型:
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner, seqFunc: JFunction2[U, V, U],
combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]
def aggregateByKey[U](zeroValue: U, numPartitions: Int, seqFunc: JFunction2[U, V, U],
combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]
def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]
源码分析:
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp = self.context.clean(seqOp)
combineByKey[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, partitioner)
}
**
aggregateByKey函数对PairRDD中相同Key的值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。和aggregate函数类似,aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey是对相同Key中的值进行聚合操作,所以aggregateByKey函数最终返回的类型还是Pair RDD,对应的结果是Key和聚合好的值;而aggregate函数直接是返回非RDD的结果,这点需要注意。在实现过程中,定义了三个aggregateByKey函数原型,但最终调用的aggregateByKey函数都一致。其中,参数zeroValue代表做比较的初始值;参数partitioner代表分区函数;参数seq代表与初始值比较的函数;参数comb是进行合并的方法。
**
实例:
//将这个测试程序拿文字做一下描述就是:在data数据集中,按key将value进行分组合并,
//合并时在seq函数与指定的初始值进行比较,保留大的值;然后在comb中来处理合并的方式。
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
int numPartitions = 4;
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
final Random random = new Random(100);
JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(integer,random.nextInt(10));
}
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+javaPairRDD.collect());
JavaPairRDD<Integer, Integer> aggregateByKeyRDD = javaPairRDD.aggregateByKey(3,numPartitions, new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("seq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + v1 + "," + v2);
return Math.max(v1, v2);
}
}, new Function2<Integer, Integer, Integer>() {
int i = 0;
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("comb~~~~~~~~~i~~~~~~~~~~~~~~~~~~~" + i++);
System.out.println("comb~~~~~~~~~v1~~~~~~~~~~~~~~~~~~~" + v1);
System.out.println("comb~~~~~~~~~v2~~~~~~~~~~~~~~~~~~~" + v2);
return v1 + v2;
}
});
System.out.println("aggregateByKeyRDD.partitions().size()~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateByKeyRDD.partitions().size());
System.out.println("aggregateByKeyRDD~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateByKeyRDD.collect());