【Spark Java API】Transformation(6)—aggregate、aggregateByKey

aggregate

官方文档描述：

Aggregate the elements of each partition, and then the results for all the partitions,
using given combine functions and a neutral "zero value". This function can return 
a different result type, U, than the type of this RDD, T. Thus, we need one operation
for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce.
Both of these functions are allowed to modify and return their first argument 
instead of creating a new U to avoid memory allocation.

函数原型：

def aggregate[U](zeroValue: U)(seqOp: JFunction2[U, T, U],  combOp: JFunction2[U, U, U]): U

源码分析：

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {  
// Clone the zero value since we will also be serializing it as part of tasks  
  var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
  val cleanSeqOp = sc.clean(seqOp)  
  val cleanCombOp = sc.clean(combOp)  
  val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)  
  val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)  
  sc.runJob(this, aggregatePartition, mergeResult)  
  jobResult
}

**
aggregate函数将每个分区里面的元素进行聚合，然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回U的类型不需要和RDD的T中元素类型一致。这样，我们需要一个函数将T中元素合并到U中，另一个函数将两个U进行合并。其中，参数1是初值元素；参数2是seq函数是与初值进行比较；参数3是comb函数是进行合并。
注意：如果没有指定分区，aggregate是计算每个分区的，空值则用初始值替换
**

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
Integer aggregateValue = javaRDD.aggregate(3, new Function2<Integer, Integer, Integer>() {    
@Override    
public Integer call(Integer v1, Integer v2) throws Exception {        
    System.out.println("seq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + v1 + "," + v2);        
    return Math.max(v1, v2);    
  }
}, new Function2<Integer, Integer, Integer>() {    
  int i = 0;    
  @Override      
public Integer call(Integer v1, Integer v2) throws Exception {    
    System.out.println("comb~~~~~~~~~i~~~~~~~~~~~~~~~~~~~"+i++);        
    System.out.println("comb~~~~~~~~~v1~~~~~~~~~~~~~~~~~~~" + v1);        
    System.out.println("comb~~~~~~~~~v2~~~~~~~~~~~~~~~~~~~" + v2);        
    return v1 + v2;   
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateValue);

aggregateByKey

官方文档描述：

Aggregate the values of each key, using given combine functions and a neutral "zero value".
This function can return a different result type, U, than the type of the values in this RDD,V.
Thus, we need one operation for merging a V into a U and one operation for merging 
two U's,as in scala.TraversableOnce. The former operation is used for merging values 
within a partition, and the latter is used for merging values between partitions. 
To avoid memory allocation, both of these functions are allowed to modify and return 
their first argument instead of creating a new U.

函数原型：

def aggregateByKey[U](zeroValue: U, partitioner: Partitioner, seqFunc: JFunction2[U, V, U],
    combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]
def aggregateByKey[U](zeroValue: U, numPartitions: Int, seqFunc: JFunction2[U, V, U],
    combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]
def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U]

源码分析：

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,    
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {  
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key  
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)  
  val zeroArray = new Array[Byte](zeroBuffer.limit)  
  zeroBuffer.get(zeroArray)  
  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()  
  val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))  
  // We will clean the combiner closure later in `combineByKey`  
  val cleanedSeqOp = self.context.clean(seqOp)  
  combineByKey[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, partitioner)
}

**
aggregateByKey函数对PairRDD中相同Key的值进行聚合操作，在聚合过程中同样使用了一个中立的初始值。和aggregate函数类似，aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey是对相同Key中的值进行聚合操作，所以aggregateByKey函数最终返回的类型还是Pair RDD，对应的结果是Key和聚合好的值；而aggregate函数直接是返回非RDD的结果，这点需要注意。在实现过程中，定义了三个aggregateByKey函数原型，但最终调用的aggregateByKey函数都一致。其中，参数zeroValue代表做比较的初始值；参数partitioner代表分区函数；参数seq代表与初始值比较的函数；参数comb是进行合并的方法。
**

实例：

//将这个测试程序拿文字做一下描述就是：在data数据集中，按key将value进行分组合并，
//合并时在seq函数与指定的初始值进行比较，保留大的值；然后在comb中来处理合并的方式。
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
int numPartitions = 4;
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
final Random random = new Random(100);
JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {    
  @Override    
  public Tuple2<Integer, Integer> call(Integer integer) throws Exception {        
    return new Tuple2<Integer, Integer>(integer,random.nextInt(10));    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+javaPairRDD.collect());

JavaPairRDD<Integer, Integer> aggregateByKeyRDD = javaPairRDD.aggregateByKey(3,numPartitions, new Function2<Integer, Integer, Integer>() {    
  @Override    
  public Integer call(Integer v1, Integer v2) throws Exception {        
    System.out.println("seq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + v1 + "," + v2);        
    return Math.max(v1, v2);    
  }
}, new Function2<Integer, Integer, Integer>() {    
  int i = 0;    
  @Override    
  public Integer call(Integer v1, Integer v2) throws Exception {        
  System.out.println("comb~~~~~~~~~i~~~~~~~~~~~~~~~~~~~" + i++);        
  System.out.println("comb~~~~~~~~~v1~~~~~~~~~~~~~~~~~~~" + v1);        
  System.out.println("comb~~~~~~~~~v2~~~~~~~~~~~~~~~~~~~" + v2);        
  return v1 + v2;   
 }
});
System.out.println("aggregateByKeyRDD.partitions().size()~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateByKeyRDD.partitions().size());
System.out.println("aggregateByKeyRDD~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"+aggregateByKeyRDD.collect());

最后编辑于：2017.12.03 02:39:00

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 220,492评论 6赞 513
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,048评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 166,927评论 0赞 358
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,293评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,309评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,024评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,638评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,546评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,073评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,188评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,321评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,998评论 5赞 347
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,678评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,186评论 0赞 23
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,303评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,663评论 3赞 375
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,330评论 2赞 358

【Spark Java API】Transformation(6)—aggregate、aggregateByKey

aggregate

官方文档描述：

函数原型：

源码分析：

实例：

aggregateByKey

官方文档描述：

函数原型：

源码分析：

实例：

推荐阅读更多精彩内容