aggregate算子可以先对局部聚合,再对全局聚合。
示例:val rdd1 = sc.parallelize(List(1,2,3,4,5), 2)
查看每个分区中的元素:
将每个分区中的最大值求和,注意:初始值是0;
如果初始值时候10,则结果为:30
如果是求和,注意:初始值是0:
如果初始值是10,则结果是:45
一个字符串的例子:
val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
修改一下刚才的查看分区元素的函数
def func2(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
两个分区中的元素:
[partID:0, val: a], [partID:0, val: b], [partID:0, val: c],
[partID:1, val: d], [partID:1, val: e], [partID:1, val: f]
运行结果:
[if !supportLists]u [endif]更复杂一点的例子
val rdd3 = sc.parallelize(List("12","23","345","4567"),2)
rdd3.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
结果可能是:”24”,也可能是:”42”
val rdd4 = sc.parallelize(List("12","23","345",""),2)
rdd4.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
结果是:”10”,也可能是”01”,
原因:注意有个初始值””,其长度0,然后0.toString变成字符串
val rdd5 = sc.parallelize(List("12","23","","345"),2)
rdd5.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
结果是:”11”,原因同上。