【Spark Java API】Transformation(2)—sample、randomSplit

sample

官方文档描述：

 Return a sampled subset of this RDD.

返回抽样的样本的子集。

函数原型：

withReplacement can elements be sampled multiple times (replaced when sampled out)
fraction expected size of the sample as a fraction of this RDD's size
without replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction must be >= 0

def sample(withReplacement: Boolean, fraction: Double): JavaRDD[T]

withReplacement can elements be sampled multiple times (replaced when sampled out)
fraction expected size of the sample as a fraction of this RDD's size
without replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction must be >= 0
seed seed for the random number generator

def sample(withReplacement: Boolean, fraction: Double, seed: Long): JavaRDD[T]

**
第一函数是基于第二个实现的，在第一个函数中seed为Utils.random.nextLong；其中，withReplacement是建立不同的采样器；fraction为采样比例；seed为随机生成器的种子
**

源码分析：

def sample(withReplacement: Boolean, fraction: Double,    
seed: Long = Utils.random.nextLong): RDD[T] = withScope {  
require(fraction >= 0.0, "Negative fraction value: " + fraction)  
if (withReplacement) {  
  new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)  
} else {    
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) 
 }
}

**
sample函数中，首先对fraction进行验证；再次建立PartitionwiseSampledRDD，依据withReplacement的值分别建立柏松采样器或伯努利采样器。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
//false   是伯努利分布  (元素可以多次采样);0.2   采样比例;100   随机数生成器的种子
JavaRDD<Integer> sampleRDD = javaRDD.sample(false,0.2,100);
System.out.println("sampleRDD~~~~~~~~~~~~~~~~~~~~~~~~~~" + sampleRDD.collect());
//true  是柏松分布;0.2   采样比例;100   随机数生成器的种子
JavaRDD<Integer> sampleRDD1 = javaRDD.sample(false,0.2,100);
System.out.println("sampleRDD1~~~~~~~~~~~~~~~~~~~~~~~~~~" + sampleRDD1.collect());

randomSplit

官方文档描述：

 Randomly splits this RDD with the provided weights.

依据所提供的权重对该RDD进行随机划分

函数原型：

weights for splits, will be normalized if they don't sum to 1
random seed
return split RDDs in an array

def randomSplit(weights: Array[Double], seed: Long): Array[JavaRDD[T]]

weights for splits, will be normalized if they don't sum to 1
return split RDDs in an array

def randomSplit(weights: Array[Double]): Array[JavaRDD[T]]

源码分析：

def randomSplit(weights: Array[Double],    
seed: Long = Utils.random.nextLong): Array[RDD[T]] = withScope {  
val sum = weights.sum  
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)  
normalizedCumWeights.sliding(2).map { x =>
  randomSampleWithRange(x(0), x(1), seed) 
 }.toArray
}

def randomSampleWithRange(lb: Double, ub: Double, seed: Long): RDD[T] = { 
this.mapPartitionsWithIndex( { (index, partition) =>    
  val sampler = new BernoulliCellSampler[T](lb, ub)        
  sampler.setSeed(seed + index)    
  sampler.sample(partition)  
 }, preservesPartitioning = true)
}

**
从源码中可以看到randomSPlit先是对权重数组进行0-1正则化；再利用randomSampleWithRange函数，对RDD进行划分；而在该函数中调用mapPartitionsWithIndex（上一节有具体说明），建立伯努利采样器对RDD进行划分。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
double [] weights =  {0.1,0.2,0.7};
//依据所提供的权重对该RDD进行随机划分
JavaRDD<Integer> [] randomSplitRDDs = javaRDD.randomSplit(weights);
System.out.println("randomSplitRDDs of size~~~~~~~~~~~~~~" + randomSplitRDDs.length);
int i = 0;
for(JavaRDD<Integer> item:randomSplitRDDs)        
   System.out.println(i++ + " randomSplitRDDs of item~~~~~~~~~~~~~~~~" + item.collect());

最后编辑于：2017.11.27 06:13:15

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,287评论 6赞 498
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,346评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,277评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,132评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,147评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,106评论 1赞 295
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,019评论 3赞 417
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,862评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,301评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,521评论 2赞 332
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,682评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,405评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,996评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,651评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,803评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,674评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,563评论 2赞 352

【Spark Java API】Transformation(2)—sample、randomSplit

sample

官方文档描述：

函数原型：

源码分析：

实例：

randomSplit

官方文档描述：

函数原型：

源码分析：

实例：

推荐阅读更多精彩内容