RDD transformations
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#transformations
map( x=> (x._2, x._1))
flatmap( line => line.split(' ')) // this will ignore None and will not flatten tuple (1,2,3)
filter(println)
distinct
sample
union
...
RDD action
nothing will get done until an action is called (lazy evaluation)
Think: how to mininize shuffle operation.
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions
collect
count
...
val result = someRdd.toSeq.sortBy(_._1,false)
// sort according to the first column from high to low
key value RDD
reduceByKey ( _ + _ )
groupByKey
sortByKey
keys
values
Find average value ( take tuple as value)
val totals=rdd
.mapValues(x=>(x,1))
.reduceByKey((x,y)=>(x._1+y._1, x._2+y._2))
val avg=totals.mapValues(x=> x._1 / x._2)
val ans=avg.collect()
ans.sorted.foreach(println)