Spark 异步Action

What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other.
Let see example

val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4)
rdd.collect().map{ x => println("Items in the lists:" + x)}
val rddCount = sc.parallelize(List(434, 3, 2, 43, 45, 3, 2), 4)
println("Number of items in the list" + rddCount.count())

In the above exmaple 2 actions are perform one after other collect and count, both are execute synchronous. So count will always execute after collect will finish. The out of the above code is as follows


Now question is if we want to run spark jobs concurrently in async fashion.
So for above question answer is simple apache spark also provide a asyn action for concurrent execution of jobs, Few Asynchronous actions spark provide as follows
collectAsync() -> Returns a future for retrieving all elements of this RDD.countAsync() -> Returns a future for counting the number of elements in the RDD.foreachAsync(scala.Function1<T,scala.runtime.BoxedUnit> f) -> Applies a function f to all elements of this RDD.foreachPartitionAsync(scala.Function1<scala.collection.Iterator,scala.runtime.BoxedUnit> f) ->Applies a function f to each partition of this RDD.takeAsync(int num) -> Returns a future for retrieving the first num elements of the RDD.
Now let us see what happen when we use async actions.

val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4)
rdd.collectAsync().map{ x => x.map{x=> println("Items in the list:"+x)} }
val rddCount = sc.parallelize(List(434, 3, 2, 43, 45, 3, 2), 4)
rddCount.countAsync().map { x =>println("Number of items in the list: "+x) }

So output of the above code is as follows


You can see in above output the result of the second job is come first because first job return future and execute second one but still have you noticed that jobs are execute one after other that’s means a job use all resources of cluster so another job will delayed.
So for take full advantage of Asynchronous jobs we need to configure job scheduler.
Job Scheduling
By default spark scheduler run spark jobs in FIFO (First In First Out) fashion. In FIFO scheduler the priority is given to the first job and then second and so on. If the jobs is not using whole cluster then second job is also run parallel but if first job is too big then second job will wait soo long even it take too less to execute. So for solution spark provide fair scheduler, fair scheduler jobs will execute in “round robin” fashion.
To configure job scheduler we need to set configuration for it as follows
val conf = new SparkConf().setAppName("spark_auth").setMaster("local[*]").set("spark.scheduler.mode", "FAIR")
After configure FAIR scheduling you can see both the jobs are running concurrently and share resources of the spark cluster.
So after this the out of the above code is as follows
Screenshot from 2015-10-21 13:35:53
Screenshot from 2015-10-21 13:35:53

You can see in above result both jobs are running concurrently. The result of both the actions are not wait for each other.
For above code you can checkout: https://github.com/knoldus/spark-scala-async

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 10,991评论 0 23
  • 今天一个人去超市,晚风阵阵,毕竟已经春天了。去的时候太阳还在恋恋不舍,回来的时候天空一片灰暗,无星无月。我一直觉得...
    冷璞阅读 376评论 1 4
  • 人生就是一场漫长的比赛而对手就是你自己是的就是一个人的比赛你可以平心静气的也可以心焦气燥的 就算场上有那么多人那也...
    哈哈同学阅读 300评论 0 1
  • 1、第一:社群应该尽量避免在群内交流过多无聊的话题,会直接影响用户的情绪。 2、第二:多借用工具来管理社群,例如:...
    明天会是晴天阅读 601评论 0 1
  • 前些日子,微信群里聊到大学生这个话题。我特别认怂的说了句,我不是大学生。然后简单几句,描述了当年的状况。 考砸了,...
    陳若心阅读 2,115评论 0 1

友情链接更多精彩内容