Spark 在三个弹性数据集,但是我们并不知道哪个性能比较好(有的文章的说Dataset<Dataframe<RDD),好了,这下就有个无聊的人了,那就是我,这里会测试一下它们的性能如何。
测试代码
class App10 {
System.setProperty("java.security.krb5.conf", "/etc/krb5.conf")
System.setProperty("sun.security.krb5.debug", "false")
val sparkConf = new SparkConf()
.set("spark.shuffle.service.enabled", "true")
.set("spark.dynamicAllocation.enabled", "true")
.set("spark.dynamicAllocation.minExecutors", "1")
.set("spark.dynamicAllocation.initialExecutors", "1")
.set("spark.dynamicAllocation.maxExecutors", "6")
.set("spark.dynamicAllocation.executorIdleTimeout", "60")
.set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "60")
.set("spark.executor.cores", "4")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// .setMaster("local[12]")
.setAppName("无聊的Dataset、Dataframe、RDD测试")
val spark = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
def run(typ: Int): Unit = {
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
if (typ == 0) {
val rdd = spark.sparkContext
.parallelize((0 to 4000000).map {
num => {
Log10(UUID.randomUUID().toString, num)
}
})
val count = rdd.count()
} else if (typ == 1) {
val rdd = spark.sparkContext
.parallelize((0 to 4000000).map {
num => {
Log10(UUID.randomUUID().toString, num)
}
}).toDF()
val count = rdd.count()
} else if (typ == 2) {
val rdd = spark.sparkContext
.parallelize((0 to 4000000).map {
num => {
Log10(UUID.randomUUID().toString, num)
}
}).toDS()
val count = rdd.count()
}
}
}
case class Log10(uid: String, age: Int)
object App10 {
def main(args: Array[String]): Unit = {
new App10().run(args(0).toInt)
}
}
测试组
PS:集群是两台2台12核24G的机子,里面没有跑任务任务,是空闲的主机,这样测试出来的结果比较理想。
第一组
time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 0
三次结果
real 0m34.242s
user 0m54.498s
sys 0m3.584s
-----------------------
real 0m34.009s
user 0m45.385s
sys 0m3.520s
----------------------
real 0m34.948s
user 0m49.349s
sys 0m3.407s
第二组
time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 1
三次结果
real 0m37.738s
user 0m52.649s
sys 0m3.684s
------------------
real 0m37.471s
user 0m50.647s
sys 0m3.557s
-------------------
real 0m37.248s
user 0m46.946s
sys 0m3.471s
第三组
time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 2
三次结果
real 0m36.179s
user 0m59.250s
sys 0m3.674s
---------------------
real 0m35.090s
user 0m54.178s
sys 0m3.476s
--------------------
real 0m35.181s
user 0m50.917s
sys 0m3.599s
结论
还是 RDD 性能好一些,可能是我打开的方式不对,下次想到更好测试再测看看。