参数 | 默认值 | 说明 |
---|---|---|
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:Local mode: number of cores on the local machineMesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger | Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
上面两个参数都是设置默认的并行度,但是适用的场景不同:
spark.sql.shuffle.partitions是对sparkSQL进行shuffle操作的时候生效,比如 join或者aggregation等操作的时候,之前有个同学设置了spark.default.parallelism 这个并行度为2000,结果还是产生200的stage,排查了很久才发现,是这个原因。
spark.default.parallelism这个参数只是针对rdd的shuffle操作才生效,比如join,reduceByKey。