Pyspark中的union算子的依赖类型竟然是窄依赖!
sql中的union和union all是不一样的。union是会去重的,而union all不去重。去重就会涉及到shuffle,ShuffleDependency就是宽依赖了。shuffle是划分stage的依据。因为宽依赖会导致数据的继承关系不明确,重新计算某个partition的数据时,就需要重新计算冗余的父partition。
但是pyspark的union算子本身和sql的union是不一样的,它不去重!所以是窄依赖!
引用pyspark文档如下:
union
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by distinct().
Also as standard in SQL, this function resolves columns by position (not by name).
New in version 2.0.
unionAll
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by distinct().
Also as standard in SQL, this function resolves columns by position (not by name).
Note Deprecated in 2.0, use union() instead.
New in version 1.3.
文档上写的很清楚了,unionAll在1.3版本引入,union是2.0引入,都相当于sql中的union all,并且,使用时不建议用unionAll,而是用union。
另外union时按照字段的顺序来合并的,并不会根据字段名,如果要根据字段名,需使用unionByName。