最开始接触的DataFrame是pandas的,用来处理时序数据很方便,一直搞不清两者有哪些区别。
markdown 表格有点不好处理,截图了
转载连接:http://www.lining0806.com/spark与pandas中dataframe对比/
diff()操作举例如下:
1. Invoke ipython console -profile=pyspark:
In [1]: from pyspark import SparkConf, SparkContext, SQLContext
In [2]: import pandas as pd
In [3]: sqlcontext = SQLContext(sc)
2. Computing diff on a column in Pandas:
In [4]: df = sqlCtx.createDataFrame([(1, 4), (1, 5), (2, 6),
(2, 6), (3, 0)], ["A", "B"])
In [5]: pdf = df.toPandas()
In [6]: pdf
Out[6]:
A B
0 1 4
1 1 5
2 2 6
3 2 6
4 3 0
In [7]: pdf['diff'] = pdf.B.diff()
In [8]: pdf
Out[8]:
A B diff
0 1 4 NaN
1 1 5 1
2 2 6 1
3 2 6 0
4 3 0 -6
3. Computing diff on a column given a specific key using the Window operation:
In [9]: from pyspark.sql.window import Window
In [10]: window_over_A = Window.partitionBy("A").orderBy("B")
In [11]: df.withColumn("diff", F.lead("B").over(window_over_A) -
df.B).show()
+---+---+-----+
| A| B|diff |
+---+---+-----+
| 1 | 4 | 1 |
| 1 | 5 | null|
| 2 | 6 | 0 |
| 2 | 6 | null|
| 3 | 0 | null|
+---+---+-----+
转自:https://blog.csdn.net/ljp812184246/article/details/77678591