Python 数据处理(三十三)—— 重塑和数据透视表
1 使用 pivot 重塑数据
数据通常以 堆叠 或 记录 的格式存储。
例如,有如下数据,表示不同时间不同变量的观察值
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
4 2000-01-04 B 1.212112
5 2000-01-05 B -0.173215
6 2000-01-03 C 0.119209
7 2000-01-04 C -1.044236
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
我们可以选择变量 A
所对应的所有数据
In [2]: df[df["variable"] == "A"]
Out[2]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
假设我们现在想对变量进行时间序列分析,但是,这种数据的表现形式显然是不利于我们进行分析的。
如果我们能将数据转换为 DataFrame
形式,以时间序列作为索引,变量名作为列名 ,索引与列名所标识的值是对应的观察值,更加有利于我们的分析
我们可以使用 DataFrame.pivot()
或顶层的 pd.pivot()
方法来实现
In [3]: df.pivot(index="date", columns="variable", values="value")
Out[3]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
如果省略了 values
参数,且除了指定为索引和列名的列之外,剩余的列超过一个,则输出的结果 DataFrame
中将会包含层次列。
例如
In [4]: df["value2"] = df["value"] * 2
In [5]: pivoted = df.pivot(index="date", columns="variable")
In [6]: pivoted
Out[6]:
value value2
variable A B C D A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 2.143608
你可以从结果中提取出数据子集
In [7]: pivoted["value2"]
Out[7]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
注意:
如果指定的索引或列包含重复值,pivot
将会抛出 ValueError: Index contains duplicate entries, cannot reshape
异常。
可以使用其泛化的函数 pivot_table()
,它支持索引或列重复值
2 使用 stack 和 unstack 重塑数据
与 pivot()
方法密切相关的是 Series
和 DataFrame
上都可以用的 stack()
和 unstack()
方法
这两个方法主要用于 MultiIndex
对象
-
stack
: 将一列(或层次列名的某一个level
)映射到索引的最内层level
-
unstack
:stack
的逆向操作,将索引最内层的level
映射为一列
让我们来看一个例子,就很清晰了
In [8]: tuples = list(
...: zip(
...: *[
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...: )
...: )
...:
In [9]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
In [11]: df2 = df[:4]
In [12]: df2
Out[12]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
执行 stack
函数会将 DataFrame
进行压缩,便会产生一个
- 只是简单的列名会生成一个
Series
-
MultiIndex
形式的列名会生成一个DataFrame
如果列是 MultiIndex
形式,可以选择某一 level
进行 stack
In [13]: stacked = df2.stack()
In [14]: stacked
Out[14]:
first second
bar one A 0.721555
B -0.706771
two A -1.039575
B 0.271860
baz one A -0.424972
B 0.567020
two A 0.276232
B -1.087401
dtype: float64
我们可以使用 unstack
逆转回来,默认会将最内层的 level
进行转换
In [15]: stacked.unstack()
Out[15]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
对指定 level
进行 unstack
In [16]: stacked.unstack(1)
Out[16]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
In [17]: stacked.unstack(0)
Out[17]:
first bar baz
second
one A 0.721555 -0.424972
B -0.706771 0.567020
two A -1.039575 0.276232
B 0.271860 -1.087401
当然,如果索引有名称,也可以直接传入索引名
In [18]: stacked.unstack("second")
Out[18]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
注意,stack
和 unstack
方法隐式地对涉及的索引级别进行排序。因此,先调用 stack
,然后调用 unstack
,或者反过来,将得到原始 DataFrame
或 Series
的排序拷贝
In [19]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])
In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])
In [21]: df
Out[21]:
A
2 a -0.370647
b -1.157892
1 a -1.344312
b 0.844885
In [22]: all(df.unstack().stack() == df.sort_index())
Out[22]: True
如果不调用 sort_index
,则上面的代码将引发 TypeError
2.1 多个级别
您还可以通过传递一个级别列表来 stack
或 unstack
多个级别,在这种情况下,最终结果就好像按列表中的顺序每次处理一个 level
一样
In [23]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat", "long"),
....: ("B", "cat", "long"),
....: ("A", "dog", "short"),
....: ("B", "dog", "short"),
....: ],
....: names=["exp", "animal", "hair_length"],
....: )
....:
In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
In [25]: df
Out[25]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 1.075770 -0.109050 1.643563 -1.469388
1 0.357021 -0.674600 -1.776904 -0.968914
2 -1.294524 0.413738 0.276662 -0.472035
3 -0.013960 -0.362543 -0.006154 -0.923061
In [26]: df.stack(level=["animal", "hair_length"])
Out[26]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
级别列表可以包含级别名称或级别编号(但不能同时包含两者)。上面的代码也可以是
In [27]: df.stack(level=[1, 2])
Out[27]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
2.2 缺失值
这些函数会自动处理缺失值,并且不希望层次索引中的每个子组具有相同的标签集,同时它们也可以处理未排序的索引
例如,下面这个例子
In [28]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat"),
....: ("B", "dog"),
....: ("B", "cat"),
....: ("A", "dog"),
....: ],
....: names=["exp", "animal"],
....: )
....:
In [29]: index = pd.MultiIndex.from_product(
....: [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
....: )
....:
In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
In [31]: df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
In [32]: df2
Out[32]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux two -1.226825 0.769804 -1.281247 -0.727707
在 stack
中传递 level
In [33]: df2.stack("exp")
Out[33]:
animal cat dog
first second exp
bar one A 0.895717 2.565646
B -1.206412 0.805244
two A 1.431256 -0.226169
B -1.170299 1.340309
baz one A 0.410835 -0.827317
B 0.132003 0.813850
foo one A -1.413681 0.569605
B 1.024180 1.607920
two A 0.875906 -2.006747
B 0.974466 -2.211372
qux two A -1.226825 -0.727707
B -1.281247 0.769804
In [34]: df2.stack("animal")
Out[34]:
exp A B
first second animal
bar one cat 0.895717 -1.206412
dog 2.565646 0.805244
two cat 1.431256 -1.170299
dog -0.226169 1.340309
baz one cat 0.410835 0.132003
dog -0.827317 0.813850
foo one cat -1.413681 1.024180
dog 0.569605 1.607920
two cat 0.875906 0.974466
dog -2.006747 -2.211372
qux two cat -1.226825 -1.281247
dog -0.727707 0.769804
我们提取一部分数据,然后进行 unstack
。如果对应标签的值不存在,默认会使用填充为缺失值,如 NaN
,NaT
等。
In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
In [36]: df3
Out[36]:
exp B
animal dog cat
first second
bar one 0.805244 -1.206412
two 1.340309 -1.170299
foo one 1.607920 1.024180
qux two 0.769804 -1.281247
In [37]: df3.unstack()
Out[37]:
exp B
animal dog cat
second one two one two
first
bar 0.805244 1.340309 -1.206412 -1.170299
foo 1.607920 NaN 1.024180 NaN
qux NaN 0.769804 NaN -1.281247
unstack
接受一个可选参数 fill_value
,用于指定用于填充缺失数据的值
In [38]: df3.unstack(fill_value=-1e9)
Out[38]:
exp B
animal dog cat
second one two one two
first
bar 8.052440e-01 1.340309e+00 -1.206412e+00 -1.170299e+00
foo 1.607920e+00 -1.000000e+09 1.024180e+00 -1.000000e+09
qux -1.000000e+09 7.698036e-01 -1.000000e+09 -1.281247e+00
2.3 MultiIndex 列
对于 MultiIndex
形式的列名的 DataFrame
In [39]: df[:3].unstack(0)
Out[39]:
exp A B A
animal cat dog cat dog
first bar baz bar baz bar baz bar baz
second
one 0.895717 0.410835 0.805244 0.81385 -1.206412 0.132003 2.565646 -0.827317
two 1.431256 NaN 1.340309 NaN -1.170299 NaN -0.226169 NaN
In [40]: df2.unstack(1)
Out[40]:
exp A B A
animal cat dog cat dog
second one two one two one two one two
first
bar 0.895717 1.431256 0.805244 1.340309 -1.206412 -1.170299 2.565646 -0.226169
baz 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317 NaN
foo -1.413681 0.875906 1.607920 -2.211372 1.024180 0.974466 0.569605 -2.006747
qux NaN -1.226825 NaN 0.769804 NaN -1.281247 NaN -0.727707
3 使用 melt 重塑 DataFrame
顶层的 melt()
以及对应的 DataFrame.melt()
方法,也可以对 DataFrame
进行格式化。
可以把一列或多列作为标识变量,其他所有列作为观测变量,相当于 pivot
的逆操作,只留下两列非标识列:variable
、value
可以使用 var_name
和 value_name
设置对应的列名
In [41]: cheese = pd.DataFrame(
....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: }
....: )
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [43]: cheese.melt(id_vars=["first", "last"])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: cheese.melt(id_vars=["first", "last"], var_name="quantity")
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
当使用 melt()
转换 DataFrame
时,索引将被忽略。通过将 ignore_index
参数设置为 False
(默认为 True
),可以保留原始的索引值。
In [45]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
In [46]: cheese = pd.DataFrame(
....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: },
....: index=index,
....: )
....:
In [47]: cheese
Out[47]:
first last height weight
person A John Doe 5.5 130
B Mary Bo 6.0 150
In [48]: cheese.melt(id_vars=["first", "last"])
Out[48]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [49]: cheese.melt(id_vars=["first", "last"], ignore_index=False)
Out[49]:
first last variable value
person A John Doe height 5.5
B Mary Bo height 6.0
A John Doe weight 130.0
B Mary Bo weight 150.0
另一种转换方法是使用 wide_to_long()
函数。它没有 melt()
灵活,但是更容易操作
In [50]: dft = pd.DataFrame(
....: {
....: "A1970": {0: "a", 1: "b", 2: "c"},
....: "A1980": {0: "d", 1: "e", 2: "f"},
....: "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
....: "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
....: "X": dict(zip(range(3), np.random.randn(3))),
....: }
....: )
....:
In [51]: dft["id"] = dft.index
In [52]: dft
Out[52]:
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -0.121306 0
1 b e 1.2 1.3 -0.097883 1
2 c f 0.7 0.1 0.695775 2
In [53]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[53]:
X A B
id year
0 1970 -0.121306 a 2.5
1 1970 -0.097883 b 1.2
2 1970 0.695775 c 0.7
0 1980 -0.121306 d 3.2
1 1980 -0.097883 e 1.3
2 1980 0.695775 f 0.1