Python 数据处理（三十三）—— 重塑和数据透视表

1 使用 pivot 重塑数据

数据通常以堆叠或记录的格式存储。

例如，有如下数据，表示不同时间不同变量的观察值

In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804

我们可以选择变量 A 所对应的所有数据

In [2]: df[df["variable"] == "A"]
Out[2]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059

假设我们现在想对变量进行时间序列分析，但是，这种数据的表现形式显然是不利于我们进行分析的。

如果我们能将数据转换为 DataFrame 形式，以时间序列作为索引，变量名作为列名，索引与列名所标识的值是对应的观察值，更加有利于我们的分析

我们可以使用 DataFrame.pivot() 或顶层的 pd.pivot() 方法来实现

In [3]: df.pivot(index="date", columns="variable", values="value")
Out[3]: 
variable           A         B         C         D
date                                              
2000-01-03  0.469112 -1.135632  0.119209 -2.104569
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804

如果省略了 values 参数，且除了指定为索引和列名的列之外，剩余的列超过一个，则输出的结果 DataFrame 中将会包含层次列。

例如

In [4]: df["value2"] = df["value"] * 2

In [5]: pivoted = df.pivot(index="date", columns="variable")

In [6]: pivoted
Out[6]: 
               value                                  value2                              
variable           A         B         C         D         A         B         C         D
date                                                                                      
2000-01-03  0.469112 -1.135632  0.119209 -2.104569  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804 -3.018117 -0.346429 -1.723698  2.143608

你可以从结果中提取出数据子集

In [7]: pivoted["value2"]
Out[7]: 
variable           A         B         C         D
date                                              
2000-01-03  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698  2.143608

注意：

如果指定的索引或列包含重复值，pivot 将会抛出 ValueError: Index contains duplicate entries, cannot reshape 异常。

可以使用其泛化的函数 pivot_table()，它支持索引或列重复值

2 使用 stack 和 unstack 重塑数据

与 pivot() 方法密切相关的是 Series 和 DataFrame 上都可以用的 stack() 和 unstack() 方法

这两个方法主要用于 MultiIndex 对象

stack: 将一列（或层次列名的某一个 level）映射到索引的最内层 level
unstack: stack 的逆向操作，将索引最内层的 level 映射为一列

让我们来看一个例子，就很清晰了

In [8]: tuples = list(
   ...:     zip(
   ...:         *[
   ...:             ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:             ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...:         ]
   ...:     )
   ...: )
   ...: 

In [9]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

In [11]: df2 = df[:4]

In [12]: df2
Out[12]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401

执行 stack 函数会将 DataFrame 进行压缩，便会产生一个

只是简单的列名会生成一个 Series
MultiIndex 形式的列名会生成一个 DataFrame

如果列是 MultiIndex 形式，可以选择某一 level 进行 stack

In [13]: stacked = df2.stack()

In [14]: stacked
Out[14]: 
first  second   
bar    one     A    0.721555
               B   -0.706771
       two     A   -1.039575
               B    0.271860
baz    one     A   -0.424972
               B    0.567020
       two     A    0.276232
               B   -1.087401
dtype: float64

我们可以使用 unstack 逆转回来，默认会将最内层的 level 进行转换

In [15]: stacked.unstack()
Out[15]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401

unstack()

对指定 level 进行 unstack

In [16]: stacked.unstack(1)
Out[16]: 
second        one       two
first                      
bar   A  0.721555 -1.039575
      B -0.706771  0.271860
baz   A -0.424972  0.276232
      B  0.567020 -1.087401

In [17]: stacked.unstack(0)
Out[17]: 
first          bar       baz
second                      
one    A  0.721555 -0.424972
       B -0.706771  0.567020
two    A -1.039575  0.276232
       B  0.271860 -1.087401

unstack(1)

unstack(0)

当然，如果索引有名称，也可以直接传入索引名

In [18]: stacked.unstack("second")
Out[18]: 
second        one       two
first                      
bar   A  0.721555 -1.039575
      B -0.706771  0.271860
baz   A -0.424972  0.276232
      B  0.567020 -1.087401

unstack("second")

注意，stack 和 unstack 方法隐式地对涉及的索引级别进行排序。因此，先调用 stack，然后调用 unstack，或者反过来，将得到原始 DataFrame 或 Series 的排序拷贝

In [19]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])

In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])

In [21]: df
Out[21]: 
            A
2 a -0.370647
  b -1.157892
1 a -1.344312
  b  0.844885

In [22]: all(df.unstack().stack() == df.sort_index())
Out[22]: True

如果不调用 sort_index，则上面的代码将引发 TypeError

2.1 多个级别

您还可以通过传递一个级别列表来 stack 或 unstack 多个级别，在这种情况下，最终结果就好像按列表中的顺序每次处理一个 level 一样

In [23]: columns = pd.MultiIndex.from_tuples(
   ....:     [
   ....:         ("A", "cat", "long"),
   ....:         ("B", "cat", "long"),
   ....:         ("A", "dog", "short"),
   ....:         ("B", "dog", "short"),
   ....:     ],
   ....:     names=["exp", "animal", "hair_length"],
   ....: )
   ....: 

In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)

In [25]: df
Out[25]: 
exp                 A         B         A         B
animal            cat       cat       dog       dog
hair_length      long      long     short     short
0            1.075770 -0.109050  1.643563 -1.469388
1            0.357021 -0.674600 -1.776904 -0.968914
2           -1.294524  0.413738  0.276662 -0.472035
3           -0.013960 -0.362543 -0.006154 -0.923061

In [26]: df.stack(level=["animal", "hair_length"])
Out[26]: 
exp                          A         B
  animal hair_length                    
0 cat    long         1.075770 -0.109050
  dog    short        1.643563 -1.469388
1 cat    long         0.357021 -0.674600
  dog    short       -1.776904 -0.968914
2 cat    long        -1.294524  0.413738
  dog    short        0.276662 -0.472035
3 cat    long        -0.013960 -0.362543
  dog    short       -0.006154 -0.923061

级别列表可以包含级别名称或级别编号（但不能同时包含两者）。上面的代码也可以是

In [27]: df.stack(level=[1, 2])
Out[27]: 
exp                          A         B
  animal hair_length                    
0 cat    long         1.075770 -0.109050
  dog    short        1.643563 -1.469388
1 cat    long         0.357021 -0.674600
  dog    short       -1.776904 -0.968914
2 cat    long        -1.294524  0.413738
  dog    short        0.276662 -0.472035
3 cat    long        -0.013960 -0.362543
  dog    short       -0.006154 -0.923061

2.2 缺失值

这些函数会自动处理缺失值，并且不希望层次索引中的每个子组具有相同的标签集，同时它们也可以处理未排序的索引

例如，下面这个例子

In [28]: columns = pd.MultiIndex.from_tuples(
   ....:     [
   ....:         ("A", "cat"),
   ....:         ("B", "dog"),
   ....:         ("B", "cat"),
   ....:         ("A", "dog"),
   ....:     ],
   ....:     names=["exp", "animal"],
   ....: )
   ....: 

In [29]: index = pd.MultiIndex.from_product(
   ....:     [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
   ....: )
   ....: 

In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)

In [31]: df2 = df.iloc[[0, 1, 2, 4, 5, 7]]

In [32]: df2
Out[32]: 
exp                  A         B                   A
animal             cat       dog       cat       dog
first second                                        
bar   one     0.895717  0.805244 -1.206412  2.565646
      two     1.431256  1.340309 -1.170299 -0.226169
baz   one     0.410835  0.813850  0.132003 -0.827317
foo   one    -1.413681  1.607920  1.024180  0.569605
      two     0.875906 -2.211372  0.974466 -2.006747
qux   two    -1.226825  0.769804 -1.281247 -0.727707

在 stack 中传递 level

In [33]: df2.stack("exp")
Out[33]: 
animal                 cat       dog
first second exp                    
bar   one    A    0.895717  2.565646
             B   -1.206412  0.805244
      two    A    1.431256 -0.226169
             B   -1.170299  1.340309
baz   one    A    0.410835 -0.827317
             B    0.132003  0.813850
foo   one    A   -1.413681  0.569605
             B    1.024180  1.607920
      two    A    0.875906 -2.006747
             B    0.974466 -2.211372
qux   two    A   -1.226825 -0.727707
             B   -1.281247  0.769804

In [34]: df2.stack("animal")
Out[34]: 
exp                         A         B
first second animal                    
bar   one    cat     0.895717 -1.206412
             dog     2.565646  0.805244
      two    cat     1.431256 -1.170299
             dog    -0.226169  1.340309
baz   one    cat     0.410835  0.132003
             dog    -0.827317  0.813850
foo   one    cat    -1.413681  1.024180
             dog     0.569605  1.607920
      two    cat     0.875906  0.974466
             dog    -2.006747 -2.211372
qux   two    cat    -1.226825 -1.281247
             dog    -0.727707  0.769804

我们提取一部分数据，然后进行 unstack。如果对应标签的值不存在，默认会使用填充为缺失值，如 NaN，NaT 等。

In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]

In [36]: df3
Out[36]: 
exp                  B          
animal             dog       cat
first second                    
bar   one     0.805244 -1.206412
      two     1.340309 -1.170299
foo   one     1.607920  1.024180
qux   two     0.769804 -1.281247

In [37]: df3.unstack()
Out[37]: 
exp            B                              
animal       dog                 cat          
second       one       two       one       two
first                                         
bar     0.805244  1.340309 -1.206412 -1.170299
foo     1.607920       NaN  1.024180       NaN
qux          NaN  0.769804       NaN -1.281247

unstack 接受一个可选参数 fill_value，用于指定用于填充缺失数据的值

In [38]: df3.unstack(fill_value=-1e9)
Out[38]: 
exp                B                                          
animal           dog                         cat              
second           one           two           one           two
first                                                         
bar     8.052440e-01  1.340309e+00 -1.206412e+00 -1.170299e+00
foo     1.607920e+00 -1.000000e+09  1.024180e+00 -1.000000e+09
qux    -1.000000e+09  7.698036e-01 -1.000000e+09 -1.281247e+00

2.3 MultiIndex 列

对于 MultiIndex 形式的列名的 DataFrame

In [39]: df[:3].unstack(0)
Out[39]: 
exp            A                   B                                      A          
animal       cat                 dog                cat                 dog          
first        bar       baz       bar      baz       bar       baz       bar       baz
second                                                                               
one     0.895717  0.410835  0.805244  0.81385 -1.206412  0.132003  2.565646 -0.827317
two     1.431256       NaN  1.340309      NaN -1.170299       NaN -0.226169       NaN

In [40]: df2.unstack(1)
Out[40]: 
exp            A                   B                                       A          
animal       cat                 dog                 cat                 dog          
second       one       two       one       two       one       two       one       two
first                                                                                 
bar     0.895717  1.431256  0.805244  1.340309 -1.206412 -1.170299  2.565646 -0.226169
baz     0.410835       NaN  0.813850       NaN  0.132003       NaN -0.827317       NaN
foo    -1.413681  0.875906  1.607920 -2.211372  1.024180  0.974466  0.569605 -2.006747
qux          NaN -1.226825       NaN  0.769804       NaN -1.281247       NaN -0.727707

3 使用 melt 重塑 DataFrame

顶层的 melt() 以及对应的 DataFrame.melt() 方法，也可以对 DataFrame 进行格式化。

可以把一列或多列作为标识变量，其他所有列作为观测变量，相当于 pivot 的逆操作，只留下两列非标识列：variable、value

可以使用 var_name 和 value_name 设置对应的列名

In [41]: cheese = pd.DataFrame(
   ....:     {
   ....:         "first": ["John", "Mary"],
   ....:         "last": ["Doe", "Bo"],
   ....:         "height": [5.5, 6.0],
   ....:         "weight": [130, 150],
   ....:     }
   ....: )
   ....: 

In [42]: cheese
Out[42]: 
  first last  height  weight
0  John  Doe     5.5     130
1  Mary   Bo     6.0     150

In [43]: cheese.melt(id_vars=["first", "last"])
Out[43]: 
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

In [44]: cheese.melt(id_vars=["first", "last"], var_name="quantity")
Out[44]: 
  first last quantity  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

当使用 melt() 转换 DataFrame 时，索引将被忽略。通过将 ignore_index 参数设置为 False(默认为 True)，可以保留原始的索引值。

In [45]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])

In [46]: cheese = pd.DataFrame(
   ....:     {
   ....:         "first": ["John", "Mary"],
   ....:         "last": ["Doe", "Bo"],
   ....:         "height": [5.5, 6.0],
   ....:         "weight": [130, 150],
   ....:     },
   ....:     index=index,
   ....: )
   ....: 

In [47]: cheese
Out[47]: 
         first last  height  weight
person A  John  Doe     5.5     130
       B  Mary   Bo     6.0     150

In [48]: cheese.melt(id_vars=["first", "last"])
Out[48]: 
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

In [49]: cheese.melt(id_vars=["first", "last"], ignore_index=False)
Out[49]: 
         first last variable  value
person A  John  Doe   height    5.5
       B  Mary   Bo   height    6.0
       A  John  Doe   weight  130.0
       B  Mary   Bo   weight  150.0

另一种转换方法是使用 wide_to_long() 函数。它没有 melt() 灵活，但是更容易操作

In [50]: dft = pd.DataFrame(
   ....:     {
   ....:         "A1970": {0: "a", 1: "b", 2: "c"},
   ....:         "A1980": {0: "d", 1: "e", 2: "f"},
   ....:         "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
   ....:         "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
   ....:         "X": dict(zip(range(3), np.random.randn(3))),
   ....:     }
   ....: )
   ....: 

In [51]: dft["id"] = dft.index

In [52]: dft
Out[52]: 
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -0.121306   0
1     b     e    1.2    1.3 -0.097883   1
2     c     f    0.7    0.1  0.695775   2

In [53]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[53]: 
                X  A    B
id year                  
0  1970 -0.121306  a  2.5
1  1970 -0.097883  b  1.2
2  1970  0.695775  c  0.7
0  1980 -0.121306  d  3.2
1  1980 -0.097883  e  1.3
2  1980  0.695775  f  0.1

Python 数据处理（三十三）—— 重塑数据