Python 数据处理（三十七）—— groupby（分组）

前言

我们所说的 group by 主要涉及以下一个或多个步骤：

拆分：根据指定的标准对数据进行切割，并分为不同的组别
应用：分别在每个组中应用函数
组合：将所有的结果组合为数据结构

在这些步骤中，拆分是最直接的。而事实上，多数情况下，我们可能希望将数据集分成若干组，并对这些分组进行一些操作

在应用函数的步骤中，我们可能希望进行以下操作

聚合：为每个分组应用一个或多个汇总函数，例如：
- 计算分组的和或均值
- 计算分组的 sizes/counts
转换：为不同的分组执行不同的计算，并返回类似索引的对象，例如：
- 在组内进行标准化（zscore）
- 填充每个分组中的 NA 值
筛选：过滤掉一些分组，例如：
- 丢弃元素数目较少的分组
- 根据组内的和或均值进行过滤

pandas 对象的 groupby 方法相较于 SQL

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

会更加简洁易用

1 将对象拆分为不同的组

pandas 对象可以在它的任何轴上进行分割。例如，使用如下代码创建 groupby 对象

In [1]: df = pd.DataFrame(
   ...:     [
   ...:         ("bird", "Falconiformes", 389.0),
   ...:         ("bird", "Psittaciformes", 24.0),
   ...:         ("mammal", "Carnivora", 80.2),
   ...:         ("mammal", "Primates", np.nan),
   ...:         ("mammal", "Carnivora", 58),
   ...:     ],
   ...:     index=["falcon", "parrot", "lion", "monkey", "leopard"],
   ...:     columns=("class", "order", "max_speed"),
   ...: )
   ...: 

In [2]: df
Out[2]: 
          class           order  max_speed
falcon     bird   Falconiformes      389.0
parrot     bird  Psittaciformes       24.0
lion     mammal       Carnivora       80.2
monkey   mammal        Primates        NaN
leopard  mammal       Carnivora       58.0

# default is axis=0
In [3]: grouped = df.groupby("class")

In [4]: grouped = df.groupby("order", axis="columns")

In [5]: grouped = df.groupby(["class", "order"])

可以使用如下方法进行拆分：

函数，可以对轴标签进行调用
列表或数组，长度与选择的轴一致
字典或 Series，存在 label-> group name 映射
对于 DataFrame 对象，传入列名或索引级别名字符串
df.groupby('A') 是 df.groupby(df['A']) 的语法糖
上面任意组合的列表

注意：如果传入的字符串既匹配列名，又匹配索引级别名，会引发异常

In [6]: df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...: 

In [7]: df
Out[7]: 
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

对于 DataFrame 对象，可以使用 groupby() 获取一个 GroupBy 对象。我们可以根据 A 或 B 列进行分组

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

如果我们把 A、B 作为层次索引，则可以选择相应的 level 进行分组

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

我们也可以根据列来拆分数据

In [13]: def get_letter_type(letter):
   ....:     if letter.lower() in 'aeiou':
   ....:         return 'vowel'
   ....:     else:
   ....:         return 'consonant'
   ....: 

In [14]: grouped = df.groupby(get_letter_type, axis=1)

pandas 的 Index 对象支持重复的索引。因此，可以对包含重复值的索引进行分组，相同的索引会被分为同一组

In [15]: lst = [1, 2, 3, 1, 2, 3]

In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)

In [17]: grouped = s.groupby(level=0)

In [18]: grouped.first()
Out[18]: 
1    1
2    2
3    3
dtype: int64

In [19]: grouped.last()
Out[19]: 
1    10
2    20
3    30
dtype: int64

In [20]: grouped.sum()
Out[20]: 
1    11
2    22
3    33
dtype: int64

注意：只有在需要的时候，才会对数据进行拆分

1.1 排序

默认情况下，groupby 会对分组键进行排序，可以使用 sort=False 来加速该操作

In [21]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})

In [22]: df2.groupby(["X"]).sum()
Out[22]: 
   Y
X   
A  7
B  3

In [23]: df2.groupby(["X"], sort=False).sum()
Out[23]: 
   Y
X   
B  3
A  7

注意：设置不排序之后，groupby 将会按照每个分组在原始数据中的出现顺序排序

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

默认情况下，groupby 操作会忽略 NA 值，可以使用 dropna=False 来保留 NA 值

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# 默认忽略 NA 值
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# dropna=False，保留 NA 值 
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

1.2 对象属性

groups 的属性是一个字典，键为每个分组的名称，值为每个组的轴标签。例如

In [32]: df.groupby("A").groups
Out[32]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

In [33]: df.groupby(get_letter_type, axis=1).groups
Out[33]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}

对 group 对象使用 len 函数，将返回 groups 对象字典的长度

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

1.3 MultiIndex

对于层次索引，可以按照索引的某一 level 进行分组

我们先创建一个 MultiIndex

In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

可以对 s 的某一个 level 进行分组，如 level=0

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

如果 MultiIndex 指定了层级的名称，可以用这些来代替数字编号

In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

向 sum 这种聚合函数，可以直接传入 level 参数，其返回结果中的索引将是相应 level 的分组

In [47]: s.sum(level="second")
Out[47]: 
second
one    0.980950
two    1.991575
dtype: float64

也可以传入多个 level 进行分组

In [48]: s
Out[48]: 
first  second  third
bar    doo     one     -1.131345
               two     -0.089329
baz    bee     one      0.337863
               two     -0.945867
foo    bop     one     -0.932132
               two      1.956030
qux    bop     one      0.017587
               two     -0.016692
dtype: float64

In [49]: s.groupby(level=["first", "second"]).sum()
Out[49]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

也可以直接作为键传入

In [50]: s.groupby(["first", "second"]).sum()
Out[50]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

1.4 根据索引 level 和列进行分组

DataFrame 可以通过同时指定列名和索引级别进行分组，其中列名传入的是字符串，索引级别传入的是 pd.Grouper 对象

例如，有如下数据

In [51]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [52]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [53]: df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

In [54]: df
Out[54]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

我们可以根据 level=1 和 A 列进行分组

In [55]: df.groupby([pd.Grouper(level=1), "A"]).sum()
Out[55]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

也可以直接传入层级名称

In [56]: df.groupby([pd.Grouper(level="second"), "A"]).sum()
Out[56]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

也可以用更简洁的方式

In [57]: df.groupby(["second", "A"]).sum()
Out[57]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

1.5 选择分组的列

在创建了 GroupBy 对象之后，可能需要对不同的列进行不同的操作，可以使用 [] 类似从 DataFrame 中获取列的方式来进行操作

In [58]: grouped = df.groupby(["A"])

In [59]: grouped_C = grouped["C"]

In [60]: grouped_D = grouped["D"]

这种语法糖主要是为了替换下面这样冗长的代码

In [61]: df["C"].groupby(df["A"])
Out[61]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fd2f6794610>

2 遍历分组

创建了 GroupBy 对象之后，可以很容易对其进行遍历

In [62]: grouped = df.groupby('A')

In [63]: for name, group in grouped:
   ....:     print(name)
   ....:     print(group)
   ....: 
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

如果是对多个键进行分组，那么组名将是一个元组

In [64]: for name, group in df.groupby(['A', 'B']):
   ....:     print(name)
   ....:     print(group)
   ....: 
('bar', 'one')
     A    B         C         D
1  bar  one  0.254161  1.511763
('bar', 'three')
     A      B         C         D
3  bar  three  0.215897 -0.990582
('bar', 'two')
     A    B         C         D
5  bar  two -0.077118  1.211526
('foo', 'one')
     A    B         C         D
0  foo  one -0.575247  1.346061
6  foo  one -0.408530  0.268520
('foo', 'three')
     A      B         C        D
7  foo  three -0.862495  0.02458
('foo', 'two')
     A    B         C         D
2  foo  two -1.143704  1.627081
4  foo  two  1.193555 -0.441652

3 选择分组

可以使用 get_group() 选择一个分组

In [65]: grouped.get_group("bar")
Out[65]: 
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526

对于多列的分组，需要传递元组

In [66]: df.groupby(["A", "B"]).get_group(("bar", "one"))
Out[66]: 
     A    B         C         D
1  bar  one  0.254161  1.511763

Python 数据处理（三十七）—— groupby（分组）