修订翻译《利用Python进行数据分析·第2版》第13章 高级pandas

The preceding chapters have focused on introducing different types of data wrangling workflows and features of NumPy, pandas, and other libraries. Over time, pandas has developed a depth of features for power users. This chapter digs into a few more advanced feature areas to help you deepen your expertise as a pandas user.

12.1 分类数据

12.1 Categorical Data

This section introduces the pandas Categorical type. I will show how you can achieve better performance and memory use in some pandas operations by using it. I also introduce some tools for using categorical data in statistics and machine learning applications.

12.1.1 背景和动机

Background and Motivation

通常,表中的一列可能包含较小不同值集合的重复实例。我们已经看到了pandas.unique函数pandas.value counts函数能够从数组中提取不同值并分别计算它们的频数:
Frequently, a column in a table may contain repeated instances of a smaller set of distinct values. We have already seen functions like unique and value_counts, which enable us to extract the distinct values from an array and compute their frequencies, respectively:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

In [10]: import numpy as np; import pandas as pd

In [11]: vals = pd.Series(['apple', 'orange', 'apple',
   ....:                     'apple'] * 2) # gg注:为避免歧义,变量名从原文的values改为vals

In [12]: vals
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [13]: pd.unique(vals)
Out[13]: array(['apple', 'orange'], dtype=object)

In [14]: pd.value_counts(vals)
apple     6
orange    2
dtype: int64

很多数据系统(用于数据仓库、统计计算或其它用途)都开发了专门的途径来表示带有重复值的数据,以便更高效的存储和计算。在数据仓库中,最佳做法是使用包含不同值的维表(dimension table),并将主要观察结果存储为引用维表的整数键:
Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use socalled dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [15]: vals = pd.Series([0, 1, 0, 0] * 2)

In [16]: dim = pd.Series(['apple', 'orange'])

In [17]: vals
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [18]: dim
0     apple
1    orange
dtype: object

We can use the take method to restore the original Series of strings:

In [19]: dim.take(vals)
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

这种表示为整数的方式被称为分类表示法(categorical representation)或字典编码表示法(dictionary-encoded representation)。由不同值组成的数组可以被称为数据的类别(categories)字典(dictionary)级别(levels)。在本书中,我们将使用术语分类的(categorical)类别(categories)。引用类别的整数值被称为类别编码(category codes)或简称编码(codes)
This representation as integers is called the categorical or dictionary-encoded representation. The array of distinct values can be called the categories, dictionary, or levels of the data. In this book we will use the terms categorical and categories. The integer values that reference the categories are called the category codes or simply codes.

在进行分析时,分类表示法可以产生明显的性能提升。你也可以在不修改编码的情况下对类别执行变换。 一些可以以相对较低的成本执行的示例变换是:
• 重命名类别
• 在不改变现有类别的顺序或位置的情况下追加新类别
The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
• Renaming categories
• Appending a new category without changing the order or position of the existing categories

12.1.2 pandas中的Categorical类型

Categorical Type in pandas

pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding. Let’s consider the example Series from before:

In [20]: fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [21]: N = len(fruits)

In [22]: df = pd.DataFrame({'fruit': fruits,
   ....:                    'basket_id': np.arange(N),
   ....:                    'count': np.random.randint(3, 15, size=N),
   ....:                    'weight': np.random.uniform(0, 4, size=N)},
   ....:                   columns=['basket_id', 'fruit', 'count', 'weight'])

In [23]: df
   basket_id   fruit  count    weight
0          0   apple      5  3.858058
1          1  orange      8  2.612708
2          2   apple      4  2.995627
3          3   apple      7  2.614279
4          4   apple     12  2.990859
5          5  orange      8  3.845227
6          6   apple      5  0.033553
7          7   apple      4  0.425778

Here, df['fruit'] is an array of Python string objects. We can convert it to categorical by calling:

In [24]: fruit_cat = df['fruit'].astype('category')

In [25]: fruit_cat
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical:

In [26]: c = fruit_cat.values

In [27]: type(c)
Out[27]: pandas.core.categorical.Categorical

The Categorical object has categories and codes attributes:

In [28]: c.categories
Out[28]: Index(['apple', 'orange'], dtype='object')

In [29]: c.codes
Out[29]: array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

You can convert a DataFrame column to categorical by assigning the converted result:

In [30]: df['fruit'] = df['fruit'].astype('category')

In [31]: df.fruit
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

You can also create pandas.Categorical directly from other types of Python sequences:

In [32]: my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [33]: my_categories
[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

If you have obtained categorical encoded data from another source, you can use the alternative from_codes constructor:

In [34]: ca = ['foo', 'bar', 'baz'] # gg注:为避免歧义,变量名从原文的categories改为ca

In [35]: co = [0, 1, 2, 0, 0, 1] # gg注:为避免歧义,变量名从原文的codes改为co

In [36]: my_cats_2 = pd.Categorical.from_codes(codes=co, categories=ca)

In [37]: my_cats_2
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

Unless explicitly specified, categorical conversions assume no specific ordering of the categories. So the categories array may be in a different order depending on the ordering of the input data. When using from_codes or any of the other constructors, you can indicate that the categories have a meaningful ordering:

In [38]: ordered_cat = pd.Categorical.from_codes(codes=co, categories=ca,
   ....:                                         ordered=True)

In [39]: ordered_cat
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

输出的[foo < bar < baz]表示在排序中'foo'在'bar'之前,以此类推。一个无序的pandas.Categorical实例可以通过as-ordered方法进行排序:
The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering, and so on. An unordered categorical instance can be made ordered with as_ordered:

In [40]: my_cats_2.as_ordered()
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

12.1.3 使用Categorical对象进行计算

Computations with Categoricals

Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some functions that can utilize the ordered flag.

Let’s consider some random numeric data, and use the pandas.qcut binning function. This return pandas.Categorical; we used pandas.cut earlier in the book but glossed over the details of how categoricals work:

In [41]: np.random.seed(12345)

In [42]: draws = np.random.randn(1000)

In [43]: draws[:5]
Out[43]: array([-0.2047,  0.4789, -0.5194, -0.5557,  1.9658])

Let’s compute a quartile binning of this data and extract some statistics:

In [44]: bs = pd.qcut(draws, 4) # gg注:为避免歧义,变量名从原文的bins改为bs

In [45]: bs
[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63,
 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63
], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.010
1, 0.63] <
                                    (0.63, 3.928]]

While useful, the exact sample quartiles may be less useful for producing a report than quartile names. We can achieve this with the labels argument to qcut:

In [46]: bs = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # gg注:为避免歧义,变量名从原文的bins改为bs

In [47]: bs
[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [48]: bs.codes[:10]
Out[48]: array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

被标记的bs pandas.Categorical对象不包含关于数据中箱边缘的信息,因此我们可以使用groupby方法提取一些汇总统计:
The labeled bs categorical does not contain information about the bin edges in the data, so we can use groupby to extract some summary statistics:

In [49]: bs_s = pd.Series(bs, name='quartile') # gg注:为避免歧义,变量名从原文的bins改为bs_s

In [50]: results = (pd.Series(draws)
   ....:            .groupby(bs_s)
   ....:            .agg(['count', 'min', 'max'])
   ....:            .reset_index())

In [51]: results
  quartile  count       min       max
0       Q1    250 -2.949343 -0.685484
1       Q2    250 -0.683066 -0.010115
2       Q3    250 -0.010032  0.628894
3       Q4    250  0.634238  3.927528

The 'quartile' column in the result retains the original categorical information, including ordering, from bs:

In [52]: results['quartile']
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4] 使用Categorical对象提高性能

Better performance with categoricals

If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too. Let’s consider some Series with 10 million elements and a small number of distinct categories:

In [53]: N = 10000000

In [54]: draws = pd.Series(np.random.randn(N))

In [55]: lbs = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))  # gg注:为避免歧义,变量名从原文的labels改为lbs

Now we convert lbs to categorical:

In [56]: cat_s = lbs.astype('category') # gg注:为避免歧义,变量名从原文的categories改为cat_s

Now we note that lbs uses significantly more memory than cat_s:

In [57]: lbs.memory_usage()
Out[57]: 80000080

In [58]: cat_s.memory_usage()
Out[58]: 10000272

The conversion to category is not free, of course, but it is a one-time cost:

In [59]: %time _ = lbs.astype('category')
CPU times: user 490 ms, sys: 240 ms, total: 730 ms
Wall time: 726 ms

GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

12.1.4 分类方法

Categorical Methods

Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes. Consider the Series:

In [60]: s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [61]: cat_s = s.astype('category')

In [62]: cat_s
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

The special attribute cat provides access to categorical methods:

In [63]: cat_s.cat.codes
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [64]: cat_s.cat.categories
Out[64]: Index(['a', 'b', 'c', 'd'], dtype='object')

假设我们知道该数据的实际类别集合超出了数据中观察到的四个值。我们可以使用set categories方法来改变它们:
Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the set_categories method to change them:

In [65]: actual_categories = ['a', 'b', 'c', 'd', 'e']

In [66]: cat_s2 = cat_s.cat.set_categories(actual_categories)

In [67]: cat_s2
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, value_counts respects the categories, if present:

In [68]: cat_s.value_counts()
d    2
c    2
b    2
a    2
dtype: int64

In [69]: cat_s2.value_counts()
d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. After you filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the remove_unused_categories method to trim unobserved categories:

In [70]: cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [71]: cat_s3
0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [72]: cat_s3.cat.remove_unused_categories()
0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

See Table 12-1 for a listing of available categorical methods.


Table 12-1. Categorical methods for Series in pandas 为建模创建虚拟变量

Creating dummy variables for modeling

当你使用统计学或机器学习工具是,通常会将分类数据变换为虚拟变量(dummy variable),也被称为独热编码(one-hot encoding)。这涉及到创建一个DataFrame,每个不同类别都是它的一列。当出现给定类别这些列的数值为1,否则为0。
When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as one-hot encoding. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.

Consider the previous example:

In [73]: cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

As mentioned previously in Chapter 7, the pandas.get_dummies function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

In [74]: pd.get_dummies(cat_s)
   a  b  c  d
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  0  0
5  0  1  0  0
6  0  0  1  0
7  0  0  0  1
