修订翻译《利用Python进行数据分析·第2版》7.2.5 离散化和进行分箱

7.2.5 离散化和进行分箱

Discretization and Binning

连续数据常常被离散化或被分成“箱”（bin）进行分析。假设你有一项研究中一组人的数据，而且你想将它们分组成离散的年龄段：
Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

我们将这些数据分成“18到25”、“26到35”、“35到60”以及“61及以上”。为此，你需要使用pandas.cut函数：
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas:

In [76]: b = [18, 25, 35, 60, 100] # gg注：为避免歧义，变量名从原文的bins改为b

In [77]: cats = pd.cut(ages, bins=b) # gg注：为更方便理解bins参数，语句在原文基础上略有修改

In [78]: cats
Out[78]: 
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35,60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut函数计算出的箱。你可以将其看作一个由表示箱名的字符串组成的数组。在底层，它含有一个指定不同类别名称的categories数组，以及一个在codes属性中的ages数据的标记：
The object pandas returns is a special Categorical object. The output you see describes the bins computed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute:

In [79]: cats.codes
Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [80]: cats.categories
Out[80]: 
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [81]: pd.value_counts(cats)
Out[81]: 
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

pd.value_counts(cats)是pandas.cut函数的结果的箱计数。
Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

跟“区间”的数学表示法一致，圆括号表示开区间，而方括号则表示闭区间。你可以通过传入right=False来修改哪边是闭区间：
Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

In [82]: pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[82]: 
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36,
 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

你也可以通过传入一个列表或数组到labels参数，来自定义箱名。
You can also pass your own bin names by passing a list or array to the labels option:

In [83]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [84]: pd.cut(ages, bins=b, labels=group_names) # gg注：为更方便理解bins参数，语句在原文基础上略有修改
Out[84]: 
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, Mid
dleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

如果你向pandas.cut函数传入的是箱的数量而不是显式的箱边缘，则它会根据数据中的最小值和最大值计算等长的箱。考虑一些均匀分布的数据被切成四份的情况：
If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [85]: data = np.random.rand(20)

In [86]: pd.cut(data, bins=4, precision=2) # gg注：为更方便理解bins参数，语句在原文基础上略有修改
Out[86]: 
[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34
, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, interval[float64]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < 
(0.76, 0.97]]

precision=2参数，限制小数精度为两位小数。
The precision=2 option limits the decimal precision to two digits.

一个密切相关的pandas.qcut函数，基于样本分位数对数据进行分箱。根据数据的分布情况，pandas.cut函数通常不会使每个箱具有相同数量的数据点。由于pandas.qcut函数使用的是样本分位数，所以根据定义你将得到大致相同大小的箱：
A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [87]: data = np.random.randn(1000)  # Normally distributed

In [88]: cats = pd.qcut(data, q=4)  # Cut into quartiles # gg注：为更方便理解q参数，语句在原文基础上略有修改

In [89]: cats
Out[89]: 
[(-0.0265, 0.62], (0.62, 3.928], (-0.68, -0.0265], (0.62, 3.928], (-0.0265, 0.62]
, ..., (-0.68, -0.0265], (-0.68, -0.0265], (-2.95, -0.68], (0.62, 3.928], (-0.68,
 -0.0265]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -0.68] < (-0.68, -0.0265] < (-0.0265,
 0.62] <
                                    (0.62, 3.928]]

In [90]: pd.value_counts(cats)
Out[90]:
(0.62, 3.928]       250
(-0.0265, 0.62]     250
(-0.68, -0.0265]    250
(-2.95, -0.68]      250
dtype: int64

类似于pandas.cut函数，你可以传入自定义的分位数（0到1之间的数，包括0和1）：
Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [91]: pd.qcut(data, q=[0, 0.1, 0.5, 0.9, 1.]) # gg注：为更方便理解q参数，语句在原文基础上略有修改
Out[91]: 
[(-0.0265, 1.286], (-0.0265, 1.286], (-1.187, -0.0265], (-0.0265, 1.286], (-0.026
5, 1.286], ..., (-1.187, -0.0265], (-1.187, -0.0265], (-2.95, -1.187], (-0.0265, 
1.286], (-1.187, -0.0265]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -1.187] < (-1.187, -0.0265] < (-0.026
5, 1.286] <
                                    (1.286, 3.928]]

在讲解聚合和分组运算的章节，我们会再次用到pandas.cut函数和pandas.qcut函数，因为这两个离散化函数对分位数和分组分析尤其有用。
We’ll return to cut and qcut later in the chapter during our discussion of aggregation and group operations, as these discretization functions are especially useful for quantile and group analysis.