简书阅读体验不佳(与有道云笔记的markdown解析不同),因此建议进入传送门
jupyter notebook:pandas 学习心得(3):层级索引
这个系列是我学习《python数据科学手册》所做的笔记
用于个人备忘
顺便分享,因此存在不严谨的地方或者述说不清晰的地方
Series多级索引
import numpy as np
import pandas as pd
多级索引的作用: 用低维的Series 或 DataFrame 表示更高维的数据
首先在不知道pandas 提供多级索引的条件下,创造一个Series 数据集
index= {('California', 2000),('California',2010),
('New York',2000),('New York',2010),
('Texas',2000),('Texas',2010)}
populations = [33871648,37253956,
18976457,19378102,
20851820,25145561]
pop = pd.Series(populations, index = index)
pop
Texas 2000 33871648
New York 2000 37253956
2010 18976457
California 2010 19378102
Texas 2010 20851820
California 2000 25145561
dtype: int64
查看我们设置的索引长什么样子
index
{('California', 2000),
('California', 2010),
('New York', 2000),
('New York', 2010),
('Texas', 2000),
('Texas', 2010)}
这是有用元组构成的多级索引,应用起来诸多不便
而且,上面pop 两个California 怎么不挨在一起,强迫症受不了!
- pandas 多级索引
现在我们利用 笛卡儿积 生成多级索引
index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
接着,我们将pop的索引进行重置,就能看到层级索引了
pop = pop.reindex(index)
pop
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
Texas 2000 33871648
2010 20851820
dtype: int64
好看多了, 其中最左边的索引为 0级索引,2000这些为1级索引,以此类推。
这个对象还是一个Series序列
现在,可以直接利用第二个索引,获取2010年的全部数据了
pop[:,2010] # [a,b] a表示 California 这些地名,b 表示2000这些年份
California 19378102
New York 18976457
Texas 20851820
dtype: int64
多级索引的创建方法
1. 显式地创建多级索引
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]]) # 从简单数组中创建
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)]) # 从元组中创建
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) # 从笛卡尔积中创建,已经了解过了
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
更详细地,可以直接提供levels和 labels 进行创建
pd.MultiIndex(levels = [['a','b'],[1,2]],
labels = [[0,0,1,1],[0,1,0,1]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
levels 有 两个列表, 分别表示 第0级索引和第1级索引
labels 也有两个列表,两个列表长度(数据集元素的个数)相同,分表表示 数据 取自第0级索引和 1级索引的 第几个标签,结合笛卡尔积理解
给多级索引加上名称,可以方便管理
pop.index.names = ['states','years']
pop
states years
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
Texas 2000 33871648
2010 20851820
dtype: int64
2. 多级列索引
对于DataFrame,有多级行索引,就存在多级列索引
下面模拟一个医疗数据的 DataFrame
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
data = np.round(np.random.randn(4,6), 1)
data[:,::2] *= 10
data += 37
data
array([[30. , 38.4, 43. , 35.8, 34. , 36. ],
[35. , 36. , 31. , 36.6, 24. , 36.7],
[31. , 36.6, 47. , 37.2, 37. , 39. ],
[39. , 35.7, 40. , 36.4, 37. , 36.9]])
health_data = pd.DataFrame(data, index = index, columns = columns)
health_data
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th>subject</th>
<th colspan="2" halign="left">Bob</th>
<th colspan="2" halign="left">Guido</th>
<th colspan="2" halign="left">Sue</th>
</tr>
<tr>
<th></th>
<th>type</th>
<th>HR</th>
<th>Temp</th>
<th>HR</th>
<th>Temp</th>
<th>HR</th>
<th>Temp</th>
</tr>
<tr>
<th>year</th>
<th>visit</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">2013</th>
<th>1</th>
<td>30.0</td>
<td>38.4</td>
<td>43.0</td>
<td>35.8</td>
<td>34.0</td>
<td>36.0</td>
</tr>
<tr>
<th>2</th>
<td>35.0</td>
<td>36.0</td>
<td>31.0</td>
<td>36.6</td>
<td>24.0</td>
<td>36.7</td>
</tr>
<tr>
<th rowspan="2" valign="top">2014</th>
<th>1</th>
<td>31.0</td>
<td>36.6</td>
<td>47.0</td>
<td>37.2</td>
<td>37.0</td>
<td>39.0</td>
</tr>
<tr>
<th>2</th>
<td>39.0</td>
<td>35.7</td>
<td>40.0</td>
<td>36.4</td>
<td>37.0</td>
<td>36.9</td>
</tr>
</tbody>
</table>
</div>
- 对DataFrame提供一个索引, 只能查询 第0级列索引
health_data['Guido'] # health_data['HR'] 会报错
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>type</th>
<th>HR</th>
<th>Temp</th>
</tr>
<tr>
<th>year</th>
<th>visit</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">2013</th>
<th>1</th>
<td>43.0</td>
<td>35.8</td>
</tr>
<tr>
<th>2</th>
<td>31.0</td>
<td>36.6</td>
</tr>
<tr>
<th rowspan="2" valign="top">2014</th>
<th>1</th>
<td>47.0</td>
<td>37.2</td>
</tr>
<tr>
<th>2</th>
<td>40.0</td>
<td>36.4</td>
</tr>
</tbody>
</table>
</div>
多级索引的取值操作
1. Series 多级索引
以pop 数据集为例
pop
states years
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
Texas 2000 33871648
2010 20851820
dtype: int64
pop['California',2000] # 注意各级索引的位置
25145561
pop['California'] # 如果只提供一个,不加逗号,那么只能在 0级索引中挑选,pop[2010] 报错
years
2000 25145561
2010 19378102
dtype: int64
pop.loc['California':'New York'] # 还可以进行切片, 0级索引必须经过排序(A-Z)
# 可使用 pop = pop.sort_index() 进行索引的排序
states years
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
dtype: int64
- 如果索引已经排序,要使用较低层级索引, 第0层索引可以使用空切片
pop[:,2010]
states
California 19378102
New York 18976457
Texas 20851820
dtype: int64
还可以使用 掩码、花式索引,就不展开了
2. DataFrame 多级索引
以 health_data 数据集为例
health_data
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th>subject</th>
<th colspan="2" halign="left">Bob</th>
<th colspan="2" halign="left">Guido</th>
<th colspan="2" halign="left">Sue</th>
</tr>
<tr>
<th></th>
<th>type</th>
<th>HR</th>
<th>Temp</th>
<th>HR</th>
<th>Temp</th>
<th>HR</th>
<th>Temp</th>
</tr>
<tr>
<th>year</th>
<th>visit</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">2013</th>
<th>1</th>
<td>30.0</td>
<td>38.4</td>
<td>43.0</td>
<td>35.8</td>
<td>34.0</td>
<td>36.0</td>
</tr>
<tr>
<th>2</th>
<td>35.0</td>
<td>36.0</td>
<td>31.0</td>
<td>36.6</td>
<td>24.0</td>
<td>36.7</td>
</tr>
<tr>
<th rowspan="2" valign="top">2014</th>
<th>1</th>
<td>31.0</td>
<td>36.6</td>
<td>47.0</td>
<td>37.2</td>
<td>37.0</td>
<td>39.0</td>
</tr>
<tr>
<th>2</th>
<td>39.0</td>
<td>35.7</td>
<td>40.0</td>
<td>36.4</td>
<td>37.0</td>
<td>36.9</td>
</tr>
</tbody>
</table>
</div>
- DataFrame的基本索引式列索引,若不使用 loc iloc ,则只能进行列索引
health_data['Guido','HR']
year visit
2013 1 43.0
2 31.0
2014 1 47.0
2 40.0
Name: (Guido, HR), dtype: float64
- 使用DataFrame 的索引器,则可以进行行、列索引
health_data.iloc[0:2, 0:2]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th>subject</th>
<th colspan="2" halign="left">Bob</th>
</tr>
<tr>
<th></th>
<th>type</th>
<th>HR</th>
<th>Temp</th>
</tr>
<tr>
<th>year</th>
<th>visit</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">2013</th>
<th>1</th>
<td>30.0</td>
<td>38.4</td>
</tr>
<tr>
<th>2</th>
<td>35.0</td>
<td>36.0</td>
</tr>
</tbody>
</table>
</div>
health_data.loc[:,(('Bob','Guido'), 'HR')] # 这个案例 详细琢磨下
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th>subject</th>
<th>Bob</th>
<th>Guido</th>
</tr>
<tr>
<th></th>
<th>type</th>
<th>HR</th>
<th>HR</th>
</tr>
<tr>
<th>year</th>
<th>visit</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">2013</th>
<th>1</th>
<td>30.0</td>
<td>43.0</td>
</tr>
<tr>
<th>2</th>
<td>35.0</td>
<td>31.0</td>
</tr>
<tr>
<th rowspan="2" valign="top">2014</th>
<th>1</th>
<td>31.0</td>
<td>47.0</td>
</tr>
<tr>
<th>2</th>
<td>39.0</td>
<td>40.0</td>
</tr>
</tbody>
</table>
</div>
使用loc索引器,不仅可以进行行列索引,还可以进行行列的多级索引,以上就是一个很好的例子
- health_data.loc[ , ] 逗号左边 为行, 右边为列
- health_data.loc[: ,<font color="#dddd00">(</font><br /> (列的第0级索引),(列的第一级索引 ) <font color="#dddd00">)</font><br /> ]如要进行多级索引,必须用嵌套元组的形式
这种索引元组的用法不是很方便,如果要在远足中使用切片会导致语法错误
health_data.loc[:,(:, 'HR')]
File "<ipython-input-25-ff9aeaa8e80b>", line 1
health_data.loc[:,(:, 'HR')]
^
SyntaxError: invalid syntax
3. 索引的设置与重置
索引的设置与重置能进行长短数据的转换
- 索引的重置
这是对Series对象执行的操作
help(pop.reset_index)
Help on method reset_index in module pandas.core.series:
reset_index(level=None, drop=False, name=None, inplace=False) method of pandas.core.series.Series instance
Generate a new DataFrame or Series with the index reset.
This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before another operation.
Parameters
----------
level : int, str, tuple, or list, default optional
For a Series with a MultiIndex, only remove the specified levels
from the index. Removes all levels by default.
drop : bool, default False
Just reset the index, without inserting it as a column in
the new DataFrame.
name : object, optional
The name to use for the column containing the original Series
values. Uses ``self.name`` by default. This argument is ignored
when `drop` is True.
inplace : bool, default False
Modify the Series in place (do not create a new object).
Returns
-------
Series or DataFrame
When `drop` is False (the default), a DataFrame is returned.
The newly created columns will come first in the DataFrame,
followed by the original Series values.
When `drop` is True, a `Series` is returned.
In either case, if ``inplace=True``, no value is returned.
See Also
--------
DataFrame.reset_index: Analogous function for DataFrame.
Examples
--------
>>> s = pd.Series([1, 2, 3, 4], name='foo',
... index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
Generate a DataFrame with default index.
>>> s.reset_index()
idx foo
0 a 1
1 b 2
2 c 3
3 d 4
To specify the name of the new column use `name`.
>>> s.reset_index(name='values')
idx values
0 a 1
1 b 2
2 c 3
3 d 4
To generate a new Series with the default set `drop` to True.
>>> s.reset_index(drop=True)
0 1
1 2
2 3
3 4
Name: foo, dtype: int64
To update the Series in place, without generating a new one
set `inplace` to True. Note that it also requires ``drop=True``.
>>> s.reset_index(inplace=True, drop=True)
>>> s
0 1
1 2
2 3
3 4
Name: foo, dtype: int64
The `level` parameter is interesting for Series with a multi-level
index.
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']),
... np.array(['one', 'two', 'one', 'two'])]
>>> s2 = pd.Series(
... range(4), name='foo',
... index=pd.MultiIndex.from_arrays(arrays,
... names=['a', 'b']))
To remove a specific level from the Index, use `level`.
>>> s2.reset_index(level='a')
a foo
b
one bar 0
two bar 1
one baz 2
two baz 3
If `level` is not set, all levels are removed from the Index.
>>> s2.reset_index()
a b foo
0 bar one 0
1 bar two 1
2 baz one 2
3 baz two 3
pop_flat = pop.reset_index() # 如果不指定name参数,它会自动添加列名
pop_flat
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>states</th>
<th>years</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>California</td>
<td>2000</td>
<td>25145561</td>
</tr>
<tr>
<th>1</th>
<td>California</td>
<td>2010</td>
<td>19378102</td>
</tr>
<tr>
<th>2</th>
<td>New York</td>
<td>2000</td>
<td>37253956</td>
</tr>
<tr>
<th>3</th>
<td>New York</td>
<td>2010</td>
<td>18976457</td>
</tr>
<tr>
<th>4</th>
<td>Texas</td>
<td>2000</td>
<td>33871648</td>
</tr>
<tr>
<th>5</th>
<td>Texas</td>
<td>2010</td>
<td>20851820</td>
</tr>
</tbody>
</table>
</div>
pop_flat2 = pop.reset_index(name = 'population') # 如果不指定name参数,它会自动添加列名
pop_flat2
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>states</th>
<th>years</th>
<th>population</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>California</td>
<td>2000</td>
<td>25145561</td>
</tr>
<tr>
<th>1</th>
<td>California</td>
<td>2010</td>
<td>19378102</td>
</tr>
<tr>
<th>2</th>
<td>New York</td>
<td>2000</td>
<td>37253956</td>
</tr>
<tr>
<th>3</th>
<td>New York</td>
<td>2010</td>
<td>18976457</td>
</tr>
<tr>
<th>4</th>
<td>Texas</td>
<td>2000</td>
<td>33871648</td>
</tr>
<tr>
<th>5</th>
<td>Texas</td>
<td>2010</td>
<td>20851820</td>
</tr>
</tbody>
</table>
</div>
- 索引的设置
是跟上面相反得到一种操作
以pop_flat2 为例,它将上述的普通DataFrame 制作成多级索引的DataFrame
pop_flat2.set_index(['states', 'years']) # 返回数据框
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>population</th>
</tr>
<tr>
<th>states</th>
<th>years</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">California</th>
<th>2000</th>
<td>25145561</td>
</tr>
<tr>
<th>2010</th>
<td>19378102</td>
</tr>
<tr>
<th rowspan="2" valign="top">New York</th>
<th>2000</th>
<td>37253956</td>
</tr>
<tr>
<th>2010</th>
<td>18976457</td>
</tr>
<tr>
<th rowspan="2" valign="top">Texas</th>
<th>2000</th>
<td>33871648</td>
</tr>
<tr>
<th>2010</th>
<td>20851820</td>
</tr>
</tbody>
</table>
</div>
pop_flat2.set_index( 'years') # 返回数据框
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>states</th>
<th>population</th>
</tr>
<tr>
<th>years</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2000</th>
<td>California</td>
<td>25145561</td>
</tr>
<tr>
<th>2010</th>
<td>California</td>
<td>19378102</td>
</tr>
<tr>
<th>2000</th>
<td>New York</td>
<td>37253956</td>
</tr>
<tr>
<th>2010</th>
<td>New York</td>
<td>18976457</td>
</tr>
<tr>
<th>2000</th>
<td>Texas</td>
<td>33871648</td>
</tr>
<tr>
<th>2010</th>
<td>Texas</td>
<td>20851820</td>
</tr>
</tbody>
</table>
</div>
索引 stack 与 unstack
以pop数据集为例
个人认为 stack 与 unstack 进行维度的转换很方便,可以将数据集进行长短变换,以满足不同需要
pop
states years
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
Texas 2000 33871648
2010 20851820
dtype: int64
使用unstack 将 states 作为列名
pop.unstack(level = 1)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>years</th>
<th>2000</th>
<th>2010</th>
</tr>
<tr>
<th>states</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>California</th>
<td>25145561</td>
<td>19378102</td>
</tr>
<tr>
<th>New York</th>
<td>37253956</td>
<td>18976457</td>
</tr>
<tr>
<th>Texas</th>
<td>33871648</td>
<td>20851820</td>
</tr>
</tbody>
</table>
</div>