一、Pandas的层级索引
1.索引可以包括很多层,下面创建一个Series, 在输入索引 index时,输入了由两个子list组成的list,第一个子list是外层索引,第二个list是内层索引。
>>>import pandas as pd
>>>import numpy as np
>>> ser_obj=pd.Series(np.random.randn(12),index=[['a','a','a','b','b','b','c','c','c','d','d','d'],[0,1,2,0,1,2,0,1,2,0,1,2]])
>>> ser_obj
a 0 -1.033278
1 0.487781
2 0.337266
b 0 -0.768059
1 0.841555
2 -0.872732
c 0 -1.217606
1 -0.481930
2 -0.786703
d 0 0.508066
1 0.680397
2 -0.156162
dtype: float64
>>> print(type(ser_obj))
<class 'pandas.core.series.Series'>
>>> print(type(ser_obj.index))
<class 'pandas.core.indexes.multi.MultiIndex'>
#显示是MultiIndex对象,包含lavels,和labels两个信息。lavels表示两个层级中分别有那些标签,labels是每个位置分别是什么标签。
2.根据索引获取数据
#外层选取:直接利用外层索引的标签
>>> print(ser_obj['c'])
0 -1.217606
1 -0.481930
2 -0.786703
dtype: float64
#内层选取:list中第一个元素表示要选取的外层索引,第一个元素表示要选取的内层索引,下面的例子中冒号表示选中所有的外层索引
>>> print(ser_obj[:,1])
a 0.487781
b 0.841555
c -0.481930
d 0.680397
dtype: float64
#具体的值选取
>>> print(ser_obj['b',1])
0.8415554704217344
3.交换分层顺序
swaplevel():交换内层与外层索引。
外层选取:直接利用外层索引的标签
参数为0:处理最外层,
参数为1:处理内层,
参数为2:处理更内层
#范例
>>> print(ser_obj.swaplevel()) #将外层索引和内层索引进行了交换
0 a -1.033278
1 a 0.487781
2 a 0.337266
0 b -0.768059
1 b 0.841555
2 b -0.872732
0 c -1.217606
1 c -0.481930
2 c -0.786703
0 d 0.508066
1 d 0.680397
2 d -0.156162
dtype: float64
>>> print(ser_obj.swaplevel(1)) #将索引为1的(次外层)索引与内层进行交换,由于只有两层索引,次内层索引就为1,等于自己与自己进行了交换
a 0 -1.033278
1 0.487781
2 0.337266
b 0 -0.768059
1 0.841555
2 -0.872732
c 0 -1.217606
1 -0.481930
2 -0.786703
d 0 0.508066
1 0.680397
2 -0.156162
dtype: float64
4.层级排序
sortlevel() 默认升序 参数ascending = False,表示降序
参数为0:处理最外层,
参数为1:处理内层,
参数为2:处理更内层
# 范例
>>>ser_obj2 = pd.Series(np.random.randn(12),index=[['a','c','b','a','b','c','b','c','a','c',
'a','b'],[-10,-1,12,0,-1,21,10,11,21,-11,18,26],[10,21,25,20,14,20,2,16,11,9,-10,0]])
# 打印series对象
>>> print(ser_obj2)
a -10 10 -0.692522
c -1 21 1.606770
b 12 25 0.872304
a 0 20 0.434283
b -1 14 -0.334205
c 21 20 1.143886
b 10 2 0.062329
c 11 16 -0.635995
a 21 11 0.060879
c -11 9 -0.380142
a 18 -10 1.329021
b 26 0 -1.848677
dtype: float64
# 层级排序sortlevel不设置参数,默认参数为0,对外层索引进行排序
>>> print(ser_obj2.sortlevel())
a -10 10 -0.692522
0 20 0.434283
18 -10 1.329021
21 11 0.060879
b -1 14 -0.334205
10 2 0.062329
12 25 0.872304
26 0 -1.848677
c -11 9 -0.380142
-1 21 1.606770
11 16 -0.635995
21 20 1.143886
dtype: float64
# 参数为0,对外层索引进行排序
>>> print(ser_obj2.sortlevel(0))
a -10 10 -0.692522
0 20 0.434283
18 -10 1.329021
21 11 0.060879
b -1 14 -0.334205
10 2 0.062329
12 25 0.872304
26 0 -1.848677
c -11 9 -0.380142
-1 21 1.606770
11 16 -0.635995
21 20 1.143886
dtype: float64
# 参数为1,对次外层索引进行排序
>>> print(ser_obj2.sortlevel(1))
c -11 9 -0.380142
a -10 10 -0.692522
b -1 14 -0.334205
c -1 21 1.606770
a 0 20 0.434283
b 10 2 0.062329
c 11 16 -0.635995
b 12 25 0.872304
a 18 -10 1.329021
21 11 0.060879
c 21 20 1.143886
b 26 0 -1.848677
dtype: float64
# 参数为2,对第三层索引进行排序
>>> print(ser_obj2.sortlevel(2))
a 18 -10 1.329021
b 26 0 -1.848677
10 2 0.062329
c -11 9 -0.380142
a -10 10 -0.692522
21 11 0.060879
b -1 14 -0.334205
c 11 16 -0.635995
a 0 20 0.434283
c 21 20 1.143886
-1 21 1.606770
b 12 25 0.872304
dtype: float64
# 由于只有三层索引,当参数为3时没有第四层索引,所以报错
>>> print(ser_obj2.sortlevel(3))
Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\indexes\multi.py", line 655, in _get_level_number
level = self.names.index(level)
ValueError: 3 is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\series.py", line 2137, in sortlevel
sort_remaining=sort_remaining)
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\series.py", line 1948, in sort_index
sort_remaining=sort_remaining)
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\indexes\multi.py", line 1735, in sortlevel
level = [self._get_level_number(lev) for lev in level]
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\indexes\multi.py", line 1735, in <listcomp>
level = [self._get_level_number(lev) for lev in level]
File "C:\Users\admin\AppData\Local\Programs\Python\Python3.5\lib\site-packages\pandas\core\indexes\multi.py", line 669, in _get_level_number
'not %d' % (self.nlevels, level + 1))
IndexError: Too many levels: Index has only 3 levels, not 4
5.Series对象重构为DataFrame对象
Unstack() 参数可以指定处理的层索引
#范例
>>>ser_obj = pd.Series(np.random.randn(12),index=[['a','a','a','b','b','b','c','c','c',
'd','d','d'],[0,1,2,0,1,2,0,1,2,0,1,2]])
>>> df_obj=ser_obj3.unstack() # 外层是行,内层是列
>>> df_obj2=ser_obj3.unstack(0) # 外层是列,内层是行
>>> print(df_obj)
0 1 2
a -1.412599 -0.549820 0.166354
b -0.533238 1.677337 -1.093246
c 0.921574 1.943227 -1.369392
d 0.567585 1.257389 -1.877724
>>> print(df_obj2)
a b c d
0 -1.412599 -0.533238 0.921574 0.567585
1 -0.549820 1.677337 1.943227 1.257389
2 0.166354 -1.093246 -1.369392 -1.877724
>>>
6.DataFrame对象重构为Series对象
stack() 参数可以指定处理的层索引
#范例
>>> print(df_obj.stack())
a 0 -1.412599
1 -0.549820
2 0.166354
b 0 -0.533238
1 1.677337
2 -1.093246
c 0 0.921574
1 1.943227
2 -1.369392
d 0 0.567585
1 1.257389
2 -1.877724
dtype: float64
二、Pandas的统计和描述
1.统计和计算
常用的统计计算
sum, mean, max, min…
axis=0 按列统计,axis=1按行统计
skipna 排除缺失值, 默认为True
# 示例,创建一个DataFame对象
>>> df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
>>> print(df_obj)
a b c d
0 0.732092 -0.890524 1.009502 0.030393
1 0.544709 -0.598250 0.924535 -0.827764
2 -0.420975 0.916274 -0.491443 1.239700
3 -0.297845 0.076457 0.360543 -0.449391
4 0.247270 1.426305 0.707588 -0.084117
>>> df_obj.sum() # 按列求和
a 0.805251
b 0.930262
c 2.510725
d -0.091179
dtype: float64
>>> df_obj.max() #按 列求最大值
a 0.732092
b 1.426305
c 1.009502
d 1.239700
dtype: float64
>>> df_obj.min(axis=1, skipna=False) # axis=1,表示按照行,这里表示按行取最小值,skipna=True表示自动跳过非数字
0 -0.890524
1 -0.827764
2 -0.491443
3 -0.449391
4 -0.084117
dtype: float64
2.统计描述
describe() 产生数据集的数据描述
统计了每一列的个数,平均值,标准差,最小值,分位数的分布,最大值等
#示例
>>> print(df_obj.describe())
a b c d
count 5.000000 5.000000 5.000000 5.000000
mean 0.161050 0.186052 0.502145 -0.018236
std 0.507458 0.982040 0.609364 0.779477
min -0.420975 -0.890524 -0.491443 -0.827764
25% -0.297845 -0.598250 0.360543 -0.449391
50% 0.247270 0.076457 0.707588 -0.084117
75% 0.544709 0.916274 0.924535 0.030393
max 0.732092 1.426305 1.009502 1.239700
三、Pandas的分组和聚合
1.分组 (groupby)
对数据集进行分组,然后对分组进行统计分析
分组过程:拆分->应用(求和)->合并
# 示例
>>> dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],
... 'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
... 'data1': np.random.randn(8),
... 'data2': np.random.randn(8)}
>>> df_obj = pd.DataFrame(dict_obj)
>>> print(df_obj)
data1 data2 key1 key2
0 1.063867 -0.317270 a one
1 -0.653952 -0.210728 b one
2 -0.832824 0.876746 a two
3 -0.639590 1.413713 b three
4 -0.741649 -0.378003 a two
5 -0.364427 1.628013 b two
6 -0.774461 -0.259763 a one
7 0.517515 -1.813689 a three
# 分组操作,如果对整个数据集进行分组,groupby参数直接指定列名即可
>>>grouped = df_obj.groupby("key2")
>>>print(grouped)
# 结果是GroupBy对象
>>>pandas.core.groupby.DataFrameGroupBy object
分组运算
在分组的基础上,对分组对象调用方法进行运算
注意,分组运算只能作用与数据部分,非数据部分不参与运算
# 示例1
>>>print(grouped.sum()) #分组求和
data1 data2
key2
one -0.364546 -0.787760
three -0.122075 -0.399976
two -1.938899 2.126756
# 示例2
#如果对单独某个数据部分的列进行分组运算,那么groupby的参数必须指定数据集的某一列进行分组
>>> grouped2=df_obj["data2"].groupby(df_obj["key1"])
>>> grouped2.mean() #求平均值
key1
a -0.156234
b 0.110796
Name: data2, dtype: float64
size() 返回每个分组的元素个数
#范例
>>> grouped1 = df_obj.groupby('key1')
>>> print(grouped1.mean())
data1 data2
key1
a -0.177206 -0.156234
b 0.877349 0.110796
>>> print(grouped1.size())
key1
a 5
b 3
dtype: int64
2.按自定义的key分组
如果现有的分组不能满足业务需求,可以自己创建一个分组规则,实现分组运算
obj.groupby(self_def_key)
自定义的key可为列表或多层列表
obj.groupby([‘label1’, ‘label2’])->多层datafram
# 自定义分组
>>> self_key = ["aa","bb","cc","dd","aa","bb","cc","dd"]
>>> grouped3 = df_obj.groupby(self_key);
>>> print(grouped3.sum()) # 求和
data1 data2
aa -0.432495 -1.395749
bb 1.512831 0.879091
cc -0.971006 0.883061
dd 1.636686 -0.815187
3.多层分层
可以指定多个列,索引顺序按列表里的参数顺序来决定
>>> grouped4 = df_obj.groupby(["key1","key2"])
>>> print(grouped4.sum())
data1 data2
key1 key2
a one -2.601935 0.666740
three 0.517470 -0.268483
two 1.198434 -1.179429
b one -0.028543 1.558148
three 1.119217 -0.546704
two 1.541374 -0.679058
4.GroupBy对象支持迭代操作
可以指定多个列,索引顺序按列表里的参数顺序来决定
每次迭代返回一个元组 (group_name, group_data)
>>> grouped4 = df_obj.groupby(["key1","key2"])
>>> print(grouped4) #grouped4的元素包含了元组
<pandas.core.groupby.DataFrameGroupBy object at 0x000001325B88A588>
>>> for name ,data in grouped4:
... print(name)
... print(data)
...
('a', 'one')
data1 data2 key1 key2
0 -0.979220 -0.171207 a one
6 -1.622715 0.837948 a one
('a', 'three')
data1 data2 key1 key2
7 0.51747 -0.268483 a three
('a', 'two')
data1 data2 key1 key2
2 0.651709 0.045113 a two
4 0.546725 -1.224542 a two
('b', 'one')
data1 data2 key1 key2
1 -0.028543 1.558148 b one
('b', 'three')
data1 data2 key1 key2
3 1.119217 -0.546704 b three
('b', 'two')
data1 data2 key1 key2
5 1.541374 -0.679058 b two
5.GroupBy对象可以转换成列表或字典
列表中包含了多个元组,每个元组包含name和data
字典中key 是name,value是data
范例
>>> print(list(grouped4))
[(('a', 'one'), data1 data2 key1 key2
0 -0.979220 -0.171207 a one
6 -1.622715 0.837948 a one), (('a', 'three'), data1 data2 key1 key2
7 0.51747 -0.268483 a three), (('a', 'two'), data1 data2 key1 key2
2 0.651709 0.045113 a two
4 0.546725 -1.224542 a two), (('b', 'one'), data1 data2 key1 key2
1 -0.028543 1.558148 b one), (('b', 'three'), data1 data2 key1 key2
3 1.119217 -0.546704 b three), (('b', 'two'), data1 data2 key1 key2
5 1.541374 -0.679058 b two)]
6.按数据类型分组
参数axis = 1 表示轴方向为列
>>> print(df_obj.dtypes) #显示有两个float64类型,两个object类型
data1 float64
data2 float64
key1 object
key2 object
dtype: object
>>> print(df_obj.groupby(df_obj.dtypes,axis = 1).size())
float64 2
object 2
dtype: int64
>>> print(df_obj.groupby(df_obj.dtypes,axis = 1).sum())
float64 object
0 -1.150428 aone
1 1.529606 bone
2 0.696821 atwo
3 0.572513 bthree
4 -0.677816 atwo
5 0.862317 btwo
6 -0.784767 aone
7 0.248986 athree
7.修改值
#构建一个数据部分区间1-10的5行5列,行索引是index列表,列索引是columns列表
>>> df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)), columns=['a', 'b', 'c', 'd', 'e'], index=['A', 'B', 'C', 'D', 'E'])
>>> df_obj2.ix[1,1:4] = np.NaN
>>> print(df_obj2)
a b c d e
A 9 4.0 9.0 2.0 5
B 4 NaN NaN NaN 8
C 8 4.0 1.0 8.0 1
D 2 3.0 4.0 6.0 5
E 9 1.0 9.0 6.0 9
8.通过字典分组
>>> dict = {"a":"one","b":"two","c":"three"}
>>> print(df_obj2.groupby(dict,axis=1).size())
one 1
three 1
two 1
dtype: int64
>>> print(df_obj2.groupby(dict, axis=1).count()) # 非NaN的个数
one three two
A 1 1 1
B 1 0 0
C 1 1 1
D 1 1 1
E 1 1 1
>>> print(df_obj2.groupby(dict, axis=1).sum())
one three two
A 9.0 9.0 4.0
B 4.0 0.0 0.0
C 8.0 1.0 4.0
D 2.0 4.0 3.0
E 9.0 9.0 1.0
小结
Series对象多层级索引构建
根据索引获取数据
分层的交换和排序
DataFrame和Series对象类型转换
Pandas统计计算和描述的常用方法sum, mean, max, min等
分组操作,分组计算,多层分组,分组迭代,分组转换为列表,通过字典分组,聚合计算