1.DataFrame对象
按照一定顺序排列多列数据,各列数据类型可以有所不同
DataFrame对象有两个索引数组,第一个数组与行相关,它与Series的索引数组极为相似,每个标签与标签所在行的所有元素相关联,第二个数组包含一系列标签,每个标签与一列数据相关联
DataFrame可以理解为一个由Series组成的字典,其中每一列的名称为字典的键,形成DataFrame的列的Series作为字典的值
2定义DateFrame对象
新建dataFrame最常用的方法是传递一个dict对象给DataFrame()构造函数
dictd对象的每一列名称作为键,每个键都有一个数组作为值
1)将字典的每个键值对都放入DataFrame中
>>> import pandas as pd #引入pandas包
>>> dict={'colors':['red','blue','yellow','black'],'object':['pen','paper','ball','mug'],'price':[1.1,1.2,3.2,4]} #定义一个字典,每个键是以后DataFrame对象的列名,每个键对应的值是以后DataFrame列的元素内容
>>> dict
{'object': ['pen', 'paper', 'ball', 'mug'], 'price': [1.1, 1.2, 3.2, 4], 'colors': ['red', 'blue', 'yellow', 'black']}
>>> s=pd.DataFrame(dict) #利用DataFrame的构造函数,将dict的内容放入DataFrame中
>>> s
colors object price
0 red pen 1.1
1 blue paper 1.2
2 yellow ball 3.2
3 black mug 4.0
2)挑选字典中部分数据对用来初始化DataFrame对象
>>> import pandas as pd #导入pandas包
>>> dic={'colos':['red','black','yellow','orange'],'object':['pen','ball','shirt','mug'],'price':[1.2,3.4,2.3,5]} #定义字典
>>> dic
{'object': ['pen', 'ball', 'shirt', 'mug'], 'price': [1.2, 3.4, 2.3, 5], 'colos': ['red', 'black', 'yellow', 'orange']}
>>> s=pd.DataFrame(dic,columns=['price','object']) #用字典来初始化DataFrame对象并且只选择两列数据,且顺序按照我选择的来ding
>>> s
price object
0 1.2 pen
1 3.4 ball
2 2.3 shirt
3 5.0 mug
3)对DataFrame对象进行自定义索引(上面的例子都是不定义,系统默认从0开始定义)
4)不使用字典,使用构造函数三个参数来进行定义DataFrame
指定三个参数,顺序:数据矩阵、index选项、columns选项、将存放标签的数组赋给index,将存放列名的数组赋值给columns选项、可使用np.arange(16).reshape(4,4)快捷生成矩阵
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
3.选取元素
1)要想知道DataFrame的所有列的名称,对它调用columns属性即可
2)要想获取DataFrame的索引列表,调用index熟悉即可
3)想要获取数据结构中的元素,使用values熟悉获取即可
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,columns=['a','b','c','d'])
>>> s
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> s.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> s.values
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
4)如果想要获取一列元素内容,把这一列名称作为所以即可,或者是调用这个列名的属性方法
第一种方法
>>> s['B']
a 1
b 5
c 9
d 13
Name: B, dtype: int64
第二种方法
>>> s.B
a 1
b 5
c 9
d 13
Name: B, dtype: int64
5)获取DataFrame某一行数据,利用ix熟悉的索引值获取
获取单行
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.ix[2]
A 8
B 9
C 10
D 11
Name: c, dtype: int64
>>> s.ix['c']
A 8
B 9
C 10
D 11
Name: c, dtype: int64
获取多行(非连续)
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.ix[[1,3]]
A B C D
b 4 5 6 7
d 12 13 14 15
>>> s.ix[['b','d']]
A B C D
b 4 5 6 7
d 12 13 14 15
获取多行(连续)
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.ix[0:3]
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> s.ix['a':'c']
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
获取某个元素
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s['A'][1] #注意一定要先写列【A】在写行【1】
4
4.赋值
1)给index和columns指定name
>>> import numpy as np
>>> import pandas as pd
>>> arry=np.arange(16)
>>> arry
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> arry=np.arange(16).reshape(4,4)
>>> arry
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> s=pd.DataFrame(arry,index=['a','b','c','d'],columns=['A','B','C','D'])
>>> s
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.index.name=id
>>> s.columns.name='item'
>>> s
item A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s.index.name='id'
>>> s
item A B C D
id
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
2)添加一列新元素
>>> s
item A B C D
id
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
>>> s['E']=12
>>> s
item A B C D E
id
a 0 1 2 3 12
b 4 5 6 7 12
c 8 9 10 11 12
d 12 13 14 15 12
3)给已经有的一列更新元素值
>>> s
item A B C D E
id
a 0 1 2 3 12
b 4 5 6 7 12
c 8 9 10 11 12
d 12 13 14 15 12
>>> s['E']=[3,5,2,6]
>>> s
item A B C D E
id
a 0 1 2 3 3
b 4 5 6 7 5
c 8 9 10 11 2
d 12 13 14 15 6
5.元素的所属关系
>>> s
item A B C D E F
id
a 0 1 2 3 NaN NaN
b 4 5 6 7 NaN NaN
c 8 9 10 11 NaN NaN
d 12 13 14 15 NaN NaN
>>> s.isin([1,4])
item A B C D E F
id
a False True False False False False
b True False False False False False
c False False False False False False
d False False False False False False
>>> s[s.isin([1,4])]
item A B C D E F
id
a NaN 1.0 NaN NaN NaN NaN
b 4.0 NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN NaN
6.删除一列
>>> s
item A B C D E F
id
a 0 1 2 3 NaN NaN
b 4 5 6 7 NaN NaN
c 8 9 10 11 NaN NaN
d 12 13 14 15 NaN NaN
>>> del s['E']
>>> s
item A B C D F
id
a 0 1 2 3 NaN
b 4 5 6 7 NaN
c 8 9 10 11 NaN
d 12 13 14 15 NaN
7.筛选
>>> s
item A B C D F
id
a 0 1 2 3 NaN
b 4 5 6 7 NaN
c 8 9 10 11 NaN
d 12 13 14 15 NaN
>>> s[s<3]
item A B C D F
id
a 0.0 1.0 2.0 NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
8.用嵌套字典生成DataFrame对象
将嵌套字典作为参数传递给DataFrame的构造函数,pandas就会将内部的键作为列名,将外部的键作为索引名,并非所有位置都有相应的元素存在,pandas会用NaN填充
>>> import pandas as pd
>>> dic={'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2017:17,2012:23,2018:18}}
>>> dic
{'blue': {2017: 17, 2018: 18, 2012: 23}, 'white': {2011: 13, 2012: 22, 2013: 16}, 'red': {2012: 22, 2013: 33}}
>>> s=pd.DataFrame(dic)
>>> s
blue red white
2011 NaN NaN 13.0
2012 23.0 22.0 22.0
2013 NaN 33.0 16.0
2017 17.0 NaN NaN
2018 18.0 NaN NaN
9.DataFrame转置
>>> s
blue red white
2011 NaN NaN 13.0
2012 23.0 22.0 22.0
2013 NaN 33.0 16.0
2017 17.0 NaN NaN
2018 18.0 NaN NaN
>>> s.T #调用T方法就行
2011 2012 2013 2017 2018
blue NaN 23.0 NaN 17.0 18.0
red NaN 22.0 33.0 NaN NaN
white 13.0 22.0 16.0 NaN NaN
10.index对象
在Series和DataFrame中index声明后不可改变
11.index对象的方法
idmin()和idmax()函数分别返回索引值最小和最大的元素
12.含有重复标签的index
>>> import pandas as pd
>>> s=pd.Series(range(6),index=['a','a','b','c','c','d'])
>>> s
a 0
a 1
b 2
c 3
c 4
d 5
dtype: int64
>>> s['a']
a 0
a 1
dtype: int64
>>> s.index.is_unique #用来判断索引中是否有重复的索引
False
13.更换索引
pandas的reindex函数可更换Series对象的索引,根据新标签序列,重新调整原来Series的元素,生成一个新的Series对象
更换索引时,可以调整所以序列中各标签的顺序,删除或增加新标签
>>> import pandas as pd
>>> s=pd.Series([1,2,3,4],index=['a','b','c','d'])
>>> s
a 1
b 2
c 3
d 4
dtype: int64
>>> s.reindex(['e','f','g','b'])
e NaN
f NaN
g NaN
b 2.0
dtype: float64
然而通过上述reindex的方式重新定义索引对于庞大的DataFrame不太适应,可以采用自动填充或插值的方法
如下:
>>> import pandas as pd
>>> s=pd.Series([1,5,6,3],index=[0,3,5,6])
>>> s
0 1
3 5
5 6
6 3
dtype: int64
>>> s.reindex(range(6),method='ffill')#让对s这个对象的索引从0-5开始重新定义索引,ffill告诉系统新增索引对应值取比他小的那个索引对应的值
0 1
1 1
2 1
3 5
4 5
5 6
dtype: int64
>>>
>>> s=pd.Series([1,5,6,3],index=[0,3,5,6])
>>> s
0 1
3 5
5 6
6 3
dtype: int64
>>> s.reindex(range(6),method='bfill')#bfill告诉系统新增索引的值用它后一个索引的元素值填充
0 1
1 5
2 5
3 5
4 6
5 6
dtype: int64
>>> dic={'colors':['blue','green','yellow','red','white'],'price':[1.2,1.0,0.6,0.9,1.7],'object':['ballpand','pen','pencil','paper','mug']}#定义一个嵌套字典
>>> dic
{'object': ['ballpand', 'pen', 'pencil', 'paper', 'mug'], 'price': [1.2, 1.0, 0.6, 0.9, 1.7], 'colors': ['blue', 'green', 'yellow', 'red', 'white']}
>>> s=pd.DataFrame(dic)#用嵌套字典定义s这个对象
>>> s
colors object price
0 blue ballpand 1.2
1 green pen 1.0
2 yellow pencil 0.6
3 red paper 0.9
4 white mug 1.7
>>> s.reindex(range(5),method='ffill',columns=['colors','price','new','object'])#补充new这个列索引
colors price new object
0 blue 1.2 blue ballpand
1 green 1.0 green pen
2 yellow 0.6 yellow pencil
3 red 0.9 red paper
4 white 1.7 white mug
>>> s=pd.DataFrame(dic,index=[1,2,3,5,7] )#自定义一个索引的DataFrame对象
>>> s
colors object price
1 blue ballpand 1.2
2 green pen 1.0
3 yellow pencil 0.6
5 red paper 0.9
7 white mug 1.7
>>> s.reindex(range(5),method='ffill')#重定义行索引
colors object price
0 NaN NaN NaN
1 blue ballpand 1.2
2 green pen 1.0
3 yellow pencil 0.6
4 yellow pencil 0.6
14.删除索引
1)删除Series中一项
2)删除Series中多项,需要将多项组合成数组放入drop函数中
3)删除DataFrame中某几行
4)删除DataFrame中列:需要加入axis值=1代表列
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series(np.arange(4),index=['red','blue','yellow','white'])
>>> s
red 0
blue 1
yellow 2
white 3
dtype: int64
>>> s.drop('yellow')#删除Series中某个索引极其对应元素
red 0
blue 1
white 3
dtype: int64
>>> s.drop(['red','white'])#删除Series中多个索引
blue 1
yellow 2
dtype: int64
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
>>> frame.drop(['blue','yellow'])#删除DataFrame中多个行
ball pen pencil paper
red 0 1 2 3
white 12 13 14 15
>>> frame.drop(['pen','pencil'],axis=1)#删除DataFrame中多个列,需要指定axis=1
ball paper
red 0 3
blue 4 7
yellow 8 11
white 12 15
15.算术和数据对齐
1)两个Series对象相加
>>> import pandas as pd
>>> s1=pd.Series([3,2,5,1],['white','yellow','green','blue'])
>>> s2=pd.Series([1,4,7,2,1],index=['white','yellow','black','blue','brown'])
>>> s1
white 3
yellow 2
green 5
blue 1
dtype: int64
>>> s2
white 1
yellow 4
black 7
blue 2
brown 1
dtype: int64
>>> s1+s2
black NaN
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
2)两个DataFrame对象相加
>>> import numpy as np
>>> frame1=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame2=pd.DataFrame(np.arange(12).reshape(4,3),index=['blue','green','white','yellow'],columns=['mug','pen','ball'])
>>> frame1
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
>>> frame2
mug pen ball
blue 0 1 2
green 3 4 5
white 6 7 8
yellow 9 10 11
>>> frame1+frame2
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
上述也可以使用如下的函数方法:
1)Series之间相加
2)DataFrame之间相加
>>> s1.add(s2)
black NaN
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
>>> frame1.add(frame2)
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
16.DataFramehe Series之间的运算
1)Series的索引=DataFrame的列名
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series([1,2,3,4],index=['a','b','c','d'])
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d'])
>>> s
a 1
b 2
c 3
d 4
dtype: int64
>>> frame
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> s+frame #frame的每一列都加上s的对应索引的对应值
a b c d
0 1 3 5 7
1 5 7 9 11
2 9 11 13 15
3 13 15 17 19
>>> frame-s #frame的每一列都加上s的对应索引的对应值
a b c d
0 -1 -1 -1 -1
1 3 3 3 3
2 7 7 7 7
3 11 11 11 11
2)Series的索引!=DataFrame的列名
>>> frame2=pd.DataFrame(np.arange(16).reshape(4,4),columns=['b','d','e','c'])
>>> s
a 1
b 2
c 3
d 4
dtype: int64
>>> frame2
b d e c
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> s+frame2
a b c d e
0 NaN 2.0 6.0 5.0 NaN
1 NaN 6.0 10.0 9.0 NaN
2 NaN 10.0 14.0 13.0 NaN
3 NaN 14.0 18.0 17.0 NaN
>>> frame2-s
a b c d e
0 NaN -2.0 0.0 -3.0 NaN
1 NaN 2.0 4.0 1.0 NaN
2 NaN 6.0 8.0 5.0 NaN
3 NaN 10.0 12.0 9.0 NaN
17.对DataFrame的每个元素求平方根,利用numpy的sqrt函数
>>> frame
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> np.sqrt(frame)
a b c d
0 0.000000 1.000000 1.414214 1.732051
1 2.000000 2.236068 2.449490 2.645751
2 2.828427 3.000000 3.162278 3.316625
3 3.464102 3.605551 3.741657 3.872983
18.按行或列执行操作的函数
1)按列对DataFrame每一列进行套用自定义函数
2)按行对DataFrame每一行进行套用自定义函数
>>> f=lambda x:x.max()-x.min()
>>> frame.apply(f) #函数参数是DataFrame中的每一列
a 12
b 12
c 12
d 12
dtype: int64
>>> frame.apply(f,axis=1)#axis=1代表f参数是DataFrame的每一行
0 3
1 3
2 3
3 3
dtype: int64
3)利用apply套用函数对某个DataFrame处理成另一个Dataframe,从而实现多维度计算
>>> f=lambda x:pd.Series([x.min(),x.max()],index=['min','max'])定义一个函数,函数的参数x是某DataFrame的一列,f然会一个Series对象,索引是min和max值是DaraFrame列的最大值和最小值
>>> frame.apply(f)#对frame这个Dataframe套用f函数,对每一列计算后都会有一个Series对象,所有的列的Series对象组合成为一个DataFrame对象产出
a b c d
min 0 1 2 3
max 12 13 14 15
19.统计函数
数组的大多数统计函数对DataFrame依旧有效
>>> frame
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> frame.sum()
a 24
b 28
c 32
d 36
dtype: int64
>>> frame.mean()
a 6.0
b 7.0
c 8.0
d 9.0
dtype: float64
>>> frame.describe()
a b c d
count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000
>>> frame.sum(axis=1)#要想对行进行套用统计函数,需要指定axis=1
0 6
1 22
2 38
3 54
dtype: int64
20.排序和排位次
1)Series对象的排序
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
>>> s
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
>>> s.sort_index()#按照索引的A-z排序
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
>>> s.sort_index(ascending=False)#ascending参数代表指定是否是降序
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
2)DataFrame对象的排序
>>> import numpy as np
>>> import pandas as pd
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
>>> frame.sort_index()#默认按照行索引进行排序,就是按照blue、red、white、yellow排序
ball pen pencil paper
blue 4 5 6 7
red 0 1 2 3
white 12 13 14 15
yellow 8 9 10 11
>>> frame.sort_index(axis=1)#axis=1说明按照列索引排序,按照ball、paper、pen、pencil排序是整列整列的换位置
ball paper pen pencil
red 0 3 1 2
blue 4 7 5 6
yellow 8 11 9 10
white 12 15 13 14
21以上都是对索引进行排序以下对对象中内容进行排序
1)对Series中元素内容进行排序
s.order()
2)对DataFrame中元素内容进行排序
>>> frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
>>> frame.sort_index(by='pen')
__main__:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
22.相关性和协方差
1)两个Series对象之间的相关性和协方差
>>> import numpy as np
>>> import pandas as pd
>>> s1=pd.Series([3,4,3,4,5,4,3,2])
>>> s2=pd.Series([1,2,3,4,4,3,2,1])
>>> s1
0 3
1 4
2 3
3 4
4 5
5 4
6 3
7 2
dtype: int64
>>> s2
0 1
1 2
2 3
3 4
4 4
5 3
6 2
7 1
dtype: int64
>>> s1.corr(s2) #相关性
0.7745966692414834
>>> s1.cov(s2)#协方差
0.8571428571428571
2)单个DataFrame的相关性和协方差
>>> frame=pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
>>> frame
ball pen pencil paper
red 1 4 3 6
blue 4 5 6 1
yellow 3 3 1 5
white 4 1 6 4
>>> frame.corr()
ball pen pencil paper
ball 1.000000 -0.276026 0.577350 -0.763763
pen -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000
>>> frame.cov()
ball pen pencil paper
ball 2.000000 -0.666667 2.000000 -2.333333
pen -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667
3)DataFrame对象的行或者列与Series对象或其他DataFrame对象元素两两之间的相关性
>>> s
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
>>> frame
ball pen pencil paper
red 1 4 3 6
blue 4 5 6 1
yellow 3 3 1 5
white 4 1 6 4
>>> frame.corrwith(s)
ball -0.140028
pen -0.869657
pencil 0.080845
paper 0.595854
dtype: float64
23.为元素赋NaN值
>>> s=pd.Series([1,2,np.NaN,3])
>>> s
0 1.0
1 2.0
2 NaN
3 3.0
dtype: float64
24.过滤NaN
>>> s
0 1.0
1 2.0
2 NaN
3 3.0
dtype: float64
>>> s.dropna()#利用dropna函数
0 1.0
1 2.0
3 3.0
dtype: float64
>>>
或者用以下方法:利用notnull方法
>>> s=pd.Series([1,2,np.NaN,3])
>>> s
0 1.0
1 2.0
2 NaN
3 3.0
dtype: float64
>>> s[s.notnull()]
0 1.0
1 2.0
3 3.0
dtype: float64:使用dropna()方法只要行或者列有一个NaN元素,该行或列的全部元素都会被删除
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
ball mug pen
blue 6.0 NaN 6.0
green NaN NaN NaN
red 2.0 NaN 5.0
>>> frame.dropna()
Empty DataFrame
Columns: [ball, mug, pen]
Index: []
因此为了防止避免删除整行或整列,需要使用how选项,值位all,告知dropna函数只删除所有元素都是NaN的行或者列
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
ball mug pen
blue 6.0 NaN 6.0
green NaN NaN NaN
red 2.0 NaN 5.0
>>> frame.dropna(how='all')
ball mug pen
blue 6.0 NaN 6.0
red 2.0 NaN 5.0
25.为NaN元素填充其他值
1)将所有的NAN替换成同一个元素,利用fillna函数
>>> frame=pd.DataFrame([[6,np.NaN,6],[np.NaN,np.NaN,np.NaN],[2,np.NaN,5]],index=['blue','green','red'],columns=['ball','mug','pen'])
>>> frame
ball mug pen
blue 6.0 NaN 6.0
green NaN NaN NaN
red 2.0 NaN 5.0
>>> frame.fillna(0)
ball mug pen
blue 6.0 0.0 6.0
green 0.0 0.0 0.0
red 2.0 0.0 5.0
2)将不同列的NaN替换成不同的元素:需要依次指定列名及要替换成的元素即可
>>> frame.fillna('ball':1,'mug':2,'pen':8)
26.等级索引和分级
1)创建带有等级索引的Series对象
>>> import numpy as np
>>> import pandas as pd
>>> s=pd.Series(np.random.rand(8),index=[['a','a','a','b','b','c','c','c'],['up','down','right','up','down','up','down','left']])
>>> s
a up 0.587733
down 0.425383
right 0.356205
b up 0.251802
down 0.105830
c up 0.253041
down 0.140155
left 0.425004
dtype: float64
2)展示带有等级索引Series对象的index属性
>>> s.index
MultiIndex(levels=[['a', 'b', 'c'], ['down', 'left', 'right', 'up']],
labels=[[0, 0, 0, 1, 1, 2, 2, 2], [3, 0, 2, 3, 0, 3, 0, 1]])
3)选取带有等级索引的Series对象的第一级索引对应的元素
>>> s['a']
up 0.587733
down 0.425383
right 0.356205
dtype: float64
4)选取带有等级索引的Series对象的第二级索引对应的元素
>>> s[:,'up'] #一定记得有个逗号
a 0.587733
b 0.251802
c 0.253041
dtype: float64
5)选取带有等级索引的Series对象的某个具体的元素
>>> s['a','up']
0.5877327517004284
6)将带有等级索引的Series对象改变成一个DataFrame对象
>>> s.unstack()
down left right up
a 0.425383 NaN 0.356205 0.587733
b 0.105830 NaN NaN 0.251802
c 0.140155 0.425004 NaN 0.253041
7)将一个DataFrame对象改变成一个带有等级索引给的Series对象
>>> frame
down left right up
a 0.425383 NaN 0.356205 0.587733
b 0.105830 NaN NaN 0.251802
c 0.140155 0.425004 NaN 0.253041
>>> frame.stack()
a down 0.425383
right 0.356205
up 0.587733
b down 0.105830
up 0.251802
c down 0.140155
left 0.425004
up 0.253041
dtype: float64
8)定义一个index和columns都是等级的DataFrame对象
>>> frame=pd.DataFrame(np.random.randn(16).reshape(4,4),index=[['white','white','red','red'],['up','down','up','down']],columns=[['pen','pen','paper','paper'],[1,2,1,2]])
>>> frame
pen paper
1 2 1 2
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
27.重新调整顺序和为层级排序
>>> frame
pen paper
1 2 1 2
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
>>> frame.index.names=['colors','status']
>>> frame.columns.names=['objects','id']
>>> frame
objects pen paper
id 1 2 1 2
colors status
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
>>> frame.swaplevel('colors','status')#交换colors和status两列层级顺序
objects pen paper
id 1 2 1 2
status colors
up white -0.487631 0.200648 0.344613 0.144835
down white 0.246683 -0.847063 -0.391592 -0.091928
up red -0.132962 -1.728167 1.787231 0.374895
down red -1.033622 0.354458 0.007813 -1.203889
>>> frame
objects pen paper
id 1 2 1 2
colors status
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
>>> frame.sortlevel()#使用sortlevel对colots的所有进行首字母的顺序排列
__main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
objects pen paper
id 1 2 1 2
colors status
red down -1.033622 0.354458 0.007813 -1.203889
up -0.132962 -1.728167 1.787231 0.374895
white down 0.246683 -0.847063 -0.391592 -0.091928
up -0.487631 0.200648 0.344613 0.144835
28.按层级统计数据
1)按照某一行层级统计,将层级名称赋值给level,level作为统计函数的参数
>>> frame
objects pen paper
id 1 2 1 2
colors status
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
>>> frame.sum(level='colors')#对colors这个行层级进行sum处理
objects pen paper
id 1 2 1 2
colors
white -0.240947 -0.646416 -0.046978 0.052907
red -1.166584 -1.373709 1.795044 -0.828994
2)想要对某一列层级
>>> frame
objects pen paper
id 1 2 1 2
colors status
white up -0.487631 0.200648 0.344613 0.144835
down 0.246683 -0.847063 -0.391592 -0.091928
red up -0.132962 -1.728167 1.787231 0.374895
down -1.033622 0.354458 0.007813 -1.203889
>>> frame.sum(level='id',axis=1) #对id这个列层级进行sum处理,用axis=1标识对列处理
id 1 2
colors status
white up -0.143017 0.345483
down -0.144909 -0.938991
red up 1.654270 -1.353272
down -1.025809 -0.849432