引入Pandas包
import numpy as np
import pandas as pd
创建对象
- Series
Series 是一种类似于一维数组的对象,由一组数据和与之相关的数据标签(即索引)组成。
可以通过传递一个list对象来创建一个Series。
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,3,5,np.nan,6,8])
In [4]: s
Out[4]:
0 1
1 3
2 5
3 NaN
4 6
5 8
dtype: float64
- DataFrame
DataFrame是一个表个性的数据结构,它包含一组有序的列,每列可以是不同的值类型。DataFrame既有行索引也有列索引,了一看做事Series组成的字典。
可以通过传递一个numpy array,时间索引以及列标签来船建一个DataFrame。
In [6]: dates=pd.date_range('20130101',periods=6)
In [9]: dates
Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
In [11]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [12]: df
Out[12]:
A B C D
2013-01-01 -0.558163 -2.595570 -0.740695 0.573137
2013-01-02 -0.442056 0.317163 -0.438089 -1.043022
2013-01-03 0.747714 0.295502 0.231061 -0.314072
2013-01-04 0.861701 0.494772 -0.744115 0.404745
2013-01-05 -0.204729 1.073058 -0.653405 0.248553
2013-01-06 1.242062 -1.018416 -1.744268 -0.881623
也可以通过传递一个能被转换成类似序列结构的字典来创建一个DataFrame。
In [16]: df2 = pd.DataFrame({'A':1,
....: 'B':pd.Timestamp('20130102'),
....: 'C':pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D':np.array([3]*4,dtype='int32'),
....: 'E':pd.Categorical(["test","train","test","train"]),
....: 'F':'foo'})
In [17]: df2
Out[17]:
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
查看不同列的数据类型。
In [19]: df2.dtypes
Out[19]:
A int64
B datetime64[ns]
C float32
D int32
E object
F object
dtype: object
查看数据
#用for循环来迭代数据
for index,row in df.iterrows():
print('行索引:',index)
print('行数据:',row)
- 查看DataFrame中头部和尾部的行
df.head()
df.tail()
In [20]: df.head()
Out[20]:
A B C D
2013-01-01 0.651580 1.655077 -1.196456 -0.533145
2013-01-02 -0.637116 -0.159470 -0.294879 1.014046
2013-01-03 -0.490999 0.027923 0.751392 0.665089
2013-01-04 -0.014505 0.093663 0.597468 1.045386
2013-01-05 1.576453 0.085662 -1.461838 -0.973813
In [21]: df.tail(2)
Out[21]:
A B C D
2013-01-05 1.576453 0.085662 -1.461838 -0.973813
2013-01-06 -0.525558 0.142466 0.305336 -0.551035
- 显示索引、列和底层的numpy数据
df.index
df.columns
df.values
In [22]: df.index
Out[22]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
In [26]: df.columns
Out[26]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [27]: df.values
Out[27]:
array([[ 0.65157962, 1.65507664, -1.19645589, -0.53314538],
[-0.63711581, -0.15946969, -0.29487877, 1.0140458 ],
[-0.49099863, 0.02792297, 0.7513917 , 0.6650887 ],
[-0.01450487, 0.0936628 , 0.59746839, 1.04538564],
[ 1.57645343, 0.08566164, -1.46183828, -0.97381329],
[-0.52555849, 0.14246563, 0.30533559, -0.55103523]])
-
df.describe()
函数对于数据的快速统计汇总
In [28]: df.describe()
Out[28]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.093309 0.307553 -0.216496 0.111088
std 0.869591 0.668485 0.936909 0.897287
min -0.637116 -0.159470 -1.461838 -0.973813
25% -0.516919 0.042358 -0.971062 -0.546563
50% -0.252752 0.089662 0.005228 0.065972
75% 0.485058 0.130265 0.524435 0.926807
max 1.576453 1.655077 0.751392 1.045386
- 对数据的转置
df.T
In [29]: df.T
Out[29]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.651580 -0.637116 -0.490999 -0.014505 1.576453 -0.525558
B 1.655077 -0.159470 0.027923 0.093663 0.085662 0.142466
C -1.196456 -0.294879 0.751392 0.597468 -1.461838 0.305336
D -0.533145 1.014046 0.665089 1.045386 -0.973813 -0.551035
- 数据按轴进行排序
df.sort_index()
In [32]: df.sort_index(axis=1,ascending=False)
Out[32]:
D C B A
2013-01-01 -0.533145 -1.196456 1.655077 0.651580
2013-01-02 1.014046 -0.294879 -0.159470 -0.637116
2013-01-03 0.665089 0.751392 0.027923 -0.490999
2013-01-04 1.045386 0.597468 0.093663 -0.014505
2013-01-05 -0.973813 -1.461838 0.085662 1.576453
2013-01-06 -0.551035 0.305336 0.142466 -0.525558
- 按值进行排序
df.sort()
In [33]: df.sort(columns='C')
Out[33]:
A B C D
2013-01-05 1.576453 0.085662 -1.461838 -0.973813
2013-01-01 0.651580 1.655077 -1.196456 -0.533145
2013-01-02 -0.637116 -0.159470 -0.294879 1.014046
2013-01-06 -0.525558 0.142466 0.305336 -0.551035
2013-01-04 -0.014505 0.093663 0.597468 1.045386
2013-01-03 -0.490999 0.027923 0.751392 0.665089
数据选择
优化后的pandas的数据访问方式有 at
, iat
, loc
, iloc
和 ix
- 获取数据
选择一个单独的列,浙江但会一个Series,等同于 df.A
In [10]: df['A']
Out[10]:
2013-01-01 -0.251925
2013-01-02 -0.059640
2013-01-03 0.162197
2013-01-04 -0.217098
2013-01-05 0.264930
2013-01-06 -1.974721
Freq: D, Name: A, dtype: float64
通过[]
进行选择,这将对行进行切片
In [11]: df[0:3]
Out[11]:
A B C D
2013-01-01 -0.251925 -0.064685 0.392655 -2.083434
2013-01-02 -0.059640 -0.745822 0.701752 0.209347
2013-01-03 0.162197 -0.879785 1.634737 0.087359
In [13]: df['20130102':'20130104']
Out[13]:
A B C D
2013-01-02 -0.059640 -0.745822 0.701752 0.209347
2013-01-03 0.162197 -0.879785 1.634737 0.087359
2013-01-04 -0.217098 2.251497 -0.541654 2.451352
- 通过标签选择
loc[]
使用标签来获取一个交叉的区域
In [15]: df.loc[dates[0]]
Out[15]:
A -0.251925
B -0.064685
C 0.392655
D -2.083434
Name: 2013-01-01 00:00:00, dtype: float64
In [16]: df.loc['2013-01-01']
Out[16]:
A -0.251925
B -0.064685
C 0.392655
D -2.083434
Name: 2013-01-01 00:00:00, dtype: float64
通过标签在多个轴上进行选择
In [17]: df.loc[:,['A','B']]
Out[17]:
A B
2013-01-01 -0.251925 -0.064685
2013-01-02 -0.059640 -0.745822
2013-01-03 0.162197 -0.879785
2013-01-04 -0.217098 2.251497
2013-01-05 0.264930 -0.512883
2013-01-06 -1.974721 0.786016
标签切片
In [23]: df.loc['20130101':'20130103','A':'C']
Out[23]:
A B C
2013-01-01 -0.251925 -0.064685 0.392655
2013-01-02 -0.059640 -0.745822 0.701752
2013-01-03 0.162197 -0.879785 1.634737
对于返回的对象进行维度缩减
In [21]: df.loc['20130101',['A','B']]
Out[21]:
A -0.251925
B -0.064685
Name: 2013-01-01 00:00:00, dtype: float64
获取一个标量
In [24]: df.loc[dates[0],'A']
Out[24]: -0.2519246989360483
快速访问一个标量
In [25]: df.at[dates[0],'A']
Out[25]: -0.2519246989360483
- 通过位置选择
iloc[]
通过传递数值进行位置选择(选择的是行)
In [27]: df.iloc[3]
Out[27]:
A -0.217098
B 2.251497
C -0.541654
D 2.451352
Name: 2013-01-04 00:00:00, dtype: float64
通过数值进行切片,与numpy/python 相同
In [28]: df.iloc[3:5,0:2]
Out[28]:
A B
2013-01-04 -0.217098 2.251497
2013-01-05 0.264930 -0.512883
通过制定一个位置的列表选择
In [30]: df.iloc[[1,2,5],[0,3]]
Out[30]:
A D
2013-01-02 -0.059640 0.209347
2013-01-03 0.162197 0.087359
2013-01-06 -1.974721 -0.166359
对行进行切片
In [31]: df.iloc[1:3,:]
Out[31]:
A B C D
2013-01-02 -0.059640 -0.745822 0.701752 0.209347
2013-01-03 0.162197 -0.879785 1.634737 0.087359
对列进行切片
In [32]: df.iloc[:,1:3]
Out[32]:
B C
2013-01-01 -0.064685 0.392655
2013-01-02 -0.745822 0.701752
2013-01-03 -0.879785 1.634737
2013-01-04 2.251497 -0.541654
2013-01-05 -0.512883 0.762724
2013-01-06 0.786016 -1.453113
获取特定的值
In [33]: df.iloc[1,1]
Out[33]: -0.74582152264492807
In [34]: df.iat[1,1]
Out[34]: -0.74582152264492807
- 布尔索引
通过一个单独列的值来选择数据
In [35]: df[df.A>0]
Out[35]:
A B C D
2013-01-03 0.162197 -0.879785 1.634737 0.087359
2013-01-05 0.264930 -0.512883 0.762724 -1.468283
通过where操作选择数据
In [36]: df[df>0]
Out[36]:
A B C D
2013-01-01 NaN NaN 0.392655 NaN
2013-01-02 NaN NaN 0.701752 0.209347
2013-01-03 0.162197 NaN 1.634737 0.087359
2013-01-04 NaN 2.251497 NaN 2.451352
2013-01-05 0.264930 NaN 0.762724 NaN
2013-01-06 NaN 0.786016 NaN NaN
通过isin()
方法过滤数据
In [39]: df
Out[39]:
A B C D E
2013-01-01 -0.251925 -0.064685 0.392655 -2.083434 one
2013-01-02 -0.059640 -0.745822 0.701752 0.209347 ome
2013-01-03 0.162197 -0.879785 1.634737 0.087359 two
2013-01-04 -0.217098 2.251497 -0.541654 2.451352 three
2013-01-05 0.264930 -0.512883 0.762724 -1.468283 four
2013-01-06 -1.974721 0.786016 -1.453113 -0.166359 three
In [40]: df[df['E'].isin(['two','four'])]
Out[40]:
A B C D E
2013-01-03 0.162197 -0.879785 1.634737 0.087359 two
2013-01-05 0.264930 -0.512883 0.762724 -1.468283 four
- 设置新的列和值
设置一个新的列
In [41]: sl=pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
In [42]: sl
Out[42]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
通过标签设置新的值
In [45]: df.at[dates[0],'A']=0
通过位置设置新的值
In [47]: df.iat[1,2]=0
通过numpy数组设置一组新的值
In [50]: df.loc[:,'D']=np.array([5]*len(df))
In [51]: df
Out[51]:
A B C D E
2013-01-01 0.000000 -0.064685 0.392655 5 one
2013-01-02 -0.059640 -0.745822 0.000000 5 ome
2013-01-03 0.162197 -0.879785 1.634737 5 two
2013-01-04 -0.217098 2.251497 -0.541654 5 three
2013-01-05 0.264930 -0.512883 0.762724 5 four
2013-01-06 -1.974721 0.786016 -1.453113 5 three
where操作设置新的值
In [69]: df2=df.copy()
In [70]: df2
Out[70]:
A B C D E F
2013-01-01 0.000000 -0.064685 0.392655 5 one NaN
2013-01-02 -0.059640 -0.745822 0.000000 5 ome 1
2013-01-03 0.162197 -0.879785 1.634737 5 two 2
2013-01-04 -0.217098 2.251497 -0.541654 5 three 3
2013-01-05 0.264930 -0.512883 0.762724 5 four 4
2013-01-06 -1.974721 0.786016 -1.453113 5 three 5
In [71]: df2=df2[df2>0]
In [72]: df2
Out[72]:
A B C D E F
2013-01-01 NaN NaN 0.392655 5 one NaN
2013-01-02 NaN NaN NaN 5 ome 1
2013-01-03 0.162197 NaN 1.634737 5 two 2
2013-01-04 NaN 2.251497 NaN 5 three 3
2013-01-05 0.264930 NaN 0.762724 5 four 4
2013-01-06 NaN 0.786016 NaN 5 three 5