pandas 使用指南（上）

pandas作为一种可以提供类似于R中dataframe数据结构的module对于经常进行机器学习和数据处理的同学是一种非常高效的数据结构工具。Pandas上面已经提供了一份简易指南，但是全文太长了（虽然其标榜十分钟内可以学会），故而决定写一份旨在给刚刚开始使用pandas的初学者的简化版，因此如果读者已经有一定的pandas使用经验的话，可以略过本文。

本文的主要内容包括：

解释�dataframe的基本结构
几种常见的创建dataframe的方法
几种常用的获得dataframe中数据的方法
几种对已创建的dataframe进行数据修改的方法
dataframe中两个常用的函数：groupby以及apply

我们首先导入需要的module：

In [1]: import pandas as pd
In [2]: import numpy as np

基本概念

series

series是pandas中基本的数据结构，它的结构与一个list类似，但是却拥有自己的索引。在实际应用中，dataframe的每一个列对一个series。正因为这些series和dataframe一样都有索引，因此在把一个series加入到dataframe中时，必须考虑他们的索引是否对应。

In [3]: s1 = pd.Series([1,3,5,6,8])

In [4]: s1
Out[4]: 
0    1
1    3
2    5
3    6
4    8
dtype: int64

In [5]: s2 = pd.Series(['a', 'b', 'c'])

In [6]: s2
Out[6]:
0    a
1    b
2    c
dtype: object

dataframe

pandas中的dataframe与R中的dataframe基本类似，可以提供一种类似于表格的数据结构，每一列都是一个series，这些series拥有与整个dataframe相同的索引。

In [7]: index = range(6)

In [8]: index
Out[8]: [0, 1, 2, 3, 4, 5]

In [9]: df = pd.DataFrame(np.random.randn(6,4), index=index, columns=list('ABCD'))

In [10]: df
Out[10]:
          A         B         C         D
0  0.531734 -1.292527 -0.627073 -1.900115
1 -0.715616  2.007529 -0.535181 -0.055376
2 -0.051779  0.751669 -1.150100  1.295839
3 -1.749876  0.962153  0.489654 -0.134320
4 -0.350185 -0.495535  0.116741  1.192340
5  0.844690 -0.111987 -1.221936 -0.714337

索引

索引在dataframe和series中都有应用，简单来看，它是一种类似于整数array的数据结构。


In [11]: df.index
Out[11]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [12]: type(df.index)
Out[12]: pandas.indexes.numeric.Int64Index

但是需要注意的是，并不是每一个dataframe和series的索引都一定是从0开始，在数据的操作过程中，索引有时候会发生变化。


In [13]: df2 = df[3:5]

In [14]: df2
Out[14]:
          A         B         C        D
3 -1.749876  0.962153  0.489654 -0.13432
4 -0.350185 -0.495535  0.116741  1.19234

In [15]: df2.index
Out[15]: Int64Index([3, 4], dtype='int64')

创建dataframe

pandas中的dataframe有很多不同的创建方式，比如可以手动一个一个创建，也可以通过读取文件或者数据库直接生成，以下将一一给予简单介绍。

手动创建

如果没有已经生成的文件或者数据库的数据表，我们可能需要手动创建一个dataframe，并在程序运行过程中不断赋值等等。手动创建的基本命令就是pd.DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)。

其参数包括data、index、columns、dtype、copy，这里我将介绍前三个变量：

data: 将要存入dataframe的数据，可以是numpy里面的matrix，也可以是一个简单的2d list。
index: 一个可以作为dataframe索引的数列，注意，pandas不会检查是否有重复元素，因此请在作为索引使用前自行检查，以免导致错误
columns: 一个string的list，每一个元素代表一个列的名字，顺序与列的顺序相同


In [16]: index = [1, 1, 2, 3, 4]

In [17]: index
Out[17]: [1, 1, 2, 3, 4]

In [18]: data = [range(4), range(1, 5), range(2, 6), range(3, 7), range(4, 8)]

In [19]: data
Out[19]: [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6], [4, 5, 6, 7]]

In [20]: df2 = pd.DataFrame(data=data, index=index, columns=list('ABCD'))

In [21]: df2
Out[21]:
   A  B  C  D
1  0  1  2  3
1  1  2  3  4
2  2  3  4  5
3  3  4  5  6
4  4  5  6  7

从文件中读取

以csv文件为例，使用read_csv函数即可。假设路径内有一个文件名为weather.csv，其主要内容如下：

,date,rain,snow,weather
0,20160101,0,0,0
1,20160102,0,0,0
2,20160103,0,0,0
3,20160104,0,0,0
4,20160105,0,0,0
5,20160106,0,0,0
6,20160107,0,0,0
7,20160108,0,0,0
8,20160109,1,0,1
9,20160110,1,0,1
10,20160111,0,0,0

其读取过程如下：


In [22]: df2 = pd.read_csv('weather.csv')

In [23]: df2
Out[23]:
    Unnamed: 0      date  rain  snow  weather
0            0  20160101     0     0        0
1            1  20160102     0     0        0
2            2  20160103     0     0        0
3            3  20160104     0     0        0
4            4  20160105     0     0        0
5            5  20160106     0     0        0
6            6  20160107     0     0        0
7            7  20160108     0     0        0
8            8  20160109     1     0        1
9            9  20160110     1     0        1
10          10  20160111     0     0        0

这里，第一列在csv中没有标题，所以读取后在dataframe里面的列名称就叫Unnamed: 0。

从sql中读取

如果要从sql中读取，我们需要了解一点sql的基本知识，同时在python中安装sqlalchemy，然后使用read_sql函数。
现假设我们的database中有一个数据表的名字就叫做weather，其内容与csv文件基本一样，那么示例如下（我们使用postgresql为例，但是mysql等等基本相同）：


In [24]: from sqlalchemy import create_engine

In [25]: database_engine = create_engine('postgresql://[用户名]:[对应密码]@localhost:5432/[database名]', echo=False)

In [26]: weather_df = pd.read_sql("SELECT * FROM weather", con=database_engine)

In [27]: weather_df
Out[27]:
    id      date  rain  snow  weather
0    0  20160101     0     0        0
1    1  20160102     0     0        0
2    2  20160103     0     0        0
3    3  20160104     0     0        0
4    4  20160105     0     0        0
5    5  20160106     0     0        0
6    6  20160107     0     0        0
7    7  20160108     0     0        0
8    8  20160109     1     0        1
9    9  20160110     1     0        1
10  10  20160111     0     0        0

使用dataframe

dataframe的使用主要在于根据使用者的需要查看并获取里面的数据。

查看数据

dataframe的查看主要包括了解整个数据表的基本信息，了解里面的数据大致特征，并查看某特定几行的数据。下面笔者将一一进行介绍。

查看数据表的基本信息

使用info函数。


In [28]: weather_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
id         11 non-null int64
date       11 non-null int64
rain       11 non-null int64
snow       11 non-null int64
weather    11 non-null int64
dtypes: int64(5)
memory usage: 512.0 bytes

查看开头或尾部的数据表

查看开头：使用head函数
查看尾部：使用tail函数


In [30]: weather_df.head(5)
Out[30]:
   id      date  rain  snow  weather
0   0  20160101     0     0        0
1   1  20160102     0     0        0
2   2  20160103     0     0        0
3   3  20160104     0     0        0
4   4  20160105     0     0        0

In [31]: weather_df.tail(5)
Out[31]:
    id      date  rain  snow  weather
6    6  20160107     0     0        0
7    7  20160108     0     0        0
8    8  20160109     1     0        1
9    9  20160110     1     0        1
10  10  20160111     0     0        0

获取数据

此处为了提高程序效率，推荐使用一下四个行数来获取数据: iloc、loc、iat、at。

获取行或列

获取多行


In [32]: weather_df[3:8]
Out[32]:
   id      date  rain  snow  weather
3   3  20160104     0     0        0
4   4  20160105     0     0        0
5   5  20160106     0     0        0
6   6  20160107     0     0        0
7   7  20160108     0     0        0

获取单行

使用iloc或者loc。


In [33]: weather_df.iloc[3]
Out[33]:
id                3
date       20160104
rain              0
snow              0
weather           0
Name: 3, dtype: int64

In [34]: weather_df.loc[3]
Out[34]:
id                3
date       20160104
rain              0
snow              0
weather           0
Name: 3, dtype: int64

此处的iloc和loc结果相同，但并不代表两者意义一样。loc通过索引获得对应行，iloc通过行号来获得对应行。例如（注意索引）


In [35]: df = weather_df[3:8]

In [36]: df
Out[36]:
   id      date  rain  snow  weather
3   3  20160104     0     0        0
4   4  20160105     0     0        0
5   5  20160106     0     0        0
6   6  20160107     0     0        0
7   7  20160108     0     0        0

In [37]: df.iloc[3]
Out[37]:
id                6
date       20160107
rain              0
snow              0
weather           0
Name: 6, dtype: int64

In [38]: df.loc[3]
Out[38]:
id                3
date       20160104
rain              0
snow              0
weather           0
Name: 3, dtype: int64

获取单列

有两种方式


In [39]: df.date
Out[39]:
3    20160104
4    20160105
5    20160106
6    20160107
7    20160108
Name: date, dtype: int64

In [40]: df['date']
Out[40]:
3    20160104
4    20160105
5    20160106
6    20160107
7    20160108
Name: date, dtype: int64

获取多列

使用loc函数或iloc函数


In [41]: df.loc[:, ['date', 'snow']]
Out[41]:
       date  snow
3  20160104     0
4  20160105     0
5  20160106     0
6  20160107     0
7  20160108     0

In [42]: df.iloc[:, :2]
Out[42]:
   id      date
3   3  20160104
4   4  20160105
5   5  20160106
6   6  20160107
7   7  20160108

两者的不同在于，loc是使用索引和列名称来获得所需的列，iloc是使用行号和列号来获取所需的列。

获取某个数据

要获取某一个数值，iloc、loc、iat、at都可以实现。


In [43]: df.iloc[3, 1]
Out[43]: 20160107

In [44]: df.iat[3, 1]
Out[44]: 20160107

In [45]: df.loc[3, 'date']
Out[45]: 20160104

In [46]: df.at[3, 'date']
Out[46]: 20160104

从该例子可以看出，iloc和iat都是通过行数和列数来获取特定的单个元素，而loc和at都是通过索引和列名称来获取特定的单个元素。

利用bool下标来获取数据

有些时候，我们只希望通过一些判断或者比较获得一部分数据，例如，在weather_df中，我希望获得以下几组数据：

获取日期在20160106到20160110（包括首尾）之间的所有行
获取名称为weather的列中值为1的所有行
获取日期在[20160106, 20160104, 20160108]这个list中的所有行
以上三组数据中，每组我只想获得id, date和weather这三列

示例将按顺序完成以上几项：


In [47]: weather_df[(weather_df['date'] >= 20160106) & (weather_df['date'] <= 20160110)]
Out[47]:
   id      date  rain  snow  weather
5   5  20160106     0     0        0
6   6  20160107     0     0        0
7   7  20160108     0     0        0
8   8  20160109     1     0        1
9   9  20160110     1     0        1

In [48]: weather_df[weather_df['weather'] == 1]
Out[48]:
   id      date  rain  snow  weather
8   8  20160109     1     0        1
9   9  20160110     1     0        1

In [49]: weather_df[weather_df.date.isin([20160106, 20160104, 20160108])]
Out[49]:
   id      date  rain  snow  weather
3   3  20160104     0     0        0
5   5  20160106     0     0        0
7   7  20160108     0     0        0

In [50]: weather_df[(weather_df['date'] >= 20160106) & (weather_df['date'] <= 20160110)].loc[:, ['id', 'date', 'weather']]
Out[50]:
   id      date  weather
5   5  20160106        0
6   6  20160107        0
7   7  20160108        0
8   8  20160109        1
9   9  20160110        1

In [51]: weather_df[weather_df['weather'] == 1].loc[:, ['id', 'date', 'weather']]
Out[51]:
   id      date  weather
8   8  20160109        1
9   9  20160110        1

In [52]: weather_df[weather_df.date.isin([20160106, 20160104, 20160108])].loc[:, ['id', 'date', 'weather']]
Out[52]:
   id      date  weather
3   3  20160104        0
5   5  20160106        0
7   7  20160108        0

通过上述几个例子，我们可以发现通过条件判断来获取部分数据的基本格式就是类似于dataframe[dataframe.column > value] 或者dataframe[dataframe.column.isin(list)]。

那么其原理呢？

事实上他们先根据方括号内的条件判断生成了一个由bool值构成的series: (weather_df['date'] >= 20160106) & (weather_df['date'] <= 20160110), weather_df['weather'] == 1或者weather_df.date.isin([20160106, 20160104, 20160108])。
然后在根据这个新的series来选取bool值为true的行。


In [53]: weather_df
Out[53]:
    id      date  rain  snow  weather
0    0  20160101     0     0        0
1    1  20160102     0     0        0
2    2  20160103     0     0        0
3    3  20160104     0     0        0
4    4  20160105     0     0        0
5    5  20160106     0     0        0
6    6  20160107     0     0        0
7    7  20160108     0     0        0
8    8  20160109     1     0        1
9    9  20160110     1     0        1
10  10  20160111     0     0        0

In [54]: weather_df.date.isin([20160106, 20160104, 20160108])
Out[54]:
0     False
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10    False
Name: date, dtype: bool

In [55]: weather_df[weather_df.date.isin([20160106, 20160104, 20160108])]
Out[55]:
   id      date  rain  snow  weather
3   3  20160104     0     0        0
5   5  20160106     0     0        0
7   7  20160108     0     0        0

常用到的判断主要包括数值的大小以及元素是否在给定的list或者set中，前者用>, >=, <, 或者<=，后者用.isin。多个逻辑判断的连接使用&或者|。
当然也存在更加复杂的逻辑判断，比如判定某一个dataframe中某一个string构成的列的每一个元素的第三到第六个构成的substring必须在某个给定区间，我的常用方法是使用apply函数，这部分内容将在后面给出解释和示例。

剩下的在后半部继续写吧。

pandas使用指南：初学者向（上）