python pandas的Dataframe 使用总结

1 数据选取操作

1.1 isin和is not in 的使用和操作

按照pandas作者的说法，pandas可以实现几乎所有的类似sql的操作，这其中当然包括sql中的in和not in。

在pandas中（包括Dataframe和Series）in和not in的实现方法分别是obj.isin([list] or {set} or (tuple))和~obj.isin([list])，例如：

import pandas as pd
data = pd.DataFrame(
    {"name": ["Tom", "Mary", "Jerry", "Mark"], 
    "age":[18, 22, 19, 31], 
    "gender":["M", "F", "M", "M"]})

data.age.isin((18, 22, 19))
~data.age.isin([18, 22, 19])
data[data.age.isin([18, 22])]

1.2 filter 和 select

pandas中的filter和select用来选择符合条件的行或者列

1.2.1 filter的使用

def filter(self, items=None, like=None, regex=None, axis=None):
"""
使用list、正则表达式或者like语法来选择行或者列

参数
----------
items : list-like
    索引list、set、tuple或者其他list-like类型
like : string
    Keep info axis where "arg in col == True"
regex : string (regular expression)
   正则表达式
axis : int or None
    默认情况下为0

# 使用方法 以上例的data为例
data.filter(items=[0,1], axis=0)
data.filter(items=[0,1], axis=0)
data.filter(like="ge", axis=1)
data.filter(regex="[a-z]_\d")

1.2.2 select的使用

def select(self, crit, axis=0):
"""
Return data corresponding to axis labels matching criteria

Parameters
----------
crit : function
    To be called on each index (label). Should return True or False
axis : int 默认情况下axis=0 即按行选择

Returns
-------
selection : type of caller
"""
# 使用方法如下
import re
# 选择含有age的列
data.select(lambda x: bool(x) if "age" in x else False, axis=1)
# 按照正则表达式选择列
data.select(lambda x: bool(re.match("^[a-z]*_\d$", x)), axis=1)

1.3 多条件选取

“&”、“|”分别表示“与”和“或”
只需使用 & 或 | 连接多个条件就可以实现多条件的选取了
示例代码如下:

```
data[data.age.isin([18, 22, 19])&~data.gender.isin(["M"])]
newseries = df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
```

1.4 ix、loc与iloc

在pandas中ix、loc和iloc都用于选取数据，其中ix和loc用于pandas的索引选取，即根据数据中的行索引和列索引的名字选取，iloc是使用类似列表的顺序索引进行数据选取的

2 DataFrame 中重复项的处理

3 DataFrame的排序

3.1 pandas中排序的方法

pandas中的排序方法有三个sort_values()、sort()
和sort_index()三个，其中sort()和sort_index()都已经不推荐使用。

def sort_values(self, by, axis=0, ascending=True, inplace=False,
                    kind='quicksort', na_position='last'):
    """
    Parameters
    ----------
    by : string 或list
        键的名字，当为list时是多值排序
    axis : int 默认情况下axis=0 即按行选择
    ascending : boolean or list, default True
        是否按照升序排序 list用于多值排序
    inplace ： bool
        是否原地操作
    kind ： {'quicksort', 'mergesort', 'heapsort'}, optional
        排序方法
    na_position : {'first', 'last'} (optional, default='last')
        'first' puts NaNs at the beginning
        'last' puts NaNs at the end
    """
    pass
data = pd.DataFrame(
    {"name": ["Tom", "Mary", "Jerry", "Mark"], 
     "age":[18, 22, 19, 31], 
     "gender":["M", "F", "M", "M"]})
data["age_2"] = [32, 32, 35, 15]
data.ix[0, "age"] = 25

data.sort_values(by="age")

3.2 多值排序

data.sort(by=["age", "age_2"])
#   age gender  name    age_2
# 2 19  M   Jerry   35
# 1 22  F   Mary    32
# 0 25  M   Tom 32
# 3 31  M   Mark    15
data.sort_values(by=["age_2", "age"], ascending=[True, False])
    age gender  name    age_2
# 3 31.0    M   Mark    15
# 4 NaN     F   Cris    18
# 0 25.0    M   Tom     32
# 1 22.0    F   Mary    32
# 2 19.0    M   Jerry   35

3.3 正序与倒序

# 正序与倒序排列通过**ascending**参数控制
data.sort_values(by=["age_2", "age"], ascending=[True, False])
    age gender  name    age_2

3.4 na值的处理

# 参考1

4 在DataFrame中添加行

4.1 concate

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]],columns=['a','b','c'])
#>>df
#      a  b  c
#   0  1  2  3
#   1  4  5  6

#先将要添加进去的记录转置后连接在一起
t = pd.concat([pd.DataFrame([7,8,9]).T,pd.DataFrame([10,11,12]).T])
#然后修改columns 使得和df的columns一致
t.columns = df.columns
#最后把两个DataFrame合并并且忽略index
df = pd.concat([df,t],ignore_index=True)
#>>df
#       a   b   c
#   0   1   2   3
#   1   4   5   6
#   2   7   8   9
#   3  10  11  12

4.2 loc

df.loc[i]={'a':1,'b':2}
df.loc[df.shape[0]+1] = {'ds':strToDate('2017-07-21'),'y':0}

5 删除列

5.1 del关键字删除列

# 只能使用del data["column_name"] 不能使用 del data.column_name
del data["age_2"]

5.2 pop删除（内部使用del实现）

data.pop("age")

5.3 drop删除

obj.drop("index_name")

6 NaN的声明

a = pd.NaT
np.NaN

7 重新索引与修改列名

7.1 rename和rename_axis

参考pandas.DataFrame.reindex 和 pandas.DataFrame.rename

DataFrame.rename(index=None, columns=None, **kwargs)
通过函数或字典修改DataFrame的index或columns. Function或者dict的值必须是一对一的(1-to-1). 没有包含在 dict或Series中的列或者索引保持不变。 多余的labels不会抛出异常。Alternatively, change Series.name with a scalar value (Series only).

Parameters: 
    index, columns : scalar, list-like, dict-like or function,optional
        Scalar or list-like 仅仅用于修改 the Series.name attribute, 
        在DataFrame和 Panel中会抛出异常. 修改DataFrame的columns和index
        需要使用 dict-like or functions 
        are transformations to apply to that axis’ values
    copy : boolean, default True
        Also copy underlying data
    inplace : boolean, default False
        是否复制一个DataFrame并返回. 如果为True，参数copy的值将会被忽略.
    level : int or level name, default None
        In case of a MultiIndex, only rename labels in the specified level.
Returns:    
    renamed : DataFrame (new object)

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name") # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(2)
Traceback (most recent call last):
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})
   a  B
0  1  4
1  2  5
2  3  6

7.2 reindex和reindex_axis

参考pandas.DataFrame.reindex

DataFrame.reindex(index=None, columns=None, **kwargs)

Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False

Parameters:

index, columns : array-like, optional (can be specified in order, or as
keywords) New labels / index to conform to. Preferably an Index object to avoid duplicating data

method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional

s.rename("my_name") # scalar, changes Series.name
0 1
1 2
2 3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2) # function, changes labels
0 1
1 2
4 3
dtype: int64
>>> s.rename({1: 3, 2: 5}) # mapping, changes labels
0 1
3 2
5 3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(2)
Traceback (most recent call last):
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})
a B
0 1 4
1 2 5
2 3 6
```

### 7.2 reindex和reindex_axis

参考pandas.DataFrame.reindex

> DataFrame.reindex(index=None, columns=None, kwargs)
> > Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
>
> >Parameters:**
>>> index, columns : array-like, optional (can be specified in order, or as
keywords) New labels / index to conform to. Preferably an Index object to avoid duplicating data

>>> method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
>>>>
method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

default: don’t fill gaps

pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap
nearest: use nearest valid observations to fill gap

copy : boolean, default True

Return a new object, even if the passed indexes are the same

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value

limit : int, default None

Maximum number of consecutive elements to forward or backward fill

tolerance : optional

Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

New in version 0.17.0.

Returns:

reindexed : DataFrame

7.3 column和index

import pandas as pd
df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})
    
# 暴力
df.columns = ['a', 'b', 'c', 'd', 'e']
    
# 修改
df.columns = df.columns.str.strip('$')

# 修改
df.columns = df.columns.map(lambda x:x[1:])

# 暴力（好处：也可只修改特定的列）
df.rename(columns=('$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}, inplace=True) 

# 修改
df.rename(columns=lambda x:x.replace('$',''), inplace=True)

data.rename(columns={"age_2": "age_3"}, inplace=True)

8 Group By操作

8.1帮助文档

pd.DataFrame.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

    Parameters
    ----------
    by : mapping function / list of functions, dict, Series, or tuple /
        list of column names.
        Called on each element of the object index to determine the groups.
        If a dict or Series is passed, the Series or dict VALUES will be
        used to determine the groups
    axis : int, default 0
    level : int, level name, or sequence of such, default None
        If the axis is a MultiIndex (hierarchical), group by a particular
        level or levels
    as_index : boolean, default True
        For aggregated output, return object with group labels as the
        index. Only relevant for DataFrame input. as_index=False is
        effectively "SQL-style" grouped output
    sort : boolean, default True
        Sort group keys. Get better performance by turning this off.
        Note this does not influence the order of observations within each
        group.  groupby preserves the order of rows within each group.
    group_keys : boolean, default True
        When calling apply, add group keys to index to identify pieces
    squeeze : boolean, default False
        reduce the dimensionality of the return type if possible,
        otherwise return a consistent type

8.2 基本操作

最常见的操作是对一个和多个建进行group by

# 操作方法
df.groupby(["key1", "key2"])

8.3 聚合操作

在pandas中可以实现很多的聚合操作，如mean、median、sum、avg、表准差等，各种操作引诱尽有，实在不能满足自己的情况下可以使用apply函数实现自己的聚合操作

```
# 操作方法
group.sum()
group.sum()["key1"] # 选取聚合结果的某一列
group.sum()[["key1", "key2"]]  # 选取聚合结果的多列
# 将聚合结果的索引转换为列 使用reset_index 函数
res = group.sum()[["key1", "key2"]]
res.reset_index()
```

8.4 将索引groupby索引转换为列

在DataFrame中有一个函数是reset_index，使用方法见8.3节

8.5 groupby结果的块操作(chunk)

可以多group的结果进行循环操作，循w环中的每一项是一个tuple，形如(索引, 数据)

```
group = data.groupby(["key1", "key2"])
for item in group:
    print(item[0])
    print(item[1])
```