pandas 数据结构

基本原则：数据对齐是内在的。标签和数据之间的链接不会被破坏，除非你明确这样做。

Series（序列）

Series是带有标签的一维数组，可以保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签统称为索引。

>>> s = pd.Series(data, index=index)

这里，data可以是许多不同的东西：

Python dict（字典）

ndarray

标量值（如5）

传入的索引是轴标签的列表。因此，根据数据的类型，分为以下几种情况：

来自ndarray

如果data是ndarray，则索引必须与数据长度相同。如果没有传递索引，将创建值为[0， ...， len(data) - 1]的索引。

>>> s = pd.Series(data=np.random.randn(5), dtype=np.float32, index=['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.755472
b   -0.565683
c    0.325820
d   -0.039064
e   -0.469198
dtype: float32

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> s = pd.Series(data=np.random.randn(5), dtype=np.float64)
>>> s
0   -1.194776
1    1.028467
2   -0.607143
3    1.729527
4   -0.029426
dtype: float64

>>>s.index
RangeIndex(start=0, stop=5, step=1)

注意： 从v0.8.0开始，pandas支持非唯一索引值。如果尝试执行不支持重复索引值的操作，那么将会引发异常。延迟的原因几乎都基于性能（在计算中有很多实例，例如 GroupBy 的部分不使用索引）。

来自字典

如果data是字典，那么如果传入了index，则会取出数据中的值，对应于索引中的标签。否则，如果可能，将从字典的有序键构造索引。

>>> pd.Series(dict(zip(('a','b','c'),(1.,2.,3.))))
a    1.0
b    2.0
c    3.0
dtype: float64

>>> pd.Series(dict(zip(('a','b','c'),(1,2,3))), index=['b', 'c', 'd', 'a'])
b    2.0
c    3.0
d    NaN
a    1.0
dtype: float64

注意: NaN（不是数字）是用于pandas的标准缺失数据标记
判断是否为nan

pd.isnull(s['d'])
True

np.isnan(s['d'])
True

import math
math.isnan(s['d'])
True

从标量值：如果data是标量值，则必须提供索引。该值会重复，来匹配索引的长度。

>>> pd.Series(data=3, index=range(5))
0    3
1    3
2    3
3    3
4    3
dtype: int64

Series 类似于 ndarray

>>>  s = pd.Series(np.random.randn(5), index=[chr(i) for i in range(97, 102)])
>>> s
a    1.762795
b    0.710000
c    1.372860
d    1.267486
e    0.141515
dtype: float64

>>> s[0]
1.762795268543718

>>> s[:3]
a    1.762795
b    0.710000
c    1.372860
dtype: float64

>>> s[s>s.median()]
a    1.762795
c    1.372860
dtype: float64

>>>s[[4,3,1]]
e    0.141515
d    1.267486
b    0.710000
dtype: float64

Series 类似于字典

Series就像一个固定大小的字典，您可以通过使用标签作为索引来获取和设置值：

>>> s['a']
1.762795268543718

>>> s['a'] = 1200
>>> s['f'] = 1200
>>> s

a    1200.000000
b       0.710000
c       1.372860
d       1.267486
e       0.141515
f    1200.000000
dtype: float64

>>> 'e' in s
True
>>> 'g' in s
False

如果标签不存在，则会出现异常：

>>> s['g']
KeyError: 'g'

使用get方法，缺失的标签将返回None或指定的默认值：

>>> s.get('g','不存在')
'不存在'

使用del方法，删除标签:

>>>  del s['f']
>>> s
a    1200.000000
b       0.710000
c       1.372860
d       1.267486
e       0.141515
dtype: float64

Series 的向量化操作和标签对齐

进行数据分析时，像原始NumPy数组一样，一个值一个值地循环遍历序列通常不是必需的。Series 也可以传递给大多数期望 ndarray 的 NumPy 方法。

>>> s+s
a    2400.000000
b       1.420001
c       2.745720
d       2.534972
e       0.283029
dtype: float64

>>> s*2
    2400.000000
b       1.420001
c       2.745720
d       2.534972
e       0.283029
dtype: float64

>>> np.exp(s)
a         inf
b    2.033992
c    3.946623
d    3.551912
e    1.152017
dtype: float64

Series 和 ndarray 之间的主要区别是，Series 上的操作会根据标签自动对齐数据。因此，您可以编写计算，而不考虑所涉及的 Series 是否具有相同标签。

>>>  s[1:] + s[:-1]
a         NaN
b    1.420001
c    2.745720
d    2.534972
e         NaN
dtype: float64

未对齐的 Series 之间的运算结果，将具有所涉及的索引的并集。如果在一个 Series或其他系列中找不到某个标签，则结果将标记为NaN（缺失）。编写代码而不进行任何显式的数据对齐的能力，在交互式数据分析和研究中提供了巨大的自由和灵活性。pandas数据结构所集成的数据对齐特性，将pandas与用于处理标记数据的大多数相关工具分开。

注意一般来说，我们选择使索引不同的对象之间的操作的默认结果为union，来避免信息的丢失。尽管缺少数据，拥有索引标签通常是重要信息，作为计算的一部分。您当然可以通过dropna函数，选择丢弃带有缺失数据的标签。

>>> t = s[1:] + s[:-1]
>>> t.dropna()
b    1.420001
c    2.745720
d    2.534972
dtype: float64

名称属性

Series还可以具有name属性：

>>> series = pd.Series(np.random.randn(5), name='this is range')
>>> series
0    2.346803
1    0.073170
2   -0.940341
3    0.876354
4   -0.109891
Name: this is range, dtype: float64

可以使用pandas.Series.rename()方法来重命名 Series。

>>> series.rename(None)
0    2.346803
1    0.073170
2   -0.940341
3    0.876354
4   -0.109891
dtype: float64
>>> series.rename(lambda x: x ** 2) # 修改label
0     2.346803
1     0.073170
4    -0.940341
9     0.876354
16   -0.109891
Name: this is range, dtype: float64 
>>> series.rename({1:10,2:30}) # 修改label
0     2.346803
10    0.073170
30   -0.940341
3     0.876354
4    -0.109891
Name: this is range, dtype: float64

注意，不是修改的原对象，返回新的对象。

DataFrame（数据框架）

DataFrame是带有标签的二维数据结构，列的类型可能不同。你可以把它想象成一个电子表格或SQL表，或者Series对象的字典。它通常是最常用的pandas对象。像Series 一样，DataFrame 接受许多不同类型的输入：

一维数组，列表，字典或 Series 的字典

二维 numpy.ndarray

结构化或记录 ndarray

Series

另一个DataFrame

和数据一起，可以选择传递index（行标签）和columns（列标签）参数。如果传递索引或列，则会用于生成的DataFrame的索引或列。因此，Series 的字典加上特定索引将丢弃所有不匹配传入索引的数据。

如果轴标签未通过，则它们将基于常识规则从输入数据构造。
来自 Series 的字典

>>> d = {
    'one':pd.Series(list(range(1,4)), index=[chr(i) for i in range(97,100)]),
    'two':pd.Series(list(range(1,5)), index=[chr(i) for i in range(97,101)])
}

>>>  df = pd.DataFrame(d)
>>> df  
    one     two
a   1.0     1
b   2.0     2
c   3.0     3
d   NaN     4

>>> pd.DataFrame(d, index=['d','b','a'])
    one     two
d   NaN     4
b   2.0     2
a   1.0     1

>>> pd.DataFrame(d, index=['d','b','a'], columns=['two', 'three'])
    two three
d   4   NaN
b   2   NaN
a   1   NaN

通过访问index和column属性可以分别访问行和列标签：

注意同时传入一组特定的列和数据的字典时，传入的列将覆盖字典中的键。

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.colums
Index(['one', 'two'], dtype='object')

来自 ndarrays / lists 的字典

ndarrays 必须长度相同。如果传入了索引，它必须也与数组长度相同。如果没有传入索引，结果将是range(n)，其中n是数组长度。

>>> d = {
    'one':list(range(1,5)),
    'two':list(range(1,5))[::-1]
}
>>> pd.DataFrame(d)
    one     two
0   1   4
1   2   3
2   3   2
3   4   1

>>> pd.DataFrame(d, index=[chr(i) for i in range(97,101)])
    one     two
a   1   4
b   2   3
c   3   2
d   4   1

来自 ndarrays / lists

>>> pd.DataFrame(np.arange(12).reshape(3,4))
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9   10  11

>>> pd.DataFrame(np.arange(12).reshape(3,4),index=[chr(i) for i in range(97,100)], columns=[chr(i) for i in range(97,101)])
    a   b   c   d
a   0   1   2   3
b   4   5   6   7
c   8   9   10  11