【3】数据分析之概要

单元7：Pandas库入门

Pandas库的介绍

官方网站：http://pandas.pydata.org
Pandas是Python第三方库，提供高性能易用数据类型和分析工具。

import pandas as pd

Pandas基于NumPy实现，常与NumPy和Matplotlib一起使用。
小测：

import pandas as pd

d = pd.Series(range(20))


d

Out[3]: 
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int64

左侧为索引，右侧为值。
计算前n项和：

d.cumsum()
Out[4]: 
0       0
1       1
2       3
3       6
4      10
5      15
6      21
7      28
8      36
9      45
10     55
11     66
12     78
13     91
14    105
15    120
16    136
17    153
18    171
19    190
dtype: int64

Pandas库的理解

两个数据类型：Series（相当于一维数据类型），DataFrame（相当于二维及以上维度的数据类型）
基于上述数据类型的各种操作：
基本操作、运算操作、特征类操作、关联类操作
与NumPy对比

NumPy	Pandas
基础数据类型	扩展数据类型
关注数据的结构表达（数据之间结构的维度）	关注数据的应用表达
维度：数据间的关系	数据与索引之间的关系

Pandas库的Series类型

由一组数据及其相关的数据索引组成。

import pandas as pd

a = pd.Series([9, 8, 7, 6])

a
Out[3]: 
0    9
1    8
2    7
3    6
dtype: int64

左侧数列为自动索引。数据类型为NumPy中的数据类型。

b = pd.Series([9, 8, 7, 6], index = ['a', 'b', 'c', 'd'])

b
Out[5]: 
a    9
b    8
c    7
d    6
dtype: int64

此时左侧为自定义索引。

Series类型可由以下数据类型创建：

Python列表
标量值
Python字典
ndarray
其他函数

从标量值创建：

s = pd.Series(25, index = ['a', 'b', 'c'])

s
Out[7]: 
a    25
b    25
c    25
dtype: int64

从字典创建：

d = pd.Series({'a':9, 'b':9, 'c':7})

d
Out[9]: 
a    9
b    9
c    7
dtype: int64

如果想要改变形状或索引：

e = pd.Series({'a':9, 'b':9, 'c':7}, index = ['c', 'a', 'b', 'd'])

e
Out[11]: 
c    7.0
a    9.0
b    9.0
d    NaN
dtype: float64

从ndarray类型创建：

import numpy as np

n = pd.Series(np.arange(5))

n
Out[15]: 
0    0
1    1
2    2
3    3
4    4
dtype: int32

m = pd.Series(np.arange(5), index = np.arange(9, 4, -1))

m
Out[17]: 
9    0
8    1
7    2
6    3
5    4
dtype: int32

Series类型的基本操作

Series类型包括index和value两部分。

类似ndarray类型和Python字典类型

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b
Out[19]: 
a    9
b    8
c    7
d    6
dtype: int64

b.index  ##获取索引
Out[20]: Index(['a', 'b', 'c', 'd'], dtype='object')  ##Index类型

b.values  ##获取数据
Out[21]: array([9, 8, 7, 6], dtype=int64)  ##NumPy的ndarray类型

b['b']  ##自定义索引
Out[22]: 8

b[1]  ##自动索引
Out[23]: 8

b[['c', 'd', 0]]

Out[24]: 
c    7.0
d    6.0
0    NaN  ##两套索引不能混用
dtype: float64

b[['c', 'd', 'a']]
Out[25]: 
c    7
d    6
a    9
dtype: int64

索引方法，采用[]
NumPy中运算和操作可用于Series类型
可通过自定义索引的列表进行切片
可以通过自动索引进行切片。若存在自定义索引，则一同被切片。

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b
Out[27]: 
a    9
b    8
c    7
d    6
dtype: int64

b[3]  ##索引，得到值
Out[28]: 6

b[:3]  ##切片，得到的是Series类型
Out[29]: 
a    9
b    8
c    7
dtype: int64

b[b > b.median()]  
Out[31]: 
a    9
b    8
dtype: int64

np.exp(b)
Out[32]: 
a    8103.083928
b    2980.957987
c    1096.633158
d     403.428793
dtype: float64

可通过自定义索引访问
保留字in操作
可以使用get()方法

b['b']
Out[33]: 8

'c' in b  ##判断b的索引中是否存在'c'
Out[34]: True

0 in b
Out[35]: False

b.get('f', 100)
Out[36]: 100

Series类型对齐操作

Series + Series

a = pd.Series([1, 2, 3], ['c', 'd', 'e'])

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

a + b
Out[40]: 
a    NaN
b    NaN
c    8.0
d    8.0
e    NaN
dtype: float64

在运算中会自动对齐不同索引的数据。

Series类型的name属性

Series对象和索引都可以有一个名字，储存在属性.name中

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b.name

b.name = 'Series对象'

b.index.name = '索引列'

b
Out[45]: 
索引列
a    9
b    8
c    7
d    6
Name: Series对象, dtype: int64

Series类型的修改

可以随时修改并即刻生效。

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b['a'] = 15

b.name = 'Series'

b
Out[49]: 
a    15
b     8
c     7
d     6
Name: Series, dtype: int64

b.name = 'New Series'

b['b', 'c'] = 20

b
Out[52]: 
a    15
b    20
c    20
d     6
Name: New Series, dtype: int64

Series是一维带“标签”的数组。基本操作类似ndarray和字典，根据索引对齐。

Pandas库的DataFrame类型

由共用相同索引的一组列组成。
是一个表格型数据类型，每列值类型可以不同。
既有行索引（index），也有列索引（column）。
常用于表达二维数据，也可以表达多维数据。

可以由以下类型创建：

二维ndarray对象
由一维ndarray、列表、字典、元组或Series构成的字典
Series类型

从二维ndarray对象创建：

import pandas as pd

import numpy as np

d = pd.DataFrame(np.arange(10).reshape(2,5))

d
Out[5]: 
   0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  8  9

生成了自动行索引和自动列索引。

从一维ndarray对象字典创建：

dt = {'one': pd.Series([1, 2, 3], index = ['a', 'b', 'c']), 'two': pd.Series([9, 8, 7, 6], index = ['a', 'b', 'c', 'd'])}



d = pd.DataFrame(dt)

d
Out[8]: 
   one  two
a  1.0    9
b  2.0    8
c  3.0    7
d  NaN    6

d = pd.DataFrame(dt, index = ['b', 'c', 'd'], columns = ['two', 'three'])

d
Out[10]: 
   two three
b    8   NaN
c    7   NaN
d    6   NaN  ##数据根据行列自动补齐

从列表类型的字典创建

dl = {'one': [1, 2, 3, 4], 'two': [9, 8, 7, 6]}

d = pd.DataFrame(dl, index = ['a', 'b', 'c', 'd'])

d
Out[13]: 
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

可以通过[]直接进行索引。

d['one']
Out[14]: 
a    1
b    2
c    3
d    4
Name: one, dtype: int64

d.loc['b']
Out[16]: 
one    2
two    8
Name: b, dtype: int64

d['one']['b']
Out[17]: 2

DataFrame是二维带“标签”数组。
基本操作类似Series，依据行列操作。

Pandas库的数据类型操作

如何改变Series和DataFrame 对象？

重新索引
.reindex()能够改变或重排索引。

dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
    '环比': [101.5, 101.2, 101.3, 102.0, 100.1 ],
    '同比': [120.7, 127.3, 119.4, 140.9, 101.4 ],
    '定基': [121.4, 127.8, 120.0, 145.5, 101.6 ]}

d = pd.DataFrame(dl, index = ['c1', 'c2', 'c3', 'c4', 'c5'])

d
Out[22]: 
    城市     环比     同比     定基
c1  北京  101.5  120.7  121.4
c2  上海  101.2  127.3  127.8
c3  广州  101.3  119.4  120.0
c4  深圳  102.0  140.9  145.5
c5  沈阳  100.1  101.4  101.6

d = d.reindex(index = ['c5', 'c4', 'c3', 'c2', 'c1'])

d
Out[24]: 
    城市     环比     同比     定基
c5  沈阳  100.1  101.4  101.6
c4  深圳  102.0  140.9  145.5
c3  广州  101.3  119.4  120.0
c2  上海  101.2  127.3  127.8
c1  北京  101.5  120.7  121.4

d = d.reindex(columns=['城市', '同比', '环比', '定基'])

d
Out[27]: 
    城市     同比     环比     定基
c5  沈阳  101.4  100.1  101.6
c4  深圳  140.9  102.0  145.5
c3  广州  119.4  101.3  120.0
c2  上海  127.3  101.2  127.8
c1  北京  120.7  101.5  121.4

.reindex( index = None, columns = None, ...)的参数

参数	说明
index, columns	新的行列自定义索引
fill_value	重新索引中，用于填充缺失位置的值
method	填充方法，ffill当前值向前填充，bfill先后填充
limit	最大填充量
copy	默认True，生成新的对象，False时，新旧相等不复制

newc = d.columns.insert(4, '新增')

newd = d.reindex(columns = newc, fill_value = 200)

newd
Out[31]: 
    城市     同比     环比     定基   新增
c5  沈阳  101.4  100.1  101.6  200
c4  深圳  140.9  102.0  145.5  200
c3  广州  119.4  101.3  120.0  200
c2  上海  127.3  101.2  127.8  200
c1  北京  120.7  101.5  121.4  200

索引类型

d.index
Out[32]: Index(['c5', 'c4', 'c3', 'c2', 'c1'], dtype='object')

d.columns
Out[33]: Index(['城市', '同比', '环比', '定基'], dtype='object')

都是Index类型。Index对象是不可修改的类型。
索引的常用方法：

方法	说明
.append(idx)	连接另一个Index对象，产生新的Index对象
.diff(idx)	计算差集，产生新的Index对象
.intersection(idx)	计算交集
.union(idx)	计算并集
.delete(idx)	删除loc位置处的元素
.insert(loc, e)	在loc位置增加一个元素e

删除指定索引对象
.drop()删除指定行或列索引

a = pd.Series([9, 8, 7, 6], index =  ['a', 'b', 'c', 'd'])

a
Out[39]: 
a    9
b    8
c    7
d    6
dtype: int64

a.drop(['b', 'c'])
Out[40]: 
a    9
d    6
dtype: int64

Pandas库的数据类型运算

算数运算法则

根据行列索引，补齐后运算，运算默认产生浮点数。补齐时填充NaN。
二维和一维、一维和零维间为广播运算。
采用+ - * /的二位运算符号会产生新的对象

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[4]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

b
Out[6]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

a + b
Out[7]: 
      0     1     2     3   4
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

a * b
Out[8]: 
      0     1      2      3   4
0   0.0   1.0    4.0    9.0 NaN
1  20.0  30.0   42.0   56.0 NaN
2  80.0  99.0  120.0  143.0 NaN

方法形式的运算。可以增加一些可选参数

方法	说明
.add(d, **argws)	加法运算
.sum(d, **argws)	减法运算
.mul(d, **argws)	乘法运算
.div(d, **argws)	除法运算

b.add(a, fill_value = 100)  ##a与b之间缺少的部分用100填充，即替代NaN
Out[9]: 
       0      1      2      3      4
0    0.0    2.0    4.0    6.0  104.0
1    9.0   11.0   13.0   15.0  109.0
2   18.0   20.0   22.0   24.0  114.0
3  115.0  116.0  117.0  118.0  119.0

a.mul(b, fill_value = 0)
Out[10]: 
      0     1      2      3    4
0   0.0   1.0    4.0    9.0  0.0
1  20.0  30.0   42.0   56.0  0.0
2  80.0  99.0  120.0  143.0  0.0
3   0.0   0.0    0.0    0.0  0.0

不同维度运算


c = pd.Series(np.arange(4))


c
Out[12]: 
0    0
1    1
2    2
3    3
dtype: int32

c - 10  #广播运算，低维的作用到高维的每一个维度上
Out[14]: 
0   -10
1    -9
2    -8
3    -7
dtype: int32

b - c  #b的每一行与c进行运算。一维默认在1轴（即行）参与运算。
Out[15]: 
      0     1     2     3   4
0   0.0   0.0   0.0   0.0 NaN
1   5.0   5.0   5.0   5.0 NaN
2  10.0  10.0  10.0  10.0 NaN
3  15.0  15.0  15.0  15.0 NaN

b.sub(c, axis = 0)  #规定b的每一列（0轴）与c进行运算
Out[16]: 
    0   1   2   3   4
0   0   1   2   3   4
1   4   5   6   7   8
2   8   9  10  11  12
3  12  13  14  15  16

比较运算法则

比较运算只能比较相同索引的元素，不进行补齐。
二维和一维、一维和零维间为广播运算
采用> < >= <= == != 等符号进行的二元运算产生布尔对象。

a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[18]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

d = pd.DataFrame(np.arange(12, 0 ,-1).reshape(3,4))

d
Out[20]: 
    0   1   2  3
0  12  11  10  9
1   8   7   6  5
2   4   3   2  1

a > d  #同维度运算，尺寸一致
Out[21]: 
       0      1      2      3
0  False  False  False  False
1  False  False  False   True
2   True   True   True   True

a == d
Out[22]: 
       0      1      2      3
0  False  False  False  False
1  False  False   True  False
2  False  False  False  False

不同维度上

a > c  #不同纬度，广播运算，默认为1轴（行）
Out[23]: 
       0      1      2      3
0  False  False  False  False
1   True   True   True   True
2   True   True   True   True

c > 0
Out[24]: 
0    False
1     True
2     True
3     True
dtype: bool

c
Out[25]: 
0    0
1    1
2    2
3    3
dtype: int32

理解数据与索引的关系。数据操作即索引操作。

单元8：Pandas数据特征分析

数据的排序

对一组数据的理解

表达一个或多个含义。
摘要：有损地提取数据特征的过程

基本统计（含排序）
分布/累计统计
数据特征
相关性、周期性等
数据挖掘（形成知识）

Pandas库的数据排序

.sort_index()：在指定轴（默认为0轴，即列）上根据索引进行排序，默认升序
.sort_index(axis = 0, ascending = True)

b = pd.DataFrame(np.arange(20).reshape(4,5), index = ['c', 'a', 'd', 'b'])

b
Out[27]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

b.sort_index()
Out[28]: 
    0   1   2   3   4
a   5   6   7   8   9
b  15  16  17  18  19
c   0   1   2   3   4
d  10  11  12  13  14

b.sort_index(ascending = False)  #改为降序排序
Out[29]: 
    0   1   2   3   4
d  10  11  12  13  14
c   0   1   2   3   4
b  15  16  17  18  19
a   5   6   7   8   9

c = b.sort_index(axis = 1, ascending = False)  #行索引进行降序排序

c
Out[31]: 
    4   3   2   1   0
c   4   3   2   1   0
a   9   8   7   6   5
d  14  13  12  11  10
b  19  18  17  16  15

c = c.sort_index()

c
Out[33]: 
    4   3   2   1   0
a   9   8   7   6   5
b  19  18  17  16  15
c   4   3   2   1   0
d  14  13  12  11  10

.sort_values()：在指定轴上根据数值进行排序，默认升序。
Series.sort_values(axis = 0, ascending = True)
DataFrame.sort_values(by, axis = 0, ascending = True)

by：axis轴上某个索引或索引列表

b
Out[34]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

c = b.sort_values(2, ascending = False)  #对索引为2的列按数据值进行行降序排序

c
Out[36]: 
    0   1   2   3   4
b  15  16  17  18  19
d  10  11  12  13  14
a   5   6   7   8   9
c   0   1   2   3   4

c = c.sort_values('a', axis = 1, ascending = False)  #对索引为'a'的行按数据值进行降序排序

c
Out[38]: 
    4   3   2   1   0
b  19  18  17  16  15
d  14  13  12  11  10
a   9   8   7   6   5
c   4   3   2   1   0

NaN统一放排序末尾

a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['a', 'b', 'c'])

a

Out[40]: 
   0  1   2   3
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

b
Out[41]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

c = a + b

c
Out[43]: 
      0     1     2     3   4
a   5.0   7.0   9.0  11.0 NaN
b  19.0  21.0  23.0  25.0 NaN
c   8.0  10.0  12.0  14.0 NaN
d   NaN   NaN   NaN   NaN NaN

c.sort_values(2, ascending = False)
Out[44]: 
      0     1     2     3   4
b  19.0  21.0  23.0  25.0 NaN
c   8.0  10.0  12.0  14.0 NaN
a   5.0   7.0   9.0  11.0 NaN
d   NaN   NaN   NaN   NaN NaN

c.sort_values(2, ascending = True)
Out[45]: 
      0     1     2     3   4
a   5.0   7.0   9.0  11.0 NaN
c   8.0  10.0  12.0  14.0 NaN
b  19.0  21.0  23.0  25.0 NaN
d   NaN   NaN   NaN   NaN NaN

数据的基本统计分析

基本的统计分析函数
适用于Series和DataFrame类型
按0轴计算。

方法	说明
.sum()	计算数据的总和
.count()	非NaN值的数量
.mean() .median()	计算数据的算术平均值，算数中位数
.var() .std()	计算数据的方差，标准差
.min() .max()	计算数据的最小值，最大值

适用于Series类型

方法	说明
.argmin() .argmax()	计算数据最大值、最小值所在位置的索引位置（自动索引）
.idxmin() .idxmax()	计算数据最大值、最小值所在位置的索引（自定义索引）

基本的统计分析函数

适用于Series和DataFrame类型

方法	说明
.describe()	针对0轴（各列）的统计汇总

a = pd.Series([9, 8, 7, 6], index = ['a', 'b', 'c', 'd'])

a
Out[3]: 
a    9
b    8
c    7
d    6
dtype: int64

a.describe()  #将一些统计值直接输出
Out[4]: 
count    4.000000
mean     7.500000
std      1.290994
min      6.000000
25%      6.750000
50%      7.500000
75%      8.250000
max      9.000000
dtype: float64

type(a.describe())  #输出结果为Series类型
Out[5]: pandas.core.series.Series

a.describe()['count']  #使用对Series类型索引的方法即可
Out[6]: 4.0

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index = ['c', 'a', 'b', 'd'])

b
Out[9]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
b  10  11  12  13  14
d  15  16  17  18  19

b.describe()
Out[10]: 
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

type(b.describe())
Out[11]: pandas.core.frame.DataFrame

b.describe().loc['max']  #获取max行
Out[13]: 
0    15.0
1    16.0
2    17.0
3    18.0
4    19.0
Name: max, dtype: float64

b.describe()[2]  获取第三列
Out[14]: 
count     4.000000
mean      9.500000
std       6.454972
min       2.000000
25%       5.750000
50%       9.500000
75%      13.250000
max      17.000000
Name: 2, dtype: float64

数据的累计统计分析

对前n个数进行累计运算。减少循环使用

累计统计分析函数

适用于Series和DataFrame类型

说明	方法
.cumsum()	给出前1、2、...、n项的和
.cumprod()	给出前1、2、...、n项的积
.cummax()	给出前1、2、...、n项的最大值
.cummin()	给出前1、2、...、n项的最小值

b
Out[15]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
b  10  11  12  13  14
d  15  16  17  18  19

b.cumsum()  #按照列计算前n项和
Out[16]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   7   9  11  13
b  15  18  21  24  27
d  30  34  38  42  46

b.cumprod()
Out[17]: 
   0     1     2     3     4
c  0     1     2     3     4
a  0     6    14    24    36
b  0    66   168   312   504
d  0  1056  2856  5616  9576

滚动计算函数（窗口计算函数），适用于Series和DataFrame类型。并不是累计从第一项开始

方法	说明
.rolling(w).sum()	依次计算相邻w个元素的
.rolling(w).mean()	依次计算相邻w个元素的算术平均值
.rolling(w).var()	依次计算相邻w个元素的方差
.rolling(w).std()	依次计算相邻w个元素的标准差
.rolling(w).min() .max()	依次计算相邻w个元素的最小值、最大值

b
Out[18]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
b  10  11  12  13  14
d  15  16  17  18  19

b.rolling(2).sum() #按照列对与前1个元素进行求和
Out[19]: 
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   5.0   7.0   9.0  11.0  13.0
b  15.0  17.0  19.0  21.0  23.0
d  25.0  27.0  29.0  31.0  33.0

b.rolling(3).sum()  #按照列对与前2个元素进行求和
Out[20]: 
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   NaN   NaN   NaN   NaN   NaN
b  15.0  18.0  21.0  24.0  27.0
d  30.0  33.0  36.0  39.0  42.0

数据的相关分析

协方差

协方差公式

协方差>0，X与Y正相关
协方差<0，X与Y负相关
协方差=0，X与Y不相关

Pearson相关系数

Pearson相关系数计算公式

取值范围[-1,1]
取绝对值后

0.8~1.0：极强相关
0.6~0.8：强相关
0.4~0.6：中等程度相关
0.2~0.4：弱相关
0.0~0.2：极弱相关或无相关

方法	说明
.cov()	计算协方差矩阵
.corr()	计算相关系数矩阵，Pearson、Spearman、Kendall等系数

实例：房价增幅与M2增幅的相关性

hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index = ['2008', '2009', '2010', '2011', '2012'])

m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index = ['2008', '2009', '2010', '2011', '2012'])

hprice.corr(m2)  #计算相关性
Out[23]: 0.5239439145220387

【3】数据分析之概要