2020-08-09--Pandas-12--数据分析实战之股票分析

安装pandas的数据模板库

pip install pandas_datareader       # pandas中数据获取接口 (慢)
pip install baostock           # baostock数据获取接口
pip install tushare              # Tushare数据获取接口

利用tushare获取数据画图

1.导入库和配置项

import tushare as ts
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False

2.接口配置--获取源数据

获取两个公司的股票
参数：股票代码，起始时间，结束时间

'''获取数据'''
# 浙江传媒财新
caixin = ts.get_hist_data('600633', start='2017-07-01', end='2020-05-08')
print(type(caixin))         # <class 'pandas.core.frame.DataFrame'>
print(caixin)
# 本接口即将停止更新，请尽快使用Pro版接口：https://tushare.pro/document/2
#              open   high  close  ...     v_ma10     v_ma20  turnover
# date                             ...                                
# 2020-05-08   9.89   9.98   9.76  ...  174862.54  157315.40      1.57
# 2020-05-07   9.80   9.95   9.80  ...  178983.13  155860.60      1.08
# 2020-05-06   9.45   9.88   9.78  ...  175347.03  157914.33      1.30
# 2020-04-30   9.28   9.64   9.57  ...  177440.81  163938.70      1.29
# 2020-04-29   9.26   9.37   9.20  ...  177899.23  180245.58      0.72
# ...           ...    ...    ...  ...        ...        ...       ...
# 2018-02-13  13.30  13.36  13.19  ...   76922.75   76922.75      0.48
# 2018-02-12  12.93  13.37  13.21  ...   80603.96   80603.96      0.49
# 2018-02-09  12.70  13.11  12.87  ...   86435.40   86435.40      0.64
# 2018-02-08  12.97  13.34  13.19  ...   88378.98   88378.98      0.64
# 2018-02-07  13.12  13.29  13.00  ...   94070.77   94070.77      0.73
# 
# [543 rows x 14 columns]

# 南方传媒
nanfang = ts.get_hist_data('601900',start='2017-07-01', end='2020-05-08')
print(nanfang)
# 本接口即将停止更新，请尽快使用Pro版接口：https://tushare.pro/document/2
#              open   high  close    low  ...     v_ma5    v_ma10    v_ma20  turnover
# date                                    ...                                        
# 2020-05-08   9.56   9.68   9.66   9.53  ...  30763.41  36500.26  35120.70      0.29
# 2020-05-07   9.52   9.67   9.53   9.50  ...  33324.80  37446.65  36674.61      0.26
# 2020-05-06   9.52   9.60   9.58   9.31  ...  33322.07  37897.68  38018.07      0.57
# 2020-04-30   9.58   9.70   9.61   9.53  ...  29903.02  36151.07  38027.82      0.30
# 2020-04-29   9.50   9.68   9.55   9.46  ...  32541.47  37241.41  38554.26      0.31
# ...           ...    ...    ...    ...  ...       ...       ...       ...       ...
# 2018-02-13  10.03  10.14  10.05   9.98  ...  19288.10  19288.10  19288.10      1.04
# 2018-02-12   9.72  10.08  10.01   9.72  ...  19534.70  19534.70  19534.70      1.06
# 2018-02-09   9.91   9.95   9.70   9.65  ...  19820.23  19820.23  19820.23      1.39
# 2018-02-08  10.09  10.21  10.13  10.02  ...  17561.04  17561.04  17561.04      0.85
# 2018-02-07  10.03  10.09  10.05   9.85  ...  20174.98  20174.98  20174.98      1.15
# 
# [543 rows x 14 columns]

返回的类型为DataFrame类型的数据。

3.组织数据结构

因为一个图中要显示多个公司的股票收盘价的折线，所以要组织数据结构为DataFrame类型。

'''组织数据结构--DataFrame'''
data = {'浙江传媒财新': caixin.close, '南方传媒':nanfang.close}
print(type(data))         # <class 'dict'>
print(data)
# {'浙江传媒财新': date
# 2020-05-08     9.76
# 2020-05-07     9.80
# 2020-05-06     9.78
# 2020-04-30     9.57
# 2020-04-29     9.20
#               ...
# 2018-02-13    13.19
# 2018-02-12    13.21
# 2018-02-09    12.87
# 2018-02-08    13.19
# 2018-02-07    13.00
# Name: close, Length: 543, dtype: float64, '南方传媒': date
# 2020-05-08     9.66
# 2020-05-07     9.53
# 2020-05-06     9.58
# 2020-04-30     9.61
# 2020-04-29     9.55
#               ...
# 2018-02-13    10.05
# 2018-02-12    10.01
# 2018-02-09     9.70
# 2018-02-08    10.13
# 2018-02-07    10.05
# Name: close, Length: 543, dtype: float64}

# 转为DataFrame对象，date(时间列)作为索引,
# 每列的列名就是data字典中键，每列的数据是字典对应的值
df = pd.DataFrame(data)
print(df)
#             浙江传媒财新   南方传媒
# date
# 2020-05-08    9.76   9.66
# 2020-05-07    9.80   9.53
# 2020-05-06    9.78   9.58
# 2020-04-30    9.57   9.61
# 2020-04-29    9.20   9.55
# ...            ...    ...
# 2018-02-13   13.19  10.05
# 2018-02-12   13.21  10.01
# 2018-02-09   12.87   9.70
# 2018-02-08   13.19  10.13
# 2018-02-07   13.00  10.05

4.索引排序

要显示到图形中，为了方便观察，所以x轴的时间要正向排序。


'''对索引进行排序，原地操作'''
df.sort_values(by='date',ascending=True,inplace=True)
print(df)
#             浙江传媒财新   南方传媒
# date
# 2018-02-07   13.00  10.05
# 2018-02-08   13.19  10.13
# 2018-02-09   12.87   9.70
# 2018-02-12   13.21  10.01
# 2018-02-13   13.19  10.05
# ...            ...    ...
# 2020-04-29    9.20   9.55
# 2020-04-30    9.57   9.61
# 2020-05-06    9.78   9.58
# 2020-05-07    9.80   9.53
# 2020-05-08    9.76   9.66
#
# [543 rows x 2 columns]

5.画图

# DataFrame对象自动绘制(优化)
df.plot(kind='line')
plt.xticks(rotation = '45')       # x轴的数据倾斜
plt.show()

结果：

volume列数据显示

# 绘制柱状图(volume列)  x轴为时间，y轴为volume列数据
plt.bar(caixin.index,caixin.volume)
# plt.gcf().set_size_inches(15,8)
plt.show()

这个x轴的数据没有像之前的进行优化，目前没有解决方案。

close列和volume列显示在一张图上

设置子图的方法--plt.subplot2grid(shape,start,row,col)

shape:将整个图分割的规格
start：起始位置，左上角为(0,0),向下为x轴正方向，向右为y轴正方向。
row：从起始位置开始画，该图所占的行(根据shape划分的行大小)
col：从起始位置开始画，该图所占的列(根据shape划分的列大小)

# draw the price history on the top
# 获取子图的位置
top = plt.subplot2grid((4,4), (0, 0), rowspan=3, colspan=4)
# 填充数据画图
top.plot(caixin.index, caixin.close,
         label='MSFT Close')
# 设置标题和图形标记的位置
plt.title('MSFT Close Price 2012 - 2014')
plt.legend(loc='best')

# and the volume along the bottom
bottom = plt.subplot2grid((4,4), (3,0), rowspan=1, colspan=4)
bottom.bar(caixin.index, caixin.close)
plt.title('Microsoft Trading Volume 2012 - 2014')

plt.subplots_adjust(hspace=0.75)
plt.gcf().set_size_inches(15,8)

plt.show()

计算单日变化百分比

'''单日比率'''
# 计算每天/昨天的值，算出比率
x = caixin.close/caixin.close.shift(1)
# print(x)
x.plot()
plt.show()

直方图

绘制南方传媒的各个数据列的直方图--DataFrame

nanfang.hist(bins=50)
plt.show()

绘制close列的直方图

nanfang.close.hist(bins=50)
plt.show()

执行移动平均计算

'''平均值'''
se = nanfang.close
# 算出平均值
ava = se.rolling(window=30).mean()
# 封装维DataFrame
df = pd.DataFrame({
    '收盘价':se,
    '月统计':ava,
})
df.plot()
plt.show()

各属性每日收盘价变化率相关性分析

# 相关性统计
# 协方差,corr()  默认使用皮尔森
re = nanfang.corr()
print(re)
#               open      high     close  ...    v_ma10  \
# open      1.000000  0.985553  0.975340  ...  0.516442
# high      0.985553  1.000000  0.991277  ...  0.545002
# close     0.975340  0.991277  1.000000  ...  0.509707
# low       0.989004  0.981960  0.986465  ...  0.482628
# volume    0.505106  0.570144  0.523075  ...  0.748281
# ...            ...       ...       ...  ...       ...
# ma20      0.865912  0.850304  0.855717  ...  0.455114
# v_ma5     0.544599  0.589638  0.545738  ...  0.933110
# v_ma10    0.516442  0.545002  0.509707  ...  1.000000
# v_ma20    0.464613  0.478883  0.454400  ...  0.889177
# turnover  0.525195  0.587116  0.546245  ...  0.483505
#
#             v_ma20  turnover
# open      0.464613  0.525195
# high      0.478883  0.587116
# close     0.454400  0.546245
# low       0.437511  0.484344
# volume    0.595617  0.754537
# ...            ...       ...
# ma20      0.531830  0.374942
# v_ma5     0.763993  0.575646
# v_ma10    0.889177  0.483505
# v_ma20    1.000000  0.330785
# turnover  0.330785  1.000000
#
# [14 rows x 14 columns]
plt.imshow(re, cmap='hot', interpolation='none')
plt.colorbar()
plt.xticks(range(len(re)), re.columns)
plt.yticks(range(len(re)), re.columns)

plt.show()

pandas补充知识

pivot函数作用是将一个DataFrame重塑称另外一个表格。其中index为新表的索引，columns为新表的列，values是支持计算产生新值得列。即将dataframe按照index-columns进行数据整理。具体应用

如上，将dataframe按照foo-bar为轴，baz为值的方式进行重组。其中，foo-bar不能有重复值，若重复，则会运行失败。