这次开始之前,我自己有必要先分析一下思路,再开始研读,到目前为止,我研读这些东西,都没有跟实际工作生活像结合起来,还是有一种为了学东西,而学东西。
我学这个有什么用呢?
这是没能想清楚的问题. 学了python的这点课,真的可能成为我未来转行大数据行业的技能储备吗?
有时候会觉得自己很幼稚。
但也一直觉得,能接触到这项技能真的很有意义。
是不是我把学习Python技能,强制附加了“成为未来转行的必备技能”的期望,而失去了学习东西的那份快乐呢?
好吧,废话几句。 不知道,接下来关于大数据的学习,该何去何从,接着研读吧。
import pandas as pd
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import matplotlib # ?? 只引用matplotlib 和.pyplot 之间的区别是什么呢?
matplotlib.style.use('ggplot')
%matplotlib inline
#%config BackendInline.forture_commat('retina')
#%config InlineBackend.fortune_format = 'retina'
#
%config InlineBackend.figure_format = 'retina'
#错误记录: 配置图形格式为retina这段代码,总是写错。 1. figure_format ,拼写错误明显。 2. 不是用括号,而是直接等于。
# 新增知识点: ggplot,是一个r语言的图形绘制包,可以像图层一样绘制,很好用。
ls data
驱动器 C 中的卷是 WIN7
卷的序列号是 CCED-57EE
C:\Users\Administrator.USER-20170623BT\Desktop\第七课材料\第七课材料\codes\data 的目录
2017/08/31 12:13 <DIR> .
2017/08/31 12:13 <DIR> ..
2017/08/14 17:39 6,148 .DS_Store
2017/08/14 12:37 14,142 data_75_12.csv
2017/08/14 17:50 3,930 evolution.csv
2017/08/14 13:15 8,568 fortis_heritability.csv
4 个文件 32,788 字节
2 个目录 40,510,488,576 可用字节
我不能在自己电脑上直接使用带感叹号的方式运行linux命令。 但是去掉感叹号后,却出现了比较多的文件夹信息。
数据导入和清洗
这部分主要是把粗糙的数据进行清洗
evolution = pd.read_csv(r'data/evolution.csv')
# 错误记录:1. 代码和文件没有在同一层文件夹中,所以应该加入文件夹地址“data/”,我天真地以为可以直接导入 2. csv,被我搞成了scv
evolution.head()
evolution.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 8 columns):
Year 82 non-null object
Species 81 non-null object
Beak length 81 non-null object
Beak depth 80 non-null float64
Beak width 80 non-null float64
CI Beak length 80 non-null object
CI Beak depth 80 non-null object
CI Beak width 80 non-null object
dtypes: float64(2), object(6)
memory usage: 5.3+ KB
只有两个数字是float类型,还有绝大部分数据都不是82个。
evolution = evolution.dropna()
evolution.info()
# 新增知识点:data.dropna() 是删除缺失值的方法,把数值不齐全的列给删除掉。
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 0 to 82
Data columns (total 8 columns):
Year 80 non-null object
Species 80 non-null object
Beak length 80 non-null object
Beak depth 80 non-null float64
Beak width 80 non-null float64
CI Beak length 80 non-null object
CI Beak depth 80 non-null object
CI Beak width 80 non-null object
dtypes: float64(2), object(6)
memory usage: 5.6+ KB
#evolution['Beak length']= pd.numeric['Beak length']
#错误记录: 1转换为数值类型需要有对象啊。 对象一半都在括号中。 我现在感觉python很像是一个超大的工具包,我说我要A工具,它流把掏出来,不过工具太多,我得说明白
# 所以我要说,我需要用A工具去操作对象a.
#2. 转换为什么的话,是需要加to的,我并没有加,也就是基本的拼写错误。
evolution['Beak length'] = pd.to_numeric(evolution['Beak length'])
evolution['Beak length'].head()
0 10.76
1 10.72
2 10.57
3 10.64
4 10.73
Name: Beak length, dtype: float64
evolution['CI Beak length'] = pd.to_numeric(evolution['CI Beak length'],errors='coerce')
#新增知识点:强制转化为数据类型,但遇到不能转化的额,就用null代替
evolution['CI Beak depth'] = pd.to_numeric(evolution['CI Beak depth'],errors='coerce')
evolution['CI Beak width'] = pd.to_numeric(evolution['CI Beak width'],errors='coerce')
evolution.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 0 to 82
Data columns (total 8 columns):
Year 80 non-null object
Species 80 non-null object
Beak length 80 non-null float64
Beak depth 80 non-null float64
Beak width 80 non-null float64
CI Beak length 79 non-null float64
CI Beak depth 79 non-null float64
CI Beak width 79 non-null float64
dtypes: float64(6), object(2)
memory usage: 5.6+ KB
初步的数据清洗,就完成了,若是要删除某些列呢? 该咋办呀?
数据探索
evolution.head()
evolution.tail()
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>Species</th>
<th>Beak length</th>
<th>Beak depth</th>
<th>Beak width</th>
<th>CI Beak length</th>
<th>CI Beak depth</th>
<th>CI Beak width</th>
</tr>
</thead>
<tbody>
<tr>
<th>78</th>
<td>2008</td>
<td>scandens</td>
<td>13.31</td>
<td>9.08</td>
<td>8.72</td>
<td>0.109</td>
<td>0.087</td>
<td>0.084</td>
</tr>
<tr>
<th>79</th>
<td>2009</td>
<td>scandens</td>
<td>13.33</td>
<td>9.08</td>
<td>8.73</td>
<td>0.099</td>
<td>0.085</td>
<td>0.081</td>
</tr>
<tr>
<th>80</th>
<td>2010</td>
<td>scandens</td>
<td>13.30</td>
<td>9.07</td>
<td>8.71</td>
<td>0.102</td>
<td>0.081</td>
<td>0.076</td>
</tr>
<tr>
<th>81</th>
<td>2011</td>
<td>scandens</td>
<td>13.35</td>
<td>9.10</td>
<td>8.75</td>
<td>0.106</td>
<td>0.085</td>
<td>0.078</td>
</tr>
<tr>
<th>82</th>
<td>2012</td>
<td>scandens</td>
<td>13.41</td>
<td>9.19</td>
<td>8.82</td>
<td>0.131</td>
<td>0.120</td>
<td>0.109</td>
</tr>
</tbody>
</table>
</div>
evolution.Species.value_counts()
#错误记录: value——count可是只针对一列的汇总统计,我输入的是evolution.value_counts()对象是整个数据啊,而我只是想要其中的一列
fortis 40
scandens 40
Name: Species, dtype: int64
#对了,怎么修改列名称来着?
evolution.rename(columns = {'Species':'species'},inplace=True)
#错误记录,未加引号
evolution.head()
#@@自行百度的哦
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>species</th>
<th>Beak length</th>
<th>Beak depth</th>
<th>Beak width</th>
<th>CI Beak length</th>
<th>CI Beak depth</th>
<th>CI Beak width</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1973</td>
<td>fortis</td>
<td>10.76</td>
<td>9.48</td>
<td>8.69</td>
<td>0.097</td>
<td>0.130</td>
<td>0.081</td>
</tr>
<tr>
<th>1</th>
<td>1974</td>
<td>fortis</td>
<td>10.72</td>
<td>9.42</td>
<td>8.66</td>
<td>0.144</td>
<td>0.170</td>
<td>0.112</td>
</tr>
<tr>
<th>2</th>
<td>1975</td>
<td>fortis</td>
<td>10.57</td>
<td>9.19</td>
<td>8.55</td>
<td>0.075</td>
<td>0.084</td>
<td>0.057</td>
</tr>
<tr>
<th>3</th>
<td>1976</td>
<td>fortis</td>
<td>10.64</td>
<td>9.23</td>
<td>8.58</td>
<td>0.048</td>
<td>0.053</td>
<td>0.039</td>
</tr>
<tr>
<th>4</th>
<td>1977</td>
<td>fortis</td>
<td>10.73</td>
<td>9.35</td>
<td>8.63</td>
<td>0.085</td>
<td>0.092</td>
<td>0.066</td>
</tr>
</tbody>
</table>
</div>
fortis = evolution[evolution.species=='fortis']
fortis.tail()
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>species</th>
<th>Beak length</th>
<th>Beak depth</th>
<th>Beak width</th>
<th>CI Beak length</th>
<th>CI Beak depth</th>
<th>CI Beak width</th>
</tr>
</thead>
<tbody>
<tr>
<th>35</th>
<td>2008</td>
<td>fortis</td>
<td>10.28</td>
<td>8.57</td>
<td>8.29</td>
<td>0.099</td>
<td>0.094</td>
<td>0.076</td>
</tr>
<tr>
<th>36</th>
<td>2009</td>
<td>fortis</td>
<td>10.28</td>
<td>8.51</td>
<td>8.27</td>
<td>0.095</td>
<td>0.087</td>
<td>0.070</td>
</tr>
<tr>
<th>37</th>
<td>2010</td>
<td>fortis</td>
<td>10.42</td>
<td>8.52</td>
<td>8.33</td>
<td>0.097</td>
<td>0.086</td>
<td>0.071</td>
</tr>
<tr>
<th>38</th>
<td>2011</td>
<td>fortis</td>
<td>10.46</td>
<td>8.57</td>
<td>8.34</td>
<td>0.103</td>
<td>0.096</td>
<td>0.076</td>
</tr>
<tr>
<th>39</th>
<td>2012</td>
<td>fortis</td>
<td>10.51</td>
<td>8.65</td>
<td>8.38</td>
<td>0.150</td>
<td>0.146</td>
<td>0.117</td>
</tr>
</tbody>
</table>
</div>
scandens = evolution[evolution.species=='scandens']
scandens.tail()
#所有方括号的外的数据,给定了范围,方括号内,就是筛选数据范围。
#evolution.species[evolution.species=='scandens']vsevolution[evolution.species=='scandens'] 是截然不同的结果
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>species</th>
<th>Beak length</th>
<th>Beak depth</th>
<th>Beak width</th>
<th>CI Beak length</th>
<th>CI Beak depth</th>
<th>CI Beak width</th>
</tr>
</thead>
<tbody>
<tr>
<th>78</th>
<td>2008</td>
<td>scandens</td>
<td>13.31</td>
<td>9.08</td>
<td>8.72</td>
<td>0.109</td>
<td>0.087</td>
<td>0.084</td>
</tr>
<tr>
<th>79</th>
<td>2009</td>
<td>scandens</td>
<td>13.33</td>
<td>9.08</td>
<td>8.73</td>
<td>0.099</td>
<td>0.085</td>
<td>0.081</td>
</tr>
<tr>
<th>80</th>
<td>2010</td>
<td>scandens</td>
<td>13.30</td>
<td>9.07</td>
<td>8.71</td>
<td>0.102</td>
<td>0.081</td>
<td>0.076</td>
</tr>
<tr>
<th>81</th>
<td>2011</td>
<td>scandens</td>
<td>13.35</td>
<td>9.10</td>
<td>8.75</td>
<td>0.106</td>
<td>0.085</td>
<td>0.078</td>
</tr>
<tr>
<th>82</th>
<td>2012</td>
<td>scandens</td>
<td>13.41</td>
<td>9.19</td>
<td>8.82</td>
<td>0.131</td>
<td>0.120</td>
<td>0.109</td>
</tr>
</tbody>
</table>
</div>
fortis.plot(x='Year',y=['Beak length','Beak depth','Beak width'])
#错误记录:1. y= 后面只能有一个对象,我的逗号,就反映了有多个对象,应该把多个对象用方括号连起来。
#2. 我竟然直接开始画图,用plt.plot开头,连data都没有。 应该是fortis.plot,笨!!
<matplotlib.axes._subplots.AxesSubplot at 0xa4d6128>
scandens.plot(x='Year',y=['Beak length','Beak depth','Beak width'])
<matplotlib.axes._subplots.AxesSubplot at 0xa9006d8>
scandens.plot(x='Year',y=['Beak length','Beak depth','Beak width'],subplots = True,figsize=(10,6))
#新增知识点:1.subplots,意思应该是在多幅图上作图,而非在一张图上显示
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000000000BA26898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000000BB0FC88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000000BB5F240>], dtype=object)
fortis.plot(x='Year',y='Beak depth',yerr='CI Beak depth',color='Red',marker='o',figsize=(10,5))
#新增知识点:yerr,marker
<matplotlib.axes._subplots.AxesSubplot at 0xd339b00>
scandens.plot(x='Year',y='Beak depth',yerr='CI Beak depth',color='Orange',marker='o',figsize=(11,4))
<matplotlib.axes._subplots.AxesSubplot at 0xe03fc88>
75年和12年的数据比较
ls data
驱动器 C 中的卷是 WIN7
卷的序列号是 CCED-57EE
C:\Users\Administrator.USER-20170623BT\Desktop\第七课材料\第七课材料\codes\data 的目录
2017/08/31 12:13 <DIR> .
2017/08/31 12:13 <DIR> ..
2017/08/14 17:39 6,148 .DS_Store
2017/08/14 12:37 14,142 data_75_12.csv
2017/08/14 17:50 3,930 evolution.csv
2017/08/14 13:15 8,568 fortis_heritability.csv
4 个文件 32,788 字节
2 个目录 40,499,621,888 可用字节
data = pd.read_csv('data/data_75_12.csv')
data.head()
这组数据中只包含有75年和12年的所有样本,而刚才的是每年的数据一个均值。
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651 entries, 0 to 650
Data columns (total 4 columns):
species 651 non-null object
length 651 non-null float64
depth 651 non-null float64
year 651 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 20.4+ KB
data.year.value_counts()
1975 403
2012 248
Name: year, dtype: int64
data.species.value_counts()
fortis 437
scandens 214
Name: species, dtype: int64
data.groupby(['species','year']).count()
#错误记录: 1. groupby后面的对象应该是括号啊,用方括号,往往都是为了把多个数据,合并为一个对象。 2. groupby的计数,是不加s的。
data.groupby(['species','year']).mean()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>length</th>
<th>depth</th>
</tr>
<tr>
<th>species</th>
<th>year</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">fortis</th>
<th>1975</th>
<td>10.565190</td>
<td>9.171646</td>
</tr>
<tr>
<th>2012</th>
<td>10.517355</td>
<td>8.605372</td>
</tr>
<tr>
<th rowspan="2" valign="top">scandens</th>
<th>1975</th>
<td>14.120920</td>
<td>8.960000</td>
</tr>
<tr>
<th>2012</th>
<td>13.421024</td>
<td>9.186220</td>
</tr>
</tbody>
</table>
</div>
fortis2 = data[data.species=='fortis']
fortis2.head()
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
fortis2.boxplot(by='year',figsize=(10,5))
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000000000F518898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000000F02D320>], dtype=object)
scandens2 = data[data.species=='scandens']
scandens2.head()
scandens2.boxplot(by='year',figsize=(10,5))
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000000000D009908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000000F6F94E0>], dtype=object)
中地雀鸟喙深度和长度的变化
##按时间给中地雀的样本分组
fortis75 = fortis2[fortis2.year==1975]
# 为什么对时间的切片,不用加引号呢? 引号只是对str类型有效吗?
fortis12 = fortis2[fortis2.year==2012]
fortis75.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000121EF518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011EFFD68>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011F3ACC0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000011F79080>]], dtype=object)
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\util\connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "d:\ProgramData\Anaconda3\lib\socket.py", line 743, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 844, in _validate_conn
conn.connect()
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 284, in connect
conn = self._new_conn()
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000011CBFF28>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 438, in send
timeout=timeout
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\packages\urllib3\util\retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='webpush.wx.qq.com', port=443): Max retries exceeded with url: /cgi-bin/mmwebwx-bin/synccheck?r=1509079787533&skey=%40crypt_36e69409_f98db12a80ac4d61526912f12b9546ae&sid=jynkiOEPmg7Iqxcv&uin=277207840&deviceid=e579034563178671&synckey=1_700988587%7C2_700988643%7C3_700988607%7C11_700988609%7C13_700760080%7C201_1509095461%7C203_1509093320%7C1000_1509095141%7C1001_1509063733&_=1509079787533 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000011CBFF28>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\itchat\components\login.py", line 300, in sync_check
r = self.s.get(url, params=params, headers=headers, timeout=config.TIMEOUT)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 531, in get
return self.request('GET', url, **kwargs)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send
r = adapter.send(request, **kwargs)
File "d:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 502, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='webpush.wx.qq.com', port=443): Max retries exceeded with url: /cgi-bin/mmwebwx-bin/synccheck?r=1509079787533&skey=%40crypt_36e69409_f98db12a80ac4d61526912f12b9546ae&sid=jynkiOEPmg7Iqxcv&uin=277207840&deviceid=e579034563178671&synckey=1_700988587%7C2_700988643%7C3_700988607%7C11_700988609%7C13_700760080%7C201_1509095461%7C203_1509093320%7C1000_1509095141%7C1001_1509063733&_=1509079787533 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000000011CBFF28>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\itchat\components\login.py", line 244, in maintain_loop
i = sync_check(self)
File "d:\ProgramData\Anaconda3\lib\site-packages\itchat\components\login.py", line 303, in sync_check
if not isinstance(e.args[0].args[1], BadStatusLine):
IndexError: tuple index out of range
LOG OUT!