首先,我们需要引入需要使用的库。
import numpy as np
import pandas as pd
除此之外,我们需要读写csv文件
from pandas import read_csv
读取我们的测试数据
df=pd.read_csv('data_study.csv')
>>>df
num class name sex english sport army math possity space
0 10 1 mary woman 80 80 90 75.0 60 65
1 28 1 land man 80 50 69 70.0 58 70
2 15 2 asnx man 80 69 80 75.0 90 94
3 18 4 david man 90 80 86 85.0 95 62
4 19 2 gry woman 90 50 64 NaN 64 85
5 20 2 kitty woman 84 58 97 94.0 63 21
6 14 3 lury woman 98 77 88 0.0 55 40
7 21 1 facy man 55 68 94 52.0 36 48
接下来我们将对数据进行处理
df.duplicated()#显示是否重复
df.drop_duplicates()#删除重复
对于空值进行用0填补
df.fillna(0)
为了数据处理,对数据进行拷贝
df1=df.copy()
df2=df.copy()
查看数据类型并填空不是int的数据
>>> for i in ty:
... if df1[i].dtype=='O':
... noint.append(i)
...
>>>
>>> noint
['name', 'sex']
添加总分
df2['total_score']=df2['english']+df2['sport']+df2['army']+df2['math']+df2['possity']+df2['space']
df>>> df2
num class name sex english sport army math possity space total_score
0 10 1 mary woman 80 80 90 75 60 65 450
1 28 1 land man 80 50 69 70 58 70 397
2 15 2 asnx man 80 69 80 75 90 94 488
3 18 4 david man 90 80 86 85 95 62 498
4 19 2 gry woman 90 50 64 0 64 85 353
5 20 2 kitty woman 84 58 97 94 63 21 417
6 14 3 lury woman 98 77 88 0 55 40 358
7 21 1 facy man 55 68 94 52 36 48 353
对数据进行分组处理
bins=[df2.total_score.min()-1,400,450,df2.total_score.max()+1]
>>> label=['common','good','perfect']
>>> df2_list=pd.cut(df2.total_score,bins,right=False,labels=label)
>>> df2['catalogy']=df2
df2 df2_list
>>> df2['catalogy']=df2_list
>>> df2
num class name sex english ... math possity space total_score catalogy
0 10 1 mary woman 80 ... 75 60 65 450 perfect
1 28 1 land man 80 ... 70 58 70 397 common
2 15 2 asnx man 80 ... 75 90 94 488 perfect
3 18 4 david man 90 ... 85 95 62 498 perfect
4 19 2 gry woman 90 ... 0 64 85 353 common
5 20 2 kitty woman 84 ... 94 63 21 417 good
6 14 3 lury woman 98 ... 0 55 40 358 common
7 21 1 facy man 55 ... 52 36 48 353 common
当然,除此之外,我们需要进行数据的标准化处理
for i in list(df1.columns[4:]):
... df1[i]=(df1[i]-df1[i].min())/(df1[i].max()-df1[i].min())
...
>>> df1
num class name sex english sport army math possity space
0 10 1 mary woman 0.581395 1.000000 0.787879 0.797872 0.406780 0.602740
1 28 1 land man 0.581395 0.000000 0.151515 0.744681 0.372881 0.671233
2 15 2 asnx man 0.581395 0.633333 0.484848 0.797872 0.915254 1.000000
3 18 4 david man 0.813953 1.000000 0.666667 0.904255 1.000000 0.561644
4 19 2 gry woman 0.813953 0.000000 0.000000 0.000000 0.474576 0.876712
5 20 2 kitty woman 0.674419 0.266667 1.000000 1.000000 0.457627 0.000000
6 14 3 lury woman 1.000000 0.900000 0.727273 0.000000 0.322034 0.260274
7 21 1 facy man 0.000000 0.600000 0.909091 0.553191 0.000000 0.369863
>>> df1['total_score']=df1['english']+df1['sport']+df1['army']+df1['math']+df1['possity']+df1['space']
>>> df1
num class name sex english sport army math possity space total_score
0 10 1 mary woman 0.581395 1.000000 0.787879 0.797872 0.406780 0.602740 4.176666
1 28 1 land man 0.581395 0.000000 0.151515 0.744681 0.372881 0.671233 2.521706
2 15 2 asnx man 0.581395 0.633333 0.484848 0.797872 0.915254 1.000000 4.412704
3 18 4 david man 0.813953 1.000000 0.666667 0.904255 1.000000 0.561644 4.946519
4 19 2 gry woman 0.813953 0.000000 0.000000 0.000000 0.474576 0.876712 2.165242
5 20 2 kitty woman 0.674419 0.266667 1.000000 1.000000 0.457627 0.000000 3.398712
6 14 3 lury woman 1.000000 0.900000 0.727273 0.000000 0.322034 0.260274 3.209581
7 21 1 facy man 0.000000 0.600000 0.909091 0.553191 0.000000 0.369863 2.432145
>>> bins=[df1.total_score.min()-1,3,4,df1.total_score.max()+1]
>>> label=['common','good','perfect']
>>> df1_list=pd.cut(df1.total_score,bins,right=False,labels=label)
>>> df1['catalogy']=df1_list
>>>
>>> df1
num class name sex english ... math possity space total_score catalogy
0 10 1 mary woman 0.581395 ... 0.797872 0.406780 0.602740 4.176666 perfect
1 28 1 land man 0.581395 ... 0.744681 0.372881 0.671233 2.521706 common
2 15 2 asnx man 0.581395 ... 0.797872 0.915254 1.000000 4.412704 perfect
3 18 4 david man 0.813953 ... 0.904255 1.000000 0.561644 4.946519 perfect
4 19 2 gry woman 0.813953 ... 0.000000 0.474576 0.876712 2.165242 common
5 20 2 kitty woman 0.674419 ... 1.000000 0.457627 0.000000 3.398712 good
6 14 3 lury woman 1.000000 ... 0.000000 0.322034 0.260274 3.209581 good
7 21 1 facy man 0.000000 ... 0.553191 0.000000 0.369863 2.432145 common
以上便是简单的数据处理内容了。