index对象
import numpy as np
import pandas as pd
pd1 = pd.Series(np.arange(8),index =['a','b','c','d','e','f','g','h'])
pd1.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], dtype='object')
pd1.index[3]
'd'
pd2 = pd.DataFrame(np.arange(12).reshape(3,4),index=['a','b','c'])
pd2.index
Index(['a', 'b', 'c'], dtype='object')
pd2.index[2]
'c'
data = {
"姓名":['张三','李四','王五','小明','小红','小刚','小亮'],
"语文":[89,78,79,89,90,87,83],
"数学":[59,83,85,92,67,81,77],
"英语":[84,97,88,83,67,73,71],
"体育":[0,0,0,0,0,0,0]
}
df = pd.DataFrame(data)
df
姓名 语文 数学 英语 体育
0 张三 89 59 84 0
1 李四 78 83 97 0
2 王五 79 85 88 0
3 小明 89 92 83 0
4 小红 90 67 67 0
5 小刚 87 81 73 0
6 小亮 83 77 71 0
df[df.values ==88]
姓名 语文 数学 英语 体育
2 王五 79 85 88 0
df[df.values ==88].index
Int64Index([2], dtype='int64')
score = pd.Series(data=[90,95,85,78,np.NAN,96,94,np.NAN,80,87,86,83],
index=range(1,13),name='score')
score.index.name = "class"
score
class
1 90.0
2 95.0
3 85.0
4 78.0
5 NaN
6 96.0
7 94.0
8 NaN
9 80.0
10 87.0
11 86.0
12 83.0
Name: score, dtype: float64
score1 = score.fillna(0)
score1
class
1 90.0
2 95.0
3 85.0
4 78.0
5 0.0
6 96.0
7 94.0
8 0.0
9 80.0
10 87.0
11 86.0
12 83.0
Name: score, dtype: float64
score1[score1.values ==0].index
Int64Index([5, 8], dtype='int64', name='class')
pandas增删改查基本操作
- Series对象增删改查操作
- 修改某元素、
import numpy as np import pandas as pd s = pd.Series(np.arange(6),index =['a','b','c','d','e','f']) s['d'] =9 s a 0 b 1 c 2 d 9 e 4 f 5 dtype: int32
- 增加行操作,append()
s1 = pd.Series({'g':10}) s1 g 10 dtype: int64 s2 = s.append(s1) s2 a 0 b 1 c 2 d 9 e 4 f 5 g 10 dtype: int64
- 查询操作
s2['b'] 1 #上述查询也可以简化成s2.b s2.b 1 s3=s2
- 删除操作
del(s3['g']) s3 a 0 b 1 c 2 d 9 e 4 f 5 dtype: int64 s3.drop(['e'],axis = 0) a 0 b 1 c 2 d 9 f 5 dtype: int64
- 修改某元素、
- DataFrame对象增删改查操作
1、建立DataFrame对象
import numpy as np
import pandas as pd
data = {'A':[0,3,6],'B':[1,4,7],'C':[2,5,8]}
pd1 = pd.DataFrame(data,index =['a','b','c'])
pd1
A B C
a 0 1 2
b 3 4 5
c 6 7 8
2、增加列
第一种方法:在最后一列增加列
pd1['D'] = 0
第二种方法:用insert()方法
pd1.insert(2,'E',10) 代表在第二列后插入列名为E的全部为0的数据,如下所示:
pd1.insert(2,'E',10)
pd1
A B E C D
a 0 1 10 2 0
b 3 4 10 5 0
c 6 7 10 8 0
3、增加行
同样也是两种方法
第一种方法:用loc直接赋值法
pd1.loc['d'] = 9
第二种方法用append()
pd2 = pd1.append({'A':10,'B':10,'C':13},ignore_index = True)
还可以用
pd4 = pd3.loc[4]
pd5 = pd3.append(pd4)
如下所示:
pd2 = pd1.append({'A':10,'B':10,'C':13},ignore_index = True)
pd3 = pd2.fillna(0)
pd4 = pd3.loc[4]
pd5 = pd3.append(pd4)
pd5
A B E C D
0 0.0 1.0 10.0 2.0 0.0
1 3.0 4.0 10.0 5.0 0.0
2 6.0 7.0 10.0 8.0 0.0
3 9.0 9.0 9.0 9.0 9.0
4 10.0 10.0 0.0 13.0 0.0
4 10.0 10.0 0.0 13.0 0.0
4、删除操作
删除多列操作:del(pd5['C'],pd5['D'])
删除列和行操作,axis =1 代表列,axis =0 代表行
pd5.drop('4',axis =0)
A B E
0 0.0 1.0 10.0
1 3.0 4.0 10.0
2 6.0 7.0 10.0
3 9.0 9.0 9.0
高级索引
- loc 标签索引,顾头又顾尾
- iloc 位置索引,顾头不顾尾
- ix 混合索引,不推荐使用,了解
仍以上述pd1对象为例
pd1.loc['a'] 等价于pd1.iloc[0]
pd1.loc['a']
A 0
B 1
C 2
Name: a, dtype: int64
pd1.loc['a':'b']等价于pd1.iloc[0:2]
pd1.iloc[0:2]
A B C
a 0 1 2
b 3 4 5
pd1.loc[['a','c']]花哨索引取a行和c行
pd1.loc[['a','c']]等价于pd1.iloc[[0,2]]
pd1.loc[['a','c']]
A B C
a 0 1 2
c 6 7 8
取其中两行两列的数据
pd1.loc['a':'b',['B','C']]等价于pd1.iloc[0:2,[1,2]],也可以用花哨索引
pd1.loc[['a','b'],['B','C']]
pd1.loc['a':'b',['B','C']]
B C
a 1 2
b 4 5
pd1.iloc[[0,2],[1,2]]
B C
a 1 2
c 7 8
pd1.loc[['a','c'],['B','C']]
B C
a 1 2
c 7 8
重新索引
reindex重新排序索引,相当于两个对象合并得并集,没有关取的索引,会用默认值NaN表示
- Series重新索引
import numpy as np
import pandas as pd
s = pd.Series(np.arange(4),index =['a','b','c','d'])
s
a 0
b 1
c 2
d 3
dtype: int32
s1 = s.reindex(['d','h','c','k','b','a'])
s1
d 3.0
h NaN
c 2.0
k NaN
b 1.0
a 0.0
dtype: float64
- DataFrame重新索引
import numpy as np
import pandas as pd
s = np.arange(12).reshape(3,4)
pd1 = pd.DataFrame(s,index=['a','b','c'],columns=['A','B','C','D'])
pd1
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
pd2 = pd1.reindex(['D','J','A','C','K','A'],axis =1)
pd2
D J A C K A
a 3 NaN 0 2 NaN 0
b 7 NaN 4 6 NaN 4
c 11 NaN 8 10 NaN 8
pd3 = pd1.reindex(['a','c','j','b'],axis =0)
pd3
A B C D
a 0.0 1.0 2.0 3.0
c 8.0 9.0 10.0 11.0
j NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
pd4 = pd1.reindex(['a','j','b'],axis =0)
pd4
A B C D
a 0.0 1.0 2.0 3.0
j NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
行与列重命名操作(rename)
pd5 = pd1.rename(columns={'A':'A1','B':'B1','C':'C1','D':'D1'})
pd5
A1 B1 C1 D1
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
pd5 = pd1.rename(columns={'A':'A1','B':'B1','C':'C1'})
pd5
A1 B1 C1 D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
pd6 = pd1.rename({'a':'a1','b':'b1'})
pd6
A B C D
a1 0 1 2 3
b1 4 5 6 7
c 8 9 10 11
pandas算术运算
实际上,通过 + - * / // ** 等符号可以直接对DataFrame与DataFrame之间或者DataFrame以及Series之间进行运算。但秉承的原则就是对应索引运算,存在索引不同时,返回结果为索引对的并集。
但是实际操作会发现,当存在索引不同时,返回的值自动填充NaN。
使用填充值的算术方法
方法 描述
add, radd 加法(+)
sub, rsub 减法(-)
div, rdiv 除法(/)
floordiv, rfloordiv 整除(//)
mul, rmul 乘法()
pow, rpow 幂次方(*)
注意
Series使用算术方法,不支持指定填充值
div(2)2代表被除数,rdiv(2),2代表除数,为分母。其他同理可证
- Series运算
fill_value =0表示忽略NAN值,并把NAN值=0然后相加
import numpy as np
import pandas as pd
s1 = pd.Series(np.arange(3),index=['a','b','c'])
s2 = pd.Series(np.arange(5,9),index=['a','d','c','f'])
s1+s2
a 5.0
b NaN
c 9.0
d NaN
f NaN
dtype: float64
s1.add(s2)
a 5.0
b NaN
c 9.0
d NaN
f NaN
dtype: float64
s1.add(s2,fill_value =0)
a 5.0
b 1.0
c 9.0
d 6.0
f 8.0
dtype: float64
- DataFrame运算
s = np.arange(12).reshape(3,4)
pd1 = pd.DataFrame(s,index=['a','b','c'],columns=['A','B','C','D'])
s = np.arange(9).reshape(3,3)
pd2 = pd.DataFrame(s,index=['a','b','c'],columns=['A','B','C'])
pd1+pd2
A B C D
a 0 2 4 NaN
b 7 9 11 NaN
c 14 16 18 NaN
pd1.add(pd2,fill_value =0)
A B C D
a 0 2 4 3.0
b 7 9 11 7.0
c 14 16 18 11.0
pd1.div(pd2)
A B C D
a NaN 1.000000 1.00 NaN
b 1.333333 1.250000 1.20 NaN
c 1.333333 1.285714 1.25 NaN
pd1.rdiv(pd2)
A B C D
a NaN 1.000000 1.000000 NaN
b 0.75 0.800000 0.833333 NaN
c 0.75 0.777778 0.800000 NaN
混合运算
pd2 = pd1.loc['b']
pd1-pd2
A B C D
a -4 -4 -4 -4
b 0 0 0 0
c 4 4 4 4
pd3=pd1['D']
pd1-pd3
A B C D a b c
a NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN
pd1.sub(pd3,axis =0)
A B C D
a -3 -2 -1 0
b -3 -2 -1 0
c -3 -2 -1 0
pandas统计运算
描述性统计的概述和计算
方法 描述
count 非NA值的个数
min,max 最小值,最大值
idxmin,idxmax 最小值,最大值的标签索引
sum 求和
mean 平均值
median 中位数
var 方差
std 标准差
cumsum 累计值
cummin,cummax 累计值的最小值或最大值
cumprod 值的累计积
diff 计算第一个算术差值(时间序列)
pct_change 百分比
corr 按索引对其的值的相关性
cov 协方差
df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],
[np.nan,np.nan],[0.75,-1.3]],
index=['a','b','c','d'],
columns=['one','two'])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
- sum()或sum(axis =0)每列相加,忽略缺失值
df.sum()
one 9.25
two -5.80
dtype: float64
- df.sum(axis =1)每行相加,忽略缺失值
df.sum(axis =1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
-idxmax()返回最大值
df.idxmax()
one b
two d
dtype: object
df.idxmax(axis=1)
a one
b one
c NaN
d one
dtype: object
- cumsum()向下求和累加
df.cumsum()
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
- diff差分环比计算,shift(-1)代表从下往上替换
df2.diff() = df2-df2.shift(1)
df1 = df.drop('two',axis =1)
df2 = df1.fillna(1)
df3 = df2.shift(1)
df2-df3
one
a NaN
b 5.70
c -6.10
d -0.25
df2.diff()
one
a NaN
b 5.70
c -6.10
d -0.25
查看汇总统计
- describe()必须是数值型,如果存在字符串,那么字符串不参与运算,它可以统计缺失值,平均值及最大与最小值等
df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
查看表的信息info
df.info()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 one 3 non-null float64
1 two 2 non-null float64
dtypes: float64(2)
memory usage: 256.0+ bytes
apply 函数映射
- 最大值,默认为列
df.apply(lambda x:x.max())
one 7.1
two -1.3
dtype: float64
df.apply(lambda x:x.max(),axis =1)
a 1.40
b 7.10
c NaN
d 0.75
dtype: float64
- 保留N位小数
df.applymap(lambda x:'%.1f'%x)
one two
a 1.4 nan
b 7.1 -4.5
c nan nan
d 0.8 -1.3
作业:
1、读取 五粮液2020.xlsx 数据
import numpy as np
import pandas as pd
data = pd.read_csv('五粮液2020.csv')
data.head(5)
ts_code trade_date open high low close pre_close
0 000858.SZ 20200903 235.40 243.00 235.19 238.64 235.00
1 000858.SZ 20200902 235.20 239.78 233.80 235.00 236.24
2 000858.SZ 20200901 237.48 239.80 233.10 236.24 240.00
3 000858.SZ 20200831 242.00 246.70 240.00 240.00 240.50
4 000858.SZ 20200828 235.98 244.86 233.20 240.50 238.08
2、查看该数据的基本信息
data.describe()
trade_date open high low close pre_close
count 1.640000e+02 164.000000 164.000000 164.000000 164.000000 164.000000
mean 2.020049e+07 156.550610 159.834329 154.414207 157.405427 156.747927
std 2.325634e+02 38.868944 40.042722 38.157768 39.326017 38.846828
min 2.020010e+07 104.000000 106.500000 98.630000 102.110000 102.110000
25% 2.020031e+07 126.480000 128.032500 124.875000 126.145000 126.145000
50% 2.020051e+07 141.955000 144.445000 140.910000 142.405000 141.880000
75% 2.020071e+07 196.500000 205.945000 192.007500 201.590000 201.432500
max 2.020090e+07 242.000000 246.700000 240.000000 240.500000 240.500000
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts_code 164 non-null object
1 trade_date 164 non-null int64
2 open 164 non-null float64
3 high 164 non-null float64
4 low 164 non-null float64
5 close 164 non-null float64
6 pre_close 164 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 9.1+ KB
3、计算每一天各指标的差异值
data['pre_close']-data['open']
0 -0.40
1 1.04
2 2.52
3 -1.50
4 2.10
...
159 0.38
160 -0.30
161 0.55
162 0.48
163 1.01
Length: 164, dtype: float64
4、计算其 pre_close 的增长率
pre_close = data['pre_close']
open = data['open']
zyl =(pre_close-open)/pre_close
zyl
0 -0.001702
1 0.004402
2 0.010500
3 -0.006237
4 0.008821
...
159 0.002937
160 -0.002322
161 0.004213
162 0.003634
163 0.007593
Length: 164, dtype: float64
5、将 pre_close 的增长率添加至数据中(新增列
data['pct_change '] = zyl
data
ts_code trade_date open high low close pre_close pct_change
0 000858.SZ 20200903 235.40 243.00 235.19 238.64 235.00 -0.001702
1 000858.SZ 20200902 235.20 239.78 233.80 235.00 236.24 0.004402
2 000858.SZ 20200901 237.48 239.80 233.10 236.24 240.00 0.010500
3 000858.SZ 20200831 242.00 246.70 240.00 240.00 240.50 -0.006237
4 000858.SZ 20200828 235.98 244.86 233.20 240.50 238.08 0.008821
... ... ... ... ... ... ... ... ...
159 000858.SZ 20200108 128.99 129.76 128.05 128.89 129.37 0.002937
160 000858.SZ 20200107 129.50 131.07 129.00 129.37 129.20 -0.002322
161 000858.SZ 20200106 130.00 130.25 128.52 129.20 130.55 0.004213
162 000858.SZ 20200103 131.60 132.07 129.61 130.55 132.08 0.003634
163 000858.SZ 20200102 132.00 133.50 129.59 132.08 133.01 0.007593
164 rows × 8 columns
6、将 pct_change 该列 呈现的 NaN 用0填充
data.fillna(0)
7、将 pct_change 这列乘以100 保留两位小数 成为百分比
z = pd.Series(zyl)*100
data['pct_change']=z
data[['open','pct_change']].applymap(lambda x:'%.2f'%x)
open pct_change
0 235.40 -0.17
1 235.20 0.44
2 237.48 1.05
3 242.00 -0.62
4 235.98 0.88
... ... ...
159 128.99 0.29
160 129.50 -0.23
161 130.00 0.42
162 131.60 0.36
163 132.00 0.76
164 rows × 2 columns