Python DataScience Handbook 学习笔记
第二部分 numpy(2)
numpy的向量化操作与Matlab非常类似,需要注意的是向量化操作远比循环要有效率的多,请尽量使用向量化操作来取代循环。
"ufunc"是一些列能够对array进行整体操作的函数
有一些特殊的函数,我们可以通过scipy包来获取
In [29]: from scipy import special
In [30]: x = np.random.randint(15, size = (5,5), dtype = 'int32')
In [31]: x
Out[31]:
array([[ 4, 14, 8, 5, 7],
[ 0, 8, 8, 14, 9],
[ 9, 9, 10, 14, 1],
[13, 10, 0, 12, 12],
[ 7, 3, 2, 14, 2]], dtype=int32)
In [32]: special.erf(x)
Out[32]:
array([[ 0.99999998, 1. , 1. , 1. , 1. ],
[ 0. , 1. , 1. , 1. , 1. ],
[ 1. , 1. , 1. , 1. , 0.84270079],
[ 1. , 1. , 0. , 1. , 1. ],
[ 1. , 0.99997791, 0.99532227, 1. , 0.99532227]])
In [33]: x
Out[33]:
array([[ 4, 14, 8, 5, 7],
[ 0, 8, 8, 14, 9],
[ 9, 9, 10, 14, 1],
[13, 10, 0, 12, 12],
[ 7, 3, 2, 14, 2]], dtype=int32)
Specifying output
In [24]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[ 0. 10. 20. 30. 40.]
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[ 1. 0. 2. 0. 4. 0. 8. 0. 16. 0.]
你可能会问这样做的好处是什么,相比于直接赋值有何优越性?
在y[::2] = 2 ** x的过程中,我们会创建一个临时数组,储存右边语句的值,再将其拷贝到左边的子数组中。很显然,使用specifying output提升了效率。
Aggregation
In [36]: x = np.linspace(0, 10, 5)
In [37]: x
Out[37]: array([ 0. , 2.5, 5. , 7.5, 10. ])
In [38]: np.add.reduce(x)
Out[38]: 25.0
In [39]: np.multiply.reduce(x)
Out[39]: 0.0
In [40]: np.add.accumulate(x)
Out[40]: array([ 0. , 2.5, 7.5, 15. , 25. ])
Outer 外积
In [41]: x = np.arange(1, 5)
In [42]: x
Out[42]: array([1, 2, 3, 4])
In [43]: np.multiply.outer(x, x)
Out[43]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])
numpy中的min,max等聚合函数
In [41]: x = np.arange(1, 5)
In [42]: x
Out[42]: array([1, 2, 3, 4])
In [43]: np.multiply.outer(x, x)
Out[43]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])
In [44]: x = np.arange(1, 10)
In [45]: x
Out[45]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [46]: %timeit x.sum()
1.11 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [47]: %timeit sum(x) #Be careful, don't use the python-version sum()
1.3 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [48]: x.min()
Out[48]: 1
In [49]: x.max()
Out[49]: 9
我们还可以通过设置axis来对行列进行操作
In [50]: Mat = np.random.random((3,4))
In [51]: Mat.sum(axis = 1)
Out[51]: array([ 2.54634383, 2.42121143, 1.28962794])
In [52]: Mat
Out[52]:
array([[ 0.77880176, 0.57543626, 0.6840498 , 0.508056 ],
[ 0.75612961, 0.15132258, 0.65047932, 0.86327992],
[ 0.25738888, 0.5731711 , 0.03401482, 0.42505314]])
In [53]: Mat.sum(axis = 0)
Out[53]: array([ 1.79232025, 1.29992993, 1.36854395, 1.79638906])
In [54]: # axis = 0 means adding the elements around column
Broadcasting
最简单的broadcasting
In [1]: import numpy as np
In [2]: a = np.array([1, 2, 3])
In [3]: b = 3
In [4]: a + b
Out[4]: array([4, 5, 6])
一些更复杂的例子
In [5]: M = np.ones((3, 3))
In [6]: M + a
Out[6]:
array([[ 2., 3., 4.],
[ 2., 3., 4.],
[ 2., 3., 4.]])
In [7]: a = np.arange(3)
In [8]: b = np.arange(3)[:, np.newaxis]
In [9]: a
Out[9]: array([0, 1, 2])
In [10]: b
Out[10]:
array([[0],
[1],
[2]])
In [11]: a + b
Out[11]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
注意在此过程中,不同维度的数组被互相“拉伸”来适应彼此。
关于broadcasting的三条规则
- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
应用实例
创建一个z = f(x,y) 的数据集
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
Boolean masking
这里书中使用了一个关于雨水的数据集来展示boolean masking的妙用。
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
In [4]: inches = rainfall / 254.0
In [5]: inches.shape
Out[5]: (365,)
接下来便可以对这些数据进行可视化来找寻其中的规律
ufuncs
前面我们提到过ufunc是一类对array整体进行操作的函数,这里我们把他与boolean masking相结合.
In [1]: import numpy as np
In [2]: rng = np.random.RandomState(0)
In [3]: x = rng.randint(10, size = (3, 4))
In [4]: x
Out[4]:
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
In [5]: x < 6
Out[5]:
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]], dtype=bool)
上述的ufunc操作会带给了我们一个boolean array, 接下来作者就展示了boolean array 的妙用。
In [5]: x < 6
Out[5]:
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]], dtype=bool)
In [6]: np.count_nonzero(_)
Out[6]: 8
In [7]: np.sum(x < 6)
Out[7]: 8
In [8]: np.any(x > 8)
Out[8]: True
In [9]: np.all(x < 8, axis = 1)
Out[9]: array([ True, False, True], dtype=bool)
In [10]: # Working together with boolean operators
In [11]: np.sum((x < 6) & (x >= 0))
Out[11]: 8
最后boolean array 还可以用为mask,这里与matlab中的logic array还是非常类似的
In [12]: x[x < 6]
Out[12]: array([5, 0, 3, 3, 3, 5, 2, 4])
回到雨水的例子,运用mask可以非常优雅地得到我们要的数据
# construct a mask of all rainy days
rainy = (inches > 0)
# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
print("Median precip on rainy days in 2014 (inches): ",
np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches): ",
np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches): 0.194881889764
Median precip on summer days in 2014 (inches): 0.0
Maximum precip on summer days in 2014 (inches): 0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575
最后要注意and, & 与 or, | 的区别,后者是位运算符。
Fancy Indexing
fancy indexing指我们以一个array作为数组的index(就例如上一届的boolean masks)
In [14]: ind = np.array([[3, 7], [4, 5]])
In [15]: rand = np.random.RandomState(45)
In [16]: x= rand.randint(100, size = (10, 5))
In [17]: x
Out[17]:
array([[75, 30, 3, 32, 95],
[61, 85, 35, 68, 15],
[65, 14, 53, 57, 72],
[87, 46, 8, 53, 12],
[34, 24, 12, 17, 68],
[30, 56, 14, 36, 31],
[86, 36, 57, 61, 79],
[17, 6, 42, 11, 8],
[49, 77, 75, 63, 42],
[54, 16, 24, 95, 63]])
In [18]: x[ind]
Out[18]:
array([[[87, 46, 8, 53, 12],
[17, 6, 42, 11, 8]],
[[34, 24, 12, 17, 68],
[30, 56, 14, 36, 31]]])
In [19]: # Shape of the result reflects the shape of the index arrays rather tha
...: n the shape of the array being indexed
In [20]: X = np.arange(12).reshape((3, 4))
In [21]: X
Out[21]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [22]: row = np.array([0, 1, 2])
In [23]: col = np.array([2, 1, 3])
In [24]: X[row, col]
Out[24]: array([ 2, 5, 11])
In [25]: # We get the (0, 2), (1, 1), (2, 3) th element
In [34]: X.shape
Out[34]: (100, 2)
In [35]: import matplotlib.pyplot as plt
In [36]: import seaborn; seaborn.set()
In [37]: plt.scatter(X[:, 0], X[:, 1])
Out[37]: <matplotlib.collections.PathCollection at 0x7f0cc9c461d0>
<matplotlib.figure.Figure at 0x7f0cc9c6b5f8>
In [38]: plt.show()
In [39]: indices = np.random.choice(X.shape[0], 20, replace = False)
In [40]: indices
Out[40]:
array([15, 87, 73, 17, 44, 66, 89, 91, 8, 25, 19, 39, 85, 49, 26, 20, 58,
41, 55, 24])
In [41]: selection = X[indices] # fancy indexing
In [42]: selection
Out[42]:
array([[ -1.80623391e-01, -2.15707232e+00],
[ -8.04178492e-01, -1.34828994e+00],
[ -1.24272035e+00, -2.42157557e+00],
[ 3.57111518e-01, 8.94495954e-02],
[ 2.15274973e+00, 3.24279140e+00],
[ -4.18439156e-01, -8.58736471e-01],
[ 6.08859877e-01, -2.59284917e-01],
[ -6.29633042e-01, 1.32258627e-01],
[ 1.11113414e+00, 1.77185490e+00],
[ 1.65522319e+00, 4.23558698e+00],
[ -1.40629915e-01, -1.62069848e-01],
[ 5.21162541e-01, 2.89756456e+00],
[ -1.11282410e+00, -1.82987036e+00],
[ -5.71948987e-01, -3.34258009e+00],
[ -2.34528800e+00, -3.77554207e+00],
[ -2.58467915e-01, -8.69598951e-01],
[ -1.46270269e-01, -1.27384266e-04],
[ -7.79152780e-02, -2.01423478e+00],
[ -1.79097697e+00, -1.08351482e+00],
[ -1.31637907e+00, -1.86128924e+00]])
Using Fancy Index to modify values
In [53]: x
Out[53]: array([ 0., 0., 2., 3., 4., 0.])
In [54]: i
Out[54]: [2, 3, 3, 4, 4, 4]
In [55]: x[i] += 1
In [56]: x
Out[56]: array([ 0., 0., 3., 4., 5., 0.])
In [57]: x = np.zeros(10)
In [58]: np.add.at(x, i, 1) # proper way to do
In [59]: x
Out[59]: array([ 0., 0., 1., 2., 3., 0., 0., 0., 0., 0.])
Binning Data
In [67]: np.random.seed(42)
In [68]: x = np.random.randn(100)
In [69]: size(x)
Out[69]: 100
In [70]: bins = np.linspace(-5, 5, 20)
In [71]: counts = np.zeros_like(bins)
In [72]: size(counts)
Out[72]: 20
In [73]: i = np.searchsorted(bins, x)
In [74]: i
Out[74]:
array([11, 10, 11, 13, 10, 10, 13, 11, 9, 11, 9, 9, 10, 6, 7, 9, 8,
11, 8, 7, 13, 10, 10, 7, 9, 10, 8, 11, 9, 9, 9, 14, 10, 8,
12, 8, 10, 6, 7, 10, 11, 10, 10, 9, 7, 9, 9, 12, 11, 7, 11,
9, 9, 11, 12, 12, 8, 9, 11, 12, 9, 10, 8, 8, 12, 13, 10, 12,
11, 9, 11, 13, 10, 13, 5, 12, 10, 9, 10, 6, 10, 11, 13, 9, 8,
9, 12, 11, 9, 11, 10, 12, 9, 9, 9, 7, 11, 10, 10, 10])
In [75]: x
Out[75]:
array([ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337,
-0.23413696, 1.57921282, 0.76743473, -0.46947439, 0.54256004,
-0.46341769, -0.46572975, 0.24196227, -1.91328024, -1.72491783,
-0.56228753, -1.01283112, 0.31424733, -0.90802408, -1.4123037 ,
1.46564877, -0.2257763 , 0.0675282 , -1.42474819, -0.54438272,
0.11092259, -1.15099358, 0.37569802, -0.60063869, -0.29169375,
-0.60170661, 1.85227818, -0.01349722, -1.05771093, 0.82254491,
-1.22084365, 0.2088636 , -1.95967012, -1.32818605, 0.19686124,
0.73846658, 0.17136828, -0.11564828, -0.3011037 , -1.47852199,
-0.71984421, -0.46063877, 1.05712223, 0.34361829, -1.76304016,
0.32408397, -0.38508228, -0.676922 , 0.61167629, 1.03099952,
0.93128012, -0.83921752, -0.30921238, 0.33126343, 0.97554513,
-0.47917424, -0.18565898, -1.10633497, -1.19620662, 0.81252582,
1.35624003, -0.07201012, 1.0035329 , 0.36163603, -0.64511975,
0.36139561, 1.53803657, -0.03582604, 1.56464366, -2.6197451 ,
0.8219025 , 0.08704707, -0.29900735, 0.09176078, -1.98756891,
-0.21967189, 0.35711257, 1.47789404, -0.51827022, -0.8084936 ,
-0.50175704, 0.91540212, 0.32875111, -0.5297602 , 0.51326743,
0.09707755, 0.96864499, -0.70205309, -0.32766215, -0.39210815,
-1.46351495, 0.29612028, 0.26105527, 0.00511346, -0.23458713])
In [76]: np.add.at(counts, i, 1)
In [77]: counts
Out[77]:
array([ 0., 0., 0., 0., 0., 1., 3., 7., 9., 23., 22.,
17., 10., 7., 1., 0., 0., 0., 0., 0.])
Sorting
numpy主要提供了两个与排序有关的函数sort()与argsort()
In [18]: x
Out[18]: array([14, 92, 58, 74, 22])
In [19]: i = np.argsort(x)
In [20]: x[i]
Out[20]: array([14, 22, 58, 74, 92])
根据argsort得到的index array, 我们可以用fancy index来构建出排序后的数组
In [21]: x = np.arange(1,10)
In [22]: np.random.shuffle(x)
In [23]: x
Out[23]: array([2, 9, 4, 3, 8, 6, 7, 5, 1])
In [24]: np.partition(x, 5)
Out[24]: array([1, 2, 3, 4, 5, 6, 7, 9, 8])
用partition而非sort我们可以得到最小的k个元素
Structured arrays
In [25]: name = ['Alice', 'Bob', 'Cathy', 'Doug']
...: age = [25, 45, 37, 19]
...: weight = [55.0, 85.5, 68.0, 61.5]
...:
In [26]: x = np.zeros(4, dtype=int)
In [27]: # compound data type
In [28]: data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats'
...: :('U10', 'i4', 'f8')})
In [29]: data.dtype
Out[29]: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [30]: data['name']=name;data['age']=age;data['weight']=weight
In [31]: data
Out[31]:
array([('Alice', 25, 55. ), ('Bob', 45, 85.5), ('Cathy', 37, 68. ),
('Doug', 19, 61.5)],
dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [32]: data[data['age'] < 30]['name']
Out[32]:
array(['Alice', 'Doug'],
dtype='<U10')
除了structured array, numpy还内置了record array,最大的区别是能够把上面的这些key作为属性来访问,但坏处是访问速度要慢于按键访问
最后,pandas为我们提供了更加强大高效的处理这类数组的工具。