Python DataScience Handbook 学习笔记

第二部分　numpy(2)

numpy的向量化操作与Matlab非常类似，需要注意的是向量化操作远比循环要有效率的多，请尽量使用向量化操作来取代循环。

"ufunc"是一些列能够对array进行整体操作的函数

有一些特殊的函数，我们可以通过scipy包来获取

In [29]: from scipy import special

In [30]: x = np.random.randint(15, size = (5,5), dtype = 'int32')

In [31]: x
Out[31]: 
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

In [32]: special.erf(x)
Out[32]: 
array([[ 0.99999998,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 0.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  0.84270079],
       [ 1.        ,  1.        ,  0.        ,  1.        ,  1.        ],
       [ 1.        ,  0.99997791,  0.99532227,  1.        ,  0.99532227]])

In [33]: x
Out[33]: 
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

Specifying output


In [24]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[  0.  10.  20.  30.  40.]

y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]

你可能会问这样做的好处是什么，相比于直接赋值有何优越性？
在y[::2] = 2 ** x的过程中，我们会创建一个临时数组，储存右边语句的值，再将其拷贝到左边的子数组中。很显然，使用specifying output提升了效率。

Aggregation

In [36]: x = np.linspace(0, 10, 5)

In [37]: x
Out[37]: array([  0. ,   2.5,   5. ,   7.5,  10. ])

In [38]: np.add.reduce(x)
Out[38]: 25.0

In [39]: np.multiply.reduce(x)
Out[39]: 0.0

In [40]: np.add.accumulate(x)
Out[40]: array([  0. ,   2.5,   7.5,  15. ,  25. ])

Outer 外积

In [41]: x = np.arange(1, 5)

In [42]: x
Out[42]: array([1, 2, 3, 4])

In [43]: np.multiply.outer(x, x)
Out[43]: 
array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

numpy中的min,max等聚合函数

In [41]: x = np.arange(1, 5)

In [42]: x
Out[42]: array([1, 2, 3, 4])

In [43]: np.multiply.outer(x, x)
Out[43]: 
array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

In [44]: x = np.arange(1, 10)

In [45]: x
Out[45]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]: %timeit x.sum()
1.11 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [47]: %timeit sum(x)         #Be careful, don't use the python-version sum()
1.3 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [48]: x.min()
Out[48]: 1

In [49]: x.max()
Out[49]: 9

我们还可以通过设置axis来对行列进行操作

In [50]: Mat = np.random.random((3,4))

In [51]: Mat.sum(axis = 1)
Out[51]: array([ 2.54634383,  2.42121143,  1.28962794])

In [52]: Mat
Out[52]: 
array([[ 0.77880176,  0.57543626,  0.6840498 ,  0.508056  ],
       [ 0.75612961,  0.15132258,  0.65047932,  0.86327992],
       [ 0.25738888,  0.5731711 ,  0.03401482,  0.42505314]])

In [53]: Mat.sum(axis = 0)
Out[53]: array([ 1.79232025,  1.29992993,  1.36854395,  1.79638906])

In [54]: # axis = 0 means adding the elements around column

Broadcasting

最简单的broadcasting

 In [1]: import numpy as np

In [2]: a = np.array([1, 2, 3])

In [3]: b = 3

In [4]: a + b
Out[4]: array([4, 5, 6])

一些更复杂的例子

In [5]: M = np.ones((3, 3))

In [6]: M + a
Out[6]: 
array([[ 2.,  3.,  4.],
       [ 2.,  3.,  4.],
       [ 2.,  3.,  4.]])

In [7]: a = np.arange(3)

In [8]: b = np.arange(3)[:, np.newaxis]

In [9]: a
Out[9]: array([0, 1, 2])

In [10]: b
Out[10]: 
array([[0],
       [1],
       [2]])

In [11]: a + b
Out[11]: 
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

注意在此过程中，不同维度的数组被互相“拉伸”来适应彼此。

How it works

关于broadcasting的三条规则

Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

应用实例

创建一个z = f(x,y) 的数据集

# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

Boolean masking

这里书中使用了一个关于雨水的数据集来展示boolean masking的妙用。

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values

In [4]: inches = rainfall / 254.0

In [5]: inches.shape
Out[5]: (365,)

接下来便可以对这些数据进行可视化来找寻其中的规律

ufuncs

前面我们提到过ufunc是一类对array整体进行操作的函数，这里我们把他与boolean masking相结合.

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(0)

In [3]: x = rng.randint(10, size = (3, 4))

In [4]: x
Out[4]: 
array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [5]: x < 6
Out[5]: 
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

上述的ufunc操作会带给了我们一个boolean array, 接下来作者就展示了boolean array 的妙用。

In [5]: x < 6
Out[5]: 
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

In [6]: np.count_nonzero(_)
Out[6]: 8

In [7]: np.sum(x < 6)
Out[7]: 8

In [8]: np.any(x > 8)
Out[8]: True

In [9]: np.all(x < 8, axis = 1)
Out[9]: array([ True, False,  True], dtype=bool)

In [10]: # Working together with boolean operators

In [11]: np.sum((x < 6) & (x >= 0))
Out[11]: 8

最后boolean array 还可以用为mask,这里与matlab中的logic array还是非常类似的

In [12]: x[x < 6]
Out[12]: array([5, 0, 3, 3, 3, 5, 2, 4])

回到雨水的例子，运用mask可以非常优雅地得到我们要的数据

# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches):    0.194881889764
Median precip on summer days in 2014 (inches):   0.0
Maximum precip on summer days in 2014 (inches):  0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575

最后要注意and, & 与　or, | 的区别，后者是位运算符。

Fancy Indexing

fancy indexing指我们以一个array作为数组的index(就例如上一届的boolean masks)

In [14]: ind = np.array([[3, 7], [4, 5]])

In [15]: rand = np.random.RandomState(45)

In [16]: x= rand.randint(100, size = (10, 5))

In [17]: x
Out[17]: 
array([[75, 30,  3, 32, 95],
       [61, 85, 35, 68, 15],
       [65, 14, 53, 57, 72],
       [87, 46,  8, 53, 12],
       [34, 24, 12, 17, 68],
       [30, 56, 14, 36, 31],
       [86, 36, 57, 61, 79],
       [17,  6, 42, 11,  8],
       [49, 77, 75, 63, 42],
       [54, 16, 24, 95, 63]])

In [18]: x[ind]
Out[18]: 
array([[[87, 46,  8, 53, 12],
        [17,  6, 42, 11,  8]],

       [[34, 24, 12, 17, 68],
        [30, 56, 14, 36, 31]]])

In [19]: # Shape of the result reflects the shape of the index arrays rather tha
    ...: n the shape of the array being indexed

In [20]: X = np.arange(12).reshape((3, 4))

In [21]: X
Out[21]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [22]: row = np.array([0, 1, 2])

In [23]: col = np.array([2, 1, 3])

In [24]: X[row, col]
Out[24]: array([ 2,  5, 11])

In [25]: # We get the (0, 2), (1, 1), (2, 3) th element

In [34]: X.shape
Out[34]: (100, 2)

In [35]: import matplotlib.pyplot as plt

In [36]: import seaborn; seaborn.set()

In [37]: plt.scatter(X[:, 0], X[:, 1])
Out[37]: <matplotlib.collections.PathCollection at 0x7f0cc9c461d0>
<matplotlib.figure.Figure at 0x7f0cc9c6b5f8>

In [38]: plt.show()

In [39]: indices = np.random.choice(X.shape[0], 20, replace = False)

In [40]: indices
Out[40]: 
array([15, 87, 73, 17, 44, 66, 89, 91,  8, 25, 19, 39, 85, 49, 26, 20, 58,
       41, 55, 24])

In [41]: selection = X[indices] # fancy indexing

In [42]: selection
Out[42]: 
array([[ -1.80623391e-01,  -2.15707232e+00],
       [ -8.04178492e-01,  -1.34828994e+00],
       [ -1.24272035e+00,  -2.42157557e+00],
       [  3.57111518e-01,   8.94495954e-02],
       [  2.15274973e+00,   3.24279140e+00],
       [ -4.18439156e-01,  -8.58736471e-01],
       [  6.08859877e-01,  -2.59284917e-01],
       [ -6.29633042e-01,   1.32258627e-01],
       [  1.11113414e+00,   1.77185490e+00],
       [  1.65522319e+00,   4.23558698e+00],
       [ -1.40629915e-01,  -1.62069848e-01],
       [  5.21162541e-01,   2.89756456e+00],
       [ -1.11282410e+00,  -1.82987036e+00],
       [ -5.71948987e-01,  -3.34258009e+00],
       [ -2.34528800e+00,  -3.77554207e+00],
       [ -2.58467915e-01,  -8.69598951e-01],
       [ -1.46270269e-01,  -1.27384266e-04],
       [ -7.79152780e-02,  -2.01423478e+00],
       [ -1.79097697e+00,  -1.08351482e+00],
       [ -1.31637907e+00,  -1.86128924e+00]])

Using Fancy Index to modify values

In [53]: x
Out[53]: array([ 0.,  0.,  2.,  3.,  4.,  0.])

In [54]: i
Out[54]: [2, 3, 3, 4, 4, 4]

In [55]: x[i] += 1

In [56]: x
Out[56]: array([ 0.,  0.,  3.,  4.,  5.,  0.])

In [57]: x = np.zeros(10)

In [58]: np.add.at(x, i, 1) # proper way to do

In [59]: x
Out[59]: array([ 0.,  0.,  1.,  2.,  3.,  0.,  0.,  0.,  0.,  0.])

Binning Data

In [67]: np.random.seed(42)

In [68]: x = np.random.randn(100)

In [69]: size(x)
Out[69]: 100

In [70]: bins = np.linspace(-5, 5, 20)

In [71]: counts = np.zeros_like(bins)

In [72]: size(counts)
Out[72]: 20

In [73]: i = np.searchsorted(bins, x)

In [74]: i
Out[74]: 
array([11, 10, 11, 13, 10, 10, 13, 11,  9, 11,  9,  9, 10,  6,  7,  9,  8,
       11,  8,  7, 13, 10, 10,  7,  9, 10,  8, 11,  9,  9,  9, 14, 10,  8,
       12,  8, 10,  6,  7, 10, 11, 10, 10,  9,  7,  9,  9, 12, 11,  7, 11,
        9,  9, 11, 12, 12,  8,  9, 11, 12,  9, 10,  8,  8, 12, 13, 10, 12,
       11,  9, 11, 13, 10, 13,  5, 12, 10,  9, 10,  6, 10, 11, 13,  9,  8,
        9, 12, 11,  9, 11, 10, 12,  9,  9,  9,  7, 11, 10, 10, 10])

In [75]: x
Out[75]: 
array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004,
       -0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783,
       -0.56228753, -1.01283112,  0.31424733, -0.90802408, -1.4123037 ,
        1.46564877, -0.2257763 ,  0.0675282 , -1.42474819, -0.54438272,
        0.11092259, -1.15099358,  0.37569802, -0.60063869, -0.29169375,
       -0.60170661,  1.85227818, -0.01349722, -1.05771093,  0.82254491,
       -1.22084365,  0.2088636 , -1.95967012, -1.32818605,  0.19686124,
        0.73846658,  0.17136828, -0.11564828, -0.3011037 , -1.47852199,
       -0.71984421, -0.46063877,  1.05712223,  0.34361829, -1.76304016,
        0.32408397, -0.38508228, -0.676922  ,  0.61167629,  1.03099952,
        0.93128012, -0.83921752, -0.30921238,  0.33126343,  0.97554513,
       -0.47917424, -0.18565898, -1.10633497, -1.19620662,  0.81252582,
        1.35624003, -0.07201012,  1.0035329 ,  0.36163603, -0.64511975,
        0.36139561,  1.53803657, -0.03582604,  1.56464366, -2.6197451 ,
        0.8219025 ,  0.08704707, -0.29900735,  0.09176078, -1.98756891,
       -0.21967189,  0.35711257,  1.47789404, -0.51827022, -0.8084936 ,
       -0.50175704,  0.91540212,  0.32875111, -0.5297602 ,  0.51326743,
        0.09707755,  0.96864499, -0.70205309, -0.32766215, -0.39210815,
       -1.46351495,  0.29612028,  0.26105527,  0.00511346, -0.23458713])

In [76]: np.add.at(counts, i, 1)

In [77]: counts
Out[77]: 
array([  0.,   0.,   0.,   0.,   0.,   1.,   3.,   7.,   9.,  23.,  22.,
        17.,  10.,   7.,   1.,   0.,   0.,   0.,   0.,   0.])

Sorting

numpy主要提供了两个与排序有关的函数sort()与argsort()

In [18]: x
Out[18]: array([14, 92, 58, 74, 22])

In [19]: i = np.argsort(x)

In [20]: x[i]
Out[20]: array([14, 22, 58, 74, 92])

根据argsort得到的index array, 我们可以用fancy index来构建出排序后的数组

In [21]: x = np.arange(1,10)

In [22]: np.random.shuffle(x)

In [23]: x
Out[23]: array([2, 9, 4, 3, 8, 6, 7, 5, 1])

In [24]: np.partition(x, 5)
Out[24]: array([1, 2, 3, 4, 5, 6, 7, 9, 8])

用partition而非sort我们可以得到最小的k个元素

Structured arrays

In [25]: name = ['Alice', 'Bob', 'Cathy', 'Doug']
    ...: age = [25, 45, 37, 19]
    ...: weight = [55.0, 85.5, 68.0, 61.5]
    ...: 

In [26]: x = np.zeros(4, dtype=int)

In [27]: # compound data type

In [28]: data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats'
    ...: :('U10', 'i4', 'f8')})

In [29]: data.dtype
Out[29]: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [30]: data['name']=name;data['age']=age;data['weight']=weight

In [31]: data
Out[31]: 
array([('Alice', 25,  55. ), ('Bob', 45,  85.5), ('Cathy', 37,  68. ),
       ('Doug', 19,  61.5)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [32]: data[data['age'] < 30]['name']
Out[32]: 
array(['Alice', 'Doug'],
      dtype='<U10')

除了structured array, numpy还内置了record　array,最大的区别是能够把上面的这些key作为属性来访问，但坏处是访问速度要慢于按键访问
最后，pandas为我们提供了更加强大高效的处理这类数组的工具。

Python 数据科学笔记2

Python 数据科学笔记2

Python DataScience Handbook 学习笔记

第二部分　numpy(2)

Specifying output

Aggregation

Outer 外积

numpy中的min,max等聚合函数

Broadcasting

最简单的broadcasting

一些更复杂的例子

关于broadcasting的三条规则

应用实例

Boolean masking

ufuncs

Fancy Indexing

Using Fancy Index to modify values

Binning Data

Sorting

Structured arrays

推荐阅读更多精彩内容

Python 数据科学笔记2

Python DataScience Handbook 学习笔记

第二部分 numpy(2)

Specifying output

Aggregation

Outer 外积

numpy中的min,max等聚合函数

Broadcasting

最简单的broadcasting

一些更复杂的例子

关于broadcasting的三条规则

应用实例

Boolean masking

ufuncs

Fancy Indexing

Using Fancy Index to modify values

Binning Data

Sorting

Structured arrays

推荐阅读更多精彩内容

第二部分　numpy(2)