CS190 Scalable Machine Learning Spark - 1 Python基础

CS190 Scalable Machine Learning Spark - 1


Python 基础

Part 1: NumPy

NumPy is a Python library for working with arrays.

     # It is convention to import NumPy with the alias np
     import numpy as np

(1a) 标量相乘 Scalar multiplication

$ a $ is the scalar (constant) and $ \mathbf{v} $ is the vector
$$ a \mathbf{v} = \begin{bmatrix} a v_1 \\ a v_2 \\ \vdots \\ a v_n \end{bmatrix} $$

# Create a numpy array with the values 1, 2, 3
simpleArray = np.array([1,2,3])
# Perform the scalar product of 5 and the numpy array
timesFive = simpleArray * 5
print simpleArray
print timesFive
-----
#result
[1 2 3]
[5 10 15

(1b) 点乘 Element-wise multiplication and dot product

The element-wise calculation is as follows:

$$ \mathbf{x} \odot \mathbf{y} = \begin{bmatrix} x_1 y_1 \\ x_2 y_2 \\ \vdots \\ x_n y_n \end{bmatrix} $$

dot product is equivalent to performing element-wise multiplication and then summing the result。

$ w \cdot x$ 也可以表示为 $ w^\top x $

$$ w \cdot x = \sum_{i=1}^n w_i x_i $$

Element-wise multiplication use the ***** operator to multiply two ndarray objects of the same length.
Dot product you can use either np.dot() or np.ndarray.dot()


# Create a ndarray based on a range and step size.
u = np.arange(0, 5, .5)
v = np.arange(5, 10, .5)

elementWise = u * v 
dotProduct = np.dot(u,v)

print 'u: {0}'.format(u)
print 'v: {0}'.format(v)
print '\nelementWise\n{0}'.format(elementWise)
print '\ndotProduct\n{0}'.format(dotProduct)

----
#result
u: [ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5]
v: [ 5.   5.5  6.   6.5  7.   7.5  8.   8.5  9.   9.5]

elementWise
[  0.     2.75   6.     9.75  14.    18.75  24.    29.75  36.    42.75]

dotProduct
183.75

(1c) 矩阵计算 Matrix math

np.matrix() 生成矩阵

matrix math on NumPy matrices using *

转置矩阵 transpose a matrix by calling numpy.matrix.transpose() or by using .T on the matrix object (e.g. myMatrix.T).

Transposing a matrix produces a matrix where the new rows are the columns from the old matrix. For example: $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^\mathbf{\top} = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix} $$

逆矩阵 Inverting a matrix can be done using numpy.linalg.inv().

Note that only square matrices can be inverted, and square matrices are not guaranteed to have an inverse. If the inverse exists, then multiplying a matrix by its inverse will produce the identity matrix. $ \scriptsize ( \mathbf{A}^{-1} \mathbf{A} = \mathbf{I_n} ) $ The identity matrix $ \scriptsize \mathbf{I_n} $ has ones along its diagonal and zero elsewhere. $$ \mathbf{I_n} = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} $$

For this exercise, multiply $ \mathbf{A} $ times its transpose $ ( \mathbf{A}^\top ) $ and then calculate the inverse of the result $ ( [ \mathbf{A} \mathbf{A}^\top ]^{-1} ) $.

from numpy.linalg import inv

A = np.matrix([[1,2,3,4],[5,6,7,8]])
print 'A:\n{0}'.format(A)
# Print A transpose
print '\nA transpose:\n{0}'.format(A.T)

# Multiply A by A transpose
AAt = A * A.T
print '\nAAt:\n{0}'.format(AAt)

# Invert AAt with np.linalg.inv()
AAtInv = np.linalg.inv(AAt)
print '\nAAtInv:\n{0}'.format(AAtInv)

# Show inverse times matrix equals identity
# We round due to numerical precision
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))

result

A:
[[1 2 3 4]
[5 6 7 8]]

A transpose:
[[1 5]
[2 6]
[3 7]
[4 8]]

AAt:
[[ 30 70]
[ 70 174]]

AAtInv:
[[ 0.54375 -0.21875]
[-0.21875 0.09375]]

AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]

AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]


Part 2: Additional NumPy and Spark linear algebra

(2a) Slices

features = np.array([1, 2, 3, 4])
print 'features:\n{0}'.format(features)

# The first three elements of features
firstThree = features[0:3]

# The last three elements of features
lastThree = features[-3:]

(2b) Combining ndarray objects

np.hstack(), which allows you to combine arrays column-wise,
np.vstack(), which allows you to combine arrays row-wise.
Note that both np.hstack() and np.vstack() take in a tuple of arrays as their first argument.
To horizontally combine three arrays a, b, and c, you would run np.hstack((a, b, c)).

If we had two arrays: a = [1, 2, 3, 4] and b = [5, 6, 7, 8], we could use np.vstack((a, b)) to produce the two-dimensional array: $$ \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix} $$

zeros = np.zeros(8)
ones = np.ones(8)
print 'zeros:\n{0}'.format(zeros)
print '\nones:\n{0}'.format(ones)

zerosThenOnes = np.hstack((zeros,ones))   # A 1 by 16 array
zerosAboveOnes = np.vstack((zeros,ones)) # A 2 by 8 array

print '\nzerosThenOnes:\n{0}'.format(zerosThenOnes)
print '\nzerosAboveOnes:\n{0}'.format(zerosAboveOnes)

result:
zeros:
[ 0. 0. 0. 0. 0. 0. 0. 0.]
ones:
[ 1. 1. 1. 1. 1. 1. 1. 1.]
zerosThenOnes:
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]

zerosAboveOnes:
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]]


(2c) PySpark's DenseVector

PySpark provides a DenseVector class within the module pyspark.mllib.linalg.

DenseVector is used to store arrays of values for use in PySpark. DenseVector actually stores values in a NumPy array and delegates calculations to that object. You can create a new DenseVector using DenseVector() and passing in an NumPy array or a Python list.

Note that DenseVector stores all values as np.float64

DenseVector objects exist locally and are not inherently distributed. DenseVector objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.

from pyspark.mllib.linalg import DenseVector

numpyVector = np.array([-3, -4, 5])
print '\nnumpyVector:\n{0}'.format(numpyVector)

# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]
myDenseVector = DenseVector([3,4,5])
# Calculate the dot product between the two vectors.
denseDotProduct = DenseVector.dot(myDenseVector,numpyVector)

print 'myDenseVector:\n{0}'.format(myDenseVector)
print '\ndenseDotProduct:\n{0}'.format(denseDotProduct)

numpyVector:
[-3 -4 5]
myDenseVector:
[3.0,4.0,5.0]
denseDotProduct:
0.0


Part 3: Python lambda expressions

Lambda 是匿名函数

一些链接: Lambda Functions, Lambda Tutorial, and Python Functions.

# Example function
def addS(x):
    return x + 's'
#lambda 形式
addSLambda = lambda x: x + 's'

# 乘法
multiplyByTen = lambda x: x * 10
print multiplyByTen(5)

#lambda fewer steps than def 
# The first function should add two values, while the second function should subtract the second  value from the first value.
def plus(x, y):
    return x + y

def minus(x, y):
    return x - y

functions = [plus, minus]
print functions[0](4, 5)
print functions[1](4, 5)

# lambda
lambdaFunctions = [lambda x,y : x+y ,  lambda x,y: x-y]
print lambdaFunctions[0](4, 5)
print lambdaFunctions[1](4, 5)

Lambda expressions consist of a single expression statement and cannot contain other simple statements. In short, this means that the lambda expression needs to evaluate to a value and exist on a single logical line. If more complex logic is necessary, use def in place of lambda.
Expression statements evaluate to a value (sometimes that value is None). Lambda expressions automatically return the value of their expression statement. In fact, a return statement in a lambda would raise a SyntaxError.
The following Python keywords refer to simple statements that cannot be used in a lambda expression: assert, pass, del, print, return, yield, raise, break, continue, import, global, and exec. Also, note that assignment statements (=) and augmented assignment statements (e.g. +=) cannot be used either.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 我是一只小蚂蚁,或许在你的眼里我连小蚂蚁都不是,因为我没有被你就这个问题去想象过,所以我连一只小蚂蚁也不是。我一点...
    菜小齐阅读 341评论 0 2
  • 是第一次不在家里住,没有了妈妈的陪伴?是第一次远行,看见了前方的高山?是第一次恋爱,体验了爱情的美好?是第一次...
    卫小花阅读 190评论 0 1
  • 这是一段我很有感触的短文。就几句。很喜欢。 女人一定要做一个手心朝下的女人,不管你多漂亮,在你伸手要钱的那一刻你就...
    幽藍阅读 193评论 0 0