0 Quick Reference (This section is not for starters)

This part is for the occasions when you have already learned deep learning but forget about the formulas. For starters, just skip this section.

forward propagation

backward propagation

shape check

Best thanks to Professor Andrew Ng and his learning materials for deep learning. This note is based on the videos of Andrew Ng.

If you are a deep learning starter, I strongly recommend you to watch these videos, they are easy to understand, and totally free.

The rest part of this note is a brief summary of the materials, take a look if you like.

1 Introduction

Architecture	Application
Standard NN	General
Convolutional NN	Image
Recurrent NN	One-dimensional sequence data

Structured data: databases of data. This kind of data has a defined meaning such as price, age;
Unstructured data: For example, audio, raw audio, text, and image. This kind of data is difficult for computer to understand.

2 Basics of Neural Network Programming

Notation

Notation	Description
`(x,y)`	A single training example, where `x` is an n_x-dimensional feature vector, y is either 1 or 0.
m	The number of training examples, i.e. `training set: {(x(1), y(1)), (x(2), y(2)), ..., (x(m), y(m))}`. More precisely, it's also denoted as m_train.
m_test	The number of test examples.
`X`	The matrix of training examples, i.e. `X = [x(1), x(2), ..., x(m)]`.
`Y`	The matrix of `y`s, i.e. `Y = [y(1), y(2), ..., y(m)]`

In python,

X.shape = (n_x, m)
Y.shape = (1, m)

2.1 Logistic Regression

Logistic Regression is an algorithm for binary classification (二分分类)

When z is large, sigmoid(z) is close to 1, and when z is small, sigmoid(z) is close to 0.

如果定义Loss Function为平方误差，则可能会导致梯度下降法 (gradient descent)落入某个局部最优解，因而不太好用。

So in your training logistic model, we're going to try to find parameters w and b that minimize the overall cost function J(\hat{y}, y).

Cost Function Is A Convex Function

Cost Function 是一个凸函数 (convex function) ，存在最低点，因此采用Gradient Descent Algorithm (梯度下降算法) 来得到（习得）最优参数。

Gradient Descent Algorithm

Repeat updating parameter w and b using the following expressions. (:= means "update <left> as <right>")

Gradient Descent Algorithm

（可以通过只研究J(w)中参数w的优化来形象地理解这个过程）

alpha is called Learning Rate.

Gradient Descent Algorithm Implementation

(1) For one training example, the computational graph could be

So if we implement gradient descent algorithm to this logistic regression example, we could calculate the formulas in the following steps.

(2) 我们可以把单个样本的梯度下降结果应用在m个样本上。

结合前面的结果，可以很容易地用for循环实现。但是在编程中，我们不希望用嵌套的for循环，这会导致程序运行缓慢。相反，我们可以用一种叫做Vectorization (向量化) 的方法省略不必要的for循环。

Vectorization

For n_x-dimensional vector w and x, we could either multiply them using non-vectorized implementation, or using vectorization.

For example, we are going to implement z = w^Tx + b.

# in python, using numpy
# implement z = w^T x+b

import numpy as np

#1 Non-vectorized implementation
z = 0
for i in range (n_x):
    z += (w[i] * x[i])
z += b

#2 Vectorization
z = np.dot(w, x) + b  #这种写法在底层使用了并行计算，快得多

Whenever possible, avoid explicit for-loops.

Vectorization in the implementation of the forward propagation

Vectorization in the implementation of the backward propagation

Implementaion with python

import numpy as np

#Given training data (X, Y) and learning rate alpha
#Then a single iteration could be:

Z = np.dot(w.T, X) + b #这里b是一个常数，但会在计算时自动转换为向量
A = sigma(Z)
dZ = A - Y
dw = np.dot(X, dZ.T) / m
db = np.sum(dZ) / m

w = w - alpha*dw  #update vector w shaped (n_x, 1) 
b = b - alpha*db  #update b

2.2 Tips About Coding Neural Network In Python And Numpy

In Python and Numpy, we use function np.sum(axis = 0) to sum vertically, and use np.sum(axis = 1) to sum horizontally
Constantly use the reshape command to make sure the matrix has the right shape, such as a column vector or a row vector
In Python and Numpy, there is a feature called "broadcasting" (广播). The technique is: When you apply +, -, *, / to an (m, n) matrix A_{(m, n)} and another matrix B. If B.shape = (1, n) or (m, 1), then numpy will auto-expand the matrix B to a (m, n) matrix by copying the vector for m or n times.
- 中文解释：在Numpy中有一种叫做Broadcasting (广播) 的特性。当对一个(m, n)的矩阵和另一个(1, n)或(m, 1)的向量进行四则运算时，numpy会将向量多次复制，自动扩展为一个(m, n)的矩阵进行运算
In Numpy, rank 1 array is not recommended, which shape will be like (n, ). Because this kind of data structure causes could cause some subtle bugs if you are not familiar with all the features of numpy

3 One Hidden Layer Neural Network

Input Layer -> Hidden Layer -> Output Layer

Hidden layer means this layer could not be seen in the training set.

用符号a_i^[n]表示第n层的第i个神经元传递给下一层的值（输出值）。输入层为a^[0]，即输入层为第0层。字母a代表activation(激活)。

Neural Network Representation

Compute the output of the neural network for a single training example：

Forward Propagation For A Single Training Example

Compute the output of the neural network for m training examples (vectorization):

Forward Propagation For m Training Exmaples

上图中的大矩阵的水平方向是不同的训练样本，竖直方向是不同的神经元 (neural network units)

3.1 Different Activation Functions

Well, for this part, I strongly suggest you directly watching the video rather than reading my summary. Because this course is so fruitful that I could only summarize the conclusions. Meanwhile, if you watch the whole video, you would learn much more.

Activation Functions

Summary

4 types of activation functions, and their functions are shown in the figures above:
- Sigmoid Function
- Tanh Function
- ReLU (rectified linear unit) Function
- Leaky ReLU Function
According to experience, tanh function works much better than sigmoid function almost in any cases
The disadvantage of both sigmoid and tanh is that when z is too large or too small, the slope becomes small, which slows down the learning efficiency
If you are using a binary classification, then you should only use the sigmoid function as the activation function for the output layer
You could simply define the derivative of ReLU function at z = 0 as 1 or 0
In most cases, ReLU function is the default activation function for hidden layers

Why should we choose non-linear activation functions?

If we choose a linear activation function, such as a = z, then the whole network is just equivalent to a linear function;
If we choose linear activation functions for our hidden layers and choose the sigmoid function for the output layer, then the whole network is no more expressive than the logistic regression. In some sense, this network is equivalent to a logistic regression model.
Here is one exception, which is when you trying to calculate an output between minus infinity to plus infinity. In this occasion, you might use the linear activation function for the output layer. However, I should also stress that the hidden layers must use non-linear functions.
So the point is that a linear hidden layer is useless.

Derivatives for activation functions

3.2 Formulas for one hidden layer neural network

Cost Function and Gradient Descent

Forward Propagation

Backward Propagation

3.3 Random Parameter Initialization

Symmetry breaking problem: If you initialize your neural network weights to all 0, then the function of each node would be completely identical, after no matter how many iteration. (This point could be proved by induction.)
The solution could be to initialize the parameters randomly.

# 生成高斯分布随机变量，乘0.01是因为希望得到小的初始值，避免落入激活函数的小斜率区
W[i] = np.random.randn((n[i], n[i-1])) * 0.01

# W随机初始化后，b可以不随机初始化
b[i] = np.zeros((n[i], 1))

4 Deep Neural Networks

Why do we need more layers rather than more units?

Here is a simple explanation. Informally speaking, there are functions you can compute with a "small" L-layer neural network that shallower networks require exponentially more hidden units to compute. (Circuit Theory)
But as Professor Ng mentioned, when we are working on an actual problem, we'd better start by using logistic regression, and then try to add one or two layers to the network. Just try to find the right depth.

Notation

notation / example	description
L	The number of layers in the neural network
L=4	This neural network has 4 layers
n^[l]	The number of units in the l_th layer
n^[1]=5	The first layer has 5 units
a^[l]	The activation function of the l_th layer
a^[l]=g^[l](z^[l])	a^[l]=g^[l](z^[l])

4.1 Forward Propagation

The general formulas are:

You are suggested to use an explicit for-loop in each iteration (from layer 1 to layer L)

Tips: 检查矩阵大小

z^[l]、a^[l]的大小一定是 (n^[l], 1)

Z^[l]、A^[l]、dZ^[l]、dA^[l]的大小一定是 (n^[l], m)
W^[l]、dW^[l]的大小一定时 (n^[l], n^[l-1])
b^[l]、db^[l]的大小是 (n^[l], 1)，在运算中用到了python-numpy的广播技术 (broadcasting)，会自动扩展为 (n^[l], m) 的矩阵

4.2 Backward Propagation

Even though I work on the machine learning for a long time, sometimes it still surprises me a bit when my learning algorithms work. Because lots of complexity of your learning algorithm comes from the data rather than necessarily from your writing.

4.3 Hyper-parameters

Parameters: W^[i], b^[i], i = 1, 2, ..., L
Hyper-parameters:

Hyper-parameters	Description
alpha	Learning rate
Iterations	The number iterations of your training
L	The number of hidden layers
n^[i]	The number of hidden units in layer i
g^[i]	The choice of activation function for layer i
...	...

It's the hyper-parameters that determines the final values of the parameters
Applied deep learning is a very empirical (经验主义的) process
- We have to try and find the best hyper-parameters
- We could draw the 'cost function J' to 'iterations' graph for different settings of hyper-parameters

Deep Learning | 1 Neural Networks And Deep Learning