Deep Learning | 1 Neural Networks And Deep Learning

0 Quick Reference (This section is not for starters)

This part is for the occasions when you have already learned deep learning but forget about the formulas. For starters, just skip this section.

forward propagation

backward propagation

shape check


Best thanks to Professor Andrew Ng and his learning materials for deep learning. This note is based on the videos of Andrew Ng.

If you are a deep learning starter, I strongly recommend you to watch these videos, they are easy to understand, and totally free.

The rest part of this note is a brief summary of the materials, take a look if you like.

1 Introduction

Architecture Application
Standard NN General
Convolutional NN Image
Recurrent NN One-dimensional sequence data
  • Structured data: databases of data. This kind of data has a defined meaning such as price, age;
  • Unstructured data: For example, audio, raw audio, text, and image. This kind of data is difficult for computer to understand.

2 Basics of Neural Network Programming

Notation

Notation Description
(x,y) A single training example, where x is an nx-dimensional feature vector, y is either 1 or 0.
m The number of training examples, i.e. training set: {(x(1), y(1)), (x(2), y(2)), ..., (x(m), y(m))}. More precisely, it's also denoted as mtrain.
mtest The number of test examples.
X The matrix of training examples, i.e. X = [x(1), x(2), ..., x(m)].
Y The matrix of ys, i.e. Y = [y(1), y(2), ..., y(m)]

In python,

X.shape = (n_x, m)
Y.shape = (1, m)

2.1 Logistic Regression


  • Logistic Regression is an algorithm for binary classification (二分分类)

When z is large, sigmoid(z) is close to 1, and when z is small, sigmoid(z) is close to 0.

如果定义Loss Function为平方误差,则可能会导致梯度下降法 (gradient descent)落入某个局部最优解,因而不太好用。

So in your training logistic model, we're going to try to find parameters w and b that minimize the overall cost function J(\hat{y}, y).

Cost Function Is A Convex Function

Cost Function 是一个凸函数 (convex function) ,存在最低点,因此采用Gradient Descent Algorithm (梯度下降算法) 来得到(习得)最优参数。

Gradient Descent Algorithm

Repeat updating parameter w and b using the following expressions. (:= means "update <left> as <right>")

Gradient Descent Algorithm

(可以通过只研究J(w)中参数w的优化来形象地理解这个过程)

alpha is called Learning Rate.

Gradient Descent Algorithm Implementation

(1) For one training example, the computational graph could be

So if we implement gradient descent algorithm to this logistic regression example, we could calculate the formulas in the following steps.

(2) 我们可以把单个样本的梯度下降结果应用在m个样本上。

结合前面的结果,可以很容易地用for循环实现。但是在编程中,我们不希望用嵌套的for循环,这会导致程序运行缓慢。相反,我们可以用一种叫做Vectorization (向量化) 的方法省略不必要的for循环。

Vectorization

For nx-dimensional vector w and x, we could either multiply them using non-vectorized implementation, or using vectorization.

For example, we are going to implement z = wTx + b.

# in python, using numpy
# implement z = w^T x+b

import numpy as np

#1 Non-vectorized implementation
z = 0
for i in range (n_x):
    z += (w[i] * x[i])
z += b

#2 Vectorization
z = np.dot(w, x) + b  #这种写法在底层使用了并行计算,快得多

Whenever possible, avoid explicit for-loops.

Vectorization in the implementation of the forward propagation

Vectorization in the implementation of the backward propagation

Implementaion with python

import numpy as np

#Given training data (X, Y) and learning rate alpha
#Then a single iteration could be:

Z = np.dot(w.T, X) + b #这里b是一个常数,但会在计算时自动转换为向量
A = sigma(Z)
dZ = A - Y
dw = np.dot(X, dZ.T) / m
db = np.sum(dZ) / m

w = w - alpha*dw  #update vector w shaped (n_x, 1) 
b = b - alpha*db  #update b

2.2 Tips About Coding Neural Network In Python And Numpy


  • In Python and Numpy, we use function np.sum(axis = 0) to sum vertically, and use np.sum(axis = 1) to sum horizontally

  • Constantly use the reshape command to make sure the matrix has the right shape, such as a column vector or a row vector

  • In Python and Numpy, there is a feature called "broadcasting" (广播). The technique is: When you apply +, -, *, / to an (m, n) matrix A(m, n) and another matrix B. If B.shape = (1, n) or (m, 1), then numpy will auto-expand the matrix B to a (m, n) matrix by copying the vector for m or n times.

    • 中文解释:在Numpy中有一种叫做Broadcasting (广播) 的特性。当对一个(m, n)的矩阵和另一个(1, n)或(m, 1)的向量进行四则运算时,numpy会将向量多次复制,自动扩展为一个(m, n)的矩阵进行运算
  • In Numpy, rank 1 array is not recommended, which shape will be like (n, ). Because this kind of data structure causes could cause some subtle bugs if you are not familiar with all the features of numpy

3 One Hidden Layer Neural Network

Input Layer -> Hidden Layer -> Output Layer

Hidden layer means this layer could not be seen in the training set.

用符号ai[n]表示第n层的第i个神经元传递给下一层的值(输出值)。输入层为a[0],即输入层为第0层。字母a代表activation(激活)。

Neural Network Representation

Compute the output of the neural network for a single training example:

Forward Propagation For A Single Training Example

Compute the output of the neural network for m training examples (vectorization):

Forward Propagation For m Training Exmaples

上图中的大矩阵的水平方向是不同的训练样本,竖直方向是不同的神经元 (neural network units)

3.1 Different Activation Functions


Well, for this part, I strongly suggest you directly watching the video rather than reading my summary. Because this course is so fruitful that I could only summarize the conclusions. Meanwhile, if you watch the whole video, you would learn much more.

Activation Functions

Summary

  • 4 types of activation functions, and their functions are shown in the figures above:

    • Sigmoid Function
    • Tanh Function
    • ReLU (rectified linear unit) Function
    • Leaky ReLU Function
  • According to experience, tanh function works much better than sigmoid function almost in any cases

  • The disadvantage of both sigmoid and tanh is that when z is too large or too small, the slope becomes small, which slows down the learning efficiency

  • If you are using a binary classification, then you should only use the sigmoid function as the activation function for the output layer

  • You could simply define the derivative of ReLU function at z = 0 as 1 or 0

  • In most cases, ReLU function is the default activation function for hidden layers

Why should we choose non-linear activation functions?

  • If we choose a linear activation function, such as a = z, then the whole network is just equivalent to a linear function;

  • If we choose linear activation functions for our hidden layers and choose the sigmoid function for the output layer, then the whole network is no more expressive than the logistic regression. In some sense, this network is equivalent to a logistic regression model.

  • Here is one exception, which is when you trying to calculate an output between minus infinity to plus infinity. In this occasion, you might use the linear activation function for the output layer. However, I should also stress that the hidden layers must use non-linear functions.

  • So the point is that a linear hidden layer is useless.

Derivatives for activation functions

3.2 Formulas for one hidden layer neural network


Cost Function and Gradient Descent
Forward Propagation
Backward Propagation

3.3 Random Parameter Initialization

  • Symmetry breaking problem: If you initialize your neural network weights to all 0, then the function of each node would be completely identical, after no matter how many iteration. (This point could be proved by induction.)

  • The solution could be to initialize the parameters randomly.

# 生成高斯分布随机变量,乘0.01是因为希望得到小的初始值,避免落入激活函数的小斜率区
W[i] = np.random.randn((n[i], n[i-1])) * 0.01

# W随机初始化后,b可以不随机初始化
b[i] = np.zeros((n[i], 1))

4 Deep Neural Networks

Why do we need more layers rather than more units?

  • Here is a simple explanation. Informally speaking, there are functions you can compute with a "small" L-layer neural network that shallower networks require exponentially more hidden units to compute. (Circuit Theory)

  • But as Professor Ng mentioned, when we are working on an actual problem, we'd better start by using logistic regression, and then try to add one or two layers to the network. Just try to find the right depth.

Notation

notation / example description
L The number of layers in the neural network
L=4 This neural network has 4 layers
n[l] The number of units in the lth layer
n[1]=5 The first layer has 5 units
a[l] The activation function of the lth layer
a[l]=g[l](z[l]) a[l]=g[l](z[l])

4.1 Forward Propagation


  • The general formulas are:
  • You are suggested to use an explicit for-loop in each iteration (from layer 1 to layer L)

Tips: 检查矩阵大小

  • z[l]、a[l]的大小一定是 (n[l], 1)
  • Z[l]、A[l]、dZ[l]、dA[l]的大小一定是 (n[l], m)

  • W[l]、dW[l]的大小一定时 (n[l], n[l-1])

  • b[l]、db[l]的大小是 (n[l], 1),在运算中用到了python-numpy的广播技术 (broadcasting),会自动扩展为 (n[l], m) 的矩阵

4.2 Backward Propagation


Even though I work on the machine learning for a long time, sometimes it still surprises me a bit when my learning algorithms work. Because lots of complexity of your learning algorithm comes from the data rather than necessarily from your writing.

4.3 Hyper-parameters


  • Parameters: W[i], b[i], i = 1, 2, ..., L

  • Hyper-parameters:

Hyper-parameters Description
alpha Learning rate
Iterations The number iterations of your training
L The number of hidden layers
n[i] The number of hidden units in layer i
g[i] The choice of activation function for layer i
... ...
  • It's the hyper-parameters that determines the final values of the parameters

  • Applied deep learning is a very empirical (经验主义的) process

    • We have to try and find the best hyper-parameters
    • We could draw the 'cost function J' to 'iterations' graph for different settings of hyper-parameters
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,014评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,796评论 3 386
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,484评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,830评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,946评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,114评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,182评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,927评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,369评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,678评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,832评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,533评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,166评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,885评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,128评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,659评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,738评论 2 351

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,312评论 0 10
  • 我觉得我是讲义气的人,与好朋友做到有福同享有难同当,比如同学上学忘记带笔,我会主动借给他们!同桌忘带书的时候,和他...
    知足常乐_3c12阅读 198评论 1 0
  • 对写作,我始终怀着赤子之心。这赤子之心,是善和美,是那些小确幸。 距离新书全国上市的日子越来越近,心里既欢喜又迟疑...
    小隐江南阅读 919评论 12 18
  • 亲爱,明天是你20岁的生日,来不及给你准备礼物,但是还是要跟你讲:祝你生日快乐,以后的每一天也都快乐。 原本打算着...
    灰色系基调阅读 188评论 2 0