Implementing Multiple Layer Neural Network from Scratch

All code can be find here.

Implementing Multiple Layer Neural Network from Scratch

This post is inspired by http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch.

In this post, we will implement a multiple layer neural network from scratch. You can regard the number of layers and dimension of each layer as parameter. For example, [2, 3, 2] represents inputs with 2 dimension, one hidden layer with 3 dimension and output with 2 dimension (binary classification) (using softmax as output).

We won’t derive all the math that’s required, but I will try to give an intuitive explanation of what we are doing. I will also point to resources for you read up on the details.

Generating a dataset

Let’s start by generating a dataset we can play with. Fortunately, scikit-learn has some useful dataset generators, so we don’t need to write the code ourselves. We will go with the make_moons function.

# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

The dataset we generated has two classes, plotted as red and blue points. Our goal is to train a Machine Learning classifier that predicts the correct class given the x- and y- coordinates. Note that the data is not linearly separable, we can’t draw a straight line that separates the two classes. This means that linear classifiers, such as Logistic Regression, won’t be able to fit the data unless you hand-engineer non-linear features (such as polynomials) that work well for the given dataset.

In fact, that’s one of the major advantages of Neural Networks. You don’t need to worry about feature engineering. The hidden layer of a neural network will learn features for you.

Neural Network

Neural Network Architecture

You can read this tutorial (http://cs231n.github.io/neural-networks-1/) to learn the basic concepts of neural network. Like activation functions, feed-forward computation and so on.

Because we want our network to output probabilities the activation function for the output layer will be the softmax, which is simply a way to convert raw scores to probabilities. If you’re familiar with the logistic function you can think of softmax as its generalization to multiple classes.

When you choose softmax as output, you can use cross-entropy loss (also known as negative log likelihood) as loss function. More about Loss Function can be find in http://cs231n.github.io/neural-networks-2/#losses.

Learning the Parameters

Learning the parameters for our network means finding parameters (such as (W_1, b_1, W_2, b_2)) that minimize the error on our training data (loss function).

We can use gradient descent to find the minimum and I will implement the most vanilla version of gradient descent, also called batch gradient descent with a fixed learning rate. Variations such as SGD (stochastic gradient descent) or minibatch gradient descent typically perform better in practice. So if you are serious you’ll want to use one of these, and ideally you would also decay the learning rate over time.

The key of gradient descent method is how to calculate the gradient of loss function by the parameters. One approach is called Back Propagation. You can learn it more from http://colah.github.io/posts/2015-08-Backprop/ and http://cs231n.github.io/optimization-2/.

Implementation

We start by given the computation graph of neural network.


In the computation graph, you can see that it contains three components (gate, layer and output), there is two kinds of gate (multiply and add), and you can use tanh layer and softmax output.

gate, layer and output can all be seen as operation unit of computation graph, so they will implement the inner derivatives of their inputs (we call it backward), and use chain rule according to the computation graph. You can see the following figure for nice explanation.

gate.py

import numpy as np

class MultiplyGate:
    def forward(self,W, X):
        return np.dot(X, W)

    def backward(self, W, X, dZ):
        dW = np.dot(np.transpose(X), dZ)
        dX = np.dot(dZ, np.transpose(W))
        return dW, dX

class AddGate:
    def forward(self, X, b):
        return X + b

    def backward(self, X, b, dZ):
        dX = dZ * np.ones_like(X)
        db = np.dot(np.ones((1, dZ.shape[0]), dtype=np.float64), dZ)
        return db, dX

layer.py

import numpy as np

class Sigmoid:
    def forward(self, X):
        return 1.0 / (1.0 + np.exp(-X))

    def backward(self, X, top_diff):
        output = self.forward(X)
        return (1.0 - output) * output * top_diff

class Tanh:
    def forward(self, X):
        return np.tanh(X)

    def backward(self, X, top_diff):
        output = self.forward(X)
        return (1.0 - np.square(output)) * top_diff

output.py

import numpy as np

class Softmax:
    def predict(self, X):
        exp_scores = np.exp(X)
        return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

    def loss(self, X, y):
        num_examples = X.shape[0]
        probs = self.predict(X)
        corect_logprobs = -np.log(probs[range(num_examples), y])
        data_loss = np.sum(corect_logprobs)
        return 1./num_examples * data_loss

    def diff(self, X, y):
        num_examples = X.shape[0]
        probs = self.predict(X)
        probs[range(num_examples), y] -= 1
        return probs

We can implement out neural network by a class Model and initialize the parameters in the __init__ function. You can pass the parameter layers_dim = [2, 3, 2], which represents inputs with 2 dimension, one hidden layer with 3 dimension and output with 2 dimension

class Model:
    def __init__(self, layers_dim):
        self.b = []
        self.W = []
        for i in range(len(layers_dim)-1):
            self.W.append(np.random.randn(layers_dim[i], layers_dim[i+1]) / np.sqrt(layers_dim[i]))
            self.b.append(np.random.randn(layers_dim[i+1]).reshape(1, layers_dim[i+1]))

First let’s implement the loss function we defined above. It is just a forward propagation computation of out neural network. We use this to evaluate how well our model is doing:

def calculate_loss(self, X, y):
    mulGate = MultiplyGate()
    addGate = AddGate()
    layer = Tanh()
    softmaxOutput = Softmax()

    input = X
    for i in range(len(self.W)):
        mul = mulGate.forward(self.W[i], input)
        add = addGate.forward(mul, self.b[i])
        input = layer.forward(add)

    return softmaxOutput.loss(input, y)

We also implement a helper function to calculate the output of the network. It does forward propagation as defined above and returns the class with the highest probability.

def predict(self, X):
    mulGate = MultiplyGate()
    addGate = AddGate()
    layer = Tanh()
    softmaxOutput = Softmax()

    input = X
    for i in range(len(self.W)):
        mul = mulGate.forward(self.W[i], input)
        add = addGate.forward(mul, self.b[i])
        input = layer.forward(add)

    probs = softmaxOutput.predict(input)
    return np.argmax(probs, axis=1)

Finally, here comes the function to train our Neural Network. It implements batch gradient descent using the backpropagation algorithms we have learned above.

def train(self, X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=False):
    mulGate = MultiplyGate()
    addGate = AddGate()
    layer = Tanh()
    softmaxOutput = Softmax()

    for epoch in range(num_passes):
        # Forward propagation
        input = X
        forward = [(None, None, input)]
        for i in range(len(self.W)):
            mul = mulGate.forward(self.W[i], input)
            add = addGate.forward(mul, self.b[i])
            input = layer.forward(add)
            forward.append((mul, add, input))

        # Back propagation
        dtanh = softmaxOutput.diff(forward[len(forward)-1][2], y)
        for i in range(len(forward)-1, 0, -1):
            dadd = layer.backward(forward[i][1], dtanh)
            db, dmul = addGate.backward(forward[i][0], self.b[i-1], dadd)
            dW, dtanh = mulGate.backward(self.W[i-1], forward[i-1][2], dmul)
            # Add regularization terms (b1 and b2 don't have regularization terms)
            dW += reg_lambda * self.W[i-1]
            # Gradient descent parameter update
            self.b[i-1] += -epsilon * db
            self.W[i-1] += -epsilon * dW

        if print_loss and epoch % 1000 == 0:
            print("Loss after iteration %i: %f" %(epoch, self.calculate_loss(X, y)))

A network with a hidden layer of size 3

Let’s see what happens if we train a network with a hidden layer size of 3.

import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.datasets
import sklearn.linear_model
import mlnn
from utils import plot_decision_boundary

# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
plt.show()

layers_dim = [2, 3, 2]

model = mlnn.Model(layers_dim)
model.train(X, y, num_passes=20000, epsilon=0.01, reg_lambda=0.01, print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: model.predict(x), X, y)
plt.title("Decision Boundary for hidden layer size 3")
plt.show()

This looks pretty good. Our neural networks was able to find a decision boundary that successfully separates the classes.

The plot_decision_boundary function is referenced by http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch.

import matplotlib.pyplot as plt
import numpy as np

# Helper function to plot a decision boundary.
def plot_decision_boundary(pred_func, X, y):
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # Predict the function value for the whole gid
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)

Further more

  1. Instead of batch gradient descent, use minibatch gradient to train the network. Minibatch gradient descent typically performs better in practice (more info).
  2. We used a fixed learning rate epsilon for gradient descent. Implement an annealing schedule for the gradient descent learning rate (more info).
  3. We used a tanh activation function for our hidden layer. Experiment with other activation functions (more info).
  4. Extend the network from two to three classes. You will need to generate an appropriate dataset for this.
  5. Try some other Parameter updates method, like Momentum update, Nesterov momentum, Adagrad, RMSprop and Adam(more info).
  6. Some other tricks of training neural network can be find http://cs231n.github.io/neural-networks-2 and http://cs231n.github.io/neural-networks-3, like dropout reglarization, batch normazation, Gradient checks and Model Ensembles.

Some useful resources

  1. http://cs231n.github.io/
  2. http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch
  3. http://colah.github.io/posts/2015-08-Backprop/
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,445评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,889评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,047评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,760评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,745评论 5 367
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,638评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,011评论 3 398
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,669评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,923评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,655评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,740评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,406评论 4 320
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,995评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,961评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,023评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,483评论 2 342

推荐阅读更多精彩内容