Implementing Recurrent Neural Network from Scratch

All code can be find here.

Implementing Recurrent Neural Network from Scratch

I’m assuming that you are somewhat familiar with basic Neural Networks. If you’re not, you may want to head over to Implementing A Neural Network From Scratch, which guides you through the ideas and implementation behind non-recurrent networks.

Introduction

This post is inspired by recurrent-neural-networks-tutorial from WildML. And you can deeply read it to know the basic knowledge about RNN, which I will not include in this tutorial.

In this tutorial, we will focus on how to train RNN by Backpropagation Through Time (BPTT), based on the computation graph of RNN and do automatic differentiation. You can find that it is more simple and reliable to calculate the gradient in this way than you do it by hand.

This post will take RNN language model (rnnlm) as example. More about the fancy applications of RNN can be found here.

How to train RNN

The architecture of RNN can be as the following figure.

You can find that the parameters (W, U, V) are shared in different time steps. And the output in each time step can be softmax. So you can use cross entropy loss as an error function and use some optimizing method (e.g. gradient descent) to calculate the optimized parameters (W, U, V).

Let recap the equations of our RNN:

We also defined our loss, or error, to be the cross entropy loss, given by:

Here y_t is the correct word at time step t, and y^_t is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

Remember that our goal is to calculate the gradients of the error with respect to our parameters U, V and W and then learn good parameters using optimizing method (in this post we use Stochastic Gradient Descent). Just like we sum up the errors, we also sum up the gradients at each time step for one training example:

. That is we should calculate dEt/dW, dEt/dU and dEt/dV, then sum up all time steps.

It is simple to calculate dEt/dV, because it only depends on the values at the current time step. But the story is different for dEt/dW and dEt/dV. Note that s_3 = tanh(Ux_3 + Ws_2) depend on s_2, which depends on W, U and s_1, and so on. So if we take the derivative with respect to W we can't treat s_2 as a constant! We need to apply the chain rule again. You can have a view from the following figure.

Now use computation graph to represent E1 as an example and calculate dE1/dW, dE1/dU is the same idea.

Note that this is exactly the same as the standard backpropagation algorithm that we use in deep Feedforward Neural Networks. The key difference is that we sum up the gradients for W at each time step. In a traditional NN we don’t share parameters across layers, so we don’t need to sum anything. But in my opinion BPTT is just a fancy name for standard backpropagation on an unrolled RNN.

To simplify the computation graph to make it efficient, we can integrate some small operation units to a big operation unit. You can have a look the following figure. Note that the operation unit should also implement the forward function and backward function.

The implementation of all operation unit and softmax output can be found as follows:

mulGate = MultiplyGate()
addGate = AddGate()
activation = Tanh()

class RNNLayer:
    def forward(self, x, prev_s, U, W, V):
        self.mulu = mulGate.forward(U, x)
        self.mulw = mulGate.forward(W, prev_s)
        self.add = addGate.forward(self.mulw, self.mulu)
        self.s = activation.forward(self.add)
        self.mulv = mulGate.forward(V, prev_s)
        
    def backward(self, x, prev_s, U, W, V, diff_s, dmulv):
        self.forward(x, prev_s, U, W, V)
        dV, dsv = mulGate.backward(V, self.s, dmulv)
        ds = dsv + diff_s
        dadd = activation.backward(self.add, ds)
        dmulw, dmulu = addGate.backward(self.mulw, self.mulu, dadd)
        dW, dprev_s = mulGate.backward(W, prev_s, dmulw)
        dU, dx = mulGate.backward(U, x, dmulu)
        return (dprev_s, dU, dW, dV)
class MultiplyGate:
    def forward(self,W, x):
        return np.dot(W, x)
    def backward(self, W, x, dz):
        dW = np.asarray(np.dot(np.transpose(np.asmatrix(dz)), np.asmatrix(x)))
        dx = np.dot(np.transpose(W), dz)
        return dW, dx

class AddGate:
    def forward(self, x1, x2):
        return x1 + x2
    def backward(self, x1, x2, dz):
        dx1 = dz * np.ones_like(x1)
        dx2 = dz * np.ones_like(x2)
        return dx1, dx2
class Sigmoid:
    def forward(self, x):
        return 1.0 / (1.0 + np.exp(-x))
    def backward(self, x, top_diff):
        output = self.forward(x)
        return (1.0 - output) * output * top_diff

class Tanh:
    def forward(self, x):
        return np.tanh(x)
    def backward(self, x, top_diff):
        output = self.forward(x)
        return (1.0 - np.square(output)) * top_diff
class Softmax:
    def predict(self, x):
        exp_scores = np.exp(x)
        return exp_scores / np.sum(exp_scores)
    def loss(self, x, y):
        probs = self.predict(x)
        return -np.log(probs[y])
    def diff(self, x, y):
        probs = self.predict(x)
        probs[y] -= 1.0
        return probs

These implementation is just the same with Implementing A Neural Network From Scratch, except that in this post the input x or s is 1-D array, but in previous post input X is a batch of data represented as a matrix (each row is an example).

Now that we are able to calculate the gradients for our parameters we can use SGD to train the model.

Implement

Initialization

Initializing the parameters U, V and W is a bit tricky. We can’t just initialize them to 0’s because that would result in symmetric calculations in all our layers. We must initialize them randomly. Because proper initialization seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function (tanh in our case) and one recommended approach is to initialize the weights randomly in the interval from

where n is the number of incoming connections from the previous layer.

class Model:
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        self.U = np.random.uniform(-np.sqrt(1. / word_dim), np.sqrt(1. / word_dim), (hidden_dim, word_dim))
        self.W = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (hidden_dim, hidden_dim))
        self.V = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (word_dim, hidden_dim))

Above, word_dim is the size of our vocabulary, and hidden_dim is the size of our hidden layer (we can pick it). Don’t worry about the bptt_truncate parameter for now, we’ll explain what that is later.

Forward Propagation

Next, let’s implement the forward propagation (predicting word probabilities) defined by our equations above:

'''
    forward propagation (predicting word probabilities)
    x is one single data, and a batch of data
    for example x = [0, 179, 341, 416], then its y = [179, 341, 416, 1]
'''
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    layers = []
    prev_s = np.zeros(self.hidden_dim)
    # For each time step...
    for t in range(T):
        layer = RNNLayer()
        input = np.zeros(self.word_dim)
        input[x[t]] = 1
        layer.forward(input, prev_s, self.U, self.W, self.V)
        prev_s = layer.s
        layers.append(layer)
    return layers

We also implement a predict function to generate the results.

def predict(self, x):
    output = Softmax()
    layers = self.forward_propagation(x)
    return [np.argmax(output.predict(layer.mulv)) for layer in layers]

Calculating the Loss

To train our network we need a way to measure the errors it makes. We call this the loss function L, and our goal is find the parameters U, V and W that minimize the loss function for our training data. A common choice for the loss function is the cross-entropy loss.

def calculate_loss(self, x, y):
    assert len(x) == len(y)
    output = Softmax()
    layers = self.forward_propagation(x)
    loss = 0.0
    for i, layer in enumerate(layers):
        loss += output.loss(layer.mulv, y[i])
    return loss / float(len(y))

def calculate_total_loss(self, X, Y):
    loss = 0.0
    for i in range(len(Y)):
        loss += self.calculate_loss(X[i], Y[i])
    return loss / float(len(Y))

Backpropagation Through Time (BPTT)

Just as what we have introduced, we implement BPTT algorithm. It takes as input a training example (x, y) and returns the gradients dL/dW, dL/dU and dL/dV.

def bptt(self, x, y):
    assert len(x) == len(y)
    output = Softmax()
    layers = self.forward_propagation(x)
    dU = np.zeros(self.U.shape)
    dV = np.zeros(self.V.shape)
    dW = np.zeros(self.W.shape)

    T = len(layers)
    prev_s_t = np.zeros(self.hidden_dim)
    diff_s = np.zeros(self.hidden_dim)
    for t in range(0, T):
        dmulv = output.diff(layers[t].mulv, y[t])
        input = np.zeros(self.word_dim)
        input[x[t]] = 1
        dprev_s, dU_t, dW_t, dV_t = layers[t].backward(input, prev_s_t, self.U, self.W, self.V, diff_s, dmulv)
        prev_s_t = layers[t].s
        dmulv = np.zeros(self.word_dim)
        for i in range(t-1, max(-1, t-self.bptt_truncate-1), -1):
            input = np.zeros(self.word_dim)
            input[x[i]] = 1
            prev_s_i = np.zeros(self.hidden_dim) if i == 0 else layers[i-1].s
            dprev_s, dU_i, dW_i, dV_i = layers[i].backward(input, prev_s_i, self.U, self.W, self.V, dprev_s, dmulv)
            dU_t += dU_i
            dW_t += dW_i
        dV += dV_t
        dU += dU_t
        dW += dW_t
    return (dU, dW, dV)

SGD Implementation

Now that we are able to calculate the gradients for our parameters we can implement SGD. I like to do this in two steps:

  1. A function sdg_step that calculates the gradients and performs the updates for one batch.
  2. An outer loop that iterates through the training set and adjusts the learning rate.
def sgd_step(self, x, y, learning_rate):
    dU, dW, dV = self.bptt(x, y)
    self.U -= learning_rate * dU
    self.V -= learning_rate * dV
    self.W -= learning_rate * dW
   
def train(self, X, Y, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    num_examples_seen = 0
    losses = []
    for epoch in range(nepoch):
        if (epoch % evaluate_loss_after == 0):
            loss = self.calculate_total_loss(X, Y)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))
            # Adjust the learning rate if loss increases
            if len(losses) > 1 and losses[-1][1] > losses[-2][1]:
                learning_rate = learning_rate * 0.5
                print("Setting learning rate to %f" % learning_rate)
            sys.stdout.flush()
        # For each training example...
        for i in range(len(Y)):
            self.sgd_step(X[i], Y[i], learning_rate)
            num_examples_seen += 1
    return losses 

Here, we annealing the learning rate by 0.5 if we find the loss increases in this epoch. More about the decay of learning rate can be found here.

Evaluation

Done! Let’s try to get a sense of how long it would take to train our network:

word_dim = 8000
hidden_dim = 100
X_train, y_train = getSentenceData('data/reddit-comments-2015-08.csv', word_dim)

np.random.seed(10)
rnn = Model(word_dim, hidden_dim)
rnn.sgd_step(X_train[10], y_train[10], 0.005)

Bad new is that one step of SGD takes a few seconds on my laptop. We have about 80,000 examples in our training data, so one epoch (iteration over the whole data set) would take several hours. Multiple epochs would take days, or even weeks!

There are many ways to speed up our code. One is to implement our code on GPU with some library like Theano. But in this tutorial, let’s just try to run SGD with a small dataset and check if the loss actually decreases:

word_dim = 8000
hidden_dim = 100
X_train, y_train = getSentenceData('data/reddit-comments-2015-08.csv', word_dim)

np.random.seed(10)
rnn = Model(word_dim, hidden_dim)

losses = rnn.train(X_train[:100], y_train[:100], learning_rate=0.005, nepoch=10, evaluate_loss_after=1)

Good, it seems like our implementation is at least doing something useful and decreasing the loss, just like we wanted.

Further more

There is a problem about RNN called vanishing gradient problem. That's why traditional RNN cannot capture the long term dependency, so we use bptt_truncate parameter to constrain the length of dependency. This will motivate our move to more sophisticated RNN models, such as LSTMs, which are the current state of the art for many tasks in NLP.

More about vanishing gradient problem and LSTM can be found here and here and here.

Reference

  1. https://github.com/pangolulu/neural-network-from-scratch
  2. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
  3. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
  4. http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,923评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,154评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,775评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,960评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,976评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,972评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,893评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,709评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,159评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,400评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,552评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,265评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,876评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,528评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,701评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,552评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,451评论 2 352

推荐阅读更多精彩内容

  • 成为一个人的意义是什么?怎样理解人的处境?如何在这个世界上生存?什么是人的尊严和价值? 对这些问题的回答,甚至于仅...
    張佑芡阅读 531评论 0 1
  • 以前我喜欢关注日本娱乐圈,现在觉得有点看厌了他们的套路,现在开始研究韩国女团。中国市场来讲,明显是韩国女团比较吃香...
    花花骚年阅读 796评论 0 0
  • 80后的中年危机,已经默默拉开帷幕。去参加饭局就能看出来,表面上还是推杯换盏觥筹交错高谈阔论指点江山,但是很显然已...
    刀疤阿姨dede阅读 190评论 2 1
  • 转载自:读懂财富 这一个令人震撼的调查,这是一段触目惊心的视频:央视记者历时8个月,调查6家大型医院,发现医生普遍...
    c86d4df3c7f2阅读 192评论 0 0
  • 作者:zengzhiyan 转自:网易用户体验中心 “一个产品没用,体验再好用户也不会去用。——你觉得这句话是对还...
    Sting阅读 1,568评论 0 13