反向传播算法

Calculus on Computational Graphs: Backpropagation

Posted on August 31, 2015

Introduction

Backpropagation

is the key algorithm that makes training deep models computationally

tractable. For modern neural networks, it can make training with

gradient descent as much as ten million times faster, relative to a

naive implementation. That’s the difference between a model taking a

week to train and taking 200,000 years.

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (seeGriewank (2010)). The general, application independent, name is “reverse-mode differentiation.”

Fundamentally,

it’s a technique for calculating derivatives quickly. And it’s an

essential trick to have in your bag, not only in deep learning, but in a

wide variety of numerical computing situations.

Computational Graphs

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expressione=(a+b)∗(b+1)

. There are three operations: two additions and one multiplication. To help us talk about this, let’s introduce two intermediary variables,candd

so that every function’s output has a variable. We now have:

c=a+b

d=b+1

e=c∗d

To

create a computational graph, we make each of these operations, along

with the input variables, into nodes. When one node’s value is the input

to another node, an arrow goes from one to another.

These sorts of graphs come up all the time in computer science, especially in talking about functional programs. They are very closely related to the notions of dependency graphs and call graphs. They’re also the core abstraction behind the popular deep learning frameworkTheano.

We can evaluate the expression by setting the input variables to certain values and computing nodes up through the graph. For example, let’s seta=2

andb=1

:

The expression evaluates to6

.

Derivatives on Computational Graphs

If one wants to understand derivatives in a computational graph, the key is to understand derivatives on the edges. Ifa

directly affectsc, then we want to know how it affectsc. Ifachanges a little bit, how doescchange? We call this thepartial derivativeofcwith respect toa

.

To evaluate the partial derivatives in this graph, we need thesum ruleand theproduct rule:

∂∂a(a+b)=∂a∂a+∂b∂a=1

∂∂uuv=u∂v∂u+v∂u∂u=v

Below, the graph has the derivative on each edge labeled.

What if we want to understand how nodes that aren’t directly connected affect each other? Let’s consider howe

is affected bya. If we changeaat a speed of 1,calso changes at a speed of1. In turn,cchanging at a speed of1causeseto change at a speed of2. Soechanges at a rate of1∗2with respect toa

.

The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. For example, to get the derivative ofe

with respect tob

we get:

∂e∂b=1∗2+1∗3

This accounts for how b affects e through c and also how it affects it through d.

This general “sum over paths” rule is just a different way of thinking about themultivariate chain rule.

Factoring Paths

The

problem with just “summing over the paths” is that it’s very easy to get

a combinatorial explosion in the number of possible paths.

In the above diagram, there are three paths fromX

toY, and a further three paths fromYtoZ. If we want to get the derivative∂Z∂Xby summing over all paths, we need to sum over3∗3=9

paths:

∂Z∂X=αδ+αϵ+αζ+βδ+βϵ+βζ+γδ+γϵ+γζ

The

above only has nine paths, but it would be easy to have the number of

paths to grow exponentially as the graph becomes more complicated.

Instead of just naively summing over the paths, it would be much better to factor them:

∂Z∂X=(α+β+γ)(δ+ϵ+ζ)

This

is where “forward-mode differentiation” and “reverse-mode

differentiation” come in. They’re algorithms for efficiently computing

the sum by factoring the paths. Instead of summing over all of the paths

explicitly, they compute the same sum more efficiently by merging paths

back together at every node. In fact, both algorithms touch each edge

exactly once!

Forward-mode

differentiation starts at an input to the graph and moves towards the

end. At every node, it sums all the paths feeding in. Each of those

paths represents one way in which the input affects that node. By adding

them up, we get the total way in which the node is affected by the

input, it’s derivative.

Though

you probably didn’t think of it in terms of graphs, forward-mode

differentiation is very similar to what you implicitly learned to do if

you took an introduction to calculus class.

Reverse-mode

differentiation, on the other hand, starts at an output of the graph

and moves towards the beginning. At each node, it merges all paths which

originated at that node.

Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output. That is, forward-mode differentiation applies the operator∂∂X

to every node, while reverse mode differentiation applies the operator∂Z∂

to every node.1

Computational Victories

At

this point, you might wonder why anyone would care about reverse-mode

differentiation. It looks like a strange way of doing the same thing as

the forward-mode. Is there some advantage?

Let’s consider our original example again:

We can use forward-mode differentiation fromb

up. This gives us the derivative of every node with respect tob

.

We’ve computed∂e∂b

, the derivative of our output with respect to one of our inputs.

What if we do reverse-mode differentiation frome

down? This gives us the derivative ofe

with respect to every node:

When I say that reverse-mode differentiation gives us the derivative of e with respect to every node, I really do meanevery node. We get both∂e∂a

and∂e∂b, the derivatives ofe

with respect to both inputs. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of them.

For

this graph, that’s only a factor of two speed up, but imagine a function

with a million inputs and one output. Forward-mode differentiation

would require us to go through the graph a million times to get the

derivatives. Reverse-mode differentiation can get them all in one fell

swoop! A speed up of a factor of a million is pretty nice!

When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use ingradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!

(Are

there any cases where forward-mode differentiation makes more sense?

Yes, there are! Where the reverse-mode gives the derivatives of one

output with respect to all inputs, the forward-mode gives us the

derivatives of all outputs with respect to one input. If one has a

function with lots of outputs, forward-mode differentiation can be much,

much, much faster.)

Isn’t This Trivial?

When I

first understood what backpropagation was, my reaction was: “Oh, that’s

just the chain rule! How did it take us so long to figure out?” I’m not

the only one who’s had that reaction. It’s true that if you ask “is

there a smart way to calculate derivatives in feedforward neural

networks?” the answer isn’t that difficult.

But I

think it was much more difficult than it might seem. You see, at the

time backpropagation was invented, people weren’t very focused on the

feedforward neural networks that we study. It also wasn’t obvious that

derivatives were the right way to train them. Those are only obvious

once you realize you can quickly calculate derivatives. There was a

circular dependency.

Worse,

it would be very easy to write off any piece of the circular dependency

as impossible on casual thought. Training neural networks with

derivatives? Surely you’d just get stuck in local minima. And obviously

it would be expensive to compute all those derivatives. It’s only

because we know this approach works that we don’t immediately start

listing reasons it’s likely not to.

That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is already done.

Conclusion

Derivatives

are cheaper than you think. That’s the main lesson to take away from

this post. In fact, they’re unintuitively cheap, and us silly humans

have had to repeatedly rediscover this fact. That’s an important thing

to understand in deep learning. It’s also a really useful thing to know

in other fields, and only more so if it isn’t common knowledge.

Are there other lessons? I think there are.

Backpropagation

is also a useful lens for understanding how derivatives flow through a

model. This can be extremely helpful in reasoning about why some models

are difficult to optimize. The classic example of this is the problem of

vanishing gradients in recurrent neural networks.

Finally,

I claim there is a broad algorithmic lesson to take away from these

techniques. Backpropagation and forward-mode differentiation use a

powerful pair of tricks (linearization and dynamic programming) to

compute derivatives more efficiently than one might think possible. If

you really understand these techniques, you can use them to efficiently

calculate several other interesting expressions involving derivatives.

We’ll explore this in a later blog post.

This post gives a very abstract treatment of backpropagation. I strongly recommend reading Michael Nielsen’schapter on itfor an excellent discussion, more concretely focused on neural networks.

Acknowledgments

Thank you toGreg Corrado,Jon Shlens,Samy BengioandAnelia Angelovafor taking the time to proofread this post.

Thanks also toDario Amodei,Michael NielsenandYoshua Bengiofor discussion of approaches to explaining backpropagation. Also thanks to all those who tolerated me practicing explaining backpropagation in talks and seminar series!

This might feel a bit likedynamic programming. That’s because it is!

More Posts

Understanding Convolutions

Understanding LSTM Networks

Visualizing MNIST

An Exploration of Dimensionality Reduction

Conv Nets

A Modular Perspective

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,907评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,987评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,298评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,586评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,633评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,488评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,275评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,176评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,619评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,819评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,932评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,655评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,265评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,871评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,994评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,095评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,884评论 2 354

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,497评论 0 23
  • 梦里不知身是客,苒苒已惘然。 帘外听夏声,风也飘飘,雨又潇潇。 别时不易见更难,佳人望西山。 天地双只影,叹老...
    折落的时光阅读 347评论 0 0
  • 夫物芸芸,各归其根。归根曰“静”,静曰“复命”。复命曰“常”,知常曰“明”。不知“常”,妄作,凶。 归根是静,是复...
    鳄鱼不饿阅读 181评论 0 0
  • 我相信! 好朋友终于恋爱了,特别为他感到开心,感觉现在的他终于开窍了,整个人都充满了活力! 这货和我一样较真,这货...
    龙兰达阅读 308评论 2 0
  • 祝愿天下的老师们,节日快乐 一秒老师说:恐惧来自于想象,而这个想象要比恐惧本身大十倍。今天的精进特别写给她,因为今...
    吕明超阅读 208评论 0 0