Recycling Deep Learning Models with Transfer Learning

Recycling Deep Learning Models with Transfer Learning

Deep learning exploits gigantic datasets to produce powerful models. But what can we do when our datasets are comparatively small? Transfer learning by fine-tuning deep nets offers a way to leverage existing datasets to perform well on new tasks.

ByZachary Chase Lipton

Deep learning models are indisputably the state of the art for many problems in machine perception. Using neural networks with many hidden layers of artificial neurons and millions of parameters, deep learning algorithms exploit both hardware capabilities and the abundance of gigantic datasets. In comparison,linear models saturate quickly and under-fit.But what if you don't have big data?

Say you have a novel research problem, such as identifying cancerous moles given photographs. Assume further that you generated a dataset of 10,000 labeled images. While this might seem like big data, it's a pittance by comparison to the authoritative datasets on which deep learning has its greatest successes. It might seem that all is lost. Fortunately there's hope.

Consider the famous ImageNet database created by Stanford ProfessorFei-Fei Li. The dataset contains millions of images, each belonging to 1 of 1000 distinct hierarchically organized object categories. AsDr. Li relates in her TED talk, a child sees millions of distinct images while growing up. Intuitively, to train a model strong enough to compete with human object recognition capabilities, a similarly large training set might be required. Further, the methods which dominate at this scale of data might not be those which dominate on the same task given smaller datasets. Dr. Li's insight quickly paid off. Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton handily won the 2012 ImageNet competition, establishing convolutional neural networks (CNNs) as the state of the art in computer vision.

To extend Dr. Li's metaphor, although a child sees millions of images throughout development, once grown, a human can easily learn to recognize a new class of object, without learning to see from scratch. Now consider a CNN trained for object recognition on the ImageNet dataset. Given an image from any among the 1000 ImageNet categories, the top-most hidden layer of the neural network constitutes a single vector representation sufficiently flexible to be useful for classifying images into each of the 1000 categories. It seems reasonable to guess that this representation would also be useful for an additional as-of-yet unseen object category.

Following this line of thinking, computer vision researchers now commonly use pre-trained CNNs to generate representations for novel tasks, where the dataset may not be large enough to train an entire CNN from scratch. Another common tactic is to take the pre-trained ImageNet network and then to fine-tune the entire network to the novel task. This is typically done by training end-to-end with backpropagation and stochastic gradient descent. I first learned about this fine-tuning technique last year in conversations withOscar Beijbom. Beijbom, a vision researcher at UCSD, has successfully used this technique to differentiate pictures of various types of coral.

A terrific paper from last year's NIPs conferenceby Jason Yosinski of Cornell, in collaboration with Jeff Clune, Yoshua Bengio and Hod Lipson, tackles this issue with a rigorous empirical investigation. The authors focus on the ImageNet dataset and the 8-layer AlexNet architecture of ImageNet fame. They note that the lowest layers of convolutional neural networks have long been known to resemble conventional computer vision features like edge detectors and Gabor filters. Further, they note that the topmost hidden layer is somewhat specialized to the task it was trained for. Therefore they systematically explore the following questions:

Where does the transition from broadly useful to highly specialized features take place?

What is the relative performance of fine-tuning an entire network vs. freezing the transferred layers?

To study these questions, the authors restrict their attention to the ImageNet dataset, but consider two 50-50 splits of the data into subsets A and B. In the first set of experiments, they split the data by randomly assigning each image category to either subset A or subset B. In the second set of experiments, the authors consider a split with greater contrast, assigning to one set only natural images and to the other only images of man-made objects.

The authors examine the case in which the network is pre-trained on task A and then trained further on B, keeping the first k layers and randomly initializing. They also consider the same experiment but pre-training on task B, randomly initializing the top 8-k layers, and then continuing to train on the same task B.

In all cases, they find that fine-tuning end-to-end on a pre-trained network outperforms freezing the transferred k layers completely. Interestingly they also find that when pre-trained layers are frozen, the transferability of features is nonlinear in the number of pre-trained layers. They hypothesize that this owes to complex interactions between the nodes in adjacent layers which are difficult to relearn.

To wax poetic, this might be analogous to performing a hemispherectomy on a fully grown human, replacing the right half of the brain with a blank slate child's brain. Ignoring the biological absurdity of this example it might seem intuitive that a fresh right brain hemisphere would have trouble filling the expected role of one to which the left hemisphere was fully adapted.

I defer tothe original paperfor a complete description of the experiments. One question which is not addressed in this paper but could inspire future research, is how these results hold up when the new task has far fewer examples than the original task. In practice, this imbalance in the number of labeled examples between the original task and the new one, is often what motivates transfer learning. A preliminary approach could be to repeat this exact work but with a 999 to 1 split instead of a 500-500 split of the categories.

Generally, data labeling is expensive. For many tasks it may even be prohibitively expensive. Thus it's of economic and academic interest to develop a deeper understanding of how we could leverage the labels we do have to perform well on tasks not as well endowed. Yosinski's work is a great step in this direction.

Zachary Chase Lipton

is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the

Division of Biomedical Informatics

, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.

Related:

Deep Learning and the Triumph of Empiricism

Not So Fast: Questioning Deep Learning IQ Results

The Myth of Model Interpretability

(Deep Learning’s Deep Flaws)’s Deep Flaws

Data Science’s Most Used, Confused, and Abused Jargon

Differential Privacy: How to make Privacy and Data Mining Compatible

Previous post

Next post

Most popular last 30 days

Most viewed last 30 days

50+ Data Science and Machine Learning Cheat Sheets- Jul 14, 2015.

R vs Python for Data Science: The Winner is ...- May 26, 2015.

9 Must-Have Skills You Need to Become a Data Scientist- Nov 22, 2014.

Top 20 Python Machine Learning Open Source Projects- Jun 1, 2015.

Top 10 Data Analysis Tools for Business- Jun 13, 2014.

Can deep learning help find the perfect date?- Jul 10, 2015.

Stop Hiring Data Scientists Until You Are Ready for Data Science- Jul 17, 2015.

Deep Learning and the Triumph of Empiricism- Jul 7, 2015.

Most shared last 30 days

50+ Data Science and Machine Learning Cheat Sheets- Jul 14, 2015.

Impact of IoT on Big Data Landscape- Jul 29, 2015.

Deep Learning Adversarial Examples - Clarifying Misconceptions- Jul 15, 2015.

Stop Hiring Data Scientists Until You Are Ready for Data Science- Jul 17, 2015.

Data is Ugly - Tales of Data Cleaning- Aug 1, 2015.

Can deep learning help find the perfect date?- Jul 10, 2015.

R, Python users show surprising stability, but strong regional differences- Jul 14, 2015.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,445评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,889评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,047评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,760评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,745评论 5 367
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,638评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,011评论 3 398
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,669评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,923评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,655评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,740评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,406评论 4 320
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,995评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,961评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,023评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,483评论 2 342

推荐阅读更多精彩内容