Abstract
Expanding the domain that deep neural network has already learned without accessing old domain data is a challenging task because deep neural networks forget previously learned information when learning new data from a new domain.
本文提出一种less-forgetful learning方法在domain expansion scenario。这个方法不需要知道输入是从old domain或new domain来的,就能在old domain和new domain上工作得很好。
Introduction
Domain adaptation: the same tasks but in different domains. The domain adaptation problem concerns how well a DNN works in a new domain that has not been learned. These domain adaptation techniques focus on adapting only to new domains.
但是实际情况中,应用经常需要记住old domain而没有再次看old domain data。本文提出这个问题,并称为DNN domain expansion problem.
-
DNN domain expansion problem非常重要的三个主要原因:
- 它使DNNs能够不断地从连续不断的输入数据中学习。(学习方式的优势)
- 在实践中,用户可以只使用从新环境收集的新数据来微调他们的DNN,而不访问来自old domain的数据。(学习的数据只来自new domain)
- 建立一个在多个领域执行的统一网络是可能的。(学习结果是一个能在多个domain工作)
-
Two challenging issues:
- network在old domain的performance不能下降,即不能发生catastrophic forgetting problem
- DNN需要在没有关于输入数据从哪个domain来的prior knowledge。
Domain Expansion Problem
Figure 2 (a),(b)需要关于数据domain的prior knowledge,Figure 2 (c)是本文提出的方法,不需要关于数据domain的prior knowledge
domain expansion problem是continual learning problem的一部分。continual learning通常考虑multiple task learning或者sequence learning (more than two domains), 但是domain expansion problem只考虑两个domains,old 和 new domain。
Related work
- dropout method + maxout activation function 能够帮助减少遗忘学习到的信息。
- large DNN + dropout method可以解决catastrophic forgetting problem
- learning without forgetting (LwF)Figure 2 (a) 利用knowledge distillation loss method保持performance
- progressive learning (PL) (Figure 2 (b))在学习新task时通过侧面连接使用之前学习的features
- Elastic weight consolidation (EWC)使用Fisher information matrix computed from the old domain training data,将diagonal elements作为l2 regularization的coefficients,在学习new domain data时达到old and new network之间的weight parameter相似。
- generative adversarial networks被用来学习生成old domain data
- Type 通过ad-hoc training process去提取useful information from old domain data;
- Type 通过通常的方法使用old domain data训练network,该方法可以直接应用到pre-trained models上。
Naive Approach
Fine-tuning only the softmax classifier layer
- freeze lower layers,fine-tune the final softmax classifier layer
- the feature extractor is shared between the old and new domains
- 这个方法是希望old domain和new domain共享的weight parameter不要改变
Weight constraint approach
- 使用l2 regularization去达到old domain和new domain之间相似的weight parameters
- 这个方法是期望weight parameter不要改变太大来保留学习到的信息。
Comment:以上两个方法都是希望尽量使new network和old network之间的差异或变化小,Fine-tuning only the softmax classifier layer使old domain和new domain共享的weight parameter差异为0,Weight constraint approach使old domain和new domain之间的weight parameters相似,差异不要太大。两种方法本质一样,只是程度不同。
Less-forgetful learning
- softmax classifier的weight代表分类feature的decision boundary,从top hidden layer提取feature是线性可分的,因为top layer classifier有线性的性质。
Property 1. The decision boundaries should be unchanged.
Property 2. The features extracted by the new net- work from the data of the old domain should be present in a position close to the features extracted by the old network from the data ofthe old domain.
实现Property 1可以设置boundary的learning rate为0.
实现Property 2,由于不能访问old domain data,所以可以使用training data of the new domain。
本文提出的方法和传统的fine-tune method一样,将old network copy到new netwrok。然后,为了保持boundary不变,将softmax classifier layer的weight freeze。Loss function如下:
cross entropy loss, Euclidean loss, ,通常比通常比小。
使softmax classifier layer之前的feature,new network学习提取与old network 提取的feature相似的feature。
是general regularization term,such as weight decay。这里与Weight constraint approach一样。
Comment: Less-forgetful learning的方法将Fine-tuning only the softmax classifier layer和Weight constraint approach结合起来。(1)同时提出decision boundary应该不变,所以freeze softmax的weight。(2)还提出,与distillation loss类似。
Experimental results
-
task:image classification
dataset:CIFAR-10,MNIST,SVHN,ImageNet
comparison methods:(1)bashline:fine-tuning+weight constraint+ReLU/Maxout/LWTA;(2)LwF;(3)EWC
-
result
-
ablation study
- Relationship between the classification rates for the old and new domains with different values of .
- Average classification rates with respect to .
- Classification rates according to the size of the old-domain data using the CIFAR-10 dataset.
- Further Analysis of Scratch learning, Fine-tunig and LF learning
- Feasibility for Continual Learning