Less-forgetful Learning for Domain Expansion in Deep Neural Networks

Abstract

Expanding the domain that deep neural network has already learned without accessing old domain data is a challenging task because deep neural networks forget previously learned information when learning new data from a new domain.

本文提出一种less-forgetful learning方法在domain expansion scenario。这个方法不需要知道输入是从old domain或new domain来的，就能在old domain和new domain上工作得很好。

Introduction

Domain adaptation: the same tasks but in different domains. The domain adaptation problem concerns how well a DNN works in a new domain that has not been learned. These domain adaptation techniques focus on adapting only to new domains.
但是实际情况中，应用经常需要记住old domain而没有再次看old domain data。本文提出这个问题，并称为DNN domain expansion problem.
image
DNN domain expansion problem非常重要的三个主要原因：
1. 它使DNNs能够不断地从连续不断的输入数据中学习。(学习方式的优势)
2. 在实践中，用户可以只使用从新环境收集的新数据来微调他们的DNN，而不访问来自old domain的数据。（学习的数据只来自new domain）
3. 建立一个在多个领域执行的统一网络是可能的。（学习结果是一个能在多个domain工作）
Two challenging issues：
1. network在old domain的performance不能下降，即不能发生catastrophic forgetting problem
2. DNN需要在没有关于输入数据从哪个domain来的prior knowledge。

Domain Expansion Problem

image

Figure 2 (a),(b)需要关于数据domain的prior knowledge，Figure 2 (c)是本文提出的方法，不需要关于数据domain的prior knowledge

domain expansion problem是continual learning problem的一部分。continual learning通常考虑multiple task learning或者sequence learning (more than two domains), 但是domain expansion problem只考虑两个domains，old 和 new domain。

Related work

dropout method + maxout activation function 能够帮助减少遗忘学习到的信息。
large DNN + dropout method可以解决catastrophic forgetting problem
learning without forgetting (LwF)Figure 2 (a) 利用knowledge distillation loss method保持performance
progressive learning (PL) (Figure 2 (b))在学习新task时通过侧面连接使用之前学习的features
Elastic weight consolidation (EWC)使用Fisher information matrix computed from the old domain training data，将diagonal elements作为l2 regularization的coefficients，在学习new domain data时达到old and new network之间的weight parameter相似。
generative adversarial networks被用来学习生成old domain data
image
Type $A^{'}$ 通过ad-hoc training process去提取useful information from old domain data；
Type $B^{'}$ 通过通常的方法使用old domain data训练network，该方法可以直接应用到pre-trained models上。

Naive Approach

Fine-tuning only the softmax classifier layer

freeze lower layers，fine-tune the final softmax classifier layer
the feature extractor is shared between the old and new domains
这个方法是希望old domain和new domain共享的weight parameter不要改变

Weight constraint approach

使用l2 regularization去达到old domain和new domain之间相似的weight parameters

$\mathcal{L}_w(x;\theta^{(o)},\theta^{(n)})=\lambda_c \mathcal{L}_c(x;\theta^{(n)})+\lambda_w\|\theta^{(o)}-\theta^{(n)}\|_2 \\ \mathcal{L}_c(x;\theta^{(n)})=-\sum^C_{i=1}t_i\log(o_i(x;\theta^{(n)}))$

这个方法是期望weight parameter不要改变太大来保留学习到的信息。

Comment:以上两个方法都是希望尽量使new network和old network之间的差异或变化小，Fine-tuning only the softmax classifier layer使old domain和new domain共享的weight parameter差异为0，Weight constraint approach使old domain和new domain之间的weight parameters相似，差异不要太大。两种方法本质一样，只是程度不同。

Less-forgetful learning

softmax classifier的weight代表分类feature的decision boundary，从top hidden layer提取feature是线性可分的，因为top layer classifier有线性的性质。
Property 1. The decision boundaries should be unchanged.
Property 2. The features extracted by the new net- work from the data of the old domain should be present in a position close to the features extracted by the old network from the data ofthe old domain.

实现Property 1可以设置boundary的learning rate为0.
实现Property 2，由于不能访问old domain data，所以可以使用training data of the new domain。

image

本文提出的方法和传统的fine-tune method一样，将old network copy到new netwrok。然后，为了保持boundary不变，将softmax classifier layer的weight freeze。Loss function如下：

$\mathcal{L}_t(x;\theta^{(o)},\theta^{(n)})=\lambda_c\mathcal{L}_c(x;\theta^{(n)})+\lambda_e\mathcal{L}_e(x;\theta^{(o)},\theta^{(n)})$
$\mathcal{L}_c$ cross entropy loss, $\mathcal{L}_e$ Euclidean loss, $\lambda_c=1$ , $\lambda_e$ 通常比 $\lambda_e$ 通常比 $\lambda_c$ 小。

$\mathcal{L}_e(x;\theta^{(o)},\theta^{(n)})=\frac{1}{2}\|\mathbf{f}_{L-1}(x;\theta^{(o)})-\mathbf{f}_{L-1}(x;\theta^{(n)})\|_2^2$
$\mathbf{f}_{L-1}$ 使softmax classifier layer之前的feature，new network学习提取与old network 提取的feature相似的feature。

$\hat{\theta}^{(n)}=\arg\min_{\theta^{(n)}}\mathcal{L}_t(x;\theta^{(o)},\theta^{(n)})+\mathcal{R}(\theta^{(n)})$
$\mathcal{R}(\cdot)$ 是general regularization term，such as weight decay。这里与Weight constraint approach一样。

image

Comment: Less-forgetful learning的方法将Fine-tuning only the softmax classifier layer和Weight constraint approach结合起来。（1）同时提出decision boundary应该不变，所以freeze softmax的weight。（2）还提出 $\mathcal{L}_e$ ,与distillation loss类似。

Experimental results

task：image classification
dataset：CIFAR-10，MNIST，SVHN，ImageNet

image
comparison methods：（1）bashline：fine-tuning+weight constraint+ReLU/Maxout/LWTA；（2）LwF;(3)EWC
result

image
ablation study
1. Relationship between the classification rates for the old and new domains with different values of $\lambda_e$ .
2. Average classification rates with respect to $\lambda_e$ .
3. Classification rates according to the size of the old-domain data using the CIFAR-10 dataset.
4. Further Analysis of Scratch learning, Fine-tunig and LF learning
5. Feasibility for Continual Learning

Less-forgetful Learning for Domain Expansion in Deep Neural Networks

Abstract

Introduction

Domain Expansion Problem

Related work

Naive Approach

Fine-tuning only the softmax classifier layer

Weight constraint approach

Less-forgetful learning

Experimental results

推荐阅读更多精彩内容