Less-forgetful Learning for Domain Expansion in Deep Neural Networks

Abstract

Expanding the domain that deep neural network has already learned without accessing old domain data is a challenging task because deep neural networks forget previously learned information when learning new data from a new domain.

本文提出一种less-forgetful learning方法在domain expansion scenario。这个方法不需要知道输入是从old domain或new domain来的,就能在old domain和new domain上工作得很好。

Introduction

  • Domain adaptation: the same tasks but in different domains. The domain adaptation problem concerns how well a DNN works in a new domain that has not been learned. These domain adaptation techniques focus on adapting only to new domains.

  • 但是实际情况中,应用经常需要记住old domain而没有再次看old domain data。本文提出这个问题,并称为DNN domain expansion problem.

  • image
  • DNN domain expansion problem非常重要的三个主要原因:

    1. 它使DNNs能够不断地从连续不断的输入数据中学习。(学习方式的优势)
    2. 在实践中,用户可以只使用从新环境收集的新数据来微调他们的DNN,而不访问来自old domain的数据。(学习的数据只来自new domain)
    3. 建立一个在多个领域执行的统一网络是可能的。(学习结果是一个能在多个domain工作)
  • Two challenging issues:

    1. network在old domain的performance不能下降,即不能发生catastrophic forgetting problem
    2. DNN需要在没有关于输入数据从哪个domain来的prior knowledge。

Domain Expansion Problem

image

Figure 2 (a),(b)需要关于数据domain的prior knowledge,Figure 2 (c)是本文提出的方法,不需要关于数据domain的prior knowledge

domain expansion problem是continual learning problem的一部分。continual learning通常考虑multiple task learning或者sequence learning (more than two domains), 但是domain expansion problem只考虑两个domains,old 和 new domain。

Related work

  • dropout method + maxout activation function 能够帮助减少遗忘学习到的信息。
  • large DNN + dropout method可以解决catastrophic forgetting problem
  • learning without forgetting (LwF)Figure 2 (a) 利用knowledge distillation loss method保持performance
  • progressive learning (PL) (Figure 2 (b))在学习新task时通过侧面连接使用之前学习的features
  • Elastic weight consolidation (EWC)使用Fisher information matrix computed from the old domain training data,将diagonal elements作为l2 regularization的coefficients,在学习new domain data时达到old and new network之间的weight parameter相似。
  • generative adversarial networks被用来学习生成old domain data
  • image
  • Type A^{'}通过ad-hoc training process去提取useful information from old domain data;
  • Type B^{'}通过通常的方法使用old domain data训练network,该方法可以直接应用到pre-trained models上。

Naive Approach

Fine-tuning only the softmax classifier layer

  • freeze lower layers,fine-tune the final softmax classifier layer
  • the feature extractor is shared between the old and new domains
  • 这个方法是希望old domain和new domain共享的weight parameter不要改变

Weight constraint approach

  • 使用l2 regularization去达到old domain和new domain之间相似的weight parameters

\mathcal{L}_w(x;\theta^{(o)},\theta^{(n)})=\lambda_c \mathcal{L}_c(x;\theta^{(n)})+\lambda_w\|\theta^{(o)}-\theta^{(n)}\|_2 \\ \mathcal{L}_c(x;\theta^{(n)})=-\sum^C_{i=1}t_i\log(o_i(x;\theta^{(n)}))

  • 这个方法是期望weight parameter不要改变太大来保留学习到的信息。

Comment:以上两个方法都是希望尽量使new network和old network之间的差异或变化小,Fine-tuning only the softmax classifier layer使old domain和new domain共享的weight parameter差异为0,Weight constraint approach使old domain和new domain之间的weight parameters相似,差异不要太大。两种方法本质一样,只是程度不同。

Less-forgetful learning

  • softmax classifier的weight代表分类feature的decision boundary,从top hidden layer提取feature是线性可分的,因为top layer classifier有线性的性质。
  • Property 1. The decision boundaries should be unchanged.
    Property 2. The features extracted by the new net- work from the data of the old domain should be present in a position close to the features extracted by the old network from the data ofthe old domain.

实现Property 1可以设置boundary的learning rate为0.
实现Property 2,由于不能访问old domain data,所以可以使用training data of the new domain。

image

本文提出的方法和传统的fine-tune method一样,将old network copy到new netwrok。然后,为了保持boundary不变,将softmax classifier layer的weight freeze。Loss function如下:

\mathcal{L}_t(x;\theta^{(o)},\theta^{(n)})=\lambda_c\mathcal{L}_c(x;\theta^{(n)})+\lambda_e\mathcal{L}_e(x;\theta^{(o)},\theta^{(n)})
\mathcal{L}_ccross entropy loss, \mathcal{L}_e Euclidean loss, \lambda_c=1,\lambda_e通常比\lambda_e通常比\lambda_c小。

\mathcal{L}_e(x;\theta^{(o)},\theta^{(n)})=\frac{1}{2}\|\mathbf{f}_{L-1}(x;\theta^{(o)})-\mathbf{f}_{L-1}(x;\theta^{(n)})\|_2^2
\mathbf{f}_{L-1}使softmax classifier layer之前的feature,new network学习提取与old network 提取的feature相似的feature。

\hat{\theta}^{(n)}=\arg\min_{\theta^{(n)}}\mathcal{L}_t(x;\theta^{(o)},\theta^{(n)})+\mathcal{R}(\theta^{(n)})
\mathcal{R}(\cdot)是general regularization term,such as weight decay。这里与Weight constraint approach一样。

image

Comment: Less-forgetful learning的方法将Fine-tuning only the softmax classifier layer和Weight constraint approach结合起来。(1)同时提出decision boundary应该不变,所以freeze softmax的weight。(2)还提出\mathcal{L}_e,与distillation loss类似。

Experimental results

  • task:image classification
    dataset:CIFAR-10,MNIST,SVHN,ImageNet


    image
  • comparison methods:(1)bashline:fine-tuning+weight constraint+ReLU/Maxout/LWTA;(2)LwF;(3)EWC

  • result


    image
  • ablation study

    1. Relationship between the classification rates for the old and new domains with different values of \lambda_e.
    2. Average classification rates with respect to \lambda_e.
    3. Classification rates according to the size of the old-domain data using the CIFAR-10 dataset.
    4. Further Analysis of Scratch learning, Fine-tunig and LF learning
    5. Feasibility for Continual Learning
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容