End-to-End Incremental Learning

Abstract

We address this issue with our approach
to learn deep neural networks incrementally, using new data and only a small
exemplar set corresponding to samples from the old classes.
distillation loss + cross-entropy loss
end-to-end
CIFAR-100 + ImageNet（ILSVRC 2012）

Introduction

While this is trivial to accomplish for most people (we learn to recognize
faces of new people we meet every day), it is not the case for a machine learning system.

incremental deep learning的几个特点：（1）能从流数据进行训练，classes可以以任何顺序，在任何时间出现。（2）在old and new classes有好的performance。（3）合理的模型参数和内存要求.（4）端到端学习机制，联合更新分类器和特征表示。

representative memory component
cross-distilled loss, a combination of two loss functions
any deep learning architecture can be adapted to our incremental learning framework, with the only requirement being the replacement of its original loss function with our new incremental loss.

Related Work

Lifelong learning is akin to transferring knowledge acquired on old tasks to the new ones.
Never-ending learning, on the other hand, focuses on continuously acquiring data to improve existing classifiers or to learn new ones.

本文的方法增加了一个exemplar set，增强knowledge representation of the old classes
本文的方法随着new classes的增加，对原网络的大小改变很小。
iCaRL的data representation和classifier是decoupled。
end-to-end fashion

Model

image.png

To help our model retain the knowledge acquired from the old classes, we use a representative memory (Sec. 3.1) that stores and manages the most representative samples from the old classes.

Representative memory

两种memory setup：（1）K固定，class越多，每个class的memory size越小。（2）每个class的memory size固定，class越多，K越大。
memory perform two operations：selection of new samples to store, and removal of leftover samples.

Cross-distilled loss function

image.png

Implementation Details

MatConvNet
40 epochs + 30 epochs balanced fine-tuning
learning rate：0.1 first 40 epochs，
is divided by 10 every 10 epochs
mini-batches： 128
weight decay：0.0001
momentum：0.9
L2-regularization and random noise on the gradients
dataset: CIFAR100(resnet32) + ImageNet（ILSVRC 2012）(resnet18)
K=2000
CIFAR100: normalize dividing by 255, subtracting the mean

好词好句

state-of-the-art results
catastrophic forgetting
dramatic decrease
when training with new classes added incrementally
using new data and only a small exemplar set corresponding to samples from the old classes.
a loss composed of a distillation measure
Our incremental training is achieved while eeping the entire framework end-to-end, i.e., learning
the data representation and the classifier jointly,
targeted at real-world applications
Although some attempts have been made to address this, most of the previous models still suffer from ...
(see Sec. 3.1)
As detailed in Sec. 4
Alternative strategies
The main drawback of all these approaches is ...
Based on our empirical results, we set T to 2 for all our experiments.
Best viewed in color.

结构

Introduction
challenges
example
traditional models
ideal system
address task in this paper

incremental deep learning 的几个特点

existing approaches for incremental learning

main contribution