Class-Incremental Learning

A survey of challenges and recent innovations

Charudatta Manwatkar
7 min readJan 8, 2024

Imagine you have a pre-trained image classification model for dog vs. cat classification, but you don’t have access to the original dataset. Now, you have acquired a new dataset containing lions & tigers and you want to extend your model to classify images among all four classes, i.e. dog vs. cat vs. lion vs. tiger. How would you achieve this?

Keep in mind that at inference time, you will not have the information about whether an unseen image belongs to (cat/dog) pair or (lion/tiger) pair. So you can’t just train 2 separate models and call it a day. Fine-tuning the model naively on the new data is also not a good idea, as we will see later in the article.

Why is this problem practically relevant?

Changing business requirements

In a real world setting, the problem statement for an ML task is often evolving with time. As your organization grows, you may need to enhance the capabilities of existing models to cater to more clients.

Data Silos

Often different parts of the relevant data can not be combined directly because of

  1. Legal Requirements: For example, different organizations may not be allowed to share their data with each other due to privacy laws.
  2. Conflict of interest: Data may be split between organizations that are in competition with each other. So, they may be unwilling to pool their data directly with each other, but want to reap the benefits of joint training.

Thus you may have access to only a subset of classes in the data at a time.

Training Efficiency

Even if you could access previous data, it would waste a lot of time, money, and efforts if you have to retrain your model from scratch every time you want to add new capabilities to it.

Some Terminology

Let’s look at some common terminology used in the literature.

  • Task-IL : A subset of incremental learning with the aim to train a single model that can perform multiple related tasks (e.g. Translation, Question-Answering, etc.) where the specific task-ID is known at the inference time.
  • Class-IL : Another subset of incremental learning with the aim to extend the model capabilities, but the task-ID is not known at the inference time. An example is the problem mentioned at the beginning.
  • Learning Task : In case of Task-IL, it’s clear what a ‘task’ refers to. In Case of Class-IL, a ‘task’ refers to a set of classes which the model has to learn together at a time. E.g. (cat/dog) is one task and (lion/tiger) is another task in the example at the beginning.
Image source: https://arxiv.org/pdf/2010.15277.pdf

Primitive Solutions

The following are some solutions that you may come up with, off the top of your head, but wouldn’t work very well

Multiple Models

  1. Train a separate model for each task
  2. At inference time, pass the data to each model and take the class with highest output value.

Pros

  • Training is fast
  • Approach is simple to understand

Cons

  • Inference is very slow (need to pass data through several model).
  • Several models need to be stored.
  • Suffers from Inter Task Confusion
This is an example of Inter-task confusion. A model was first trained on (Square vs rhombus) task and another on (circle vs triangle) task. However, neither model ever learned to discriminate between squares and circles. Source: https://arxiv.org/pdf/2010.15277.pdf

Fixed Representations

  1. Training:
    - Train a classifier for the very first task from scratch.
    - For the new classes, remove previous classification nodes/layers and add new nodes/layers (untrained)
    - Freeze the previous layers and only train the new nodes/layers on new data.
  2. For inference, append the old nodes/layers back
Source: https://arxiv.org/pdf/1810.12448.pdf

Pros

  • Training is Fast
  • Inference is fast
  • Preserves model performance on old classes

Cons

  • Model performance is poor for newer classes. This is referred to as intransigence.
Model performance by class with Fixed Representation approach. Here, a new class was added every 500 epochs. The model performs well for old classes, but struggles to replicate the same performance for newer classes. Source: https://arxiv.org/pdf/1810.12448.pdf

Fine-tuning

We use the same approach as above, but instead we train the whole network (instead of just the last later) for adding new classes.

Pros

Performance improves for new classes

Cons

Performance degrades significantly for old classes. This is also called Catastrophic Forgetting in the literature.

Model performance by class with naive fine-tuning approach. The model learns to recognize new classes, but its performance degrades slowly for older classes. Source: https://arxiv.org/pdf/1810.12448.pdf

Smarter Solutions

Now, let’s look at some of the smarter solutions that researchers have come up with, in recent years. These solutions fall into 2 broad types:

  1. Rehearsal Free Solutions: These solutions assume that we have no access to old data whatsoever.
  2. Rehearsal Based Solutions: These solutions store a tiny fraction of old data and use it to balance model training during new tasks.

Note: In both these categories, we will append new layers for new classes, but otherwise keep the same architecture of the model.

A simple baseline

Researchers from University of Oxford have come up with a simple but interesting baseline called GDumb (Greedy Sampler and Dumb Learner). They store a random fraction of examples from each old class (within a fixed tiny memory budget) and simply retrain a multi-class classifier from scratch for adding a new class. This naive approach beats several sophisticated approaches in their experiments.

Elastic Weight Consolidation

This is a rehearsal free solution.

Core idea: Identify a subset of model weights which are important for a given task and avoid changing these in the subsequent tasks.

Visualization of EWC in the loss landscape. EWC attempts to learn the new task while still keeping the important weights unchanged for the old task. Source: https://arxiv.org/pdf/1612.00796.pdf

The importance is calculated using Fisher information Matrix of the posterior distribution of model weights.

Here A and B denote the previous and current task respectively. Fᵢ denote the diagonal value from Fisher information matrix corresponding to weight θᵢ (Source: https://arxiv.org/pdf/1612.00796.pdf)

The Authors use the fact that Fisher matrix is equivalent to the second derivative of the loss near a minimum. They use the diagonal values in this matrix from the old task as the regularization constants in the next task.

Link to paper: https://arxiv.org/pdf/1612.00796.pdf

Path Integral Based Approaches

This is a group of rehearsal free approaches.

Core idea: in EWC, the weight importances are calculated only once the model is fully trained on a task. These approaches instead calculate the weight importance along the entire training trajectory.

Examples: [RWalk] [Memory Aware Synapses]

Learning Without Forgetting

This is a rehearsal-free solution.

Core idea: This is a combination of Knowledge Distillation and Fine-tuning.

Knowledge distillation

  1. Take a trained model for an old task.
  2. Discard the data for the old task.
  3. Record the outputs of the trained model over new inputs. These input/output pairs allow us to “distill” the knowledge of the old task into new data.

Fine-tuning

  • The outputs of the layers corresponding to old task should match with the “distilled” outputs.
  • The layers corresponding to new tasks should minimize classification loss.
Learning without forgetting. Source: https://arxiv.org/pdf/1606.09282.pdf

Source: https://arxiv.org/pdf/1606.09282.pdf

Learning Without Memorizing

This is a rehearsal-free solution.

Core idea: Similar to Learning Without Forgetting, but use attention maps in addition to final outputs for knowledge distillation.

Attention map is a heat-map over the image that tells us which pixels were most important to determine the output. Here, they are generated using the Grad-CAM approach.

The authors say “We hypothesize that attention regions encode the models’ representation more precisely […] instead of finding which base classes are resembled in the new data, attention maps explain ‘why’ hints of a base class are present.”

Source: https://arxiv.org/pdf/1811.08051.pdf

Incremental Classifier and Representation Learning (iCaRL)

This is a very important rehearsal-based solution which has had a big impact on the field of Class-IL.

Initial Training

  1. Train a classifier model.
  2. Choose a few “exemplar” images of each class (and save them)
  3. Consider their representation in an intermediate layer of the model (i.e. vector encoding of the image)
  4. Calculate the mean of such encodings for each class (called “class prototype”)

Inference

  1. Calculate the vector encoding of the unseen image
  2. Find the class prototype nearest to this encoding (as measured by L2 distance)
  3. Classify the image into the class of this prototype.

New class addition

  1. Fine-tune the model with images from new classes. Also choose their exemplars.
  2. Choose new exemplars in each old class such that their prototype is as close to their old prototype as possible.

Exemplar management

This method maintains a few images from each class within a fixed total memory budget. As new classes are added, exemplars from old classes are removed to make room for the exemplars in the new class.

Source: https://arxiv.org/pdf/1611.07725.pdf

Extension to other areas of ML

The techniques we discussed here were all developed with image classification as the focus. A lot of these ideas are directly portable to other areas like image segmentation, object detection, etc.

However, these areas also present challenges unique to them. They have given rise to many interesting avenues of research. I will try to discuss them in a separate article.

Summary

Class-incremental Learning is a fascinating and practically important area of research. It has its own challenges like Catastrophic Forgetting, Intransigence, Inter-task confusion, etc. Researchers have made great progress in dealing with these problems by a combination of various techniques like regularization, knowledge distillation, careful exemplar selection, etc. Learning about these ideas will provide great insights about concepts like model capacity, model compression, generalization, and so on.

--

--

Charudatta Manwatkar
Charudatta Manwatkar

Written by Charudatta Manwatkar

जे जे आपणासी ठावे। ते ते इतरांसी सांगावे।

No responses yet