Class-Incremental Learning
A survey of challenges and recent innovations
Imagine you have a pre-trained image classification model for dog vs. cat classification, but you don’t have access to the original dataset. Now, you have acquired a new dataset containing lions & tigers and you want to extend your model to classify images among all four classes, i.e. dog vs. cat vs. lion vs. tiger. How would you achieve this?
Keep in mind that at inference time, you will not have the information about whether an unseen image belongs to (cat/dog) pair or (lion/tiger) pair. So you can’t just train 2 separate models and call it a day. Fine-tuning the model naively on the new data is also not a good idea, as we will see later in the article.
Why is this problem practically relevant?
Changing business requirements
In a real world setting, the problem statement for an ML task is often evolving with time. As your organization grows, you may need to enhance the capabilities of existing models to cater to more clients.
Data Silos
Often different parts of the relevant data can not be combined directly because of
- Legal Requirements: For example, different organizations may not be allowed to share their data with each other due to privacy laws.
- Conflict of interest: Data may be split between organizations that are in competition with each other. So, they may be unwilling to pool their data directly with each other, but want to reap the benefits of joint training.
Thus you may have access to only a subset of classes in the data at a time.
Training Efficiency
Even if you could access previous data, it would waste a lot of time, money, and efforts if you have to retrain your model from scratch every time you want to add new capabilities to it.
Some Terminology
Let’s look at some common terminology used in the literature.
- Task-IL : A subset of incremental learning with the aim to train a single model that can perform multiple related tasks (e.g. Translation, Question-Answering, etc.) where the specific task-ID is known at the inference time.
- Class-IL : Another subset of incremental learning with the aim to extend the model capabilities, but the task-ID is not known at the inference time. An example is the problem mentioned at the beginning.
- Learning Task : In case of Task-IL, it’s clear what a ‘task’ refers to. In Case of Class-IL, a ‘task’ refers to a set of classes which the model has to learn together at a time. E.g. (cat/dog) is one task and (lion/tiger) is another task in the example at the beginning.
Primitive Solutions
The following are some solutions that you may come up with, off the top of your head, but wouldn’t work very well
Multiple Models
- Train a separate model for each task
- At inference time, pass the data to each model and take the class with highest output value.
Pros
- Training is fast
- Approach is simple to understand
Cons
- Inference is very slow (need to pass data through several model).
- Several models need to be stored.
- Suffers from Inter Task Confusion
Fixed Representations
- Training:
- Train a classifier for the very first task from scratch.
- For the new classes, remove previous classification nodes/layers and add new nodes/layers (untrained)
- Freeze the previous layers and only train the new nodes/layers on new data. - For inference, append the old nodes/layers back
Pros
- Training is Fast
- Inference is fast
- Preserves model performance on old classes
Cons
- Model performance is poor for newer classes. This is referred to as intransigence.
Fine-tuning
We use the same approach as above, but instead we train the whole network (instead of just the last later) for adding new classes.
Pros
Performance improves for new classes
Cons
Performance degrades significantly for old classes. This is also called Catastrophic Forgetting in the literature.
Smarter Solutions
Now, let’s look at some of the smarter solutions that researchers have come up with, in recent years. These solutions fall into 2 broad types:
- Rehearsal Free Solutions: These solutions assume that we have no access to old data whatsoever.
- Rehearsal Based Solutions: These solutions store a tiny fraction of old data and use it to balance model training during new tasks.
Note: In both these categories, we will append new layers for new classes, but otherwise keep the same architecture of the model.
A simple baseline
Researchers from University of Oxford have come up with a simple but interesting baseline called GDumb (Greedy Sampler and Dumb Learner). They store a random fraction of examples from each old class (within a fixed tiny memory budget) and simply retrain a multi-class classifier from scratch for adding a new class. This naive approach beats several sophisticated approaches in their experiments.
Elastic Weight Consolidation
This is a rehearsal free solution.
Core idea: Identify a subset of model weights which are important for a given task and avoid changing these in the subsequent tasks.
The importance is calculated using Fisher information Matrix of the posterior distribution of model weights.
The Authors use the fact that Fisher matrix is equivalent to the second derivative of the loss near a minimum. They use the diagonal values in this matrix from the old task as the regularization constants in the next task.
Link to paper: https://arxiv.org/pdf/1612.00796.pdf
Path Integral Based Approaches
This is a group of rehearsal free approaches.
Core idea: in EWC, the weight importances are calculated only once the model is fully trained on a task. These approaches instead calculate the weight importance along the entire training trajectory.
Examples: [RWalk] [Memory Aware Synapses]
Learning Without Forgetting
This is a rehearsal-free solution.
Core idea: This is a combination of Knowledge Distillation and Fine-tuning.
Knowledge distillation
- Take a trained model for an old task.
- Discard the data for the old task.
- Record the outputs of the trained model over new inputs. These input/output pairs allow us to “distill” the knowledge of the old task into new data.
Fine-tuning
- The outputs of the layers corresponding to old task should match with the “distilled” outputs.
- The layers corresponding to new tasks should minimize classification loss.
Source: https://arxiv.org/pdf/1606.09282.pdf
Learning Without Memorizing
This is a rehearsal-free solution.
Core idea: Similar to Learning Without Forgetting, but use attention maps in addition to final outputs for knowledge distillation.
Attention map is a heat-map over the image that tells us which pixels were most important to determine the output. Here, they are generated using the Grad-CAM approach.
The authors say “We hypothesize that attention regions encode the models’ representation more precisely […] instead of finding which base classes are resembled in the new data, attention maps explain ‘why’ hints of a base class are present.”
Source: https://arxiv.org/pdf/1811.08051.pdf
Incremental Classifier and Representation Learning (iCaRL)
This is a very important rehearsal-based solution which has had a big impact on the field of Class-IL.
Initial Training
- Train a classifier model.
- Choose a few “exemplar” images of each class (and save them)
- Consider their representation in an intermediate layer of the model (i.e. vector encoding of the image)
- Calculate the mean of such encodings for each class (called “class prototype”)
Inference
- Calculate the vector encoding of the unseen image
- Find the class prototype nearest to this encoding (as measured by L2 distance)
- Classify the image into the class of this prototype.
New class addition
- Fine-tune the model with images from new classes. Also choose their exemplars.
- Choose new exemplars in each old class such that their prototype is as close to their old prototype as possible.
Exemplar management
This method maintains a few images from each class within a fixed total memory budget. As new classes are added, exemplars from old classes are removed to make room for the exemplars in the new class.
Source: https://arxiv.org/pdf/1611.07725.pdf
Extension to other areas of ML
The techniques we discussed here were all developed with image classification as the focus. A lot of these ideas are directly portable to other areas like image segmentation, object detection, etc.
However, these areas also present challenges unique to them. They have given rise to many interesting avenues of research. I will try to discuss them in a separate article.
Summary
Class-incremental Learning is a fascinating and practically important area of research. It has its own challenges like Catastrophic Forgetting, Intransigence, Inter-task confusion, etc. Researchers have made great progress in dealing with these problems by a combination of various techniques like regularization, knowledge distillation, careful exemplar selection, etc. Learning about these ideas will provide great insights about concepts like model capacity, model compression, generalization, and so on.