Image Classification with No Data? | by Clement Wang | Dec, 2022

By Jessie Hobb On Dec 7, 2022

Theoretical review of Machine learning algorithms using less data

You want to build a Machine learning model without much data? Machine learning is known to be data-hungry while gathering and annotating data requires time and is expensive. This article presents some methods to build an efficient image classifier with much less data!

Introduction

1. Transfer learning

2. Leveraging unlabeled data

3. Few-shot learning

4. Weakly supervised learning and text-based zero-shot classifier

Conclusion

References

Data is always needed to train Machine learning models. Huge datasets are necessary to build robust pipelines, avoid overfitting, and generalize well to new images.

However, given some assumptions, you can greatly decrease the amount of data that you need — annotated or not. This article presents some ways to do image classification without much data or labeled data. I will present the most important aspects of transfer learning, self-supervised learning, semi-supervised learning, active learning, few-shot image classification, and text-based zero-shot classifiers.

Keep in mind that each method I am presenting can be used in other Machine learning problems.

Sections 1 and 2 present ways to build classifiers with around one thousand annotated samples with well-studied methods. Sections 3 and 4 push data efficiency to its limits with hot research topics.

I will not talk about data augmentation and generative networks for the sake of conciseness. For the same reason, I won’t go into mathematical or implementation details. For more information, I advise you to directly read the papers in the reference section. That being said, let’s get started!

When you are training a deep neural network from scratch, you often initialize your weights randomly. Is it the optimal way to initialize a neural network? The answer is generally no.

First things first, Deep learning is all about representation. In classical Machine learning, features need to be handcrafted. The idea behind Deep learning is that you let your neural network learn a feature representation by itself while training.

Between each layer of your neural network, you have a representation of your input data. The deeper you go into your neural network, the more global your representation should be. Typically, the first layers of a classifier neural network are known to be able to detect blobs of colors and edges. Middle layers take as input first layers representations to compute more complex concepts than the first layers. For instance, they might detect the presence of a cat’s eye or dog’s ear. The last layer gives a probability of the image being from each class.

Deep features representation — Image by the author

The idea behind Transfer learning [1] is that some representations learned from another classification task might be useful for your task. Transfer learning is about taking the first layers of a pre-trained network on another task, adding new layers on top of it, and fine-tuning the whole network on a dataset of interest.

As a comparison, if your goal is to learn to win football games, transfer learning would consist in learning to play basketball first to get used to moving your body, working on your endurance, etc., before starting to play football games.

Transfer learning pipeline — Image by the author

How it is going to affect the performance of your final network? Where should you cut your pre-trained networks? These questions are extensively addressed in [1].

To sum up the most important ideas:

The first layers of a neural network are very general while the deepest layers are the most specialized in the pre-training task. Thus, you can expect that if your pre-training task is close to your target task, then keeping more layers will be more beneficial.
Cutting in middle layers often results in poor performance. This is due to a fragile equilibrium attained in intermediate layers through fine-tuning.
Using pre-trained weights is always a better idea than using randomly initialized weights. That is because by training on another task first, your model learned features that it would not have learned otherwise.
Better performance is achieved when retraining these pre-trained weights — eventually using a lower learning rate for them or unfreezing them after a few epochs.

Unlabelled data are usually much more accessible compared to labeled data. It would be a waste not to take advantage of that!

Self-supervised learning

Self-supervised learning [2] tackles the problem of learning deep features from unlabelled data. After training a self-supervised model, the features extractor can be used just like in Transfer learning, so you still need some annotated data to fine-tune.

So, how do you train a deep features extractor from unlabelled data? To sum up, you need a hard enough pretext task that will enable you to learn interesting features for your classification task.

If you want to win football games without playing actual games, you can, for instance, train on juggling a ball as much as possible. Juggling a ball will improve your ball-handling technique which will come in handy when playing games.

An example of a pretext task is to predict the rotation angle of images. Basically, for each image, you apply a rotation z to get the rotated image x. Then you train a neural network to predict z from x. This rotation prediction task forces your network to deeply understand your data. Indeed, to predict the rotation of an image of a dog, your network will first need to understand that there is a dog in the image and that a dog should be oriented in a particular way.

Example of pretext task: rotation transformation prediction — (Left) [2] | (Right) image by the author

Pretext tasks can vary a lot according to your specific goal. Commonly used pretext tasks are among:

Transformation prediction: A sample from a dataset is modified by a transformation and your network learns to predict the transformation.
Masked prediction: A random square of the input image is masked and the network has to predict the masked portion of the image.
Instance discrimination: Learn a representation that separates all data samples. For instance, each data point can be considered a class and a classifier can be trained on this task.

Examples of self-supervised pretext tasks [2]

Semi-supervised learning

Semi-supervised learning [4] is very similar to self-supervised learning as it also involves having a small annotated dataset as well as a large unlabelled dataset. Depending on the definition you take, self-supervised learning can be seen as a special case of semi-supervised learning. However, in a stricter definition, semi-supervised learning consists in exploiting unlabelled data directly as training samples and not as a way to learn meaningful features.

The most popular way to do semi-supervised learning is self-training [3]. Basically, you train a baseline on labeled data, infer on unlabeled data to get pseudo labels, keep only the pseudo labels with the highest confidence, and train on the union of the labeled and the remaining pseudo-labeled data.

In other words, self-training consists in training a model on labeled data, then this first model becomes a teacher for a student model. Then, the student model can become a teacher, and so on. As for the number of iterations, usually, one or two iterations are enough to maximize your metrics.

To continue on the comparison with winning football games, semi-supervised learning would be that you are watching recordings of football games with coaches that analyze each play.

Semi-supervised learning with self-training [3]

Semi-supervised learning and self-supervised learning are known to be more effective than transfer learning. That is simply because data in a semi-supervised or self-supervised setting is closer to the target task. Nevertheless, there is no clear winner between semi-supervised and self-supervised.

Active learning

All the methods presented above assume that you already annotated some data and that you cannot or do not want to annotate more. Active learning [5] gives you a set of rules to choose which data to annotate to get the best results with the fewest annotated samples.

Which data to annotate? The answer comes rather naturally: you should always annotate hard samples for your model. That way, you force your model to learn more precise features to recognize your classes.

Annotating with Active learning [5]. The “Oracle” is a human annotator.

When learning to win football games, classic training is focused on your flaws. Let’s say that you are forward, you will want to master ball handling, achieve good physical explosivity, and have great shooting ability. To master all of these, rather than playing games directly, the most efficient way is to train each skill one by one and focus on your worst skills first.

So now, you would ask how to determine these “hard samples”. This is where it becomes complicated. A simple baseline is to label randomly a small subset of data, train a model on it, infer on it and annotate those on which you have the lowest confidence. However, because large networks are often overconfident, it is often recommended to estimate uncertainty rather than directly using the output of your neural network. To get uncertainty estimation, you can use Gaussian processes or Bayesian networks for instance.

Few-shot learning is about learning from very few samples. The typical order of magnitude would be 1 to 20 samples per class. While deep learning needs typically at least thousands of samples, few-shot learning tackles this problem of not having data at all. This field is currently a hot research topic so you probably do not want to put this into production right now.

How does it work? How can you learn to classify from so little data? There are many ways to perform few-shot image classification. I will only present a few of them, mainly those with the most important impact on research.

Basic setting: Metric learning and Protonet

Remember when I said that Deep learning is all about representation? As you only have tens of samples, you cannot train a neural network to understand each class and classify images in one of them. The idea is to rather find a distance to a particular class by taking advantage of a pre-trained features extractor. The feature extractor might be trained with any of the methods mentioned above.

Let’s say that you only have one image for each class and you want to classify a new unlabeled image. Given a feature extractor, you can extract a feature representation of each labeled image and also of the unlabeled image. If the extracted features are good enough, the extracted features from the unlabeled image should be most similar to those of the image from the same class. You can then compute a distance to each class. You can either use a neural network or a regular distance such as the Cosine distance or Mean squared error. That is called metric learning [6][7].

If you have several images from each class, research papers usually use the ProtoNet setting [8]. It consists in computing a Prototype for each class. Basically, for each class, you compute all features with the pre-trained features extractor and take the means to get a prototype for each class. Then everything works as if you only had one image.

Meta-learning

In parallel to metric learning, meta-learning [9] is a totally different way to perform few-shot learning. Here, the idea is to first, train a model so that after being fine-tuned a few epochs, it gets the best metrics.

In other words, you want to find the best weights initialization such that you only need to fine-tune during a few epochs and with little data. In order to find the best initialization, a learning schema is used. That’s why meta-learning is also called learning to learn.

This idea of learning to learn is kind of disturbing at the beginning. Meta-learning for football would be that you train in as many sports as possible, tennis, basketball, rugby, volleyball, musculation, etc., and because of that, you are now able to adapt to any sport with little training, including football.

Meta-learning to adapt quickly to target tasks [9]

Let’s assume that you have three different tasks with little data. θ* represents the theoretical optimum for each task. Where should you start your training to get the best results after a few epochs of fine-tuning? Meta-learning tries to find the best initialization θ to get the closest to the optimum of all tasks after fine-tuning.

I won’t go into mathematical details to avoid getting you lost. I would recommend you read the paper [9] to get a better understanding of it.

The meta-learning idea seems really interesting but in practice, it did not reach good performance metrics compared to other methods. Moreover, to meta-train your model, you need several different tasks and this is rarely the case in practice.

State-of-the-art and limitations

The current best model on leaderboards is P>M>F [10]. This paper unifies many other papers and studies the influence of each part of the global pipeline. The feature extractor is a pre-trained Vision Transformer using a masked self-supervised method. Then the model is meta-trained with a ProtoNet setting. To perform your particular classification task, it needs fine-tuning.

Paper with code benchmark on Mini-Imagenet 5-way 1-shot

The best model achieves 95% accuracy! Does it mean that we don’t need labeled data anymore? Unfortunately, for the moment, research is focused on academic datasets with unrealistic settings [11]. Benchmarks are often calculated with a low number of classes (5 classes in the benchmark of the figure) and test samples are drawn from a uniform distribution. That makes current methods unreliable in real applications. Few-shot image classification is still a challenging field and it will surely improve over time!

This section presents CLIP from OpenAI [12]. This architecture revolutionized Deep learning in 2021 by bringing image and text together. Here, zero-shot means that the model is able to classify images without having specifically trained with labeled classes.

CLIP is composed of two encoders — or feature extractors: one for images and another for text. Both encoders transform respectively images and texts in the same latent space, enabling comparison between text and image with a simple Cosine distance on the extracted features.

How is it trained? CLIP is trained using weakly supervised learning. It consists in having a massive amount of noisy data. It takes advantage of the massive image-caption data you can scrap from the internet. Even though data from the internet is extremely noisy, it turns out that CLIP is quite accurate and has very strong zero-shot capabilities.

Because of its massive and noisy training dataset, CLIP can classify common images out of the box without any fine-tuning, while being more robust and more accurate than classic classifiers. OpenAI presents a much more accurate analysis on its blog.

Does CLIP have any drawbacks?! Well, everything comes with a price…

The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.

Not everyone — actually, nearly no one, can afford to train such a huge neural network. I will not even talk about the size of the training dataset. In terms of inference, it is all about a trade-off between computation resources and accuracy. Several smaller versions of CLIP are available.

Though huge and slow, CLIP can be used to generate high-quality pseudo-labels for instance, and then train a smaller network. Moreover, human annotation is usually quicker when pseudo-labels are available as the annotator only has to choose between a few classes instead of tens or hundreds of classes.

Research is so exciting! I found it crazy to be able to do this well without much data. These approaches wouldn’t have been possible without creativity. Creativity unlocks so many possibilities!

To sum up the ideas:

Do transfer learning when you have a huge labeled dataset close to your target task.
Do self-supervised learning or semi-supervised learning when you have huge unlabeled data and a subset of annotated data.
Use active learning if you want to annotate efficiently your data.
Few-shot image classification can classify images with nearly no annotated data. Even though it is still being researched, current benchmarks are encouraging.
Text-based zero-shot learners might also have a leading role in the future. They can be used out of the box without needing any data.

I did not present transductive models in few-shot learning because I personally think that these are not usable in real applications. You might want to check them out as they once dominated the leaderboards on paper with code.

Don’t forget to check out Regularization, Data augmentation, and generative models as well. These may somehow resolve your problems more easily.

Congrats! You reached the end of this article. I hope you enjoyed reading it. You are now prepared to read all the papers without any fear. Don’t hesitate to comment and reach out!

[1] J. Yosinski, J. Clune, Y. Bengio, et al., How transferable are features in deep neural networks? (2014), NIPS 2014

[2] L. Ericsson, H. Gouk, C. C. Loy, et al., Self-Supervised Representation Learning: Introduction, Advances and Challenges (2021), IEEE 2022

[3] Q. Xie, M.-T. Luong, E. Hovy, et al., Self-training with Noisy Student improves ImageNet classification (2020), CVPR 2020

[4] Y. Ouali, C. Hudelot, and M. Tami, An Overview of Deep Semi-Supervised Learning (2020), arXiv preprint arXiv:2006.05278

[5] P. Ren, Y. Xiao, X. Chang, et al., A Survey of Deep Active Learning (2021), ACM Computing Surveys 2022

[6] O. Vinyals, C. Blundell, T. Lillicrap, et al., Matching Networks for One Shot Learning (2017), NIPS 2016

[7] F. Sung, Y. Yang, L. Zhang, et al., Learning to Compare: Relation Network for Few-Shot Learning (2018), CVPR 2018

[8] J. Snell, K. Swersky, and R. S. Zemel, Prototypical Networks for Few-shot Learning (2017), NIPS 2017

[9] C. Finn, P. Abbeel, and S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017), ICML 2017

[10] S. X. Hu, D. Li, J. Stühmer, et al., Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference (2022), CVPR 2022

[11] E. Bennequin, M. Tami, A. Toubhans, et al., Few-Shot Image Classification Benchmarks are Too Far From Reality: Build Back Better with Semantic Task Sampling (2022), CVPR 2022

[12] A. Radford, J. W. Kim, C. Hallacy, et al., Learning Transferable Visual Models From Natural Language Supervision (2021), PMLR 2021