MultiMAE: An Inspiration to Leverage Labeled Data in Unsupervised Pre-training | by Shuchen Du | Jul, 2022

By Jessie Hobb On Jul 19, 2022

Boost your model performance via multimodal masked auto-encoders

Self-supervised pre-training is a main approach to improve the performance to traditional supervised learning, in which large amount of labeled data is necessary and costly. Among self-supervised learning methods, contrastive learning is popular for its simplicity and efficacy. However, most contrastive learning methods use global vectors in which the details of pixel-level information is lost, which leaves room of improvement when transferred to downstream dense tasks. I recommend my former articles [1,2,3] on contrastive learning methods for interested readers.

For transfer learning on dense downstream tasks, we need some self-supervised pre-training methods that can recover the details of the whole feature maps, not only the pooled global vectors. Only through this the details of the whole distribution of feature maps can be learned, and more importantly, without labels.

In contrastive learning, some methods are proposed for training on the image patches [4], or local features [5], or even on pixels [6]. However, these methods either use momentum encoders and queues [4,5] which is memory consuming, or use semi-supervised learning manner with pseudo-labels which is difficult to train [6].

Masked Auto-encoder

A method without contrastive thinking is proposed by Kaiming He in 2021 [7]. This is a similar approach with BERT [8] in NLP, in which the masked tokens in a sentence are classified and trained with cross-entropy loss. However, in computer vision the tokens cannot be classified because image patterns are almost infinite, while in NLP the tokens are limited and pre-defined in a corpus. Thus in computer vision the masked tokens can only be predicted in a regression manner, rather than in a classification manner.

The architecture is called masked auto-encoder (MAE), where the unmasked tokens are encoded, and the masked tokens are reconstructed and trained with MSE loss. This is a simple but effective architecture based on vision transformers [9]. Since the encoder only encodes unmasked tokens, it is scalable to large input images with light computational overhead. Also since the decoder is shallow and the loss is only calculated for masked tokens, it is also scalable with little additional computational overhead. The authors used random masking for the tokens, but I think the training resources can be concentrated on the yet-not-well-trained tokens like the manner used in focal loss [10].

We can see that the images can be reconstructed plausibly even when they are masked by 95%, showing powerful ability of learning-the-details of the model.

Multimodal Multi-task Masked Auto-encoders

When a model succeeds, it is usually extended to many other forms by researchers in the community. Multimodal is one of these extended models. Since multimodal models produce more robust features, it is usually good choice to train multimodal models if you have the data.

In MultiMAE, the authors used three modalities: RGB, depth and semantic. Since it is difficult to collect large amount of corresponding data in these three modalities, the authors proposed using pseudo-labels generated by some off-the-shelf models. However, the authors also showed that models trained with pseudo-labels are less performant than those trained with real labels.

The model is extended as shown above. It is straightforward in that each modality patches are projected to token vectors with modality-wise linear projectors. The token vectors of all the three modalities are encoded with the same encoder, but decoded with modality-wise decoders separately. After pre-training, the encoder can be fine-tuned in both single-modal and multi-modal manners with the corresponding linear projectors and task-specific heads.

One interesting thing I discovered about MultiMAE is that the labeled data can be leveraged in both pre-training and fine-tuning phases. Suppose you have some RGB-semantic pairs for training a semantic segmentation model, rather than training in a traditional supervised learning manner, you can per-train the model with MultiMAE in which both the RGB and semantic are masked and reconstructed. After that, since the model has learned both the distributions of the RGB and semantic details, the label efficiency could be improved to much extent in the following single-modal fine-tuning phase. You can also use pseudo-labels produced by some off-the-shelf models if you lack ground truth semantic labels in the pre-training phase which is data consuming.

References

[1] Understanding Contrastive Learning and MoCo, 2021

[2] Pixel-level Dense Contrastive Learning, 2022

[3] Contrastive Pre-training of Visual-Language Models, 2022

[4] DetCo: Unsupervised Contrastive Learning for Object Detection, 2021

[5] Dense Contrastive Learning for Self-Supervised Visual Pre-Training, 2021

[6] Bootstrapping Semantic Segmentation with Regional Contrast, 2022

[7] Masked Autoencoders Are Scalable Vision Learners, 2021

[8] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019

[9] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, 2021

[10] Focal Loss for Dense Object Detection, 2018

[11] MultiMAE: Multi-modal Multi-task Masked Autoencoders, 2022