DINO-ViT — Beyond Self-Supervised Classifications | by Ta-Ying Cheng | Sep, 2022

By Jessie Hobb On Sep 26, 2022

Distill Fine-Grained Features Without Supervision

Figure 1. Self-supervised learning is an important step to true artifical intelligence. Image retrieved from Unsplash.

Previously, I have written several articles briefly discussing self-supervised learning and, in particular, contrastive learning. What was not yet covered, however, was a concurrent branch of self-supervised approach using interactions of multiple networks that seems to emerge and excel recently. As of today, one of the state-of-the-art training methods is a predominantly knowledge distilling method named DINO imposed on vision transformers (DINO-ViT). The most surprising element of this architecture, however, is no longer its strong classification knowledge, but its dense features that are actually capable of performing much more fine-grained tasks such as part segmentation and even correspondence across multiple objects.

In this article, we will go over how the DINO-ViT is trained, followed by a brief tutorial on how to utilise existing libraries for part co-segmentation and finding correspondences.

The term DINO came from self-DIstillation with NO supervision. As its name suggests, DINO-ViT utilises a variant of the traditional knowledge distillation method and applies it to the powerful vision transformer (ViT) architecture. This idea is somewhat inspired by the technique Bootstrap Your Own Latent (BYOL), which we will cover more thoroughly in the upcoming articles.

Hoe does knowledge distillation work?

Figure 2. Overview of the DINO training method. Image retrieved from https://arxiv.org/abs/2104.14294.

In simple terms, the goal for knowledge distillation is to allow a student network Ps to learn from a teacher network Pt. From a computer vision standpoint, the goal can be translated to updating the student network to minimise the cross-entropy loss H on the two networks given an image x:

Self-supervised Knowledge Distillation

The following depicts how the distillation method is extended to DINO training:

Given a particular image x, we now crop the image into two global views and multiple local views. All crops (global and local) are fed into the student network, but only the global views are fed into the teacher network. We then minimise the cross-entropy following the knowledge distillation method. To achieve this objective, the training learns to create local and global correspondences within an image, thus encouraging the student network to learn proper image features.

Getting the Teacher from the Student

Unlike the original problem setting of knowledge distillation where the weights of the teacher network are given a priori, DINO does not have a pretrained teacher. Thus, the teacher network is in fact retrieved from previous students via a moving average method. Additional centering is performed to avoid the two models collapsing.

Figure 2. Attention map of DINO-ViT Image retrieved from https://arxiv.org/abs/2104.14294.

Classification has been a standard benchmark for all self-supervised methods, and DINO-ViT is undoubtedly one of the leading scorers. However, what is fascinating about the DINO-ViT is its fine-grained dense features learnt via self-distillation. Figure 2 illustrates how without any supervision the attention of DINO-ViT is surprisingly focused on the foreground of the object. This implies that the high classification accuracy is in fact merely a outcome of representations that can do much more.

What’s Next?

Figure 3. Part co-segmentation results using DINO-ViT Features. Image retrieved from https://arxiv.org/abs/2112.05814.

Following the DINO-ViT was a paper named Deep ViT Features as Dense Visual Descriptors. Amir et al. described in this work that the dense patch-wise features of DINO-ViT, with additional clusterings and simple unsupervised methods, can perform very challenging tasks such as part-co-segmentation and correspondence finding (as exemplified in Figure 3). In fact, the features are so strong it can actually find correspondences across images of different categories.

Using the Dense Features

Amir et al. have published off-the-shelf Google Colab notebooks on their project page for individuals with interests. You may also follow the GitHub instructions for package installations and run these methods in batches.

The link to the code can be found here:

The long journey of creating true AI seems to be endless, but self-supervised learning certainly is a great step towards it. DINO-ViT has shown insights into how networks learn, and the vast potential behind these learnt features would indeed be a significant stepping stone in the computer vision domain.

Thank you for making it this far 🙏! I regularly write about different areas of computer vision/deep learning, so join and subscribe if you are interested to know more!