Unboxing DINOv2, Meta’s new all-purpose computer vision backbone | by Michał Oleszak | May, 2023

By Jessie Hobb On May 8, 2023

Artificial Intelligence

Are vision foundational models catching up with LLMs?

Self-supervised training methods continue to deliver breakthrough after breakthrough. Last week, Meta AI released the second version of their self-DIstillation with NO labels or DINO model. The model can supposedly be used as a backbone to solve virtually any computer vision task without fine-tuning! Have the foundational models in computer vision caught up to the level of versatility that Large Language Models have held for some time? Let’s take DINO for a walk to see what it can do!

If you’re mainly interested in playing with the new DINO, feel free to scroll down to the “Testing DINOv2” section. Before that, we look in more detail at the model’s architecture and training routine.

Self-supervision has been gaining popularity in computer vision applications for a couple of years now. And to no surprise: the possibility to train models without labeled examples allows for using a much larger pool of training data, and in some applications where labels are hard or expensive to get, it may even enable training where it was previously impossible.

Models trained in a self-supervised way learn from the images alone, without annotations. Indeed, they create their own pseudo-labels from unlabeled data.

This has been an established practice in NLP for a time now, where language models are often trained to predict the next word in a sentence. Given the input body of text, the features and labels for training can be created automatically.

In computer vision, however, self-supervised approaches haven’t really taken off until a couple of contrastive models from Google and Meta (SimCLR, MoCo, SwAV, and BYOL) showed state-of-the-art results, sometimes matching or even exceeding those of fully supervised models with access to labeled training data. In my earlier work, I have shown how MoCo improves the performance of X-ray diagnosis in an environment where annotated training examples are scarce.

In 2021, Meta described their first DINO in the paper titled Emerging Properties in Self-Supervised Vision Transformers. Their model, although inspired by the previously reigning contrastive architectures, took a slightly different approach. Let’s take a look at the original DINO first since its second version is very similar to it.

“DINO” is actually sort of an acronym, standing for self-distillation with no labels. As the name suggests, it combines two learning techniques: self-supervised learning with no labels which we have already discussed, and knowledge distillation.

Knowledge distillation is a method typically used to compress model size. In it, a smaller model (referred to as a “student”) is typically trained to produce the same predictions as a larger, already-trained model (called the “teacher”). If the student can learn to mimic the teacher truthfully, we can keep the same performance while using a smaller model.

DINO uses what the authors call self-distillation, in which the two models — the student and the teacher — are effectively the same model: they have the same size and architecture. They only differ in how their parameters get updated during training.

DINO’s training process. Image source: arXiv:2104.14294

To train DINO, we set up two identical networks — the authors originally use Vision Transformers (ViTs). As mentioned, both networks have the same architecture but different parameters.

Then, from each training image, a number of random crops are cut out. Some of these crops cover just a small area of the original image — we will call them local views. Other crops are larger and cover a significant part of the original image — these are global views.

Next, all the crops are passed through the student network while only the global views are passed through the teacher network. Each network produces latent representations, or embeddings, of the crops it got as inputs. The similarity between the embeddings from the student and the teacher is then evaluated with a cross-entropy loss. This idea is based on SwAV and its goal is to encourage the model to learn global-to-local correspondence.

Finally, the gradients based on the loss are propagated back through the student network to teach it to produce representations similar to those of the teacher. The teacher’s weights, on the other hand, are updated with an exponential moving average of the student’s weights. This idea is based on the MoCo model, but in contrast to it, DINO doesn’t use any memory bank.

The original DINO paper was titled “Emerging Properties in Self-Supervised Vision Transformers” since the authors were somewhat amazed at the properties that emerged from the model. The DINO backbone turned out to comprise information about the semantic segmentation of an image, as well as to allow for a great performance on downstream image classification tasks.

What’s new in V2?

How does DINOv2 differ from its predecessor, I hear you asking. Well, not that much, at least not in terms of the model architecture or training routine. The authors themselves confess that in the DINOv2 paper, “most of the technical contributions aim at accelerating and stabilizing the training at scale”.

The one thing that’s different is the data that DINOv2 was trained on. So far, most advances in self-supervised learning for vision applications were made while pre-training the models on small datasets such as the infamous ImaegeNet, whose lack of diversity impedes learning useful features.

DINOv2 authors build a data pipeline allowing them to curate a relatively large and diverse dataset. To do this, they employ a clustering algorithm to group candidate images into semantically similar clusters, and then they rebalance the clusters to prevent the model from overfitting to a couple of the most dominants modes in the data.

Let’s put the model to a simple test! The paper claims the DINOv2 backbone can be used as a feature extractor without fine-tuning. Let’s see how well it does.

As the test task, we will have DINO recognize what alphabet a handwritten character comes from using a subset of the Omniglot dataset.

A sample from the Omniglot dataset. Source: https://github.com/brendenlake/omniglot.

Specifically, we will pass 9543 character drawings (964 different characters from 30 different alphabets) through the DINOv2 backbone. Then, we will split the embeddings we get into training and testing sets, and train a logistic regression classifier on top of them to classify the images to one of the 30 alphabets. This evaluation method is known as a linear readout — we just read the embeddings from the frozen backbone and put a single linear layer (or a liner classifier) on top.

This is quite a challenging task: with around 9.6k images and around 960 distinct characters, there are only 10 images per character (and only 7 end up in the training data — the rest are used for testing). Effectively, we create a few-shot learning problem in which a random classifier would score an accuracy of 1/30, or 3.3%.

Let’s start with setting up a dataloader.

dataset = ImageFolder(
"omniglot",
transform=transforms.Compose([
transforms.ToTensor(), 
transforms.Resize((98, 98))
]),
)
dataloader = DataLoader(
dataset, shuffle=True, batch_size=64
)

Then, we load the DINOv2 model. Four different architectures are available from PyTorch Hub, with varying sizes and performances. Let’s use the lightest ViT-S/14 distilled with 21M parameters and the heaviest ViT-L/14 distilled with 300M parameters (there is also an undistilled version of 1’100M params, but it’s quite heavy and very close in performance to the 300M params version). Here is the snippet to load ViT-S/14 distilled.

dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')

With that done, let’s pass all the images through the DINOv2 backbone and collect the embeddings and their associated target labels.

dinov2_vits14 = dinov2_vits14.to(device)all_embeddings, all_targets = [], []
with torch.no_grad():
for images, targets in tqdm(dataloader):
images = images.to(device)
embedding = dinov2_vits14(images)
all_embeddings.append(embedding)
all_targets.append(targets)
all_embeddings = torch.cat(all_embeddings, dim=0)
all_targets = torch.cat(all_targets, dim=0)

Next, we split the data into train and test sets and train a logistic regression classifier on top of it.

X_train, X_test, y_train, y_test = train_test_split(
all_embeddings.cpu().numpy(), 
all_targets.cpu().numpy(), 
test_size=0.3, 
random_state=42,
)model = LogisticRegression()
model.fit(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f'Test accuracy: {test_acc}')

We got a test accuracy of just over 54%. Much better than random guessing, but far from perfect. Let’s see how it compares to a larger, 300M params DINO, and to a ResNet50.

Model comparison: two DINOs and a ResNet.

ResNet50 and the small DINOv2 using ViT-S/14 are of similar size — DINO is actually even smaller — but DINO yields an accuracy score roughly 15 percentage points higher. A larger DINO can bump the score by another 10 to 15 percentage points, that is to 65–70% accuracy.

Is this a good score? Upon getting the results, my first reaction was a slight disappointment. Unconsciously, I must have been hoping for an accuracy score in the 90s. But after all, the task is not easy and we only used (the equivalent of) a single linear layer to train. DINOv2 definitely does a better job than a similarly-sized ResNet, which is often used as a go-to visual features extractor.

What do you think about these results? Let me know in the comments!

Thanks for reading!

If you liked this post, why don’t you subscribe for email updates on my new articles? And by becoming a Medium member, you can support my writing and get unlimited access to all stories by other authors and yours truly.

Want to always keep your finger on the pulse of the increasingly faster-developing field of machine learning and AI? Check out my new newsletter, AI Pulse. Need consulting? You can ask me anything or book me for a 1:1 here.

You can also try one of my other articles. Can’t choose? Pick one of these: