DINOv2: A Deep Dive into Meta's Self-Supervised Vision Transformer

The field of computer vision is being reshaped by the remarkable progress in self-supervised learning. Leading this revolution is DINOv2, a cutting-edge self-supervised vision transformer developed by Meta AI. DINOv2 epitomizes the power of self-supervised learning, building upon the original DINO framework published by Meta in 2021.

Introduction to DINOv2

DINOv2, short for DIstillation of knowledge with NO labels v2, improves self-supervised visual learning by producing more semantically meaningful patch-level features-without the need for labeled training data. It's a 2023 self-supervised Vision Transformer model (developed by Meta AI) that learns robust visual features from unlabeled images. It’s essentially a foundation model for computer vision producing universal image features that can be used across many tasks without fine-tuning.

DINOv2 builds on the original DINO method by scaling up training and architecture - it was trained on 142 million diverse images (vs. smaller datasets before) and includes new training innovations. Its features achieve state-of-the-art performance on various vision benchmarks without requiring labeled data or task-specific fine-tuning, surpassing even some supervised or text-supervised approaches.

Understanding DINOv2's Core Concepts

Self-Supervised Learning

Traditional deep learning models for computer vision often rely on massive amounts of labeled data, which can be expensive and time-consuming to acquire. Self-supervised learning offers a compelling alternative by enabling models to learn from unlabeled data, leveraging the inherent structure and patterns within the data itself.

Student-Teacher Framework

DINOv2 leverages a student-teacher framework, where a student network learns to mimic the representations learned by a momentum teacher network.

Read also: Understanding PLCs

The teacher is used as a label and the student network is the model we’re training. This encourages global image-level alignment without needing labels, and it works remarkably well with vision transformers (ViTs).

Key Improvements in DINOv2

DINOv2 builds on the original by making improvements to multiple areas:

Data curation: Meta curated a dataset of 1.2 billion images. But importantly, they applied deduplication and filtering steps to avoid domain bias and near-duplicate image contamination. After filtering and curation, they were left with an image dataset comprising 142M images.
Efficiency: DINOv2 allows efficiency improvements that parallelize it easily across GPUs. Additionally, they implement a new version of flash attention to speed up training.
Regularization: DINOv2 uses KoLeo regularization-a technique that encourages uniform distribution of patch embeddings on the hypersphere. Ultimately this helps prevent feature collapse and forces better coverage of the patch feature space, resulting in more semantically meaningful patches.
Loss: But the most substantial improvement is the addition of a new patch-level loss, inspired by iBOT-another self-supervised learning method. This patch-level loss allows DINOv2 to learn semantically meaningful representations not just at the image level, but at the patch level.

How DINOv2 Works: A Detailed Look

Training Process

A large Vision Transformer (ViT) teacher model (with ~1 billion parameters) is first trained on the images, then a student ViT is trained to match the teacher’s outputs (knowledge distillation). Training leveraged massive scale (fully sharded training across many GPUs, FlashAttention for efficient ViT attention, huge batch sizes ~65k) and an automatic data pipeline to curate a balanced dataset from uncurated web images.

Discriminative Self-Supervised Pre-training

To not use predetermined labels, DIstillation with NO labels (DINO) requires that we have a teacher network and a student network. The math above is comparing the probabilities created by the student network against the ones created by the teacher, but how do we get each value?

Let’s start off with 1 image. We’ll create 2 different views of the same image, then give one to the teacher and one to the student. Both networks output tokens which we put respectively into a corresponding DINO Head (think an MLP) which outputs a vector of “prototype scores” for that token. These “prototype scores” are the chances the class token is correctly representing a feature. We take a softmax of the student vector to get ps, and a softmax followed by a centering operation (using a running average to determine the center) to get the teacher pt.

Read also: Learning Resources Near You

With these 2 probabilities, we can then do cross entropy just like with natural language processing. The student model is updated given the loss and the teacher is built using an exponential moving average of past training runs (with likely a high momentum score to avoid instability).

DINO here is training our model on the whole image level, but we also want it to understand pieces of the image as well. This is where iBOT comes in.

The Patch-Level Objective from iBOT

DINOv2 incorporates iBOT's patch-level loss to go beyond global image-level alignment. The key idea is to match patch embeddings between teacher and student encoded views, even when parts of the image are masked.

iBOT (Image Bert pre-training with Online Tokenizer) is designed to give us patch-level understanding of the image. The idea is to randomly mask out parts of the image and then see what the model thinks should go there. As we do not have predetermined labels, we again lean on the student-teacher model dynamic.

For one image, they randomly apply a mask to parts of the image. They then give the masked part to the student iBOT head and the image under the mask to the teacher iBOT head. Each output probability scores which are soft-maxed and then in the case of the teacher centered via a moving average. Just like DINO above, we update the teacher using an exponential moving average with high momentum to maintain stability.

Read also: Learning Civil Procedure

Combining DINO and iBOT

Meta has open-sourced their DINOv2 code base and so allowed us to see exactly how they bring together the two loss functions. The code is fairly complex, so I am going to give an abridged version below (to see the exact file I’m referring to click here)

# Combined loss calculationloss_accumulator = 0if do_dino: dino_global_crops_loss = self.dino_loss(...) loss_accumulator += self.dino_loss_weight * dino_global_crops_lossif do_ibot: ibot_patch_loss = self.ibot_patch_loss.forward_masked(...) loss_accumulator += self.ibot_loss_weight * ibot_patch_loss

Put simply, we are calculating the loss for each method and then using a hyperparameter to determine how much of that weight to put into the final loss. This way, we can empirically determine the best combination to balance global understanding (DINO) with local (iBOT).

Effective Implementation Details

Let’s delve into the crucial effective implementation details that contribute significantly to DINOv2’s remarkable performance. The paper highlights several key techniques which provided significantly better results as compared to DINO.

The paper leverages fully-sharded data parallelism (FSDP) which distributes both the model and the data across multiple GPUs. This enables training with batches exceeding 65k images.
Instead of directly applying momentum to the gradients, DINOv2 updates a running average of the normalized gradients.
DINOv2’s approach: Employs an exponential moving average (EMA) of model weights, effectively accumulating knowledge from past iterations.
DINOv2’s strategy: Utilizes BatchNorm on the teacher network with a centered and scaled output. This helps prevent collapse and promotes more diverse feature representations.
Leverage custom kernels for specific operations to further enhance performance.

Data Curation: Building a Massive and Diverse Dataset

DINOv2’s success hinges significantly on the quality and scale of its training data. The authors curated a massive dataset of 142 million images, emphasizing diversity and real-world representation.

The authors assembled their own diverse dataset LVD-142M by retrieving images from a “large pool of uncurated data” but close to those in curated datasets such as ImageNet-22k and Google landmarks. These uncurated data sources are publicly available repositories of crawled web data. In particular they extract URL links of images from the HTML <img> tag.

During curating the image dataset the authors discard image URLs that are unsafe or restricted by domains. Moreover they pre-process using standard tricks such as PCA hash deduplication, NSFW filtering and blurring identifiable faces. They also employ the copy detection pipeline of SSCD (from Self-Supervised Descriptor for Image Copy Detection) to remove near duplicate images. It has been noted that a major problem in dealing with images from the wild is to rebalance concepts and avoid overfitting on a few dominant modes.

Cleaning: The authors employed rigorous filtering techniques to remove low-quality images, duplicates, and those with potentially sensitive content.
Pre-processing: Images were resized to a standard resolution (typically 224×224 pixels) and normalized using standard ImageNet statistics (mean and standard deviation).

The emphasis on data quality, diversity, and scale, combined with thoughtful pre-processing techniques, contributes significantly to DINOv2’s exceptional performance and its ability to learn generic visual features applicable to a wide range of downstream tasks.

Self-Supervised Training (No Labels)

Most traditional methods for pretraining foundational models for vision use a form of text-guided pretraining i.e. a form of textual supervision using the captions to guide the training of image features. This form of pretraining limits the information that can be retained during training since image captions only weakly approximate the information present in an image. Moreover pixel level information is not used during text-guided training. These methods also require image encoders that align the texts with images. Self-Supervised training on the other hand uses data similarities between images instead of external metadata in the form of captions. Thus one can avoid manual image annotation all together and use raw data alone allowing flexibility over text counterparts.

Comparison: Self-Supervised vs. Weakly-Supervised (CLIP) Features

The authors also validated the performance of DINOv2 with SOTA weakly supervised models. They report that DINOv2 surpasses OpenCLIP and EVA-CLIP on linear evaluation performance using large ViT models. Moreover they report a higher score on test sets suggesting better generalization through robust visual features.

DINOv2's Applications and Capabilities

Because it learns general-purpose visual features, DINOv2 works impressively well on a wide range of vision tasks: image classification (competitive ImageNet accuracy with just a linear classifier), object detection, semantic segmentation, monocular depth estimation, image/instance retrieval, and even transfer to video understanding tasks. Notably, it produces high-quality segmentation maps and depth predictions without any supervised training, often matching or beating task-specific models.

DINOv2’s pre-trained features can be plugged into numerous computer vision tasks and deliver top-tier results.

Tasks DINOv2 Excels At

Specific Applications

Image Classification: DINOv2 is suitable for use in image classification tasks. According to Meta Research, the performance of DINOv2 is “competitive or better than the performance of text-image models such as CLIP and OpenCLIP on a wide array of tasks”.
Object Detection
Semantic Segmentation: DINOv2 is capable of segmenting objects in an image. Meta Research evaluated DINOv2 against the ADE20K and Cityscapes benchmarks and achieved “competitive results” without any fine-tuning when compared to other relevant models, according to the instance segmentation example in the model playground.
Monocular Depth Estimation
Image/Instance Retrieval: DINOv2 can be used as part of an image information retrieval system that accepts images and returns related images. To do so, one would embed all of the images in a dataset. For each search, the provided image would be embedded and then images with a high cosine similarity to the embedded query image would be returned.

DINOv2 in Action: Real-World Scenarios

Depth Estimation: DINOv2 can be used for predicting the depth of each pixel in an image, achieving state-of-the-art performance when evaluated on the NYU Depth and SUN RGB-D depth estimation benchmark datasets. The depth estimation capabilities of DINOv2 are useful for solving many problems. For instance, consider a scenario where you want to understand how close a forklift is to a parking dock. You could position a camera on the dock and use DINOv2 to estimate how close the forklift is to the dock. This could then be added to an alarm to notify someone that they are about to get too close to the end of the parking dock and hit a wall.
Image Segmentation: DINOv2 is capable of segmenting objects in an image. Consider a scenario where you want to classify images of vehicles on a construction site. You could use DINOv2 to classify vehicles into specified classes using a nearest neighbor approach or linear classification.
Instance Retrieval: DINOv2 also works well in scenarios where fine-grained object or scene understanding is required and where it is necessary to identify and locate specific instances of objects or scenes with high precision. This is the perfect scenario for instance retrieval. You could build an image-to-image instance retrieval system using the out-of-the-box code for loading the model and encoding images using the model. Then, you could use a vector store like faiss to store embeddings for a repository of images and a distance measure such as cosine measure to find images related to another image.

DINOv2 vs. Other Foundation Models

How does DINOv2 stack up against other prominent foundation models in computer vision? Here we compare it with a few key models: CLIP (OpenAI’s image-text model), MAE (Masked Autoencoder), and iBOT (a previous self-supervised ViT method), among others. Each of these approaches has a different training strategy and thus different strengths.

DINOv2 vs. CLIP and OpenCLIP

DINOv2 does not require fine-tuning for specific tasks unlike other models like CLIP and OpenCLIP. According to Meta Research, the performance of DINOv2 is “competitive or better than the performance of text-image models such as CLIP and OpenCLIP on a wide array of tasks”.

DINOv2 vs. DINOv1

After testing both DINO and DINOv2 out, we have found the latter version to provide more accurate attention maps for objects in ambiguous and unambiguous scenes. DINOv2 allows remarkable properties to emerge, such as a robust understanding of object parts, and robust semantic and low-level understanding of images.

DINOv2 works better than DINO because of the following reasons:

A larger curated training dataset.
Improvements on the training algorithm and implementation.
A functional distillation pipeline.

Why DINOv2 Matters

DINOv2 signifies a major advancement in self-supervised learning for computer vision. Its ability to learn powerful visual representations from vast unlabeled data, combined with improved efficiency, establishes it as a key model for diverse applications.

General-Purpose “Out of the Box” Model: DINOv2 is an important model since it’s a general-purpose whose “frozen features” can be used for any number of tasks such as image retrieval or depth estimation.
Strong Transfer to Downstream Tasks: The authors report strong performance when using a linear probe and domain generalization. Moreover, compared to other self-supervised learning methods DINOv2 performs strongly on semantic segmentation, depth estimation and out-of-domain generalization.
No Labels = More Flexibility: Since DINOv2 was trained using raw image samples with no labels or captions, its use is not limited to image classification. The features generated by this model can be used for classification on custom labels or for video recognition using the same model.
Robust Semantic and Visual Features: Compared to fine tuned model architectures, DINOv2 performs at par on semantic segmentation by greatly simplifying the model architecture. Moreover the authors report that DINOv2 frozen features offer superior performance compared to SOTA Self Supervised Learning and weakly supervised features suggesting robust understanding
Reduction in Fine-Tuning Costs: DINOv2 combines a lot of tricks learnt from other SSL methods and is more stable at scale. This allows for 2x faster and 3x lower memory consumption enabling longer training runs with larger batch sizes.

Challenges and Limitations of DINOv2

While DINOv2 is a major leap forward, it’s not without limitations and challenges. It’s important to recognize these, both to set proper expectations and to guide future improvements:

When evaluated on geographical fairness the authors report a drop of 25.7% when using the Dollar Street dataset on regions in Africa when compared to Europe suggesting that the model is still biased towards western countries.
DINOv2 also performs better on high-income households compared to low-income households with a difference of 31.7% on the Dollar Street dataset.

Getting Started with DINOv2

While training DINOv2 from scratch demands substantial computational resources, leveraging its pre-trained weights for feature extraction is readily accessible.

The code behind the DINOv2 paper is available on GitHub. Four checkpoints accompany the code, ranging in size from 84 MB for the smallest model to 4.2 GB for the largest model. There are also training scripts that you can use to train the model on different datasets.

The DINOv2 repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.

tags: #dinov2 #explained