Self-Supervised Learning: A Comprehensive Guide to Algorithms That Teach Themselves

Self-supervised learning (SSL) is a machine learning technique that leverages unsupervised learning principles for tasks traditionally requiring supervised learning. It's particularly valuable in areas like computer vision and natural language processing (NLP), where vast amounts of labeled data are needed to train state-of-the-art artificial intelligence (AI) models. Acquiring these labeled datasets can be challenging due to the time-consuming annotation process by human experts.

Introduction to Self-Supervised Learning

In essence, SSL aims to minimize or eliminate the need for labeled data, which is often scarce and expensive, by using abundant and relatively inexpensive unlabeled data. The core idea is to generate supervisory signals from the unlabeled data itself, allowing the model to learn meaningful representations without explicit human annotation.

The Core Concepts of Self-Supervised Learning

Pretext Tasks and Downstream Tasks

SSL involves two main categories of tasks: pretext tasks and downstream tasks. In a pretext task, SSL is employed to train an AI system to learn meaningful representations of unstructured data. These learned representations can then be used as input for a downstream task, such as supervised learning or reinforcement learning.

The term "pretext" suggests that the training task is not necessarily useful in itself but rather serves as a means to teach models data representations that are valuable for subsequent downstream tasks. Essentially, pretext tasks yield "pseudo-labels" from unlabeled data.

Self-Supervised Learning vs. Supervised Learning

Supervised learning involves training a model with data that has high-quality manual labels to tune the model weights accordingly. In contrast, self-supervised learning also entails training a model with data and their labels, but the labels here are generated by the model itself and are not available at the very start.

Read also: Your Guide to Nursing Internships

Self-Supervised Learning vs. Unsupervised Learning

Neither unsupervised nor self-supervised learning uses labels in the training process. Both methods learn intrinsic correlations and patterns in unlabeled data rather than externally imposed correlations from annotated datasets. However, problems using conventional unsupervised learning do not measure results against any pre-known ground truth. For example, an unsupervised association model could power an e-commerce recommendation engine by learning which products are frequently purchased together.

Self-supervised learning, on the other hand, measures results against a ground truth, albeit one implicitly derived from unlabeled training data. Like supervised models, self-supervised models are optimized using a loss function: an algorithm measuring the divergence ("loss") between ground truth and model predictions.

Self-Supervised Learning vs. Semi-Supervised Learning

Unlike self-supervised learning, which does not involve human-labeled data, semi-supervised learning uses both labeled and unlabeled data to train models. For example, a semi-supervised model might use a small amount of labeled data points to infer labels for the rest of an otherwise unlabeled set of training data, then proceed to use the entire dataset for supervised learning.

Techniques in Self-Supervised Learning

Self-supervised learning tasks are designed such that a loss function can use unlabeled input data as ground truth. Several techniques are employed to achieve this:

Autoassociative Self-Supervised Learning (Self-Prediction)

Also known as autoassociative self-supervised learning, self-prediction methods train a model to predict part of an individual data sample, given information about its other parts. An autoencoder is a neural network trained to compress (or encode) input data, then reconstruct (or decode) the original input using that compressed representation. Though autoencoder architectures vary, they typically introduce some form of bottleneck: as data traverses the encoder network, each layer’s data capacity is progressively reduced.

Denoising autoencoders are given partially corrupted input data and trained to restore the original input by removing useless information ("noise"). Autoencoding is a self-supervised learning technique that involves training a neural network to reconstruct its input data. The objective is to minimize the difference between the input and the reconstructed output. In general, autoencoders are widely used for image and text data. Nonlinear principal component analysis using autoassociative neural networks is another application.

Variational autoencoders (VAEs) are an important tool in image synthesis. OpenAI’s original DALL-E model used a VAE to generate images.

Autoregressive Models

Autoregressive models use past behavior to predict future behavior. Autoregression algorithms model time-series data, using the value(s) of the previous time step(s) to predict the value of the following time step. Autoregression is used prominently in causal language models like the GPT, LLaMa and Claude families of LLMs that excel at tasks like text generation and question answering. Autoregressive modeling has been used for image synthesis in models like PixelRNN and PixelCNN.

The GPT (Generative Pre-trained Transformer) family of models are pre-trained on the classic language modelling task - predict the next word having read all the previous ones.

Masking

Another self-supervised learning method involves masking certain parts of an unlabeled data sample and tasking models with predicting or reconstructing the missing information. Loss functions use the original (pre-masking) input as ground truth. Masking is also used in the training of masked language models: random words are omitted from sample sentences and models are trained to fill them in.

Though masked language models like BERT (and the many models built off its architecture, like BART and RoBERTa) are often less adept at text generation than autoregressive models, they have the advantage of being bidirectional: they can predict not only the next word, but also previous words or words found later on in a sequence.

Innate Relationship Prediction

Innate relationship prediction trains a model to maintain its understanding of a data sample after it is transformed in some way.

Contrastive Learning

Contrastive self-supervised learning methods provide models with multiple data samples and task them to predict the relationship between them. Contrastive models generally operate on data-data pairs for training, whereas autoassociative models operate on data-label pairs (in which the label is self-generated from the data). These pairs are often created via data augmentation: applying different kinds of transformations or perturbations to unlabeled data to create new instances or augmented views.

For example, common augmentation techniques for image data include rotation, random cropping, flipping, noising, filtering and colorizations. In computer vision, such methods-like SimCLR or MoCo-typically begin with a batch of unlabeled raw images and apply a random combination of transformations to generate pairs (or sets) of augmented image samples. Instance discrimination methods thus train models to learn representations of different categories that, thanks to random data augmentations, are robust to trivial variations (like the color, perspective or visible parts in a specific image).

A contrastive learning approach trains a model to distinguish between similar and dissimilar pairs of data points. A popular algorithm in this category is Contrastive Predictive Coding (CPC), which learns representations by predicting future data given the current context.

SimCLR is a simple framework for contrastive learning of visual representations. The model maximizes the agreement between different augmentations of the same image. A SimCLR model is trained to recognize the same image under different transformations, such as rotation, cropping, or color changes.

Non-Contrastive Learning

Somewhat counterintuitively, "non-contrastive learning" refers to a method closely related to contrastive learning (rather than, as one might guess, a general catch-all for methods that are not contrastive learning). Compared to contrastive learning, non-contrastive approaches are relatively simple: because they operate only on positive samples, they utilize smaller batch sizes for training epochs and don’t need a memory bank to store negative samples.

Cross-Modal Learning

Given data points of different types-modalities-contrastive methods can learn mapping between those modalities. For example, Contrastive Language-Image Pre-training (CLIP) jointly trains an image encoder and text encoder to predict which caption goes with which image, using millions of readily available unlabeled (image, text) pairings collected from the internet. Cross-modal retrieval is a self-supervised learning technique where the model is trained to retrieve semantically similar objects across different modalities, such as images and text.

Applications of Self-Supervised Learning

Self-supervised learning has drawn massive attention for its excellent data efficiency and generalization ability. Recent self-supervised learning models include frameworks such as Pre-trained Language Models (PTM), Generative Adversarial Networks (GAN), Autoencoder and its extensions, Deep Infomax, and Contrastive Coding.

Computer Vision

In computer vision, self-supervised learning algorithms can learn representations by solving tasks such as image reconstruction, colorization, and video frame prediction, among others. Models such as contrastive learning and autoencoding have shown promising results in learning representations.

Image Colorization: a self-supervised learning technique where a black-and-white image is used to predict the corresponding colored image.
Motion and Depth Estimation: A self-supervised learning technique used to predict motion and depth from video frames.

Self-supervised learning is popular due to the availability of large amounts of unlabeled image data.

Depth Estimation

Self-supervised learning is a powerful tool to learn deep networks for depth estimation using only raw data and our knowledge about 3D geometry.

Eigen et al was the first to show that a calibrated camera & LiDAR sensor rig can be used to turn monocular depth estimation into a supervised learning problem. The setup is simple: we convert an input image into per pixel distance estimates (a depth map) using a neural network. Then, we use accurate LiDAR measurements reprojected onto the camera image to supervise the learning of the depth network weights via standard backpropagation of prediction errors using deep learning libraries like PyTorch.

Building on the work of Godard et al, we explored a self-supervised learning approach that used only images captured by a stereo pair (two cameras next to each other) instead of LiDAR. Images of the same scene captured from different viewpoints are indeed geometrically consistent, and we can use this property to learn a depth network. If it makes a correct prediction on the left image of the stereo pair, simple geometric equations explain how to reconstruct the left image from pixels in the right image only, a task called view synthesis. If the depth prediction is wrong, then the reconstruction will be poor, giving an error, called the photometric loss, to minimize via backpropagation.

Note that the network is still monocular: it is trained only on left images, whereas the right image, and prior knowledge about projective geometry, is only used to self-supervise thelearning process. This is also different from most self-supervised learning works in Computer Vision that only learn representations: here we learn the full model for the task of depth estimation without any labels at all!

Applications in Robotics

The ability to maintain successful object detection and image segmentation despite changes to an object’s orientation is essential to many robotics tasks.

Natural Language Processing (NLP)

Pre-Trained neural language Models (PTM) are self-supervised learning algorithms used for natural language processing (NLP), where the machine learning model is trained on large amounts of text data to predict missing words or masked tokens.

The Yarowsky algorithm is an example of self-supervised learning in natural language processing.

Medical Imaging

Self-supervised learning is a rapidly growing subset of deep learning techniques used for medical imaging, for which expertly annotated images are relatively scarce. SSL-based methods can often match or exceed the accuracy of models trained using fully supervised methods.

Speech Recognition

Self-supervised learning is particularly suitable for speech recognition. Audio Recognition is a self-supervised learning technique where the model is trained to recognize spoken words or musical notes.

Benefits of Self-Supervised Learning

Data Efficiency: SSL allows models to learn from vast amounts of unlabeled data, reducing the reliance on expensive and time-consuming labeled datasets.
Generalization Ability: Models pre-trained with SSL often exhibit improved generalization performance on downstream tasks, even with limited labeled data.
Adaptability: SSL can adapt to changing data distributions and new tasks more readily than supervised learning.
Mimicking Human Learning: SSL explores a machine’s capability of thinking independently-like humans-by automatically generating labels without any humans in the AI loop. The model itself needs to decide whether the labels generated are reliable or not, and accordingly use them in the next iteration to tune its weights.
Scalability: SSL works with unstructured data and can train on massive amounts of it.

When fine-tuned using labeled data for only one percent of all training data, models pre-trained with SSL have achieved over 80 percent accuracy on the ImageNet dataset.

Limitations of Self-Supervised Learning

Computational Power: SSL models require significant computational resources due to the need to process large amounts of unlabeled data and generate pseudo-labels.
Accuracy: SSL models may not achieve the same level of accuracy as supervised learning models, especially when the quality of pseudo-labels is limited.
Pretext Task Selection: Choosing an appropriate pretext task is crucial for successful SSL. The pretext task should encourage the model to learn meaningful representations that are relevant to the downstream task.
"One Method Solves All" Fallacy: Self-supervised learning is trying to achieve the ‘one method solves all’ approach but it is far from that realization.
Low Accuracy: In SSL, if the model predicts a wrong class with a very high confidence score, the model will keep believing that the prediction is correct and won’t tune the weights against this prediction.

The Future of Self-Supervised Learning

Self-supervised learning continues to gain prominence as a new approach across diverse fields. It is a blessing in use-cases where we deal with challenges related to data. Another thing for which it’s great is Downstream Tasks i.e. Transfer Learning. It holds the potential for many applications that could benefit society and increase opportunities for mobility for all.

AI researchers are prioritizing the development of self-learning mechanisms with unstructured data that can scale the research and development of generic AI systems at a low cost.

tags: #self #supervised #learning #explained