Active Learning in Machine Learning: A Comprehensive Tutorial

In the realm of machine learning, the pursuit of optimal model performance often hinges on the availability of large, meticulously labeled datasets. However, the process of annotating vast amounts of data can be a costly, time-consuming, and labor-intensive undertaking. Active learning emerges as a powerful paradigm to address these challenges, offering a strategic approach to data annotation that optimizes learning efficiency and model accuracy.

Introduction: The Quest for Efficient Learning

Active learning is a supervised machine learning approach that aims to optimize annotation using a few small training samples. It is a "human-in-the-loop" type of Deep Learning framework that uses a large dataset of which only a small portion is labeled for model training. The Active Learning framework is interactive, and that’s how the name “Active” was coined. The student is then asked to solve problems on their own (predictions on the test set) and ask for help only if they get really stuck (very low confidence predictions on 10 samples).

Unlike traditional supervised learning, where models are trained on a fixed, pre-defined dataset of labeled examples, active learning algorithms actively query a human annotator or oracle for the most informative data points to label. By intelligently selecting which instances to label, active learning algorithms can achieve better learning efficiency and performance than passive learning approaches.

Core Principles of Active Learning

At its core, active learning operates through an iterative process of selection, labeling, and retraining. The process typically unfolds as follows:

Initialization: The process begins with a small set of labeled data points, which serve as the starting point for training the model.

Read also: Benefits of Active Student Sunflower Initiatives
Model Training: A machine learning model is trained using the initial labeled data. This model forms the basis for selecting the most informative, unlabeled data points.
Query Strategy: A query strategy guides selecting which data points to label next. Various strategies, such as uncertainty sampling, diversity sampling, or query by committee, can be employed based on the nature of the data and the learning task.
Human Annotation or Human-in-the-Loop: The selected data points are annotated by a human annotator, providing the ground truth labels for these instances.
Model Update: After labeling, the newly annotated data points are incorporated into the training set, and the model is retrained using this augmented dataset. The updated model now benefits from the additional labeled data.
Active Learner Loop: Steps 3 through 6 are repeated iteratively. The model continues to select the most informative data points for labeling, incorporating them into the training set and updating itself until a stopping criterion is met or labeling additional data ceases to provide significant improvements.

Read also: Enrichment and Engagement

Through this iterative process, active learning algorithms optimize labeled data, improving learning efficiency and model performance compared to traditional supervised learning methods.

Active Learning vs. Passive Learning: A Comparative Analysis

Passive learning and active learning are two different approaches to machine learning. In passive learning, the model is trained on a pre-defined labeled dataset, and the learning process is complete once the model is trained. In active learning, the informative data points are selected using query strategies instead of a pre-defined labeled dataset. Then, an annotator labels them before using them to train the model.

By iterating this process of using informative samples, we constantly work on improving the performance of a predictive model.

Here are some key differences between active and passive learning:

Labeling: In active learning, a query strategy is used to determine the data to label and annotate, and the labels that need to be applied.
Data Selection: A query strategy is used to select data for training in active learning.
Cost: Active learning requires human annotators, sometimes experts, depending on the field (e.g., healthcare). Although costs can be controlled with automated, AI-based labeling tools and active learning software.
Performance: Active learning doesn't need as many labels due to the impact of informative samples. Passive learning needs more data, labels, and time to train a model to achieve the same results.
Adaptable: Active learning is more adaptable than passive learning, especially with dynamic datasets.

Active learning is a powerful approach for improving the performance of machine learning models by reducing labeling costs and improving accuracy and generalization.

Read also: The Power of Active Learning

Active Learning vs. Reinforcement Learning

Active Learning and Reinforcement Learning are distinct machine learning algorithms but share some conceptual similarities. As discussed above, active learning is a framework where the learning algorithm can actively choose the data it wants to learn from, aiming to minimize the amount of labeled data required to achieve a desired level of performance.

In contrast, reinforcement learning is a framework where an agent learns by interacting with an environment and receiving feedback in the form of rewards or penalties to learn a policy that maximizes the cumulative reward.

While active learning relies on a fixed training dataset and uses query strategies to select the most informative data points, reinforcement learning does not require a pre-defined dataset; it learns by continuously exploring the environment and updating its internal models based on the feedback received.

Advantages of Active Learning: Unveiling the Benefits

There are various advantages to using active learning for machine learning tasks, including:

Reduced Labeling Costs: Labeling large datasets is time-consuming and expensive. Active learning helps reduce labeling costs by selecting the most informative samples that require labeling, including techniques such as auto-segmentation. The most informative samples are those that are expected to reduce the uncertainty of the model the most and thus provide the most significant improvements to the model's performance. By selecting the most informative samples, active learning can reduce the number of samples that need to be labeled, thereby reducing the labeling costs.
Improved Accuracy: Active learning improves the accuracy of machine learning models by selecting the most informative samples for labeling. Focusing on the most informative samples can help improve the model's performance. Active learning algorithms are designed to select samples that are expected to reduce the uncertainty of the model the most. Active learning can significantly improve the model's accuracy by focusing on these samples.
Faster Convergence: Active learning helps machine learning models converge faster by selecting the most informative samples. The model can learn quickly and converge faster by focusing on the most relevant samples. Traditional machine learning models rely on random sampling or sampling based on specific criteria to select samples for training. However, these methods do not necessarily prioritize the most informative samples. On the other hand, active learning algorithms are designed to identify the most informative samples and prioritize their inclusion in the training set, resulting in faster convergence.
Improved Generalization: Active learning helps ML models generalize new data better by selecting the most diverse samples for labeling. Active learning Python formulas or deep learning networks improve a model's reinforcement learning capabilities. The model can learn to recognize patterns and better generalize new data by focusing on diverse samples, including outliers, even when there’s a large amount of data. Diverse samples cover a broad range of the feature space, ensuring that the model learns to recognize patterns relevant to a wide range of scenarios. Active learning can help the model generalize to new data by including diverse samples in the training set.
Robustness to Noise: Another way active learning works is to improve the robustness of machine learning models to noise in the data. By selecting the most informative samples, active learning algorithms are trained on the samples that best represent the entire dataset. Hence, the models trained on these samples will perform well on the best data points and the outliers.

Active Learning Query Strategies: Methods for Selecting Informative Data Points

Active learning improves the efficiency of the training process by selecting the most valuable data points from an unlabeled dataset. This step of selecting the data points, or query strategy, can be categorized into three methods.

Stream-based Selective Sampling

Stream-based selective sampling is a query strategy used in active learning when the data is generated continuously, such as in online or real-time data analysis. In this, a model is trained incrementally on a stream of data, and at each step, the model selects the most informative samples for labeling to improve its performance. The model selects the most informative sample using a sampling strategy. The sampling strategy measures the informativeness of the samples and determines which samples the model should request labels for to improve its performance. For example, uncertainty sampling selects the samples the model is most uncertain about, while diversity sampling selects the samples most dissimilar to those already seen. Stream-based sampling is particularly useful in applications where data is continuously generated, like processing real-time video data. Here, waiting for a batch of data to accumulate may not be feasible before selecting samples for labeling. Instead, the model must continuously adapt to new data and select the most informative samples as they arrive. This approach has several advantages and disadvantages, which should be considered before selecting this query strategy.

Advantages of Stream-Based Selective Sampling

Reduced labeling cost: Stream-based selective sampling reduces the cost of labeling by allowing the algorithm to selectively label only the most informative samples in the data stream. This can be especially useful when the cost of labeling is high and labeling all incoming data is not feasible.
Adaptability to changing data distribution: This strategy is highly adaptive to changes in the data distribution. As new data constantly arrives in the stream, the model can quickly adapt to changes and adjust its predictions accordingly.
Improved scalability: Stream-based selective sampling allows for improved scalability since it can handle large amounts of incoming data without storing all the data.

Disadvantages of stream-based selective sampling

Potential for bias: Stream-based selective sampling can introduce bias into the model if it only labels certain data types. This can lead to a model that is only optimized for certain data types and may not generalize well to new data.
Difficulty in sample selection: This sampling strategy requires careful selection of which samples to label, as the algorithm only labels a small subset of the incoming data. Selection of the wrong samples to label can result in a less accurate model than a model trained with a randomly selected labeled dataset.
Dependency on the streaming platform: Stream-based selective sampling depends on the streaming platform and its capabilities.

Pool-based Sampling

Pool-based sampling is a popular method used in active learning to select the most informative examples for labeling. This approach creates a pool of unlabeled data, and the model selects the most informative examples from this pool to be labeled by an expert or a human annotator. The newly labeled examples are then used to retrain the model, which is repeated until the desired level of model performance is achieved. Pool-based sampling can be further categorized into uncertainty sampling, query-by-committee, and density-weighted sampling.

Advantages of Pool-Based Sampling

Reduced labeling cost: Pool-based sampling reduces the overall labeling cost compared to traditional supervised learning methods since it only requires labeling the most informative sample. This can lead to significant cost savings, especially when dealing with large datasets.
Efficient use of expert time: Since the expert is only required to label the most informative samples, this strategy allows for efficient use of expert time, saving time and resources.
Improves model performance: The selected samples are more likely to be informative and representative of the data, so pool-based sampling can improve the model's accuracy.

Disadvantages of Pool-Based Sampling

Selection of the pool of unlabeled data: The quality of the selected data affects the performance of the model, so careful selection of the pool of unlabeled data is essential. This can be challenging, especially for large and complex datasets.
Quality of the selection method: The quality of the selection method used to choose the most informative sample can affect the model’s accuracy. The model's accuracy may suffer if the selection method is not appropriate for the data or is poorly designed.
Not suitable for all data types: Pool-based sampling may not be suitable for all data types, such as unstructured or noisy data. In these cases, other active learning approaches may be more appropriate.

Query Synthesis Methods

Query synthesis methods are a group of active learning strategies that generate new samples for labeling by synthesizing them from the existing labeled data. The methods are useful when your labeled dataset is small, and the cost of obtaining new labeled samples is high. One approach to query synthesis is by perturbing the existing labeled data, for example, by adding noise or applying transformations.

Informativeness Measures: Guiding the Selection Process

Several informativeness measures have been proposed to decide which samples in a batch (or data stream) to query the user.

Uncertainty-based Sampling

In an Uncertainty Sampling framework, an active learner queries the instances about which it is least certain how to label, i.e., the samples which lie close to the classification boundary. This approach is often straightforward for probabilistic learning models. For example, when using a probabilistic model for binary classification, uncertainty sampling simply queries the instance whose posterior probability of being positive is nearest 0.5. Alternatively, for multi-class problems, the network's confidence in prediction is used as the uncertainty measure.

Another common measure for uncertainty is entropy since the entropy measure does not favor instances where only one of the labels is highly unlikely because the model might be fairly certain that it is not the true label.

Uncertainty Sampling with Diversity Maximization (USDM) is a multi-class active learning framework that leverages all the data in the active pool for uncertainty evaluation. USDM is able to globally evaluate the informativeness of the pool data across multiple classes, leading to a more accurate evaluation.

Also, in contrast to traditional Active Learning algorithms that merely consider the uncertainty score for active learning, USDM proposes to select the most uncertain data, which is as diverse as possible. It means that the data selected for labeling should be sufficiently different from each other.

Query-By-Committee Sampling

The Query-By-Committee (QBC) approach involves maintaining a committee of models which are all trained on the current labeled set but represent competing hypotheses. Each committee member is then allowed to vote on the labelings of query candidates. The most informative query is considered to be the instance about which they most disagree. This method is inspired by Ensemble Learning frameworks.

Since deep architectures have a huge number of trainable parameters, training a committee of deep networks is infeasible. Instead, the authors train a Convolutional Neural Network (CNN) on a selection of training samples, but the selection is handled by a committee of partial CNNs. To build the committee, the authors use batch-wise dropout on the current full CNN in order to define as many partial CNNs as batch-wise dropout runs, thus reducing the computational cost of the standard QBC technique.

Expected Model-change-based Sampling

Another general active learning framework uses a decision-theoretic approach, which involves selecting the instance that would impart the most significant change to the current model if we knew its label.

The authors proposed an Expected Model Change Maximization (EMCM) framework, which measures the change as the difference between the current model parameters and the new parameters trained with the enlarged labeled training set. The EMCM model uses the gradient of the error with respect to a candidate example to estimate the model change.

Expected Error Reduction

Expected Error Reduction (EER) measures how much the generalization error is likely to be reduced rather than how much the model is likely to change like in the previous approach. The concept here is to estimate the expected future error of a model when trained with the existing labeled set, along with the sample that will be queried, on the remaining unlabeled samples. The instance with the minimal expected future error (called “risk”) is queried.

EER brings a huge time cost owing to its error reduction estimation. That is, for each datapoint, the classifier has to be re-optimized with its possible labels, and the labels of other data points need to be re-inferred to calculate its expected generalization error. This paper proposed a novel criterion called approximated error reduction (AER) to address this limitation. AER estimates the error reduction of a candidate based on an expected impact over all data points and an approximated ratio between the error reduction and the impact over its nearby data points.

tags: #active #learning #in #machine #learning #tutorial