Machine Learning Techniques for Fusing Non-IID Datasets

Introduction

In the realm of machine learning, a fundamental assumption often made is that data is Independent and Identically Distributed (IID). This implies that each data point is independent of the others and that the underlying population from which the data is drawn remains consistent. However, this assumption often breaks down in real-world scenarios, leading to what is known as non-IID data. This article explores the challenges posed by non-IID data, particularly within the context of federated learning, and delves into various machine learning techniques developed to address these challenges.

The Challenge of Non-IID Data

When data is not IID, it introduces complexities that can significantly impact the performance of machine learning models. For instance, in a time-series dataset, the distribution of data may drift over time, rendering older data less relevant to future predictions. Similarly, when collecting image data from the web, one might inadvertently capture multiple frames from the same video, leading to an overrepresentation of the video's content.

Impact on Federated Learning

Federated learning, a distributed machine learning paradigm designed to preserve data privacy, is particularly vulnerable to the effects of non-IID data. In federated learning, multiple entities (clients) collaboratively train a model without sharing their local data. Each client trains a local model on its own dataset, and the trained weights are then sent to a central server for aggregation. The aggregated global model is subsequently distributed back to the clients for further training.

However, the data held by each client is often non-IID, meaning that the data distributions vary across clients. This heterogeneity can lead to slower convergence and a significant drop in the performance of the aggregated global model.

Specific Challenges Posed by Non-IID Data in Federated Learning

Model Heterogeneity: Data sampled from clients exhibits varying distributions.
Convergence Challenges: Slower model convergence, potentially leading to divergence.
Sampling Bias: Models biased toward specific subpopulations.
Adaptability Issues: Data from various clients may change over time.

Techniques for Addressing Non-IID Data

To mitigate the adverse effects of non-IID data, researchers have developed various machine learning techniques, particularly within the context of federated learning. These techniques can be broadly categorized as follows:

1. Distribution Fusion Based Model Aggregation

One approach to tackle non-IID data in federated learning is to infer the unknown global distributions without compromising privacy. FedFusion, a novel data-agnostic distribution fusion-based model aggregation method, optimizes federated learning with non-IID local datasets. This method represents heterogeneous client data distributions using a global distribution comprising several virtual fusion components with varying parameters and weights. A Variational AutoEncoder (VAE) learns the optimal parameters of these fusion components based on limited statistical information extracted from local models. The derived distribution fusion model then optimizes federated model aggregation with non-IID data.

2. Personalized Federated Learning

Personalized federated learning aims to train models that cater to the specific characteristics of each client's data distribution. This approach recognizes that a one-size-fits-all global model may not be optimal for all clients due to data heterogeneity.

3. Ensemble Learning and Knowledge Distillation

Ensemble learning combines the predictions of multiple machine learning models to improve overall performance. In the context of federated learning, ensemble methods can be used to aggregate the knowledge learned by individual clients, even when their data is non-IID.

Federated Distillation Fusion (FedDF) is an algorithm that leverages ensemble learning and knowledge distillation. In each communication round, participating clients train their local models. FedAvg initializes a fusion model. A mini-batch of samples is sampled from an unlabeled or generated dataset and used to train the fusion model, distilling the ensemble of teacher models (local models of participating clients) into a single server student model. The averaged logits of client teachers train the initialized fusion student model on the server.

4. Algorithmic Modifications

Several federated learning algorithms have been developed or adapted to address the challenges of non-IID data. Some notable examples include:

Read also: Revolutionizing Remote Monitoring

FedAvg (Federated Averaging): The vanilla FedAvg algorithm aggregates weights given by entities for the global model.
FedProx (Federated Learning with Proximal Term): Extends FedAvg by introducing a proximal term in the optimization objective to mitigate the impact of non-IID data distribution. Clients use this modified objective function to train on local datasets. The central server aggregates the local model updates using FedAvg.
FedNova (Federated Learning with Novelty and Variance Adaptation): Improves convergence in federated learning, especially in the presence of non-IID data, by adapting the learning rate based on the novelty and variance of local model updates.
SCAFFOLD: A federated learning algorithm designed to correct for client-side drift, which can be exacerbated by non-IID data.
FedAdam: An adaptation of the Adam optimization algorithm for federated learning, which can help to improve convergence in the presence of non-IID data.
FedMA: A federated learning algorithm that uses a meta-learning approach to adapt to non-IID data.
FedDP: Federated learning with differential privacy, which can help to protect the privacy of client data while still allowing for effective model training.

5. Data Partitioning Strategies

Comprehensive data partitioning strategies are crucial to cover the typical non-IID data setting in Federated learning. These strategies encompass various types of skews that can be observed in non-IID data, including:

Label Skew: An unequal distribution of labels or classes amongst the participating clients.
- Quantity-based label imbalance: Different sets of labels are randomly assigned to clients, and the samples of each label are then randomly and equally divided amongst the parties that owned the labels.
Feature Skew: Feature distributions vary across parties.
- Noise-based feature imbalance: The entire dataset is divided into multiple parties randomly and equally.
- Synthetic feature imbalance: A synthetic feature imbalance federated dataset is created by distributing data points in a cube, which is partitioned into 8 parts by three different planes, wherein each part contains data points of a particular label. A subset for each party is then allocated from two parts which are symmetric of (0,0,0){(0,0,0)}( 0 , 0 , 0 ).
- Real-world feature imbalance: Data collected from different sources tends to have a natural variance in its feature distribution.
Quantity Skew: Each party is allocated a varying size of the local dataset.

6. Federated Labels (FedLbl)

This method aggregates client model updates by weighing them based not only on the amount of data but also on the number of classes or labels each client’s local dataset had when trained locally. It divides local models into two groups based on the heterogeneity level of the data the model has trained on, interpreted by the number of classes or labels the model is trained on.

7. Decentralized Federated Learning with Mutual Knowledge Transfer

Def-KT is a novel mutual knowledge transfer algorithm in a decentralized federated learning setup. In every iteration, Q out of K clients are selected for training. The Q clients train locally on their datasets. After training, each of the Q clients selects a client from another set of Q clients to share the model with. The new set of Q clients with whom the model is shared use their local weights and the shared weights to train simultaneously on their local dataset with an interdependent loss function.

8. Detecting Non-IID Data

The cleanlab library offers functionalities to detect non-IID data. The key idea is to consider whether datapoints closer in the data ordering (i.e., time) tend to have more similar feature values. The library constructs a k-Nearest Neighbor graph of the dataset based on feature values and computes the index distance between datapoints. A two-sample permutation test using the Kolmogorov-Smirnov statistic determines if there is a statistically significant difference between the distributions of index distances between kNN-neighbors vs. index distances between arbitrary datapoint pairs.

The cleanlab library also assigns a score (between 0 and 1) for each individual datapoint, which can help determine why a dataset received a low p-value from the non-IID test.

Read also: Boosting Algorithms Explained

Experimental Results and Analysis

Experiments have been conducted to evaluate the performance of federated learning algorithms under varying degrees of data heterogeneity. These experiments often involve partitioning datasets like CIFAR-10 among clients using a Dirichlet distribution, which allows for controlled manipulation of the non-IIDness of the data.

Impact of Batch Size

Larger batch sizes tend to converge slowly and may not perform well on both IID and non-IID data. Balancing batch size with learning rate and other regularization techniques is crucial for achieving optimal performance in federated learning scenarios, both for IID and non-IID data distributions.

Impact of Dirichlet Concentration Parameter (⍺) and Reporting Fraction (C)

The Dirichlet concentration parameter (⍺) controls the degree of data heterogeneity, with larger values indicating more identical distributions. The reporting fraction (C) represents the fraction of clients participating in each communication round.

Accuracy improves as the number of entities contributing to the learning process each epoch increases. Training error exhibits greater volatility when the data is more heterogeneous. When employing small reporting fractions, the model may struggle to converge within a reasonable number of communication rounds, even with small batch sizes.

tags: #machine #learning #techniques #for #fusing #non-iid