Deep Learning Optimization Techniques: A Comprehensive Overview

Deep learning has revolutionized various fields, achieving remarkable results in image recognition, natural language processing, and more. However, training deep learning models can be computationally expensive and time-consuming. Optimization techniques play a crucial role in efficiently training these models, reducing losses, and achieving the most accurate results possible. This article provides a comprehensive overview of widely used deep learning optimization techniques.

Introduction

Optimization algorithms are responsible for iteratively adjusting the model's parameters (weights and biases) to minimize a loss function. The weights are initialized using some initialization strategies and are updated with each epoch according to an equation. The best results are achieved using some optimization strategies or algorithms called Optimizer. These algorithms navigate the complex, high-dimensional parameter space to find the optimal set of parameters that minimize the difference between the model's predictions and the actual values.

Gradient Descent

Gradient Descent is an iterative algorithm that starts from a random point on the function and traverses down its slope in steps until it reaches the lowest point of that function. This algorithm is apt for cases where optimal points cannot be found by equating the slope of the function to 0. For the function to reach minimum value, the weights should be altered. With the help of back propagation, loss is transferred from one layer to another and “weights” parameter are also modified depending on loss so that loss can be minimized.

The Algorithm

  1. Objective: Gradient Descent aims to find a function’s parameters (weights) that minimize the cost function. In the case of a deep learning model, the cost function is the average of the loss for all training samples as given by the loss function.

  2. Gradient Computation: Calculate the gradient of the cost function with respect to each parameter. The gradient is a vector that points in the direction of the steepest increase of the function.

    Read also: Comprehensive Overview of Deep Learning for Cybersecurity

  3. Update Parameters: Adjust the model’s parameters in the direction opposite to the gradient. This step is done by subtracting a fraction of the gradient from the current values of the parameters.

    θ=θ−α⋅∇J(θ)

    where w represents the model’s parameters (weights) and 𝛼 is the learning rate.

  4. Cost function: θ=θ−α⋅∇J(θ)

  5. Learning Rate: The learning rate is a crucial hyperparameter that needs to be chosen carefully. If it’s too small, the algorithm will converge very slowly. If the learning rate is too high, the optimization overshoots the minimum.

    Read also: Continual learning and plasticity: A deeper dive

  6. Local Minima and Saddle Points: In complex cost landscapes, Gradient Descent can get stuck in local minima or saddle points, especially in non-convex optimization problems common in deep learning.

Drawbacks

As for Gradient Descent algorithm, the entire data set is loaded at a time. This makes it computationally intensive. Another drawback is there are chances the iteration values may get stuck at local minima or saddle point and never converge to minima. To obtain the best solution, the must reach global minima.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an extension of Gradient Descent, where it overcomes some of the disadvantages of Gradient Descent algorithm. SGD tries to overcome the disadvantage of computationally intensive by computing the derivative of one point at a time. Due to this fact, SGD takes more number of iterations compared to GD to reach minimum and also contains some noise when compared to Gradient Descent. As SGD computes derivatives of only 1 point at a time, the time taken to complete one epoch is large compared to Gradient Descent algorithm.

The Algorithm

  1. Objective: Like Gradient Descent, the primary goal of SGD is to minimize the cost function of a model by iteratively adjusting its parameters (weights).
  2. Update Parameters: Update the model’s parameters using this computed gradient.
  3. Variance: The updates can be noisy due to the reliance on a single example, potentially causing the cost function to fluctuate.

Advantages

SGD tries to overcome the disadvantage of computationally intensive by computing the derivative of one point at a time.

Mini-Batch Stochastic Gradient Descent (MB-SGD)

MB-SGD is an extension of SGD algorithm. It overcomes the time-consuming complexity of SGD by taking a batch of points / subset of points from dataset to compute derivative.

Read also: An Overview of Deep Learning Math

The Algorithm

  1. Objective: Similar to other gradient descent variants, the aim of Mini-batch Gradient Descent is to optimize the model’s parameters to minimize the cost function.
  2. Update Parameters: Adjust the parameters in the direction opposite to the computed gradient.
  3. Hyperparameter Tuning: Like with the other variants we’ve discussed so far, selecting the learning rate requires experimentation. Further, we need to choose the batch size.

Note

It is observed that the derivative of loss function of MB-SGD is similar to the loss function of GD after some iterations. But the number iterations to achieve minima in MB-SGD is large compared to GD and is computationally expensive. The update of weights in much noisier because the derivative is not always towards minima.

Momentum-based Optimizer

It is an adaptive optimization algorithm which exponentially uses weighted average gradients over previous iterations to stabilize the convergence, resulting in quicker optimization. This is done by adding a fraction (gamma) to the previous iteration values. Essentially the momentum term increase when the gradient points are in the same directions and reduce when gradients fluctuate. As a result, the value of loss function converges faster than expected.

The Algorithm

  1. Cost function: V(t)=γV(t−1)+α.∇J(θ)
  2. θ=θ - V(t)

Nesterov Accelerated Gradient (NAG)

In momentum-based optimization, the current gradient takes the next step based on previous iteration values. But we need a much smarter algorithm that knows when to intuitively stop so that the gradient doesn’t further increase. To do this the algorithm should have an approximate idea of the parameter values in its next iteration. In doing so we can efficiently look ahead by calculating the gradient values wrt to future position of parameters.

The Algorithm

  1. Cost function: V(t)=γV(t−1)+α.∇J[θ-γV(t−1)]
  2. θ=θ - V(t)

By using NAG technique, we are now able to adapt error function with the help of previous and future values and thus eventually speed up the convergence.

AdaGrad

Adaptive Gradient as the name suggests adopts the learning rate of parameters by updating it at each iteration depending on the position it is present, i.e- by adapting slower learning rates when features are occurring frequently and adapting higher learning rate when features are infrequent.

Technically it acts on learning rate parameter by dividing the learning rate by the square root of gamma, which is the summation of all gradients squared.

The Algorithm

In the update rule, AdaGrad modifies the general learning rate N at each step for all the parameters based on past computations.

Diminishing learning rates

As training progresses, the accumulated squared gradients can grow very large, causing the learning rates to shrink and become infinitesimally small.

Disadvantages

One of the biggest disadvantages is the accumulation of squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during the training. This makes the learning rate to shrink and eventually become small. This method is not very sensitive to master step size and also converges faster.

AdaDelta

It is simply an extension of AdaGrad that seeks to reduce its monotonically decreasing learning rate. Instead of summing all the past gradients, AdaDelta restricts the no. of summation values to a limit (w). In AdaDelta, the sum of past gradients (w) is defined as “Decaying Average of all past squared gradients”. The current average at the iteration then depends only on the previous average and current gradient.

The Algorithm

AdaDelta eliminates the need for an external learning rate parameter entirely.

Complexity

AdaDelta’s mechanism, involving the tracking and updating of two separate state variables for gradients and parameter updates, adds complexity to the algorithm.

Convergence Rate

AdaDelta might converge more slowly than other optimizers like Adam, especially on problems where the learning rate needs more aggressive tuning.

RMSProp

Root Mean Squared Prop is another adaptive learning rate method that tries to improve AdaGrad. Instead of taking cumulative sum of squared gradients like in AdaGrad, we take the exponential moving average. The first step in both AdaGrad and RMSProp is identical. RMSProp simply divides learning rate by an exponentially decaying average.

The Algorithm

Update the running average of squared gradients using a decay factor, γ, often set to 0.9.

Adaptive Moment Estimation (Adam)

It is a combination of RMSProp and Momentum. This method computes adaptive learning rate for each parameter. In addition to storing the previous decaying average of squared gradients, it also holds the average of past gradient similar to Momentum. Thus, Adam behaves like a heavy ball with friction which prefers flat minima in error surface.

The Algorithm

  1. Initialization: Start with random initial parameter values and initialize a first moment vector (m) and a second moment vector (v). Both moment vectors are initialized to zero at the start of the optimization.
  2. Update moments: Update the first moment vector (m) with the bias-corrected moving average of the gradients.

Bias correction mechanism

Adam also introduces a bias correction mechanism to account for these vectors being initialized as zeros. The vectors’ initial state leads to a bias towards zero, especially in the early stages of training, because they haven’t yet accumulated enough gradient information. To correct this bias, Adam adjusts the calculations of the adaptive learning rate by applying a correction factor to both moment vectors.

Nesterov Accelerated Adaptive Moment Estimator (Nadam)

As we have seen in previous section, Adam is a combination of RMSProp and Momentum. Earlier we have also seen that NAG is superior to momentum. Nadam thus incorporates Adam and NAG. To do this we need to modify the momentum term.

Optimization Techniques in Keras

Though we have discussed some types of Optimization techniques, not all of these are provided by the popular Keras package.

The methods provided by Keras are as listed below-

  • SGD
  • RMSprop
  • Adam
  • Adadelta
  • Adagrad
  • Adamax
  • Nadam
  • Ftrl

Model Optimization Approaches

Deep learning models are a vital component of solutions across a large number of industries. As this trend continues, model compression and optimization are critical to reducing the size of models to enable them to run faster and more efficiently than before. These techniques provide a scalar reduction in the amount of energy used but, at their core, the end solution is still a neural network. As a community, it is both an incredible challenge and an imperative that we find more ways to reduce energy usage while simultaneously driving innovation.

Knowledge Distillation

As the name suggests, the goal of knowledge distillation is to take functionality from one model and move it into another. By leveraging a model that is already a working solution to a problem, we can create a similar, less complex model that can perform the same task. Obviously, the smaller model must perform with similar accuracy to be a successful distillation. In many recent publications on the topic, a teacher/student analogy is used to describe how knowledge distillation learning models work.

Quantization

Perhaps the most well-known type of deep learning optimization is quantization. Quantization involves taking a model trained using higher precision number formats, like 32- or 64-bit floating point representations, and reproducing the functionality with a neural network that uses lower precision number formats, typically an 8-bit integer (INT8). There are a few approaches to quantization. One can perform quantization after the initial model is trained. The subsequent INT8 model can then be computed by scaling the weights within the original model to generate a new model. This has the benefit of being able to run against existing models that you are trying to optimize after fine-tuning them. Another option is to include quantization techniques as part of the initial training process. In both of these cases, the result of using an INT8 representation provides significant savings in terms of model size, which translates into lower memory and compute requirements.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT is a family of methods designed to efficiently adapt large-scale models by training only a small subset of parameters. A neural network contains many dense layers which perform matrix multiplication. matrices in these layers typically have full-rank. This means that while training for a broad, complex task, the weight matrices in a neural network have full rank, which minimizes redundancy. However, when fine-tuning this universal model for a specialized task, not all the knowledge from the original model is necessary. Therefore, only a small fraction of the parameters needs to be trained. In simpler terms, the weight matrices can be represented by smaller matrices with fewer parameters.

Low-Rank Adaptation (LoRA)

For a pre-trained weight matrix W0∈Rd×dW0 \in \mathbb{R}^{d\times d}, we constrain its update by representing the latter with a low-rank decomposition W0+ΔW=W0+BAW0 + \Delta W = W_0 + BA, where B∈Rd×rB \in \mathbb{R}^{d\times r}, A∈Rr×dA \in \mathbb{R}^{r\times d}, and the rank r≪dr \ll d. During training, W0W_0 is frozen and does not receive gradient updates, while AA and BB contain trainable parameters. Note both W0W_0 and ΔW=BA\Delta W = BA are multiplied with the same input, and their respective output vectors are summed coordinate-wise. In essence, we freeze the original model, insert low-rank adapters under the relevant weight matrices, and train these adapters to simulate the updates that would normally come from gradients. The most significant benefit comes from the reduction in memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM usage by up to 2/3 if r≪dr \ll d as we do not need to store the gradients and optimizer states for the frozen parameters.

QLoRA

QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. QLoRA has one storage data type (usually 4-bit NormalFloat) for the base model weights and a computation data type (16-bit BrainFloat) used to perform computations. QLoRA dequantizes weights from the storage data type to the computation data type to perform the forward and backward passes, but only computes weight gradients for the LoRA parameters which use 16-bit bfloat.

First-Order Optimization Algorithms

First-order optimization algorithms use the first derivative (gradient) of the objective (loss) function to update parameters and move toward a minimum or maximum. They are widely used in machine learning because they are computationally efficient and scale well to large datasets.

Stochastic Optimization Techniques

Stochastic optimization techniques introduce randomness to the search process which can be advantageous for tackling complex optimization problems where traditional methods might struggle.

Simulated Annealing

It is an optimization algorithm inspired by metallurgy. It escapes local minima by gradually reducing randomness to find near-optimal solutions.

Random Search

This simple method randomly chooses points in the search space then evaluates them. Random search is actually quite effective particularly for optimization problems that are high-dimensional.

Evolutionary Algorithms

Evolutionary Algorithms (EAs) are inspired by natural selection and mimic biological evolution to solve complex optimization problems that are difficult for traditional methods. They evolve a population of solutions over generations using selection, crossover and mutation to find near-optimal results.

Key components of Evolutionary Algorithms

  1. Population: A set of candidate solutions to explore the search space.
  2. Fitness Function: Evaluates the quality of each candidate.
  3. Selection: Chooses the fittest candidates for reproduction.
  4. Genetic Operators: Modify candidates to create new solutions.
  5. Termination: Defines stopping conditions.

Genetic Algorithms

A Genetic Algorithm (GA) is a population-based optimization technique inspired by natural selection and genetics. It iteratively evolves a set of candidate solutions using operators like selection, crossover and mutation to find optimal or near-optimal solutions for complex problems where traditional methods may fail.

Differential Evolution (DE)

Differential Evolution is a population-based stochastic optimization technique that evolves candidate solutions using vector differences. It is effective for continuous, high-dimensional and non-convex optimization problems.

Metaheuristic Optimization

Metaheuristic optimization algorithms are high-level strategies used to solve complex optimization problems by efficiently exploring large search spaces. They combine exploration and exploitation techniques to find near-optimal solutions and are independent of gradient information, making them suitable for problems where traditional methods may fail.

Tabu Search (TS)

Tabu Search improves the efficiency of local search by using memory structures that prevent cycling back to recently visited solutions. This helps the algorithm escape local optima and explore new regions of the search space.

Iterated Local Search (ILS)

Iterated Local Search is another strategy for enhancing local search but unlike Tabu…

tags: #deep #learning #optimization #techniques

Popular posts: