Deep Learning Accelerator: An Overview

Deep Learning Accelerators (DLAs) have emerged as crucial components in modern AI systems, addressing the ever-increasing computational demands of deep neural networks (DNNs). DNNs have become ubiquitous in various AI applications, including computer vision, speech recognition, and robotics, achieving state-of-the-art accuracy. However, this accuracy comes at the expense of significant computational complexity. DLAs are designed to mitigate this challenge by providing specialized hardware acceleration for DNN operations. This article provides an overview of DLAs, their architecture, software stack, and applications, with a focus on NVIDIA's DLA.

The Need for Deep Learning Accelerators

Deep neural networks (DNNs) are currently widely used for many AI applications. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity.

What is a Deep Learning Accelerator?

A Deep Learning Accelerator (DLA) is a specialized hardware component designed to accelerate deep learning workloads. Their purpose is either to efficiently execute already trained AI models (inference) or to train AI models. They are often manycore or spatial designs and focus on low-precision arithmetic, novel dataflow architectures, or in-memory computing capability. On consumer devices, the NPU is intended to be small, power-efficient, but reasonably fast when used to run small models. To do this they are designed to support low-bitwidth operations using data types such as INT4, INT8, FP8, and FP16.

NVIDIA's AI Platform and DLA

NVIDIA's AI platform at the edge delivers best-in-class compute for accelerating deep learning workloads. NVIDIA offers a comprehensive AI platform, encompassing both hardware and software solutions, tailored for edge deployment. The DLA is the fixed-function hardware that accelerates deep learning workloads on these platforms, including the optimized software stack for deep learning inference workloads. NVIDIA Jetson brings accelerated AI performance to the edge in a power-efficient and compact form factor. NVIDIA DRIVE embedded supercomputing solutions process data from camera, radar, and lidar sensors to perceive the surrounding environment, localize the car to a map, then plan and execute a safe path forward.

NVIDIA DLA Hardware

NVIDIA DLA hardware is a fixed-function accelerator engine targeted for deep learning operations. It’s designed to do full hardware acceleration of convolutional neural networks, supporting various layers such as convolution, deconvolution, fully connected, activation, pooling, batch normalization, and others. NVIDIA’s Orin SoCs feature up to two second-generation DLAs while Xavier SoCs feature up to two first-generation DLAs.

The DLA is available on Jetson AGX Xavier, Xavier NX, Jetson AGX Orin and Jetson Orin NX modules. The DLA is an application-specific integrated circuit that is capable of efficiently performing fixed operations, like convolutions and pooling, that are common in modern neural network architectures.

NVIDIA DLA Software

NVIDIA DLA software consists of the DLA compiler and the DLA runtime stack. DLA performance is enabled by both hardware acceleration and software. For example, DLA software performs fusions to reduce the number of passes to and from system memory. TensorRT also provides higher-level abstraction to the DLA software stack. TensorRT delivers a unified platform and common interface for AI inference on either the GPU or the DLA, or both. The TensorRT builder provides the compile time and build time interface that invokes the DLA compiler. Once the plan file is generated, the TRT runtime calls into the DLA runtime stack to execute the workload on the DLA cores. TensorRT also makes it easy to port from GPU to DLA by specifying only a few additional flags.

cuDLA

cuDLA is an extension of NVIDIA® CUDA® that integrates GPU and DLA under the same programming model.

Benefits of Using NVIDIA DLA

Offloading the GPU and CPU: Port your AI-heavy workloads over to the Deep Learning Accelerator to free up the GPU and CPU for more compute-intensive applications. Offloading the GPU and CPU allows you to add more functionality to your embedded application or increase the throughput of your application by parallelising your workload on GPU and DLA.
Performance Improvement: The two DLAs on Orin can offer up to 9X the performance of the two DLAs on Xavier.
Power Efficiency: The DLA delivers the highest AI performance in a power-efficient architecture. It accelerates the NVIDIA AI software stack with almost 2.5X the power efficiency of a GPU.
Robustness: Design more robust applications with independent pipelines on a GPU and DLA to avoid single point of failure.

Using the DLA

To use the DLA, you first need to train your model with a deep learning framework like PyTorch or TensorFlow. Next, you need to import and optimize your model with NVIDIA TensorRT. TensorRT is responsible for generating the DLA engines, and can also be used as a runtime for executing them.

Optimizing Applications with DLA

Many NVIDIA Jetson developers are already using the DLA to successfully optimize their applications. Postmates optimized their delivery robot application on Jetson AGX Xavier leveraging the DLA along with the GPU.

Design Considerations for Deep Neural Network Accelerators

Understanding the design space for deep neural network accelerators involves several key considerations, including:

Managing data movement
Handling sparsity
Importance of flexibility

Other AI Accelerators

Since the late 2010s, graphics processing units designed by companies such as Nvidia and AMD often include AI-specific hardware in the form of dedicated functional units for low-precision matrix-multiplication operations. Mobile NPU vendors typically provide their own application programming interface such as the Snapdragon Neural Processing Engine. Consumer CPU-integrated NPUs are accessible through vendor-specific APIs. GPUs generally use existing GPGPU pipelines such as CUDA and OpenCL adapted for lower precisions.

Examples of AI accelerators:

Apple Neural Engine (ANE): ^ MLX builds atop the CPU and GPU parts, not the Apple Neural Engine (ANE) part of Apple Silicon chips. ^ "Deploying Transformers on the Apple Neural Engine".
Intel Movidius Compute Stick: ^ "Intel unveils Movidius Compute Stick USB AI Accelerator".
Inspurs GX4 AI Accelerator: ^ "Inspurs unveils GX4 AI Accelerator".
Google's Tensor Processing Unit (TPU): ^ "Google Designing AI Processors". ^ Jouppi, Norman P.; et al. (June 24, 2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit".
Nvidia Hopper H100 GPU: ^ Moss, Sebastian (March 23, 2022). "Nvidia reveals new Hopper H100 GPU, with 80 billion transistors".
Xilinx Versal AI Engines: ^ Brown, Nick (February 12, 2023). "Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation".
Huawei Kirin NPUs: ^ "HUAWEI Reveals the Future of Mobile AI at IFA".
Snapdragon NPUs: ^ "Snapdragon 8 Gen 3 mobile platform" (PDF).
Intel Lunar Lake Processors: ^ "Intel's Lunar Lake Processors Arriving Q3 2024".
Intel Meteor Lake Chips: ^ "Intel to Bring a 'VPU' Processor Unit to 14th Gen Meteor Lake Chips".
Amazon Web Services (AWS) Inferentia and Trainium: ^ "How silicon innovation became the 'secret sauce' behind AWS's success".
Nvidia's China AI Chips: ^ Patel, Dylan; Nishball, Daniel; Xie, Myron (November 9, 2023). "Nvidia's New China AI Chips Circumvent US Restrictions".

Read also: An Overview of Deep Learning Math

tags: #deep #learning #accelerator #overview