Deep Learning File Types: A Comprehensive Guide for Machine Learning Practitioners

The landscape of machine learning, particularly deep learning, is intrinsically tied to the efficient handling and storage of data and models. At the core of this process are file types, which dictate how information is structured, encoded, and accessed. This article serves as a comprehensive guide to the popular file formats encountered in open-source machine learning frameworks, with a focus on Python-based ecosystems like TensorFlow/Keras, PyTorch, Scikit-Learn, and PySpark. We will also explore how specialized tools, such as a Feature Store, can significantly streamline the data scientist's workflow by facilitating the generation of training and testing data in chosen file formats and on preferred file systems.

Understanding the Data Ecosystem: Files, File Systems, and Object Stores

A file format is fundamentally defined by the structure and encoding of the data it contains, typically identified by its file extension. While files are ubiquitous for data storage, not all formats are directly amenable to training machine learning models. This article primarily focuses on file formats for structured data, but also touches upon model storage formats. We will explicitly exclude image formats (e.g., .png, .jpeg), video formats (e.g., .mp4, .mkv), archive formats (e.g., .zip, .gz), document formats (e.g., .docx, .pdf), and web formats (e.g., .html).

Data utilized in machine learning resides on file systems. In cloud environments, this increasingly translates to object stores. File systems themselves come in various forms. A local file system (POSIX) typically stores data on disks, whether magnetic (hard drives), SSD (solid-state storage), or NVMe drives. These can be accessed over a network (e.g., NFS), or for massive storage needs, distributed file systems like HDFS, HopsFS, and CephFS are employed, scaling to petabytes across thousands of servers. Cloud object stores offer the most cost-effective storage, providing reasonable performance for applications to read and write files.

The choice of file format is often intertwined with the file system. For instance, training an ImageNet model on images stored in S3 (an object store) using TensorFlow/PyTorch on a Nvidia V100 GPU might be bottlenecked by I/O to the file system. While a single client can read 100 images per second from S3, the V100 GPU could process 1000 images per second. This I/O limitation highlights the need for new file formats designed to overcome such bottlenecks. For example, Uber developed the Petastorm file format to efficiently store and access petabytes of self-driving vehicle data on HDFS. Petastorm files are large, splittable, and compressed, with readers for TensorFlow and PyTorch that enable parallel feeding of data to multiple GPUs, preventing file I/O from becoming a bottleneck. An alternative, though significantly more expensive, solution would involve using traditional file formats on storage devices composed of thousands of NVMe disks.

Machine learning frameworks aim to consume training data as a sequence of samples. Therefore, file formats intended for training ML models should offer easily consumable layouts that avoid impedance mismatches with the storage platform or the programming language used for access. Furthermore, distributed training, which involves training ML models across multiple GPUs simultaneously for faster results, necessitates files that are splittable and accessible over distributed file systems or object stores. This allows different GPUs to read distinct data shards in parallel from various servers.

Read also: Comprehensive Overview of Deep Learning for Cybersecurity

From Binary to Structured Data: The Evolution of Big Data in ML

Machine learning has revolutionized domains such as image classification, voice recognition, natural language processing, and neural machine translation, typically employing compressed binary or plaintext data formats. Its impact has expanded into enterprise data, addressing business problems that can be framed as supervised machine learning tasks. Enterprise data, often sourced from data warehouses, databases, document repositories, and data lakes, is frequently structured. This structured enterprise data can be stored in various text-based and binary file formats. For large datasets, binary file formats are generally preferred over text-based formats like CSV, as they can significantly boost import pipeline throughput, thereby reducing model training time. Binary formats require less disk space and are faster to read. This efficiency is particularly crucial given the trend towards deep learning, which is notoriously data-hungry and benefits immensely from larger datasets. Efficient, compressed file formats play a pivotal role in this escalating demand for data.

File Formats within Machine Learning Frameworks

Older file formats, such as CSV, may lack compression, be unsplittable (e.g., HDF5 and NetCDF), hindering seamless parallel training with multiple workers, and make combining datasets difficult. However, the benefits of modern file formats are only realized if the machine learning framework (TensorFlow, PyTorch, Scikit-Learn) provides integrated data import and preprocessing functionalities. For instance, the TFRecord file format is designed for TensorFlow and is fully supported by tf.data. PyTorch's DataLoader was initially optimized for NumPy files and later extended to other formats. Similarly, Scikit-Learn was first designed to work with CSV and Pandas, with subsequent extensions to other formats. While adopting modern file formats is advantageous, the challenge of data conversion and potential scarcity of documentation can be daunting. Fortunately, tools like Feature Stores simplify this process, enabling effortless conversion of data into essential ML file formats.

File Formats for Data

This section categorizes widely used ML file formats into established types: columnar, tabular, nested, array-based, and hierarchical. We also cover newer formats designed for model serving.

Columnar Data File Formats

Enterprise data commonly resides in data warehouses or data lakes, often accessed via SQL. Data can be stored in row-oriented formats (typical for OLTP databases, offering low latency and high write throughput) or, more frequently, in column-oriented formats (common in OLAP/columnar databases, scaling from terabytes to petabytes and providing faster queries and aggregations). In data lakes, structured data is often stored as files (e.g., .parquet, .orc) accessible via SQL using scalable engines like SparkSQL, Hive, and Presto. These columnar formats and backend databases are frequent sources for enterprise ML training data. Feature engineering on them typically requires data-parallel processing frameworks (Spark, Beam, Flink) to scale across numerous servers. This scalability is enabled because the path to data in Parquet/ORC/Petastorm is a directory, not a single file, containing multiple files processed in parallel. When reading columnar data, the base directory path is provided, and the processing engine identifies the relevant files. If only a subset of columns is needed, files containing excluded columns are not read from disk. For range scans, statistics (min/max values per column within files) facilitate data skipping, preventing files with values outside the query range from being accessed.

While Parquet and ORC share similar properties, Petastorm is uniquely tailored for ML data, being the only columnar format that natively supports multi-dimensional data. Columnar formats typically assume 2D relational data, but tensors can have much higher dimensionality. Petastorm extends Parquet by incorporating its own Unischema, designed explicitly for ML use cases. This Unischema allows Petastorm files to store multi-dimensional tensors natively within Parquet. The Unischema is compatible with PyTorch and TensorFlow, enabling direct conversion to their respective schemas for native readers. Columnar file formats are designed for distributed file systems (HDFS, HopsFS) and object stores (S3, GCS, ADL), allowing parallel reading by workers.

Read also: Continual learning and plasticity: A deeper dive

  • File formats: .parquet, .orc, .petastorm
  • Feature Engineering: PySpark, Beam, Flink
  • Training: .petastorm has native readers in TensorFlow and PyTorch; .orc, .parquet have native readers in Spark; JDBC/Hive sources are supported by Spark.

Tabular Text-based File Formats

Tabular data for machine learning is commonly found in .csv files. These are text-based files with comma-separated values. CSV files are popular due to their ease of viewing, debugging, and programmatic reading/writing. However, they lack support for column types, offer no distinction between text and numeric columns, and exhibit poor performance with large datasets due to their lack of splitting, indexing, and column filtering capabilities. CSV files can be compressed using GZIP for space savings. Other tabular formats, such as spreadsheet files (e.g., .xlsx, .xls) and unstructured text files (.txt), are not typically used in ML.

  • File formats: .csv, .xlsx
  • Feature Engineering: Pandas, Scikit-Learn, PySpark, Beam, and many others.
  • Training: .csv has native readers in TensorFlow, PyTorch, Scikit-Learn, and Spark.

Nested File Formats

Nested file formats store records in an n-level hierarchical structure with a schema to define their organization. A record can have a parent or be a root, and can also possess children. Schemas in nested formats are extensible (allowing attribute additions while maintaining backward compatibility), and attribute order is typically not significant. .json and .xml are well-known plaintext nested formats, while binary nested formats include Protocol Buffers (.pb) and Avro (.avro).

TFRecords is a sequence of binary records, usually a protobuf with either an "Example" or "SequenceExample" schema. Developers choose "SequenceExample" for features that are lists of identically typed data. A TFRecords file can be a directory containing multiple .tfrecords files and supports Gzip compression. NumPy (.npy) is also a high-performance file format due to its vectorization support. A NumPy array is a densely packed array of elements of the same type. The .npy file format stores a single NumPy array, including nested record and object arrays.

  • File formats: .npy
  • Feature Engineering: PyTorch, NumPy, Scikit-Learn, TensorFlow.
  • Training: .npy has native readers in PyTorch, TensorFlow, and Scikit-Learn.

Hierarchical Data Formats

HDF5 (.h5 or .hdf5) and NetCDF (.nc) are popular hierarchical data formats designed for large, heterogeneous, and complex datasets. They are particularly suitable for high-dimensional data that doesn't map well to columnar formats like Parquet, though Petastorm bridges this gap by being both columnar and supporting high-dimensional data. Medical device data and genomic data (BAM, VCF) are often stored in HDF or related formats. Internally, HDF5 and NetCDF use compressed layouts. NetCDF is prevalent in climate science and astronomy, while HDF5 is common in GIS systems. These formats are not splittable, making them unsuitable for distributed processing with engines like Spark.

  • File formats: .h5 (HDF5), .nc (NetCDF)
  • Feature Engineering: Pandas, Dask, XArray.
  • Training: .h5 has no native readers in TensorFlow or PyTorch; .nc also lacks known native readers.

Model File Formats

In supervised machine learning, the artifact generated after training, used for predictions, is called a model. For a deep neural network (DNN), a trained model is essentially a file containing its layers and weights. Models are often saved in compressed binary files. TensorFlow saves models as Protocol Buffer files with a .pb extension. Keras natively saves models as .h5 files. Scikit-Learn saves models as pickled Python objects with a .pkl extension. An older format for model serving, Predictive Model Markup Language (.pmml), based on XML, is still usable in some frameworks like Scikit-Learn.

Read also: An Overview of Deep Learning Math

Model files are used for predictions either by batch applications that load the model file or by real-time model serving servers (e.g., TensorFlow Serving Server) that load the model into memory, potentially managing multiple versions for A/B testing.

Other model file formats include SparkML models saved in the MLeap file format, served in real-time via an MLleap model server (files packaged as .zip). Apple's .mlmodel format is for models embedded in iOS applications via its Core ML framework, offering superior support for Objective-C and Swift. Models trained in TensorFlow, Scikit-Learn, and other frameworks require conversion to .mlmodel for iOS deployment, with tools like coremltools and tensorflow_converter assisting in this process. Theoretically, any ML framework can export models to the .onnx file format, promising unified model serving across different frameworks.

PyTorch Model File Extensions: .pt, .pth, and .pwf

PyTorch, a widely adopted machine learning library, provides several methods for saving and loading models, commonly associated with the file extensions .pt, .pth, and .pwf. These extensions serve distinct purposes and possess unique characteristics.

Understanding the Extensions:

  1. .pt Extension (PyTorch Tensor): This is the most frequently used format for saving PyTorch models and is recommended by PyTorch for storing model weights and architectures. The .pt extension is primarily used for saving the entire model, encompassing both its architecture and trained weights, making it efficient for both saving and loading. A model can be saved to a .pt file using the torch.save() function and loaded back with torch.load().

  2. .pth Extension (PyTorch): The .pth extension is often employed for checkpointing models and storing dictionaries of model parameters. However, its overlap with Python's path configuration files can sometimes lead to confusion. While .pth files are used for saving model checkpoints, they can also trigger initialization scripts automatically upon module import, potentially causing unintentional conflicts. In PyTorch, .pth files are typically used to save model state dictionaries, which include model parameters and optimizer states.

  3. .pwf Extension (PyTorch Weights Format): This extension is less common and not officially recognized in PyTorch documentation. It appears to be used in specific projects or contexts, often for lightweight models or custom implementations for saving model weights. If .pwf files are encountered, referring to the specific project documentation is essential to understand their usage.

Functional Equivalence and Best Practices:

From a functional standpoint, there is no inherent difference between .pt, .pth, and .pwf when saving PyTorch models. The torch.save() function does not interpret the extension; it serializes the model and its metadata regardless of the extension used. Consequently, models saved with any of these extensions can be loaded by PyTorch without issue.

When working with PyTorch models, selecting the appropriate file extension and saving/loading method is crucial for effective model management and deployment.

  • Use .pt for Model Checkpointing: The .pt extension is recommended for saving and loading entire models, offering a straightforward and efficient serialization method.
  • Avoid .pth for Checkpoints (if possible): Given its potential conflict with Python path files, it is advisable to use .pt or other extensions like .pth.tar for saving checkpoints.
  • Understand Project-Specific Extensions: If non-standard extensions like .pwf are encountered, clarify their specific use case within the project context.
  • Use .pt for Clarity and Consistency: For general model saving, .pt is the preferred choice for its clarity and consistency within the PyTorch community.
  • Avoid .pwf: This extension is not standard and should be avoided to maintain consistency.

Emerging and Specialized Model Formats

As ML model applications grow, optimizing models for specific use cases becomes paramount.

ONNX (Open Neural Network Exchange): ONNX provides an open-source, vendor-neutral format for AI models. It defines an extensible computation graph model, along with standard operators and data types, enabling interoperability between various frameworks, tools, and hardware. ONNX models are saved in a single file with the .onnx extension. The computation graph within the model file offers flexibility. However, ONNX has limited support for quantized tensors, decomposing them into integer and scale factor tensors. Complex architectures may require operator fallbacks or custom implementations for unsupported layers. ONNX model inference depends on the runtime library's supported Execution Provider (CPU, GPU, edge, etc.). Onnxruntime offers tools for quantizing ONNX models, with support based on the operators present. ONNX uses "Opsets" (Operator sets), which evolve with ONNX package releases, introducing new operators. Some users report slower inference speeds after converting models to ONNX compared to their base format, indicating that conversion is not always straightforward.

GGML and GGUF: Developed by Georgi Gerganov, GGML is a tensor library for machine learning, enabling large models and high performance on commodity hardware. GGML defines a binary format for distributing large language models (LLMs), supporting 16-bit float and integer quantization (e.g., 4-bit, 5-bit, 8-bit), offering trade-offs between efficiency and performance. GGML uses versioning for format improvements without sacrificing backward compatibility. Valid GGML files list hyperparameters, defining model behavior, and vocabulary (supported tokens). Weights, also called parameters, constitute the "size" of a model. Quantization, the conversion of high-precision floating-point values to lower precision, reduces resource requirements. GGML supports various quantization strategies. The project facilitates high-quality speech-to-text solutions across multiple platforms and inference/training for many open-source models (StarCoder, Falcon, Bert, etc.).

A successor to GGML, GGUF, was introduced by the llama.cpp team. GGUF is an extensible, future-proof format storing more metadata about the model, including significantly improved tokenization code and full support for special tokens. It features a metadata section organized in key-value pairs and a section for tensor metadata. GGUF, along with the GGML library, offers flexible quantization schemes (e.g., Q4KM, IQ4XS, IQ2M, Q8_0), enabling efficient storage while maintaining accuracy, though not all models are convertible. GGUF is primarily used for serving models in production where fast loading times are critical.

Safetensors: Developed by Hugging Face, safetensors addresses the security and efficiency limitations of pickle-based formats like PyTorch's .pt. It is designed for fast model loading and saving, with lazy-loading and partial data loading capabilities, leading to faster load times and lower memory usage. A metadata section in JSON format details model tensors (shape, data type, name). While safetensors offers flexibility, its quantization scheme is less adaptable than GGUF, and a JSON parser is required for metadata. safetensors is the default serialization format for Hugging Face's transformers library and is widely used for sharing, training, fine-tuning, and serving AI models.

TensorRT (.engine): TensorRT is NVIDIA's high-performance Deep Learning Inference library. Its .engine format represents optimized runtime models. ONNX has a TensorRT backend that parses ONNX models for execution with TensorRT, supporting both Python and C++. The list of supported ONNX operators for TensorRT is maintained separately. Currently, each model checkpoint needs to be compiled first to ONNX and then to TensorRT. INT4 and INT16 quantization are not currently supported by TensorRT.

TensorFlow Lite (.tflite): This is a FlatBuffer format developed by Google to optimize TensorFlow models for edge and mobile devices.

Core ML (.mlmodel): Apple's .mlmodel format is designed for deploying ML models within Apple's ecosystem (iOS, macOS, watchOS).

Keras (.h5): Adopted early by Keras, the .h5 (HDF5) format allows storing the entire model architecture, weights, and optimizer state in a single file.

TensorFlow Checkpoint (.ckpt): The checkpoint format in TensorFlow allows models to save training states, weights, and optimizer configurations.

Python Pickle (.pkl): Python's pickle module is foundational for many ML file formats, particularly in the PyTorch and broader Python ecosystem. It serializes Python objects into a byte stream and deserializes them back. However, loading pickled files from untrusted sources can execute arbitrary code, posing a security risk.

Model Definition Files in ArcGIS

In ArcGIS, a deep learning model package typically includes a folder structure with files like loss_graph.png, show_results.png, training_validation_loss.json, model_metrics.html, and the model weights file (e.g., .pth).

The Esri Model Definition File (.emd) is a JSON file that describes the trained deep learning model. It contains parameters required for inference tools and should be modified by the data scientist. The .emd file, once completed, can be used for multiple inferences as long as the input imagery originates from the same sensor and targets the same classes.

Key parameters within the .emd file include:

  • Framework: The deep learning framework used (e.g., TensorFlow, Keras, PyTorch).
  • ModelConfiguration: The type of model training (e.g., ObjectDetectionAPI, DeepLab, MaskRCNN).
  • ModelType: The type of model (e.g., ImageClassification, ObjectDetection, ObjectClassification).
  • ModelFile: The path to the trained deep learning model file.
  • Description: Information about the model.
  • InferenceFunction (Optional): Path to an inference function if custom logic is required.
  • SensorName (Optional): Name of the sensor used for training imagery.
  • Classes (Optional): Information about output class categories.

These .emd files, along with trained model files and any necessary supporting files, can be packaged into a .dlpk file for easy distribution and use within ArcGIS Pro.

tags: #deep #learning #file #types

Popular posts: