Skip to main content

Improving Application Performance with NVMe Storage - Part 1

The Rise of AI and ML Driving Parallel Computing Requirements
Zivan Ori

As computing technology and data algorithms have advanced over the years, the ways in which technology has been applied to real world challenges have grown more automated and autonomous. This has given rise to a completely new set of computing workloads for Machine Learning which drives Artificial Intelligence applications (aka AI / ML).

AI / ML can be applied across a broad spectrum of applications and industries. Financial analysis with real-time analytics is used for predicting investments and drives the FinTech industrys needs for high performance computing. Real-time image recognition is a key enabler for self-driving vehicles, while facial recognition is used by law enforcement across the globe. Manufacturing uses image recognition technology to spot defects in materials, organizations such as NOAA use satellite imagery to spot changes in weather, while social media platforms use image recognition to tag photos of friends and family.

What is common among these uses cases is the need for a high level of parallel computing power, coupled with a high-performance low latency architecture to enable parallel processing of data in real-time across the compute cluster. The "training" phase of machine learning is critical and can take an excessively long time, especially as the training data set grows exponentially to enable deep learning for AI.

With storage performance now recognized as a critical component of AI/ML application performance, the next step is to identify the ideal storage platform. Non-Volatile Memory Express (NVMe) based storage systems have gained traction as the storage media of choice to deliver the best throughput and latency. Shared NVMe storage systems unlock the performance of NVMe, and offer a strong alternative to using local NVMe SSDs inside of GPU nodes.

The Rise of GPUs for AI / ML

GPUs were originally created for high performance image creation, and are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them much more efficient than general purpose CPUs for algorithms where the processing of large blocks is done in parallel. For this reason, GPUs have found strong adoption in the AI / ML use case as they allow for a high degree of parallel computing and current AI focused applications have been optimized to run on GPU based computing clusters.

With the powerful compute performance of GPUs, the bottleneck moves to other areas of the AI / ML architecture. For example, the volume of data required to feed machine learning requires massive parallel read access to shared files from the storage subsystem across all nodes in the GPU cluster. This creates a performance challenge that NVMe shared storage systems are ideally suited to address.

Shared NVMe Storage for High Performance Machine Learning (ML)

One of benefits of shared NVMe storage is the ability to create even deeper neural networks due to the inherent high performance of shared storage, opening the door for future models that cannot be achieved today with non-shared NVMe storage solutions.

Today, there are storage solutions that offer patented architectures built from the ground up to leverage NVMe. The key to performance and scalability is the separation of control and data path operations between the the storage controller software and the host-side agents. The storage controller software provides centralized control and management, while the agents manage data path operations with direct access to shared storage volumes.

While AI / ML workloads are run exclusively on the GPUs within the cluster, that doesn't mean that CPUs have been eliminated from the GPU clusters completely. The operating system and drivers still leverage the CPUs, but while the machine learning training is in progress, the CPU is relatively idle. This provides the perfect opportunity for an NVMe based storage architecture to leverage the idle CPU computing capacity for a high performance distributed storage approach.

With NVMe protocol supporting exponentially more connections per SSD, the storage agents use RDMA to give each GPU node a direct connection to the drives. This approach enables the agents to perform up to 90% of the data path operations between the GPU nodes and storage, reducing latency to be on par with local SSDs.

In this scenario, running the NVMe based storage agent on the idle CPU cores of the GPU nodes enables the NVMe based storage to deliver 10x better performance than competing all-flash solutions, while leveraging existing compute resources that are already installed and available to use.

Read Part 2: Local versus Shared Storage for Artificial Intelligence (AI) and Machine Learning (ML)

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...

Improving Application Performance with NVMe Storage - Part 1

The Rise of AI and ML Driving Parallel Computing Requirements
Zivan Ori

As computing technology and data algorithms have advanced over the years, the ways in which technology has been applied to real world challenges have grown more automated and autonomous. This has given rise to a completely new set of computing workloads for Machine Learning which drives Artificial Intelligence applications (aka AI / ML).

AI / ML can be applied across a broad spectrum of applications and industries. Financial analysis with real-time analytics is used for predicting investments and drives the FinTech industrys needs for high performance computing. Real-time image recognition is a key enabler for self-driving vehicles, while facial recognition is used by law enforcement across the globe. Manufacturing uses image recognition technology to spot defects in materials, organizations such as NOAA use satellite imagery to spot changes in weather, while social media platforms use image recognition to tag photos of friends and family.

What is common among these uses cases is the need for a high level of parallel computing power, coupled with a high-performance low latency architecture to enable parallel processing of data in real-time across the compute cluster. The "training" phase of machine learning is critical and can take an excessively long time, especially as the training data set grows exponentially to enable deep learning for AI.

With storage performance now recognized as a critical component of AI/ML application performance, the next step is to identify the ideal storage platform. Non-Volatile Memory Express (NVMe) based storage systems have gained traction as the storage media of choice to deliver the best throughput and latency. Shared NVMe storage systems unlock the performance of NVMe, and offer a strong alternative to using local NVMe SSDs inside of GPU nodes.

The Rise of GPUs for AI / ML

GPUs were originally created for high performance image creation, and are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them much more efficient than general purpose CPUs for algorithms where the processing of large blocks is done in parallel. For this reason, GPUs have found strong adoption in the AI / ML use case as they allow for a high degree of parallel computing and current AI focused applications have been optimized to run on GPU based computing clusters.

With the powerful compute performance of GPUs, the bottleneck moves to other areas of the AI / ML architecture. For example, the volume of data required to feed machine learning requires massive parallel read access to shared files from the storage subsystem across all nodes in the GPU cluster. This creates a performance challenge that NVMe shared storage systems are ideally suited to address.

Shared NVMe Storage for High Performance Machine Learning (ML)

One of benefits of shared NVMe storage is the ability to create even deeper neural networks due to the inherent high performance of shared storage, opening the door for future models that cannot be achieved today with non-shared NVMe storage solutions.

Today, there are storage solutions that offer patented architectures built from the ground up to leverage NVMe. The key to performance and scalability is the separation of control and data path operations between the the storage controller software and the host-side agents. The storage controller software provides centralized control and management, while the agents manage data path operations with direct access to shared storage volumes.

While AI / ML workloads are run exclusively on the GPUs within the cluster, that doesn't mean that CPUs have been eliminated from the GPU clusters completely. The operating system and drivers still leverage the CPUs, but while the machine learning training is in progress, the CPU is relatively idle. This provides the perfect opportunity for an NVMe based storage architecture to leverage the idle CPU computing capacity for a high performance distributed storage approach.

With NVMe protocol supporting exponentially more connections per SSD, the storage agents use RDMA to give each GPU node a direct connection to the drives. This approach enables the agents to perform up to 90% of the data path operations between the GPU nodes and storage, reducing latency to be on par with local SSDs.

In this scenario, running the NVMe based storage agent on the idle CPU cores of the GPU nodes enables the NVMe based storage to deliver 10x better performance than competing all-flash solutions, while leveraging existing compute resources that are already installed and available to use.

Read Part 2: Local versus Shared Storage for Artificial Intelligence (AI) and Machine Learning (ML)

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...