Improving Application Performance with NVMe Storage - Part 2
Local versus Shared Storage for Artificial Intelligence (AI) and Machine Learning (ML)
April 30, 2019

Zivan Ori
E8 Storage

Share this

Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access and data protection.

Start with Part 1: The Rise of AI and ML Driving Parallel Computing Requirements

Normally, GPU nodes don't have much room for SSDs, which limits the opportunity to train very deep neural networks that need more data. For example, one well respected vendor's standard solution is limited to 7.5TB of internal storage, and it can only scale to 30TB. In contrast, there are generally available NVMe solutions that can scale from 100TB to 1PB of shared NVMe storage at the performance of local NVMe SSDs, providing the opportunity to significantly increase the depth of the training for neural networks.

A number of today's GPU-based servers have the power to perform entire processing operations on their own, however some workloads require more than a single GPU node, either to speed up operations by processing across multiple GPUs, or to process a machine learning model to large to fit into a single GPU. If the clustered GPU nodes all need access to the same dataset for their machine learning training, the data has to be copied to each CPU node, leading to capacity limitations and inefficient storage utilization. Alternatively, if the dataset is split among the nodes in the GPU cluster, then data is only stored locally and cannot be shared between the nodes, and there is no redundancy scheme (RAID / replication) to protect the data.

Because using local SSDs may not have the capacity to store the full dataset for machine learning or deep learning, some installations instead use local SSD as cache for a slower storage array to accelerate access to the working dataset. This leads to performance bottlenecks as the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.

Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read / write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global filesystems such as IBM Spectrum Scale, Lustre, CEPH and others.

Use Case Scenario Example: Deep Learning Datasets

One vendor provides the hardware infrastructure that their customers use to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.

Moving to Shared NVMe-oF Storage

Having now discussed the performance of NVMe inside of GPU nodes, let's explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.

Reasonable benchmark parameters and test objectives could be:

1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.

2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.

3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.

4. In general, whether RDMA technology could be a key component of an AI / ML computing cluster.

I have in fact been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

Benchmark Results

In both image throughput and overall training time, the appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points. This highlights one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for the any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in an hour / day / week when using shared NVMe storage than with local SSDs. Although the difference is just a few percentage points, this advantage will scale up as more GPU nodes are added to the compute cluster.

In addition, the training time with NVMe storage was much faster than with local SSDs, again highlighting the advantage of being able to bring the performance of multiple NVMe SSDs to bear in a shared volume. Combined with the scalability of the NVMe storage, this enables customers to not only speed up the performance of training, but to also leverage 100TB or more datasets to enable deep learning for improved results.

Read Part 3: Benefits of NVMe Storage for AI/ML

Zivan Ori is CEO and Co-Founder of E8 Storage
Share this

The Latest

April 15, 2024

Organizations recognize the value of observability, but only 10% of them are actually practicing full observability of their applications and infrastructure. This is among the key findings from the recently completed Logz.io 2024 Observability Pulse Survey and Report ...

April 11, 2024

Businesses must adopt a comprehensive Internet Performance Monitoring (IPM) strategy, says Enterprise Management Associates (EMA), a leading IT analyst research firm. This strategy is crucial to bridge the significant observability gap within today's complex IT infrastructures. The recommendation is particularly timely, given that 99% of enterprises are expanding their use of the Internet as a primary connectivity conduit while facing challenges due to the inefficiency of multiple, disjointed monitoring tools, according to Modern Enterprises Must Boost Observability with Internet Performance Monitoring, a new report from EMA and Catchpoint ...

April 10, 2024

Choosing the right approach is critical with cloud monitoring in hybrid environments. Otherwise, you may drive up costs with features you don’t need and risk diminishing the visibility of your on-premises IT ...

April 09, 2024

Consumers ranked the marketing strategies and missteps that most significantly impact brand trust, which 73% say is their biggest motivator to share first-party data, according to The Rules of the Marketing Game, a 2023 report from Pantheon ...

April 08, 2024

Digital experience monitoring is the practice of monitoring and analyzing the complete digital user journey of your applications, websites, APIs, and other digital services. It involves tracking the performance of your web application from the perspective of the end user, providing detailed insights on user experience, app performance, and customer satisfaction ...