Improving Application Performance with NVMe Storage - Part 2
Local versus Shared Storage for Artificial Intelligence (AI) and Machine Learning (ML)
April 30, 2019

Zivan Ori
E8 Storage

Share this

Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access and data protection.

Start with Part 1: The Rise of AI and ML Driving Parallel Computing Requirements

Normally, GPU nodes don't have much room for SSDs, which limits the opportunity to train very deep neural networks that need more data. For example, one well respected vendor's standard solution is limited to 7.5TB of internal storage, and it can only scale to 30TB. In contrast, there are generally available NVMe solutions that can scale from 100TB to 1PB of shared NVMe storage at the performance of local NVMe SSDs, providing the opportunity to significantly increase the depth of the training for neural networks.

A number of today's GPU-based servers have the power to perform entire processing operations on their own, however some workloads require more than a single GPU node, either to speed up operations by processing across multiple GPUs, or to process a machine learning model to large to fit into a single GPU. If the clustered GPU nodes all need access to the same dataset for their machine learning training, the data has to be copied to each CPU node, leading to capacity limitations and inefficient storage utilization. Alternatively, if the dataset is split among the nodes in the GPU cluster, then data is only stored locally and cannot be shared between the nodes, and there is no redundancy scheme (RAID / replication) to protect the data.

Because using local SSDs may not have the capacity to store the full dataset for machine learning or deep learning, some installations instead use local SSD as cache for a slower storage array to accelerate access to the working dataset. This leads to performance bottlenecks as the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.

Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read / write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global filesystems such as IBM Spectrum Scale, Lustre, CEPH and others.

Use Case Scenario Example: Deep Learning Datasets

One vendor provides the hardware infrastructure that their customers use to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.

Moving to Shared NVMe-oF Storage

Having now discussed the performance of NVMe inside of GPU nodes, let's explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.

Reasonable benchmark parameters and test objectives could be:

1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.

2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.

3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.

4. In general, whether RDMA technology could be a key component of an AI / ML computing cluster.

I have in fact been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

Benchmark Results

In both image throughput and overall training time, the appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points. This highlights one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for the any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in an hour / day / week when using shared NVMe storage than with local SSDs. Although the difference is just a few percentage points, this advantage will scale up as more GPU nodes are added to the compute cluster.

In addition, the training time with NVMe storage was much faster than with local SSDs, again highlighting the advantage of being able to bring the performance of multiple NVMe SSDs to bear in a shared volume. Combined with the scalability of the NVMe storage, this enables customers to not only speed up the performance of training, but to also leverage 100TB or more datasets to enable deep learning for improved results.

Read Part 3: Benefits of NVMe Storage for AI/ML

Zivan Ori is CEO and Co-Founder of E8 Storage
Share this

The Latest

November 23, 2021

The holidays are almost upon us, and retailers are preparing well in advance for the onslaught of online consumers during this compressed period. The Friday following Thanksgiving Day has become the busiest shopping day of the year, and online shopping has never been more robust. But with supply chain disruptions limiting merchandise availability, customer experience will make the difference between clicking the purchase button or typing a competitor's web address ...

November 22, 2021

The 2021 holiday season will be an inflection point: As the economy starts to ramp up again while the country still grapples with the pandemic, holiday shopping will be the most digital holiday season in history by a long shot ... The work must begin months before, as organizations learn from the year prior and take steps to improve experiences and operations, fine-tune systems, plug in new data sources to enrich machine-learning algorithms, move more workloads to the cloud, automate, and experiment with new tech. These efforts culminate in "API Tuesday" ...

November 18, 2021

Most (83%) of nearly 1,500 business and IT decision makers believe that at least 25% of their workforce will remain hybrid post-pandemic, according to the Riverbed | Aternity Hybrid Work Global Survey 2021. While all indicators signal hybrid work environments are the future, most organizations are not fully prepared to deliver a seamless hybrid work experience ...

November 17, 2021

The results of the 2021 BMC Mainframe Survey highlight the consistent positive growth outlook as seen in recent years, with 92 percent of respondents viewing the mainframe as a platform for long-term growth and new workloads, and 86 percent of extra-large shops expecting MIPS (millions of instructions per second) to grow in the coming year. This is not surprising, considering the disruptive nature of the modern digital economy ...

November 16, 2021

With an accelerated push toward digital transformation, organizations everywhere are trying to find ways to work smarter, not harder. A key component of this new model is finding ways to automate business processes — freeing up employees to focus on more strategic, valuable work and improving customers' experiences. Today's enterprise IT leaders have many options to help drive automation initiatives — from digital process automation and artificial intelligence (AI) to enterprise content management and robotic process automation (RPA) ...