Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access and data protection.
Start with Part 1: The Rise of AI and ML Driving Parallel Computing Requirements
Normally, GPU nodes don't have much room for SSDs, which limits the opportunity to train very deep neural networks that need more data. For example, one well respected vendor's standard solution is limited to 7.5TB of internal storage, and it can only scale to 30TB. In contrast, there are generally available NVMe solutions that can scale from 100TB to 1PB of shared NVMe storage at the performance of local NVMe SSDs, providing the opportunity to significantly increase the depth of the training for neural networks.
A number of today's GPU-based servers have the power to perform entire processing operations on their own, however some workloads require more than a single GPU node, either to speed up operations by processing across multiple GPUs, or to process a machine learning model to large to fit into a single GPU. If the clustered GPU nodes all need access to the same dataset for their machine learning training, the data has to be copied to each CPU node, leading to capacity limitations and inefficient storage utilization. Alternatively, if the dataset is split among the nodes in the GPU cluster, then data is only stored locally and cannot be shared between the nodes, and there is no redundancy scheme (RAID / replication) to protect the data.
Because using local SSDs may not have the capacity to store the full dataset for machine learning or deep learning, some installations instead use local SSD as cache for a slower storage array to accelerate access to the working dataset. This leads to performance bottlenecks as the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.
Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read / write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global filesystems such as IBM Spectrum Scale, Lustre, CEPH and others.
Use Case Scenario Example: Deep Learning Datasets
One vendor provides the hardware infrastructure that their customers use to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.
Moving to Shared NVMe-oF Storage
Having now discussed the performance of NVMe inside of GPU nodes, let's explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.
Reasonable benchmark parameters and test objectives could be:
1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.
2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.
3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.
4. In general, whether RDMA technology could be a key component of an AI / ML computing cluster.
I have in fact been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
In both image throughput and overall training time, the appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points. This highlights one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for the any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in an hour / day / week when using shared NVMe storage than with local SSDs. Although the difference is just a few percentage points, this advantage will scale up as more GPU nodes are added to the compute cluster.
In addition, the training time with NVMe storage was much faster than with local SSDs, again highlighting the advantage of being able to bring the performance of multiple NVMe SSDs to bear in a shared volume. Combined with the scalability of the NVMe storage, this enables customers to not only speed up the performance of training, but to also leverage 100TB or more datasets to enable deep learning for improved results.
Read Part 3: Benefits of NVMe Storage for AI/ML
Digital Experience Monitoring is a tool that should be integrated with an organization's change management strategy. A key benefit of SaaS/cloud is no longer being responsible for software and hardware upgrades, maintenance, and patch cycles. Migrating to Microsoft Office 365 means no longer spending precious time and resources on Windows, Exchange or SharePoint upgrades for example. But that doesn't mean that IT can ignore changes or doesn't need to monitor for their effects ...
As systems become more complex and IT loses direct control of infrastructure (hello cloud), it becomes both more difficult and more important to capture and observe, holistically, the user experience. SaaS or cloud apps like Salesforce, Microsoft Office 365, and Workday have become mission-critical to most businesses and therefore need to be examined when it comes to experience monitoring ...
Newly distributed operations teams are struggling to cope with the sudden change to the WFH (work from home) concept. IT operations teams were traditionally set up to work from centralized locations, unlike software and engineering teams. Some organizations have overcome that by implementing AIOps solutions; others are using a brute force method of employing more IT operations analysts to keep the distributed NOCs going ...
Enterprises that halted their cloud migration journey during the current global pandemic are two and a half times more likely than those that continued their move to the cloud to have experienced IT outages that negatively impacted their SLAs, according to Virtana's latest survey report The Current State of Hybrid Cloud and IT ...