30 Dec VP Jonathan Cartu Publishes – Integrated and automated management is a key component to…
NVIDIA DGX SuperPOD and DDN AI400 storage speed AI deployment and workflows
Supercomputing systems and tools such as Artificial intelligence (AI) and Deep Learning (DL) have recently gained wider recognition as an opportunity for Enterprises to create increased value from their data. However, AI and DL pose exceptional infrastructure and data management challenges when running and storing data at scale. This article describes how the NVIDIA DGX SuperPOD™ supercomputing cluster paired with the DDN A3I (Accelerated, Any-Scale AI) storage solution makes it easy to deploy a supercomputer infrastructure with minimal complexity and reduced timelines. In addition, this combined solution meets the most challenging AI and DL workload needs and speeds data science workflows by up to 20 times.
Solving the challenges in system deployment
“Historically, the deployment of the largest supercomputing systems was a months-long process which involved extensive customization and tuning to extract the maximum performance for available resources. NVIDIA’s introduction of their NVIDIA DGX SuperPOD infrastructure is a game–changer for the world of complex AI modeling and other high performance computing (HPC)-like workloads that require extreme multi-node scale. Additionally, DDN and NVIDIA have committed an extensive effort to create an end-to-end deployment that pairs the power of the NVIDIA DGX-2 and the parallel data–delivery system of DDN’s A³I appliances for high-performance environments, which are easy to deploy and manage. In our testing, DDN successfully deployed ten AI400 appliances in only four hours.” states Kurt Kuckein, Vice President, Marketing.
Increasing performance for DL training model datasets
DL models used in areas such as DL training classification, object detection and natural language require large amounts of training data. Datasets in automotive and other computer vision tasks can exceed 30 terabytes (TB) in size and may require 1 GB/s per Graphic Processing Unit (GPU) for read performance. During DL training, data may be repeatedly read as the model is iterated to find the most accurate model. Processing this data requires a system that can handle massive throughput of data of many I/O patterns including large blocks (greater than 1 megabyte), smaller blocks (less than 1 megabyte and even less than 32 kilobytes), and memory-mapped files. NVIDIA’s DGX SuperPOD is specifically designed for this level of data processing. For a storage solution to meet the needs of the DGX SuperPOD, it must be able to handle these types of I/O patterns and scale to tens of gigabytes per second of read performance to all nodes simultaneously. DDN developed the DDN® AI400™ appliance to meet demanding mixed-I/O patterns, and is capable of supporting intensive deep learning (DL) workloads when connected to the DGX SuperPOD.
Introducing the NVIDIA SuperPOD – DDN A3I AI400 solution
The NVIDIA DGX SuperPOD is a first-of-its-kind AI supercomputing infrastructure that delivers groundbreaking performance, quickly deploys as a fully integrated system, and is designed to solve the world’s most challenging AI problems. The DGX SuperPOD implements a reference architecture integrating 64 NVIDIA DGX-2 systems with Mellanox InfiniBand™ networking, and the DDN AI400 to create a shared supercomputing infrastructure designed not just for the lab world, but for businesses exploring data science at scale.
The DDN AI400 appliance is a compact and low-power storage solution that provides incredible raw performance with the use of Non-Volatile Memory (NVMe) drives for storage and InfiniBand as its network transport. The AI400 appliance leverages the EXAScaler® EXA5 file system which provides a high-performance enterprise parallel filesystem with expanded data management capabilities.
The AI400 appliance communicates with DGX SuperPOD clients using multiple EDR InfiniBand or 100 GbE network connections for performance, load balancing, and resiliency. The DDN parallel protocol allows each storage appliance to be accessed at over 48 GB/s, supplying plenty of overhead to feed multiple GPUs at full speed simultaneously. This performance is necessary for training image-based networks as image sizes grow to 1080p, 4K, and beyond. In addition, the all-NVME architecture of the DDN AI400 appliance provides excellent random read performance, often as fast as sequential read patterns.
Enterprises are gaining business insights from their data using supercomputing systems and tools such as Artificial intelligence (AI) and Deep Learning (DL) but face challenges in deploying these systems and processing and storing data at scale. A collaboration between NVIDIA and DDN to combine the power of the NVIDIA DGX SuperPOD system with DDN’s DDN A3I data management system and the DDN AI400 storage appliance allows deploying a supercomputer infrastructure with minimal configuration and reduced timelines. Now enterprises have access to leadership-class supercomputing resources without the complexity historically associated with this level of infrastructure. Both of these companies are leveraging decades of experience with data intensive computing to provide this groundbreaking integrated solution.
“Now, commercial customers that are struggling to deploy their AI models at scale with massive data sets have a readily–available recipe that requires little to no customization to drive business innovation. IT can consolidate silos of data science within their organization. In addition, by leveraging the combined capabilities of NVIDIA and DDN, enterprises can speed up data workflows up to 20 times, ” states Kurt Kuckein, Vice President, Marketing.
To get your copy of this reference architecture, please visit the DDN website.
DDN is the world’s leading data management supplier to data-intensive, global organizations. The rapidly evolving competitive landscape makes it essential to ensure projects like AI initiatives can move quickly from investigation to production. For more than 20 years, DDN has focused on designing, deploying and optimizing solutions for production level AI, HPC and Big Data. DDN enables businesses to generate more value and accelerate time to insight from their…