DDN Cranks the Data Throughput with AI400X2 Turbo
DDN today launched a new version of its high-end storage solution for AI and high-performance computing, which Nvidia uses to power its massive Eos supercomputer. The AI400X2 Turbo features a 30% performance boost compared to the system it replaces, which will enable customers to more efficiently train large language models when paired with GPUs from Nvidia.
DDN has a long history developing storage solutions for the HPC business. In the new AI era, it’s leveraged that leadership to serve the exploding need for high-speed storage solutions to train large language models (LLMs) and other AI models.
While the training data in an LLM is rather modest by big data standards, the need to continually back up, or checkpoint, the LLM during a training session has driven the demand. For instance, when Nvidia started working with AI400X2 systems two years ago, Nvidia required a collection of storage systems capable of delivering 1TB per second for reads and 500GB a second for writes, according to James Coomer, senior vice president of products for DDN.
“That was very critical to them,” Coomer says. “Even though this was an LLM and rationally you think that’s only words, that’s not huge volumes, the model size becomes very large and they need to be checkpointed a lot.”
Nvidia, which is holding its GPU Technology Conference this week in San Jose, California, adopted the AI400X2 for its own supercomputer, dubbed Eos, which was launched in March 2022. The 18 exaflop cluster sports 48 AI400X2 appliances, which delivers 4.3 TB/sec reads and 3.1 TB/sec writes to the SuperPOD loaded with 576f DGX systems and more than 4,600 H100 GPUs.
“That write performance was a really big goal for them because of the checkpointing operations,” says Kurt Kuckein, vice president of marketing for DDN. “Their whole goal was to ensure around 2 TB/sec and we were able to achieve above 3 [TB/sec] for the write performance.”
That total throughput would theoretically go up 30% with the new AI400X2 Turbo that DDN announced today. As a 2U appliance, the AI400X2 Turbo can read data at speeds up to 120 GB/s and write data at speeds up to 75 GB/s, with total IOPS of 3 million. That compares with 90 GB/s for reads and 65 GB/s for writes with the AI400X, which the AI400X Turbo replaces atop the DDN stack.
Customers will be able to leverage that 30% benefit in multiple ways, either by either cranking through more work in the same amount of time, getting the same job done quicker, or getting the same job done with a fewer number of storage systems, DDN says.
“We can reduce the number of appliances provisioned, and so potentially you get 30% savings in power as opposed to just raw performance, training times and things like that,” Kuckein says. “Depending on the number of GPUs and things that you have, potentially you’re just decreasing the storage footprint.”
When customers string multiple AI400X2 appliances together to Nvidia DGX systems or SuperPODs over 200Gb InfiniBand or Ethernet networks, the total throughput goes up accordingly. But it’s not just about the hardware investment, Coomer says.
“For us of course the argument isn’t really that we do 120 GB/sec. The biggest argument by far is customers of ours have spent like $100 million in infrastructure and cooling and networks and data scientists and data centers and stuff. There’s a big competitive play out there to get your models done faster. It’s about spending 5% of that budget max on storage if you choose DDN, then you get more productive output.”
DDN has experienced a large increase in sales due to the GenAI boom. The company says its 2023 sales for AI storage was double the 2022 level.
“We didn’t know it was going to be like this,” Coomer said. “We posted a press release last year saying we shipped as much in the first quarter as we did in the previous year. This year, it kind of looks like it might turn out to be similar.”
AI400X2 Turbo will be available soon. The appliances can be fitted with 2.5-inch NVMe drives with 30TB to 500TB capacities. In addition to DDN’s file system, it includes quality of service, port zoning detection, and data integrity check/correction functionalities.
Related Items:
AWS Delivers ‘Lightning’ Fast LLM Checkpointing for PyTorch
GenAI Doesn’t Need Bigger LLMs. It Needs Better Data
Why Object Storage Is the Answer to AI’s Biggest Challenge