The AI Data Cycle: Understanding the Optimal Storage Mix for AI Workloads at Scale
While AI is transforming lives and inspiring a world of new applications, at its core, it’s fundamentally about data utilization and data generation.
As the AI industry builds-out a massive new infrastructure to train AI models and offer AI services (inference), there are important implications related to data storage. First, storage technology plays important roles in the cost and power-efficiency of the varied stages of this new infrastructure. As AI systems process and analyze existing data, they create new data, much of which will be stored because it’s useful or entertaining. And new AI use cases and ever more sophisticated models make existing repositories and additional data sources more valuable for model context and training, powering a cycle where increased data generation fuels expanded data storage, which fuels further data generation – a virtuous AI Data Cycle.
It’s important for enterprise data center planners to understand the dynamic interplay between AI and data storage. The AI Data Cycle outlines storage priorities for AI workloads at scale at each one of the six-stages. Storage component manufacturers are tuning their product roadmaps in recognition of these accelerating AI-driven requirements to maximize performance and minimize TCO.
Let’s take a quick walk through the stages of the AI Data Cycle:
Raw Data Archives, Content Storage
Raw data is collected and stored from various sources securely and efficiently. The quality and diversity of collected data are critical, setting the foundation for everything that follows.
Storage needs: Capacity enterprise hard disk drives (eHDDs) remain the technology of choice for lowest cost bulk data storage, continuing to deliver highest capacity per drive and lowest cost per bit.
Data Preparation & Ingestion
Data is processed, cleaned, and transformed for input to model training. Data center owners are implementing upgraded storage infrastructure such as fast data lakes to support preparation and ingestion.
Storage needs: All-flash storage systems incorporating high-capacity enterprise solid state drives (eSSDs) are being deployed to augment existing HDD based repositories, or within new all-flash storage tiers.
AI Model Training
It is during this stage where AI models are trained iteratively to make accurate predictions based on the training data. Specifically, models are trained on high-performance supercomputers, and training efficiency relies heavily on maximizing GPU utilization.
Storage needs: Very high-bandwidth flash storage near the training server is important for maximum utilization. High-performance (PCIe® Gen. 5) and low-latency compute optimized eSSDs are designed to meet these stringent requirements.
Inference & Prompting
This stage involves creating user-friendly interfaces for AI models, including APIs, dashboards, and tools that combine context specific data with end-user prompts. AI models will be integrated into existing internet and client applications, enhancing them without replacing current systems. This means maintaining current systems alongside new AI compute, driving further storage needs.
Storage needs: Current storage systems will be upgraded for additional data center eHDD and eSSD capacity to accommodate AI-integration into existing processes. Similarly, larger and higher performance client SSDs (cSSDs) for PCs and laptops, and higher capacity embedded flash devices for Mobile Phones, IoT systems, and Automotive will be needed for AI-enhancements to existing applications.
AI Inference Engine
Stage 5 is where the magic happens in real-time. This stage involves deploying the trained models into production environments where they can analyze new data and provide real-time predictions or generate new content. The efficiency of the inference engine is crucial for timely and accurate AI responses.
Storage needs: High-capacity eSSDs for streaming context or model data to inference servers; depending on scale or response time targets, high-performance compute eSSDs may be deployed for caching; High-capacity cSSDs and larger embedded Flash modules in AI-enabled edge devices.
New Content Generation
The final stage is where new content is created. The insights produced by the AI models often generate new data, which is stored because it proves valuable or engaging. While this stage closes the loop, it also feeds back into the data cycle, driving continuous improvement and innovation by increasing the value of data for training or analysis by future models.
Storage needs: Generated content will land back in capacity enterprise eHDDs for archival data center storage, and in high-capacity cSSDs and embedded Flash devices in AI-enabled edge devices.
A Self-Perpetuating Cycle of Increased Data Generation
This continuous loop of data generation and consumption is accelerating the need for performance-driven and scalable storage technologies for managing large AI data sets and re-factoring complex data efficiently, driving further innovation.
Ed Burns, research director at IDC noted, “The implications for storage are expected to be significant as the role of storage, and access to data, influences the speed, efficiency and accuracy of AI Models, especially as larger and higher-quality data sets become more prevalent.”
There’s no doubt that AI is the next transformational technology. As AI technologies become embedded across virtually every industry sector, expect to see storage component providers increasingly tailor products to the needs of each stage in the cycle.
About the author: Dan Steere is Senior Vice President of Corporate Business Development at Western Digital, where he leads initiatives improving growth and profitability across the company. His responsibilities include overseeing Business Development, Western Digital Ventures, Corporate Development, and Strategic Programs. Before joining Western Digital, Dan co-founded and served as CEO of Abundant Robotics. With a background that spans various industries, including semiconductors, mobile electronics, enterprise software, robotics, and space technology, Dan’s career is marked by a passion for innovation and creating positive work environments. He holds a bachelor’s degree in computer science from Harvard, and an MBA from Stanford, where he was an Arjay Miller Scholar.
Related Items:
Data Is the Foundation for GenAI, MIT Tech Review Says
Making the Leap From Data Governance to AI Governance
The Rise and Fall of Data Governance (Again)