Follow BigDATAwire:

July 23, 2024

Cutting-Edge Infrastructure Best Practices for Enterprise AI Data Pipelines

Molly Presley

(anterovium/Shutterstock)

The ability to harness, process, and leverage vast amounts of data sets leading organizations apart in today’s data-driven landscape. To stay ahead, enterprises must master the complexities of artificial intelligence (AI) data pipelines.

The use of data analytics, BI applications, and data warehouses for structured data is a mature industry, and the strategies to extract value from structured data are well known. However, the emerging explosion of generative AI now holds the promise of extracting hidden value from unstructured data as well. Enterprise data often resides in disparate silos, each with its own structure, format, and access protocols. Integrating these diverse data sources is a significant challenge but a crucial first step in building an effective AI data pipeline.

In the rapidly evolving landscape of AI, enterprises are constantly striving to harness the full potential of AI-driven insights. The backbone of any successful AI initiative is a robust data pipeline, which ensures that data flows seamlessly from source to insight.

Overcoming Data Silo Barriers to Accelerate AI Pipeline Implementation

The barriers separating unstructured data silos have now become a severe limitation to how quickly IT organizations can implement AI pipelines without costs, governance controls, and complexity spiraling out of control.

Organizations need to be able to leverage their existing data and can’t afford to overhaul the existing infrastructure to migrate all their unstructured data to new platforms to implement AI strategies. AI use cases and technologies are changing so rapidly that data owners need the freedom to pivot at any time to scale up or down or to bridge multiple sites with their existing infrastructure, all without disrupting data access for existing users or applications. As diverse as the AI use cases are, the common denominator among them is the need to collect data from many diverse sources and often different locations.

(Tee11/Shutterstock)

The fundamental challenge is that access to data by both humans and AI models is always funneled through a file system at some point – and file systems have traditionally been embedded within the storage infrastructure. The result of this infrastructure-centric approach is that when data outgrows the storage platform on which it resides, or if different performance requirements or cost profiles dictate the use of other storage types, users and applications must navigate across multiple access paths to incompatible systems to get to their data.

This problem is particularly acute for AI workloads, where a critical first step is consolidating data from multiple sources to enable a global view across them all. AI workloads must have access to the complete dataset to classify and/or label the files to determine which should be refined down to the next step in the process.

With each phase in the AI journey, the data will be refined further. This refinement might include cleansing and large language model (LLM) training or, in some cases, tuning existing LLMs for iterative inferencing runs to get closer to the desired output. Each step also requires different compute and storage performance requirements, ranging from slower, less expensive mass storage systems and archives, to high-performance and more costly NVMe storage.

The fragmentation caused by the storage-centric lock-in of file systems at the infrastructure layer is not a new problem unique to AI use cases. For decades, IT professionals have been faced with the choice of overprovisioning their storage infrastructure to solve for the subset of data that needed high performance or paying the “data copy tax” and added complexity to shuffle file copies between different systems. This long-standing problem is now also evident in the training of AI models as well as through the ETL process.

Separating the File System from the Infrastructure Layer

(ALPAL-images/Shutterstock)

Conventional storage platforms embed the file system within the infrastructure layer. However a software-defined solution that is compatible with any on-premises or cloud-based storage platform from any vendor creates a high-performance, cross-platform Parallel Global File System that spans incompatible storage silos across one or more locations.

With the file system decoupled from the underlying infrastructure, automated data orchestration provides high performance to GPU clusters, AI models, and data engineers. All users and applications in all locations have read/write access to all data everywhere. Not to file copies but to the same files via this unified, global metadata control plane.

Empowering IT Organizations with Self-Service Workflow Automation

Since many industries such as pharma, financial services, or biotechnology require both the archiving of training data as well as the resulting models, the ability to automate the placement of these data into low-cost resources is critical. With custom metadata tags tracking data provenance, iteration details, and other steps in the workflow, recalling old model data for reuse or applying a new algorithm is a simple operation that can be automated in the background.

The rapid shift to accommodate AI workloads has created a challenge that exacerbates the silo problems that IT organizations have faced for years. And the problems have been additive:

To be competitive as well as manage through the new AI workloads, data access needs to be seamless across local silos, locations, and clouds, plus support very high-performance workloads.

There is a need to be agile in a dynamic environment where fixed infrastructure may be difficult to expand due to cost or logistics. As a result, the ability for companies to automate data orchestration across different siloed resources or rapidly burst to cloud compute and storage resources has become essential.

At the same time, enterprises need to bridge their existing infrastructure with these new distributed resources cost-effectively and ensure that the cost of implementing AI workloads does not crush the expected return.

To keep up with the many performance requirements for AI pipelines, a new paradigm is necessary that could effectively bridge the gaps between on-premises silos and the cloud. Such a solution requires new technology and a revolutionary approach to lift the file system out of the infrastructure layer to enable AI pipelines to utilize existing infrastructure from any vendor without compromising results.

About the author: Molly Presley brings over 15 years of product and growth marketing leadership experience to the Hammerspace team. Molly has led the marketing organization and strategy at fast-growth innovators such as Pantheon Platform, Qumulo, Quantum Corporation, DataDirect Networks (DDN), and Spectra Logic. She was responsible for the go-to-market strategy for SaaS, hybrid cloud, and data center solutions across various data-intensive verticals and use cases in these companies. At Hammerspace, Molly leads the marketing organization and inspires data creators and users to take full advantage of a truly global data environment.

Related Items:

Three Ways to Connect the Dots in a Decentralized Big Data World

Object Storage a ‘Total Cop Out,’ Hammerspace CEO Says. ‘You All Got Duped’

Hammerspace Hits the Market with Global Parallel File System

 

BigDATAwire