Follow BigDATAwire:

July 25, 2024

Iterative Unveils DataChain to Revolutionize Unstructured Data Management with AI Models

Iterative, a startup dedicated to improving and streamlining workflows for AI engineers, has unveiled DataChain, a new open-source tool for the evaluation and processing of unstructured data.

The startup claims that DataChain will transform how structured data is curated, processed, and evaluated by large language models (LLMs). 

McKinsey’s Global Survey on the state of AI published in early 2024 revealed that only 15% of the companies had realized a meaningful impact of GenAI on their business outcomes. A large part of this problem is the data inefficiencies that exist in many organizations. According to Iterative, the inability to process unstructured data is a major barrier to AI success, highlighting a significant gap between structured data technologies and the newer AI workflows based in Python. 

Unstructured data makes up the bulk of the information stored on company systems, and it is vital for training and fine-tuning AI models. However, effectively leveraging this data is complicated by issues such as scalability, data complexity, and integration difficulties. 

The existing tools are designed for structured data, such as spreadsheets and databases. Unstructured data, such as images, videos, and PDFs, are proving to be much harder to access, evaluate, and improve at scale. AI engineers often rely on building custom codes to manage unstructured data. However, the labor-intensive nature of this approach, along with the potential issues with scalability makes it difficult to manage unstructured data efficiently.

“The biggest challenge in adopting artificial intelligence in the enterprise today is the lack of practices and tools for data curation and generative AI evaluation that can ensure the quality of results,” said Dmitry Petrov, CEO of Iterative. 

“As the next step, we need AI models that can evaluate and improve AI models. So far this has only happened at the industry forefront – take a look at DeepMind’s AlphaGo training against itself, or OpenAI’s DALL-E3 curating its own dataset. Our goal is to change this.”

Petrov believes the solution to this issue lies in leveraging AI itself.  With its AI-based analytical capabilities such as “large language models (LLMs) judging LLMs” and multimodal GenAI evaluations, DataChain can automate the assessment and enhancement of AI models. This can minimize the need for extensive manual intervention. 

Additionally, Iterative’s DataChain democratizes the use of AI models by making them more accessible for evaluating and processing unstructured data. It does this by adding a “meta-layer” of information that contains information about the files as well as the meta information.

DataChain works in a way that mirrors the efficiency of SQL querying for structured data but extends this capability to handle unstructured and multimodal data by interacting with files and their associated meta attributes. The natural language capabilities enable users to easily query their data.

Founded in 2018, Iterative has reached more than 20 million downloads for its open-source software Data Version Control (DVC). It has over 400 contributors across different tools and more than 20 enterprise customers, including Fortune 500 companies.

The introduction of DataChain represents significant progress in leveraging the full potential of unstructured data, however, such tools may have a long way to go before they can fully address all complexities and challenges associated with managing and curating diverse data types. DataChain may be able to increase its visibility and adoption across industries by getting integrated into larger enterprise platforms. 

Related Items 

Breaking Down Silos, Building Up Insights: Implementing a Data Fabric 

Yes, Big Data Is Still a Thing (It Never Really Went Away)

It’s 10 pm. Do You Know Where Your Company’s Data Is?

BigDATAwire