Data Quality Is A Mess, But GenAI Can Help
A recurring theme in big data over the past two decades is the poor quality of data. No matter how much ink is spilled on the topic, organizations continually seem surprised that the data they want to use for analytics or AI is not in good shape and needs attention. Ataccama has made a business out of helping organizations solve their data quality problems, and with generative AI, the solutions are getting better.
There’s no shortage of studies pointing to data quality issues being one of the most pressing concerns among data professionals. Dbt Labs issued a report in March indicating worsening data quality. In February, Informatica issued a report that found data quality to be the number one issue preventing companies from succeeding with generative AI (GenAI) initiatives. A pair of data observability vendors, Bigeye and Monte Carlo, published their own studies last year finding data quality is getting worse, not better.
The folks at Ataccama–a Toronto, Canada-based data management software firm that competes with these other vendors–have also run into the data quality beastie.
“I think a lot of times people do fully understand the picture of what data their enterprise has and the quality of the data that they have access to, or maybe lack of quality of their data,” said Jessica Smith, Ataccama’s vice president of data quality. “It’s still very common for enterprise data quality to be a huge issue across organizations.”
There are many sources of data quality bugs, but one of the biggest is the sheer complexity of enterprise IT systems and the size of IT estates, Smith says.
“There’s a huge amount of complexity in today’s enterprise data landscapes,” Smith says. “I’ve been doing this for 10-plus years and I don’t think I’ve talked to a single customer that has said to me, I fully understand my data landscape. I can do everything I want. There’s no issues. We’re off and running.”
The good news is that some organizations are getting the message. Since the General Data Protection Regulation (GDPR) went into effect in 2018, there has been a financial incentive to avoid poorly managed data. That has led to a concerted effort among some larger companies to get serious about data governance in general and–dare we say it?–slay the data quality monster.
Not every company has gotten serious about data governance, and overall the level of data quality is getting worse, the data shows. But for the few who “get” it, the hard data governance work is paying off and better positioning them to take advantage of GenAI.
“I think for those organizations who put in the work to start to kind of comply with more of these data governance initiatives are absolutely further ahead,” Smith said. “I think the higher up you go in the executive chain, the less understanding they might have around the importance of data quality. We saw that a lot especially probably four or five years ago when data governance really became a mainstream thing it was very common for a CDO, if they existed in the organization, to kind of make a case to their boss or the CEO to invest in this governance initiative.”
GenAI and Data Quality
There’s a mutual symbiosis occurring between GenAI and data quality. On the one hand, high quality data is needed for a business to succeed at AI, GenAI or otherwise. On the other hand, AI and GenAI in particular can also help an organization accelerate their data quality initiative.
Having a successful data governance program that’s producing good quality data is a prerequisite for having a successful AI project, Smith says.
“Being able to understand what you have and being able to appropriately classify it are really important first steps that we encourage a lot of our customers to do if they want to do AI projects,” Smith says. “You don’t want things to go off the rails. You don’t want to build something that exposes any internal customer data. So that’s a really good first step.”
Ataccama customers typically will start with an internal-facing AI project, which allows them to minimize the damage if something does go wrong with it. That gives customers the chance to get a better understanding of their data, how it looks, and whether it’s in the appropriate shape to do AI initiatives with it, Smith says.
“Obviously generative AI is top of mind right now to so many people, but we still find a lot of organizations just doing traditional AI as well,” she says. “Statistical analysis–that’s still a core competency that a lot of organizations are focusing on, and that’s where a lot of the kind of traditional data quality capabilities also come into play.”
On the flip side, Ataccama is also adopting AI and GenAI within its offerings to improve the data quality experience. The company, which sells a full suite of data management tools spanning data catalogs, governance, metadata management, and other disciplines but whose core specialty remains data quality was recently named a leader in the Gartner Magic Quadrant for Augmented Data Quality Solutions, and Smith says that reflects the company’s long-term investment and commitment to the data quality space.
Having already built much of the underlying functionality to improve data quality–such as being able to track how a customer’s data quality changes over time–gives Ataccama a foundation upon which it can start to use new technologies like GenAI to start automating some tasks.
“This is really where you have the data profiling, classification and cataloging of data,” Smith tells Datanami. “ So being able to actually write rules to be able to monitor data quality, be able to look at anomalous behavior over time, being able to proactively catch data quality issues. [It’s about] not only understanding your data, but then actually being able to remediate data quality issues.”
In February, Ataccama unveiled Version 15 of the company’s Ataccama ONE platform. This release introduced a host of new GenAI-powered features for helping users track, manage, and cleanse their data. Smith explains.
“We can do things like natural-language-to-SQL conversions,” she says. “You can chat with our documentation. We can generate table descriptions in terms of our catalog and automatically suggest business terms based on a glossary that’s been defined. We can do automated rule generation and create data quality rules for you. We can profile your data and then suggest some data quality rules for you based on the profile of that data.”
The company has just begun to implement GenAI into its product, and more GenAI capabilities are on the roadmap, Smith says. “This year we are really doubling down on our AI capabilities,” she says.
Related Items:
Data Quality Getting Worse, Report Says
Bigeye Sounds the Alarm on Data Quality
Back to Basics: Governance, Quality, Security Grab the Spotlight at Strata Data Conference