What to Look for in a Data Catalog
There’s no mistaking it: Data catalogs are hot. The product category has exploded in recent years as a way to drive data discovery and, increasingly, to control access to data. But how do you select the right data catalog for your particular needs? Potential answers to this question could be found at the recent Eckerson Group CDO TechVent on data catalogs.
At the most basic level, the core function of a data catalog is to provide a bridge between how business talks about data and how that data is technically stored. Nearly every data catalog in the market–and there are close to 100 of them now–can do that.
But not every data catalog is the same, and there are important differences among the various offerings. According to Lauren LeRoy, the director product marketing at BigID, the use case should dictate the type of data catalog that would best fit a user.
“If you know the problem you’re trying to solve, then you know what kind of catalog you’re looking for,” LeRoy said during the vendor panel for the Eckerson Group CDO TechVent on data catalogs, which took place December 15. “We all talk about data catalogs, but what each catalog offers is a little bit different…On the BigID side it’s all based on discovery. How do you discover your data at scale [and] add classification?”
Every data catalog customer wants to get value from their purchase. But desiring that isn’t quite enough to get you there. According to Mitesh Shah, the vice president of product marketing for Alation, a successful adoption requires clarity about what the customer is trying to achieve.
There are two product characteristics that are important to gaining customer traction, starting with the user interface being “dead simple” to use. “You really want to make sure that the product is simplifying people’s lives and not making it more difficult,” Shah said.
The second factor is the intelligence of the product under the hood. “You want to make sure that technology is applying machine learning, as Sanjeev [Mohan] mentioned earlier, and making sure that it’s simplifying things.”
At data catalog offered data.world, the company is “relentlessly focused on adoption,” said Tim Gasper, vice president of product. That focus stems in part due to the large number of data catalog projects gone bad, he said.
“So many times, we see these sort of failure stories of companies trying to implement a catalog, but it’s only for five or 10 users, or it’s a very specialized use case, and ultimately they don’t end up getting the adoption that they need for it, [not] to be just sticky, but actually to have an impact in the organization,” Gasper said.
Implementation Tips
Data catalogs may bare a heavier load in data literacy than other tool types because they are one of the primary new ways that people are discovering and interacting with data in their organizations. Perhaps that gives data catalog developers a greater responsibility to be a positive force for data literacy, according to Jeffrey Giles, the principle architect at Sandhill Consultants, an implementation partner for data governance and catalog provider erwin (now owned by Quest Software).
“You can teach people how the tool works. ‘Click this menu item, fill in this box, click this button, and it does stuff,’” Giles said. “But I find a lot of people [say], why am I doing that? What does all this mean? How do I get value out of this?’ That kind of starts sometimes with education about data literacy, to orient them towards how the tool creates business value to you, not necessarily that I could type something in and go find the definition of something.”
While the workflow details ultimately are important, a more important discussion may occur around how data is defined in the first place. “Let’s say ‘customer’ is not really related to something called a ‘wholesale customer’ or a ‘retail customer.’ What’s the difference between all of these thing? How does this effect the fact that I don’t have 360 degree view of the customer?” Giles said. “Literacy helps a lot with that.”
Complexity is one of the biggest bugaboos afflicting data catalogs as a category, according to Eckerson’s research. To escape the complexity trap, you need to avoid overly ambitious deployments, Gasper said.
“A lot of companies try to boil the ocean. They say, ‘Oh, we’re going to buy a catalog, and we’re going to address compliance and quality and better integration, and we’re going to use machine learning, and we’re going to use knowledge graphs and we’re going to create connections between things and we’re going to assign stewards,” the data.world VP said. “And so they create this list 100 miles long of all the things they are going do, and they’re going all do it by yesterday. And that’s not the right way to approach it.”
A data catalog can be a catalyst for data-driven change within an organization, but like anything else, a little patience–not to mention having a plan and sticking to it–can go a long way.
“You create a use case backlog. What is the most important thing that we should work on first?” Gasper said. “Prioritize that, and then pair up the right producers and consumers to iterate together, sprint-style on those use cases and then work through the backlog.”
Avoiding the ‘Frankenstack’
Companies should be careful to avoid taking the “Frankenstack” approach, where they’re trying to add features to the catalog after the fact, such as data quality or privacy and compliance, according to Alation’s Shah. “You can’t bake too much into the product,” he said.
At the same time, many data catalogs are expandable. Many vendors have taken to building a “core” platform and then enabling customers to add “apps” on after the fact, including Alation, and BigID.
“What we’re doing at BigID is based on that core platform,” LeRoy said. “We do have a data quality solution, but it’s based on the fact that you use that core machine learning. You use that core catalog and then it integrates with that, so that you’re not completely Frankensteining things together, which I’d argue that other vendors do.”
Data observability is another hot product area in the big data market, and the lines separating where a data catalog stops and where data observability tools pick up is not always a clear one. In some cases, the data monitoring and observability features may be backed into the data catalog, while in other cases, an integration to a third-party tool may be in order. It comes back to understanding your particular use case, says data.world’s Gasper.
“These are all different sub use-cases around quality,” Gasper said. “You’re going to find that either maybe your catalog is providing some of those capabilities–data.world has some of these data quality capabilities–or you’re going to find you’re really interested in observability. How we can start monitoring these differences signals, do anomaly detection on how these things are changing over time, and I want to use something like a Monte Carlo or a Bigeye or one of these types of vendors.”
Alation has partnerships with Bigeye and Soda for data observability, Shah said. “These tools are great for looking at data drift and checking out what’s going wrong with data and doing that introspection and sort of alerting the data engineer, the folks who are responsible for investigating and fixing those issues,” he said. “Because ultimately you want your business users, you want everybody in the organization, as they’re consuming the data, to understand whether the data they’re looking at is quality data or not.”
Catalogs and Governance
When it comes to data governance, there’s a natural tension with data catalogs, which was evident during the CDO TechVent panel on data catalogs. It was also clear that users can sometimes get themselves into chicken-or-the-egg situations when trying to roll both out simultaneously.
BigID, with its strong heritage in data discovery, sides heavily on the side of data discovery as a driver for governance. “You can’t govern what you don’t know,” LeRoy said. “So we say that it all starts with discovery.”
Alation, which partners with BigID for discovery, has a more moderated view. Shah noted how Bob Seiner, whom he called the “Grandfather of data governance,” had a saying. “Everybody is already doing data governance. It’s really formalizing people’s behaviors around data, and the catalog really helps you doing that.”
Users should get away from the notion that data governance is a formalized 12-step journey, “where it takes years and years and suddenly you reach Nirvana at the end,” Shah said. “It’s not the case. Everybody has a starting point. It’s all about continuous improvement. The catalog can help you get there.”
Giles takes a more traditional view towards the relationship of catalog and governance, which meshes with erwin’s history as a provider of data governance solutions.
“What we find is a lot of people will buy a tool, a technology, and they will start to discover stuff, but they don’t know who’s responsible for that and how does change happen when we find that something is not right?” he said. “If we have that in place first, and then I put that on top of the catalog, things go a lot smoother.”
Gasper has also seen how jumping too quickly into data discovery without a firm foundation in governance can cause things to go sideways. “I’m sure even the BigID folks probably see this as well,” he said. The good news is customers can move fairly quickly on implementing a formalized data governance program once a few key items are in place.
At a certain point, the line separating a data catalog and a data governance solution starts to blur. Erwin has been in the data governance space for a long time, and Alation–the vendor that kicked off the data catalog craze a few years back–recently launched its first data governance solution. Not coincidentally, Eckerson Group’s next CDO TechVent will be focused on data governance tools. Registration is now open for that free event taking place on April 26. You can register here.
“The data governance concept and the data management concept will begin to merge together over time, so that data governance essentially is just a business-as-usual kind of integration into data catalog work,” Gasper said. “The user interfaces will be simplified and more customizable. But at the end of the day, people aren’t going to be saying ‘Hey, I’m doing data governance.’ They’re going to be saying, ‘hey I’m doing data work.’”
To view a recording of the CDO TechVent on data catalogs, go to www.techvent.eckerson.com/data-catalogs.
Related Items:
Data Catalogs Take Center Stage in Eckerson CDO TechVent
A Guide to Maximum Data Lake Value
Data Mesh Vs. Data Fabric: Understanding the Differences
Editor’s note: This article has been corrected. Bob Seiner is the known as the “Grandfather of data governance,” not the “Godfather of data discovery.” Datanami regrets the errors.