

Data catalogs and metadata catalogs share some similarities, particularly in their nearly identical names. And while they have some common functions, there are also important differences between the two entities that big data practitioners should know about.
Metadata catalogs, which are sometimes called metastores or technical data catalogs, have been in the news lately. If you’re a regular Datanami reader (and we certainly hope you are!), you would have read a lot metadata catalogs at the Snowflake and Databricks conferences last month, when the two competitors committed to open sourcing their respective metadata catalogs, Polaris and Unity Catalog.
So what is a metadata catalog, and why do they matter? (We’re glad you asked!) Read on to learn more.
Metadata Catalogs
A metadata catalog is defined as the place where one stores the technical metadata describing the data you have stored as a tabular structure in a data lake or a lakehouse.
The most commonly used metadata catalog is the Hive Metastore, which was the central repository for metadata describing the contents of Apache Hive tables. Hive, of course, was the relational framework that allowed Hadoop users to query HDFS-based data using good old SQL, as opposed to MapReduce.
Hive and the Hive Metastore are still around, but they’re in the process of being replaced by a newer generation of technology. Table formats, such as Apache Iceberg, Apache Hudi, and Databricks Delta Table, bring many advantages over Hive tables, including support for transactions, which boosts the accuracy of data.
These table formats also require a technical layer–the metadata catalog–to help users know what data exists in the tables and to grant or deny access to that data. Databricks supports this function in its Unity Catalog. For Iceberg, products such as Project Nessie, which was developed by engineers at Dremio, sought to be the “transactional catalog” brokering data access to various open and commercial data engines, including Hive, Dremio, Spark, and AWS Athena (based on Presto), among others.
Snowflake developed and released (or pledged to release, anyway) Polaris to be the standard metadata catalog for the Apache Iceberg ecosystem. Like Nessie, Polaris uses Iceberg’s open REST-based API to get access to the descriptive metadata of the Parquet data that Iceberg stores. This REST API then serves as the interface between the data stored in Iceberg tables and data processing engines, such as Snowflake’s native SQL engine as well as a variety of open-source engines.
Data Catalogs
Data catalogs are typically third-party tools that companies use to organize all of the data they have stored across their organizations. They typically include some facility that allows users to search for data their organization may own, which means data catalogs often have some data discovery component.
Many data catalogs, such as Alation’s catalog, have also evolved to include access control functionality, as well as data lineage tracking and governance capabilities. In some cases, data management tool vendors that started out providing data governance and access control, such as Collibra, have evolved the other way, to also include data catalogs and data discovery capabilities.
And like metadata catalogs, regular data catalogs–or what some in the industry term “enterprise” data catalogs–are also fully involved in gobbling up metadata to help them track various data assets. One enterprise data catalog vendor, Atlan, focuses its efforts on unifying the metadata generated by different datasets and synchronizing them through a metadata “control plane,” thereby ensuring that the business metrics don’t get too out of whack.
By now, you’re probably wondering “So what the heck is the difference?! They both track metadata, and they both have “data catalog” in their name. So what’s the difference between a metadata catalog and a data catalog.
So What’s The Difference?!
To help us decode the differences between these two catalog types, Datanami recently talked to Felix Van de Maele, the CEO and co-founder of Collibra, one of the leading data catalog vendors in the big data space.
“They’re very different things,” Van de Maele said. “If you think about Polaris catalog and Unity Catalog from Databricks–and AWS and Google and Microsoft all have their catalogs–it’s really this idea that you’re able to store your data anywhere, on any clouds…And I can use any kind of data engine like a Databricks, like a Snowflake, like a Google, AWS, and so forth, to consume that data.”
But what Collibra and other enterprise data catalogs do is quite different, Van de Maele said.
“What we do is we provide much more of the business context,” he said. “We provide what we call that knowledge graph, that business context where you’re actually defining and managing your policies. Policies such as what’s the quality of my data? What business rules does my data need to comply to? What privacy policies does my data need to comply to? Who needs to approve it? How do we capture attestations? How do we do certification? How do I build a business glossary with business terms and clear definitions?
“That’s very different than a Polaris catalog on top of Iceberg that’s the physical metadata. And that’s a real differentiation,” he said.
Van de Maele supports the open data lakehouse architecture that has emerged, which gives customers the freedom to store their data in open table formats, such as Iceberg, Delta, and Hudi, and query it with any engine. His customers, many of which are Fortune 500 enterprises, store data across many data platforms and use the Collibra Data Intelligence platform to help control and govern access to that data.
Different Roles
Customers should understand that, while the names are similar, metadata catalogs and data catalogs play very different roles.
“The way I differentiate between the two is we do policy definition and management, they do policy enforcement,” Van de Maele said. “And actually I think that’s the right architecture.”
The metadata catalogs typically do not have functionality to allow users to set up business policies around data access. For instance, they won’t let you set up access controls to enable a marketing team to access all customer data except for anything that’s been marked “classified,” in which case it must be masked, Van de Maele said.
“We can have marketing data in Databricks, we have marketing data in Salesforce, we have marketing data in Google, and anywhere people are using marketing data, I need to make sure that the right data is classified and masked,” he said. “So we push that down in Databricks, in Snowflake, in Google, in Amazon and in Microsoft.”
Customers could define their own data access policies without a tool like Collibra’s, Van de Maele said. After all, it’s just SQL at the end of the day. But then they would need some other method to keep track of the millions of columns spread across various data platforms. Providing insight into what data exists and where, and then ensuring customers are accessing it according to the company’s governance rules, is the role Collibra serves.
At the same time, Collibra is dependent upon metadata catalogs for the enforcement mechanisms. Other enforcement mechanisms have been tried, such as proxies and drivers, Van de Maele said, but none of it works.
“We think the metadata catalog approach with open table format is actually the right approach,” he said. “We want to have those data platforms be able to do that natively, otherwise scalability and performance always become a problem.”
Databricks Unity Catalog appears to be the exception here. Unity Catalog, which Databricks just open sourced last month, provides the low-level control over technical metadata as well as higher-level functions, such as data governance, access control, auditing, and lineage. In that respect, Unity Catalog appears to compete with the enterprise data catalog vendors.
Related Items:
What the Big Fuss Over Table Formats and Metadata Catalogs Is All About
Databricks to Open Source Unity Catalog
What to Look for in a Data Catalog
February 7, 2025
- Palantir Wins Dresner Advisory Services 2024 Application Innovation and Technology Innovation Awards in Multiple Categories
- MOSTLY AI Unveils Open-Source Toolkit for Synthetic Data Generation
- Pecan AI Launches AI-Driven Co-Pilot for Predictive Analytics, Expanding Access to ML
- European Union: First Rules of the Artificial Intelligence Act Are Now Applicable
February 6, 2025
- SoftServe Survey Finds 58 Percent of Leaders Report Using Inaccurate Data
- Oracle Recognized as a Leader in the 2025 IDC MarketScape
- Cognida.ai Secures $15M Series A from Nexus Venture Partners
- TrueFoundry Secures $19 Million Series A Funding to Transform AI Deployment
- Glean Achieves $100M ARR in Three Years
- Moveworks Launches Quick GPT
- Qumulo Announces Cloud Data Fabric File System
- Aerospike Unveils Database 8 with Distributed ACID Transactions for OLTP
February 5, 2025
- Lightning AI Brings DeepSeek to Private Enterprise Clouds with AI Hub
- PEAK:AIO Powers AI Data for University of Strathclyde’s MediForge Hub
- dbt Labs Surpasses $100M ARR, Expands Global Customer Base
- Qlik Connect 2025 Brings AI, Iceberg, and Automation to the Forefront
- Immuta Finds Legacy Data Provisioning Systems Are Hindering AI Adoption
- Arcitecta Named a Leader in Coldago Research’s Map 2024 for Unstructured Data Management
- Hydrolix Releases Apache Spark Connector for Databricks Integration
February 4, 2025
- OpenTelemetry Is Too Complicated, VictoriaMetrics Says
- PayPal Feeds the DL Beast with Huge Vault of Fraud Data
- The Top 2025 GenAI Predictions, Part 2
- Inside Nvidia’s New Desktop AI Box, ‘Project DIGITS’
- What Are Reasoning Models and Why You Should Care
- Slicing and Dicing the Data Governance Market
- The Top 2025 Generative AI Predictions: Part 1
- Why Data Lakehouses Are Poised for Major Growth in 2025
- Your APIs are a Security Risk: How to Secure Your Data in an Evolving Digital Landscape
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- More Features…
- Meet MATA, an AI Research Assistant for Scientific Data
- Dataiku Report Predicts Key AI Trends for 2025
- AI Agent Claims 80% Reduction in Time to Complete Data Tasks
- Qlik and dbt Labs Make Big Data Integration Acquisitions
- Bloomberg Survey Reveals Data Challenges for Investment Research
- Collibra Bolsters Position in Fast-Moving AI Governance Field
- Observo AI Raises $15M for Agentic AI-Powered Data Pipelines
- Anaconda’s Commercial Fee Is Paying Off, CEO Says
- Google Cloud’s 2023 Data and AI Trends Report Reveals a Changing Landscape
- Mathematica Helps Crack Zodiac Killer’s Code
- More News In Brief…
- Informatica Reveals Surge in GenAI Investments as Nearly All Data Leaders Race Ahead
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- PEAK:AIO Powers AI Data for University of Strathclyde’s MediForge Hub
- Hightouch Partners with Databricks to Launch Self-Service Platform for Retail Media Networks
- Dremio’s New Report Shows Data Lakehouses Accelerating AI Readiness for 85% of Firms
- Scale Computing’s Edge Computing for In-Store Retail Solutions Showcased at NRF ’25
- dbt Labs Acquires SDF Labs to Elevate SQL Capabilities and Developer Efficiency
- Reltio Is Recognized as Representative Vendor in 2024 Gartner Market Guide for Master Data Management Solutions
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- TigerGraph Launches Savanna Cloud Platform to Scale Graph Analytics for AI
- More This Just In…