

(Jezper/Shutterstock)
If the exhibitors at last week’s Strata + Hadoop World expo are any indication of what’s happening down on the street, data cataloging is evolving from a nice-to-have into a necessity for organizations looking to capitalize on big data.
Hadoop’s so-called “junk drawer” problem has been well-documented. It stems largely from the flexible schema-on-read approach, where data is structured only when it’s finally accessed from the data lake, as opposed to the traditional ETL approach of transforming data when it’s originally loaded into the data warehouse.
In short, getting data into Hadoop is easy, but finding it and getting it back out again can be hard. All sorts of vendors are now looking to address this dilemma, which touches many aspects of big data analytics, including data quality and security. Having a catalog of the data stored in Hadoop seems like a good idea, and there are a number of vendors providing that.
Alex Gorelik, CEO and founder of Waterline Data, which provides data cataloging software for Hadoop and other big data systems, says data professionals are reluctant to open Hadoop to downstream users without a better accounting of the actual data.
“The data lake looks like a flea market,” Gorelik tells Datanami. “It’s all in there somewhere, but how do you find it? It’s a problem for data scientists and data stewards because they can’t give people access until they know what’s in there.”
Gorelik says that while open source tools like Apache Atlas, which is backed by Hortonworks (NASDAQ: HDP), and Cloudera Navigator provide a good technical foundation for addressing data cataloging and master data management (MDM) challenges, they don’t go far enough to solve the problem. Waterline addresses the problem by using “tags” to track the lineage of every piece of data.

Data lakes are resembling “flea markets” says Alex Gorelik, CEO and founder of Waterline Data (Toniflap/Shutterstock)
With Waterline, Hadoop users can continue ingesting data as they did before, while relying on the software to keep it somewhat organized. Apache Lucene sits under the covers to power searches, while an Amazon-like user interface and “shopping cart” process lets analysts check-out when they’ve found their data.
It’s not a license to be messy with your data, but at least it takes the burden off of users to manually track their data. “People used to have careful directories. But these days, they can’t keep track of their directories,” Gorelik says. “You have millions of files. You should organize them as well as you can. [With Waterline software] it doesn’t matter where the file is, as long as you can find it.”
Collibra is another master data management (MDM) software vendor helping customers keep track of their Hadoop-resident data using the catalog approach. The company, which recently moved its headquarters to New York City, has an eight-year history of providing data governance solution to customers in healthcare, financial services, and other industries.
“What we have is a technology platform that has the capability to keep track of processes around data, the metadata and organization and roles and responsibility for data,” Daniel Sholler, director of product marketing at Collibra, tells Datanami. “We keep track of all the technical connections of all the data because you need to know that stuff. But it turns out that stuff isn’t the interesting stuff.”
Instead, Collibra exposes a set of applications that make it relatively easy for end users to get access to data, if they are authorized to access it. That’s the “interesting” stuff that Sholler was referring to. Data access is one component of a collection of data governance solutions that Collibra is offering, and the scope of that offering will expand in the coming weeks.
Another vendor that’s plying the fruitful waters of data cataloging is Alation. The company originally designed its product to “learn” about data connections by observing how analysts interact. But just providing data cataloging wasn’t enough, the company says. So last week Alation announced that in its version 4.0 update, it will also track queries that run along with the data that’s collected.
Tracking queries and data, says Alation CTO Venky Ganti, will provide critical context that’s required for addressing the needs of data stewards and customers, including answering questions like “Where can I find data to answer my question?” “Can I trust this data?” “What are the data semantics in order to use it?” and “Who can answer my question about this data set.”

Will you organize the data in your lake by hand, or rely on power tools to simplify the task? (cherezoff/Shutterstock)
“Experts who understand certain datasets often play the stewardship role of ensuring that data consumers can make accurate and effective use of data,” Ganti says in a blog post. “More recently, data governance initiatives have started to assign formal stewardship responsibility.”
Other companies offering data cataloging functionality include Podium Data, which announced a $9.5-million Series A round just prior to the show. Zaloni also unveiled its Bedrock Data Lake Manager (DLM) product, which uses data cataloging to help manage storage more effectively. At Strata, it launched a new version of Mica, its data preparation tool, which introduces a new “shopping cart”-like experience.
That “shopping cart” metaphor was heard often on the Strata expo floor during discussions of data catalogs and big data management. You can expect to see that show up in MDM and data quality tools more often.
Informatica, the big dog of last-gen ETL tools that’s hungering for a piece of the big data pie, also updated its data lake management product, called Data Lake Management, to include more capabilities. Specifically, the product combines data cataloging, stream data capture, Hadoop job management, security, and cloud connectors in a single unified product.
The lack of a centralized data lake management point eats up analysts’ time and hurts productivity, says Amit Walia, executive vice president and chief product officer for Informatica. “Ease of use and a delightful user experience along with robust governance and metadata capabilities are critical for getting business value out of data lakes,” he says in a statement.”
According to Gartner analysts Guido De Simoni and Roxane Edjlali, enterprise metadata management, including data cataloging, has become a “required discipline.” “Failure to recognize this will lead to sustained siloed behavior and loss of business value,” they wrote earlier this year.
While data silos will inevitably be with us for a while, we don’t have to behave as if the data is trapped in a single location. As the Gartner analysts rightly point out, organizations that can get a unified view of their data will find greater business value. It’s becoming clear that data catalogs will be one way of providing that visibility.
Related Items:
Delivering on the Data Lake Promise
8 Tips for Achieving ROI with Your Data Lake
Taming the Wild Side of Hadoop Data
April 22, 2025
- O’Reilly Launches AI Codecon, New Virtual Conference Series on the Future of AI-Enabled Development
- Docker Extends AI Momentum with MCP Tools Built for Developers
- John Snow Labs Unveils End-to-End HCC Coding Solution at Healthcare NLP Summit
- Qumulo Launches New Pricing in AWS Marketplace
April 21, 2025
- MIT: Making AI-Generated Code More Accurate in Any Language
- Cadence Introduces Industry-First DDR5 12.8Gbps MRDIMM Gen2 IP on TSMC N3 for Cloud AI
- BigDATAwire Unveils 2025 People to Watch
April 18, 2025
- Snowflake and PostgreSQL Among Top Climbers in DB-Engines Rankings
- Capital One Software Unveils Capital One Databolt to Help Companies Tokenize Sensitive Data at Scale
- Salesforce Launches Tableau Next to Streamline Data-to-Action with Agentic Analytics
- DataVisor Named a Leader in Forrester Wave for AML Solutions, Q2 2025
- GitLab Announces the General Availability of GitLab Duo with Amazon Q
- Anthropic Joins Palantir’s FedStart Program to Deploy Claude Application
- SnapLogic Announces Partnership with Glean to Transform the Agentic Enterprise
April 17, 2025
- Qlik Highlights Real-World Enterprise AI at Qlik Connect 2025 with Lenovo, Visa, and Reworld
- SnapLogic Ushers in the Era of Infinite AI Workforce for the Agentic Enterprise With AgentCreator 3.0
- Devo Highlights Analyst Overload in Push Toward Alertless SOC
- InfluxData Launches InfluxDB 3 Core and Enterprise for Real-Time Time Series Data
- Informatica and Carnegie Mellon University Partner to Drive Innovation in GenAI for Data Management
- SnapLogic Unveils Next-Gen API Management Solution to Power the Composable and Agentic Enterprise
- PayPal Feeds the DL Beast with Huge Vault of Fraud Data
- OpenTelemetry Is Too Complicated, VictoriaMetrics Says
- Will Model Context Protocol (MCP) Become the Standard for Agentic AI?
- Accelerating Agentic AI Productivity with Enterprise Frameworks
- When Will Large Vision Models Have Their ChatGPT Moment?
- Thriving in the Second Wave of Big Data Modernization
- What Benchmarks Say About Agentic AI’s Coding Potential
- Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
- Google Cloud Fleshes Out its Databases at Next 2025, with an Eye to AI
- Can We Learn to Live with AI Hallucinations?
- More Features…
- Grafana’s Annual Report Uncovers Key Insights into the Future of Observability
- Google Cloud Cranks Up the Analytics at Next 2025
- New Intel CEO Lip-Bu Tan Promises Return to Engineering Innovation in Major Address
- AI One Emerges from Stealth to “End the Data Lake Era”
- Reporter’s Notebook: AI Hype and Glory at Nvidia GTC 2025
- Snowflake Bolsters Support for Apache Iceberg Tables
- GigaOM Report Highlights Top Performers in Unstructured Data Management for 2025
- SnapLogic Connects the Dots Between Agents, APIs, and Work AI
- Excessive Cloud Spending In the Spotlight
- Mathematica Helps Crack Zodiac Killer’s Code
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- MinIO: Introducing Model Context Protocol Server for MinIO AIStor
- Dataiku Achieves AWS Generative AI Competency
- AMD Powers New Google Cloud C4D and H4D VMs with 5th Gen EPYC CPUs
- Deloitte Survey Finds AI Use and Tech Investments Top Priorities for Private Companies in 2024
- Prophecy Introduces Fully Governed Self-Service Data Preparation for Databricks SQL
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- CData Launches Microsoft Fabric Integration Accelerator
- MLCommons Releases New MLPerf Inference v5.0 Benchmark Results
- Opsera Raises $20M to Expand AI-Driven DevOps Platform
- More This Just In…