
Selecting a Data Lake ETL Platform? Here Are 6 Questions to Ask

(Alexander Supertramp/Shutterstock)
Not all data lakes are created equal. If your organization wants to adopt a data lake solution to simplify and more easily operate your IT infrastructure and store enormous quantities of data without requiring extended data transformation, then go for it.
But before you do, understand that simply dumping all your data into object storage such as AWS S3 doesn’t exactly mean you will have a working data lake.
The ability to use that data in analytics or machine learning requires converting that raw information into organized datasets you can use for SQL queries, and this can only be done via extract-transform-load (ETL) flows.
Data lake ETL platforms are available in a full range of options – from open-source to managed solutions to custom-built. Whichever tool you select, it’s important to differentiate data lake ETL challenges from traditional database ELT demands – and seek the platform that overcomes these obstacles.
Ask yourself which ETL solution:
1. Effectively Conducts Stateful Transformations
Traditional ETL frameworks allow for stateful operations like joins and aggregations to enable analysts to work with data from multiple sources; this is difficult to implement with a decoupled architecture.
Stateful transformations can occur by relying on extract-load-transform (ELT) – i.e., sending data to an “intermediary” database and using the database’s SQL, processing power and already amassed historical data. After transformation, the information is loaded into the data warehouse tables.
Data lakes, aiming to reduce cost and complexity by avoiding decoupled architecture, cannot depend on databases for every activity. You’ll need to look for an ETL tool that can conduct stateful transformations in-memory and needs no additional database to sustain joins and aggregations.
2 Extracts Schema from Raw Data
Organizations customarily use data lakes as a storehouse for raw data in a structured or semi-structured arrangement vs databases, which are predicated on structured tables. This poses numerous challenges.
One, in order to query data, can the data lake ETL tool draw out a schema (without which querying is not possible) from the raw data – and bring it up to date as changes in data and data structure come about? And two — this is an ongoing struggle — can the ETL tool effectively make queries with nested data?
3 Improves Query Performance Via Optimized Object Storage
Have you tried to read raw data straight from a data lake? Unlike using a database’s optimized file system that quickly sends back query results, doing the same operation with a data lake can be quite frustrating performance-wise.
To get optimal results, your ETL framework should continually store data in columnar formats and merge small files to the 200mb-1gb range. Unlike traditional ELT tools that only need to write the data once to its target database, data lake ETL should support the ability to write multiple copies of the same data based on the queries you will want to run and the various optimizations required for your query engines to be performant.
4 Easily Integrates with the Metadata Catalog
You’ve chosen the data lake approach for its flexibility — store large quantities if data now but analyze it later — and the ability to handle a wide range of analytics use cases. Such an open architecture should keep metadata separate from the engine that queries it, so you can easily change these query engines or use several simultaneously for the same data.
This means the data lake tool you select should reinforce this open architecture, i.e., be seamlessly merged with the metadata catalog. This allows the metadata to be easily “queryable” by various services because it is both stored in the catalog and still dovetails with every adjustment in schema, partition, and location of objects.
5 Replays Historical Data
Say you wanted to test a hypothesis by looking at stored data on a historical basis. This is difficult to accomplish with the traditional database option, where data is stored in a mutable condition, and in which running such a query could be prohibitive in terms of cost, stress, and tension between operations and analysis.
It’s easy to do with a data lake. In data lakes, stored raw data remains continuously available – it only transformed after extraction. Therefore, having a data lake allows you got “travel back in time,” seeing the exact state of the data as it was collected.
“Traditional” databases don’t allow for that, as the data is only stored in its transformed state.
6 Updates Tables Periodically
Data lakes, unlike databases that allow you to update and make deletions to tables, contain partitioned files that enable an append or add-only feature. If you want to store transactional data, implement change data capture in the data lake, or delete particular data for GDPR compliance, you’ll have difficulty doing so.
Make sure that the data lake ETL tools you choose have the ability to sidestep this obstacle. Your solution should be able to allow upserts, a system that lets you insert new records or update existing ones, in the storage layer and in the output tables.
About the author: Ori Rafael is the CEO and co-founder of Upsolver, a provider of a self-service data lake ETL platform that bridges the gap between data lakes and data consumers. Ori has worked in IT for nearly two decades and has an MBA from Tel Aviv University.
Related Items:
Merging Batch and Stream Processing in a Post Lambda World
January 31, 2025
- Observo AI Secures $15M to Tackle Data Overload
- DeepSeek Now Available on Clarifai Platform
- DataHub Launches Fully Enterprise-ready Release, DataHub 1.0
- EDB Advances Open Source Postgres with CloudNativePG’s CNCF Milestone
- Oracle and Google Cloud Expand Availability, Enhance Oracle Database@Google Cloud
- Quantum Announces Scalability Enhancements to its Myriad All-flash File System
January 30, 2025
- DeepSeek-R1 models now available on AWS
- DeepSeek-R1 Now Live with NVIDIA NIM
- Komprise Unveils Sensitive Data Management Capabilities for AI Data Governance and Cybersecurity
- ServiceNow Expands Workflow Data Fabric with Oracle Integration for AI-Driven Insights
- Kurrent Introduces Public Internet Access for Event-Native Data Platform
- VAST Data Supports Canada’s Sovereign AI Strategy with Hypertec Cloud Collaboration
- Lovelytics Acquires Datalytics to Expand Databricks Consulting and Global Reach
- Astronomer Announces Winners of the Inaugural Astronomer Data Excellence Awards
- LexisNexis Introduces Conversational Search in Nexis+ AI for Faster Insights
- Cerebras Launches Record-Breaking DeepSeek R1 Distill Llama 70B Inference
- YugabyteDB Levels Up its PostgreSQL Compatibility with PG15 Features and Seamless Upgrades
- SiMa.ai Expands MLSoC Lineup with Modalix for GenAI and Computer Vision
- FlutterFlow Announces AI-Powered Solution for CPG, Partners with Google Cloud and Accenture
January 29, 2025
- The Top 2025 Generative AI Predictions: Part 1
- Inside Nvidia’s New Desktop AI Box, ‘Project DIGITS’
- 2025 Big Data Management Predictions
- OpenTelemetry Is Too Complicated, VictoriaMetrics Says
- 2025 Observability Predictions and Observations
- PayPal Feeds the DL Beast with Huge Vault of Fraud Data
- The Top 2025 GenAI Predictions, Part 2
- Big Data Career Notes for December 2024
- Slicing and Dicing the Data Governance Market
- Why Data Lakehouses Are Poised for Major Growth in 2025
- More Features…
- Meet MATA, an AI Research Assistant for Scientific Data
- IBM Report Reveals Retail and Consumer Brands on the Brink of an AI Boom
- Oracle Touts Performance Boost with Exadata X11M
- Mathematica Helps Crack Zodiac Killer’s Code
- Dataiku Report Predicts Key AI Trends for 2025
- Hitachi Vantara Urges Businesses to Invest in Data Infrastructure to Unlock AI Potential
- The Top Five Data Labeling Firms According to Everest Group
- Sahara AI’s New Platform Rewards Users for Building AI Training Data
- Qlik and dbt Labs Make Big Data Integration Acquisitions
- Bloomberg Survey Reveals Data Challenges for Investment Research
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- AI and Big Data Expo Global Set for February 5-6, 2025, at Olympia London
- NVIDIA Unveils Project DIGITS Personal AI Supercomputer
- Exabeam Enhances SOC Efficiency with New-Scale Platform’s Open-API Integration
- GIGABYTE Launches New Servers with NVIDIA HGX B200 Platform for AI and HPC
- Marvell Unveils Co-Packaged Optics for Custom Processors to Boost AI Server Interconnects
- Domo Partners with Data Consulting Group to Provide Advanced BI Capabilities to Global Enterprises
- Oracle Unveils Exadata X11M with Performance Gains Across AI, Analytics, and OLTP
- Dremio’s New Report Shows Data Lakehouses Accelerating AI Readiness for 85% of Firms
- General Assembly Launches Suite of Upskilling Programs to Prepare Businesses for an AI-Driven Economy
- More This Just In…