Will the Data Lakehouse Lead to Warehouse-Style Lock-In?
Lakehouse architectures are gaining steam as a preferred method for doing big data analytics in the cloud, thanks to the way they blend traditional data warehousing concepts with today’s cloud tech. But could lakehouses end up trapping customers with limited and proprietary build-outs, just as the last generation of dedicated data warehouses did? For Ori Rafael, the CEO of data pipeline company Upsolver, it’s a distinct possibility.
One big drawback of the lakehouse approach is it solves just one particular use case today, says Rafael, who co-founded Upsolver in 2014. Today, that main use case for the lakehouse is a data warehouse, or a columnar relational database with a SQL query engine, which is implemented atop AWS S3 or an S3-compatible object store, such as Google Cloud Storage, he says.
“The lakehouse is kind of the rebranding of the warehouse,” Rafael tells Datanami. “I want to take that use case and implement on top of the data lake, so if I’m not using a data warehouse, I’m basically using the lakehouse.”
But there are many other ways to process data besides SQL, and there is a strong demand for different data storage patterns and query mechanisms that optimize the delivery of insight from that raw data. Customers will need NoSQL database approaches as embodied by Cassandra or Redis, search engines like Elastic, and decoupled query engines like Trino and Presto, Rafael says.
“There’s a lot of different use cases for data and no single vendor solves everything,” he says. “I can count at least six or seven different patterns I want to query the data, so I need to store it in multiple ways. I need to allow in my architecture that there will be multiple databases, not just one database.”
When AWS originally described its concept for a data lakehouse, the idea was for a hub-and-spoke type model that offered many different ways to work with the data in the lake. But over the years, the meaning has shifted a bit to mean predominantly the data warehouse pattern, which is the most mature data access pattern today but won’t be the only one, Rafael says.
Before co-founding Upsolver, Rafael worked as an Oracle database administrator. As an Oracle DBA, Rafael strived to learn everything about how the internals of the RDBMS worked so he could optimize its performance. He sees the same process playing out with the new lakehouses of the world, but it’s taking place across a deconstructed stack with many moving parts.
“So who is responsible for storing the data in the right way, the right file format, the right compression, the right file size–all of that?” Rafael says. “Now the customer needs to do that, so that’s additional data engineering I need to do in the lakehouse world.”
Databricks takes care of these implementation details for customers who adopt its Delta Lake, and that has worked well for many customers. Databricks, along with Snowflake, have emerged as the two largest independent contenders for big data workloads in the cloud, besides the cloud giants themselves. But neither of these companies have embraced the open data ecosystem in the way that Rafael and others believe is in the best interest of customers who want to take advantage of the diversity of data processing engines.
While Databricks ostensibly embraces open source to a higher degree than Snowflake–which actually came out against open data platforms almost exactly a year ago (before delivering Python and Java support in Snowpark later), the reality of using non-Databricks products with Delta Table leaves some “openness” to be desired, Rafael says.
“I think what Databricks is doing is another data warehouse because the product used to query Delta Lake is Databricks, in almost all cases,” he says. “It’s not really performant or easy to go and query Delta Lake from other engines. So I think what they’re doing is a rebranding of the data warehouse and I think what the open source community is doing is a real open lakehouse.”
A year ago, Databricks launched Delta Sharing, which enables Delta Lake users to share data with others via Pandas or Spark DataFrames, or to load it directly into PowerBI. The company’s website says support for additional targets, like Presto, Trino, R, Hive, and Tableau, are coming soon.\
Data Loch-In?
Rafael sees echoes of the data warehouse lock-in that customers experienced with Oracle and Teradata in the new cloud data warehouse offerings. Like with Oracle, you primarily store the data and consume the queries in the same place, he says.
Conceptually, the lakehouse concept is a good one, Rafael says. But in reality, it’s just another data warehouse, he says. While the ability to easily query data using non-SQL engines from a simple API is still a bit green, that capability will soon be a core requirement for running big data workloads in the cloud, and the current incarnation of lakehouses don’t quite cut it, he says.
“I think the lakehouse [has] too much branding around it and not enough essence,” Rafael says. “I really believe in the concept of the lakehouse. This is why we are doing the company, basically deconstructing databases on top of data lake is the reason we founded the company and the reason for the vision.”
Rafael and his team developed Upsolver to improve the way customers build and run data pipelines. The product, which was written primarily in Scala, focuses on automating the transformation of streaming data and files into data formats that can be more easily queried by cloud query engines, like Presto, Trino, and Athena.
The Upsolver process relies on declarative SQL-based data pipeline that creates Parquet or Iceberg tables in S3 storage from sources like Kafka and Kinesis. It provides better performance with lower complexity than Spark code, Rafael says. The offering runs in AWS and Microsoft Azure today, and is certified to work with Amazon Athena, a serverless Presto runtime.
“In ETL, the transformation is usually the hard part, and we turned this problem from many different separate steps into one step,” Rafael says. “So we call this declarative pipelines and you basically define a pipelines as your transformation, and that’s it.”
The customer constructs the data transformation in SQL, and the Upsolver product handles the rest. Rafael provided an example of a Spark pipeline that contained 487 lines of code, whereas the Upsolver pipeline contained just nine lines.
“That’s the difference of being declarative¸ so the only code you have here is transformation,” Rafael says. “You’re just writing SQL like you do a query, but you get a pipeline.”
Related Items:
Google Cloud Opens Door to the Lakehouse with BigLake
Lakehouses Prevent Data Swamps, Bill Inmon Says
Do Customers Want Open Data Platforms?