Databricks Unveils LakeFlow: A Unified and Intelligent Tool for Data Engineering
Data engineering is a cornerstone for the democratization of data and AI. However, it faces significant challenges in the form of complex and brittle connectors, difficulty in integrating data from disparate and often proprietary sources, and operational disruptions. Datatricks has addressed some of these challenges with the introduction of a new platform.
At its annual Data + AI Summit, Databricks announced a new data engineering solution, LakeFlow, designed to streamline all aspects of data engineering, from data ingestion to transformation and orchestration.
Built on top of its Data Intelligence Platform, LakeFlow can ingest data from different systems, including databases, cloud sources, and enterprise apps, and then automates pipeline deployment, operation, and monitoring at scale in production.
During this keynote address at the Data + AI Summit, Ali Ghodsi, CEO and Co-Founder of Databricks, shared that data fragmentation is one of the key hurdles in the use of GenAI for enterprises. According to Ghodsi, it is a “complexity nightmare” to deal with the high costs and proprietary lock-in of using multiple platforms.
Until now Databricks has relied on its partners, such as dbt and Fivetran, to provide tools for data preparation and loading, but the introduction of LakeFlow has eliminated the need for third-party solutions. Databrick now has a unified platform with deep integration with Unity Catalog and end-to-end governance and serverless computing for a more efficient and scalable setup.
A significant percentage of Databricks customers do not use the Databrick partner ecosystem. This major segment of the market builds their own customized solutions based on their specific requirements. They want a service that is built into the platform so they don’t have to rely on building connectors, using data pipelines, and buying and configuring new platforms.
A key component of the new platform is LakeFlow Connect, which provides inbuilt connectors between different data sources and Databricks service. Users can ingest data from Oracle, MYSQL, Postgres, and other databases, as well as enterprise apps such as Google Analytics, Sharepoint, and Salesforce.
Built on Databricks’ Delta Live Tables technology, the LakeFlow Pipelines enable users to implement data transformation and ETL in either Python or SQL. This feature also offers a low latency mode for data delivery and incremental data processing in near-real-time.
Users can also monitor the health of their data pipelines using the LakeFlow Jobs feature, which allows for automated orchestration and data recovery. This tool is integrated with alerting systems such as PagerDuty. When an issue is detected, administrations are automatically notified about the problem.
“So far, we’ve talked about getting the data in, that’s Connectors. And then we said: let’s transform the data. That’s Pipelines. But what if I want to do other things? What if I want to update a dashboard? What if I want to train a machine-learning model on this data? What are other actions in Databricks that I need to take? For that, Jobs is the orchestrator,” Ghodsi explained.
With its control flow capabilities and centralized management, LakeFlow Jobs makes it easier for data teams to automate deploying, orchestration, and monitoring data pipelines in a single place.
The introduction of LakeFlow Jobs marks a significant milestone in Databricks’ journey towards its mission to simplify and democratize data and AI, helping data teams solve the world’s toughest problems. While LakeFlow is not available in preview yet, Databricks has opened a waitlist for users to sign up for easy access.
Related Items
Databricks Sees Compound Systems as Cure to AI Ailments
Does Big Data Still Need Stacks?
Databricks Announces Major Updates to Its AI Suite to Boost AI Model Accuracy