Follow BigDATAwire:

June 30, 2022

Databricks Bolsters Governance and Secure Sharing in the Lakehouse

Data governance is one of the four pillars necessary for the future of AI, along with past-looking analytics, future-looking AI, and real-time decision-making. To that end, Databricks rolled out several new governance capabilities for its unified lakehouse architecture at Data + AI Summit, including the GA of Unity Catalog and Delta Sharing, and the unveiling of Databricks Marketplace and Cleanrooms.

Anyone who’s had to manage big data knows that data governance for AI is very complex, says Matei Zaharia, a Databricks co-founder and its CTO. For starters, controlling permissions across disparate data repositories is difficult. Some repositories may support setting fine-grained row and column-level restrictions, while others, like Amazon S3, don’t support that approach.

“And it’s also very hard to change your models for how you organize data,” Zaharia said during his keynote address at the Data + AI Summit Tuesday in San Francisco. “You have to move all the files around if you want to change your directory structure. So that’s already kind of awkward.

“On top of that, you probably want to think of your data as tables and views,” the Spark creator continued. “So you might have something like Hive metastore where you set permission on tables and views. And it sounds great. But the problem is those permissions can be out of synch with the underlying data, and so that leads to a lot of confusion.”

Managing data permissions and access control in a busy lakehouse can be a big challenge (lucadp/Shutterstock)

Data warehouses will typically support a richer approach based on SQL and GRANTS statements, he said. “And then you have many other systems, like your machine learning platform, dashboards, and so on, and they all have their own way of doing permissions and you have to somehow make sure your policies are consistently across all of these.”

The company is addressing this hodge-podge of data governance approaches with Unity Catalog. Databricks first unveiled Unity Catalog a year ago at Data + AI Summit, and yesterday announced that it will become generally available on AWS and Microsoft Azure in the coming weeks.

Unity Catalog provides a centralized governance solution that brings features like built-in search and discovery and automated lineage for all data workloads. The product enforces permissions to tables using ANSI SQL GRANTS, Zaharia said, and it can also control access to other data assets, such as files stored in an object store, via REST.

Databricks recently support for lineage tracking, which Zaharia said will be very useful for a range of data assets. “This allows you to set up and track lineage on tables, columns, dashboards, notebooks, drops–basically anything that you can run in the Databricks platform, and see what kind of data and who’s using it downstream,” he said.

Delta Sharing

Companies are beginning to ramp up their data sharing with partners and others. The reason, of course, is the potential to develop better insights and train more powerful AI by augmenting their own data with data from organizations in the same industry. According to Gartner, customers that are part of a data sharing ecosystem can expect to see a 3x boost in economic performance compared to their non-sharing peers.

Databricks Delta Sharing is now GA

The challenge, then, becomes how to enable share data while maintaining some semblance of control over the data and minimizing the need for extensive manual data handling. One mechanism created by Databricks is Delta Sharing, which is another previously announced feature of its lakehouse that will become GA in the weeks to come.

Delta Sharing enables customers to share data across multiple platforms via a REST API. “Basically, any system that can process Parquet can read data through Delta Sharing,” Zaharia says.

Any customer with a Delta table can share their data, even if they’re on different clouds. All that’s required is that they have a client with a Delta Sharing connector, such as a Spark shell, Pandas, or even PowerBI, he says. The transfers happen quickly and efficiently, Zaharia says, since it’s using “a feature of cloud object store that allows you to give someone temporary access to read just one file.”

Since unveiling Delta Sharing a year ago, usage has started to take off. According to Zaharia, more than 1PB of data is shared every day using Delta Sharing on the Databricks platform.

Marketplace and Cleanrooms

The maturation of Delta Sharing has led to two additional new products: a Databricks Marketplace and Cleanrooms.

The new Databricks Marketplace is based on Delta Sharing and will enable anybody with a Delta Sharing-compatible client to buy, sell, and share data and data solutions. The offering will fill in the gaps left by data marketplaces that aren’t meeting the needs of data providers, Zaharia said.

Data cleanrooms are emerging as a way to securely share data with other organizations (hvostik/Shutterstock)

“One limitation is that each marketplace is closed,” he said. “It’s for a specific cloud or a specific data warehouse or software platform, because the goal of these vendors is to get more people computing on their platform and paying them money. That’s nice for those vendors. But if you’re a data provider and you worked hard to create a data set, it’s really annoying have to publish up to 10 different platforms just to reach all the users who want to use your data set.”

The Databricks Marketplace also isn’t restricted to the trading of data, but also code, such as notebooks, machine learning models, and dashboards, Zaharia said. “We’ve…set it up so pretty much anything you can build on the Databricks platform, you can publish on the Databricks marketplace to give someone a complete application,” he said.

Databricks Cleanroom will become available in the months to come. The company is not planning on charging a fee at this point.

Last but not least, Databricks is launching a new Cleanrooms service, which will also be available in the coming months. According to Databricks, the service will provide a way to share and join data across organizations within a secure, hosted environment.

One key aspect of Cleanrooms, which is also based on Delta Sharing, is the elimination of the need to manually replicate data. It will enable users to collaborate with their customers and partners on any cloud, and provide them with the flexibility to not only share data, but to run computations and workloads that leverage SQL as well as data science tools using Python, R, and Scala.

Related Items:

It’s Not ‘Mobile Spark,’ But It’s Close

Why the Open Sourcing of Databricks Delta Lake Table Format Is a Big Deal

Databricks Unveils Data Sharing, ETL, and Governance Solutions

BigDATAwire