Finding the Data Access Governance Sweet Spot
In part one of this two-part series, I presented the three most common “tried and failed” approaches that large enterprises take to implementing data access controls to increase security and enable compliance with evolving privacy regulations. The three failed approaches all reveal ways in which complexity is the enemy of security. Creating secure copies of data, defining policies as “views,” and using Apache Ranger to enable fine-grained access controls all lead to fragmentation and mounting complexity, opening the door to a data management nightmare, potential security gaps and compliance failures. Increasing complexity can also make it impossible to provide the right access to the right people at the right time, inhibiting business productivity and innovation.
In this follow-on article, I will discuss three additional lessons learned that many successful large enterprises have applied to reach that “sweet spot” where big data can be used responsibly, compliance can be automated, and data management can be made easier.
Lesson 1: Strive for a Single Source of Authoritative Data
The opposite of curating secure copies or views of data is the ability to implement dynamic data access policies on top of a single source of authoritative data. This is the foundation of a successful data access management program. A single source of truth eliminates the proliferation of redundant and ungovernable data silos – while making access management far simpler.
This doesn’t mean you have to consolidate all your data in one place. If you subscribe to the idea of a data lakehouse, for example, great! But if your organization wants or needs to operate disparate data platforms ala a data mesh, that’s fine too. The learning here is that within each system, don’t curate multiple versions of the same data for security purposes. It gets ugly and you quickly lose control, which is the opposite of what you’re trying to achieve.
Instead, implement dynamic data access policies on your authoritative data sources. Modern, universal data authorization platforms allow you to apply fine-grained access controls – mask/hide/tokenize information at the file, column, row and cell level – in real time based on the user’s entitlements and query context. Dynamic data access policies ensure the twin goals of effective governance and user productivity.
A single source of truth also makes it easier to manage and standardize a continuous integration/continuous delivery (CI/CD) pipeline, enabling administrators to catalog and classify only a single data set. This in turn enables a change-once-implement-immediately approach. It also supports efficient auditing to allow the business to demonstrate compliance to regulators.
Lesson 2: Separate the Policy from the Platform
To fully understand an organization’s requirements for security and compliance, the data governance team must collaborate with all other data stakeholders. And instead of dwelling on the technical complexity of policies and policy enforcement, the collaborative discussion should focus on which data consumer roles get to use which classifications of data.
Collaboration with and input from the following teams will help create the optimal foundation for your data program.
- Compliance – Regulatory compliance requirements, such as the right to be forgotten and the personally identifiable information (PII) that must be redacted or obfuscated
- Security – Requirements for Zero Trust data access policies and how to optimize them to minimize risks
- IT – Requirements for a modern data platform, such as cloud-first, containerization and sufficient scalability to support massive data lakes and the required number of users, use cases and computing nodes, etc.
- Lines of business – Their needs for the data program, such as dashboards, machine learning (ML) models, customer 360 views, etc.
By working together within the context of a collaborative platform that recognizes all data stakeholders, the organization can define what consistent policy enforcement across the enterprise looks like – which then allows for automation of policy enforcement. This information is essential for shifting from a limited role-based access control (RBAC) strategy to a combined RBAC and attribute-based access control (ABAC) strategy.
Why RBAC + ABAC? Role-based access control (RBAC) is the standard in most organizations today. But it is insufficient in our post-big data era when the three Vs of volume, velocity, and variety are real and present problems. For example, every data analyst in a financial firm – or group of analysts in a line of business (LOB) within the firm – may be assigned a “card analyst” role so only they can be given access to transaction databases. While this simple RBAC strategy works for simple use cases, the roles must be managed manually, and every new use case requires the creation of a new role, with new permissions granted to the user or users. Further, RBAC is usually limited to coarse-grained access (e.g. an entire table or file), and each system handles role definition and permission management differently. So as the data platform grows in scale, the organization experiences “role explosion,” and complexity abounds.
Attribute-based access control (ABAC), by contrast, allows for far more flexible access policy definitions by leveraging attributes to make a context-aware decision regarding any individual request for access. For example, if data is classified “SSN,” only people with certain roles should be able to work with it. You no longer have to assign roles to individual resources by name. Combined with RBAC, ABAC scales very granular policy requirements to support more people and use cases without hard coding, manual configuration or role explosion. And since the definitions are abstracted out, administrators benefit from easy repeatability and policy reusability across multiple data sources.
The benefits of ABAC include ensuring reliable policy change management, avoiding policy drift across the enterprise, eliminating manual effort to stay in compliance as policies change over time, and increasing data usage intelligence thanks to full visibility.
Lesson 3: Choose Universal Policy Enforcement
Abstract policies need concrete enforcement. Choose a universal data authorization platform that dynamically applies policies consistently and reliably. For example, policies should apply equally to data scientists running Spark on AWS EMR and LOB analysts running Looker queries against Snowflake. Only a universal platform approach enables policies to be automatically and intelligently enforced everywhere without the need for user intervention.
As enterprises are finding in multiple disciplines, from network security to content marketing, relying on a technology platform that can seamlessly integrate partner technologies is the most efficient way to implement and manage a particular strategy. A data policy platform also centralizes auditing and can position an organization to implement distributed stewardship. When looking at the platform for data access governance, be sure the platform is technology and data platform agnostic. This is the only way to allow for a single policy that is understandable and usable for every data system and stakeholder, independent of the underlying solutions.
Make Simplicity the Ally of Security
Big data and evolving privacy regulations have introduced unprecedented information management complexity for enterprises, making security and compliance more difficult than ever. However, as some of the world’s most well-known brands have learned through trial and error, this complexity can be reduced and effectively managed – and security and compliance can be enhanced – when organizations:
- Strive for a single source of authoritative data and enforce fine-grained access controls using ABAC.
- Take a collaborative approach to implementing controls for who can access what sensitive data.
- Adopt a technology-agnostic universal data authorization platform.
About the author: Nong Li is the co-founder and CTO of Okera. Prior to co-founding Okera in 2016, he led performance engineering for Spark core and SparkSQL at Databricks. Before Databricks, he served as the tech lead for the Impala project at Cloudera. Nong is also one of the original authors of the Apache Parquet project. He has a bachelor’s in computer science from Brown University.
Related Items:
Big Data Analytics: Top Three Data Security Mistakes
Security, Privacy, and Governance at the Data Crossroads in ‘22
ML for Security Is Dead. Long Live ML for Security