Designing a Data Lake Management and Security Strategy

By:   |   Updated: 2022-09-13   |   Comments   |   Related: > Cloud Strategy


Problem

Organizations that own established legacy data platforms undergo a mindset shift as they adopt modern cloud data Lakehouse platforms. Traditional data warehouse platforms are highly mature big data storage technologies that deliver numerous benefits, including the capability to record data in an ACID-compliant manner and ensure high levels of integrity. While the traditional data warehouse platform has served the industry well for optimal data analytics and business intelligence use cases, it may lack the flexibility to process semi-structured and unstructured data. It can be expensive to implement and maintain. Data providers and consumers of the traditional data warehouse have a number of disadvantages to these traditional platforms. Data scientists struggle to conduct advanced analytics use cases, while DBAs find it challenging and time intensive to maintain these platforms, and business stakeholders cringe at the monthly bills, which can be extremely expensive.

The Data Lakehouse platform solves a number of these challenges. By decoupling storage from compute, cost efficiency can be achieved on the Lakehouse platform. With its object storage capabilities for seamlessly managing semi-structured and unstructured data files, the Lakehouse platform is well suited for advanced analytics use cases such as AI and ML. Built on distributed cloud computing platforms, this Lakehouse platform reaps the benefits of pay-as-you-go and pay-per-query pricing models. In addition to solving the age-old challenges faced by traditional data warehousing systems, the modern data Lakehouse also brings with it the mature capabilities of the traditional data systems such as ACID compliance, indexing, partitioning, optimization techniques, meta-data management, querying with standard ANSI SQL language and more. Since the data Lakehouse is not a relational data storage system, it has some differences in data management and security. Since Data Lakes, the underlying storage technology for the Lakehouse platform, take a file storage-based approach to recording data, they have different data and security management strategies. As an organization's Lakehouse platform implementation takes flight, they are frequently interested in deepening their understanding of how they can design and implement a Data Lake management and Security Strategy for their organization.

Solution

The Data Lakehouse paradigm continues to win over numerous organizations as they embark on their digital innovation and transformation journey to modernize their big data and advanced analytics platforms to enable growth and value-added business outcomes. Components of this Modern Lakehouse, such as Apache Spark, are winning prestigious awards, such as the recent one from the Association of Computing Machinery's Special Interest Group in the Management of Data in the industry. These awards are presented to systems whose technical contributions have significantly impacted the theory or practice of large-scale data management systems. As a result of this and other continued upticks in Lakehouse innovations, numerous cloud providers, vendors, and solution integrators are supporting this Lakehouse-driven upward cloud adoption trajectory by building products and providing service solutions to design and implement the Modern Data Lakehouse Platform through well-architected frameworks.

Data Lake

The Data Lake plays a critical role in the Modern Data Lakehouse Platform, primarily because all enterprise data is stored within the lake in various formats. The lake can support structured, semi-structured, and unstructured data. Since storage within the lake is cheaper than OLTP database or OLAP data warehouses, several big data and advanced analytics use cases with differing velocities of data, such as batch and real-time, can easily be accommodated within the data lake. Also, most modern versions of Data Lakes across cloud providers can enable the 'Hierarchical Namespace –(HNS)' feature. This will allow the objects and files within the lake containers to be organized into pre-defined hierarchies of directories and nested subdirectories. A storage account with this hierarchical namespace enabled inherits the capability of providing the scalability and cost-effectiveness of object storage, with file system semantics familiar to analytics engines and frameworks. This HNS feature is also the key differentiator between a general Blob Storage account. For example, within the Azure Cloud ecosystem, this HNS-enabled storage account is called Data Lake Storage Gen 2.

From a file format perspective, typically snappy compressed parquet files are the most optimal formats for highly performant analytics workloads due to their columnar format and ~97.5% compression ratio. An advancement of this Azure Data Lake Storage Gen2 account is the addition of the support for 'Delta,' which is also based on the general columnar-oriented 'parquet' file format, which is best suited for analytics workloads. Delta transaction log files provide ACID transactions and isolation levels to Spark for processing data within your Lakehouse. Delta supports caching, time travel, merging datasets, optimization, and schema evolution, among other features. This makes it a prime candidate for data storage within the modern Data Lakehouse Platform. The Figure below of an Azure Modern Data Lakehouse Platform highlights the data lake's role within this platform. Zones (Bronze, Silver, Gold), which we will cover in a later section, can be designed to capture various stages of data storage and processing in Delta format as your enterprise data gets ingested, transformed, and served downstream to a variety of consumers through workspaces and reporting tools. In addition to consumption layer tools, most cloud technologies also have the capabilities of connecting to the Data Lake.

Data Lake Design and Management

When designing and managing data within the Data Lake, it is always important to focus on security implications early and design data partitions together with authorization. We will explore security in a later section, but these are typically defined by Active Directory (AD) Groups, Role Based Access Controls (RBAC), and Access Control Lists (ACL). Also, when designing a data lake, keep in mind that data redundancy might be allowed in exchange for security. When consistent, multiple nesting of folders is acceptable. Also, it is good practice to collocate similar file formats and datasets within a single folder structure. A good folder structure must not begin with date partitions; the dates must reside within the lower folder levels. Defining and adhering to a naming convention is critical to a good data lake design. Also pivotal to good data lake design is to include time elements in the folder structure and file name, when appropriate.

Zones typically define the root-level folder hierarchies in a data lake container. Zones do not always need to reside in the same physical data lake and could also reside as separate filesystems, storage accounts, or even in different subscriptions. Multiple storage accounts in different subscriptions may be a good idea for large throughput requirements exceeding a request rate of 20,000 per second. The figure below illustrates the typical zones within a Data Lake and their purpose. Modern Data Lakehouse platforms follow a Medallion style design wherein the bronze zone is a raw storage layer that contains data sourced from various on-premises and cloud-based systems. It is segregated by source system, dataset, and time-based (e.g., Year, Month, Day) hierarchies. It serves as the Enterprise's Landing Zone. Similarly, the silver zone contains data that is sanitized, enhanced, and staged for further analysis. For example, this can include the de-sensitization of PII data and the deduplication of raw data from the bronze zone. Lastly, the gold zone contains transformed, aggregated, and modeled data processed from the silver zone.For example, this zone can include Facts and Dimensions and typically stores data ready for consumption by end users. Organizations introduce a variety of other zones as needed and dependant on the use case. We will explore the design of these and other zones in a later section.

There are several options for storing and managing data within a Data Lake. Since the underlying storage of the lake is essentially object-oriented, folder and file hierarchical structures can be defined in many unique ways to meet the specific use cases of the organization, the customers, and their departmental or program-specific use cases. The figure below illustrates some of these options. Zones can be defined by multiple folders in a container, as shown in Option 1. Alternatively, zones can be scoped at the container level by multiple containers within a storage account, as shown in Option 2. Finally, zones can also be defined by multiple storage accounts, as shown in Option 3.

Bronze Zone

The Bronze Zone serves the purpose of storing vast quantities, varieties and velocities of raw data. This data can be batch, streaming, structured, un-structures, or semi-structured. The figure below shows a sample naming convention and hierarchical folder and file structure for designing a high-quality Bronze Zone. Notice that the naming conventions adhere to good design patterns, the dates reside within the lower levels of the folder structure, and the file name contains a timestamp.

Here is an example of how this structure would appear after it is implemented.

Example:


Adventureworks–Dev>rg-lakedata-dev-eu2>stglakedatadeu2\medallion\BRONZE\CUSTOMER\ABC_ELECTRONIC\Landing\2022\05\06\ABC_ELECTRONIC_Full_12:45:34.parquet

Silver Zone

The Silver Zone is the location where cleansed datasets are stored. A typical Silver Zone design structure might look like the figure below. In this scenario, if the data is stored in Delta format, it will bring ACID compliance features such as time tracking, logging, merging capabilities, and more. For that reason, the date-level folders would not be necessary. Any required inserts, updates, and deletes would simply merge into the existing datasets.

Here is an example of how this structure would appear after it is implemented.

Example:

 Adventureworks–Dev>rg-lakedata-dev-eu2>stglakedatadeu2\medallion\SILVER\CUSTOMER\ABC_ELECTRONIC\FileName001.delta
 

Gold Zone

The Gold Zone is the location where curated datasets are stored. Like the Silver Zone, the Gold Zone also stores data as Delta format. A typical Gold Zone design structure might look like the figure below. Finalized consumption-ready data, such as data products that are aggregated or further modeled into dimensions and facts, would typically reside in this zone. The data can then be queried through the federated querying capabilities of workspaces such as Databricks and Synapse. Furthermore, this Delta formatted data can be connected and consumed by modern cloud-based reporting tools like Power BI.

Here is an example of how this structure would appear after it is implemented.

Example:

 Adventureworks–Dev>rg-lakedata-dev-eu2>stglakedatadeu2\medallion\GOLD\CUSTOMER\ABC_ELECTRONIC\FileName001.delta
 

Masked Zone

In certain scenarios, organizations may need to mask, encrypt, and protect Personal Identifiable Information (PII) and then provide these masked datasets to either internal stakeholders or external vendors for analysis or further processing. In this case, a Masked Zone would need to be created. The structure of a Masked Zone may look like the figure shown below. In this scenario, the Masked Zone would only contain masked data to be shared with external vendors. Furthermore, fine-grained access controls can be implemented to granular control which folders are read-only and which allow write operations as well. For even more fine-grained access controls, each unique project and vendor can have its container, which would apply its custom security policies.

Here is an example of how this structure would appear after it is implemented.

Example:


Adventureworks–Dev>rg-lakedata-dev-eu2>stglakedatadeu2\masked-vendorA\ToVendor-VendorA-CustomerLifeTimeValue\CustomerLifeTimeValue001.parquet

Sensitive Zone

Like the Masked Zone, which hides masked data, a Sensitive Zone serves the purpose of storing unmasked and sensitive data, which may need to be analyzed and processed by Data Scientists and Machine Learning Models. This could include customer ages, demographics, income, and other PII data that could be used to determine customer lifetime value, churn, or other advanced analytics use cases. This data may have varying levels of sensitivity that could further be split out by red, yellow, and green tag indicators. These color-coded folders could have their own granular access, and security controls applied. The structure of a Sensitive Zone may look like the Figure shown below.

Here is an example of how this structure would appear after it is implemented.

Example:

 Adventureworks–Dev>rg-lakedata-dev-eu2>stglakedatadeu2\medallion\RED\CustomerLifeTimeValue001.parquet
 

Archive Zone

An Archive Zone could serve the purpose of Cold Storage for unused historical datasets. Organizations may choose to retain historical data and have it stored in a different storage account since this data can be quite voluminous. The structure would follow the same pattern as the original Hot data and could be split out by program at the container level and zone at the root hierarchical level. Since this is a separate storage account, archive centric-security groups could be created and assigned RBAC roles to the account. This way, only members of the Archive AD groups would be able to access the data in this Archive zone. In addition, granular ACLs could also be applied when necessary. The structure of an Archive Zone may look like the Figure shown below.

Here is an example of how this structure would appear after it is implemented.

Example:


Adventureworks–Dev>rg-lakedata-dev-eu2>stglakearchivedatadeu2\CLV\GOLD\CUSTOMER_LIFETIME_VALUE\CustomerLifeTimeValue001 001.delta

Data Lake Security

Security and Access Management within a Lakehouse is like traditional Relational Database Management Systems (RDBMS) in that user and group permissions can be managed via Active Directory groups. Azure Storage has a management layer that accounts for subscription and storage account access, while containers, folders, and files are accessed through the data layer. There are available service API endpoints for working with the management and data layers. The following section will detail the management and data layers capabilities within the Data Lake. Data Lake automation mechanisms can include Share Key Authorization, Shared Access Signature (SAS) authorization, Role Based Access Control (RBAC), or Access Control Lists (ACL).

Shared Key and SAS authorization grants access without requiring users to have an identity in Azure Active Directory (Azure AD). On the other hand, RBAC and ACL require users to have an identity in Azure AD. RBAC grants access to storage account data, such as read or write access to all data in a storage account. ACLs grant granular access, such as write access to a specific directory or file. It is important to note that with Shared Key and SAS authorization, Azure RBAC and ACLs have no effect. It is also important to note that when both RBAC and ACL are applied, RBAC is evaluated first and takes priority over any ACL assignments. The ACL will not be evaluated if the operation is fully authorized based on RBAC. Shared Keys and SAS authorizations are exceptions in that they will be evaluated independently. Also, users having the Storage Blob Data Owner built-in role are an exception since they will have super-user access. The following sections focus on RBAC and ACL.

Role Based Access Controls

Role Base Access Controls (RBAC) can contain permissions for management or data layer access. The figure below lists the typical built-in roles available for storage accounts and briefly describes what permissions each role grants. For example, the Owner, Contributor, Storage Account Contributor, and Reader roles are management-level RBAC roles that grant specific access to the storage account but do not grant access to the underlying data within the account. In addition to these management plane roles, a user needing access to the data will need to have the data plane level role assigned, such as Storage Blob Data Owner, Storage Blob Data Contributor, or Storage Blob Data Reader.

Access Control Lists

Access Control Lists (ACL) support granular-level access to directories and files. The following figure shows the available ACL permissions and how they correspond to accessing directories and files with the Data Lake. ACL may be a good authorization candidate for granting granular access to files and folders. For example, when granting vendors access to data within folders of the same container, a ToVendor folder will have Read-only permissions, and a FromVendor folder will have Read-Write permissions managed by ACL.

Active Directory Groups and Permissions

As you start to tie the AD Groups, RBAC (Management and Data layers), and ACL authorizations to devise an outline of your enterprise's AD Groups and permissions, it may look like the figure below. This illustration clearly lists your organization's AD Group naming convention, crisp descriptions of the AD groups, sample AD groups, and the RBAC and ACL level access that will be granted to the group. The membership approvers are intended to begin capturing the automated approval workflows for granting members access to the groups.

Enterprise Data Lake Structure and Security

As we continue to tie folder structures, RBAC, and ACL together, the illustration shown in the figure below will help depict which AD Groups have either RBAC or ACL assigned at the resource and directory level.

Workflows

After a thoroughly designed and secured Data Lake is in place with corresponding AD Groups with the relevant access, an automated approval workflow to grant membership access to the AD Groups and accompanying granular data will save administrators' time manually managing the security and permissions within the lake. For example, in the figure below, a user will create a ticket via a Solution Portal with a membership request for access to the desired AD Group and direct the request to the pre-defined administrating team. For example, Membership to AD group Az-lakedata-Dev will grant access to the Data Lake storage account via RBAC access.

When working with your workflow automation team, it would be useful to provide the workflow details, as shown in the figure below, to support them as they build out the required automation and UI experience for users submitting their access requests.

Summary

Designing and managing a Data Lake is a highly customized approach that requires deep planning and collaboration between technology, security, governance, infrastructure, and business stakeholders. Since there is so much opportunity for design and management customization, the Data Lake can support different levels of governance and control over organizational data. The Azure ecosystem contains numerous resources that all integrate with Active Directory (AD) and are constantly finding ways to integrate more deeply with AD. For example, Databricks recently released the capability to provision users and groups using SCIM directly from AD within the Azure portal. The Azure Data Lake uses AD for its RBAC authorization, meaning that membership access to a specific AD group can grant access to multiple Azure resources. With ACL, security can be fine-grained to secure data access within the internal organization and across external vendors. A good data lake design accounts for zones, the hierarchical folder structures within these zones, and the security authorizations granted to the data within them. Once the secured Data Lake has been designed and implemented, membership to AD resources governed by RBAC can be granted through automated workflows containing approval gates. These are the key recommended steps for building a secure, performant, well-governed Modern Data Lakehouse Platform.

Next Steps









About the author
Ron L'Esteve is a seasoned Data Architect who holds an MBA and MSF. Ron has over 15 years of consulting experience with Microsoft Business Intelligence, data engineering, emerging cloud and big data technologies.

View all my tips


Article Last Updated: 2022-09-13

Comments For This Article