How to build a Data Lakehouse

Initial considerations

Today, let’s talk about the design of a data lakehouse, its key areas and tools that you need to consider when approaching this concept. Strap in and get ready for a ride as today’s subject is quite extensive and I’ll do my best to painlessly guide you through it.

Medallion architecture

Before you start loading any data, the first thing you’ll need to design is the storage layer. 

The medallion architecture is a multi-layered approach of building a data lakehouse. It is a current market standard that was introduced by Databricks and it refers to the colors of medals - bronze, silver, gold. Each of them indicates a separate logical layer and their names denote the quality of stored data as well. This also allows us to define roles, responsibilities and access control in a granular way.

https://www.databricks.com/glossary/medallion-architecture

Bronze - raw layer

Source data is extracted and stored in the Bronze layer in a raw state. The files maintain their original form and are not cleaned or transformed in any way. This zone will contain data in various formats - tabular data in Parquet, CSV, ORC, semi-structured data in XML, JSON or even unstructured data like JPEGs, PDFs, etc…

The ingestion process is incremental, which guarantees efficiency and good performance. Bronze layer reflects the full and unprocessed history of each dataset and gives us the ability to recreate the state of a given source system at any point in time with accuracy equal to the frequency of ingestion. This characteristic allows reprocessing of upper layers without reading the data from the source systems.

Silver - curated layer

Next one is a silver layer that would be described as a curated or refined zone. The goal here is to validate, cleanse, standardize and enrich the data that was ingested. Another scenario at this point is deduplication, if you are combining datasets coming from different sources. 

In terms of data modeling perspective, the silver layer can be structured similarly to models in 3rd Normal Form. The focus should be on creating write-performant models, similarly to the Data Vault approach.

When it comes to storage, parquet file format, while very efficient, is immutable. It means that you can add partitions to parquet files, but you can't edit the data in place, which makes it impossible to merge incremental changes when using it. Here Delta format comes into play as well, as it lets us apply update and delete operations. It also provides ACID transactions to the data lake, just like we know them from database environments.

The Silver layer should be looked at as a reliable foundation for analyses and building business oriented models in the Gold layer. 

Gold - final layer

Eventually we arrive at our final Gold layer, where data is highly refined and aggregated - an equivalent of a typical data warehouse or a data mart. Similarly, entities here are built around business specific requirements, optimizing data for read performance and often taking into account data sensitivity by implementing anonymization or masking.

Just like in the silver layer, data is stored in the Delta format. Apart from already mentioned features, it provides Time Travel functionality, which automatically versions the data and keeps all of its history. Auditing and rollbacks are available out of the box because of it.

Users should be able to retrieve data with low latency as most aggregations and transformations should already be handled at this point. Data exposed to users, reporting and analysis tools would rarely be stored outside of the gold layer. 

Ingestion and processing

There are two main tools that we will consider during ingestion and processing data through the layers - Pipelines available in Azure Synapse Analytics and Azure Databricks notebooks. 

Another option would be to go for Azure Data Factory, which is tightly coupled with Synapse Pipelines but since you are building a lakehouse you’ll probably make use of Synapse anyways, for instance because of its serverless capabilities. Differences between the two services are well documented by Microsoft on this page.

Ingestion to bronze layer

Depending on your company requirements you have a couple of simple solutions to consider at this stage. Copy Data tool will be an efficient choice to ingest and store the files in their original format, especially when used with its binary copy option. 

Next consideration will be on how to narrow down the amount of files to go through.

  • If the source files have been exposed directly to you and no one else in the organization is using them, you can consider automatically deleting the source files after the copy task is completed.
  • Another scenario would be to utilize proper folder structure and/or naming convention that follows the “yyyy/mm/dd/” pattern. 
  • Third option might be filtering on “LastModifiedDate” file property, which allows you to differentiate the old files and the ones that have been changed since your previous load.
  • If none of the above is an option, you can perform the copy operation only for selected files by explicitly stating their path and filename.

Copy Data tool provides all of those options and you can configure them according to a relevant scenario.

When it comes to OLTP sources, be smart and build a generic solution, preferably metadata driven. Such an approach allows you to define which tables/objects in the source and their specific attributes should be ingested into the Bronze layer. If the requirement is to ingest all of them, it makes things even simpler. Define a metadata task that will retrieve schema and object names and feed them as inputs to consecutive data flow. Make sure to store the output as parquet files, which allows efficient storage thanks to its columnar compression.

Processing to silver and gold layers

At this stage there are multiple choices to consider based on the requirements and use cases in your organization.

Synapse Pipelines and its Mapping Dataflows allow us to develop data transformations writing little to no code. They are still compatible with Delta format, because behind the scenes they are translated into Spark notebooks that are executed in scaled-out clusters.

Databricks notebooks are a go to choice here, thanks to the versatility and support of multiple programming languages like Python, SQL, Scala, C#, R. Read the data into data frames and consider creating temporary tables based on them. Temporary tables give you the ability to reference them in separate cells, while using different languages. For example you can write the transformations in SQL as you traditionally would and save the outcome in Delta format.

No matter what background your data engineers have, they should easily find their way around Databricks. Developers can mix and match the programming languages as they go, based on their preference and the type of task in hand.

Going back to the choice of Pipelines, you should abstain from developing transformations using graphical interface. All experienced data engineers will tell you that it is much slower to develop such solutions and it is much harder to maintain them. 

Think about Pipelines as your orchestration tool and perform your transformations using code. And don’t go overboard with graphical orchestration either. Instead go for DAGs, described by our team here.

Also build your solutions in a generic way and utilize the possibility to parameterize both the data flows as well as the notebooks. Make sure to design them in an idempotent way, meaning each rerun should produce the exact same results without the need of any manual action to be taken before the rerun.

Serving layer

Our data has finally reached the gold, final layer. Now it is time to expose it to the end users. Typically we have two ways to go about it in Synapse as well - serverless or dedicated pools.

Serverless gives the possibility to create external tables on top of the gold layer in the data lake. Further transforming data is also an option by using T-SQL and creating another layer of views. This way you are building a logical data warehouse without actually storing the data in a database and for the end user this approach is totally transparent.

The second option is to use T-SQL to ingest the data from the storage into tables in the SQL pool. From the technical perspective there will be a couple of choices on how to do it but generally using PolyBase should be the fastest and the most scalable way. This is the common approach when your data has complex structure and requires complicated SQL queries.

At this point, everything is ready to expose the data for access by BI tools, analysts, scientists and business users to make data-driven decisions.

Conclusions

The integration of Azure and Databricks is seamless and allows to design a data lakehouse in a relatively simple manner. Of course, the presented approach should be taken as a baseline and modified accordingly to the organizational needs and practical scenarios.

When it comes to the development process, keep the following keywords in mind - idempotent, generic, code-based. Stay away from GUI based development as it will lead to data platforms being inefficient and hard to maintain. Invest in professionals that are up to date with current trends and fluently adjust to change, which nowadays is rapid in cloud environments, regardless of the supplier.

And don’t forget about Data Governance! This article is focused around the high-level architecture and ingestion process but it doesn’t mean that your Lakehouse will function properly without it.

Lakehouses are a book sized topic that I sincerely encourage you to dive into. Hopefully I was able to shed some light onto the subject!

Thanks for reading!

Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: piotrek@faro.team

Sources:

Published on
March 15, 2023
Share this article
Linkedin

Let’s talk!

tomek@faro.team