What is a Data Lakehouse?

A data lakehouse is a new data architecture that combines the benefits of data warehouse structures and modeling with low cost storage of data lakes. Merging them into one system means that all users have access to a single source of high quality data, regardless of their field of work - reporting, analysis or machine learning. And while previously it was just a theoretical concept, recent advancement in data engineering has enabled the enterprises to bring this concept to reality.

Why is it needed?

The most generic answer would be - the need comes from disadvantages of its predecessors.

Data warehouses have been supporting business decisions since the 1980s by providing highly refined and transformed data models. Trust in data and data driven decisions was built on the quality that data warehouses are able to provide. It is possible at the expense of the initial amount of labor that’s needed to set them up, their overall financial cost and lack of flexibility.

Those drawbacks gained importance as companies continued to evolve. Enterprises began to generate large amounts of data in various formats - a lot of it was semi-structured or unstructured. Unfortunately, warehousing was not optimized to ingest or work with them, which made it impossible to use them in a reliable way.

This problem has sparked the need to build data lakes, which promised availability and usability of new sources of information. The lack of constraints when it comes to data structures and vast volumes of data were the major advantages. Another key benefit - the ability to store ingested files at a low cost.

Data lakes succeeded with providing the data for analytic purposes but lacked features known from data warehousing. The lack of isolation meant that simultaneous reads and writes were not possible. Data quality was not provided and neither was referential integrity between datasets. They have failed to deliver the value that came from accessibility to such an extensive amount of information.

The companies still required a system that was able to gather data from different sources, in both structured and unstructured formats. A system that is able to ensure quality, usability and is able to deliver data quickly, according to frequently changing needs.

How does lakehouse solve those problems?

Arguably, the biggest stepping stone was the introduction of Delta Lake file format in 2019, which placed a structured transactional layer on top of files. This event has enabled the lakehouse to transition from just a theoretical concept to a usable and reliable architecture. To make it simpler, let’s analyze it point by point.

Low cost storage

This characteristic has been inherited directly from Data Lakes. Cloud storage is cheap and can be treated as unlimited. Data is stored as files in efficient open source formats like .parquet.

Metadata layer

Delta Lake format contains metadata about the contents of the file and introduces versioning by default. It allows the creation of a serving layer that lets users query data directly from the files, without copying the data to relational tables. It also introduces the time travel function, that allows a rollback either to the previous version of the file or specific point in time.

Transaction support

This characteristic has been inherited directly from Data Warehouses and its lack is one of the main disadvantages of Data Lakes. Lakehouse supports ACID (Atomicity, Consistency, Isolation, Durability) transactions that ensure consistency and allow for concurrent reads and writes.

Decoupled storage and compute

Storage and compute are decoupled, which simply means that they use separate clusters. This gives us the possibility to scale them easier and at different rates.

Real-time data

Streaming functionalities are available by default, so ingesting and querying real-time data is possible. The output for the end user can be served as a table as well.

source: https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new.png

How to build it?

Great that you are interested! Architecture and design of data lakehouse in Azure technologies will be a topic of upcoming articles from Faro.

Make sure to follow us on LInkedIn and don’t miss any post!

Thanks for reading!

Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: piotrek@faro.team

Resources:

  • https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
  • https://learn.microsoft.com/en-gb/azure/databricks/lakehouse/
  • https://sesamesoftware.com/wp-content/uploads/2022/07/LinkedIn_Post_Lakehouse_Final.png
Published on
January 18, 2023
Share this article
Linkedin

Let’s talk!

tomek@faro.team