Looking at the emergence of data lakehouses let’s take a step back and understand the key differences between the concepts it has emerged from - data warehousing and data lakes.
Data Warehouse
Data warehouse is a centralized system that stores both current and historical data from various sources which are often based on different technology stacks. It is purpose-built and designed to enable business intelligence and analytics. This in turn allows companies to make more informed decisions.
Traditionally, data ingestion is realized in batches by ETL (Extract, Transform, Load) process. Such an approach makes it possible to improve data quality by cleanup and enrichment, consolidation of several tables into one or restructuring the schema to simplify relationships between objects. This results in a structured data model with high consistency between related objects.
Processing data before storing is highly beneficial but it does introduce some restrictions. Data warehousing requires initial time investment to standardize terms and formats across the company and properly model business concepts. Implementing changes to existing systems might not be easy either. Processing at this stage also constrains the ability to ingest streaming data, although using small, incremental batches might be possible to allow near-real time ingestion.
Data orchestration is another consideration in this approach as you need to decide how and when to copy data from the source transactional system to the data warehouse, so it doesn’t impact any of the platforms and takes all object dependencies into account. Fortunately DataOps practices come with proven and reliable solutions in that subject as well.
Data warehouses make it possible to access data in a business friendly manner with little or no preparation to be done by the user. They act as a single source of truth, which builds trust in insights derived from the data and assures company-wide consistency. On top of that, separation of analytical processing (DWH) from transactional databases (sources), improves the performance of both systems as well.
Data Lake
Similar to data warehouse, data lake is a centralized system that stores data from various sources. Unlike a data warehouse, data arrives in raw format and is stored without being processed. Another difference is the lack of limitation when it comes to data types, as data lake can store structured, semistructured, unstructured and any unconventional data.
The key benefits of this solution are high scalability and the possibility of cost effective storage of massive volumes of data. Refraining from processing the data and keeping it in a raw state has several implications. First of all, it means that data is available for use far faster. It also implies that nothing is thrown away and a wider range of data can be analyzed to gain insights that were previously unavailable.
Just like data warehouses, data lakes come with their own challenges. Lack of a schema or descriptive metadata makes the data hard to consume or query by inexperienced users and thus data lakes are mostly used by people possessing a technical skill set. No enforced consistency across the data can cause difficulties or errors during data analysis. There is also no guarantee of the data quality.
Despite some sources, data lake is not a single source of truth as it is up to users how they clean and transform data. Starting with the same input they most likely will arrive with different results, making it impossible to achieve a unified overview for the organization.
Without proper governance, access control and GDPR processes can be problematic. Organizations must have supervision of what information is stored in the data lake, who can access it and for what uses. Besides, there is a significant risk of data lake becoming a dumping ground “just in case” some data might be useful in the future.
Summary
In short, a data warehouse is a repository of highly structured historical data which has been processed for a defined purpose. Data lake is also a repository but it is not limited only to structured and processed data. It stores this data in a raw format and its purpose might not have been defined yet.
It is important to understand that Data warehouses and Data lakes are complementary solutions, which serve different needs of an organization. Use this infographic to review the key differences and as a reminder for future reference.
What’s next?
Today, the lines between the two concepts are getting blurred and merged into the idea of a data lakehouse, which combines their advantages by enabling ACID transactional processes, removing the constraint of structured data, assuring data quality and low-cost cloud storage.
Make sure to follow Faro.team and learn more about the data lakehouse concept and how DataOps enabled it to become such a potent and accessible solution in the upcoming publications.
Thanks for reading!
Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: piotrek@faro.team
Sources: