Have you ever considered how important metadata is in data warehouse?

When you design a data warehouse your decisions will have a big impact on further maintenance and
lifecycle of the solution.

One of the most important decisions is whether you automate the data sourcing processes or not and how you will store and version changes.

Common approaches is to build ingestion processes manually, using GUI in tools like SSIS or ADF. It allows to better control individual flows and prepare custom processes, tuned for specific source objects. Also entry level for developers is usually lower in such solutions, as they can focus on specific flows, without deep knowledge about the entire system. In this approach entire ETL packages are stored in the repository, which is not a comfort way to track changes. But the biggest problem raises when you create a data warehouse from scratch and in such case this approach is very time consuming as you need to manually prepare hundreds or thousands of packages.


But what’s an alternative?

Imagine there’s no need to click a thousand times, placing pieces in the GUI tool. Imagine you can build hundreds of ingestion flows in minutes. Imagine you can add a new source by a few simple inserts. But how to do that?


The answer is metamodel and automation!

Storing proper metadata in the warehousing systems is one of the most important things, but also often very underrated. Instead of keeping entire flows in the repository it’s way more better to store the metadata of these processes and build the solution on demand. In the metadata approach you have tables containing information about how your solution should look like. You need to store at least three main groups of information:

Diagram: High-level metamodel concept

Source and target data models might be implemented together, as these metadata is similar. It’s usually list of tables, columns, datatypes or relations between tables. This will allow you to generate proper objects in your database, but not only! Well-designed metamodel allows you to source or ingest data from/to a file storage or even for example Azure Event Hub!

Having source and target models defined, we can handle then mappings or transformations. Again, you need to store them in your metadata. They might be just a simple list of columns with datatype mappings to be queried on the source, but it also might be something more complex. If you want to apply business logic during the extraction, you can put here queries joining many tables or even Python code.

What’s the advantage of this approach?

First of all you store and track information about WHAT you want to have in your warehouse, not HOW it’s done. Preparing a set of metadata and generating flows automatically will tremendously decrease amount of time needed to build your solution. What is also important, this approach generates less errors as you don’t edit every flow independently, which means there are fewer places you can make a mistake in.

But how to track changes having the metamodel inside a database?

This can be achieved in many ways. The simplest approach is to use native solutions like temporal tables in the SQLServer or AzureSQL databases. Every time you insert new entities or update some column attributes, your changes will be automatically tracked. Although it’s the easiest way, it has some major cons. The history is strongly dependent on a specific database and the backup/restore strategy performed on it. You also need to design custom processes to move the metadata to a different environment.

What seems the most efficient and flexible is to store the metadata “as a code” using your GIT repository and replicate it in the database using AzureDevOps pipelines. You can choose the file format suitable for you, but the common choice is the CSV format. If you would like to focus more on user experience while editing these files manually you can use Excel files as well. Then you can benefit from everything that GIT gives you out of the box and for free – tracking history, reviewing the code with pull requests, merging and branching to better reflect your architecture.

Below you can find a brief summary of selected features

Table: pros and cons of discussed architectural patterns

So if we have now a good metamodel, how to put it to work and automate?

Stay tuned, to be continued.

Thanks for reading!

Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: maciek@faro.team

Published on
November 15, 2022
Share this article
Linkedin

Let’s talk!

tomek@faro.team