How did your ETLs become data factories

How did your ETLs become data factories?

Things used to be easy in the old days. You had few ETL flows, moving data here and there with some transformations. You had simple time-based scheduling, maybe some logging to check how things are going on from time to time. And now all this DataOps stuff, with its Agile and DevOps… and Lean Manufacturing! That one is way too much. Why should I even bother, I do not produce agricultural machinery or anything tangible at all, I just want to process my data! Why can't I live as I used to?

The answer is - because nowadays data flows are much more complicated. Very often you don’t have just a source, a destination and some transformation in between. Your data flows contain a lot of different steps and stages to reliably deliver valuable data from all your sources directly to your consumers. And there is a lot of data. And a lot of data flows. And a lot of dependencies between them. Fortunately, there are some tools to help you - tools invented to aid a completely different industry.

‍

Lean Manufacturing

We already know (thanks to this article: Faro Team - Your Lighthouse In The Ocean of Data) the 3 fundamentals of DataOps: Agile, DevOps and Lean Manufacturing. Agile stands for analytics development. DevOps stands for analytics deployment. And Lean Manufacturing gives you tools to operate your data pipelines - to orchestrate, monitor and manage them. Originally introduced by Toyota after the II World War, Lean Manufacturing goals were to shorten times within the production system and help employees to reduce non-value-add activities and improve the overall efficiency and quality of the products.

We can adapt this concept to our data flows. Manufacturing line is actually a pipeline - you start with raw materials, they go through different workstations and in the end you get a finished product. In data analytics you start with raw data from your sources, process your data through a series of steps and also end with a product - that can be a data model or report. Each step of the process takes the input from the previous one, applies some transformations and provides the result to the following step as an output. Whole process must be correctly orchestrated, managed and monitored to efficiently and predictably provide a good quality product.

Statistical Process Control

But how to achieve it? How to monitor and control the quality of such complex processes? Data flows are just too complicated to monitor them manually, you need to have a smart tool to help you with that. The answer is Statistical Process Control (SPC). SPC uses real-time measurements to give you insights of what is going on during processing your data flows (fabricating a product in your data factory). If the measures are inside the specified limits, then the process is considered as running properly. In practice you want to use a lot of very simple checks to monitor data trends at each step of your data pipeline.You can apply tests at the input or output of your data transformation steps and you can set the levels of anomalies to notify you about warnings and even stop your processing flow in case of any errors - as soon as possible, to not let wrong data reach your production.

You can implement a variety of tests:

‍Business logic tests - validation of your data (If each customer has a Social Security Number? If all customer cities exist in my dimension table?)

‍Input tests - check data prior to each stage of the pipeline (Is the input row count in the right range? Are the order dates all in the past? Are all required fields correctly filled in? Do sales not vary by more than 15%?)
Output tests - check result of your operation (Are there any data duplicates? Is the number of Polish citizens smaller than 40 million?)

Knowing your data you can think of dozens of such simple data checks. And these little ones are very powerful tools that help you to automate monitoring your data flows and improve their quality, efficiency and transparency. They work for you constantly, 24x7, doing a whole dirty job - because you want to be the first one who knows that something went wrong, before your customer notifies you. Additionally, you don’t need to build everything in one go. You can start with a very small set of checks - you can keep adding them continuously, each time something new comes to your mind.

And keep in mind that your data factory never consists of only one pipeline - in fact, you always have at least two! But that’s a little bit of a different story. Stay in touch!

‍

Thanks for reading!

Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: mariusz@faro.team

‍

References:

https://en.wikipedia.org/wiki/Lean_manufacturing

https://en.wikipedia.org/wiki/Statistical_process_control

https://medium.com/data-ops/lean-manufacturing-secrets-that-you-can-apply-to-data-analytics-31d1a319cbf0

‍