What is Data Orchestration?

Let’s start with some definitions, just to be sure that everybody is on the same page. Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process.
The goal is to streamline and optimize the execution of frequent, repeatable processes and thus to help data teams more easily manage complex tasks and workflows.
Simply speaking Orchestration is the process of constructing bigger data workflows from building blocks which are smaller, atomic data loads.

Traditional approaches to Data Orchestration 

Orchestration of data jobs has become such a common practice that most organizations have worked out their own ways of creating and scheduling workflows. Let’s try to quickly go through some traditional approaches.

I hope you never had to work with the strategy where data loading scripts call other scripts. It is a nightmare equivalent to using a GoTo statement. Much better and fortunately way more common is the concept called “master packages”. It means nothing more than creating a workflow in the same tool where data loads have been created. For example, if we have created ETL/ELT processes in SSIS or ADF then we create a special package or pipeline just to host the order and dependencies for child ETL/ELT processes. This approach works really well for small and simple solutions. It is super easy to implement, understand and maintain. However, in complex scenarios you will quickly get lost in the spider's web of dependencies. Nobody likes spiders. For sure nobody likes spider’s webs. Sooner or later such a solution becomes unmaintainable. Another drawback is that every change in dependencies requires a redeployment of at least “master package”, since orchestration information is hard coded together with data workflows. Maybe we can make it more dynamic in nature? For sure we will need a way and a place to store orchestration information dynamically.

DAG - Directed Acyclic Graphs

Without a doubt there is no sense in speaking about Orchestration without dependencies. In old school, traditional batch processing dependencies were so simple that we didn't even think about them. Let’s say we had 3 batches: G dependent on A, whereas A was dependent on D. So, the processing order was following: start with processing D, when it is completed process A and finally run the G batch.

D -> A -> G

Consequently, what we here simply call dependencies, in fact are edges of Directed Acyclic Graphs, which in real life scenarios in big solutions can be very complex. The key point is to store, modify and visualize them efficiently. How can we achieve this?

Modern Data Orchestration

So, what are the building blocks of modern Data Orchestration?

  • DAGs - store detailed dependency information in the shape of Directed Acyclic Graphs. Treat it as metadata and preferably follow the “Configuration as code” principle. Just put it into your source control!
  • Array of Workers - of course you should be able to run many workloads in parallel, but you should be able to control in real time how much pressure you want to put on your data systems. That’s why you need a manageable worker thread pool.
  • Scheduler process - to be in full control of a complex data analytics system you need an asynchronous, queue based processing engine. First big advantage is an ability to control the workload - you can limit the number of processes running in parallel without beastly rejecting incoming requests when the limit is exceeded. You gently put them into the queue and let them wait for the available Worker Thread. Of course processes are taken from the queue considering Priorities you have defined upfront. 
  • Automated monitoring of data workflows - since you are using one, common orchestration engine for all your data processes across the whole company, then you can easily implement some simple automated remedial actions like retries after a timeout or killing hanging processes. 
  • Graphical interface to manage the orchestration - since DAGs can become really heavy and complex it is really helpful to have a graphical tool to explore definitions of DAGs, but also to monitor the workflows and to check the execution logs.

Your Modern workflow management system gives you a full control and insight into even the most complex workflows. What are key benefits?

  • Thanks to DAGs kept as Code it supports DataOps principle: “Configuration as Code”, which in turn gives you traceability and control of your system.
  • You get better performance! Thanks to precise dependencies your workflow is started as soon as source data is ready. You don’t waste time on unnecessary waiting.
  • Queue based processing makes you sure that the more important data loads are processed as first priority.
  • Asynchronous processing and manageable Array of Workers ensure you can control performance of your data system
  • Automated monitoring and graphical interface give you control and insight into your complex data analytics workflows. Simple maintenance actions are taken automatically for you, other actions you can take immediately, since you finally understand what is going on with your data!

In Faro we can see that many companies tend to build their own simple frameworks to solve problems like storing dependencies and the status of jobs. However, concepts like DAGs, asynchronous, queue based processing aren’t that popular yet. That’s why we have built our own Workflow Management System. On the other hand, solutions like Apache Airflow are gaining in popularity. For sure we will touch upon both solutions in future blog posts. Just stay in touch with our blog!

Thanks for reading!

Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: adrian@faro.team

Sources:

Published on
December 6, 2022
Share this article
Linkedin

Let’s talk!

tomek@faro.team