Let’s start with some definitions, just to be sure that everybody is on the same page. Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process.
The goal is to streamline and optimize the execution of frequent, repeatable processes and thus to help data teams more easily manage complex tasks and workflows.
Simply speaking Orchestration is the process of constructing bigger data workflows from building blocks which are smaller, atomic data loads.
Orchestration of data jobs has become such a common practice that most organizations have worked out their own ways of creating and scheduling workflows. Let’s try to quickly go through some traditional approaches.
I hope you never had to work with the strategy where data loading scripts call other scripts. It is a nightmare equivalent to using a GoTo statement. Much better and fortunately way more common is the concept called “master packages”. It means nothing more than creating a workflow in the same tool where data loads have been created. For example, if we have created ETL/ELT processes in SSIS or ADF then we create a special package or pipeline just to host the order and dependencies for child ETL/ELT processes. This approach works really well for small and simple solutions. It is super easy to implement, understand and maintain. However, in complex scenarios you will quickly get lost in the spider's web of dependencies. Nobody likes spiders. For sure nobody likes spider’s webs. Sooner or later such a solution becomes unmaintainable. Another drawback is that every change in dependencies requires a redeployment of at least “master package”, since orchestration information is hard coded together with data workflows. Maybe we can make it more dynamic in nature? For sure we will need a way and a place to store orchestration information dynamically.
Without a doubt there is no sense in speaking about Orchestration without dependencies. In old school, traditional batch processing dependencies were so simple that we didn't even think about them. Let’s say we had 3 batches: G dependent on A, whereas A was dependent on D. So, the processing order was following: start with processing D, when it is completed process A and finally run the G batch.
D -> A -> G
Consequently, what we here simply call dependencies, in fact are edges of Directed Acyclic Graphs, which in real life scenarios in big solutions can be very complex. The key point is to store, modify and visualize them efficiently. How can we achieve this?
So, what are the building blocks of modern Data Orchestration?
Your Modern workflow management system gives you a full control and insight into even the most complex workflows. What are key benefits?
In Faro we can see that many companies tend to build their own simple frameworks to solve problems like storing dependencies and the status of jobs. However, concepts like DAGs, asynchronous, queue based processing aren’t that popular yet. That’s why we have built our own Workflow Management System. On the other hand, solutions like Apache Airflow are gaining in popularity. For sure we will touch upon both solutions in future blog posts. Just stay in touch with our blog!
Thanks for reading!
Hope you enjoy it and if you'd like to talk more about it, please reach out to me via email: adrian@faro.team
Sources: