Airflow DAG with CASFS+

May 4th, 2021


About Apache Airflow

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. In Airflow, a workflow is known as a DAG or “Directed Acyclic Graph,” and each DAG created can consist of multiple interdependent tasks. Airflow is built on the Python programming language. DAGs are constructed using Airflow's Python library, and developers can utilize all the standard practices of Python programming in their DAG construction. Programmatically building workflows with code takes out the possibility of human error and is a standard practice in enterprise data engineering solutions.


Apache provides built-in tools, such as file sensors and cron scheduling, to provide developers with ways to fine-tune DAG and task execution. In order to fully utilize Airflow with the CASFS+ file system, Code Willing has implemented a number of these tools in production. The following sections provide some detail as to how this is currently being done.


About the System

CASFS+ facilitates building quick and dynamic workflow pipelines by providing its users with Apache Airflow out-of-the-box. Combining the power of the CASFS+ cloud workspace with the precision offered by Apache Airflow has allowed Code Willing to build robust, stable data pipelines for its clients. In addition, the ease of creating and maintaining DAGs via Python code means Code Willing is quickly able to turn ad-hoc projects into production-ready code swiftly and effectively.


In Code Willing's CASFS+ Cloud Workspace environment, Apache Airflow is configured such that a variety of ETL workflows for clients are able to run. Typically, these include things such as raw data processing, data migration, and general quality checks. These jobs do not necessarily run on the same node or even in the same Airflow environment. It is not uncommon for more than one node to exist in the CASFS+ workspace, with each node running its own instance of Airflow, all under the same user. To help manage this, CASFS+ allows users to easily switch between dashboards of all running Airflow instances to ensure that jobs can be tracked across nodes easily.


Let us consider a typical data ETL pipeline that is managed by Code Willing. There are three main types of DAGs in this pipeline: data processing DAGs, data quality check DAGs, and alert/monitoring DAGs. Each of these DAGs consist of many interdependent tasks and play an important role in the overall data life cycle.


The data processing DAGs are used to ingest certain pre-specified raw data to extract, analyze, and transform it (i.e. ETL). Next, with data quality check DAGs, the data that has been extracted and cleaned is tested and made sure that it passes several tests of fidelity — usually specified by the client. In the last stage of the ETL process, job alert and monitoring DAGs exist to track details of the DAGs or tasks themselves. These DAGs check for timely file arrivals, make sure there are no missing files, and also check for failed jobs. In the event that the data quality checks do not pass, or a file is later than expected, the data quality and monitoring DAGs should sound the alarm. Emails and Slack alerts are sent to all support team members asking them to look into the issue. This system has allowed Code Willing to prevent nearly all the errors that stem from the day-to-day data ETL process before the clients ever knew they existed!


However, there is one catch; what happens if the entire Airflow process shuts down? Assuming that the whole instance fails, DAGs that monitor the pipeline for errors or late files would be of little use. Luckily, Airflow provides a way to monitor for this situation by providing dedicated API health check endpoints. Utilizing AWS's Cloudwatch, external health checks monitor API endpoints to make sure the pipeline stays up and running. In the event that it seems the system is down, the health checking system will send email and Slack notifications to the support team again asking for them to look into the issue.


When performing tasks which have high memory requirements or otherwise need to be run in parallel, the power of Airflow and CASFS+ can be combined to get powerful solutions. Because CASFS+ provides a shared network file system across all nodes, it is easy to send Airflow tasks to one node and retrieve the results on another. By utilizing CASFS+'s built-in node balancer, tasks are off-loaded to worker nodes and results are retrieved by the “master” machine (usually the machine running Airflow). By never running any tasks on the “master” machine and delegating tasks only to dedicated worker nodes, we ensure that the master server running Airflow remains stable, free from memory strain, and that the nodes running production jobs are dedicated only to the task at hand.


Conclusion

Code Willing's first-hand experience with Airflow's stability and flexibility has led to it being a core part of the data services Code Willing offers to its clients. This, along with the fact that Airflow's wide adoption in the data engineering space has made it one of the most popular workflow management tools in recent years, has led to it being included as part of the CASFS+ cloud workspace platform out-the-box. Code Willing's goal is to accelerate the research abilities of end users by providing a suite of tools which facilitate powerful data analysis. Airflow is undoubtedly one of the most powerful tools implemented in that suite of tools to date. Combined with Airflow, the CASFS+ cloud workspace platform is able to turn Airflow into a robust big-data ETL workflow and execution manager which works seamlessly in a multi-node parallel processing environment.


References:


https://airflow.apache.org/docs/apache-airflow/2.0.1/