Here Is What I Learned Using Apache Airflow over 6 Years | by Chengzhi Zhao | Jan, 2023

By Jessie Hobb On Jan 9, 2023

A journey with Apache Airflow from experiment to production hassle-free

Apache Airflow is undoubtedly the most popular open-source project for data engineering for years. It gains popularity at the right time with The Rise Of Data Engineer, and the core concept of making code as the first-class citizen instead of drag and drop for data pipeline (aka. ETL) is a milestone. The Apache Airflow became an Apache Incubator project in March 2016 and became the top project in January 2019. I have worked on Apache Airflow since 2017 as a user. Along the way, I also contributed to Apache Airflow. Today, I want to share my journey with Airflow and what I learned over 6 years.

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. — Airflow Offical Documentation

Apache Airflow is developed by Maxime Beauchemin, who is an ex-Airbnb and ex-Facebook. He developed Airflow at Airbnb (you can tell from the project name). However, the core idea that inspired him was an internal tool used at Facebook.

The primary users of Apache Airflow are data engineers or engineer who needs to schedule workflows, mainly the ETL (Extraction, Transformation, Loading) jobs. Those ETL jobs are usually run at a daily or hourly cadence. The jobs themselves perform data manipulation to get insights from unstructured raw data.

I worked on Apache Airflow since 2017 before it became widespread as a must-have skill for any data engineer. I still recall the early days of how much effort we have “hacked” into the system of the Airflow scheduler to get it to work stable as an early adopter. If you want to learn more, I wrote article years back.

Airflow shines from the beginning as an incubating project. It enhances and stabilizes along the way. Here are a couple of reasons I noted that made Airflow so fascinating initially:

Code First

Coding (especially in Python) is a new norm for a data engineer. However, writing a data pipeline 10 years ago was mostly drag & drop in UI-based tools like SSIS or Informatica. The drag & drop is easy in the first place. However, it will soon get to a point where it is hard to scale the development cycle, and the UI sometimes is trickier to drag and buggy when you drop the box.

Coding is a good way of abstraction. We can write the DAG with all the logic in Python code, share it, and deploy it. It fits the purpose well. It forces some people who only know SQL to learn Python, but it is rewarding.

Nice Visualization

No matter how much code we put together, as an orchestrator, it needs some way to visualize the process. Ultimately, a nice visualization looks engaging and solves multiple problems. It’s manageable to catch failures, understand dependencies, and monitor job states.

As an end user, the visualization isn’t something you need to polish like a dashboard. Once the coding is done, Airflow handles the rest and takes visualization for you. Your DAG can be visualized in seven different views, each with its focus. The most common ones I use daily are Grid view (used to be tree) and Graph view.

On the UI, you can perform tasks such as restarting a job, resetting states for some tasks within a DAG, checking the failure log, or examining the long-running job. Although UI doesn’t provide us with 100% functionalities, it covers the day-to-day use case.

Extensibility

To enable more people to use it as an ETL orchestration tool, the project has to be extensible to allow other projects to integrate easily. The richness of integration sets the foundation for Airflow to become one of the top Apache projects. Furthermore, Airflow allows user to write their own PythonOperator which further encourages developers to build their logic by code instead of waiting for a new upgrade of a plugin to accomplish their ETL needs.

The great extensibility also expands more innovative projects and plugins to onboard within Airflow’s ecosystem. The core idea of using plugins is to avoid everyone reinventing the wheel to write your logic yourselves. It also helps services or products provide seamless integration with end users.

The comprehensive Airflow plugins make it harder for the end users to migrate to something else. For example, EDI 835: Electronic Remittance Advice is a special format that provides claim payment information in the healthcare industry. It requires special parsing logic to read it correctly. Those types of use cases for specific domains make it a great “moat” for Airflow to be competitive.

Community

In the end, Airflow is part of the Apache foundation. We all know what it means to be an Apache top-tier project. The fame speaks for itself in attracting organic traffic for more people to try Airflow.

The community is growing healthily. In StackOverflow, you can find over 9k questions tagged [Airflow]. Airflow has its yearly hosted AirflowSummit. The company astronomer.io, which backed Airflow, also provides great documentation and certification. Its Slack was active since the early days when I joined back in 2017, and now it has 25k+ people joined.

When I joined Meetup, Airflow was an experimental tool brought by a principal engineer as a proof of concept back in April 2017.

The introduction of Airflow by that time aimed to resolve multiple issues when we had a custom Scala-based ETL. I still admire people who have built complex dependencies of DAG logic to make the ETL works in the first place. But we all know ETL will fail some days. That’s the headache part of the custom Scala-based ETL. Not easy to debug and restart a job, and you’d need to be extremely careful to make the job idempotent.

Ultimately, we figured out a way to run Airflow 1.7 and deployed with new Python code for POC. It went exponentially well. Since Airflow 1.9 and the Kubernetes executor were introduced, Airflow was at a much more mature stage. I also contributed to the Airflow project and gained insights into the internal core code. On production, we encountered fewer issues than early days, and we can focus more on improving the scalability and onboarding more use cases.

However, there are still a few things I wish I had known earlier to avoid some time-consuming investigation.

Not leveraging macros in the first place

You read it right. I don’t know the macro concept at the early stage of Airflow. So all of the Jinja templates like {{ ds }} I didn’t leverage it in the first place. On a new day, to trigger the ETL, it just uses the python function date()-1 .

The biggest problem is backfilling, I hacked it using Airflow’s parameter, but users have to provide that information and only allow backfill a day at a time. The entire process becomes a nontrivial process. After carefully reading the Airflow documentation, it turns out the macros are perfect for getting jobs’ metadata. We backfill as simple as clearing the existing dag without additional parameter hacks.

ETL in EST

The date is critical in any data system. Our data system is partitioned by EST. It’s by EST!!! If you were in the same boat as I was, you’d understand how much pain I had during the daylight change. Everything breaks!

We had multiple DAG dependencies using the external_task_sensor, which relies on execution_delta — time difference with the previous execution to look at. Think about a case where you have some task finished, then suddenly the pending tasks +1/-1 hour becomes chaotic to manage. I have to build a solution to avoid that. It’s not fun to maintain. If you start building any data system, I’d suggest sticking with UTC time whenever possible for simplicity and peace of mind.

Be gentle in scheduling the DAG

Airflow is not a data streaming solution. This has been mentioned at the beginning of the Airflow official documentation in the Beyond The Horizon section.

One reason is Airflow doesn’t pass data between tasks. You’d need to find external storage when it comes to that part. The Airflow architecture doesn’t start with a streaming design. If you have a job that needs to trigger every minute, Airflow can accomplish that, but it won’t scale well. For the data streaming use case, use Flink/Spark for data streaming purposes, and don’t overload that into Airflow.

Another reason that is not mentioned often is Airflow scheduler is a beast for updating all the states of your DAGs. It scans for the next potential DAGs to get it triggered by its execution date when it hits the wall clock. Even when the Airflow scheduler is down, the states must be materialized to track the state. Massive state updates are going behind the scenes in an OLTP database, which Airflow still recommends using MySQL or Postgres. In this case, scheduler_heartbeat_sec needs to be config properly, and the default is 5 seconds. This value can be the knob to tune.

When we are to the point where we have hundreds of DAGs and thousands of tasks, jobs are slowly getting scheduled and running. Increasing this value to more than 60 seconds will cool down the Airflow scheduler. To learn more about the Airflow scheduler, I have written an article Airflow Schedule Interval 101 to help you learn more.

Apache Airflow is a remarkable open-source project for data engineers. As I look back, many concepts are leading implementations that paved the road for its popularity and success today. Many companies add Airflow as the required skill for data engineer hiring. It’s a fortune for me to get involved with Apache Airflow early and witness the project grow. Some new projects have started to challenge Apache Airflow, for example, mage-ai and Perfect. I will cover them in future posts.

The creator of Airflow, Maxime, who is also the creator of Apache Superset, is one of the most admirable data engineers in those two projects. Below is the talk he gave back in 2015 about the best practices for Airflow. It’s still a good reference to use Airflow today and learn about the Airflow design philosophy.

Best practices with Airflow- an open source platform for workflows & schedules | By Maxime Beauchemin