Techno Blender
Digitally Yours.

Setting Up Apache Airflow with Docker-Compose in 5 Minutes | by Marvin Lanhenke | May, 2022

0 112


Create a development environment and start building DAGs

Photo by Fabio Ballasina on Unsplash

Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.

Today, we’re about to change all that.

In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.

Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.

Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.

Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.

Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.

A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.

A simple example DAG [Image by Author]

This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.

Airflow is composed of three main components:

  1. Airflow Scheduler — the “heart” of Airflow, that parses the DAGs, checks the scheduled intervals, and passes the tasks over to the workers.
  2. Airflow Worker — picks up the tasks and actually performs the work.
  3. Airflow Webserver — provides the main user interface to visualize and monitor the DAGs and their results.
A high-level overview of Airflow components [Image by Author]

Now that we shortly introduced Apache Airflow, it’s time to get started.

Step 0: Prerequisites

Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.

Step 1: Create a new folder

We start nice and slow by simply creating a new folder for Airflow.

Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:

mkdir airflow
cd airflow

Step 2: Create a docker-compose file

Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.

Via the terminal, we can run the following command inside the newly created Airflow folder

curl https://raw.githubusercontent.com/marvinlanhenke/Airflow/main/01GettingStarted/docker-compose.yml -o docker-compose.yml

or simply create a new file named docker-compose.yml and copy the below content.

The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

At the top of the file, we make use of some local variables that are commonly used in every docker container or service.

Step 3: Environment variables

We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.

Still, inside your Airflow folder create a .env file with the following content:

The above variables set the database credentials, the airflow user, and some further configurations.

Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor.

Note: More information on the different kinds of executors can be found here.

Step 4: Run docker-compose

And this is already it!

Just head over to the terminal and spin up all the necessary containers by running

docker compose up -d

After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.

Airflow 2 Web UI [Screenshot by Author]

With a working Airflow environment, we can now create a simple DAG for testing purposes.

First of all, make sure to run pip install apache-airflow to install the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a new file called sample_dag.py.

We define a new DAG and some pretty simple tasks.

The EmptyOperator serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.

Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.

Manually triggering a DAG [Screenshot by Author]

Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"

Running the DAG shouldn’t take any longer than a couple of seconds.

Once finished, we can navigate to XComs and inspect the output.

Navigating to Airflow XComs [Screenshot by Author]
Inspecting the output [Screenshot by Author]

And this is it!

We successfully installed Airflow with docker-compose and gave it a quick test ride.

Note: We can stop the running containers by simply executing docker compose down.

Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.

In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.

However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.

You can find the full code here on my GitHub.


Create a development environment and start building DAGs

Photo by Fabio Ballasina on Unsplash

Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.

Today, we’re about to change all that.

In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.

Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.

Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.

Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.

Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.

A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.

A simple example DAG [Image by Author]

This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.

Airflow is composed of three main components:

  1. Airflow Scheduler — the “heart” of Airflow, that parses the DAGs, checks the scheduled intervals, and passes the tasks over to the workers.
  2. Airflow Worker — picks up the tasks and actually performs the work.
  3. Airflow Webserver — provides the main user interface to visualize and monitor the DAGs and their results.
A high-level overview of Airflow components [Image by Author]

Now that we shortly introduced Apache Airflow, it’s time to get started.

Step 0: Prerequisites

Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.

Step 1: Create a new folder

We start nice and slow by simply creating a new folder for Airflow.

Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:

mkdir airflow
cd airflow

Step 2: Create a docker-compose file

Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.

Via the terminal, we can run the following command inside the newly created Airflow folder

curl https://raw.githubusercontent.com/marvinlanhenke/Airflow/main/01GettingStarted/docker-compose.yml -o docker-compose.yml

or simply create a new file named docker-compose.yml and copy the below content.

The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

At the top of the file, we make use of some local variables that are commonly used in every docker container or service.

Step 3: Environment variables

We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.

Still, inside your Airflow folder create a .env file with the following content:

The above variables set the database credentials, the airflow user, and some further configurations.

Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor.

Note: More information on the different kinds of executors can be found here.

Step 4: Run docker-compose

And this is already it!

Just head over to the terminal and spin up all the necessary containers by running

docker compose up -d

After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.

Airflow 2 Web UI [Screenshot by Author]

With a working Airflow environment, we can now create a simple DAG for testing purposes.

First of all, make sure to run pip install apache-airflow to install the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a new file called sample_dag.py.

We define a new DAG and some pretty simple tasks.

The EmptyOperator serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.

Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.

Manually triggering a DAG [Screenshot by Author]

Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"

Running the DAG shouldn’t take any longer than a couple of seconds.

Once finished, we can navigate to XComs and inspect the output.

Navigating to Airflow XComs [Screenshot by Author]
Inspecting the output [Screenshot by Author]

And this is it!

We successfully installed Airflow with docker-compose and gave it a quick test ride.

Note: We can stop the running containers by simply executing docker compose down.

Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.

In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.

However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.

You can find the full code here on my GitHub.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment