How to Build Simple ETL Pipelines With GitHub Actions

By Jessie Hobb On May 5, 2023

ETLs don’t have to be complex. If that’s the case, use GitHub Actions.

If you’re into software development, you’d know what GitHub actions are. It’s a utility by GitHub to automate dev tasks. Or, in popular language, a DevOps tool.

But people hardly use it for building ETL pipelines.

The first thing that comes to mind when discussing ETLs is Airflow, Prefect, or related tools. They are, without a doubt, the best in the market for task orchestration. But many ETLs we build are simple, and hosting a separate tool for them is often overkill.

You can use GitHub Actions instead.

This article focuses on GitHub Actions. But if you’re on Bitbucket or GitLab, you could use their respective alternatives too.

We can run our Python, R, or Julia scripts on GitHub Actions. So as a data scientist, you don’t have to learn a new language or tool for this matter. You could even get email notifications when any of your ETL tasks fail.

You can still enjoy 2000min of computation monthly if you’re on a free account. You can try GitHub action if you can estimate your ETL workload within this range.

How do we start building ETLs on GitHub Actions?

Getting started with the GitHub actions is simple. You could follow the official doc. Or the three simple steps are as follows.

In your repository, create a directory at .github/workflows . Then create the YAML config file actions.yaml inside it with the following content.

name: ETL Pipelineon:
schedule:
- cron: '0 0 * * *'  # Runs at 12.00 AM every day
jobs:
etl:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Extract data
run: python extract.py
- name: Transform data
run: python transform.py
- name: Load data
run: python load.py

The above YAML automates an ETL (Extract, Transform, Load) pipeline. The workflow is triggered every day at 12:00 AM UTC, and it consists of a single job that runs on the ubuntu-latest environment (Whatever that’s available at the time.)

The steps of these configurations are simple.

The job has five steps: the first two steps check out the code and set up the Python environment, respectively, while the next three steps execute the extract.py, transform.py, and load.py Python scripts sequentially.

This workflow provides an automated and efficient way of extracting, transforming, and loading data on a daily basis using GitHub Actions.

The Python scripts may vary depending on the scenario. Here’s one of many ways.

# extract.py
# --------------------------------
import requestsresponse = requests.get("https://api.example.com/data")
with open("data.json", "w") as f:
f.write(response.text)
# transform.py
# --------------------------------
import json
with open("data.json", "r") as f:
data = json.load(f)
# Perform transformation
transformed_data = [item for item in data if item["key"] == "value"]
# Save transformed data
with open("transformed_data.json", "w") as f:
json.dump(transformed_data, f)
# load.py
# --------------------------------
import json
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
# Connect to database
engine = create_engine("postgresql://myuser:mypassword@localhost:5432/mydatabase")
# Create metadata object
metadata = MetaData()
# Define table schema
mytable = Table(
"mytable",
metadata,
Column("id", Integer, primary_key=True),
Column("column1", String),
Column("column2", String),
)
# Read transformed data from file
with open("transformed_data.json", "r") as f:
data = json.load(f)
# Load data into database
with engine.connect() as conn:
for item in data:
conn.execute(
mytable.insert().values(column1=item["column1"], column2=item["column2"])
)

The above scripts read from a dummy API and push it to a Postgres database.

Things to consider when deploying ETL pipelines to GitHub Actions.

1. Security: Keep your secrets secure by using GitHub’s secret store and avoid hardcoding secrets into your workflows.

Have you already noticed that the sample code I’ve given above has database credentials? It’s not right for a production system.

We have other ways to securely embed secrets, like database credentials.

If you don’t encrypt your secrets in GitHub Actions, they will be visible to anyone who has access to the repository’s source code. This means that if an attacker gains access to the repository or the repository’s source code is leaked; the attacker will be able to see your secret values.

To protect your secrets, GitHub provides a feature called encrypted secrets, which allows you to store your secret values securely in the repository settings. Encrypted secrets are only accessible to authorized users and are never exposed in plaintext in your GitHub Actions workflows.

Here’s how it works.

In the repository settings sidebar, you can find the secrets and variables for Actions. You can create your variables here.

Secrets created here are not visible to anyone. They are encrypted and can be used in the workflow. Even you can’t read them. But you can update them with a new value.

Once you created the secrets, you can pass in them using the GitHub Actions configuration as an environment variable. Here’s how it works:

name: ETL Pipelineon:
schedule:
- cron: '0 0 * * *'  # Runs at 12.00 AM every day
jobs:
etl:
runs-on: ubuntu-latest
steps:
...
- name: Load data
env: # Or as an environment variable
DB_USER: ${{ secrets.DB_USER }}
DB_PASS: ${{ secrets.DB_PASS }}
run: python load.py

Now, we can modify the Python scripts to read credentials from environment variables.

# load.py
# --------------------------------
import json
import os
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData# Connect to database
engine = create_engine(
f"postgresql://{os.environ['DB_USER']}:{os.environ['DB_PASS']}@localhost:5432/mydatabase"
)

2. Dependencies: Make sure to use the correct version of dependencies to avoid any issues.

Your Python project may already have a requirements.txt file that specifies dependencies along with their versions. Or, for more sophisticated projects, you may be using modern dependency management tools like Poetry.

You should have a step to set up your environment before you run the other pieces of your ETL. You can do this by specifying the following in your YAML configuration.

- name: Install dependencies
run: pip install -r requirements.txt

3. Timezone settings: GitHub actions use UTC timezone, and as of writing this post, you can’t change it.

Thus you must ensure you’re using the correct timezone. You can use an online converter or manually adjust your local time to UTC before configuring.

The biggest caveat of GitHub action scheduling is its uncertainty in the execution time. Even though you’ve configured it to run at a specific point in time, if the demand is high at that point, your job will be qued. Thus, there will be a short delay in the actual job starting time.

If your job depends on exact execution time, using GitHub Actions scheduling is probably not a good option. Using a self-hosted runner in GitHub actions may help.

4. Resource Usage: Avoid overloading the resources provided by GitHub.

Even though GitHub actions, even with a free account, has 2000 minutes of free run time, if you use a different OS than Linux, rules change a bit.

If you’re using a Windows runtime, you’ll get only half of it. In a MacOS environment, you’ll only get one-tenth of it.

Conclusion

GitHub actions is a DevOps tool. But we can use it to run any scheduled tasks. In this post, we’ve discussed how to create an ETL that periodically fetches an API and pushes the data to a dataframe.

For simple ETLs, this approach is easy to develop and deploy.

But scheduled jobs in GitHub actions don’t have to run at the exact same time. Hence for time bounded tasks, this isn’t suitable.