CI/CD for Machine Learning Model Training with GitHub Actions | by Zoumana Keita

A comprehensive guide to using an EC2 instance as a server for training your Machine Learning model

Image by ThisisEngineering RAEng on Unsplash

Proper orchestration of a machine learning pipeline can be performed using multiple open-source tools. Github actions is one of the well-known out there. It is a built-in Github tool primarily developed to automate the development, testing, and deployment process of software.

Nowadays, Machine Learning practitioners have been using it to automate the entire workflow of their projects. All these workflows are related to specific jobs that can be executed using Github actions servers or by using your own servers.

At the end of this conceptual blob, you will understand:

The benefits of hosting your own runners.
How to create an EC2 instance and configure it for the task at hand.
Implement a machine learning workflow with GitHub actions using your runners.
Use DVC to store your model metadata.
MLFlow to track the performance of your model.

Hosting your own runner means allows you to execute jobs within a custom hardware environment with the required processing power and memory storage.

Doing so has the following benefits:

The user can easily increase or decrease the number of runners, which can be beneficial when it comes to training models in parallel.
There is no restriction in terms of operating systems. Both Linux, Windows, and macOS are supported.
When using cloud services like AWS, GCP, or Azure, the runner can benefit from all the services depending on the subscription level.

To successfully perform this section, we need to perform two main tasks. First, get the AWS and Github credentials, then set up the encrypted secrets to synchronize EC2 and Github.

Get your AWS and Github credentials

Before starting, you need to first create an EC2 instance, which can be done from this article. There are three main credentials required and the process of acquiring them is detailed below.

→ Get your PERSONAL_ACCESS_TOKEN from your Github account. This is used as an alternative to passwords, required to interact with Github API.

→ The ACCESS_KEY_IDand SECRET_ACCESS_KEY can be retrieved after following these 5 steps:

Click your username near the top right.
Select the security credentials tab.
Select Access keys (access key ID and secret access key).
Create a new access key.
Click Show Access Key to view your Access Key ID & Secret Access Key.

Illustration of the 5 steps to get your AWS Access Key ID and Secret Access Key (Image by Author)

Set up encrypted secrets for synchronization

This step is performed from your project repository on Github. All the steps are described in the Encrypted Secrets section of the following article by Khuyen Tran.

In the end, you should have something similar to mine, like this.

Environment secrets added to Github secrets (Image by Author)

The Machine learning task covered in this section is the implementation of a 3-class classification using the BERT model. This is performed using the following workflow:

Data Acquisition → Data Processing → Model Training & Evaluation → Model & metadata serialization.

We will start by explaining the underlying tasks performed in each step, along with their source code before explaining how to run the model training using our self-hosted runner.

→ Data acquisition: responsible for collecting data from DVC storage.

get_data.py

→ Data processing: limited to special characters removal for simplicity.

prepare_data.py

→ Model training & Evaluation: create a BERT transformer model.

The model training consists of two main sections: (1) train, and evaluate the model performance, and (2) use MLFlow to track the metrics.

The training generates two main files:

model/finetuned_BERT_epoch{x}.model:corresponding to the fine-tuned BERT model generated after the epoch n°x.
metrics/metrics.json:containing the precision, recall, and F1-score of the previously generated model.

For the model tracking, we need to acquire the credentials as follows from the top right corner of your DagsHub project repository.

Steps to get your MLFlow credentials from your DagsHub project (Image by Author)

The script below only shows the model tracking section with MLFlow because showing all the source code would be too long. However, the complete code is available at the end of the article.

MLFlow section from the training file.

→ Model & Metadata serialization: responsible for saving into DVC storage the following metadata: metrics.json, finetuned_BERT_epoch{x}.model.

Train your model using an EC2 self-hosted runner.

This section mainly focuses on firing the training of the model after provisioning the self-hosted runner that will be used to train the model instead of using a default GitHub actions runner.

The training workflow is implemented in .github/workflows/training.yaml

You can name the training.yamlfile whatever you want. However, it must be located in the .github/workflows folder with the .yaml extension. This way it will be considered by Github as a workflow file.

Below is the general format of the training workflow:

training.yaml

name: the name of the workflow.
on: push and pull_request are the main events in our case responsible for triggering the entire workflow.
jobs: contains the set of jobs in our workflow which are: (1) the deployment of the EC2 runner, and (2) the training of the model.

What are the steps within each job?

Before diving into the source code, let’s understand the underlying visual illustration of the workflow.

General workflow of the blog scope (Image by Author)

The workflow is started by a push or pull request by the developer/Machine Learning Engineer.
The training is triggered in a provisioned EC2 instance.
The metadata (metrics.json and model) is stored on DVC and the metrics values are tracked by MLFlow on DagsHub.

The complete workflow of the training process in the self-hosted runner (training.yaml)

→ First job: deploy-runner

We start by using the continuous machine learning library by Iterative to automate the server provisioning and model training. Then we acquire all the credentials required to run the provisioning.

The EC2 instance being used is:

a free-tier t2.micro instance.
located in theus-east-1aregion.
tagged with the cml-runnerlabel that will be used to identify the instance when training the model.

→ Second job: train-model

By using the previously provisioned runner, we are able to perform the training using the CML library.

Save the metadata in DVC storage.

All the files generated from the training step are automatically stored in the local machine. But we might want to keep track of all the changes on those data. This is the reason why we are usingdvc in this section.

There are overall two main steps: (1) get your credentials, and (2) implement the logic to push the data.

→ Get your DVC credentials.

Getting your DVC credentials follows the same process, similar to MLFlow:

Steps to get your DVC credentials from your DagsHub project (Image by Author)

→ Implement the logic

First, we implement the DVC configuration and data-saving logic in thesave_metadata.py file.

save_metadata.py

Then, this file is called by the save_metadata job in the workflow.

jobs: ... # deploy-runner & train-model jobs save_metadata:
steps:
- name: Save metadata into DVC
run: python save_metadata.py

Now I can commit the changes and push the code on Github to be able to perform a pull request.

git add . 
git commit -m “Pushing the code for pull request experimentation”
git push -u origin main

After changing the number of epochs to 2, we get a new version of the training script which triggers pull requests. Below is an illustration.

Pull request

The following illustration shows the model metrics before and after the pull request, respectively for an epoch at 1 and an epoch at 2.

When implementing a complete machine learning pipeline with Github actions, using a self-hosted server can be beneficial in many ways as illustrated at the beginning of the article.

In this conceptual blog, we explored how to provision an EC2 instance, trigger the train of the model from push and pull requests, then save the metadata into a DVC storage and track the model performance using MLFlow.

Mastering these skills will help you provide a valuable skill to take the entire machine learning pipeline of your organization to the next level.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member to unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

Source code of the project

About self-hosted runners

Training and saving models with CML on a self-hosted AWS EC2 runner.

A comprehensive guide to using an EC2 instance as a server for training your Machine Learning model

Image by ThisisEngineering RAEng on Unsplash

At the end of this conceptual blob, you will understand:

The benefits of hosting your own runners.
How to create an EC2 instance and configure it for the task at hand.
Implement a machine learning workflow with GitHub actions using your runners.
Use DVC to store your model metadata.
MLFlow to track the performance of your model.

Hosting your own runner means allows you to execute jobs within a custom hardware environment with the required processing power and memory storage.

Doing so has the following benefits:

The user can easily increase or decrease the number of runners, which can be beneficial when it comes to training models in parallel.
There is no restriction in terms of operating systems. Both Linux, Windows, and macOS are supported.
When using cloud services like AWS, GCP, or Azure, the runner can benefit from all the services depending on the subscription level.

To successfully perform this section, we need to perform two main tasks. First, get the AWS and Github credentials, then set up the encrypted secrets to synchronize EC2 and Github.

Get your AWS and Github credentials

Before starting, you need to first create an EC2 instance, which can be done from this article. There are three main credentials required and the process of acquiring them is detailed below.

→ Get your PERSONAL_ACCESS_TOKEN from your Github account. This is used as an alternative to passwords, required to interact with Github API.

→ The ACCESS_KEY_IDand SECRET_ACCESS_KEY can be retrieved after following these 5 steps:

Click your username near the top right.
Select the security credentials tab.
Select Access keys (access key ID and secret access key).
Create a new access key.
Click Show Access Key to view your Access Key ID & Secret Access Key.

Illustration of the 5 steps to get your AWS Access Key ID and Secret Access Key (Image by Author)

Set up encrypted secrets for synchronization

This step is performed from your project repository on Github. All the steps are described in the Encrypted Secrets section of the following article by Khuyen Tran.

In the end, you should have something similar to mine, like this.

Environment secrets added to Github secrets (Image by Author)

The Machine learning task covered in this section is the implementation of a 3-class classification using the BERT model. This is performed using the following workflow:

Data Acquisition → Data Processing → Model Training & Evaluation → Model & metadata serialization.

We will start by explaining the underlying tasks performed in each step, along with their source code before explaining how to run the model training using our self-hosted runner.

→ Data acquisition: responsible for collecting data from DVC storage.

get_data.py

→ Data processing: limited to special characters removal for simplicity.

prepare_data.py

→ Model training & Evaluation: create a BERT transformer model.

The model training consists of two main sections: (1) train, and evaluate the model performance, and (2) use MLFlow to track the metrics.

The training generates two main files:

model/finetuned_BERT_epoch{x}.model:corresponding to the fine-tuned BERT model generated after the epoch n°x.
metrics/metrics.json:containing the precision, recall, and F1-score of the previously generated model.

For the model tracking, we need to acquire the credentials as follows from the top right corner of your DagsHub project repository.

Steps to get your MLFlow credentials from your DagsHub project (Image by Author)

The script below only shows the model tracking section with MLFlow because showing all the source code would be too long. However, the complete code is available at the end of the article.

MLFlow section from the training file.

→ Model & Metadata serialization: responsible for saving into DVC storage the following metadata: metrics.json, finetuned_BERT_epoch{x}.model.

Train your model using an EC2 self-hosted runner.

This section mainly focuses on firing the training of the model after provisioning the self-hosted runner that will be used to train the model instead of using a default GitHub actions runner.

The training workflow is implemented in .github/workflows/training.yaml

Below is the general format of the training workflow:

training.yaml

name: the name of the workflow.
on: push and pull_request are the main events in our case responsible for triggering the entire workflow.
jobs: contains the set of jobs in our workflow which are: (1) the deployment of the EC2 runner, and (2) the training of the model.

What are the steps within each job?

Before diving into the source code, let’s understand the underlying visual illustration of the workflow.

General workflow of the blog scope (Image by Author)

The workflow is started by a push or pull request by the developer/Machine Learning Engineer.
The training is triggered in a provisioned EC2 instance.
The metadata (metrics.json and model) is stored on DVC and the metrics values are tracked by MLFlow on DagsHub.

The complete workflow of the training process in the self-hosted runner (training.yaml)

→ First job: deploy-runner

We start by using the continuous machine learning library by Iterative to automate the server provisioning and model training. Then we acquire all the credentials required to run the provisioning.

The EC2 instance being used is:

a free-tier t2.micro instance.
located in theus-east-1aregion.
tagged with the cml-runnerlabel that will be used to identify the instance when training the model.

→ Second job: train-model

By using the previously provisioned runner, we are able to perform the training using the CML library.

Save the metadata in DVC storage.

There are overall two main steps: (1) get your credentials, and (2) implement the logic to push the data.

→ Get your DVC credentials.

Getting your DVC credentials follows the same process, similar to MLFlow:

Steps to get your DVC credentials from your DagsHub project (Image by Author)

→ Implement the logic

First, we implement the DVC configuration and data-saving logic in thesave_metadata.py file.

save_metadata.py

Then, this file is called by the save_metadata job in the workflow.

jobs: ... # deploy-runner & train-model jobs save_metadata:
steps:
- name: Save metadata into DVC
run: python save_metadata.py

Now I can commit the changes and push the code on Github to be able to perform a pull request.

git add . 
git commit -m “Pushing the code for pull request experimentation”
git push -u origin main

After changing the number of epochs to 2, we get a new version of the training script which triggers pull requests. Below is an illustration.

Pull request

The following illustration shows the model metrics before and after the pull request, respectively for an epoch at 1 and an epoch at 2.

When implementing a complete machine learning pipeline with Github actions, using a self-hosted server can be beneficial in many ways as illustrated at the beginning of the article.

Mastering these skills will help you provide a valuable skill to take the entire machine learning pipeline of your organization to the next level.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member to unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

Source code of the project

About self-hosted runners

Training and saving models with CML on a self-hosted AWS EC2 runner.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

CI/CD for Machine Learning Model Training with GitHub Actions | by Zoumana Keita | Aug, 2022

A comprehensive guide to using an EC2 instance as a server for training your Machine Learning model

Get your AWS and Github credentials

Set up encrypted secrets for synchronization

Train your model using an EC2 self-hosted runner.

Save the metadata in DVC storage.

A comprehensive guide to using an EC2 instance as a server for training your Machine Learning model

Get your AWS and Github credentials

Set up encrypted secrets for synchronization

Train your model using an EC2 self-hosted runner.

Save the metadata in DVC storage.

Related Posts