Practical MLOps using Azure ML. Automating ML pipelines using Azure ML… | by Anupam Misra | Feb, 2023

By Jessie Hobb On Feb 20, 2023

Automating ML pipelines using Azure ML CLI(v2) & github actions

Machine learning models affect our interaction with the world as much as software products we use on a regular basis. Just like DevOps is required for seamless CI/CD, MLOps has become imperative for continuously building up-to-date models and utilising their predictions.

In this article, we are going to build end-end MLOps using Azure ML CLI(v2) and Github Actions. This article hopes to serve as the starting point for your next MLOps project!

This article will help you simulate this scenario:

Data drifts frequently and is available through an API. Hence there is a need to retrain the model and re-deploy it to the online-endpoint at a defined frequency.

This is achieved via weekly cron jobs through the following steps :

Download the data using an API and register it as an Azure dataset.
Compute management and triggering model training jobs in Azure ML Studio.
Register the model created in the latest job.
Deploy the new model to an online endpoint.

There are two options to automate your ML pipelines in Azure ML:
1. Azure DevOps
2. Github actions

You can read about them here. I chose GitHub Actions for its ease of use.

A quick recap about MLOps

The need for MLOps and the steps to achieve it:

Quality control via consistency and change tracking:
a. Initial setup for entire project — IDE, workspace, permissions
b. Environment versioning
c. Data versioning
d. Code versioning
e. Versioning of other components
Fast experimentation with models:
a. Tracking model hyperparameters
b. Tracking model metrics, bias, fairness and explainability on different slices of data
c. Maintaining track of links between changing parts of the ML pipeline
Seamless model deployment and comprehensive model monitoring:
a. Fast model deployment into production environment
b. Staged rollout, blue-green or other deployment strategy
c. Tracking model efficacy to trigger retraining
d. Tracking data drift to trigger retraining

I have structured the entire ML project into three different pipelines to achieve the aforementioned MLOps goals:

Build pipeline
Training pipeline
Deployment pipeline

After discussing about these pipelines, we will look into the code implementation.

1.1 Initial setup

Step 1: Setting up Azure ML Studio:

We will be doing our data versioning, model training and deployment using MS Azure. Please follow the below steps to create your Azure ML workspace:

Login into your Azure account or get a free Azure subscription from here.
Create a Resource group (ref).
Create a ML workspace – Navigate to ml.azure.com and click on Create Workspace and follow the on screen instructions.

When your Azure ML workspace is created, you should be able to see a screen like this:

You would not be able to see these jobs, don’t worry!

Step 2: Linking GitHub Actions with Azure ML Studio:

Initialise a repository in GitHub and go to Settings> Secrets and variables> Actions>New repository secret

Open a new browser tab to create a Service principal for accessing your ML workspace.(ref)
Save the generated JSON as AZURE_CREDENTIALS in your repository secrets.

GitHub actions secrets in the repository

Step 3: Generating a personal access token(PAT)

Using the normal GITHUB_TOKEN, you cannot edit workflow files(files in .github/workflows), you would need to add a PAT. You would need to edit workflow files in order to version the training runs, model versions etc. automatically.

In GitHub, generate the PAT from Settings>Developer settings>Personal access tokens.

Personal access token for editing workflow

Save the PAT to your repository secrets with below repository permissions:

1.2 Environment versioning

There are two environments needed:

Training environment — For model training dependencies
Deployment environment — For model serving dependencies

You could also keep them as the same environment for smaller projects.

1.3 Data versioning

In our example, data is downloaded every week and is registered in Azure datastore as a dataset. Data versioning is important to track lineage of models. In our example, it is done by jobs/data_download.save_to_data_upload(…)

1.4 Code versioning

Code versioning is done through GitHub.

1.5 Versioning of other components

You would also need to version runs, models and other components. To automate their naming, you would have to edit them in a previous cron job. In our example, it is done by jobs/update_training_yamls.py.

1.6 Automated testing

To automatically test the python files we can use pytests and also track code coverage.

2.1 Tracking model hyperparameters

In our example, we have used mlflow for pytorch to log the following details:

Metrics:

Runtime model parameters:

2.2 Tracking model metrics, bias, fairness and explainability on different slices of data

Since I have used stock data to simulate constantly changing data, I have skipped this part. However, in most ML use cases, these are very important metrics to judge the model performance.

2.3 Maintaining links between changing parts of the ML pipeline

Azure ML studio automatically links everything.
Example lineage tracking during model training:

Similarly data and models are also automatically versioned and their lineage is tracked.

3.1 Fast model deployment into production

Initially creating the end-point and deploying the first model takes some time. After that, newly registered models can be used to update endpoints in much lesser time.

3.2 Staged rollout, blue-green or different deployment strategy

Different deployment strategies can be used. In our case we override the previously deployed model. However through Azure ML, blue-green deployment can be very easily achieved during deployment.

3.3 Tracking model efficacy to trigger retraining

Post model deployment, we need to track model performance against known labels. This will help us identify in which data strata our model performs poorly. This will help us identify whether we need to collect more data or undertake other measures during next model re-training.

3.4 Tracking data drift to trigger retraining

There are two schools of thought to model re-training:
1. Schedule based
2. Drift based

There are advantages and disadvantages to both options. In this example I follow schedule based model re-training. Hence I have not implemented any data drift monitors. However Azure has tools to monitor data drift.

As engineers, it becomes much clearer when we code something from ground up. So here you go!

Code repository: coderkol95/MLOps_stock_prediction

Project folder structure:

Brief information about the folders and files:

.github/workflows/

The yml files for pipeline control are placed here. These trigger data download/upload, model training, registration and deployment via cron jobs.

data_pipeline.yml: Frequency – Every Monday at 1:01 am
– download ticker data and update yml file
Data download via API to a csv file and update the data-upload.yml file with dataset tags, version and path.
– edit yaml files
Update version of other components in yml files, like job_name, model version etc. which will be used during the run
– push files to github
Push the updated yml files and downloaded csv to repository
-upload to azure
Dataset registration in Azure datastore

name: data upload to azureenv:
ticker: WIPRO.NS
start: 366
end: 1
on:
schedule:
- cron: "1 1 * * 1"
jobs:
datawork:
runs-on: ubuntu-latest
steps:
- name: checkout repository
uses: actions/checkout@v2
with:
token: ${{ secrets.PAT }}
repository: 'coderkol95/MLOps_stock_prediction'
- name: setup python 3.9
uses: actions/setup-python@v4
with: 
python-version: "3.9"
- name: install python packages
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: download ticker data and update yml file
run: python data_download.py --ticker $ticker --start $start --end $end
id: data
working-directory: jobs
- name: edit yaml files
run: python update_training_yamls.py
working-directory: jobs
- name: push files to github
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Action"
git add -A
git commit -m "Ticker data for $ticker downloaded and YAML file updated." || exit 0
git push @github.com/${GITHUB_REPOSITORY}.git">https://x-access-token:${GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git HEAD
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: upload to azure
run: az ml data create -f jobs/data_upload.yml

model_pipeline.yml: Frequency – Every Monday at 2:01 am
* train-job
Compute creation and model training on the latest dataset.
The dataset preparation and model training is done via pytorch lightning. All the logging is done via MLFlow. For code details, please refer to my repository.

* register-job
Registration of the model from the latest run.

* delete-compute
Compute deletion after training has completed.

name: training and registering modelenv:
job_name: ga-run-10
compute_name: computer456
registered_model_name: GA_model
on:
schedule:
- cron: "1 2 * * 1"
jobs:
train-job:
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: create-compute
run: az ml compute create --name $compute_name --size STANDARD_DS11_v2 --min-instances 1 --max-instances 2 --type AmlCompute
- name: train-job
working-directory: jobs
run: az ml job create --file train.yml --debug --stream # --stream causes the step to go on, as long as the model trains.
# If training is expected to take a long time, registration can be scheduled in a separate cron job, triggered later.
register-job:
needs: [train-job]
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: register-model
run: az ml model create
--name $registered_model_name 
--version 10
--path azureml://jobs/ga-run-10/outputs/artifacts/paths/outputs/
--type custom_model
delete-compute:
needs: [train-job]
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: delete-step
run: az ml compute delete --name $compute_name --yes

deployment_pipeline.yml: Frequency – Every Monday at 3:01 am
* Endpoint & deployment creation(if its the first time)
* Updating online deployment with latest model (shown below)

name: model deployment
on:
schedule:
- cron: "1 3 * * 1"
jobs:
# compare-job:
#  Compare if the model is good enough
#  Profile the model
#  If it is good enough, proceed to next step 
deployment-job:
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
# Commenting out as endpoint creation is only needed during the first run
# - name: create-endpoint
#   run: az ml online-endpoint create --name ga-deployment
- name: deployment-step
run: az ml online-deployment update -f deploy.yml #--all-traffic # First time it'll be az ml ... create --all-traffic
working-directory: jobs

I have kept the cron jobs’ execution 1 hour apart as it completes within 1 hour, you may keep a longer duration if required. You may also set flags to capture job completion.

cli/

setup.sh: Configures the VM on which the code runs, for Azure ML

GROUP="RG"
LOCATION="eastus"
WORKSPACE="AzureMLWorkspace"az configure --defaults group=$GROUP workspace=$WORKSPACE location=$LOCATION
az extension remove -n ml
az extension add -n ml

jobs/

Azure specific YAML files are kept here along with python scripts for individual pipeline step execution.

data_download.py
get_ticker_data(…)
Download the data via API call and save it to a csv file. I have downloaded the data using Yahoo Finance API.

get_dataset_tags(…)
Versioning the data and adding tags.

Dataset tags generated before uploading it to Azure

save_to_data_upload(…)
Write out the dataset specifications to the Azure yml file for uploading to Azure datastore.

For the code, you may refer to my repository.

data_upload.yml
This yml file is updated by jobs/data_download.save_to_data_upload(…)
This yml file uploads the dataset in Azure.

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.jsontype: uri_file
name: 'WIPRO'
description: Stock data for WIPRO.NS during 2022-02-14:2023-02-13 in 1d interval.
path: '../data/WIPRO.NS.csv'
tags: {'Length': 249, 'Start': '2022-02-14', 'End': '2023-02-13', 'Median': 413.7, 'SD': 69.09}
version: 20230215

deploy.yml
This yml file is updated by jobs/update_training_yamls.py before each cron job.

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: green
endpoint_name: ga-deployment
model: azureml:GA_model:10
code_configuration:
code: ../jobs
scoring_script: deployment.py
environment: azureml:stock-pricing:5
instance_type: Standard_DS1_v2
instance_count: 1

deployment.py
The script used at the endpoint to generate online predictions. For the code, you may refer to my repository.

init()
Initializes the model and the datamodule used by pytorch lightning.

run()
Used to serve online predictions from the model.

train.py
Model training scipt using pytorch lightning and mlflow. For the code you may refer to my repository.

train.yml
This yml file is updated by jobs/update_training_yamls.py before each cron job.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
name: ga-run-10
tags:
modeltype: pytorch
code: ../jobs
command: >-
python train.py 
--data ${{inputs.data}}
inputs:
data:
type: uri_file
path: azureml:WIPRO@latest
environment: azureml:stock-pricing:4
compute: azureml:computer456
display_name: stock
experiment_name: ga_train_job
description: Training job via Github actions

update_training_yamls.py
Updates several components’ version which need to be updated before every run, like run ID, model version to be registered, model to be deployed. For the code you may refer to my repository.

I hope you now have an idea on how to implement automated end-end MLOps project using MS Azure. For detailed code implementation you may refer to my repository.