Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III) | by Eduardo Blancas | Nov, 2022

By Jessie Hobb On Nov 1, 2022

Data Science Cloud Infrastructure

A step-by-step guide to deploy a Data Science platform on AWS with open-source software

In our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.

In this final post, we’ll show you how to use Ploomber and Soopervisor to create grids of experiments that you can run in parallel on AWS Batch, and how to request resources dynamically (CPUs, RAM, and GPUs).

Hi! My name is Eduardo, and I like writing about all things MLOps. If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!

This is how our architecture looks like:

Platform’s architecture. Image by author.

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

We’ll be using Docker for this part, so ensure it’s up and running:

First, let’s create another ECR repository to host our Docker image:

Output:

Assign the REPOSITORY variable to the output of the previous command:

We’ll now get a sample project. First, let’s install the required packages.

Note: We recommend you install them in a virtual environment.

Download the example in the grid directory:

Output:

This downloaded a full project:

Output:

The example we downloaded prepares some data and trains a dozen Machine Learning models in parallel, here’s a graphical representation:

Graphical representation of our workflow. Image by author.

Let’s look at the pipeline.yaml file, which specifies the tasks in our workflow:

Output:

The pipeline.yaml is one interface that Ploomber has to describe computational workflows (you can also declare them with Python).

The tasks section contains five entries, one per task. The first four are Python functions that process some input data ( tasks.raw.get, tasks.features.sepal, tasks.features.petal, tasks.features.features), and the last one is a script that fits a model ( scripts/fit.py).

Note the last entry is longer because it’s a grid task: it’ll use the same script and execute it multiple times with different parameters. In total, the script will be executed 12 times, but this could be a larger number.

To learn more about the pipeline.yaml file and Ploomber, check our documentation.

Let’s now configure AWS Batch as our cloud environment (Kubernetes, SLURM, and Airflow are supported as well).

Output:

========================= Loading DAG =========================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/pipeline.yaml.
Loading...
Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/aws-env/Dockerfile...
============================= Done ============================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
$ soopervisor export aws-env
To force execution of all tasks:
$ soopervisor export aws-env --mode force

There are a few extra things we need to configure, to facilitate the setup, we created a script that automates these tasks depending on your AWS infrastructure, let’s download it:

Output:

Now, set the values for the AWS Batch job queue and artifacts bucket you want to use. (If in doubt, you might want to revisit the previous tutorials: Part I, and Part II).

Let’s generate the configuration file that specifies the job queue to use and the ECR repository to upload our code:

Output:

Now, let’s specify the S3 client so the outputs of the pipeline are uploaded to the bucket:

Output:

Modify the pipeline.yaml so it uses the client we created in the step above:

Let’s upload our project to ECR:

Output:

Ensure boto3 is installed as part of our project. We need to upload to S3:

We’re now ready to schedule our workflow! Let’s use the soopervisor export command to build the Docker image, push it to ECR and schedule the jobs on AWS Batch:

You can monitor execution in the AWS Batch console. Or use the following command, just ensure you change the job name. The following command retrieves the status of the fit-random-forest-1-gini task:

Output:

After a few minutes, all tasks should be executed!

Let’s check the outputs in the S3 bucket:

Output:

You can see there’s a combination of .pickle files (the trained models), .csv (processed data), and .html (reports generated from the training script).

Let’s download one of the reports:

Output:

Open the report.html and you’ll see the outputs of the training script!

Let’s take a look at the grid/soopervisor.yaml file which configures the cloud environment:

Output:

The soopervisor.yaml file specifies the backend to use ( aws-batch), the resources to use by default ({memory: 16384, vcpus: 8}), the job queue, region and repository.

We can add a new section to specify per-task resources, to override the default value:

In this final part, we showed how to create multi-step workflows, and how to parametrize a script to create a grid of experiments that can run in parallel. Now you have a scalable infrastructure to run Data Science and Machine Learning experiments!

If you need help customizing the infrastructure or want to share your feedback, please join our community!

To keep up-to-date with our content; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Here’s the command you need to run to delete the ECR repository we created on this post. To delete all the infrastructure, revisit the previous tutorials.

Output:

Data Science Cloud Infrastructure

A step-by-step guide to deploy a Data Science platform on AWS with open-source software

In our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.

Hi! My name is Eduardo, and I like writing about all things MLOps. If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!

This is how our architecture looks like:

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

We’ll be using Docker for this part, so ensure it’s up and running:

First, let’s create another ECR repository to host our Docker image:

Output:

Assign the REPOSITORY variable to the output of the previous command:

We’ll now get a sample project. First, let’s install the required packages.

Note: We recommend you install them in a virtual environment.

Download the example in the grid directory:

Output:

This downloaded a full project:

Output:

The example we downloaded prepares some data and trains a dozen Machine Learning models in parallel, here’s a graphical representation:

Let’s look at the pipeline.yaml file, which specifies the tasks in our workflow:

Output:

The pipeline.yaml is one interface that Ploomber has to describe computational workflows (you can also declare them with Python).

To learn more about the pipeline.yaml file and Ploomber, check our documentation.

Let’s now configure AWS Batch as our cloud environment (Kubernetes, SLURM, and Airflow are supported as well).

Output:

========================= Loading DAG =========================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/pipeline.yaml.
Loading...
Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/aws-env/Dockerfile...
============================= Done ============================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
$ soopervisor export aws-env
To force execution of all tasks:
$ soopervisor export aws-env --mode force

There are a few extra things we need to configure, to facilitate the setup, we created a script that automates these tasks depending on your AWS infrastructure, let’s download it:

Output:

Now, set the values for the AWS Batch job queue and artifacts bucket you want to use. (If in doubt, you might want to revisit the previous tutorials: Part I, and Part II).

Let’s generate the configuration file that specifies the job queue to use and the ECR repository to upload our code:

Output:

Now, let’s specify the S3 client so the outputs of the pipeline are uploaded to the bucket:

Output:

Modify the pipeline.yaml so it uses the client we created in the step above:

Let’s upload our project to ECR:

Output:

Ensure boto3 is installed as part of our project. We need to upload to S3:

We’re now ready to schedule our workflow! Let’s use the soopervisor export command to build the Docker image, push it to ECR and schedule the jobs on AWS Batch:

Output:

After a few minutes, all tasks should be executed!

Let’s check the outputs in the S3 bucket:

Output:

You can see there’s a combination of .pickle files (the trained models), .csv (processed data), and .html (reports generated from the training script).

Let’s download one of the reports:

Output:

Open the report.html and you’ll see the outputs of the training script!

Let’s take a look at the grid/soopervisor.yaml file which configures the cloud environment:

Output:

The soopervisor.yaml file specifies the backend to use ( aws-batch), the resources to use by default ({memory: 16384, vcpus: 8}), the job queue, region and repository.

We can add a new section to specify per-task resources, to override the default value:

If you need help customizing the infrastructure or want to share your feedback, please join our community!

To keep up-to-date with our content; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Here’s the command you need to run to delete the ECR repository we created on this post. To delete all the infrastructure, revisit the previous tutorials.

Output:

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.