Running Jupyter Notebooks On Docker Containers | by Nadim Kawwa | Aug, 2022

By Jessie Hobb On Aug 5, 2022

A Project with SageMaker Studio Lab and Docker

The objective of this post is to run a data science workflow on AWS and then ship it using Docker, thus creating an end-to-end machine learning task.

Furthermore, I will be focusing more on the “how” to dockerize a data science project, more so than “why this project is cool”. That being said, there are many benefits for using Docker:

Portability
Performance
Agility
Isolation
Scalability

On the other hand, AWS SageMaker Studio Lab provides the power of SageMaker without the need to explicitly define every subprocess.

For this project we will be using 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). The data distributed under the BSD license and is free for both academic and commercial use.

The data used is a subset of that data, found on Kaggle. It contains 1,800,000 training samples, 200,000 testing samples, such that each review is either “positive” or “negative”.

Since this is a large dataset and we are working with AWS, the data is stored in an S3 bucket. In addition, make sure that the bucket and AWS services you’e using are in the same region. For this project, the region of choice is us-east-2 (US East Ohio).

Working with AWS, one of the common pitfalls is not having the required permissions, or not knowing the extent of your permissions. Go to Identity and Access Management (IAM), create a role, and attach the following policies to it:

AmazonS3FullAcess
AmazonSageMakerFullAccess
AWSGlueConsoleSageMakerNotebookFullAccess
AWSImageBuilderFullAccess
AmazonSageMakerPipelinesIntegrations
AmazonSageMakerGroundTruthExecution

SageMaker Studio Lab provides a lightweight environment to perform tasks faster than you would on SageMaker. It comes with some drawbacks such as not having access to AWS Data Wrangler and no large scale distributed training.

To get started with SageMaker Studio Lab, all you need is an AWS account and wait about 24hrs to get your account approved. Once inside the lab, the environment looks like a Jupyter notebook.

You can request a free account here: https://studiolab.sagemaker.aws/

The code blurb below is a standard way to connect to your AWS resources and defines essential resources. Think of it a sort of Shebang to get started with AWS.

Inside the SageMaker Studio, the first step is to write a preprocessing script such as the one shown below. This script and all subsequent scripts need to contain all required libraries and be able to run independently.

Moreover, the mindset to adopt in a ML workflow is “this will be running on someone else’s machine”. Hence it’s crucial that what is written in a Jupyter Notebook environment can be contained in a concise python script.

For those getting started with SageMaker, remember to stick to the prescribed directory format: “/opt/ml/…”. In fact, “/opt/ml” and all its subdirectories are reserved by SageMaker per the documentation. For example, “/opt/ml/model/” is the directory where you write the model that your algorithm generates.

After preprocessing comes the training which like the previous section is contained in a script. The specific model isn’t as important as having the right workflow set up. Once the model is functioning as intended, feel free to plug in your algorithm of choice.

The data is perfectly balanced: 50% positive and 50% negative. Therefore accuracy is a sufficient metric. Precision and recall are also included to uncover any issues in prediction.

Prior to any parameter optimization, the accuracy hovers at 80% and so is precision/recall.

The SageMaker processing module allows us to add dependencies using a FrameWork processor which allows for a set of Python scripts to be run as part of the Processing Job. In addition, it gives the processor access to multiple scripts in a directory instead of only specifying one script.

Now that our workflow is running smoothly, it’s time to ship! First create a directory called docker inside the lab and write the Dockerfile to create the processing container:

Next step is building the container using the docker command which creates an Elastic Registry Container (ECR) and pushing the docker image:

P.S. If you’re using an Apple M1 computer or newer, make sure to explicitly call out your platform when building the docker image.

First we call the ScriptProcessor class which lets you run a command inside a container, and now we can run the same scripts from before except they are inside a docker container.

And that’s it, we successfully ran a processing job on a docker container. The training and evaluation containers are not shown for brevity since they follow similar steps.

With AWS services and Docker, we are able to push python code into a container and productize the data science workflow.

If you’re interested in seeing the whole code along with outputs to validate your own work, check out the Github repo below:

A Project with SageMaker Studio Lab and Docker

The objective of this post is to run a data science workflow on AWS and then ship it using Docker, thus creating an end-to-end machine learning task.

Furthermore, I will be focusing more on the “how” to dockerize a data science project, more so than “why this project is cool”. That being said, there are many benefits for using Docker:

Portability
Performance
Agility
Isolation
Scalability

On the other hand, AWS SageMaker Studio Lab provides the power of SageMaker without the need to explicitly define every subprocess.

The data used is a subset of that data, found on Kaggle. It contains 1,800,000 training samples, 200,000 testing samples, such that each review is either “positive” or “negative”.

AmazonS3FullAcess
AmazonSageMakerFullAccess
AWSGlueConsoleSageMakerNotebookFullAccess
AWSImageBuilderFullAccess
AmazonSageMakerPipelinesIntegrations
AmazonSageMakerGroundTruthExecution