Modularise your Notebook into Scripts | by Leon Sun | May, 2022

By Jessie Hobb On May 23, 2022

A simple guide to transform your code from notebooks to executable scripts

Hello World! In this article, I will present a simple guide on modularising your notebooks into executable scripts.

Previously Geoffrey Hung shared an extremely comprehensive article on how you can transform your Jupyter Notebooks into Scripts. However, during my quest to productionise models, I found that there are some gaps in modularising the notebooks from .ipynb into .py and running the entire pipeline of scripts.

If you work in the analytical space, chances are you will find yourself writing python code in notebooks at some point in time. You may or may not have experienced issues with this approach, but if you are looking for a way to execute your notebook in scripts instead, then this article is for you.

I will not be focusing on the benefits of writing code in scripts, nor attempt to compare both approaches as notebooks and scripts have their own benefits and drawbacks. If you’re wondering why you should make the switch, this article may provide more clarity.

I have created a demo repository to perform a clustering analysis on a set of credit card dataset obtain from Kaggle. I’ll be using this repository throughout the article to share snippets of examples.

Table of Contents:

Project and Code Structure
Abstraction and Refactoring
Executing the Pipeline

Project and Code Structure

Having a proper repository structure is essential. Instead of a gigantic notebook, or multiple notebooks with different models containing the entire pipeline from data extraction to modelling, we first have to compartmentalise this complexity by breaking them down into their different purposes.

The typical data analytical workflow broadly consists of 3 components: Extraction, Transformation/Preprocessing and Analysis/Modelling. This means that you can already dissect the notebook into at least 3 separate scripts — extraction.py, preprocessing.py and model.py.

In my demo repository, extraction.py is missing as the dataset was obtained from Kaggle so an extraction script is unnecessary. However, if you’re leveraging on APIs, web scraping or dumping data from a data lake, it will be useful to have an extraction script. Depending on the type of data models adopted by your team, you will likely find yourself having to write an array of queries to extract the data and perform join statements to merge them into a singular table or dataframe.

A typical structure of a project might look something like this. To view your project structure:

$ tree.
├── LICENSE
├── README.md
├── config.yml
├── data
│   ├── CC_GENERAL.csv
│   └── data_preprocessed.csv
├── main.py
├── notebooks
│   ├── dbscan.ipynb
│   ├── kmeans.ipynb
│   └── preprocessing.ipynb
├── requirements.txt
└── src
├── dbscan.py
├── executor.py
├── kmeans.py
├── preprocessing.py
└── utility.py

In this project repository, we break down the pipeline into its key components. Let’s look into each of their purposes.

/notebooks: This folder serves as a playground where your original code is written in a flat structure for simpler display of outputs and iterating through code chunks.
/data : This folder contains the various data files your scripts will be using. Some of the popular data formats stored here include .csv, .parquet , .json etc.
/src: This folder stores all your executable scripts.
main.py: The main script to run the entire pipeline, with the code abstracted and refactored from the notebooks.
config.yml: A human readable file to store your configurable parameters used to run the scripts

Abstraction and Refactoring

Once you have your project structure in place, the next step is to refactor and abstract your code to reduce complexity. Instead of writing code that forces the reader to understand the how, abstracting code into functions and classes (coupled with proper variable naming) can help to compartmentalise the complexity.

The article below provides an amazing summary of how you can refactor your notebook.

Preprocessing

After extracting your data, it’s common to clean them up before using them for your analysis or models. Some of the common preprocessing includes imputing missing values, removing outliers and transforming the data etc.

Some of the preprocessing involved in the demo repository include removing outliers and imputing missing values. These specific jobs can be abstracted into functions and stored in a utility.py script, which was later imported into preprocessing.py.

For instance, the function for imputing missing values with median was placed in the utility.py file.

Models

If you have to use different models on the same set of preprocessed data, we can also create classes to create a model instance. In the demo repository, I explored two types of algorithms when performing clustering, with each model separated into executable scripts. For instance, kmeans was abstracted into kmeans.py while DBSCAn was abstracted into dbscan.py.

Let’s import the necessary packages and create a class for the kmeans model.

If we want to create a model instance, we can simply define an object to initialise a model and store the kmeans model instance.

kmeans = kmeans_model(df) # instantiate kmeans modelkmeans_models = kmeans.kmeans_model(min_clusters=1, max_clusters=10) # run multiple iterations of kmeans model

This article by Sadrach Pierre provides an extensive elaboration on how you can utilise classes when building models.

Executing the Pipeline

With the various key components of the analytical pipeline now abstracted into functions and classes and transformed into modularised scripts, we can now simply run the entire pipeline. This is achieved using two scripts — main.py and executor.py.

Main

The main script, main.py, will run the entire pipeline when executed, taking in the necessary configurations that were loaded. In the demo repository, I leveraged on using a config file to store the parameters, and click to interface with it.

Executor

Once the model choice and its respective parameters have been loaded, we can parse these model inputs to execute it using the execution script, executor.py. The steps for instantiating the model, optimising it and stitching the cluster labels are then laid out within the executor function.

To run the entire pipeline:

# execute entire pipeline with default model
python3 main.py# execute entire pipeline using DBSCAN
python3 main.py --model=dbscan# execute entire pipeline using another config file and DBSCAN
python3 main.py another_config.yml --model=dbscan

Conclusion

Putting them all together, we now have a logical project structure with the modular scripts each carrying out its specific purpose with the underlying code abstracted and refactored. The pipeline can then be executed using a configuration file which stores the input parameters for the models.

Please leave comments 💬 if there’s more to add and I will be happy to include them in an edit!

Thanks for reading! 🙂

References