Techno Blender
Digitally Yours.

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks Using Open-Source Tools | by Eduardo Blancas | Jun, 2022

0 71


A step-by-step guide to going from a messy notebook to a pipeline running in Kubernetes

Photo by Myriam Jessier on Unsplash

Notebooks are great for rapid iterations and prototyping but quickly get messy. After working on a notebook, my code becomes difficult to manage and unsuitable for deployment. In production, code organization is essential for maintainability (it’s much easier to improve and debug organized code than a long, messy notebook).

In this post, I’ll describe how you can use our open-source tools to cover the entire life cycle of a Data Science project: starting from a messy notebook until you have that code running in production. Let’s get started!

The first step is to clean up our notebook with automated tools; then, we’ll automatically refactor our monolithic notebook into a modular pipeline with soorgeon; after that, we’ll test that our pipeline runs; and, finally, we’ll deploy our pipeline to Kubernetes. The main benefit of this workflow is that all steps are fully automated, so we can return to Jupyter, iterate (or fix bugs), and deploy again effortlessly.

Image by author.

The interactivity of notebooks makes it simple to try out new ideas, but it also yields messy code. While exploring data, we often rush to write code without considering readability. Lucky for us, there are tools like isort and black which allow us to easily re-format our code to improve readability. Unfortunately, these tools only work with .py files; however, soorgeon enable us to run them on notebook files (.ipynb):

Note: If you need an example notebook to try these commands, here’s one:

Check out the image at the beginning of this section: I introduced some extra whitespace on the left notebook. However, after applying soorgeon clean (picture on the right), we see that the extra whitespace went away. So now we can focus on writing code and apply soorgeon clean to use auto-formatting easily!

Creating analysis on a single notebook is convenient: we can move around sections and edit them easily; however, this has many drawbacks: it’s hard to collaborate and test. Organizing our analysis in multiple files will allow us to define clear boundaries so multiple pipelines can work in the project without getting into each other’s ways.

The process of going from a single notebook to a modular pipeline is time-consuming and error-prone; fortunately, soorgeon can do the heavy lifting for us:

Upon refactoring, we’ll see a bunch of new files:

Image by author.

Ploomber turns our notebook into a modularized project automatically! It generates a README.md with basic instructions and a requirements.txt (extracting package names from import statements). Furthermore, it creates a tasks/ directory with a few .ipynb files; these files come from the original notebook sections separated by markdown headings. soorgeon refactor figures out which sections depend on which ones.

If you prefer to export .py files; you can pass the --file-format option:

soorgeon refactor nb.ipynb --file-format py

The tasks/ directory will have .py files this time:

  ├── README.md 
├── nb.ipynb
├── pipeline.yaml
├── requirements.txt
└── tasks
├── clean.py
├── linear-regression.py
├── load.py
├── random-forest-regressor.py
└── train-test-split.py

soorgeon uses Markdown headings to determine how many output tasks to generate. In our case, there are five of them. Then, soorgeon analyzes the code to resolve the dependencies among sections and adds the necessary code to pass outputs to the each task.

For example, our “Train test split” section creates a variables X, y, X_train, X_test, y_train, and y_test; and the last four variables are used by the “Linear regression” section:

Image by author.

By determining input and output variables, soorgeon determines that the “Linear regression” section depends on the “Train test split” section. Furthermore, the “Random Forest Regressor” section also depends on the “Train test split” since it also uses the variables generated by the “Train test split” section. With this information, soorgeon builds the dependency graph.

Now it’s time to ensure that our modular pipeline runs correctly. To do so, we’ll use the second package in our toolbox: ploomber. Ploomber allows us to develop and execute our pipelines locally.

This is the output:

ploomber offers a lot of tools to manage our pipeline; for example, we can generate a plot:

Image by author.

We can see the dependency graph; there are three serial tasks: load, clean, and train-test-split. After them, we see two independent tasks: linear-regression, and random-forest-regressor. The advantage of modularizing our work is that members of our team can work independently, we can test tasks in isolation, and run independent tasks in parallel. With ploomber we can keep developing the pipeline with Jupyter until we’re ready to deploy!

To keep things simple, you may deploy your Ploomber pipeline with cron, and run ploomber build on a schedule. However, in some cases, you may want to leverage existing infrastructure. We got you covered! With soopervisor, you can export your pipeline to Airflow, AWS Batch, Kubernetes, SLURM, or Kubeflow.

soopervisor add adds some files to our project, like a preconfigured Dockerfile (which we can modify if we want to). On the other hand soopervisor export takes our existing pipeline and exports it to Argo Workflows so we can run it on Kubernetes.

By changing the --backend argument in the soopervisor add command, you can switch to other supported platforms. Alternatively, you may sign up for free cloud service, which allows you to run your notebooks in the cloud with one command.

Notebook cleaning and refactoring are time-consuming and error-prone, and we are developing tools to make this process a breeze. In this blog post, we went from having a monolithic notebook until we had a modular pipeline running in production-all of that in an automated way using open-source tools. So please let us know what features you’d like to see. Join our community and share your thoughts!


A step-by-step guide to going from a messy notebook to a pipeline running in Kubernetes

Photo by Myriam Jessier on Unsplash

Notebooks are great for rapid iterations and prototyping but quickly get messy. After working on a notebook, my code becomes difficult to manage and unsuitable for deployment. In production, code organization is essential for maintainability (it’s much easier to improve and debug organized code than a long, messy notebook).

In this post, I’ll describe how you can use our open-source tools to cover the entire life cycle of a Data Science project: starting from a messy notebook until you have that code running in production. Let’s get started!

The first step is to clean up our notebook with automated tools; then, we’ll automatically refactor our monolithic notebook into a modular pipeline with soorgeon; after that, we’ll test that our pipeline runs; and, finally, we’ll deploy our pipeline to Kubernetes. The main benefit of this workflow is that all steps are fully automated, so we can return to Jupyter, iterate (or fix bugs), and deploy again effortlessly.

Image by author.

The interactivity of notebooks makes it simple to try out new ideas, but it also yields messy code. While exploring data, we often rush to write code without considering readability. Lucky for us, there are tools like isort and black which allow us to easily re-format our code to improve readability. Unfortunately, these tools only work with .py files; however, soorgeon enable us to run them on notebook files (.ipynb):

Note: If you need an example notebook to try these commands, here’s one:

Check out the image at the beginning of this section: I introduced some extra whitespace on the left notebook. However, after applying soorgeon clean (picture on the right), we see that the extra whitespace went away. So now we can focus on writing code and apply soorgeon clean to use auto-formatting easily!

Creating analysis on a single notebook is convenient: we can move around sections and edit them easily; however, this has many drawbacks: it’s hard to collaborate and test. Organizing our analysis in multiple files will allow us to define clear boundaries so multiple pipelines can work in the project without getting into each other’s ways.

The process of going from a single notebook to a modular pipeline is time-consuming and error-prone; fortunately, soorgeon can do the heavy lifting for us:

Upon refactoring, we’ll see a bunch of new files:

Image by author.

Ploomber turns our notebook into a modularized project automatically! It generates a README.md with basic instructions and a requirements.txt (extracting package names from import statements). Furthermore, it creates a tasks/ directory with a few .ipynb files; these files come from the original notebook sections separated by markdown headings. soorgeon refactor figures out which sections depend on which ones.

If you prefer to export .py files; you can pass the --file-format option:

soorgeon refactor nb.ipynb --file-format py

The tasks/ directory will have .py files this time:

  ├── README.md 
├── nb.ipynb
├── pipeline.yaml
├── requirements.txt
└── tasks
├── clean.py
├── linear-regression.py
├── load.py
├── random-forest-regressor.py
└── train-test-split.py

soorgeon uses Markdown headings to determine how many output tasks to generate. In our case, there are five of them. Then, soorgeon analyzes the code to resolve the dependencies among sections and adds the necessary code to pass outputs to the each task.

For example, our “Train test split” section creates a variables X, y, X_train, X_test, y_train, and y_test; and the last four variables are used by the “Linear regression” section:

Image by author.

By determining input and output variables, soorgeon determines that the “Linear regression” section depends on the “Train test split” section. Furthermore, the “Random Forest Regressor” section also depends on the “Train test split” since it also uses the variables generated by the “Train test split” section. With this information, soorgeon builds the dependency graph.

Now it’s time to ensure that our modular pipeline runs correctly. To do so, we’ll use the second package in our toolbox: ploomber. Ploomber allows us to develop and execute our pipelines locally.

This is the output:

ploomber offers a lot of tools to manage our pipeline; for example, we can generate a plot:

Image by author.

We can see the dependency graph; there are three serial tasks: load, clean, and train-test-split. After them, we see two independent tasks: linear-regression, and random-forest-regressor. The advantage of modularizing our work is that members of our team can work independently, we can test tasks in isolation, and run independent tasks in parallel. With ploomber we can keep developing the pipeline with Jupyter until we’re ready to deploy!

To keep things simple, you may deploy your Ploomber pipeline with cron, and run ploomber build on a schedule. However, in some cases, you may want to leverage existing infrastructure. We got you covered! With soopervisor, you can export your pipeline to Airflow, AWS Batch, Kubernetes, SLURM, or Kubeflow.

soopervisor add adds some files to our project, like a preconfigured Dockerfile (which we can modify if we want to). On the other hand soopervisor export takes our existing pipeline and exports it to Argo Workflows so we can run it on Kubernetes.

By changing the --backend argument in the soopervisor add command, you can switch to other supported platforms. Alternatively, you may sign up for free cloud service, which allows you to run your notebooks in the cloud with one command.

Notebook cleaning and refactoring are time-consuming and error-prone, and we are developing tools to make this process a breeze. In this blog post, we went from having a monolithic notebook until we had a modular pipeline running in production-all of that in an automated way using open-source tools. So please let us know what features you’d like to see. Join our community and share your thoughts!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment