6 Underdog Data Science Libraries That Deserve Much More Attention | by Bex T. | Apr, 2023

By Jessie Hobb On Apr 13, 2023

Time to go out of the shadows

bexgboost_a_fleet_of_tiny_boats._Beautiful_cinematic_lightning._a74e62d1-61d3-4b3f-aa00-1b998b7210a3.png — Image by me via Midjourney.

While the big guys, Pandas, Scikit-learn, NumPy, Matplotlib, TensorFlow, etc., hog all your attention, it is easy to miss some down-to-earth and yet, incredible libraries.

They may not be GitHub rock stars, or taught in expensive Coursera specializations, but thousands of open-source developers pour their blood and sweat into writing them. They quietly fill the gaps left by popular libraries from the shadows.

The purpose of this article is to shine a light on some of these libraries and marvel together at how powerful the open-source community can be.

Let’s get started!

0. Manim

Image from the Manim GitHub page. MIT License.

We are all wowed and stunned at just how beautiful 3Blue1Brown videos are. But most of us don’t know that all the animations are created using the Mathematical Animation Engine (Manim) library written by Grant Sanderson himself. (We take Grant Sanderson so much for granted.)

Each 3b1b video is powered by thousands of lines of code written in Manim. As an example, the legendary “The Essence of Calculus” series took Grant Sanderson over 22k lines of code.

In Manim, each animation is represented by a scene class like the following (don’t worry if you don’t understand it):

import numpy as np
from manim import *class FunctionExample(Scene):
def construct(self):
axes = Axes(...)
axes_labels=axes.get_axis_labels()
# Get the graph of a simple functions
graph = axes.get_graph(lambda x: np.sin(1/x), color=RED)
# Set up its label
graph_label = axes.get_graph_label(
graph, x_val=1, direction=2 * UP + RIGHT,
label=r'f(x) = \sin(\frac{1}{x})', color=DARK_BLUE
)
# Graph the axes components together
axes_group = VGroup(axes, axes_labels)
# Animate
self.play(Create(axes_group), run_time=2)
self.wait(0.25)
self.play(Create(graph), run_time=3)
self.play(Write(graph_label), run_time=2)

Which produces the following animation of the function sin(1/x):

Unfortunately, Manim is not well-maintained and documented, as, understandably, Grant Sanderson spends most of his efforts on making the awesome videos.

But, there is a community fork of the library by Manim Community, that provides better support, documentation, and learning resources.

If you got too excited (you math lover!) already, here is my gentle but thorough introduction to Manim API:

Stats and links:

Because of its steep learning curve and complex installation, Manim gets very few downloads each month. It deserves so much more attention.

1. PyTorch Lightning

Screenshot of PyTorch Lightning GitHub page. Apache-2.0 license.

When I started learning PyTorch after TensorFlow, I became very grumpy. It was obvious that PyTorch was powerful but I couldn’t help but say “TensorFlow does this better”, or “That would have been much shorter in TF”, or even worse, “I almost wish I never learned PyTorch”.

That’s because PyTorch is a low-level library. Yes, this means PyTorch gives you complete control over the model training process, but it requires a lot of boilerplate code. It is like TensorFlow but five years younger if I am not mistaken.

Turns out, there are quite many people who feel this way. More specifically, almost 830 contributors at Lightning AI, developed PyTorch Lightning.

GIF by PyTorch Lightning GitHub page. Apache-2.0 license.

PyTorch lightning is a high-level wrapper library built around PyTorch that abstracts away most of its boilerplate code and soothes all its pain points:

Hardware-agnostic models
Code is highly readable because engineering code is handled by Lightning modules
Flexibility is intact (all Lightning modules are still PyTorch modules)
Multi-GPU, multi-node, TPU support
16-bit precision
Experiment tracking
Early stopping and model checkpointing (finally!)

and other, close to 40 advanced features, all designed to delight AI researchers rather than infuriate them.

Stats and links:

Learn from the official tutorials:

2. Optuna

Yes, hyperparameter tuning with GridSearch is easy, comfortable, and only a single import statement away. But you must surely admit that it is slower than a hungover snail and very inefficient.

bexgboost_infinite_number_of_supermarket_aisles_in_an_orderly_f_471fc79d-cd62-40a0-bb3a-1eeb75d35507.png — Image by me via Midjourney.

For a moment, think of hyperparameter tuning as grocery shopping. Using GridSearch means going down every single aisle in a supermarket and checking every product. It is a systematic and orderly approach but you waste so much time.

On the other hand, if you have an intelligent personal shopping assistant with Bayesian roots, you will know exactly what you need and where to go. It is a more efficient and targeted approach.

If you like that assistant, its name is Optuna. It is a Bayesian hyperparameter optimization framework to search the given hyperparameter space efficiently and find the golden set of hyperparameters that give the best model performance.

Here are some of its best features:

Framework-agnostic: tunes models of any machine learning model you can think of
Pythonic API to define search spaces: instead of manually listing possible values for a hyperparameter, Optuna lets you sample them linearly, randomly, or logarithmically from a given range
Visualization: supports hyperparameter importance (parallel coordinate) plots, history plots, and slice plots
Control the number or duration of iterations: Set the exact number of iterations or the maximum time duration the tuning process lasts
Pause and resume the search
Pruning: stop unpromising trials before they start

All these features are designed to save time and resources. If you want to see them in action, check out my tutorial on Optuna (it is one of my best-performing articles among 150):

Stats and links:

3. PyCaret

Screenshot of the PyCaret GitHub page. MIT license.

I have enormous respect for Moez Ali for creating this library from the ground up on his own. Currently, PyCaret is the best low-code machine learning library out there.

If PyCaret was advertised on TV, here is what the ad would say:

“Are you tired of spending hours writing virtually the same code in your machine learning workflows? Then, PyCaret is the answer!

Our all-in-one machine learning library helps you to build and deploy machine learning models in as few lines of code as possible. Think of it as a cocktail containing code from all your favorite machine learning libraries like Scikit-learn, XGBoost, CatBoost, LightGBM, Optuna, and many others.”

Then, the ad would show this snippet of code, with dramatic popping noises to display each line:

# Classification OOP API Example# loading sample dataset
from pycaret.datasets import get_data
data = get_data('juice')
# init setup
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data, target = 'Purchase', session_id = 123)
# model training and selection
best = s.compare_models()
# evaluate trained model
s.evaluate_model(best)
# predict on hold-out/test set
pred_holdout = s.predict_model(best)
# predict on new data
new_data = data.copy().drop('Purchase', axis = 1)
predictions = s.predict_model(best, data = new_data)
# save model
s.save_model(best, 'best_pipeline')

The narrator would say on voiceover as the code is being displayed:

“With a few lines of code, you can train and choose the best from dozens of models from different frameworks, evaluate them on a hold-out set, and save them for deployment. It is so easy to use, anyone can do it!

Hurry up and grab a copy of our software from GitHub, through PIP, and thank us later!”

Stats and links:

4. BentoML

Web developers love FastAPI like their pets. It is one of the most popular GitHub projects and admittedly, makes API development stupidly easy and intuitive.

Because of this popularity, it also made its way into machine learning. It is common to see engineers deploying their models as APIs using FastAPI, thinking the whole process couldn’t get any better or easier.

But most are under an illusion. Just because FastAPI is so much better than its predecessor (Flask), it doesn’t mean it is the best tool for the job.

Well, then, what is the best tool for the job? I am so glad you asked — BentoML!

BentoML, though relatively young, is an end-to-end framework to package and ship models of any machine learning library to any cloud platform.

Image from BentoML home page taken with permission.

FastAPI was designed for web developers, so it had many obvious shortcomings in deploying ML models. BentoML solves them all:

Standard API to save/load models
Model store to version and keep track of models
Dockerization of models with a single line of terminal code
Serving models on GPUs
Deploying models as APIs with a single short script and a few terminal commands to any cloud provider

I’ve already written a few tutorials on BentoML. Here is one of them:

Stats and links:

5. PyOD

bexgboost_an_army_of_robots_against_a_lonely_robot._Dramatic_ci_3689bb13-1272-45c6-ab37-e4c24e6325cf.png — Image by me via Midjourney.

This library is an underdog, because the problem it solves, outlier detection, is also an underdog.

Virtually any machine learning course you take only teaches z-scores for outlier detection and moves on to fancier concepts and tools like R (sarcasm).

But outlier detection is so much more than plain z-scores. There is modified z-scores, Isolation Forests (cool name), KNN for anomalies, Local Outlier Factor, and 30+ other state-of-the-art anomaly detection algorithms packed into the Python Outlier Detection toolkit (PyOD).

When not detected and dealt with properly, outliers will skew the mean and standard deviation of features and create noise in training data — scenarios you don’t want happening at all.

That’s PyOD’s life purpose — provide tools to facilitate finding anomalies. Apart from its wide range of algorithms, it is fully compatible with Scikit-learn, making it easy to use in existing machine-learning pipelines.

If you are still not convinced about the importance of anomaly detection and the role PyOD plays in it, I highly recommend giving this article a read (written by yours truly):

Stats and links:

6. Sktime

Image from the Sktime GitHub page. BSD-3 Clause License.

Time machines are no longer things of science fiction. It is a reality in the form of Sktime.

Instead of jumping between time periods, Sktime performs the slightly less cool task of time series analysis.

It borrows the best tools of its big brother, Scikit-learn to perform the following time series tasks:

Classification
Regression
Clustering (this one is fun!)
Annotation
Forecasting

It features over 30 state-of-the-art algorithms with a familiar Scikit-learn syntax and also offers pipelining, ensembling, and model tuning for both univariate and multivariate time series data.

It is also very well maintained — Sktime contributors work like bees.

Here is a tutorial on it (not mine, alas):

Stats and links:

Wrap

While our daily workflows are dominated by popular tools like Scikit-learn, TensorFlow, or PyTorch, it is important not to overlook the lesser-known libraries.

They may not have the same level of recognition or support, but in the right hands, they provide elegant solutions to problems not addressed by their popular counterparts.

This article focused on only six of them, but you can be sure there are hundreds of others. All you have to do is some exploring!

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.

Time to go out of the shadows

While the big guys, Pandas, Scikit-learn, NumPy, Matplotlib, TensorFlow, etc., hog all your attention, it is easy to miss some down-to-earth and yet, incredible libraries.

The purpose of this article is to shine a light on some of these libraries and marvel together at how powerful the open-source community can be.

Let’s get started!

0. Manim

Each 3b1b video is powered by thousands of lines of code written in Manim. As an example, the legendary “The Essence of Calculus” series took Grant Sanderson over 22k lines of code.

In Manim, each animation is represented by a scene class like the following (don’t worry if you don’t understand it):

import numpy as np
from manim import *class FunctionExample(Scene):
def construct(self):
axes = Axes(...)
axes_labels=axes.get_axis_labels()
# Get the graph of a simple functions
graph = axes.get_graph(lambda x: np.sin(1/x), color=RED)
# Set up its label
graph_label = axes.get_graph_label(
graph, x_val=1, direction=2 * UP + RIGHT,
label=r'f(x) = \sin(\frac{1}{x})', color=DARK_BLUE
)
# Graph the axes components together
axes_group = VGroup(axes, axes_labels)
# Animate
self.play(Create(axes_group), run_time=2)
self.wait(0.25)
self.play(Create(graph), run_time=3)
self.play(Write(graph_label), run_time=2)

Which produces the following animation of the function sin(1/x):

Unfortunately, Manim is not well-maintained and documented, as, understandably, Grant Sanderson spends most of his efforts on making the awesome videos.

But, there is a community fork of the library by Manim Community, that provides better support, documentation, and learning resources.

If you got too excited (you math lover!) already, here is my gentle but thorough introduction to Manim API:

Stats and links:

Because of its steep learning curve and complex installation, Manim gets very few downloads each month. It deserves so much more attention.

1. PyTorch Lightning

Turns out, there are quite many people who feel this way. More specifically, almost 830 contributors at Lightning AI, developed PyTorch Lightning.

PyTorch lightning is a high-level wrapper library built around PyTorch that abstracts away most of its boilerplate code and soothes all its pain points:

Hardware-agnostic models
Code is highly readable because engineering code is handled by Lightning modules
Flexibility is intact (all Lightning modules are still PyTorch modules)
Multi-GPU, multi-node, TPU support
16-bit precision
Experiment tracking
Early stopping and model checkpointing (finally!)

and other, close to 40 advanced features, all designed to delight AI researchers rather than infuriate them.

Stats and links:

Learn from the official tutorials:

2. Optuna

Yes, hyperparameter tuning with GridSearch is easy, comfortable, and only a single import statement away. But you must surely admit that it is slower than a hungover snail and very inefficient.

On the other hand, if you have an intelligent personal shopping assistant with Bayesian roots, you will know exactly what you need and where to go. It is a more efficient and targeted approach.

Here are some of its best features:

Framework-agnostic: tunes models of any machine learning model you can think of
Pythonic API to define search spaces: instead of manually listing possible values for a hyperparameter, Optuna lets you sample them linearly, randomly, or logarithmically from a given range
Visualization: supports hyperparameter importance (parallel coordinate) plots, history plots, and slice plots
Control the number or duration of iterations: Set the exact number of iterations or the maximum time duration the tuning process lasts
Pause and resume the search
Pruning: stop unpromising trials before they start

All these features are designed to save time and resources. If you want to see them in action, check out my tutorial on Optuna (it is one of my best-performing articles among 150):

Stats and links:

3. PyCaret

I have enormous respect for Moez Ali for creating this library from the ground up on his own. Currently, PyCaret is the best low-code machine learning library out there.

If PyCaret was advertised on TV, here is what the ad would say:

“Are you tired of spending hours writing virtually the same code in your machine learning workflows? Then, PyCaret is the answer!

Then, the ad would show this snippet of code, with dramatic popping noises to display each line:

# Classification OOP API Example# loading sample dataset
from pycaret.datasets import get_data
data = get_data('juice')
# init setup
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data, target = 'Purchase', session_id = 123)
# model training and selection
best = s.compare_models()
# evaluate trained model
s.evaluate_model(best)
# predict on hold-out/test set
pred_holdout = s.predict_model(best)
# predict on new data
new_data = data.copy().drop('Purchase', axis = 1)
predictions = s.predict_model(best, data = new_data)
# save model
s.save_model(best, 'best_pipeline')

The narrator would say on voiceover as the code is being displayed:

Hurry up and grab a copy of our software from GitHub, through PIP, and thank us later!”

Stats and links:

4. BentoML

Web developers love FastAPI like their pets. It is one of the most popular GitHub projects and admittedly, makes API development stupidly easy and intuitive.

But most are under an illusion. Just because FastAPI is so much better than its predecessor (Flask), it doesn’t mean it is the best tool for the job.

Well, then, what is the best tool for the job? I am so glad you asked — BentoML!

BentoML, though relatively young, is an end-to-end framework to package and ship models of any machine learning library to any cloud platform.

FastAPI was designed for web developers, so it had many obvious shortcomings in deploying ML models. BentoML solves them all:

Standard API to save/load models
Model store to version and keep track of models
Dockerization of models with a single line of terminal code
Serving models on GPUs
Deploying models as APIs with a single short script and a few terminal commands to any cloud provider

I’ve already written a few tutorials on BentoML. Here is one of them:

Stats and links:

5. PyOD

This library is an underdog, because the problem it solves, outlier detection, is also an underdog.

Virtually any machine learning course you take only teaches z-scores for outlier detection and moves on to fancier concepts and tools like R (sarcasm).

When not detected and dealt with properly, outliers will skew the mean and standard deviation of features and create noise in training data — scenarios you don’t want happening at all.

If you are still not convinced about the importance of anomaly detection and the role PyOD plays in it, I highly recommend giving this article a read (written by yours truly):

Stats and links:

6. Sktime

Time machines are no longer things of science fiction. It is a reality in the form of Sktime.

Instead of jumping between time periods, Sktime performs the slightly less cool task of time series analysis.

It borrows the best tools of its big brother, Scikit-learn to perform the following time series tasks:

Classification
Regression
Clustering (this one is fun!)
Annotation
Forecasting

It features over 30 state-of-the-art algorithms with a familiar Scikit-learn syntax and also offers pipelining, ensembling, and model tuning for both univariate and multivariate time series data.

It is also very well maintained — Sktime contributors work like bees.

Here is a tutorial on it (not mine, alas):

Stats and links:

Wrap

While our daily workflows are dominated by popular tools like Scikit-learn, TensorFlow, or PyTorch, it is important not to overlook the lesser-known libraries.

They may not have the same level of recognition or support, but in the right hands, they provide elegant solutions to problems not addressed by their popular counterparts.

This article focused on only six of them, but you can be sure there are hundreds of others. All you have to do is some exploring!

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.