Techno Blender
Digitally Yours.

Training XGBoost with MLflow Experiments and HyperOpt | by Ani Madurkar | Jan, 2023

0 103


Colors of the Adirondacks. Image by author

As you evolve in your journey in Machine Learning, you’ll soon find yourself gravitating closer and closer to MLOps whether you like it or not. Building efficient, scalable, and resilient machine learning systems is a challenge and the real job of a Data Scientist (in my opinion) as opposed to just doing modeling.

The modeling part has been largely figured out for most use cases. Unless you’re trying to be at the bleeding edge of the craft, you’re likely dealing with structured, tabular datasets. The choice of model can vary depending on the dataset size, assumptions, and technical restrictions, but for the most part, it is fairly repeatable. My workflow for supervised learning ML during the experimentation phase has converged to using XGBoost with HyperOpt and MLflow. XGBoost for the model of choice, HyperOpt for the hyperparameter tuning, and MLflow for the experimentation and tracking.

This also represents a phenomenal step 1 as you embark on the MLOps journey because I think it’s easiest to start doing more MLOps work during the experimentation phase (model tracking, versioning, registry, etc.). It’s lightweight and highly configurable which makes it easy to scale up and down as you may need.

Although I briefly discuss XGBoost, MLflow, and HyperOpt, this isn’t a deep walkthrough of each. Initial hands-on familiarity with each would be really helpful to understand how some pieces here are working in more depth. I’ll be working with the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset (CC BY 4.0).

To start, we can start an MLflow server (I discuss what’s happening here a bit later):

mlflow server \
- backend-store-uri sqlite:///mlflow.db \
- default-artifact-root ./mlruns \
- host 0.0.0.0 \
- port 5000

PandasProfiler is a fantastic open-source library to run a quick exploratory data analysis report on a dataset. Tracking descriptive statistics, finding nulls, anomaly detection, distribution analysis, and more are all shown in the report.

Sample view of the HTML Profiler output

I save the file as an HTML to interact with the analysis on a webpage instead of in a Jupyter Notebook which could yield memory errors depending on dataset size. See the Quickstart guide here for how PandasProfiler works: https://pandas-profiling.ydata.ai/docs/master/pages/getting_started/quickstart.html

Lastly, we can create training, validation, and testing datasets.

XGBoost, eXtreme Gradient Boosted Decision Trees, has become the de facto model of choice for a large number of tabular modeling tasks. It’s still highly recommended to try out simpler models like Linear/Logistic Regression first but almost all structured tabular modeling projects with >50–100K rows have resulted in these models winning out by a significant margin.

How do they work?

XGBoost is an open-source, gradient-boosting algorithm that uses decision trees as weak learners to build up a stronger model — considered an ensemble model due to its nature to combine multiple models together. There are two common ensemble methods: Bagging and Boosting.

Bagging, or bootstrap aggregating, typically is low variance but can be high bias. It can lead to better training stability, stronger ML accuracy, and a lower tendency to overfit. Random Forest models leverage bagging by combining decision trees where each tree can pick only from a random subset of features to use.

An illustration of the concept of bootstrap aggregating. Open Domain, Wikipedia

Boosting, in contrast, works to convert weak learners to strong ones. Each learner, or model, is trained on the same set of samples but each sample is weighted differently in each iteration. This results in weak learners getting better at learning the right weights and parameters for strong model performance over time.

An illustration of the concept of boosting. Open Domain, Wikipedia

The gradient boosting part of this algorithm refers to the fact it uses Gradient Descent to minimize the loss function. The loss function for XGBoost is a regularised (L1 and L2) objective function that incorporates a function of convex loss and a model complexity penalty term.

XGBoost took fame as it became the standard for winning a multitude of tournaments on Kaggle, but lately, people have also used Microsoft’s LightGBM as it can be faster for large datasets. I’ve typically found phenomenal performance using both — can just be dependent on your needs.

Mlflow is an open-source machine learning experiment tracking software. It makes it incredibly easy to spin up a local web interface to monitor your machine learning models, compare them, and stage them.

As you’re experimenting between the right modeling algorithm and architecture, it can be a nightmare to efficiently evaluate the best one especially when you’re running hundreds of experiments at once. Ideally, you want a way to store each model run, its hyperparameters, its evaluation criteria, and more. MLflow makes this all possible with minimal code around your training code.

As you experiment with different modeling architectures you can add new experiments and compare each one with the same criteria. Logging artifacts is extremely easy and fast. There are two main components that it looks to track and store: entities and artifacts.

Entities: runs, parameters, metrics, tags, notes, metadata, etc. These are stored in the backend store.

Artifacts: files, models, images, in-memory objects, or model summaries, etc. These are stored in the artifact store.

The default storage location for these are local files, but you can specify the backend store to instead be an SQLite database (this is minimally needed if you want to stage models to Staging or Production), a PostgreSQL database (this enables user authentication to access), or an S3 database. This page clarifies the different store configurations really well: https://mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

I won’t be going into the basics of how to use MLflow because their documentation covers a lot of what you’d need already. In case you need a quick tutorial for how to spin up MLflow, I highly recommend starting here: https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

I will be leveraging a lightweight architecture that is independent of the cloud but know that as long as you have read/write access to an S3 bucket it’s as easy as signing into your AWS credentials and then changing the bucket store path. I’ll also be enabling a TrackingServer that connects to the runs and artifacts via REST API so you’re able to see the results on a website.

MLflow on localhost with Tracking Server architecture. Adapted from MLflow documentation.

HyperOpt is an open source for Bayesian optimization to find the right model architecture. It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

The different algorithms it can leverage are:

  • Random Search
  • Tree of Parzen Estimators
  • Annealing
  • Tree
  • Gaussian Process Tree

Most hyperparameter tuning I see done in practice is either manual or Grid Search. This exhaustive process can sometimes yield good results but often times it’s highly expensive (computing & time) and unneeded. What’s more, is that Grid Search doesn’t selectively use hyperparameters that are better or worse to find the global minima for your loss function. It simply searches your entire search space.

Random Search is often better as a baseline as it can be faster and still provide good enough starting points to filter your search space down. A more optimal method is Bayesian Optimization though. This is because Bayesian Optimization takes into account the prior runs in iteration to guide future selections. Bayesian Optimization creates a probabilistic distribution of our hyperparameter space to traverse and attempts to find the hyperparameters that minimize our loss function the best.

The Tree of Parzen Estimators is a great default I’ve found, but here’s a paper that dives deeper into each of the algorithms: Algorithms for Hyper-Parameter Optimization (Bergstra, et al.). It typically outperforms basic Bayesian Optimization as it leverages a tree structure to better traverse complex search spaces which includes categorical hyperparameters whereas Bayesian Optimization requires numerical values. Tree of Parzen Estimators has been found to be extremely robust and efficient for large-scale hyperparameter optimization.

Michael Berk has written a phenomenal writeup on HyperOpt and its algorithms if you’re interested in a deeper dive: HyperOpt Demystified.

HyperOpt’s fmin function takes in the key components of putting all of this together. Here are some key parameters of fmin:

  • fn: training model function
  • space: hyperparameter search space
  • algo: optimization algorithm
  • trials: an object can be saved, passed on to the built-in plotting routines, or analyzed with your own custom code.
  • max_evals: number of modeling experiments to run

HyperOpt’s trials object helps us see a bit more of why you’d return a dictionary from the training function.

  • trials.trials – a list of dictionaries representing everything about the search
  • trials.results – a list of dictionaries returned by ‘objective’ during the search
  • trials.losses() – a list of losses (float for each ‘ok’ trial)
  • trials.statuses() – a list of status strings

Another thing to note in the return of the training function is that is where you define the loss function (can be custom if you choose). It defaults to minimize that metric so if you want to maximize it, as we do with ROC AUC Score, then multiply it by -1.

MLflow UI for analyzing ML experiments

We can see all the 50 modeling experiments ran with trying to find the XGBoost model with the highest validation ROC AUC score as it searches the hyperparameter space. We can easily filter the columns to see different parameters or metrics and change how we sort to rapidly analyze the results.

Model Evaluation & Registry

Let’s compare the top two modeling results.

Comparing the top two modeling results

We can choose different parameters and metrics to analyze via a Parallel Coordinates Plot, Scatter Plot, Box Plot, and Contour Plot to gauge how changes in parameters affect the metrics.

Furthermore, we can click on one run to see all the artifacts of the model saved of it and some code snippets to make predictions from this run.

Details of best model run

We can now load in the best model to evaluate the model on the test set before we register it.

Testing Accuracy: 0.982
Testing Precision: 1.0
Testing Recall: 0.971
Testing F1: 0.986
Testing AUCROC: 0.999

Looks great! We can now register the model into the Model Registry like so:

Successfully registered model ‘BreastCancerClassification-XGBHP’. 2023/01/08 17:19:08 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. 
Model name: BreastCancerClassification-XGBHP, version 1 Created version ‘1’ of model ‘BreastCancerClassification-XGBHP’.

Now let’s update some information such as the description and the version information of the model.

<RegisteredModel: creation_timestamp=1673216348938, description=('This model classifies breast cancer as malignant or benign given certain '
'numerical features of cell nuclei such as \n'
' a) radius (mean of distances from center to points on the perimeter)\n'
' b) texture (standard deviation of gray-scale values)\n'
' c) perimeter\n'
' d) area\n'
' e) smoothness (local variation in radius lengths)\n'
' f) compactness (perimeter^2 / area - 1.0)\n'
' g) concavity (severity of concave portions of the contour)\n'
' h) concave points (number of concave portions of the contour)\n'
' i) symmetry\n'
' j) fractal dimension ("coastline approximation" - 1).'), last_updated_timestamp=1673216621429, latest_versions=[<ModelVersion: creation_timestamp=1673216348973, current_stage='None', description='', last_updated_timestamp=1673216348973, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>], name='BreastCancerClassification-XGBHP', tags={}>
<ModelVersion: creation_timestamp=1673216348973, current_stage='None', description=('This model version is the first XGBoost model trained with HyperOpt for '
'bayesian hyperparameter tuning.'), last_updated_timestamp=1673216628186, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>

Once we know we want to push the model to production, we can do this step easily in MLflow as well. One [very] important thing to note here is that MLflow doesn’t have great proxy controls for this which is very much not the recommended approach — you don’t want just anyone to be able to push the model up to production. I’ve typically found the Staging and Production environments for modeling live in different areas (ie. Cloud) so you can manage permissions and access better.

<ModelVersion: creation_timestamp=1673216348973, current_stage='Production', description=('This model version is the first XGBoost model trained with HyperOpt for '
'bayesian hyperparameter tuning.'), last_updated_timestamp=1673217222517, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>

What’s the value of pushing to Production here? Keep in mind this is version 1 of our experimentation phase, which happened to perform great. Over time we may acquire more data or better knowledge of what modeling techniques could beat our version 1 model. When that happens and we run a second experiment, we can evaluate the second modeling experiment results with our model in production and then gauge whether to replace it or not. This is when we would want to leverage the “Staging” region between “None” and “Production”.

In this story we started with UCI ML Breast Cancer Wisconsin (Diagnostic) dataset and showed a standard structured, tabular supervised learning machine learning workflow (for a simple dataset) that leveraged:

  • PandasProfiler for creating an Exploratory Data Analysis report
  • Sci-kit Learn for preprocessing
  • XGBoost for model training
  • HyperOpt for hyperparameter tuning
  • MLflow for experiment tracking, model evaluation, model logging/versioning, and model registry

Hope this helps you jumpstart your journey into the MLOps world and levels up your machine learning workflows!

[1] Breast Cancer Wisconsin (Diagnostic) Data Set (Wolberg, Street, Mangasarian)

[2] Designing Machine Learning Systems (Huyen)

[3] MLflow

[4] HyperOpt

[5] HyperOpt Demystified (Berk)

[6] Algorithms for Hyper-Parameter Optimization (Bergstra et al.)

All images unless otherwise noted are by the author.


Colors of the Adirondacks. Image by author

As you evolve in your journey in Machine Learning, you’ll soon find yourself gravitating closer and closer to MLOps whether you like it or not. Building efficient, scalable, and resilient machine learning systems is a challenge and the real job of a Data Scientist (in my opinion) as opposed to just doing modeling.

The modeling part has been largely figured out for most use cases. Unless you’re trying to be at the bleeding edge of the craft, you’re likely dealing with structured, tabular datasets. The choice of model can vary depending on the dataset size, assumptions, and technical restrictions, but for the most part, it is fairly repeatable. My workflow for supervised learning ML during the experimentation phase has converged to using XGBoost with HyperOpt and MLflow. XGBoost for the model of choice, HyperOpt for the hyperparameter tuning, and MLflow for the experimentation and tracking.

This also represents a phenomenal step 1 as you embark on the MLOps journey because I think it’s easiest to start doing more MLOps work during the experimentation phase (model tracking, versioning, registry, etc.). It’s lightweight and highly configurable which makes it easy to scale up and down as you may need.

Although I briefly discuss XGBoost, MLflow, and HyperOpt, this isn’t a deep walkthrough of each. Initial hands-on familiarity with each would be really helpful to understand how some pieces here are working in more depth. I’ll be working with the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset (CC BY 4.0).

To start, we can start an MLflow server (I discuss what’s happening here a bit later):

mlflow server \
- backend-store-uri sqlite:///mlflow.db \
- default-artifact-root ./mlruns \
- host 0.0.0.0 \
- port 5000

PandasProfiler is a fantastic open-source library to run a quick exploratory data analysis report on a dataset. Tracking descriptive statistics, finding nulls, anomaly detection, distribution analysis, and more are all shown in the report.

Sample view of the HTML Profiler output

I save the file as an HTML to interact with the analysis on a webpage instead of in a Jupyter Notebook which could yield memory errors depending on dataset size. See the Quickstart guide here for how PandasProfiler works: https://pandas-profiling.ydata.ai/docs/master/pages/getting_started/quickstart.html

Lastly, we can create training, validation, and testing datasets.

XGBoost, eXtreme Gradient Boosted Decision Trees, has become the de facto model of choice for a large number of tabular modeling tasks. It’s still highly recommended to try out simpler models like Linear/Logistic Regression first but almost all structured tabular modeling projects with >50–100K rows have resulted in these models winning out by a significant margin.

How do they work?

XGBoost is an open-source, gradient-boosting algorithm that uses decision trees as weak learners to build up a stronger model — considered an ensemble model due to its nature to combine multiple models together. There are two common ensemble methods: Bagging and Boosting.

Bagging, or bootstrap aggregating, typically is low variance but can be high bias. It can lead to better training stability, stronger ML accuracy, and a lower tendency to overfit. Random Forest models leverage bagging by combining decision trees where each tree can pick only from a random subset of features to use.

An illustration of the concept of bootstrap aggregating. Open Domain, Wikipedia

Boosting, in contrast, works to convert weak learners to strong ones. Each learner, or model, is trained on the same set of samples but each sample is weighted differently in each iteration. This results in weak learners getting better at learning the right weights and parameters for strong model performance over time.

An illustration of the concept of boosting. Open Domain, Wikipedia

The gradient boosting part of this algorithm refers to the fact it uses Gradient Descent to minimize the loss function. The loss function for XGBoost is a regularised (L1 and L2) objective function that incorporates a function of convex loss and a model complexity penalty term.

XGBoost took fame as it became the standard for winning a multitude of tournaments on Kaggle, but lately, people have also used Microsoft’s LightGBM as it can be faster for large datasets. I’ve typically found phenomenal performance using both — can just be dependent on your needs.

Mlflow is an open-source machine learning experiment tracking software. It makes it incredibly easy to spin up a local web interface to monitor your machine learning models, compare them, and stage them.

As you’re experimenting between the right modeling algorithm and architecture, it can be a nightmare to efficiently evaluate the best one especially when you’re running hundreds of experiments at once. Ideally, you want a way to store each model run, its hyperparameters, its evaluation criteria, and more. MLflow makes this all possible with minimal code around your training code.

As you experiment with different modeling architectures you can add new experiments and compare each one with the same criteria. Logging artifacts is extremely easy and fast. There are two main components that it looks to track and store: entities and artifacts.

Entities: runs, parameters, metrics, tags, notes, metadata, etc. These are stored in the backend store.

Artifacts: files, models, images, in-memory objects, or model summaries, etc. These are stored in the artifact store.

The default storage location for these are local files, but you can specify the backend store to instead be an SQLite database (this is minimally needed if you want to stage models to Staging or Production), a PostgreSQL database (this enables user authentication to access), or an S3 database. This page clarifies the different store configurations really well: https://mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

I won’t be going into the basics of how to use MLflow because their documentation covers a lot of what you’d need already. In case you need a quick tutorial for how to spin up MLflow, I highly recommend starting here: https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

I will be leveraging a lightweight architecture that is independent of the cloud but know that as long as you have read/write access to an S3 bucket it’s as easy as signing into your AWS credentials and then changing the bucket store path. I’ll also be enabling a TrackingServer that connects to the runs and artifacts via REST API so you’re able to see the results on a website.

MLflow on localhost with Tracking Server architecture. Adapted from MLflow documentation.

HyperOpt is an open source for Bayesian optimization to find the right model architecture. It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

The different algorithms it can leverage are:

  • Random Search
  • Tree of Parzen Estimators
  • Annealing
  • Tree
  • Gaussian Process Tree

Most hyperparameter tuning I see done in practice is either manual or Grid Search. This exhaustive process can sometimes yield good results but often times it’s highly expensive (computing & time) and unneeded. What’s more, is that Grid Search doesn’t selectively use hyperparameters that are better or worse to find the global minima for your loss function. It simply searches your entire search space.

Random Search is often better as a baseline as it can be faster and still provide good enough starting points to filter your search space down. A more optimal method is Bayesian Optimization though. This is because Bayesian Optimization takes into account the prior runs in iteration to guide future selections. Bayesian Optimization creates a probabilistic distribution of our hyperparameter space to traverse and attempts to find the hyperparameters that minimize our loss function the best.

The Tree of Parzen Estimators is a great default I’ve found, but here’s a paper that dives deeper into each of the algorithms: Algorithms for Hyper-Parameter Optimization (Bergstra, et al.). It typically outperforms basic Bayesian Optimization as it leverages a tree structure to better traverse complex search spaces which includes categorical hyperparameters whereas Bayesian Optimization requires numerical values. Tree of Parzen Estimators has been found to be extremely robust and efficient for large-scale hyperparameter optimization.

Michael Berk has written a phenomenal writeup on HyperOpt and its algorithms if you’re interested in a deeper dive: HyperOpt Demystified.

HyperOpt’s fmin function takes in the key components of putting all of this together. Here are some key parameters of fmin:

  • fn: training model function
  • space: hyperparameter search space
  • algo: optimization algorithm
  • trials: an object can be saved, passed on to the built-in plotting routines, or analyzed with your own custom code.
  • max_evals: number of modeling experiments to run

HyperOpt’s trials object helps us see a bit more of why you’d return a dictionary from the training function.

  • trials.trials – a list of dictionaries representing everything about the search
  • trials.results – a list of dictionaries returned by ‘objective’ during the search
  • trials.losses() – a list of losses (float for each ‘ok’ trial)
  • trials.statuses() – a list of status strings

Another thing to note in the return of the training function is that is where you define the loss function (can be custom if you choose). It defaults to minimize that metric so if you want to maximize it, as we do with ROC AUC Score, then multiply it by -1.

MLflow UI for analyzing ML experiments

We can see all the 50 modeling experiments ran with trying to find the XGBoost model with the highest validation ROC AUC score as it searches the hyperparameter space. We can easily filter the columns to see different parameters or metrics and change how we sort to rapidly analyze the results.

Model Evaluation & Registry

Let’s compare the top two modeling results.

Comparing the top two modeling results

We can choose different parameters and metrics to analyze via a Parallel Coordinates Plot, Scatter Plot, Box Plot, and Contour Plot to gauge how changes in parameters affect the metrics.

Furthermore, we can click on one run to see all the artifacts of the model saved of it and some code snippets to make predictions from this run.

Details of best model run

We can now load in the best model to evaluate the model on the test set before we register it.

Testing Accuracy: 0.982
Testing Precision: 1.0
Testing Recall: 0.971
Testing F1: 0.986
Testing AUCROC: 0.999

Looks great! We can now register the model into the Model Registry like so:

Successfully registered model ‘BreastCancerClassification-XGBHP’. 2023/01/08 17:19:08 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. 
Model name: BreastCancerClassification-XGBHP, version 1 Created version ‘1’ of model ‘BreastCancerClassification-XGBHP’.

Now let’s update some information such as the description and the version information of the model.

<RegisteredModel: creation_timestamp=1673216348938, description=('This model classifies breast cancer as malignant or benign given certain '
'numerical features of cell nuclei such as \n'
' a) radius (mean of distances from center to points on the perimeter)\n'
' b) texture (standard deviation of gray-scale values)\n'
' c) perimeter\n'
' d) area\n'
' e) smoothness (local variation in radius lengths)\n'
' f) compactness (perimeter^2 / area - 1.0)\n'
' g) concavity (severity of concave portions of the contour)\n'
' h) concave points (number of concave portions of the contour)\n'
' i) symmetry\n'
' j) fractal dimension ("coastline approximation" - 1).'), last_updated_timestamp=1673216621429, latest_versions=[<ModelVersion: creation_timestamp=1673216348973, current_stage='None', description='', last_updated_timestamp=1673216348973, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>], name='BreastCancerClassification-XGBHP', tags={}>
<ModelVersion: creation_timestamp=1673216348973, current_stage='None', description=('This model version is the first XGBoost model trained with HyperOpt for '
'bayesian hyperparameter tuning.'), last_updated_timestamp=1673216628186, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>

Once we know we want to push the model to production, we can do this step easily in MLflow as well. One [very] important thing to note here is that MLflow doesn’t have great proxy controls for this which is very much not the recommended approach — you don’t want just anyone to be able to push the model up to production. I’ve typically found the Staging and Production environments for modeling live in different areas (ie. Cloud) so you can manage permissions and access better.

<ModelVersion: creation_timestamp=1673216348973, current_stage='Production', description=('This model version is the first XGBoost model trained with HyperOpt for '
'bayesian hyperparameter tuning.'), last_updated_timestamp=1673217222517, name='BreastCancerClassification-XGBHP', run_id='61c3dddaf07d4d5ab316f36e7f6d1541', run_link='', source='./mlruns/1/61c3dddaf07d4d5ab316f36e7f6d1541/artifacts/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>

What’s the value of pushing to Production here? Keep in mind this is version 1 of our experimentation phase, which happened to perform great. Over time we may acquire more data or better knowledge of what modeling techniques could beat our version 1 model. When that happens and we run a second experiment, we can evaluate the second modeling experiment results with our model in production and then gauge whether to replace it or not. This is when we would want to leverage the “Staging” region between “None” and “Production”.

In this story we started with UCI ML Breast Cancer Wisconsin (Diagnostic) dataset and showed a standard structured, tabular supervised learning machine learning workflow (for a simple dataset) that leveraged:

  • PandasProfiler for creating an Exploratory Data Analysis report
  • Sci-kit Learn for preprocessing
  • XGBoost for model training
  • HyperOpt for hyperparameter tuning
  • MLflow for experiment tracking, model evaluation, model logging/versioning, and model registry

Hope this helps you jumpstart your journey into the MLOps world and levels up your machine learning workflows!

[1] Breast Cancer Wisconsin (Diagnostic) Data Set (Wolberg, Street, Mangasarian)

[2] Designing Machine Learning Systems (Huyen)

[3] MLflow

[4] HyperOpt

[5] HyperOpt Demystified (Berk)

[6] Algorithms for Hyper-Parameter Optimization (Bergstra et al.)

All images unless otherwise noted are by the author.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment