Browsing Tag

Blancas

Stop Using 0.5 as the Threshold for Your Binary Classifier | by Eduardo Blancas | Nov, 2022

Jessie Hobb Nov 30, 2022 0

Statistics for Machine LearningLearn how to set the optimal threshold for your Machine Learning modelImage by author, using image files from flaticon.comTo produce a binary response, classifiers output a real-valued score that is thresholded. For example, logistic regression outputs a probability (a value between 0.0 and 1.0); and observations with a score equal to or higher than 0.5 produce a positive binary output (many other models use the 0.5 threshold by default).However, using the default 0.5 threshold is…

Can I Trust My Model’s Probabilities? A Deep Dive into Probability Calibration | by Eduardo Blancas | Nov, 2022

Jessie Hobb Nov 10, 2022 0

Statistics for Data ScienceA practical guide on probability calibrationPhoto by Edge2Edge Media on UnsplashSuppose you have a binary classifier and two observations; the model scores them as 0.6 and 0.99, respectively. Is there a higher chance that the sample with the 0.99 score belongs to the positive class? For some models, this is true, but for others it might not.This blog post will dive deeply into probability calibration-an essential tool for every data scientist and machine learning engineer. Probability…

Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III) | by Eduardo Blancas | Nov, 2022

Jessie Hobb Nov 1, 2022 0

Data Science Cloud InfrastructureA step-by-step guide to deploy a Data Science platform on AWS with open-source softwarePhoto by Chris Ried on UnsplashIn our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.In this final post, we’ll show you how to use Ploomber and Soopervisor to create grids of experiments that you can run in parallel on AWS Batch, and how to request resources dynamically (CPUs, RAM, and GPUs).Hi! My name…

Deploying a Data Science Platform on AWS: Running containerized experiments (Part II) | by Eduardo Blancas | Oct, 2022

Jessie Hobb Oct 26, 2022 0

Data Science Cloud InfrastructureA step-by-step guide to deploy a Data Science platform on AWS with open-source softwarePhoto by Guillaume Bolduc on UnsplashIn our previous post, we saw how to configure AWS Batch and tested our infrastructure by executing a task that spinned up a container, waited for 3 seconds and shut down.In this post, we’ll leverage the existing infrastructure, but this time, we’ll execute a more interesting example.We’ll ship our code to AWS by building a container and storing it in Amazon ECR, a…

Deploying a Data Science Platform on AWS: Setting Up AWS Batch (Part I) | by Eduardo Blancas | Oct, 2022

Jessie Hobb Oct 7, 2022 0

Data Science Cloud InfrastructureA step-by-step guide to deploy a Data Science platform on AWS with open-source softwareYour laptop isn’t enough, let’s use the cloud. Photo by CHUTTERSNAP on UnsplashIn this series of tutorials, we’ll show you how to deploy a Data Science platform with AWS and open-source software. By the end of the series, you’ll be able to submit computational jobs to AWS scalable infrastructure with a single command.Architecture of the Data Science platform we’ll deploy. Image by author.Screenshot of…

Introducing Snapshot Testing for Jupyter Notebooks | by Eduardo Blancas | Jul, 2022

Jessie Hobb Jul 6, 2022 0

Software Engineering for Data Sciencenbsnapshot is an open-source package that benchmarks notebook’s outputs to detect issues automaticallyImage by author.If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!When analyzing data in a Jupyter notebook, I unconsciously memorize “rules of thumb” to determine if my results are correct. For example, I might print some summary statistics and become skeptical of some outputs if they deviate too much from what I’ve seen historically.…

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks Using Open-Source Tools | by Eduardo Blancas | Jun, 2022

Jessie Hobb Jun 23, 2022 0

Software Engineering For Data ScienceA step-by-step guide to going from a messy notebook to a pipeline running in KubernetesPhoto by Myriam Jessier on UnsplashNotebooks are great for rapid iterations and prototyping but quickly get messy. After working on a notebook, my code becomes difficult to manage and unsuitable for deployment. In production, code organization is essential for maintainability (it’s much easier to improve and debug organized code than a long, messy notebook).In this post, I’ll describe how you can use…

Analyze and plot 5.5M records in 20s with BigQuery and Ploomber | by Eduardo Blancas | May, 2022

Jessie Hobb May 23, 2022 0

Software Engineering for Data ScienceDevelop scalable pipelines on Google Cloud using open-source softwareImage by author.This tutorial will show how you can use Google Cloud and Ploomber to develop a scalable and production-ready pipeline.We’ll use Google BigQuery (data warehouse) and Cloud Storage to show how we can transform big datasets with ease using SQL, plot the results with Python, and store the results in the cloud. Thanks to BigQuery scalability (we’ll use a dataset with 5.5M records!) and Ploomber’s…