How to Handle Outliers, Anomalies, and Skews | by TDS Editors | Sep, 2022

By Jessie Hobb On Sep 15, 2022

Data science is about finding patterns and extracting meaningful insights from their analysis. As any practitioner knows, however, data loves throwing us the occasional curveball: a weird spike, an unexpected dip, or (gasp!) an oddly shaped cluster.

This week, we turn our attention to those jarring moments when things (and our graphs) turn out to be less smooth than we’d hoped. Our selection of highlights cover different approaches for tackling irregularity and coming to terms with the unpredictable.

Finding outliers, the right way. As Hennie de Harder observes, “In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye.” Rather than rely on our fallible powers of observation, Hennie shows how to leverage Cook’s distance, DBSCAN, and Isolation Forest for identifying data points that require extra scrutiny.
How to minimize the potential dangers of skewed data. Bias has been a charged buzzword among data and ML professionals in the past few years. Adam Brownell invites us to think about bias as “a skew that produces a type of harm,” and walks us through three strategies to measure it effectively in the context of natural language processing models.

Photo by Jennifer Boyle on Unsplash

Adversarial training to the rescue? Anomaly detection is particularly hard in computer vision, where small data volumes and an often-limited variety of images make model training a challenge. Eugenia Anello’s helpful explainer walks us through a novel approach, GANomaly, which leverages the power of generative adversarial networks to address the shortcomings of previous methods.
Keeping linear regressions outlier-proof. For a hands-on demonstration of robust linear algorithms and how you can use them to handle outliers lurking within your data, you should check out Eryk Lewinson’s recent tutorial. It covers Huber regression, Random sample consensus (RANSAC) regression, and Theil-Sen regression, and benchmarks their performance on the same dataset.

Finding outliers, the right way. As Hennie de Harder observes, “In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye.” Rather than rely on our fallible powers of observation, Hennie shows how to leverage Cook’s distance, DBSCAN, and Isolation Forest for identifying data points that require extra scrutiny.
How to minimize the potential dangers of skewed data. Bias has been a charged buzzword among data and ML professionals in the past few years. Adam Brownell invites us to think about bias as “a skew that produces a type of harm,” and walks us through three strategies to measure it effectively in the context of natural language processing models.

Adversarial training to the rescue? Anomaly detection is particularly hard in computer vision, where small data volumes and an often-limited variety of images make model training a challenge. Eugenia Anello’s helpful explainer walks us through a novel approach, GANomaly, which leverages the power of generative adversarial networks to address the shortcomings of previous methods.
Keeping linear regressions outlier-proof. For a hands-on demonstration of robust linear algorithms and how you can use them to handle outliers lurking within your data, you should check out Eryk Lewinson’s recent tutorial. It covers Huber regression, Random sample consensus (RANSAC) regression, and Theil-Sen regression, and benchmarks their performance on the same dataset.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.