Every Scaler and Its Application in Data Science | by Emmett Boudreau | Nov, 2022

By Jessie Hobb On Nov 22, 2022

Venerable continuous scalers and applying them to machine-learning

introduction

The most important factor in your model is always going to be your data. The data given to a model will ultimately determine how well that model performs, and thus features are an integral part to creating a model that performs really well. There are several ways to improve model validity by adjusting different aspects of the data. There are two different types of features where these sorts of alterations are often beneficial, and these are categorical features and continuous features. For categorical features, we use encoders. For continuous features, we use scalers. Though in some ways they serve different purposes, ultimately as preprocessing techniques they both attribute to the higher product of validation accuracy. I have gone over encoders in the past, so today we will take the opposite trajectory. If you would like to learn more about encoders, I also have an article on that which you may read here:

Scalers are an incredibly important tool for Data Scientists. These scalers are used on data in order to make it more interpretable by machine-learning algorithms. This type of math can help us to make generalizations and draw conclusions from data a lot more decisively. Probably the most well-known of continuous scalers has a very prominent application in both statistics and machine-learning, and this is the normal or standard scaler.

The standard scaler standardizes our data by determining how many standard deviations our data is from the mean of the population. To demonstrate this, we will observe its formula and explain how exactly this improves our ability to quantify continuous features. We will also touch base on the other continuous scalers, both useful to machine-learning and Science and — not so useful.

The standard scaler

As I touched on before, the standard scaler is most likely the most well-known scaler. Rather conveniently, this scaler is also just the normal distribution. The normal distribution is a perfect introduction and foundational principal to inferential and Gaussian statistics, so it is very important that as scientists we stress the importance and usage of this scaler. If you would like to learn more about the normal distribution, I have an article that I think explains it quite well which you may read here:

Let us briefly review what denotes the normal distribution. The probability density function (PDF) for this distribution is denoted with:

x̄-µ / σ

That is, sample minus population mean, divided by population standard deviation. Examining that math, we are finding the difference between our sample and the average and then seeing how many standard deviations can fit into it. This is why the normal distribution is a visualization of standard deviations from the mean. Given that most of a population rests in the center near the mean (that is why it is the mean), this is why we see the parabola shape often associated with distributions and data. Our standard deviation gives the parameter of how spread out our data is and the mean provides the average number for this parameter to be applied to each sample.

This technique has been proven extremely effective in machine-learning. As a matter of a fact, most pipelines that utilize continuous data like this are very likely to include this form of standardization. Likewise, if you are working on a project this is probably the best tool for this sort of job. Here is a simple writing of a standard scaler in Julia:

function standard_scale(x::Vector{<:Number})
σ = std(x)
μ = mean(x)
[i = (i-μ) / σ for i in x]::Vector{<:Number}
end

Unit length scaler

A scaler that is much less known is the unit-length scaler. Sorry for another link, but there is another resource that I wrote on unit length scalers and machine-learning for those that want to learn more:

While typically rescalers, arbitrary-rescalers, and mean normalizers take a back-set to standardization, the unit-length scaler stands on its own in the machine-learning world. This works by dividing each element by the euclidean length of the vector.

Here is an example of this scaler in Julia:

function unitl_scale(data::AbstractVector)
zeroes = [i = 0 for i in data]
euclidlen = euclidian(zeroes, data)
[i = i / euclidlen for i in data]
end

Mean Normalization

Mean normalization is another scaler that has some useful applications. Generally, however, there are better options and this is a scaler which I have not seen much use of in my experience. That being said, there could definitely be some times where this sort of normalization might come in handy and be an apt choice over the average standard scaler.

function mean_norm(array::Array{<:Number})
avg = mean(array)
b = minimum(array)
a = maximum(array)
[i = (i-avg) / (a-b) for i in array]
end

While normalization via the normal distribution involves determining the number of standard deviations from the mean, mean normalization involves using forming the scale by subtracting the smallest value in the data from the largest value. All that is replaced in this formula when compared to the standard scaler is that the standard deviation is replaced with a — b .

Rescaling

Rescaling is one of the more simple options, and is not typically seen in Data Science. This is not to say that this scaler has no purpose but instead that the role in Data Science has certainly been quite minimal. It would certainly be interesting if there were some unmeasured uses for this scaler, but for the time being this is certainly not a scaler I use. Regardless, it is pretty cool to see the math behind some of these scalers so I would at least like to write one. Here is an example of a Rescaler written in julia:

function rescaler(x::Array{<:Any})
min = minumum(x)
max = maximum(x)
[i = (i-min) / (max - min) for i in x]
end

This is likely the scaler that is least likely to be applied to Data Science. That being said, you never know; and this is certainly still an interesting formula to know about. Fortunately, they are also rather simple mathematically with only the minimum and maximum being required to scale the data.

Closing thoughts

Scaling data is a foundational building block of statistical principles, and is definitely a very important thing to get familiar with for anyone who might want to get acclimated with Data Science. It is easy to get stuck in the trap of simply using the standard scaler for every application and never exploring the other tools available. This is certainly sensible, as often the standard scaler will be the best tool for the job. However, I would argue that it certainly makes a lot of sense to do some exploration on this front.

Mastering these different scalers can certainly provide a substantial benefit towards great prediction accuracy in many different circumstances. Furthermore, having a diversified knowledge and firm understanding of different scalers can come in handy for understanding what is going on when things go wrong with them. I hope this article was helpful in providing some context on that front, thank you for reading!