Techno Blender
Digitally Yours.

6 Dimensionality Reduction Techniques | by Kevin Berlemont, PhD | May, 2022

0 68


How and when to use them

Photo by Rodion Kutsaiev from Pexels

In the age of big data, data scientists are using datasets that possess more and more features. This leads to a very well know effect: the curse of dimensionality. When the number of features increases, after a certain point, the performance of the model will decrease. This is due to the fact that the density of the data points is going to decrease as the dimensionality increases (without adding any samples). One of the main consequences to this, is that it becomes extremally easy for a model to overfit.

To overcome the issue of overfitting, training time and storage due to the high dimensionality, a popular approach consists in applying a dimensionality reduction technique to the original dataset.

In this post I will describe six dimensionality reduction methods that you have to know when doing a data science project. I will show an application of the methods to the well-known MNIST dataset. Finally, I will compare and detail when to use which method in the last part.

This algorithm is one of the most well-known feature extraction algorithms. PCA is an unsupervised, linear transformation algorithm that produce the new features by determining the maximum variance of the data. The algorithm is as follows:

  • Construct the covariance matrix of the data
  • Apply the eigen decomposition to the matrix and sort the eigen values in decreasing order
  • Transform the data by projecting on a subset of the top-k eigen values
PCA applied to MNIST dataset

PCA has several benefits, such as being non-iterative and thus less time consuming. Moreover, it reduces over-fitting well. However, it is limited to linear projections and thus doesn’t handle non-linear data well. This is the reason why none of the digits form really separated clusters on the projected PCA.

Like the previous one, this method is a linear feature extraction algorithm, but this time supervised. This method creates a new feature space to project data with the goal of maximizing the classes’ separability. It does so by creating two distances matrices: in-between class and within-class. The first one computes the distance between the mean of each class. The second computes the distance between the mean of each class and the data within that class.

The algorithm is the following:

  • Build the two distances matrices
  • Calculate the eigen values of the combination of these matrices
  • Rank the eigen vectors and build the top-k matrix
  • Project the data of the new subspace
LDA applied to MNIST dataset

Selecting the top-k eigen vectors makes LDA similar to PCA. However, one major inconvenient of this algorithm is that in the case of binary classification, there will only be one feature available after LDA, independently of the number of features available at first.

Before presenting some non-linear algorithms, I want to highlight the Independent Component Analysis method. This is a linear, supervised feature extraction algorithm that generates new features that are statistically independent. It does so by reducing the second and higher order dependencies in a given dataset.

The algorithm is thus:

  • Decompose the data into a mixing matrix A and a new basis matrix S
  • Select the top-k components
  • Build the new features by projecting on those k components
IDA applied to MNIST dataset

In contrast to the two previous algorithms, ICA searches for features that are statistically independent. It does so by searching non-Gaussian features. This algorithm is commonly used in the blind source separation problem. One constraint of this algorithm is that it will not be able to separate Gaussian features.

The next three methods I will present will consist in non-linear dimensionality reduction techniques. The first one is a nonlinear, unsupervised dimensionality reduction technique. It focuses on the relationships, such as similarities or dissimilarities, between data in a multi-dimensional space. MDS represents similar data together and less similar data far apart. It does so by finding an output that maximizes the similarity between the distance matrices of the original and modified features.

The algorithm is:

  • Compute the dissimilarity matrix of the data
  • Compute a kernel matrix K by centering the data.
  • Apply the eigen decomposition
  • Obtain the new feature space by selection the top-k eigen vectors
MDS applied to MNIST dataset

Be careful with this method as it is computationally expensive.

This method comes from the fact classical scaling methods do not capture the possible nonlinear patterns in a dataset. This nonlinear unsupervised method solves this issue by maintaining the pairwise geodesic distances between data in lower dimension space. It computes the shortest path between all pairs of data in the neighborhood graph to approximate the geodesic distance between them.

The steps are:

  • Construct the neighborhood graph of the data
  • Compute the geodesic distance matrix
  • Apply MDS to this geodesic matrix
ISOMAP applied to MNIST dataset

As we can observe of the application of the method to the MNIST dataset, some of the digits are very well separated, but the operation is more complicated for similar number such as 2 and 7. This is because ISOMAP is a manifold learning algorithm (one of the earliest), but due to the fact that it uses the neighborhood graph it can sometimes build wrong connections between elements. A non-convex manifold will be difficult to analyze with this algorithm too.

The last algorithm I will present is a manifold-based dimensionality reduction technique. It consists in a nonlinear unsupervised method that is at the same time handy for visualization as it retains data’s structure. This algorithm first converts the Euclidian distances between data points in conditional probabilities that represent similarity. Afterwards, t-SNE minimizes the differences between the probabilities from the low and high dimensional space (using Kullback-Leibner divergence).

The summarized algorithm is:

  • Apply Stochastic Neighbor Embedding to obtain the conditional probabilities
  • Map the high dimensional space to the low dimensional one by minimizing the distance between the probabilities
t-SNE applied to MNIST dataset

By preserving the local structure, this algorithm is particularly adapted to data visualization in low-dimensional space as can attest the MNIST data above where most of the digit classes are well separated.

These six dimensionality reduction techniques represent the variety of methods that can be used for this purpose. Each of them has their specificities and weaknesses, with some of these methods being mostly used for data visualization such as t-SNE. All of them are integrated with Scikit-Learn which provides an easy way to implement them in a machine learning pipeline.

With the two tables below, I provide a useful summary of the whole post that details when and how to use these different feature extraction algorithms. We do have a set of tools to reduce the dimensionality of our data, but each of them has to be used in a specific context in order for them to be truly efficient.


How and when to use them

Photo by Rodion Kutsaiev from Pexels

In the age of big data, data scientists are using datasets that possess more and more features. This leads to a very well know effect: the curse of dimensionality. When the number of features increases, after a certain point, the performance of the model will decrease. This is due to the fact that the density of the data points is going to decrease as the dimensionality increases (without adding any samples). One of the main consequences to this, is that it becomes extremally easy for a model to overfit.

To overcome the issue of overfitting, training time and storage due to the high dimensionality, a popular approach consists in applying a dimensionality reduction technique to the original dataset.

In this post I will describe six dimensionality reduction methods that you have to know when doing a data science project. I will show an application of the methods to the well-known MNIST dataset. Finally, I will compare and detail when to use which method in the last part.

This algorithm is one of the most well-known feature extraction algorithms. PCA is an unsupervised, linear transformation algorithm that produce the new features by determining the maximum variance of the data. The algorithm is as follows:

  • Construct the covariance matrix of the data
  • Apply the eigen decomposition to the matrix and sort the eigen values in decreasing order
  • Transform the data by projecting on a subset of the top-k eigen values
PCA applied to MNIST dataset

PCA has several benefits, such as being non-iterative and thus less time consuming. Moreover, it reduces over-fitting well. However, it is limited to linear projections and thus doesn’t handle non-linear data well. This is the reason why none of the digits form really separated clusters on the projected PCA.

Like the previous one, this method is a linear feature extraction algorithm, but this time supervised. This method creates a new feature space to project data with the goal of maximizing the classes’ separability. It does so by creating two distances matrices: in-between class and within-class. The first one computes the distance between the mean of each class. The second computes the distance between the mean of each class and the data within that class.

The algorithm is the following:

  • Build the two distances matrices
  • Calculate the eigen values of the combination of these matrices
  • Rank the eigen vectors and build the top-k matrix
  • Project the data of the new subspace
LDA applied to MNIST dataset

Selecting the top-k eigen vectors makes LDA similar to PCA. However, one major inconvenient of this algorithm is that in the case of binary classification, there will only be one feature available after LDA, independently of the number of features available at first.

Before presenting some non-linear algorithms, I want to highlight the Independent Component Analysis method. This is a linear, supervised feature extraction algorithm that generates new features that are statistically independent. It does so by reducing the second and higher order dependencies in a given dataset.

The algorithm is thus:

  • Decompose the data into a mixing matrix A and a new basis matrix S
  • Select the top-k components
  • Build the new features by projecting on those k components
IDA applied to MNIST dataset

In contrast to the two previous algorithms, ICA searches for features that are statistically independent. It does so by searching non-Gaussian features. This algorithm is commonly used in the blind source separation problem. One constraint of this algorithm is that it will not be able to separate Gaussian features.

The next three methods I will present will consist in non-linear dimensionality reduction techniques. The first one is a nonlinear, unsupervised dimensionality reduction technique. It focuses on the relationships, such as similarities or dissimilarities, between data in a multi-dimensional space. MDS represents similar data together and less similar data far apart. It does so by finding an output that maximizes the similarity between the distance matrices of the original and modified features.

The algorithm is:

  • Compute the dissimilarity matrix of the data
  • Compute a kernel matrix K by centering the data.
  • Apply the eigen decomposition
  • Obtain the new feature space by selection the top-k eigen vectors
MDS applied to MNIST dataset

Be careful with this method as it is computationally expensive.

This method comes from the fact classical scaling methods do not capture the possible nonlinear patterns in a dataset. This nonlinear unsupervised method solves this issue by maintaining the pairwise geodesic distances between data in lower dimension space. It computes the shortest path between all pairs of data in the neighborhood graph to approximate the geodesic distance between them.

The steps are:

  • Construct the neighborhood graph of the data
  • Compute the geodesic distance matrix
  • Apply MDS to this geodesic matrix
ISOMAP applied to MNIST dataset

As we can observe of the application of the method to the MNIST dataset, some of the digits are very well separated, but the operation is more complicated for similar number such as 2 and 7. This is because ISOMAP is a manifold learning algorithm (one of the earliest), but due to the fact that it uses the neighborhood graph it can sometimes build wrong connections between elements. A non-convex manifold will be difficult to analyze with this algorithm too.

The last algorithm I will present is a manifold-based dimensionality reduction technique. It consists in a nonlinear unsupervised method that is at the same time handy for visualization as it retains data’s structure. This algorithm first converts the Euclidian distances between data points in conditional probabilities that represent similarity. Afterwards, t-SNE minimizes the differences between the probabilities from the low and high dimensional space (using Kullback-Leibner divergence).

The summarized algorithm is:

  • Apply Stochastic Neighbor Embedding to obtain the conditional probabilities
  • Map the high dimensional space to the low dimensional one by minimizing the distance between the probabilities
t-SNE applied to MNIST dataset

By preserving the local structure, this algorithm is particularly adapted to data visualization in low-dimensional space as can attest the MNIST data above where most of the digit classes are well separated.

These six dimensionality reduction techniques represent the variety of methods that can be used for this purpose. Each of them has their specificities and weaknesses, with some of these methods being mostly used for data visualization such as t-SNE. All of them are integrated with Scikit-Learn which provides an easy way to implement them in a machine learning pipeline.

With the two tables below, I provide a useful summary of the whole post that details when and how to use these different feature extraction algorithms. We do have a set of tools to reduce the dimensionality of our data, but each of them has to be used in a specific context in order for them to be truly efficient.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment