# UMAP Variance Explained. A simple way to interpret UMAP… | by Nikolay Oskolkov | Mar, 2023

## A simple way to interpret UMAP components

This is the **twenty fifth** post from my column **Mathematical Statistics and Machine Learning for Life Sciences**, where I use plain language to discuss analytical methods from Computational Biology and Life Sciences. **UMAP**** **is a **dimensionality reduction** technique that together with **tSNE** became increasingly popular and *de-facto* a standard tool for analysis of **single cell genomics** data, where traditional methods such as **PCA** **have limitations**. However, one disadvantage of UMAP and tSNE in comparison with PCA is their **non-interpretable** components that are not straightforward to link to the variation in the original data. In this article, I suggest a simple method for **estimating amount of variation explained** **by leading UMAP and tSNE components**. Using the classic MNIST dataset of black and white images of handwritten digits as a benchmark, I demonstrate that **leading UMAP and tSNE components are inferior to PCA components in terms of explaining total data variation**, however, surprisingly, demonstrate a better linkage to the labels of data points, i.e. they explain more** biological rather than total variation in the data**.

As a benchmark dataset, we will be using the **MNIST**** **which includes 70 000 black and white images of handwritten digits of **28 x 28 pixels** **resolution, **i.e.** 784 pixels** per image. First, we will download the MNIST dataset, check its dimensions and visualize a few random images.

`from sklearn.datasets import fetch_openml`mnist = fetch_openml('mnist_784', version = 1)

labels = mnist.target.astype(int)

print(mnist.data.shape)

#(70000, 784)

from matplotlib import pyplot as plt

plt.figure(figsize = (20, 15))

for i in range(25):

plt.subplot(5, 5, i + 1)

plt.imshow(mnist.data[i].reshape(28, 28), cmap = plt.get_cmap('gray'))

plt.show()

The values of MNIST dataset represent **pixel intensities** varying from 0 to 255, where 0 corresponds to the black background. Therefore, MNIST is a **“zero-inflated”** data set which is very **similar to a** **typical single cell** gene expression dataset. It is usually recommended to normalize the black and white pixel intensities by the largest 255 value. However, here, by analogy with single cell gene expression, we will use a **log-transform** as another mild normalization strategy. Also, for the sake of time, we will not use all the 70 000 images but **randomly pick 10 000 images**, later we will draw 10 000 images a few times to make sure that our conclusions are robust.

`import numpy as np`N_points = 10000

X = np.log10(mnist.data + 1)

#X = mnist.data / 255

np.random.seed(123)

random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)

X = X[random_indices,:]

labels = labels[random_indices]

As a first exploratory step, in order to **understand the variation** **in MNIST**, we will perform a PCA analysis on the MNIST itself and a **shuffled** version of MNIST. This will help us to estimate the number of meaningful, i.e. non-redundant, dimensions out of the initial 784 dimensions, i.e. we will figure our the number of **informative** dimensions to keep for all the future tested dimension reduction and clustering techniques. More details are here.

`import pandas as pd; import matplotlib.pyplot as plt`

from sklearn.decomposition import PCA; import seaborn as snsN_pca_comps = 100

sns.set(font_scale = 1.5); figure = plt.figure(figsize = (20, 15))

plt.subplot(221)

X_reduced = PCA(n_components = 2).fit_transform(X)

plt.scatter(X_reduced[:,0], X_reduced[:,1], c=labels, cmap='tab10', s=10)

plt.title('PCA: MNIST', fontsize = 25)

plt.xlabel('PC1', fontsize = 22); plt.ylabel('PC2', fontsize = 22)

plt.subplot(222)

pca = PCA(n_components = N_pca_comps).fit(X)

print('Observed variance explained:')

print(pca.explained_variance_ratio_[0:10]); print('\n')

plt.bar(range(len(pca.explained_variance_ratio_)),

pca.explained_variance_ratio_)

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('Explained Variance', fontsize = 22)

N_perm = 10

X_flat = X.flatten()

expl_var_perm_df = pd.DataFrame(index = list(range(N_perm)),

columns = list(range(X.shape[1])))

for i in range(N_perm):

np.random.shuffle(X_flat)

X_perm = X_flat.reshape(X.shape[0], X.shape[1])

pca_perm = PCA().fit(X_perm)

expl_var_perm_df.loc[i] = pca_perm.explained_variance_ratio_

print('Finished {} permutations'.format(i + 1))

X_perm = list(expl_var_perm_df.mean(axis = 0) +

2*expl_var_perm_df.std(axis = 0))

print('\nPermuted variance explained:')

print(X_perm[0:10])

plt.subplot(223)

plt.plot(pca.explained_variance_ratio_, c = 'blue')

plt.plot(X_perm, c = 'red'); plt.xlim([-1, N_pca_comps])

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('Explained Variance', fontsize = 22)

plt.gca().legend(('Observed variance explained',

'Permuted variance explained'), fontsize = 20)

plt.subplot(224)

pval = list()

for j in range(N_pca_comps):

pval.append(np.sum(expl_var_perm_df.iloc[:, j] +

2*expl_var_perm_df.std(axis = 0) >=

pca.explained_variance_ratio_[j]) / N_perm)

plt.plot(pval, c = 'darkgreen')

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('P-value', fontsize = 22); plt.xlim([-1, N_pca_comps])

N_opt_pcs = np.where(np.array(pval) >= 0.05)[0][0]

print('\nNumber of significant Principal Components: {}'.format(N_opt_pcs))

print('Together they explain {}% of variation in the data'.\

format(int(round(sum(pca.explained_variance_ratio_[0:\

np.where(np.array(pval) >= 0.05)[0][0]])*100,0))))

figure.tight_layout()

plt.show()

An important outcome of the PCA analysis on MNIST and shuffled MNIST is that the dataset seems to have **62 informative **principal components** **that all together **capture 86% of variation in the data**. Therefore, if we replace the original dataset including 784 features by 62 PCs, we will keep most of the variation in the data, but will **reduce the dimensionality of the data by more than 10 times**.

Now, let us visualize the MNIST dataset as a 2D map using UMAP and tSNE dimension reduction techniques. In both cases, we will use 2 top principal components, i.e. 2 PCs, for initialization. The perplexity hyperparameter** **for tSNE and the number of nearest neighbors for UMAP will be computed as a square root of the number of data points (images). See here for details.

`import umap; import numpy as np`

import seaborn as sns; import matplotlib.pyplot as plt

from sklearn.manifold import TSNE; from sklearn.decomposition import PCAopt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))

X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)

umap_embedding = umap.UMAP(n_components = 2, n_neighbors = opt_perp,

init = X_reduced[:, 0:2],

min_dist=0.3, n_epochs = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

tsne_embedding = TSNE(n_components=2, perplexity=opt_perp,

init=X_reduced[:, 0:2],

learning_rate = 200, n_iter = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))

plt.subplot(121)

plt.scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c = labels, s = 10,

cmap = 'tab20')

plt.title('tSNE: MNIST', fontsize = 25)

plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)

plt.subplot(122)

plt.scatter(umap_embedding[:, 0], umap_embedding[:, 1], c = labels, s = 10,

cmap = 'tab20')

plt.title('UMAP: MNIST', fontsize = 25)

plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)

plt.show()

One obvious conclusion that can be drawn from the comparison of 2D PCA, tSNE and UMAP pictures is that the **classes** **of handwritten digits** are much more resolved by the non-linear **neighbor-graph** dimensionality reduction methods such as tSNE / UMAP compared to the **linear** **matrix factorization** techniques such as PCA. Therefore if we assume the classes of handwritten digits (aka biological phenotype, e.g. cell type in scRNAseq experiments) to comprise **most of the variation** in the MNIST data, it probably makes sense to hypothesize that **two tSNE/UMAP components are capabale of capturing more biological variation than two PCA components (at least in the MNIST dataset)**. In the next sections, we are going to try to **quantify and prove this hypothesis**.

As non-linear** **neighbor-graph based techniques, **UMAP / tSNE do not seem to have the concept of variance explained by its components** **in contrast to matrix factorization linear dimension reduction techniques such as PCA**, see for example the answer of Leland McInnes, the author of UMAP. Here, nevertheless, we will try to quantify the amount of MNIST variation in pixel intensities explained by UMAP / tSNE components and compare it with the amount of variation explained by PCA components using the **Partial Least Square (PLS) regression** and the **R-squared statistic** generalized for matrix operations.

Let us check the percentage of MNIST variation explained by first principal component (PC1), we can easily compute and extract it from the PCA as the normalized **first eigen value** of MNIST. Below, we can see that **PC1 explains ~11% of MNIST variation**.

`#First, we will compute variance explained by PC1 via PCA in sklearn`

import numpy as np

from sklearn.decomposition import PCApca = PCA(n_components = X.shape[1]).fit(X)

pca_comps = PCA().fit_transform(X)

print(pca.explained_variance_ratio_[0])

#0.11043073983593521

Now, **we will** **reproduce this number from the** **Partial Least Squares (PLS)** **regression**. We are going to use the following reasoning. Suppose we want to **approximate a matrix X by another matrix PCA_matrix**, which for now includes only one column, that is PC1, but generally can include up to 784 columns for MNIST dataset. Then, we can fit a PLS linear regression model of **X = B * PCA_matrix** and compute a **R-squared statistic** which will reflect the amount of variation in **X** explained by **PCA_matrix**. In the matrix form, the R-squared statistic will be given by the following equation:

where **B*PCA_matrix** represents a prediction of **X** from its approximation **PCA_matrix** and will be found from the PLS regression (first equation). To technically implement this procedure in **scikit-learn Python module**, we will first fit a PLS model with **X** as a response variable, and **PCA_matrix** as an explanatory variable, we compute a prediction **y_pred = B*PCA_matrix**, and, finally, we can either use the **r2_score** function in scikit-learn, or the second equation above for computing the R-squared statistic. Let us check that both will give identical answers:

`#Now let us compute variance explained by PC1 through PLS procedure`

from sklearn.metrics import r2_score

from sklearn.cross_decomposition import PLSRegressionPCA_matrix = pd.DataFrame(pca_comps[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(PCA_matrix, X)

y_pred = pls.predict(PCA_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.11043073983593246

#Finally, let us compute variance explained by PC1 from scratch

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.11043073983593554

Thus, by utilizing PLS regression, we computed the fraction of total MNIST **variation explained by the first principal component** **outside of the PCA algorithm**. Compare the fraction calculated by the PLS with the respective variance explained by the PCA algorithm pca.explained_variance_ratio_[0] which is the ratio of first eigen value divided by the sum of all eigen values. They are nearly identical. Now let **PCA_matrix** consist of a number of PCs. Below, we demonstrate that the PLS computation of cumulative fraction of variance explained by principal components will result in **nearly identical **to PCA cumulative variance explained.

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegressionpredicted_var_expl = []

for i in range(1, 21):

PCA_matrix_current = pd.DataFrame(pca_comps[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(PCA_matrix_current, X)

y_pred_current = pls_current.predict(PCA_matrix_current)

predicted_var_expl.append(r2_score(X, y_pred_current,

multioutput = 'variance_weighted'))

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.plot(np.cumsum(pca.explained_variance_ratio_[0:100]),linewidth=5)

plt.plot(predicted_var_expl, linewidth = 5)

plt.ylabel('Cumulative Explained Variance', fontsize = 20)

plt.xlabel('Number of Principal Components', fontsize = 20)

plt.legend(['PCA - computed cumulative variance explained',

'PLS - computed cumulative variance explained'],

fontsize = 20); plt.show()

For clarity, we plot PLS-computed MNIST variance explained for PC1-PC20 on the top of the native PCA eigen value based variance explained which is done for PC1 — PC100. We can see, that **the** **curves of cumulative variance explained computed by PLS and PCA nicely coincide**. This is no surprise since PLS computation essentially mimics what is going under the hood in the PCA algorithm. However, very importantly, this gives us an **instrument** of computing variance of **X** explained by **any approximation**, i.e. not only the **PCA_matrix** but also any other matrix. In the next session, we will use **UMAP_matrix** as an approximation of the original **X** matrix of MNIST pixel intensities and **compare the cumulative variance of X explained by UMAP and PCA components**.

In this section we will use the methodology of PLS estimation of variance explained developed in the previous section. Again, we will use the same reasoning: since UMAP gives some sort of **approximation** **of the original data** **X**, and people even run **clustering on 2D UMAP** for discovering cell types in scRNAseq field, we can use the **UMAP_matrix** (and we can do the same for tSNE) as an approximation of **X**, then we fit the PLS model **X=B * UMAP_matrix**, and estimate the fraction of MNIST variance explained by UMAP components via R-squared statistic as

Now, we will use the PLS methodology developed in the previous section, and compute the fraction of MNIST variance explained by the first UMAP component.

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegression#MNIST variation explained by UMAP1

UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(UMAP_matrix, X)

y_pred = pls.predict(UMAP_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.07335034485651613

#Here the same but more explicitly via the R^2 equation above

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.07335034485652026

#MNIST variation explained by tSNE1

tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(tSNE_matrix, X)

y_pred = pls.predict(tSNE_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.07265918428990921

#Here the same but explicitly via the R^2 equation above

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.07265918428991347

We conclude that **first components of both UMAP and tSNE explain about 7% of MNIST variation which is lower than the 11% explained by the first PCA component**. This is not surprising. Intuitively, it is hard to expect that there can be another latent variable alternative to PC1 that explains more than 11% of MNIST variation without redefining the concept of “explained variance”. The current definition comes from the PCA (**normalized eigen values**), and linear regression, that is R-squared, analyses. Both are linear frameworks. **In addition,** **PC1 by definition corresponds to a direction of maximal data variation**. Therefore, if another latent variable comes from a nonlinear analysis such as UMAP, that does not *per se* attempts to maximize variation in the data, **it would be hard to expect that e.g. UMAP1 explains more variation in MNIST** **within linear definition of “variance explained”**. Now we will calculate the variance explained by a few top tSNE and UMAP components. To make our analysis more robust towards sampling of the 10 000 images, we will independently draw them **N_iter** times. Also, for each iteration, we will slightly **change /** **perturb** **the hyperparameters of UMAP and tSNE, as well as the number of data points for PCA**. This will allow us to build **confidence intervals** and address the sensitivity of our analysis.

`import umap; import numpy as np; import seaborn as sns`

import matplotlib.pyplot as plt; from sklearn.manifold import TSNE

from sklearn.metrics import r2_score; from sklearn.decomposition import PCA

from sklearn.cross_decomposition import PLSRegressionN_iter = 3; N_comps = 3

N_points_list = [5000, 10000, 15000]

perp_list = [50, 100, 150]; min_dist_list = [0.1, 0.2, 0.3]

predicted_var_expl_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_umap_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_tsne_matrix = np.zeros(shape = (N_iter, N_comps))

for j in range(N_iter):

#MNIST variance explained by PCA components

np.random.seed(j)

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points_list[j],

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)

predicted_var_expl = []

for i in range(1, (N_comps + 1)):

PCA_matrix_current = pd.DataFrame(pca_comps_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(PCA_matrix_current, X_sample)

y_pred_current = pls_current.predict(PCA_matrix_current)

predicted_var_expl.append(r2_score(X_sample, y_pred_current,

multioutput='variance_weighted'))

predicted_var_expl_matrix[j,:] = predicted_var_expl

#MNIST variance explained by UMAP components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

umap_embedding_sample = umap.UMAP(n_components = N_comps,

n_neighbors = opt_perp,

init = X_reduced_sample[:, 0:N_comps],

min_dist = min_dist_list[j],

n_epochs = 1000, verbose = \

0).fit_transform(X_reduced_sample)

predicted_var_expl_umap = []

for i in range(1, (N_comps + 1)):

UMAP_matrix_current = pd.DataFrame(umap_embedding_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(UMAP_matrix_current, X_sample)

y_pred_current = pls_current.predict(UMAP_matrix_current)

predicted_var_expl_umap.append(r2_score(X_sample, y_pred_current, \

multioutput = 'variance_weighted'))

predicted_var_expl_umap_matrix[j,:] = predicted_var_expl_umap

#MNIST variance explained by tSNE components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

tsne_embedding_sample = TSNE(n_components = N_comps,

perplexity = perp_list[j],

init = X_reduced_sample[:, 0:N_comps],

learning_rate = 200, n_iter = 1000,

verbose = 0).fit_transform(X_reduced_sample)

predicted_var_expl_tsne = []

for i in range(1, (N_comps + 1)):

tSNE_matrix_current = pd.DataFrame(tsne_embedding_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(tSNE_matrix_current, X_sample)

y_pred_current = pls_current.predict(tSNE_matrix_current)

predicted_var_expl_tsne.append(r2_score(X_sample, y_pred_current, \

multioutput = 'variance_weighted'))

predicted_var_expl_tsne_matrix[j,:] = predicted_var_expl_tsne

print("MNIST variance explained by PCA components:")

print(predicted_var_expl_matrix)

print("\nMNIST variance explained by UMAP components:")

print(predicted_var_expl_umap_matrix)

print("\nMNIST variance explained by tSNE components:")

print(predicted_var_expl_tsne_matrix)

#Plot MNIST variance explained by leading PCA, tSNE and UMAP components

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_matrix, axis = 0),

linewidth = 3, color = 'red', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_tsne_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_tsne_matrix,

axis = 0), linewidth = 3, color = 'blue', marker = 'o',

markersize = 10, capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_umap_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_umap_matrix,

axis = 0), linewidth = 3, color = 'green', marker = 'o',

markersize = 10, capsize = 5, capthick = 3)

plt.ylabel('Cumulative MNIST Explained Variance', fontsize = 20)

plt.xlabel('Number of Components', fontsize = 20)

plt.legend(['PCA variance explained', 'tSNE variance explained',

'UMAP variance explained'], fontsize = 20)

plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()

Here, we conclude that leading PCA components explain consistently more variation in MNIST dataset than leading tSNE and UMAP components. This result was expected since UMAP and tSNE do not aim at building directions of maximal variation in contrast to PCA. Further, **the amounts of variation in the MNIST data explained by leading tSNE and UMAP components are comparable, and both lower than the ones explained by PCA components**. Therefore, we seem to observe a systematic difference in MNIST explained variance **between matrix factorization and neghbor-graph dimensionality reduction techniques**.

In the previous section, we showed that leading components of UMAP and tSNE explain significantly less variation in MNIST dataset compared to the leading PCA components, which is a sort of **expected result** since PCA by definition attempts to find components of maximal variation in the data in contrast to UMAP and tSNE.

This is however an interesting paradox: looking at the 2D UMAP and tSNE figures of MNIST data we can clearly see more

distinctclusters of handwritten digits compared to the ones in the corresponding 2D PCA figure.Despite 2D UMAP and tSNE explain less variation inMNISTthan PCA.

Here, we are going to hypothesize that UMAP / tSNE components are linked to a **phenotype of interest**, that is MNIST **labels** in our case, or cell types for scRNAseq, **rather than total variation in the data**. To check this hypothesis, let us first explore how leading PCA, UMAP and tSNE components correlate with the MNIST labels.

`import umap; import seaborn as sns`

from sklearn.datasets import fetch_openml

from sklearn.decomposition import PCA; import numpy as np

import matplotlib.pyplot as plt; from sklearn.manifold import TSNEN_comps = 3; N_points = 10000

mnist = fetch_openml('mnist_784', version = 1)

labels = mnist.target.astype(int); X = np.log10(mnist.data + 1)

#Subsample MNIST data down to 10 000 images

np.random.seed(123)

random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)

X = X[random_indices,:]; labels = labels[random_indices]

#Compute top 3 PCA components on the subsampled MNIST data

pca = PCA(n_components = N_comps).fit(X)

pca_comps = PCA().fit_transform(X)

#Compute top 3 UMAP components on the subsampled MNIST data

opt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))

X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)

umap_embedding = umap.UMAP(n_components = N_comps, n_neighbors = opt_perp,

init = X_reduced[:, 0:N_comps],

min_dist = 0.3, n_epochs = 1000, random_state=123,

verbose = 0).fit_transform(X_reduced)

#Compute top 3 tSNE components on the subsampled MNIST data

tsne_embedding = TSNE(n_components = N_comps, perplexity = opt_perp,

init = X_reduced[:, 0:N_comps],

learning_rate = 200, n_iter = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

#Pairwise correlations between MNIST labels and PCA / tSNE / UMAP components

from scipy.stats import spearmanr

rho_matrix = np.zeros(shape = (3, 3))

for i in range(N_comps):

rho, pval = spearmanr(pca_comps[:, i], labels)

rho_matrix[0, i] = np.abs(rho)

for i in range(N_comps):

rho, pval = spearmanr(tsne_embedding[:, i], labels)

rho_matrix[1, i] = np.abs(rho)

for i in range(N_comps):

rho, pval = spearmanr(umap_embedding[:, i], labels)

rho_matrix[2, i] = np.abs(rho)

rho_df = pd.DataFrame(rho_matrix, columns = ['Comp1', 'Comp2', 'Comp3'],

index = ['PCA', 'tSNE', 'UMAP'])

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))

sns.heatmap(rho_df.T, cmap = "Blues", annot = True)

plt.title('Spearman correlation of leading PCA / tSNE / UMAP components'+

' with MNIST labels', fontsize = 20); plt.show()

We can observe, that **both tSNE1 and UMAP1 have a stronger correlation with the MNIST labels compared to PC1**. Further, the second components of PCA, tSNE and UMAP have comparable strength of correlation with the MNIST labels, **whereas the** **third UMAP component has a much stronger correlation with MNIST labels compared to the third components of PCA and tSNE**, that is PC3 and tSNE3, respectively, which **appear to be almost uncorrelated with the MNIST labels**. In the complete notebook, available on my github, I show that the **obtained heatmap is robust** **with respect to sampling and perturbations of PCA / tSNE / UMAP hyperparameters**.

**Alternatively**, we can also quantify the linkage between the PCA, tSNE and UMAP components on one side, and the MNIST labels (classes of digits, aka cell types in scRNAseq data) on the other side, through the **PLS regression** methodology developed in the previous sections. By analogy, if we consider **UMAP_matrix** (or **PCA_matrix** or **tSNE_matrix**) as an approximation of the MNIST vector of **labels**, we can fit the PLS model **labels= B * UMAP_matrix**, and estimate the fraction of **variation in the MNIST** **labels** **explained by the UMAP components** as

Let us test how much variance in MNIST **labels** is explained by PC1, tSNE1 and UMAP1:

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegression#Variance in MNIST labels explained by PC1

pca = PCA(n_components = X.shape[1]).fit(X)

pca_comps = PCA().fit_transform(X)

PCA_matrix = pd.DataFrame(pca_comps[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(PCA_matrix, labels)

y_pred = pls.predict(PCA_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.01570798235844606

#Variance in MNIST labels explained by tSNE1

tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(tSNE_matrix, labels)

y_pred = pls.predict(tSNE_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.04531724676013893

#Variance in MNIST labels explained by UMAP1

UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(UMAP_matrix, labels)

y_pred = pls.predict(UMAP_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.1369512765129124

We can clearly see that tSNE1, and especially UMAP1, explain much more variation in MNIST labels compared to PC1, which confirms what we have previously observed with Spearman correlation in the heatmap above. We can also reproduce the **R-squared** values above from the linear regression model in **statsmodels** package.

`import statsmodels.formula.api as smf`my_df_comps = pd.DataFrame({'LABELS': labels,

'PC1': np.array(PCA_matrix).flatten(),

'tSNE1': np.array(tSNE_matrix).flatten(),

'UMAP1': np.array(UMAP_matrix).flatten()})

smf.ols(formula = 'LABELS ~ PC1', data = my_df_comps).fit().summary()

smf.ols(formula = 'LABELS ~ tSNE1', data = my_df_comps).fit().summary()

smf.ols(formula = 'LABELS ~ UMAP1', data = my_df_comps).fit().summary()

Now, **once we have made sure that** **we can correctly compute** the variance in MNIST labels explained by PC1, tSNE1 and UMAP1 by using the PLS and R — squared methodology, we can extend this procedure for other leading PCA /tSNE/ UMAP components, and visualize how the** **cumulative variance explained changes** **as more and more components are added.** **As per usual, we will perturb the PCA, tSNE and UMAP with respect to the sub-sampling iteration, number of images and hyperparameters of the methods.

`from sklearn.decomposition import PCA`

from sklearn.cross_decomposition import PLSRegression

import seaborn as sns; import matplotlib.pyplot as plt

import numpy as np; from sklearn.metrics import r2_scoreN_iter = 3; N_comps = 3

N_points_list = [5000, 10000, 15000]

min_dist_list = [0.1, 0.2, 0.3]; perp_list = [90, 100, 110]

predicted_var_expl_labels_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_labels_umap_matrix=np.zeros(shape=(N_iter, N_comps))

predicted_var_expl_labels_tsne_matrix=np.zeros(shape=(N_iter, N_comps))

for j in range(N_iter):

#Variance in MNIST labels explained by PCA components

np.random.seed(j)

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points_list[j],

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)

predicted_var_expl_labels = []

for i in range(1, (N_comps + 1)):

PCA_matrix_current_labels = pd.DataFrame(pca_comps_sample[:, 0:i])

pls_current_labels = PLSRegression(n_components = i)

pls_current_labels.fit(PCA_matrix_current_labels, labels_sample)

y_pred_current_labels = pls_current_labels.predict(\

PCA_matrix_current_labels)

predicted_var_expl_labels.append(r2_score(labels_sample, \

y_pred_current_labels, multioutput = 'variance_weighted'))

predicted_var_expl_labels_matrix[j,:] = predicted_var_expl_labels

#Variance in MNIST labels explained by UMAP components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components=N_opt_pcs).fit_transform(X_sample)

umap_embedding_sample = umap.UMAP(n_components = N_comps,

n_neighbors = opt_perp,

init=X_reduced_sample[:, 0:N_comps],

min_dist = min_dist_list[j],

n_epochs = 1000, verbose = \

0).fit_transform(X_reduced_sample)

predicted_var_expl_labels_umap = []

for i in range(1, (N_comps + 1)):

UMAP_matrix_current_labels=pd.DataFrame(umap_embedding_sample[:,0:i])

pls_current_labels_umap = PLSRegression(n_components = i)

pls_current_labels_umap.fit(UMAP_matrix_current_labels,labels_sample)

y_pred_current_labels_umap = pls_current_labels_umap.predict(\

UMAP_matrix_current_labels)

predicted_var_expl_labels_umap.append(r2_score(\

labels_sample, y_pred_current_labels_umap, \

multioutput = 'variance_weighted'))

predicted_var_expl_labels_umap_matrix[j,:]=predicted_var_expl_labels_umap

#Variance in MNIST labels explained by tSNE components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

tsne_embedding_sample = TSNE(n_components = N_comps,

perplexity = perp_list[j],

init = X_reduced_sample[:, 0:N_comps],

learning_rate = 200, n_iter = 1000,

verbose=0).fit_transform(X_reduced_sample)

predicted_var_expl_labels_tsne = []

for i in range(1, (N_comps + 1)):

tSNE_matrix_current_labels=pd.DataFrame(tsne_embedding_sample[:,0:i])

pls_current_labels_tsne = PLSRegression(n_components = i)

pls_current_labels_tsne.fit(tSNE_matrix_current_labels,labels_sample)

y_pred_current_labels_tsne = pls_current_labels_tsne.predict(\

tSNE_matrix_current_labels)

predicted_var_expl_labels_tsne.append(r2_score(labels_sample, \

y_pred_current_labels_tsne, multioutput = 'variance_weighted'))

predicted_var_expl_labels_tsne_matrix[j,:]=predicted_var_expl_labels_tsne

print("Variance in MNIST labels explained by PCA components:")

print(predicted_var_expl_labels_matrix)

print("\nVariance in MNIST labels explained by UMAP components:")

print(predicted_var_expl_labels_umap_matrix)

print("\nVariance in MNIST labels explained by tSNE components:")

print(predicted_var_expl_labels_tsne_matrix)

#Plot MNIST labels variance explained by PCA, tSNE and UMAP components

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_matrix, axis = 0),

yerr = 2*np.std(predicted_var_expl_labels_matrix, axis = 0),

linewidth = 3, color = 'red', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_tsne_matrix, axis = 0),

yerr=2*np.std(predicted_var_expl_labels_tsne_matrix,axis=0),

linewidth = 3, color = 'blue', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_umap_matrix, axis = 0),

yerr=2*np.std(predicted_var_expl_labels_umap_matrix,axis=0),

linewidth = 3, color = 'green', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.ylabel('Cumulative MNIST Labels Explained Variance', fontsize = 20)

plt.xlabel('Number of Components', fontsize = 20)

plt.legend(['PCA variance explained', 'tSNE variance explained',

'UMAP variance explained'], fontsize = 20)

plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()

**Here,** **surprisingly, we observe that** **leading UMAP and tSNE components seem to explain much more of variation in the MNIST labels compared to leading PCA components. **This is somehow counter-intuitive as we saw in the previous section that leading tSNE and UMAP components explain less variation in MNIST image pixel intensities. Nevertheless, this cumulative explained variance plot above basically confirms the correlation heatmap between the MNIST labels, and PCA / tSNE / UMAP components obtained previously.

Here, we have an interesting effect. We saw in the previous section that the leading tSNE and UMAP components can not capture more variation in the MNIST than leading PCA components. However, surpisingly, they are **able to capture more variation in MNIST labels, that is classes of handwritten digits, despite the classes were not provided to tSNE / UMAP at all, when performing the dimension reduction.** In other words, all three dimension reduction techniques are completely unsupervised, i.e. they know nothing about the classes of handwritten digits. Nevertheless, **tSNE and especially UMAP components seem to be, surprisingly, linked to the classes of digits** without being able to capture substantial proportion of variation in MNIST pixel intensities across the images. I do not fully understand this effect but believe it is an interesting observation, which I would be curious to further explore. **Let me know in the comments, if you have more insights into the nature of this effect.**

In this article, we have learnt that a **Partial Least Squares (PLS)** **approach** can be used for building **intuition about data variation explained by tSNE and UMAP components**. Using this approach, we demonstrated that **tSNE and UMAP components explain (unsurprisingly) less variation in MNIST image pixel intensities** **data compared to PCA components**, however, are surprisingly strongly** linked to the labels of MNIST images**. The nature of this effect is unclear taken into account the unsupervised design of all the three dimension reduction techniques.

As per usual, let me know in the comments below what analytical methods from Life Sciences** **and Computational Biology seem **especially mysterious** to you and I will try to address them in this column. Check the files used for the post on my github. Follow me at **Medium ****Nikolay Oskolkov**, in **Twitter** @NikolayOskolkov, in **Mastodon** @[email protected] and connect via **Linkedin**. In the next post, I will discuss** how to cluster in UMAP space**, stay tuned.

## A simple way to interpret UMAP components

This is the **twenty fifth** post from my column **Mathematical Statistics and Machine Learning for Life Sciences**, where I use plain language to discuss analytical methods from Computational Biology and Life Sciences. **UMAP**** **is a **dimensionality reduction** technique that together with **tSNE** became increasingly popular and *de-facto* a standard tool for analysis of **single cell genomics** data, where traditional methods such as **PCA** **have limitations**. However, one disadvantage of UMAP and tSNE in comparison with PCA is their **non-interpretable** components that are not straightforward to link to the variation in the original data. In this article, I suggest a simple method for **estimating amount of variation explained** **by leading UMAP and tSNE components**. Using the classic MNIST dataset of black and white images of handwritten digits as a benchmark, I demonstrate that **leading UMAP and tSNE components are inferior to PCA components in terms of explaining total data variation**, however, surprisingly, demonstrate a better linkage to the labels of data points, i.e. they explain more** biological rather than total variation in the data**.

As a benchmark dataset, we will be using the **MNIST**** **which includes 70 000 black and white images of handwritten digits of **28 x 28 pixels** **resolution, **i.e.** 784 pixels** per image. First, we will download the MNIST dataset, check its dimensions and visualize a few random images.

`from sklearn.datasets import fetch_openml`mnist = fetch_openml('mnist_784', version = 1)

labels = mnist.target.astype(int)

print(mnist.data.shape)

#(70000, 784)

from matplotlib import pyplot as plt

plt.figure(figsize = (20, 15))

for i in range(25):

plt.subplot(5, 5, i + 1)

plt.imshow(mnist.data[i].reshape(28, 28), cmap = plt.get_cmap('gray'))

plt.show()

The values of MNIST dataset represent **pixel intensities** varying from 0 to 255, where 0 corresponds to the black background. Therefore, MNIST is a **“zero-inflated”** data set which is very **similar to a** **typical single cell** gene expression dataset. It is usually recommended to normalize the black and white pixel intensities by the largest 255 value. However, here, by analogy with single cell gene expression, we will use a **log-transform** as another mild normalization strategy. Also, for the sake of time, we will not use all the 70 000 images but **randomly pick 10 000 images**, later we will draw 10 000 images a few times to make sure that our conclusions are robust.

`import numpy as np`N_points = 10000

X = np.log10(mnist.data + 1)

#X = mnist.data / 255

np.random.seed(123)

random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)

X = X[random_indices,:]

labels = labels[random_indices]

As a first exploratory step, in order to **understand the variation** **in MNIST**, we will perform a PCA analysis on the MNIST itself and a **shuffled** version of MNIST. This will help us to estimate the number of meaningful, i.e. non-redundant, dimensions out of the initial 784 dimensions, i.e. we will figure our the number of **informative** dimensions to keep for all the future tested dimension reduction and clustering techniques. More details are here.

`import pandas as pd; import matplotlib.pyplot as plt`

from sklearn.decomposition import PCA; import seaborn as snsN_pca_comps = 100

sns.set(font_scale = 1.5); figure = plt.figure(figsize = (20, 15))

plt.subplot(221)

X_reduced = PCA(n_components = 2).fit_transform(X)

plt.scatter(X_reduced[:,0], X_reduced[:,1], c=labels, cmap='tab10', s=10)

plt.title('PCA: MNIST', fontsize = 25)

plt.xlabel('PC1', fontsize = 22); plt.ylabel('PC2', fontsize = 22)

plt.subplot(222)

pca = PCA(n_components = N_pca_comps).fit(X)

print('Observed variance explained:')

print(pca.explained_variance_ratio_[0:10]); print('\n')

plt.bar(range(len(pca.explained_variance_ratio_)),

pca.explained_variance_ratio_)

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('Explained Variance', fontsize = 22)

N_perm = 10

X_flat = X.flatten()

expl_var_perm_df = pd.DataFrame(index = list(range(N_perm)),

columns = list(range(X.shape[1])))

for i in range(N_perm):

np.random.shuffle(X_flat)

X_perm = X_flat.reshape(X.shape[0], X.shape[1])

pca_perm = PCA().fit(X_perm)

expl_var_perm_df.loc[i] = pca_perm.explained_variance_ratio_

print('Finished {} permutations'.format(i + 1))

X_perm = list(expl_var_perm_df.mean(axis = 0) +

2*expl_var_perm_df.std(axis = 0))

print('\nPermuted variance explained:')

print(X_perm[0:10])

plt.subplot(223)

plt.plot(pca.explained_variance_ratio_, c = 'blue')

plt.plot(X_perm, c = 'red'); plt.xlim([-1, N_pca_comps])

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('Explained Variance', fontsize = 22)

plt.gca().legend(('Observed variance explained',

'Permuted variance explained'), fontsize = 20)

plt.subplot(224)

pval = list()

for j in range(N_pca_comps):

pval.append(np.sum(expl_var_perm_df.iloc[:, j] +

2*expl_var_perm_df.std(axis = 0) >=

pca.explained_variance_ratio_[j]) / N_perm)

plt.plot(pval, c = 'darkgreen')

plt.xlabel('Number of Principal Components', fontsize = 22)

plt.ylabel('P-value', fontsize = 22); plt.xlim([-1, N_pca_comps])

N_opt_pcs = np.where(np.array(pval) >= 0.05)[0][0]

print('\nNumber of significant Principal Components: {}'.format(N_opt_pcs))

print('Together they explain {}% of variation in the data'.\

format(int(round(sum(pca.explained_variance_ratio_[0:\

np.where(np.array(pval) >= 0.05)[0][0]])*100,0))))

figure.tight_layout()

plt.show()

An important outcome of the PCA analysis on MNIST and shuffled MNIST is that the dataset seems to have **62 informative **principal components** **that all together **capture 86% of variation in the data**. Therefore, if we replace the original dataset including 784 features by 62 PCs, we will keep most of the variation in the data, but will **reduce the dimensionality of the data by more than 10 times**.

Now, let us visualize the MNIST dataset as a 2D map using UMAP and tSNE dimension reduction techniques. In both cases, we will use 2 top principal components, i.e. 2 PCs, for initialization. The perplexity hyperparameter** **for tSNE and the number of nearest neighbors for UMAP will be computed as a square root of the number of data points (images). See here for details.

`import umap; import numpy as np`

import seaborn as sns; import matplotlib.pyplot as plt

from sklearn.manifold import TSNE; from sklearn.decomposition import PCAopt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))

X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)

umap_embedding = umap.UMAP(n_components = 2, n_neighbors = opt_perp,

init = X_reduced[:, 0:2],

min_dist=0.3, n_epochs = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

tsne_embedding = TSNE(n_components=2, perplexity=opt_perp,

init=X_reduced[:, 0:2],

learning_rate = 200, n_iter = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))

plt.subplot(121)

plt.scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c = labels, s = 10,

cmap = 'tab20')

plt.title('tSNE: MNIST', fontsize = 25)

plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)

plt.subplot(122)

plt.scatter(umap_embedding[:, 0], umap_embedding[:, 1], c = labels, s = 10,

cmap = 'tab20')

plt.title('UMAP: MNIST', fontsize = 25)

plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)

plt.show()

One obvious conclusion that can be drawn from the comparison of 2D PCA, tSNE and UMAP pictures is that the **classes** **of handwritten digits** are much more resolved by the non-linear **neighbor-graph** dimensionality reduction methods such as tSNE / UMAP compared to the **linear** **matrix factorization** techniques such as PCA. Therefore if we assume the classes of handwritten digits (aka biological phenotype, e.g. cell type in scRNAseq experiments) to comprise **most of the variation** in the MNIST data, it probably makes sense to hypothesize that **two tSNE/UMAP components are capabale of capturing more biological variation than two PCA components (at least in the MNIST dataset)**. In the next sections, we are going to try to **quantify and prove this hypothesis**.

As non-linear** **neighbor-graph based techniques, **UMAP / tSNE do not seem to have the concept of variance explained by its components** **in contrast to matrix factorization linear dimension reduction techniques such as PCA**, see for example the answer of Leland McInnes, the author of UMAP. Here, nevertheless, we will try to quantify the amount of MNIST variation in pixel intensities explained by UMAP / tSNE components and compare it with the amount of variation explained by PCA components using the **Partial Least Square (PLS) regression** and the **R-squared statistic** generalized for matrix operations.

Let us check the percentage of MNIST variation explained by first principal component (PC1), we can easily compute and extract it from the PCA as the normalized **first eigen value** of MNIST. Below, we can see that **PC1 explains ~11% of MNIST variation**.

`#First, we will compute variance explained by PC1 via PCA in sklearn`

import numpy as np

from sklearn.decomposition import PCApca = PCA(n_components = X.shape[1]).fit(X)

pca_comps = PCA().fit_transform(X)

print(pca.explained_variance_ratio_[0])

#0.11043073983593521

Now, **we will** **reproduce this number from the** **Partial Least Squares (PLS)** **regression**. We are going to use the following reasoning. Suppose we want to **approximate a matrix X by another matrix PCA_matrix**, which for now includes only one column, that is PC1, but generally can include up to 784 columns for MNIST dataset. Then, we can fit a PLS linear regression model of **X = B * PCA_matrix** and compute a **R-squared statistic** which will reflect the amount of variation in **X** explained by **PCA_matrix**. In the matrix form, the R-squared statistic will be given by the following equation:

where **B*PCA_matrix** represents a prediction of **X** from its approximation **PCA_matrix** and will be found from the PLS regression (first equation). To technically implement this procedure in **scikit-learn Python module**, we will first fit a PLS model with **X** as a response variable, and **PCA_matrix** as an explanatory variable, we compute a prediction **y_pred = B*PCA_matrix**, and, finally, we can either use the **r2_score** function in scikit-learn, or the second equation above for computing the R-squared statistic. Let us check that both will give identical answers:

`#Now let us compute variance explained by PC1 through PLS procedure`

from sklearn.metrics import r2_score

from sklearn.cross_decomposition import PLSRegressionPCA_matrix = pd.DataFrame(pca_comps[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(PCA_matrix, X)

y_pred = pls.predict(PCA_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.11043073983593246

#Finally, let us compute variance explained by PC1 from scratch

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.11043073983593554

Thus, by utilizing PLS regression, we computed the fraction of total MNIST **variation explained by the first principal component** **outside of the PCA algorithm**. Compare the fraction calculated by the PLS with the respective variance explained by the PCA algorithm pca.explained_variance_ratio_[0] which is the ratio of first eigen value divided by the sum of all eigen values. They are nearly identical. Now let **PCA_matrix** consist of a number of PCs. Below, we demonstrate that the PLS computation of cumulative fraction of variance explained by principal components will result in **nearly identical **to PCA cumulative variance explained.

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegressionpredicted_var_expl = []

for i in range(1, 21):

PCA_matrix_current = pd.DataFrame(pca_comps[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(PCA_matrix_current, X)

y_pred_current = pls_current.predict(PCA_matrix_current)

predicted_var_expl.append(r2_score(X, y_pred_current,

multioutput = 'variance_weighted'))

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.plot(np.cumsum(pca.explained_variance_ratio_[0:100]),linewidth=5)

plt.plot(predicted_var_expl, linewidth = 5)

plt.ylabel('Cumulative Explained Variance', fontsize = 20)

plt.xlabel('Number of Principal Components', fontsize = 20)

plt.legend(['PCA - computed cumulative variance explained',

'PLS - computed cumulative variance explained'],

fontsize = 20); plt.show()

For clarity, we plot PLS-computed MNIST variance explained for PC1-PC20 on the top of the native PCA eigen value based variance explained which is done for PC1 — PC100. We can see, that **the** **curves of cumulative variance explained computed by PLS and PCA nicely coincide**. This is no surprise since PLS computation essentially mimics what is going under the hood in the PCA algorithm. However, very importantly, this gives us an **instrument** of computing variance of **X** explained by **any approximation**, i.e. not only the **PCA_matrix** but also any other matrix. In the next session, we will use **UMAP_matrix** as an approximation of the original **X** matrix of MNIST pixel intensities and **compare the cumulative variance of X explained by UMAP and PCA components**.

In this section we will use the methodology of PLS estimation of variance explained developed in the previous section. Again, we will use the same reasoning: since UMAP gives some sort of **approximation** **of the original data** **X**, and people even run **clustering on 2D UMAP** for discovering cell types in scRNAseq field, we can use the **UMAP_matrix** (and we can do the same for tSNE) as an approximation of **X**, then we fit the PLS model **X=B * UMAP_matrix**, and estimate the fraction of MNIST variance explained by UMAP components via R-squared statistic as

Now, we will use the PLS methodology developed in the previous section, and compute the fraction of MNIST variance explained by the first UMAP component.

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegression#MNIST variation explained by UMAP1

UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(UMAP_matrix, X)

y_pred = pls.predict(UMAP_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.07335034485651613

#Here the same but more explicitly via the R^2 equation above

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.07335034485652026

#MNIST variation explained by tSNE1

tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(tSNE_matrix, X)

y_pred = pls.predict(tSNE_matrix)

print(r2_score(X, y_pred, multioutput = 'variance_weighted'))

#0.07265918428990921

#Here the same but explicitly via the R^2 equation above

print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \

np.mean(X, axis = 0))**2))

#0.07265918428991347

We conclude that **first components of both UMAP and tSNE explain about 7% of MNIST variation which is lower than the 11% explained by the first PCA component**. This is not surprising. Intuitively, it is hard to expect that there can be another latent variable alternative to PC1 that explains more than 11% of MNIST variation without redefining the concept of “explained variance”. The current definition comes from the PCA (**normalized eigen values**), and linear regression, that is R-squared, analyses. Both are linear frameworks. **In addition,** **PC1 by definition corresponds to a direction of maximal data variation**. Therefore, if another latent variable comes from a nonlinear analysis such as UMAP, that does not *per se* attempts to maximize variation in the data, **it would be hard to expect that e.g. UMAP1 explains more variation in MNIST** **within linear definition of “variance explained”**. Now we will calculate the variance explained by a few top tSNE and UMAP components. To make our analysis more robust towards sampling of the 10 000 images, we will independently draw them **N_iter** times. Also, for each iteration, we will slightly **change /** **perturb** **the hyperparameters of UMAP and tSNE, as well as the number of data points for PCA**. This will allow us to build **confidence intervals** and address the sensitivity of our analysis.

`import umap; import numpy as np; import seaborn as sns`

import matplotlib.pyplot as plt; from sklearn.manifold import TSNE

from sklearn.metrics import r2_score; from sklearn.decomposition import PCA

from sklearn.cross_decomposition import PLSRegressionN_iter = 3; N_comps = 3

N_points_list = [5000, 10000, 15000]

perp_list = [50, 100, 150]; min_dist_list = [0.1, 0.2, 0.3]

predicted_var_expl_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_umap_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_tsne_matrix = np.zeros(shape = (N_iter, N_comps))

for j in range(N_iter):

#MNIST variance explained by PCA components

np.random.seed(j)

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points_list[j],

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)

predicted_var_expl = []

for i in range(1, (N_comps + 1)):

PCA_matrix_current = pd.DataFrame(pca_comps_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(PCA_matrix_current, X_sample)

y_pred_current = pls_current.predict(PCA_matrix_current)

predicted_var_expl.append(r2_score(X_sample, y_pred_current,

multioutput='variance_weighted'))

predicted_var_expl_matrix[j,:] = predicted_var_expl

#MNIST variance explained by UMAP components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

umap_embedding_sample = umap.UMAP(n_components = N_comps,

n_neighbors = opt_perp,

init = X_reduced_sample[:, 0:N_comps],

min_dist = min_dist_list[j],

n_epochs = 1000, verbose = \

0).fit_transform(X_reduced_sample)

predicted_var_expl_umap = []

for i in range(1, (N_comps + 1)):

UMAP_matrix_current = pd.DataFrame(umap_embedding_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(UMAP_matrix_current, X_sample)

y_pred_current = pls_current.predict(UMAP_matrix_current)

predicted_var_expl_umap.append(r2_score(X_sample, y_pred_current, \

multioutput = 'variance_weighted'))

predicted_var_expl_umap_matrix[j,:] = predicted_var_expl_umap

#MNIST variance explained by tSNE components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

tsne_embedding_sample = TSNE(n_components = N_comps,

perplexity = perp_list[j],

init = X_reduced_sample[:, 0:N_comps],

learning_rate = 200, n_iter = 1000,

verbose = 0).fit_transform(X_reduced_sample)

predicted_var_expl_tsne = []

for i in range(1, (N_comps + 1)):

tSNE_matrix_current = pd.DataFrame(tsne_embedding_sample[:, 0:i])

pls_current = PLSRegression(n_components = i)

pls_current.fit(tSNE_matrix_current, X_sample)

y_pred_current = pls_current.predict(tSNE_matrix_current)

predicted_var_expl_tsne.append(r2_score(X_sample, y_pred_current, \

multioutput = 'variance_weighted'))

predicted_var_expl_tsne_matrix[j,:] = predicted_var_expl_tsne

print("MNIST variance explained by PCA components:")

print(predicted_var_expl_matrix)

print("\nMNIST variance explained by UMAP components:")

print(predicted_var_expl_umap_matrix)

print("\nMNIST variance explained by tSNE components:")

print(predicted_var_expl_tsne_matrix)

#Plot MNIST variance explained by leading PCA, tSNE and UMAP components

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_matrix, axis = 0),

linewidth = 3, color = 'red', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_tsne_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_tsne_matrix,

axis = 0), linewidth = 3, color = 'blue', marker = 'o',

markersize = 10, capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_umap_matrix,

axis = 0), yerr = 2*np.std(predicted_var_expl_umap_matrix,

axis = 0), linewidth = 3, color = 'green', marker = 'o',

markersize = 10, capsize = 5, capthick = 3)

plt.ylabel('Cumulative MNIST Explained Variance', fontsize = 20)

plt.xlabel('Number of Components', fontsize = 20)

plt.legend(['PCA variance explained', 'tSNE variance explained',

'UMAP variance explained'], fontsize = 20)

plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()

Here, we conclude that leading PCA components explain consistently more variation in MNIST dataset than leading tSNE and UMAP components. This result was expected since UMAP and tSNE do not aim at building directions of maximal variation in contrast to PCA. Further, **the amounts of variation in the MNIST data explained by leading tSNE and UMAP components are comparable, and both lower than the ones explained by PCA components**. Therefore, we seem to observe a systematic difference in MNIST explained variance **between matrix factorization and neghbor-graph dimensionality reduction techniques**.

In the previous section, we showed that leading components of UMAP and tSNE explain significantly less variation in MNIST dataset compared to the leading PCA components, which is a sort of **expected result** since PCA by definition attempts to find components of maximal variation in the data in contrast to UMAP and tSNE.

This is however an interesting paradox: looking at the 2D UMAP and tSNE figures of MNIST data we can clearly see more

distinctclusters of handwritten digits compared to the ones in the corresponding 2D PCA figure.Despite 2D UMAP and tSNE explain less variation inMNISTthan PCA.

Here, we are going to hypothesize that UMAP / tSNE components are linked to a **phenotype of interest**, that is MNIST **labels** in our case, or cell types for scRNAseq, **rather than total variation in the data**. To check this hypothesis, let us first explore how leading PCA, UMAP and tSNE components correlate with the MNIST labels.

`import umap; import seaborn as sns`

from sklearn.datasets import fetch_openml

from sklearn.decomposition import PCA; import numpy as np

import matplotlib.pyplot as plt; from sklearn.manifold import TSNEN_comps = 3; N_points = 10000

mnist = fetch_openml('mnist_784', version = 1)

labels = mnist.target.astype(int); X = np.log10(mnist.data + 1)

#Subsample MNIST data down to 10 000 images

np.random.seed(123)

random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)

X = X[random_indices,:]; labels = labels[random_indices]

#Compute top 3 PCA components on the subsampled MNIST data

pca = PCA(n_components = N_comps).fit(X)

pca_comps = PCA().fit_transform(X)

#Compute top 3 UMAP components on the subsampled MNIST data

opt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))

X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)

umap_embedding = umap.UMAP(n_components = N_comps, n_neighbors = opt_perp,

init = X_reduced[:, 0:N_comps],

min_dist = 0.3, n_epochs = 1000, random_state=123,

verbose = 0).fit_transform(X_reduced)

#Compute top 3 tSNE components on the subsampled MNIST data

tsne_embedding = TSNE(n_components = N_comps, perplexity = opt_perp,

init = X_reduced[:, 0:N_comps],

learning_rate = 200, n_iter = 1000, random_state = 123,

verbose = 0).fit_transform(X_reduced)

#Pairwise correlations between MNIST labels and PCA / tSNE / UMAP components

from scipy.stats import spearmanr

rho_matrix = np.zeros(shape = (3, 3))

for i in range(N_comps):

rho, pval = spearmanr(pca_comps[:, i], labels)

rho_matrix[0, i] = np.abs(rho)

for i in range(N_comps):

rho, pval = spearmanr(tsne_embedding[:, i], labels)

rho_matrix[1, i] = np.abs(rho)

for i in range(N_comps):

rho, pval = spearmanr(umap_embedding[:, i], labels)

rho_matrix[2, i] = np.abs(rho)

rho_df = pd.DataFrame(rho_matrix, columns = ['Comp1', 'Comp2', 'Comp3'],

index = ['PCA', 'tSNE', 'UMAP'])

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))

sns.heatmap(rho_df.T, cmap = "Blues", annot = True)

plt.title('Spearman correlation of leading PCA / tSNE / UMAP components'+

' with MNIST labels', fontsize = 20); plt.show()

We can observe, that **both tSNE1 and UMAP1 have a stronger correlation with the MNIST labels compared to PC1**. Further, the second components of PCA, tSNE and UMAP have comparable strength of correlation with the MNIST labels, **whereas the** **third UMAP component has a much stronger correlation with MNIST labels compared to the third components of PCA and tSNE**, that is PC3 and tSNE3, respectively, which **appear to be almost uncorrelated with the MNIST labels**. In the complete notebook, available on my github, I show that the **obtained heatmap is robust** **with respect to sampling and perturbations of PCA / tSNE / UMAP hyperparameters**.

**Alternatively**, we can also quantify the linkage between the PCA, tSNE and UMAP components on one side, and the MNIST labels (classes of digits, aka cell types in scRNAseq data) on the other side, through the **PLS regression** methodology developed in the previous sections. By analogy, if we consider **UMAP_matrix** (or **PCA_matrix** or **tSNE_matrix**) as an approximation of the MNIST vector of **labels**, we can fit the PLS model **labels= B * UMAP_matrix**, and estimate the fraction of **variation in the MNIST** **labels** **explained by the UMAP components** as

Let us test how much variance in MNIST **labels** is explained by PC1, tSNE1 and UMAP1:

`from sklearn.metrics import r2_score`

from sklearn.cross_decomposition import PLSRegression#Variance in MNIST labels explained by PC1

pca = PCA(n_components = X.shape[1]).fit(X)

pca_comps = PCA().fit_transform(X)

PCA_matrix = pd.DataFrame(pca_comps[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(PCA_matrix, labels)

y_pred = pls.predict(PCA_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.01570798235844606

#Variance in MNIST labels explained by tSNE1

tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(tSNE_matrix, labels)

y_pred = pls.predict(tSNE_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.04531724676013893

#Variance in MNIST labels explained by UMAP1

UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])

pls = PLSRegression(n_components = 1)

pls.fit(UMAP_matrix, labels)

y_pred = pls.predict(UMAP_matrix)

r2_score(labels, y_pred, multioutput = 'variance_weighted')

#0.1369512765129124

We can clearly see that tSNE1, and especially UMAP1, explain much more variation in MNIST labels compared to PC1, which confirms what we have previously observed with Spearman correlation in the heatmap above. We can also reproduce the **R-squared** values above from the linear regression model in **statsmodels** package.

`import statsmodels.formula.api as smf`my_df_comps = pd.DataFrame({'LABELS': labels,

'PC1': np.array(PCA_matrix).flatten(),

'tSNE1': np.array(tSNE_matrix).flatten(),

'UMAP1': np.array(UMAP_matrix).flatten()})

smf.ols(formula = 'LABELS ~ PC1', data = my_df_comps).fit().summary()

smf.ols(formula = 'LABELS ~ tSNE1', data = my_df_comps).fit().summary()

smf.ols(formula = 'LABELS ~ UMAP1', data = my_df_comps).fit().summary()

Now, **once we have made sure that** **we can correctly compute** the variance in MNIST labels explained by PC1, tSNE1 and UMAP1 by using the PLS and R — squared methodology, we can extend this procedure for other leading PCA /tSNE/ UMAP components, and visualize how the** **cumulative variance explained changes** **as more and more components are added.** **As per usual, we will perturb the PCA, tSNE and UMAP with respect to the sub-sampling iteration, number of images and hyperparameters of the methods.

`from sklearn.decomposition import PCA`

from sklearn.cross_decomposition import PLSRegression

import seaborn as sns; import matplotlib.pyplot as plt

import numpy as np; from sklearn.metrics import r2_scoreN_iter = 3; N_comps = 3

N_points_list = [5000, 10000, 15000]

min_dist_list = [0.1, 0.2, 0.3]; perp_list = [90, 100, 110]

predicted_var_expl_labels_matrix = np.zeros(shape = (N_iter, N_comps))

predicted_var_expl_labels_umap_matrix=np.zeros(shape=(N_iter, N_comps))

predicted_var_expl_labels_tsne_matrix=np.zeros(shape=(N_iter, N_comps))

for j in range(N_iter):

#Variance in MNIST labels explained by PCA components

np.random.seed(j)

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points_list[j],

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)

predicted_var_expl_labels = []

for i in range(1, (N_comps + 1)):

PCA_matrix_current_labels = pd.DataFrame(pca_comps_sample[:, 0:i])

pls_current_labels = PLSRegression(n_components = i)

pls_current_labels.fit(PCA_matrix_current_labels, labels_sample)

y_pred_current_labels = pls_current_labels.predict(\

PCA_matrix_current_labels)

predicted_var_expl_labels.append(r2_score(labels_sample, \

y_pred_current_labels, multioutput = 'variance_weighted'))

predicted_var_expl_labels_matrix[j,:] = predicted_var_expl_labels

#Variance in MNIST labels explained by UMAP components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components=N_opt_pcs).fit_transform(X_sample)

umap_embedding_sample = umap.UMAP(n_components = N_comps,

n_neighbors = opt_perp,

init=X_reduced_sample[:, 0:N_comps],

min_dist = min_dist_list[j],

n_epochs = 1000, verbose = \

0).fit_transform(X_reduced_sample)

predicted_var_expl_labels_umap = []

for i in range(1, (N_comps + 1)):

UMAP_matrix_current_labels=pd.DataFrame(umap_embedding_sample[:,0:i])

pls_current_labels_umap = PLSRegression(n_components = i)

pls_current_labels_umap.fit(UMAP_matrix_current_labels,labels_sample)

y_pred_current_labels_umap = pls_current_labels_umap.predict(\

UMAP_matrix_current_labels)

predicted_var_expl_labels_umap.append(r2_score(\

labels_sample, y_pred_current_labels_umap, \

multioutput = 'variance_weighted'))

predicted_var_expl_labels_umap_matrix[j,:]=predicted_var_expl_labels_umap

#Variance in MNIST labels explained by tSNE components

X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)

random_indices = np.random.choice(X.shape[0], size = N_points,

replace = False)

X_sample = X[random_indices,:]; labels_sample = labels[random_indices]

opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))

X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)

tsne_embedding_sample = TSNE(n_components = N_comps,

perplexity = perp_list[j],

init = X_reduced_sample[:, 0:N_comps],

learning_rate = 200, n_iter = 1000,

verbose=0).fit_transform(X_reduced_sample)

predicted_var_expl_labels_tsne = []

for i in range(1, (N_comps + 1)):

tSNE_matrix_current_labels=pd.DataFrame(tsne_embedding_sample[:,0:i])

pls_current_labels_tsne = PLSRegression(n_components = i)

pls_current_labels_tsne.fit(tSNE_matrix_current_labels,labels_sample)

y_pred_current_labels_tsne = pls_current_labels_tsne.predict(\

tSNE_matrix_current_labels)

predicted_var_expl_labels_tsne.append(r2_score(labels_sample, \

y_pred_current_labels_tsne, multioutput = 'variance_weighted'))

predicted_var_expl_labels_tsne_matrix[j,:]=predicted_var_expl_labels_tsne

print("Variance in MNIST labels explained by PCA components:")

print(predicted_var_expl_labels_matrix)

print("\nVariance in MNIST labels explained by UMAP components:")

print(predicted_var_expl_labels_umap_matrix)

print("\nVariance in MNIST labels explained by tSNE components:")

print(predicted_var_expl_labels_tsne_matrix)

#Plot MNIST labels variance explained by PCA, tSNE and UMAP components

sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_matrix, axis = 0),

yerr = 2*np.std(predicted_var_expl_labels_matrix, axis = 0),

linewidth = 3, color = 'red', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_tsne_matrix, axis = 0),

yerr=2*np.std(predicted_var_expl_labels_tsne_matrix,axis=0),

linewidth = 3, color = 'blue', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.errorbar(range(1, (N_comps + 1)),

np.mean(predicted_var_expl_labels_umap_matrix, axis = 0),

yerr=2*np.std(predicted_var_expl_labels_umap_matrix,axis=0),

linewidth = 3, color = 'green', marker = 'o', markersize = 10,

capsize = 5, capthick = 3)

plt.ylabel('Cumulative MNIST Labels Explained Variance', fontsize = 20)

plt.xlabel('Number of Components', fontsize = 20)

plt.legend(['PCA variance explained', 'tSNE variance explained',

'UMAP variance explained'], fontsize = 20)

plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()

**Here,** **surprisingly, we observe that** **leading UMAP and tSNE components seem to explain much more of variation in the MNIST labels compared to leading PCA components. **This is somehow counter-intuitive as we saw in the previous section that leading tSNE and UMAP components explain less variation in MNIST image pixel intensities. Nevertheless, this cumulative explained variance plot above basically confirms the correlation heatmap between the MNIST labels, and PCA / tSNE / UMAP components obtained previously.

Here, we have an interesting effect. We saw in the previous section that the leading tSNE and UMAP components can not capture more variation in the MNIST than leading PCA components. However, surpisingly, they are **able to capture more variation in MNIST labels, that is classes of handwritten digits, despite the classes were not provided to tSNE / UMAP at all, when performing the dimension reduction.** In other words, all three dimension reduction techniques are completely unsupervised, i.e. they know nothing about the classes of handwritten digits. Nevertheless, **tSNE and especially UMAP components seem to be, surprisingly, linked to the classes of digits** without being able to capture substantial proportion of variation in MNIST pixel intensities across the images. I do not fully understand this effect but believe it is an interesting observation, which I would be curious to further explore. **Let me know in the comments, if you have more insights into the nature of this effect.**

In this article, we have learnt that a **Partial Least Squares (PLS)** **approach** can be used for building **intuition about data variation explained by tSNE and UMAP components**. Using this approach, we demonstrated that **tSNE and UMAP components explain (unsurprisingly) less variation in MNIST image pixel intensities** **data compared to PCA components**, however, are surprisingly strongly** linked to the labels of MNIST images**. The nature of this effect is unclear taken into account the unsupervised design of all the three dimension reduction techniques.

As per usual, let me know in the comments below what analytical methods from Life Sciences** **and Computational Biology seem **especially mysterious** to you and I will try to address them in this column. Check the files used for the post on my github. Follow me at **Medium ****Nikolay Oskolkov**, in **Twitter** @NikolayOskolkov, in **Mastodon** @[email protected] and connect via **Linkedin**. In the next post, I will discuss** how to cluster in UMAP space**, stay tuned.

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.