UMAP Variance Explained. A simple way to interpret UMAP… | by Nikolay Oskolkov | Mar, 2023
A simple way to interpret UMAP components
This is the twenty fifth post from my column Mathematical Statistics and Machine Learning for Life Sciences, where I use plain language to discuss analytical methods from Computational Biology and Life Sciences. UMAP is a dimensionality reduction technique that together with tSNE became increasingly popular and de-facto a standard tool for analysis of single cell genomics data, where traditional methods such as PCA have limitations. However, one disadvantage of UMAP and tSNE in comparison with PCA is their non-interpretable components that are not straightforward to link to the variation in the original data. In this article, I suggest a simple method for estimating amount of variation explained by leading UMAP and tSNE components. Using the classic MNIST dataset of black and white images of handwritten digits as a benchmark, I demonstrate that leading UMAP and tSNE components are inferior to PCA components in terms of explaining total data variation, however, surprisingly, demonstrate a better linkage to the labels of data points, i.e. they explain more biological rather than total variation in the data.
As a benchmark dataset, we will be using the MNIST which includes 70 000 black and white images of handwritten digits of 28 x 28 pixels resolution, i.e. 784 pixels per image. First, we will download the MNIST dataset, check its dimensions and visualize a few random images.
from sklearn.datasets import fetch_openmlmnist = fetch_openml('mnist_784', version = 1)
labels = mnist.target.astype(int)
print(mnist.data.shape)
#(70000, 784)
from matplotlib import pyplot as plt
plt.figure(figsize = (20, 15))
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.imshow(mnist.data[i].reshape(28, 28), cmap = plt.get_cmap('gray'))
plt.show()
The values of MNIST dataset represent pixel intensities varying from 0 to 255, where 0 corresponds to the black background. Therefore, MNIST is a “zero-inflated” data set which is very similar to a typical single cell gene expression dataset. It is usually recommended to normalize the black and white pixel intensities by the largest 255 value. However, here, by analogy with single cell gene expression, we will use a log-transform as another mild normalization strategy. Also, for the sake of time, we will not use all the 70 000 images but randomly pick 10 000 images, later we will draw 10 000 images a few times to make sure that our conclusions are robust.
import numpy as npN_points = 10000
X = np.log10(mnist.data + 1)
#X = mnist.data / 255
np.random.seed(123)
random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)
X = X[random_indices,:]
labels = labels[random_indices]
As a first exploratory step, in order to understand the variation in MNIST, we will perform a PCA analysis on the MNIST itself and a shuffled version of MNIST. This will help us to estimate the number of meaningful, i.e. non-redundant, dimensions out of the initial 784 dimensions, i.e. we will figure our the number of informative dimensions to keep for all the future tested dimension reduction and clustering techniques. More details are here.
import pandas as pd; import matplotlib.pyplot as plt
from sklearn.decomposition import PCA; import seaborn as snsN_pca_comps = 100
sns.set(font_scale = 1.5); figure = plt.figure(figsize = (20, 15))
plt.subplot(221)
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=labels, cmap='tab10', s=10)
plt.title('PCA: MNIST', fontsize = 25)
plt.xlabel('PC1', fontsize = 22); plt.ylabel('PC2', fontsize = 22)
plt.subplot(222)
pca = PCA(n_components = N_pca_comps).fit(X)
print('Observed variance explained:')
print(pca.explained_variance_ratio_[0:10]); print('\n')
plt.bar(range(len(pca.explained_variance_ratio_)),
pca.explained_variance_ratio_)
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('Explained Variance', fontsize = 22)
N_perm = 10
X_flat = X.flatten()
expl_var_perm_df = pd.DataFrame(index = list(range(N_perm)),
columns = list(range(X.shape[1])))
for i in range(N_perm):
np.random.shuffle(X_flat)
X_perm = X_flat.reshape(X.shape[0], X.shape[1])
pca_perm = PCA().fit(X_perm)
expl_var_perm_df.loc[i] = pca_perm.explained_variance_ratio_
print('Finished {} permutations'.format(i + 1))
X_perm = list(expl_var_perm_df.mean(axis = 0) +
2*expl_var_perm_df.std(axis = 0))
print('\nPermuted variance explained:')
print(X_perm[0:10])
plt.subplot(223)
plt.plot(pca.explained_variance_ratio_, c = 'blue')
plt.plot(X_perm, c = 'red'); plt.xlim([-1, N_pca_comps])
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('Explained Variance', fontsize = 22)
plt.gca().legend(('Observed variance explained',
'Permuted variance explained'), fontsize = 20)
plt.subplot(224)
pval = list()
for j in range(N_pca_comps):
pval.append(np.sum(expl_var_perm_df.iloc[:, j] +
2*expl_var_perm_df.std(axis = 0) >=
pca.explained_variance_ratio_[j]) / N_perm)
plt.plot(pval, c = 'darkgreen')
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('P-value', fontsize = 22); plt.xlim([-1, N_pca_comps])
N_opt_pcs = np.where(np.array(pval) >= 0.05)[0][0]
print('\nNumber of significant Principal Components: {}'.format(N_opt_pcs))
print('Together they explain {}% of variation in the data'.\
format(int(round(sum(pca.explained_variance_ratio_[0:\
np.where(np.array(pval) >= 0.05)[0][0]])*100,0))))
figure.tight_layout()
plt.show()
An important outcome of the PCA analysis on MNIST and shuffled MNIST is that the dataset seems to have 62 informative principal components that all together capture 86% of variation in the data. Therefore, if we replace the original dataset including 784 features by 62 PCs, we will keep most of the variation in the data, but will reduce the dimensionality of the data by more than 10 times.
Now, let us visualize the MNIST dataset as a 2D map using UMAP and tSNE dimension reduction techniques. In both cases, we will use 2 top principal components, i.e. 2 PCs, for initialization. The perplexity hyperparameter for tSNE and the number of nearest neighbors for UMAP will be computed as a square root of the number of data points (images). See here for details.
import umap; import numpy as np
import seaborn as sns; import matplotlib.pyplot as plt
from sklearn.manifold import TSNE; from sklearn.decomposition import PCAopt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))
X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)
umap_embedding = umap.UMAP(n_components = 2, n_neighbors = opt_perp,
init = X_reduced[:, 0:2],
min_dist=0.3, n_epochs = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
tsne_embedding = TSNE(n_components=2, perplexity=opt_perp,
init=X_reduced[:, 0:2],
learning_rate = 200, n_iter = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))
plt.subplot(121)
plt.scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c = labels, s = 10,
cmap = 'tab20')
plt.title('tSNE: MNIST', fontsize = 25)
plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.subplot(122)
plt.scatter(umap_embedding[:, 0], umap_embedding[:, 1], c = labels, s = 10,
cmap = 'tab20')
plt.title('UMAP: MNIST', fontsize = 25)
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
One obvious conclusion that can be drawn from the comparison of 2D PCA, tSNE and UMAP pictures is that the classes of handwritten digits are much more resolved by the non-linear neighbor-graph dimensionality reduction methods such as tSNE / UMAP compared to the linear matrix factorization techniques such as PCA. Therefore if we assume the classes of handwritten digits (aka biological phenotype, e.g. cell type in scRNAseq experiments) to comprise most of the variation in the MNIST data, it probably makes sense to hypothesize that two tSNE/UMAP components are capabale of capturing more biological variation than two PCA components (at least in the MNIST dataset). In the next sections, we are going to try to quantify and prove this hypothesis.
As non-linear neighbor-graph based techniques, UMAP / tSNE do not seem to have the concept of variance explained by its components in contrast to matrix factorization linear dimension reduction techniques such as PCA, see for example the answer of Leland McInnes, the author of UMAP. Here, nevertheless, we will try to quantify the amount of MNIST variation in pixel intensities explained by UMAP / tSNE components and compare it with the amount of variation explained by PCA components using the Partial Least Square (PLS) regression and the R-squared statistic generalized for matrix operations.
Let us check the percentage of MNIST variation explained by first principal component (PC1), we can easily compute and extract it from the PCA as the normalized first eigen value of MNIST. Below, we can see that PC1 explains ~11% of MNIST variation.
#First, we will compute variance explained by PC1 via PCA in sklearn
import numpy as np
from sklearn.decomposition import PCApca = PCA(n_components = X.shape[1]).fit(X)
pca_comps = PCA().fit_transform(X)
print(pca.explained_variance_ratio_[0])
#0.11043073983593521
Now, we will reproduce this number from the Partial Least Squares (PLS) regression. We are going to use the following reasoning. Suppose we want to approximate a matrix X by another matrix PCA_matrix, which for now includes only one column, that is PC1, but generally can include up to 784 columns for MNIST dataset. Then, we can fit a PLS linear regression model of X = B * PCA_matrix and compute a R-squared statistic which will reflect the amount of variation in X explained by PCA_matrix. In the matrix form, the R-squared statistic will be given by the following equation:
where B*PCA_matrix represents a prediction of X from its approximation PCA_matrix and will be found from the PLS regression (first equation). To technically implement this procedure in scikit-learn Python module, we will first fit a PLS model with X as a response variable, and PCA_matrix as an explanatory variable, we compute a prediction y_pred = B*PCA_matrix, and, finally, we can either use the r2_score function in scikit-learn, or the second equation above for computing the R-squared statistic. Let us check that both will give identical answers:
#Now let us compute variance explained by PC1 through PLS procedure
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegressionPCA_matrix = pd.DataFrame(pca_comps[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(PCA_matrix, X)
y_pred = pls.predict(PCA_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.11043073983593246
#Finally, let us compute variance explained by PC1 from scratch
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.11043073983593554
Thus, by utilizing PLS regression, we computed the fraction of total MNIST variation explained by the first principal component outside of the PCA algorithm. Compare the fraction calculated by the PLS with the respective variance explained by the PCA algorithm pca.explained_variance_ratio_[0] which is the ratio of first eigen value divided by the sum of all eigen values. They are nearly identical. Now let PCA_matrix consist of a number of PCs. Below, we demonstrate that the PLS computation of cumulative fraction of variance explained by principal components will result in nearly identical to PCA cumulative variance explained.
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegressionpredicted_var_expl = []
for i in range(1, 21):
PCA_matrix_current = pd.DataFrame(pca_comps[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(PCA_matrix_current, X)
y_pred_current = pls_current.predict(PCA_matrix_current)
predicted_var_expl.append(r2_score(X, y_pred_current,
multioutput = 'variance_weighted'))
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.plot(np.cumsum(pca.explained_variance_ratio_[0:100]),linewidth=5)
plt.plot(predicted_var_expl, linewidth = 5)
plt.ylabel('Cumulative Explained Variance', fontsize = 20)
plt.xlabel('Number of Principal Components', fontsize = 20)
plt.legend(['PCA - computed cumulative variance explained',
'PLS - computed cumulative variance explained'],
fontsize = 20); plt.show()
For clarity, we plot PLS-computed MNIST variance explained for PC1-PC20 on the top of the native PCA eigen value based variance explained which is done for PC1 — PC100. We can see, that the curves of cumulative variance explained computed by PLS and PCA nicely coincide. This is no surprise since PLS computation essentially mimics what is going under the hood in the PCA algorithm. However, very importantly, this gives us an instrument of computing variance of X explained by any approximation, i.e. not only the PCA_matrix but also any other matrix. In the next session, we will use UMAP_matrix as an approximation of the original X matrix of MNIST pixel intensities and compare the cumulative variance of X explained by UMAP and PCA components.
In this section we will use the methodology of PLS estimation of variance explained developed in the previous section. Again, we will use the same reasoning: since UMAP gives some sort of approximation of the original data X, and people even run clustering on 2D UMAP for discovering cell types in scRNAseq field, we can use the UMAP_matrix (and we can do the same for tSNE) as an approximation of X, then we fit the PLS model X=B * UMAP_matrix, and estimate the fraction of MNIST variance explained by UMAP components via R-squared statistic as
Now, we will use the PLS methodology developed in the previous section, and compute the fraction of MNIST variance explained by the first UMAP component.
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegression#MNIST variation explained by UMAP1
UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(UMAP_matrix, X)
y_pred = pls.predict(UMAP_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.07335034485651613
#Here the same but more explicitly via the R^2 equation above
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.07335034485652026
#MNIST variation explained by tSNE1
tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(tSNE_matrix, X)
y_pred = pls.predict(tSNE_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.07265918428990921
#Here the same but explicitly via the R^2 equation above
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.07265918428991347
We conclude that first components of both UMAP and tSNE explain about 7% of MNIST variation which is lower than the 11% explained by the first PCA component. This is not surprising. Intuitively, it is hard to expect that there can be another latent variable alternative to PC1 that explains more than 11% of MNIST variation without redefining the concept of “explained variance”. The current definition comes from the PCA (normalized eigen values), and linear regression, that is R-squared, analyses. Both are linear frameworks. In addition, PC1 by definition corresponds to a direction of maximal data variation. Therefore, if another latent variable comes from a nonlinear analysis such as UMAP, that does not per se attempts to maximize variation in the data, it would be hard to expect that e.g. UMAP1 explains more variation in MNIST within linear definition of “variance explained”. Now we will calculate the variance explained by a few top tSNE and UMAP components. To make our analysis more robust towards sampling of the 10 000 images, we will independently draw them N_iter times. Also, for each iteration, we will slightly change / perturb the hyperparameters of UMAP and tSNE, as well as the number of data points for PCA. This will allow us to build confidence intervals and address the sensitivity of our analysis.
import umap; import numpy as np; import seaborn as sns
import matplotlib.pyplot as plt; from sklearn.manifold import TSNE
from sklearn.metrics import r2_score; from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegressionN_iter = 3; N_comps = 3
N_points_list = [5000, 10000, 15000]
perp_list = [50, 100, 150]; min_dist_list = [0.1, 0.2, 0.3]
predicted_var_expl_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_umap_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_tsne_matrix = np.zeros(shape = (N_iter, N_comps))
for j in range(N_iter):
#MNIST variance explained by PCA components
np.random.seed(j)
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points_list[j],
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)
predicted_var_expl = []
for i in range(1, (N_comps + 1)):
PCA_matrix_current = pd.DataFrame(pca_comps_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(PCA_matrix_current, X_sample)
y_pred_current = pls_current.predict(PCA_matrix_current)
predicted_var_expl.append(r2_score(X_sample, y_pred_current,
multioutput='variance_weighted'))
predicted_var_expl_matrix[j,:] = predicted_var_expl
#MNIST variance explained by UMAP components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
umap_embedding_sample = umap.UMAP(n_components = N_comps,
n_neighbors = opt_perp,
init = X_reduced_sample[:, 0:N_comps],
min_dist = min_dist_list[j],
n_epochs = 1000, verbose = \
0).fit_transform(X_reduced_sample)
predicted_var_expl_umap = []
for i in range(1, (N_comps + 1)):
UMAP_matrix_current = pd.DataFrame(umap_embedding_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(UMAP_matrix_current, X_sample)
y_pred_current = pls_current.predict(UMAP_matrix_current)
predicted_var_expl_umap.append(r2_score(X_sample, y_pred_current, \
multioutput = 'variance_weighted'))
predicted_var_expl_umap_matrix[j,:] = predicted_var_expl_umap
#MNIST variance explained by tSNE components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
tsne_embedding_sample = TSNE(n_components = N_comps,
perplexity = perp_list[j],
init = X_reduced_sample[:, 0:N_comps],
learning_rate = 200, n_iter = 1000,
verbose = 0).fit_transform(X_reduced_sample)
predicted_var_expl_tsne = []
for i in range(1, (N_comps + 1)):
tSNE_matrix_current = pd.DataFrame(tsne_embedding_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(tSNE_matrix_current, X_sample)
y_pred_current = pls_current.predict(tSNE_matrix_current)
predicted_var_expl_tsne.append(r2_score(X_sample, y_pred_current, \
multioutput = 'variance_weighted'))
predicted_var_expl_tsne_matrix[j,:] = predicted_var_expl_tsne
print("MNIST variance explained by PCA components:")
print(predicted_var_expl_matrix)
print("\nMNIST variance explained by UMAP components:")
print(predicted_var_expl_umap_matrix)
print("\nMNIST variance explained by tSNE components:")
print(predicted_var_expl_tsne_matrix)
#Plot MNIST variance explained by leading PCA, tSNE and UMAP components
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_matrix, axis = 0),
linewidth = 3, color = 'red', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_tsne_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_tsne_matrix,
axis = 0), linewidth = 3, color = 'blue', marker = 'o',
markersize = 10, capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_umap_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_umap_matrix,
axis = 0), linewidth = 3, color = 'green', marker = 'o',
markersize = 10, capsize = 5, capthick = 3)
plt.ylabel('Cumulative MNIST Explained Variance', fontsize = 20)
plt.xlabel('Number of Components', fontsize = 20)
plt.legend(['PCA variance explained', 'tSNE variance explained',
'UMAP variance explained'], fontsize = 20)
plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()
Here, we conclude that leading PCA components explain consistently more variation in MNIST dataset than leading tSNE and UMAP components. This result was expected since UMAP and tSNE do not aim at building directions of maximal variation in contrast to PCA. Further, the amounts of variation in the MNIST data explained by leading tSNE and UMAP components are comparable, and both lower than the ones explained by PCA components. Therefore, we seem to observe a systematic difference in MNIST explained variance between matrix factorization and neghbor-graph dimensionality reduction techniques.
In the previous section, we showed that leading components of UMAP and tSNE explain significantly less variation in MNIST dataset compared to the leading PCA components, which is a sort of expected result since PCA by definition attempts to find components of maximal variation in the data in contrast to UMAP and tSNE.
This is however an interesting paradox: looking at the 2D UMAP and tSNE figures of MNIST data we can clearly see more distinct clusters of handwritten digits compared to the ones in the corresponding 2D PCA figure. Despite 2D UMAP and tSNE explain less variation in MNIST than PCA.
Here, we are going to hypothesize that UMAP / tSNE components are linked to a phenotype of interest, that is MNIST labels in our case, or cell types for scRNAseq, rather than total variation in the data. To check this hypothesis, let us first explore how leading PCA, UMAP and tSNE components correlate with the MNIST labels.
import umap; import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA; import numpy as np
import matplotlib.pyplot as plt; from sklearn.manifold import TSNEN_comps = 3; N_points = 10000
mnist = fetch_openml('mnist_784', version = 1)
labels = mnist.target.astype(int); X = np.log10(mnist.data + 1)
#Subsample MNIST data down to 10 000 images
np.random.seed(123)
random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)
X = X[random_indices,:]; labels = labels[random_indices]
#Compute top 3 PCA components on the subsampled MNIST data
pca = PCA(n_components = N_comps).fit(X)
pca_comps = PCA().fit_transform(X)
#Compute top 3 UMAP components on the subsampled MNIST data
opt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))
X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)
umap_embedding = umap.UMAP(n_components = N_comps, n_neighbors = opt_perp,
init = X_reduced[:, 0:N_comps],
min_dist = 0.3, n_epochs = 1000, random_state=123,
verbose = 0).fit_transform(X_reduced)
#Compute top 3 tSNE components on the subsampled MNIST data
tsne_embedding = TSNE(n_components = N_comps, perplexity = opt_perp,
init = X_reduced[:, 0:N_comps],
learning_rate = 200, n_iter = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
#Pairwise correlations between MNIST labels and PCA / tSNE / UMAP components
from scipy.stats import spearmanr
rho_matrix = np.zeros(shape = (3, 3))
for i in range(N_comps):
rho, pval = spearmanr(pca_comps[:, i], labels)
rho_matrix[0, i] = np.abs(rho)
for i in range(N_comps):
rho, pval = spearmanr(tsne_embedding[:, i], labels)
rho_matrix[1, i] = np.abs(rho)
for i in range(N_comps):
rho, pval = spearmanr(umap_embedding[:, i], labels)
rho_matrix[2, i] = np.abs(rho)
rho_df = pd.DataFrame(rho_matrix, columns = ['Comp1', 'Comp2', 'Comp3'],
index = ['PCA', 'tSNE', 'UMAP'])
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))
sns.heatmap(rho_df.T, cmap = "Blues", annot = True)
plt.title('Spearman correlation of leading PCA / tSNE / UMAP components'+
' with MNIST labels', fontsize = 20); plt.show()
We can observe, that both tSNE1 and UMAP1 have a stronger correlation with the MNIST labels compared to PC1. Further, the second components of PCA, tSNE and UMAP have comparable strength of correlation with the MNIST labels, whereas the third UMAP component has a much stronger correlation with MNIST labels compared to the third components of PCA and tSNE, that is PC3 and tSNE3, respectively, which appear to be almost uncorrelated with the MNIST labels. In the complete notebook, available on my github, I show that the obtained heatmap is robust with respect to sampling and perturbations of PCA / tSNE / UMAP hyperparameters.
Alternatively, we can also quantify the linkage between the PCA, tSNE and UMAP components on one side, and the MNIST labels (classes of digits, aka cell types in scRNAseq data) on the other side, through the PLS regression methodology developed in the previous sections. By analogy, if we consider UMAP_matrix (or PCA_matrix or tSNE_matrix) as an approximation of the MNIST vector of labels, we can fit the PLS model labels= B * UMAP_matrix, and estimate the fraction of variation in the MNIST labels explained by the UMAP components as
Let us test how much variance in MNIST labels is explained by PC1, tSNE1 and UMAP1:
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegression#Variance in MNIST labels explained by PC1
pca = PCA(n_components = X.shape[1]).fit(X)
pca_comps = PCA().fit_transform(X)
PCA_matrix = pd.DataFrame(pca_comps[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(PCA_matrix, labels)
y_pred = pls.predict(PCA_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.01570798235844606
#Variance in MNIST labels explained by tSNE1
tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(tSNE_matrix, labels)
y_pred = pls.predict(tSNE_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.04531724676013893
#Variance in MNIST labels explained by UMAP1
UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(UMAP_matrix, labels)
y_pred = pls.predict(UMAP_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.1369512765129124
We can clearly see that tSNE1, and especially UMAP1, explain much more variation in MNIST labels compared to PC1, which confirms what we have previously observed with Spearman correlation in the heatmap above. We can also reproduce the R-squared values above from the linear regression model in statsmodels package.
import statsmodels.formula.api as smfmy_df_comps = pd.DataFrame({'LABELS': labels,
'PC1': np.array(PCA_matrix).flatten(),
'tSNE1': np.array(tSNE_matrix).flatten(),
'UMAP1': np.array(UMAP_matrix).flatten()})
smf.ols(formula = 'LABELS ~ PC1', data = my_df_comps).fit().summary()
smf.ols(formula = 'LABELS ~ tSNE1', data = my_df_comps).fit().summary()
smf.ols(formula = 'LABELS ~ UMAP1', data = my_df_comps).fit().summary()
Now, once we have made sure that we can correctly compute the variance in MNIST labels explained by PC1, tSNE1 and UMAP1 by using the PLS and R — squared methodology, we can extend this procedure for other leading PCA /tSNE/ UMAP components, and visualize how the cumulative variance explained changes as more and more components are added. As per usual, we will perturb the PCA, tSNE and UMAP with respect to the sub-sampling iteration, number of images and hyperparameters of the methods.
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
import seaborn as sns; import matplotlib.pyplot as plt
import numpy as np; from sklearn.metrics import r2_scoreN_iter = 3; N_comps = 3
N_points_list = [5000, 10000, 15000]
min_dist_list = [0.1, 0.2, 0.3]; perp_list = [90, 100, 110]
predicted_var_expl_labels_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_labels_umap_matrix=np.zeros(shape=(N_iter, N_comps))
predicted_var_expl_labels_tsne_matrix=np.zeros(shape=(N_iter, N_comps))
for j in range(N_iter):
#Variance in MNIST labels explained by PCA components
np.random.seed(j)
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points_list[j],
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)
predicted_var_expl_labels = []
for i in range(1, (N_comps + 1)):
PCA_matrix_current_labels = pd.DataFrame(pca_comps_sample[:, 0:i])
pls_current_labels = PLSRegression(n_components = i)
pls_current_labels.fit(PCA_matrix_current_labels, labels_sample)
y_pred_current_labels = pls_current_labels.predict(\
PCA_matrix_current_labels)
predicted_var_expl_labels.append(r2_score(labels_sample, \
y_pred_current_labels, multioutput = 'variance_weighted'))
predicted_var_expl_labels_matrix[j,:] = predicted_var_expl_labels
#Variance in MNIST labels explained by UMAP components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components=N_opt_pcs).fit_transform(X_sample)
umap_embedding_sample = umap.UMAP(n_components = N_comps,
n_neighbors = opt_perp,
init=X_reduced_sample[:, 0:N_comps],
min_dist = min_dist_list[j],
n_epochs = 1000, verbose = \
0).fit_transform(X_reduced_sample)
predicted_var_expl_labels_umap = []
for i in range(1, (N_comps + 1)):
UMAP_matrix_current_labels=pd.DataFrame(umap_embedding_sample[:,0:i])
pls_current_labels_umap = PLSRegression(n_components = i)
pls_current_labels_umap.fit(UMAP_matrix_current_labels,labels_sample)
y_pred_current_labels_umap = pls_current_labels_umap.predict(\
UMAP_matrix_current_labels)
predicted_var_expl_labels_umap.append(r2_score(\
labels_sample, y_pred_current_labels_umap, \
multioutput = 'variance_weighted'))
predicted_var_expl_labels_umap_matrix[j,:]=predicted_var_expl_labels_umap
#Variance in MNIST labels explained by tSNE components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
tsne_embedding_sample = TSNE(n_components = N_comps,
perplexity = perp_list[j],
init = X_reduced_sample[:, 0:N_comps],
learning_rate = 200, n_iter = 1000,
verbose=0).fit_transform(X_reduced_sample)
predicted_var_expl_labels_tsne = []
for i in range(1, (N_comps + 1)):
tSNE_matrix_current_labels=pd.DataFrame(tsne_embedding_sample[:,0:i])
pls_current_labels_tsne = PLSRegression(n_components = i)
pls_current_labels_tsne.fit(tSNE_matrix_current_labels,labels_sample)
y_pred_current_labels_tsne = pls_current_labels_tsne.predict(\
tSNE_matrix_current_labels)
predicted_var_expl_labels_tsne.append(r2_score(labels_sample, \
y_pred_current_labels_tsne, multioutput = 'variance_weighted'))
predicted_var_expl_labels_tsne_matrix[j,:]=predicted_var_expl_labels_tsne
print("Variance in MNIST labels explained by PCA components:")
print(predicted_var_expl_labels_matrix)
print("\nVariance in MNIST labels explained by UMAP components:")
print(predicted_var_expl_labels_umap_matrix)
print("\nVariance in MNIST labels explained by tSNE components:")
print(predicted_var_expl_labels_tsne_matrix)
#Plot MNIST labels variance explained by PCA, tSNE and UMAP components
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_matrix, axis = 0),
yerr = 2*np.std(predicted_var_expl_labels_matrix, axis = 0),
linewidth = 3, color = 'red', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_tsne_matrix, axis = 0),
yerr=2*np.std(predicted_var_expl_labels_tsne_matrix,axis=0),
linewidth = 3, color = 'blue', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_umap_matrix, axis = 0),
yerr=2*np.std(predicted_var_expl_labels_umap_matrix,axis=0),
linewidth = 3, color = 'green', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.ylabel('Cumulative MNIST Labels Explained Variance', fontsize = 20)
plt.xlabel('Number of Components', fontsize = 20)
plt.legend(['PCA variance explained', 'tSNE variance explained',
'UMAP variance explained'], fontsize = 20)
plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()
Here, surprisingly, we observe that leading UMAP and tSNE components seem to explain much more of variation in the MNIST labels compared to leading PCA components. This is somehow counter-intuitive as we saw in the previous section that leading tSNE and UMAP components explain less variation in MNIST image pixel intensities. Nevertheless, this cumulative explained variance plot above basically confirms the correlation heatmap between the MNIST labels, and PCA / tSNE / UMAP components obtained previously.
Here, we have an interesting effect. We saw in the previous section that the leading tSNE and UMAP components can not capture more variation in the MNIST than leading PCA components. However, surpisingly, they are able to capture more variation in MNIST labels, that is classes of handwritten digits, despite the classes were not provided to tSNE / UMAP at all, when performing the dimension reduction. In other words, all three dimension reduction techniques are completely unsupervised, i.e. they know nothing about the classes of handwritten digits. Nevertheless, tSNE and especially UMAP components seem to be, surprisingly, linked to the classes of digits without being able to capture substantial proportion of variation in MNIST pixel intensities across the images. I do not fully understand this effect but believe it is an interesting observation, which I would be curious to further explore. Let me know in the comments, if you have more insights into the nature of this effect.
In this article, we have learnt that a Partial Least Squares (PLS) approach can be used for building intuition about data variation explained by tSNE and UMAP components. Using this approach, we demonstrated that tSNE and UMAP components explain (unsurprisingly) less variation in MNIST image pixel intensities data compared to PCA components, however, are surprisingly strongly linked to the labels of MNIST images. The nature of this effect is unclear taken into account the unsupervised design of all the three dimension reduction techniques.
As per usual, let me know in the comments below what analytical methods from Life Sciences and Computational Biology seem especially mysterious to you and I will try to address them in this column. Check the files used for the post on my github. Follow me at Medium Nikolay Oskolkov, in Twitter @NikolayOskolkov, in Mastodon @[email protected] and connect via Linkedin. In the next post, I will discuss how to cluster in UMAP space, stay tuned.
A simple way to interpret UMAP components
This is the twenty fifth post from my column Mathematical Statistics and Machine Learning for Life Sciences, where I use plain language to discuss analytical methods from Computational Biology and Life Sciences. UMAP is a dimensionality reduction technique that together with tSNE became increasingly popular and de-facto a standard tool for analysis of single cell genomics data, where traditional methods such as PCA have limitations. However, one disadvantage of UMAP and tSNE in comparison with PCA is their non-interpretable components that are not straightforward to link to the variation in the original data. In this article, I suggest a simple method for estimating amount of variation explained by leading UMAP and tSNE components. Using the classic MNIST dataset of black and white images of handwritten digits as a benchmark, I demonstrate that leading UMAP and tSNE components are inferior to PCA components in terms of explaining total data variation, however, surprisingly, demonstrate a better linkage to the labels of data points, i.e. they explain more biological rather than total variation in the data.
As a benchmark dataset, we will be using the MNIST which includes 70 000 black and white images of handwritten digits of 28 x 28 pixels resolution, i.e. 784 pixels per image. First, we will download the MNIST dataset, check its dimensions and visualize a few random images.
from sklearn.datasets import fetch_openmlmnist = fetch_openml('mnist_784', version = 1)
labels = mnist.target.astype(int)
print(mnist.data.shape)
#(70000, 784)
from matplotlib import pyplot as plt
plt.figure(figsize = (20, 15))
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.imshow(mnist.data[i].reshape(28, 28), cmap = plt.get_cmap('gray'))
plt.show()
The values of MNIST dataset represent pixel intensities varying from 0 to 255, where 0 corresponds to the black background. Therefore, MNIST is a “zero-inflated” data set which is very similar to a typical single cell gene expression dataset. It is usually recommended to normalize the black and white pixel intensities by the largest 255 value. However, here, by analogy with single cell gene expression, we will use a log-transform as another mild normalization strategy. Also, for the sake of time, we will not use all the 70 000 images but randomly pick 10 000 images, later we will draw 10 000 images a few times to make sure that our conclusions are robust.
import numpy as npN_points = 10000
X = np.log10(mnist.data + 1)
#X = mnist.data / 255
np.random.seed(123)
random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)
X = X[random_indices,:]
labels = labels[random_indices]
As a first exploratory step, in order to understand the variation in MNIST, we will perform a PCA analysis on the MNIST itself and a shuffled version of MNIST. This will help us to estimate the number of meaningful, i.e. non-redundant, dimensions out of the initial 784 dimensions, i.e. we will figure our the number of informative dimensions to keep for all the future tested dimension reduction and clustering techniques. More details are here.
import pandas as pd; import matplotlib.pyplot as plt
from sklearn.decomposition import PCA; import seaborn as snsN_pca_comps = 100
sns.set(font_scale = 1.5); figure = plt.figure(figsize = (20, 15))
plt.subplot(221)
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=labels, cmap='tab10', s=10)
plt.title('PCA: MNIST', fontsize = 25)
plt.xlabel('PC1', fontsize = 22); plt.ylabel('PC2', fontsize = 22)
plt.subplot(222)
pca = PCA(n_components = N_pca_comps).fit(X)
print('Observed variance explained:')
print(pca.explained_variance_ratio_[0:10]); print('\n')
plt.bar(range(len(pca.explained_variance_ratio_)),
pca.explained_variance_ratio_)
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('Explained Variance', fontsize = 22)
N_perm = 10
X_flat = X.flatten()
expl_var_perm_df = pd.DataFrame(index = list(range(N_perm)),
columns = list(range(X.shape[1])))
for i in range(N_perm):
np.random.shuffle(X_flat)
X_perm = X_flat.reshape(X.shape[0], X.shape[1])
pca_perm = PCA().fit(X_perm)
expl_var_perm_df.loc[i] = pca_perm.explained_variance_ratio_
print('Finished {} permutations'.format(i + 1))
X_perm = list(expl_var_perm_df.mean(axis = 0) +
2*expl_var_perm_df.std(axis = 0))
print('\nPermuted variance explained:')
print(X_perm[0:10])
plt.subplot(223)
plt.plot(pca.explained_variance_ratio_, c = 'blue')
plt.plot(X_perm, c = 'red'); plt.xlim([-1, N_pca_comps])
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('Explained Variance', fontsize = 22)
plt.gca().legend(('Observed variance explained',
'Permuted variance explained'), fontsize = 20)
plt.subplot(224)
pval = list()
for j in range(N_pca_comps):
pval.append(np.sum(expl_var_perm_df.iloc[:, j] +
2*expl_var_perm_df.std(axis = 0) >=
pca.explained_variance_ratio_[j]) / N_perm)
plt.plot(pval, c = 'darkgreen')
plt.xlabel('Number of Principal Components', fontsize = 22)
plt.ylabel('P-value', fontsize = 22); plt.xlim([-1, N_pca_comps])
N_opt_pcs = np.where(np.array(pval) >= 0.05)[0][0]
print('\nNumber of significant Principal Components: {}'.format(N_opt_pcs))
print('Together they explain {}% of variation in the data'.\
format(int(round(sum(pca.explained_variance_ratio_[0:\
np.where(np.array(pval) >= 0.05)[0][0]])*100,0))))
figure.tight_layout()
plt.show()
An important outcome of the PCA analysis on MNIST and shuffled MNIST is that the dataset seems to have 62 informative principal components that all together capture 86% of variation in the data. Therefore, if we replace the original dataset including 784 features by 62 PCs, we will keep most of the variation in the data, but will reduce the dimensionality of the data by more than 10 times.
Now, let us visualize the MNIST dataset as a 2D map using UMAP and tSNE dimension reduction techniques. In both cases, we will use 2 top principal components, i.e. 2 PCs, for initialization. The perplexity hyperparameter for tSNE and the number of nearest neighbors for UMAP will be computed as a square root of the number of data points (images). See here for details.
import umap; import numpy as np
import seaborn as sns; import matplotlib.pyplot as plt
from sklearn.manifold import TSNE; from sklearn.decomposition import PCAopt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))
X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)
umap_embedding = umap.UMAP(n_components = 2, n_neighbors = opt_perp,
init = X_reduced[:, 0:2],
min_dist=0.3, n_epochs = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
tsne_embedding = TSNE(n_components=2, perplexity=opt_perp,
init=X_reduced[:, 0:2],
learning_rate = 200, n_iter = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))
plt.subplot(121)
plt.scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c = labels, s = 10,
cmap = 'tab20')
plt.title('tSNE: MNIST', fontsize = 25)
plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
plt.subplot(122)
plt.scatter(umap_embedding[:, 0], umap_embedding[:, 1], c = labels, s = 10,
cmap = 'tab20')
plt.title('UMAP: MNIST', fontsize = 25)
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
One obvious conclusion that can be drawn from the comparison of 2D PCA, tSNE and UMAP pictures is that the classes of handwritten digits are much more resolved by the non-linear neighbor-graph dimensionality reduction methods such as tSNE / UMAP compared to the linear matrix factorization techniques such as PCA. Therefore if we assume the classes of handwritten digits (aka biological phenotype, e.g. cell type in scRNAseq experiments) to comprise most of the variation in the MNIST data, it probably makes sense to hypothesize that two tSNE/UMAP components are capabale of capturing more biological variation than two PCA components (at least in the MNIST dataset). In the next sections, we are going to try to quantify and prove this hypothesis.
As non-linear neighbor-graph based techniques, UMAP / tSNE do not seem to have the concept of variance explained by its components in contrast to matrix factorization linear dimension reduction techniques such as PCA, see for example the answer of Leland McInnes, the author of UMAP. Here, nevertheless, we will try to quantify the amount of MNIST variation in pixel intensities explained by UMAP / tSNE components and compare it with the amount of variation explained by PCA components using the Partial Least Square (PLS) regression and the R-squared statistic generalized for matrix operations.
Let us check the percentage of MNIST variation explained by first principal component (PC1), we can easily compute and extract it from the PCA as the normalized first eigen value of MNIST. Below, we can see that PC1 explains ~11% of MNIST variation.
#First, we will compute variance explained by PC1 via PCA in sklearn
import numpy as np
from sklearn.decomposition import PCApca = PCA(n_components = X.shape[1]).fit(X)
pca_comps = PCA().fit_transform(X)
print(pca.explained_variance_ratio_[0])
#0.11043073983593521
Now, we will reproduce this number from the Partial Least Squares (PLS) regression. We are going to use the following reasoning. Suppose we want to approximate a matrix X by another matrix PCA_matrix, which for now includes only one column, that is PC1, but generally can include up to 784 columns for MNIST dataset. Then, we can fit a PLS linear regression model of X = B * PCA_matrix and compute a R-squared statistic which will reflect the amount of variation in X explained by PCA_matrix. In the matrix form, the R-squared statistic will be given by the following equation:
where B*PCA_matrix represents a prediction of X from its approximation PCA_matrix and will be found from the PLS regression (first equation). To technically implement this procedure in scikit-learn Python module, we will first fit a PLS model with X as a response variable, and PCA_matrix as an explanatory variable, we compute a prediction y_pred = B*PCA_matrix, and, finally, we can either use the r2_score function in scikit-learn, or the second equation above for computing the R-squared statistic. Let us check that both will give identical answers:
#Now let us compute variance explained by PC1 through PLS procedure
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegressionPCA_matrix = pd.DataFrame(pca_comps[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(PCA_matrix, X)
y_pred = pls.predict(PCA_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.11043073983593246
#Finally, let us compute variance explained by PC1 from scratch
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.11043073983593554
Thus, by utilizing PLS regression, we computed the fraction of total MNIST variation explained by the first principal component outside of the PCA algorithm. Compare the fraction calculated by the PLS with the respective variance explained by the PCA algorithm pca.explained_variance_ratio_[0] which is the ratio of first eigen value divided by the sum of all eigen values. They are nearly identical. Now let PCA_matrix consist of a number of PCs. Below, we demonstrate that the PLS computation of cumulative fraction of variance explained by principal components will result in nearly identical to PCA cumulative variance explained.
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegressionpredicted_var_expl = []
for i in range(1, 21):
PCA_matrix_current = pd.DataFrame(pca_comps[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(PCA_matrix_current, X)
y_pred_current = pls_current.predict(PCA_matrix_current)
predicted_var_expl.append(r2_score(X, y_pred_current,
multioutput = 'variance_weighted'))
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.plot(np.cumsum(pca.explained_variance_ratio_[0:100]),linewidth=5)
plt.plot(predicted_var_expl, linewidth = 5)
plt.ylabel('Cumulative Explained Variance', fontsize = 20)
plt.xlabel('Number of Principal Components', fontsize = 20)
plt.legend(['PCA - computed cumulative variance explained',
'PLS - computed cumulative variance explained'],
fontsize = 20); plt.show()
For clarity, we plot PLS-computed MNIST variance explained for PC1-PC20 on the top of the native PCA eigen value based variance explained which is done for PC1 — PC100. We can see, that the curves of cumulative variance explained computed by PLS and PCA nicely coincide. This is no surprise since PLS computation essentially mimics what is going under the hood in the PCA algorithm. However, very importantly, this gives us an instrument of computing variance of X explained by any approximation, i.e. not only the PCA_matrix but also any other matrix. In the next session, we will use UMAP_matrix as an approximation of the original X matrix of MNIST pixel intensities and compare the cumulative variance of X explained by UMAP and PCA components.
In this section we will use the methodology of PLS estimation of variance explained developed in the previous section. Again, we will use the same reasoning: since UMAP gives some sort of approximation of the original data X, and people even run clustering on 2D UMAP for discovering cell types in scRNAseq field, we can use the UMAP_matrix (and we can do the same for tSNE) as an approximation of X, then we fit the PLS model X=B * UMAP_matrix, and estimate the fraction of MNIST variance explained by UMAP components via R-squared statistic as
Now, we will use the PLS methodology developed in the previous section, and compute the fraction of MNIST variance explained by the first UMAP component.
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegression#MNIST variation explained by UMAP1
UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(UMAP_matrix, X)
y_pred = pls.predict(UMAP_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.07335034485651613
#Here the same but more explicitly via the R^2 equation above
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.07335034485652026
#MNIST variation explained by tSNE1
tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(tSNE_matrix, X)
y_pred = pls.predict(tSNE_matrix)
print(r2_score(X, y_pred, multioutput = 'variance_weighted'))
#0.07265918428990921
#Here the same but explicitly via the R^2 equation above
print(1 - np.sum((np.array(X) - np.array(y_pred))**2) / np.sum((X - \
np.mean(X, axis = 0))**2))
#0.07265918428991347
We conclude that first components of both UMAP and tSNE explain about 7% of MNIST variation which is lower than the 11% explained by the first PCA component. This is not surprising. Intuitively, it is hard to expect that there can be another latent variable alternative to PC1 that explains more than 11% of MNIST variation without redefining the concept of “explained variance”. The current definition comes from the PCA (normalized eigen values), and linear regression, that is R-squared, analyses. Both are linear frameworks. In addition, PC1 by definition corresponds to a direction of maximal data variation. Therefore, if another latent variable comes from a nonlinear analysis such as UMAP, that does not per se attempts to maximize variation in the data, it would be hard to expect that e.g. UMAP1 explains more variation in MNIST within linear definition of “variance explained”. Now we will calculate the variance explained by a few top tSNE and UMAP components. To make our analysis more robust towards sampling of the 10 000 images, we will independently draw them N_iter times. Also, for each iteration, we will slightly change / perturb the hyperparameters of UMAP and tSNE, as well as the number of data points for PCA. This will allow us to build confidence intervals and address the sensitivity of our analysis.
import umap; import numpy as np; import seaborn as sns
import matplotlib.pyplot as plt; from sklearn.manifold import TSNE
from sklearn.metrics import r2_score; from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegressionN_iter = 3; N_comps = 3
N_points_list = [5000, 10000, 15000]
perp_list = [50, 100, 150]; min_dist_list = [0.1, 0.2, 0.3]
predicted_var_expl_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_umap_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_tsne_matrix = np.zeros(shape = (N_iter, N_comps))
for j in range(N_iter):
#MNIST variance explained by PCA components
np.random.seed(j)
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points_list[j],
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)
predicted_var_expl = []
for i in range(1, (N_comps + 1)):
PCA_matrix_current = pd.DataFrame(pca_comps_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(PCA_matrix_current, X_sample)
y_pred_current = pls_current.predict(PCA_matrix_current)
predicted_var_expl.append(r2_score(X_sample, y_pred_current,
multioutput='variance_weighted'))
predicted_var_expl_matrix[j,:] = predicted_var_expl
#MNIST variance explained by UMAP components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
umap_embedding_sample = umap.UMAP(n_components = N_comps,
n_neighbors = opt_perp,
init = X_reduced_sample[:, 0:N_comps],
min_dist = min_dist_list[j],
n_epochs = 1000, verbose = \
0).fit_transform(X_reduced_sample)
predicted_var_expl_umap = []
for i in range(1, (N_comps + 1)):
UMAP_matrix_current = pd.DataFrame(umap_embedding_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(UMAP_matrix_current, X_sample)
y_pred_current = pls_current.predict(UMAP_matrix_current)
predicted_var_expl_umap.append(r2_score(X_sample, y_pred_current, \
multioutput = 'variance_weighted'))
predicted_var_expl_umap_matrix[j,:] = predicted_var_expl_umap
#MNIST variance explained by tSNE components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
tsne_embedding_sample = TSNE(n_components = N_comps,
perplexity = perp_list[j],
init = X_reduced_sample[:, 0:N_comps],
learning_rate = 200, n_iter = 1000,
verbose = 0).fit_transform(X_reduced_sample)
predicted_var_expl_tsne = []
for i in range(1, (N_comps + 1)):
tSNE_matrix_current = pd.DataFrame(tsne_embedding_sample[:, 0:i])
pls_current = PLSRegression(n_components = i)
pls_current.fit(tSNE_matrix_current, X_sample)
y_pred_current = pls_current.predict(tSNE_matrix_current)
predicted_var_expl_tsne.append(r2_score(X_sample, y_pred_current, \
multioutput = 'variance_weighted'))
predicted_var_expl_tsne_matrix[j,:] = predicted_var_expl_tsne
print("MNIST variance explained by PCA components:")
print(predicted_var_expl_matrix)
print("\nMNIST variance explained by UMAP components:")
print(predicted_var_expl_umap_matrix)
print("\nMNIST variance explained by tSNE components:")
print(predicted_var_expl_tsne_matrix)
#Plot MNIST variance explained by leading PCA, tSNE and UMAP components
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_matrix, axis = 0),
linewidth = 3, color = 'red', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_tsne_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_tsne_matrix,
axis = 0), linewidth = 3, color = 'blue', marker = 'o',
markersize = 10, capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)), np.mean(predicted_var_expl_umap_matrix,
axis = 0), yerr = 2*np.std(predicted_var_expl_umap_matrix,
axis = 0), linewidth = 3, color = 'green', marker = 'o',
markersize = 10, capsize = 5, capthick = 3)
plt.ylabel('Cumulative MNIST Explained Variance', fontsize = 20)
plt.xlabel('Number of Components', fontsize = 20)
plt.legend(['PCA variance explained', 'tSNE variance explained',
'UMAP variance explained'], fontsize = 20)
plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()
Here, we conclude that leading PCA components explain consistently more variation in MNIST dataset than leading tSNE and UMAP components. This result was expected since UMAP and tSNE do not aim at building directions of maximal variation in contrast to PCA. Further, the amounts of variation in the MNIST data explained by leading tSNE and UMAP components are comparable, and both lower than the ones explained by PCA components. Therefore, we seem to observe a systematic difference in MNIST explained variance between matrix factorization and neghbor-graph dimensionality reduction techniques.
In the previous section, we showed that leading components of UMAP and tSNE explain significantly less variation in MNIST dataset compared to the leading PCA components, which is a sort of expected result since PCA by definition attempts to find components of maximal variation in the data in contrast to UMAP and tSNE.
This is however an interesting paradox: looking at the 2D UMAP and tSNE figures of MNIST data we can clearly see more distinct clusters of handwritten digits compared to the ones in the corresponding 2D PCA figure. Despite 2D UMAP and tSNE explain less variation in MNIST than PCA.
Here, we are going to hypothesize that UMAP / tSNE components are linked to a phenotype of interest, that is MNIST labels in our case, or cell types for scRNAseq, rather than total variation in the data. To check this hypothesis, let us first explore how leading PCA, UMAP and tSNE components correlate with the MNIST labels.
import umap; import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA; import numpy as np
import matplotlib.pyplot as plt; from sklearn.manifold import TSNEN_comps = 3; N_points = 10000
mnist = fetch_openml('mnist_784', version = 1)
labels = mnist.target.astype(int); X = np.log10(mnist.data + 1)
#Subsample MNIST data down to 10 000 images
np.random.seed(123)
random_indices = np.random.choice(X.shape[0], size=N_points, replace=False)
X = X[random_indices,:]; labels = labels[random_indices]
#Compute top 3 PCA components on the subsampled MNIST data
pca = PCA(n_components = N_comps).fit(X)
pca_comps = PCA().fit_transform(X)
#Compute top 3 UMAP components on the subsampled MNIST data
opt_perp = np.int(np.round(np.sqrt(X.shape[0]), 0))
X_reduced = PCA(n_components = N_opt_pcs).fit_transform(X)
umap_embedding = umap.UMAP(n_components = N_comps, n_neighbors = opt_perp,
init = X_reduced[:, 0:N_comps],
min_dist = 0.3, n_epochs = 1000, random_state=123,
verbose = 0).fit_transform(X_reduced)
#Compute top 3 tSNE components on the subsampled MNIST data
tsne_embedding = TSNE(n_components = N_comps, perplexity = opt_perp,
init = X_reduced[:, 0:N_comps],
learning_rate = 200, n_iter = 1000, random_state = 123,
verbose = 0).fit_transform(X_reduced)
#Pairwise correlations between MNIST labels and PCA / tSNE / UMAP components
from scipy.stats import spearmanr
rho_matrix = np.zeros(shape = (3, 3))
for i in range(N_comps):
rho, pval = spearmanr(pca_comps[:, i], labels)
rho_matrix[0, i] = np.abs(rho)
for i in range(N_comps):
rho, pval = spearmanr(tsne_embedding[:, i], labels)
rho_matrix[1, i] = np.abs(rho)
for i in range(N_comps):
rho, pval = spearmanr(umap_embedding[:, i], labels)
rho_matrix[2, i] = np.abs(rho)
rho_df = pd.DataFrame(rho_matrix, columns = ['Comp1', 'Comp2', 'Comp3'],
index = ['PCA', 'tSNE', 'UMAP'])
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 10))
sns.heatmap(rho_df.T, cmap = "Blues", annot = True)
plt.title('Spearman correlation of leading PCA / tSNE / UMAP components'+
' with MNIST labels', fontsize = 20); plt.show()
We can observe, that both tSNE1 and UMAP1 have a stronger correlation with the MNIST labels compared to PC1. Further, the second components of PCA, tSNE and UMAP have comparable strength of correlation with the MNIST labels, whereas the third UMAP component has a much stronger correlation with MNIST labels compared to the third components of PCA and tSNE, that is PC3 and tSNE3, respectively, which appear to be almost uncorrelated with the MNIST labels. In the complete notebook, available on my github, I show that the obtained heatmap is robust with respect to sampling and perturbations of PCA / tSNE / UMAP hyperparameters.
Alternatively, we can also quantify the linkage between the PCA, tSNE and UMAP components on one side, and the MNIST labels (classes of digits, aka cell types in scRNAseq data) on the other side, through the PLS regression methodology developed in the previous sections. By analogy, if we consider UMAP_matrix (or PCA_matrix or tSNE_matrix) as an approximation of the MNIST vector of labels, we can fit the PLS model labels= B * UMAP_matrix, and estimate the fraction of variation in the MNIST labels explained by the UMAP components as
Let us test how much variance in MNIST labels is explained by PC1, tSNE1 and UMAP1:
from sklearn.metrics import r2_score
from sklearn.cross_decomposition import PLSRegression#Variance in MNIST labels explained by PC1
pca = PCA(n_components = X.shape[1]).fit(X)
pca_comps = PCA().fit_transform(X)
PCA_matrix = pd.DataFrame(pca_comps[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(PCA_matrix, labels)
y_pred = pls.predict(PCA_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.01570798235844606
#Variance in MNIST labels explained by tSNE1
tSNE_matrix = pd.DataFrame(tsne_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(tSNE_matrix, labels)
y_pred = pls.predict(tSNE_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.04531724676013893
#Variance in MNIST labels explained by UMAP1
UMAP_matrix = pd.DataFrame(umap_embedding[:, 0:1])
pls = PLSRegression(n_components = 1)
pls.fit(UMAP_matrix, labels)
y_pred = pls.predict(UMAP_matrix)
r2_score(labels, y_pred, multioutput = 'variance_weighted')
#0.1369512765129124
We can clearly see that tSNE1, and especially UMAP1, explain much more variation in MNIST labels compared to PC1, which confirms what we have previously observed with Spearman correlation in the heatmap above. We can also reproduce the R-squared values above from the linear regression model in statsmodels package.
import statsmodels.formula.api as smfmy_df_comps = pd.DataFrame({'LABELS': labels,
'PC1': np.array(PCA_matrix).flatten(),
'tSNE1': np.array(tSNE_matrix).flatten(),
'UMAP1': np.array(UMAP_matrix).flatten()})
smf.ols(formula = 'LABELS ~ PC1', data = my_df_comps).fit().summary()
smf.ols(formula = 'LABELS ~ tSNE1', data = my_df_comps).fit().summary()
smf.ols(formula = 'LABELS ~ UMAP1', data = my_df_comps).fit().summary()
Now, once we have made sure that we can correctly compute the variance in MNIST labels explained by PC1, tSNE1 and UMAP1 by using the PLS and R — squared methodology, we can extend this procedure for other leading PCA /tSNE/ UMAP components, and visualize how the cumulative variance explained changes as more and more components are added. As per usual, we will perturb the PCA, tSNE and UMAP with respect to the sub-sampling iteration, number of images and hyperparameters of the methods.
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
import seaborn as sns; import matplotlib.pyplot as plt
import numpy as np; from sklearn.metrics import r2_scoreN_iter = 3; N_comps = 3
N_points_list = [5000, 10000, 15000]
min_dist_list = [0.1, 0.2, 0.3]; perp_list = [90, 100, 110]
predicted_var_expl_labels_matrix = np.zeros(shape = (N_iter, N_comps))
predicted_var_expl_labels_umap_matrix=np.zeros(shape=(N_iter, N_comps))
predicted_var_expl_labels_tsne_matrix=np.zeros(shape=(N_iter, N_comps))
for j in range(N_iter):
#Variance in MNIST labels explained by PCA components
np.random.seed(j)
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points_list[j],
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
pca_comps_sample = PCA(n_components = N_comps).fit_transform(X_sample)
predicted_var_expl_labels = []
for i in range(1, (N_comps + 1)):
PCA_matrix_current_labels = pd.DataFrame(pca_comps_sample[:, 0:i])
pls_current_labels = PLSRegression(n_components = i)
pls_current_labels.fit(PCA_matrix_current_labels, labels_sample)
y_pred_current_labels = pls_current_labels.predict(\
PCA_matrix_current_labels)
predicted_var_expl_labels.append(r2_score(labels_sample, \
y_pred_current_labels, multioutput = 'variance_weighted'))
predicted_var_expl_labels_matrix[j,:] = predicted_var_expl_labels
#Variance in MNIST labels explained by UMAP components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components=N_opt_pcs).fit_transform(X_sample)
umap_embedding_sample = umap.UMAP(n_components = N_comps,
n_neighbors = opt_perp,
init=X_reduced_sample[:, 0:N_comps],
min_dist = min_dist_list[j],
n_epochs = 1000, verbose = \
0).fit_transform(X_reduced_sample)
predicted_var_expl_labels_umap = []
for i in range(1, (N_comps + 1)):
UMAP_matrix_current_labels=pd.DataFrame(umap_embedding_sample[:,0:i])
pls_current_labels_umap = PLSRegression(n_components = i)
pls_current_labels_umap.fit(UMAP_matrix_current_labels,labels_sample)
y_pred_current_labels_umap = pls_current_labels_umap.predict(\
UMAP_matrix_current_labels)
predicted_var_expl_labels_umap.append(r2_score(\
labels_sample, y_pred_current_labels_umap, \
multioutput = 'variance_weighted'))
predicted_var_expl_labels_umap_matrix[j,:]=predicted_var_expl_labels_umap
#Variance in MNIST labels explained by tSNE components
X = np.log10(mnist.data + 1); labels = mnist.target.astype(int)
random_indices = np.random.choice(X.shape[0], size = N_points,
replace = False)
X_sample = X[random_indices,:]; labels_sample = labels[random_indices]
opt_perp = np.int(np.round(np.sqrt(X_sample.shape[0]), 0))
X_reduced_sample = PCA(n_components = N_opt_pcs).fit_transform(X_sample)
tsne_embedding_sample = TSNE(n_components = N_comps,
perplexity = perp_list[j],
init = X_reduced_sample[:, 0:N_comps],
learning_rate = 200, n_iter = 1000,
verbose=0).fit_transform(X_reduced_sample)
predicted_var_expl_labels_tsne = []
for i in range(1, (N_comps + 1)):
tSNE_matrix_current_labels=pd.DataFrame(tsne_embedding_sample[:,0:i])
pls_current_labels_tsne = PLSRegression(n_components = i)
pls_current_labels_tsne.fit(tSNE_matrix_current_labels,labels_sample)
y_pred_current_labels_tsne = pls_current_labels_tsne.predict(\
tSNE_matrix_current_labels)
predicted_var_expl_labels_tsne.append(r2_score(labels_sample, \
y_pred_current_labels_tsne, multioutput = 'variance_weighted'))
predicted_var_expl_labels_tsne_matrix[j,:]=predicted_var_expl_labels_tsne
print("Variance in MNIST labels explained by PCA components:")
print(predicted_var_expl_labels_matrix)
print("\nVariance in MNIST labels explained by UMAP components:")
print(predicted_var_expl_labels_umap_matrix)
print("\nVariance in MNIST labels explained by tSNE components:")
print(predicted_var_expl_labels_tsne_matrix)
#Plot MNIST labels variance explained by PCA, tSNE and UMAP components
sns.set(font_scale = 1.5); plt.figure(figsize = (20, 15))
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_matrix, axis = 0),
yerr = 2*np.std(predicted_var_expl_labels_matrix, axis = 0),
linewidth = 3, color = 'red', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_tsne_matrix, axis = 0),
yerr=2*np.std(predicted_var_expl_labels_tsne_matrix,axis=0),
linewidth = 3, color = 'blue', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.errorbar(range(1, (N_comps + 1)),
np.mean(predicted_var_expl_labels_umap_matrix, axis = 0),
yerr=2*np.std(predicted_var_expl_labels_umap_matrix,axis=0),
linewidth = 3, color = 'green', marker = 'o', markersize = 10,
capsize = 5, capthick = 3)
plt.ylabel('Cumulative MNIST Labels Explained Variance', fontsize = 20)
plt.xlabel('Number of Components', fontsize = 20)
plt.legend(['PCA variance explained', 'tSNE variance explained',
'UMAP variance explained'], fontsize = 20)
plt.xlim([0.8, (N_comps + 0.2)]); plt.xticks([1, 2, 3]); plt.show()
Here, surprisingly, we observe that leading UMAP and tSNE components seem to explain much more of variation in the MNIST labels compared to leading PCA components. This is somehow counter-intuitive as we saw in the previous section that leading tSNE and UMAP components explain less variation in MNIST image pixel intensities. Nevertheless, this cumulative explained variance plot above basically confirms the correlation heatmap between the MNIST labels, and PCA / tSNE / UMAP components obtained previously.
Here, we have an interesting effect. We saw in the previous section that the leading tSNE and UMAP components can not capture more variation in the MNIST than leading PCA components. However, surpisingly, they are able to capture more variation in MNIST labels, that is classes of handwritten digits, despite the classes were not provided to tSNE / UMAP at all, when performing the dimension reduction. In other words, all three dimension reduction techniques are completely unsupervised, i.e. they know nothing about the classes of handwritten digits. Nevertheless, tSNE and especially UMAP components seem to be, surprisingly, linked to the classes of digits without being able to capture substantial proportion of variation in MNIST pixel intensities across the images. I do not fully understand this effect but believe it is an interesting observation, which I would be curious to further explore. Let me know in the comments, if you have more insights into the nature of this effect.
In this article, we have learnt that a Partial Least Squares (PLS) approach can be used for building intuition about data variation explained by tSNE and UMAP components. Using this approach, we demonstrated that tSNE and UMAP components explain (unsurprisingly) less variation in MNIST image pixel intensities data compared to PCA components, however, are surprisingly strongly linked to the labels of MNIST images. The nature of this effect is unclear taken into account the unsupervised design of all the three dimension reduction techniques.
As per usual, let me know in the comments below what analytical methods from Life Sciences and Computational Biology seem especially mysterious to you and I will try to address them in this column. Check the files used for the post on my github. Follow me at Medium Nikolay Oskolkov, in Twitter @NikolayOskolkov, in Mastodon @[email protected] and connect via Linkedin. In the next post, I will discuss how to cluster in UMAP space, stay tuned.