A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis | by Destin Gong | Jan, 2023

By Jessie Hobb On Jan 25, 2023

Linear Algebra for Machine Learning — Covariance Matrix, Eigenvector and Principal Component

Principal Component Analysis for ML (image by author)

In my previous article, we have talked about applying linear algebra for data representation in machine learning algorithms, but the application of linear algebra in ML is much broader than that.

This article will introduce more linear algebra concepts with the main focus on how these concepts are applied for dimensionality reduction, specially Principal Component Analysis (PCA). In the second half of this post, we will also implement and interpret PCA using a few lines of code with the help of Python scikit-learn.

High-dimensional data is a common issue experienced in machine learning practices, as we typically feed a large amount of features for model training. This results in the caveat of models having less interpretability and higher complexity — also known as the curse of dimensionality. PCA can be beneficial when the dataset is high-dimensional (i.e. contains many features) and it is widely applied for dimensionality reduction.

Additionally, PCA is also used for discovering the hidden relationships among features and reveal underlying patterns that can be very insightful. PCA attempts to find linear components that capture as much variance in the data as possible, and the first principal component (PC1) is typically composed of features that contributes the most to model predictions.

The objective of PCA is to find the principal components that represents the data variance in a lower dimension and we are going to unfold the process into following steps:

represent the data variance using covariance matrix
eigenvector and eigenvalue capture data variance in a lower dimensionality
principal components are the eigenvectors of the covariance matrix

To understand how PCA works, we need to answer the questions of what are covariance matrix and eigenvector/eigenvalue. It is also helpful to fundamentally shift our perspectives of viewing matrix multiplication as a math operation to a visual transformation.

Matrix Transformation

We have previously introduced how matrix dot product is computed from a math operation perspectives. We can also interpret the dot product as a visual transformation which assists in understanding more complex linear algebra concepts. As illustrated below, let us use a 2×2 matrix as an example. We split the matrix vertically into two vectors where the left one represents the basis vector of x-axis, and the right one represents the basis vector of the y-axis. Therefore, a matrix represents a 2D space constructed by the x-axis and y-axis.

It is not hard to understand that an identity matrix has [1,0] as the basis vector on the x-axis and [0,1] as the basis vector on the y-axis, so that the dot product between any vectors and the identity matrix will return the vector itself.

Matrix transformation boils down to changing the scale and shifting the direction of the axis. For example, changing the basis vector of x-axis from [1,0] to [2,0] means that the mapping space has been scaled two times in the x coordinate direction.

matrix transformation — x-axis scaled matrix (image by author)

We can additionally combine both the x-axis and y-axis for more complicated scaling, rotating or shearing transformation. A typically example is the mirror matrix where we swap the x and y axis. For a given vector [1,2], we will get [2,1] after the mirror transformation.

matrix transformation — mirror matrix (image by author)

If you would like to practice these transformations in python and skip the manual calculations, we can use following code to perform these dot products and visualize the result of the transformation using plt.quiver() function.

import numpy as np
import matplotlib.pyplot as plt
# define matrices and vector
x_scaled_matrix = np.array([[2,0],[0,1]])
mirror_matrix = np.array([[0,1],[1,0]])
v = np.array([1,2])
# matrix transformation
mirrored_v = mirror_matrix.dot(v)
x_scaled_v = x_scaled_matrix.dot(v)
# plot transformed vectors
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, v[0], v[1], color=['black'],scale=10, label='original vector')
plt.quiver(*origin, mirrored_v[0], mirrored_v[1] , color=['#D3E7EE'], scale=10, label='mirrored vector' )
plt.quiver(*origin, x_scaled_v[0], x_scaled_v[1] , color=['#C6A477'], scale=10, label='x_scaled vector')
plt.legend(loc ="lower right")

matrix transformation result in python (image by author)

Covariance Matrix

In Short: covariance matrix represents the pairwise correlations among a group of variables in a matrix form.

Covariance matrix is another critical concept in PCA process that represents the data variance in the dataset. To understand the details of covariance matrix, we firstly need to know that covariance measures the magnitude of how one random variable varies with another random variable. For two random variable x and y, their covariance is formulated as below and higher covariance value indicates stronger correlation between two variables.

When given a set of variables (e.g. x1, x2, … xn) in a dataset, covariance matrix is used for representing the covariance value between each variable pairs in a matrix format.

Multiplying any vector with the covariance matrix will transform it towards the direction that captures the trend of variance in the original dataset.

Let us use a simple example to simulate the effect of this transformation. Firstly, we randomly generate the variable x0, x1 and then compute the covariance matrix.

# generate random variables x0 and x1
import random
x0 = [round(random.uniform(-1, 1),2) for i in range(0,100)]
x1 = [round(2 * i + random.uniform(-1, 1) ,2) for i in x0]# compute covariance matrix
X = np.stack((x0, x1), axis=0)
covariance_matrix = np.cov(X)
print('covariance matrix\n', covariance_matrix)

We then transform some random vectors by taking the dot product between each of them and the covariance matrix.

# plot original data points
plt.scatter(x0, x1, color=['#D3E7EE'])# vectors before transformation
v_original = [np.array([[1,0.2]]), np.array([[-1,1.5]]), np.array([[1.5,-1.3]]), np.array([[1,1.4]])]
# vectors after transformation
for v in v_original:
v_transformed = v.dot(covariance_matrix)
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, v[:, 0], v[:, 1], color=['black'], scale=4)
plt.quiver(*origin, v_transformed[:, 0], v_transformed[:, 1] , color=['#C6A477'], scale=10)
# plot formatting
plt.axis('scaled')   
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])

Original vectors prior to the transformation are in black, and after transformation are in brown. As you can see, the original vectors that are pointing at different directions have become more conformed to the general trend displayed in the original dataset (i.e. the blue dots). Because of this property, covariance matrix is important to PCA in terms of describing the relationship between features.

Eigenvalue and Eigenvector

In Short: Eigenvector (v) of a matrix (A) remains at the same direction after the matrix transformation, hence Av = λv where v represents the corresponding eigenvalue. Representing data using eigenvector and eigenvalue reduces the dimensionality while maintaining the data variance as much as possible.

To bring more intuitions to this concept, we can use a simple demonstration. For example, we have the matrix [[0,1],[1,0]], and one of the eigenvector for matrix is [1,1] and the corresponding eigenvalue is 1.

eigenvector and eigenvalue (image by author)

From matrix transformation, we know that matrix[[0,1],[1,0]] acts as a mirror matrix that swaps the x, y coordinate of the vector. Therefore, the direction of vector [1,1] will not change after the mirror transformation, thus it meets the criteria of being the eigenvector of the matrix. The eigenvalue 1 indicates that the vector remains at the same scale and direction as prior to the transformation. Consequently, we are able to represent the effect of a matrix transform A (i.e. 2 dimensional) using a scalar λ (i.e. 1 dimension) and eigenvalue tells us how much variance are preserved by the eigenvector.

Let’s continue with the example above and use this code snippet to overlay the eigenvector with the greatest eigenvalue (in red color). As you can see, it is aligned with the direction with the greatest data variance.

from numpy.linalg import eig
eigenvalue,eigenvector = eig(covariance_matrix)
plt.quiver(*origin, eigenvector[:,1][0], eigenvector[:,1][1] , color=['red'], scale=4, label='eigenvector')

Principal Components

Now that we have discussed that covariance matrix can represent the data variance when multiple variables are present and eigenvector can capture the data variance in a lower dimensionality. By computing the eigenvector/eigenvalue of the covariance matrix, we get the principal components. There are more than one eigenvector for a matrix and they are typically arranged in a descending order of the their eigenvalue, denoted by PC1, PC2 …PCn. The first principal component (PC1) is the eigenvector with the highest eigenvalue which is the red vector shown in the image, which explains the maximum variance in the data. Therefore, when using principal components to reduce data dimensionality, we select the ones with higher eigenvalues as it preserves more information in the original dataset.

We have walked through the theory behind PCA and now let’s step into the practical part. Luckily, scikit-learn has provided us an easy implementation of PCA. We will use the public dataset “college major” from fivethirtyeight GitHub repository [1].

1. Standardize data into the same scale

PCA is sensitive to data with different scales, as covariance matrix requires the data at the same scale to measure the correlation between features with a consistent standard. To achieve that, data standardization is applied before PCA, which means that each feature has a mean of zero and a standard deviation of one. We use the following code snippet to perform data standardization. If you wish to know more data transformation techniques such as normalization, min-max scaling, please visit my article on “3 Common Techniques for Data Transformation”.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

2. Apply PCA on the scaled data

We then import PCA from sklearn.decomposition and specify the number of components to generate. The number of components is determined by how much data variance to explain by the principal components. Here we will generate 3 components to balance the trade off between the explained variance and dimensionality.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_data = pca.fit_transform(scaled_df)

3. Visualize explained variance using scree plot

Some information of the original dataset will be lost after shrinking it to a lower dimensionality, hence it is important to keep as much information as possible while limiting the number of principal components. To help us with the interpretation, we can visualize the explained variance using a scree plot. Explained variance of a principal component indicates the magnitude of data variance in the direction of the eigenvector and it correlates to the eigenvalue. Higher explained variance means that it preserves more information and the one with highest explained variance is the first principal component. We can use the explained_variance_ratio_ attribute to get the explained variance. The code snippet below visualizes the explained variance and also the cumulative variance (i.e. sum of variance if we add previous principal components together).

import matplotlib.pyplot as plt
principal_components = ['PC1', 'PC2', 'PC3']
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
plt.figure(figsize=(10, 6))
plt.bar(principal_components, explained_variance, color='#D3E7EE')
plt.plot(principal_components, cumulative_variance, 'o-', linewidth=2, color='#C6A477')# add cumulative variance as the annotation
for i,j in zip(principal_components, cumulative_variance):
plt.annotate(str(round(j,2)), xy=(i, j))

The scree plot tells us about the explained variances when three principal components were generated. The first principal component (PC1) explains 60% of the variance and 84% of the variance are explained with the first 3 components combined.

4. Interpret the principal components composition

Principal components additionally provide us some evidence of the importance of original features. By evaluating the magnitude and direction of the coefficients for each original feature, we know whether the feature is strongly correlated with the component. As show below, we generate the coefficients of the features with respects to the components.

pca_component_df = pd.DataFrame(pca.components_, columns = df.columns)
pca_component_df

component coefficients (image by author)

Additionally, we can use heatmap from seaborn library to highlight the features with high absolute coefficient values.

import seaborn as sns
# create custom color palette
customPalette = sns.color_palette("blend:#D3E7EE,#C6A477", as_cmap=True)# create heatmap
plt.figure(figsize=(24,3))
sns.heatmap(pca_component_df, cmap=customPalette, annot=True)

component coefficients heatmap (image by author)

If we interpret PC1 (i.e. row 0), we can see there are multiple features have relatively higher association with PC1, such as “Total” (number of enrolled students), “Employed”, “Full_time”, “Unemployed” etc, indicating that these features contribute more to the data variance. Additionally, you may notice that some features are directly correlated with each other, and PCA brings the extra benefit of removing multicollinearity among these features.

5. Use principal components in ML algorithm

Finally, we have reduced the dimensionality to a handful of principal components which are ready to be utilized as the new features in machine learning algorithms. To do so, we are going to use the transformed dataset from the output of PCA process — pca_df. We can examine the shape of this dataset using pca_df.shape and we get 173 rows and 3 columns. We then add the label (e.g. “Rank”) back to this dataset with 3 principal components from the PCA process and this will become the new dataframe to build the ML model.

pca_df = pd.DataFrame(pca_data)
new_df = pd.concat([pca_df,label_df], axis = 1)
new_df.columns = ["PC1", "PC2", "PC3", "Rank"]

The remaining process will follow the standard procedure of a machine learning lifecycle, that is — split the dataset into train-test, building model and then model evaluation. Here we won’t dive into the details of building ML models, but if you are interested, please have a look at my article on classification algorithms as the starting point.