Light on Math ML: Intuitive Guide to Matrix Factorization (Part 1) | by Thushan Ganegedara | Jul, 2022
You’ll never be afraid to see an allegedly intimidating matrix factorization equation in your life!
Matrix factorization code: [here]
In this article, you will learn about matrix factorization, bread and butter of many classical machine learning approaches. This article will focus explaining the real-world applications of matrix factorization (MF) (with code examples) and the intuition underpinning it. Have you ever thought matrix factorization might be used for drug repurposing ❓
If you’re anything like me (hopefully not), by the end of this article, you will regret downplaying matrix factorization in your undergraduate algebra course, because you never thought you would use it.
To visit my previous articles in this series use the following letters.
A B C D* E F G H I J K L* M* N O P Q R S T U V W X Y Z
Mathematics was never my strong suit in my undergraduates. I never understood, why the angle of the tangent at a given point of a curve is important, or why integral over a curve resulting in the surface area under the curve is important, or why matrix factorization is important? Sure, lot of data we see in the digital realm can be represented as matrices (or tensors generally). But that’s where it stopped making sense for me. Questions like, if you already have the full ground-truth matrix, why would you break it to two? haunted me during algebra classes.

Then came graduate studies. I was doing my PhD in deep learning and that was a perfect storm to ignore topics like matrix factorization as that was something seldom coming up in deep learning literature.
Currently, as an MLE working on recommendation algorithms, I had the much needed epiphany straight away. “A-ha so that’s why you break your matrix down to two smaller ones”.
And looking back in time at the naive undergraduate student of myself from 10 years ago, I feel nothing but deep regret for not paying more attention. To redeem myself, I’m going to show you cool matrix factorization and its concepts are with real applications. I’m hoping that this would be a lighthouse for those who may be in a ship-wreck of drowning mathematics and rough algebra.
Long before matrix factorization came … matrices. Lots of data we see and work with in data science can be represented as matrices. Here are few examples.
- 🖼 A grayscale image is a matrix of pixels organized into rows(height) and columns(width)
- 📹 A video can be represented as a matrix with rows showing number of frames and columns being an image unwrapped to a 1D vector
- A corpus of documents can be represented as a matrix with rows being the documents and columns being the all the terms (i.e. vocabulary) present in the corpus
- 👟 Item ratings (such as movies or products) can be represented as a matrix with rows (one for each user) and columns (one for each item).
I hope that’s convincing enough to fathom the wide-spread nature of matrices in the data we come across daily.

Now we’ve established there’s no absenteeism of matrices our next goal is to understand the most important question in this discussion.
Why do we need to split up (i.e. factorize) matrices?
There’s no universal answer I can give, as the specific benefit of MF will usually depend on the application and the algorithm itself. So let me motivate us as follows.
These matrices I outlined above, aren’t very useful as they are. In other words, think of them as a layer on top of underlying patterns in data — which is what we are typically interested in uncovering. Just like the way you would cut and polish a diamond 🔹 dug from earth to increase its appeal and value, these observed matrices need to be manipulated in certain ways to get to the sweet center of it. Matrix factorization is the vessel for that journey.
Data matrices we observe contains patterns/behaviors of the sources that generated this data
Going into bit more detail, typically data has intrinsic patterns in it. For example, if a part of an image is occluded, you still can infer the missing content using surrounding information. In another example, if some words are modified/deleted in a document, you still can understand the meaning or message conveyed by that document.
Why is it possible to do so? It’s because the data we see has underlying patterns in them. Pixels on an image aren’t random. Text in a document isn’t random. With matrix factorization, we are trying to cut through our observed data/noise to surface such patterns. Why is it important to learn patterns you may ask? Having access to such patterns is the life-support of decision-support systems we develop. After all, learning patterns is what machine learning models do.
It sounds a bit surreal to hear that splitting up data into multiple parts reveal underpinning patterns found in data. It turns out, projecting data into a latent space forces the resulting matrix to learn patterns separate out noise.
There are many different MF algorithms which exploit various properties on the resulting factorized matrices. Some may constrain the values to be non-negative while others enforce sparsity. Therefore the algorithm you use will dictate the nature of the result you get. We’ll see quite a few of them in our journey to the center of it!
Now you might be tired of hearing me going on and on about MF. It’s time to see these algorithms in action.
Image compression
First example we see is image compression. Matrix factorization can be used to store important content of an image needed to reconstruct it later (with a smaller memory footprint). Here we will be learning about a technique called singular value decomposition (SVD). The idea behind SVD is to represent matrix A with the multiplication of three matrices; U, Σ and V.

Here A is n x m
, U is n x n
Σ is n x m
and V is m x m
.
The matrix Σ is a diagonal matrix that contains the singular values in the diagonal, an important by-product of SVD. These singular values indicate how much variance is captured by each row/column of U and V (or singular vectors). The more variance you capture, the better the reconstruction would be. You may recognize that this is the same principle behind PCA.
The square of the singular values is proportional to the amount of information captured (i.e. variance) by each singular vector
The other great thing is that the singular values are ordered in decreasing order. Which means to get the p most important singular values, you just cut your matrices only to contain that many singular values.

There are different variants of SVD. Getting only the p largest singular values without computing full factorization is called truncated SVD. And a more efficient approximated variant of that is known as randomized SVD. Here’s what we get with different number of components for randomized SVD.

Here’s the implementation in scikit-learn
.
Let’s compute compression for k=50
for a 512x512
image.
- Original image = 512×512 = 262144 pixels
- Reconstructed = 512×10 + 10×10 + 10×512 = 10340 pixels (Just ~4% of the original image)
Nice! We just reconstructed an approximation with just a 4% memory footprint of the original. You can find the full end-to-end code here.
Foreground detection
Next in line, we got foreground detection. If you thought compressing images is cool, wait till you see this one! You can separate out background and foreground using matrix factorization 🤯*insert mind blown GIF* 🤯. Think about the amazing things you can do in video surveillance. You can detect people or cars from videos. Let’s see how that can be done.

First of all we represent our video as a matrix of size l x f
, where l
is the length of a single frame (i.e. height and width of the grayscale frame unwrapped to a 1D vector) and f
is the number of frames in the video. In our case, we have a matrix of size 76800 x 794
(each image is a 320×240=76800 pixels).
We can plot it and see what it looks like.

The squiggly lines you see are people movements present in the video. For this task we will be using a different matrix factorization technique called robust PCA (rPCA). The idea is to factorize a given matrix M as follows.

Here, L is a low-rank approximation and S is a sparse matrix. Holup! 🛑 didn’t you just say you’re not going to intimidate us with mind-numbing mathematics? So let’s understand what that means intuitively.
Rank of a matrix is known as the number of linearly independent columns. A column is linearly independent if it cannot be derived as a linear transformation of other columns in the matrix. Why is that important? Because the more linearly dependent columns you have in a matrix, the more redundant information there is — because you derive them from the independent ones. If you think about the video feed from a CCTV camera, the background is static, thus contains very little information (or low entropy). So we can probably represent the background with small number of linearly independent columns. In other words if I was to represent the content of M as a low-rank matrix L, I would capture the background present in M in it.
Side note: In SVD the number of non-zero singular values represent the rank of that matrix.
An example low-rank matrix is the static background of a video feed represented as a matrix
What about the sparse matrix. That one would make more sense. In a video feed, a static background would contain most data (data, not information) in terms of the volume. The remaining sparse information belongs to foreground — because foreground is typically taking small space in the video. Therefore, if I try to force M to become a sparse matrix S, I’d probably capture foreground movements in S. Another way to think is that, S captures the outliers in your data!
Now adding up L and S should give us the original video, which is what the robust PCA equation is saying. Now easier said than done! How do we actually ensure these properties for these matrices. That’s out of scope for this article. But to give you a little bit of a flavour, you can use an algorithm like Principle Component Pursuit or Alternating direction method. For rPCA, I have used and modified the code originally found here.
At a high level, let’s understand how we can optimize for these special properties.
To minimize the rank of L, you could use the nuclear norm of L which is a proxy for the rank. On the other hand, if you have worked with linear regression, you might remember that L1 norm encourages sparsity in your weight matrix. They use the same principle here, to make S a sparse matrix.
Looking at the 250th frame of the original frames_t
, L
and S
gives us the following three subplot graph (in order).

And here’s the video.

The full end-to-end code is available here.
Cool trick! Let’s test our knowledge
We learned two techniques so far; SVD and rPCA. We also learned that in videos organized in a matrix, the low-rank component captures the background.
Now, there is this interesting property in SVD, the rank is equal to the number of non-zero singular values. So what would happen if we perform truncated SVD with few components (in our case randomized SVD) and reconstruct the image?
We should be able to separate out background. Then extracting foreground is trivial. Simply subtract background pixels from the original image pixels.

Looks pretty good. We can see increasing the number of components lead to more foreground being in the reconstruction. This is a pretty good example of how powerful intuition is, it allows you to combine different techniques confidently and efficiently to get what you need!
But why do we need rPCA here then? That leads to more technical details and is out of scope for this introduction. To give a short answer, different algorithms have different strengths and weaknesses. And they do not always get highlighted in every single use case. One obvious difference I can point out is that rPCA is more robust to outliers in the data.
Here I’m going to end our first part of the discussion. We first understood why matrices are important which lent itself to why matrix factorization important. Then we discussed two applications in computer vision; image compression and foreground detection. We discussed two algorithms, SVD and rPCA and even did a cool trick of adapting SVD for foreground detection.
In the next article we are going to explore the other parts of the machine learning realm, where matrix factorization has ruled (and rules)! We’ll discuss,
- How matrix factorization is used in NLP to learn concepts
- How matrix factorization used for item recommendation
Until then, it’s good bye!
[TBA] Part 2 — Matrix factorization
[1] fast.ai’s course on linear algebra — A very comprehensive lay of the land of matrix factorization and lots of applications in Python
[2] Singular Value Decomposition
[3] Similarities between SVD and PCA
[4] Robust PCA
You’ll never be afraid to see an allegedly intimidating matrix factorization equation in your life!

Matrix factorization code: [here]
In this article, you will learn about matrix factorization, bread and butter of many classical machine learning approaches. This article will focus explaining the real-world applications of matrix factorization (MF) (with code examples) and the intuition underpinning it. Have you ever thought matrix factorization might be used for drug repurposing ❓
If you’re anything like me (hopefully not), by the end of this article, you will regret downplaying matrix factorization in your undergraduate algebra course, because you never thought you would use it.
To visit my previous articles in this series use the following letters.
A B C D* E F G H I J K L* M* N O P Q R S T U V W X Y Z
Mathematics was never my strong suit in my undergraduates. I never understood, why the angle of the tangent at a given point of a curve is important, or why integral over a curve resulting in the surface area under the curve is important, or why matrix factorization is important? Sure, lot of data we see in the digital realm can be represented as matrices (or tensors generally). But that’s where it stopped making sense for me. Questions like, if you already have the full ground-truth matrix, why would you break it to two? haunted me during algebra classes.

Then came graduate studies. I was doing my PhD in deep learning and that was a perfect storm to ignore topics like matrix factorization as that was something seldom coming up in deep learning literature.
Currently, as an MLE working on recommendation algorithms, I had the much needed epiphany straight away. “A-ha so that’s why you break your matrix down to two smaller ones”.
And looking back in time at the naive undergraduate student of myself from 10 years ago, I feel nothing but deep regret for not paying more attention. To redeem myself, I’m going to show you cool matrix factorization and its concepts are with real applications. I’m hoping that this would be a lighthouse for those who may be in a ship-wreck of drowning mathematics and rough algebra.
Long before matrix factorization came … matrices. Lots of data we see and work with in data science can be represented as matrices. Here are few examples.
- 🖼 A grayscale image is a matrix of pixels organized into rows(height) and columns(width)
- 📹 A video can be represented as a matrix with rows showing number of frames and columns being an image unwrapped to a 1D vector
- A corpus of documents can be represented as a matrix with rows being the documents and columns being the all the terms (i.e. vocabulary) present in the corpus
- 👟 Item ratings (such as movies or products) can be represented as a matrix with rows (one for each user) and columns (one for each item).
I hope that’s convincing enough to fathom the wide-spread nature of matrices in the data we come across daily.

Now we’ve established there’s no absenteeism of matrices our next goal is to understand the most important question in this discussion.
Why do we need to split up (i.e. factorize) matrices?
There’s no universal answer I can give, as the specific benefit of MF will usually depend on the application and the algorithm itself. So let me motivate us as follows.
These matrices I outlined above, aren’t very useful as they are. In other words, think of them as a layer on top of underlying patterns in data — which is what we are typically interested in uncovering. Just like the way you would cut and polish a diamond 🔹 dug from earth to increase its appeal and value, these observed matrices need to be manipulated in certain ways to get to the sweet center of it. Matrix factorization is the vessel for that journey.
Data matrices we observe contains patterns/behaviors of the sources that generated this data
Going into bit more detail, typically data has intrinsic patterns in it. For example, if a part of an image is occluded, you still can infer the missing content using surrounding information. In another example, if some words are modified/deleted in a document, you still can understand the meaning or message conveyed by that document.
Why is it possible to do so? It’s because the data we see has underlying patterns in them. Pixels on an image aren’t random. Text in a document isn’t random. With matrix factorization, we are trying to cut through our observed data/noise to surface such patterns. Why is it important to learn patterns you may ask? Having access to such patterns is the life-support of decision-support systems we develop. After all, learning patterns is what machine learning models do.
It sounds a bit surreal to hear that splitting up data into multiple parts reveal underpinning patterns found in data. It turns out, projecting data into a latent space forces the resulting matrix to learn patterns separate out noise.
There are many different MF algorithms which exploit various properties on the resulting factorized matrices. Some may constrain the values to be non-negative while others enforce sparsity. Therefore the algorithm you use will dictate the nature of the result you get. We’ll see quite a few of them in our journey to the center of it!
Now you might be tired of hearing me going on and on about MF. It’s time to see these algorithms in action.
Image compression
First example we see is image compression. Matrix factorization can be used to store important content of an image needed to reconstruct it later (with a smaller memory footprint). Here we will be learning about a technique called singular value decomposition (SVD). The idea behind SVD is to represent matrix A with the multiplication of three matrices; U, Σ and V.

Here A is n x m
, U is n x n
Σ is n x m
and V is m x m
.
The matrix Σ is a diagonal matrix that contains the singular values in the diagonal, an important by-product of SVD. These singular values indicate how much variance is captured by each row/column of U and V (or singular vectors). The more variance you capture, the better the reconstruction would be. You may recognize that this is the same principle behind PCA.
The square of the singular values is proportional to the amount of information captured (i.e. variance) by each singular vector
The other great thing is that the singular values are ordered in decreasing order. Which means to get the p most important singular values, you just cut your matrices only to contain that many singular values.

There are different variants of SVD. Getting only the p largest singular values without computing full factorization is called truncated SVD. And a more efficient approximated variant of that is known as randomized SVD. Here’s what we get with different number of components for randomized SVD.

Here’s the implementation in scikit-learn
.
Let’s compute compression for k=50
for a 512x512
image.
- Original image = 512×512 = 262144 pixels
- Reconstructed = 512×10 + 10×10 + 10×512 = 10340 pixels (Just ~4% of the original image)
Nice! We just reconstructed an approximation with just a 4% memory footprint of the original. You can find the full end-to-end code here.
Foreground detection
Next in line, we got foreground detection. If you thought compressing images is cool, wait till you see this one! You can separate out background and foreground using matrix factorization 🤯*insert mind blown GIF* 🤯. Think about the amazing things you can do in video surveillance. You can detect people or cars from videos. Let’s see how that can be done.

First of all we represent our video as a matrix of size l x f
, where l
is the length of a single frame (i.e. height and width of the grayscale frame unwrapped to a 1D vector) and f
is the number of frames in the video. In our case, we have a matrix of size 76800 x 794
(each image is a 320×240=76800 pixels).
We can plot it and see what it looks like.

The squiggly lines you see are people movements present in the video. For this task we will be using a different matrix factorization technique called robust PCA (rPCA). The idea is to factorize a given matrix M as follows.

Here, L is a low-rank approximation and S is a sparse matrix. Holup! 🛑 didn’t you just say you’re not going to intimidate us with mind-numbing mathematics? So let’s understand what that means intuitively.
Rank of a matrix is known as the number of linearly independent columns. A column is linearly independent if it cannot be derived as a linear transformation of other columns in the matrix. Why is that important? Because the more linearly dependent columns you have in a matrix, the more redundant information there is — because you derive them from the independent ones. If you think about the video feed from a CCTV camera, the background is static, thus contains very little information (or low entropy). So we can probably represent the background with small number of linearly independent columns. In other words if I was to represent the content of M as a low-rank matrix L, I would capture the background present in M in it.
Side note: In SVD the number of non-zero singular values represent the rank of that matrix.
An example low-rank matrix is the static background of a video feed represented as a matrix
What about the sparse matrix. That one would make more sense. In a video feed, a static background would contain most data (data, not information) in terms of the volume. The remaining sparse information belongs to foreground — because foreground is typically taking small space in the video. Therefore, if I try to force M to become a sparse matrix S, I’d probably capture foreground movements in S. Another way to think is that, S captures the outliers in your data!
Now adding up L and S should give us the original video, which is what the robust PCA equation is saying. Now easier said than done! How do we actually ensure these properties for these matrices. That’s out of scope for this article. But to give you a little bit of a flavour, you can use an algorithm like Principle Component Pursuit or Alternating direction method. For rPCA, I have used and modified the code originally found here.
At a high level, let’s understand how we can optimize for these special properties.
To minimize the rank of L, you could use the nuclear norm of L which is a proxy for the rank. On the other hand, if you have worked with linear regression, you might remember that L1 norm encourages sparsity in your weight matrix. They use the same principle here, to make S a sparse matrix.
Looking at the 250th frame of the original frames_t
, L
and S
gives us the following three subplot graph (in order).

And here’s the video.

The full end-to-end code is available here.
Cool trick! Let’s test our knowledge
We learned two techniques so far; SVD and rPCA. We also learned that in videos organized in a matrix, the low-rank component captures the background.
Now, there is this interesting property in SVD, the rank is equal to the number of non-zero singular values. So what would happen if we perform truncated SVD with few components (in our case randomized SVD) and reconstruct the image?
We should be able to separate out background. Then extracting foreground is trivial. Simply subtract background pixels from the original image pixels.

Looks pretty good. We can see increasing the number of components lead to more foreground being in the reconstruction. This is a pretty good example of how powerful intuition is, it allows you to combine different techniques confidently and efficiently to get what you need!
But why do we need rPCA here then? That leads to more technical details and is out of scope for this introduction. To give a short answer, different algorithms have different strengths and weaknesses. And they do not always get highlighted in every single use case. One obvious difference I can point out is that rPCA is more robust to outliers in the data.
Here I’m going to end our first part of the discussion. We first understood why matrices are important which lent itself to why matrix factorization important. Then we discussed two applications in computer vision; image compression and foreground detection. We discussed two algorithms, SVD and rPCA and even did a cool trick of adapting SVD for foreground detection.
In the next article we are going to explore the other parts of the machine learning realm, where matrix factorization has ruled (and rules)! We’ll discuss,
- How matrix factorization is used in NLP to learn concepts
- How matrix factorization used for item recommendation
Until then, it’s good bye!
[TBA] Part 2 — Matrix factorization
[1] fast.ai’s course on linear algebra — A very comprehensive lay of the land of matrix factorization and lots of applications in Python
[2] Singular Value Decomposition
[3] Similarities between SVD and PCA
[4] Robust PCA