Uncovering Anomalies with Variational Autoencoders (VAE): A Deep Dive into the World of Unsupervised Learning | by Will Badr | Jan, 2023

By Jessie Hobb On Jan 17, 2023

An example use case of using Variational Autoencoders (VAE) to detect anomalies in all types of data

In an earlier post, I explained what autoencoders are, what they are used for and how to leverage them in training an anomaly detection model. As a reminder, autoencoders are a type of neural network that are commonly used for dimensionality reduction and learning features. They are also commonly used for anomaly detection, as they can learn to reconstruct normal data, but may struggle to reconstruct anomalous or outlier data.

An autoencoder network consists of two components: an encoder and a decoder. The encoder maps the input data to a lower-dimensional latent space, and the decoder maps the latent representation back to the original input space. During training, the autoencoder is trained to reconstruct the input data as accurately as possible.

To use an autoencoder for anomaly detection, the autoencoder is first trained on a dataset of normal, non-anomalous data. Once trained, the autoencoder can be used to reconstruct new data samples. If a new data sample is significantly different from the normal data that the autoencoder was trained on, it may be reconstructed poorly, indicating that it is potentially anomalous.

In this article, I will focus on using a variation of the autoencoder network called Variational Autoencoders (VAEs) to detect anomalies and what makes it different from regular autoencoders in detecting anomalies.

VAEs are a type of neural network architecture that is used for generative modeling. They are unique in that they are able to learn a compact, latent and compressed representation of a given dataset and then generate new samples from that representation.

One of the key feature of VAEs is that they are designed to be able to learn a probabilistic model of the data, which means that they can be used to generate new samples that are similar to the training data, but not necessarily identical. This allows VAEs to be used for tasks such as image generation, text generation, and other types of data generation.

You might still be wondering, what does a generative model have to do with the anomaly detection task! To answer this question, let’s review what anomaly detection is. Anomaly detection is the task of identifying unusual or unexpected patterns in a dataset, any pattern that deviates from what is normal. Since VAEs have the ability to learn a probabilistic model of the data, this allows them to generate new samples from the latent space. These new samples are drawn from the same probability distribution of the original data you used to train the model which makes it more robust and tolerant to variations in the data than regular autoencoders. This can be very useful for detecting anomalies in data that has a clear normal behavior.

VAE Network Structure

Source: https://commons.wikimedia.org/wiki/File:Reparameterized_Variational_Autoencoder.png

VAE networks are typically composed of multiple components:

An Encoder: The encoder is a neural network that maps the input data to a lower-dimensional latent space. The encoder is typically parameterized by a set of weights and biases that are learned during training.
Latent space: The latent space is the lower-dimensional space that the encoder maps the input data to. This latent space typically has a continuous structure, which means that the dimensions of the latent space can only take on any real value within a certain range. This is in contrast to a discrete latent space, which would only allow for a finite set of values for each dimension. This allows for more flexibility and expressive power in the VAE model. It allows the VAE to capture subtle variations and nuances in the input data, then generate new data samples that are close to the training data but not necessarily identical to it. This enables the VAE to capture the uncertainty and variability in the data and to generate new samples that are diverse and varied (Hence the name Variational autoencoder)
Decoder: The decoder is a neural network that maps the latent representation above back to the original input space. The decoder is also typically parameterized by a set of weights and biases that are learned during training.
Reconstruction loss: The reconstruction loss measures how well the decoder is able to reconstruct the input data from the latent representation. This loss is typically used to train the model.

So far, the 4 components above are similar to the ones from the regular autoencoders. VAEs have two extra components:

5. A Prior: The prior is a probabilistic distribution that is used to model the latent space. In VAEs, the prior is often assumed to be a standard normal distribution.

6. A Posterior: The posterior is the distribution that models the latent variables given the input data. The posterior is typically approximated using a function that is parameterized by the encoder.

Now that you know what involves a VAE network, let’s implement a base version of it using PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F# Define the VAE model
class VAE(nn.Module):
def __init__(self, input_dim, latent_dim):
super(VAE, self).__init__()
# Define the encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, 32),
nn.ReLU(),
nn.Linear(32, 16),
nn.ReLU()
)
# Define the latent representation
self.fc_mu = nn.Linear(16, latent_dim)
self.fc_logvar = nn.Linear(16, latent_dim)
# Define the decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 16),
nn.ReLU(),
nn.Linear(16, 32),
nn.ReLU(),
nn.Linear(32, input_dim),
nn.Sigmoid()
)
def forward(self, x):
x = self.encoder(x)
mu = self.fc_mu(x)
logvar = self.fc_logvar(x)
z = self.reparameterize(mu, logvar)
reconstructed = self.decoder(z)
return reconstructed, mu, logvar
def reparameterize(self, mu, logvar):
std = logvar.mul(0.5).exp_()
eps = torch.randn_like(std)
return mu + std*eps
# Train the VAE on the normal data
vae = VAE(input_dim=30, latent_dim=10)
# Generate random input data to test the model
data = torch.randn(100, 30)
optimizer = torch.optim.Adam(vae.parameters())

Let me explain each section in the code snippet above:

The encoder and decoder are the same like the regular autoencoder. The encoder goes from the input dimension size (input_dim) to a small number of dimensions/compressed representation (latent_dim) then the decoder uncompresses it back to the original input dimensions.
fc_mu: is a fully connected layer that maps the intermediate representation of the input data produced by the encoder to the mean of the posterior distribution.
fc_logvar: is also a fully connected layer that maps the intermediate representation of the input data to the log variance of the posterior distribution. The posterior distribution is then used to model the latent variables given the input data.
reparameterize() The mean and log variance of the posterior that we generate from the fully connected layers are used to sample latent variables using this function. It allows the VAE to be trained using gradient-based optimization methods. It is also being referred to as the reparameterization trick

Now that we defined the model and optimizer, let’s need to define the loss function and training function. The loss function in our case will be a combination of two different losses. The reconstruction loss that measures the difference between the input and output; the KL divergence loss. The KL divergence loss is used to encourage the posterior distribution to be similar to the prior distribution, which helps to prevent overfitting and ensure that the latent variables capture the underlying structure and variability of the input data.

Still doesn’t make sense? Let’s break it down even further:

The prior distribution refers to the distribution of the latent variables before they are conditioned on the input data. The prior distribution is typically assumed to be a standard normal distribution, which represents the belief that the latent variables are independent and have a simple distribution. The posterior distribution refers to the distribution of the latent variables after they are conditioned on the input data. The posterior distribution is modeled using the mean and log variance of the latent variables produced by the encoder.

The prior distribution is used as a regularization term in the VAE, as it encourages the posterior distribution of the latent variables (given the input data) to be similar to the prior distribution. This helps to prevent overfitting and ensure that the latent variables capture the underlying structure and variability of the input data, rather than just memorizing the training data. We simply use KL divergence loss to achieve this. It is calculated as the sum of the element-wise divergence between the posterior and prior distributions:

kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

The negative sign above is included to ensure that the loss is always non-negative, as the KL divergence is a non-negative measure of the difference between two distributions. The factor of 0.5 is included for computational convenience, as it allows the loss to be calculated using the mean and log variance of the latent variables, rather than the probability densities.

Let’s look at the training code:

# Instantiate the model
model = VAE(input_dim=30, latent_dim=10)# Define our reconstruction loss function
loss_fn = nn.BCELoss()
# Train the model
for epoch in range(100):
# Compute the reconstruction loss
reconstructed, mu, logvar = model(data)
reconstruction_loss = loss_fn(reconstructed, data)
# Compute the KL divergence loss
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
# Compute the total loss
total_loss = reconstruction_loss + kl_loss
# Backpropagate the gradients and update the model weights
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
# Print the loss values
print(f"Epoch {epoch}: reconstruction_loss = {reconstruction_loss:.4f}, kl_loss = {kl_loss:.4f}, total_loss = {total_loss:.4f}")

After instantiating the model, we pass in the data and the model will return three parameters. The reconstructed output, the mu and logvar parameters. We then use the mu and logvar to calculate the KL divergence loss. The total loss will be the sum of the reconstruction loss and KL divergence loss. Hence we calculate the gradients w.r.t the total_loss variable

AutoEncoders vs Variational AutoEncoders (VAE):

The main difference between VAE and the regular autoencoders is the addition of the latent space and the use of the variational lower bound as the objective function, , which consists of two terms: the reconstruction loss and the KL divergence between the approximate posterior distribution over the latent space and the prior distribution.

In a regular autoencoder, the encoder network maps the input data to a latent representation, and the decoder network maps the latent representation back to the original data. The objective function is typically the reconstruction loss, which measures the difference between the input data and the reconstructed data.

In a VAE, the encoder network still maps the input data to a latent representation, but the latent representation is split into two parts: a mean vector and a log variance vector. These two vectors are used to define a Gaussian distribution over the latent space, which allows the VAE to generate new samples from the latent space by sampling from this distribution. The decoder network is then used to map these latent samples back to the original data space.

To summarize, the main differences between a VAE and a regular autoencoder are the use of a latent space with a probabilistic interpretation, and the use of the variational lower bound as the objective function.

Disadvantages of VAEs:

Now that explored the benefits and advantages of VAEs, particularly in the anomaly detection domain, let’s explore some of its disadvantages:

VAEs may not be as effective for data that is highly variable or has multiple modes of normal behavior.
VAEs can also be sensitive to the different choices of hyperparameters, such as the latent dimension and the learning rate, which can make them difficult to optimize.
VAEs can be computationally expensive to train, as they require sampling from the latent space and back-propagating through the sampling process.
VAEs may struggle to capture complex relationships between the input data and the latent variables, particularly when the data is highly structured or correlated.
VAEs may produce blurry or low-quality reconstructions, particularly when the latent dimension is small or the training data is noisy.

In conclusion, VAEs are a powerful and flexible tool for learning the underlying structure and variability of a dataset, and for generating new samples. However, they also have some limitations and challenges that should be considered when deciding whether to use a VAE for a particular task.