Modeling DNA Sequences with PyTorch | by Erin Wilson | Sep, 2022

By Jessie Hobb On Sep 16, 2022

A beginner-friendly tutorial

Image by author

DNA is a complex data stream. While it can be represented by a string of ACTGs, it is filled with intricate patterns and structural nuances that are difficult for humans to understand just by looking at a raw sequence of nucleotides. In recent years, much progress has been made towards modeling DNA data using deep learning.

Researchers have applied methods such as convolutional neural networks (CNN), long-short term memory networks (LSTMs), and even transformers to predict various genomic measurements directly from DNA sequences. These models are particularly useful because with enough high-quality training data, they can automatically pick up on sequence patterns — or motifs — that are relevant to your prediction task rather than requiring an expert to specify which patterns to look for ahead of time. Overall, enthusiasm is growing for using deep learning in genomics to help map DNA sequences to their biological functions!

As a grad student interested in using computational approaches to address challenges in sustainability and synthetic biology, I’ve been learning how to use PyTorch to study DNA sequence patterns. There’s no shortage of tutorials on how to get started with PyTorch, however many tend to focus on image or language input data. For using DNA as an input, there are many great projects out there that have developed PyTorch frameworks to model all sorts of biological phenomena [1,2,3], but they can to be quite sophisticated and difficult to dive into as a beginner.

I had some trouble finding beginner examples for those new to PyTorch that were also focused on DNA data, so I compiled a quick tutorial in case any future DNA modelers find it helpful for getting started!

The tutorial itself can be run interactively as a Jupyter Notebook, or you may follow along with the summary of key concepts and Github gists in the rest of this article.

This tutorial shows an example of a PyTorch framework that can use raw DNA sequences as input, feed these into a neural network model, and predict a quantitative label directly from the sequence.

Tutorial Overview:

Generate synthetic DNA data
Prepare data for PyTorch training
Define PyTorch models
Define training loop functions
Run the models
Check model predictions on test set
Visualize convolutional filters
Conclusion

It assumes the reader is already familiar with ML concepts like:

What is a neural network, including the basics of a convolutional neural network (CNN)
Model training over epochs
Splitting data into train/val/test sets
Loss functions and comparing train vs. val loss curves

It also assumes some familiarity with biological concepts like:

DNA nucleotides
What is a regulatory motif?
Visualizing DNA motifs

Note: The following methods aren’t necessarily the optimal way to do this! I’m sure there are more elegant solutions, this is just my attempt while learning. But if you’re just getting started with PyTorch and are also using DNA sequences as your input, perhaps this tutorial can be a helpful example of how to “connect some PyTorch tubes together” in the context of DNA sequence analysis.

Usually scientists might be interested in predicting something like a binding score, an expression strength, or classifying a transcription factor binding event. But here, we are going to keep it simple: the goal in this tutorial is to observe if a deep learning model can learn to detect a very small, simple pattern in a DNA sequence and score it appropriately (again, just a practice task to convince ourselves that we have actually set up the PyTorch pieces correctly such that it can learn from input that looks like a DNA sequence).

So arbitrarily, let’s say that given an 8-mer DNA sequence, give it points for each letter as follows:

A = +20 points
C = +17 points
G = +14 points
T = +11 points

For every 8-mer, sum up its total points then take the average. For example,

AAAAAAAA would score 20.0

mean(20 + 20 + 20 + 20 + 20 + 20 + 20 + 20) = 20.0

ACAAAAAA would score 19.625

mean(20 + 17 + 20 + 20 + 20 + 20 + 20 + 20) = 19.625

These values for the nucleotides are arbitrary — there’s no real biology here! It’s just a way to assign sequences a score for the purposes of our PyTorch practice.

However, since many recent papers use methods like CNNs to automatically detect “motifs,” or short patterns in the DNA that can activate or repress a biological response, let’s add one more piece to our scoring system. To simulate something like motifs influencing gene expression, let’s say a given sequence gets a +10 bump if TAT appears anywhere in the 8-mer, and a -10 bump if it has a GCG in it. Again, these motifs don’t mean anything in real life, they are just a mechanism for simulating a really simple activation or repression effect.

A simple scoring system for 8-mer DNA sequences. Image by author.

Here’s an implementation of this simple scoring system:

Plotting the score distribution for 8-mer sequences, we see them fall into 3 groups:

sequences with GCG (score = ~5)
sequences without a motif (score = ~15)
sequences withTAT (score = ~25)

Distribution of 8-mer scores. Image by author.

Our goal is now to train a model to predict this score by looking at the DNA sequence.

For neural networks to make predictions, you have to give it your input as a matrix of numbers. For example, to classify images by whether or not they contain a cat, a network “sees” the image as a matrix of pixel values and learns relevant patterns in the relative arrangement of pixels (e.g. patterns that correspond to cat ears, or a nose with whiskers).

We similarly need to turn our DNA sequences (strings of ACGTs) into a matrix of numbers. So how do we pretend our DNA is a cat?

One common strategy is to one-hot encode the DNA: treat each nucleotide as a vector of length 4, where 3 positions are 0 and one position is a 1, depending on the nucleotide.

This one-hot encoding scheme has the nice property that it makes your DNA appear like how a computer sees a picture of a cat! Image by author.

With this one-hot encoding scheme, we can prepare our train, val, and test sets. This quick_split just randomly picks some indices in the pandas dataframe to split (sklearn has a function to do this too).

Note: In real/non-synthetic tasks, you might need to be more clever about your splitting strategy depending on your prediction task: often papers will create train/test splits by chromosome or other genome location features.

A big step when preparing your data for PyTorch is using DataLoader and Dataset objects. It took me a lot of googling around to figure something out, but this is a solution I was able to concoct from a lot of combing through docs and stack overflow posts!

In short, a Dataset wraps your data in an object that can smoothly give your properly formatted X examples and Y labels to the model you’re training. The DataLoader accepts a Dataset and some other details about how to form batches from your data and makes it easier to iterate through training steps.

These DataLoaders are now ready to be used in a training loop!

The primary model I was interested in trying was a Convolutional Neural Network, as these have been shown to be useful for learning motifs from genomic data. But as a point of comparison, I included a simple Linear model. Here are some model definitions:

Note: These are not optimized models, just something to start with (again, we’re just practicing connecting the PyTorch tubes in the context of DNA).

The Linear model tries to predict the score by simply weighting the nucleotides that appear in each position.
The CNN model uses 32 filters of length (kernel_size) 3 to scan across the 8-mer sequences for informative 3-mer patterns.

Next, we need to define the training/fit loop. I admit I’m not super confident here and spent a lot of time wading through matrix dimension mismatch errors — there are likely more elegant ways to do this! But maybe this is ok? –shrug– (Shoot me a message if you have feedback 🤓 )

In any case, I defined functions that stack like this:

# adds default optimizer and loss function
run_model()
# loops through epochs
fit()
# loop through batches
train_step()
# calc train loss for batch
loss_batch()
val_step()
# calc val loss for batch
loss_batch()

First let’s try running a Linear Model on our 8-mer sequences.

After collecting the train and val losses, let’s look at them in a quick plot:

Linear model training and validation loss curves. Image by author.

At first glance, not much learning appears to be happening.

Next let’s try the CNN and plot the loss curves.

Loss curves for both CNN and Linear model. Image by author.

It seems clear from the loss curves that the CNN is able to capture a pattern in the data that the Linear model is not! Let’s spot check a few sequences to see what’s going on.

From the above examples, it appears that the Linear model is really under-predicting sequences with a lot of G’s and over-predicting those with many T’s. This is probably because it noticed GCG made sequences have unusually low scores and TAT made sequences have unusually high scores. However, since the Linear model doesn’t have a way to take into account the different context of GCG vs GAG, it just predicts that sequences with G’s should be lower. We know from our scoring scheme that this isn’t the case: it’s not that G’s in general are detrimental, but specifically GCG is.

The CNN is better able to adapt to the differences between 3-mer motifs! It predicts quite well on both the sequences with and without motifs.

An important evaluation step in any machine learning task is to check if your model can make good predictions on the test set, which it never saw during training. Here, we can use a parity plot to visualize the difference between the actual test sequence scores vs the model’s predicted scores.

Comparison of actual vs predicted scores for test set sequences. Image by author.

Parity plots are useful for visualizing how well your model predicts individual sequences: in a perfect model, they would all land on the y=x line, meaning that the model prediction was exactly the sequence’s actual value. But if it is off the y=x line, it means the model is over- or under-predicting.

In the Linear model, we can see that it can somewhat predict a trend in the Test set sequences, but really gets confused by these buckets of sequences in the high and low areas of the distribution (the ones with a motif).

However for the CNN, it is much better at predicting scores close to the actual value! This is expected, given that the architecture of our CNN uses 3-mer kernels to scan along the sequence for influential motifs.

But the CNN isn’t perfect. We could probably train it longer or adjust the hyperparameters, but the goal here isn’t perfection — this is a very simple task relative to actual regulatory grammars. Instead, I thought it would be interesting to use the Altair visualization library to interactively inspect which sequences the models get wrong:

Notice that the sequences that are off the diagonal tend to have multiple instance of the motifs! In the scoring function, we only gave the sequence a +/- bump if it had at least 1 motif, but it certainly would have been reasonable to decide to add multiple bonuses if the motif was present multiple times. In this example, I arbitrarily only added the bonus for at least 1 motif occurrence, but we could have made a different scoring function.

In any case, I thought it was cool that the model noticed the multiple occurrences and predicted them to be important. I suppose we did fool it a little, though an R2 of 0.95 is pretty respectable 🙂

When training CNN models, it can be useful to visualize the first layer convolutional filters to try to understand more about what the model is learning. With image data, the first layer convolutional filters often learn patterns such as borders or colors or textures — basic image elements that can be recombined to make more complex features.

In DNA, convolutional filters can be thought of like motif scanners. Similar to a position weight matrix for visualizing sequence logos, a convolutional filter is like a matrix showing a particular DNA pattern, but instead of being an exact sequence, it can hold some uncertainty about which nucleotides show up in which part of the pattern. Some positions might be very certain (i.e., there’s always an A in position 2; high information content) while other positions could hold a variety of nucleotides with about equal probability (high entropy; low information content).

The calculations that occur within the hidden layers of neural networks can get very complex and not every convolutional filter will be an obviously relevant pattern, but sometimes patterns in the filters do emerge and can be informative for helping to explain the model’s predictions.

Below are some functions to visualize the first layer convolutional filters, both as a raw heatmap and as a motif logo.

Ok, maybe this is a little helpful, but usually people like to visualize sequences with some uncertainty as motif logos: the x-axis has positions in the motif and the y-axis is the probability of each nucleotide appearing in each position. Often these probabilities are converted into bits (aka information content) for easier visualization.

To convert raw convolutional filters into position weight matrix visuals, it is common to collect filter activations: apply the weights of the filter along a one-hot encoded sequence and measure the filter activation (aka how well the weights match the sequence).

Filter weight matrices that correspond to a close match to a given sequence will activate strongly (yield higher match scores). By collecting the subsequences of DNA that yield the highest activation scores, we can create a position weight matrix of “highly activated sequences” for each filter, and therefore visualize the convolutional filter as a motif logo.

Diagram of how strongly activating subsequences can be collected and converted to a motif logo for a given convolutional filter. Image by author.

From this particular CNN training, we can see a few filters have picked up on the strong TAT and GCG motifs, but other filters have focused on other patterns as well.

There is some debate about how relevant convolutional filter visualizations are for model interpretability. In deep models with multiple convolutional layers, convolutional filters can be recombined in more complex ways inside the hidden layers, so the first layer filters may not be as informative on their own (Koo and Eddy, 2019). Much of the field has since moved towards attention mechanisms and other explainability methods, but should you be curious to visualize your filters as potential motifs, these functions may help get you started!

This tutorial shows some basic PyTorch structure for building CNN models that work with DNA sequences. The practice task used in this demo is not reflective of real biological signals; rather, we designed the scoring method to simulate the presence of regulatory motifs in very short sequences that were easy for us humans to inspect and verify that PyTorch was behaving as expected. From this small example, we observed how a basic CNN with sliding filters was able to predict our scoring scheme better than a basic linear model that only accounted for absolute nucleotide position (without local context).

To read more about CNN’s applied to DNA in the wild, check out the following foundational papers:

I hope other new-to-ML folks interested in tackling biological questions may find this helpful for getting started with using PyTorch to model DNA sequences 🙂

FOOTNOTE 1

In this tutorial, the CNN model definition uses a 1D convolutional layer — since DNA is not an image with 2 dimensions, Conv1D is sufficient to just slide along the length dimension and not scan up and down. (In fact, sliding a filter “up” and “down” doesn’t apply to one-hot encoded DNA matrices: separating the A and C rows from the G and T rows doesn’t make sense – you need all 4 rows to accurately represent a DNA sequence.)

However, I once found myself needing to use an analysis tool built with keras and found a pytorch2keras conversion script. The conversion script only knew how to handle Conv2d layers and gave errors for models with Conv1d layers 🙁

In case this happens to you, here is an example of how to reformat the CNN definition using a Conv2D while ensuring that it still scans along the DNA as if it were a Conv1D:

FOOTNOTE 2

If you’re doing a classification task instead of a regression task, you may want to use CrossEntropyLoss. However, CrossEntropyLoss expects a slightly different format than MSELoss – try this:

loss = loss_func(xb_out, yb.long().squeeze(1))