Protein Structure Prediction a Million Times Faster than AlphaFold 2 Using a Protein Language Model | by LucianoSphere | Nov, 2022

By Jessie Hobb On Nov 3, 2022

Dream soon to come true for biologists?

Summary of the method and its relevance; trying to run it in Google Colab; and testing it out to see if it really is that good.

Concept art by author and DallE2 generations.

With a new round of CASP (the competition on protein structure prediction) just a month away from revealing its results, we continue to see papers about using AlphaFold 2 and AlphaFold-inspired neural networks to predict protein structures even better, faster, or with totally open-source programs; also to predict protein interfaces with other biological molecules, and even to design new proteins. While progress is happening too fast for me to cover all these developments, I select now and then some that are of special relevance and interest. Here’s one that I think deserves special attention: a neural network that predicts protein structures using a language model, with an accuracy similar to that of AlphaFold 2 but a million times faster and being less sensitive than it to the lack of homologous sequences.

This and other emerging models for protein structure prediction using language models highlight the role of language models in general, far beyond the initial interest in using them for processing natural language as today mastered by GPT-3.

Let’s see what this is all about.

Recap of AlphaFold for protein structure prediction

Predicting the (3D) structures of proteins from their amino acid sequences is very important in modern biology. Given this, almost 30 years ago a biannual competition on protein structure prediction (CASP, more here and here) was launched that fostered developments in the field accompanied by critical evaluations of the models and analysis of progress, if any.

For years, CASP showed that protein structure prediction only worked for easy proteins, i.e. those for which a similar sequence already has its structure known. That is, only homology modeling worked, and hard targets (i.e. those with no homologs of known structure) remained very challenging -predictors were barely able to grasp the overall shape:

This slide (produced and presented by the author during the CASP13 conference) shows how predicting hard targets was very difficult and displayed no or very little progress until CASP12-CASP13.

By CASP12 (when I myself came in as assessor), one breakthrough started to change things: contacts between pairs of residues could now be computed by applying “residue coevolution detection” methods (formally direct information analysis and the such) to alignments of multiple sequences related to the sequence that one wants to model. The rationale behind this was that pairs of residues which make physical contact in the protein structure must change together; therefore, detecting such co-occurring changes in the multiple sequence alignment would point at pairs of residues that had to be forced together during the procedure for structure prediction. This worked, and indeed in my analysis I observed that proteins for which larger alignments could be built (because there are more sequences like it in databases) turned out to be better modeled:

Highest GDTTS (main CASP metric used to evaluate models) for the best model produced for each target vs. the log of the number of sequences available normalized by the length of the modeled sequence, in CASP 12. Each point in the plot is one target, of the hardest FM or mid-difficulty set FM/TBM; the line is a linear regression from the FM cases. Figure produced by the author from freely available CASP data.

Then in CASP13, some groups tweaked this idea leading to a further improvement in the predictions: now not only contacts would be predicted, but also distances and even relative orientations between residues. Deepmind entered CASP here, and applied this trick, but they were not the only ones. All groups that used this trick performed very well, certainly better than in CASP12 and much better than in previous CASP editions. Deepmind pushed the procedure to the maximum, “winning” CASP13 by a small margin to the runners-up.

As evaluator in both CASP12 and CASP13, I could see clearly that the requirement of large sequence alignments was still there in CASP13, with a higher offset (better models at lower numbers of sequences) but still a strong slope:

Same as in the previous figure but by CASP13. Notice the similar slope but the higher intercept on the y axis, which implies better models were obtained than in CASP12 for proteins that had similar numbers of sequences in the alignments. Figure produced by the author from freely available CASP data.

Ideally, one would want to predict sequences individually without the need of auxiliary alignments at all. That is, when Neff_max in the above plots is 1, or Neff_max/length is very small, say 0.001 for a 1000 residue-long protein (hence log = -3). Why is that? Well, because many proteins just lack similar sequences in databases, while others are totally human-invented (“designed” as explained here and here); so for these proteins any prediction must rely exclusively on the isolated sequences, without the help of alignments at all.

Today, not even AlphaFold 2, with its totally redesigned protocol and having beaten all other groups by far in CASP14, can produce perfect models from single sequences. This is the current “holy grail” in protein structure prediction, problem in which the paper discussed here made significant progress by using a language model approach to protein structure prediction.

The paper in question is by Chowdhury et al. in Nature Biotechnology, published just now in early October 2022. As we have seen above, the major challenge right now in protein structure prediction is achieving good to perfect predictions for sequences that lack known homologs hence no sequence alignments can be built for them. These are called “orphan” or “single” sequences, they are much more common than you would expect (around 20% of all known protein sequences and over 10% of eukaryotic and viral proteins), and even with growing protein sequences in databases, still new orphan sequences keep showing up. Thus, it is important to develop methods to predict their structures.

Another point with current methods including AlphaFold 2, less critical but also desired, is the capability to obtain accurate structure predictions at a faster pace. AlphaFold 2 takes some minutes per prediction, which is reasonable for low-throughput needs but too slow for processing full proteomes or exploring sequence space for protein design.

The new method presented by Chowdhury et al. from the AlQuraishi group, called RGN2, builds on their previous work predicting protein structures from position-specific scoring matrices derived from alignments. In the previous model, structure was parameterized as torsion angles between adjacent residues to sequentially place the protein backbone in 3D space; all components were differentiable so the network could be optimized from end to end to minimize prediction error (RMSD of predicted models compared to actual structures used for training). Although this previous method does not explicitly use coevolution data, it does require an alignment to compute the position-specific scoring matrices, so it is sensitive to the availability of sequences.

The new RGN2 predicts protein structures from single protein sequences, without any need for alignments, using a protein language model trained on protein sequences. Just like language models attempt to extract information from a sequence of words, the protein language model used here attempts to capture the latent information in a string of amino acids -which specifies protein structure. The information that was provided by the position-specific scoring matrices derived from alignments in the early method, is in RGN2 replaced by information computed by the language model directly from the input sequence.

The “protein language” model used by RGN2, called AminoBERT, is inspired by this earlier work by Rives et al who developed a deep contextual language model trained through unsupervised learning on protein sequences, that learned fundamental properties of proteins and thus are able to do remote homology detection, predict secondary structures and long-range inter-residue contacts, etc. Exactly the kind of information that one needs to produce protein structures.

By coupling the protein language model trained on structures to a network module for structure generation under the regular rules for rotational and translational invariance, RGN2 can produce Cα traces with backbone geometries that present realistic protein structure features; and moreover, they actually seem to properly predict protein folds in space as the authors claim. The full RGN2 is in average not as good as AlphaFold 2 when alignments can be created, but presumably it is better on proteins for which no alignments can be produced. Even in the former case, RGN2 is up to a million times faster than AlphaFold 2.

Note, if you missed this in the previous paragraph, that RGN2’s predictions consist only in C𝛼 traces, not full-atom models. In the pipeline developed by the authors, the remaining backbone and sidechain atoms are initially constructed on top of this trace with ModRefiner, and the resulting structure is used as template for AlphaFold2 with no alignments, which then provides the final model.

Trying out this tool is very straightforward, because the authors created a Colab Notebook for this –just like Deepmind did to open AlphaFold 2 to the public and other scientists began doing since then.

The notebook is here:

And I’ve just tried it out.

The first observation I make is that it takes quite some time to load the model. I guess in a real application to multiple proteins, you would actually load the model once and then just run predictions in seconds.

Second, that the output is not graphical at all, so don’t expect an app like ColabFold, developed to use AlphaFold 2 both easily and efficiently.

Third, remember the main output is only a C𝛼 trace, after which the regular AlphaFold 2 runs (just without alignments) to produce the final model.

So, how well did it work?

I tried two relatively easy proteins (“easy” in that there are some structures in the PDB and several related sequences in databases) and the predictions were quite off from the known structures. In both cases the global shape of the protein was captured correctly, but the RMSD to actual structures is in my opinion still too high for the coolest applications you could imagine.

Here’s one of the examples, where the RMSD after alignment of produced C𝛼 trace (third from left to right) to the C𝛼 trace of the actual structure is almost 14 Angstrom, and almost 18 Angstrom for the final AlphaFold 2 model guided by the CA trace prediction -both quite a lot!

From left to right: Expected structure shown as cartoons, expected structure only CA atoms, obtained CA trace model from RGN2, and final model obtained with AlphaFold2 using RGN2. Figure by the author using a free version of PyMOL.

Summarizing from this and some other examples, my verdict is that it’s not there yet. All the proteins I’ve tried are there in the PDB, and there are sequences available so something about them must have been seen upon training of the protein language model. Yet the predicted structures are somewhat far from the actual structures, compared to what AlphaFold 2 (and its confidence predictions, which are lacking here) made us used to.

However, the model obviously has potential, and it certainly is very interesting (i) as a new technology, in particular showing that language models do a reasonable job at replacing alignments for structural information extraction; and (ii) regarding the main practical aim it advances on -rapid predictions of reasonable quality.

More globally, the successful application of language models to the problem of handling protein sequences (already in a BERT-like module inside AlphaFold 2 to handle sequence alignments, and also used in a few other ML models for chemistry and biology) and for protein structure prediction (in particular check the next note and my upcoming articles) highlight the role of language models in general, far beyond the initial interest in using them for processing natural language as today mastered by GPT-3.

A team working on protein structure prediction with language models at Meta just released a preprint showing that their model can infer structure from primary sequences using a large language model at high speeds. Just like AlphaFold and RGN2, the model is available to run on Google Colab notebooks; moreover, Meta provides an API endpoint that makes running even easier just through API calls! My preliminary tests show this model might work better than RGN2, and indeed the preprint claims that the method results in predictions up to 60x faster than the state of the art (AlphaFold 2) while maintaining resolution and accuracy -and, my addition, it may also be less sensitive to shallow alignments. I hope to bring news on Meta’s take on the problem and approach in an article soon. The field is evolving extremely fast, and maybe the dream of biologists might become real sooner than expected.

The peer-reviewed paper in nature Biotechnology

A comment about the paper, in the same journal:

To know more about AI-powered protein structure prediction, CASP, etc.: