AlphaFold2 Year 1: Did It Change the World? | by Salvatore Raieli | Sep, 2022


DeepMind promised us a revolution. Did it happen?

image by Greg Rakozy at unsplash.com

A year ago, AlphaFold2 was published in Nature. AlphaFold was presented during CASP13 (a competition where the goal is to predict the structure of proteins whose structure has been obtained but not yet published) where it outperformed the other competitors but without causing a stir. During the subsequent CASP14 competition, AlphaFold2 defeated not only all other competitors but the results predicted by the algorithm were similar to those obtained experimentally.

In the following days, the researchers claimed that AlphaFold2 meant that the problem of protein structure prediction could be called finally over. Others claimed that this was a scientific revolution and would open up incredible new perspectives.

This article asks this question: one year after AlphaFold2, what happened? What have been the recent developments? In what applications has AlphaFold2 been used?

Index of the article
- The protein: a self-assembling origami
- Does AlphaFold2 have its Achille’s heel?
- Again on the ring: AlphaFold2 will keep the belt?
- No protein left behind
- How men used the fire that Prometheus gave them
- What course does the future take?
photo by Joshua Dixon at unsplash.com

In short, the problem of how from a sequence we get the structure of the protein has been central to twentieth-century biology. Proteins are the engine of life; if the information is stored in DNA, it is the proteins that perform all the functions within every living thing. Their sequence decides their structure, and structure commands their function.

Biology is the most powerful technology ever created. DNA is software, protein are hardware, cells are factories. — Arvind Gupta

Knowing the structure of a protein provides a better understanding of how it functions under both health and disease conditions. The structure is also critical for designing molecules that can interact with proteins (practically all drugs have been designed knowing the structure of the protein). However, predicting structure and then function opens up almost endless possibilities (from nanotechnology to medicine, agriculture to pharmaceuticals, and so on).

Predicting structure, however, is difficult for a variety of reasons:

  • there are a huge number of potential structures for a sequence but only one is the right one
  • the biophysical rules of how a protein is assembled are unknown
  • testing or simulating all potential combinations is too computationally expensive
  • there are both local interactions between the various chemical groups but also remote interactions

In a previous article, I told why this is difficult and the importance of knowing the structure and more detailed introduction of AlphaFold2, for further study:

After one here of experimenting some limitations were identified. Image by Aron Visuals at Unsplash.com

AlphaFold2 has certainly achieved outstanding results in such a complex challenge as protein prediction. The impact of AlphaFold2 could already be seen from the number of citations it obtained. Pubmed marks as the total number of citations 1935(380 in 2021 and 1639 in 2022). While according to Google Scholar it is even more than 5000:

Google Scholar screenshot took by the author.

This year, some researchers have also addressed the limitations of AlphaFold2. it is interesting before discussing the new applications and recent developments what are the limitations of the DeepMind’s model.

At present, AlphaFold2 cannot predict interactions with other proteins. In fact, this is an important aspect since almost all functions performed by proteins are carried out together with several other partners. This is a complex problem because proteins can interact in homodimers (i.e., with themselves) or in heterodimers (with other different proteins). These interactions can alter the structure of the protein itself, which can change the configuration.

The authors of AlphaFold2 themselves published a follow-up paper where they presented AlphaFold-Multimer a model specifically designed to model protein interaction and assembly. Although the results are promising there is still room for improvement.

“Structure examples predicted with the AlphaFold-Multimer. Visualized are the ground truth structures (green) and predicted structures (colored by chain).” image source: original paper

In addition, at present AlphaFold2 does not predict other important aspects of protein structure: metal ions, cofactors, and other ligands. In addition, the structure of the protein is not determined by the amino acid structure alone. Protein can undergo modifications that regulate its function and half-life (glycosylation, phosphorylation, sumoylation, ubiquitination, and so on). These post-translational modifications are important and are often altered in different diseases.

For example, AlphaFold2 correctly predicts the structure of hemoglobin. However, it predicts the structure without the heme cofactor, but hemoglobin under physiological conditions is always bound to the heme cofactor. In another case, CENP-E kinesin, AlphaFold2 correctly predicts the so-called molecular motor but not the coiled-coil region. In fact, AlphaFold2 has difficulties with intrinsically disordered regions of the proteins (like the coiled-coil region).

Limitation of AlphaFold2: structure of the protein Hemoglobin and CEMPE with the corresponding AlphaFold2 predictions. Image source: here, license: here

Moreover, the side chains of amino acids are not always precisely positioned. This information is important to determine, for example, the active site of the protein and how to design molecules that can regulate it.

Another problem is that proteins can have different conformations (they are not static but dynamical entities) and AlphaFold2 return only one of them. In disease conditions, protein sequences may have mutations that alter the protein structure, and AlphaFold2 is currently unable to predict this.

An important example is the human potassium voltage-gated channel subfamily H member 2 (hERG) protein that can exist in three conformations (open, closed, inactive) during the heartbeat. Mutations of this channel or interactions with particular drugs lead to long QT syndrome, so it is important to predict the three different conformational structures.

As noted by MIT researchers, AlphaFold2 is currently useful in only one step of drug discovery: modeling the structure of the protein. In fact, the model does not allow for modeling how a drug physically interacts with the protein. In fact, the researchers attempted to simulate how bacterial proteins interacted with antibiotics (molecules that bind more tightly with the protein are potentially better antibiotics), but AlphaFold2 was not very effective.

“Utilizing these standard molecular docking simulations, we obtained an auROC value of roughly 0.5, which basically says you’re doing no better than if you were randomly guessing,” Collins JJ, MIT researcher about using AlphaFold2 for docking antibiotics and proteins (source: here).

However, it is critical, as always, that users take the limitations of the method into account. If structure predictions are used and interpreted naively, it can lead to erroneous hypotheses or blatantly wrong mechanistic models. — The joys and perils of AlphaFold, EMBO reports

AlphFold2 certainly represented a breakthrough in the prediction of protein molecular structures. On the other hand, however, it shows several limitations, and as the MIT researchers noted, they must be taken into account both to make assumptions and to use in some applications.

Whatever the purpose of the model, what the model succeeds in achieving is defined by the nature of the data. AlphaFold2 was trained using data in the PDB (protein structure database, a repository of experimentally determined protein molecular structures), and thus predicts a protein structure as if it were another PDB entry. The problem is that many proteins exist under physiological conditions only bound to cofactors, other proteins, or in large complexes, but one cannot always crystallize proteins under physiological conditions, with cofactors or with other proteins.

Moreover, for more than 40% of all UniProt families no protein was crystallized, and more than 20 percent when looking at superfamilies (a broader class).

Therefore, while in the publication associated with protein structure (when this is obtained by crystallography) it is explained how and why the structure was obtained (and possible limitations of the model), AlphaFold2 does not consider this context for its predictions. Therefore, as much as AlphaFold2 is an outstanding tool one must still take into account the limitations.

image by Nathan Dumlao at Unsplash.com

AlphaFold2 showed that it is possible to predict the structure of a single protein with good accuracy using deep learning. In addition, given the recent results of AlphaFold2, CASP decided to increase the complexity of the challenge. As mentioned before CASP is a challenge dedicated to protein structure prediction algorithms now in its 15th edition (the first CASP1 in 1994, and the last two CASP13 and CASP14 were won by DeepMind).

Therefore, CASP15 will focus on several categories including/

  • Single Protein and Domain Modeling. Also in this edition, there will be several single protein domains but there will be an emphasis on fine-grained accuracy (local main chain motifs and side chains).
  • Assembly. This category considers protein interactions (interactions between different protein domains, between various subunits, etc…)
  • RNA structures and complexes. In fact, RNA molecules can also assemble into 3D structures and there is much less knowledge on the subject.
  • Protein-ligand complexes. Proteins can have physiological ligands such as various metabolites (or even other proteins) or therapeutic molecules can also be understood as ligands. This is a crucial step in drug discovery, most drugs in the mecate are in fact small molecules that regulate targets (proteins) by binding. CASP15 is also intended as a challenge to push research in drug design.
  • Protein conformational ensembles. As mentioned, proteins should not be viewed as static entities but can have several conformational changes. These changes can occur under specific conditions such as binding to a particular substrate, phosphorylation, or binding to a drug. In fact, obtaining data on these conformations is also difficult under experimental conditions (usually obtained by cryo-EM and NMR).

It can be noted, that while previous editions focused on a static view of proteins, here the emphasis is on a dynamic view of proteins (behavior in solution, interactions with other protein partners, or other biological macromolecules). Thus, in light of the limitations, we have seen above, these categories pose challenges for AlphaFold2.

Submissions for CASP15 closed on August 19, and the results will be announced in December at the conference in Antalya. Thus we will know soon whether DeepMind will retain the championship belt. In the meantime, for those who like spoilers CAMEO ( a community project sustained by different institutions) monitors in constant time the quality of predictions from different sources (different servers where the predictions of different competitors are present). CAMEO offers different scores on predictions and you can also select the time interval.

CAMEO also for the same targets uses predictions from the same AlphaFold2 (version released last year). As can be seen in the graph, AlphaFold2 naive version does not fare badly (the closer to the top-right, the better).

Screenshot from CAMEO website by the author.
Trends for keyword “Alphafold” on Google trends, screenshot by the author

Just twelve months later, AlphaFold has been accessed by more than half a million researchers and used to accelerate progress on important real-world problems ranging from plastic pollution to antibiotic resistance — DeepMind press release

One year later, AlphaFold2 is still present in Google searches, and there have been recent developments that we will discuss here.

DeepMind immediately after the article was published made the source code available on GitHub and also made available a Google Colab where it is possible to enter the sequence of a protein and get an image of the structure.

For example, simply visit the Uniprot site to retrieve the sequence of a protein insert it inside the Google Colab cell, and run it to get the prediction. Here is the structure of human hemoglobin:

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
Here is an example of prediction by AlphaFold2, color represents the confidence of the model obtained by the Author using the provided Google Colab (you can also use the GitHub source code but the download weight is 2.2 TB and you need a GPU to run it)

As mentioned before, the same group has presented a specific version of AlphaFold2 for multimeric complexes and the source code has also been published (GitHub, Google Colab).

In addition, DeepMind in collaboration with EMBL’s European Bioinformatics Institute (EMBL-EBI), initially released the structure of 20 million proteins of about 20 species (350,000 humans).

Recently, they released the structures of almost all proteins cataloged in the databases (they used AlphaFold2 on the protein sequences that had been collected). DeepMind built a database where all these predicted structures could be accessed (AlphaFold DB) or can be downloaded in bulk here. The site contained 200 million structures (the initial release had been 1 million structures).

Recently, DeepMind announced they have won the 2023 Breakthrough Prize in Life Sciences for the development of AlphaFold.

Already AlphaFold 1 has proven to be useful to scientists. The first version of AlphaFold was not capable of accurately modeling amino acid side chains, but scientists found that the model did a very good job with the backbone. For example, AlphaFold1 was used in this scientific paper to determine the structure of a large glutamate dehydrogenase enzyme (they used the model to guide the study of the structure of this protein).

structure of the large glutamate dehydrogenase enzyme. image source (original article)

This articulates how it is possible to integrate AlphaFold with experimental work. In fact, one of the fastest growing techniques for the experimental determination of protein structures is cryo-electron microscopy. This technique can be used on proteins that fail to crystallize (spoiler: many important ones) and allows us to observe proteins in more physiological situations (conformations, interactions, etc…). On the other hand, this technique produces images with a low signal-to-noise ratio and does not produce an atomic resolution. In these cases, AlphaFold2 can help predict structure, and facilitate the experimental validation.

Science recently devoted a special issue to the resolution of the nuclear pore structure. The nuclear pore is a complex protein structure that regulates the entry/exit of biological macromolecules from the nucleus to the cytoplasm. Nuclear pore complexes (NPCs) are composed of more than 1,000 protein subunits and are involved in several cellular processes (in addition to regulating transport) that make them crucial in several diseases and in interaction with pathogens (for example, it is the target of several viruses that need to enter the nucleus).

On the other hand, being able to get a high-resolution 3D image of more than thirty NPC proteins has proven exceedingly difficult. Today the researchers used the help of AlphaFold to construct the model and thus, three articles have been published presenting the structure of NPC with near-atomic resolution.

Scaffold architecture of the human NPC. Image source: here

In 2004, researchers showed that mutations in the PINK1 gene led to the early development of Parkinson’s disease. Being able to obtain the structure of human PINK1 was difficult (the protein was unstable) while they had succeeded in obtaining the structure of the insect PINK1 protein. The researchers then used cryo-EM and AlphaFold2 in obtaining the structure of human PINK1. As the researchers said, this opens up possibilities for studying molecules that can be used in the treatment of Parkinson’s disease:

We can start to think about, ‘What kind of drugs do we have to develop to fix the protein, rather than just deal with the fact that it’s broken’ — David Komander (source)

As highlighted by professor Beltrao, we can use the structure for many research purposes. From paleontology to drug discovery, wherever living things are discussed, protein structure can be useful.

We can now trace back the evolution of proteins for longer periods of the evolutionary timescale — Pedro Beltrao (here)

AlphaFold2 has not only opened possibilities for drug discovery as industrial application, in fact, proteins are also used in various industrial processes (enzymes in detergents, food industry, etc…). In addition, the study of proteins can be useful for environmentally important challenges: for example, AlphaFold2 has been used to study the structure of Vitellogenin (a protein important for immunity in honey bees). Protein structure can also be useful in designing proteins that degrade plastics or other pollutants.

A thankful honey bee, happy we are trying to help them from their extinction we are provoking. image source: Dmitry Grigoriev from Unsplash.com

We discussed above how AlphaFold2 can be used in different applications. On the other hand, the fact that the model is open source means that it can be used and incorporated into other models. For example, that is what META has done with ESM-IF1, where the idea is to retrieve the sequence from the structure.

Another recent application, Making-it-rain, is interested in describing the dynamics of a protein. In a dynamical system, by applying or changing conditions the atoms of a molecule move along a trajectory that can be followed. Making-it-rain simulates the behavior of the atoms of a protein in water or when other molecules are present (nucleic acids, small molecules). Making-it-rain as described in the article also incorporates AlphaFold2 beyond several models. In addition, the code can be found both on GitHub and in several Google Colabs (the links are in the repository).

MD simulation run in Google Colab’s “Making-it-rain” notebook
Image by Torsten Dederichs at Unsplash.com

EMBL in a report lists some of the possible applications of AlphaFold2, briefly:

  • Accelerating structure studies. AlphaFold2 will be able to help when a protein is difficult to crystallize and help resolve structures obtained by low-resolution experimental methods when the entire protein sequence is not crystallizable.
  • Filling in the components of protein complexes. AlphaFold2 can help with the study of large protein complexes, interaction with other biological molecules (RNA, DNA), or even help in suggesting hypotheses about the function of a protein.
  • Generating hypotheses for analysis of protein dynamics. The structure is the first step in studying the dynamics of a protein.
  • Predicting RNA structure. we can learn from predicting protein structure and work on this new frontier.
  • Predict the evolution of a protein or the effect of sequence mutations.

We have seen different applications and the possibility of integrating AlphaFold2 into different models. Other groups have been concerned with democratizing access to it. ColabFold is a project that aims to make AlphaFold2 and other structure prediction algorithms such as RosettaFold easier and more usable.

In addition, other groups have begun working to overcome the limitations of AlphaFold2, such as predicting alternative conformational states of transporters and receptors. The researchers in this article have proven that it is possible and have also released the code (GitHub).

Predicting different conformation of a molecule. source: here

Even more challenging is the estimation of how strong a ligand might bind to a certain pocket (scoring problem). This is the holy grail of drug discovery and multiple methods exist to describe protein-ligand binding with varying accuracy. — J. Chem. Inf. Model. 2022, 62, 3142−3156

Certainly, the pharmaceutical sector will benefit the most from the release of AlphaFold2. Protein structure is the first step in designing a molecule. In fact, most drugs on the market are “small molecules” small molecules that insert into a “pocket” of the protein and block or activate it. AlphaFold2 can help researchers in identifying new pockets and therefore enable the design of small molecules.

Another frontier opening up is protein drawing from scratch. Last June, South Korea approved a vaccine where a protein was created from scratch. It took 10 years of intensive work to arrive at this result.

The advent of AlphaFold2 has brought new hype to the field. Several companies have decided to focus on the use of AI in protein design. After all, designing proteins could be useful in various fields such as medicine, waste cleaning, nanotechnology, and so on.

For example, some of the proposed methods such as variation of existing structures or assembling local structures are much easier with AlphaFold2. There are also more special methods such as “protein hallucination” where you start from random amino acids while predicting the structure (although this method seems inefficient, it was tested on 100 peptides and one-fifth of them achieved the predicted shape).

The field is moving rapidly as has been described in this review. In addition, several tools and algorithms are being developed to help draw proteins (ProteinMPNN is one example, it helps to do a kind of ‘spellcheck’ when drawing a protein). Another interesting approach is to use GPT-3 instead of generating text it is used to generate protein sequences.

Examples of how to leverage deep learning for protein design. image source: here
Image by Mantas Hesthaven at Unsplash.com

The results obtained by AlphaFold2 are impressive, so much so that the CASP consortium eliminated some categories because they were considered too simple for the available algorithms.
However, AlphaFold2 as we have seen has several limitations (different conformations, interactions with other biological molecules, ligands, etc…). In part, these problems are derived from the incompleteness of experimental databases, and an algorithm relies on the data it has seen during training to make predictions.

Despite, the limitations AlphaFold2 has been used for different purposes and in different publications. AlphaFold2 is still a starting point, after all, structure prediction is only one of the steps in drug discovery. In any case, it is still premature to start drug development without experimental evidence and only using the predictions of an algorithm.

In addition, the more structures are added, the more these prediction models will improve. AlphaFold2 uses alignment with similar sequences, but upcoming models will be even more sophisticated by including other information, physics and biology principles, etc.

The more intricate questions about conformational states, dynamics, and transient interactions will still need attention and experiments to answer. It is thus important that funders — and peer-reviewers — do not come to believe that “the folding problem has been solved”. — EMBL report

That said, structural biologics will not go away but rather take advantage of the algorithm and allow it to proceed more quickly. For years, thinking of getting the structure of all proteins was a dream: too expensive to think of succeeding by experimental methods alone. On the other hand, many protein families are not covered, gaps need to be filled to help models learn, and still many aspects of the folding process are not clear

As we have seen obtaining the structure is the first step, and in the future, we will use this information to design new proteins, obtain their sequence and produce them for various applications.

To date, more than 500,000 researchers from 190 countries have accessed the AlphaFold DB to view over 2 million structures. — DeepMind press release

Researchers have leveraged AlphaFold2 for several research projects, creating new tools, and many new applications are on the horizon. One year is still short to see all the new possibilities opened up by AlphaFold2.

However, in drug discovery, we will only see the results of AlphaFold2 and these new approaches in the next few years. Unfortunately, before a new therapy can reach the market, it must pass several steps in both preclinical and clinical trials. From conception to arrival in the clinic, it can also take up to ten years and billions of dollars (not to mention that many clinical trials fail). Artificial intelligence will be crucial to able to shorten time and expense, but that is another story (perhaps to be told in another article).

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

Additional resources


DeepMind promised us a revolution. Did it happen?

image by Greg Rakozy at unsplash.com

A year ago, AlphaFold2 was published in Nature. AlphaFold was presented during CASP13 (a competition where the goal is to predict the structure of proteins whose structure has been obtained but not yet published) where it outperformed the other competitors but without causing a stir. During the subsequent CASP14 competition, AlphaFold2 defeated not only all other competitors but the results predicted by the algorithm were similar to those obtained experimentally.

In the following days, the researchers claimed that AlphaFold2 meant that the problem of protein structure prediction could be called finally over. Others claimed that this was a scientific revolution and would open up incredible new perspectives.

This article asks this question: one year after AlphaFold2, what happened? What have been the recent developments? In what applications has AlphaFold2 been used?

Index of the article
- The protein: a self-assembling origami
- Does AlphaFold2 have its Achille’s heel?
- Again on the ring: AlphaFold2 will keep the belt?
- No protein left behind
- How men used the fire that Prometheus gave them
- What course does the future take?
photo by Joshua Dixon at unsplash.com

In short, the problem of how from a sequence we get the structure of the protein has been central to twentieth-century biology. Proteins are the engine of life; if the information is stored in DNA, it is the proteins that perform all the functions within every living thing. Their sequence decides their structure, and structure commands their function.

Biology is the most powerful technology ever created. DNA is software, protein are hardware, cells are factories. — Arvind Gupta

Knowing the structure of a protein provides a better understanding of how it functions under both health and disease conditions. The structure is also critical for designing molecules that can interact with proteins (practically all drugs have been designed knowing the structure of the protein). However, predicting structure and then function opens up almost endless possibilities (from nanotechnology to medicine, agriculture to pharmaceuticals, and so on).

Predicting structure, however, is difficult for a variety of reasons:

  • there are a huge number of potential structures for a sequence but only one is the right one
  • the biophysical rules of how a protein is assembled are unknown
  • testing or simulating all potential combinations is too computationally expensive
  • there are both local interactions between the various chemical groups but also remote interactions

In a previous article, I told why this is difficult and the importance of knowing the structure and more detailed introduction of AlphaFold2, for further study:

After one here of experimenting some limitations were identified. Image by Aron Visuals at Unsplash.com

AlphaFold2 has certainly achieved outstanding results in such a complex challenge as protein prediction. The impact of AlphaFold2 could already be seen from the number of citations it obtained. Pubmed marks as the total number of citations 1935(380 in 2021 and 1639 in 2022). While according to Google Scholar it is even more than 5000:

Google Scholar screenshot took by the author.

This year, some researchers have also addressed the limitations of AlphaFold2. it is interesting before discussing the new applications and recent developments what are the limitations of the DeepMind’s model.

At present, AlphaFold2 cannot predict interactions with other proteins. In fact, this is an important aspect since almost all functions performed by proteins are carried out together with several other partners. This is a complex problem because proteins can interact in homodimers (i.e., with themselves) or in heterodimers (with other different proteins). These interactions can alter the structure of the protein itself, which can change the configuration.

The authors of AlphaFold2 themselves published a follow-up paper where they presented AlphaFold-Multimer a model specifically designed to model protein interaction and assembly. Although the results are promising there is still room for improvement.

“Structure examples predicted with the AlphaFold-Multimer. Visualized are the ground truth structures (green) and predicted structures (colored by chain).” image source: original paper

In addition, at present AlphaFold2 does not predict other important aspects of protein structure: metal ions, cofactors, and other ligands. In addition, the structure of the protein is not determined by the amino acid structure alone. Protein can undergo modifications that regulate its function and half-life (glycosylation, phosphorylation, sumoylation, ubiquitination, and so on). These post-translational modifications are important and are often altered in different diseases.

For example, AlphaFold2 correctly predicts the structure of hemoglobin. However, it predicts the structure without the heme cofactor, but hemoglobin under physiological conditions is always bound to the heme cofactor. In another case, CENP-E kinesin, AlphaFold2 correctly predicts the so-called molecular motor but not the coiled-coil region. In fact, AlphaFold2 has difficulties with intrinsically disordered regions of the proteins (like the coiled-coil region).

Limitation of AlphaFold2: structure of the protein Hemoglobin and CEMPE with the corresponding AlphaFold2 predictions. Image source: here, license: here

Moreover, the side chains of amino acids are not always precisely positioned. This information is important to determine, for example, the active site of the protein and how to design molecules that can regulate it.

Another problem is that proteins can have different conformations (they are not static but dynamical entities) and AlphaFold2 return only one of them. In disease conditions, protein sequences may have mutations that alter the protein structure, and AlphaFold2 is currently unable to predict this.

An important example is the human potassium voltage-gated channel subfamily H member 2 (hERG) protein that can exist in three conformations (open, closed, inactive) during the heartbeat. Mutations of this channel or interactions with particular drugs lead to long QT syndrome, so it is important to predict the three different conformational structures.

As noted by MIT researchers, AlphaFold2 is currently useful in only one step of drug discovery: modeling the structure of the protein. In fact, the model does not allow for modeling how a drug physically interacts with the protein. In fact, the researchers attempted to simulate how bacterial proteins interacted with antibiotics (molecules that bind more tightly with the protein are potentially better antibiotics), but AlphaFold2 was not very effective.

“Utilizing these standard molecular docking simulations, we obtained an auROC value of roughly 0.5, which basically says you’re doing no better than if you were randomly guessing,” Collins JJ, MIT researcher about using AlphaFold2 for docking antibiotics and proteins (source: here).

However, it is critical, as always, that users take the limitations of the method into account. If structure predictions are used and interpreted naively, it can lead to erroneous hypotheses or blatantly wrong mechanistic models. — The joys and perils of AlphaFold, EMBO reports

AlphFold2 certainly represented a breakthrough in the prediction of protein molecular structures. On the other hand, however, it shows several limitations, and as the MIT researchers noted, they must be taken into account both to make assumptions and to use in some applications.

Whatever the purpose of the model, what the model succeeds in achieving is defined by the nature of the data. AlphaFold2 was trained using data in the PDB (protein structure database, a repository of experimentally determined protein molecular structures), and thus predicts a protein structure as if it were another PDB entry. The problem is that many proteins exist under physiological conditions only bound to cofactors, other proteins, or in large complexes, but one cannot always crystallize proteins under physiological conditions, with cofactors or with other proteins.

Moreover, for more than 40% of all UniProt families no protein was crystallized, and more than 20 percent when looking at superfamilies (a broader class).

Therefore, while in the publication associated with protein structure (when this is obtained by crystallography) it is explained how and why the structure was obtained (and possible limitations of the model), AlphaFold2 does not consider this context for its predictions. Therefore, as much as AlphaFold2 is an outstanding tool one must still take into account the limitations.

image by Nathan Dumlao at Unsplash.com

AlphaFold2 showed that it is possible to predict the structure of a single protein with good accuracy using deep learning. In addition, given the recent results of AlphaFold2, CASP decided to increase the complexity of the challenge. As mentioned before CASP is a challenge dedicated to protein structure prediction algorithms now in its 15th edition (the first CASP1 in 1994, and the last two CASP13 and CASP14 were won by DeepMind).

Therefore, CASP15 will focus on several categories including/

  • Single Protein and Domain Modeling. Also in this edition, there will be several single protein domains but there will be an emphasis on fine-grained accuracy (local main chain motifs and side chains).
  • Assembly. This category considers protein interactions (interactions between different protein domains, between various subunits, etc…)
  • RNA structures and complexes. In fact, RNA molecules can also assemble into 3D structures and there is much less knowledge on the subject.
  • Protein-ligand complexes. Proteins can have physiological ligands such as various metabolites (or even other proteins) or therapeutic molecules can also be understood as ligands. This is a crucial step in drug discovery, most drugs in the mecate are in fact small molecules that regulate targets (proteins) by binding. CASP15 is also intended as a challenge to push research in drug design.
  • Protein conformational ensembles. As mentioned, proteins should not be viewed as static entities but can have several conformational changes. These changes can occur under specific conditions such as binding to a particular substrate, phosphorylation, or binding to a drug. In fact, obtaining data on these conformations is also difficult under experimental conditions (usually obtained by cryo-EM and NMR).

It can be noted, that while previous editions focused on a static view of proteins, here the emphasis is on a dynamic view of proteins (behavior in solution, interactions with other protein partners, or other biological macromolecules). Thus, in light of the limitations, we have seen above, these categories pose challenges for AlphaFold2.

Submissions for CASP15 closed on August 19, and the results will be announced in December at the conference in Antalya. Thus we will know soon whether DeepMind will retain the championship belt. In the meantime, for those who like spoilers CAMEO ( a community project sustained by different institutions) monitors in constant time the quality of predictions from different sources (different servers where the predictions of different competitors are present). CAMEO offers different scores on predictions and you can also select the time interval.

CAMEO also for the same targets uses predictions from the same AlphaFold2 (version released last year). As can be seen in the graph, AlphaFold2 naive version does not fare badly (the closer to the top-right, the better).

Screenshot from CAMEO website by the author.
Trends for keyword “Alphafold” on Google trends, screenshot by the author

Just twelve months later, AlphaFold has been accessed by more than half a million researchers and used to accelerate progress on important real-world problems ranging from plastic pollution to antibiotic resistance — DeepMind press release

One year later, AlphaFold2 is still present in Google searches, and there have been recent developments that we will discuss here.

DeepMind immediately after the article was published made the source code available on GitHub and also made available a Google Colab where it is possible to enter the sequence of a protein and get an image of the structure.

For example, simply visit the Uniprot site to retrieve the sequence of a protein insert it inside the Google Colab cell, and run it to get the prediction. Here is the structure of human hemoglobin:

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
Here is an example of prediction by AlphaFold2, color represents the confidence of the model obtained by the Author using the provided Google Colab (you can also use the GitHub source code but the download weight is 2.2 TB and you need a GPU to run it)

As mentioned before, the same group has presented a specific version of AlphaFold2 for multimeric complexes and the source code has also been published (GitHub, Google Colab).

In addition, DeepMind in collaboration with EMBL’s European Bioinformatics Institute (EMBL-EBI), initially released the structure of 20 million proteins of about 20 species (350,000 humans).

Recently, they released the structures of almost all proteins cataloged in the databases (they used AlphaFold2 on the protein sequences that had been collected). DeepMind built a database where all these predicted structures could be accessed (AlphaFold DB) or can be downloaded in bulk here. The site contained 200 million structures (the initial release had been 1 million structures).

Recently, DeepMind announced they have won the 2023 Breakthrough Prize in Life Sciences for the development of AlphaFold.

Already AlphaFold 1 has proven to be useful to scientists. The first version of AlphaFold was not capable of accurately modeling amino acid side chains, but scientists found that the model did a very good job with the backbone. For example, AlphaFold1 was used in this scientific paper to determine the structure of a large glutamate dehydrogenase enzyme (they used the model to guide the study of the structure of this protein).

structure of the large glutamate dehydrogenase enzyme. image source (original article)

This articulates how it is possible to integrate AlphaFold with experimental work. In fact, one of the fastest growing techniques for the experimental determination of protein structures is cryo-electron microscopy. This technique can be used on proteins that fail to crystallize (spoiler: many important ones) and allows us to observe proteins in more physiological situations (conformations, interactions, etc…). On the other hand, this technique produces images with a low signal-to-noise ratio and does not produce an atomic resolution. In these cases, AlphaFold2 can help predict structure, and facilitate the experimental validation.

Science recently devoted a special issue to the resolution of the nuclear pore structure. The nuclear pore is a complex protein structure that regulates the entry/exit of biological macromolecules from the nucleus to the cytoplasm. Nuclear pore complexes (NPCs) are composed of more than 1,000 protein subunits and are involved in several cellular processes (in addition to regulating transport) that make them crucial in several diseases and in interaction with pathogens (for example, it is the target of several viruses that need to enter the nucleus).

On the other hand, being able to get a high-resolution 3D image of more than thirty NPC proteins has proven exceedingly difficult. Today the researchers used the help of AlphaFold to construct the model and thus, three articles have been published presenting the structure of NPC with near-atomic resolution.

Scaffold architecture of the human NPC. Image source: here

In 2004, researchers showed that mutations in the PINK1 gene led to the early development of Parkinson’s disease. Being able to obtain the structure of human PINK1 was difficult (the protein was unstable) while they had succeeded in obtaining the structure of the insect PINK1 protein. The researchers then used cryo-EM and AlphaFold2 in obtaining the structure of human PINK1. As the researchers said, this opens up possibilities for studying molecules that can be used in the treatment of Parkinson’s disease:

We can start to think about, ‘What kind of drugs do we have to develop to fix the protein, rather than just deal with the fact that it’s broken’ — David Komander (source)

As highlighted by professor Beltrao, we can use the structure for many research purposes. From paleontology to drug discovery, wherever living things are discussed, protein structure can be useful.

We can now trace back the evolution of proteins for longer periods of the evolutionary timescale — Pedro Beltrao (here)

AlphaFold2 has not only opened possibilities for drug discovery as industrial application, in fact, proteins are also used in various industrial processes (enzymes in detergents, food industry, etc…). In addition, the study of proteins can be useful for environmentally important challenges: for example, AlphaFold2 has been used to study the structure of Vitellogenin (a protein important for immunity in honey bees). Protein structure can also be useful in designing proteins that degrade plastics or other pollutants.

A thankful honey bee, happy we are trying to help them from their extinction we are provoking. image source: Dmitry Grigoriev from Unsplash.com

We discussed above how AlphaFold2 can be used in different applications. On the other hand, the fact that the model is open source means that it can be used and incorporated into other models. For example, that is what META has done with ESM-IF1, where the idea is to retrieve the sequence from the structure.

Another recent application, Making-it-rain, is interested in describing the dynamics of a protein. In a dynamical system, by applying or changing conditions the atoms of a molecule move along a trajectory that can be followed. Making-it-rain simulates the behavior of the atoms of a protein in water or when other molecules are present (nucleic acids, small molecules). Making-it-rain as described in the article also incorporates AlphaFold2 beyond several models. In addition, the code can be found both on GitHub and in several Google Colabs (the links are in the repository).

MD simulation run in Google Colab’s “Making-it-rain” notebook
Image by Torsten Dederichs at Unsplash.com

EMBL in a report lists some of the possible applications of AlphaFold2, briefly:

  • Accelerating structure studies. AlphaFold2 will be able to help when a protein is difficult to crystallize and help resolve structures obtained by low-resolution experimental methods when the entire protein sequence is not crystallizable.
  • Filling in the components of protein complexes. AlphaFold2 can help with the study of large protein complexes, interaction with other biological molecules (RNA, DNA), or even help in suggesting hypotheses about the function of a protein.
  • Generating hypotheses for analysis of protein dynamics. The structure is the first step in studying the dynamics of a protein.
  • Predicting RNA structure. we can learn from predicting protein structure and work on this new frontier.
  • Predict the evolution of a protein or the effect of sequence mutations.

We have seen different applications and the possibility of integrating AlphaFold2 into different models. Other groups have been concerned with democratizing access to it. ColabFold is a project that aims to make AlphaFold2 and other structure prediction algorithms such as RosettaFold easier and more usable.

In addition, other groups have begun working to overcome the limitations of AlphaFold2, such as predicting alternative conformational states of transporters and receptors. The researchers in this article have proven that it is possible and have also released the code (GitHub).

Predicting different conformation of a molecule. source: here

Even more challenging is the estimation of how strong a ligand might bind to a certain pocket (scoring problem). This is the holy grail of drug discovery and multiple methods exist to describe protein-ligand binding with varying accuracy. — J. Chem. Inf. Model. 2022, 62, 3142−3156

Certainly, the pharmaceutical sector will benefit the most from the release of AlphaFold2. Protein structure is the first step in designing a molecule. In fact, most drugs on the market are “small molecules” small molecules that insert into a “pocket” of the protein and block or activate it. AlphaFold2 can help researchers in identifying new pockets and therefore enable the design of small molecules.

Another frontier opening up is protein drawing from scratch. Last June, South Korea approved a vaccine where a protein was created from scratch. It took 10 years of intensive work to arrive at this result.

The advent of AlphaFold2 has brought new hype to the field. Several companies have decided to focus on the use of AI in protein design. After all, designing proteins could be useful in various fields such as medicine, waste cleaning, nanotechnology, and so on.

For example, some of the proposed methods such as variation of existing structures or assembling local structures are much easier with AlphaFold2. There are also more special methods such as “protein hallucination” where you start from random amino acids while predicting the structure (although this method seems inefficient, it was tested on 100 peptides and one-fifth of them achieved the predicted shape).

The field is moving rapidly as has been described in this review. In addition, several tools and algorithms are being developed to help draw proteins (ProteinMPNN is one example, it helps to do a kind of ‘spellcheck’ when drawing a protein). Another interesting approach is to use GPT-3 instead of generating text it is used to generate protein sequences.

Examples of how to leverage deep learning for protein design. image source: here
Image by Mantas Hesthaven at Unsplash.com

The results obtained by AlphaFold2 are impressive, so much so that the CASP consortium eliminated some categories because they were considered too simple for the available algorithms.
However, AlphaFold2 as we have seen has several limitations (different conformations, interactions with other biological molecules, ligands, etc…). In part, these problems are derived from the incompleteness of experimental databases, and an algorithm relies on the data it has seen during training to make predictions.

Despite, the limitations AlphaFold2 has been used for different purposes and in different publications. AlphaFold2 is still a starting point, after all, structure prediction is only one of the steps in drug discovery. In any case, it is still premature to start drug development without experimental evidence and only using the predictions of an algorithm.

In addition, the more structures are added, the more these prediction models will improve. AlphaFold2 uses alignment with similar sequences, but upcoming models will be even more sophisticated by including other information, physics and biology principles, etc.

The more intricate questions about conformational states, dynamics, and transient interactions will still need attention and experiments to answer. It is thus important that funders — and peer-reviewers — do not come to believe that “the folding problem has been solved”. — EMBL report

That said, structural biologics will not go away but rather take advantage of the algorithm and allow it to proceed more quickly. For years, thinking of getting the structure of all proteins was a dream: too expensive to think of succeeding by experimental methods alone. On the other hand, many protein families are not covered, gaps need to be filled to help models learn, and still many aspects of the folding process are not clear

As we have seen obtaining the structure is the first step, and in the future, we will use this information to design new proteins, obtain their sequence and produce them for various applications.

To date, more than 500,000 researchers from 190 countries have accessed the AlphaFold DB to view over 2 million structures. — DeepMind press release

Researchers have leveraged AlphaFold2 for several research projects, creating new tools, and many new applications are on the horizon. One year is still short to see all the new possibilities opened up by AlphaFold2.

However, in drug discovery, we will only see the results of AlphaFold2 and these new approaches in the next few years. Unfortunately, before a new therapy can reach the market, it must pass several steps in both preclinical and clinical trials. From conception to arrival in the clinic, it can also take up to ten years and billions of dollars (not to mention that many clinical trials fail). Artificial intelligence will be crucial to able to shorten time and expense, but that is another story (perhaps to be told in another article).

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

Additional resources

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AlphaFold2Changemachine learningRaieliSalvatoreSepTechnoblenderTechnologyWorldYear
Comments (0)
Add Comment