Techno Blender
Digitally Yours.

How to Deploy and Interpret AlphaFold2 with Minimal Compute | by Temitope Sobodu | Feb, 2023

0 56


Photo by Luke Jones on Unsplash

It is no longer news that one of the steepest challenges in biology has been leveled by a London-based artificial intelligence outfit — DeepMind. They won the Critical Assessment of protein-Structure Prediction 14 edition (CASP14) with a ground truth score of 90. DeepMind proceeded to publish a landmark paper in the summer of 2021.

DeepMind named their protein-folding platform AlphaFold (updated version —AlphaFold v2.3.0 at the time of writing). They released the source code on GitHub for open access. However, deploying AlphaFold2 open source code on GitHub requires humongous computational resources; downloading the database requires 12 vCPUs, 85 GB RAM, a 100 GB boot disk, 3 TB disk, and an A100 GPU. Also, the user must be vast in Linux, and deploy docker containers and other dependencies. The purpose of this article is to guide readers through seamless alternative tools available to tap into the AlphaFold miracle.

This article is of medium length (no pun intended), so please stay with me.

I will be discussing:

· How AlphaFold works

· Performance evaluation metrics

· The EMBL-EBI method

· The Colab notebook method

· AlphaFold2’s limitations

· Conclusion

So how does AlphaFold2 work?

Integrally, AlphaFold2 consists of a trained multiple sequence alignment (MSA), paired residues, and PDB templates of 100000 known protein structures (validated experimentally by NMR, X-ray crystallography, cryo-EM) from metagenomic databases. The AlphaFold2 evoformer, a 48-block neural network, was built based on concepts derived from large language models (LLM), tokenization, transformers, and attention.

Image Source: Jumper et al.

The evoformer outputs MSA and pair representations which is fed into the structure prediction module. These blocks employ invariant point attention to predict the single representation copy of the first row of the MSA, which is consequently funnelled to predict the 𝛘 torsional angles between predicted protein residue atoms: in essence, the placing of interconnected atoms adjacent to a peptide backbone in an XYZ dimension. The final predicted protein structure is relaxed to optimize their energy landscape using openMM from Amber forcefields, reducing steric clashes and energy violations.

Performance evaluation metrics

AlphaFold2 (AF2) outputs a 3D protein structure, and the model performance is evaluated with a three confidence metrics.

Predicted Local difference Test (pLDDT), ranging from 0–100, is a per residue confidence score, signifying the confidence of the model prediction for each amino acid residue relative to the 𝛂 carbon atoms.

Image Source: Author

For instance, the image above depicts pLDDT color-coded prediction of the human mammalian Target of Rapamycin (mTOR) serine/threonine kinase. pLDDT > 90 signifies very high confidence, 70–90 good enough confidence, while <70 connotes low model confidence. The pLDDT can help us gauge how well the model has performed with respect to protein regions or domains.

Predicted Aligned Error (PAE) gives us the inter-domain/intra-domain distances between two residues X and Y relative to the true structures when aligned on the same plane. Simply, how well the residues are arranged relative to other residues in space. Typically, the distances range from 0-35 Ångstroms for a confident prediction. AF2 prides itself at being able to better predict the relative positions of residues within the same domain (intra-domain residues), compared to residues in different domains (inter-domain residues). This makes sense because residues within a domain are more static compared to residues in other domains. The model outputs a PAE plot which juxtaposes residue positions along X/Y axes.

Image Source: Author

Predicted Template Modeling Score (pTM) measures the structural congruency between two folded protein structures. AlphaFold2 allows the inclusion of PDB templates for modeling, as part of the modeling options. Although the templates are not required for predictions, they can improve model performance. The pTM ranges from 0 to 1, provides a framework for AF2 to rank its 5 predicted outputs. Predictions with pTM < 0.2 are either stochastically assigned residue patterns, with negligible or no correlation with the supposed native structure, or are intrinsically disordered proteins. A pTM >0.5 is usually strong enough to make inference.

Now that we have discussed AlphaFold’s fundamentals and its metrics, we can deep-dive into low-compute methods to explore AlphaFold.

The EMBL-EBI method

To provide easy access to AlphaFold2, DeepMind collaborated with the European Molecular Biology Lab, Bioinformatics group to curate proteins for 48 commonly studied species including humans, mice, rats, apes, fruitfly, and yeast. The database contains over 1 million structures spanning 48 organismic proteomes. Although, a miserly number compared to the 200 million open source AlphaFold2 has to offer. However, this super-easy API subsidizes the heavy cost of computation associated with the open source code. The webpage can be accessed via the link below

On the webpage, users can search for their protein of interest with protein name, gene name, UniProt accession, and UniProt ID. An additional alternative search process would be the sequence if the protein name is unknown. Unknown proteins can be deciphered by inputing the protein sequence into another tool called Fasta (link below).

On Fasta users can search for macromolecules including proteins inputing the known sequence referenced to a universal protein database e.g. UniProt.

The EMBL-EBI webpage outputs the folded structure in HTML color-coded by the pLDDT and generates a PAE plot. Users can download the predicted protein structure in PDB, mmCIF, or JSON format. The EMBL-EBI is fast and nearly instantaneous. It requires no installations or dependencies. The cons of the EMBL-EBI webpage includes its lack of flexibility with sequence inputs and that it can only work for monomeric structures. Further, these models cannot be tweaked to predict destabilizing mutant residues or domains. They are essentially good for ready-made predictions, and contribute pittance to inference or hypothesis generation.

The Colab notebook method

A colab notebook is a website that hosts an application written in python. The AlphaFold colab notebook is hosted on google cloud servers. Users need to initialize and connect to the server, which allocate GPU or TPU resources (RAM & disk) for each colab notebook session. In user experience terms, this is the closest in function to open source predictions. The AF2 colab notebook allocate computational resources to users, however, the allocated resources are not finite, and compels users to a limit of 800 residues for prediction optimization. Above this ceiling, AF2 performance may drop drastically. Other downsides of the colab notebook includes, substantial wait time (dependent on allocate resources for that instance). Not suitable for large batch jobs, essentially optimal for single sequence predictions. Nonetheless, in comparison to the EMBI-EBI database, colab notebook permits some modifications to prediction modalities.

A step by step instruction on how to use AlphaFold2 colab notebooks

Search ‘AlphaFold2 colab notebooks’ on google or use this link

The reconnect dropdown menu connects users to google cloud servers. This step allocates GPU to the user for the session. Be careful to prudently utilize your sessions because the GPU allocation is tenured, and the connection is terminated when the allocated GPU is exhausted. However, google offers a premium user service which allocates you near infinite GPU. The protein MSA goes in the query_sequence box.

Also, Be sure to inspect the sequence, close the spaces and indentations before running the job. Interchain breaks can be signified with a colon (:). For example —EQVTNVGGAVVTGVTAVA:EQVTNVGGAVVTGVTAVA connotes a homodimer. Amber Forcefields relaxes the protein 3D structure. This permits the side chains to rotate freely with minimal steric clashes and thermodynamic violations. This step is optional and doesn’t significantly improve the model performance.

Furthermore, the msa_mode provides options for the sequence search modality. The Many-against-Many sequence searching mode with uniRef and environmental is the default. There is an option for users to use MMseq2 uniRef only or provide a custom sequence search mode.

The model_type feature allows users to choose the structural dimensions of the protein to be predicted. This feature enables predictions for oligomeric or multimeric structures. Choose Alphafold-multimer v1 or v2 option if you are working with an oligomeric sequence, Alphafold2-ptm if the structure is monomeric. The auto option permits the model to decide. Auto is the default.

The num-recycles function enables the model to reiterate through the sequence and predictions multiple times. This is one of the reasons a GPU is required because it empowers the machine to run parallel jobs. The default number of recycles is 3, however, it advised to scale to 6 or more until a near accurate prediction is achieved by the model. Also, it is noteworthy the runtime increases with number of recycles.

To run the job, click the runtime dropdown menu then choose ‘Run all’. This is the last and probably the easiest step. However, if the job is interrupted with a disruption to the internet connection or GPU availability, the whole process may be stalled. Therefore, the Run all option requires the user to keep the internet connection intact and the colab notebook page open till the job is completed.

After completion, AlphaFold2 colab notebook returns 5 models predictions with their respective pTM and pLDDT. The ‘best’ predicted structure is ranked and the output can be downloaded into the users’ computer local drive or google drive. The zip file contains the PDB structures for the 5 predictions (10 if you used the amber relax option), a corresponding PAE plot and 5 JSON files for the PAE and pTM coordinates or matrices.


Photo by Luke Jones on Unsplash

It is no longer news that one of the steepest challenges in biology has been leveled by a London-based artificial intelligence outfit — DeepMind. They won the Critical Assessment of protein-Structure Prediction 14 edition (CASP14) with a ground truth score of 90. DeepMind proceeded to publish a landmark paper in the summer of 2021.

DeepMind named their protein-folding platform AlphaFold (updated version —AlphaFold v2.3.0 at the time of writing). They released the source code on GitHub for open access. However, deploying AlphaFold2 open source code on GitHub requires humongous computational resources; downloading the database requires 12 vCPUs, 85 GB RAM, a 100 GB boot disk, 3 TB disk, and an A100 GPU. Also, the user must be vast in Linux, and deploy docker containers and other dependencies. The purpose of this article is to guide readers through seamless alternative tools available to tap into the AlphaFold miracle.

This article is of medium length (no pun intended), so please stay with me.

I will be discussing:

· How AlphaFold works

· Performance evaluation metrics

· The EMBL-EBI method

· The Colab notebook method

· AlphaFold2’s limitations

· Conclusion

So how does AlphaFold2 work?

Integrally, AlphaFold2 consists of a trained multiple sequence alignment (MSA), paired residues, and PDB templates of 100000 known protein structures (validated experimentally by NMR, X-ray crystallography, cryo-EM) from metagenomic databases. The AlphaFold2 evoformer, a 48-block neural network, was built based on concepts derived from large language models (LLM), tokenization, transformers, and attention.

Image Source: Jumper et al.

The evoformer outputs MSA and pair representations which is fed into the structure prediction module. These blocks employ invariant point attention to predict the single representation copy of the first row of the MSA, which is consequently funnelled to predict the 𝛘 torsional angles between predicted protein residue atoms: in essence, the placing of interconnected atoms adjacent to a peptide backbone in an XYZ dimension. The final predicted protein structure is relaxed to optimize their energy landscape using openMM from Amber forcefields, reducing steric clashes and energy violations.

Performance evaluation metrics

AlphaFold2 (AF2) outputs a 3D protein structure, and the model performance is evaluated with a three confidence metrics.

Predicted Local difference Test (pLDDT), ranging from 0–100, is a per residue confidence score, signifying the confidence of the model prediction for each amino acid residue relative to the 𝛂 carbon atoms.

Image Source: Author

For instance, the image above depicts pLDDT color-coded prediction of the human mammalian Target of Rapamycin (mTOR) serine/threonine kinase. pLDDT > 90 signifies very high confidence, 70–90 good enough confidence, while <70 connotes low model confidence. The pLDDT can help us gauge how well the model has performed with respect to protein regions or domains.

Predicted Aligned Error (PAE) gives us the inter-domain/intra-domain distances between two residues X and Y relative to the true structures when aligned on the same plane. Simply, how well the residues are arranged relative to other residues in space. Typically, the distances range from 0-35 Ångstroms for a confident prediction. AF2 prides itself at being able to better predict the relative positions of residues within the same domain (intra-domain residues), compared to residues in different domains (inter-domain residues). This makes sense because residues within a domain are more static compared to residues in other domains. The model outputs a PAE plot which juxtaposes residue positions along X/Y axes.

Image Source: Author

Predicted Template Modeling Score (pTM) measures the structural congruency between two folded protein structures. AlphaFold2 allows the inclusion of PDB templates for modeling, as part of the modeling options. Although the templates are not required for predictions, they can improve model performance. The pTM ranges from 0 to 1, provides a framework for AF2 to rank its 5 predicted outputs. Predictions with pTM < 0.2 are either stochastically assigned residue patterns, with negligible or no correlation with the supposed native structure, or are intrinsically disordered proteins. A pTM >0.5 is usually strong enough to make inference.

Now that we have discussed AlphaFold’s fundamentals and its metrics, we can deep-dive into low-compute methods to explore AlphaFold.

The EMBL-EBI method

To provide easy access to AlphaFold2, DeepMind collaborated with the European Molecular Biology Lab, Bioinformatics group to curate proteins for 48 commonly studied species including humans, mice, rats, apes, fruitfly, and yeast. The database contains over 1 million structures spanning 48 organismic proteomes. Although, a miserly number compared to the 200 million open source AlphaFold2 has to offer. However, this super-easy API subsidizes the heavy cost of computation associated with the open source code. The webpage can be accessed via the link below

On the webpage, users can search for their protein of interest with protein name, gene name, UniProt accession, and UniProt ID. An additional alternative search process would be the sequence if the protein name is unknown. Unknown proteins can be deciphered by inputing the protein sequence into another tool called Fasta (link below).

On Fasta users can search for macromolecules including proteins inputing the known sequence referenced to a universal protein database e.g. UniProt.

The EMBL-EBI webpage outputs the folded structure in HTML color-coded by the pLDDT and generates a PAE plot. Users can download the predicted protein structure in PDB, mmCIF, or JSON format. The EMBL-EBI is fast and nearly instantaneous. It requires no installations or dependencies. The cons of the EMBL-EBI webpage includes its lack of flexibility with sequence inputs and that it can only work for monomeric structures. Further, these models cannot be tweaked to predict destabilizing mutant residues or domains. They are essentially good for ready-made predictions, and contribute pittance to inference or hypothesis generation.

The Colab notebook method

A colab notebook is a website that hosts an application written in python. The AlphaFold colab notebook is hosted on google cloud servers. Users need to initialize and connect to the server, which allocate GPU or TPU resources (RAM & disk) for each colab notebook session. In user experience terms, this is the closest in function to open source predictions. The AF2 colab notebook allocate computational resources to users, however, the allocated resources are not finite, and compels users to a limit of 800 residues for prediction optimization. Above this ceiling, AF2 performance may drop drastically. Other downsides of the colab notebook includes, substantial wait time (dependent on allocate resources for that instance). Not suitable for large batch jobs, essentially optimal for single sequence predictions. Nonetheless, in comparison to the EMBI-EBI database, colab notebook permits some modifications to prediction modalities.

A step by step instruction on how to use AlphaFold2 colab notebooks

Search ‘AlphaFold2 colab notebooks’ on google or use this link

The reconnect dropdown menu connects users to google cloud servers. This step allocates GPU to the user for the session. Be careful to prudently utilize your sessions because the GPU allocation is tenured, and the connection is terminated when the allocated GPU is exhausted. However, google offers a premium user service which allocates you near infinite GPU. The protein MSA goes in the query_sequence box.

Also, Be sure to inspect the sequence, close the spaces and indentations before running the job. Interchain breaks can be signified with a colon (:). For example —EQVTNVGGAVVTGVTAVA:EQVTNVGGAVVTGVTAVA connotes a homodimer. Amber Forcefields relaxes the protein 3D structure. This permits the side chains to rotate freely with minimal steric clashes and thermodynamic violations. This step is optional and doesn’t significantly improve the model performance.

Furthermore, the msa_mode provides options for the sequence search modality. The Many-against-Many sequence searching mode with uniRef and environmental is the default. There is an option for users to use MMseq2 uniRef only or provide a custom sequence search mode.

The model_type feature allows users to choose the structural dimensions of the protein to be predicted. This feature enables predictions for oligomeric or multimeric structures. Choose Alphafold-multimer v1 or v2 option if you are working with an oligomeric sequence, Alphafold2-ptm if the structure is monomeric. The auto option permits the model to decide. Auto is the default.

The num-recycles function enables the model to reiterate through the sequence and predictions multiple times. This is one of the reasons a GPU is required because it empowers the machine to run parallel jobs. The default number of recycles is 3, however, it advised to scale to 6 or more until a near accurate prediction is achieved by the model. Also, it is noteworthy the runtime increases with number of recycles.

To run the job, click the runtime dropdown menu then choose ‘Run all’. This is the last and probably the easiest step. However, if the job is interrupted with a disruption to the internet connection or GPU availability, the whole process may be stalled. Therefore, the Run all option requires the user to keep the internet connection intact and the colab notebook page open till the job is completed.

After completion, AlphaFold2 colab notebook returns 5 models predictions with their respective pTM and pLDDT. The ‘best’ predicted structure is ranked and the output can be downloaded into the users’ computer local drive or google drive. The zip file contains the PDB structures for the 5 predictions (10 if you used the amber relax option), a corresponding PAE plot and 5 JSON files for the PAE and pTM coordinates or matrices.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment