Proteins are large biomolecules and macromolecules made from long chains of amino acids. A linear chain of amino acid residues is called a polypeptide; a protein contains at least one long polypeptide (short polypeptides, containing less than 20–30 residues, are rarely considered to be proteins and are commonly just called peptides).
Proteins perform a vast array of functions within organisms including catalyzing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another.
Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the sequence of nucleotides in their genes, and which usually results in protein folding into a specific 3D structure that determines its activity. Methods commonly used to study protein structures and function include immunohistochemistry, site-directed mutagenesis, X-ray crystallography, nuclear magnetic resonance and mass spectrometry.
Since protein synthesis is driven by the genetic code of stretches of DNA, and since the human genome has already been sequenced, there is great interest in determining the possible structure of proteins computationally. We should be able to extrapolate the peptide sequence coded by our DNA and solve how the protein folds into its final conformation.
Protein structure prediction is one of the most important goals pursued by computational biology, as it is important for the design of both drugs and novel enzymes. Computational prediction would solve experimental structures faster, allow the in-silico examination of the effect of mutation on protein function, and provide novel insights into poorly known molecular mechanisms.
Predicting folding configurations is a formidable challenge however due to various factors. First, computational complexity arises from the vast number of possible conformations, with the search space growing exponentially with size. Navigating the rugged energy landscape with many local minima, governed by the protein’s tendency to adopt low free energy conformations, adds complexity.
Additionally, entropic effects introduce a balance between stability gain and entropy loss during folding. Interactions with water molecules, essential in the aqueous protein environment, must be accurately modeled for precise predictions. Long-range interactions between distant amino acids, a crucial aspect of folding, pose challenges in capturing non-local details.
The cooperative nature of protein folding, where different regions fold simultaneously, further complicates predictions. Proteins also undergo post-translational modifications and dynamic changes, affecting their properties, stability, and function. Experimental techniques like X-ray crystallography have limitations, providing incomplete data for predictive models. Overall, these constraints presents a multifaceted challenge in accurately predicting protein folding.
Despite these challenges, significant progress has been made in recent years thanks to advancements in computational techniques, machine learning approaches, and increased understanding of the biophysical principles governing protein folding. Researchers continue to explore innovative methods to improve the accuracy of protein folding predictions; in the following paragraphs I review the recent advancements.
A community-wide experiment, the Critical Assessment of Techniques for Protein Structure Prediction (CASP), takes place every two years. In 2018 the Google division called DeepMind released the software AlphaFold during CASP13, a breakthrough method that not only combined previous methodologies but also uses artificial intelligence. Its algorithm beat the other tools in CASP13 and became the state of the art in protein structure prediction.
The first version of AlphaFold used deep learning to predict the structure, demonstrating that it is possible to learn protein-specific structure by training a neural network given only the protein sequence. By training a convolutional neural network on PDB structures, distograms were created to predict distances between residues. By analyzing the amino acid sequence, the neural network predicted a distogram of multiple sequence alignment (MSA) features and backbone torsion distribution probabilities.
Presented at CASP14 between May and July 2020, AlphaFold2 predicted protein structures with more accuracy than other competing methods, demonstrating a root-mean-square deviation (RMSD) among prediction and experimental backbone structures of 0.8Å versus the 2.8Å from the next best performing method. AlphaFold2 increased the structural coverage from 48% to 76% of all human protein residues, dropping the number of human protein without structural coverage from 5027 to 29. AlphaFold2 utilizes amino acid sequences to generate a MSA from multiple protein sequence databases, identifying mutation-prone regions and detecting correlations.
AlphaFold2 owes its breakthrough to the evoformer and structure modules, both neural network components. By exchanging information between MSA and templates, the evoformer improves the assessment of MSA and modifies the protein structures hypothesized by the templates. The AlphaFold2 structure module, that also contains an attention architecture, prioritizes the orientation of the protein backbone, considering the residue rotations and translations, localizing the side chain of each residue in highly constrained frames, followed by local refinement and minimization by gradient descent.
In the same year (2021), in a partnership between Google DeepMind and the EMBL-European Bioinformatics Institute (EMBL-EBI), the AlphaFold Protein Structure Database (AlphaFold DB) was created, making available over 360,000 predicted structures from 21 organism proteomes. Today, AlphaFold DB has over 200 million entries from the human and 47 other organism proteomes, with the structure predictions and their respective analyses freely available to the scientific community.
Alphafold2 has been used to predict protein-protein interaction, using flexible linkers or artificial gaps and, in general, it predicted heterodimeric protein complexes accurately, exceeding docking approaches usually used in these analysis.
Despite its success, AlphaFold2 has difficulty predicting intrinsically disordered protein regions and loops. This is troubling considering the importance of the latter for drug screening and design, since they are exposed in protein surface and readily available to solvent and other proteins.
Also in 2021, the Institute for Protein Design at the University of Washington introduced RoseTTAFold, a “three-track” neural network that simultaneously considers patterns in protein sequences, how a protein’s amino acids interact with one another, and a protein’s possible three-dimensional structure. In this architecture, one-, two-, and three-dimensional information flows back and forth, allowing the network to collectively reason about the relationship between a protein’s chemical parts and its folded structure.
The pioneering computational method OmegaFold was launched by the Chinese biotech firm Helixon in July 2022, marking a breakthrough in predicting high-resolution protein structure from a lone primary sequence. The researchers reported in a study that they utilized a novel combination of a protein language model that enabled predictions from individual sequences, and a geometry-inspired transformer model trained on protein structures. The team said that the overall model of OmegaFold is conceptually inspired by advances in language models for NLP coupled with deep neural networks used in AlphaFold2. In addition, OmegaFold empowers accurate predictions for orphan proteins without ties to any characterized protein family, as well as antibodies known for having noisy MSAs resulting from rapid evolution.
Meta AI introduced a groundbreaking large-scale language model, Evolutionary Scale Modeling (ESMFold), in November 2022, aimed at accelerating protein structure prediction, with advances in the complexity of language modeling extending to 15 billion parameters. This model claimed to have similar accuracy as AlphaFold2 and RoseTTAFold, but ESMFold inference is faster at enabling the exploration of structural spaces of metagenomic proteins.
The natural world contains a vast number of proteins beyond the ones that have been cataloged and annotated in well-studied organisms. Metagenomics uses gene sequencing to discover proteins in samples from environments across the earth, from microbes living in the soil, deep in the ocean, in extreme environments like hydrothermal vents, and even in our guts and on our skin. Meta also released the 600+ million protein ESM Metagenomic Atlas, with predictions for nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences.
Machine learning and artificial intelligence have revolutionized the field of protein folding computation, marking a significant leap forward in our ability to predict and understand the intricate 3D structures of proteins. Despite the inherent challenges posed by computational complexity, energy landscapes, and dynamic molecular interactions, recent breakthroughs exemplified by AlphaFold2, RoseTTAFold, OmegaFold, and ESMFold underscore the transformative power of cutting-edge technologies in overcoming these hurdles. These advancements not only enhance our comprehension of protein folding but also hold immense promise for drug design, enzyme engineering, and a deeper exploration of metagenomic proteins. As we venture further into the realm of computational biology, the continued collaboration between AI and biotechnology promises to unveil the mysteries of countless unexplored proteins, opening new frontiers in understanding the complex machinery of life.
(Author’s note: along with a bit of my own writing and editing, this article contains direct excerpts from the references listed below and text generated by ChatGPT)
References:
“Before and after AlphaFold2: An ovreview of protein structure prediction” by Letícia M. F. Bertoline, Angélica N. Lima, Jose E. Krieger, Samantha K. Teixeira
Wikipedia: Protein_structure_prediction
Protein Wars by Amit Raja Naik
ESM Metagenomic Atlas, from Meta AI