Review Article
Three-dimensional protein structure prediction: Methods and computational strategies
Graphical abstract
Introduction
Structural Bioinformatics is one of the key research areas in the field of Computational Biology (Zhang et al., 2005, Altman and Dugan, 2005, Clote and Backofen, 2000, Pevzner, 2000, Liljas et al., 2001, Gopakumar, 2012). Structural Bioinformatics concerns the analysis and prediction of three-dimensional (3-D) structures of biological macromolecules such as Proteins1, RNA and DNA (Zhang et al., 2005, Altman and Dugan, 2005). This structural information corresponds to 3-D macromolecular structures obtained through different experimental methods such as protein crystallography (X-ray diffraction), electron microscopy or nuclear magnetic resonance (NMR). This information allows one to study folds and local motifs in proteins, molecular folding, evolution and structure/function relationships.
One of the main research problems in structural bioinformatics is the prediction of three-dimensional protein structures. Proteins are long sequences formed out of 20 different amino acid residues that in physiological conditions adopt a unique 3-D structure2 (Anfinsen et al., 1961). Knowledge of the protein structure allows the investigation of biological processes more directly, with higher resolution and finer detail. The sequence–protein–structure paradigm (also known as the “lock-and-key” hypothesis) says that the protein can achieve its biological function only by folding into a unique, structured state determined by its amino acid sequence (Anfinsen, 1973). Nevertheless, currently it has been recognized that not all protein functions are associated to a folded state (Dunker et al., 2008, Dunker et al., 2001, Uversky, 2001, Tompa and Csermely, 2004, Tompa, 2002, Wright and Dyson, 1999). In some cases proteins must be unfolded or disordered to perform their functions (Gunasekaran et al., 2003). These proteins are called intrinsically disordered proteins (IDP) and represent around 30% of the protein sequences. Despite the presence of IDP proteins an important aspect of understanding and interpreting the function of a given protein involves characterizing molecular interactions. These interactions can be intramolecular (ionic bonds, covalent bonds, metallic bonds, etc) or intermolecular (hydrogen bonds and other non-covalent bonds such as van der Waals forces). The knowledge of the 3-D structure of polypeptides gives researchers very important information to infer the function of the protein in the cell (Branden and Tooze, 1998, Laskowiski et al., 2005a, Laskowiski et al., 2005b, Lesk, 2002): structural functions; catalysis in chemical reactions; transport and storage; regulatory functions; gene transcription control; recognition functions. Further details about protein function prediction can be found in Whisstock and Lesk (Whisstock and Lesk, 2003), Rentzsch and Orengo (Rentzsch and Orengo, 2009) and Lee et al. (Lee et al., 2007).
The determination of protein structure is both experimentally expensive (due to the costs associated to crystallography, electron microscopy or NMR), and time consuming (Guntert, 2004). The difficulty in determining and finding out the 3-D structure of proteins has generated a large discrepancy between the volume of data (sequences of amino acid residues) generated by the Genome Projects3 and the number of 3-D structures of proteins which are currently known. Only a tiny portion of protein sequences have experimentally solved three-dimensional structures. These figures not only clearly illustrate the need for, but also motivate further research in computational protein structure prediction methods. Over the last 10 years several computational methodologies, systems and algorithms have been proposed as a solution to the three-dimensional protein structure prediction (3-D PSP) problem (Bujnicki, 2006, Moult, 2005, Osguthorpe, 2000, Tramontano, 2006). These methods are divided into four classes, that shall be described in detail in this review (Floudas et al., 2006): (1) First principle methods without database information (Osguthorpe, 2000); (2) First principle methods with database information (Rohl et al., 2004, Srinivasan and Rose, 1995); (3) Fold recognition and threading methods (Bowie et al., 1991, Jones et al., 1992, Bryant and Altschul, 1995, Turcotte et al., 1998); and (4) Comparative modeling methods and sequence alignment strategies (Martí-Renom et al., 2000, Sánchez and Sali, 1997). The first group of methods aims at predicting new folds only through (computational) simulation of physicochemical properties of the folding process of the proteins in nature. The other groups represent the methods that are able of performing fast and effective prediction of protein 3-D structures when known template structures and fold libraries are available (Kolinski, 2004).
Predicting the correct 3-D structure of a protein molecule is an intricate and arduous task. The 3-D PSP and Protein Folding (PF) problems4 are classified in computational complexity theory as NP-complete problems (Crescenzi et al., 1998, Fraenkel, 1993, Hart and Istrail, 1997, Levinthal, 1968, Ngo et al., 1997), i.e., they are among the hardest problems in terms of computational requirements. For a formal definition of NP-completeness see Garey and Johnson (Garey and Johnson, 1979). This complexity is due to the folding process of a protein being highly selective. A long amino acid chain ends up in one out of a huge number of 3-D conformations. In contrast, the conformational preferences of single amino acid residues are weak. Thus, the high selectivity of protein folding is only possible through the interaction of many residues. Therefore, non-local interactions play an important role in protein three-dimensional structure, as local sequence–structure relationships are not absolute (Rackovsky, 2010). Ab initio methods (first principle methods without database information) can obtain novel and unknown protein folds. Nevertheless, the complexity and the high dimensionality of the search space (Ngo et al., 1997) even for a small protein molecule makes the problem intractable (Levinthal, 1968). The direct simulation of protein folding in atomic details, as used in Molecular Dynamics (MD)5, is not tractable (van Gunsteren and Berendsen, 1990) (for large proteins of medical and scientific interest) due to high computational costs, despite the efforts towards the development of distributed computing platforms. On the other hand, homology modelling does not lead to such problems; however, it can only predict structures of protein sequences which are similar or nearly identical to other sequences of known structures. Fold recognition via threading, in turn, is limited to the fold library derived from the Protein Data Bank (PDB) structures (Berman et al., 2000).
In order to tackle the computational complexity of the 3-D PSP problem, current 3-D protein structure prediction methods make use of a wide range of optimization algorithms (Klepeis et al., 2003). Metaheuristics are used to provide near optimal solutions. In addition, considering the limitations of the four classes of protein structure prediction methods, researchers have recently developed hybrid methods which combine principles of the four classes, as can be observed in last CASP editions (Moult et al., 2014, Moult et al., 2011). For example, the accuracy presented by homology modeling methods is combined with the capacity of Ab initio methods in predicting novel folds (Dhingra and Jayaram, 2013, Dorn et al., 2008, Fan and Mark, 2004). In order to reduce the complexity and the high dimensionality of the conformational search space inherent to Ab initio methods, information about structural motifs found in known protein structures can be used to construct approximate conformations. These approximate conformations are expected to be sufficient to allow later refinement by means of Molecular Mechanics (MM) such as MD simulation (van Gunsteren and Berendsen, 1990). In a refinement step, global interactions between all atoms in the molecule (including e.g. non-bond interactions) are evaluated and deviations in the polypeptide main-chain and side-chain torsion angles can be corrected (Fan and Mark, 2004). These in turn reduce the total time spent by Ab initio methods – which usually start from a fully extended conformation of a polypeptide – to fold a sequence of unknown structure (Breda et al., 2007). The first principle methods that make use of database information cover this class. Such methods use previous protein structural information from existing databases in order to construct starting point 3-D protein structures. Machine learning and data mining techniques are also applied in order to extract useful information from known protein 3-D structures.
Our main goal is to review the methods and computational strategies that are currently used in 3-D protein structure prediction. In order to do so, we present the most important results needed to understand the four classes of prediction methods. The main contributions of this review are addressed through the organization and description of the main computational techniques and strategies that are currently used in the development of in silico methods for the 3-D PSP problem. This will contribute toward the development of new computational methods and strategies for the 3-D PSP problem. The review is structured as follows. In Section 2 the fundamental concepts of proteins are briefly described (readers familiar with these fundamental concepts can clearly skip this section). Section 3 describes the four classes in which the 3-D protein structure prediction methods and algorithms are classified. In addition, we present details of the main prediction methods and outline the computational strategies performed. Section 4 concludes the paper and points out directions for further research.
Section snippets
On proteins, structure and representation
From a structural perspective, a protein is an ordered linear chain of building blocks known as amino acid residues. Each protein is defined by its unique sequence of amino acids. This sequence causes the protein to fold into a particular three-dimensional shape. Predicting the folded structure of a protein only from its amino acid sequence remains a challenging problem in mathematical optimization (Lander and Waterman, 1999). The challenge arises due to the combinatorial explosion of plausible
Three-dimensional protein structure prediction methods
The prediction of the 3-D structure of polypeptides based only on the amino acid sequence (primary structure) is a problem that has, over the last decades, challenged biochemists, biologists, computer scientists and mathematicians (Baxevanis and Quellette, 1990). The Protein Structure Prediction Problem (Creighton, 1990) is one of the main research problems in Structural Bioinformatics. The main challenge is to understand how the information encoded in the linear sequence of amino acid residues
Conclusions
The study of protein structure and the prediction of their three-dimensional (3-D) structures is one of the key research problems in Structural Bioinformatics. Predicting the three-dimensional structure of a protein that has no templates in the Protein Data Bank is a very hard and sometimes virtually intractable task. Over the last years, many computational methods, systems and algorithms have been developed with the purpose of solving this complex problem. However, the problem still challenges
Acknowledgements
This work was supported by grants from FAPERGS (002021-25.51/13) and MCT/CNPq (473692/2013-9) – Brazil.
References (471)
- et al.
Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins
J. Mol. Biol.
(1994) - et al.
Basic local alignment search tool
J. Mol. Biol.
(1990) - et al.
New advances in chemistry and materials science with cpmd and parallel computing
Parallel Comput.
(2000) - et al.
Homology modeling by distance geometry
Fold. Des.
(1996) - et al.
Multi-canonical algorithms for first order phase transitions
Phys. Lett. B
(1991) - et al.
Potential energy functions for protein design
Curr. Opin. Struct. Biol.
(2007) - et al.
Statistics of sequence-structure threading
Curr. Opin. Struct. Biol.
(1995) - et al.
Structure-derived hydrophobic potential. hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds
J. Mol. Biol.
(1992) - et al.
Folding the main chain of small proteins with the genetic algorithm
J. Mol. Biol.
(1994) - et al.
Structure prediction for casp7 targets using extensive all-atom refinement with rosetta@home.
Proteins: Struct. Funct. Gen.
(2007)
Development and status of mindo/3 and mndo
J. Mol. Struct.
A3n: an artificial neural network n-gram-based method to approximate 3-d polypeptides structure prediction
Expert Syst. Appl.
Intrinsically disordered protein
J. Mol. Graph. Model.
Function and structure of inherently disordered proteins
Curr. Opin. Struct. Biol.
SMMP a modern package for simulation of proteins
Comput. Phys. Commun.
An enhanced version of SMMP – open-source software package for simulation of proteins
Comput. Phys. Commun.
Computer simulations of protein folding: classical trajectories by optimization of action
Comput. Phys. Commun.
A method for determining reaction paths in large molecules: application to myoglobin
Chem. Phys. Lett.
Moil – a program for simulation of macromolecules
Comput. Phys. Commun.
Why do globular proteins fit the limited set of folding patterns?
Prog. Biophys. Mol. Biol.
Servers for protein structure prediction
Curr. Opin. Struct. Biol.
Advances in protein structure prediction and de novo protein design: a review
Chem. Eng. Sci.
Fast Protein Fold Recognition Via Sequence to Structure Alignment and Contact Capacity Potentials
Defining Bioinformatics and Structural Bioinformatics
Issues in searching molecular sequence databases
Nat. Genet.
Gapped blast and psi-blast: a new generation of protein database search programs
Nucleic Acids Res.
BIONC: A System for Public-resource Computing and Storage
Molecular dynamics on graphic processing units: Hoomd to the rescue
Comput. Sci. Eng.
Principles that govern the folding of protein chains
Science
The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain
Proc. Natl. Acad. Sci. U. S. A.
Sequence alignment in molecular biology
J. Comput. Biol.
The swiss-model workspace: a web-based environment for protein structure homology modeling
Bioinformatics
Energetics of base pairs in b-dna in solution: An appraisal of potential functions and dielectric treatments
J. Phys. Chem. B
Protein Tertiary Structure Prediction Using Artificial Bee Colony Algorithm
Knowledge-based model building of proteins: concepts and examples
Protein Sci.
Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information
BMC Bioinf.
Enhancement of protein modeling by human intervention in applying the automatic programs 3d-jigsaw and 3d-pssm
Proteins: Struct. Funct. Gen.
Practical aspects of multiple sequence alignment
Methods Biochem. Anal.
Bioinformatics: a practical guide to the analysis of genes and proteins
Assessments of casp8 structure predictions for template free targets
Proteins: Struct. Funct. Bioinf.
The protein data bank
Nucleic Acids Res.
Optimization of the additive charmm all-atom protein force field targeting improved sampling of the backbone ϕ, ψ and side-chain χ1 and χ2 dihedral angles
J. Chem. Theory Comput.
Swiss-model: modelling protein tertiary and quaternary structure using evolutionary information
Nucleic Acids Res.
Ample: a cluster-and-truncate approach to solve the crystal structures of small proteins using rapidly computed ab initio models
Acta Crystallogr. Sect. D: Biol. Crystallogr.
Knowledge-based prediction of protein structures and the design of novel molecules
Nature
Ab initio protein structure prediction: progress and prospects
Annu. Rev. Biophys. Biomol. Struct.
An evolutionary approach to folding small alpha-helical proteins that uses sequence information and empirical guiding fitness function
Proc. Natl. Acad. Sci. U. S. A.
A method to identify protein sequences that fold into a known three-dimensional structure
Science
Toward high-resolution de novo structure prediction for small proteins
Science
Pymod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within pymol
BMC Bioinf.
Cited by (156)
-
Adaptive patch grid strategy for parallel protein folding using atomic burials with NAMD
2024, Journal of Parallel and Distributed Computing -
Bioprospecting lignin for biorefinery: Emerging innovations and strategies in microbial technology
2024, Biomass and Bioenergy -
Chemical modification of protein-based biopolymers for application in food packaging
2023, Advanced Applications of Biobased Materials: Food, Biomedical, and Environmental Applications -
Predicting mutational function using machine learning
2023, Mutation Research - Reviews in Mutation Research -
Applications of machine learning in computer-aided drug discovery
2022, QRB Discovery