Review Article
Three-dimensional protein structure prediction: Methods and computational strategies

https://doi.org/10.1016/j.compbiolchem.2014.10.001 Get rights and content

Highlights

  • Predicting the correct three-dimensional structure of a protein molecule is an intricate and arduous task.

  • First principle methods without database information.

  • First principle methods with database information.

  • Fold recognition and threading methods.

  • Comparative modeling methods and sequence alignment strategies.

Abstract

A long standing problem in structural bioinformatics is to determine the three-dimensional (3-D) structure of a protein when only a sequence of amino acid residues is given. Many computational methodologies and algorithms have been proposed as a solution to the 3-D Protein Structure Prediction (3-D-PSP) problem. These methods can be divided in four main classes: (a) first principle methods without database information; (b) first principle methods with database information; (c) fold recognition and threading methods; and (d) comparative modeling methods and sequence alignment strategies. Deterministic computational techniques, optimization techniques, data mining and machine learning approaches are typically used in the construction of computational solutions for the PSP problem. Our main goal with this work is to review the methods and computational strategies that are currently used in 3-D protein prediction.

Introduction

Structural Bioinformatics is one of the key research areas in the field of Computational Biology (Zhang et al., 2005, Altman and Dugan, 2005, Clote and Backofen, 2000, Pevzner, 2000, Liljas et al., 2001, Gopakumar, 2012). Structural Bioinformatics concerns the analysis and prediction of three-dimensional (3-D) structures of biological macromolecules such as Proteins1, RNA and DNA (Zhang et al., 2005, Altman and Dugan, 2005). This structural information corresponds to 3-D macromolecular structures obtained through different experimental methods such as protein crystallography (X-ray diffraction), electron microscopy or nuclear magnetic resonance (NMR). This information allows one to study folds and local motifs in proteins, molecular folding, evolution and structure/function relationships.

One of the main research problems in structural bioinformatics is the prediction of three-dimensional protein structures. Proteins are long sequences formed out of 20 different amino acid residues that in physiological conditions adopt a unique 3-D structure2 (Anfinsen et al., 1961). Knowledge of the protein structure allows the investigation of biological processes more directly, with higher resolution and finer detail. The sequence–protein–structure paradigm (also known as the “lock-and-key” hypothesis) says that the protein can achieve its biological function only by folding into a unique, structured state determined by its amino acid sequence (Anfinsen, 1973). Nevertheless, currently it has been recognized that not all protein functions are associated to a folded state (Dunker et al., 2008, Dunker et al., 2001, Uversky, 2001, Tompa and Csermely, 2004, Tompa, 2002, Wright and Dyson, 1999). In some cases proteins must be unfolded or disordered to perform their functions (Gunasekaran et al., 2003). These proteins are called intrinsically disordered proteins (IDP) and represent around 30% of the protein sequences. Despite the presence of IDP proteins an important aspect of understanding and interpreting the function of a given protein involves characterizing molecular interactions. These interactions can be intramolecular (ionic bonds, covalent bonds, metallic bonds, etc) or intermolecular (hydrogen bonds and other non-covalent bonds such as van der Waals forces). The knowledge of the 3-D structure of polypeptides gives researchers very important information to infer the function of the protein in the cell (Branden and Tooze, 1998, Laskowiski et al., 2005a, Laskowiski et al., 2005b, Lesk, 2002): structural functions; catalysis in chemical reactions; transport and storage; regulatory functions; gene transcription control; recognition functions. Further details about protein function prediction can be found in Whisstock and Lesk (Whisstock and Lesk, 2003), Rentzsch and Orengo (Rentzsch and Orengo, 2009) and Lee et al. (Lee et al., 2007).

The determination of protein structure is both experimentally expensive (due to the costs associated to crystallography, electron microscopy or NMR), and time consuming (Guntert, 2004). The difficulty in determining and finding out the 3-D structure of proteins has generated a large discrepancy between the volume of data (sequences of amino acid residues) generated by the Genome Projects3 and the number of 3-D structures of proteins which are currently known. Only a tiny portion of protein sequences have experimentally solved three-dimensional structures. These figures not only clearly illustrate the need for, but also motivate further research in computational protein structure prediction methods. Over the last 10 years several computational methodologies, systems and algorithms have been proposed as a solution to the three-dimensional protein structure prediction (3-D PSP) problem (Bujnicki, 2006, Moult, 2005, Osguthorpe, 2000, Tramontano, 2006). These methods are divided into four classes, that shall be described in detail in this review (Floudas et al., 2006): (1) First principle methods without database information (Osguthorpe, 2000); (2) First principle methods with database information (Rohl et al., 2004, Srinivasan and Rose, 1995); (3) Fold recognition and threading methods (Bowie et al., 1991, Jones et al., 1992, Bryant and Altschul, 1995, Turcotte et al., 1998); and (4) Comparative modeling methods and sequence alignment strategies (Martí-Renom et al., 2000, Sánchez and Sali, 1997). The first group of methods aims at predicting new folds only through (computational) simulation of physicochemical properties of the folding process of the proteins in nature. The other groups represent the methods that are able of performing fast and effective prediction of protein 3-D structures when known template structures and fold libraries are available (Kolinski, 2004).

Predicting the correct 3-D structure of a protein molecule is an intricate and arduous task. The 3-D PSP and Protein Folding (PF) problems4 are classified in computational complexity theory as NP-complete problems (Crescenzi et al., 1998, Fraenkel, 1993, Hart and Istrail, 1997, Levinthal, 1968, Ngo et al., 1997), i.e., they are among the hardest problems in terms of computational requirements. For a formal definition of NP-completeness see Garey and Johnson (Garey and Johnson, 1979). This complexity is due to the folding process of a protein being highly selective. A long amino acid chain ends up in one out of a huge number of 3-D conformations. In contrast, the conformational preferences of single amino acid residues are weak. Thus, the high selectivity of protein folding is only possible through the interaction of many residues. Therefore, non-local interactions play an important role in protein three-dimensional structure, as local sequence–structure relationships are not absolute (Rackovsky, 2010). Ab initio methods (first principle methods without database information) can obtain novel and unknown protein folds. Nevertheless, the complexity and the high dimensionality of the search space (Ngo et al., 1997) even for a small protein molecule makes the problem intractable (Levinthal, 1968). The direct simulation of protein folding in atomic details, as used in Molecular Dynamics (MD)5, is not tractable (van Gunsteren and Berendsen, 1990) (for large proteins of medical and scientific interest) due to high computational costs, despite the efforts towards the development of distributed computing platforms. On the other hand, homology modelling does not lead to such problems; however, it can only predict structures of protein sequences which are similar or nearly identical to other sequences of known structures. Fold recognition via threading, in turn, is limited to the fold library derived from the Protein Data Bank (PDB) structures (Berman et al., 2000).

In order to tackle the computational complexity of the 3-D PSP problem, current 3-D protein structure prediction methods make use of a wide range of optimization algorithms (Klepeis et al., 2003). Metaheuristics are used to provide near optimal solutions. In addition, considering the limitations of the four classes of protein structure prediction methods, researchers have recently developed hybrid methods which combine principles of the four classes, as can be observed in last CASP editions (Moult et al., 2014, Moult et al., 2011). For example, the accuracy presented by homology modeling methods is combined with the capacity of Ab initio methods in predicting novel folds (Dhingra and Jayaram, 2013, Dorn et al., 2008, Fan and Mark, 2004). In order to reduce the complexity and the high dimensionality of the conformational search space inherent to Ab initio methods, information about structural motifs found in known protein structures can be used to construct approximate conformations. These approximate conformations are expected to be sufficient to allow later refinement by means of Molecular Mechanics (MM) such as MD simulation (van Gunsteren and Berendsen, 1990). In a refinement step, global interactions between all atoms in the molecule (including e.g. non-bond interactions) are evaluated and deviations in the polypeptide main-chain and side-chain torsion angles can be corrected (Fan and Mark, 2004). These in turn reduce the total time spent by Ab initio methods – which usually start from a fully extended conformation of a polypeptide – to fold a sequence of unknown structure (Breda et al., 2007). The first principle methods that make use of database information cover this class. Such methods use previous protein structural information from existing databases in order to construct starting point 3-D protein structures. Machine learning and data mining techniques are also applied in order to extract useful information from known protein 3-D structures.

Our main goal is to review the methods and computational strategies that are currently used in 3-D protein structure prediction. In order to do so, we present the most important results needed to understand the four classes of prediction methods. The main contributions of this review are addressed through the organization and description of the main computational techniques and strategies that are currently used in the development of in silico methods for the 3-D PSP problem. This will contribute toward the development of new computational methods and strategies for the 3-D PSP problem. The review is structured as follows. In Section 2 the fundamental concepts of proteins are briefly described (readers familiar with these fundamental concepts can clearly skip this section). Section 3 describes the four classes in which the 3-D protein structure prediction methods and algorithms are classified. In addition, we present details of the main prediction methods and outline the computational strategies performed. Section 4 concludes the paper and points out directions for further research.

Section snippets

On proteins, structure and representation

From a structural perspective, a protein is an ordered linear chain of building blocks known as amino acid residues. Each protein is defined by its unique sequence of amino acids. This sequence causes the protein to fold into a particular three-dimensional shape. Predicting the folded structure of a protein only from its amino acid sequence remains a challenging problem in mathematical optimization (Lander and Waterman, 1999). The challenge arises due to the combinatorial explosion of plausible

Three-dimensional protein structure prediction methods

The prediction of the 3-D structure of polypeptides based only on the amino acid sequence (primary structure) is a problem that has, over the last decades, challenged biochemists, biologists, computer scientists and mathematicians (Baxevanis and Quellette, 1990). The Protein Structure Prediction Problem (Creighton, 1990) is one of the main research problems in Structural Bioinformatics. The main challenge is to understand how the information encoded in the linear sequence of amino acid residues

Conclusions

The study of protein structure and the prediction of their three-dimensional (3-D) structures is one of the key research problems in Structural Bioinformatics. Predicting the three-dimensional structure of a protein that has no templates in the Protein Data Bank is a very hard and sometimes virtually intractable task. Over the last years, many computational methods, systems and algorithms have been developed with the purpose of solving this complex problem. However, the problem still challenges

Acknowledgements

This work was supported by grants from FAPERGS (002021-25.51/13) and MCT/CNPq (473692/2013-9) – Brazil.

References (471)

  • M. Dewar

    Development and status of mindo/3 and mndo

    J. Mol. Struct.

    (1983)
  • M. Dorn et al.

    A3n: an artificial neural network n-gram-based method to approximate 3-d polypeptides structure prediction

    Expert Syst. Appl.

    (2010)
  • A. Dunker et al.

    Intrinsically disordered protein

    J. Mol. Graph. Model.

    (2001)
  • A. Dunker et al.

    Function and structure of inherently disordered proteins

    Curr. Opin. Struct. Biol.

    (2008)
  • F. Eisenmenger et al.

    SMMP a modern package for simulation of proteins

    Comput. Phys. Commun.

    (2001)
  • F. Eisenmenger et al.

    An enhanced version of SMMP – open-source software package for simulation of proteins

    Comput. Phys. Commun.

    (2006)
  • R. Elber

    Computer simulations of protein folding: classical trajectories by optimization of action

    Comput. Phys. Commun.

    (2005)
  • R. Elber et al.

    A method for determining reaction paths in large molecules: application to myoglobin

    Chem. Phys. Lett.

    (1987)
  • R. Elber et al.

    Moil – a program for simulation of macromolecules

    Comput. Phys. Commun.

    (1995)
  • A. Finkelstein et al.

    Why do globular proteins fit the limited set of folding patterns?

    Prog. Biophys. Mol. Biol.

    (1987)
  • D. Fischer

    Servers for protein structure prediction

    Curr. Opin. Struct. Biol.

    (2006)
  • C. Floudas et al.

    Advances in protein structure prediction and de novo protein design: a review

    Chem. Eng. Sci.

    (2006)
  • N. Alexandrov et al.

    Fast Protein Fold Recognition Via Sequence to Structure Alignment and Contact Capacity Potentials

    (1996)
  • R.B. Altman et al.

    Defining Bioinformatics and Structural Bioinformatics

    (2005)
  • S. Altschul et al.

    Issues in searching molecular sequence databases

    Nat. Genet.

    (1994)
  • S. Altschul et al.

    Gapped blast and psi-blast: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • D. Anderson

    BIONC: A System for Public-resource Computing and Storage

    (2004)
  • J. Anderson et al.

    Molecular dynamics on graphic processing units: Hoomd to the rescue

    Comput. Sci. Eng.

    (2008)
  • C. Anfinsen

    Principles that govern the folding of protein chains

    Science

    (1973)
  • C. Anfinsen et al.

    The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain

    Proc. Natl. Acad. Sci. U. S. A.

    (1961)
  • A. Apostolico et al.

    Sequence alignment in molecular biology

    J. Comput. Biol.

    (1998)
  • K. Arnold et al.

    The swiss-model workspace: a web-based environment for protein structure homology modeling

    Bioinformatics

    (2006)
  • N. Arora et al.

    Energetics of base pairs in b-dna in solution: An appraisal of potential functions and dielectric treatments

    J. Phys. Chem. B

    (1998)
  • H. Bahamish et al.

    Protein Tertiary Structure Prediction Using Artificial Bee Colony Algorithm

    (2009)
  • J. Bajorath et al.

    Knowledge-based model building of proteins: concepts and examples

    Protein Sci.

    (1994)
  • D. Barthel et al.

    Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information

    BMC Bioinf.

    (2007)
  • P.A. Bates et al.

    Enhancement of protein modeling by human intervention in applying the automatic programs 3d-jigsaw and 3d-pssm

    Proteins: Struct. Funct. Gen.

    (2001)
  • A. Baxevanis

    Practical aspects of multiple sequence alignment

    Methods Biochem. Anal.

    (1998)
  • A. Baxevanis et al.

    Bioinformatics: a practical guide to the analysis of genes and proteins

    (1990)
  • M. Ben-David et al.

    Assessments of casp8 structure predictions for template free targets

    Proteins: Struct. Funct. Bioinf.

    (2009)
  • H. Berman et al.

    The protein data bank

    Nucleic Acids Res.

    (2000)
  • R.B. Best et al.

    Optimization of the additive charmm all-atom protein force field targeting improved sampling of the backbone ϕ, ψ and side-chain χ1 and χ2 dihedral angles

    J. Chem. Theory Comput.

    (2012)
  • M. Biasini et al.

    Swiss-model: modelling protein tertiary and quaternary structure using evolutionary information

    Nucleic Acids Res.

    (2014)
  • J. Bibby et al.

    Ample: a cluster-and-truncate approach to solve the crystal structures of small proteins using rapidly computed ab initio models

    Acta Crystallogr. Sect. D: Biol. Crystallogr.

    (2012)
  • T. Blundell et al.

    Knowledge-based prediction of protein structures and the design of novel molecules

    Nature

    (1987)
  • R. Bonneau et al.

    Ab initio protein structure prediction: progress and prospects

    Annu. Rev. Biophys. Biomol. Struct.

    (2001)
  • J.U. Bowie et al.

    An evolutionary approach to folding small alpha-helical proteins that uses sequence information and empirical guiding fitness function

    Proc. Natl. Acad. Sci. U. S. A.

    (1994)
  • J.U. Bowie et al.

    A method to identify protein sequences that fold into a known three-dimensional structure

    Science

    (1991)
  • P. Bradley et al.

    Toward high-resolution de novo structure prediction for small proteins

    Science

    (2005)
  • E. Bramucci et al.

    Pymod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within pymol

    BMC Bioinf.

    (2012)
  • Cited by (156)

    • Chemical modification of protein-based biopolymers for application in food packaging

      2023, Advanced Applications of Biobased Materials: Food, Biomedical, and Environmental Applications
    • Predicting mutational function using machine learning

      2023, Mutation Research - Reviews in Mutation Research
    View all citing articles on Scopus
    View full text