Volume 15, Issue 4 p. 808-824
research-article
Free Access

A comparative study of available software for high-accuracy homology modeling: From sequence alignments to structural models

Akbar Nayeem

Corresponding Author

Akbar Nayeem

Computer-Assisted Drug Design, Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, New Jersey 08543, USA

Bristol-Myers Squibb, P.O. Box 5400, Princeton, NJ 08543, USA; fax: (609) 818-3545.Search for more papers by this author
Doree Sitkoff

Doree Sitkoff

Computer-Assisted Drug Design, Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, New Jersey 08543, USA

Search for more papers by this author
Stanley Krystek Jr.

Stanley Krystek Jr.

Computer-Assisted Drug Design, Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, New Jersey 08543, USA

Search for more papers by this author
First published: 01 January 2009
Citations: 111

Abstract

An open question in protein homology modeling is, how well do current modeling packages satisfy the dual criteria of quality of results and practical ease of use? To address this question objectively, we examined homology-built models of a variety of therapeutically relevant proteins. The sequence identities across these proteins range from 19% to 76%. A novel metric, the difference alignment index (DAI), is developed to aid in quantifying the quality of local sequence alignments. The DAI is also used to construct the relative sequence alignment (RSA), a new representation of global sequence alignment that facilitates comparison of sequence alignments from different methods. Comparisons of the sequence alignments in terms of the RSA and alignment methodologies are made to better understand the advantages and caveats of each method. All sequence alignments and corresponding 3D models are compared to their respective structure-based alignments and crystal structures. A variety of protein modeling software was used. We find that at sequence identities >40%, all packages give similar (and satisfactory) results; at lower sequence identities (<25%), the sequence alignments generated by Profit and Prime, which incorporate structural information in their sequence alignment, stand out from the rest. Moreover, the model generated by Prime in this low sequence identity region is noted to be superior to the rest. Additionally, we note that DSModeler and MOE, which generate reasonable models for sequence identities >25%, are significantly more functional and easier to use when compared with the other structure-building software.

Abbreviations: NMR, nuclear magnetic resonance; PDB, Protein Data Bank; SCOP, structural classification of proteins; MOE, molecular operating environment; RMSD, root-mean-squared deviation; RSA, relative sequence alignment; DAI, difference alignment index; MMP, matrix metalloprotease; NHR, nuclear hormone receptor; SCD, short-chain dehydrogenase; CASP, critical assessment of techniques for protein structure prediction; GPCR, G-protein coupled receptor.

Despite significant progress in X-ray crystallography and high-field NMR spectroscopy for solving protein structures (Berman et al. 2000; also http://www.rcsb.org/pdb/holdings.html), structures for many therapeutically relevant target receptors remain unavailable. Such structures are particularly desirable in the early phases of drug discovery projects before experimental structural methods are in place, and one therefore has to rely on comparative modeling based on homologous templates for constructing three-dimensional (3D) models (Hillisch and Hilgenfeld 2003; Hillisch et al. 2004). The performance of homology modeling is well understood and documented (Johnson et al. 1994; Congreve et al. 2005). The method relies on the observation that in nature the structural conformation of a protein is more highly conserved than its amino acid sequence, and that small or medium changes in sequence typically result in only small changes in the 3D structure (Lesk and Chothia 1986). Several improvements and modifications of this general homology modeling strategy have been developed and applied to the prediction of protein structures. Indeed, Ring et al. (1993) have identified inhibitors of serine and cysteine proteases on the basis of homology models. Schafferhans and Klebe (2001) have used homology models to correctly identify crystal-binding modes of bound ligands. Vangrevelinghe et al. (2003) were able to use homology modeling combined with virtual screening to identify potent inhibitors from a Novartis collection of 400,000 compounds. A particularly interesting and recent example, where homology modeling is pushed to its limits, is provided by the successful application of homology modeling to virtual screening for GPCR (G-protein coupled receptor) antagonists (Evers and Klebe 2004; Evers and Klabunde 2005). Subjecting the available structure prediction methods to a blind test, community-wide experiments on the critical assessment of techniques for protein structure prediction (CASP 1-5) have been performed, and their results presented and published (Tramontano and Morea 2003). As a result of CASP, the current state-of-the-art in protein structure prediction has been established, the progress made has been documented, and the areas where future efforts might be most productively concentrated have been highlighted.

Homology modeling techniques have been captured in a number of publicly available programs, both in the commercial and public realms. For users of these homology modeling tools, an important question is whether any of the available programs distinguish themselves from the rest in terms of performance and/or usability. As modelers in the pharmaceutical field, our interest in this area is particularly strong in that in the absence of X-ray or NMR structures, good and reliable homology models are needed for the docking and scoring (virtual screening) of 3D databases of compounds (lead discovery) and for lead optimization (see Oshiro et al. 2004, and references therein). A recent paper on this topic (Wallner and Elofsson 2005) evaluated the performance of six different homology modeling programs. This study compared 3D models constructed from the various homology modeling programs starting from a common sequence alignment. The work is an important step toward evaluating the relative strengths and weaknesses of the model building capabilities of the modeling packages available in the field. Absent from the article, however, was a discussion of the relative sequence alignment capacities of the programs. Moreover, several software tools commonly used in pharmaceutical industrial modeling were not considered in that work. Finally, the sequence identities of the target/template pairs studied ranged from 30% to 100%, and it is of interest to consider how well software can perform below 30% sequence identity.

Building on this recent report, the present study provides a systematic analysis of the performance of commonly used (and commercially available) homology modeling software for therapeutically relevant proteins in terms of comparing both (1) the sequence alignments, and (2) the 3D models generated by them. Alignments are assessed by comparing to a structure-based alignment, derived from structural overlays of the target and template. A novel metric is developed to aid in quantifying the quality of local sequence alignments. Model quality is addressed directly by comparing the model and crystal structure in terms of the root-mean-squared deviation (RMSD). The models for each target are built based upon a single common alignment to avoid differences in alignments affecting the final models. The question of usability is somewhat subjective and can vary from user to user, since different users may have different levels of experience with the packages. In order to alleviate the effects of subjective bias in evaluating them, the authors of this work, who have different levels of experience with the packages, assessed the packages independently and tabulated our findings. The results reported for usability should be considered an average that is hoped to reduce effects of personal bias to the extent possible. The software packages used in this study for sequence alignment and model building are shown in Table 1. Since sequence alignment and model building were evaluated independently, Table 1 has two columns. This table also shows the input and output parameters and the criteria used for evaluating the quality of the results.

In this paper, we document the results of a retrospective analysis of sequence alignments and homology models from the various packages shown in Table 1 using seven datasets from the Protein Data Bank (Berman et al. 2000). The datasets were chosen to cover a wide structural range of proteins with sequence identities ranging between 19% and 76%, all of clinical interest (Table 2). Each set consists of a target protein and a template protein, the aim being to build a model of the target based on the template. Although the structures of both target and template are known, for the purpose of this study the structure of the target will be treated as an “unknown” and will only be used for comparing the final model with the “known” target structure in order to assess model accuracy.

Our analysis shows that when the sequence identities are >40%, the homology models derived from the different packages are comparable to each other. When the sequence identity is lower, the results tend to vary, with some packages performing noticeably better than others. While such models often correctly predict the overall fold, in regions of low sequence homology the model is usually less accurate. This is of special concern when poorly modeled regions happen to be at or near the ligand binding site. Next, the question of usability is raised and discussed in terms of the features one expects from homology modeling software, highlighting the relative strengths and weaknesses of the software tools. Finally, possible reasons for the better performance of some packages over others are discussed, along with the utility of homology models when sequence identities are low.

Materials and methods

Choice of protein classes and sequence pairs

Seven sets of target-template pairs were selected to cover a broad range of protein structural and functional classes, with sequence identities ranging from 19% to 76%. The test sets were chosen to represent a range of therapeutically interesting and, at the same time, structurally diverse proteins with X-ray resolution ≥2.5 Å. The datasets chosen are shown in Table 2. For each pair, the functional class, structural topology as classified by SCOP (Murzin et al. 1995; Lo Conte et al. 2002; Andreeva et al. 2004), and percent sequence identity from structural alignments between target and template are shown. For each target shown, the corresponding template was selected from the same structural class. When more than a single template was available, the template chosen was that which had the highest sequence identity and a complete set of coordinates. Both the CDK kinases (CDK-2 and CDK-6) were in the inactive conformation, as judged by the absence of phophorylated residues in the PDB files and the conformations of their activation loops. In the case of the nuclear hormone receptors Liver-X receptor (1P8D, target) and Retinoid Acid Receptor Gamma (2LBD, template), the agonist conformation of the AF2 helix (which in fact was the only conformation available) was used. The “Target” column contains the name of the target protein and the PDB code for the structural coordinates of that protein in parentheses. A similar column is provided for the “Template” protein.

The column marked “RMSD” reflects the results of a structural superposition of the target structure with respect to the template. The results of these superpositions are shown in Figure 1, where the structure in green denotes the template, and the target is displayed in red. The data in the RMSD column in Table 2 contain the root-mean-squared deviation (in Å) of the target protein with respect to the template protein when optimally superimposed, taken over the overlapping Cα coordinates. Following this number, the number in the next column (nres) denotes the number of residues over which the RMSD was calculated (hence the number of equivalent residues), and the two numbers that follow denote what percentage of the full sequence is covered by the alignment. For example, for the “aspartyl protease” set, Table 2 indicates that this set belongs to the all-β structural class of proteins, and when the target (renin, PDB code 1HRN) is structurally superimposed on the template (pepsin, PDB code 1MPP), there are 274 residues that align. The Cα RMSD over these 274 residues is 1.9 Å, and the 274 residues cover 82% of the number of residues in renin and 77% of the residues in pepsin. The sequence identity measured over these 274 residues is 24%.

Software

The commercial software packages used in this paper for comparing sequence alignments and homology models are shown in Table 1. For each of the two studies (sequence alignments and model construction), a commonly used software program was chosen that served as a benchmark for comparison. For sequence alignment studies, the benchmark used was CLUSTALW (Higgins et al. 1994; Chenna et al. 2003), and for model building studies it was SWISS-MODEL (Schwede et al. 2003). Both of these programs are accessible through their Web sites and are freely available.

It should be noted that, in order to compare the results as objectively as possible, all software packages were run with their default parameters. However, in cases where the option of including a ligand and/or cofactor in the model building process was available, it was used. No attempt was made to optimize the model beyond what was automatically done by the program. The results shown here should therefore be interpreted with the following caveat in mind: The programs could give different results (than those shown here) at the hands of an experienced user who uses other than the default settings for sequence alignment and/or model building, especially since some of the default parameters may not be optimal. Nonetheless, for a user with a limited budget trying to decide between the various programs, the results shown here provide a source of guidance.

A brief mention of the essential features of each of these software packages is made here since these features will be relevant to a comparative discussion of the results later. These descriptions are not meant to be exhaustive; please see the software documentation for complete descriptions and references.

CLUSTALW

This is a program used to produce a multiple sequence alignment, designed by Higgins et al. (1994) for proteins and nucleic acids. It works by first constructing a similarity tree based on pairwise sequence alignments, and then combining the alignments in the order specified by the tree (for detailed explanation, see Chenna et al. 2003). Using the Web form (for example, http://www.ebi. ac.uk/clustalw/#), the user need only input or upload a file of the sequences that they want to align in an accepted format. The other options on the form are set to the default values for producing a multiple alignment. The user can use the defaults, or they can make some changes on the form to customize their run. A multiple sequence alignment of the sequences submitted will be returned to the user (.aln file).

Sybyl

The Biopolymer module of Sybyl (distributed by Tripos) was used for the sequence alignment study, whereas the Composer module of Sybyl (Sutcliffe et al. 1987a, b) was used for model construction. The sequence alignment step used the Needleman and Wüncsh algorithm (Needleman and Wüncsh 1970) with the Dayhoff similarity matrix (Dayhoff et al. 1979). The Composer program uses a multiple sequence alignment from structurally aligned homologs to build a complete protein model, including loops. For success, it depends upon the accurate identification of structurally conserved and structurally variable regions that are deduced from the multiple structural alignment of the homologs; the regions are then used to construct the homology model based upon a fragment-assembly approach. However, since in this study only a single template was provided for model building, the structurally conserved regions were artificially defined to be those regions in the target–template alignment where there were no gaps.

GENEMINE/LOOK

The Molecular Applications group developed LOOK, which was ultimately renamed GENEMINE (version 3.5). The default GENEMINE/LOOK alignment uses a modification of the CLUSTALW algorithm to construct alignments. Alignments are based upon a “master” sequence to which all other sequences are aligned. Homology modeling in GENEMINE/LOOK (GENEMINE) uses the SEGMOD, segment match modeling protocol, developed by Levitt (1992). Briefly, the sequence to be modeled is divided into short segments. Corresponding structural fragments are taken from a structural database and then matched to the sequence, and then the fragments are fitted onto the framework of the template structure. The process is reiterated, building 10 independent models. The 10 models are then averaged and undergo stereochemical refinement to minimize conformational strain.

MOE

The software MOE (Molecular Operating Environment) is a suite of comprehensive software tools developed by Chemical Computing Group Inc. (CCG). MOE software has numerous protein modeling and bioinformatics algorithms within the software suite. The sequence alignment feature in MOE (MOE-Align), though available, was not tested in this study. The homology modeling algorithm within MOE consists of the following steps: (1) Initial partial geometry, where all coordinates are copied if residue identity is conserved; (2) Boltzmann-weighted randomized sampling, which consists of a data-collection step and model building step. During data collection, backbone fragments are collected from a high-resolution structural database, and alternative side chain conformations for non-identical residues are assembled from an extensive rotamer library. During model building, a number of independent models are created based upon loop and side chain placements scored by a contact energy function. Optionally, bound ligands and conserved waters can be included in the structural template. The final model is either a minimized Cartesian average structure or the best intermediate model.

Prime

Developed by Schrödinger, LLC, Prime is a protein structure prediction suite. Prime calculates alignments using a combination of sequence and secondary structure information. Structures are built using atom positions from the aligned portions of the template(s), taking solvent, ligand, force field, and other contributions into account via a series of algorithms. Briefly, the OPLS2000 all-atom force field is used for energy scoring of proteins; the OPLS2001 force field for ligands and other non-amino acid residues, a Surface Generalized Born (SGB) continuum solvation model, is used for treating solvation effects; and side chain rotomer and backbone dihedral libraries derived from PDB non-redundant structures are used for building backbone and side chains. Portions of the query sequence that do not align to the template, such as loops, are built using an ab initio procedure that incorporates solvation (Jacobson et al. 2004).

DSModeler

The protein homology modeling program DSModeler, distributed by Accelrys Software Inc., is a graphical interface wrapped around the software tool MODELLER (Sali and Blundell 1993). DSModeler produces protein homology models, given a template(s) and sequence alignment. The structures are predicted based on distance restraints obtained from the template, from the database of crystal structures in the PDB, and from a molecular force field. Ligand structures and constraints such as disulfide bonds and cis-prolines can be incorporated into the model building step. Loops are generated de novo, by a process that incorporates knowledge-based potentials from known crystal structures.

Insight/Homology

Only the alignment portion of Insight's Insight II Homology tools (distributed by Accelrys) is examined here. For simple alignment based on a single query and template, the Needleman and Wüncsh algorithm is used (Needleman and Wüncsh 1970). Insight has additional algorithms to detect structurally conserved regions given multiple templates and to use this information in the alignment process; however, these were not utilized in the examples described in this paper since they are not the software defaults and we were not comparing the multiple template alignment algorithms.

ICM

The ICM (Internal Coordinate Mechanics) software was designed to predict low energy conformations of bio-molecules. Sequence alignments in ICM utilize the Needle-man-Wüncsh algorithm modified to include zero end gap penalties (ZEGA, Abagyan and Batalov 1997). The homology modeling option in version 3.1.02 was utilized from the ICM graphical user interface using the Build-Model-by-Homology option. This is a completely automated method that inherits the backbone from the aligned (not necessarily identical) parts of the template and adds the extracted side chain conformations for residues identical to the template. Loops are inserted based upon conformational database searches with matching loop ends, and upon insertion into the model non-identical side chains are assigned the most likely rotamer and optimized by torsional scan and minimization.

Profit

PROFIT software (version 2.2.1) was developed based upon protein fold recognition (threading) technology (Sippl and Weitckus 1992) by Proceryon Biosciences (http://www.proceryon.com/). Sequence alignments are derived from knowledge-based potentials for atom pair and protein–solvent interactions. These potentials are used in combination with dynamic programming and frozen approximation to generate an alignment of the target sequence with a template structure. The alignment is also based upon optimized gap penalties, gap restrictions, and potentials (Domingues et al. 1999).

SWISS-MODEL

SWISS-MODEL is a fully automated protein structure homology modeling server accessible via the ExPASy Web server (Schwede et al. 2003; http://swissmodel. expasy.org/). SWISS-MODEL was initiated in 1993 by Manuel Peitsch (see Schwede et al. 2003, and references therein). It takes as input a sequence alignment and a PDB file for the template. These are submitted over a server, and the knowledge-based homology model is constructed using the ProModII program (Peitsch 1995). Model construction includes complete backbone and side chain building, loop building, and verification of model quality, including packing. The model thus built is energy minimized using the Gromos96 force field (Van Gunsteren et al. 1996). The model coordinates are returned in PDB format. Very little or no user intervention is allowed.

Homology model construction

Homology modeling consists of building a protein model using a structural template, the template being a protein of known structure. The steps in homology modeling are detailed elsewhere, but the basic outline of the procedure is shown in Figure 2. The sequences of the two proteins, the target (or unknown) protein and the template, are first aligned. The Cα coordinates of the aligned residues from the template are then copied over to the target to form the skeletal backbone. The residue side chains, and relative insertions and deletions, are then modeled using automated or semi-automated procedures. Finally, the protein model thusly obtained may be subjected to energy minimization or molecular dynamics to relax unfavorable contacts.

The quality of the sequence alignment is critical in determining model quality. In comparing the homology modeling software, we therefore decided to assess this step independently from the model construction. Furthermore, in order to evaluate model building as an independent step, we started from a common sequence alignment for all software packages. Accordingly, this study is composed of two parts: (1) evaluation of sequence alignment and (2) evaluation of the constructed model starting from a given sequence alignment.

Comparing sequence alignments

For the purposes of evaluating sequence alignment quality, it was first necessary to establish a reference to which the alignment must be compared, and then define a suitable local metric that provided a measure of alignment quality at each point of the alignment. The constructions of the reference alignment and the comparison metric, which we shall call the “Relative Sequence Alignment” (RSA), are described below.

(1) Reference alignment

In order to obtain the reference alignment for a given target–template pair, each target was structurally aligned to its cognate template. This was accomplished using the ICM program and is described later (see “Structural Alignment of Target and Template” below). Following structural alignment, a residue in the target protein whose Cα atom lay within 3.5 Å of a Cα atom in the template protein was considered equivalent to that residue. Thus, most residues from the target protein were mapped onto corresponding residues in the template protein, while some residues could not be mapped. This mapping formed the basis for a (structurally based) sequence alignment. The sequence alignment thus produced, which for each target–template set serves as the “reference alignment” for that set, was derived from a structural superposition that does not use sequence identities to influence the superposition. Therefore, the sequence alignment thus produced is independent of similarity matrices used for sequence alignment.

(2) “Relative Sequence Alignment” (RSA) and “Difference Alignment Index” (DAI)

Once the reference alignment was obtained (see above), the sequence alignment between the target and template produced by each software package was compared with the reference alignment in order to obtain the relative sequence alignment (RSA). A flow chart illustrating this work flow is shown in Figure 3. More explicitly, each residue in both target and template was first assigned a number from 1 to n, corresponding to its position in the sequence starting from the N terminus, where n is the number of residues in the sequence. As shown in Figure 4, a given sequence alignment can then be viewed as the alignment between two vectors, with dashes (–) denoting those positions at which the template sequence does not have a residue, and numbers denoting the position of the residue in the sequence. It is important to note that in this representation of the sequence alignment, the target sequence is always represented by numbers from 1 to n, the number of residues in the target sequence. In contrast, the template sequence may not always be represented by all numbers starting from 1 to the end of the sequence, since those residues in the template that are unaligned will be missing their numbers.

Once we transform the two sets of sequence alignments to this new representation (see Fig. 4), the difference between the two alignments is obtained as follows. First, a “difference alignment” is produced according to the following rules: At each residue position in the target sequence, (1) if a “–” occurs in either the sequence alignment produced by the software package being tested or the reference alignment, a “–” is assigned at that position; (2) if numbers occur, then the absolute value of the difference between the numbers is assigned at that position. Thus, a “–” denotes an insertion or deletion, a “0” (zero) denotes that the test alignment agrees with the reference alignment at that position, and a number >0 denotes a shift in the test alignment relative to the reference alignment. Since dashes correspond to positions that have no equivalent residue, we decided to omit these positions from comparison and arbitrarily assign a value of −1 whenever a “–” is encountered. The n-dimensional vector string (where n is the number of residues in the target sequence) consisting of −1, 0, or some integer, is our measure of the “Relative Sequence Alignment.” The components will be called the Difference Alignment Index (DAI).

Structural alignment of target and template

ICM was used to calculate the structural alignment for each target–template pair. For example, the coordinates of target 2C9 (PDB code 1OG5) and template 2C5 (PDB code 1NR6) were structurally aligned using the backbone topology (Cα atom coordinates) resulting in an optimal superposition. The sequence identity in the structural alignment was 76%, which is lower that the optimal sequence alignment that has identity of 77%. For structure–structure alignments, the sequence identity is often lower than sequence-only alignments because conservation of a protein fold does not necessarily require sequence conservation. Thus, protein function and 3D fold are often conserved across functional homologs even though sequence identity is not required.

Comparing structural alignments

All target models were compared directly with the experimental structures, whose PDB codes are shown in Table 2. Sybyl was used for structural alignment of the model with its crystal structure. For each such alignment, RMSDs with respect to Cα atoms, backbone atoms, side chain atoms, and all heavy atoms were computed. The RMSDs based on Cα did not differ appreciably from those based on backbone atoms, and therefore only the former are reported.

Results

Sequence alignment

The results of the sequence alignment studies are summarized in Table 3, where the fractions of correctly predicted residues (ignoring gaps in sequence alignment) for the target sequence are shown. The sequence alignment quality for a given pair was measured in terms of the relative sequence alignment and quantified in terms of the difference alignment index (DAI) defined and explained in detail above (see Materials and Methods). Briefly, the DAI is a local measure of the deviation of a given alignment from a reference alignment. In the context of our studies, the DAI measures the deviation of a calculated alignment from the correct alignment based on structural superposition. A value of 0 for the DAI at a given residue position signifies agreement with the ideal alignment at that position, a positive value denotes the number of residues away (upstream or downstream) the template residue in the ideal alignment is away from the predicted residue in the alignment, and a value of −1 signifies that there is a gap in the alignment at that residue position.

Representative results are shown in Figure 5A–C, for the Cyp-450, serine protease, and SCD families, covering the cases corresponding to high (76%), moderate (39%), and low (19%) sequence identity. In these figures, the DAI is shown as a function of amino acid position in the query sequence. We have found this parameter to be very helpful in quantifying the quality of the local sequence alignment. As Figure 5A shows, for the Cyp-450 case with high sequence identities, all six software suites give identical results and are therefore equally good for sequence alignment purposes. For the serine protease family, where the target–template sequence identity is moderate (Fig. 5B), CLUSTALW, ICM, Prime, and Profit give essentially identical results, with each showing a misaligned residue at position 22 (misaligned by one residue). These results are very impressive, showing that there is just one misaligned residue in the sequence alignment. The results from Insight are in agreement with the results from the four aforementioned software suites for residue positions 100 and higher, but show significant deviations (misalignments up to five residues) in alignment in the positional range 85–100. The results from Sybyl, on the other hand, agree with the four software programs previously mentioned in the positional range 1–150, but show a systematic misalignment by four residues that starts at position 155 and persists for ∼100 residues. It should be noted that such deviations often correct themselves upstream from where they occur, as can be seen from the Insight example where a five-residue misalignment at position 90 is corrected upstream, causing the remaining alignment to be correct. The 100-residue misalignment seen with Sybyl in this case should be regarded as an unusual aberration and not reflective of typical behavior.

The case of SCD, where the sequence identity is low (19%), is the most interesting, for it is here that the differences between the six software suites become more apparent (Fig. 5C). Ignoring regions in alignment where the DAI is –1 (gaps in alignment), we note here that all six software suites get the region between residue positions 90 and 190 correct (apart from one or two misaligned residues around position 130), whereas all six software suites get the region from 190 to the C-terminal end wrong—predicting either gaps or misalignments by as many as 20–25 residues. Possible reasons for this are discussed later (see Discussion). The N-terminal region between residues 1–90 is predicted to varying extents, but it is in this region that Prime and Profit are most accurate. This difference in behavior between Prime and Profit vis-à-vis the others may be a direct consequence of the additional use of structural information in effecting alignment, a matter discussed later (see Discussion).

The behavior of all six software suites for all the protein sets for sequence alignment is summarized in Table 3. The trends observed for the three cases shown in Figure 5A–C are reflected here across the entire data-set. The data are arranged in seven columns in descending order of sequence identity (the percent identity being shown in the first row), and in seven rows showing the different software tested. The entries represent the percentage of correctly predicted residues in the sequence alignment.

A general trend observed here (Table 3) is that as target–template sequence identity decreases, the percentage of correctly predicted residues decreases. For sequence identities ≥38%, ≥75% target residues are correctly predicted in their alignment with the template sequence. (As mentioned before, the case of Sybyl correctly predicting only 65% of the residues for the 39% example is an exception, and this does not reflect the general behavior of Sybyl, as can be discerned by looking at its other entries.) The performance of Prime in this range, for the 38% identity case (NHR), is noteworthy. While the others predict 75% correctly in this region, Prime actually gets 88% of the alignment correct. At sequence identities <30%, Sybyl, Insight, CLUSTALW, and GENEMINE/LOOK are seen to predict 50%–60% of the residues correctly, whereas 60%–70% are correctly predicted by Prime, ICM, and Profit.

Model building

In order to assess model building independently of sequence alignment (to prevent alignment errors from influencing the model), all models were constructed from a common sequence alignment for each target–template pair. The common alignment was the “reference alignment” alluded to earlier, which formed the starting point for the model building calculations. This alignment was that which corresponded to the optimal structural superposition between target structure and template structure.

For reasons discussed above, all software for comparison was run using the default parameters. It is entirely possible that when run at the hands of an expert, some software may give better results than those obtained using the default set of parameters, but this would deviate from a truly objective comparison of the programs themselves (the possibility that some of the programs may have unreasonable default values notwithstanding) and is therefore probably better left to organized contests such as CASP. However, in cases where additional features that were deemed important for performing a realistic construction of the model were available, those features were used. Two examples will suffice to illustrate the point: (1) In the case of CYP450, which includes a heme group, or the SCDs, which include NADPH as a cofactor, the option of including the cofactor in model building was considered important and was therefore turned on, where available. (2) Allowance for cysteine bridges was made when available, when certain pairs of cysteines were known to be involved in forming disulfide bonds.

With each software suite, complete models were constructed, but no attempt was made to optimize or refine them beyond individual software defaults. The models were compared with their corresponding crystal structures. The RMSDs for Cα, side chain, and all heavy atoms are shown in Table 4, AC, respectively. The main-chain RMSDs are not shown since they follow the same trends as the Cα results and do not add any more insight to the results. In each of these tables, the rows represent the proteins while the columns represent the programs. The rows from top (P450) to bottom (SCD) are arranged in descending order of sequence identity. The RMSD values in the cells are shown in gray-shaded cells, boxed cells, or normal depending on the extent to which the result stands out from the others in the same category (in a given row). These differences in display style are meant to aid the eye in quickly discerning the best results. Looking down a given column (software) at the number of gray and boxed entries, one can quickly estimate the relative performance of the software suites for that column across the different protein sets; likewise, looking across a given row (protein), one can assess the relative performance of the different programs for that protein class.

The data show that the differences between the software suites become apparent as one goes from high to low sequence identity. No significant differences are observed between the packages for P450 and MMP, where the sequence identities are high (>50%). With lower sequence identities, Prime, DSModeler, and Sybyl (respectively) perform better than the others, as seen by the number of “gray” cells for these entries, with GENEMINE/LOOK and ICM performing somewhat less well (boxed). In particular, the ability of Prime to model the SCD class stands out from the others (the only “gray” entry in the SCD category), a possible consequence of including the steric and electrostatic effects of the cofactor (NADPH) and ligand during the final stages of model building. In their recent paper comparing model building capabilities of six software packages on 1037 target–template pairs, Wallner and Elofsson (2005) found that Modeller and GENEMINE/LOOK (they used SegMod/ENCAD [Levitt 1992]) were among the best overall performers. It is interesting that Modeller also performs quite well in our study, considering that different software is being compared (the overlapping programs between the Wallner and Elofsson work and the model building portion of our study are Modeller, SWISS-MODEL, and GENEMINE/LOOK) and different proteins are studied.

Functionality and usability

While quality of results is clearly of key importance in assessing homology modeling software, quality of use can also be a critical factor. This is particularly true in the field of drug discovery, where accuracy and speed are required to succeed in the competitive pharmaceutical environment. We have therefore rated the alignment and homology structure-building tools utilized here by their functionality and usability (the results are shown in Table 5). The alignment tools were ranked according to four categories: (1) ability to read and write multiple alignment file formats, (2) annotation capabilities, (3) ease of generation of the alignment, and (4) ease of editing the alignment. The structure-building tools were ranked according to, (1 and 2) ease of building backbone and side chain atoms, (3) ability to incorporate disulfide bridge information, (4) ability to build loops, and (5) ability to utilize ligand and/or cofactor information. The modeling software programs are rated best (score =1), neutral (score = 0), and worst (score = −1) in each category. The results pertaining to ease of use are clearly subjective; the rankings reported here are an average of our experiences with each software program, but other users could have different opinions. The ratings pertaining to functionality are more objective (e.g., whether ligand information can be incorporated in building the protein model).

In the alignment categories, Insight/Homology, ICM, and GENEMINE/LOOK stand out among the software in terms of usability and functionality. GENEMINE/ LOOK and ICM read and write the widest variety of file formats, and can thus integrate more smoothly into an environment in which multiple modeling software suites are used. GENEMINE/LOOK and Profit are easy to annotate, and GENEMINE/LOOK and Insight/ Homology have editing tools that are straightforward to use. Profit and Prime are limited in usability in that they require a 3D template, so two sequences cannot be aligned unless a 3D structure is also present.

With regard to structure building, DSModeler, ICM, GENEMINE/LOOK, and MOE show superiority with respect to usability and functionality. Sybyl, DSModeler, and MOE can utilize disulfide bridge information in building models; this is a critical feature for modeling proteins containing disulfide bonds. Some of the other packages (Prime, ICM) have less straightforward work-arounds to include disulfide information. DSModeler, ICM, and MOE are superior at loop building; Prime cannot build loops for longer sequences due to computational requirements; and loops built in SWISS-MODEL, Sybyl, and GENEMINE/LOOK are of poorer quality. Only DSModeler, Prime, and MOE can include bound ligand information in structure building. During model building in PRIME, the bound ligand will become part of the energy function. DSModeler incorporates the ATOM and HETATM records of the PDB template file when generating the spatial restraints used for model generation. The process of including bound ligand information is slightly easier in MOE and DSModeler than in the version of Prime that we tested, which is why we ranked MOE and DSModeler higher than Prime in this category (in a later version of Prime released after this work was done, the ease of including ligands was improved).

Overall, among the five programs for which a total score was reported, the combination of the Accelrys tools, Insight/Homology, and DSModeler stand out among the software suites in both alignment and model building categories in terms of usability and functionality. DSModeler alone performs well at building loops and can incorporate disulfide bond and ligand information. ICM and GENE-MINE/LOOK closely follow the Accelrys tools in overall quality of use for combined (alignment and model building) score, with different ranges of capabilities but similar summed numerical rankings. While MOE and DSModeler are comparable in the Build category, the alignment category was not tested for MOE, and therefore a total score for MOE was not reported.

Discussion

Sequence alignment

The results for comparative sequence alignment shown above demonstrate that at sequence identity ≥38%, the different software suites are indistinguishable in terms of quality. For low sequence identities, Profit and Prime stand out from the rest. As noted in the brief description of these software suites (see Materials and Methods), both Profit and Prime, in addition to using sequence information, also use structural information to effect the alignment. Profit is a protein threading program designed to produce reliable alignments when sequence identity is low, and works by measuring the compatibility of a sequence with a template structure (Sippl and Weitckus 1992). Prime works by comparing the target and template not only in terms of the sequence identity between them, but in addition the target's predicted secondary structure with the template's known secondary structure. It is the use of structural information for sequence alignment that sets these two packages apart from the rest, underscoring the importance of incorporating structural information (when available) for improved sequence alignment. Our experience with Prime and Profit clearly shows that it makes good sense to expect that incorporating predicted secondary structure should lead to better sequence alignments.

Homology model

Our studies show that when the sequence identity to the target structure is >40%, the homology models are equally satisfactory. At lower sequence identity, some performed better than others, while at truly low identity (19%), only Prime performed reasonably well. The value of 4.3 Å for the Cα RMSD obtained with 19% identity probably reflects the limit of how well one can hope to model a target on a template when the extent of homology is so low.

The results were discussed in terms of backbone building, side chain modeling, and overall model building. While loop modeling is an important and integral part of homology model building, its relative role in the different programs was only considered implicitly. A detailed study similar to ours comparing different software in the specific context of loop modeling is presently underway and will be presented in a separate publication (K. Rossi, C. Weigelt, A. Nayeem, and S. Krystek, in prep.).

How utilizable are such models for structure-based drug design? This question has been raised and addressed by Oshiro et al. (2004) in the context of virtual screening and docking using CDK2 and factor VIIa screening data sets, and more recently by Bernacki et al. (2005) for a wider set of protein classes. Their results showed that when the sequence identity between target and template was >50%, the docking performance was comparable to crystal structures. The implication is that in these situations the active site is preserved between the target and template so that binding is largely unaffected. Similarly, others who have used homology modeling for docking applications have found that a 40% sequence identity was needed in order to obtain reliable results (Ring et al. 1993; Schafferhans and Klebe 2001). Our own experience with the datasets shown here corroborates these findings. Indeed, RMSDs taken over atoms in the vicinity of the active site are lower than overall RMSDs reported in Table 4A–C.

But what can we say about the quality of the binding site when sequence identity falls to 19%? Consider the case of the short-chain dehydrogenases (SCD) example where the sequence identity is 19%. The sequence alignments of target (1A27) to template (1CYD), which are correct between residue positions 90–190 (see Fig. 5C), contain the three key residues defining the catalytic triad. Their positions are indicated by the vertical bars, and it is seen that in all six cases, despite the overall predicted alignment being rather poor, the catalytic residues are correctly placed. The structural alignment between the target and template is shown in Fig. 6A. The alignment in and around the binding site that lies between residues 90 and 190 is reasonable and devoid of any alignment gaps. Therefore, the assignment of Cα coordinates for these residues can be made based on the equivalent template coordinates. When the final model is constructed, this region is modeled quite well, at least in terms of the Cα positions (Fig. 6B). However, since the sequence similarity between target and template in this region is modest (Fig. 6A), we cannot expect the model derived from here to be suitable for drug design applications such as lead optimization. Nonetheless, the overall structure at this level of sequence identity is useful in providing an idea of the overall topology.

It is clear that a sequence identity <50% would be sufficient to generate a homology model that could consistently identify active compounds. The 20%–30% homology band has been called, by some investigators, the “twilight zone” because the quality of homology models may vary widely (Sternberg 1996). Generally, when the sequence identity exceeds 30%, reliable homology models can be constructed (Sanchez and Sali 1997). It should be noted that our 50% cutoff was derived using homology models generated with only a single template structure. Many homology modeling programs can use multiple structures rather than single structures (Sybyl, Insight, DSModeler, Prime, ICM). The advantages of using multiple templates rather than a single template for homology modeling are especially apparent when homology is low (<30%) and when different templates have similar sequence identities to the target. Furthermore, even if the overall sequence identities of different templates with respect to the target sequence may be similar, the different templates could have varying degrees of sequence similarity in different regions of the sequence alignment. (For example, when using templates A and B to model a target sequence, template A may be more similar than template B at the N terminus of the target, whereas template B may be more similar to the target than template A near the C terminus.) Multiple-template-based models could have an RMSD relative to the true target crystal structure of <2 Å and, as published data suggest, can help identify novel leads and show enrichments comparable to that of the true crystal structure. In a recent review, Kopp and Schwede (2004) describe several examples where homology models provide sufficient accuracy to be useful for a wide variety of applications. The automation of homology modeling described by Kopp and Schwede using SWISS-MODEL and the development of protein model repositories requires those who build and use homology models to rely upon studies like those presented in this manuscript to compare the results of automated model building with current state-of-the-art software. The comparison of models built using multiple templates versus those built using single templates is, however, beyond the scope of this study.

These studies provide a basis for assessing the accuracy of homology modeling. In particular, the accuracy of comparative homology models depends upon the quality of the alignment between the target protein and the template from which the model is to be produced. (It may therefore be argued that had one started from a different sequence alignment than the one that was used in this study, one could have built a better model than the one for which the RMSD was reported in Table 4A–C.) Our purpose, however, was to start from a common input alignment for all programs, and the ICM suite produced structure-based sequence alignments that were reasonable and which agreed with published structural alignments from DALI (see FSSP in the DALI server at http://www.ebi.ac.uk/dali/.) The Difference Alignment Index and the resulting “Relative Sequence Alignments” have been shown to provide useful measures of local deviations from reference sequence alignments. As expected, small differences in the sequence alignments can lead to significant changes in the models produced. Furthermore, we have demonstrated that the protein modeling protocols from alternative software can produce dramatically different models even when using the exact same target–template alignments.

Table Table 1. Software packages and evaluation parameters used in the sequence alignment and homology model building portions of this work
image

Table Table 2. Protein classes studied
image

aFor each protein class, the structural class, and the target and template proteins are listed.

bThe parameters that describe the structural alignment of the target with respect to the template.

dCα root-mean-squared deviations.

eNumber of aligned residues within a 3.5-Å cutoff.

fPercentage of residues that “nres” represents with respect to the entire amino acid sequence of target (% targ) and temlate (% temp).

Details are in the caption following the image

Structural overlays of target proteins (red) on their cognate templates (green). The target–template pairs are those shown in Table 2. The program ICM was used for the structural alignment shown here.

Details are in the caption following the image

Outline of steps for homology modeling process. The steps are discussed in the text.

Details are in the caption following the image

Flow chart showing the method for calculating the relative sequence alignment (see text). This was done by comparing the purely sequence-based alignment with the sequence alignment generated by aligning the 3D structures.

Details are in the caption following the image

Explicit example illustrating the calculations of the “relative sequence alignment” and the “difference alignment index” at each target residue position. See text for details.

Table Table 3. Relative performance of different software suites in the sequence alignments of the protein pairs
image

Protein classes are arranged in descending order of sequence identity (shown as a percentage in parentheses for each class) going from left to right. The numbers in the table denote the fraction of correctly aligned residues. Results that stand out in a given set are enclosed in open boxes.

Details are in the caption following the image

Relative sequence alignments for cases corresponding to high (A), moderate (B), and low (C) sequence identity. The protein pairs representing these cases were CYP450 (A), serine protease (B), and short-chain dehydrogenases (C). The three vertical lines in panel C denote the residue positions of the active site residues in SCD.

Table Table 4. Relative performance of the different software for building the homology models starting from a common sequence alignment
image

The numbers in the second column refer to the number of atoms over which the RMSD values are calculated, and the RMSD values (in Å) themselves are displayed in the cells for each target model (as produced by the software identifying that column of data) relative to the template. The RMSDs are measured in terms of Cα atoms (A), side chain heavy atoms (B), and all heavy atoms (C). The best sets of results are in the shaded gray boxes, followed by the next best, which are in unshaded boxes.

Table Table 5. Usability and functionality in different categories for the software packages
image

Sequence alignment and structure building are evaluated separately. Within each category, the software are rated high (+), medium (blank), or low (−) and given corresponding weightings of 1, 0, or −1. The overall score for each software suite was produced by adding the scores for each category. The total score was obtained by adding the contributions from the alignment and build scores.

Details are in the caption following the image

(A) Sequence alignment for the structural overlay corresponding to the SCD example of 19% sequence identity. The alignment is annotated by secondary structure (α-helices in red and β-sheets in green), and similar and identical residues (one letter for identical residue and “#” for similar but not identical). Regions of high insertions and/or deletions are boxed. (B) Different structural regions corresponding to the boxed areas in the sequence alignment shown in panel A.

Acknowledgements

We thank Ramy Farid (Schrodinger) and Teresa Lyons (formerly Accelrys) for valuable discussions, and Deborah Loughney and Dr. Wesley Cosand (BMS) for a critical reading of the manuscript

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.