Volume 16, Issue 2 p. 165-175
research-article
Free Access

Prediction of structures of multidomain proteins from structures of the individual domains

Andrew M. Wollacott

Andrew M. Wollacott

Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA

These authors contributed equally to this work.

Search for more papers by this author
Alexandre Zanghellini

Alexandre Zanghellini

Biomolecular Structure and Design, University of Washington, Seattle, Washington 98195, USA

Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA

These authors contributed equally to this work.

Search for more papers by this author
Paul Murphy

Paul Murphy

Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA

Search for more papers by this author
David Baker

Corresponding Author

David Baker

Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA

Department of Biochemistry Health Sciences Building, Box 357350, Seattle, WA 98195, USA; fax: (206) 685-1792.Search for more papers by this author
First published: 01 January 2009
Citations: 43

Abstract

We describe the development of a method for assembling structures of multidomain proteins from structures of isolated domains. The method consists of an initial low-resolution search in which the conformational space of the domain linker is explored using the Rosetta de novo structure prediction method, followed by a high-resolution search in which all atoms are treated explicitly and backbone and side chain degrees of freedom are simultaneously optimized. The method recapitulates, often with very high accuracy, the structures of existing multidomain proteins.

Abbreviations: RMSD, root mean square deviation.

Proteins are frequently composed of multiple domains (Ponting and Russell 2002; Vogel et al. 2004) that are likely to fold independently (Shen et al. 2005). Determining the structure of multidomain complexes at atomic resolution is critical to understanding the underpinnings of much of biology (Lupas et al. 2001; Aloy and Russell 2006). While structures of single domains can be readily determined through X-ray or NMR techniques, the structures of large multipart proteins are often more difficult to elucidate (Aloy et al. 2003).

There are two general approaches to predicting structures of multidomain proteins from structures of individual domains. First, the domain assembly problem may be treated as a docking problem. For example, Inbar et al. (2005) used rigid body docking methods to predict the structure of the resulting complex. A second approach to domain assembly, which we describe here, is to explicitly sample the degrees of freedom of the linker rather than the rigid body degrees of freedom of the two domains. Approached in this manner, the domain assembly problem may be viewed as an ab initio prediction problem for a relatively short amino acid sequence with preformed N- and C-terminal structures.

The Rosetta protein modeling method has had success in folding small protein chains ab initio (Bradley et al. 2005), and in protein–protein docking with flexible side chains (Wang et al. 2005). Here we combine these methods to assemble structures of isolated domains into a multidomain complex. The conformation of the linker is explored, keeping the backbone of the individual domains fixed but allowing the side chains in the linker and at the domain interface to sample a full range of rotamer conformations. The lowest energy models found are often very close to the correct structure.

Results and Discussion

Seventy-six two-domain proteins were culled from a nonredundant database of proteins (Berman et al. 2000), as described in Materials and Methods. These proteins contained no cofactors or ligands near the interface of the domains, as the focus was on modeling the interface between protein domains only. All systems were first subjected to a low-resolution (side chains represented by centroids) search to generate 5000 candidate decoys, starting from an extended structure of the linker. In this search, the Rosetta de novo fragment assembly method is used to sample the conformational space of the linker; residues in the domains at the N and C termini of the linker interact with each other and with the linker according to the Rosetta low-resolution potential, but no fragment insertions are done within the domains.

The resulting low-resolution models were then subjected to high-resolution refinement using the standard Rosetta Monte Carlo minimization plus side chain repacking protocol (Schueler-Furman et al. 2005). In each attempted move small torsion angle changes are made in the linker and interface side chain conformations are repacked using the Dunbrack backbone-dependent rotamer library (Dunbrack and Cohen 1997) and continuous quasi-Newton optimization (Press et al. 2002) of the linker and side chain degrees of freedom is carried out. The move is accepted or rejected according to the standard Metropolis criterion. Side chain conformations from the native complex were not included in the rotamer search, as this has previously been shown to favor lower RMSD structures (Wang et al. 2005). Using backbone geometries from the native complex may bias the search toward near-native structures; we excluded native side chain information in these studies to reduce this effect.

In most cases, there were only modest changes to the low-resolution structure upon high-resolution refinement, but in some cases, large numbers of clashes generated in the initial side chain grafting caused the domains to separate during refinement and led to large structural changes. Near-native decoys sometimes snapped into a more native-like orientation upon addition and refinement of side chains. Overall, 10.8% of low-resolution decoys with an RMSD <3.0 Å showed a noticeable improvement in RMSD (>0.5 Å difference) after high-resolution refinement. In contrast, only 0.44% of low-resolution decoys with an RMSD >3.0 Å were refined to an RMSD <2.0 Å by the high-resolution refinement protocol; the radius of convergence of the refinement protocol is clearly <3 Å RMSD.

Decoys were ranked based on their interdomain interaction energy to eliminate the effects of energetic differences due to alternative side chain packing away from the interface. For each assembly, the domains were separated, and the energy of each domain was evaluated with side chains in the same conformation as the complex. The interdomain interaction energy was defined as:
equation image(1)

The domain interaction energy does not include entropic effects associated with complexation, but these are, to a first approximation, independent of the structure of the complex.

Figure 1 shows plots of interdomain interaction energies as a function of RMSD for a number of systems after high-resolution refinement. In many cases, shown in Figure 1A, there is a striking energy funnel with near-native decoys possessing significantly lower energies than higher RMSD models. For comparison purposes, the energies of relaxed native structures (see Materials and Methods) are also plotted. In the majority of cases, the relaxed natives maintained a very low RMSD and had lower interaction energies than the decoys. This further illustrates the deep energy funnel around the native minimum, and suggests that increased sampling could lead to lower RMSD structures.

Details are in the caption following the image

Plots of binding energy (Y-axis) vs. Cα RMSD (X-axis) (Å) for decoys (blue dots), and relaxed natives (red dots) for select complexes. (A) Successes with funnel-like energy distributions. (B) Successes with low-energy and RMSD models but less funnel-like energy distributions. (C) Failures due to insufficient sampling. (D) Failures with low RMSD decoys that were not identified. (*) RMSDs do not include linker regions for these systems.

Predictions of domain structures

We divide our predictions into three major categories: successes, low-resolution failures, and high-resolution failures. Successes are cases where the lowest energy decoys had an RMSD <2 Å after high-resolution refinement. Thirty-eight of the 76 test cases were successful according to this criterion (Table 1). Systems for which no decoys were generated with an RMSD <3 Å after low-resolution refinement were considered failures at the centroid level, representing 13 of the 76 systems (Table 2). The failure to generate a near-native decoy (RMSD <2 Å) (Table 3) or the inability to identify these near-native models from a decoy set (Table 4), were considered failures at the high-resolution level (25 of the systems in this study).

Table Table 1.. Systems with low-RMSD decoys in the five lowest energy models (after high-resolution refinement)
image

Table Table 2.. Systems that failed to yield low RMSD decoys during the low-resolution search
image

Table Table 3.. Systems with no low RMSD decoys in the top 5% by energy after high-resolution refinement
image

Table Table 4.. Systems with no low-RMSD decoys in the five lowest energy decoys after high resolution refinement
image

Successful predictions

Table 1 lists those proteins for which the structure of the assembly was accurately predicted, in that one of the five lowest energy models has a Cα RMSD <2 Å. In these cases, there were generally a significant fraction of near-native decoys, and the energy function was able to accurately identify the low RMSD models. For a large number of these successful predictions (70%), the lowest energy model was within 1 Å RMSD of the native complex. Plots in Figure 1A illustrate the funnel-like energy distribution for successful predictions, while those in Figure 1B show moderate success cases where a funnel distribution was not evident.

In most cases, the very low RMSD structures had native-like side chain packing at the interface, even though crystal structure side chains were not included. Side chain packing at the interface is illustrated in Figure 2. Even though Cα RSMDs for near-native decoys were very low (on the order of 0.2 Å), the heavy atom RMSD was generally over 1.0 Å. As expected, surface side chains away from the interface were generally unconstrained and so did not match well with the native side chain orientations at those positions. However, at the interface, the side chains more closely resembled native structures, as highlighted in Figure 2, A and B.

Details are in the caption following the image

Examples of accurate predictions with domain orientation and side chain packing close to the native structure. (A) 1a62 (Cα RMSD = 0.12 Å; heavy-atom RMSD = 1.05 Å). (B) 1cli (Cα RMSD = 0.23 Å; heavy-atom RMSD = 1.27 Å). (C) 1i39 (Cα RMSD = 0.20 Å; heavy-atom RMSD = 1.20 Å). (D) 1cdy (Cα RMSD = 0.32 Å; heavy-atom RMSD = 1.40 Å). The native structures are in red, the native linker in yellow, the decoy in blue, and the decoy linker in orange. Structures were superimposed onto only one domain.

Rosetta was also able to correctly model proteins with large linkers, especially when those linkers were constrained and structured. This is shown in Figure 2, C and D; the linker forms a large α-helix in 1i39 and a large β-sheet in 1cdy.

For several proteins in Table 1, the RMSD reported did not take into account the linker residues. These systems, 1d09, 1eov, and 1han, contain large unstructured linkers. In several cases, as shown in Figure 3, A and B, the correct relative positions of the domains was predicted for these systems. However, since the linker was generally unconstrained, there was a large degree of flexibility for these linkers. Thus, the linker was poorly predicted even for the cases where the orientation of the two domains was correctly identified. These cases were considered successful since the packing of the domains was correctly determined.

Details are in the caption following the image

Correct prediction of relative positions of the domains but not the structure of linker. (A) 1aoa (Cα RMSD = 1.04 Å; Cα RMSD neglecting linker = 0.22 Å). (B) 1han (Cα RMSD = 5.07 Å; Cα RMSD neglecting linker = 0.24 Å). The native structures are in red, the native linker in yellow, the decoy in blue, and the decoy linker in orange. Structures were superimposed onto only one domain.

Failures

Low-resolution failures

Table 2 lists the proteins for which no decoy was found with a Cα RMSD <3 Å in the low-resolution search. Due to the limited radius of convergence of high-resolution refinement, the low-resolution search has to sample close to the native structure for predictions to be successful. These cases were, therefore, considered failures at the early stage of modeling. Further high-resolution refinement was not carried out on these systems in this study to save computer time, as it is unlikely that refinement would convert failures to successes.

The majority of these failures contained large unstructured linkers. The native structures for two of these systems, 1rhs and 1j8m, are shown in Figure 4. In both cases, the linker wraps around one of the domains. In these systems, insufficient sampling is the likely reason for the inability to generate near-native decoys. With few constraints, and an interface far removed from the endpoints of the linker, only a small fraction of decoys might be expected to sample conformational space near the native state. It is possible that by increasing the number of low-resolution models generated, near-native low energy decoys may be found.

Details are in the caption following the image

Challenging complexes where sampling was a problem, with long linkers stretching around a domain. Native structures for 1rhs (A) and 1j8m (B).

It might be expected that native states that contain a large number of interdomain contacts would be more likely to be recapitulated during low-resolution refinement. However, an analysis of the number of contacts between domains, or between the domains and their linker, shows no strong correlation for systems that yielded near-native decoys and those that failed. This further indicates that the centroid-level failures listed in Table 2 suffered from insufficient sampling and not deficiencies in the energy function. Indeed, as shown in Table 2, with the exception of 1qcs, these proteins did not have sufficient sampling within 5 Å of the native structure.

Several systems in Table 2 contained short linkers and yet did not yield low RMSD decoys, such as 1qcs and 1nkr. In these cases, the problem may be that the linker was so short that it became overly restrictive. The bond lengths and angles in the linker are kept fixed during refinement, and with only the torsional angles variable with a short linker near-native complexes may not be conformationally accessible. It is possible, therefore, that increasing the size of these linkers may actually improve the likelihood of generating low RMSD decoys.

In order to test this hypothesis, the linker of 1qcs was extended by one residue on either side so that the linker was seven residues in length instead of five. The low-resolution search was repeated with this new linker, yielding 7% of decoys with an RMSD <3 Å (data not shown). Using this new definition of the linker, the assembly procedure would be considered a success after low-resolution refinement. This result indicates that if the linker is too small the conformational space available to sample may become too restrictive to obtain near-native decoys, especially when there is a high shape complementarity between the domains. Increasing the size of the linker can allow for better sampling of the conformational space near the native state.

High-resolution failures

Table 3 lists the systems, after high-resolution refinement, for which no decoy <3 Å was present in the lowest 5% interdomain interaction energy subset of the population. As with the centroid-level failures, these systems are hampered by insufficient sampling. Only a small fraction of near-native low-resolution decoys were created for these complexes; consequently, after high-resolution refinement, there remained very few low RMSD decoys. Only one of these systems (1f5n) had a decoy with an RMSD <2 Å, and that model did not score favorably. Plots in Figure 1C illustrate high-resolution failures that exhibited insufficient sampling. With the exception of 1a6q, the relaxed natives had significantly lower interaction energies, further indicating that this was a sampling problem.

Table 4 lists additional failures after high-resolution refinement in which near-native decoys (Cα RMSD <2 Å) were present in the lowest energy 5% of structures but were not the five lowest energy models. Overall, there were only a small fraction of decoys sampled for these systems that were considered near-native. For 1f1z, 1a79, and 1dzf, which contain a significant number of lower RMSD decoys, there appears to be a discrete bottleneck to achieving the low energy native minimum as the relaxed native structures are considerably lower in energy than the decoys. Thus, for the majority of these systems, the problem again appears to be insufficient sampling. Notable exceptions in Table 4 are 1qam and 1qto. Here the problem is most likely due to the inability of the energy function to discriminate near-native models from high RMSD decoys (Fig. 1D). As shown in Figure 5A, the low energy 1qto decoy is stabilized by a nonnative strand pairing between the two domains, while maintaining an equivalent number of contacts as in the native structure. For 1qam (Fig. 5B), the lowest energy decoy is stabilized by an increased number of contacts between the two domains. Thus, this remains a scoring problem as the energy function is unable to discriminate native-like complexes from incorrect complexes with larger numbers of contacts. This may require improved measures of packing.

Details are in the caption following the image

Low-energy high RMSD decoys for several complexes. (A) 1qto (Cα RMSD = 11.90 Å) and (B) 1qam (Cα RMSD = 5.06 Å). The native structures are on the left and decoys on the right.

Improvements with high-resolution refinement

In the protocol described here, models are first created using a low-resolution approach where the protein is modeled at the centroid level. This allows for a more rapid sampling of the conformational degrees of freedom of the linker region. Subsequent high-resolution allows small changes in the linker region to optimize the details of side chain packing at the interface so that the best models can be identified using a physically realistic atomic level energy function. Although the changes are often small, they can dramatically reduce the energy; without backbone refinement, many models contain significant interatomic clashes. While the primary purpose of the high-resolution refinement is to improve recognition of the best models based on the all-atom energy function, in many cases the refinement protocol improves model quality. As shown in Tables 1, 3, and 4, 65% of the systems had a lower RMSD decoy after high-resolution refinement than after low-resolution refinement.

Figure 6 illustrates the different stages in the model generation process. Figure 6A shows the centroid-level energy distribution for 1cli after the low-resolution de novo buildup of the linker. While many near-native models are produced, the scoring function is unable to discriminate near-native decoys from structurally divergent models. Figure 6B shows the all-atom energy distribution after side chains are grafted onto low-resolution models and optimally repacked, keeping the linker fixed. This leads to an abundance of models with large numbers of atomic clashes, as the relative orientations of the domains cannot accommodate the native sequence. By allowing the linker region to relax, a more dramatic energy funnel is obtained, as shown in Figure 6C, allowing for the identification of near-native decoys using the scoring metric. Figure 6D summarizes the changes in structure that occur upon high-resolution refinement. In the majority of cases, the structures diverge from the native model due to clashes introduced by the side chain grafting, but a subset of the lower RMSD structures become more native-like.

Details are in the caption following the image

Energy distributions at successive steps in the prediction protocol. (A–C) Plots of energy vs. RMSD to native for a population of models generated for 1cli (A) after de novo modeling of the linker using the centroid-based energy function (B) after grafting and repacking interface side chains onto models in A and C after high-resolution refinement of models in B. (D) Comparison of model RMSDs to native before and after high-resolution refinement. For this system, high-resolution refinement significantly increased the extent that the best models could be recognized based on their energies (cf. panels A and B to panel C), and improved the quality of many of the models (D).

Overall, the Rosetta domain assembly protocol appears to be quite successful at predicting the structure of two-domain complexes, and the methodology can be readily extended to multidomain assemblies.

Conclusion

The Rosetta domain assembly method is successful in predicting and identifying near-native complexes for domain assembly problems in 50% of cases studied here. By explicitly modeling the polypeptide linker that tethers both domains, the conformational space available for the docking of each domain is reduced, and thus treating domain assembly as a linker folding problem can be more powerful than restricted docking methods. Most of the failures with this method appear to be due to insufficient sampling; with increased computational resources, the success rate should increase considerably. The high accuracy of many of the lowest energy models (Fig. 2) illustrate recent progress in high-resolution modeling (Schueler-Furman et al. 2005). To better treat problems in which the domain structures are produced by homology modeling, the next step would be to incorporate domain flexibility, particularly in loops, into the assembly protocol.

Materials and methods

Data set and linker definition

A nonredundant subset of the Protein Data Bank (Berman et al. 2000) was used to build the data set used in this study. The initial data set was obtained through the PISCES server (Wang and Dunbrack 2003) with the following parameters: the cutoff for redundancy at the sequence level was set at <40% sequence identity, and X-ray structures with a resolution of ≤2.5 Å were chosen. This culling reduced the data set to 601 structures.

An automatic domain parsing and linker definition procedure was implemented. Three independent domain prediction algorithms were used, Dali (Holm and Sander 1998), CATH (Orengo et al. 1997), and Taylor's (Taylor 1999). The sequence residues that define the linker between the two domains were determined for each method, and the consensus of the three methods was chosen as the linker region. A secondary structure assignment was determined for each protein (Kabsch and Sander 1983), and the linker region was extended on both sides up to the boundaries of contiguous secondary structure segments. The linker size was limited to 21 residues. This method allows for a systematic and automatic definition of linker regions.

Proteins in our data set were further filtered to consider only two-domain proteins that were not part of oligomeric complexes. Also, structures containing ligand groups or metal cofactors near the interface of the two domains were removed from the benchmark set. This yielded a total of 76 structures, listed in Tables 14.

Low-resolution domain assembly

A low-resolution search was performed for each system using a centroid representation of the protein (backbone and Cβ atoms only) in a manner similar to the Rosetta ab initio folding method (Bradley et al. 2005). The linker between each domain was initially set in an extended conformation (φ = −150°, ψ = 150°, ω = 180°), with ideal bond lengths and angles. The Rosetta de novo fragment assembly protocol was then used to build 5000 alternative conformations of the linker (Rohl et al. 2004) while the two flanking domains were held internally rigid. The centroid-level energy function used to guide the Monte Carlo sampling through alternative conformations favors burial of hydrophobic residues, pairing of β-strands, and penalizes clashes between backbone atoms and the residue centroids. The Cα RMSD to the native structure was calculated for each conformation (decoy) generated in this de novo buildup process, and if any decoys were found with an RMSD <3 Å, then the high-resolution refinement protocol was applied to all decoy models for that protein. Otherwise, the system was considered a failure at the low-resolution level, since it is unlikely that low RMSD models would be produced by high-resolution refinement given the limited radius of convergence. The centroid score (Rohl et al. 2004), evaluated for the entire structure, could not reliably identify low RMSD domain assemblies, so it was not used to reduce the number of models that were subjected to high-resolution refinement.

High-resolution refinement

Side chains were grafted onto the centroid-level models using the following protocol. First, for each system, the individual domains in the native structure were separated and the side chains repacked using a Monte Carlo sampling protocol (Kuhlman and Baker 2000) and the Dunbrack backbone-dependent rotamer library (native side chain conformations were not used as they could unfairly bias refinements toward the native structure). The repacked side chain orientations were then grafted onto each of the centroid-level decoys, and the residues near the domain–domain interface (with Cβ–Cβ distances <8 Å across the interface) were then repacked in each decoy. The rationale behind this approach was to reduce energy differences due to rotamer packing differences far from the domain interface.

Following side chain grafting and repacking, the linker was relaxed using the Monte Carlo Minimization (MCM) protocol developed for high-resolution refinement of protein models in Rosetta (Bradley et al. 2005; Misura and Baker 2005). Each step in the MCM protocol consists first of small random perturbations that are applied to the backbone φ and ψ dihedral angles at random positions in the linker. The side chains in the linker as well as at the interface are then optimized using the greedy “rotamer-trials” algorithm (Rohl et al. 2004), and subsequently the backbone degrees of freedom of the linker and the side chain chi angles of all residues were minimized using a quasi-Newton algorithm. The move is then accepted or rejected based on the standard Metropolis criterion. Full combinatorial side chain repacking of the linker and interface between domains was carried out after every 25 MCM cycles. The repulsive Lennard-Jones term was initially damped and then ramped up over the first 10 cycles to more smoothly transition from the centroid to all-atom representations of the protein chain.

The interdomain interaction energy was computed for each decoy by subtracting the energy of individual domains and the linker from the energy of the assembly. For comparison purposes, the interdomain interaction energies of the X-ray structure and relaxed natives were also calculated. Relaxed native structures in this context were generated from a centroid model of the complex that contained the linker in the same orientation as the native state. Starting with this model would be equivalent to obtaining a decoy of near 0 Å RMSD after a low-resolution search.

In several cases, the domains in the relaxed natives appear to drift apart, leading to large deviations from the X-ray structure. The 1eov system demonstrates this effect as shown in Figure 1, where the decoys exhibit a funnel shape while the relaxed natives have a high RMSD. The reason for the large deviations in the near-native structures is that the linkers were idealized before relaxation; bond lengths and angles replaced by ideal values. For tight complexes, idealization of only the linker can lead to small backbone clashes that can cause the complex to drift apart upon relaxation.

Acknowledgements

The authors thank Keith Laidig for maintaining the computational resources used in this work. We also acknowledge David Kim and Carol Rohl for the initial culling and analysis of the data set. This work was supported by the HHMI.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.