Support vector machines for prediction of dihedral angle regions

Zimmermann, Olav; Hansmann, Ulrich H. E.

doi:10.1093/bioinformatics/btl489

Abstract

Motivation: Most secondary structure prediction programs target only alpha helix and beta sheet structures and summarize all other structures in the random coil pseudo class. However, such an assignment often ignores existing local ordering in so-called random coil regions. Signatures for such ordering are distinct dihedral angle pattern. For this reason, we propose as an alternative approach to predict directly dihedral regions for each residue as this leads to a higher amount of structural information.

Results: We propose a multi-step support vector machine (SVM) procedure, dihedral prediction (DHPRED), to predict the dihedral angle state of residues from sequence. Trained on 20 000 residues our approach leads to dihedral region predictions, that in regions without alpha helices or beta sheets is higher than those from secondary structure prediction programs.

Availability: DHPRED has been implemented as a web service, which academic researchers can access from our webpage Author Webpage

Contact: u.hansmann@fz-juelich.de

1 INTRODUCTION

Despite decades of research, the prediction of protein structure and function solely from sequence information has remained one of the defining challenges in computational biology. However, there has been considerable progress in the prediction of the local secondary structure elements (SSE) that build up globular proteins. Based on neural networks (NN) (Qian and Sejnowski, 1988; Rost and Sander, 1994), hidden Markov models (HMMs) (Bystroff et al., 2000) and support vector machines (SVMs) (Hua and Sun 2001; Kim and Park, 2003; Ward et al., 2003), the secondary structure state of a residue can be predicted as either helix, extended (beta sheet) or coil with an accuracy of ∼76% if evolutionary information is used (Rost, 2001).

The primary target of secondary structure prediction programs is the detection of alpha helices and beta sheets.These SSE are macroscopic features defined by combinations of dihedral angles, hydrogen bonds and number of residues. The complex IUPAC–IUB definition utilized in secondary structure analysis programs like DSSP (Kabsch, 1983) makes it difficult to predict the state of an individual residue. For instance, an individual residue may be at the border between two different SSE and thus belong to both. Some prediction programs therefore give the individual probability scores for each of the three states [e.g. PsiPred (Jones, 1999)].

In the present paper, we choose another approach and restrict ourselves to the prediction of dihedral angle regions. Such dihedral constraints were originally formulated by Ramachandran et al. (Ramachandran, 1968), but for a long time regarded as frequently violated and therefore of limited usability. However, recent analyses by Lovell et al. have demonstrated that violations are largely due to inaccurate assignment of atom positions in experimental structures (Lovell et al., 2003). Using carefully filtered high-resolution structures and excluding atoms with high B-factors, they derive surprisingly sharp boundaries for allowed and generously allowed regions of the Ramachandran plot. Analyses by Betancourt et al. revealed a strong correlation between the dihedral state of a residue and the state of its immediate sequence neighbors irrespective of the amino acid sequence (Betancourt and Skolnick, 2004). In the same study, it is demonstrated that these correlations can be used as a folding potential. Hence, dihedral angle regions do indeed describe accurately local ordering in proteins.

Most studies denote those parts of a structure that belong neither to beta strands nor to alpha helices as random coil. According to this definition, ∼45% of the residues in the PDB are random coil. However, this assignment does not exclude local ordering that is frequently observed even in these random coil regions (Vucetic et al., 2005). Several of these structures are mixed, but distinct patterns of residues with dihedral angles as observed in alpha or beta conformations. Prediction of the dihedral state of individual residues in the coil region is a prerequisite to identifying elements of a general conformational alphabet and thereby augments the amount of structural information that can be predicted from sequence.

For these reasons, we describe in the present study an SVM-based method DHPRED (dihedral prediction) to predict in what region of the Ramachandran plot the dihedral configuration of each residue lies. We analyze the dependencies from the sequence and the dihedral environment for each of these dihedral angle regions. We then describe a multi-step algorithm that exploits the influence of the dihedral neighborhood (Betancourt and Skolnick, 2004) using information from local predicted dihedral preferences. Using Comparative Assessment of Structure Prediction (CASP6) targets from new-folds as examples, we analyze the approach's performance and discuss further improvements.

2 METHODS

Sequence and structure datasets are derived from the representative subsets of the Protein Data Bank (PDB, Berman et al., 2000). The pdb50 library provided by the Research Collaboratory for Structural Bioinformatics (RCSB) contains structures of protein chains with a pairwise sequence identity <50%. This non-redundant set of protein chains was searched for all chains longer than 100 residues from X-ray structures with a resolution better than 2.0 Å. Omitting the N- and C-termini, as their dihedral conformation is less reliable, our dataset contains 424 609 residues from 1929 different protein chains. We estimate some of the dihedral angle regions from the figures in the publication of Lovell et al. (2003) and store these regions as grids with 1° spacing. Figure 1 shows the regions as we estimated them. Due to the low number of samples and for comparability to secondary structure prediction programs we have only used the generously allowed regions for helical (H) and extended (E) states. All other regions are merged into an outlier class (O), which is not to be confused with the random coil pseudo class mentioned above. In contrast to the random coil class, our outlier class contains only ∼7% of all residues.

Fig. 1

Open in new tab Download slide

Dihedral regions estimated from (Lovell, 2003). The region interfaces of the generously allowed regions were defined manually by us and are partially overlapping.

Table 1 shows the distribution of the different dihedral angle regions for our dataset. Over 93% are located in the generously allowed alpha and beta regions. Our prediction algorithm belongs to the class of SVM, i.e. a supervised machine-learning algorithm that requires positive and negative examples for training. For a comprehensive introduction to SVMs see (Schoelkopf and Smola, 2002). The C-SVM algorithm implementation of the LIBSVM-library (Author Webpage) with a radial basis function (RBF) kernel is used throughout this study. Input data for training are vectors comprised of a class label and several numerical input values (features). The resulting model is an abstract specification of the hyperplane that separates two classes with the largest margin. This model is then used to classify previously unseen examples. In order to allow the algorithm to harness homology information, we have encoded each amino acid residue of the local sequence neighborhood by a profile vector of amino acid propensities obtained from the position specific scoring matrices of a PSI-BLAST run (Altschul et al., 1997). We use a sliding window of length 15 to define the local sequence environment of a residue. Accordingly, the feature vectors to encode the sequence information are of length 15 × 20 = 300 (Fig. 2).

Fig. 2

Open in new tab Download slide

Encoding of vectors for SVM training.

Table 1

Distribution of dihedral regions where: core = allowed region (union contains 99% of all data according to [Lovell03]), gen = generously allowed region (union contains 99.9% of all data according to [Lovell03])

Dihedral region (abbrev.)	# of samples
Right handed helix, core	210 840
Right handed helix, gen	215 391
Beta sheet, core	174 534
Beta sheet, gen	182 435
Left handed helix, core	11 406
Left handed helix, gen	17 904
II'-region, gen	2677
Gamma turn, gen	355
In none of these regions	7770
In none of this regions, not Glycine	1003
Total number of residues	424 609

Dihedral region (abbrev.)	# of samples
Right handed helix, core	210 840
Right handed helix, gen	215 391
Beta sheet, core	174 534
Beta sheet, gen	182 435
Left handed helix, core	11 406
Left handed helix, gen	17 904
II'-region, gen	2677
Gamma turn, gen	355
In none of these regions	7770
In none of this regions, not Glycine	1003
Total number of residues	424 609

Open in new tab

Table 1

Distribution of dihedral regions where: core = allowed region (union contains 99% of all data according to [Lovell03]), gen = generously allowed region (union contains 99.9% of all data according to [Lovell03])

Dihedral region (abbrev.)	# of samples
Right handed helix, core	210 840
Right handed helix, gen	215 391
Beta sheet, core	174 534
Beta sheet, gen	182 435
Left handed helix, core	11 406
Left handed helix, gen	17 904
II'-region, gen	2677
Gamma turn, gen	355
In none of these regions	7770
In none of this regions, not Glycine	1003
Total number of residues	424 609

Dihedral region (abbrev.)	# of samples
Right handed helix, core	210 840
Right handed helix, gen	215 391
Beta sheet, core	174 534
Beta sheet, gen	182 435
Left handed helix, core	11 406
Left handed helix, gen	17 904
II'-region, gen	2677
Gamma turn, gen	355
In none of these regions	7770
In none of this regions, not Glycine	1003
Total number of residues	424 609

Open in new tab

For a second set of classifiers, we also use the predicted class labels obtained from prediction runs using the first SVM-models. We employ a sequence window of length seven and three separate predictions: helix (alpha generously allowed region), extended (beta generously allowed region) and outlier (all others). This gives 21 features, which increase the total length of the vectors for the second set of SVM-models to 321. A sketch of the encoding scheme for both types of classifiers is plotted in Figure 2.

Predictions start by performing a PSI-BLAST run for the target sequence, deriving vectors from the resulting PSSM and obtaining class labels using the first set of SVM-models (step 1). The output of the second step is again a set of three independent predictions for the membership of a residue in the alpha, beta or outlier class, respectively. We find that repeating the second step using the updated dihedral neighborhood information from the previous prediction round leads to further improvement (step 3). In particular, residues showing ambiguous predictions become less frequent. As convergence of this iterative step is not guaranteed, we limit the number of additional rounds to nine. Due to the low number of ambiguous predictions, our use of discrete class labels +1 and −1 (instead of real-valued class probabilities) and a narrow sequence window of only 7 residues for the dihedral neighborhood, we always observe convergence after two to three additional rounds. Remaining ambiguities are resolved by assigning the class label of the nearest non-ambiguous residue (step 4).

Matthew's correlation coefficient (MCC) is used throughout this study as main evaluator for classification performance (Matthews, 1975):

MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP) \cdot (TP + FN) \cdot (TN + FP) \cdot (TN + FN)}},

using the definitions in Table 2.

Table 2

Definition of prediction categories for calculation of MCC, specificity and sensitivity

Prediction	Observation
	+1	−1
+1	TP (true positive)	FP (false positive)
−1	FN (false negative)	TN (true negative)

Prediction	Observation
	+1	−1
+1	TP (true positive)	FP (false positive)
−1	FN (false negative)	TN (true negative)

Open in new tab

Table 2

Definition of prediction categories for calculation of MCC, specificity and sensitivity

Prediction	Observation
	+1	−1
+1	TP (true positive)	FP (false positive)
−1	FN (false negative)	TN (true negative)

Prediction	Observation
	+1	−1
+1	TP (true positive)	FP (false positive)
−1	FN (false negative)	TN (true negative)

Open in new tab

For some tests, we also give the sensitivity and specificity:

Sensitivity=TP / (TP+FN)

Specificity=TN / (TN+FP) .

3 RESULTS

3.1 SVM classifier performance

We have initially trained individual classifiers for each dihedral angle region. However, due to the low number of available examples for left handed helices, gamma turns and II′ turns, classifiers for predicting these secondary structure classes show only low correlation on the test set (data not shown). We therefore use here only the information on alpha and beta helices as targets and only train classifiers for two generously allowed regions: right-handed alpha helix and beta strand (denoted ralpha-gen and beta-gen in Fig. 1). A third classifier is trained on residues outside of this both regions. In a first step, we utilize only sequence profile information from PSI-BLAST. For computational reasons we restrict the training set to 20 000 residues from 499 proteins. Our prediction algorithm is then evaluated on an independent test set of 18 872 residues from 97 proteins. Results are shown in Table 3. Already the profile-only SVM classifiers show a prediction performance of ∼80%, in the range of one of the best secondary structure prediction programs, PSIPRED (Jones, 1999). However, note that the models show a marked tendency to over-predict extended residues and to under-predict residues in helical state.

Table 3

Performance of SVM PSSM-only classifiers

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7564	7924	1271	2113	82.1	0.645
Beta E	6793	8631	2198	1250	81.7	0.635
Outlier O	906	16 705	920	341	93.3	0.567

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7564	7924	1271	2113	82.1	0.645
Beta E	6793	8631	2198	1250	81.7	0.635
Outlier O	906	16 705	920	341	93.3	0.567

Open in new tab

Table 3

Performance of SVM PSSM-only classifiers

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7564	7924	1271	2113	82.1	0.645
Beta E	6793	8631	2198	1250	81.7	0.635
Outlier O	906	16 705	920	341	93.3	0.567

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7564	7924	1271	2113	82.1	0.645
Beta E	6793	8631	2198	1250	81.7	0.635
Outlier O	906	16 705	920	341	93.3	0.567

Open in new tab

In a second iteration, we improve on these results by adding dihedral neighborhood information obtained from prediction runs using the first classifiers to the training set features. As dihedral neighborhood information, we use the class labels of the first classifiers in a sequence window of length 7 (Fig. 2). Results presented in Table 4 are for the same independent test set of 18 872 residues. As expected, the results in this iteration show a moderate improvement over the predictions from the profile-only classifiers, validating the prediction approach described. The bias towards over-prediction of extended state remains, although less pronounced.

Table 4

Performance of SVM PSSM + dihedral classifiers

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7780	7971	1224	1897	83.5	0.671
Beta E	6825	8861	1968	1218	83.1	0.661
Outlier O	905	16 724	901	342	93.4	0.570

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7780	7971	1224	1897	83.5	0.671
Beta E	6825	8861	1968	1218	83.1	0.661
Outlier O	905	16 724	901	342	93.4	0.570

Open in new tab

Table 4

Performance of SVM PSSM + dihedral classifiers

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7780	7971	1224	1897	83.5	0.671
Beta E	6825	8861	1968	1218	83.1	0.661
Outlier O	905	16 724	901	342	93.4	0.570

Region	TP	TN	FP	FN	Acc %	MCC
Alpha H	7780	7971	1224	1897	83.5	0.671
Beta E	6825	8861	1968	1218	83.1	0.661
Outlier O	905	16 724	901	342	93.4	0.570

Open in new tab

3.2 Comparison to secondary structure prediction programs

Although we are not aware of any programs which yield predictions of a residues dihedral state, some secondary structure prediction programs give probabilities for the secondary structure state of individual residues. Hence, we use this type of output from the GOR-IV and PSIPRED programs as an approximate measure for the dihedral region prediction. We have used the prediction scores without regard of the coil probability, as this purely macroscopic category does not imply any dihedral preference. To estimate the improvement by including information on the 3D environment of similar sequences, we compared our data with PSIPRED predictions obtained in single mode as well as to PSIPRED predictions that use position specific profiles from PSI-BLAST (Table 5).

Table 5

Prediction test for individual residues (n ≈ 17500)

		DHPRED step 1,2	PSIPRED profile	PSIPRED single	GOR-IV
Alpha	Sens %	77	86 (64)	79 (51)	72 (50)
	Spec %	90	73 (96)	60 (85)	64 (83)
	MCC	0.67	0.60 (0.62)	0.40 (0.38)	0.35 (0.35)
Beta	Sens %	80	73 (42)	60 (32)	64 (34)
	Spec %	86	86 (96)	79 (91)	72 (87)
	MCC	0.66	0.60 (0.47)	0.40 (0.28)	0.35 (0.25)

		DHPRED step 1,2	PSIPRED profile	PSIPRED single	GOR-IV
Alpha	Sens %	77	86 (64)	79 (51)	72 (50)
	Spec %	90	73 (96)	60 (85)	64 (83)
	MCC	0.67	0.60 (0.62)	0.40 (0.38)	0.35 (0.35)
Beta	Sens %	80	73 (42)	60 (32)	64 (34)
	Spec %	86	86 (96)	79 (91)	72 (87)
	MCC	0.66	0.60 (0.47)	0.40 (0.28)	0.35 (0.25)

Sensitivity, specificity in % and Matthew's correlation coefficient (mcc). Figures in brackets denote predictions obtained including coil predictions.

Open in new tab

Table 5

Prediction test for individual residues (n ≈ 17500)

		DHPRED step 1,2	PSIPRED profile	PSIPRED single	GOR-IV
Alpha	Sens %	77	86 (64)	79 (51)	72 (50)
	Spec %	90	73 (96)	60 (85)	64 (83)
	MCC	0.67	0.60 (0.62)	0.40 (0.38)	0.35 (0.35)
Beta	Sens %	80	73 (42)	60 (32)	64 (34)
	Spec %	86	86 (96)	79 (91)	72 (87)
	MCC	0.66	0.60 (0.47)	0.40 (0.28)	0.35 (0.25)

		DHPRED step 1,2	PSIPRED profile	PSIPRED single	GOR-IV
Alpha	Sens %	77	86 (64)	79 (51)	72 (50)
	Spec %	90	73 (96)	60 (85)	64 (83)
	MCC	0.67	0.60 (0.62)	0.40 (0.38)	0.35 (0.35)
Beta	Sens %	80	73 (42)	60 (32)	64 (34)
	Spec %	86	86 (96)	79 (91)	72 (87)
	MCC	0.66	0.60 (0.47)	0.40 (0.28)	0.35 (0.25)

Sensitivity, specificity in % and Matthew's correlation coefficient (mcc). Figures in brackets denote predictions obtained including coil predictions.

Open in new tab

Although trained on a smaller database than GOR-IV or PSIPRED, the first two steps of our procedure give the same amount of information on local secondary structure as current secondary structure programs. Our method gives a MCC higher even than PSIPRED. This suggests that PSIPREDs unrivaled ability to detect SSEs comes at the price of a lower ability to detect less uniform local ordering. The low correlation coefficients for predictions including coil underlines that a lot of information about the dihedral state at residue level can be recovered just by ignoring coil prediction probabilities. Note the pronounced improvement in the MCC of ∼0.2 when PSIPRED uses PSI-BLAST profiles. Our own experiments with SVM-based methods, with and without profile information, show a similar gain (data not shown).

3.3 Detailed analysis of CASP6 examples

The gold standard for each prediction method is its application to situations where no structure is known for any protein with similar sequence. Consequently, we have tested the performance of our approach for three targets, among them two from the new-fold category of the CASP6. The first test case, Target 242 (PDB-code: 2blk, chain A), shown in Figure 3, is a new-fold and contains long stretches where according to DSSP there are no SSE. DHPRED correctly assigns 72 of the 88 core residues (81.8%) including all three ‘outliers’, while PSIPRED predicted 70 (79.6%). GOR-IV, in contrast, predicts less than half of the residues correctly, emphasizing that, even for new-folds, implicit information on the 3D environment can be obtained using sequence profiles. The correctly predicted regions are colored in black in Figure 3, while white denotes false predictions and gray the termini, and outlier residues, which have not been evaluated in the comparison. A more detailed listing of our results for this protein and a comparison with competing techniques, can be found in Figure 4. The false predictions for target 242 are mainly located in four clusters. The C-terminal part of the first alpha helix is not recognized, an error, which is, even more pronounced in the PSIPRED prediction. Before the second helix, an alternating pattern is missed and two patterns where DSSP reports turns are not correctly predicted. The correlation of mispredictions between PSIPRED and DHPRED makes it likely that in these regions either rare H-bonding patterns occur or the normal local structure is strongly influenced by non-local interactions.

Fig. 3

Open in new tab Download slide

Prediction for CASP6 target 242 (2blkA). Black: correct prediction, white: wrong prediction, gray: not evaluated (chain ends and outliers).

Fig. 4

Open in new tab Download slide

True dihedral state and predictions for CASP6 target 242 (2blkA) with different algorithms. Normal face: correct, bold: wrong, gray: not evaluated. seq = amino acid sequence in one-letter-code, DSSP = secondary structure annotation by DSSP, DH = dihedral region according to the Lovell definitions: E = within beta-gen (extended), H = within ralpha-gen, O = outside of both regions (turn), DHPRED: dihedral region predicted by our SVM approach, GOR′: predicted preference by GOR-IV when ignoring coil prediction, PSI1′: same for PSIPRED without using profile information, PSIP′: same for PSIPRED using profile information.

Our second test case is the new-fold target 238 (PDB-code: 1w33, complement protein), which has an all-alpha structure and is shown in Figure 5. In spite of the tendency of DHPRED to under-predict residues in helical state, it assigns 86.9% of the 145 core residues to the correct class. Here PSIPRED, which favors helix predictions, achieves slightly better results (89.0%). The detailed analysis of Figure 6 demonstrates that false predictions by the SVM method cluster at the C-terminal half of the first and second helix. The first cluster of mispredictions is shared with PSIPRED. A scattered cluster of mispredictions is also located at the complex loop structure between the first and second helix. The first cluster of mispredictions contains the subpattern Ile-Gln-Ile (IQI), which is found more frequently in beta sheets than in alpha helices. The same is true for second missed pattern, Lys-Tyr-Ser-Ser (LYSS). Due to our small number of training residues, we may have missed the less frequent sequence profiles, which belong to helical conformations of this pattern.

Fig. 5

Open in new tab Download slide

Prediction for CASP6 target 238 (1w33A). Black: correct prediction, white: wrong prediction, gray: not evaluated (chain ends and outliers).

Fig. 6

Open in new tab Download slide

True dihedral state and predictions for CASP6 target 238 (1w33A) with different algorithms. Normal font: correct prediction, bold font: wrong prediction, gray: not evaluated (ambiguous or outliers). (Fig. 4. for detailed legend).

Although, not a new-fold, the third test case, Target 273 (PDB-code: 1wdj), was chosen for its complex alpha-beta topology that includes a beta barrel at the C-terminus. The molecule is displayed in Figure 7. The prediction accuracy of DHPRED (82.4%) is even higher than that of PSIPRED. Although, the large number of different loop structures connecting the SSE is the main problems for the DHPRED predictor, it assigns 33 of 51 (64.7%) correctly, while the residue dihedral state of PSIPRED is only correct in 49.0% of the cases (Fig. 8 and Table 7).

Fig. 7

Open in new tab Download slide

Prediction for CASP6 target 273 (1wdjA). Black: correct prediction, white: wrong prediction, gray: not evaluated (chain ends and outliers).

Fig. 8

Open in new tab Download slide

True dihedral state and predictions for CASP6 target 273 (1wdjA) with different algorithms. Normal font: correct prediction, bold font: wrong prediction, gray: not evaluated (ambiguous or outliers). (Fig. 4. for detailed legend).

Tables 6 and 7 summarize the results for the three targets. Note that in all three cases false predictions tend to cluster and that all methods show strong correlations on the residues for which they predict the wrong class. While this is not surprising for residues within ‘coil’ regions with their irregular H-bond pattern, we find such ‘difficult residues’ also in helices that have neither strong kinks nor bends. The observed correlation of false predictions between three independent methods implies that in these particular regions the local structures strongly deviate from the average structures observed for similar sequences. We conjecture that in these cases the local secondary structure is more strongly determined by the non-local environment of the surrounding protein than it is on average. This is a principal limitation of all techniques that use only local sequence information.

Table 6

Performance comparison on three targets from CASP6

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
Res	85–235	17–104	17–172
# Res	151	88	156
# Res eval	145	85	148
% Correctly assigned
GOR-IV″	80.0	48.2	75.7
PSIPRED single″	84.1	62.4	68.9
PSIPRED profile″	89.0	82.4	80.4
SVM-DH	69.6	60.0	79.1
DHPRED step 1	75.2	74.1	75.0
DHPRED step 1,2	85.5	77.7	80.4
DHPRED step 1,2,3	86.9	78.8	81.1
DHPRED step 1,2,3,4	86.9	81.2	82.4

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
Res	85–235	17–104	17–172
# Res	151	88	156
# Res eval	145	85	148
% Correctly assigned
GOR-IV″	80.0	48.2	75.7
PSIPRED single″	84.1	62.4	68.9
PSIPRED profile″	89.0	82.4	80.4
SVM-DH	69.6	60.0	79.1
DHPRED step 1	75.2	74.1	75.0
DHPRED step 1,2	85.5	77.7	80.4
DHPRED step 1,2,3	86.9	78.8	81.1
DHPRED step 1,2,3,4	86.9	81.2	82.4

Open in new tab

Table 6

Performance comparison on three targets from CASP6

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
Res	85–235	17–104	17–172
# Res	151	88	156
# Res eval	145	85	148
% Correctly assigned
GOR-IV″	80.0	48.2	75.7
PSIPRED single″	84.1	62.4	68.9
PSIPRED profile″	89.0	82.4	80.4
SVM-DH	69.6	60.0	79.1
DHPRED step 1	75.2	74.1	75.0
DHPRED step 1,2	85.5	77.7	80.4
DHPRED step 1,2,3	86.9	78.8	81.1
DHPRED step 1,2,3,4	86.9	81.2	82.4

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
Res	85–235	17–104	17–172
# Res	151	88	156
# Res eval	145	85	148
% Correctly assigned
GOR-IV″	80.0	48.2	75.7
PSIPRED single″	84.1	62.4	68.9
PSIPRED profile″	89.0	82.4	80.4
SVM-DH	69.6	60.0	79.1
DHPRED step 1	75.2	74.1	75.0
DHPRED step 1,2	85.5	77.7	80.4
DHPRED step 1,2,3	86.9	78.8	81.1
DHPRED step 1,2,3,4	86.9	81.2	82.4

Open in new tab

Table 7

Performance comparison in regions without SSEs

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
# Res eval	145	85	148
# Res w/o SSE	22	35	51
% Correctly assigned
PSIPRED profile″	59.1	65.7	49.0
DHPRED step 1,2,3,4	77.3	65.7	64.7

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
# Res eval	145	85	148
# Res w/o SSE	22	35	51
% Correctly assigned
PSIPRED profile″	59.1	65.7	49.0
DHPRED step 1,2,3,4	77.3	65.7	64.7

Open in new tab

Table 7

Performance comparison in regions without SSEs

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
# Res eval	145	85	148
# Res w/o SSE	22	35	51
% Correctly assigned
PSIPRED profile″	59.1	65.7	49.0
DHPRED step 1,2,3,4	77.3	65.7	64.7

CASP target (pdb)	T0238 (2blkA)	T0242 (1w33A)	T0273 (1wdjA)
# Res eval	145	85	148
# Res w/o SSE	22	35	51
% Correctly assigned
PSIPRED profile″	59.1	65.7	49.0
DHPRED step 1,2,3,4	77.3	65.7	64.7

Open in new tab

4 CONCLUSION AND OUTLOOK

We have developed a multi-step SVM-procedure DHPRED for predicting the dihedral class of individual residues. The advantage of such an approach over conventional secondary structure prediction methods is twofold. First, some of the difficulties arising from the inherent complexity of secondary structure definitions are avoided and second, it leads to additional information in ‘coil’ regions. Our approach is based solely on sequence profiles. However, each step generates additional information on the dihedral neighborhood that is used in the following step to improve the prediction performance. The method compares favorably to non-profile methods and is on par with PSIPRED regarding the overall prediction quality.

While PSIPRED excels especially on proteins with high helix content, DHPRED shows much higher prediction accuracy in regions between SSE. For computational reasons, we have used a rather small training set (20 000 residues from 499 proteins). We expect that larger training sets and rigorous parameter optimization will improve the prediction results considerably. In the future, we plan to use parallelized implementations of SVM algorithms that will allow for the weighting of features. We will also try to address some of the shortcomings of DHPRED e.g. employing special training sets for Glycine and Proline, which have dihedral preferences that deviate considerably from those of the other amino acid residues. Starting from microscopic predictions, as in DHPRED, we intend to target the prediction of macroscopic secondary structure in a bottom-up approach.

This work is supported in part by a research grant (GM62838) of the National Institutes of Health. The computations were performed on Computers at the John v. Neumann Institute for Computing in Jülich, Germany.

Conflict of Interest: none declared.

REFERENCES

Altschul

S.F.

, et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

3389

-

3402

)

Google Scholar

Crossref

PubMed

WorldCat

Berman

H.M.

, et al.

The Protein Data Bank

,

Nucleic Acids Res.

,

2000

, vol.

28

(pg.

235

-

242

)

Google Scholar

Crossref

PubMed

WorldCat

Betancourt

M.R.

,

Skolnick

J.

.

Local propensities and statistical potentials of backbone dihedral angles in proteins

,

J. Mol. Biol.

,

2004

, vol.

342

(pg.

635

-

649

)

Google Scholar

Crossref

PubMed

WorldCat

Bhaskaran

R.

,

Ponnuswamy

P.K.

.

Positional flexibilities of amino acid residues in globular proteins

,

Int. J. Peptide Protein Res.

,

1988

, vol.

32

(pg.

241

-

255

)

Google Scholar

Crossref

WorldCat

Bystroff

C.

, et al.

HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins

,

J. Mol. Biol.

,

2000

, vol.

301

(pg.

173

-

190

)

Google Scholar

Crossref

PubMed

WorldCat

Camproux

A.C.

, et al.

A hidden Markov model derived structural alphabet for proteins

,

J. Mol. Biol.

,

2004

, vol.

339

(pg.

591

-

605

)

Google Scholar

Crossref

PubMed

WorldCat

Chou

P.Y.

,

Fasman

G.D.

.

Prediction of protein conformation

,

Biochemistry

,

1974

, vol.

13

(pg.

222

-

245

)

Google Scholar

Crossref

PubMed

WorldCat

Fauchere

J.L.

, et al.

Amino acid side chain parameters for correlation studies in biology and pharmacology

,

Int. J. Pept. Protein Res.

,

1988

, vol.

32

(pg.

269

-

278

)

Google Scholar

Crossref

PubMed

WorldCat

Garnier

J.

, et al.

Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins

,

J. Mol. Biol.

,

1978

, vol.

120

(pg.

97

-

120

)

Google Scholar

Crossref

PubMed

WorldCat

Hua

S.

,

Sun

Z.

.

A novel method of protein secondary structure prediction with high segment overlap measure-support vector machine approach

,

J. Mol. Biol.

,

2001

, vol.

308

(pg.

397

-

407

)

Google Scholar

Crossref

PubMed

WorldCat

Jones

T.D.

.

Protein secondary structure prediction based on position specific matrices

,

J. Mol. Biol.

,

1999

, vol.

292

(pg.

195

-

202

)

Google Scholar

Crossref

PubMed

WorldCat

Kawashima

S.

, et al.

AAindex: amino acid index database

,

Nucleic Acids Res.

,

1999

, vol.

27

(pg.

368

-

369

)

Google Scholar

Crossref

PubMed

WorldCat

Kihara

D.

.

The effect of long-range interactions on the secondary structure formation of proteins

,

Protein Sci.

,

2005

, vol.

14

(pg.

1955

-

1963

)

Google Scholar

Crossref

PubMed

WorldCat

Kim

H.

,

Park

H.

.

Protein secondary structure prediction based on an improved support vector machines approach

,

Protein Eng.

,

2003

, vol.

16

(pg.

553

-

560

)

Google Scholar

Crossref

PubMed

WorldCat

Klein

P.

, et al.

Prediction of protein function from sequence properties: discriminant analysis of a data base

,

Biochim. Biophys. Acta

,

1984

, vol.

787

(pg.

221

-

226

)

Google Scholar

Crossref

PubMed

WorldCat

Lewis

P.N.

, et al.

Chain reversals in proteins

,

Biochim. Biophys. Acta

,

1973

, vol.

303

(pg.

211

-

229

)

Google Scholar

Crossref

PubMed

WorldCat

Lovell

S.C.

, et al.

Structure validation by Calpha geometry: phi,psi and Cbeta deviation

,

Proteins

,

2003

, vol.

50

(pg.

437

-

450

)

Google Scholar

Crossref

PubMed

WorldCat

Matthews

B.W.

.

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

,

Biochim. Biophys. Acta

,

1975

, vol.

405

(pg.

442

-

451

)

Google Scholar

Crossref

PubMed

WorldCat

Mitaku

S.

, et al.

Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membrane-water interfaces

,

Bioinformatics

,

2002

, vol.

18

(pg.

608

-

616

)

Google Scholar

Crossref

PubMed

WorldCat

Nguyen

M.N.

,

Rajapakse

J.C.

.

Multi-class support vector machines for protein secondary structure prediction

,

Genome Inform. Ser. Workshop Genome Inform.

,

2003

, vol.

14

(pg.

218

-

227

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Oobatake

M.

, et al.

Optimization of amino acid parameters for correspondence of sequence to tertiary structures of proteins

,

Bull. Inst. Chem. Res. Kyoto Univ.

,

1985

, vol.

63

(pg.

82

-

94

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Petersen

T.N.

, et al.

Prediction of protein secondary structure at 80% accuracy

,

Proteins

,

2000

, vol.

41

(pg.

17

-

20

)

Google Scholar

Crossref

PubMed

WorldCat

Ptitsyn

O.B.

,

Finkelstein

A.V.

.

Theory of protein secondary structure and algorithm of its prediction

,

Biopolymers

,

1983

, vol.

22

(pg.

15

-

25

)

Google Scholar

Crossref

PubMed

WorldCat

Qian

N.

,

Sejnowski

T.J.

.

Predicting the secondary structure of globular proteins using neural network models

,

J. Mol. Biol.

,

1988

, vol.

202

(pg.

865

-

884

)

Google Scholar

Crossref

PubMed

WorldCat

Robson

B.

, et al.

GOR method for predicting protein secondary structure from amino acid sequence

,

Meth. Enzymol.

,

1996

, vol.

266

(pg.

540

-

553

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Rost

B.

,

Sander

C.

.

Combining evolutionary information and neural networks to predict protein secondary structure

,

Proteins

,

1994

, vol.

19

(pg.

55

-

72

)

Google Scholar

Crossref

PubMed

WorldCat

Rost

B.

.

Review: protein secondary structure prediction continues to rise

,

J. Struct. Biol.

,

2001

, vol.

134

(pg.

204

-

218

)

Google Scholar

Crossref

PubMed

WorldCat

Schölkopf

B.

,

Smola

A.J.

. ,

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

,

2002

Cambridge, MA

MIT Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Takano

K.

,

Yutani

K.

.

A new scale for side-chain contribution to protein stability based on the empirical stability analysis of mutant proteins

,

Protein Eng.

,

2001

, vol.

14

(pg.

525

-

528

)

Google Scholar

Crossref

PubMed

WorldCat

Tsai

J.

, et al.

The packing density in proteins: standard radii and volumes

,

J. Mol. Biol.

,

1999

, vol.

290

(pg.

253

-

266

)

Google Scholar

Crossref

PubMed

WorldCat

Vihinen

M.

, et al.

Accuracy of protein flexibility predictions

,

Proteins

,

1994

, vol.

19

(pg.

141

-

149

)

Google Scholar

Crossref

PubMed

WorldCat

Vucetic

S.

, et al.

DisProt: a database of protein disorder

,

Bioinformatics

,

2005

, vol.

21

(pg.

137

-

140

)

Google Scholar

Crossref

PubMed

WorldCat

Ward

J.J.

, et al.

Secondary structure prediction with support vector machines

,

Bioinformatics

,

2003

, vol.

19

(pg.

1650

-

1655

)

Google Scholar

Crossref

PubMed

WorldCat

Author notes

Associate Editor: Anna Tramontano

Download all slides

Month:	Total Views:
December 2016	2
January 2017	1
February 2017	2
March 2017	6
April 2017	3
May 2017	6
June 2017	2
July 2017	2
August 2017	1
November 2017	2
December 2017	13
January 2018	5
February 2018	11
March 2018	9
April 2018	10
May 2018	11
June 2018	9
July 2018	11
August 2018	9
September 2018	5
October 2018	9
November 2018	8
December 2018	9
January 2019	5
February 2019	9
March 2019	9
April 2019	28
May 2019	6
June 2019	9
July 2019	17
August 2019	12
September 2019	13
October 2019	15
November 2019	9
December 2019	13
January 2020	14
February 2020	11
March 2020	8
April 2020	19
May 2020	1
June 2020	5
July 2020	6
August 2020	4
September 2020	5
October 2020	13
November 2020	6
December 2020	6
January 2021	8
February 2021	11
March 2021	9
April 2021	11
May 2021	11
June 2021	5
July 2021	17
August 2021	7
September 2021	7
October 2021	7
November 2021	10
December 2021	12
January 2022	12
February 2022	20
March 2022	7
April 2022	14
May 2022	17
June 2022	12
July 2022	16
August 2022	14
September 2022	27
October 2022	6
November 2022	1
December 2022	10
January 2023	7
February 2023	5
March 2023	17
April 2023	5
May 2023	2
June 2023	5
July 2023	2
August 2023	10
September 2023	3
October 2023	5
November 2023	12
December 2023	8
January 2024	16
February 2024	11
March 2024	9
April 2024	5

Article Contents

Support vector machines for prediction of dihedral angle regions

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS

3.1 SVM classifier performance

3.2 Comparison to secondary structure prediction programs

3.3 Detailed analysis of CASP6 examples

4 CONCLUSION AND OUTLOOK

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Support vector machines for prediction of dihedral angle regions

Abstract

1 INTRODUCTION

2 METHODS

3 RESULTS

3.1 SVM classifier performance

3.2 Comparison to secondary structure prediction programs

3.3 Detailed analysis of CASP6 examples

4 CONCLUSION AND OUTLOOK

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only