Abstract

Motivation

Domain boundary prediction is one of the most important problems in the study of protein structure and function. Many sequence-based domain boundary prediction methods are either template-based or machine learning (ML) based. ML-based methods often perform poorly due to their use of only local (i.e. short-range) features. These conventional features such as sequence profiles, secondary structures and solvent accessibilities are typically restricted to be within 20 residues of the domain boundary candidate.

Results

To address the performance of ML-based methods, we developed a new protein domain boundary prediction method (ConDo) that utilizes novel long-range features such as coevolutionary information in addition to the aforementioned local window features as inputs for ML. Toward this purpose, two types of coevolutionary information were extracted from multiple sequence alignment using direct coupling analysis: (i) partially aligned sequences, and (ii) correlated mutation information. Both the partially aligned sequence information and the modularity of residue–residue couplings possess long-range correlation information.

Availability and implementation

https://github.com/gicsaw/ConDo.git

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A protein chain consists of one or more domains, each comprised of its own independent compact (Richardson, 1981) and folded (Wetlaufer, 1973) substructures. The protein domain is a stable unit of structure, function, and evolution (Bork, 1991). Studying protein domains requires one of two approaches: one is the domain identification of proteins whose 3D structures are already known (Alden et al., 2010; Alexandrov and Shindyalov, 2003; Guo et al., 2003; Xu et al., 2000; Zhou et al., 2007); the other is predicting domain boundaries from protein sequences whose 3D structures are yet unknown (Bondugula et al., 2008; Cheng et al., 2006; Eickholt et al., 2011; Sim et al., 2005; Xue et al., 2013). Domain boundary prediction has become a fundamental step in predicting and determining (Longhi et al., 2007) 3D protein structures, famously seen as a high complexity problem. Since the domain is considered a unit of folding within the protein, 3D structure prediction can be simulated separately for each domain; this approach can reduce the overall complexity of the problem.

Domain boundary prediction can be carried out via template-based methods (TBMs) (Bondugula et al., 2008; Xue et al., 2013) and ab initio methods. TBM predict the domain boundary of a target by referring to the domain information of another protein that has a known 3D structure which shows relatively high sequence similarity to the target. This method is highly accurate but can be used only when an appropriate template with known domain information is found. For ab initio methods, machine learning (ML) is mainly used (Cheng et al., 2006; Eickholt et al., 2011; Sim et al., 2005). Most ML-based methods prepare features for every residue positions on a chain, and the learned machine predicts whether or not a residue position is located at a domain boundary. Conveniently, most TBM and ML-based methods use the same input features. Sequence profiles (SPs) such as position-specific scoring matrices are the most commonly used input features with predicted secondary structure (SS) and predicted solvent accessibility (SA) not far behind.

When compared with TBMs, however, ab initio methods show relatively poor performance (Xue et al., 2013). Most ML-based methods use short-range sequence information (only about tens of residues away from the target position) in order to predict the domain boundary. However, the protein domain prediction may not be accurately determined solely by local interactions; in order to improve the prediction performance, long-range interactions between residues should be considered.

For example, many structure-based domain identification methods already use residue–residue contact information which is one kind of long-range interaction (Alexandrov and Shindyalov, 2003; Guo et al., 2003; Xu et al., 2000; Zhou et al., 2007). The frequency of contacts between residues belonging to an identical domain is higher than that between residues belonging to different domains. Although residue–residue contact information is useful for identifying domains, it cannot be readily used in the sequence-based domain boundary identification for proteins whose 3D structures are unknown. Although it is possible to use predicted contact information (Simkovic et al., 2017) instead of true contact information, previous studies rarely used predicted contact information (Kikuchi et al., 1988; Rigden, 2002; Sadowski, 2013) because of the poor quality of contact prediction methods at the time.

Because the coevolutionary (i.e. correlated mutation) information between residues is structurally related to the contact between (Göbel et al., 1994), it can also be used to predict the contacts. Recently, direct coupling analysis (Cocco et al., 2013; Ekeberg et al., 2013; Jones et al., 2012; Morcos et al., 2011; Seemayer et al., 2014) and ML-based contact prediction methods (Jones et al., 2015; Wang et al., 2017a) have been developed to improve the performance of this contact prediction. As a result, many newly develop methods for predicting protein 3D structures utilize these higher-quality predicted contacts. Residue–residue contact prediction was one of the most important issues in the 12th Critical Assessment of Structure Prediction (CASP12) (Buchan and Jones, 2018; Kosciolek and Jones, 2016; Wang et al., 2017b). Contact prediction using coevolutionary information (CI) is also associated with the observation that a protein domain is an evolutionary unit. Based on this observation, it is assumed that coevolution between residues belonging to the same domain is more frequent than coevolution between residues belonging to different domains.

It should be noted that TBMs utilizes features in a different way than ML-based methods do; TBMs use features to select and align templates. The SPs, SSs and SAs of both a target and its template are compared so as to align them properly. Because the templates selected in this process are aligned over a large sequence space, this approach is considered as a long-range method. The domain boundary of a target sequence is predicted using the domain information of the selected templates with the template reflecting the interaction of all residues.

Even when the 3D structure of a protein is not known, its sequence can serve as a template. In the multiple sequence alignment (MSA) of a target and its templates, some templates may cover every sequence region overall, whereas other templates only cover partial regions. Boundaries of these partially aligned sequences (PASs) may contain important information on domain boundaries (Eickholt et al., 2011).

Recently, it was shown that variation of domain structure can take place via a rearrangement of loops (Berezovsky, 2003; Koczyk and Berezovsky, 2008). Although protein domains contain many loops despite the fact that loops typically divide domains, paying attention to protein loops and the (re)arrangement of loops may indicate a possible hierarchy of domain structures, which reconciles different domain definitions in some cases (Berezovsky et al., 1999).

In this work, we have developed a domain boundary prediction method (ConDo) that utilizes long-range sequence information such as both PAS in MSA and CI, as well as the usual short-range sequence information such as SP, predicted SS and predicted SA. A deep-learning method was used to extract as much information on domain boundaries as possible from the various kinds of features collected. Recently, deep learning has been successfully used in various fields of studies such as image processing (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012) and language translation (Wu et al., 2016). Deep learning has also been used for many protein research areas including SS prediction (Wang et al., 2016) and contact prediction (Wang et al., 2017a).

In this article, we show how PAS and contact information are related to the study of domain boundaries (See the Supplementary Material). We generated datasets separately for training and testing from protein structure domain databases of SCOPe (Chandonia et al., 2017) and CATH (Sillitoe et al., 2015). We used a fully connected deep neural network (DNN) in our ML method. To test how each feature contributes to the domain boundary prediction, a 10-fold cross-validation was performed using a variety of input features combinations. The performance of our domain boundary prediction method is compared with existing ML-based methods.

2 Datasets

We generated training and test datasets with the following considerations. We selected protein chains if the number of their domains defined by SCOPe 2.06 (released on September 15, 2017) (Chandonia et al., 2017) and CATH 4.2 (released on August 21, 2017) (Sillitoe et al., 2015) were identical. For multi-domain chains, chains whose domain boundaries (as defined by SCOPe and CATH) that did not agree with each other within 20 residues were excluded. Small chains with <80 residues and large chains with >1000 residues were also excluded. Using CD-HIT (Huang et al., 2010), we also excluded chains with sequence identities greater than 25% of each other. Among the remaining chains, those released before February 27, 2016 were assigned as the training set and the rest as the test set. The training set contained 2583 single-domain chains and 790 multi-domain chains, and the test set contained 260 single-domain chains and 119 multi-domain chains (see Supplementary Material). We adopted domain boundary information as defined by SCOPe.

Additionally, in the test set, we excluded single-domain chains whose exterior boundaries were located over 40 residues away from the N- or C-terminus, as well as those chains whose domain sizes were smaller than 40 residues. The reason for this was to eliminate such cases where predicted domain boundaries divided a disordered region from a single domain. For example, 5l3dB is a single-domain target with a domain definition of ‘D1: 376/440’. Here, even when the domain boundary of 376 can be accurately predicted, it will be classified as a multi-domain target, and this would be technically treated as a false prediction.

3 Input features

In this section, we introduce and summarize all of the input features used in our ML approach. Figure 1 shows the pipeline of our domain boundary prediction method. Table 1 summarizes the input feature in the order of six subgroups: common (C), SP, SS, SA, PAS and CI. The modularity (Newman and Girvan, 2004) of the network generated by the predicted residue–residue contact map and the PAS is described in the section ‘Partially aligned sequence and coevolutionary information’ of the Supplementary Material.

Fig. 1.

The input feature generation pipeline for domain boundary prediction is shown as a diagram. MSADD represents MSA-derived data that includes NMSA, NPAS, BS1 and BS2. BS1 and BS2 are boundary signal related quantities and defined in Section 3. SDD represents sequence-derived data that includes Nres and the residue position

Table 1.

Input features used in this work are summarized. Each feature belongs to one of the following six groups; common (C), CI, PAS, SP, SS and SA

Features Group Window size Number per residue Total number
Outside of chain C 100 1 100
Residue position C 1 3 3
N  res C 1 1 1
SP SP 21 20 420
SS SS 41 3 123
SA SA 41 4 164
BS1 PAS 1 3 3
BS2 PAS 1 6 6
Max of AvgNPAS PAS 1 1 1
Max of AvgCPAS PAS 1 1 1
AvgNPAS PAS 101 1 101
AvgCPAS PAS 101 1 101
N  PAS PAS 1 1 1
N  PAS/NMSA PAS/CI 1 1 1
Modularity CI 100 1 100
Max and Min of modularity CI 1 2 2
N  MSA CI 1 1 1
Total 1129
Features Group Window size Number per residue Total number
Outside of chain C 100 1 100
Residue position C 1 3 3
N  res C 1 1 1
SP SP 21 20 420
SS SS 41 3 123
SA SA 41 4 164
BS1 PAS 1 3 3
BS2 PAS 1 6 6
Max of AvgNPAS PAS 1 1 1
Max of AvgCPAS PAS 1 1 1
AvgNPAS PAS 101 1 101
AvgCPAS PAS 101 1 101
N  PAS PAS 1 1 1
N  PAS/NMSA PAS/CI 1 1 1
Modularity CI 100 1 100
Max and Min of modularity CI 1 2 2
N  MSA CI 1 1 1
Total 1129

Note: BS1 and BS2 are boundary signal related quantities and defined in Section 3.

Table 1.

Input features used in this work are summarized. Each feature belongs to one of the following six groups; common (C), CI, PAS, SP, SS and SA

Features Group Window size Number per residue Total number
Outside of chain C 100 1 100
Residue position C 1 3 3
N  res C 1 1 1
SP SP 21 20 420
SS SS 41 3 123
SA SA 41 4 164
BS1 PAS 1 3 3
BS2 PAS 1 6 6
Max of AvgNPAS PAS 1 1 1
Max of AvgCPAS PAS 1 1 1
AvgNPAS PAS 101 1 101
AvgCPAS PAS 101 1 101
N  PAS PAS 1 1 1
N  PAS/NMSA PAS/CI 1 1 1
Modularity CI 100 1 100
Max and Min of modularity CI 1 2 2
N  MSA CI 1 1 1
Total 1129
Features Group Window size Number per residue Total number
Outside of chain C 100 1 100
Residue position C 1 3 3
N  res C 1 1 1
SP SP 21 20 420
SS SS 41 3 123
SA SA 41 4 164
BS1 PAS 1 3 3
BS2 PAS 1 6 6
Max of AvgNPAS PAS 1 1 1
Max of AvgCPAS PAS 1 1 1
AvgNPAS PAS 101 1 101
AvgCPAS PAS 101 1 101
N  PAS PAS 1 1 1
N  PAS/NMSA PAS/CI 1 1 1
Modularity CI 100 1 100
Max and Min of modularity CI 1 2 2
N  MSA CI 1 1 1
Total 1129

Note: BS1 and BS2 are boundary signal related quantities and defined in Section 3.

Sliding windows were used to generate a certain number of input features; and the window size, the number of feature elements per residue and the total number of feature elements per each feature are shown in Table 1. For all residues of a given protein chain, we calculated all the input features and predicted if a residue was at the domain boundary. Below, k refers to the residue index to be predicted as a domain boundary.

The ‘outside of the chain’ entry specifies a sliding window that contains 50 residues to the C-terminal direction and 50 residues to the N-terminal direction, and 0 is assigned to the residue outside of the target chain and 1 to the residue inside of the target chain. ‘Residue position’ k was used in three ways as follows:
ratio of k and N res = k N res ,
(1)
normalized distance from N - terminus = { k 1 1000 , k 1000 1 , k > 1000 ,
(2)
normalized distance from C - terminus = { N res k 1000 , k N res 1000 1 , k < N res 1000 .
(3)
The chain size N res was used as
normalized N res = { N res 1000 , N res 1000 1 , N res > 1000 .
(4)

SP was generated by HHblits (Remmert et al., 2012) with UniRef20, and a window size of 21 was used with the total number of elements being 21 × 20 = 420. SS and SA were used with a window size of 41 residues. For each residue’s SS, three features were used corresponding to the propensities of it being in the coil, helix and strand states, respectively. SS was predicted by PSIPRED (McGuffin et al., 2000). For each residues SA, four features were used corresponding to the predicted relative surface area, the confidence score and the two propensities of it being in the buried and exposed states, respectively. SA was predicted by SANN (Joo et al., 2012).

The boundary signal refers to the boundary of a PAS in MSA (Eickholt et al., 2011). At least one of the terminal residues of a PAS is positioned >20 residues away from the corresponding terminal residue of the query, and these terminals of a PAS may signal the presence of a domain boundary. A boundary signal at position k may arise either from an N-terminal residue of a PAS or from a C-terminal residue of another PAS. Three numbers are assigned to position k as (iN, iC, iN + iC) where i(N/C) is set to 1 if an N/C-terminal residue of a PAS is positioned at k, otherwise it is set to 0. For smoothing and normalization purposes, the final three boundary signal values were summed over the range of (k−5, k + 5) and divided by (11, 11 and 22) respectively. We denote this as BS1. It should be noted that in the calculation of BS1, the frequency of the boundary signal arising from multiple PASs is not reflected. Because of this, we defined a second kind of boundary signal, BS2, by counting the total number of boundary signals. Smoothing over (k−5, k + 5) was performed as above, but two additional normalization methods were used by dividing the total number of boundary signals by (1) (NMSA, NMSA, 2 × NMSA) and by (2) the maximum value of each quantity.

AvgNPAS and AvgCPAS were used with a sliding window of 101 residues centered at the k-th position. The maximum and minimum values of modularity for each target were used as input features. The ‘max of AvgNPAS/AvgCPAS’ which means maximum value of AvgNPASk(i)/AvgCPASk(i) for all i (not just within the window size of 101) was also used as an input feature.

The number of sequences in MSA (NMSA) and PAS (NPAS) were used in the log scale as
normalized log scale of N MSA = { 1 5 log 10 ( N MSA ) , 10 N MSA 100 000 0 , N MSA < 10 1 , N MSA > 100 000 ,
(5)
normalized log scale of N PAS = { 1 5 log 10 ( N PAS ) , 10 N PAS 100 000 0 , N PAS < 10 1 , N PAS > 100 000 .
(6)

Here we note that the maximum number of MSA sequences used for residue–residue contact prediction is 10 000 i.e. <100 000. The ratio of NPAS and NMSA was also used as NPAS/NMSA.

MSA for PAS and contact prediction was generated by jackHMMer (Finn et al., 2011) with UniRef90. Contacts were predicted by CCMpred (Seemayer et al., 2014). To save computational time, only up to 10 000 sequences in MSA were used for contact prediction. For the modularity, we used a window comprised of 100 residue positions from k − 49 to k + 50. The total number of input features used in this study is 1129.

4 Methods of machine learning

We used the KERAS ML toolkit with the Theano background for the fully connected DNN. Figure 2 shows our DNN model for domain boundary prediction. For each residue position, our machine predicted if it was near the true domain boundary or not. We used multi-task learning (Caruana, 1997) along with four output labels which corresponded to whether the residue position was within 5, 10, 15 or 20 residues from the true boundary (1) or not (0). We also used four hidden layers which had 1500 hidden units, and the rectified linear unit was used as activation functions. The drop-out rate of 0.5 was used only for the first and the last hidden layer. For the output label, we used the sigmoid function as the final activation function. The loss function was binary cross-entropy, and the optimizer was Adadelta (Zeiler, 2012) with the learning rate of 1.0. Mini-batch sizes were 128.

Fig. 2.

Diagram of the fully connected DNN for domain boundary prediction is shown. FCNN represents fully connected neural network

5 Ten-fold cross-validation for feature selection

In this section, we evaluate the performance of our domain boundary predictions. A 10-fold cross-validation was used to benchmark the predictive performance of various combinations of input features. Figure 3 and Table 2 show the precision-recall curve (PRC) and the area under the precision-recall curve (AUC-PR) of 10-fold cross-validation for the training set. Each data line in Figure 3 corresponds to the result of a particular combination of input features used.

Fig. 3.

The PRC of 10-fold cross-validation for training set. Input features used for prediction are indicated by colored lines (Color version of this figure is available at Bioinformatics online.)

Table 2.

The AUC-PR of 10-fold cross-validation

Features SP SP+SS SP+SS+SA PAS+CI SP+SS+SA+PAS SP+SS+SA+CI All
AUC-PR 0.2595 0.3516 0.4487 0.5592 0.6093 0.6690 0.7196
Features SP SP+SS SP+SS+SA PAS+CI SP+SS+SA+PAS SP+SS+SA+CI All
AUC-PR 0.2595 0.3516 0.4487 0.5592 0.6093 0.6690 0.7196
Table 2.

The AUC-PR of 10-fold cross-validation

Features SP SP+SS SP+SS+SA PAS+CI SP+SS+SA+PAS SP+SS+SA+CI All
AUC-PR 0.2595 0.3516 0.4487 0.5592 0.6093 0.6690 0.7196
Features SP SP+SS SP+SS+SA PAS+CI SP+SS+SA+PAS SP+SS+SA+CI All
AUC-PR 0.2595 0.3516 0.4487 0.5592 0.6093 0.6690 0.7196

All features within the common (C) group were always used. SP, SS, and SA are the usual short-range features. In previous studies, either SP was used as the only input feature (PPRODO), or SP+SS+SA were used altogether (DOMPRO). By examining the results of SP, SP+SS, and SP+SS+SA, we observe that adding SS and SA improves the performance of the boundary prediction. PAS and CI contain long-range features; the result of PAS+CI is better than that of SP+SS+SA, implying that long-range features are more useful for domain boundary prediction than short-range features. The result of SP+SS+SA+CI is better than that of SP+SS+SA+PAS, implying that CI is more useful than PAS. Not surprisingly, the best performance was achieved by using all of our chosen input features.

In our method (ConDo), long-range features and short-range features play separate roles; long-range features can provide information about the membership of two given residues belonging to either an identical domain or separate domains while short-range features are typically useful for determining if a given residue is near a domain boundary.

6 Benchmark results and discussion

In this section, we discuss the benchmark result of ConDo on the test set and compare our result against those from other methods in the literature. The evaluation criteria for benchmark are described in the section ‘Evaluation criteria for benchmark’ of the Supplementary Material. Before showing any statistics, we first explain how one can interpret the results of two examples. Figure 4 shows the examples of 1c7cA and 1sxjH, and the four outputs from our machine are shown in Figure 4A and B for 1c7cA and 1sxjH, respectively. These four output scores correspond to the domain boundary prediction scores using 5, 10, 15 and 20 residue criteria, and the final score which is shown in Figure 4C and D is the sum of these four scores.

Fig. 4.

Four output scores of domain boundaries within 5, 10, 15 and 20 residue separations from the residue position k for 1c7cA (A) and 1sxjH (B) are also shown. The total summation scores of 1c7cA (C) and 1sxjH (D) are shown. Domain definitions of 1c7cA and 1sxjH are ‘D1: 1/142 and D2: 143/283’ and ‘D1: 26/151 and D2: 152/279’, respectively

Using the final score, our domain boundary prediction is made as follows: (i) residue positions are sorted in the order of the final score; (ii) positions with less than the cutoff value (1.2) are eliminated; (iii) remaining positions are selected as domain boundaries in the order of the final score; (iv) for each selected position, neighboring positions within 40 residue separations are eliminated. Obviously, targets with no final scores above the cutoff value (1.2) are predicted as single-domain proteins. The predicted domain boundaries of 1c7cA and 1sxjH are respectively ‘140’ and ‘21 153’. These predicted domain boundaries are true positives. However, the first domain boundary of 1sxjH at 21 is an exterior domain boundary, meaning that there exist no additional domains toward the N-terminal direction. It should be noted that we trained our machine to allow exterior domain boundaries in the context of disorder prediction, but chose to exclude them in the evaluation. So, in the final prediction, we excluded boundaries located within 40 residues from either the N- or the C-terminus. Targets with at least one predicted domain boundary are classified as multi-domain proteins and the others as single-domain proteins.

Using our test data described in Section 2, we compared the performance of ConDo with three existing methods: PPRODO (Sim et al., 2005), DOMPRO (Cheng et al., 2006) and DoBo (Eickholt et al., 2011). PPRODO uses neural networks with a single hidden layer and, as its input features, only uses SP. DOMpro uses recursive neural networks and, as its input features, uses SP, SS and SA. DoBo uses a support vector machine and, as its input features, uses SP, SS, SA and boundary signals. It should be noted that DoBo predicts domain boundaries only for residue positions where boundary signals appear, while ConDo, PPRODO and DOMpro predict domain boundaries for all positions. In addition, ConDo, PPRODO and DOMpro perform classifications of single- and multi-domain proteins based on the domain boundary prediction result (i.e. targets with no predicted domain boundaries are classified as single-domain proteins), while DoBo first predicts if a given target is a single- or multi-domain protein, and then only for predicted multi-domain targets does it subsequently predict its domain boundaries.

Figure 5 and Table 3 show the classification results for the test set. DOMpro and DoBo are excluded because they do not provide the predicted confidence value. AUC-PRs in Table 3 correspond to Figure 5. Accuracy, Matthew's correlation coefficient (MCC), precision and recall are calculated with the cutoff confidence values of 1.2 (ConDo) and 0.35 (PPRODO). Based on these observations, it appears that ConDo outperforms the other ML-based methods.

Fig. 5.

The PRC of domain classification is shown. The classification results on single-domain targets (A) and multi-domain targets (B) are separately shown. All targets are from the test set described in Section 2

Table 3.

Classification results for the test set. Cutoff values of confidence are 1.2 (ConDo) and 0.35 (PPRODO)

Method Accuracy MCC Multi
Single
Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.8496 0.6768 0.7153 0.8655 0.8789 0.9319 0.8423 0.8165
PPRODO 0.0.7704 0.5012 0.6111 0.7395 0.7278 0.8681 0.7846 0.7721
DOMPRO 0.7230 0.3925 0.5897 0.3866 0.7575 0.8769
DoBo 0.7414 0.3971 0.5897 0.5798 0.8092 0.8154
Method Accuracy MCC Multi
Single
Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.8496 0.6768 0.7153 0.8655 0.8789 0.9319 0.8423 0.8165
PPRODO 0.0.7704 0.5012 0.6111 0.7395 0.7278 0.8681 0.7846 0.7721
DOMPRO 0.7230 0.3925 0.5897 0.3866 0.7575 0.8769
DoBo 0.7414 0.3971 0.5897 0.5798 0.8092 0.8154
Table 3.

Classification results for the test set. Cutoff values of confidence are 1.2 (ConDo) and 0.35 (PPRODO)

Method Accuracy MCC Multi
Single
Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.8496 0.6768 0.7153 0.8655 0.8789 0.9319 0.8423 0.8165
PPRODO 0.0.7704 0.5012 0.6111 0.7395 0.7278 0.8681 0.7846 0.7721
DOMPRO 0.7230 0.3925 0.5897 0.3866 0.7575 0.8769
DoBo 0.7414 0.3971 0.5897 0.5798 0.8092 0.8154
Method Accuracy MCC Multi
Single
Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.8496 0.6768 0.7153 0.8655 0.8789 0.9319 0.8423 0.8165
PPRODO 0.0.7704 0.5012 0.6111 0.7395 0.7278 0.8681 0.7846 0.7721
DOMPRO 0.7230 0.3925 0.5897 0.3866 0.7575 0.8769
DoBo 0.7414 0.3971 0.5897 0.5798 0.8092 0.8154

The recall of single-domains by DOMPRO (0.8769) is slightly better than that by ConDo (0.8423). However, the corresponding precision by ConDo (0.9319) is significantly better than that by DOMPRO (0.7575). Because the ratio of precision to recall depends on the cutoff value, obviously, one can adjust it to increase recall at the expense of decreasing precision, which we did not do in this study. Recently, Sadowski (2013) proposed a domain prediction method where the predicted contact information was utilized. In this method, Kernel density estimation was used instead of ML or modularity. However, we could not compare our results with Sadowski’s method due to its lack of availability at the time.

In Figure 6, PRC of domain boundary prediction for multi-domain targets is shown. The results are separately shown by excluding/including exterior boundaries (Fig. 6A and B). The data for DoBo cover only a small fraction of the recall range. DOMpro data could not be included in Figure 6 because DOMpro does not provide confidence values. The results are also summarized in Table 4. NDO, precision, and recall were calculated with the cutoff confidence values of 1.2 (ConDo), 0.35 (PPRODO) and 0.6 (DoBo). In all cases shown in the table, ConDo provided the best performance. We believe the reasons for this improvement are as follows: ConDo appropriately combines both local features and global features such as PAS and modularity of contact map while PPRODO and DOMpro use only local features (SP, SS and SA). Since we included many features in our machine as compared with previous studies, we had to use deep learning for better training. In addition, we used a bigger dataset for training as compared with previous studies; our training set contains 2583 single-domain chains and 790 multi-domain chains, while the corresponding numbers for DOMpro and DoBo are, respectively, 963/354 and 442/186 (single-/multi-domain), with PPRODO’s dataset containing 522 two-domain chains. We also note that, because DoBo predicts domain boundaries only for residues with boundary signals, the actual amount of data that DoBo contains is much less than that of the other methods.

Fig. 6.

The precision-recall curve of the domain boundary position prediction for multi-domain targets of the test set is shown by excluding exterior boundaries (A). The precision-recall curve without this exclusion is shown in (B)

Table 4.

Protein domain boundary prediction results for multi-domain targets in the test set

Methods NDO
Exclude exterior boundaries
Include exterior boundaries
All Multi Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.9031 0.8427 0.7481 0.7153 0.7267 0.8244 0.7297 0.7994
PPRODO 0.8483 0.7337 0.4857 0.4964 0.4705 0.5429 0.5135 0.5301
DOMPRO 0.8573 0.6501 0.4807 0.1825 0.5192 0.1824
DoBo 0.8156 0.6048 0.2222 0.3066 0.2593 0.3311
Methods NDO
Exclude exterior boundaries
Include exterior boundaries
All Multi Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.9031 0.8427 0.7481 0.7153 0.7267 0.8244 0.7297 0.7994
PPRODO 0.8483 0.7337 0.4857 0.4964 0.4705 0.5429 0.5135 0.5301
DOMPRO 0.8573 0.6501 0.4807 0.1825 0.5192 0.1824
DoBo 0.8156 0.6048 0.2222 0.3066 0.2593 0.3311

Note: Cutoff values of confidence are 1.2 (ConDo), 0.35 (PPRODO) and 0.6 (DoBo).

Table 4.

Protein domain boundary prediction results for multi-domain targets in the test set

Methods NDO
Exclude exterior boundaries
Include exterior boundaries
All Multi Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.9031 0.8427 0.7481 0.7153 0.7267 0.8244 0.7297 0.7994
PPRODO 0.8483 0.7337 0.4857 0.4964 0.4705 0.5429 0.5135 0.5301
DOMPRO 0.8573 0.6501 0.4807 0.1825 0.5192 0.1824
DoBo 0.8156 0.6048 0.2222 0.3066 0.2593 0.3311
Methods NDO
Exclude exterior boundaries
Include exterior boundaries
All Multi Precision Recall AUC-PR Precision Recall AUC-PR
ConDo 0.9031 0.8427 0.7481 0.7153 0.7267 0.8244 0.7297 0.7994
PPRODO 0.8483 0.7337 0.4857 0.4964 0.4705 0.5429 0.5135 0.5301
DOMPRO 0.8573 0.6501 0.4807 0.1825 0.5192 0.1824
DoBo 0.8156 0.6048 0.2222 0.3066 0.2593 0.3311

Note: Cutoff values of confidence are 1.2 (ConDo), 0.35 (PPRODO) and 0.6 (DoBo).

7 Conclusion

We have developed a new domain boundary prediction method (named ConDo) using ML with CI. We first had to search for homologous sequences for given target sequences. From the MSA of homologous sequences, ConDo extracts two types of long-range sequence information: the PAS information and the predicted residue–residue contact information. Boundaries of PAS are candidates for the domain boundary of the target sequence. In this study, we assumed that the frequency of contacts between residues belonging to separate domains is lower than that between residues from the same domain. Based on this assumption, we performed residue-residue contact prediction to generate a weighted network where each residue position is a node and weighted edges were assigned between nodes according to their predicted contact score. The modularity of the network was calculated at each residue position by dividing the target chain into two consecutive subdivisions to generate features that represent the global nature of the chain. The proposed method outperformed existing domain boundary prediction methods, namely PPRODO, DOMpro and DoBo; this success is mainly due to the appropriate usage of both short-range and global features. We used fully connected DNNs for training and testing.

Because the success of ConDo can be attributed to the efficient utilization of CI from MSA, large numbers of homologous sequences are crucial for better prediction performance. In the case of residue–residue contact prediction (Jones et al., 2012; Marks et al., 2012), typically more than 1000 homologous sequences are required for a reliable prediction; we expect a similar minimum for ConDo. In the case of TBM, a large presence of relevant structural templates is critical for accurate domain boundary prediction. We expect ConDo to be more useful than TBM in the cases where many >1000) homologous sequences exist and where no significant structural templates are available.

In this study, we did not perform any special treatment for discontinuous domains. However, PAS and the predicted contact contain some information about discontinuous domains. For example, a special pattern of local alignments in MSA may signify the presence of discontinuous domains. Additionally, in the predicted contact map, block diagonal components could be scattered and off-diagonal block components could be relatively more prominent. These are important clues for discontinuous domain prediction and should be studied in the future.

For 1c7cA and 1sxjH (whose sequence lengths are about 300 residues long), the total run time was ∼20 min on the AMD Ryzen 7 1700 eight-core processor. These examples are included in the distribution.

Acknowledgements

We thank Dr Il Gu Yi, and Jong Yun Kim for their advice on machine learning. We thank Andrew Brooks for his critical reading of and suggestions for the article. We thank KIAS Center for Advanced Computation for providing computing resources.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT [NRF-2017R1E1A1A01077717].

Conflict of Interest: none declared.

References

Alden
 
K.
 et al.  (
2010
)
dConsensus: a tool for displaying domain assignments by multiple structure-based algorithms and for construction of a consensus assignment
.
BMC Bioinformatics
,
11
,
310.

Alexandrov
 
N.
,
Shindyalov
I.
(
2003
)
PDP: protein domain parser
.
Bioinformatics
,
19
,
429
430
.

Berezovsky
 
I.N.
(
2003
)
Discrete structure of van der waals domains in globular proteins
.
Protein Eng
.,
16
,
161
167
.

Berezovsky
 
I.N.
 et al.  (
1999
)
Hierarchy of the interaction energy distribution in the spatial structure of globular proteins and the problem of domain definition
.
J. Biomol. Struct. Dyn
.,
17
,
133
155
.

Bondugula
 
R.
 et al.  (
2008
)
FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator
.
Nucleic Acids Res
.,
37
,
452
462
.

Bork
 
P.
(
1991
)
Shuffled domains in extracellular proteins
.
FEBS Lett
.,
286
,
47
54
.

Buchan
 
D.W.
,
Jones
D.T.
(
2018
)
Improved protein contact predictions with the MetaPSICOV2 server in CASP12
.
Proteins: Struct. Funct. Bioinf.
,
86
,
78
83
.

Caruana
 
R.
(
1997
)
Multitask learning
.
Mach. Learn.
,
28
,
41
75
.

Chandonia
 
J.-M.
 et al.  (
2017
)
SCOPe: manual curation and artifact removal in the structural classification of proteins–extended database
.
J. Mol. Biol
.,
429
,
348
355
.

Cheng
 
J.
 et al.  (
2006
)
DOMpro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks
.
Data Mining Knowl. Discov
.,
13
,
1
10
.

Cocco
 
S.
 et al.  (
2013
)
From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction
.
PLoS Comput. Biol
.,
9
,
e1003176.

Eickholt
 
J.
 et al.  (
2011
)
DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning
.
BMC Bioinformatics
,
12
,
43.

Ekeberg
 
M.
 et al.  (
2013
)
Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models
.
Phys. Rev. E
,
87
,
012707
.

Finn
 
R.D.
 et al.  (
2011
)
HMMER web server: interactive sequence similarity searching
.
Nucleic Acids Res
.,
39 (Suppl. 2)
,
W29
W37
.

Göbel
 
U.
 et al.  (
1994
)
Correlated mutations and residue contacts in proteins
.
Proteins
,
18
,
309
317
.

Guo
 
J.-T.
 et al.  (
2003
)
Improving the performance of domainparser for structural domain partition using neural network
.
Nucleic Acids Res
.,
31
,
944
952
.

Hinton
 
G.
 et al.  (
2012
)
Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups
.
IEEE Sig. Process. Mag
.,
29
,
82
97
.

Huang
 
Y.
 et al.  (
2010
)
CD-HIT Suite: a web server for clustering and comparing biological sequences
.
Bioinformatics
,
26
,
680
682
.

Jones
 
D.T.
 et al.  (
2012
)
PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments
.
Bioinformatics
,
28
,
184
190
.

Jones
 
D.T.
 et al.  (
2015
)
MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins
.
Bioinformatics
,
31
,
999
1006
.

Joo
 
K.
 et al.  (
2012
)
SANN: solvent accessibility prediction of proteins by nearest neighbor method
.
Proteins
,
80
,
1791
1797
.

Kikuchi
 
T.
 et al.  (
1988
)
Prediction of the location of structural domains in globular proteins
.
J. Protein Chem
.,
7
,
427
471
.

Koczyk
 
G.
,
Berezovsky
I.N.
(
2008
)
Domain hierarchy and closed loops (DHcL): a server for exploring hierarchy of protein domain structure
.
Nucleic Acids Res
.,
36 (Suppl. 2)
,
W239
W245
.

Kosciolek
 
T.
,
Jones
D.T.
(
2016
)
Accurate contact predictions using covariation techniques and machine learning
.
Proteins
,
84 (Suppl. 1)
,
145
151
.

Krizhevsky
 
A.
 et al.  (
2012
) Imagenet classification with deep convolutional neural networks. In:
Pereira
F.
(eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., pp.
1097
1105
. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Longhi
 
S.
 et al.  (
2007
) Protein engineering. In:
Macromolecular Crystallography Protocols
.
Springer
, pp.
59
90
.

Marks
 
D.S.
 et al.  (
2012
)
Protein structure prediction from sequence variation
.
Nat. Biotechnol
.,
30
,
1072.

McGuffin
 
L.J.
 et al.  (
2000
)
The PSIPRED protein structure prediction server
.
Bioinformatics
,
16
,
404
405
.

Morcos
 
F.
 et al.  (
2011
)
Direct-coupling analysis of residue coevolution captures native contacts across many protein families
.
Proc. Natl. Acad. Sci. USA
,
108
,
E1293
E1301
.

Newman
 
M.E.
,
Girvan
M.
(
2004
)
Finding and evaluating community structure in networks
.
Phys. Rev. E
,
69
,
026113
.

Remmert
 
M.
 et al.  (
2012
)
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
.
Nat. Methods
,
9
,
173
175
.

Richardson
 
J.S.
(
1981
)
The anatomy and taxonomy of protein structure
.
Adv. Protein Chem
.,
34
,
167
339
.

Rigden
 
D.J.
(
2002
)
Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments
.
Protein Eng
.,
15
,
65
77
.

Sadowski
 
M.I.
(
2013
)
Prediction of protein domain boundaries from inverse covariances
.
Proteins
,
81
,
253
260
.

Seemayer
 
S.
 et al.  (
2014
)
CCMpred-fast and precise prediction of protein residue–residue contacts from correlated mutations
.
Bioinformatics
,
30
,
3128
3130
.

Sillitoe
 
I.
 et al.  (
2015
)
CATH: comprehensive structural and functional annotations for genome sequences
.
Nucleic Acids Res
.,
43
,
D376
D381
.

Sim
 
J.
 et al.  (
2005
)
PPRODO: prediction of protein domain boundaries using neural networks
.
Proteins
,
59
,
627
632
.

Simkovic
 
F.
 et al.  (
2017
)
Applications of contact predictions to structural biology
.
IUCrJ
,
4
,
291
300
.

Wang
 
S.
 et al.  (
2016
)
Protein secondary structure prediction using deep convolutional neural fields
.
Sci. Rep
.,
6
,
18962
.

Wang
 
S.
 et al.  (
2017a
)
Accurate de novo prediction of protein contact map by ultra-deep learning model
.
PLoS Comput. Biol
.,
13
,
e1005324.

Wang
 
S.
 et al.  (
2017b
) Analysis of deep learning methods for blind protein contact prediction in casp12. Proteins.

Wetlaufer
 
D.B.
(
1973
)
Nucleation, rapid folding, and globular intrachain regions in proteins
.
Proc. Natl. Acad. Sci. USA
,
70
,
697
701
.

Wu
 
Y.
 et al.  (
2016
) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144.

Xu
 
Y.
 et al.  (
2000
)
Protein domain decomposition using a graph-theoretic approach
.
Bioinformatics
,
16
,
1091
1104
.

Xue
 
Z.
 et al.  (
2013
)
ThreaDom: extracting protein domain boundary information from multiple threading alignments
.
Bioinformatics
,
29
,
i247
i256
.

Zeiler
 
M.D.
(
2012
) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv: 1212.5701.

Zhou
 
H.
 et al.  (
2007
)
DDOMAIN: Dividing structures into domains using a normalized domain–domain interaction profile
.
Protein Sci
.,
16
,
947
955
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Alfonso Valencia
Alfonso Valencia
Associate Editor
Search for other works by this author on:

Supplementary data