Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments

Protein Eng. 2002 Feb;15(2):65-77. doi: 10.1093/protein/15.2.65.

Abstract

Current methods for identification of domains within protein sequences require either structural information or the identification of homologous domain sequences in different sequence contexts. Knowledge of structural domain boundaries is important for fold recognition experiments and structural determination by X-ray crystallography or nuclear magnetic resonance spectroscopy using the divide-and-conquer approach. Here, a new and conceptually simple method for the identification of structural domain boundaries in multiple protein sequence alignments is presented. Analysis of covariance at positions within the alignment is first used to predict 3D contacts. By the nature of the domain as an independent folding unit, inter-domain predicted contacts are fewer than intra-domain predicted contacts. By analysing all possible domain boundaries and constructing a smoothed profile of predicted contact density (PCD), true structural domain boundaries are predicted as local profile minima associated with low PCD. A training data set is constructed from 52 non-homologous two-domain protein sequences of known 3D structure and used to determine optimal parameters for the profile analysis. The alignments in the training data set contained 48 +/- 17 (mean +/- SD) sequences and lengths of 257 +/- 121 residues. Of the 47 alignments yielding predictions, 35% of true domain boundaries are predicted to within 15 amino acids by the local profile minimum with the lowest profile value. Including predictions from the second- and third-lowest local minima increases the correct domain boundary coverage to 60%, whereas the lowest five local minima cover 79% of correct domain boundaries. Through further profile analysis, criteria are presented which reliably identify subsets of more accurate predictions. Retrospective analysis of CASP3 targets shows predictions of sufficient accuracy to enable dramatically improved fold recognition results. Finally, a prediction is made for geminivirus AL1 protein which is in full agreement with biochemical data, yielding a plausible, novel threading result.

MeSH terms

  • DNA-Binding Proteins / chemistry
  • Databases, Factual
  • Geminiviridae
  • Models, Chemical
  • Protein Conformation
  • Protein Structure, Tertiary*
  • Proteins / chemistry*
  • Sequence Alignment*
  • Sequence Analysis / methods*
  • Structure-Activity Relationship
  • Viral Proteins / chemistry
  • Virus Replication

Substances

  • DNA-Binding Proteins
  • Proteins
  • Viral Proteins
  • replication protein AL1, Begomovirus