Theobald reply

Theobald, D. L.

doi:10.1038/nature09483

Download PDF

Brief Communications Arising
Published: 15 December 2010

Theobald reply

D. L. Theobald¹

Nature volume 468, page E10 (2010)Cite this article

734 Accesses
6 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Replying to T. Yonezawa & M. Hasegawa Nature 468, 10.1038/nature09482 (2010)

Yonezawa and Hasegawa¹ provide an example from two apparently unrelated families of nucleic acid coding sequences for which an Akaike information criterion (AIC) model selection test, similar to mine², chooses a common origin hypothesis. Although this may seem surprising, the coding sequences in this example were aligned in the same reading frame. The constraints of the genetic code are expected to induce correlations between these sequences (and among all coding sequences) that are not due to common ancestry. For instance, owing to codon bias and the structure of the genetic code, in these sequences the second codon position is biased towards T (about twofold over average), whereas the third position is usually an A (∼50%) and rarely a G (∼4%).

Extreme purifying selection against point mutations in the human genome

Article Open access 25 July 2022

Noah Dukler, Mehreen R. Mughal, … Adam Siepel

A common root for coevolution and substitution rate variability in protein sequence evolution

Article Open access 02 December 2019

Francesca Rizzato, Stefano Zamuner, … Alessandro Laio

Extracting phylogenetic dimensions of coevolution reveals hidden functional signals

Article Open access 17 January 2022

Alexandre Colavin, Esha Atolia, … Kerwyn Casey Huang

Main

One can account for these correlations explicitly by using codon models (as implemented in PAML³, codonFreq = 2 or 3) or standard amino acid models (as in PhyML⁴). With these more realistic models, independent ancestry is the strongly preferred hypothesis. Furthermore, the raw likelihoods and AIC scores increase significantly (by hundreds to thousands of logs), indicating that codon and amino acid models are greatly superior to the naive nucleotide models.

Yonezawa and Hasegawa¹ point out that I² did not explicitly test models in which selection or biophysical constraints generate sequence correlations among proteins with independent origins. Formal phylogenetic models accounting for such factors are currently unavailable; their development would be a welcome advance. Although these are important considerations for proteins with low sequence similarity, neither selection nor physical constraints alone can plausibly generate the high levels of sequence similarity (>55% average sequence identity) observed in the universal protein data set that I used^2,5. The amount of adaptive convergence necessary to produce thousands of identical amino acids among 23 different proteins from completely independent beginnings is not comparable to the limited molecular convergence seen with, for example, homologous digestive lysozymes⁶, in which already highly similar proteins (in function, structure and sequence) later acquired a handful of identical substitutions in parallel.

How could selection or biophysical constraints induce correlations among unrelated sequences? If certain similar amino acid sequences are necessary for performing specific functions (or for adopting a specific tertiary conformation that is necessary for function), then selection for function may ‘lead’ proteins with independent origins to neighbouring regions of sequence space. However, no particular protein sequence or fold is necessary for any given function. There are abundant examples of proteins with undetectable sequence similarity and different folds that perform the same biochemical and cellular functions⁷. For example, the proteases subtilisin, trypsin and carboxypeptidase have the same active site and mechanism, whereas papain, renin and thermolysin have different active sites and different mechanisms. All six proteases have radically different folds and sequences. Because different folds in general have different sequence requirements, proteins with the same function need not have similar sequences.

Even assuming that a certain protein fold is necessary for a given function, current molecular evidence indicates that sequence requirements for a fold are extremely low—nearly indistinguishable from random. This data comes from many independent sources from throughout biology.

Many large classes of proteins with identical folds have no detectable sequence similarity (for example, families of TIM barrels, carbonic anhydrases, OB-folds, SH3 domains, Rossmann folds and immunoglobulin domains). These proteins provide prima facie evidence that sequence requirements for any particular fold and function are nearly indistinguishable from random. Protein domains in the SCOP database⁸ from different superfamilies yet with the same fold share ∼9% sequence identity⁹.

Identical folds with known independent origins have nearly random sequence similarity^9,10. For example, unrelated proteins with the same fold from the MALISAM database share 8.5 ± 0.4% sequence identity^9,10. This data can be used to estimate the correlations among independently evolved and created proteins with the same fold, and the correlations are nearly random. In the universal protein data set that I used², the average sequence correlation induced by common ancestry is roughly one log-likelihood per site for the most divergent proteins. In contrast, the correlations among independent proteins with the same fold are ∼100 times weaker. From this we can estimate that model selection scores for common ancestry hypotheses will be many thousands of logs greater than competing selection hypotheses.

Even the most conserved proteins have not yet reached the limits of sequence space, which has been estimated to be near the random expectation for any given fold and function¹¹.

These arguments are largely circumstantial and informal. I have not tested all possible competing hypotheses, and my analysis will not be the “last word on common ancestry”¹². I emphasize that I have in no sense provided an absolute ‘proof’ of universal common ancestry. One of the great advantages of the model selection framework that I presented is that if a novel model is proposed with a well-defined likelihood function, then we can easily compare it to the common ancestry models and see how it fares.

References

Yonezawa, T. & Hasegawa, M. Was the universal common ancestry proved? Nature 468, 10.1038/nautre09482 (2010)
Theobald, D. L. A formal test of the theory of universal common ancestry. Nature 465, 219–222 (2010)
Article ADS CAS Google Scholar
Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997)
CAS PubMed Google Scholar
Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003)
Article Google Scholar
Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nature Genet. 28, 281–285 (2001)
Article CAS Google Scholar
Stewart, C. B., Schilling, J. W. & Wilson, A. C. Adaptive evolution in the stomach lysozymes of foregut fermenters. Nature 330, 401–404 (1987)
Article ADS CAS Google Scholar
Omelchenko, M. V., Galperin, M. Y., Wolf, Y. I. & Koonin, E. V. Non-homologous isofunctional enzymes: a systematic analysis of alternative solutions in enzyme evolution. Biol. Direct 5, 31 (2010)
Article Google Scholar
Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425 (2008)
Article CAS Google Scholar
Cheng, H., Kim, B. H. & Grishin, N. V. Discrimination between distant homologs and structural analogs: lessons from manually constructed, reliable data sets. J. Mol. Biol. 377, 1265–1278 (2008)
Article CAS Google Scholar
Cheng, H., Kim, B. H. & Grishin, N. V. MALISAM: a database of structurally analogous motifs in proteins. Nucleic Acids Res. 36, D211–D217 (2008)
Article CAS Google Scholar
Povolotskaya, I. S. & Kondrashov, F. A. Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926 (2010)
Article ADS CAS Google Scholar
Steel, M. & Penny, D. Origins of life: Common ancestry put to the test. Nature 465, 168–169 (2010)
Article ADS CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biochemistry, Brandeis University, Waltham, 01778, Massachusetts, USA
D. L. Theobald

Authors

D. L. Theobald

View author publications

You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Theobald, D. Theobald reply. Nature 468, E10 (2010). https://doi.org/10.1038/nature09483

Download citation

Published: 15 December 2010
Issue Date: 16 December 2010
DOI: https://doi.org/10.1038/nature09483

This article is cited by

On universal common ancestry, sequence similarity, and phylogenetic structure: the sins of P-values and the virtues of Bayesian evidence
- Douglas L Theobald
Biology Direct (2011)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.