Elsevier

Biochimie

Volume 138, July 2017, Pages 168-183
Biochimie

Research paper
Akaryotes and Eukaryotes are independent descendants of a universal common ancestor

https://doi.org/10.1016/j.biochi.2017.04.013 Get rights and content

Highlights

  • Bayesian tests estimate the reliability of four prominent ToL models.

  • The ESP model is the most parsimonious and most probable ToL model.

  • A nonreversible evolution model is 10277 times more probable than a reversible model.

  • Mayr's two empires ToL is 1087 times more probable than Woese's three-domains ToL.

  • The Eocyte ToL model is by far the least probable model.

Abstract

We reconstructed a global tree of life (ToL) with non-reversible and non-stationary models of genome evolution that root trees intrinsically. We implemented Bayesian model selection tests and compared the statistical support for four conflicting ToL hypotheses. We show that reconstructions obtained with a Bayesian implementation (Klopfstein et al., 2015) are consistent with reconstructions obtained with an empirical Sankoff parsimony (ESP) implementation (Harish et al., 2013). Both are based on the genome contents of coding sequences for protein domains (superfamilies) from hundreds of genomes. Thus, we conclude that the independent descent of Eukaryotes and Akaryotes (archaea and bacteria) from the universal common ancestor (UCA) is the most probable as well as the most parsimonious hypothesis for the evolutionary origins of extant genomes. Reconstructions of ancestral proteomes by both Bayesian and ESP methods suggest that at least 70% of unique domain-superfamilies known in extant species were present in the UCA. In addition, identification of a vast majority (96%) of the mitochondrial superfamilies in the UCA proteome precludes a symbiotic hypothesis for the origin of eukaryotes. Accordingly, neither the archaeal origin of eukaryotes nor the bacterial origin of mitochondria is supported by the data. The proteomic complexity of the UCA suggests that the evolution of cellular phenotypes in the two primordial lineages, Akaryotes and Eukaryotes, was driven largely by duplication of common superfamilies as well as by loss of unique superfamilies. Finally, innovation of novel superfamilies has played a surprisingly small role in the evolution of Akaryotes and only a marginal role in the evolution of Eukaryotes.

Introduction

Since Darwin's assertion of the theory of common ancestry, phylogeny has depicted the genealogical relationships of organisms [1], [2]. Thus, the fundamentals of phylogenetic theory describe the evolution and speciation of organisms [2]. Nevertheless, for the past several decades it has been standard practice to equate gene phylogenies with species phylogeny [3], [4]. In this genotypic tradition of molecular evolution, gene trees inferred from “marker genes” or from the so-called “genealogy defining core” of genes have been promoted as species trees [5], [6], [7]. Here, we emphasize the distinction between gene trees and species trees. Further, we show that the use of gene trees as proxies for species trees is responsible for much confusion. The more general thrust of this study is to emphasize the importance of choosing the appropriate phylogenetic characters as well as sufficiently realistic models with which to describe complex evolutionary processes to reconstruct species phylogeny.

There are obvious differences in the levels of character complexity appropriate to species or organisms as opposed to individual genes. But theory and methods that were initially developed to infer the phylogeny of species have been directly adapted to infer gene phylogenies [3], [4]. Here we identify the genome as the smallest molecular genetic unit that describes the evolution of an organism. Gene phylogenies that summarize patterns of point mutations in the form of nucleotide or amino acid substitutions convey limited information about the evolution of the corresponding organisms or genomes. That is to say, gene phylogenies do not record genome-scale processes such as the birth, duplication, loss and death of novel genes and gene families [8], [9]. The gains and losses of coding sequences constitute the modes of genome evolution. In effect, genomic modes of evolution together describe a higher hierarchical level of evolutionary change than that represented by a limited selection of concatenated or coalesced gene phylogenies [10], [11], [12], [13], [14].

For these reasons the distinction between microevolution and macroevolution is helpful [9], [15], [16], [17]. We reserve the term microevolutionary change for character state changes that involve amino acid or nucleotide substitutions or small indels that are expressed at the level of a gene or its expressed products. In contrast, genome-level features such as gains and losses including duplications or rearrangements of coding sequences for genes as well as for protein-domains are examples of some well-studied macroevolutionary character state changes [18], [19], [20], [21], [22]. Microevolutionary changes are expressed in the point mutations affecting gene sequences [23]. Macroevolutionary changes influence the presence or absence of coding sequences. Accordingly, microevolution is about allelic sequence drift (meandering) [23] and macroevolution is about change in the composition of genomic characters that may lead to speciation events [17].

A practical advantage of modeling phylogeny with complex macroevolutionary characters is that the frequencies of changes in these are likely to be relatively modest compared to the frequent instances of point mutations expressed as sequence drift or meandering [23]. This means that complex characters such as protein domains are likely to be more stable over long evolutionary times. In effect, rates of macroevolution are slower than rates of microevolution. As a result, gains and losses of complex characters can be used to track the speciation events associated with the lineage sorting events.

Highly variable point mutation rates combined with mutational saturation due to frequent substitutions at the same position in a sequence are the most common sources of phylogenetic error, and these are particularly troublesome for resolving the deeper divergences [12], [23], [24], [25]. Consequently, divergences among the major groups of animal species take place over millennia while the compositions of coding sequences are known to be meandering on a time scale corresponding to generation times [12], [23], [24], [25].

Finally, a reliable phylogenetic reconstruction of the evolutionary history of contemporary species depends on an explicit identification of the hypothetical ancestral species at the root of the ToL, the universal common ancestor (UCA). This is the critical distinction between phenetics and phylogeny [2]. The root polarizes the tree in time or direction so that the evolutionary succession of ancestors to descendants is discernable [2]. In effect, the root determines the direction of character evolution.

In our previous studies we utilized protein domains classified at the level of superfamily in the structural classification of proteins (SCOP) hierarchy [26] as complex characters for implementation of phylogeny [27], [28]. Superfamilies, unlike orthologous gene families, are well represented in the species of the crown. There are other macroevolutionary characters that are restricted to specific groups of species in the crown. For instance, introns are commonly found in Eukaryotes but rarely seen in Akaryotes (archaea and bacteria). Accordingly, the implementation of phylogeny with introns is restricted to Eukaryotes.

An important obstacle to the rooting of characters in a ToL is that appropriate outgroup species or other “external” data are conspicuously absent. Nevertheless, it is possible to implement phylogenetic models of evolution that describe non-reversible and non-stationary processes of evolution in order to root a ToL [27], [28], [29], [30]. Our approach to rooting the ToL is based on genome content of protein domains (superfamilies) that are reconstructed in a generalized (Sankoff) parsimony model that specifies asymmetric state transition “costs”. This particular approach to implement an asymmetric cost-matrix exploits the observed heterogeneity in the patterns of genomic superfamily composition in groups of closely related species. Here, such heterogeneity is taken to be due to the non-reversible and non-stationary evolution of superfamilies. We referred to this model as an Empirical Sankoff Parsimony (ESP) model [27], [28].

The ESP model constructs an intrinsically rooted tree that resolves two primary lineages: Akaryotes and Eukaryotes. Here, Archaea and Bacteria are sister clades within the Akaryotes [27], [28]. This genome phylogeny supports the taxonomic classification proposed by Mayr, which was based on gross cellular phenotypes [31]. However, it does not support Woese's classification at the highest level of taxonomy, the three domains classification based on a single-gene tree, that of rRNA [4], [32].

Both proposals, that is Mayr's ‘two empires’ of life classification as well as Woese's ‘three domains’ of life classification, were based on phenetic methods and not strictly deduced from phylogenetic reconstructions. Mayr proposed a classification based on overall similarities of cellular structures (phenotypes) while Woese proposed a classification based on overall similarities of rRNA sequences (genotype). The root of the unrooted rRNA tree was assumed to be the same as the root of the paralogous gene pairs such as EF-Tu and EF-G. For these rootings the Dayhoff outgroup method was exploited [33], [34], [35]. It is now well established that Long Branch attraction (LBA) artifacts arising from highly variable mutational substitution rates corrupt the rRNA rootings [25], [36], [37].

In the present study, we have further examined the support for the ESP model [27], [28]. In particular, we have verified the fundamental assumptions of the ESP model, which is that character state transitions must be non-reversible as well as non-stationary. We do so by implementing objective Bayesian model selection tests for superfamily data. Likewise, Bayesian phylogenetic tests show that the most parsimonious hypothesis for the evolution of the genomic cohorts of superfamilies is also the most probable hypothesis, which is realized as the independent divergence of akaryotes and eukaryotes from root of the ToL, UCA. Most important, multiple independent reconstructions of the ancestral repertoire of superfamilies in UCA indicates that it is very complex and that it features more than two thirds of nearly 1800 known protein domain superfamilies encoded by the genomes sampled here. In other words there is close agreement between the Bayesian reconstruction and that supported by the ESP model for the same data sets.

Finally, we have exploited the Bayesian inference method [38] to directly test and compare the phylogenetic support for four conflicting ToL hypotheses: (i) Eocyte tree (Lake, 1984), (ii) Three Domains tree (Woese, 1990), (iii) Gupta tree (for lack of a popular name) (Gupta, 1995) and (iv) Two empires tree (Mayr, 1998). We find that the Mayr tree with separate descent of akaryotes and eukaryotes is far and away the most probable ToL. This and the observation that in all our reconstructions of the ancestral superfamily proteome, there is a major subset of superfamilies corresponding to the mitochondrial proteomes of Eukaryotes. These data together rather strongly imply that eukaryogenesis was autogenic. None of our data supports the symbiosis or fusion model of eukaryogenesis [39], [40].

Section snippets

Data sources and character matrix construction

Protein domain annotations for genomes at the superfamily level of the SCOP hierarchy were obtained from the SUPERFAMILY (1.75) database [41]. The number of distinct superfamilies in an individual genome is referred to as the superfamily occurrence number (presence/absence) and the genomic abundance of superfamily members are referred to as superfamily abundance (copy number). Distinct character matrices were assembled either as binary-state characters representing presence or absence of

Results

As described in the previous section, we reconstructed phylogeny using the genomic composition of superfamilies. The genomic compositions of superfamilies come in two flavors: Occurrences refer to the total number of unique superfamilies encoded by genomes. Abundance refers to the total numbers of different homologous protein domains encoded by a genome or when referring to a single superfamily abundance refers to the number of distinguishable domains that are included in that superfamily.

In

Discussion

We previously described a species ToL based on the genomic compositions of protein-domains at the superfamily level of the SCOP hierarchy [27]. Two features of this intrinsically rooted genome tree stand out from the standard gene tree depictions of the ToL [4], [6], [7]: First, rather than a root that separates the bacterial lineage from the sister clades of archaea and eukaryotes [4], [6], [68] the root of this genome tree separates the akaryote lineages from the eukaryote lineages, as had

Data availability

All data matrices and the resulting trees are available on request.

Author contributions

A.H. conceived and designed the study, acquired data and performed the analyses; C.G.K. contributed to the study design. A.H. and C.G.K analyzed and interpreted the results. The manuscript was drafted by A.H., with critical revision by C.G.K. and A.H.

Acknowledgements

We thank J. Gough, D. Morrison and D. Theobald for stimulating discussions; S. Klopfstein for providing the non-stationary implementation in MrBayes; D. Morrison, J. Roth and I. Winkler for comments on earlier versions of the manuscript; A.H. acknowledges support from The Swedish Research Council (to M. Ehrenberg) and the Knut and Alice Wallenberg Foundation, RiboCORE (to M. Ehrenberg and D. Andersson) and C.G.K. acknowledges support from the Nobel Committee for Chemistry of the Royal Swedish

References (93)

  • C. Darwin

    On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life

    (1859)
  • W. Hennig

    Phylogenetic systematics

    Annu. Rev. Entomol.

    (1965)
  • E. Zuckerkandl et al.

    Evoloutioary divergence and convergence in proteins

  • C.R. Woese et al.

    Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya

    Proc. Natl. Acad. Sci.

    (1990)
  • C.R. Woese

    On the evolution of cells

    Proc. Natl. Acad. Sci. U. S. A.

    (2002)
  • F.D. Ciccarelli et al.

    Toward automatic reconstruction of a highly resolved tree of life

    Sci. (New York, N.Y.)

    (2006)
  • T.A. Williams et al.

    A congruent phylogenomic signal places eukaryotes within the Archaea

    Proc. Biol. Sci.

    (2012)
  • B. Boussau et al.

    Genomes as documents of evolutionary history

    Trends Ecol. Evol.

    (2010)
  • C.G. Kurland et al.

    The phylogenomics of protein structures: the backstory

    Biochimie

    (2015)
  • P. Pamilo et al.

    Relationships between gene trees and species trees

    Mol. Biol. Evol.

    (1988)
  • R. Nichols

    Gene trees and species trees are not the same

    Trends Ecol. Evol.

    (2001)
  • L. Salichos et al.

    Inferring ancient divergences requires genes with strong phylogenetic signals

    Nature

    (2013)
  • G.J. Szöllősi et al.

    The inference of gene trees with species trees

    Syst. Biol.

    (2015)
  • D. Posada

    Phylogenomics for systematic biology

    Syst. Biol.

    (2016)
  • G.G. Simpson

    Tempo and Mode in Evolution

    (1944)
  • T. Ryan Gregory

    Macroevolution, hierarchy theory, and the C-value enigma

    Paleobiology

    (2004)
  • C.G. Kurland et al.

    Structural biology and genome evolution: an introduction

    Biochimie

    (2015)
  • A. Rokas et al.

    Rare genomic changes as a tool for phylogenetics

    Trends Ecol. Evol.

    (2000)
  • F. Delsuc et al.

    Phylogenomics and the reconstruction of the tree of life

    Nat. Rev. Genet.

    (2005)
  • J.L. Boore

    The use of genome-level characters for phylogenetic reconstruction

    Trends Ecol. Evol.

    (2006)
  • B. Snel et al.

    Genome phylogeny based on gene content

    Nat. Genet.

    (1999)
  • S. Yang et al.

    Phylogeny determined by protein domain content

    Proc. Natl. Acad. Sci. U. S. A.

    (2005)
  • M.A. DePristo et al.

    Missense meanderings in sequence space: a biophysical view of protein evolution

    Nat. Rev. Genet.

    (2005)
  • T.L. Blundell et al.

    Is the evolution of insulin Darwinian or due to selectively neutral mutation?

    Nature

    (1975)
  • R. Gouy et al.

    Rooting the tree of life: the phylogenetic jury is still out

    Philos. Trans. R. Soc. Lond. Ser. B, Biol. Sci.

    (2015)
  • A.G. Murzin et al.

    SCOP: a structural classification of proteins database for the investigation of sequences and structures

    J. Mol. Biol.

    (1995)
  • A. Harish et al.

    Rooted phylogeny of the three superkingdoms

    Biochimie

    (2013)
  • A. Harish et al.

    Empirical genome evolution models root the tree of life

    Biochimie

    (2017)
  • Z. Yang et al.

    On the use of nucleic acid sequences to infer early branchings in the tree of life

    Mol. Biol. Evol.

    (1995)
  • J.P. Huelsenbeck et al.

    Inferring the root of a phylogenetic tree

    Syst. Biol.

    (2002)
  • E. Mayr

    Two empires or three?

    Proc. Natl. Acad. Sci. U. S. A.

    (1998)
  • C.R. Woese

    Default taxonomy: ernst Mayr's view of the microbial world

    Proc. Natl. Acad. Sci. U. S. A.

    (1998)
  • R.M. Schwartz et al.

    Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts

    Sci. (New York, N.Y.)

    (1978)
  • N. Iwabe et al.

    Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes

    Proc. Natl. Acad. Sci.

    (1989)
  • J.P. Gogarten et al.

    Evolution of the vacuolar H+-ATPase: implications for the origin of eukaryotes

    Proc. Natl. Acad. Sci. U. S. A.

    (1989)
  • P. Forterre et al.

    The nature of the last universal ancestor and the root of the tree of life, still open questions

    Bio Syst.

    (1992)
  • H. Philippe et al.

    The rooting of the universal tree of life is not reliable

    J. Mol. Evol.

    (1999)
  • Seraina Klopfstein et al.

    A nonstationary markov model detects directional evolution in hymenopteran morphology

    Syst. Biol.

    (2015)
  • L. Sagan

    On the origin of mitosing cells

    J. Theor. Biol.

    (1967)
  • T.A. Williams et al.

    An archaeal origin of eukaryotes supports only two primary domains of life

    Nature

    (2013)
  • J. Gough et al.

    Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure

    J. Mol. Biol.

    (2001)
  • E.O. Wiley et al.

    Phylogenetics: theory and practice of phylogenetic systematics

    (2011)
  • D.L. Swofford

    PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4

    (2003)
  • J.-M. Chandonia et al.

    SCOPe: manual curation and artifact Removal in the structural classification of proteins – extended database

    J. Mol. Biol.

    (2017)
  • C. Chothia et al.

    Evolution of the protein repertoire

    Sci. (New York, N.Y.)

    (2003)
  • C. Chothia et al.

    Genomic and structural aspects of protein evolution

    Biochem. J.

    (2009)
  • Cited by (0)

    View full text