Volume 2, Issue 1 p. 34-42
Free Access

Selecting ribosomal protein genes for invertebrate phylogenetic inferences: how many genes to resolve the Mollusca?

Achim Meyer

Corresponding Author

Achim Meyer

Institute of Zoology, Johannes Gutenberg-University, D-55099 Mainz, Germany

Correspondence author. E-mail: [email protected]Search for more papers by this author
Alexander Witek

Alexander Witek

Institute of Molecular Genetics, Johannes Gutenberg-University, D-55099 Mainz, Germany

Search for more papers by this author
Bernhard Lieb

Bernhard Lieb

Institute of Zoology, Johannes Gutenberg-University, D-55099 Mainz, Germany

Search for more papers by this author
First published: 03 February 2011
Citations: 14

Summary

1. Multi-gene analyses are currently the gold standard in phylogenetics, despite limited taxon sampling. To facilitate broad taxon representation on an economically tolerable level, we optimize the gene selection for future PCR-based sampling strategies.

2. Highly expressed ribosomal proteins (RP) were sampled chiefly for molluscs, the second largest metazoan phylum with largely unknown internal relationships. Thirty-two new sequences for Lepidochitona cinerea (Polyplacophora) were integrated into a data-matrix of 79 RP genes, comprising 16 mollusc species (five classes). The resulting maximum likelihood tree was used to evaluate each single-gene tree according to its topological fit. Genes were sorted in ascending order, successively concatenated and then evaluated.

3. The most efficient sub-alignment consists of 18 selected genes. All analyses reveal the Mollusca as monophyletic. The phylogenetic utility of the selected and concatenated 18 ribosomal genes is comparable to results from phylogenomic analyses.

4. This optimized data set can be assembled using PCR methods and will allow expanding the taxon coverage in deep molluscan phylogeny far beyond the current sampling.

Introduction

Multi-gene data sets using expressed sequence tag (EST) data provide high resolution when inferring relationships of major metazoan taxa (Steinke, Salzburger & Meyer 2006; Roeding et al. 2007; Dunn et al. 2008; Hejnol et al. 2009; Philippe et al. 2009). Despite decreasing sequencing costs, these data sets are generally restricted in taxon sampling, which may lead to biased species representation. Standard PCR methods thus will remain an important tool for phylogenetic studies, but need a careful selection of marker genes, adapted to the question asked (Townsend 2007). In practice, the selection of marker genes usually follows historical or technical criteria, but rarely investigates their phylogenetic utility in detail (but see Graybeal 1994; Collins, Fedrigo & Naylor 2005; Konstantinidis, Ramette & Tiedje 2006; Kuramae et al. 2007; Aguileta et al. 2008). As a start we restrict the gene selection to ribosomal protein (RP) genes due to their exceptionally high expression levels, which facilitate successful PCR amplification. The importance of this initial restriction suggests briefly summarizing some phylogenetically important aspects of the current knowledge about this macromolecular complex.

Most detailed information is available on mammal RP genes and their scattered distribution across the genome indicates unlinked loci (Uechi, Tanaka & Kenmochi 2001; Marygold et al. 2007). As a rule, each mammalian RP is encoded by a single functional gene, but duplicated functional (paralogous) genes may exist (Dudov & Perry 1984; Kuzumaki et al. 1987; Wool, Chan & Gluck 1995; Kenmochi et al. 1998). Dunn et al. (2008) discarded most of the RP genes due to low taxon coverage or paralogy problems, but comparing RP genes from nine metazoan model organisms deposited in the Ribosomal Protein Gene Database (Nakao, Yoshihama & Kenmochi 2004), on average only three out of eighty RP genes, not more than 2·4%, are duplicated per species. The maximum of nine paralogous RP genes (10%) is found in Drosophila melanogaster, but all nine duplications appear to have arisen exclusively within the Drosophilidae and in eight of the nine duplicates the younger copy has a lower expression level (Marygold et al. 2007). Within the human genome more than 2000 RP pseudogenes have been reported (Zhang, Harrison & Gerstein 2002), but again the processed pseudogenes lack significant homology between humans and rodents (Balasubramanian et al. 2009). Considering the relatively recent origin of duplicated RP genes and RP pseudogenes and the low expression levels of RP paralogues (Nakao, Yoshihama & Kenmochi 2004; Marygold et al. 2007), the paralogy problem appears to be minor when using RP genes for deep metazoan phylogenetics. Horizontal gene transfer has frequently been observed in prokaryotes (Brochier, Philippe & Moreira 2000) but is unlikely in the case of Mollusca. This encouraged us to sample and to apply the entire set of 79 RP genes available for molluscs.

The metazoan phylum Mollusca is the second largest after the Arthropoda, but knowledge about molluscan evolutionary history is surprisingly incomplete despite an abundant fossil record and extensively documented morphological, molecular and developmental data (Lindberg, Ponder & Haszprunar 2004). Currently nearly all molluscan class-level relationships are under dispute and molecular-phylogenetic analyses of such relationships are hampered by limited character and taxon sampling (Todt et al. 2008). We enlarged the available data on molluscan RP genes by generating 1000 ESTs for the chiton Lepidochitona cinerea (Polyplacophora, Aculifera) to densify the data coverage for this presumably early branching mollusc lineage. The resulting maximum likelihood (ML) tree allows important conclusions concerning mollusc evolution but above all, the objective of our analyses is to evaluate the phylogenetic value of single genes and combined data sets to infer mollusc relationships. Aguileta et al. (2008) assessed the phylogenetic performance of single-gene trees from fungal whole genome sequences. Here we adapt this method to a ‘gappy’ EST data set in using the agreement metric (Goddard et al. 1994) to compare topological differences. We sort all single-gene trees in order to their fit to the reference tree and then use ML inferences, that maximize the fit of alignment, substitution model and tree topology, to find the most efficient subset of concatenated RP genes.

Materials and methods

Generation and processing of ESTs

A specimen of the chiton L. cinerea was collected at Helgoland Island (Germany) and stored at −80 °C. RNA was extracted using the Trizol reagent (Invitrogen, Karlsruhe, Germany). mRNA was purified applying the Dynabeads mRNA Purification Kit (Invitrogen) and transcribed by primer extension using SuperScript II (Invitrogen). Size fractioning and directional cloning was done by the CloneMiner cDNA Library Kit (Invitrogen) and the pDONR222 vector. Clones containing cDNA inserts were sequenced from the 5′-end on ABI 3730 capillary sequencer systems using the BigDye chemistry (Applied Biosystems, Darmstadt, Germany). EST sequencing of 1000 reads led to 32 RP contigs from L. cinerea. ‘Positive’ clones were additionally sequenced from the 3′-end or using an oligoT primer. EST processing for all species was accomplished at the Center for Integrative Bioinformatics in Vienna as described in Hausdorf et al. (2007).

Amino acid (aa) alignments from the Hausdorf et al. (2007) alignment were supplemented by adding more mollusc and metazoan RP genes that we found via tBlastn (Altschul et al. 1997) searches using Human RP genes as queries. All contigs were translated using GeneWise (Birney, Clamp & Durbin 2004). If any RP was found with multiple hits in a species, i.e. paralogous genes, we selected the copy with the lowest divergence to the query sequence. The Capitella sp. and Helobdella robusta ESTs caused difficulty because they were 100% identical for a number of sequences. This implies a miss-specified batch of ESTs in GenBank, and to compensate we only used hits where these two taxa differed.

Alignment, phylogenetic reconstructions and tree evaluation

The final data matrix consists of 79 RP genes for 16 molluscs and a large outgroup comprising 46 species. Ecdysozoa and Deuterostomia act as indicators for the reliability of the inferred topologies and might improve phylogenetic accuracy by increasing the number of taxa analysed (Graybeal 1998). Single genes were aligned using Muscle 3.6 (Edgar 2004) and trimmed with GBlock (Castresana 2000) using options for a less stringent excision (−b2 = 55, −b4 = 5, −b5 = h). The best substitution models for each RP were determined with MultiPhyl (Keane, Naughton & McInerney 2007) applying the AICc model criterion (Sugiura 1978) before concatenation of RP genes in BioEdit (Hall 1999). Single-gene ML phylogenies were calculated, using their respective models, with Treefinder (October 2008 version; Jobb, von Haeseler & Strimmer 2004). We performed 100 inferences on the original 79 gene alignment and 1000 bootstrap replicates, with RAxML 7.0.4 (Stamatakis 2006). The programme settings were adapted to this alignment with an initial rearrangement setting of 10, using 25 rate categories and the rtREV model with empirical frequencies. Phylobayes 2.1 (Blanquart & Lartillot 2006) ran with two chains for 23 000 points under the CAT model (Lartillot & Philippe 2004). Three thousand points were discarded as burnin before summarizing over every tenth of the remaining trees (maxdiff. <0·09; meandiff. <0·002).

Tree topologies of all single-gene trees were compared with the 79 gene ML tree in calculating the agreement metric (Goddard et al. 1994) and the symmetric distance (Penny & Hendy 1985) using paup 4.0 (Swofford 2003). The agreement metric results comparing single-gene trees with the 79 gene ML tree were used to successively concatenate RPs in the order of goodness of their fit to the total tree. Additionally, two alignments comprising half of the genes that produced poorly matching tree topologies were assembled (labelled ‘worst 39’ and ‘worst 40’). ML trees from all aligned subsets were calculated using Treefinder to estimate their likelihood values and RAxML was used to confirm the results in the region between 10 and 30 genes and to directly compare the ML topologies with the 79 gene tree by applying the implemented (RAxML) SH test (Shimodaira & Hasegawa 1999).

Results

Thirty-two RP genes were obtained from the EST data set of L. cinerea (accession nos FJ429199FJ429230). The final, concatenated alignment spanned 79 genes, covered 62 taxa and comprised 11 911 aa positions after the exclusion of ambiguous sites using GBlock. All alignments are available on at our institution’s website (http://www.staff.uni-mainz.de/lieb/data.html). Percentages of missing data for the Mollusca, Kamptozoa (entoprocts) and a few of the other outgroup species (with some condensed to chimerical taxonomic units) are given in Appendix S1.

The likelihood of the reference ML tree in Appendix S2 (and tree labelled ‘all 79’ in Fig. 4) is −ln L = 416 928·85 (Γ shape parameter of α = 0·600). The clade Mollusca got full support (100% bs, 1·00 pp) with gastropods joining bivalves in a robustly supported clade. The topologies of the inferred ML and Bayesian trees are largely congruent except for the varying positions of the caudofoveate Chaetoderma nitidulum, of Syndermata, and of Chaetognatha. Positions of the latter two clades are highly insignificant (low support values), but C. nitidulum got 78% bootstrap support (bs) for a grouping with Cephalopoda in the ML inference and 1·00 posterior probability (pp) as sistertaxon to Polyplacophora in the Bayesian analysis (yellow branch in Appendix S2). Minor differences between the two inferences were additionally observed for the branching pattern within Bivalvia, where Crassostrea was found as sistertaxon to Mytilus in the Bayesian analysis (0·84 pp), but as sister to all remaining bivalves in the ML inference. The molluscan classes with more than one representative species in the alignment were recovered as monophyletic with high support values despite the presence of long branched taxa such as the true limpet Lottia gigantea. The tree topology inferred with a constraint on Cyrtosoma (Gastropoda + Cephalopoda) is significantly worse. SH test result: Likelihood: −ln L = 417 052·97, Δ(LH): −124·12, SD: 40·2941.

Details are in the caption following the image

Comparison of the maximum likelihood (ML) trees using all 79 ribosomal proteins (RPs; left tree) and a subset of 18 selected genes (right tree). The ‘best 18’ ML tree (right) was inferred using 18 RPs concatenated according their agreement metric results and listed on the white backgrounds in Fig. 1. Arrows indicate the three topological differences. The taxa inducing these differences are highlighted in bold (Priapulida, Polyplacophora and Nemertea). Detailed labelling of taxonomic units is given for the ‘all 79’-gene tree in Appendix S2.

Maximum likelihood analyses of all single ribosomal genes were compared with the ML tree of the concatenated alignment using the agreement metric and symmetric distance metric (Fig. 1). The single-gene topologies are to varying degrees in conflict with the result from the concatenated data set and may even disagree with well-settled evolutionary hypotheses, for example with the L7 tree rejecting monophyletic Mollusca (data not shown). A plot of the alignment lengths against agreement metric results shows the approximately logarithmic correlation between the number of amino acid positions available and the resolution (Fig. 2). Single-gene trees were sorted according to their agreement metric rank and concatenated for subsequent analyses. Treefinder and RAxML inferences agree with their relative likelihood per site values (Fig. 3a shows only the Treefinder results for clarity and because RAxML inferences cover only the sub-range from 10 to 30 genes), but the resulting trees offer minor topological differences between Treefinder and RAxML analyses (see triangles and dots in Fig. 2). The likelihood per site is maximized with the 18 best fitting concatenated genes comprising 3348 aa positions (Fig. 3a). The topological differences between the 18 and 79 RP trees involve three nodes, namely the positions of Priapulus caudatus, Polyplacophora and Nemertea (arrows, Fig. 4). The SH test did not reject the 18-gene (3348 aa) tree when tested against the 79-gene tree (Fig. 3b). We also compared all single genes with the ‘best18’ gene tree to estimate the influence of the reference tree on gene selection as well as the robustness of gene selection against possibly wrong assumptions implied from the guide tree. Within the batch of the 18 highest scoring RP genes we found only two gene substitutions: S3 and S23 are shifted to rank 17 and 18, respectively, whereas L21 and L23a are weighted down to rank 19 and 20.

Details are in the caption following the image

Sorted 79 ribosomal proteins (RPs) used in the phylogenetic inferences. Comparison of each single-gene maximum likelihood (ML) topology with the ML tree from the concatenated alignment that used all 79 RPs. RPs are listed in the order of their agreement metric (Goddard et al. 1994) to the 79-gene tree and were successively aligned for the likelihood per-site estimation. The 18 best-scoring RPs (see Fig. 3) are shown on the white background. Besides the differences in alignment length, paralogy issues or high substitution rates may affect the phylogenetic performance of single genes. For completeness, the symmetric difference values (Penny & Hendy 1985) are also given. OTU, operational taxonomic unit.

Details are in the caption following the image

Agreement metric results from the 79 ribosomal protein (RP)–maximum likelihood (ML) tree for all single-gene analyses (open circles in the left half of the graph) and five selected concatenated alignments against alignment length (darkened dots and triangles). The logarithmic-trend line is given. Five selected genes (1174 aa, see corresponding tree in Appendix S3); 40S: small ribosomal subunit (4723 aa); 60S: large ribosomal subunit (7188 aa). Differences between Treefinder topologies (dots) and RAxML topologies (triangles) may result from programme specific and slightly different substitution models and search strategies.

Details are in the caption following the image

Estimation of the most efficient alignment size. (a) All genes were sorted due to their agreement metric results shown in Fig. 1. Likelihood per site of maximum likelihood (ML) trees from successively concatenated ribosomal protein (RPs) is measured for accumulated alignments from gene L10a ( j =1) to gene L41 ( j =79). The global maximum (highest peak) is found with 18 genes. Several local maxima may point to unsatisfactory rank assignment of certain genes. (b) SH test (Shimodaira & Hasegawa 1999) results for topologies inferred with RAxML from sorted and concatenated alignments of different size. Each ML tree was tested against the 79-gene tree and alignment, with the null hypothesis being ‘no difference’. The 18-gene tree is close to the 79-gene tree in its Ln/L value, and outperforms trees from the alignments with more genes as far as the 29-gene tree. The gene on rank 27 (S27) is a short alignment with only 84 amino acid positions which may produce different topologies dependent on the applied substitution model (see Fig. 1).

The RAxML tree calculated from the ‘worst39’ RP genes (4739 aa) with the lowest agreement metric values was significantly worse and suggests a number of doubtful clades such as (Platyhelminthes + Nematoda) or (Chaetognatha + Kamptozoa) and [Gastropoda minus Haliotis + (Haliotis + Bivalvia)].

Discussion

Mollusca and the molluscan classes are monophyletic in our analyses using 18 and 79 RP genes, in clear contrast to previously published results based on single or a few genes (Winnepenninckx, Backeljau & De Wachter 1996; Giribet et al. 2000; Peterson & Eernisse 2001; Meyer et al. 2010). Both analyses robustly support a clade comprising gastropods and bivalves as suggested by Dunn et al. (2008), who used 150 genes for the same finding. The remarkable efficiency of the analysed data sets is based on the presented gene selection strategy.

Gene selection

To assess the phylogenetic utility of single genes, their ML tree topologies were compared with the ML tree calculated from all 79 genes in concatenation. In EST-based multi-gene studies like ours, the individual single-gene trees usually provide only a subset of the total species sampling. Thus, the number of applicable methods for tree comparison is reduced. For example, the measurement of an overall topological score (Nye, Lio & Gilks 2006) needs the same set of species in both trees, and so does Concaterpillar (Leigh et al. 2008). The symmetric difference (Penny & Hendy 1985), a partition metric equivalent to the RF distance (Robinson & Foulds 1981), can compare incomplete single-gene trees, but here the impact of missing leaves in the single-gene tree might be strong because the symmetric difference measures the distance to transfer one tree into another. Here the different positioning of a single taxon on two otherwise identical trees can lead to a large difference value (Penny & Hendy 1985). In contrast, the agreement metric (Goddard et al. 1994) is well suited when comparing two trees with incomplete species coverage in the single-gene tree. The agreement metric prunes leaves from both trees until the topologies fit each other and is designed to evaluate the concordance of subtrees. The 79-gene tree is suited to serve as a guide tree because many well-established groups are recovered within this tree. Single-gene trees that reflect these groups but may show a distinct basal branching pattern will then nevertheless get a better score than trees with more or less randomly scattered species distribution.

There are alternative methods to the agreement metric for selecting or evaluating genes prior to an analysis: Cophenetic correlation (Kuramae et al. 2007), average nucleotide identity (Konstantinidis, Ramette & Tiedje 2006), sequence divergence (Graybeal 1994), compositional heterogeneity (Nesnidal et al. in press 2010) and others (Goldman 1998; Townsend 2007). Concaterpillar (Leigh et al. 2008) and supertree bootstrapping (Burleigh, Driskell & Sanderson 2006) are able to identify congruent sets of markers, but only the method of gene selection we describe reduces the number of genes while simultaneously evaluating improvement in tree quality. Our approach is optimized for direct sampling strategies and offers the opportunity to incorporate patchy EST data sets. In this manner, a broader taxon sampling can complement the steadily growing number of genes from transcriptome sequencing. The number and identity of optimal genes will be different for other metazoan taxa because the required data depend on the particular question and taxon being considered (Gatesy, DeSalle & Wahlberg 2007).

Tree inference improved with increasing gene length (Fig. 2), showing an approximately logarithmic relation between the number of amino acid positions and agreement with the total gene tree. Therefore, shorter genes are less informative, even adding random noise compared with longer genes. Our gene selection favoured longer genes (Fig. 2), which has another advantage: larger genes are beneficial when applying direct sampling strategies, such as PCR. The effort involved in sampling one large gene is far less costly and more beneficial than sampling two small genes offering the same number of amino acids. The 18 selected genes resulted in a tree favoured by the SH test and showing only minor topological changes to the 79 gene tree (arrows, Fig. 4).

Phylogenetic implications based on 79 RP genes

The inferred topology of the analyses using all 79 RP genes (Appendix S2 and Fig. 4) largely agrees with molecular trees inferred from other multi-gene data sets (Bourlat et al. 2006; Philippe & Telford 2006; Dunn et al. 2008). The 79 gene tree supports (i) mollusc monophyly; (ii) reveals class-level mollusc taxa monophyletic; (iii) shows a close relationship of bivalves and gastropods; and (iv) significantly rejects the Cyrtosoma (Haszprunar 2000; Haszprunar, Schander & Halanych 2008) hypothesis (=no cephalopods + gastropods).

Due to their more or less sessile lifestyle and specialized deposit or filter feeding, bivalves have many anatomical modifications that hamper comparison with other molluscs. From a morphological point of view, gastropods and cephalopods appear to be sister groups (Pojeta & Runnegar 1976; Salvini-Plawen 1980), but this may again reflect somewhat similar lifestyles. Cephalopods have high substitution rates in the rRNA genes (Passamaneck, Schander & Halanych 2004; Meyer et al. 2010) and thus artefacts in molecular phylogenetic reconstructions might be possible. High substitution rates may explain the unstable position of the caudofoveate C. nitidulum, which is recovered as sister to Cephalopoda with ML, but sister to Polyplacophora with the Bayesian inference (Appendix S2). The clade unifying Cephalopoda and Caudofoveata has been observed repeatedly in molecular analyses (Passamaneck, Schander & Halanych 2004; Dunn et al. 2008; Lieb & Todt 2008; Hejnol et al. 2009; Wilson, Rouse & Giribet 2009), but low support values prevent the proposal of a new phylogenetic hypothesis (Waegele et al. 2009).

Inferences based on the 18 RP gene subset

The 18-gene tree, like the 79-gene tree, supports all the main findings for the Mollusca listed in points (i) to (iv) in the previous section. The minor differences between the trees could be due to an erosion of phylogenetic signal through the use of far fewer genes and amino acids in the 18-gene set. The large amount of missing data within EST-based alignments probably intensified this erosion, an effect that might decrease when a denser data matrix is available. Even so, the SH test indicates that this medium-sized data set is superior to inferences using up to 28 genes (Fig. 3b) and is the most efficient alignment analysed (Fig. 3a).

The RP genes are suspected to suffer from long-branch attraction (LBA) (Bleidorn et al. 2009). This assumption is confirmed here when analysing the 39 and 40 worst scoring RPs (from a total of 79 genes) revealing typical LBA artefacts, but not for the selected best 18 genes, underlining the importance of carefully selecting marker genes. Long-branch issues can be met with several strategies, first and foremost with an enlarged taxon sampling of high-quality data focusing on short-branching taxa (Graybeal 1998; Bergsten 2005).

The Mollusca comprises about 200 000 species (Heywood 1995) which display diverse morphologies. Trying to analyse mollusc relationships using only 0·01% of the clade’s extant diversity appears somewhat naive. Our subset of 18 selected RP genes at this point already resolves the phylum Mollusca plus the included mollusc classes and in future can be sampled on a much broader taxonomic scale to increase resolution and support. Collecting selective medium-scale data sets can be done by RT-PCR methods (Regier et al. 2010) and benefits from the reduction to the most essential amount of data. The high expression level of RPs, combined with their high degree of sequence conservation (Perina et al. 2006), will facilitate successful amplification in a broad range of species. Despite decreasing sequencing costs, the generation of 1000 sequencing reactions plus a number of preparative steps such as normalization and library construction has still significant costs. On the other hand, the effort to sample 18 genes will be dependent on the successful primer design and costs are difficult to estimate a priori. Regier et al. (2010) sampled 62 genes via PCR to analyse arthropod phylogeny whereas our results suggest collecting only a third of this gene number for molluscs.

Conclusions

Multi-gene data sets for phylogenetics are generally restricted in their taxon sampling, leading to biased species representation. In particular, species-poor taxa that are not in the centre of public interest are underrepresented. Furthermore, randomly assembled data (ESTs) frequently suffer from incomplete (‘gappy’) alignments and poor quality of single reads. Here we achieved success by focusing on a subset of gene data. We show that assembling 79 genes gives promising results for the Mollusca and that an efficiently trimmed subset of the 18 most informative genes still supports the main findings (Fig. 4). These findings, which have been extremely difficult to obtain for molluscs with molecular phylogenetics, are: (i) mollusc monophyly; (ii) monophyly of all class level taxa; (iii) a robust clade of Gastropoda and Bivalvia; and (iv) the earlier proposed clade Cyrtosoma refuted.

Acknowledgements

We thank the German Research Foundation (DFG) for funding this study within the priority programme ‘Deep Metazoan Phylogeny’ project (Li 998/3-1) as well as I. Ebersberger, S. Strauss and A. von Haeseler (Max F. Perutz Laboratories, Center for Integrative Bioinformatics, Vienna) for processing EST data. Christiane Todt (Department of Biology, University of Bergen) kindly improved the language of the manuscript and we also thank Alexander Gruhl (Natural History Museum London) for critical reading. We thank the reviewers from Methods in Ecology and Evolution for their valuable and helpful suggestions. Finally, we are grateful for the comprehensive remarks of the anonymous reviewers from a provisional draft of this study.