Volume 94, Issue 2 p. 173-183
Article
Free Access

Inferring phylogeny at low taxonomic levels: utility of rapidly evolving cpDNA and nuclear ITS loci

Mark E. Mort

Mark E. Mort

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
Jenny K. Archibald

Jenny K. Archibald

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
Christopher P. Randle

Corresponding Author

Christopher P. Randle

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

 Author for correspondence ([email protected])Search for more papers by this author
Nicholas D. Levsen

Nicholas D. Levsen

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
T. Ryan O'Leary

T. Ryan O'Leary

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
Katarina Topalov

Katarina Topalov

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
Catherine M. Wiegand

Catherine M. Wiegand

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
Daniel J. Crawford

Daniel J. Crawford

Department of Ecology and Evolutionary Biology, Natural History Museum, University of Kansas, 1200 Sunnyside Ave., Lawrence, Kansas 66045 USA; and

Department of Biological Sciences, Sam Houston State University, P.O. Box 2116, Huntsville, Texas 77340 USA

Search for more papers by this author
First published: 01 February 2007
Citations: 78

The authors thank A. Santos-Guerra for help in obtaining material of Aichryson and Tolpis species; P. Neeff for help in obtaining material of Sempervivum and Jovibarba species; P. Burgoyne, J. Burrows, S. Burrows, P. Chesselet, G. F. Smith, E. v. Jaarsveld, and P. Winter for help in obtaining specimens of Crassula; S. D. Johnson and C. I. Peter for help in obtaining material of Zaluzianskya species; and J. Foote, D. Murray, B. Bennett, and L. Pietarinen for help in obtaining specimens of Chrysosplenium species. C.P.R. thanks J. W. Wenzel for enlightening discussions on the nature of homoplasy. This work was supported by NSF award DEB-0089640 to M.E.M. and by Kansas NSF EPSCoR.

Abstract

Plant molecular systematic studies of closely related taxa have relied heavily on sequence data from nuclear ITS and cpDNA. Positive attributes of using ITS sequence data include the rapid rate of evolution compared to most plastid loci and availability of universal primers for amplification and sequencing. On the other hand, ITS sequence data may not adequately track organismal phylogeny if concerted evolution and high rDNA array copy number do not permit identification of orthologous copies. Shaw et al. (American Journal of Botany 92: 142–166) recently identified nine plastid regions that appear to provide more potentially informative characters than many other plastid loci. In the present study, sequences of these loci and ITS were obtained for six taxonomic groups in which phylogenetic relationships have been difficult to establish using other data. The relative utility of these regions was compared by assessing the number of parsimony informative characters, character congruence, resolution of inferred trees, clade support, and accuracy. No single locus emerged as the best in all lineages for any of these measures of utility. Results further indicated that in preliminary studies, sampling strategy should include at least four exemplar taxa. The importance of sampling data from independent distributions is also discussed.

Systematists working with taxa that are recently derived or the result of a rapid radiation often seek rapidly evolving loci for phylogenetic analysis. Many of the best-characterized genetic loci play an important role in the function of an organism, such as enzyme-coding regions or ribosomal DNA. Many loci have proven valuable in elucidating phylogeny at high taxonomic levels (Chase et al., 1993; Kellogg and Juliano, 1997; APG I, 1998; APG II, 2003). However, these loci are often the worst candidates for providing phylogenetic resolution among recently derived or rapidly evolving species because they evolve slowly and may be quite invariable across a taxonomic sample.

Since it was first used in phylogenetic inference (Baldwin, 1992), the nuclear rDNA internal transcribed spacer (ITS) has become the workhorse of plant molecular systematics, particularly at lower taxonomic levels. As a nuclear gene region, it is biparentally inherited and has higher rates of base substitution than would be expected of organellar genes. This region is also easily amplified by standard PCR methodologies because it exists in high copy numbers and because priming sites in the surrounding 18S and 26S regions are highly conserved, thus allowing the use of one set of amplification primers for most plants and fungi (White, 1990). Although the use of ITS sequence data is very common in phylogenetic studies, characteristics of nrDNA evolution may result in comparison of paralogs if sufficient caution is not exercised (reviewed extensively by Alvarez and Wendel, 2003).

Among the confounding phenomena discussed by Alvarez and Wendel (2003) is the large number of rDNA arrays that may be included in a nuclear genome. Concerted evolution may present a confounding effect when two or more paralogous copies of ITS exist in the same genome because homogenization may result in the loss of all but one of the differing copy sequences. This is phylogenetically misleading in that it is analogous to lineage sorting—the gene tree and species tree are no longer the same. Further, if multiple copies exist as the result of hybridization or introgression, evidence of the reticulation has been erased. If homogenization is incomplete, recombination of arrays (even without interspecific hybridization) may result in chimeric nuclear sequences, in which case no bifurcating hypotheses of gene evolution is correct. Bailey et al. (2003) investigated the phylogenetic implications of paralogy resulting from the tandem existence of both nrDNA pseudogenes and functional copies within an individual. While the presence of pseudogenes is not necessarily confounding to phylogenetic analysis if all copies have been sampled from each individual, rates of evolution among functional and nonfunctional paralogs may be sufficiently different to result in long-branch artifacts (Bailey et al., 2003).

Issues of orthology aside, Alvarez and Wendel (2003) discussed other confounding phenomena: the effect of secondary structure on base substitution resulting in non-independence of characters, difficulties with alignment, problems with contamination resulting from universal primers, generally high levels of homoplasy due to rapid evolution, and difficulties with amplification that arise from secondary structure and the existence of multiple rDNA arrays. Although several of these concerns are common to most rapidly evolving regions, they did emphasize the need for caution in molecular phylogenetic studies.

Chloroplast loci are often assumed to be uniparentally inherited, nonrecombining and, by and large, free of the kinds of phenomena that lead to violation of assumptions of orthology. Recently, Shaw et al. (2005) examined sequence variation of 21 plastid loci, most of which were noncoding spacer regions. These were hypothesized to evolve more rapidly than many of the more commonly used cpDNA regions, thus providing greater phylogenetic structure at low taxonomic levels. Shaw et al. (2005) obtained sequences from three species each of 10 phylogenetically disparate lineages and assessed the number of variable or Potentially Informative Characters (PICs). Note that PICs are not the same as parsimony informative characters because they are derived from three taxon comparisons. Quantification of PICs allowed them to separate the 21 loci into three tiers of varying phylogenetic potential (tier 1 having the greatest variability and tier 3 having the least). Importantly, Shaw et al. (2005) recognized in cost–benefit analysis that the ratio of variable characters to overall sequence length is only important to consider if competing regions require different numbers of sequencing reactions. The cost of sequencing a 900-bp region is the same as that for a 600-bp region if both sequences can be obtained with two sequencing primers. If the 900-bp region provides a greater number of variable characters, it is to be preferred, all else being equal.

PICs enumerated from three-taxon sets may only be considered a good measure of phylogenetic content if they are correlated with the number of variable characters for increased taxonomic sampling. Shaw et al. (2005) tested this correlation and found it to be significant. However, a better measure of phylogenetic information content is the number of parsimony informative characters present in a data set, which cannot be enumerated without sampling at least four taxa. There are also other traits of data sets that may be considered when choosing loci for cladistic analysis, including high character congruence (or low homoplasy) as measured by the consistency index or retention index. However, relatively noninformative data sets may have little homoplasy because variation is a necessary component of conflict. Therefore, another important measure of utility is resolution. Loci that resolve many relationships but exhibit homoplasy are more useful than those that provide neither resolution nor conflict. A related measure of utility is the degree to which data provide support for resolved clades. Of course, the most desirable aspect of phylogenetic data is accuracy, that is, the ability of a data set to obtain the correct tree. Assessing accuracy is much more slippery than other measures of phylogenetic utility. Rarely are “true” trees known, and a “true” gene tree may not be the same as a “true” species tree (Doyle, 1992).

In this study, the findings of Shaw et al. (2005) are further explored. For six infrageneric groups of 13–23 terminals, sequence data were obtained for cpDNA loci that were found to be among the most variable in the Shaw et al. (2005) study. Taxonomic groups were chosen that have been shown to be problematic with respect to obtaining phylogenetic resolution using conventional loci and for which ITS sequence data were readily available (with one exception): Tolpis (Asteraceae), Aichryson (Crassulaceae), Crassula subgenus Crassula (Crassulaceae), Zaluzianskya section Nycterinia (Scrophulariaceae), Chrysosplenium section Alternifolia (Saxifragaceae), and Sempervivum s.l. (Crassulaceae). Two of these, Tolpis and Aichryson represent Macaronesian island radiations, having originated probably no more than 20 million years ago (the age of the oldest Canary island, Fuerteventura). Allelic and genetic divergence among island taxa is often low compared to that among continental species (Crawford et al., 1987), decreasing the ability of neutral loci to resolve relationships. Preliminary investigations reveal that this trend also characterizes members of Tolpis and Aichryson in Macaronesia (Park et al., 2001; Mort et al., 2003; Fairfield et al., 2004; Archibald et al., 2006; Crawford et al., 2006). Crassula subgenus Crassula (Crassulaceae) and Zaluzianskya section Nycterinia (Scrophulariaceae) both have an impressive degree of morphological diversity. In subgenus Crassula, vegetative features vary considerably, many of which may be adaptations to periodic drought, such as the degree of woodiness and succulence, and the number and position of hydathodes (Tölken, 1977; Jaarsveld, 2003). Preliminary analyses of cpDNA (matK and the trnL-trnF spacer) and nuclear ITS from many of the species of Crassula strongly support the monophyly of subgenus Crassula. However, these genes provide little resolution within the subgenus, and terminal branch lengths are extremely short (Randle et al., 2005). In Zaluzianskya section Nycterinia, morphological variation occurs in habit and floral characters (among others), such as floral symmetry, vestiture, and most pronouncedly, the time of day that flowers open (Archibald et al., 2005b). Despite morphological and ecological variation, several loci have performed poorly in resolving relationships within either subgenus Crassula or section Nycterinia because of low sequence divergence (Archibald et al., 2005b; Randle et al., 2005). Chrysosplenium section Alternifolia is composed of a small number of boreal and alpine species that have shown little plastid sequence divergence in past studies (Soltis et al., 2001). Ecological characteristics of the North American Alternifolia species indicate that species are well adapted to the Arctic environment, a biome that is only 3–15 million years old. Sempervivum is distributed in alpine habitats of Europe and Turkey. While the genus is cytologically quite diverse, species differ little in gross morphology or phytochemistry (‘t Hart et al., 2003). Often treated as distinct from Sempervivum s.s. based on morphology, Jovibarba species occur in similar habitats, overlap extensively in geographic distribution, and readily hybridize with those of Sempervivum. Phylogenetic investigation utilizing matK sequences strongly supports the monophyly of both Sempervivum s.s. (99% bootstrap) and Jovibarba (98% bootstrap) as well as a sister group relationship (100% bootstrap) between these clades (Mort et al., 2001). However, the chromosome number of Jovibarba (2n = 38) falls within the range of chromosome numbers present in Sempervivum s.s. (Favarger et al., 1968; ‘t Hart et al., 2003). Speciation may occur as the result of chromosomal change, either by rearrangement of chromosomal material or changes in chromosome number (Lewis, 1966). Such events, if recent enough, may be marked by little nucleotide substitution. For example, annual species of Coreopsis in eastern North America are isolated by chromosome restructuring but exhibit low ITS sequence divergence (Smith, 1976; Archibald et al., 2005a). In preliminary studies Sempervivum showed little sequence variation among commonly used plastid loci.

In the present study, the phylogenetic utility of rapidly evolving plastid loci was measured indirectly by the ease of amplification and sequencing. Utility was also assessed more directly by the number of parsimony informative characters provided, character congruence, the degree of resolution and clade support of most parsimonious trees, and the degree to which resolution was concordant with trees inferred from combined cpDNA matrices. We also obtained ITS sequence data for five of the six taxon sets and assessed utility as described for comparison.

MATERIALS AND METHODS

Taxon sampling

Taxonomic groups for analysis were chosen based on difficulty in resolving relationships in previous studies using DNA sequence data. When possible, outgroup taxa were included in analyses to simulate realistic taxon sampling. Taxa chosen as outgroups were judged to be closely related to the ingroup based on previous phylogenetic studies. Six taxon sets of 13–23 taxa were used to assess the utility of rapidly evolving cpDNA loci.

Aichryson (Crassulaceae)

Aichryson consists of approximately 17 species and subspecific taxa endemic to Macaronesia, with the Canary Islands as the center of diversity. DNA sequence data from the plastid (matK, trnL-trnF spacer, and psbA-trnH spacer) and nuclear (ITS) genomes strongly support the monophyly of Aichryson, which is resolved as sister to the remaining Macaronesian clades of Crassulaceae (Mort et al., 2002; Fairfield et al., 2004). In this study, sequence data were attained for 13 Canary Island species and an additional four subspecific taxa. Sedum modestum and S. jaccardianum, both northern African species, have been shown to be closely related to the Macaronesian clade (Mort et al., 2001); these species were included as outgroups.

Chrysosplenium section Alternifolia (Saxifragaceae)

In the current study, sequences were obtained from 15 individuals belonging to each of three species (C. iowense, C. tetrandrum, and C. rosendahlii, but see Packer [1963] for discussion of alternative taxonomic hypotheses). Species were collected from widespread geographic locations throughout North America.

Crassula subgenus Crassula (Crassulaceae)

Preliminary analyses of cpDNA (matK and the trnL-trnF spacer) and nuclear ITS from many of the species of Crassula strongly support the monophyly of subgenus Crassula. However, these genes provide little resolution within the subgenus, and terminal branch lengths are extremely short (Randle et al., 2005). In this study, sequences were obtained for 13 species that span the range of diversity in subgenus Crassula based on taxonomic, morphological, and phylogenetic criteria.

Sempervivum and Jovibarba (Crassulaceae)

Sempervivum and Jovibarba include 40–50 species of hardy, alpine leaf-succulents. In this study, sequences were obtained from 19 species of Sempervivum and two species of Jovibarba that have been included in Sempervivum in the past. In the Mort et al. (2001) study, the Sempervivum + Jovibarba clade is unresolved within a much larger clade that includes the Leucosedum, “Acre”, and Aeonium (i.e., Macaronesian clade) clades. Therefore, no single taxon stands out as the best outgroup. In this study, two species of Aichryson were included as outgroup taxa.

Canary Island clade of Tolpis (Asteraceae)

Tolpis includes at least 10 species native to Macaronesia and at least two others that occur in Europe and Africa (Crawford et al., 2006). Eighteen individuals from putatively 11 species were sampled in this study. Additionally, T. succulenta (an endemic to Madeira) was included as an outgroup.

Zaluzianskya section Nycterinia (Scrophulariaceae)

Zaluzianskya comprises 55 species of herbaceous annuals and perennials native to southern Africa. In the current study, sequences were obtained from 14 individuals (11 species) of section Nycterinia, Z. mirabilis, and one species of section Holomeria (Z. divaricata) to serve as an outgroup.

DNA extraction and sequencing

DNA was isolated from fresh or silica-dried material using either a modified CTAB protocol (Doyle and Doyle, 1987; Mort et al., 2001) or the DNeasy kit (Qiagen, Valencia, California, USA). Plastid genes for amplification were selected from among those providing the greatest number of PICs following Shaw et al. (2005). Regions attempted included psbM-trnD, rpl16, rpoB-trnC, rps16, trnD-trnT, trnS-trnfM, trnSGCU-trnGUUC (including the entire trnGUUC sequence), trnT-trnL, and ycf6-psbM using published primer sequences (Shaw et al., 2005). Nuclear ITS (including ITS1, 5.8S, and ITS2) was amplified using standard primers (Wen and Zimmer, 1996), forward: NNC-18S10 (AGGAGAAGTCGTAACAAG), and reverse: C26A (GTTTCTTTTCCTCCGCT). PCR reactions included 1× Biomix (Midwest Scientific, St. Louis, Missouri, USA) and 0.64 μM forward and reverse primer. ITS amplification reactions also contained 0.5% dimethyl sulfoxide to reduce secondary structure. Thermocycler conditions for amplification of plastid loci followed those of Shaw et al. (2005). Reaction conditions for ITS amplifications were as follows: 2 min at 95°C; 30 cycles of 45 s at 95°C, 45 s at 48°C, and 4 min at 72°C; and a final extension of 10 min at 72°C. Sequencing was performed as a service of Macrogen, Inc. (Seoul, South Korea) with the same primers used for PCR amplification. Sequences will be deposited in GenBank and released upon publication of the phylogenetic results for each taxon group; sequences may be requested from the author of correspondence prior to release in GenBank.

Analyses

Contigs were assembled using Sequencher 4.5 (Gene Codes Corp., Ann Arbor, Michigan, USA). Sequences were aligned primarily by eye using the program Se-Al (Rambaut, 1996), but in cases in which alignment was not straightforward, multiple sequence alignment was performed using Clustal X (Thompson et al., 1997), followed by manual adjustment. Positions 5′ or 3′ to the region of interest that were obtainable from sequencing reactions were included in matrices, contra Shaw et al. (2005). Given the small degree of sequence variation expected for each group, it was more desirable to include as many characters as possible than to limit analyses to characters belonging exclusively to a “region” per se. All phylogenetic analyses were performed using PAUP* version 4.01 (Swofford, 2002). For each taxon set, cpDNA loci were analyzed separately and in combination. Tree search consisted of 1000 random addition searches using tree-bisection-reconnection (TBR) branch swapping and saving all most parsimonious trees. Nodes that were not supported by at least one character state change under all character state optimizations were collapsed (“amb-” in PAUP*).

Phylogenetic utility was assessed in several ways. One assessment of utility is the number of parsimony informative characters provided by a locus. Complete sequences from all regions analyzed here were obtainable using two sequencing reactions, one with the forward primer and one with the reverse. Further, reactions that failed in several attempts were not repeated. Consequently, each sequence obtained represents roughly the same quantity of effort and expense (but see comments on trnS-trnG in Discussion). Following Shaw et al. (2005), the absolute number of parsimony informative characters was interpreted as a more critical determinant of utility than the percentage of characters that were parsimony informative.

Another measure of utility is resolution. The degree of resolution was assessed by examining the number of resolved nodes in the strict consensus of most parsimonious trees, which is reported as a percentage of t − 3 nodes of a fully bifurcating tree. Nodal support was estimated using 1000 parsimony jackknife replicates, in which 37% of the characters were deleted using the “emulate Jac” option. Jackknife support was summarized by placement of nodes into one of three categories: nodes receiving less than 63% support, nodes supported by 63% or more (support equivalent to one uncontroverted synapomorphy or more), and nodes with support of 87% or more (support equivalent to two uncontroverted synapomorphies or more; Farris et al., 1996).

To assess the accuracy of resolution using empirical data without actual observations of evolutionary species divergence, we used the trees inferred from the combined plastid data for comparison with those obtained from individual loci, assuming that the combined trees were the best estimate of relationships among these taxa. This assumption was based on several features of chloroplast data. It is expected that all base pairs in a plastid data matrix are obtained from the plastid itself and are the result of the shared branching history of the plastid (although not necessarily of the organism). This assumption is probably warranted in most instances, although there are documented cases where sequenced genes have been relocated from the plastid to another genomic compartment (Baldauf and Palmer, 1990; Gantt et al., 1991; Huang et al., 2003). Plastid recombination has also been documented (Vivjerberg, 1999; Huang et al., 2001; Marshall et al., 2001) but is probably quite rare. If transferred genes have not been sampled, recombination has not taken place, and rates of change were not high enough to cause long-branch attraction in analyses, then phylogenetic estimates should grow more accurate with the inclusion of more data. This supports our use of the combined trees as representatives of the most accurate phylogeny.

A test was formulated to determine whether a given locus was significantly better or worse at resolving relationships than other rapidly evolving loci. The null hypothesis is that a given locus provides no better or worse resolution than any combination of characters of equal size to that locus drawn from the combined cpDNA data set. In this case, the metric of quality of resolution is equivalent to the number of clades “correctly” resolved (i.e., resolved in combined cpDNA data) minus the number of “incorrectly” resolved nodes (i.e., conflicting resolution compared with the combined cpDNA data). This metric was termed the overall success of resolution, (OSR; Simmons and Miya, 2004; Simmons et al., 2004). No penalty was assigned to nodes resolved in trees generated from individual loci that were unresolved (part of a polytomy) in the combined cpDNA tree.

To create a null distribution, 500 bootstrap data sets were created by sampling from the entire plastid data set, with the number of sampled characters equal to the number of characters in the locus in question. For each replicate, a strict consensus of the most parsimonious trees was compared with the tree inferred from the combined plastid matrix. For each comparison, OSR was calculated using PEST 3.0 (Zujko-Miller and Miller, 2003). The OSR obtained from the actual data of the locus in question was then compared to the null distribution of bootstrap replicates. Because this was a two-tailed test, rejection of the null was inferred for tail probabilities ≤ 0.025. Some loci were missing from the Tolpis and Chrysosplenium data sets, so these taxon sets were not included in this analysis. Further, when sequence data were not available for a given locus and taxon, combined analysis of cpDNA and the generation of null trees was performed without those taxa.

RESULTS

Data matrices

PCR amplification was not uniformly successful for all loci across taxa. ITS was not sequenced for the full complement of Tolpis species because preliminary experiments showed it to be invariant across several morphologically and ecologically divergent species. Shaw et al. (2005) had difficulty obtaining PCR amplifications for trnD-trnT in members of Asteraceae, possibly from an inversion. In this study, trnD-trnT failed to amplify in both Tolpis (Asteraceae) and Chrysosplenium (Saxifragaceae). However, trnD-trnT was obtained for the remaining taxon sets and was included in statistical analyses. In the Shaw et al. (2005) study, rpoB-trnC also proved problematic for Asteraceae as well as for gymnosperms. This region failed to amplify for all accessions of Tolpis, Chrysosplenium, and subgenus Crassula; it was therefore excluded from subsequent analysis for all taxon sets. Similarly, the trnSGCU-trnGUUC region failed to amplify in Aichryson, Chrysosplenium, and Sempervivum and was not included in subsequent analyses. Attempts to amplify trnT-trnL were largely unsuccessful in any of the taxa included, and therefore this region was also excluded.

The following regions were not problematic and were sequenced in their entirety for all groups: psbM-trnD, rpl16, rps16, trnS-trnfM, and ycf6-psbM. However, not every individual DNA resulted in amplification for a particular region. To maximize sampling, DNA accessions that provided amplicons for at least five of the seven regions (including trnD-trnT and ITS) were included in the analysis.

Potentially informative characters vs. parsimony informative characters

In an analysis of 21 loci, Shaw et al. (2005) demonstrated that for three-taxon groups, variation in PICs could be explained to some extent by the length of the sequence obtained (r2 = 0.22–0.82). In the present study, sequence length explains only 5.2% of the variation in the number of parsimony informative characters when all plastid loci are included in the analysis (P = 0.20; Fig. 1) and only 0.9% if ITS sequences are included (P = 0.55). This is not surprising, as the current analysis only includes loci representing the highest portion of the variation in PIC values as ascertained by Shaw et al. (2005). It also supports the view that if any of these loci are found to be superior in resolving relationships for any given lineage, it may not be simply a function of sequence length.

Interpretation of PIC values of three-taxon matrices (as in Shaw et al., 2005) as indicators of phylogenetic signal would be inappropriate if it were shown that PIC values were a poor measure of phylogenetically useful information (here considered to be the number of parsimony informative characters). To test the assumption that PIC values are representative of phylogenetic information content, Shaw et al. (2005) obtained the full complement of noncoding cpDNA sequence data for 10 taxa in two of the taxon sets, Hibiscus and Prunus, and showed that the number of PICs recorded in the three-taxon survey was correlated with the number of variable characters in the 10-taxon survey. However, variable characters that only provide autapomorphic state changes do not provide phylogenetic information, at least within a parsimony framework. To test the reliability of PICs as a measure of hierarchical information content, the number of variable characters in each data set examined here was compared to the number of parsimony informative characters. Regression analysis demonstrates correlation between variable and informative characters for combined plastid loci (r2 = 0.61, P < 0.01) and combined plastid loci plus ITS (r2 = 0.76; P << 0.01; Fig. 2). This further justifies the use of the PIC as a suitable proxy for informative characters in preliminary assessments of sequence variation.

Relative utility of plastid loci

Data matrices varied considerably in alignment length among taxon sets (Table 1). It is inappropriate to apply statistical analyses to summary data obtained in this study, but several qualitative phenomena were observed. ITS provided the fewest nucleotides for all taxon sets, and trnS-trnfM provided the most. Consistent with the expectation that nuclear genes undergo an overall faster rate of substitution than plastid genes, ITS provided the highest percentage of parsimony informative characters, although not necessarily the highest overall number of variable characters for all data sets (Fig. 3). Taxon sets demonstrated considerable variability in the number of parsimony informative characters when all plastid sequences were combined. The most variable by far was Sempervivum with 217 parsimony informative characters, while the least was Tolpis with a mere seven parsimony informative characters distributed over 4400 aligned nucleotides for 21 taxa. When plastid and ITS data are considered together, ITS sequences contribute 15–50% of the total number of parsimony informative characters, while only contributing 9–15% of the total characters. ITS provides the greatest total number of parsimony informative characters for Aichryson, Chrysosplenium, and Sempervivum—twice as many, in fact, as the best two plastid regions combined for those taxa. In Crassula, trnS-trnfM provided the greatest number of informative characters. For Zaluzianskya, three genes (ITS, rps16, and trnS-trnfM) tied for the greatest number of characters provided.

Interpreting the relative ability of data sets to resolve clades is more complex (summarized in Table 2). In a parsimony framework, statistics relating to phylogenetic utility include measures of homoplasy, such as the consistency index (Kluge and Farris, 1969) and retention index (Farris, 1989). Homoplasy has sometimes been interpreted as an undesirable characteristic of phylogenetic data (Lyons-Weiler et al., 1996; Swofford et al., 1996). This may be true to the extent that any one character may be misleading if its state distribution does not reflect the true branching history. However, data that are homoplastic may still imply phylogenetic structure, often more than internally consistent data sets (Källersjö et al., 1999; Wenzel and Siddall, 1999). For data presented in this study, this often appears to be the case. When the percentage of nodes recovered with ≥63% jackknife support is plotted as a function of rescaled consistency index (RC = CI × RI), the relationship is significant (albeit only slightly so) and negative (Fig. 4; r2 = 0.118, P = 0.04, slope = −0.44). Of all the taxon sets, combined plastid data resolve the greatest proportion of nodes with ≥63% jackknife support for the genus Crassula. This also has the lowest RC value (0.66) and therefore the greatest homoplasy. On the other hand, Tolpis has the highest RC (1.00), probably as a result of invariance, but very few clades were supported with ≥63% jackknife. In contrast to these examples, the Crassula ITS data set not only had the lowest RC (0.61) of all ITS data sets but also resolved the fewest nodes with support.

The combined plastid matrices varied in ability to resolve relationships across groups. For Crassula, tree searches resulted in two most parsimonious trees. Nine of 10 nodes of the consensus tree were resolved with ≥63% jackknife support. Of these, seven received ≥87% support. Although the greatest number of parsimony informative characters was obtained from Sempervivum plastid sequences, very few nodes were well supported when these were analyzed together. Examination of the most parsimonious trees revealed that this was largely the result of many character state changes optimizing on the branch separating the ingroup from the outgroup, which varied in length from 172–236 of 420 steps in the most parsimonious trees, depending on optimization. This branch received high (100%) jackknife support when all cpDNA characters were taken into account, as well as in analyses of each individual locus (100%) for which data for both outgroup taxa were available.

Overall, there was no clear indication that any plastid locus nor ITS was best at providing clade support for all taxa surveyed (Fig. 5). The mean proportion of nodes receiving ≥63% jackknife support varied by less than 10% across loci. Most of the variation in node support was among taxon sets rather than among loci. For instance, trnS-trnfM resolved 60% of nodes with ≥63% jackknife support in Crassula, but provided no jackknife support for any nodes in Aichryson or Zaluzianskya. Similarly, in Aichryson and Sempervivum, ITS performed better than any of the plastid loci, while in Crassula, no clades received ≥63% jackknife support.

Overall success of resolution

The OSR metric of Simmons et al. (2004) was used to test whether a locus performed better or worse than a data matrix of the same size randomly selected from the combined plastid data set. Because multiple regions failed to amplify in Tolpis and Chrysosplenium and because these data sets yielded such little resolution when the entire plastid matrix was analyzed, they were excluded from this test. It is important to note that the characters forming the null distribution were obtained from only loci included in this study and that these loci were among the fastest-evolving regions as demonstrated by Shaw et al. (2005). Therefore, the test only distinguished performance relative to other fast-evolving loci. ITS and cpDNA may obtain conflicting topologies that may both be correct if gene trees and species trees have different topologies. ITS sequence data can not be evaluated using OSR in this context.

The resulting P values are listed in Table 3. No region performed better or worse than the null distribution for all taxon sets. Significance was only found for the Crassula ycf6-psbM data set, which performed better than the null plastid distribution in resolving relationships.

DISCUSSION

Utility of fast-evolving plastid loci

When choosing loci for phylogenetic inference, often the first step is to determine sequence variability over a small sample of taxa. Conventionally, a number of regions are amplified and sequenced for the same set of taxa so that comparisons can be made. As a matter of practicality, regions that present difficulty in either amplification or sequencing are often eliminated in this preliminary step. This is a reasonable practice because expending resources on the development of special primers or PCR protocols seems unwise if obtaining sequences from other regions is not problematic.

Of the nine plastid loci examined in the present study, three (rpoB-trnC, trnSGCU-trnGUUC and, trnT-L) failed to amplify consistently using published primers and were thus eliminated. Successful amplification of rpoB-trnC was achieved for Aichryson, Sempervivum, and Zaluzianskya. In the Shaw et al. (2005) study, rpoB-trnC provided the second greatest mean number of PICs of all loci attempted, although it also failed to provide data for members of Asteraceae or for gymnosperms. Of all plastid loci examined in this study, rpoB-trnC provided the greatest number of parsimony informative characters for each of the taxon sets for which it was obtained. Conversely, for these three taxon sets, rpoB-trnC was not clearly better at resolving or supporting clades. In this regard, it only performed better than five of the six plastid loci in Aichryson, three loci in Sempervivum, and four loci in Zaluzianskya. Given that it has been shown to be informative if obtainable, rpoB-trnC should certainly remain a candidate gene in preliminary studies. The trnSGCU-trnGUUC region was the third best at providing PICs for taxa sampled by Shaw et al. (2005). In the present study, trnSGCU-trnGUUC amplicons were only obtainable in Crassula, Tolpis, and Zaluzianskya. In Tolpis, trnSGCU-trnGUUC provided precisely as many parsimony informative characters and clades with at least 63% support as trnS-trnfM, otherwise the best plastid locus for this group. However, for Crassula or Zaluzianskya taxon sets, trnSGCU-trnGUUC did not stand out as especially valuable in either providing parsimony informative characters or support for hypothesized relationships. Further, for many accessions of both Crassula and Zaluzianskya, PCR resulted in two fragments. This may be an important consideration when selecting loci, in that excising bands from gels or cloning adds labor and expense. In Shaw et al. (2005), trnT-trnL provided the fifth most PICs. In the present study, trnT-trnL was only obtained for the genus Crassula, of which two species failed to amplify over several attempts. The number of informative characters and nodes with ≥63% jackknife support were intermediate compared to other loci tested for this taxon set.

Of the plastid loci included in OSR tests, none was clearly the best for all taxon sets or for all measures of utility. The trnS-trnfM region may provide the most characters (and the most parsimony informative characters in some groups) but was not especially useful in supporting nodes in general. On the other hand, some regions that were not very useful in certain lineages were better (although without statistical significance) at resolving nodes correctly than a region of equal size drawn from other fast-evolving loci, such as rps16 in Zaluzianskya section Nycterinia. The only data set that was statistically better than the null OSR was ycf6-psbM in Crassula subgenus Crassula. However, given that 24 tests were performed, it is likely that one of them would have achieved significance by chance alone.

We suspect that among these rapidly evolving plastid loci, any determination of greatest utility will be highly contingent on (1) organismal lineage, (2) measure of utility employed, or (3) taxon sampling. First, it is clear that phylogenetic utility varies significantly among lineages. For example, trnS-trnfM performs well in providing both parsimony informative characters and jackknife support for subgenus Crassula but performs poorly for Aichryson or Zaluzianskya for either measure. For recently or rapidly evolved taxa, this may simply be an issue of sampling error. If few substitutions are to be expected per nucleotide site, a few substitutions more or less may be the difference between the utility of two loci in the same taxon, or the difference between the same locus in multiple lineages. Second, utility can be measured several ways, and these measurements are not necessarily independent of each other. The number of variable characters may serve as a proxy for the number of informative characters even if sample size is relatively small. On the other hand, the absolute number of parsimony informative characters was not entirely predictive of ability to resolve nodes with support. Sempervivum cpDNA provided many parsimony informative characters but performed poorly at resolving or providing support for nodes. Similarly, decreased consistency resulted in a slight increase in the ability of loci to resolve relationships in many instances. Third, it follows that any measure of utility is highly dependent on taxon sampling. Whether a data set is useful or not depends not only on the number of informative characters, but also on how character state changes are distributed on shortest trees. This will always be contingent on the relationships a hypothesized tree is required to explain. Even though rpl16 provided many more parsimony informative characters for Sempervivum (47) than for Crassula (8), a higher percentage (43%) of Crassula clades were resolved with ≥63% jackknife support than were Sempervivum clades (17%). These types of contingencies do not allow easy or universal assessment of loci across all taxa, especially among those taxa that have already been shown to have low sequence divergence.

Utility of nuclear ribosomal ITS sequences

As expected, ITS provided more parsimony informative characters than many of the rapidly evolving loci examined here (Shaw et al., 2005). Even for groups in which trnS-trnfM provided a greater number of informative characters, ITS was always better than at least three of the five other loci. Although homoplasy in ITS was relatively high compared to the cpDNA loci, the number of nodes resolved with ≥63% support was comparable. Insertion/deletion (or gap) characters were not examined in this study, though in other studies they have been useful in providing resolution. This is evidenced by the 281 published studies employing the method of gap coding devised by Simmons and Ochoterena (2000). Among data sets examined in the present study, gap characters varied considerably in hierarchical structure using the “complex” coding scheme of Simmons and Ochoterena (2000), making comparison to nucleotide characters difficult. This is because the number of state changes required by complex gap characters and the degree to which these are ordered by stipulation of a transition matrix may be extremely variable, while nucleotide characters include no more than four states, and state changes all count as a single step if unweighted parsimony is used. Due to the noncoding nature of nrDNA, ITS might reasonably be expected to provide more insertion/deletion characters than do protein-coding genes. However, neither ITS nor cpDNA characters consistently provide the most gap characters across all taxa (data not shown).

There is little doubt that ITS may provide sufficient numbers of parsimony informative characters, good resolution, and strong clade support. However, given the caveats of Alvarez and Wendel (2003), the question remains: how well does ITS estimate organismal phylogeny? This cannot be answered directly, but for any given study, congruence with cpDNA loci supports the assertion that both the ITS tree and cpDNA tree are reasonable hypotheses of species relationships. The incongruence length difference (ILD) test (Farris et al., 1994), implemented in PAUP* (Swofford, 2002) as the partition homogeneity test, was used to determine whether incongruence between ITS and cpDNA was significant. One thousand random partitions with heuristic searches were used to generate the null distribution. The null was rejected for Aichryson, Crassula, Sempervivum, and Zaluzianskya (P < 0.001). No ITS data matrix was assembled for Tolpis, so ILD was not tested. For Chrysosplenium, the null was not rejected (P > 0.05), which was not surprising as the ITS tree is perfectly congruent with the cpDNA tree, although very poorly resolved. For Aichryson, removing one or two offending taxa does not reverse incongruence. Agreement subtrees, formed by sequentially pruning taxa from two incongruent trees until congruence has been obtained, may be used to infer (if roughly) the degree to which two trees are topologically congruent. In the agreement subtrees inferred from the Aichryson cpDNA and ITS topologies, only seven of the initial 19 taxa remain. Incongruence between ITS and plastid loci may clearly be an impediment to accurate estimation of phylogeny in Aichryson. For Crassula, ITS had few parsimony informative characters, relatively high homoplasy (RC = 0.609), and failed to resolve any nodes with jackknife support ≥63%. Conversely, cpDNA loci had more than five times as many parsimony informative characters, slightly lower homoplasy (RC = 0.660), and resolved many nodes with jackknife ≥63%. Agreement subtrees exclude five of 13 terminals. The combined data set results in a tree identical to that inferred from cpDNA alone, with all nodes supported (≥63% jackknife). Nonetheless, ITS provides a significantly incongruent, if poorly supported, hypothesis of relationships compared to the cpDNA tree. In Sempervivum, strong topological incongruence is evident by visual comparison of ITS and cpDNA trees, though neither data set provides support for many nodes. The agreement subtree requires elimination of 15 of 23 terminals. However, very few characters in either ITS or cpDNA are useful in resolving relationships among ingroup taxa, the majority of character state changes optimizing on the node separating the ingroup from the outgroup.

The situation in Zaluzianskya appears at first similar to that of Crassula. ITS provides few informative characters and very little support or resolution with only two nodes supported by ≥63% jackknife. However, despite statistical incongruence as measured by ILD, the tree recovered when data are combined includes one of the clades recovered by ITS (and not cpDNA) but with higher support (90%) than in the ITS tree (72%). This would indicate that phylogenetic signal in cpDNA had been masked by homoplasy and was strengthened by that of the ITS—in effect, cpDNA is exhibiting “secondary” signal (Nixon and Carpenter, 1996) or “additive” signal (Wenzel and Siddall, 1999) in combination with ITS. In this instance, retaining ITS despite incongruence results in an increase in resolution and support.

Sampling strategy

PICs in three-taxon sets were significantly correlated with the number of variable characters if more taxa were sampled (Shaw et al., 2005). In this study, the number of parsimony informative characters in data sets was significantly correlated with the number of variable characters. However, we would argue that in preliminary studies, obtaining PIC values for three-taxon comparisons may not be an optimal strategy, as is best illustrated by the Sempervivum data. Given the position of the Sempervivum clade as basally derived within a clade that contains nearly the entire family (Mort et al., 2001), there is no clear, single choice of best outgroup. For most loci, almost all character state changes map to the branch separating the ingroup from the outgroup. If two ingroup and one outgroup taxa are sampled, the number of PICs would not give a fair indication of how useful a locus is at resolving ingroup relationships (because most variability is between the ingroup and the outgroup). Sampling strategy would benefit from adding an additional ingroup taxon so that variability within the ingroup and between the ingroup and outgroup could be assessed separately.

Cost–benefit considerations

When deciding which loci to sample, it is important to consider the cost in labor and materials of obtaining sequences vs. the probability that such sequences will provide thorough tests of phylogenetic hypotheses. Among ITS and cpDNA loci tested here, the cost in labor and material is approximately equal, and no one DNA region generally outperformed all others according to the various measures of utility within the various taxon sets.

Other costs are more difficult to quantify. ITS may have a higher probability of giving an incorrect solution than does plastid DNA because of the confounding phenomena discussed by Alvarez and Wendel (2003). On the other hand, there is no guarantee that plastid loci will obtain the correct species phylogeny, even if they obtain the correct gene phylogeny. Further, many of the tacit assumptions held about plastid evolution (uniparental inheritance, single-copy genes, nonrecombination) have been shown in recent years to be subject to violation (Wolfe and Randle, 2004), though violations of these assumptions are probably sufficiently rare to be excluded from consideration in many studies.

One solution to problems associated with gene paralogy is to clone PCR products, obtain sequences, and infer gene trees to determine orthologs, which can then be used to infer species trees (Slowinski and Page, 1999; Simmons et al., 2000). However, cloning adds time, labor, and considerable expense. Further, inference of species trees from gene trees can be difficult (Simmons and Freudenstein, 2002; Cotton and Page, 2003). As Alvarez and Wendel (2003) point out, gene conversion may have the effect of homogenizing rDNA arrays following an introgression event, increasing the probability of obtaining an incorrect gene tree, even if all members of a gene family are obtained from each taxon.

Recently, efforts have been made to obtain nuclear sequence data from low copy number regions, combining the benefits of high substitution rates and decreased probability of obtaining paralogous sequences (Small et al., 1998; Archambault and Bruneau, 2004; Popp and Oxelman, 2004; Popp et al., 2005; Roncal et al., 2005; Syring et al., 2005). Further benefits of this approach include biparental inheritance and decrease in linkage effects. Because plastid DNA is by and large uniparentally inherited, only a single parental haplotype is represented in hybrid or introgressed lineages. Chromosomal structure of the nuclear genome also allows independent segregation of linkage groups. The choice of multiple loci therefore allows independent falsification of phylogenetic hypotheses (Small et al., 2004). Disadvantages of nuclear ITS pertain to other nuclear loci, but probably to a lesser degree. Establishing orthology will always be difficult in any gene that occurs in multiple-copy gene families or if heterozygosity is fixed (as in many recently derived allopolyploids); however, if a gene family is small enough, this might be addressed by cloning. The problems of recombination and concerted evolution apply to low-copy nuclear genes as well. Regarding cost–benefit considerations, the greatest difficulty may be in simply characterizing gene families sufficiently to obtain adequate sampling of copies (Mort and Crawford, 2004; Small et al., 2004). However, this problem is not intrinsic and is sure to be ameliorated as studies incorporating data from these regions accumulate.

In molecular systematic studies of closely related taxa, it is difficult to say which is the more costly alternative: obtaining molecular sequence data that evolve rapidly enough to obtain well supported but incorrect phylogenies (i.e., not tracking organismal phylogeny) or obtaining more slowly evolving data that do not obtain adequate resolution of any phylogeny. The practice of acquiring sequence data from two or more loci that can be reasonably expected to provide independent tests of phylogeny (e.g., from different genomes) is a proven means of avoiding at least the first scenario. Further, if such loci are sufficiently variable, they may provide useful markers for studies of population genetics or phylogeography that use explicitly reticulate models of evolution. Using rapidly evolving loci, such as those investigated here, may help avoid the second eventuality. If anything is certain, it is that loci will vary in utility across taxa. Low-copy nuclear genes may become the workhorse of intrageneric systematics once ITS has been put to pasture. However, even if this occurs, low-copy nuclear genes will be most useful if interpreted in the context of other independently evolving loci such as those described here from the chloroplast genome.

Table 1. Summary of sequence data from plastid loci and ITS
image
Table 2. Summary of statistics pertaining to resolution
image
Table 3. P values for the test of the null hypothesis that individual loci performed no better (top) or worse (bottom) than data sets randomly drawn from the combined matrix, using “overall success of resolution” as a metric. The null was rejected with probability <0.025 (marked *)
image
Details are in the caption following the image

Linear regression analysis did not detect significant correlation between the number of aligned characters and the number of parsimony informative characters for plastid matrices (r2 = 0.052, P = 0.20) or plastid loci with ITS data included (shown; r2 = 0.009, P = 0.55)

Details are in the caption following the image

Linear regression analysis detected significant correlation between the number of variable characters and parsimony informative characters in data sets, for both plastid loci (r2 = 0.610, P < 0.01) and for plastid loci with ITS data included (r2 = 0.765, P << 0.01)

Details are in the caption following the image

Relative performance of cpDNA loci and ITS at providing parsimony informative characters for analysis of each taxon set. Arrows indicate loci providing the greatest number for each taxon set. ITS performed best for Aichryson, Sempervivum, and Chrysosplenium, while trnS-trnfM provided the greatest number of parsimony informative characters for Tolpis and Crassula. Three loci provided the maximum of 10 characters in analyses of Zaluzianskya sequences: ITS, rps16, and trnS-trnfM

Details are in the caption following the image

Linear regression analysis detected a weakly significant and negative correlation between rescaled consistency index values and the percentage of t − 3 nodes receiving at least 63% jackknife support

Details are in the caption following the image

The percentage of t − 3 nodes receiving at least 63% jackknife support. The proportion of nodes receiving greater than 87% jackknife support is in gray. A circle represents the mean for each locus. Bars were placed on the x-axis in the order of least to greatest percentage of nodes receiving support. The taxonomic group is abbreviated below each bar as follows: A = Aichryson, C = Crassula, Ch = Chrysosplenium, S = Sempervivum, T = Tolpis, and Z = Zaluzianskya