Assigning bacteria to discrete populations, or species, can be problematic, since bacteria differ greatly in the extent and promiscuity of recombination (
8). In some bacteria, homologous recombination appears to be restricted because of either inefficient mechanisms or vectors of genetic exchange or ecological factors that limit the extent to which genetically distinct strains meet each other in nature. In other bacteria, recombination is very frequent, and in some cases, homologous recombination occurs between bacteria that differ substantially (20 to 25%) in sequence and that are assigned to different but closely related named species. Attempts to define species typically determine the relationships among a set of isolates that are considered to represent each named species, using DNA-DNA hybridization or the sequences of rRNA or conserved genes.
Homologous recombination distorts the true relationships between isolates of closely related named species and can lead to inconsistent relationships among those species inferred from the sequences of different genes (
6,
9). Consequently, defining species using single loci is inappropriate, particularly for those species where rates of recombination are high. The use of multilocus sequence-based approaches ensures that recombination at one locus is buffered by the more reliable indications of relatedness provided by the other loci. Furthermore, in defining any species, we must analyze populations of each candidate species and not just one or a few isolates (
9).
Two general types of multilocus approaches can be considered as tools to distinguish related species. Microarrays can detect differences in gene repertoire among isolates (
15) but suffer the serious disadvantage that genes that differ among isolates of a named species, or between related species, reflect the least stable part of the genome, rather than the core genome, which is likely to be the most phylogenetically informative (
12). The alternative approach is to use the sequences of multiple housekeeping genes that are part of the core genome (
22). These data are now widely available as isolates of many pathogens are characterized by sequencing internal fragments of seven housekeeping loci, a technique referred to as multilocus sequence typing (MLST) (
13).
Previously, we have shown that MLST-based approaches are capable of discriminating named species among the human
Neisseria (
9). In this paper, we test the utility of this approach in defining the boundaries of the named species
Streptococcus pneumoniae (the pneumococcus) and the extent to which it can be resolved from closely related isolates of uncertain taxonomic status that cocolonize the human nasopharynx. The confident identification of these members of the mitis group of alpha-hemolytic streptococci is fraught with difficulty. At present, presumptive members of the species
S. pneumoniae are usually identified in clinical microbiology laboratories by colonial morphology when grown on blood agar and by optochin sensitivity. In the case of optochin-insensitive isolates that otherwise appear similar to pneumococci, bile solubility may also be used. Pneumococci are further characterized by serotyping using the Quellung reaction, and isolates that can be assigned to one of the 90 recognized pneumococcal serotypes are considered unambiguously to be
S. pneumoniae. After applying these tests, a number of nonserotypeable isolates that appear to be similar to pneumococci but are of uncertain taxonomic status remain.
The relationship between nontypeable pneumococcus-like isolates and genuine pneumococci has been considered by Whatmore et al. (
23). Those authors point out the difficulty in resolving these issues by gene content. Genes previously thought to be limited to the pneumococcus, including those encoding the virulence factors pneumolysin (
ply) and autolysin, have been found in isolates assigned to other named species such as
S. oralis and
S. mitis (
14,
23).
This work considers the potential of MLST in defining the pneumococcus and in distinguishing it from closely related species. We used MLST to characterize a set of 121 nonserotypeable presumptive pneumococci from Finland and compared the sequences of a fragment of the ply gene with those of authentic serotypeable pneumococci. The sequence data clearly identify the nontypeable isolates as members of either the pneumococcal population or a clearly resolved and more diverse related population. We propose that these latter isolates should be recognized as distinct from pneumococci and demonstrate the utility of the sequence of a fragment of the pneumolysin gene for distinguishing nonserotypeable pneumococci from this related nonpneumococcal population.
MATERIALS AND METHODS
Bacterial isolates.
A reference set of 39 serotypeable pneumococci was chosen to define the diversity found within the pneumococcus. To construct this data set, the entire MLST database (
http://spneumoniae.mlst.net/ ) was divided into nonoverlapping groups of related isolates using the program eBURST (
http://eburst.mlst.net/ ) (
7), with the default setting for the group definition (six out of seven shared loci). The reference set includes examples of the founding genotypes of the major clonal complexes identified by eBURST and also isolates of some of the major internationally disseminated antibiotic-resistant clones.
For comparison with the pneumococcal reference set described above, 121 isolates of nonserotypeable presumptive pneumococci were obtained from the Finnish otitis media studies conducted in Finland to investigate pneumococcal disease and carriage (
5,
10,
11). Details of these isolates are summarized in Table
1. Presumptive pneumococci were identified by colony morphology and sensitivity to optochin (6 μg; Biodisk PDM Diagnostic Disks, Sweden). All isolates discussed here were optochin sensitive in the first testing, but a few isolates were optochin resistant in later testing, as indicated in Table
1. All isolates were, on first inspection, not serotypeable. In some cases, however, subsequent testing revealed reactions with omniserum, and in these cases, serotype was determined by the Quellung reaction. Isolates were obtained from either middle ear fluid (MEF) of children with acute otitis media (AOM) (
11) or nasopharyngeal (NP) swabs of healthy children or children with AOM (
20). Two isolates were excluded during analysis (IOPR 4609 and IOPR 3386), as they failed to yield good-quality sequence at any MLST locus despite repeated attempts.
DNA isolation and sequencing.
Genomic DNA was isolated using Qiamp DNA Mini kits (QIAGEN). Internal fragments of MLST loci were PCR amplified (
Taq polymerase and 10× buffer; QIAGEN) with 50 nM deoxynucleoside triphosphates (Geneamp; Applied Biosystems, Foster City, Calif.) using the primers and PCR conditions described previously (
4). The PCR products were precipitated with 20% polyethylene glycol 8000-2.5 M NaCl (Sigma, St. Louis, Mo.), and the fragments were sequenced on both strands using the same primers and Big Dye II terminators (Applied Biosystems). The products of the sequencing reactions were precipitated with 185 mM sodium acetate in 70% ethanol and were resuspended in 10 μl HiDi formamide (Applied Biosystems) and loaded onto an ABI Prism 3700 sequencer. Sequences were analyzed using STARS (obtainable from
http://www.mlst.net ), a modified Staden interface developed by Man-Suen Chan for use with MLST projects. Alleles were assigned by comparing the sequences to those in the pneumococcal MLST database, and those already present were assigned the relevant allele number. For those not found in the pneumococcal database, each unique allele from nontypeable isolates was initially given an alphabetic identifier to distinguish it from those alleles found in serotypeable pneumococci. All new alleles were sequenced at least twice on each DNA strand. The pneumolysin gene fragment was amplified and sequenced on both strands using the primers Ia and Ib as described previously by Toikka et al. (
21) and the same conditions for PCR and sequencing used for the MLST loci. Sequences were trimmed to 282 bp in length, and each distinct sequence was assigned as a different
ply allele. All disagreements between strands were resolved by manual inspection of trace files, and all
ply alleles were verified by sequencing at least twice on each DNA strand.
Phylogenetics and population genetics.
To illustrate differences between individual alleles at the MLST and
ply loci, minimum evolution trees were constructed using all nucleotide differences and the Kimura 2 parameter distance correction in MEGA 2.1. The sequences of all loci except
ddl (see below) were concatenated, maintaining the +1 reading frame, and trees were constructed from the concatenated 2,751-bp sequence using MrBayes 3.0b4 (
16). A starting neighbor-joining tree was determined in PAUP*4.0beta v.10 (
http://paup.csit.fsu.edu/ ) (
19), with distances corrected using the HKY85 model. This was input as a starting tree into MrBayes, four Markov chain Monte Carlo chains were run with default heating parameters until convergence, and 10,000 trees were sampled from the posterior probability distribution. These were then used to produce a consensus tree. The choice of evolutionary model for MrBayes was made using MrModeltest 2.2 (
http://www.ebc.uu.se/systzoo/staff/nylander.html ) and corresponded to the general reversible model with rates of substitution being gamma distributed between sites, a proportion of which were invariant. Nucleotide diversities were estimated using DNAsp (
17). Other population genetic analyses were performed using Arlequin v2.0 (
http://lgb.unige.ch/arlequin/ ).
RESULTS
Clustering of multilocus genotypes defines a monophyletic group containing all serotypeable pneumococcal isolates.
The sequences of the seven MLST loci from the pneumococcal reference set, and 121 nontypeable isolates of uncertain status, were determined. The combinations of alleles (i.e., the allelic profiles) for all isolates studied in this work are shown in Table
1. Figure
1 shows a phylogenetic tree constructed from concatenated sequences of the alleles at all loci except
ddl (see below). The tree was constructed in MrBayes 3.04b using all nucleotide sites. All 39 strains of the pneumococcal reference set are descended from a single node and group with several of the nontypeable isolates, most of which were found to have sequence types (STs) already present in the MLST database. Support for this node was high (99%), and we designate isolates descended from it as group 1 in Fig.
1. A small, well-supported (99%) cluster is evident within this group (1b in Fig.
1). This includes the three most common STs found among the nontypeable isolates (STs 449, 448, and 344). Isolates of each of these STs have previously been submitted to the MLST database from other locations and were found to be nontypeable in all cases. The remaining two STs in this cluster were minor variants of these previously recognized STs.
The remainder of the nontypeable isolates form a highly diverse group distinct from group 1. The mean genetic distance (computed using all nucleotide positions with DNAsp) within this group of divergent nontypeable isolates is greater (86.19 nucleotide differences) than that within group 1 (28.28 nucleotide differences). The mean distance between group 1 and the other isolates was 146.78. Wright's
FST (
24), the proportion of genetic variation distributed among subpopulations relative to the total population (computed using Arlequin v.2.0), was 0.61, indicating significant differentiation between the two groups.
Relationships of alleles in nontypeable isolates to those found in the MLST database.
The nontypeable isolates that clustered within group 1 were considered to be authentic pneumococci that failed to express a capsule, and any novel alleles in these isolates were assigned allele numbers and added to the pneumococcal MLST database. For those nontypeable isolates that did not fall into group 1, each unique sequence was given an alphabetic allele identifier and retained in a separate database to prevent confusion with the pneumococcal alleles in the MLST database. Minimum evolution trees were constructed for each locus from the sequences of all known pneumococcal alleles in the MLST database together with the alleles from all nontypeable isolates (Fig.
2). In the case of the
aroE locus, the sequences cluster into two groups that are highly divergent from one another but with relatively little diversity within each group. The mean diversity of
aroE alleles in the MLST database is 2.4 nucleotide differences, in comparison to a mean of 4.1 nucleotide differences within the
aroE alleles identified in group 2 strains (significantly greater [Student's
t test;
P = 0.006]). This is the situation for all other loci with the exception of
ddl, with nucleotide diversity in group 2 strains being significantly greater among the alphabetic alleles (Student's
t test;
P ≪ 0.05 for each) than those from typical pneumococci, as is apparent from the trees shown in Fig.
2, and diverging by >5% from the latter group of alleles. However, certain alleles present in serotypeable pneumococci in the MLST database clearly cluster with the alphabetic alleles (e.g.,
recP allele 26) and almost certainly represent instances of lateral transfer between the groups (presumably importation of alleles into pneumococci). The bidirectional nature of this is evident in Table
1 through the finding of alleles which are present, indeed common, in authentic pneumococci among the nontypeable isolates that cluster outside group 1.
The
ddl gene is located close to the penicillin-binding protein 2b (
pbp2b) gene, and penicillin-resistant pneumococci often have highly divergent
ddl alleles, since the DNA fragment carrying the
pbp2b gene that is imported from related species during the emergence of penicillin resistance frequently includes all or part of the flanking
ddl gene (
3). The tree derived from
ddl sequences of authentic pneumococci therefore shows a cluster of similar alleles and a spectrum of increasingly divergent alleles due to the importation of divergent alleles and the generation of mosaic alleles where the recombinational junction between imported divergent sequence and the resident pneumococcal sequence is within the
ddl gene. Unlike the other loci, the levels of diversity among
ddl alleles within the pneumococcal MLST database were not significantly different from those found in group 2 strains (mean of 25.4 nucleotide differences in comparison with 24.9 nucleotide differences [Student's
t test;
P > 0.05]). As the locus is known to be subject to hitchhiking (
3),
ddl was excluded from the concatenation procedure described above.
Pneumolysin gene sequence.
The
ply gene, once considered to be a defining property of the pneumococcus, has recently been demonstrated to be present in isolates of related species (
14,
23). We therefore sequenced a 282-bp fragment of the
ply locus from all nontypeable isolates and from the pneumococcal reference set. The
ply gene was found to be present in all but one isolate, which was highly divergent at the MLST loci (IOPR 2148 [Table
1]). Another isolate, IOPR 2640 (NT34), appeared to contain more than one
ply sequence; sequencing of the amplified fragment from several DNA preparations from purified single colonies gave a mixed sequence, although this was not observed with the MLST loci. Interestingly, some of these were typical pneumococcal alleles, while others were divergent, suggesting that this strain has a history of interspecific recombination. Each unique
ply sequence was assigned as a different allele following the same conventions described above for MLST genes (i.e., integers for alleles of isolates in group 1 and alphabetic identifiers for the remainder). A minimum evolution tree showing the relationships between the
ply sequences is presented in Fig.
3, and the
ply alleles assigned to individual isolates are shown in Table
1. Alleles from the reference pneumococcal set again form a distinct cluster, and all of the nontypeable isolates that fall into group 1 in Fig.
1 had
ply sequences that either clustered with or were identical to the
ply alleles from the pneumococcal reference set. The other nontypeable isolates had
ply alleles that were distinct from those of the reference pneumococci and which clustered apart from them on the tree.
DISCUSSION
The identification of clearly defined groups of isolates that are similar to each other in genotype, but which are clearly distinguished from other related groups of similar genotypes and which may be assigned as species, is a central problem in microbiology (
18). The species concept is particularly problematic for bacteria, and it may be unrealistic to expect to have a single satisfying concept of species that can encompass bacteria that are almost totally asexual through to ones in which localized homologous recombinational replacements are very frequent and may be relatively promiscuous. In this paper, we have examined one of the more highly recombinogenic species, the pneumococcus, and have attempted to discriminate it from isolates that are closely related and which colonize the same niche, the human nasopharynx, and to explore the extent to which these very similar bacteria can be resolved into distinct populations.
Single genes are clearly unsatisfactory for exploring these issues, and we have therefore used a multilocus approach. We have previously applied this approach to the human
Neisseria species and found it to be capable of discriminating named species even in the presence of recombination (
9). Here, we consider the related but distinct question of whether we can define the boundaries of a named species and distinguish it from other related species which may not currently be designated as such. Nontypeable presumptive pneumococci provide a useful source of isolates that are closely related to pneumococci and which are known to include authentic pneumococci that for various reasons may not express a capsule, in addition to isolates that are genetically distinct. Trees based on the concatenated sequences of the MLST loci (excluding
ddl) clearly resolved the nontypeable presumptive pneumococci into two groups. Approximately 58% of the isolates clustered with serotypeable pneumococci, whereas the others were a separate and more diverse population. Based on these results, we propose that
S. pneumoniae may be defined as all isolates falling into group 1 on a tree based on the concatenated sequences of the six MLST loci. The pneumococcal MLST website now contains a function that allows users to test whether isolates sequenced in their laboratory should be considered true pneumococci under this definition (
1).
The presence of the pneumolysin gene in isolates closely related to pneumococci has been previously demonstrated (
14,
23) and precludes using the presence of this gene as a means of distinguishing pneumococci from their closest relatives. However, the
ply alleles in pneumococci were different from those in isolates that grouped outside the pneumococci in the tree constructed using concatenated MLST sequences. Precisely the same groups were obtained using the sequence of the pneumolysin gene fragment or the concatenated MLST loci. The
ply sequences therefore provide a further means of identifying true pneumococci in difficult cases, although rigorous assignment of a nontypeable isolate as a pneumococcus should involve examination of the clustering obtained with both the concatenated MLST loci and the
ply gene fragment.
Those isolates not identified as pneumococci by this approach are a much more highly diverse grouping than the pneumococci, as demonstrated by comparing the mean genetic distance within the two groups. This is even more striking if you consider that group 1 contains strains from a reference set specifically chosen to illustrate the diversity of the pneumococcal population, whereas the atypical isolates reported here were retrieved in longitudinal studies of carriage and AOM within a limited geographic area and are therefore unlikely to represent the full diversity of this population. The relationship of these organisms to the recently proposed species
S. pseudopneumoniae (
2) remains to be determined. It should be noted that two of the strains falling outside group 1 (IOKOR 731 and IOKOR 1362) were isolated from MEF in children suffering from AOM, suggesting that organisms of this group may harbor pathogenic potential in some disease contexts.
The approach described here clearly delineates the boundaries of the pneumococcal cluster in sequence space and, thanks to its multilocus nature, is resistant to limited shuffling of genetic information across this boundary. Unlike previous attempts to define bacterial species using conceptually similar approaches (
22), we tested the ability of our method to discriminate between the pneumococcus as a population and a large group of very closely related isolates that are indistinguishable by other methods. We are also impressed by the ability of this approach to resolve the pneumococcus with confidence, even in the presence of relatively high levels of recombination. However, it is likely that the pneumococcus is an example of a “fuzzy species,” and further sampling of the fringes of the pneumococcal cluster may require us to update our definitions. This issue can only be resolved by further work. It should be noted that recombination means that these trees contain no useful information about the relationships within group 1 and group 2, but this is insufficient to obscure the differences between them. It remains to be seen if it is possible to define combinations of phenotypic characteristics shared by all members of group 1 that are not found in any members of the diverse group of related organisms this work has shown clustering apart from genuine pneumococci. The recombinogenic nature of these organisms may mean that attempts to do so are misguided.
This work raises several questions about the nature of the mechanism which generates and maintains these divisions. One possibility is effective reproductive isolation, in which strains mainly undergo recombination with isolates of the same named species. While interspecific recombination does occur, and renders a single-locus approach untenable, it is not common enough to prevent the emergence of those clusters in sequence space we refer to as species. In the case of clonal bacteria, the virtual absence of recombination will necessarily lead to clusters of related strains. What we do not know is what generates such effective isolation. We are also ignorant of to what degree recombination must be limited in order to achieve effective isolation and consequent speciation. To resolve these issues, further studies that combine theoretical and molecular approaches are required.
Acknowledgments
This work was supported by a grant from the Wellcome Trust (grant number 030662) B.G.S. is a Wellcome Trust Principal Research Fellow. The Finnish otitis media studies were supported by Merck & Co., Aventis Pasteur, and Wyeth-Lederle Vaccines and Pediatrics.