Introduction
DNA barcoding for specimen identification works by comparing unknown samples to a database of sequences generated from identified material available on GenBank and the Barcode of Life Data System (BOLD;
Ratnasingham and Hebert 2007). BOLD is the product of collaboration between computer programmers, molecular biologists, and taxonomists (
Smith et al. 2005;
Hajibabaei et al. 2005;
Ratnasingham and Hebert 2007). Taxonomists provide the scientific context for sequence data used for specimen identification (
Goldstein and DeSalle 2011). The advantage of this enterprise from the taxonomists’ perspective is a potential wealth of molecular data that can be used to test hypotheses of species limits (
DeSalle et al. 2005). The combination of traditional taxonomic approaches with molecular methods can create a taxonomic feedback loop, which can lead to species discovery and more well-resolved taxonomies (
Page et al. 2005). Inclusion of molecular data in taxonomic studies is one part of a broader integrative approach to the science sometimes referred to as integrative taxonomy (
Dayrat 2005).
Ratnasingham and Hebert (2013) recently developed the Barcode Index Number (BIN) system for categorizing DNA barcodes into operational taxonomic units (OTUs) in the absence of taxonomic information. BINs are assigned using Refined Single Linkage Analysis (RESL), an algorithm that does not use prior taxonomic knowledge (
Ratnasingham and Hebert 2013). RESL uses a 2.2% threshold of sequence divergence to delimit preliminary OTUs and then refines them using a graphical Markov clustering analysis (
Ratnasingham and Hebert 2013). A BIN can have four possible relationships for any species pair, which
Ratnasingham and Hebert (2013) refer to as match, split, merge, and mixture (their fig. 1). When traditional taxonomy and BINs are concordant, they are said to match. Splits occur when single species are assigned multiple BINs. Merging occurs when a single BIN number is assigned to two or more species, what has long been referred to informally as lumping. In the mixture scenario, two BINs are assigned to two species, but sequences of at least one species fall into both BINs. Merging and mixture of BINs may occur in situations of introgression, incomplete lineage sorting, or if species are inappropriately assigned too many names (
Rheindt et al. 2009). Other methods have been developed for assigning OTUs using sequence data (
Pons et al. 2006), including the Automatic Barcode Gap Discovery (ABGD) method (
Puillandre et al. 2012). ABGD is similar to RESL in that it defines preliminary OTUs by inferring a barcode gap and then refines partitions recursively allowing for different barcode gaps across the dataset. ABGD has been used in several barcoding studies, typically using default or slightly modified settings (
Hendrixson et al. 2013;
Kekkonen and Hebert 2014).
I examined DNA barcode data generated for a group of taxonomically challenging bees in North America using BINs and barcode gaps. Sweat bees (Hymenoptera: Halictidae) have been called morphologically monotonous (
Michener 1974) and the despair of taxonomists (
Wheeler 1928), and the large subgenus
Lasioglossum (
Dialictus) is notoriously the most difficult to identify to species.
Lasioglossum (
Dialictus) are extremely abundant in surveys of North American bees (
MacKay and Knerer 1979;
Grixti and Packer 2006;
Campbell et al. 2007;
Droege et al. 2010;
Ngo et al. 2013), making identification tools crucial for studies of bee diversity. Thousands of
L. (
Dialictus) are collected each year in North America, but taxonomic keys have only become available recently for a subset of species based on geographic regions (
Gibbs 2009b,
2010a,
2011). Consequently, many studies have been published for which
L. (
Dialictus) specimens are unidentified or misidentified (
Kalhorn et al. 2003;
Giles and Ascher 2006;
Grixti and Packer 2006;
Kearns and Oliveras 2009;
B.A. Smith et al. 2012;
Wheelock and O’Neal 2016). A very small pool of taxonomists is available to provide reliable identifications to this group, limited primarily to species included in recent revisions (
Gibbs 2010a,
2011;
Gibbs et al. 2013). DNA barcodes have proven useful for facilitating species discovery within small species groups of
L. (
Dialictus) in the past (
Gibbs 2009a,
2009b), and they have the potential to aid in the identification of difficult taxa for which taxonomic expertise is limited (
Hebert et al. 2003a;
Packer et al. 2009).
Lasioglossum (
Dialictus) has a nearly cosmopolitan distribution including nearly the entire Nearctic, Neotropical, Palaearctic, and Afrotropical regions. The genus as a whole has a relatively recent origin (31 Mya, 95CI 24–48 Mya) (
Gibbs et al. 2012), which has diversified rapidly into more than 1800 described species (
Gibbs et al. 2012;
Ascher and Pickering 2016). The genus has not been revised for many parts of the world (
Michener 2007), so the total species richness may be much higher. The recent and rapid diversification of
Lasioglossum is likely to blame for taxonomic challenges associated with the genus (
Gibbs et al. 2012;
Groom et al. 2013). In addition, sexual dimorphism and caste variation within
Lasioglossum has led to taxonomic errors in the past (
Knerer and Atwood 1964). It is unclear why
Lasioglossum has diversified so rapidly, but their generalist nature seems to allow them to thrive in varied conditions. The small body size of
L. (
Dialictus) may also allow for geographic isolation of populations and subsequent speciation.
I restricted the study of
L. (
Dialictus) DNA barcoding to the species occurring in Canada and the United States for two reasons. There have been recent revisions of
L. (
Dialictus) in these two areas (
Gibbs 2010a,
2011) and DNA barcode data for these species are most complete. The species in these regions as currently defined were based on an integrative approach using an evolutionary species concept. Data used in delimiting species included morphology, DNA barcodes, ecology, and geography (
Gibbs 2009a,
2010a,
2011). These data were used in combination to corroborate taxonomic hypotheses (
DeSalle et al. 2005). In cases where corroboration failed, individuals or populations were considered conspecific. Species were not defined by a single character where polymorphism could not be distinguished from species level differences. Some highly variable species were considered tentative species complexes in need of additional study (
Gibbs 2010a).
Given that taxonomic information is lacking globally, including western North America, I explore the effectiveness of molecular methods for identifying specimens and inferring species boundaries in this group. Both RESL and ABGD have been reported to be relatively successful for delimiting species boundaries using public data sets (
Puillandre et al. 2012;
Ratnasingham and Hebert 2013;
Kekkonen and Hebert 2014), but these have often been across broader taxon groups where the taxonomy overall is less problematic. I focus attention on whether current DNA barcode data are sufficient for identification and delimitation of
L. (
Dialictus) specimens based upon the best close match method (
Virgilio et al. 2012) and the BIN system. I also address remaining taxonomic issues in the group and the use of DNA barcodes for species delimitation in a difficult taxon.
Discussion
The value of DNA barcodes in an integrative taxonomic approach is difficult to quantify retroactively. Species delimitation is a process that occurs over time as taxonomists familiarize themselves with a taxon and gather evidence from multiple sources. In the context of
L. (
Dialictus), the availability of an independent data set was invaluable even in cases were RESL and ABGD were deemed unsuccessful. As an illustration, I describe the situation of
L. ephialtum Gibbs, which has a specific epithet that literally means nightmare in Greek.
Lasioglossum ephialtum was not described until 2010, but it is a relatively common species, particularly in urban areas (
Gibbs 2010a). The species was often misidentified as
L. versatum (or the junior synonym
rohweri Ellis) by earlier workers leading to confusion over the diagnostic characters for
L. versatum. Although DNA barcodes of
L. ephialtum are not easily distinguished from those of members of the
L. viridatum species-group, they are clearly distinct from
L. versatum, which helped to clarify the latter species’ limits. Other advantages of incorporating DNA barcodes into these revisions was the ability to associate dimorphic sexes and female castes (
Gibbs 2010a). Many
Lasioglossum are social with queens and workers, which can be so different that they have been described as distinct species (
Knerer and Atwood 1964), and nearly two thirds of North American
L. (
Dialictus) prior to recent revisionary studies were described from a single sex (
Moure and Hurd 1987).
DNA barcodes may also benefit taxonomy by showing clear variation in morphologically distinct species that were often not recognized due to insufficient study. Species with distinct characters in otherwise problematic taxa may not be scrutinized as closely and incorrectly lumped (
Packer and Taylor 1997). DNA barcodes sequence variation can highlight the variation in a species complex leading to more careful taxonomic study. Examples in
L. (
Dialictus) include the
L. tegulare and
L. petrellum (Cockerell) species-groups (
Gibbs 2009a,
2009b). When DNA barcodes suggest splitting species it could be evidence of cryptic species, but it might also represent highly divergent barcodes within a single species. Heteroplasmy (i.e., the presence of multiple haplotypes within an individual) has been shown to occur in the bee genus
Hylaeus (
Magnacca and Brown 2010). Heteroplasmic individuals may still be identified using DNA barcodes, but the reference database must include multiple divergent haplotypes for each species. The best example of cryptic species among bees is in the
Halictus ligatus Say/
Halictus poeyi Lepeletier species pair (
Packer et al. 2016). Originally separated based upon extensive allozyme data (
Carman and Packer 1996), these two species were subsequently distinguished using both nuclear and mitochondrial DNA sequences including DNA barcodes (
Danforth et al. 1998,
1999). The two are sympatric in a narrow area with almost no heterozygosity at the allozyme loci (never more than one of seven loci out of hundreds of individuals from sympatric populations sampled) (
Packer 1999). No morphological differences have been discovered despite intensive investigation, although there are slight phenological differences in sympatry (
Dunn et al. 1998). DNA barcodes have recently recovered a third Neotropical species in this complex, presumably
H. townsendi Cockerell based on geography (
Packer et al. 2016).
Lasioglossum (
Dialictus) is taxonomically challenging and it is likely that species-level diversity has not been completely described even within areas that have received detailed study (
Gibbs 2010b,
2011).
Lasioglossum is the most species rich of all bee genera, but few revisionary studies exist. This group of small bees has diversified rapidly to occupy virtually every terrestrial habitat on the planet (
Michener 1979,
2007). Although broadly generalist and highly adaptable,
L. (
Dialictus) are small-bodied species that are likely limited in their dispersal ability. As a result, this subgenus seems able to successfully establish and persist in novel and often marginal habitats, but is also susceptible to physical barriers to gene flow. Species-rich taxa with recent diversification are not expected to perform well with single locus automated species delimitation methods (
Puillandre et al. 2012).
The highest level of intraspecific divergence in this study, 6.9%, is seen in
L. ruidosense, putatively a single species with an enormous north–south distribution, ranging from southern New Mexico to Alaska (
Gibbs 2010a). Southern populations of
L. ruidosense are primarily limited to high elevations, which have the potential to isolate populations leading to allopatric speciation (
Coyne and Orr 2004) as seems to be the case with another species,
L. boreale Svensson, Ebmer and Sakagami, with a similar distribution in North America (
Packer and Taylor 2002). Allozyme data from
L. boreale showed little genetic variation from individuals spanning large geographical distances, but high elevation populations in the southwestern United States were found to have fixed differences. Additional study is required to determine and describe the extent of diversity in this apparent complex. In the
L. ruidosense case, the incongruence between current taxonomy and DNA barcoding is likely caused by insufficient taxonomic study. High intraspecific genetic divergence is also present in
L. cressonii (Robertson) and
L. sagax (Sandhouse). Neither of these two species has a distinct geographical or morphological pattern correlated with sequence divergence. In fact, highly divergent sequences have been found from a single locality during the same collecting event for both species (
Gibbs 2010a). Substitutions between haplotypes are strongly biased in the 3rd codon position, which does not suggest amplification of a nuclear pseudogene, although a recently derived pseudogene is possible. A more detailed examination of these species that includes morphology and nuclear DNA is required. In the meantime, they can at most be considered unconfirmed candidate species (
Padial et al. 2010).
Lasioglossum cressonii is a common, distinctive species and similar cases in the past have turned out to be cryptic species (
Danforth et al. 1998;
Gibbs 2009a).
Lasioglossum sagax is a less well-known species that until recently was only known from the holotype (
Wolf and Ascher 2009;
Gibbs 2010a). Future work that includes multiple loci will be productive for delimiting cryptic species where morphological characters are lacking. It should be noted that DNA barcoding may overestimate the number of species (
Dasmahapatra et al. 2010), so the deep divergences within species like
L. cressonii may not be true evidence of speciation.
Mitochondrial DNA has proven capable of identifying specimens and clarifying species boundaries in other taxonomically challenging bees (
Danforth et al. 1998;
Murray et al. 2007;
Bertsch 2009;
Sheffield et al. 2009;
Gibbs 2009b;
Magnacca and Brown 2010;
Rehan and Sheffield 2011). However, DNA barcoding is only moderately successful for specimen identification in
L. (
Dialictus) using either DNA barcode gaps or BINs. The mean maximum intraspecific value of 1.69% reported here is higher than mean intraspecific values reported for some taxa, including bats (
Clare et al. 2007), birds (
Hebert et al. 2004b), marine fish (
Ward et al. 2005), Lepidoptera (
Hajibabaei et al. 2006a), bumble bees (
Bertsch 2009), and a general bee fauna (
Sheffield et al. 2009) but does correspond to some previous studies of insect taxa, such as mayflies (
Alexander et al. 2009) and black flies (
Rivera and Currie 2009). Direct comparisons are not always reliable due to differences in sampling effort and the use of mean intraspecific divergences in some studies (e.g.,
Barrett and Hebert 2005) and the more relevant value, the mean maximum intraspecific divergence used here (
Collins and Cruickshank 2012). Minimum distances between species pairs was on average four times the mean intraspecific genetic divergence but less than double the mean maximum intraspecific divergence. This gap between inter- and intraspecific genetic divergences is less than those of most barcoding studies and falls well below the 10-fold difference suggested in an earlier study of birds (
Hebert et al. 2004b). Importantly, intra- and interspecific divergence varies among taxa and in recently diverged groups any simple criterion for delimiting species will have flaws.
Cases where DNA barcodes do not differ among species help to emphasize the importance of traditional taxonomic practices. In another group of bees, DNA barcodes do not distinguish morphologically similar but ecologically distinct species in the
Colletes succinctus (L.) group, although differences in nuclear sequences were present (
Kuhlmann et al. 2007). The
L. viridatum species-group defined by
Gibbs (2010a) is comprised of many of the most taxonomically challenging
L. (
Dialictus) species in Canada and the eastern United States. Members of this complex are often distinguishable based on multiple lines of evidence including ecological differences (e.g.,
L. georgeickworti Gibbs is the only member restricted to coastal areas of the northeastern United States and
L. subviridatum nests in logs unlike the ground nests of most other species;
Gibbs 2011). Unfortunately, few members of the
L. viridatum group have had their nests discovered, and none has had their nesting biology described in detail. The nesting biology of
L. viridatum described by
Atwood (1933) is unreliable (
Zarrillo et al. 2016), and may pertain to
L. laevissimum. DNA barcodes of species in the
L. viridatum group are among the most difficult to interpret, including a surprising 19 species lumped into a single BIN. It is tempting to consider this as a case of traditional taxonomy over-splitting species. However, this BIN includes morphologically distinct, geographically separated populations that occupy distinct ecological niches (
Gibbs 2010a,
2011). It is possible that the species complex in its entirety has not been accurately delimited, but multiple lines of evidence suggest that it is not a single species (
Gibbs 2010a,
2011). The absence of clear DNA barcode clusters is presumably due to incomplete lineage sorting or introgression in a species complex that has undergone recent and rapid diversification. A neighbour-joining analysis suggested that 27% of species in this study had non-monophyletic DNA barcode clusters. Neighbour-joining is not a preferred method of phylogenetic analysis, being generally outperformed by other methods of phylogenetic analysis that are better able to account for evolutionary rate variation (
Huelsenbeck 1995;
Felsenstein 2004). However, neighbour-joining was assessed here because it is commonly used to examine DNA barcode sequences and is the standard tree building algorithm used in BOLD, due to its speed and relatively good performance when sequences have recently diverged (
Holder and Lewis 2003). The high level of paraphyly found here for
Lasioglossum should be taken into account when interpreting results from neighbour-joining trees as part of future DNA barcoding efforts.
Although DNA barcodes may be insufficient for species identification using RESL or ABGD in some cases, fixed substitutions can be informative in an integrative taxonomic approach.
Lasioglossum (
Dialictus) is a recently derived subgenus that has diversified into many hundreds of species (
Gibbs et al. 2012;
Ascher and Pickering 2016). Closely related species that have recently separated will have evolved fewer neutral mutations than other species. Morphological characters, if under selection, can evolve rapidly (
Owen and Harder 1995;
Thompson 1998) and could result in clearly defined species with little genetic divergence in mitochondrial haplotypes. In addition, the possibility of introgression is presumably higher in closely related species and could confound specimen identification and species discovery using DNA barcodes by reducing intraspecific variation among species (
Pentinsaari et al. 2014).
Lasioglossum hitchensi and
L. weemsi are closely related, but distinguishable based on hair patterns of the metasoma; however, a single individual of
L. hitchensi was found to have a DNA barcode consistent with
L. weemsi. This example is a candidate for introgression between closely related species.
In some cases, character-based methods of identification may still allow correct determination of species where distance-based methods are misleading (
Burns et al. 2007). Previous results from species of
Lasioglossum suggest that unique fixed substitutions or unique polymorphism patterns may be sufficient to distinguish species in the absence of a clear DNA barcode gap (
Gibbs 2009a,
2009b;
Gibbs et al. 2013). Similar methods have also been used in the
Bombus lucorum (L.) species complex of bumble bees (
Murray et al. 2007;
Bertsch 2009) and Hawaiian
Hylaeus (
Magnacca and Brown 2010). All of the species recognized in the taxonomic revisions are diagnosable using morphological characters (
Gibbs 2010b,
2011). If voucher specimens are retained, then additional morphological or geographic characters could be used in cases where DNA barcodes fail entirely. For example, a DNA barcode sequence matching that of
L. versatum and
L. callidum, which share an identical DNA barcode haplotype, can be differentiated by examining morphological characters, including the shape of the female mandible and protrochanter and male facial pubescence (
Gibbs 2010a,
2011). Although DNA barcodes do not distinguish
L. tenax (Sandhouse) and
L. cattellae (Ellis), these two species can be distinguished by geographical distribution alone, in addition to aspects of their microsculpture, colour, and pubescence (
Gibbs 2010a).
Lasioglossum tenax is an alpine/boreal species restricted to Canada and the Rocky Mountains, while
L. cattellae is largely an open grassland species in the mid-Atlantic and Midwestern states of the United States. The BIN system on BOLD allows for expert opinion to annotate BINs believed to include multiple species (
Ratnasingham and Hebert 2013). Until BINs are fully annotated and taxonomic revisions are complete, users would be wise to treat BINs with care and use alternative data, including morphology and geographical distribution, to make species identifications using DNA barcodes.
In conclusion, DNA barcodes in
L. (
Dialictus) lack a well-defined barcode gap that can be used for delimiting species, and RESL and ABGD worked unambiguously in only about half the cases. Nevertheless, DNA barcodes are sufficient for identifying specimens of many species in this taxonomically challenging group even with simple distance-based methods. Character-based identifications have the potential for a greater level of success in identifying specimens with low interspecific genetic divergence. Unique combinations of nucleotide substitutions may allow specimen identification even when interspecific divergences are low and fixed nucleotide substitutions are absent (
Gibbs et al. 2013). Although some species pairs cannot be identified using DNA barcodes, the sequences are nevertheless informative for reducing the list of possible species names. Simple annotations to the BOLD system (e.g.,
viridatum complex) could flag such species for additional study or sequencing of additional loci. For this reason, I reiterate the need of morphology-based taxonomy and integrative approaches using multiple loci as further tests of species identity. DNA barcodes, even with a weak barcode gap, are nevertheless a useful taxonomic tool for bees when used in conjunction with morphology, behaviour, and other data.