Introduction
The correct identification and delimitation of taxa are crucial to applied surveys, in addition to ecological, taxonomic, systematic, and evolutionary studies (
Hey 2009). Exponential advances in molecular biology have mitigated a myriad of biological problems related to organism identification. The use of DNA sequences to specimen identification started in the 1980s (e.g.,
Kloos and Wolfshohl 1982;
Rollinson et al. 1986;
Gale and Crampton 1987), boosting the interest of scientists to improve molecular practices, theories, and analytical methods. DNA barcoding was formalised at the beginning of this millennium, promising precise identification of specimens through a DNA sequence fragment from a standardised region of the genome (
Hebert et al. 2003). For animals, the primary DNA barcode is a 658 base pairs (bp) region of the mitochondrial gene cytochrome
c oxidase subunit I (
cox1) (
Ratnasingham and Hebert 2007).
A recent review of interpretations and trends in DNA barcoding shows a constant rise in studies using this tool to solve different biological problems (e.g., species delimitation, species discovery, specimen identification) within distinct disciplines (
DeSalle and Goldstein 2019). For a proper use of DNA barcoding, each specific evolutionary lineage demands careful preliminary analyses. In the lack of previous information about a specific taxon, researchers set threshold values a priori. Many studies use an arbitrary value between 2% and 3% for specific divergence, depending on the taxonomic group (
Hebert et al. 2003;
Abdo and Golding 2007;
Clare et al. 2007). On the other hand, the Barcode of Life Data System (BOLD) initially assigns a 1% distance threshold, leading to the recognition of a higher number of operational taxonomic units (
Ratnasingham and Hebert 2007).
Unlike fundamental thoughts concerning DNA barcoding, there is no prior reason to assume a universal fixed threshold value to sort out conspecific from heterospecific taxa. Since coalescent depths among species vary intrinsically for each lineage (
Fujita et al. 2012), a fixed threshold for all organisms would generate false-positive and false-negative errors, depending on the pooled species (
Goldstein et al. 2000). A shortcoming for distance-based methods is the lack of objective criteria to delineate lineages (
DeSalle et al. 2005), and only the accumulation of data, their compilation in digital libraries, and their analytical interpretation can improve the detection and optimisation of an empirical threshold value for a specific taxon—preferably lineages closer to species level (e.g., genus level).
In addition to specimen identification, DNA barcoding may assist in species discovery, through the search for a “barcode gap” (
Meyer and Paulay 2005), defined by the interval between the highest intraspecific distances and the lowest interspecific distances (
DeSalle and Goldstein 2019). Then, a threshold for species delimitation may be established to the target taxonomic rank (
Hebert et al. 2003;
Meyer and Paulay 2005). Initially,
Hebert et al. (2004) proposed a standard threshold for animals: 10 times the mean intraspecific variation for the group under study, the “10-fold rule”. Other values have been refined to particular taxa (e.g.,
Meyer and Paulay 2005;
Prantoni et al. 2018), and also alternative methods were proposed to find thresholds (see
Meier et al. 2006). Subsequent studies questioned the 10-fold rule (
Frézal and Leblois 2008), mainly because of its weak biological background (
Meyer and Paulay 2005). Conversely, empirical data for different nematode groups validated this method to set thresholds (e.g.,
Ferri et al. 2009;
Derycke et al. 2010;
Martínez-Arce et al. 2020).
The phylum Nematoda Rudolphi, 1808 is an abundant and speciose group among metazoans. Nematodes comprise around 25 000 valid species, with estimated diversity higher than 40 million species (
Larsen et al. 2017). Nematodes occupy a wide variety of ecological niches, as both free-living and parasitic species (
Blumenthal and Davis 2004). Nematode identification often relies on morphological characters—which may be subtle, subjective, dependant on other characters, show high phenotypic plasticity, or be featured only in a specific life stage or sex (
Coomans 2002;
Nadler 2002;
Carneiro et al. 2017). For nematodes of medical and economic interest, an accurate taxonomic diagnosis is fundamental to understand transmission mechanisms, develop management strategies, and prevent the deleterious effects of parasitism (
Jasmer et al. 2003;
Ortiz et al. 2016).
Thus, we assessed cox1 performance as a DNA barcode in seven nematode genera, seeking to (i) test efficiency based on barcode gap and Probability of Correct Identification (PCI) analyses, (ii) compare PCI between two public sequence databases, and (iii) estimate species richness in the compiled datasets through an automated distance-based species clustering tool. Issues related to operational biases and challenges in nematode taxonomy are discussed under the light of DNA barcoding.
Discussion
In this study, we explored cox1 performance as a DNA barcode for different lineages of Nematoda, represented by seven genera. The barcoding gap analyses tested the applicability of this molecular marker in species discovery and delimitation. We found barcoding gaps for the seven analysed genera using GenBank sequences; for BOLD sequences, only six genera disclosed a barcoding gap. Moreover, we checked the hypothetical accuracy of the identifications (i.e., PCI), compared PCI between BOLD and GenBank, and estimated species richness based on cox1 for each dataset (i.e., ABGD). We found PCI rates around 70% for both databases, and ABGD results overall pointed out to a higher species richness than the taxonomic labels informed by databases. These results highlight the prevalence of database issues and pitfalls in the widespread use of arbitrarily fixed species delimitation thresholds, the implications of which are relevant to a variety of metazoan lineages.
Barcoding gaps and fixed thresholds: a cautionary tale
The good performance of cox1 for all the analysed genera in the barcoding gap analyses show the potential of this molecular marker as a tool to assess the diversity in Nematoda. The only exception was the intermediate performance for Heterodera sequences retrieved from BOLD. Accordingly, we recommend caution when defining divergence thresholds for species discovery.
Some authors have assigned fixed thresholds for nematode groups. Using the 10-fold rule,
Ferri et al. (2009) estimated a 4.8% threshold for filarioid nematodes (also sampling
Onchocerca species). For free-living marine nematodes, a 5% threshold obtained through the 10-fold rule is consistently being suggested to assess closely related and cryptic species of a “wide range of taxa” (
Derycke et al. 2010;
Armenteros et al. 2014;
Martínez-Arce et al. 2020). Alternatively, a 2% threshold sorted out congeneric species from multiple lineages of parasites of vertebrates (
Prosser et al. 2013). Moreover, previous works often suggest a fixed threshold based on the lifecycle strategy of the scrutinised taxa (e.g., marine nematodes). We discourage this practice since the diversity of lifestyles within the phylum has emerged independently multiple times (
Blaxter and Koutsovoulos 2015).
Here, we reiterate what
Collins and Cruickshank (2012) postulated as “the sixth deadly sin of DNA barcoding”: the inappropriate use of fixed thresholds for higher taxonomic levels. This assumption disregards the likely evolutionary heterogeneity and coalescence within diverse lineages (
Fujita et al. 2012;
Pentinsaari et al. 2016). A threshold value should be optimised from libraries of specific taxonomic groups (e.g., genus), putting away arbitrary “magic values” of divergence for higher taxonomic levels. Hence, an advantageous feature of DNA barcoding is its retroactive essence: as the accuracy of DNA barcoding upgrades the detection of taxa, it reciprocally enhances the correct labelling of library data.
We recognise that
cox1 performance in our dataset is far from flawless. The result for
Heterodera (BOLD) was intermediate (
Fig. 1), and the barcode gap analyses showed a remarkable number of outlier values in the boxplots (
Figs. 1 and
2). Those outliers may be specimens from subsampled populations that present molecular distances above the conspecific average. However, we need to stress the likelihood of cryptic diversity and operational biases affecting the identification accuracy, as discussed below.
So, how should the barcoding gap be established? With caveats and carefulness. Many methods and techniques are premised on a comprehensive sampling that would include all populations and species from a lineage (
Lim et al. 2012). As the detection of a barcoding gap is sensitive to the number of species (
Meier et al. 2008) and specimens sampled (
Fontaneto et al. 2015), then the analyses should be reviewed regularly whenever new samples are generated and deposited in the databases (see
Qing et al. 2020). Presumably, the genetic diversity within a taxon should reach an asymptote as the heterogeneity within a lineage increase. This knowledge may facilitate the establishment of a more robust barcoding threshold for specific taxa. Exploratory studies must avoid a priori threshold values. In cases where there is a lack of data for a specific taxon, we suggest the careful use of the threshold from the closest lineage as possible.
In BOLD and GenBank we trust (but not blindly)
Our PCI analyses reinforce the barcoding gap results (
Figs. 1 and
2): the outlier comparisons exhibited on the histograms show an evident incongruence between the genetic distances and the database species labels. We aimed to test the performance of
cox1 to identify specimens. Considering the incipient application of this marker for Nematoda, the observed PCI is remarkable. Here, improved species delimitation methods provide a step forward for future research that seeks to accurately assess diversity and identify specimens/sequences in different lineages of Nematoda.
The PCIs of BOLD and GenBank were statistically similar. These results dismiss our initial supposition that BOLD sequences would exhibit higher PCI compared to GenBank (see “Data obtention and filtering” section). Although BOLD mines sequences from GenBank, it is not a reason to assume that datasets (and the results obtained within) from both databases would be the same (e.g.,
Meiklejohn et al. 2019;
Pentinsaari et al. 2020). Hence, GenBank sequences could equally contribute to exploratory analyses. Despite that, PCIs of most datasets concerning both databases are quite far from 100%, except for
Caenorhabditis.
The use of data retrieved from any database—even curated ones—should not be done blindly. Careful screening may avoid unwanted problems. Indeed, misidentification and annotation errors are intrinsic from the way the sequences are deposited in public databases (e.g.,
Valkiūnas et al. 2008;
Kvist 2014;
Stavrou et al. 2018). The assignment errors may be related to operational biases, e.g., laboratory contamination, DNA of cells from the host, and data entry mistakes (
Mutanen et al. 2016;
Leray et al. 2019) but mainly because of specimen misidentification (see below) (
Valkiūnas et al. 2008). The free access of these sequences may propagate these errors, and lead to erroneous conclusions (
Valkiūnas et al. 2008). As the number of taxonomists has been decreasing, cases of misidentified sequences may increase soon (see
Janssen et al. 2017). However, independent research groups have worked continuously to improve identification accuracy for different taxa and molecular markers (e.g.,
Heller et al. 2014;
O’Leary et al. 2016;
Dunlap et al. 2018). Curated datasets and pipelines for molecular identification have also been developed for nematode groups (e.g.,
Macheriotou et al. 2019;
Qing et al. 2020). Efforts like these are invaluable resources, mainly for taxa which taxonomy is historically ambiguous, like Nematoda.
Hidden diversity, but to what degree?
The species richness estimated by ABGD mismatch the number of species labels informed by BOLD and GenBank for all datasets. ABGD is considered a conservative approach and recent studies reported its tendency to lump sequences belonging to different species, and seldom split conspecific sequences (
Pentinsaari et al. 2017;
Gélin et al. 2017;
Busschau et al. 2019). The conservative proposal of this algorithm is desirable here since our aim was not to make taxonomic decisions, but warily to shed light on cryptic diversity and prominently dubious species boundaries. Nevertheless, species richness of the analysed genera was usually greater than taxonomic labels informed. Some of our results stood out, showing discrepancies both for underestimation, e.g.,
Meloidogyne, and for overestimation, e.g.,
Heterodera (
Table 4). Remarkably, our sample encompasses all
Anguillicola and
Trichinella species (
Table 1), and for both genera and datasets, ABGD estimated a higher richness. These cases could be investigated in-depth, integrating multiple sources of evidence to unravel the taxonomy of these groups.
The use of
cox1 as a DNA barcode usually allows taxonomic resolution at population/species level, but the peculiarities of each lineage may hinder species diagnosis (
Powers et al. 2018). In genera such as
Meloidogyne, the ancient asexuality and the hybrid origin of species has led to reticular evolutionary patterns that hamper the delimitation of apomictic species (
Lunt 2008;
Janssen et al. 2016;
Powers et al. 2018). Still, approaches based only on mitochondrial DNA may overlook most recent speciation events due to the time-lag between speciation and haplotype lineage sorting to reciprocal monophyly (
Nadler 2002).
Overall, cox1 is a relevant tool for integrative taxonomy of nematodes
Our multiple analyses using cox1 show the suitability of this molecular marker to the scrutiny of Nematoda at the genus and species level. The availability of data limited the approach adopted here. Thus, any taxon sampling bias, somehow, depicts the current trendings and state of the art on nematode research. We are aware that the data available in GenBank and BOLD, and so the genera coverage here, represent only a fraction of this diverse phylum. However, the use of thousands of sequences in a transversal study approaching different lineages with distinct lifecycle strategies has no precedent among Nematoda, as far as we know.
Overall, the results point out a substantial number of specimen misidentification or dubious species delimitation. For taxa with many cryptic species, complex morphology, and complex life histories, such as Nematoda, the taxonomic impediments arise. Thus, systematics become weakened whenever a single approach (e.g., morphology, molecular, behaviour) is prioritised (
Coomans 2002). The term integrative taxonomy, coined around 15 years ago (
Dayrat 2005;
Will et al. 2005), uses multiple lines of evidence to inform taxonomy and is widespread in the literature (
Padial et al. 2010) but rarely used for nematodes. New species descriptions are often based exclusively on morphological comparisons of type specimens (e.g.,
Phillips et al. 2016;
Acosta et al. 2017;
Pinheiro et al. 2018).
The integration of large-scale and consistent DNA sequencing with traditional taxonomic approaches naturally improves the discovery of biological diversity and identification of specimens (
Moritz and Cicero 2004). The use of
cox1 as a metazoan barcode enriches the large public databases, such as BOLD and GenBank, making them scientifically valuable (
Fontaneto et al. 2015;
Andújar et al. 2018). However, the success of the DNA barcoding strategy requires the maintenance of a reference database that obeys rigorous taxonomic criteria at the moment of the deposit of sequences, especially concerning voucher data (
Ekrem et al. 2007). The standardisation of a molecular marker allows reliable cross-comparison between studies and databases (
Smith et al. 2009), boosting its use in, e.g., applied sciences. C
ox1 can also improve metabarcoding studies to access nematode communities as 18S is usually inaccurate to species level and may even underestimate the real diversity (
Tang et al. 2012;
Blaxter 2016;
Treonis et al. 2018). For identification purposes, a growing body of evidence shows
cox1 outperforming ribosomal markers, including 18S (
Guardone et al. 2013;
Singh et al. 2013;
Armenteros et al. 2014) and ITS (
Blouin 2002;
Keskin et al. 2015). When feasible, the use of multi-locus barcode approaches should be preferred as they increase identification success (
Meiklejohn et al. 2019).
After all,
cox1 barcoding is neither the panacea nor the archenemy of nematode taxonomy. We encourage the use of multiple methods to increase the robustness of taxonomic decisions.
Cox1 has been used extensively for varied groups of organisms and different taxonomic purposes (e.g.,
Zimmermann et al. 2015;
Almerón-Souza et al. 2018;
Gibbs 2018). Without a reliable taxonomic identification, all research carried out in academic and applied branches of life sciences are virtually worthless (
Kholia and Fraser-Jenkins 2011). Therefore,
cox1 is a relevant ally in nematode systematics and taxonomy, improving other methodologies, aiding in cryptic diversity detection, and shedding light on specimen identification. In other words,
cox1 as a DNA barcode may be useful to tackle this can of worms.