Abstract
Species identification is one of the most important issues in biological studies. Due to recent increases in the amount of genomic information available and the development of DNA sequencing technologies, the applicability of using DNA sequences to identify species (commonly referred to as “DNA barcoding”) is being tested in many areas. Several methods have been suggested to identify species using DNA sequences, including similarity scores, analysis of phylogenetic and population genetic information, and detection of species-specific sequence patterns. Although these methods have demonstrated good performance under a range of circumstances, they also have limitations, as they are subject to loss of information, require intensive computation and are sensitive to model mis-specification, and can be difficult to evaluate in terms of the significance of identification. Here, we suggest a new DNA barcoding method in which support vector machine (SVM) procedures are adopted. Our new method is nonparametric and thus is expected to be robust for a wide range of evolutionary scenarios as well as multilocus analyses. Furthermore, we describe bootstrap procedures that can be used to test the significances of species identifications. We implemented a novel conversion technique for transforming sequence data to real-valued vectors, and therefore, bootstrap procedures can be easily combined with our SVM approach. In this study, we present the results of simulation studies and empirical data analyses to demonstrate the performance of our method and discuss its properties.
Similar content being viewed by others
References
Abdo Z, Golding GB (2007) A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups. Syst Biol 56:44–56
Armstrong KF, Ball SL (2005) DNA barcodes for biosecurity: invasive species identification. Philos Trans R Soc Lond B 360:1813–1823
Ball SL, Armstrong KF (2008) Rapid, one-step DNA extraction for insect pest identification by using DNA barcodes. J Econ Entomol 101:523–532
Bertolazzi P, Felici G, Weitschek E (2009) Learning to classify species with barcodes. BMC Bioinformatics 10(Suppl 14):S7
Bruno WJ, Halpern AL (1999) Topological bias and inconsistency of maximum likelihood using wrong models. Mol Biol Evol 16:564–566
Buckley TR (2002) Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol 51:509–523
Chu KH, Xu M, Li CP (2009) Rapid DNA barcoding analysis of large datasets using the composition vector method. BMC Bioinform 10(Suppl 14):S8
Clare EL, Lim BK, Engstrom MD, Eger JL, Hebert PDN (2007) DNA barcoding of Neotropical bats: species identification and discovery within Guyana. Mol Ecol Notes 7:184–190
Cywinska A, Hunter FF, Hebert PD (2006) Identifying Canadian mosquito species through DNA barcodes. Med Vet Entomol 20:413–424
DeSalle R, Egan MG, Siddall M (2005) The unholy trinity: taxonomy, species delimitation and DNA barcoding. Philos Trans R Soc Lond B 360:1905–1916
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Ebach MC, Holdrege C (2005) DNA barcoding is no substitute for taxonomy. Nature 434:697–697
Edwards AWF, Cavalli-Sforza LL (1963) The reconstruction of evolution. Ann Hum Genet 27:105–106
Edwards SV, Liu L, Pearl DK (2007) High-resolution species trees without concatenation. Proc Natl Acad Sci USA 104:5936–5941
Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA 93:13429–13434
Felsenstein J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791
Funk DJ, Omland KE (2003) Species-level paraphyly and polyphyly: frequency, causes, and consequences, with insights from animal mitochondrial DNA. Annu Rev Ecol Evol Syst 34:397–423
Hasegawa M, Kishino H, Yano T (1985) Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160–174
Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, Hebert PD (2006) DNA barcodes distinguish species of tropical Lepidoptera. Proc Natl Acad Sci USA 103:968–971
Hajibabaei M, Singer GA, Clare EL, Hebert PD (2007) Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring. BMC Biol 5:24
Hall P, Wilson SR (1991) Two guidelines for bootstrap hypothesis testing. Biometrics 47:757–762
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New York
Hebert PD, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proc R Soc B 270:313–321
Hebert PD, Stoeckle MY, Zemlak TS, Francis CM (2004a) Identification of Birds through DNA Barcodes. PLoS Biol 2:e312
Hebert PD, Penton EH, Burns JM, Janzen DH, Hallwachs W (2004b) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc Natl Acad Sci USA 101:14812–14817
Hochreiter S, Heusel M, Obermayer K (2007) Fast model-based protein homology detection without alignment. Bioinformatics 23(14):1728–1736
Hong H, Hong Q, Perkins R, Shi L, Fang H, Su Z, Dragan Y, Fuscoe JC, Tong W (2009) The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments. J Comput Biol 16(12):1671–1688
Janzen DH, Hajibabaei M, Burns JM, Hallwachs W, Remigio E, Hebert PD (2005) Wedding biodiversity inventory of a large and complex Lepidoptera fauna with DNA barcoding. Philos Trans R Soc Lond B 360:1835–1845
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (eds) Mammalian protein metabolism. Academic Press, New York, pp 21–132
Kingman JFC (1982) On the genealogy of large populations. J Appl Probab 19A:27–43
Kishino H, Hasegawa M (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 29:170–179
Kolaczkowski B, Thornton JW (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980–984
Kress JW, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH (2005) Use of DNA barcodes to identify flowering plants. Proc Natl Acad Sci USA 102:8369–8374
Kuksa P, Pavlovic V (2007) Fast kernel methods for SVM sequence classifiers. In: Giancarlo R, Hannernhalli S (eds) WABI 2007 Lecture Notes in Bioinformatics. Springer, New York, pp 228–239
Kuksa P, Pavlovic V (2009) Efficient alignment-free DNA barcode analytics. BMC Bioinform 10(Suppl 14):S9
Lang S (1996) Calculus of several variables. Addison-Wesley, Reading, MA, p 137
Leslie C, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
Liu L, Pearl DK (2007) Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56:504–514
Liu L, Pearl DK, Brumfield RT, Edwards SV (2008) Estimating species trees using multiple-allele DNA sequence data. Evolution 62:2080–2091
Meier R, Shiyang K, Vaidya G, Ng PK (2006) DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. Syst Biol 55:715–728
Meyer CP, Paulay G (2005) DNA barcoding: error rates based on comprehensive sampling. PLoS Biol 3:2229–2238
Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R (2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Syst Biol 57:750–757
Nielsen R, Matz M (2006) Statistical approaches for DNA barcoding. Syst Biol 55:162–169
Noble WS (2004) Support vector machine applications in computational biology. In: Schoelkopf B, Tsuda K, Vert J-P (eds) Kernel methods in computational biology. MIT Press, Cambridge, MA, pp 71–92
Rambaut A, Grassly NC (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238
Rannala B, Yang Z (2003) Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164:1645–1656
Rannala B, Yang Z (2008) Phylogenetic inference using whole genomes. Annu Rev Genomics Hum Genet 9:217–231
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804
Rubinoff D, Cameron S, Will K (2006) A genomic perspective on the shortcomings of mitochondrial DNA for “barcoding” identification. J Hered 97:581–594
Rudin W (1976) Principles of mathematical analysis, 3rd edn. McGraw-Hill, NY, USA
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
Schlökopf B, Smola AJ (2002) Learning with Kernels. The MIT Press, Cambridge, MA
Seo T-K, Kishino H, Thorne JL (2005) Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc Natl Acad Sci USA 102:4436–4441
Seo T-K (2008) Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol Biol Evol 25:960–971
Shimodaira H, Hasegawa M (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16:1114–1116
Smith MA, Poyarkov NA Jr, Hebert PDN (2008) CO1 DNA barcoding amphibians: take the chance, meet the challenge. Mol Ecol Resour 8:235–246
Steel MA, Hendy MD, Penny D (1988) Loss of information in genetic distances. Nature 336:118
Steel MA, Rodrigo A (2008) Maximum likelihood supertrees. Syst Biol 57:243–250
Sullivan J, Swofford DL (1997) Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J Mammal Evol 4:77–86
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10:512–526
Tautz D, Arctander P, Minelli A, Thomas RH, Vogler AP (2003) A plea for DNA taxonomy. Trends Ecol Evol 18:70–74
Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PD (2005) DNA barcoding Australia’s fish species. Philos Trans R Soc Lond B 360:1847–1857
Will KW, Mishler BD, Wheeler QD (2005) The perils of DNA barcoding and the need for integrative taxonomy. Syst Biol 54:844–851
Will KW, Rubinoff D (2004) Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics 20:47–55
Wong EH-K, Hanner RH (2008) DNA barcoding detects market substitution in North American seafood. Food Res Int 41:828–837
Yang Z (1994) Estimating the pattern of nucleotide substitution. J Mol Evol 39:105–111
Zhang AB, Sikes DS, Muster C, Li SQ (2008) Inferring species membership using DNA sequences with back-propagation neural networks. Syst Biol 57:202–215
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Seo, TK. Classification of Nucleotide Sequences Using Support Vector Machines. J Mol Evol 71, 250–267 (2010). https://doi.org/10.1007/s00239-010-9380-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-010-9380-9