Skip to main content

Advertisement

Log in

Classification of Nucleotide Sequences Using Support Vector Machines

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Species identification is one of the most important issues in biological studies. Due to recent increases in the amount of genomic information available and the development of DNA sequencing technologies, the applicability of using DNA sequences to identify species (commonly referred to as “DNA barcoding”) is being tested in many areas. Several methods have been suggested to identify species using DNA sequences, including similarity scores, analysis of phylogenetic and population genetic information, and detection of species-specific sequence patterns. Although these methods have demonstrated good performance under a range of circumstances, they also have limitations, as they are subject to loss of information, require intensive computation and are sensitive to model mis-specification, and can be difficult to evaluate in terms of the significance of identification. Here, we suggest a new DNA barcoding method in which support vector machine (SVM) procedures are adopted. Our new method is nonparametric and thus is expected to be robust for a wide range of evolutionary scenarios as well as multilocus analyses. Furthermore, we describe bootstrap procedures that can be used to test the significances of species identifications. We implemented a novel conversion technique for transforming sequence data to real-valued vectors, and therefore, bootstrap procedures can be easily combined with our SVM approach. In this study, we present the results of simulation studies and empirical data analyses to demonstrate the performance of our method and discuss its properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Abdo Z, Golding GB (2007) A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups. Syst Biol 56:44–56

    Article  CAS  PubMed  Google Scholar 

  • Armstrong KF, Ball SL (2005) DNA barcodes for biosecurity: invasive species identification. Philos Trans R Soc Lond B 360:1813–1823

    Article  CAS  Google Scholar 

  • Ball SL, Armstrong KF (2008) Rapid, one-step DNA extraction for insect pest identification by using DNA barcodes. J Econ Entomol 101:523–532

    Article  CAS  PubMed  Google Scholar 

  • Bertolazzi P, Felici G, Weitschek E (2009) Learning to classify species with barcodes. BMC Bioinformatics 10(Suppl 14):S7

    Article  PubMed  Google Scholar 

  • Bruno WJ, Halpern AL (1999) Topological bias and inconsistency of maximum likelihood using wrong models. Mol Biol Evol 16:564–566

    CAS  PubMed  Google Scholar 

  • Buckley TR (2002) Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol 51:509–523

    Article  PubMed  Google Scholar 

  • Chu KH, Xu M, Li CP (2009) Rapid DNA barcoding analysis of large datasets using the composition vector method. BMC Bioinform 10(Suppl 14):S8

    Article  Google Scholar 

  • Clare EL, Lim BK, Engstrom MD, Eger JL, Hebert PDN (2007) DNA barcoding of Neotropical bats: species identification and discovery within Guyana. Mol Ecol Notes 7:184–190

    Google Scholar 

  • Cywinska A, Hunter FF, Hebert PD (2006) Identifying Canadian mosquito species through DNA barcodes. Med Vet Entomol 20:413–424

    Article  CAS  PubMed  Google Scholar 

  • DeSalle R, Egan MG, Siddall M (2005) The unholy trinity: taxonomy, species delimitation and DNA barcoding. Philos Trans R Soc Lond B 360:1905–1916

    Article  CAS  Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  • Ebach MC, Holdrege C (2005) DNA barcoding is no substitute for taxonomy. Nature 434:697–697

    Article  CAS  PubMed  Google Scholar 

  • Edwards AWF, Cavalli-Sforza LL (1963) The reconstruction of evolution. Ann Hum Genet 27:105–106

    Google Scholar 

  • Edwards SV, Liu L, Pearl DK (2007) High-resolution species trees without concatenation. Proc Natl Acad Sci USA 104:5936–5941

    Article  CAS  PubMed  Google Scholar 

  • Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA 93:13429–13434

    Article  CAS  PubMed  Google Scholar 

  • Felsenstein J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791

    Article  Google Scholar 

  • Funk DJ, Omland KE (2003) Species-level paraphyly and polyphyly: frequency, causes, and consequences, with insights from animal mitochondrial DNA. Annu Rev Ecol Evol Syst 34:397–423

    Article  Google Scholar 

  • Hasegawa M, Kishino H, Yano T (1985) Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160–174

    Article  CAS  PubMed  Google Scholar 

  • Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, Hebert PD (2006) DNA barcodes distinguish species of tropical Lepidoptera. Proc Natl Acad Sci USA 103:968–971

    Article  PubMed  Google Scholar 

  • Hajibabaei M, Singer GA, Clare EL, Hebert PD (2007) Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring. BMC Biol 5:24

    Article  PubMed  Google Scholar 

  • Hall P, Wilson SR (1991) Two guidelines for bootstrap hypothesis testing. Biometrics 47:757–762

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New York

    Google Scholar 

  • Hebert PD, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proc R Soc B 270:313–321

    Article  CAS  PubMed  Google Scholar 

  • Hebert PD, Stoeckle MY, Zemlak TS, Francis CM (2004a) Identification of Birds through DNA Barcodes. PLoS Biol 2:e312

    Article  PubMed  Google Scholar 

  • Hebert PD, Penton EH, Burns JM, Janzen DH, Hallwachs W (2004b) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc Natl Acad Sci USA 101:14812–14817

    Article  CAS  PubMed  Google Scholar 

  • Hochreiter S, Heusel M, Obermayer K (2007) Fast model-based protein homology detection without alignment. Bioinformatics 23(14):1728–1736

    Article  CAS  PubMed  Google Scholar 

  • Hong H, Hong Q, Perkins R, Shi L, Fang H, Su Z, Dragan Y, Fuscoe JC, Tong W (2009) The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments. J Comput Biol 16(12):1671–1688

    Article  CAS  PubMed  Google Scholar 

  • Janzen DH, Hajibabaei M, Burns JM, Hallwachs W, Remigio E, Hebert PD (2005) Wedding biodiversity inventory of a large and complex Lepidoptera fauna with DNA barcoding. Philos Trans R Soc Lond B 360:1835–1845

    Article  CAS  Google Scholar 

  • Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (eds) Mammalian protein metabolism. Academic Press, New York, pp 21–132

    Google Scholar 

  • Kingman JFC (1982) On the genealogy of large populations. J Appl Probab 19A:27–43

    Article  Google Scholar 

  • Kishino H, Hasegawa M (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 29:170–179

    Article  CAS  PubMed  Google Scholar 

  • Kolaczkowski B, Thornton JW (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980–984

    Article  CAS  PubMed  Google Scholar 

  • Kress JW, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH (2005) Use of DNA barcodes to identify flowering plants. Proc Natl Acad Sci USA 102:8369–8374

    Article  CAS  PubMed  Google Scholar 

  • Kuksa P, Pavlovic V (2007) Fast kernel methods for SVM sequence classifiers. In: Giancarlo R, Hannernhalli S (eds) WABI 2007 Lecture Notes in Bioinformatics. Springer, New York, pp 228–239

  • Kuksa P, Pavlovic V (2009) Efficient alignment-free DNA barcode analytics. BMC Bioinform 10(Suppl 14):S9

    Article  Google Scholar 

  • Lang S (1996) Calculus of several variables. Addison-Wesley, Reading, MA, p 137

    Google Scholar 

  • Leslie C, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476

    Article  CAS  PubMed  Google Scholar 

  • Liu L, Pearl DK (2007) Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56:504–514

    Article  CAS  PubMed  Google Scholar 

  • Liu L, Pearl DK, Brumfield RT, Edwards SV (2008) Estimating species trees using multiple-allele DNA sequence data. Evolution 62:2080–2091

    Article  PubMed  Google Scholar 

  • Meier R, Shiyang K, Vaidya G, Ng PK (2006) DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. Syst Biol 55:715–728

    Article  PubMed  Google Scholar 

  • Meyer CP, Paulay G (2005) DNA barcoding: error rates based on comprehensive sampling. PLoS Biol 3:2229–2238

    CAS  Google Scholar 

  • Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R (2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Syst Biol 57:750–757

    Article  PubMed  Google Scholar 

  • Nielsen R, Matz M (2006) Statistical approaches for DNA barcoding. Syst Biol 55:162–169

    Article  PubMed  Google Scholar 

  • Noble WS (2004) Support vector machine applications in computational biology. In: Schoelkopf B, Tsuda K, Vert J-P (eds) Kernel methods in computational biology. MIT Press, Cambridge, MA, pp 71–92

    Google Scholar 

  • Rambaut A, Grassly NC (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238

    CAS  PubMed  Google Scholar 

  • Rannala B, Yang Z (2003) Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164:1645–1656

    CAS  PubMed  Google Scholar 

  • Rannala B, Yang Z (2008) Phylogenetic inference using whole genomes. Annu Rev Genomics Hum Genet 9:217–231

    Article  CAS  PubMed  Google Scholar 

  • Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804

    Article  CAS  PubMed  Google Scholar 

  • Rubinoff D, Cameron S, Will K (2006) A genomic perspective on the shortcomings of mitochondrial DNA for “barcoding” identification. J Hered 97:581–594

    Article  CAS  PubMed  Google Scholar 

  • Rudin W (1976) Principles of mathematical analysis, 3rd edn. McGraw-Hill, NY, USA

    Google Scholar 

  • Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

    CAS  PubMed  Google Scholar 

  • Schlökopf B, Smola AJ (2002) Learning with Kernels. The MIT Press, Cambridge, MA

    Google Scholar 

  • Seo T-K, Kishino H, Thorne JL (2005) Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc Natl Acad Sci USA 102:4436–4441

    Article  CAS  PubMed  Google Scholar 

  • Seo T-K (2008) Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol Biol Evol 25:960–971

    Article  CAS  PubMed  Google Scholar 

  • Shimodaira H, Hasegawa M (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16:1114–1116

    CAS  Google Scholar 

  • Smith MA, Poyarkov NA Jr, Hebert PDN (2008) CO1 DNA barcoding amphibians: take the chance, meet the challenge. Mol Ecol Resour 8:235–246

    Article  CAS  Google Scholar 

  • Steel MA, Hendy MD, Penny D (1988) Loss of information in genetic distances. Nature 336:118

    Article  CAS  PubMed  Google Scholar 

  • Steel MA, Rodrigo A (2008) Maximum likelihood supertrees. Syst Biol 57:243–250

    Article  PubMed  Google Scholar 

  • Sullivan J, Swofford DL (1997) Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J Mammal Evol 4:77–86

    Article  Google Scholar 

  • Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10:512–526

    CAS  PubMed  Google Scholar 

  • Tautz D, Arctander P, Minelli A, Thomas RH, Vogler AP (2003) A plea for DNA taxonomy. Trends Ecol Evol 18:70–74

    Article  Google Scholar 

  • Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PD (2005) DNA barcoding Australia’s fish species. Philos Trans R Soc Lond B 360:1847–1857

    Article  CAS  Google Scholar 

  • Will KW, Mishler BD, Wheeler QD (2005) The perils of DNA barcoding and the need for integrative taxonomy. Syst Biol 54:844–851

    Article  PubMed  Google Scholar 

  • Will KW, Rubinoff D (2004) Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics 20:47–55

    Article  Google Scholar 

  • Wong EH-K, Hanner RH (2008) DNA barcoding detects market substitution in North American seafood. Food Res Int 41:828–837

    Article  CAS  Google Scholar 

  • Yang Z (1994) Estimating the pattern of nucleotide substitution. J Mol Evol 39:105–111

    PubMed  Google Scholar 

  • Zhang AB, Sikes DS, Muster C, Li SQ (2008) Inferring species membership using DNA sequences with back-propagation neural networks. Syst Biol 57:202–215

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tae-Kun Seo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seo, TK. Classification of Nucleotide Sequences Using Support Vector Machines. J Mol Evol 71, 250–267 (2010). https://doi.org/10.1007/s00239-010-9380-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-010-9380-9

Keywords

Navigation