Full text loading...
Review Article
Free
Evaluating Model Performance in Evolutionary Biology
- Jeremy M. Brown1, and Robert C. Thomson2
- Vol. 49:95-114 (Volume publication date November 2018) https://doi.org/10.1146/annurev-ecolsys-110617-062249
-
Copyright © 2018 by Annual Reviews. All rights reserved
Abstract
Many fields of evolutionary biology now depend on stochastic mathematical models. These models are valuable for their ability to formalize predictions in the face of uncertainty and provide a quantitative framework for testing hypotheses. However, no mathematical model will fully capture biological complexity. Instead, these models attempt to capture the important features of biological systems using relatively simple mathematical principles. These simplifications can allow us to focus on differences that are meaningful, while ignoring those that are not. However, simplification also requires assumptions, and to the extent that these are wrong, so is our ability to predict or compare. Here, we discuss approaches for evaluating the performance of evolutionary models in light of their assumptions by comparing them against reality. We highlight general approaches, how they are applied, and remaining opportunities. Absolute tests of fit, even when not explicitly framed as such, are fundamental to progress in understanding evolution.
Article metrics loading...
Literature Cited
- Barley AJ, Brown JM, Thomson RC 2017. Impact of model violations on the inference of species boundaries under the multispecies coalescent. Syst. Biol. 67:269–84
- Barley AJ, Thomson RC 2016. Assessing the performance of DNA barcoding using posterior predictive simulations. Mol. Ecol. 25:1944–57
- Bayarri MJ, Berger JO 2004. The interplay of Bayesian and frequentist analysis. Stat. Sci. 19:58–80
- Beaulieu JM, O'Meara BC 2016. Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Syst. Biol. 65:583–601
- Beaulieu JM, O'Meara BC, Donoghue MJ 2013. Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. Syst. Biol. 62:725–37
- Blum MGB, François O 2006. Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst. Biol. 55:685–91
- Bollback JP. 2002. Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol. 19:1171–80
- Brown JM. 2014. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit. Syst. Biol. 63:334–48
- Brown JM, ElDabaje R 2009. PuMA: Bayesian analysis of partitioned (and unpartitioned) model adequacy. Bioinformatics 25:537–38
- Brown JM, Lemmon AR 2007. The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. Syst. Biol. 56:643–55
- Brown JM, Thomson RC 2017. Bayes factors unmask highly variable information content, bias, and extreme influence in phylogenomic analyses. Syst. Biol. 66:517–30
- Caetano DS, O'Meara BC, Beaulieu JM 2018. Hidden state models improve the adequacy of state-dependent diversification approaches using empirical trees, including biogeographical models. bioRxiv 302729. https://doi.org/10.1101/302729
- Castoe TA, de Koning APJ, Kim H-M, Gu W, Noonan BP et al. 2009. Evidence for an ancient adaptive episode of convergent molecular evolution. PNAS 106:8986–91
- Chen M-H, Shao Q-M, Ibrahim JG 2000. Monte Carlo Methods in Bayesian Computation New York: Springer
- Cox DR. 1961. Tests of separate families of hypotheses. Proc. Fourth Berkeley Symp. Math. Stat. Probab105–23 Berkeley: Univ. Calif. Press
- Darwin C. 1859. On the Origin of Species by Means of Natural Selection, Or, the Preservation of Favoured Races in the Struggle for Life London: John Murray
- Darwin F 1887. The Life and Letters of Charles Darwin, Including an Autobiographical Chapter, Vol. 1 London: John Murray
- Doyle VP, Young RE, Naylor GJP, Brown JM 2015. Can we identify genes with increased phylogenetic reliability. Syst. Biol. 64:824–37
- Duchêne DA, Duchêne S, Ho SYW 2017. New statistical criteria detect phylogenetic bias caused by compositional heterogeneity. Mol. Biol. Evol. 34:1529–34
- Duchêne DA, Duchêne S, Ho SYW 2018. PhyloMAd: efficient assessment of phylogenomic model adequacy. Bioinformatics 34:2300–1
- Duchêne DA, Duchêne S, Holmes EC, Ho SYW 2015. Evaluating the adequacy of molecular clock models using posterior predictive simulations. Mol. Biol. Evol. 32:2986–95
- Duchêne S, Duchêne DA, Di Giallonardo F, Eden J-S, Geoghegan JL et al. 2016. Cross-validation to select Bayesian hierarchical models in phylogenetics. BMC Evol. Biol. 16:115
- Dunn CW, Giribet G, Edgecombe GD, Hejnol A 2014. Animal phylogeny and its evolutionary implications. Annu. Rev. Ecol. Evol. Syst. 45:371–95
- Edwards SV, Liu L, Pearl DK 2007. High-resolution species trees without concatenation. PNAS 104:5936–41
- Efron B, Tibshirani RJ 1993. An Introduction to the Bootstrap Boca Raton, FL: Chapman & Hall
- Etienne RS, Rosindell J 2012. Prolonging the past counteracts the pull of the present: Protracted speciation can explain observed slowdowns in diversification. Syst. Biol. 61:204–13
- Felsenstein J. 1985. Phylogenies and the comparative method. Am. Nat. 125:1–15
- Felsenstein J. 2004. Inferring Phylogenies Sunderland, MA: Sinauer
- FitzJohn RG. 2010. Quantitative traits and diversification. Syst. Biol. 59:619–33
- FitzJohn RG. 2012. Diversitree: comparative phylogenetic analyses of diversification in R. Methods Ecol. Evol. 3:1084–92
- FitzJohn RG, Maddison WP, Otto SP 2009. Estimating trait-dependent speciation and extinction rates from incompletely resolved phylogenies. Syst. Biol. 58:595–611
- Foster PG. 2004. Modeling compositional heterogeneity. Syst. Biol. 53:485–95
- Foster PG, Hickey DA 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–90
- Garland T, Harvey PH, Ives AR 1992. Procedures for the analysis of comparative data using phylogenetically independent contrasts. Syst. Biol. 41:18–32
- Gelman A. 2003. A Bayesian formulation of exploratory data analysis and goodness-of-fit testing. Int. Stat. Rev. 71:369–82
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB 2013. Bayesian Data Analysis Boca Raton, FL: CRC Press, 3rd ed..
- Gelman A, Meng X-L, Stern H 1996. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6:733–807
- Goldberg EE, Lancaster LT, Ree RH 2011. Phylogenetic inference of reciprocal effects between geographic range evolution and diversification. Syst. Biol. 60:451–65
- Goldman N. 1993a. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182–98
- Goldman N. 1993b. Simple diagnostic statistical tests of models for DNA substitution. J. Mol. Evol. 37:650–61
- Gruenstaeudl M, Reid NM, Wheeler GL, Carstens BC 2015. Posterior predictive checks of coalescent models: P2C2M, an R package. Mol. Ecol. Res. 16:193–205
- Heard SB. 1992. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution 46:1818–26
- Heard SB, Mooers AØ 2002. Signatures of random and selective mass extinctions in phylogenetic tree balance. Syst. Biol. 51:889–97
- Heath TA, Moore BR 2014. Bayesian inference of species divergence times. Bayesian Phylogenetics: Methods, Algorithms, and Applications M-H Chen, L Kuo, PO Lewis 487–533 Sunderland, MA: Sinauer
- Heath TA, Zwickl DJ, Kim J, Hillis DM 2008. Taxon sampling affects inferences of macroevolutionary processes from phylogenetic trees. Syst. Biol. 57:160–66
- Heled J, Drummond AJ 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27:570–80
- Ho SYW, Duchêne S 2014. Molecular-clock methods for estimating evolutionary rates and timescales. Mol. Ecol. 23:5947–65
- Höhna S, Coghill LM, Mount GG, Thomson RC, Brown JM 2017. P3: phylogenetic posterior prediction in RevBayes. Mol. Biol. Evol. 35:1028–34
- Höhna S, Landis MJ, Heath TA, Boussau B, Lartillot N et al. 2016. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Syst. Biol. 65:726–36
- Höhna S, May MR, Moore BR 2015. TESS: an R package for efficiently simulating phylogenetic trees and performing Bayesian inference of lineage diversification rates. Bioinformatics 32:789–91
- Huelsenbeck J. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44:17–48
- Huelsenbeck JP, Larget B, Miller RE, Ronquist F 2002. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst. Biol. 51:673–88
- Huelsenbeck J, Rannala B 2004. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53:904–13
- Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–14
- James G, Witten D, Hastie T, Tibshirani R 2013. An Introduction to Statistical Learning New York: Springer
- Joly S, McLenachan PA, Lockhart PJ 2009. A statistical approach for distinguishing hybridization and incomplete lineage sorting. Am. Nat. 174:E54–70
- Jukes TH, Cantor CR 1969. Evolution of protein molecules. Mammalian Protein Metabolism HN Munro 21–132 New York: Academic
- Kass RE. 2011. Statistical inference: the big picture. Stat. Sci. 26:1–9
- Kimura M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–20
- Kishino H, Hasegawa M 1990. Converting distance to time: application to human evolution. Methods Enzymol 183:550–70
- Koch JM, Holder MT 2012. An algorithm for calculating the probability of classes of data patterns on a genealogy. PLOS Curr 4:e4fd1286980c08
- Lanfear R, Calcott B, Ho SYW, Guindon S 2012. PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol. Biol. Evol. 29:1695–701
- Lartillot N, Brinkmann H, Philippe H 2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol. Biol. 7:Suppl. 1S4
- Lartillot N, Philippe H 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21:1095–109
- Lartillot N, Philippe H 2008. Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos. Trans. R. Soc. B 363:1463–72
- Lemmon AR, Moriarty EC 2004. Importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. 53:265–77
- Lewis PO, Xie W, Chen M-H, Fan Y, Kuo L 2014. Posterior predictive Bayesian phylogenetic model selection. Syst. Biol. 63:309–21
- Maddison WP. 1997. Gene trees in species trees. Syst. Biol. 46:523–36
- Maddison WP. 2006. Confounding asymmetries in evolutionary diversification and character change. Evolution 60:1743–46
- Maddison WP, Midford PE, Otto SP 2007. Estimating a binary character's effect on speciation and extinction. Syst. Biol. 56:701–10
- Mayr E. 1982. The Growth of Biological Thought: Diversity, Evolution, and Inheritance Cambridge, MA: Belknap
- McElreath R. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan Boca Raton. FL: CRC Press
- Meng X-L. 1994. Posterior predictive p-values. Ann. Stat. 22:1142–60
- Minin V, Abdo Z, Joyce P, Sullivan J 2003. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol. 52:674–83
- Mooers A. 1995. Tree balance and tree completeness. Evolution 49:379–84
- Navidi WC, Churchill GA, von Haeseler A 1991. Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol. 8:128–43
- Nielsen R. 2002. Mapping mutations on phylogenies. Syst. Biol. 51:729–39
- Nielsen R, Huelsenbeck JP 2002. Detecting positively selected amino acid sites using posterior predictive p-values. Pac. Symp. Biocomput. 7:576–88
- Nylander JAA, Ronquist F, Huelsenbeck JP, Nieves-Aldrey JL 2004. Bayesian phylogenetic analysis of combined data. Syst. Biol. 53:47–67
- Pagel M, Meade A 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53:571–81
- Pennell MW, Fitzjohn RG, Cornwell WK, Harmon LJ 2015. Model adequacy and the macroevolution of angiosperm functional traits. Am. Nat. 186:E33–50
- Posada D, Buckley T 2004. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst. Biol. 53:793–808
- Posada D, Crandall K 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50:580–601
- Rabosky DL, Goldberg EE 2015. Model inadequacy and mistaken inferences of trait-dependent speciation. Syst. Biol. 64:340–55
- Rabosky DL, Goldberg EE 2017. FiSSE: a simple nonparametric test for the effects of a binary character on lineage diversification rates. Evolution 71:1432–42
- Rannala B, Yang Z 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst. Biol. 66:823–42
- Reeves JH. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:17–31
- Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD et al. 2014. Poor fit to the multispecies coalescent is widely detectable in empirical data. Syst. Biol. 63:322–33
- Ren F, Tanaka H, Yang Z 2005. Empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst. Biol. 54:808–18
- Richards EJ, Brown JM, Barley AJ, Chong RA, Thomson RC 2018. Variation across mitochondrial gene trees provides evidence for systematic error: How much gene tree variation is biological. Syst. Biol. 67:847–60
- Ripplinger J, Sullivan J 2010. Assessment of substitution model adequacy using frequentist and Bayesian methods. Mol. Biol. Evol. 27:2790–803
- Ritland K, Clegg MT 1987. Evolutionary analysis of plant DNA sequences. Am. Nat. 130:S74–100
- Rodrigue N, Philippe H, Lartillot N 2007. Assessing site-interdependent phylogenetic models of sequence evolution. Mol. Biol. Evol. 23:1762–75
- Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A et al. 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61:539–542
- Rubin DB. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 12:1151–72
- Shapiro SS, Wilk MB 1965. An analysis of variance test for normality (complete samples). Biometrika 52:591–611
- Slater GJ, Pennell MW 2014. Robust regression and posterior predictive simulation increase power to detect early bursts of trait evolution. Syst. Biol. 63:293–308
- Stadler T, Degnan JH, Rosenberg NA 2016. Does gene tree discordance explain the mismatch between macroevolutionary models and empirical patterns of tree shape and branching times. Syst. Biol. 65:628–39
- Stigler SM. 2016. The Seven Pillars of Statistical Wisdom Cambridge, MA: Harvard Univ. Press
- Sullivan J, Joyce P 2005. Model selection in phylogenetics. Annu. Rev. Ecol. Evol. Syst. 36:445–66
- Tavaré S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Some Mathematical Questions in Biology: DNA Sequence Analysis Miura RM 57–86 Providence, RI: Am. Math. Soc.
- Waddell PJ, Ota R, Penny D 2009. Measuring fit of sequence data to phylogenetic model: gain of power using marginal tests. J. Mol. Evol. 69:289–99
- Whelan S, Goldman N 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691–99
- Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–14
- Yang Z. 2014. Molecular Evolution: A Statistical Approach Oxford: Oxford Univ. Press
- Yang Z, Nielsen R 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–17
- Yang Z, Rannala B 2005. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. 54:455–70
- Yang Z, Rannala B 2010. Unguided species delimitation using DNA sequence data from multiple loci. Mol. Biol. Evol. 31:3125–35
- Yang Z, Rannala B 2014. Bayesian species delimitation using multilocus sequence data. PNAS 107:9264–69
- Zhou Y, Brinkmann H, Rodrigue N, Lartillot N, Philippe H 2010. A Dirichlet process covarion mixture model and its assessments using posterior predictive discrepancy tests. Mol. Biol. Evol. 27:371–84
- Zuckerkandl E, Pauling L 1962. Molecular disease, evolution and genetic heterogeneity. Horizons in Biochemistry M Kasha, B Pullman 189–225 New York: Academic
- Zuckerkandl E, Pauling L 1965. Evolutionary divergence and convergence in proteins. Evolving Genes and Proteins V Bryson, H Vogel 97–166 New York: Academic
Data & Media loading...
- Article Type: Review Article
Most Read This Month
Most Cited Most Cited RSS feed
-
-
-
-
Ecological and Evolutionary Responses to Recent Climate Change
Vol. 37 (2006), pp. 637–669
-
-
-
-
-
-
-
Species Distribution Models: Ecological Explanation and Prediction Across Space and Time
Vol. 40 (2009), pp. 677–697
-
-
-
-
-
Phylogenies and Community Ecology
Vol. 33 (2002), pp. 475–505
-
-
-
Species Richness of Parasite Assemblages: Evolution and Patterns
Vol. 28 (1997), pp. 341–358
-
-
-
-
-
Landscapes and Riverscapes: The Influence of Land Use on Stream Ecosystems
Vol. 35 (2004), pp. 257–284
-
- More Less