mvmorph: an r package for fitting multivariate evolutionary models to morphometric data
Summary
- We present mvmorph, a package of multivariate phylogenetic comparative methods for the r statistical environment. mvmorph is freely available on the cran package repository (http://cran.r-project.org/web/packages/mvMORPH/).
- mvmorph allows fitting a range of multivariate evolutionary models under a maximum-likelihood criterion. Initially developed in the context of phylogenetic analysis of multiple morphometric traits, its use can be extended to any biological data set with one or multiple covarying continuous traits. All the fitting models include the possibility to use simmap-like mapping, which may be useful for fitting changes along lineages at a given point in time. All models provide diagnostic metrics for convergence and reliability of estimates, as well as the possibility to include trait measurement errors in model estimates.
- New features provided by the mvmorph package include the possibility of fitting models with changes in the mode of evolution along the phylogeny, which will be particularly meaningful in comparative analyses that include extinct taxa, for example when testing changes in evolutionary mode associated with global biotic/abiotic events.
- We briefly describe the models already included in mvmorph and provide some demonstration of the use of the package with two simulated worked examples.
Introduction
Comparative methods for fitting evolutionary models to continuous data are becoming increasingly popular (e.g. Hansen 1997; Pagel 1999; Butler & King 2004; O'Meara et al. 2006; Thomas, Freckleton & Székely 2006; Hansen, Pienaar & Orzack 2008; Beaulieu et al. 2012; Thomas & Freckleton 2012; Adams 2013; Ingram & Mahler 2013). Over the last years, these methods used to study evolutionary patterns of continuous traits within phylogenies have been widely applied on a trait-by-trait basis for estimating shifts in evolutionary rates associated with environmental global changes, the timing of acquisition of key innovations, or the release of biotic interactions (Collar et al. 2009; Price et al. 2011, 2013; Slater 2013), as well as changes in modes of evolution and adaptation towards different optima (Scales, King & Butler 2009; Collar, Schulte & Losos 2011; Frédérich et al. 2013; Knope & Scales 2013; Mahler et al. 2013). However, phenomic evolution is multivariate by nature, as natural selection operates on functionally related organism's traits (Walsh 2007; Walsh & Blows 2009), and because traits may covary due to pleiotropic effects, genetic linkages or different sources of developmental constraints (Lande & Arnold 1983; Felsenstein 1988; Armbruster & Schwaegerle 1996; Armbruster et al. 2014). Thus, multivariate analyses should logically be favoured over univariate ones (Fig. 1) for describing the complex structure of multiple covarying traits and measurements as routinely studied in morphometrics. Moreover, by considering the evolutionary history of phylogenetically related species, we can further assess how and when phenotypic covariations evolved, either in relation or not with biotic/abiotic events that potentially drive evolution (Fig. 2). Here, we present a new r package, mvmorph, dedicated to the phylogenetic comparative analysis of biological data sets with multiple covarying continuous traits.
Phenotypic multivariate evolution has been thoroughly studied over the last decades, particularly in quantitative genetics in relation to the evolution of genetic covariances (Lande 1979; Lande & Arnold 1983; Cheverud 1984; Felsenstein 1988; Zeng 1988; Endler 1995; Arnold, Pfrender & Jones 2001; Steppan, Phillips & Houle 2002) and has led to particular expectations over macroevolutionary times (e.g. Schluter 1996; Steppan 1997; Arnold, Pfrender & Jones 2001; Baker & Wilkinson 2003; Polly 2004; Hohenlohe & Arnold 2008; Revell & Harmon 2008; Shoval et al. 2012; Goswami et al. 2014). Importantly, modelling multivariate phenotypic evolution through trait variances and pairwise covariances allows describing several aspects (e.g. correlated evolution, correlated selection) that separate univariate analyses cannot provide (Fig. 2). Indeed, univariate analyses are merely limited for describing complex structures and generally show less statistical power than multivariate counterparts (Zheng et al. 2009). Moreover, while complex multivariate data sets are often first reduced through ordination methods such as PCA and a subset of the first principal components is then treated as independent univariate traits, evolutionary model inferences from these transformed spaces can be severely biased (Revell 2009; Monteiro 2013; Polly et al. 2013; Uyeda, Caetano & Pennell 2015). Finally, it is a rather common approach in comparative analysis to regressing traits on each other by assuming, often arbitrarily, that one trait is the dependent variable and the other the independent variable. Nevertheless, such an approach implicitly assumes that one trait is evolutionarily responding to the other (Hansen, Pienaar & Orzack 2008; Revell 2010; Hansen & Bartoszek 2012), a critical assumption of biological causality which is actually relaxed in a multivariate setting focused on mere correlations among response variables.
The multivariate models provided by the mvmorph package will be also of particular interest for evolutionary biologists integrating fossil specimens and taxa in their studies, as the package allows estimating models of changes in evolutionary modes and/or rates at a given point in time (e.g. Slater 2013). Indeed, integration of extinct clades in comparative studies is an important step forward in the study of morphological diversification (Slater, Harmon & Alfaro 2012; Slater 2013): while some studies have pointed towards the need to consider phylogenies in among-group comparisons of morphological rates of evolution (O'Meara et al. 2006; Ricklefs 2006; Adams et al. 2009; Revell & Collar 2009), it appears also necessary to consider underlying processes which may affect the realized variances (Hunt 2012). Actually, some processes such as those leading to shifts from the ancestral state to the optima can be estimated only when including fossil taxa in the analysis (Hansen 1997; Hunt, Bell & Travis 2008). Moreover, although the use of non-ultrametric trees (whether including fossil taxa or fast evolving organisms such as viruses) enhances the accuracy and identifiability of parameter estimates (Slater, Harmon & Alfaro 2012; Ho & Ané 2013), it remains computationally limited to some of the current software implementations (Slater 2014).
Description
The mvmorph Package
The mvmorph package provides in a unified environment useful tools to deal with ultrametric or non-ultrametric trees over a range of models of continuous trait evolution (Table 1). mvmorph provides fast (see drawback below) multivariate implementations of comparative approaches allowing the incorporation of measurement errors (Ives, Midford & Garland 2007; Felsenstein 2008; Revell & Reynolds 2012; Silvestro et al. 2015), missing values and various parameterizations; it allows the user to handle trees in the simmap-like format (Huelsenbeck, Nielsen & Bollback 2003; Bollback 2006) which provide a flexible way to map evolutionary hypothesis on a phylogeny thanks to existing functions from the phytools package (Revell 2012). Last, mvmorph also allows simulating the multivariate evolution of traits (mvSIM function) and provides tools for users and developers with a general wrapper mvLL for computing log-likelihood of user-specified and customized models of trait evolution with various fast methods (see Appendix S1 for details, Supporting information).
Function | Model | Number of parametersa | Short description of the associated evolutionary model |
---|---|---|---|
mvBM | BM1 | (p(p + 1)/2) + p | BM on the whole tree (unique rate matrix) |
BMM | (p(p + 1)/2)k + p | BM with multiple rates matrix (multiple selective regimes) | |
mvOU | OU1 | 2(p(p + 1)/2) + p | OU process with a unique adaptive optimum per trait |
OUM | 2(p(p + 1)/2) + kp | OU process with multiple adaptive optima per trait | |
mvEB | EB–ACDC | (p(p + 1)/2) + p + 1 | EB model or decelerating model of evolutionary rates. |
mvSHIFT | BMEB/EBBM | (p(p + 1)/2) + p + 1 | Shift of a BM to/from EB process at a given point in time |
BMEBi/EBBMi | 2(p(p + 1)/2) + p + 1 | Same as BMEB/EBBM with independent rates on each time slice | |
BMOU/OUBM | 2(p(p + 1)/2) + p | Shift of a BM to/from OU process at a given point in time | |
BMOUi/OUBMi | 3(p(p + 1)/2) + p | Same as BMOU/OUBM with independent rates on each time slice | |
EBOU/OUEB | 2(p(p + 1)/2) + p + 1 | Shift of a EB to/from OU process at a given point in time | |
EBOUi/OUEBi | 3(p(p + 1)/2) + p + 1 | Same as EBOU/OUEB with independent rates on each time slice | |
mvSIM | all | – | Simulate the evolution of traits under the models represented in mvmorph |
mvLL | – | – | Compute the log-likelihood of an user provided tree or variance–covariance matrix |
half-life | OU | – | Compute the phylogenetic half-life for a multivariate OU process |
stationary | OU | – | Compute the multivariate stationary distribution for a multivariate OU process |
LRT | – | – | Compute the Log-ratio test for nested models in mvmorph |
- p, number of traits; k, number of distinct selective regimes. Model abbreviations: BM, Brownian motion; OU, Ornstein–Uhlenbeck; EB, Early Burst; ACDC, exponentially accelerating or decelerating.
- a *The number of parameters can be actually lower, depending on matrix constraints used.
Multivariate Models Implemented in mvmorph
The multivariate models for continuous trait evolution implemented in mvmorph are summarized in Table 1. Most of these models have been already widely described in an univariate context (Hansen 1997; Blomberg, Garland & Ives 2003; Butler & King 2004; O'Meara et al. 2006; Thomas, Freckleton & Székely 2006; Harmon et al. 2010; Slater 2013). Below we give a short presentation of their multivariate extension as provided in the package (see Appendix S1 for detailed descriptions).
Log-Likelihood of Multivariate Models
For a multivariate model with m traits and n species, Y is a vector of length mn and ε is a vector of mn residuals normally distributed in , with V a mn × mn phylogenetic multivariate variance–covariance matrix. If we fit an intercept-only model (i.e. a correlational model where we are estimating the phylogenetic signal in Y), then X is a mn × m design matrix, and β – the (notional) ancestral state expectation – is a vector of parameters to estimate of length m (assuming a unique ancestral state for each trait).
Brownian motion
Ornstein–Uhlenbeck
The OU model is an appealing alternative to the BM model as it describes processes where the traits variance is constrained around one or several optima often referred as ‘selective regime' or ‘adaptive zone' optima at the macroevolutionary scale (Hansen 1997, 2012; Butler & King 2004). In comparative method studies, this process has often been misleadingly interpreted in terms of Lande's (1976) model of population undergoing stabilizing selection in a static landscape (see also discussion in Harmon et al. 2010 and Pennell & Harmon 2013).
This second approach is particularly useful for fitting multivariate models to non-ultrametric trees (e.g. including fossil species) for which the estimation of the root state is feasible. On ultrametric trees, both approaches should converge to the same estimates if the process is stationary.
This expectation is used to estimate the root and the optima by entering the coefficients of their relative contribution in the design matrix X of the GLS equation (eqn 3). The various parameterizations implemented for the root state are described and discussed in the Appendix S1.
Early Burst
Shifts models
Constraining the parameter spaces
Estimating parameters of multivariate models is a complex task because matrices of parameters must always fulfil specific conditions, for example positive definiteness for variance–covariance matrices. Generally, one of the efficient ways to solve this issue is to use an unconstrained optimization of the likelihood with a parameterization that enforces the matrices to a particular structure (Pinheiro & Bates 1996). Various parameterizations are used in mvmorph (see Appendix S1) for ensuring positive definiteness of rate matrices (Pinheiro & Bates 1996), conditioning matrices on their eigenvalues (e.g. Sy, Taylor & Cumberland 1997; Jaffrézic, Thompson & Pletcher 2004; Bartoszek et al. 2012), or testing significance of interactions between traits by constraining the parameter space and taking advantage of joint likelihood optimization (e.g. Revell & Collar 2009; Adams 2013).
Alternatives
Other r implementations of multivariate comparative approaches already exist that calculate parts of the same or alternative models to those provided in mvmorph. Multivariate extensions of Pagel (1999) branch-length transformations are proposed in motmot (Thomas & Freckleton 2012; no more on the cran repository but accessible from the archive); BM with multiple evolutionary rate matrices can be fitted with evol.vcv as well as a phylogenetic canonical correlation analysis with phyl.cca in phytools (Revell 2012), and geomorph allows computing BM rates for high-dimensional data sets (Adams & Otarola-Castillo 2013). Multivariate OU is available from the ouch package (Butler & King 2009) as well as mvslouch that allows fitting multivariate OU models with a randomly evolving predictor variable (Bartoszek et al. 2012).
Drawbacks
Computational Performances
One of the main drawback of working with multivariate dimensions is that computational time grows up exponentially with variance–covariance matrix construction, inversion and computation of the determinant (Felsenstein 1973; Henderson 1976; Hadfield & Nakagawa 2010; Freckleton 2012). Recent alternative algorithms have been published to cope with the problem of growing number of species and dimensions in comparative analyses (Hadfield & Nakagawa 2010; FitzJohn 2012; Thomas & Freckleton 2012; Ho & Ané 2014; Lartillot 2014). Nevertheless, no general solution emerged so far, and most of these fast algorithms remain actually limited with multivariate data. For instance, the algorithm proposed by Ho & Ané (2014) can be easily extended to multivariate case when phylogenetic variance–covariance submatrices are proportional and that the multivariate variance–covariance matrix maintains a three-point structure, which is unlikely for non-Brownian processes such as the multivariate OU. Similarly, the pruning algorithm used by Thomas & Freckleton (2012) in the motmot package allows estimating a limited range of multivariate models such as a single optimum OU where selection acts independently on each traits.
Facing this lack of general solution, particularly for complex models such as multivariate OU processes, mvmorph implements various approaches to speed up the computations (see Appendix S1 for details). The main implementations consist in different methods that avoid the explicit computation of the inverse and determinant of the variance–covariance matrix to solve the linear model in eqn 1 through Cholesky factorization and independent contrasts. Two efficient algorithms are used for computing the Cholesky factors. The first one (method = ‘rpf') uses the rectangular full-packed format algorithm proposed by Gustavson et al. (2010). This algorithm uses half memory for the storage of symmetric matrices and reorganizes the upper triangular matrix in a rectangular format allowing the use of fast BLAS 3 matrix multiplication routines. The second method (method = ‘sparse') takes advantage of the sparsity structure of phylogenetic variance–covariance matrices to use faster sparse methods and updating of the Cholesky factor (Furrer & Sain 2010; see Appendix S1). Finally, an extremely fast implementation of the likelihood computation using contrasts (Freckleton 2012; see Appendix S1) is also provided in mvmorph for BM-based models (method = ‘pic'). Overall, the computational solutions implemented in mvmorph allow realistic computation times even for very large-sized data sets, with relative speeds several times faster than other available packages (see Appendix S1, Table S1).
High-Dimensionality
With increased number of traits and parameters (Table 1), over-parameterized multivariate models are often highly penalized by model selection criteria (Burnham & Anderson 2002; Revell & Collar 2009). Moreover, when the number of analysed traits reaches the number of studied species, the matrix computations involved in the likelihood calculation cannot be completed and parameter estimates become less accurate (Adams 2014a,b,c). Accordingly, Adams (2014a,b,c) has proposed methods based on matrix distances to allow the use of large m/small n data sets. Although these nonparametric approaches are appealing for comparing high-dimensional multivariate data sets, they do not fully address all the multivariate questions of trait evolution as they summarize the whole covariation information through a single metric, they assume a common phylogenetic structure for each trait, and they are limited to the BM model. Indeed, high-dimensional methods in phylogenetic comparative analysis should deserve more attention in the future as common approaches used to reduce the dimensionality of such data sets (e.g. principal component analysis) are prone to several biases (Monteiro 2013; Polly et al. 2013; Uyeda, Caetano & Pennell 2015).
Worked examples
To illustrate how mvmorph works, we describe and comment two simulated examples in the Appendix S2 (also included as vignettes in the mvmorph package). In the first example, we show how multivariate models can out-compete separate univariate analyses for a phylogeny involving two distinct selective regimes corresponding to two separate species groups. In the second worked example, we illustrate how to assess significances of differences in evolutionary covariations by testing a hierarchy of nested models. In addition, the help pages embedded within the mvmorph package also provide several simulated examples illustrating how the various functions work.
Conclusion
Elaboration of the mvmorph package was initially motivated by the need for a unified r package for fitting multivariate models of phenotypic evolution incorporating measurement errors and using flexible ways to map evolutionary scenarios for morphometric data. Obviously, such objectives can be extended to any biological data set made of covarying continuous traits, as well as univariate traits. mvmorph also implements models that were not previously designed for the multivariate case, as well as new ones. Last, but not least, the fast likelihood implementations provided by the mvmorph package may be particularly appealing when dealing with large data sets and assessing the statistical power of comparative methods through modelling approaches (Boettiger, Coop & Ralph 2012). The mvmorph package is in continuous development and can be followed and contributed from the gitHub website (https://github.com/JClavel/mvMORPH). mvmorph (current version: 1.0.5) is an open-source r package available for download on the cran repository (http://cran.r-project.org/web/packages/mvMORPH/).
Acknowledgements
J.C. warmly thanks Aaron A. King and Krzysztof Bartoszek for their patience and help in understanding how the multivariate OU model works. We thank Emmanuel Paradis for sharing codes and Graham J. Slater for sharing an unpublished manuscript as well as for precisions on his models. We thank Jonathan Drury for his advices and help for correcting the package, and Fabien Dubuffet for his help with some computational procedures and access to the cluster under his care for the package testing. Thanks to Hélène Morlon and members of her laboratory (Jonathan Drury, Eric Lewitus, Marc Manceau and Olivier Missa) for helpful discussions. We finally thank Vincent Bonhomme, Matthew W. Pennell, Thimothée Poisot, Liam J. Revell, Graham J. Slater, two anonymous reviewers and the Journal editor Robert B. O'Hara for comments that greatly improved a first version of this manuscript.
Data accessibility
This manuscript does not include any data.