Volume 13, Issue 2 p. 367-382
RESEARCH ARTICLE
Free Access

A standardized effect size for evaluating and comparing the strength of phylogenetic signal

Michael L. Collyer

Corresponding Author

Michael L. Collyer

Department of Science, Chatham University, Pittsburgh, PA, USA

Correspondence

Michael L. Collyer

Email: [email protected]

Search for more papers by this author
Erica K. Baken

Erica K. Baken

Department of Science, Chatham University, Pittsburgh, PA, USA

Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA

Search for more papers by this author
Dean C. Adams

Dean C. Adams

Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA

Search for more papers by this author
First published: 27 October 2021
Citations: 4

Handling Editor: José Miguel Ponciano

Abstract

  1. Macroevolutionary studies frequently characterize the phylogenetic signal in phenotypes; however, analytical tools for comparing the strength of that signal across traits remain largely underdeveloped.
  2. We developed a non-parametric, permutation test for the log-likelihood of an evolutionary model, plus a standardized statistic, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0001, from this test, which can be considered a phylogenetic signal effect size. This statistic can be used in two-sample tests to compare the strength of phylogenetic signal for multiple traits.
  3. We performed simulation experiments that revealed that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0002 had a linear association with Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0003, which could be predicted by tree size, plus could be quickly interpreted as a hypothesis for phylogenetic signal based on a standard normal distribution. We additionally found that the permutation test had greater statistical power for detecting phylogenetic signal than parametric likelihood ratio tests, especially for small trees.
  4. The analytical framework we present extends the phylogenetic comparative methods toolkit, allowing for statistical comparison of phylogenetic signal in multiple traits. Future studies could also consider this framework for the comparison of different evolutionary models, especially in light of different null processes.

1 INTRODUCTION

The shared evolutionary history of closely related species often implies the trait similarity among them, a pattern referred to as ‘phylogenetic signal’ (Blomberg et al., 2003; Felsenstein, 1985; Pagel, 1999). Many phylogenetic comparative methods (PCMs) seek to address the non-independence of species' traits (Felsenstein, 1985; Harvey & Pagel, 1991) in their analytic framing, by conditioning data on the phylogenetic relatedness among observations (Adams, 2014b; Adams & Collyer, 2018; Beaulieu et al., 2012; Garland & Ives, 2000; Grafen, 1989; Martins & Hansen, 1997; O'Meara et al., 2006; Rohlf, 2001). Indeed, under numerous evolutionary models, phylogenetic signal is expected, as stochastic character change along the hierarchical structure of the tree of life generates trait covariation among taxa (Blomberg et al., 2003; Felsenstein, 1985; Revell et al., 2008). Quantifying and comparing phylogenetic signal among traits, however, remains quite challenging.

Several analytical tools have been developed to quantify phylogenetic signal in phenotypic datasets (Abouheif, 1999; Adams, 2014a; Blomberg et al., 2003; Gittleman & Kot, 1990; Klingenberg & Gidaszewski, 2010; Pagel, 1999), and their statistical properties—namely type I error rates and statistical power—have been investigated to determine under what conditions phylogenetic signal can be detected (Adams, 2014a; Boettiger et al., 2012; Diniz-Filho et al., 2012; Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012; Pavoine & Ricotta, 2013; Revell, 2010; Revell et al., 2008). One of the most widely used methods for characterizing phylogenetic signal is Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0004 (Pagel, 1999), which transforms the lengths (by compression) of the internal branches of the phylogeny, while leaving the tips unaffected, to improve the fit of data to the phylogeny via maximum likelihood (Freckleton et al., 2002; Pagel, 1999). To infer whether phylogenetic signal differs from no signal or a Brownian motion (BM) model of evolutionary divergence, the observed model fit using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0005 may be statistically compared with that using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0006 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0007 via likelihood ratio tests (Bose et al., 2019; Cooper et al., 2010; Freckleton et al., 2002) or confidence limits (Vandelook et al., 2019).

Another widely used measure is Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0008 (Blomberg et al., 2003), which characterizes phylogenetic signal as the ratio of observed trait variation to the amount of variation expected under BM. Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0009 can be treated as a test statistic by using a permutation test to generate its sampling distribution (Adams, 2014a; Blomberg et al., 2003) for determining whether significant phylogenetic signal is present in data. Both urn:x-wiley:2041210X:media:mee313749:mee313749-math-0010 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0011 seem intuitive to interpret, as a value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0012 for both corresponds to no phylogenetic signal, and a value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0013 corresponds to the amount of phylogenetic signal expected under BM. Thus, it is tempting to regard both urn:x-wiley:2041210X:media:mee313749:mee313749-math-0014 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0015 as descriptive statistics (and effect sizes, Münkemüller et al., 2012) that measure the relative strength of phylogenetic signal, providing an estimate of its magnitude for comparison.

The potential appeal of Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0016 and Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0017 as effect sizes is that they provide a basis for interpreting ‘weak’ versus ‘strong’ phylogenetic signal; that is, small versus large values of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0018 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0019, respectively, in a comparative sense (De Meester et al., 2019; Pintanel et al., 2019; Su et al., 2019). They are also important statistics in hypothesis tests. The optimized value of lambda, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0020, is the location where the log-likelihood is maximized, and is, therefore, compelling for finding the maximum phylogenetic signal in the data, which can be deemed ‘significant’ by rejecting the null hypothesis of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0021 in a likelihood ratio test. Although Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0022 has an upper bound of 1, Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0023 can measure phylogenetic signal that is greater than expected under BM, as it has no upper bound. Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0024—or more specifically, the GLS estimation of variance that is a part of its calculation—can serve as a test statistic in a permutation test that randomizes ‘tip data’ in random permutations. However, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0025 is quite sensitive to tree size, exhibits high type II error rates for intermediate strength of phylogenetic signal, has higher type I error rates than likelihood ratio tests based on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0026, and exhibits greater uncertainty for strong phylogenetic signals (whereas urn:x-wiley:2041210X:media:mee313749:mee313749-math-0027 has greater uncertainty at intermediate phylogenetic signal strength: Münkemüller et al., 2012). Both of these statistics offer good support as test statistics for determining whether phylogenetic signal exists in a trait, but they are limited for comparing phylogenetic signals between traits.

Here, we present an alternative, standardized effect size calculation, which can be used in hypothesis tests to compare phylogenetic signals for different traits, and which is based on a normalized distribution of random log-likelihoods of a phylogenetic model, generated from a model of phylogenetic independence. Much like a likelihood ratio test for Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0028, this non-parametric approach can assess the significance of the observed phylogenetic signal, but unlike the parametric test, the standardized location of the observed likelihood can be used as an effect size, which can be statistically compared with similarly calculated effect sizes to consider hypotheses regarding the relative strengths of phylogenetic signal for multiple traits. We use simulation experiments to compare this standardized effect size to urn:x-wiley:2041210X:media:mee313749:mee313749-math-0029 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0030 and demonstrate its utility with an empirical example. Comprehensively we illustrate that this standardized effect size provides an additional necessary tool to the phylogenetic comparative toolkit.

2 CONCEPTUAL DEVELOPMENT

A hypothesis test for phylogenetic signal involves calculating the variance among taxa trait values, conditioned on phylogenetic covariances (evolutionary rates) and comparing this variance to a variance that assumes phylogenetic independence. This can be appreciated by the GLS log-likelihood equation for a BM phylogenetic model of a univariate trait (Blomberg et al., 2003; Freckleton, 2012; Freckleton et al., 2002; Garland & Ives, 2000):
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0031(1)
where, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0032 is a vector of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0033 trait values, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0034 is a vector of phylogenetic residuals, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0035 is an urn:x-wiley:2041210X:media:mee313749:mee313749-math-0036 phylogenetic covariance matrix, equal to urn:x-wiley:2041210X:media:mee313749:mee313749-math-0037, and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0038 represents its determinant. The urn:x-wiley:2041210X:media:mee313749:mee313749-math-0039 covariance matrix, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0040, is a matrix of phylogenetic variances along the diagonal, and covariances that are proportional to or exactly the covariances from a BM model of evolution (Revell et al., 2008). The variance (evolutionary rate), urn:x-wiley:2041210X:media:mee313749:mee313749-math-0041, is calculated as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0042, where T represents vector transposition. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0043 is used for the maximum likelihood estimate of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0044; urn:x-wiley:2041210X:media:mee313749:mee313749-math-0045 is used in place of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0046 for the restricted maximum likelihood (REML) estimator (Freckleton, 2012). The expected value (tree root) is computed as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0047, where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0048 is an urn:x-wiley:2041210X:media:mee313749:mee313749-math-0049 vector of 1s. Because urn:x-wiley:2041210X:media:mee313749:mee313749-math-0050, Equation 1 can be expanded, that is,
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0051(2)
This expansion helps to elucidate the portions of the log-likelihood equation that are constant when comparing traits. In Equation 2, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0052 is a constant and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0053 for any traits and any trees, for a BM model of evolution. We can, therefore, update Equation 2:
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0054
Thus, Equation 2 can be simplified:
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0055(3)
where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0056 is a constant for all of the parts in Equation 1 that would not be changed by changing urn:x-wiley:2041210X:media:mee313749:mee313749-math-0057. Equation 3 helps one appreciate that in a likelihood ratio test to compare estimates of evolutionary rates (e.g. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0058 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0059), urn:x-wiley:2041210X:media:mee313749:mee313749-math-0060 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0061 are the same in the two likelihood calculations; the only parts that change are urn:x-wiley:2041210X:media:mee313749:mee313749-math-0062 (as a proportional change in covariances, based on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0063) and the values of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0064, as a result of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0065. Therefore, a likelihood ratio test is a direct comparison of evolutionary rates.

Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0066 is a scaling parameter by which the covariances (i.e. the off-diagonals) of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0067 are multiplied (Pagel, 1999). A value of 0 changes all covariances to 0 (a star phylogeny or phylogenetic independence), and a value of 1 does not change the covariances from those expected by a BM model of evolution. (If urn:x-wiley:2041210X:media:mee313749:mee313749-math-0068 and the tree is ultrametric, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0069 is a diagonal matrix proportional to an urn:x-wiley:2041210X:media:mee313749:mee313749-math-0070 identity matrix, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0071. It is convenient in this case to refer to urn:x-wiley:2041210X:media:mee313749:mee313749-math-0072 rather than urn:x-wiley:2041210X:media:mee313749:mee313749-math-0073 as a model of evolutionary independence because the lengths of branches in a star phylogeny—all equal—are inconsequential in estimation of the trait variance.) The value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0074 that minimizes urn:x-wiley:2041210X:media:mee313749:mee313749-math-0075 maximizes the log-likelihood. This value can be found for the interval between 0 and 1, yielding the optimized value, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0076. In a likelihood ratio test, the log-likelihood at urn:x-wiley:2041210X:media:mee313749:mee313749-math-0077 can be compared with the log-likelihood found at urn:x-wiley:2041210X:media:mee313749:mee313749-math-0078; rejection of the null hypothesis indicates ‘significant’ phylogenetic signal. Likelihood ratio tests could also be used to explicitly test urn:x-wiley:2041210X:media:mee313749:mee313749-math-0079 against a model of pure BM (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0080).

By contrast, Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0081 finds urn:x-wiley:2041210X:media:mee313749:mee313749-math-0082 and calculates variance (mean-squared error) from these residuals two different ways: urn:x-wiley:2041210X:media:mee313749:mee313749-math-0083 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0084, where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0085 is typically the untransformed covariance matrix based on a BM model of evolutionary divergence (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0086). The only difference between urn:x-wiley:2041210X:media:mee313749:mee313749-math-0087 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0088 in Equation 2 is the use of the REML estimator (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0089 degrees of freedom) for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0090. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0091, however, ignores phylogenetic covariances in its estimation (does not correct for phylogenetic relatedness). Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0092 is the ratio, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0093, divided by its expectation under BM for a given phylogeny; that is, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0094 (Blomberg et al., 2003). This equation could be equivalently calculated (Revell et al., 2008), as
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0095(4)
where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0096 is the sum of diagonal elements, and we use the subscript, BM, to indicate this is an untransformed (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0097) version of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0098. Typically, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0099 is also used in the calculation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0100, but this need not be the case, as least for considering urn:x-wiley:2041210X:media:mee313749:mee313749-math-0101 as a test statistic rather than a descriptive statistic. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0102 will tend toward 0 if there is no phylogenetic signal, and tend toward or exceed 1 if there is. Whereas a likelihood ratio test can be used for Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0103, a permutation test (which randomizes the trait data across the tips of the phylogeny) is used to generate random distributions of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0104 (e.g. Blomberg et al., 2003) or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0105 (e.g. Adams, 2014a). A p-value is found as the percentile of the observed statistic in its sampling distribution. Because a permutation test and likelihood ratio test are non-parametric and parametric solutions for different test statistics, respectively, it might not be surprising that they could produce different results with respect to the same null hypothesis of no phylogenetic signal. However, it is because of the potential difference in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0106 values used in calculation of the test statistics more so than the statistic or method used that different results are possible. With the same urn:x-wiley:2041210X:media:mee313749:mee313749-math-0107 used to calculate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0108, and thus, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0109, the two tests should produce similar results. This can be appreciated by considering the process that generates variation in the permutation test.
Blomberg et al. (2003) proposed that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0110 could be replaced in Equation 3 by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0111, where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0112 is the ordinary least squares (OLS) mean of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0113. A permutation test that randomizes tip data performed with this altered urn:x-wiley:2041210X:media:mee313749:mee313749-math-0114 statistic or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0115 produces a distribution of values that are perfectly rank correlated because the only random element recalculated in each permutation is urn:x-wiley:2041210X:media:mee313749:mee313749-math-0116. (All other portions of the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0117 calculation would be constant.) It can be appreciated why Blomberg et al. (2003) suggested urn:x-wiley:2041210X:media:mee313749:mee313749-math-0118 as a statistic, and asserted that using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0119 that is transformed (e.g. by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0120) would mean having greater statistical power to detect phylogenetic signal. It can be seen from Equations 3 and 4 that for the same urn:x-wiley:2041210X:media:mee313749:mee313749-math-0121,
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0122(5)

If one optimizes urn:x-wiley:2041210X:media:mee313749:mee313749-math-0123 via maximum likelihood, uses this value to transform urn:x-wiley:2041210X:media:mee313749:mee313749-math-0124, and performs a permutation test on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0125, using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0126 as the test statistic, then a test on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0127 and a test on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0128 are commensurate. Furthermore, such a permutation test can be considered a non-parametric alternative to a likelihood ratio test. (We provide additional empirical detail in Appendix 1 of the Supporting Information that confirms rank correlation.)

Randomizing tip data is a simplified form of randomization of residuals in a permutation procedure (RRPP). RRPP works best if residuals are the most appropriate exchangeable units under the null hypothesis (Adams & Collyer, 2018; Commenges, 2003). RRPP is a process that randomizes null model residuals and adds them to null model fitted values in every random permutation to create random pseudodata used to fit alternative models. If the null hypothesis is phylogenetic independence, a star phylogeny is assumed, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0129, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0130, the OLS mean, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0131 are the OLS residuals, and random outcomes of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0132, where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0133 indicates randomization, are the pseudodata produced in each permutation. If urn:x-wiley:2041210X:media:mee313749:mee313749-math-0134, then randomizing residuals is the same as randomizing tip data (the root and data mean are the same). This process preserves first- and second-moment exchangeability; that is, the OLS mean and variance of the trait are constant across random permutations. (If phylogenetic independence is not assumed, RRPP still functions the same, but has a GLS solution with second-moment exchangeability only.) It is important to understand that when the portions of the log-likelihood expression for phylogenetic independence (summarized as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0135 in Equation 3) are held constant, p-values found from the sampling distributions of either component of Equation 5, the log-likelihood, or the likelihood ratio statistic, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0136, will be exactly the same (as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0137 is also constant in every random permutation). Therefore, a permutation test with RRPP is a non-parametric application of the likelihood ratio statistic in a hypothesis test (Potter, 2005), which does not rely on a mixture of parametric probability distributions as a proxy of the true sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987). However, it remains to be seen if a permutation test, using the log-likelihood statistic and RRPP, is as reliable as a parametric likelihood ratio test.

Assuming a comparable tree transformation, Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0138 and Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0139 can be considered two phylogenetic signal effect sizes (Münkemüller et al., 2012), but a test of phylogenetic signal, as demonstrated above, is more explicitly an assessment of the rarity of the observed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0140 in a hypothetical distribution of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0141, if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0142 is the null model process. This process can be applied with RRPP and a sampling distribution of either urn:x-wiley:2041210X:media:mee313749:mee313749-math-0143, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0144, or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0145 can be generated, and all would provide the same p-values for the same RRPP permutations (see Appendix 1 of the Supporting Information for an example of this outcome). However, sampling distributions from permutation tests do not need to be a means to an end, a tool to merely find a p-value. The location of the observed statistic in its sampling distribution can also be considered an effect size (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015). From, Equation 3, either the non-constant portion of the log-likelihood equation, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0146, or the log-likelihood itself, are good statistics for estimating effect sizes (Equation 3 demonstrates that the log-likelihood is a linear transformation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0147, so effect sizes estimated from the RRPP distributions of either will be the same. We will henceforth refer to the random forms of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0148, for simplicity.) The location of the observed value in a standardized distribution of random urn:x-wiley:2041210X:media:mee313749:mee313749-math-0149 outcomes, based on the appropriate null hypothesis of phylogenetic independence, provides a standardized (statistical) effect size that can be used in comparative tests (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015).

Letting urn:x-wiley:2041210X:media:mee313749:mee313749-math-0150, where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0151 represents a normalizing function (if random urn:x-wiley:2041210X:media:mee313749:mee313749-math-0152 are not sufficiently normally distributed), the standardized effect size of phylogenetic signal for a trait is estimated as,
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0153(6)
where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0154 is the mean of the sampling distribution and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0155 is the standard error (the standard deviation of the sampling distribution, not the trait). (The indicates these values are estimated, based on the number of random permutations used, which is probably fewer than the finite but large possible number of all permutations.) As a standard deviate, we would expect correspondence between a p-value estimated from the location of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0156 in a standard normal distribution and the percentile of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0157 in its sampling distribution. Therefore, it is obvious that, e.g. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0158 means significant phylogenetic signal and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0159 means phylogenetic signal that is not significant, based on a significance level of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0160. More importantly, because sampling distributions are approximately normal, two effect sizes can be compared in a hypothesis test, by finding the two-sample test statistic,
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0161(7)

The urn:x-wiley:2041210X:media:mee313749:mee313749-math-0162 statistic can be assumed to follow a standard normal distribution, meaning a p-value can be obtained for a null hypothesis test that the phylogenetic signals for two separate traits are the same. There is no explicit expectation that the traits have to come from the same phylogeny, but the scope for comparison of traits is something that can only be considered by examining the behavior of these effect sizes for varied tree sizes and phylogenetic signal strength.

Equation 3 implicitly assumes that the compared traits evolve independently, which might be an illogical assumption for traits measured on species from the same phylogeny. In such cases, there are two options worth considering. First, one could generalize the log-likelihood equation (Equation 2) for multivariate data (see, Revell & Harmon, 2008) and consider the relative strength of multivariate phylogenetic signal with respect to the univariate signals. This is not necessarily a simple generalization, if one allows urn:x-wiley:2041210X:media:mee313749:mee313749-math-0163 to vary among traits (requiring urn:x-wiley:2041210X:media:mee313749:mee313749-math-0164 covariance matrix estimations in the log-likelihood for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0165 traits; see Appendix 2 in the Supporting Information for further details). However, one could compare multivariate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0166-scores between models that assume a common urn:x-wiley:2041210X:media:mee313749:mee313749-math-0167 or allow urn:x-wiley:2041210X:media:mee313749:mee313749-math-0168 to vary among traits, as a test of evolutionary independence of traits; much like one can compare models with common or separate evolutionary rates among traits (see, Adams, 2013). (We provide further details for this future research consideration in Appendix 2 of the Supporting Information.) Second, one could compare the relative strength of phylogenetic signal between principal components of a multivariate data set. With this option, the principal components would be independent, but one would have to reconcile principal component loadings with test results to determine whether suites of traits have different phylogenetic signals.

Multivariate considerations are expansive and exceed the scope of this paper. However, RRPP is a process that generates sampling distributions of log-likelihoods in a consistent manner, irrespective of the number of traits. Research questions that require multivariate analysis should have tractable solutions, provided log-likelihoods can be estimated (residual covariance matrices are not singular). Regarding single traits, we perform simulation experiments to determine type I error rates, correspondence between hypothesis test outcomes, statistical power, and the relationship between effect size and simulated phylogenetic strength, below. However, we first provide a simple example to help illustrate the purpose of this type of analysis.

2.1 Illustrative example

As an illustrative example, we simulated two traits on a phylogeny (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0169), one with moderate phylogenetic signal and one with stronger phylogenetic signal. (Simulation details are explained in the next section.) For one variable, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0170, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0171, and for the other, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0172, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0173. We performed RRPP to recalculate the GLS log-likelihoods (using a covariance matrix for a tree transformed by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0174 for each variable), with 10,000 random permutations, each. These distributions were normalized (with a Box-Cox transformation) and standardized (Figure 1), yielding urn:x-wiley:2041210X:media:mee313749:mee313749-math-0175 scores of 2.21 and 6.33 standard deviations, respectively, each of which was significant at urn:x-wiley:2041210X:media:mee313749:mee313749-math-0176 (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0177 for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0178 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0179 for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0180). Performing parametric likelihood ratio tests with a null model of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0181 yielded slightly different results urn:x-wiley:2041210X:media:mee313749:mee313749-math-0182, and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0183, for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0184 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0185, respectively. The difference, as we show below, is likely due to the limited statistical power (type II error) of the parametric likelihood ratio test. We performed a two-sample z-test to compare the phylogenetic signal effect sizes; urn:x-wiley:2041210X:media:mee313749:mee313749-math-0186, indicating that the phylogenetic signal in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0187 was significantly larger than in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0188. Comparatively, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0189 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0190, which like urn:x-wiley:2041210X:media:mee313749:mee313749-math-0191 is not as useful for determining a significant difference between phylogenetic signals.

Details are in the caption following the image
Plot of phylogenetic tree with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0192 values, plus frequency histograms for the RRPP log-likelihood values for two variables, X and Y. Vertical lines indicate observed values. In the last panel, histograms have been combined for standardized values

3 SIMULATION METHODS AND RESULTS

We examined the behavior of RRPP-based likelihood ratio tests and standardized log-likelihood effect size with simulation experiments that varied the strength of phylogenetic signal, for various sized pure-birth phylogenetic trees. In our simulation experiments, we sought to examine the statistical power of likelihood ratio tests with RRPP compared to parametric tests, and the relationships between effect size, Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0193, and Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0194.

3.1 Simulation methods

Simulating data from a model with known phylogenetic signal is challenging, as the process requires an a priori definition of phylogenetic signal, and there is no guarantee that the process will produce data that are similar to the intended effect. It is possible to simulate data with an intended urn:x-wiley:2041210X:media:mee313749:mee313749-math-0195, as this requires merely simulating a tree, rescaling the tree, and simulating BM data on the rescaled tree (see, e.g. Adams, 2014a; Molina-Venegas & Rodríguez, 2017). Alternatively, a weighted average of data simulated with BM and without BM could be used, which Münkemüller et al. (2012) described as the (simulated) BM strength. However, with either approach, there is no guarantee that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0196 will resemble urn:x-wiley:2041210X:media:mee313749:mee313749-math-0197, especially for small trees (see, e.g. Figure 2 of Münkemüller et al., 2012). Furthermore, there is no easily conceivable way to simulate data from a model with known urn:x-wiley:2041210X:media:mee313749:mee313749-math-0198. For previous studies that sought to evaluate statistical properties (type I or type II errors, accuracy, and precision), defining urn:x-wiley:2041210X:media:mee313749:mee313749-math-0199 or a weight of BM strength, as simulated, was sufficient for calculating summary statistics over many simulation runs with the same input value. However, we were more interested in understanding the association of simulated phylogenetic signal strength and the effect size estimated from the log-likelihood of an evolutionary model, over a continuum from urn:x-wiley:2041210X:media:mee313749:mee313749-math-0200 to urn:x-wiley:2041210X:media:mee313749:mee313749-math-0201.

Details are in the caption following the image
Plots of simulations including relative frequencies of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0202 generated (left column), urn:x-wiley:2041210X:media:mee313749:mee313749-math-0203 tested with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0204 (middle column), and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0205 tested with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0206 (right column). Rows separate results by tree size. Points corresponding to non-significant results from permutation tests are colored gray; significant results are scaled, colored and hued according to the magnitude of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0207, as in the left column

Initial trials to simulate data (sensu Adams, 2014a; Molina-Venegas & Rodríguez, 2017) from a uniform distribution of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0208 revealed that, especially with smaller trees, distributions of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0209 tended to be skewed toward 0 or 1, despite uniform sampling of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0210. This was consistent with the research of Münkemüller et al. (2012). (see, e.g. their Figure 2 and their Table 2, which indicates skewing of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0211 toward 0 or 1 for small trees.) Therefore, we used an algorithm to first simulate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0212 from a uniform distribution, and then simulate data that produced urn:x-wiley:2041210X:media:mee313749:mee313749-math-0213 within 5% of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0214 to assure that there was an approximately uniform distribution of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0215 throughout the simulation runs.

We simulated 5,000 pure-birth, ultrametric trees (with a branching rate of 0.05) for each of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0216 sized trees (25,000 trees total). All trees were created with the function, pbtree, from the phytools R package (Revell, 2010). For each tree, we randomly sampled urn:x-wiley:2041210X:media:mee313749:mee313749-math-0217 from a uniform distribution (minimum of 0, maximum of 0.99), scaled the tree branch lengths by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0218, and simulated random BM data on the transformed tree, using the sim.char function of the geiger R package (Harmon et al., 2008). Subsequently, we found the maximum likelihood estimate, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0219 (see code in Appendix 3 of the Supporting Information) from the data generated. We used an upper limit of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0220 because like Cooper et al. (2016), we observed a rare but discernible trend for data simulated with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0221 to not fit as well with a BM model of evolutionary divergence as alternative models, such as Ornstein Uhlenbeck models (Lande, 1976). By using a cut-off of 0.99, instances of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0222 were still frequent, but anomalies from simulating non-BM data were largely mitigated. For every tree we simulated, we repeatedly simulated data until we found urn:x-wiley:2041210X:media:mee313749:mee313749-math-0223 within a 5% interval of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0224, and then retained the data for analysis.

For every simulated tree and its corresponding data, we performed a parametric likelihood ratio test, with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0225 (untransformed) and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0226 (transformed) adjustments of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0227. We also used RRPP to generate distributions of 1,000 random log-likelihoods for each tree:data combination, and for both untransformed and transformed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0228 matrices, from which the percentile of the observed statistic was used to estimate a p-value. The parametric likelihood ratio test performed for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0229 and the permutation test performed for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0230 (null urn:x-wiley:2041210X:media:mee313749:mee313749-math-0231 in both cases) correspond exactly with the tests typically performed for Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0232 and Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0233, respectively. We verified p-values estimated this way were the exact same as using the distribution of random urn:x-wiley:2041210X:media:mee313749:mee313749-math-0234, more typically used for a test of Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0235.

urn:x-wiley:2041210X:media:mee313749:mee313749-math-0236 results and standardized log-likelihood effect sizes were plotted against urn:x-wiley:2041210X:media:mee313749:mee313749-math-0237, with points scaled and hued in association with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0238 to visualize patterns. In such plots, points were colored if significant, based on the RRPP permutation test, or gray if not significant. We anticipated that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0239 should correspond to the null hypothesis rejection limit for a one-tailed test, a line we superimposed into plots to visually determine the consistency of effect sizes and hypothesis test results. (We expected colored points to lie above this line and gray points to lie beneath, if effect sizes reflected hypothesis test outcomes.)

Because we had p-values from both parametric and permutation tests, we could create urn:x-wiley:2041210X:media:mee313749:mee313749-math-0240 tables of hypothesis test outcome correspondence, to assess the consistency of parametric and non-parametric tests. These tables report both the consistencies (parametric tests and RRPP tests found same result) and two types of inconsistencies: the parametric test finds a significant result but RRPP does not, or RRPP finds a significant result, but the parametric test does not. The inconsistent results were labeled in plots, along with type I error rates (calculated from the frequency of occasions that for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0241, a significant result was observed). Because we had 5,000 urn:x-wiley:2041210X:media:mee313749:mee313749-math-0242 values, approximately uniform in distribution, we were able to estimate statistical power curves as a moving proportion of null hypothesis rejections across the landscape of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0243 values. We estimated the proportion of rejections for four test types: parametric/untransformed, parametric/transformed, RRPP/untransformed, and RRPP/transformed. Proportions were estimated by culling data by intervals of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0244 and producing a vector of 0s (did not reject null hypothesis) and 1s (rejected the null hypothesis), for each test type. The means of these vectors were the proportion of tests for which the null hypothesis was rejected. We found that an interval length of 0.1 assured more than 500 values for all interior points (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0245), and produced rather smooth curves that were qualitatively as informative as any curves produced with a greater number of intervals.

Additional functions and code for simulation experiments can be found in Appendix 3 of the Supporting Information. Several support functions from the RRPP R package (Collyer & Adams, 2018) were used to create functions to estimate log-likelihoods and effect sizes from RRPP distributions. In some initial simulations, we also considered balanced and pectinate trees. We found no qualitative differences and simulations could only produce new sets of data on the same tree, so we did not consider them further.

3.2 Simulation results

Figure 2 shows urn:x-wiley:2041210X:media:mee313749:mee313749-math-0246 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0247 results from simulations, and Figure 3 shows urn:x-wiley:2041210X:media:mee313749:mee313749-math-0248 score results, with corresponding points scaled and hued the same, based on the value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0249 in the first column of Figure 2. The relative frequencies of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0250 suggested the simulations produced approximately uniformly distributed phylogenetic signals. A rejection of the null hypothesis of no phylogenetic signal (significant phylogenetic signal) resulted in points that were colored, with hue changing as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0251 increased; non-significant values were gray in color. These figures allow for visual clarification of various attributes acquired from the simulation runs, such that patterns are easy to interpret. Statistical power curves are shown in Figure 4.

Details are in the caption following the image
Plots of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0252 score simulations for untransformed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0253 (left columns) and transformed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0254 (right column). Points are scaled and colored as in Figure 2. Inconsitencies (from a urn:x-wiley:2041210X:media:mee313749:mee313749-math-0255 correspondence table) between parametric likelihood ratio tests (LRT) and RRPP permutation tests are noted, as well as type I error rates for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0256. The dashed line at urn:x-wiley:2041210X:media:mee313749:mee313749-math-0257 indicates expected separation of significant and non-significant results, based on a level of significance of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0258
Details are in the caption following the image
Statistical power curves for indicated methods: parametric likelihood ratio test (LRT), and RRPP. Whether the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0259 matrix is transformed is noted

Regarding Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0260 and Blomberg's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0261, our simulation results tracked the results of Münkemüller et al. (2012) in one particular way (Figure 2). It was possible to simulate larger urn:x-wiley:2041210X:media:mee313749:mee313749-math-0262 values for smaller trees, but within any tree size, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0263 tended to be less than 1 except for the largest simulated urn:x-wiley:2041210X:media:mee313749:mee313749-math-0264 values. Despite this trend, the hypothesis test results using alternative transformations of the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0265 matrix for estimation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0266 as a test statistic for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0267 revealed profound differences. In Figure 2, significant or non-significant urn:x-wiley:2041210X:media:mee313749:mee313749-math-0268 values can be found for any urn:x-wiley:2041210X:media:mee313749:mee313749-math-0269, if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0270 is forced in the test statistic, which is the common way this test is performed (middle column). The simple act of using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0271 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0272 as a test statistic (not urn:x-wiley:2041210X:media:mee313749:mee313749-math-0273) alleviated this concern, and was consistent with the assertion of Blomberg et al. (2003) that doing so increases statistical power (Figure 4). Forcing urn:x-wiley:2041210X:media:mee313749:mee313749-math-0274 for hypothesis tests of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0275 also elevated type I error rates, but they were still close to the nominal urn:x-wiley:2041210X:media:mee313749:mee313749-math-0276 level.

Issues with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0277 forced to be equal to 1 were also revealed by using a standardized effect size based on the location of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0278 in its RRPP-generated sampling distribution. Significant and non-significant results spanned the entire range of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0279 (Figure 3). These results are not surprising, as they do not seek to maximize likelihood, but help to confirm that the permutation test with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0280, using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0281 as a test statistic, is flawed (since urn:x-wiley:2041210X:media:mee313749:mee313749-math-0282 might not be minimized via a best fit of the tree to the comparative data). Furthermore, using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0283 as an effect size if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0284 is forced to equal 1 makes little sense because of its curvilinear association with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0285 (Figure 3). However, for cases where urn:x-wiley:2041210X:media:mee313749:mee313749-math-0286, both the permutation test on the log-likelihood statistic as well as the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0287 score from the RRPP sampling distribution, as an effect size, had several desirable attributes.

First, the permutation test for the log-likelihood of the evolutionary model had greater statistical power than the parametric likelihood ratio test (Figure 4). A statistical power advantage was greatest for smaller trees, and the power curves of the two methods tended to converge with larger trees. The cases of inconsistent results from the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0288 hypothesis test correspondence tables (Figure 3) were always due to the permutation test finding significant results when the parametric likelihood ratio test did not, but the rate of inconsistencies decreased with increased tree size. (By contrast, if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0289 is forced, the rate of inconsistencies increased with tree size.) A likelihood ratio statistic only asymptotically follows a urn:x-wiley:2041210X:media:mee313749:mee313749-math-0290 distribution, as urn:x-wiley:2041210X:media:mee313749:mee313749-math-0291 (Wilks, 1938), so it is not surprising that a parametric likelihood ratio test would have larger type II error rates with small tree sizes. Furthermore, the asymptotic null distribution for a one-sided likelihood ratio statistic, in which null hypotheses are at the limits of the constrained parameter space (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0292 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0293), is a mixture of two urn:x-wiley:2041210X:media:mee313749:mee313749-math-0294 distributions (Molenberghs & Verbeke, 2007; Verbeke & Molenberghs, 2003). Generally, an unconstrained urn:x-wiley:2041210X:media:mee313749:mee313749-math-0295 statistic is reported, but a p-value is considered to be 1/2 of the classical urn:x-wiley:2041210X:media:mee313749:mee313749-math-0296 approximation, when mixture proportions are equal. Therefore, a tendency toward high statistical power might be expected for traits from large trees with likelihood ratio tests. Nonetheless, the statistical power was as good or better with permutation tests in our results, irrespective of tree size. In addition to having greater statistical power, the RRPP sampling distribution allows the standard deviate of the observed log-likelihood to be used as an effect size (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0297), which also has nice attributes.

Second, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0298 based on a maximum likelihood estimate of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0299 has a linear association with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0300. The slope of this linear association increases with tree size, unfortunately, as it is not possible to disentangle a goodness of fit (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0301) from the size of a tree. Thus, one might consider comparing effect sizes for traits from two vastly different trees with caution. The range of simulated urn:x-wiley:2041210X:media:mee313749:mee313749-math-0302 also increased with phylogenetic signal strength and tree size (Figure 3). This result can be explained by the fact that for a large value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0303, also for a large tree, the breadth of possible urn:x-wiley:2041210X:media:mee313749:mee313749-math-0304 values in RRPP permutations increases, so it is also possible to have a larger span of possible urn:x-wiley:2041210X:media:mee313749:mee313749-math-0305 values.

Third, with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0306, there was a clear demarcation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0307 above a value of 1.96 corresponding to significant hypothesis test outcomes, especially for tests with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0308 (Figure 3). This is helpful, as an effect size of say, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0309 reported from an empirical study, indicates significant phylogenetic signal. We found sampling distributions to be consistently normally distributed (see Figure 1 as an example), especially for larger trees. Results in Figure 3 (second column) had a distinction of significant results consistently for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0310. A reliance on the normal distribution of RRPP sampling distributions means that two-sample urn:x-wiley:2041210X:media:mee313749:mee313749-math-0311 statistics are also reliable, and quick interpretation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0312 to, e.g. urn:x-wiley:2041210X:media:mee313749:mee313749-math-0313, for two traits from the same tree, indicates which has greater phylogenetic signal.

Ideally, there would have been no relationship between urn:x-wiley:2041210X:media:mee313749:mee313749-math-0314 and tree size, but such an expectation would be unwarranted, as phylogenetic signal is inherently related to the largeness of the phylogeny. However, we determined that there was a precise relationship between tree size and the slope of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0315 with respect to urn:x-wiley:2041210X:media:mee313749:mee313749-math-0316. The slopes of the lines in Figure 3 fit (nearly perfectly) the relationship, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0317. Thus, the expected value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0318, given urn:x-wiley:2041210X:media:mee313749:mee313749-math-0319 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0320 is urn:x-wiley:2041210X:media:mee313749:mee313749-math-0321. One can calculate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0322 for the traits (see Figure 5) that are compared to ascertain if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0323 is larger (more positive) or lesser (more negative) than expected, given the tree size and optimized value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0324. This adjustment could be seen at best as a tool to help understand the multifarious nature of phylogenetic signal, rather than fix urn:x-wiley:2041210X:media:mee313749:mee313749-math-0325 for comparative tests. For example, when comparing traits from two different trees, more positive values of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0326 might be considered stronger phylogenetic signal, if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0327 are comparable.

Details are in the caption following the image
urn:x-wiley:2041210X:media:mee313749:mee313749-math-0328 scores from Figure 3 (right column) after subtracting the expected value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0329, based on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0330 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0331

4 EMPIRICAL EXAMPLE

To demonstrate the utility of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0332, we compared urn:x-wiley:2041210X:media:mee313749:mee313749-math-0333 for two ecologically-relevant traits in plethodontid salamanders (Figure 6): surface area to volume ratios urn:x-wiley:2041210X:media:mee313749:mee313749-math-0334 and relative (to snout to vent length) body width urn:x-wiley:2041210X:media:mee313749:mee313749-math-0335 (Baken & Adams, 2019; Baken et al., 2020). For this example, surface area to volume ratios and relative body width measures were obtained from individuals of 305 species, from which species means were obtained (Baken & Adams, 2019; Baken et al., 2020). A time-dated molecular phylogeny for the group (Bonett & Blair, 2017) was pruned to match the species in the phenotypic dataset. The phylogenetic signal effect size in each trait was obtained from 10,000 RRPP permutations, using functions described in Appendix 3 of the Supporting Information. The absolute value of the two-sample effect size (Equation 5) was calculated, as we had no a priori expectation of direction in the hypothesis test; i.e. it was treated as a two-tailed hypothesis test.

Details are in the caption following the image
A Description of traits compared; B comparison of traits via standardized effect sizes (shown as locations in standardized sampling distributions, after transforming sampling distributions of log-likelihoods)

Although both traits contained significant phylogenetic signal (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0336 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0337), a test based on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0338 revealed that the degree of phylogenetic signal was significantly stronger in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0339 (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0340; urn:x-wiley:2041210X:media:mee313749:mee313749-math-0341: Figure 5). Biologically, this observation may be interpreted by the fact that the tropical species—which form a monophyletic group within plethodontids—display greater variation in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0342, which covaries with disparity in their climatic niches (Baken et al., 2020). Thus, greater phylogenetic signal in urn:x-wiley:2041210X:media:mee313749:mee313749-math-0343 is to be expected. Coincidentally, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0344 was 0.76 and 0.91, and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0345 was 0.25 and 0.76, for urn:x-wiley:2041210X:media:mee313749:mee313749-math-0346 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0347, respectively.

5 DISCUSSION

To be able to ask if traits differ in their amount of phylogenetic signal, resolving how to best measure phylogenetic signal is essential. In this study, we considered the two most common measures of phylogenetic signal, and our simulation results did not dispute any issues that were already known about these measures. For example, the precision to estimate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0348 is tree-dependent, with more taxa-rich trees required for better precision (Boettiger et al., 2012; Münkemüller et al., 2012). urn:x-wiley:2041210X:media:mee313749:mee313749-math-0349 does not scale linearly with increased phylogenetic signal strength, and its variance is positively associated with phylogenetic signal strength (Diniz-Filho et al., 2012; Münkemüller et al., 2012). Our simulation results confirmed these attributes. These issues make the comparison of phylogenetic signals challenging, even if only qualitatively comparing urn:x-wiley:2041210X:media:mee313749:mee313749-math-0350 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0351 between traits, for the same phylogeny. That there has been no statistical test only makes inference more speculative.

In this study, we made three important advances for the comparison of phylogenetic signals among different traits. First, we demonstrated that a permutation-based procedure (RRPP) using the log-likelihood as a statistic is not only reliable but performs better than a parametric test, especially for smaller trees. Second, we demonstrated that if the RRPP-log-likelihood permutation test is used, a test of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0352 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0353 are the same, provided that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0354 is transformed by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0355 in the calculation of the GLS variance that is at the heart of the calculation of either statistic. Indeed, Blomberg et al. (2003) introduced urn:x-wiley:2041210X:media:mee313749:mee313749-math-0356 as a statistic that had an associated permutation test, based on a distribution of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0357, not urn:x-wiley:2041210X:media:mee313749:mee313749-math-0358, noting that statistical power would be higher if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0359 was calculated from a transformed version of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0360 that resulted in better fit of the tree to the data. Because urn:x-wiley:2041210X:media:mee313749:mee313749-math-0361 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0362 are perfectly rank-order correlated for the same set of RRPP permutations, viewing urn:x-wiley:2041210X:media:mee313749:mee313749-math-0363 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0364 as statistics that have different hypothesis test outcomes is not necessary. Previous simulation studies have found differences between them, but did so by relying on adjudication of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0365 by a parametric likelihood ratio test (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0366 transformed by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0367), and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0368 by a permutation test with no transformation of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0369 (urn:x-wiley:2041210X:media:mee313749:mee313749-math-0370) (see, e.g. Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012). Our work reveals that these differences were the result of the incommensurate transformation step, and not in test statistic performance, per se. Third, having demonstrated that a test of phylogenetic signal is a test of the rarity of the observed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0371 in a distribution of random outcomes, generated by a null model of phylogenetic independence, we can measure phylogenetic signal an alternative way: as the standardized location of the observed urn:x-wiley:2041210X:media:mee313749:mee313749-math-0372 in the RRPP-generated distribution of random values (i.e. as an effect size). This alternative makes it possible to perform hypothesis tests for the comparison of the strength of phylogenetic signal across traits.

This third advance is important but perhaps unsettling. The convenience of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0373 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0374 is that a value of 0 should mean data devoid of phylogenetic signal, and a value of 1 should mean data have a phylogenetic signal that matches a BM model of evolutionary divergence. By contrast, a urn:x-wiley:2041210X:media:mee313749:mee313749-math-0375-score is a value measured in standard deviations that indicates a location in a normal distribution relative to expectation (mean), given a null model of phylogenetic independence. As a measure of the degree of phylogenetic signal, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0376, might feel less intuitively comfortable. However, this discomfort is perhaps predicated on one's definition of phylogenetic signal. For example, in a tree with 128 taxa (Figure 3), for a value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0377, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0378 might range from 2 to 17. If urn:x-wiley:2041210X:media:mee313749:mee313749-math-0379 is the definition of phylogenetic signal strength, the range of effect sizes suggest that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0380 is not a good effect size to consider. Conversely, an effect size urn:x-wiley:2041210X:media:mee313749:mee313749-math-0381 might be found to have urn:x-wiley:2041210X:media:mee313749:mee313749-math-0382 range from 0.1 to 1.0. That is, if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0383 is the measure of phylogenetic signal, a corresponding urn:x-wiley:2041210X:media:mee313749:mee313749-math-0384 indicates the tree transformation that best reveals the phylogenetic signal, not the amount of phylogenetic signal. However, the appeal of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0385 is that it allows a statistical comparison of the phylogenetic signals from multiple traits, but those traits might also have quite different values of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0386 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0387. The best analysis is probably one that statistically compares urn:x-wiley:2041210X:media:mee313749:mee313749-math-0388 but also reports both urn:x-wiley:2041210X:media:mee313749:mee313749-math-0389 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0390, as these two statistics have important meaning: the optimized branch-length transformation and a ratio that expresses the relative amount of BM contribution to the GLS variance estimate, respectively. That is, one can report urn:x-wiley:2041210X:media:mee313749:mee313749-math-0391, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0392, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0393, and one p-value, and not have to view phylogenetic signal statistics as a means to an end for different statistical tests.

It is common for researchers to report ‘weak but significant’ phylogenetic signal when urn:x-wiley:2041210X:media:mee313749:mee313749-math-0394 is considerably less than 1 but the null hypothesis test is rejected. We also demonstrated with our simulations that it is possible to find ‘significant’ phylogenetic signal when urn:x-wiley:2041210X:media:mee313749:mee313749-math-0395 is small compared to a non-significant result when urn:x-wiley:2041210X:media:mee313749:mee313749-math-0396 is forced to be equal to 1 (compare plots between left and right columns of Figure 3). Our work demonstrates that it is not helpful to declare ‘weak but significant’ phylogenetic signal (especially if not simultaneously reporting ‘strong but not significant’ phylogenetic signal by increasing urn:x-wiley:2041210X:media:mee313749:mee313749-math-0397), based on urn:x-wiley:2041210X:media:mee313749:mee313749-math-0398 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0399 values. However, we feel it is more appropriate to declare urn:x-wiley:2041210X:media:mee313749:mee313749-math-0400 as weak but significant, compared with say, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0401, which is strong and significant. Phylogenetic signal ‘strength’ can be viewed as measure of rarity to generate such a strong signal by chance, which urn:x-wiley:2041210X:media:mee313749:mee313749-math-0402 describes well. Although urn:x-wiley:2041210X:media:mee313749:mee313749-math-0403 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0404 are useful statistics, their ability to discern strong versus weak phylogenetic signal is questionable. Only urn:x-wiley:2041210X:media:mee313749:mee313749-math-0405, which is a statistical effect size, affords this statistical interpretation. However, an interpretation of phylogenetic strength still cannot be made independent of phylogeny size.

One less desirable outcome of our simulations is that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0406 (more precisely the slope between urn:x-wiley:2041210X:media:mee313749:mee313749-math-0407 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0408) was positively associated with urn:x-wiley:2041210X:media:mee313749:mee313749-math-0409, the number of taxa represented in a tree. We were able to demonstrate that the slope between urn:x-wiley:2041210X:media:mee313749:mee313749-math-0410 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0411 is predicted by urn:x-wiley:2041210X:media:mee313749:mee313749-math-0412, such that one could find an expected value of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0413, given urn:x-wiley:2041210X:media:mee313749:mee313749-math-0414 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0415. When comparing the same or multiple traits between trees, this added step might help to better elucidate differences between a two-sample test of phylogenetic signal, especially if urn:x-wiley:2041210X:media:mee313749:mee313749-math-0416 is significant, but it is not clear if the test result is because of differences in phylogenetic signal strength or tree size. This would not be a panacea, as it would also involve using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0417, which could vary between traits, but it is a tool that might assist inferences made about differences in phylogenetic signal involving multiple trees, or different urn:x-wiley:2041210X:media:mee313749:mee313749-math-0418 scores also involving different urn:x-wiley:2041210X:media:mee313749:mee313749-math-0419 transformations.

Although using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0420 scores for comparative analysis offers new opportunities, it also presents new challenges. Chief among the challenges that will have to be addressed is how to generalize the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0421 score as an effect size for multivariate data, especially if the number of variables precludes calculating log-likelihood. We see three possible approaches. First, like the generalization of the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0422 statistic for multivariate data (Adams, 2014a), it might be possible to use the trace of the evolutionary rate matrix, rather than the matrix determinant, which would not be variable-limited (for example, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0423 could be generalized by taking either the trace or determinant of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0424, the multivariate generalization of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0425). Research demonstrating the adequacy of this approach would be needed, and certainly, the random outcomes could not be called log-likelihoods, but if the sampling distributions of log-likelihoods and modified statistics using traces were commensurate for comparable sets of variables, and yielded similar urn:x-wiley:2041210X:media:mee313749:mee313749-math-0426 scores, then using an alternative generalization would be possible. Second, one could use a penalized-log-likelihood based on a regularization of near-singular or singular urn:x-wiley:2041210X:media:mee313749:mee313749-math-0427 matrices (Clavel et al., 2019). Because this approach assures a urn:x-wiley:2041210X:media:mee313749:mee313749-math-0428 matrix that is positive-definite and invertable, it also assures that urn:x-wiley:2041210X:media:mee313749:mee313749-math-0429 can be estimated in every random permutation. Whether, the distribution of random urn:x-wiley:2041210X:media:mee313749:mee313749-math-0430 obtained from RRPP, followed by regularization in each permutation, yields appropriate sampling distributions would remain to be seen. The statistical properties have been adjudicated using a penalized-likelihood framework for evaluating Wilks' urn:x-wiley:2041210X:media:mee313749:mee313749-math-0431, with RRPP (Clavel & Morlon, 2020), so there is promise that this framework would also work for calculating multivariate urn:x-wiley:2041210X:media:mee313749:mee313749-math-0432 scores.

The third potential solution is to use phylogenetically aligned component analysis (PACA; Collyer & Adams, 2021) as a dimension reduction method. The perils of data reduction before likelihood estimation have been clearly demonstrated (Uyeda et al., 2015), but this was for cases where the data reduction method (principal component analysis, PCA) did not find components specifically aligned to phylogenetic signal. PACA specifically aligns components to phylogenetic signal, such that greater phylogenetic signal—rather than variance—is predominantly found in the first few components. It might be possible to use a subset of data dimensions that contain most or all phylogenetic signal to estimate pseudo-likelihoods. Again, it might not be sufficient to refer to a statistic calculated this way as model likelihood, but if random outcomes across many permutations produce a sampling distribution that yields similar urn:x-wiley:2041210X:media:mee313749:mee313749-math-0433 values in fewer dimensions, it might be trusted for estimating urn:x-wiley:2041210X:media:mee313749:mee313749-math-0434 for highly multivariate data.

Regardless of these three possible solutions, another consideration is whether different variables could have different urn:x-wiley:2041210X:media:mee313749:mee313749-math-0435 in the estimation of log-likelihoods; that is, can it be assumed traits evolve independently? In Appendix 2 of the Supporting Information, we outline a method for calculating log-likelihoods for multivariate data, both assuming common and independent urn:x-wiley:2041210X:media:mee313749:mee313749-math-0436 for traits. Model selection could be used to compare these two likelihoods to ascertain if traits evolve independently, and if so, the two-sample urn:x-wiley:2041210X:media:mee313749:mee313749-math-0437 test described here could be used to determine which traits have greater phylogenetic signal. However, appropriate optimization methods for multiple urn:x-wiley:2041210X:media:mee313749:mee313749-math-0438 should be rigorously researched, in addition to the statistical properties of different likelihood estimators, before solutions for multivariate traits are eagerly embraced.

Regardless of future challenges, the ability to estimate an effect size that can be used for hypothesis tests to compare phylogenetic signal in multiple traits, as a tool, is a boon for the PCM toolkit. We feel that measuring phylogenetic signal is more nuanced than using a single statistic, but adding urn:x-wiley:2041210X:media:mee313749:mee313749-math-0439 to the suite of statistics used can help decipher between weak and strong phylogenetic signals, rather than misinterpreting values of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0440 or urn:x-wiley:2041210X:media:mee313749:mee313749-math-0441. Our scope of investigation concerned BM models of evolutionary divergence and one transformation parameter, Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0442. Pagel's urn:x-wiley:2041210X:media:mee313749:mee313749-math-0443 is generally considered to be most associated with phylogenetic signal, but one could also consider using RRPP with additional transformation parameters, including urn:x-wiley:2041210X:media:mee313749:mee313749-math-0444 and urn:x-wiley:2041210X:media:mee313749:mee313749-math-0445 (Pagel, 1999). Because the transformation of the urn:x-wiley:2041210X:media:mee313749:mee313749-math-0446 matrix is an a priori step and this transformation is retained through random permutations, it would be easy to extend the RRPP-log-likelihood computations to additional urn:x-wiley:2041210X:media:mee313749:mee313749-math-0447 matrix transformations. Furthermore, RRPP could be used with alternative models of evolution (e.g. multi-rate Brownian, early burst, OU, AC/DC), recognizing that the simplifications we made from Equations 1 to 3 would not be the same. Random versions of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0448 in Equation 1 would have to be calculated in each RRPP permutation, accounting appropriately for parameters that are fixed or variable in each permutation. Insomuch as phylogenetic signal effect size (using urn:x-wiley:2041210X:media:mee313749:mee313749-math-0449) is a measure of the fit of tree to comparative data for a BM model of evolution, any similar approach can be considered a model effect size for an alternative evolutionary model. Thus, there could be some appeal with using the RRPP-log-likelihood effect size as a model selection criterion, especially because multiple models could be compared, not just assuming a null model of phylogenetic independence, but other null models as well. (For example, a single rate BM model could serve as null model for various multi-rate alternative models. Comparison of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0450 among the models, using both phylogenetic independence and BM as different null models, could be valuable for inferring the best evolutionary model for trait data.)

In answering certain evolutionary questions, such as comparing the strength of phylogenetic signal, traditional parametric approaches offer challenges. First, the parameter space of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0451 is bounded, and thus, a mixture of urn:x-wiley:2041210X:media:mee313749:mee313749-math-0452 distributions is required as a proxy for a sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987; Verbeke & Molenberghs, 2003). Second, urn:x-wiley:2041210X:media:mee313749:mee313749-math-0453 distributions are asymptotically appropriate for likelihood ratio statistics for very large sample sizes (Wilks, 1938), a situation rarely afforded when working with phylogenies. The permutation test we presented is not constrained to use a parametric probability distribution as a proxy, and is additionally capable of providing effect sizes, which are comparable across datasets to evaluate comparative hypotheses. Prior work (Adams & Collyer, 2018) has shown that empirical sampling distributions generated from RRPP match nearly perfectly the parametric urn:x-wiley:2041210X:media:mee313749:mee313749-math-0454-distributions typically used in ANOVA, when data are simulated to match the assumptions of ANOVA. Based on our work here, one might speculate that RRPP-generated sampling distributions are better proxies for statistics without appropriate parametric sampling distributions and converge on parametric distributions in cases where sampling distribution solutions are tractable. When viewed from this perspective, permutation methods such as RRPP should not be considered mere analytical band-aids to be used for ill-conditioned datasets, or scenarios where standard tests are not applicable. Rather, they are equivalent to parametric procedures for ‘standard’ biostatistical problems and can supersede them in cases where parametric methods are not applicable. Thus, our perspective is that this work helps to continue to pave the way for advancement of PCMs as sets of tools that take advantage of the computational power of modern computers rather than force evolutionary biology questions into limited traditional statistics applications.

ACKNOWLEDGEMENTS

We thank Elizabeth Glynne and Bryan Juarez for comments on early drafts of the manuscript, and Simone Blomberg and Joe Felsenstein for their helpful comments on the penultimate draft. This work was sponsored in part by National Science Foundation Awards DBI-1902694 (to M.L.C.) and DBI-1902511 (to D.C.A.). National Science Foundation (DBI-1902511, DBI-1902694).

    CONFLICTS OF INTEREST

    The authors have no conficts of interest to declare.

    AUTHORS' CONTRIBUTIONS

    D.C.A. conceived the original idea for an effect size measure of phylogenetic signal; M.L.C. developed the log-likelihood permutation procedure that was implemented and D.C.A., E.K.B., and M.L.C. collaboratively developed the concept and contributed to all portions of this manuscript. All authors approve of the final product and are willingly accountable for any portion of the content.

    DATA AVAILABILITY STATEMENT

    Empirical data used in this paper are available on Dryad (associated with original article, Baken et al., 2020), https://doi.org/10.5061/dryad.59zw3r23m. R-scripts for simulated data tests are provided as Supporting Information.