Macroevolutionary studies frequently characterize the phylogenetic signal in phenotypes; however, analytical tools for comparing the strength of that signal across traits remain largely underdeveloped.
We developed a non-parametric, permutation test for the log-likelihood of an evolutionary model, plus a standardized statistic, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0001$ , from this test, which can be considered a phylogenetic signal effect size. This statistic can be used in two-sample tests to compare the strength of phylogenetic signal for multiple traits.
We performed simulation experiments that revealed that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0002$ had a linear association with Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0003$ , which could be predicted by tree size, plus could be quickly interpreted as a hypothesis for phylogenetic signal based on a standard normal distribution. We additionally found that the permutation test had greater statistical power for detecting phylogenetic signal than parametric likelihood ratio tests, especially for small trees.
The analytical framework we present extends the phylogenetic comparative methods toolkit, allowing for statistical comparison of phylogenetic signal in multiple traits. Future studies could also consider this framework for the comparison of different evolutionary models, especially in light of different null processes.

1 INTRODUCTION

The shared evolutionary history of closely related species often implies the trait similarity among them, a pattern referred to as ‘phylogenetic signal’ (Blomberg et al., 2003; Felsenstein, 1985; Pagel, 1999). Many phylogenetic comparative methods (PCMs) seek to address the non-independence of species' traits (Felsenstein, 1985; Harvey & Pagel, 1991) in their analytic framing, by conditioning data on the phylogenetic relatedness among observations (Adams, 2014b; Adams & Collyer, 2018; Beaulieu et al., 2012; Garland & Ives, 2000; Grafen, 1989; Martins & Hansen, 1997; O'Meara et al., 2006; Rohlf, 2001). Indeed, under numerous evolutionary models, phylogenetic signal is expected, as stochastic character change along the hierarchical structure of the tree of life generates trait covariation among taxa (Blomberg et al., 2003; Felsenstein, 1985; Revell et al., 2008). Quantifying and comparing phylogenetic signal among traits, however, remains quite challenging.

Several analytical tools have been developed to quantify phylogenetic signal in phenotypic datasets (Abouheif, 1999; Adams, 2014a; Blomberg et al., 2003; Gittleman & Kot, 1990; Klingenberg & Gidaszewski, 2010; Pagel, 1999), and their statistical properties—namely type I error rates and statistical power—have been investigated to determine under what conditions phylogenetic signal can be detected (Adams, 2014a; Boettiger et al., 2012; Diniz-Filho et al., 2012; Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012; Pavoine & Ricotta, 2013; Revell, 2010; Revell et al., 2008). One of the most widely used methods for characterizing phylogenetic signal is Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0004$ (Pagel, 1999), which transforms the lengths (by compression) of the internal branches of the phylogeny, while leaving the tips unaffected, to improve the fit of data to the phylogeny via maximum likelihood (Freckleton et al., 2002; Pagel, 1999). To infer whether phylogenetic signal differs from no signal or a Brownian motion (BM) model of evolutionary divergence, the observed model fit using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0005$ may be statistically compared with that using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0006$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0007$ via likelihood ratio tests (Bose et al., 2019; Cooper et al., 2010; Freckleton et al., 2002) or confidence limits (Vandelook et al., 2019).

Another widely used measure is Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0008$ (Blomberg et al., 2003), which characterizes phylogenetic signal as the ratio of observed trait variation to the amount of variation expected under BM. Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0009$ can be treated as a test statistic by using a permutation test to generate its sampling distribution (Adams, 2014a; Blomberg et al., 2003) for determining whether significant phylogenetic signal is present in data. Both $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0010$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0011$ seem intuitive to interpret, as a value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0012$ for both corresponds to no phylogenetic signal, and a value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0013$ corresponds to the amount of phylogenetic signal expected under BM. Thus, it is tempting to regard both $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0014$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0015$ as descriptive statistics (and effect sizes, Münkemüller et al., 2012) that measure the relative strength of phylogenetic signal, providing an estimate of its magnitude for comparison.

The potential appeal of Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0016$ and Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0017$ as effect sizes is that they provide a basis for interpreting ‘weak’ versus ‘strong’ phylogenetic signal; that is, small versus large values of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0018$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0019$ , respectively, in a comparative sense (De Meester et al., 2019; Pintanel et al., 2019; Su et al., 2019). They are also important statistics in hypothesis tests. The optimized value of lambda, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0020$ , is the location where the log-likelihood is maximized, and is, therefore, compelling for finding the maximum phylogenetic signal in the data, which can be deemed ‘significant’ by rejecting the null hypothesis of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0021$ in a likelihood ratio test. Although Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0022$ has an upper bound of 1, Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0023$ can measure phylogenetic signal that is greater than expected under BM, as it has no upper bound. Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0024$ —or more specifically, the GLS estimation of variance that is a part of its calculation—can serve as a test statistic in a permutation test that randomizes ‘tip data’ in random permutations. However, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0025$ is quite sensitive to tree size, exhibits high type II error rates for intermediate strength of phylogenetic signal, has higher type I error rates than likelihood ratio tests based on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0026$ , and exhibits greater uncertainty for strong phylogenetic signals (whereas $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0027$ has greater uncertainty at intermediate phylogenetic signal strength: Münkemüller et al., 2012). Both of these statistics offer good support as test statistics for determining whether phylogenetic signal exists in a trait, but they are limited for comparing phylogenetic signals between traits.

Here, we present an alternative, standardized effect size calculation, which can be used in hypothesis tests to compare phylogenetic signals for different traits, and which is based on a normalized distribution of random log-likelihoods of a phylogenetic model, generated from a model of phylogenetic independence. Much like a likelihood ratio test for Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0028$ , this non-parametric approach can assess the significance of the observed phylogenetic signal, but unlike the parametric test, the standardized location of the observed likelihood can be used as an effect size, which can be statistically compared with similarly calculated effect sizes to consider hypotheses regarding the relative strengths of phylogenetic signal for multiple traits. We use simulation experiments to compare this standardized effect size to $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0029$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0030$ and demonstrate its utility with an empirical example. Comprehensively we illustrate that this standardized effect size provides an additional necessary tool to the phylogenetic comparative toolkit.

2 CONCEPTUAL DEVELOPMENT

A hypothesis test for phylogenetic signal involves calculating the variance among taxa trait values, conditioned on phylogenetic covariances (evolutionary rates) and comparing this variance to a variance that assumes phylogenetic independence. This can be appreciated by the GLS log-likelihood equation for a BM phylogenetic model of a univariate trait (Blomberg et al., 2003; Freckleton, 2012; Freckleton et al., 2002; Garland & Ives, 2000):

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0031$ (1)

where, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0032$ is a vector of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0033$ trait values, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0034$ is a vector of phylogenetic residuals, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0035$ is an $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0036$ phylogenetic covariance matrix, equal to $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0037$ , and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0038$ represents its determinant. The $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0039$ covariance matrix, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0040$ , is a matrix of phylogenetic variances along the diagonal, and covariances that are proportional to or exactly the covariances from a BM model of evolution (Revell et al., 2008). The variance (evolutionary rate), $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0041$ , is calculated as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0042$ , where ^T represents vector transposition. $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0043$ is used for the maximum likelihood estimate of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0044$ ; $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0045$ is used in place of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0046$ for the restricted maximum likelihood (REML) estimator (Freckleton, 2012). The expected value (tree root) is computed as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0047$ , where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0048$ is an $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0049$ vector of 1s. Because $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0050$ , Equation 1 can be expanded, that is,

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0051$ (2)

This expansion helps to elucidate the portions of the log-likelihood equation that are constant when comparing traits. In Equation 2, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0052$ is a constant and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0053$ for any traits and any trees, for a BM model of evolution. We can, therefore, update Equation 2:

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0054$

Thus, Equation 2 can be simplified:

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0055$ (3)

where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0056$ is a constant for all of the parts in Equation 1 that would not be changed by changing $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0057$ . Equation 3 helps one appreciate that in a likelihood ratio test to compare estimates of evolutionary rates (e.g. $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0058$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0059$ ), $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0060$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0061$ are the same in the two likelihood calculations; the only parts that change are $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0062$ (as a proportional change in covariances, based on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0063$ ) and the values of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0064$ , as a result of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0065$ . Therefore, a likelihood ratio test is a direct comparison of evolutionary rates.

Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0066$ is a scaling parameter by which the covariances (i.e. the off-diagonals) of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0067$ are multiplied (Pagel, 1999). A value of 0 changes all covariances to 0 (a star phylogeny or phylogenetic independence), and a value of 1 does not change the covariances from those expected by a BM model of evolution. (If $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0068$ and the tree is ultrametric, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0069$ is a diagonal matrix proportional to an $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0070$ identity matrix, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0071$ . It is convenient in this case to refer to $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0072$ rather than $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0073$ as a model of evolutionary independence because the lengths of branches in a star phylogeny—all equal—are inconsequential in estimation of the trait variance.) The value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0074$ that minimizes $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0075$ maximizes the log-likelihood. This value can be found for the interval between 0 and 1, yielding the optimized value, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0076$ . In a likelihood ratio test, the log-likelihood at $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0077$ can be compared with the log-likelihood found at $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0078$ ; rejection of the null hypothesis indicates ‘significant’ phylogenetic signal. Likelihood ratio tests could also be used to explicitly test $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0079$ against a model of pure BM ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0080$ ).

By contrast, Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0081$ finds $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0082$ and calculates variance (mean-squared error) from these residuals two different ways: $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0083$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0084$ , where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0085$ is typically the untransformed covariance matrix based on a BM model of evolutionary divergence ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0086$ ). The only difference between $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0087$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0088$ in Equation 2 is the use of the REML estimator ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0089$ degrees of freedom) for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0090$ . $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0091$ , however, ignores phylogenetic covariances in its estimation (does not correct for phylogenetic relatedness). Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0092$ is the ratio, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0093$ , divided by its expectation under BM for a given phylogeny; that is, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0094$ (Blomberg et al., 2003). This equation could be equivalently calculated (Revell et al., 2008), as

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0095$ (4)

where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0096$ is the sum of diagonal elements, and we use the subscript, BM, to indicate this is an untransformed ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0097$ ) version of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0098$ . Typically, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0099$ is also used in the calculation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0100$ , but this need not be the case, as least for considering $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0101$ as a test statistic rather than a descriptive statistic. $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0102$ will tend toward 0 if there is no phylogenetic signal, and tend toward or exceed 1 if there is. Whereas a likelihood ratio test can be used for Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0103$ , a permutation test (which randomizes the trait data across the tips of the phylogeny) is used to generate random distributions of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0104$ (e.g. Blomberg et al., 2003) or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0105$ (e.g. Adams, 2014a). A p-value is found as the percentile of the observed statistic in its sampling distribution. Because a permutation test and likelihood ratio test are non-parametric and parametric solutions for different test statistics, respectively, it might not be surprising that they could produce different results with respect to the same null hypothesis of no phylogenetic signal. However, it is because of the potential difference in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0106$ values used in calculation of the test statistics more so than the statistic or method used that different results are possible. With the same $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0107$ used to calculate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0108$ , and thus, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0109$ , the two tests should produce similar results. This can be appreciated by considering the process that generates variation in the permutation test.

Blomberg et al. (2003) proposed that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0110$ could be replaced in Equation 3 by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0111$ , where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0112$ is the ordinary least squares (OLS) mean of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0113$ . A permutation test that randomizes tip data performed with this altered $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0114$ statistic or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0115$ produces a distribution of values that are perfectly rank correlated because the only random element recalculated in each permutation is $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0116$ . (All other portions of the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0117$ calculation would be constant.) It can be appreciated why Blomberg et al. (2003) suggested $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0118$ as a statistic, and asserted that using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0119$ that is transformed (e.g. by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0120$ ) would mean having greater statistical power to detect phylogenetic signal. It can be seen from Equations 3 and 4 that for the same $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0121$ ,

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0122$ (5)

If one optimizes $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0123$ via maximum likelihood, uses this value to transform $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0124$ , and performs a permutation test on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0125$ , using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0126$ as the test statistic, then a test on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0127$ and a test on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0128$ are commensurate. Furthermore, such a permutation test can be considered a non-parametric alternative to a likelihood ratio test. (We provide additional empirical detail in Appendix 1 of the Supporting Information that confirms rank correlation.)

Randomizing tip data is a simplified form of randomization of residuals in a permutation procedure (RRPP). RRPP works best if residuals are the most appropriate exchangeable units under the null hypothesis (Adams & Collyer, 2018; Commenges, 2003). RRPP is a process that randomizes null model residuals and adds them to null model fitted values in every random permutation to create random pseudodata used to fit alternative models. If the null hypothesis is phylogenetic independence, a star phylogeny is assumed, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0129$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0130$ , the OLS mean, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0131$ are the OLS residuals, and random outcomes of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0132$ , where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0133$ indicates randomization, are the pseudodata produced in each permutation. If $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0134$ , then randomizing residuals is the same as randomizing tip data (the root and data mean are the same). This process preserves first- and second-moment exchangeability; that is, the OLS mean and variance of the trait are constant across random permutations. (If phylogenetic independence is not assumed, RRPP still functions the same, but has a GLS solution with second-moment exchangeability only.) It is important to understand that when the portions of the log-likelihood expression for phylogenetic independence (summarized as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0135$ in Equation 3) are held constant, p-values found from the sampling distributions of either component of Equation 5, the log-likelihood, or the likelihood ratio statistic, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0136$ , will be exactly the same (as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0137$ is also constant in every random permutation). Therefore, a permutation test with RRPP is a non-parametric application of the likelihood ratio statistic in a hypothesis test (Potter, 2005), which does not rely on a mixture of parametric probability distributions as a proxy of the true sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987). However, it remains to be seen if a permutation test, using the log-likelihood statistic and RRPP, is as reliable as a parametric likelihood ratio test.

Assuming a comparable tree transformation, Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0138$ and Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0139$ can be considered two phylogenetic signal effect sizes (Münkemüller et al., 2012), but a test of phylogenetic signal, as demonstrated above, is more explicitly an assessment of the rarity of the observed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0140$ in a hypothetical distribution of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0141$ , if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0142$ is the null model process. This process can be applied with RRPP and a sampling distribution of either $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0143$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0144$ , or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0145$ can be generated, and all would provide the same p-values for the same RRPP permutations (see Appendix 1 of the Supporting Information for an example of this outcome). However, sampling distributions from permutation tests do not need to be a means to an end, a tool to merely find a p-value. The location of the observed statistic in its sampling distribution can also be considered an effect size (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015). From, Equation 3, either the non-constant portion of the log-likelihood equation, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0146$ , or the log-likelihood itself, are good statistics for estimating effect sizes (Equation 3 demonstrates that the log-likelihood is a linear transformation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0147$ , so effect sizes estimated from the RRPP distributions of either will be the same. We will henceforth refer to the random forms of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0148$ , for simplicity.) The location of the observed value in a standardized distribution of random $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0149$ outcomes, based on the appropriate null hypothesis of phylogenetic independence, provides a standardized (statistical) effect size that can be used in comparative tests (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015).

Letting $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0150$ , where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0151$ represents a normalizing function (if random $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0152$ are not sufficiently normally distributed), the standardized effect size of phylogenetic signal for a trait is estimated as,

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0153$ (6)

where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0154$ is the mean of the sampling distribution and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0155$ is the standard error (the standard deviation of the sampling distribution, not the trait). (The ^∧ indicates these values are estimated, based on the number of random permutations used, which is probably fewer than the finite but large possible number of all permutations.) As a standard deviate, we would expect correspondence between a p-value estimated from the location of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0156$ in a standard normal distribution and the percentile of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0157$ in its sampling distribution. Therefore, it is obvious that, e.g. $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0158$ means significant phylogenetic signal and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0159$ means phylogenetic signal that is not significant, based on a significance level of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0160$ . More importantly, because sampling distributions are approximately normal, two effect sizes can be compared in a hypothesis test, by finding the two-sample test statistic,

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0161$ (7)

The $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0162$ statistic can be assumed to follow a standard normal distribution, meaning a p-value can be obtained for a null hypothesis test that the phylogenetic signals for two separate traits are the same. There is no explicit expectation that the traits have to come from the same phylogeny, but the scope for comparison of traits is something that can only be considered by examining the behavior of these effect sizes for varied tree sizes and phylogenetic signal strength.

Equation 3 implicitly assumes that the compared traits evolve independently, which might be an illogical assumption for traits measured on species from the same phylogeny. In such cases, there are two options worth considering. First, one could generalize the log-likelihood equation (Equation 2) for multivariate data (see, Revell & Harmon, 2008) and consider the relative strength of multivariate phylogenetic signal with respect to the univariate signals. This is not necessarily a simple generalization, if one allows $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0163$ to vary among traits (requiring $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0164$ covariance matrix estimations in the log-likelihood for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0165$ traits; see Appendix 2 in the Supporting Information for further details). However, one could compare multivariate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0166$ -scores between models that assume a common $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0167$ or allow $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0168$ to vary among traits, as a test of evolutionary independence of traits; much like one can compare models with common or separate evolutionary rates among traits (see, Adams, 2013). (We provide further details for this future research consideration in Appendix 2 of the Supporting Information.) Second, one could compare the relative strength of phylogenetic signal between principal components of a multivariate data set. With this option, the principal components would be independent, but one would have to reconcile principal component loadings with test results to determine whether suites of traits have different phylogenetic signals.

Multivariate considerations are expansive and exceed the scope of this paper. However, RRPP is a process that generates sampling distributions of log-likelihoods in a consistent manner, irrespective of the number of traits. Research questions that require multivariate analysis should have tractable solutions, provided log-likelihoods can be estimated (residual covariance matrices are not singular). Regarding single traits, we perform simulation experiments to determine type I error rates, correspondence between hypothesis test outcomes, statistical power, and the relationship between effect size and simulated phylogenetic strength, below. However, we first provide a simple example to help illustrate the purpose of this type of analysis.

2.1 Illustrative example

As an illustrative example, we simulated two traits on a phylogeny ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0169$ ), one with moderate phylogenetic signal and one with stronger phylogenetic signal. (Simulation details are explained in the next section.) For one variable, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0170$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0171$ , and for the other, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0172$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0173$ . We performed RRPP to recalculate the GLS log-likelihoods (using a covariance matrix for a tree transformed by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0174$ for each variable), with 10,000 random permutations, each. These distributions were normalized (with a Box-Cox transformation) and standardized (Figure 1), yielding $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0175$ scores of 2.21 and 6.33 standard deviations, respectively, each of which was significant at $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0176$ ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0177$ for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0178$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0179$ for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0180$ ). Performing parametric likelihood ratio tests with a null model of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0181$ yielded slightly different results $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0182$ , and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0183$ , for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0184$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0185$ , respectively. The difference, as we show below, is likely due to the limited statistical power (type II error) of the parametric likelihood ratio test. We performed a two-sample z-test to compare the phylogenetic signal effect sizes; $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0186$ , indicating that the phylogenetic signal in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0187$ was significantly larger than in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0188$ . Comparatively, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0189$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0190$ , which like $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0191$ is not as useful for determining a significant difference between phylogenetic signals.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Plot of phylogenetic tree with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0192$ values, plus frequency histograms for the RRPP log-likelihood values for two variables, X and Y. Vertical lines indicate observed values. In the last panel, histograms have been combined for standardized values

**FIGURE 1**
Open in figure viewer PowerPoint

Plot of phylogenetic tree with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0192$ values, plus frequency histograms for the RRPP log-likelihood values for two variables, X and Y. Vertical lines indicate observed values. In the last panel, histograms have been combined for standardized values

3 SIMULATION METHODS AND RESULTS

We examined the behavior of RRPP-based likelihood ratio tests and standardized log-likelihood effect size with simulation experiments that varied the strength of phylogenetic signal, for various sized pure-birth phylogenetic trees. In our simulation experiments, we sought to examine the statistical power of likelihood ratio tests with RRPP compared to parametric tests, and the relationships between effect size, Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0193$ , and Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0194$ .

3.1 Simulation methods

Simulating data from a model with known phylogenetic signal is challenging, as the process requires an a priori definition of phylogenetic signal, and there is no guarantee that the process will produce data that are similar to the intended effect. It is possible to simulate data with an intended $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0195$ , as this requires merely simulating a tree, rescaling the tree, and simulating BM data on the rescaled tree (see, e.g. Adams, 2014a; Molina-Venegas & Rodríguez, 2017). Alternatively, a weighted average of data simulated with BM and without BM could be used, which Münkemüller et al. (2012) described as the (simulated) BM strength. However, with either approach, there is no guarantee that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0196$ will resemble $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0197$ , especially for small trees (see, e.g. Figure 2 of Münkemüller et al., 2012). Furthermore, there is no easily conceivable way to simulate data from a model with known $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0198$ . For previous studies that sought to evaluate statistical properties (type I or type II errors, accuracy, and precision), defining $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0199$ or a weight of BM strength, as simulated, was sufficient for calculating summary statistics over many simulation runs with the same input value. However, we were more interested in understanding the association of simulated phylogenetic signal strength and the effect size estimated from the log-likelihood of an evolutionary model, over a continuum from $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0200$ to $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0201$ .

**FIGURE 2**
Open in figure viewer PowerPoint

Plots of simulations including relative frequencies of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0202$ generated (left column), $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0203$ tested with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0204$ (middle column), and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0205$ tested with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0206$ (right column). Rows separate results by tree size. Points corresponding to non-significant results from permutation tests are colored gray; significant results are scaled, colored and hued according to the magnitude of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0207$ , as in the left column

Initial trials to simulate data (sensu Adams, 2014a; Molina-Venegas & Rodríguez, 2017) from a uniform distribution of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0208$ revealed that, especially with smaller trees, distributions of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0209$ tended to be skewed toward 0 or 1, despite uniform sampling of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0210$ . This was consistent with the research of Münkemüller et al. (2012). (see, e.g. their Figure 2 and their Table 2, which indicates skewing of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0211$ toward 0 or 1 for small trees.) Therefore, we used an algorithm to first simulate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0212$ from a uniform distribution, and then simulate data that produced $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0213$ within 5% of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0214$ to assure that there was an approximately uniform distribution of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0215$ throughout the simulation runs.

We simulated 5,000 pure-birth, ultrametric trees (with a branching rate of 0.05) for each of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0216$ sized trees (25,000 trees total). All trees were created with the function, pbtree, from the phytools R package (Revell, 2010). For each tree, we randomly sampled $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0217$ from a uniform distribution (minimum of 0, maximum of 0.99), scaled the tree branch lengths by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0218$ , and simulated random BM data on the transformed tree, using the sim.char function of the geiger R package (Harmon et al., 2008). Subsequently, we found the maximum likelihood estimate, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0219$ (see code in Appendix 3 of the Supporting Information) from the data generated. We used an upper limit of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0220$ because like Cooper et al. (2016), we observed a rare but discernible trend for data simulated with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0221$ to not fit as well with a BM model of evolutionary divergence as alternative models, such as Ornstein Uhlenbeck models (Lande, 1976). By using a cut-off of 0.99, instances of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0222$ were still frequent, but anomalies from simulating non-BM data were largely mitigated. For every tree we simulated, we repeatedly simulated data until we found $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0223$ within a 5% interval of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0224$ , and then retained the data for analysis.

For every simulated tree and its corresponding data, we performed a parametric likelihood ratio test, with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0225$ (untransformed) and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0226$ (transformed) adjustments of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0227$ . We also used RRPP to generate distributions of 1,000 random log-likelihoods for each tree:data combination, and for both untransformed and transformed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0228$ matrices, from which the percentile of the observed statistic was used to estimate a p-value. The parametric likelihood ratio test performed for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0229$ and the permutation test performed for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0230$ (null $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0231$ in both cases) correspond exactly with the tests typically performed for Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0232$ and Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0233$ , respectively. We verified p-values estimated this way were the exact same as using the distribution of random $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0234$ , more typically used for a test of Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0235$ .

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0236$ results and standardized log-likelihood effect sizes were plotted against $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0237$ , with points scaled and hued in association with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0238$ to visualize patterns. In such plots, points were colored if significant, based on the RRPP permutation test, or gray if not significant. We anticipated that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0239$ should correspond to the null hypothesis rejection limit for a one-tailed test, a line we superimposed into plots to visually determine the consistency of effect sizes and hypothesis test results. (We expected colored points to lie above this line and gray points to lie beneath, if effect sizes reflected hypothesis test outcomes.)

Because we had p-values from both parametric and permutation tests, we could create $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0240$ tables of hypothesis test outcome correspondence, to assess the consistency of parametric and non-parametric tests. These tables report both the consistencies (parametric tests and RRPP tests found same result) and two types of inconsistencies: the parametric test finds a significant result but RRPP does not, or RRPP finds a significant result, but the parametric test does not. The inconsistent results were labeled in plots, along with type I error rates (calculated from the frequency of occasions that for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0241$ , a significant result was observed). Because we had 5,000 $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0242$ values, approximately uniform in distribution, we were able to estimate statistical power curves as a moving proportion of null hypothesis rejections across the landscape of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0243$ values. We estimated the proportion of rejections for four test types: parametric/untransformed, parametric/transformed, RRPP/untransformed, and RRPP/transformed. Proportions were estimated by culling data by intervals of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0244$ and producing a vector of 0s (did not reject null hypothesis) and 1s (rejected the null hypothesis), for each test type. The means of these vectors were the proportion of tests for which the null hypothesis was rejected. We found that an interval length of 0.1 assured more than 500 values for all interior points ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0245$ ), and produced rather smooth curves that were qualitatively as informative as any curves produced with a greater number of intervals.

Additional functions and code for simulation experiments can be found in Appendix 3 of the Supporting Information. Several support functions from the RRPP R package (Collyer & Adams, 2018) were used to create functions to estimate log-likelihoods and effect sizes from RRPP distributions. In some initial simulations, we also considered balanced and pectinate trees. We found no qualitative differences and simulations could only produce new sets of data on the same tree, so we did not consider them further.

3.2 Simulation results

Figure 2 shows $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0246$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0247$ results from simulations, and Figure 3 shows $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0248$ score results, with corresponding points scaled and hued the same, based on the value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0249$ in the first column of Figure 2. The relative frequencies of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0250$ suggested the simulations produced approximately uniformly distributed phylogenetic signals. A rejection of the null hypothesis of no phylogenetic signal (significant phylogenetic signal) resulted in points that were colored, with hue changing as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0251$ increased; non-significant values were gray in color. These figures allow for visual clarification of various attributes acquired from the simulation runs, such that patterns are easy to interpret. Statistical power curves are shown in Figure 4.

**FIGURE 3**
Open in figure viewer PowerPoint

Plots of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0252$ score simulations for untransformed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0253$ (left columns) and transformed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0254$ (right column). Points are scaled and colored as in Figure 2. Inconsitencies (from a $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0255$ correspondence table) between parametric likelihood ratio tests (LRT) and RRPP permutation tests are noted, as well as type I error rates for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0256$ . The dashed line at $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0257$ indicates expected separation of significant and non-significant results, based on a level of significance of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0258$

**FIGURE 4**
Open in figure viewer PowerPoint

Statistical power curves for indicated methods: parametric likelihood ratio test (LRT), and RRPP. Whether the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0259$ matrix is transformed is noted

Regarding Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0260$ and Blomberg's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0261$ , our simulation results tracked the results of Münkemüller et al. (2012) in one particular way (Figure 2). It was possible to simulate larger $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0262$ values for smaller trees, but within any tree size, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0263$ tended to be less than 1 except for the largest simulated $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0264$ values. Despite this trend, the hypothesis test results using alternative transformations of the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0265$ matrix for estimation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0266$ as a test statistic for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0267$ revealed profound differences. In Figure 2, significant or non-significant $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0268$ values can be found for any $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0269$ , if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0270$ is forced in the test statistic, which is the common way this test is performed (middle column). The simple act of using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0271$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0272$ as a test statistic (not $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0273$ ) alleviated this concern, and was consistent with the assertion of Blomberg et al. (2003) that doing so increases statistical power (Figure 4). Forcing $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0274$ for hypothesis tests of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0275$ also elevated type I error rates, but they were still close to the nominal $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0276$ level.

Issues with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0277$ forced to be equal to 1 were also revealed by using a standardized effect size based on the location of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0278$ in its RRPP-generated sampling distribution. Significant and non-significant results spanned the entire range of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0279$ (Figure 3). These results are not surprising, as they do not seek to maximize likelihood, but help to confirm that the permutation test with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0280$ , using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0281$ as a test statistic, is flawed (since $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0282$ might not be minimized via a best fit of the tree to the comparative data). Furthermore, using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0283$ as an effect size if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0284$ is forced to equal 1 makes little sense because of its curvilinear association with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0285$ (Figure 3). However, for cases where $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0286$ , both the permutation test on the log-likelihood statistic as well as the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0287$ score from the RRPP sampling distribution, as an effect size, had several desirable attributes.

First, the permutation test for the log-likelihood of the evolutionary model had greater statistical power than the parametric likelihood ratio test (Figure 4). A statistical power advantage was greatest for smaller trees, and the power curves of the two methods tended to converge with larger trees. The cases of inconsistent results from the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0288$ hypothesis test correspondence tables (Figure 3) were always due to the permutation test finding significant results when the parametric likelihood ratio test did not, but the rate of inconsistencies decreased with increased tree size. (By contrast, if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0289$ is forced, the rate of inconsistencies increased with tree size.) A likelihood ratio statistic only asymptotically follows a $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0290$ distribution, as $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0291$ (Wilks, 1938), so it is not surprising that a parametric likelihood ratio test would have larger type II error rates with small tree sizes. Furthermore, the asymptotic null distribution for a one-sided likelihood ratio statistic, in which null hypotheses are at the limits of the constrained parameter space ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0292$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0293$ ), is a mixture of two $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0294$ distributions (Molenberghs & Verbeke, 2007; Verbeke & Molenberghs, 2003). Generally, an unconstrained $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0295$ statistic is reported, but a p-value is considered to be 1/2 of the classical $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0296$ approximation, when mixture proportions are equal. Therefore, a tendency toward high statistical power might be expected for traits from large trees with likelihood ratio tests. Nonetheless, the statistical power was as good or better with permutation tests in our results, irrespective of tree size. In addition to having greater statistical power, the RRPP sampling distribution allows the standard deviate of the observed log-likelihood to be used as an effect size ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0297$ ), which also has nice attributes.

Second, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0298$ based on a maximum likelihood estimate of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0299$ has a linear association with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0300$ . The slope of this linear association increases with tree size, unfortunately, as it is not possible to disentangle a goodness of fit ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0301$ ) from the size of a tree. Thus, one might consider comparing effect sizes for traits from two vastly different trees with caution. The range of simulated $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0302$ also increased with phylogenetic signal strength and tree size (Figure 3). This result can be explained by the fact that for a large value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0303$ , also for a large tree, the breadth of possible $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0304$ values in RRPP permutations increases, so it is also possible to have a larger span of possible $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0305$ values.

Third, with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0306$ , there was a clear demarcation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0307$ above a value of 1.96 corresponding to significant hypothesis test outcomes, especially for tests with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0308$ (Figure 3). This is helpful, as an effect size of say, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0309$ reported from an empirical study, indicates significant phylogenetic signal. We found sampling distributions to be consistently normally distributed (see Figure 1 as an example), especially for larger trees. Results in Figure 3 (second column) had a distinction of significant results consistently for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0310$ . A reliance on the normal distribution of RRPP sampling distributions means that two-sample $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0311$ statistics are also reliable, and quick interpretation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0312$ to, e.g. $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0313$ , for two traits from the same tree, indicates which has greater phylogenetic signal.

Ideally, there would have been no relationship between $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0314$ and tree size, but such an expectation would be unwarranted, as phylogenetic signal is inherently related to the largeness of the phylogeny. However, we determined that there was a precise relationship between tree size and the slope of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0315$ with respect to $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0316$ . The slopes of the lines in Figure 3 fit (nearly perfectly) the relationship, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0317$ . Thus, the expected value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0318$ , given $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0319$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0320$ is $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0321$ . One can calculate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0322$ for the traits (see Figure 5) that are compared to ascertain if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0323$ is larger (more positive) or lesser (more negative) than expected, given the tree size and optimized value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0324$ . This adjustment could be seen at best as a tool to help understand the multifarious nature of phylogenetic signal, rather than fix $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0325$ for comparative tests. For example, when comparing traits from two different trees, more positive values of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0326$ might be considered stronger phylogenetic signal, if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0327$ are comparable.

**FIGURE 5**
Open in figure viewer PowerPoint

$urn:x-wiley:2041210X:media:mee313749:mee313749-math-0328$ scores from Figure 3 (right column) after subtracting the expected value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0329$ , based on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0330$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0331$

4 EMPIRICAL EXAMPLE

To demonstrate the utility of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0332$ , we compared $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0333$ for two ecologically-relevant traits in plethodontid salamanders (Figure 6): surface area to volume ratios $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0334$ and relative (to snout to vent length) body width $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0335$ (Baken & Adams, 2019; Baken et al., 2020). For this example, surface area to volume ratios and relative body width measures were obtained from individuals of 305 species, from which species means were obtained (Baken & Adams, 2019; Baken et al., 2020). A time-dated molecular phylogeny for the group (Bonett & Blair, 2017) was pruned to match the species in the phenotypic dataset. The phylogenetic signal effect size in each trait was obtained from 10,000 RRPP permutations, using functions described in Appendix 3 of the Supporting Information. The absolute value of the two-sample effect size (Equation 5) was calculated, as we had no a priori expectation of direction in the hypothesis test; i.e. it was treated as a two-tailed hypothesis test.

Although both traits contained significant phylogenetic signal ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0336$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0337$ ), a test based on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0338$ revealed that the degree of phylogenetic signal was significantly stronger in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0339$ ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0340$ ; $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0341$ : Figure 5). Biologically, this observation may be interpreted by the fact that the tropical species—which form a monophyletic group within plethodontids—display greater variation in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0342$ , which covaries with disparity in their climatic niches (Baken et al., 2020). Thus, greater phylogenetic signal in $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0343$ is to be expected. Coincidentally, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0344$ was 0.76 and 0.91, and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0345$ was 0.25 and 0.76, for $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0346$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0347$ , respectively.

5 DISCUSSION

To be able to ask if traits differ in their amount of phylogenetic signal, resolving how to best measure phylogenetic signal is essential. In this study, we considered the two most common measures of phylogenetic signal, and our simulation results did not dispute any issues that were already known about these measures. For example, the precision to estimate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0348$ is tree-dependent, with more taxa-rich trees required for better precision (Boettiger et al., 2012; Münkemüller et al., 2012). $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0349$ does not scale linearly with increased phylogenetic signal strength, and its variance is positively associated with phylogenetic signal strength (Diniz-Filho et al., 2012; Münkemüller et al., 2012). Our simulation results confirmed these attributes. These issues make the comparison of phylogenetic signals challenging, even if only qualitatively comparing $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0350$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0351$ between traits, for the same phylogeny. That there has been no statistical test only makes inference more speculative.

In this study, we made three important advances for the comparison of phylogenetic signals among different traits. First, we demonstrated that a permutation-based procedure (RRPP) using the log-likelihood as a statistic is not only reliable but performs better than a parametric test, especially for smaller trees. Second, we demonstrated that if the RRPP-log-likelihood permutation test is used, a test of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0352$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0353$ are the same, provided that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0354$ is transformed by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0355$ in the calculation of the GLS variance that is at the heart of the calculation of either statistic. Indeed, Blomberg et al. (2003) introduced $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0356$ as a statistic that had an associated permutation test, based on a distribution of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0357$ , not $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0358$ , noting that statistical power would be higher if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0359$ was calculated from a transformed version of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0360$ that resulted in better fit of the tree to the data. Because $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0361$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0362$ are perfectly rank-order correlated for the same set of RRPP permutations, viewing $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0363$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0364$ as statistics that have different hypothesis test outcomes is not necessary. Previous simulation studies have found differences between them, but did so by relying on adjudication of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0365$ by a parametric likelihood ratio test ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0366$ transformed by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0367$ ), and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0368$ by a permutation test with no transformation of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0369$ ( $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0370$ ) (see, e.g. Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012). Our work reveals that these differences were the result of the incommensurate transformation step, and not in test statistic performance, per se. Third, having demonstrated that a test of phylogenetic signal is a test of the rarity of the observed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0371$ in a distribution of random outcomes, generated by a null model of phylogenetic independence, we can measure phylogenetic signal an alternative way: as the standardized location of the observed $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0372$ in the RRPP-generated distribution of random values (i.e. as an effect size). This alternative makes it possible to perform hypothesis tests for the comparison of the strength of phylogenetic signal across traits.

This third advance is important but perhaps unsettling. The convenience of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0373$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0374$ is that a value of 0 should mean data devoid of phylogenetic signal, and a value of 1 should mean data have a phylogenetic signal that matches a BM model of evolutionary divergence. By contrast, a $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0375$ -score is a value measured in standard deviations that indicates a location in a normal distribution relative to expectation (mean), given a null model of phylogenetic independence. As a measure of the degree of phylogenetic signal, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0376$ , might feel less intuitively comfortable. However, this discomfort is perhaps predicated on one's definition of phylogenetic signal. For example, in a tree with 128 taxa (Figure 3), for a value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0377$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0378$ might range from 2 to 17. If $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0379$ is the definition of phylogenetic signal strength, the range of effect sizes suggest that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0380$ is not a good effect size to consider. Conversely, an effect size $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0381$ might be found to have $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0382$ range from 0.1 to 1.0. That is, if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0383$ is the measure of phylogenetic signal, a corresponding $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0384$ indicates the tree transformation that best reveals the phylogenetic signal, not the amount of phylogenetic signal. However, the appeal of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0385$ is that it allows a statistical comparison of the phylogenetic signals from multiple traits, but those traits might also have quite different values of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0386$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0387$ . The best analysis is probably one that statistically compares $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0388$ but also reports both $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0389$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0390$ , as these two statistics have important meaning: the optimized branch-length transformation and a ratio that expresses the relative amount of BM contribution to the GLS variance estimate, respectively. That is, one can report $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0391$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0392$ , $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0393$ , and one p-value, and not have to view phylogenetic signal statistics as a means to an end for different statistical tests.

It is common for researchers to report ‘weak but significant’ phylogenetic signal when $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0394$ is considerably less than 1 but the null hypothesis test is rejected. We also demonstrated with our simulations that it is possible to find ‘significant’ phylogenetic signal when $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0395$ is small compared to a non-significant result when $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0396$ is forced to be equal to 1 (compare plots between left and right columns of Figure 3). Our work demonstrates that it is not helpful to declare ‘weak but significant’ phylogenetic signal (especially if not simultaneously reporting ‘strong but not significant’ phylogenetic signal by increasing $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0397$ ), based on $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0398$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0399$ values. However, we feel it is more appropriate to declare $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0400$ as weak but significant, compared with say, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0401$ , which is strong and significant. Phylogenetic signal ‘strength’ can be viewed as measure of rarity to generate such a strong signal by chance, which $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0402$ describes well. Although $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0403$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0404$ are useful statistics, their ability to discern strong versus weak phylogenetic signal is questionable. Only $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0405$ , which is a statistical effect size, affords this statistical interpretation. However, an interpretation of phylogenetic strength still cannot be made independent of phylogeny size.

One less desirable outcome of our simulations is that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0406$ (more precisely the slope between $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0407$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0408$ ) was positively associated with $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0409$ , the number of taxa represented in a tree. We were able to demonstrate that the slope between $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0410$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0411$ is predicted by $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0412$ , such that one could find an expected value of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0413$ , given $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0414$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0415$ . When comparing the same or multiple traits between trees, this added step might help to better elucidate differences between a two-sample test of phylogenetic signal, especially if $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0416$ is significant, but it is not clear if the test result is because of differences in phylogenetic signal strength or tree size. This would not be a panacea, as it would also involve using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0417$ , which could vary between traits, but it is a tool that might assist inferences made about differences in phylogenetic signal involving multiple trees, or different $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0418$ scores also involving different $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0419$ transformations.

Although using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0420$ scores for comparative analysis offers new opportunities, it also presents new challenges. Chief among the challenges that will have to be addressed is how to generalize the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0421$ score as an effect size for multivariate data, especially if the number of variables precludes calculating log-likelihood. We see three possible approaches. First, like the generalization of the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0422$ statistic for multivariate data (Adams, 2014a), it might be possible to use the trace of the evolutionary rate matrix, rather than the matrix determinant, which would not be variable-limited (for example, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0423$ could be generalized by taking either the trace or determinant of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0424$ , the multivariate generalization of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0425$ ). Research demonstrating the adequacy of this approach would be needed, and certainly, the random outcomes could not be called log-likelihoods, but if the sampling distributions of log-likelihoods and modified statistics using traces were commensurate for comparable sets of variables, and yielded similar $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0426$ scores, then using an alternative generalization would be possible. Second, one could use a penalized-log-likelihood based on a regularization of near-singular or singular $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0427$ matrices (Clavel et al., 2019). Because this approach assures a $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0428$ matrix that is positive-definite and invertable, it also assures that $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0429$ can be estimated in every random permutation. Whether, the distribution of random $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0430$ obtained from RRPP, followed by regularization in each permutation, yields appropriate sampling distributions would remain to be seen. The statistical properties have been adjudicated using a penalized-likelihood framework for evaluating Wilks' $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0431$ , with RRPP (Clavel & Morlon, 2020), so there is promise that this framework would also work for calculating multivariate $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0432$ scores.

The third potential solution is to use phylogenetically aligned component analysis (PACA; Collyer & Adams, 2021) as a dimension reduction method. The perils of data reduction before likelihood estimation have been clearly demonstrated (Uyeda et al., 2015), but this was for cases where the data reduction method (principal component analysis, PCA) did not find components specifically aligned to phylogenetic signal. PACA specifically aligns components to phylogenetic signal, such that greater phylogenetic signal—rather than variance—is predominantly found in the first few components. It might be possible to use a subset of data dimensions that contain most or all phylogenetic signal to estimate pseudo-likelihoods. Again, it might not be sufficient to refer to a statistic calculated this way as model likelihood, but if random outcomes across many permutations produce a sampling distribution that yields similar $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0433$ values in fewer dimensions, it might be trusted for estimating $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0434$ for highly multivariate data.

Regardless of these three possible solutions, another consideration is whether different variables could have different $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0435$ in the estimation of log-likelihoods; that is, can it be assumed traits evolve independently? In Appendix 2 of the Supporting Information, we outline a method for calculating log-likelihoods for multivariate data, both assuming common and independent $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0436$ for traits. Model selection could be used to compare these two likelihoods to ascertain if traits evolve independently, and if so, the two-sample $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0437$ test described here could be used to determine which traits have greater phylogenetic signal. However, appropriate optimization methods for multiple $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0438$ should be rigorously researched, in addition to the statistical properties of different likelihood estimators, before solutions for multivariate traits are eagerly embraced.

Regardless of future challenges, the ability to estimate an effect size that can be used for hypothesis tests to compare phylogenetic signal in multiple traits, as a tool, is a boon for the PCM toolkit. We feel that measuring phylogenetic signal is more nuanced than using a single statistic, but adding $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0439$ to the suite of statistics used can help decipher between weak and strong phylogenetic signals, rather than misinterpreting values of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0440$ or $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0441$ . Our scope of investigation concerned BM models of evolutionary divergence and one transformation parameter, Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0442$ . Pagel's $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0443$ is generally considered to be most associated with phylogenetic signal, but one could also consider using RRPP with additional transformation parameters, including $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0444$ and $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0445$ (Pagel, 1999). Because the transformation of the $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0446$ matrix is an a priori step and this transformation is retained through random permutations, it would be easy to extend the RRPP-log-likelihood computations to additional $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0447$ matrix transformations. Furthermore, RRPP could be used with alternative models of evolution (e.g. multi-rate Brownian, early burst, OU, AC/DC), recognizing that the simplifications we made from Equations 1 to 3 would not be the same. Random versions of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0448$ in Equation 1 would have to be calculated in each RRPP permutation, accounting appropriately for parameters that are fixed or variable in each permutation. Insomuch as phylogenetic signal effect size (using $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0449$ ) is a measure of the fit of tree to comparative data for a BM model of evolution, any similar approach can be considered a model effect size for an alternative evolutionary model. Thus, there could be some appeal with using the RRPP-log-likelihood effect size as a model selection criterion, especially because multiple models could be compared, not just assuming a null model of phylogenetic independence, but other null models as well. (For example, a single rate BM model could serve as null model for various multi-rate alternative models. Comparison of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0450$ among the models, using both phylogenetic independence and BM as different null models, could be valuable for inferring the best evolutionary model for trait data.)

In answering certain evolutionary questions, such as comparing the strength of phylogenetic signal, traditional parametric approaches offer challenges. First, the parameter space of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0451$ is bounded, and thus, a mixture of $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0452$ distributions is required as a proxy for a sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987; Verbeke & Molenberghs, 2003). Second, $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0453$ distributions are asymptotically appropriate for likelihood ratio statistics for very large sample sizes (Wilks, 1938), a situation rarely afforded when working with phylogenies. The permutation test we presented is not constrained to use a parametric probability distribution as a proxy, and is additionally capable of providing effect sizes, which are comparable across datasets to evaluate comparative hypotheses. Prior work (Adams & Collyer, 2018) has shown that empirical sampling distributions generated from RRPP match nearly perfectly the parametric $urn:x-wiley:2041210X:media:mee313749:mee313749-math-0454$ -distributions typically used in ANOVA, when data are simulated to match the assumptions of ANOVA. Based on our work here, one might speculate that RRPP-generated sampling distributions are better proxies for statistics without appropriate parametric sampling distributions and converge on parametric distributions in cases where sampling distribution solutions are tractable. When viewed from this perspective, permutation methods such as RRPP should not be considered mere analytical band-aids to be used for ill-conditioned datasets, or scenarios where standard tests are not applicable. Rather, they are equivalent to parametric procedures for ‘standard’ biostatistical problems and can supersede them in cases where parametric methods are not applicable. Thus, our perspective is that this work helps to continue to pave the way for advancement of PCMs as sets of tools that take advantage of the computational power of modern computers rather than force evolutionary biology questions into limited traditional statistics applications.

ACKNOWLEDGEMENTS

We thank Elizabeth Glynne and Bryan Juarez for comments on early drafts of the manuscript, and Simone Blomberg and Joe Felsenstein for their helpful comments on the penultimate draft. This work was sponsored in part by National Science Foundation Awards DBI-1902694 (to M.L.C.) and DBI-1902511 (to D.C.A.). National Science Foundation (DBI-1902511, DBI-1902694).

CONFLICTS OF INTEREST

The authors have no conficts of interest to declare.

AUTHORS' CONTRIBUTIONS

D.C.A. conceived the original idea for an effect size measure of phylogenetic signal; M.L.C. developed the log-likelihood permutation procedure that was implemented and D.C.A., E.K.B., and M.L.C. collaboratively developed the concept and contributed to all portions of this manuscript. All authors approve of the final product and are willingly accountable for any portion of the content.

Open Research

DATA AVAILABILITY STATEMENT

Empirical data used in this paper are available on Dryad (associated with original article, Baken et al., 2020), https://doi.org/10.5061/dryad.59zw3r23m. R-scripts for simulated data tests are provided as Supporting Information.

Supporting Information

REFERENCES

Abouheif, E. (1999). A method for testing the assumption of phylogenetic independence in comparative data. Evolutionary Ecology Research, 1, 895–909.

Web of Science®Google Scholar
Adams, D. C. (2013). Comparing evolutionary rates for different phenotypic traits on a phylogeny using likelihood. Systematic Biology, 62(2), 181–192. https://doi.org/10.1093/sysbio/sys083
10.1093/sysbio/sys083
PubMedWeb of Science®Google Scholar
Adams, D. C. (2014a). A generalized K statistic for estimating phylogenetic signal from shape and other high-dimensional multivariate data. Systematic Biology, 63(5), 685–697. https://doi.org/10.1093/sysbio/syu030
10.1093/sysbio/syu030
PubMedWeb of Science®Google Scholar
Adams, D. C. (2014b). A method for assessing phylogenetic least squares models for shape and other high-dimensional multivariate data. Evolution, 68, 2675–2688. https://doi.org/10.1111/evo.12463
10.1111/evo.12463
PubMedWeb of Science®Google Scholar
Adams, D. C., & Collyer, M. L. (2016). On the comparison of the strength of morphological integration across morphometric datasets. Evolution, 70(11), 2623–2631. https://doi.org/10.1111/evo.13045
10.1111/evo.13045
PubMedWeb of Science®Google Scholar
Adams, D. C., & Collyer, M. L. (2018). Phylogenetic ANOVA: Group-clade aggregation, biological challenges, and a refined permutation procedure. Evolution, 72(6), 1204–1215. https://doi.org/10.1111/evo.13492
10.1111/evo.13492
PubMedWeb of Science®Google Scholar
Adams, D. C., & Collyer, M. L. (2019). Comparing the strength of modular signal, and evaluating alternative modular hypotheses, using covariance ratio effect sizes with morphometric data. Evolution, 73(12), 2352–2367. https://doi.org/10.1111/evo.13867
10.1111/evo.13867
PubMedWeb of Science®Google Scholar
Baken, E. K., & Adams, D. C. (2019). Macroevolution of arboreality in salamanders. Ecology and Evolution, 9(12), 7005–7016. https://doi.org/10.1002/ece3.5267
10.1002/ece3.5267
PubMedWeb of Science®Google Scholar
Baken, E. K., Mellenthin, L. E., & Adams, D. C. (2020). Macroevolution of desiccation-related morphology in plethodontid salamanders as inferred from a novel surface area to volume ratio estimation approach. Evolution, 74, 476–486. https://doi.org/10.1111/evo.13898
10.1111/evo.13898
PubMedWeb of Science®Google Scholar
Beaulieu, J. M., Jhwueng, D. C., Boettiger, C., & O'Meara, B. C. (2012). Modeling stabilizing selection: Expanding the ornstein-uhlenbeck model of adaptive evolution. Evolution, 66(8), 2369–2383. https://doi.org/10.1111/j.1558-5646.2012.01619.x
10.1111/j.1558?5646.2012.01619.x
PubMedWeb of Science®Google Scholar
Blomberg, S. P., Garland, T., & Ives, A. R. (2003). Testing for phylogenetic signal in comparative data: Behavioral traits are more labile. Evolution, 57, 717–745.
10.1111/j.0014-3820.2003.tb00285.x
PubMedWeb of Science®Google Scholar
Boettiger, C., Coop, G., & Ralph, P. (2012). Is your phylogeny informative? Measuring the power of comparative methods. Evolution, 66(7), 2240–2251. https://doi.org/10.1111/j.1558-5646.2011.01574.x
10.1111/j.1558-5646.2011.01574.x
PubMedWeb of Science®Google Scholar
Bonett, R. M., & Blair, A. L. (2017). Evidence for complex life cycle constraints on salamander body form diversification. Proceedings of the National Academy of Sciences of the United States of America, 114, 9936–9941. https://doi.org/10.1073/pnas.1703877114
10.1073/pnas.1703877114
CASPubMedWeb of Science®Google Scholar
Bose, R., Ramesh, B. R., Pélissier, R., & Munoz, F. (2019). Phylogenetic diversity in the Western Ghats biodiversity hotspot reflects environmental filtering and past niche diversification of trees. Journal of Biogeography, 46(1), 145–157. https://doi.org/10.1111/jbi.13464
10.1111/jbi.13464
Web of Science®Google Scholar
Clavel, J., Aristide, L., & Morlon, H. (2019). A penalized likelihood framework for high-dimensional phylogenetic comparative methods and an application to new-world monkeys brain evolution. Systematic Biology, 68(1), 93–116. https://doi.org/10.1093/sysbio/syy045
10.1093/sysbio/syy045
PubMedWeb of Science®Google Scholar
Clavel, J., & Morlon, H. (2020). Reliable phylogenetic regressions for multivariate comparative data: Illustration with the MANOVA and application to the effect of diet on mandible morphology in phyllostomid bats. Systematic Biology, 69(5), 927–943. https://doi.org/10.1093/sysbio/syaa010
10.1093/sysbio/syaa010
CASPubMedWeb of Science®Google Scholar
Collyer, M. L., & Adams, D. C. (2018). RRPP: An r package for fitting linear models to high-dimensional data using residual randomization. Methods in Ecology and Evolution, 9(7), 1772–1779. https://doi.org/10.1111/2041-210x.13029
10.1111/2041?210x.13029
Web of Science®Google Scholar
Collyer, M. L., & Adams, D. C. (2021). Phylogenetically aligned component analysis. Methods in Ecology and Evolution, 12(2), 359–372. https://doi.org/10.1111/2041-210X.13515
10.1111/2041-210X.13515
Web of Science®Google Scholar
Collyer, M. L., Sekora, D. J., & Adams, D. C. (2015). A method for analysis of phenotypic change for phenotypes described by high-dimensional data. Heredity, 115(4), 357–365. https://doi.org/10.1038/hdy.2014.75
10.1038/hdy.2014.75
CASPubMedWeb of Science®Google Scholar
Commenges, D. (2003). Transformations which preserve exchangeability and application to permutation tests. Journal of Nonparametric Statistics, 15, 171–185. https://doi.org/10.1080/1048525031000089310
10.1080/1048525031000089310
Web of Science®Google Scholar
Cooper, N., Jetz, W., & Freckleton, R. P. (2010). Phylogenetic comparative approaches for studying niche conservatism. Journal of Evolutionary Biology, 23(12), 2529–2539.
10.1111/j.1420-9101.2010.02144.x
CASPubMedWeb of Science®Google Scholar
Cooper, N., Thomas, G. H., Venditti, C., Meade, A., & Freckleton, R. P. (2016). A cautionary note on the use of ornstein uhlenbeck models in macroevolutionary studies. Biological Journal of the Linnean Society, 118(1), 64–77. https://doi.org/10.1111/bij.12701
10.1111/bij.12701
PubMedWeb of Science®Google Scholar
De Meester, G., Huyghe, K., & Van Damme, R. (2019). Brain size, ecology and sociality: A reptilian perspective. Biological Journal of the Linnean Society, 126(3), 381–391. https://doi.org/10.1093/biolinnean/bly206
10.1093/biolinnean/bly206
Web of Science®Google Scholar
Diniz-Filho, J. A. F., Santos, T., Rangel, T. F., & Bini, L. M. (2012). A comparison of metrics for estimating phylogenetic signal under alternative evolutionary models. Genetics and Molecular Biology, 35(3), 673–679. https://doi.org/10.1590/S1415-47572012005000053
10.1590/S1415?47572012005000053
PubMedWeb of Science®Google Scholar
Felsenstein, J. (1985). Phylogenies and the comparative method. The American Naturalist, 125(1), 1–15. https://doi.org/10.1086/284325
10.1086/284325
Web of Science®Google Scholar
Freckleton, R. P. (2012). Fast likelihood calculations for comparative analyses. Methods in Ecology and Evolution, 3(5), 940–947. https://doi.org/10.1111/j.2041-210X.2012.00220.x
10.1111/j.2041?210X.2012.00220.x
Web of Science®Google Scholar
Freckleton, R. P., Harvey, P. H., & Pagel, M. (2002). Phylogenetic analysis and comparative data: A test and review of evidence. The American Naturalist, 160, 712–726. https://doi.org/10.1086/343873
10.1086/343873
CASPubMedWeb of Science®Google Scholar
Garland, T. Jr, & Ives, A. R. (2000). Using the past to predict the present: Confidence intervals for regression equations in phylogenetic comparative methods. The American Naturalist, 155, 346–364. https://doi.org/10.1086/303327
10.1086/303327
PubMedWeb of Science®Google Scholar
Gittleman, J. L., & Kot, M. (1990). Adaptation: Statistics and a null model for estimating phylogenetic effects. Systematic Zoology, 39(3), 227. https://doi.org/10.2307/2992183
10.2307/2992183
Web of Science®Google Scholar
Grafen, A. (1989). The phylogenetic regression. Philosophical Transactions of the Royal Society of London B, Biological Sciences, 326, 119–157.
10.1098/rstb.1989.0106
CASPubMedWeb of Science®Google Scholar
Harmon, L. J., Weir, J. T., Brock, C. D., Glor, R. E., & Challenger, W. (2008). GEIGER: Investigating evolutionary radiations. Bioinformatics, 24, 129–131. https://doi.org/10.1093/bioinformatics/btm538
10.1093/bioinformatics/btm538
CASPubMedWeb of Science®Google Scholar
Harvey, P. H., & Pagel, M. D. (1991). The comparative method in evolutionary biology. 239, Oxford University Press.
10.1093/oso/9780198546412.001.0001
Google Scholar
Klingenberg, C. P., & Gidaszewski, N. A. (2010). Testing and quantifying phylogenetic signals and homoplasy in morphometric data. Systematic Biology, 59(3), 245–261. https://doi.org/10.1093/sysbio/syp106
10.1093/sysbio/syp106
CASPubMedWeb of Science®Google Scholar
Lande, R. (1976). Natural selection and random genetic drift in phenotypic evolution. Evolution, 314–334. https://doi.org/10.1111/j.1558-5646.1976.tb00911.x
10.1111/j.1558?5646.1976.tb00911.x
PubMedGoogle Scholar
Martins, E. P., & Hansen, T. F. (1997). Phylogenies and the comparative method: A general approach to incorporating phylogenetic information into the analysis of interspecific data. The American Naturalist, 149, 646–667. https://doi.org/10.1086/286013
10.1086/286013
Web of Science®Google Scholar
Molenberghs, G., & Verbeke, G. (2007). Likelihood ratio, score, and Wald tests in a constrained parameter space. The American Statistician, 61(1), 22–27. https://doi.org/10.1198/000313007X171322
10.1198/000313007X171322
Web of Science®Google Scholar
Molina-Venegas, R., & Rodríguez, M. (2017). Revisiting phylogenetic signal; strong or negligible impacts of polytomies and branch length information? BMC Evolutionary Biology, 17(1). https://doi.org/10.1186/s12862-017-0898-y
10.1186/s12862?017?0898?y
Google Scholar
Münkemüller, T., Lavergne, S., Bzeznik, B., Dray, S., Jombart, T., Schiffers, K., & Thuiller, W. (2012). How to measure and test phylogenetic signal. Methods in Ecology and Evolution, 3, 743–756. https://doi.org/10.1111/j.2041-210X.2012.00196.x
10.1111/j.2041-210X.2012.00196.x
Web of Science®Google Scholar
O'Meara, B. C., Ané, C., Sanderson, M. J., & Wainwright, P. C. (2006). Testing for different rates of continuous trait evolution using likelihood. Evolution, 60, 922–933.
10.1111/j.0014-3820.2006.tb01171.x
PubMedWeb of Science®Google Scholar
Pagel, M. D. (1999). Inferring the historical patterns of biological evolution. Nature, 401, 877–884. https://doi.org/10.1038/44766
10.1038/44766
CASPubMedWeb of Science®Google Scholar
Pavoine, S., & Ricotta, C. (2013). Testing for phylogenetic signal in biological traits: The ubiquity of cross-product statistics. Evolution, 67(3), 828–840. https://doi.org/10.1111/j.1558-5646.2012.01823.x
10.1111/j.1558?5646.2012.01823.x
PubMedWeb of Science®Google Scholar
Pintanel, P., Tejedo, M., Ron, S. R., Llorente, G. A., & Merino-Viteri, A. (2019). Elevational and microclimatic drivers of thermal tolerance in Andean Pristimantis frogs. Journal of Biogeography, 46(8), 1664–1675. https://doi.org/10.1111/jbi.13596
10.1111/jbi.13596
Web of Science®Google Scholar
Potter, D. M. (2005). A permutation test for inference in logistic regression with small-and moderate-sized data sets. Statistics in Medicine, 24(5), 693–708. https://doi.org/10.1002/sim.1931
10.1002/sim.1931
PubMedWeb of Science®Google Scholar
Revell, L. J. (2010). Phylogenetic signal and linear regression on species data. Methods in Ecology and Evolution, 1, 319–329. https://doi.org/10.1111/j.2041-210X.2010.00044.x
10.1111/j.2041-210X.2010.00044.x
Web of Science®Google Scholar
Revell, L. J., & Harmon, L. J. (2008). Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters. Evolutionary Ecology Research, 10, 311–331.

Web of Science®Google Scholar
Revell, L. J., Harmon, L. J., & Collar, D. C. (2008). Phylogenetic signal, evolutionary process, and rate. Systematic Biology, 57(4), 591–601. https://doi.org/10.1080/10635150802302427
10.1080/10635150802302427
PubMedWeb of Science®Google Scholar
Rohlf, F. J. (2001). Comparative methods for the analysis of continuous variables: Geometric interpretations. Evolution, 55, 2143–2160.
10.1111/j.0014-3820.2001.tb00731.x
CASPubMedWeb of Science®Google Scholar
Self, S. G., & Liang, K.-Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82(398), 605–610. https://doi.org/10.1080/01621459.1987.10478472
10.1080/01621459.1987.10478472
Web of Science®Google Scholar
Su, G., Villéger, S., & Brosse, S. (2019). Morphological diversity of freshwater fishes differs between realms, but morphologically extreme species are widespread. Global Ecology and Biogeography, 28(2), 211–221. https://doi.org/10.1111/geb.12843
10.1111/geb.12843
Web of Science®Google Scholar
Uyeda, J. C., Caetano, D. S., & Pennell, M. W. (2015). Comparative analysis of principal components can be misleading. Systematic Biology, 64(4), 677–689. https://doi.org/10.1093/sysbio/syv019
10.1093/sysbio/syv019
CASPubMedWeb of Science®Google Scholar
Vandelook, F., Janssens, S., Gijbels, P., Fischer, E., Van den Ende, W., Honnay, O., & Abrahamczyk, S. (2019). Nectar traits differ between pollination syndromes in balsaminaceae. Annals of Botany, 124(2), 269–279. https://doi.org/10.1093/aob/mcz072
10.1093/aob/mcz072
CASPubMedWeb of Science®Google Scholar
Verbeke, G., & Molenberghs, G. (2003). The use of score tests for inference on variance components. Biometrics, 59(2), 254–262. https://doi.org/10.1111/1541-0420.00032
10.1111/1541?0420.00032
PubMedWeb of Science®Google Scholar
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62. https://doi.org/10.1214/aoms/1177732360
10.1214/aoms/1177732360
Google Scholar

Citing Literature

Volume13, Issue2

February 2022

Pages 367-382

Filename	Description
mee313749-sup-0001-Supinfo.pdfPDF document, 3.5 MB	Supplementary Material
mee313749-sup-0002-Rscript.Rapplication/R, 5.7 KB	R script

A standardized effect size for evaluating and comparing the strength of phylogenetic signal

Abstract

1 INTRODUCTION

2 CONCEPTUAL DEVELOPMENT

2.1 Illustrative example

3 SIMULATION METHODS AND RESULTS

3.1 Simulation methods

3.2 Simulation results

4 EMPIRICAL EXAMPLE

5 DISCUSSION

ACKNOWLEDGEMENTS

CONFLICTS OF INTEREST

AUTHORS' CONTRIBUTIONS

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A standardized effect size for evaluating and comparing the strength of phylogenetic signal

Abstract

1 INTRODUCTION

2 CONCEPTUAL DEVELOPMENT

2.1 Illustrative example

3 SIMULATION METHODS AND RESULTS

3.1 Simulation methods

3.2 Simulation results

4 EMPIRICAL EXAMPLE

5 DISCUSSION

ACKNOWLEDGEMENTS

CONFLICTS OF INTEREST

AUTHORS' CONTRIBUTIONS

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Related

Information