A standardized effect size for evaluating and comparing the strength of phylogenetic signal
Handling Editor: José Miguel Ponciano
Abstract
- Macroevolutionary studies frequently characterize the phylogenetic signal in phenotypes; however, analytical tools for comparing the strength of that signal across traits remain largely underdeveloped.
- We developed a non-parametric, permutation test for the log-likelihood of an evolutionary model, plus a standardized statistic, , from this test, which can be considered a phylogenetic signal effect size. This statistic can be used in two-sample tests to compare the strength of phylogenetic signal for multiple traits.
- We performed simulation experiments that revealed that had a linear association with Pagel's , which could be predicted by tree size, plus could be quickly interpreted as a hypothesis for phylogenetic signal based on a standard normal distribution. We additionally found that the permutation test had greater statistical power for detecting phylogenetic signal than parametric likelihood ratio tests, especially for small trees.
- The analytical framework we present extends the phylogenetic comparative methods toolkit, allowing for statistical comparison of phylogenetic signal in multiple traits. Future studies could also consider this framework for the comparison of different evolutionary models, especially in light of different null processes.
1 INTRODUCTION
The shared evolutionary history of closely related species often implies the trait similarity among them, a pattern referred to as ‘phylogenetic signal’ (Blomberg et al., 2003; Felsenstein, 1985; Pagel, 1999). Many phylogenetic comparative methods (PCMs) seek to address the non-independence of species' traits (Felsenstein, 1985; Harvey & Pagel, 1991) in their analytic framing, by conditioning data on the phylogenetic relatedness among observations (Adams, 2014b; Adams & Collyer, 2018; Beaulieu et al., 2012; Garland & Ives, 2000; Grafen, 1989; Martins & Hansen, 1997; O'Meara et al., 2006; Rohlf, 2001). Indeed, under numerous evolutionary models, phylogenetic signal is expected, as stochastic character change along the hierarchical structure of the tree of life generates trait covariation among taxa (Blomberg et al., 2003; Felsenstein, 1985; Revell et al., 2008). Quantifying and comparing phylogenetic signal among traits, however, remains quite challenging.
Several analytical tools have been developed to quantify phylogenetic signal in phenotypic datasets (Abouheif, 1999; Adams, 2014a; Blomberg et al., 2003; Gittleman & Kot, 1990; Klingenberg & Gidaszewski, 2010; Pagel, 1999), and their statistical properties—namely type I error rates and statistical power—have been investigated to determine under what conditions phylogenetic signal can be detected (Adams, 2014a; Boettiger et al., 2012; Diniz-Filho et al., 2012; Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012; Pavoine & Ricotta, 2013; Revell, 2010; Revell et al., 2008). One of the most widely used methods for characterizing phylogenetic signal is Pagel's (Pagel, 1999), which transforms the lengths (by compression) of the internal branches of the phylogeny, while leaving the tips unaffected, to improve the fit of data to the phylogeny via maximum likelihood (Freckleton et al., 2002; Pagel, 1999). To infer whether phylogenetic signal differs from no signal or a Brownian motion (BM) model of evolutionary divergence, the observed model fit using may be statistically compared with that using or via likelihood ratio tests (Bose et al., 2019; Cooper et al., 2010; Freckleton et al., 2002) or confidence limits (Vandelook et al., 2019).
Another widely used measure is Blomberg's (Blomberg et al., 2003), which characterizes phylogenetic signal as the ratio of observed trait variation to the amount of variation expected under BM. Blomberg's can be treated as a test statistic by using a permutation test to generate its sampling distribution (Adams, 2014a; Blomberg et al., 2003) for determining whether significant phylogenetic signal is present in data. Both and seem intuitive to interpret, as a value of for both corresponds to no phylogenetic signal, and a value of corresponds to the amount of phylogenetic signal expected under BM. Thus, it is tempting to regard both and as descriptive statistics (and effect sizes, Münkemüller et al., 2012) that measure the relative strength of phylogenetic signal, providing an estimate of its magnitude for comparison.
The potential appeal of Pagel's and Blomberg's as effect sizes is that they provide a basis for interpreting ‘weak’ versus ‘strong’ phylogenetic signal; that is, small versus large values of or , respectively, in a comparative sense (De Meester et al., 2019; Pintanel et al., 2019; Su et al., 2019). They are also important statistics in hypothesis tests. The optimized value of lambda, , is the location where the log-likelihood is maximized, and is, therefore, compelling for finding the maximum phylogenetic signal in the data, which can be deemed ‘significant’ by rejecting the null hypothesis of in a likelihood ratio test. Although Pagel's has an upper bound of 1, Blomberg's can measure phylogenetic signal that is greater than expected under BM, as it has no upper bound. Blomberg's —or more specifically, the GLS estimation of variance that is a part of its calculation—can serve as a test statistic in a permutation test that randomizes ‘tip data’ in random permutations. However, is quite sensitive to tree size, exhibits high type II error rates for intermediate strength of phylogenetic signal, has higher type I error rates than likelihood ratio tests based on , and exhibits greater uncertainty for strong phylogenetic signals (whereas has greater uncertainty at intermediate phylogenetic signal strength: Münkemüller et al., 2012). Both of these statistics offer good support as test statistics for determining whether phylogenetic signal exists in a trait, but they are limited for comparing phylogenetic signals between traits.
Here, we present an alternative, standardized effect size calculation, which can be used in hypothesis tests to compare phylogenetic signals for different traits, and which is based on a normalized distribution of random log-likelihoods of a phylogenetic model, generated from a model of phylogenetic independence. Much like a likelihood ratio test for Pagel's , this non-parametric approach can assess the significance of the observed phylogenetic signal, but unlike the parametric test, the standardized location of the observed likelihood can be used as an effect size, which can be statistically compared with similarly calculated effect sizes to consider hypotheses regarding the relative strengths of phylogenetic signal for multiple traits. We use simulation experiments to compare this standardized effect size to and and demonstrate its utility with an empirical example. Comprehensively we illustrate that this standardized effect size provides an additional necessary tool to the phylogenetic comparative toolkit.
2 CONCEPTUAL DEVELOPMENT
Pagel's is a scaling parameter by which the covariances (i.e. the off-diagonals) of are multiplied (Pagel, 1999). A value of 0 changes all covariances to 0 (a star phylogeny or phylogenetic independence), and a value of 1 does not change the covariances from those expected by a BM model of evolution. (If and the tree is ultrametric, is a diagonal matrix proportional to an identity matrix, . It is convenient in this case to refer to rather than as a model of evolutionary independence because the lengths of branches in a star phylogeny—all equal—are inconsequential in estimation of the trait variance.) The value of that minimizes maximizes the log-likelihood. This value can be found for the interval between 0 and 1, yielding the optimized value, . In a likelihood ratio test, the log-likelihood at can be compared with the log-likelihood found at ; rejection of the null hypothesis indicates ‘significant’ phylogenetic signal. Likelihood ratio tests could also be used to explicitly test against a model of pure BM ().
If one optimizes via maximum likelihood, uses this value to transform , and performs a permutation test on , using as the test statistic, then a test on and a test on are commensurate. Furthermore, such a permutation test can be considered a non-parametric alternative to a likelihood ratio test. (We provide additional empirical detail in Appendix 1 of the Supporting Information that confirms rank correlation.)
Randomizing tip data is a simplified form of randomization of residuals in a permutation procedure (RRPP). RRPP works best if residuals are the most appropriate exchangeable units under the null hypothesis (Adams & Collyer, 2018; Commenges, 2003). RRPP is a process that randomizes null model residuals and adds them to null model fitted values in every random permutation to create random pseudodata used to fit alternative models. If the null hypothesis is phylogenetic independence, a star phylogeny is assumed, , , the OLS mean, are the OLS residuals, and random outcomes of , where indicates randomization, are the pseudodata produced in each permutation. If , then randomizing residuals is the same as randomizing tip data (the root and data mean are the same). This process preserves first- and second-moment exchangeability; that is, the OLS mean and variance of the trait are constant across random permutations. (If phylogenetic independence is not assumed, RRPP still functions the same, but has a GLS solution with second-moment exchangeability only.) It is important to understand that when the portions of the log-likelihood expression for phylogenetic independence (summarized as in Equation 3) are held constant, p-values found from the sampling distributions of either component of Equation 5, the log-likelihood, or the likelihood ratio statistic, , will be exactly the same (as is also constant in every random permutation). Therefore, a permutation test with RRPP is a non-parametric application of the likelihood ratio statistic in a hypothesis test (Potter, 2005), which does not rely on a mixture of parametric probability distributions as a proxy of the true sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987). However, it remains to be seen if a permutation test, using the log-likelihood statistic and RRPP, is as reliable as a parametric likelihood ratio test.
Assuming a comparable tree transformation, Pagel's and Blomberg's can be considered two phylogenetic signal effect sizes (Münkemüller et al., 2012), but a test of phylogenetic signal, as demonstrated above, is more explicitly an assessment of the rarity of the observed in a hypothetical distribution of , if is the null model process. This process can be applied with RRPP and a sampling distribution of either , , or can be generated, and all would provide the same p-values for the same RRPP permutations (see Appendix 1 of the Supporting Information for an example of this outcome). However, sampling distributions from permutation tests do not need to be a means to an end, a tool to merely find a p-value. The location of the observed statistic in its sampling distribution can also be considered an effect size (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015). From, Equation 3, either the non-constant portion of the log-likelihood equation, , or the log-likelihood itself, are good statistics for estimating effect sizes (Equation 3 demonstrates that the log-likelihood is a linear transformation of , so effect sizes estimated from the RRPP distributions of either will be the same. We will henceforth refer to the random forms of , for simplicity.) The location of the observed value in a standardized distribution of random outcomes, based on the appropriate null hypothesis of phylogenetic independence, provides a standardized (statistical) effect size that can be used in comparative tests (Adams & Collyer, 2016, 2018, 2019; Collyer et al., 2015).
The statistic can be assumed to follow a standard normal distribution, meaning a p-value can be obtained for a null hypothesis test that the phylogenetic signals for two separate traits are the same. There is no explicit expectation that the traits have to come from the same phylogeny, but the scope for comparison of traits is something that can only be considered by examining the behavior of these effect sizes for varied tree sizes and phylogenetic signal strength.
Equation 3 implicitly assumes that the compared traits evolve independently, which might be an illogical assumption for traits measured on species from the same phylogeny. In such cases, there are two options worth considering. First, one could generalize the log-likelihood equation (Equation 2) for multivariate data (see, Revell & Harmon, 2008) and consider the relative strength of multivariate phylogenetic signal with respect to the univariate signals. This is not necessarily a simple generalization, if one allows to vary among traits (requiring covariance matrix estimations in the log-likelihood for traits; see Appendix 2 in the Supporting Information for further details). However, one could compare multivariate -scores between models that assume a common or allow to vary among traits, as a test of evolutionary independence of traits; much like one can compare models with common or separate evolutionary rates among traits (see, Adams, 2013). (We provide further details for this future research consideration in Appendix 2 of the Supporting Information.) Second, one could compare the relative strength of phylogenetic signal between principal components of a multivariate data set. With this option, the principal components would be independent, but one would have to reconcile principal component loadings with test results to determine whether suites of traits have different phylogenetic signals.
Multivariate considerations are expansive and exceed the scope of this paper. However, RRPP is a process that generates sampling distributions of log-likelihoods in a consistent manner, irrespective of the number of traits. Research questions that require multivariate analysis should have tractable solutions, provided log-likelihoods can be estimated (residual covariance matrices are not singular). Regarding single traits, we perform simulation experiments to determine type I error rates, correspondence between hypothesis test outcomes, statistical power, and the relationship between effect size and simulated phylogenetic strength, below. However, we first provide a simple example to help illustrate the purpose of this type of analysis.
2.1 Illustrative example
As an illustrative example, we simulated two traits on a phylogeny (), one with moderate phylogenetic signal and one with stronger phylogenetic signal. (Simulation details are explained in the next section.) For one variable, , , and for the other, , . We performed RRPP to recalculate the GLS log-likelihoods (using a covariance matrix for a tree transformed by for each variable), with 10,000 random permutations, each. These distributions were normalized (with a Box-Cox transformation) and standardized (Figure 1), yielding scores of 2.21 and 6.33 standard deviations, respectively, each of which was significant at ( for and for ). Performing parametric likelihood ratio tests with a null model of yielded slightly different results , and , for and , respectively. The difference, as we show below, is likely due to the limited statistical power (type II error) of the parametric likelihood ratio test. We performed a two-sample z-test to compare the phylogenetic signal effect sizes; , indicating that the phylogenetic signal in was significantly larger than in . Comparatively, and , which like is not as useful for determining a significant difference between phylogenetic signals.
3 SIMULATION METHODS AND RESULTS
We examined the behavior of RRPP-based likelihood ratio tests and standardized log-likelihood effect size with simulation experiments that varied the strength of phylogenetic signal, for various sized pure-birth phylogenetic trees. In our simulation experiments, we sought to examine the statistical power of likelihood ratio tests with RRPP compared to parametric tests, and the relationships between effect size, Pagel's , and Blomberg's .
3.1 Simulation methods
Simulating data from a model with known phylogenetic signal is challenging, as the process requires an a priori definition of phylogenetic signal, and there is no guarantee that the process will produce data that are similar to the intended effect. It is possible to simulate data with an intended , as this requires merely simulating a tree, rescaling the tree, and simulating BM data on the rescaled tree (see, e.g. Adams, 2014a; Molina-Venegas & Rodríguez, 2017). Alternatively, a weighted average of data simulated with BM and without BM could be used, which Münkemüller et al. (2012) described as the (simulated) BM strength. However, with either approach, there is no guarantee that will resemble , especially for small trees (see, e.g. Figure 2 of Münkemüller et al., 2012). Furthermore, there is no easily conceivable way to simulate data from a model with known . For previous studies that sought to evaluate statistical properties (type I or type II errors, accuracy, and precision), defining or a weight of BM strength, as simulated, was sufficient for calculating summary statistics over many simulation runs with the same input value. However, we were more interested in understanding the association of simulated phylogenetic signal strength and the effect size estimated from the log-likelihood of an evolutionary model, over a continuum from to .
Initial trials to simulate data (sensu Adams, 2014a; Molina-Venegas & Rodríguez, 2017) from a uniform distribution of revealed that, especially with smaller trees, distributions of tended to be skewed toward 0 or 1, despite uniform sampling of . This was consistent with the research of Münkemüller et al. (2012). (see, e.g. their Figure 2 and their Table 2, which indicates skewing of toward 0 or 1 for small trees.) Therefore, we used an algorithm to first simulate from a uniform distribution, and then simulate data that produced within 5% of to assure that there was an approximately uniform distribution of throughout the simulation runs.
We simulated 5,000 pure-birth, ultrametric trees (with a branching rate of 0.05) for each of sized trees (25,000 trees total). All trees were created with the function, pbtree, from the phytools R package (Revell, 2010). For each tree, we randomly sampled from a uniform distribution (minimum of 0, maximum of 0.99), scaled the tree branch lengths by , and simulated random BM data on the transformed tree, using the sim.char function of the geiger R package (Harmon et al., 2008). Subsequently, we found the maximum likelihood estimate, (see code in Appendix 3 of the Supporting Information) from the data generated. We used an upper limit of because like Cooper et al. (2016), we observed a rare but discernible trend for data simulated with to not fit as well with a BM model of evolutionary divergence as alternative models, such as Ornstein Uhlenbeck models (Lande, 1976). By using a cut-off of 0.99, instances of were still frequent, but anomalies from simulating non-BM data were largely mitigated. For every tree we simulated, we repeatedly simulated data until we found within a 5% interval of , and then retained the data for analysis.
For every simulated tree and its corresponding data, we performed a parametric likelihood ratio test, with (untransformed) and (transformed) adjustments of . We also used RRPP to generate distributions of 1,000 random log-likelihoods for each tree:data combination, and for both untransformed and transformed matrices, from which the percentile of the observed statistic was used to estimate a p-value. The parametric likelihood ratio test performed for and the permutation test performed for (null in both cases) correspond exactly with the tests typically performed for Pagel's and Blomberg's , respectively. We verified p-values estimated this way were the exact same as using the distribution of random , more typically used for a test of Blomberg's .
results and standardized log-likelihood effect sizes were plotted against , with points scaled and hued in association with to visualize patterns. In such plots, points were colored if significant, based on the RRPP permutation test, or gray if not significant. We anticipated that should correspond to the null hypothesis rejection limit for a one-tailed test, a line we superimposed into plots to visually determine the consistency of effect sizes and hypothesis test results. (We expected colored points to lie above this line and gray points to lie beneath, if effect sizes reflected hypothesis test outcomes.)
Because we had p-values from both parametric and permutation tests, we could create tables of hypothesis test outcome correspondence, to assess the consistency of parametric and non-parametric tests. These tables report both the consistencies (parametric tests and RRPP tests found same result) and two types of inconsistencies: the parametric test finds a significant result but RRPP does not, or RRPP finds a significant result, but the parametric test does not. The inconsistent results were labeled in plots, along with type I error rates (calculated from the frequency of occasions that for , a significant result was observed). Because we had 5,000 values, approximately uniform in distribution, we were able to estimate statistical power curves as a moving proportion of null hypothesis rejections across the landscape of values. We estimated the proportion of rejections for four test types: parametric/untransformed, parametric/transformed, RRPP/untransformed, and RRPP/transformed. Proportions were estimated by culling data by intervals of and producing a vector of 0s (did not reject null hypothesis) and 1s (rejected the null hypothesis), for each test type. The means of these vectors were the proportion of tests for which the null hypothesis was rejected. We found that an interval length of 0.1 assured more than 500 values for all interior points (), and produced rather smooth curves that were qualitatively as informative as any curves produced with a greater number of intervals.
Additional functions and code for simulation experiments can be found in Appendix 3 of the Supporting Information. Several support functions from the RRPP R package (Collyer & Adams, 2018) were used to create functions to estimate log-likelihoods and effect sizes from RRPP distributions. In some initial simulations, we also considered balanced and pectinate trees. We found no qualitative differences and simulations could only produce new sets of data on the same tree, so we did not consider them further.
3.2 Simulation results
Figure 2 shows and results from simulations, and Figure 3 shows score results, with corresponding points scaled and hued the same, based on the value of in the first column of Figure 2. The relative frequencies of suggested the simulations produced approximately uniformly distributed phylogenetic signals. A rejection of the null hypothesis of no phylogenetic signal (significant phylogenetic signal) resulted in points that were colored, with hue changing as increased; non-significant values were gray in color. These figures allow for visual clarification of various attributes acquired from the simulation runs, such that patterns are easy to interpret. Statistical power curves are shown in Figure 4.
Regarding Pagel's and Blomberg's , our simulation results tracked the results of Münkemüller et al. (2012) in one particular way (Figure 2). It was possible to simulate larger values for smaller trees, but within any tree size, tended to be less than 1 except for the largest simulated values. Despite this trend, the hypothesis test results using alternative transformations of the matrix for estimation of as a test statistic for revealed profound differences. In Figure 2, significant or non-significant values can be found for any , if is forced in the test statistic, which is the common way this test is performed (middle column). The simple act of using and as a test statistic (not ) alleviated this concern, and was consistent with the assertion of Blomberg et al. (2003) that doing so increases statistical power (Figure 4). Forcing for hypothesis tests of also elevated type I error rates, but they were still close to the nominal level.
Issues with forced to be equal to 1 were also revealed by using a standardized effect size based on the location of in its RRPP-generated sampling distribution. Significant and non-significant results spanned the entire range of (Figure 3). These results are not surprising, as they do not seek to maximize likelihood, but help to confirm that the permutation test with , using as a test statistic, is flawed (since might not be minimized via a best fit of the tree to the comparative data). Furthermore, using as an effect size if is forced to equal 1 makes little sense because of its curvilinear association with (Figure 3). However, for cases where , both the permutation test on the log-likelihood statistic as well as the score from the RRPP sampling distribution, as an effect size, had several desirable attributes.
First, the permutation test for the log-likelihood of the evolutionary model had greater statistical power than the parametric likelihood ratio test (Figure 4). A statistical power advantage was greatest for smaller trees, and the power curves of the two methods tended to converge with larger trees. The cases of inconsistent results from the hypothesis test correspondence tables (Figure 3) were always due to the permutation test finding significant results when the parametric likelihood ratio test did not, but the rate of inconsistencies decreased with increased tree size. (By contrast, if is forced, the rate of inconsistencies increased with tree size.) A likelihood ratio statistic only asymptotically follows a distribution, as (Wilks, 1938), so it is not surprising that a parametric likelihood ratio test would have larger type II error rates with small tree sizes. Furthermore, the asymptotic null distribution for a one-sided likelihood ratio statistic, in which null hypotheses are at the limits of the constrained parameter space ( or ), is a mixture of two distributions (Molenberghs & Verbeke, 2007; Verbeke & Molenberghs, 2003). Generally, an unconstrained statistic is reported, but a p-value is considered to be 1/2 of the classical approximation, when mixture proportions are equal. Therefore, a tendency toward high statistical power might be expected for traits from large trees with likelihood ratio tests. Nonetheless, the statistical power was as good or better with permutation tests in our results, irrespective of tree size. In addition to having greater statistical power, the RRPP sampling distribution allows the standard deviate of the observed log-likelihood to be used as an effect size (), which also has nice attributes.
Second, based on a maximum likelihood estimate of has a linear association with . The slope of this linear association increases with tree size, unfortunately, as it is not possible to disentangle a goodness of fit () from the size of a tree. Thus, one might consider comparing effect sizes for traits from two vastly different trees with caution. The range of simulated also increased with phylogenetic signal strength and tree size (Figure 3). This result can be explained by the fact that for a large value of , also for a large tree, the breadth of possible values in RRPP permutations increases, so it is also possible to have a larger span of possible values.
Third, with , there was a clear demarcation of above a value of 1.96 corresponding to significant hypothesis test outcomes, especially for tests with (Figure 3). This is helpful, as an effect size of say, reported from an empirical study, indicates significant phylogenetic signal. We found sampling distributions to be consistently normally distributed (see Figure 1 as an example), especially for larger trees. Results in Figure 3 (second column) had a distinction of significant results consistently for . A reliance on the normal distribution of RRPP sampling distributions means that two-sample statistics are also reliable, and quick interpretation of to, e.g. , for two traits from the same tree, indicates which has greater phylogenetic signal.
Ideally, there would have been no relationship between and tree size, but such an expectation would be unwarranted, as phylogenetic signal is inherently related to the largeness of the phylogeny. However, we determined that there was a precise relationship between tree size and the slope of with respect to . The slopes of the lines in Figure 3 fit (nearly perfectly) the relationship, . Thus, the expected value of , given and is . One can calculate for the traits (see Figure 5) that are compared to ascertain if is larger (more positive) or lesser (more negative) than expected, given the tree size and optimized value of . This adjustment could be seen at best as a tool to help understand the multifarious nature of phylogenetic signal, rather than fix for comparative tests. For example, when comparing traits from two different trees, more positive values of might be considered stronger phylogenetic signal, if are comparable.
4 EMPIRICAL EXAMPLE
To demonstrate the utility of , we compared for two ecologically-relevant traits in plethodontid salamanders (Figure 6): surface area to volume ratios and relative (to snout to vent length) body width (Baken & Adams, 2019; Baken et al., 2020). For this example, surface area to volume ratios and relative body width measures were obtained from individuals of 305 species, from which species means were obtained (Baken & Adams, 2019; Baken et al., 2020). A time-dated molecular phylogeny for the group (Bonett & Blair, 2017) was pruned to match the species in the phenotypic dataset. The phylogenetic signal effect size in each trait was obtained from 10,000 RRPP permutations, using functions described in Appendix 3 of the Supporting Information. The absolute value of the two-sample effect size (Equation 5) was calculated, as we had no a priori expectation of direction in the hypothesis test; i.e. it was treated as a two-tailed hypothesis test.
Although both traits contained significant phylogenetic signal ( and ), a test based on revealed that the degree of phylogenetic signal was significantly stronger in (; : Figure 5). Biologically, this observation may be interpreted by the fact that the tropical species—which form a monophyletic group within plethodontids—display greater variation in , which covaries with disparity in their climatic niches (Baken et al., 2020). Thus, greater phylogenetic signal in is to be expected. Coincidentally, was 0.76 and 0.91, and was 0.25 and 0.76, for and , respectively.
5 DISCUSSION
To be able to ask if traits differ in their amount of phylogenetic signal, resolving how to best measure phylogenetic signal is essential. In this study, we considered the two most common measures of phylogenetic signal, and our simulation results did not dispute any issues that were already known about these measures. For example, the precision to estimate is tree-dependent, with more taxa-rich trees required for better precision (Boettiger et al., 2012; Münkemüller et al., 2012). does not scale linearly with increased phylogenetic signal strength, and its variance is positively associated with phylogenetic signal strength (Diniz-Filho et al., 2012; Münkemüller et al., 2012). Our simulation results confirmed these attributes. These issues make the comparison of phylogenetic signals challenging, even if only qualitatively comparing or between traits, for the same phylogeny. That there has been no statistical test only makes inference more speculative.
In this study, we made three important advances for the comparison of phylogenetic signals among different traits. First, we demonstrated that a permutation-based procedure (RRPP) using the log-likelihood as a statistic is not only reliable but performs better than a parametric test, especially for smaller trees. Second, we demonstrated that if the RRPP-log-likelihood permutation test is used, a test of and are the same, provided that is transformed by in the calculation of the GLS variance that is at the heart of the calculation of either statistic. Indeed, Blomberg et al. (2003) introduced as a statistic that had an associated permutation test, based on a distribution of , not , noting that statistical power would be higher if was calculated from a transformed version of that resulted in better fit of the tree to the data. Because and are perfectly rank-order correlated for the same set of RRPP permutations, viewing and as statistics that have different hypothesis test outcomes is not necessary. Previous simulation studies have found differences between them, but did so by relying on adjudication of by a parametric likelihood ratio test ( transformed by ), and by a permutation test with no transformation of () (see, e.g. Molina-Venegas & Rodríguez, 2017; Münkemüller et al., 2012). Our work reveals that these differences were the result of the incommensurate transformation step, and not in test statistic performance, per se. Third, having demonstrated that a test of phylogenetic signal is a test of the rarity of the observed in a distribution of random outcomes, generated by a null model of phylogenetic independence, we can measure phylogenetic signal an alternative way: as the standardized location of the observed in the RRPP-generated distribution of random values (i.e. as an effect size). This alternative makes it possible to perform hypothesis tests for the comparison of the strength of phylogenetic signal across traits.
This third advance is important but perhaps unsettling. The convenience of or is that a value of 0 should mean data devoid of phylogenetic signal, and a value of 1 should mean data have a phylogenetic signal that matches a BM model of evolutionary divergence. By contrast, a -score is a value measured in standard deviations that indicates a location in a normal distribution relative to expectation (mean), given a null model of phylogenetic independence. As a measure of the degree of phylogenetic signal, , might feel less intuitively comfortable. However, this discomfort is perhaps predicated on one's definition of phylogenetic signal. For example, in a tree with 128 taxa (Figure 3), for a value of , might range from 2 to 17. If is the definition of phylogenetic signal strength, the range of effect sizes suggest that is not a good effect size to consider. Conversely, an effect size might be found to have range from 0.1 to 1.0. That is, if is the measure of phylogenetic signal, a corresponding indicates the tree transformation that best reveals the phylogenetic signal, not the amount of phylogenetic signal. However, the appeal of is that it allows a statistical comparison of the phylogenetic signals from multiple traits, but those traits might also have quite different values of or . The best analysis is probably one that statistically compares but also reports both and , as these two statistics have important meaning: the optimized branch-length transformation and a ratio that expresses the relative amount of BM contribution to the GLS variance estimate, respectively. That is, one can report , , , and one p-value, and not have to view phylogenetic signal statistics as a means to an end for different statistical tests.
It is common for researchers to report ‘weak but significant’ phylogenetic signal when is considerably less than 1 but the null hypothesis test is rejected. We also demonstrated with our simulations that it is possible to find ‘significant’ phylogenetic signal when is small compared to a non-significant result when is forced to be equal to 1 (compare plots between left and right columns of Figure 3). Our work demonstrates that it is not helpful to declare ‘weak but significant’ phylogenetic signal (especially if not simultaneously reporting ‘strong but not significant’ phylogenetic signal by increasing ), based on or values. However, we feel it is more appropriate to declare as weak but significant, compared with say, , which is strong and significant. Phylogenetic signal ‘strength’ can be viewed as measure of rarity to generate such a strong signal by chance, which describes well. Although and are useful statistics, their ability to discern strong versus weak phylogenetic signal is questionable. Only , which is a statistical effect size, affords this statistical interpretation. However, an interpretation of phylogenetic strength still cannot be made independent of phylogeny size.
One less desirable outcome of our simulations is that (more precisely the slope between and ) was positively associated with , the number of taxa represented in a tree. We were able to demonstrate that the slope between and is predicted by , such that one could find an expected value of , given and . When comparing the same or multiple traits between trees, this added step might help to better elucidate differences between a two-sample test of phylogenetic signal, especially if is significant, but it is not clear if the test result is because of differences in phylogenetic signal strength or tree size. This would not be a panacea, as it would also involve using , which could vary between traits, but it is a tool that might assist inferences made about differences in phylogenetic signal involving multiple trees, or different scores also involving different transformations.
Although using scores for comparative analysis offers new opportunities, it also presents new challenges. Chief among the challenges that will have to be addressed is how to generalize the score as an effect size for multivariate data, especially if the number of variables precludes calculating log-likelihood. We see three possible approaches. First, like the generalization of the statistic for multivariate data (Adams, 2014a), it might be possible to use the trace of the evolutionary rate matrix, rather than the matrix determinant, which would not be variable-limited (for example, could be generalized by taking either the trace or determinant of , the multivariate generalization of ). Research demonstrating the adequacy of this approach would be needed, and certainly, the random outcomes could not be called log-likelihoods, but if the sampling distributions of log-likelihoods and modified statistics using traces were commensurate for comparable sets of variables, and yielded similar scores, then using an alternative generalization would be possible. Second, one could use a penalized-log-likelihood based on a regularization of near-singular or singular matrices (Clavel et al., 2019). Because this approach assures a matrix that is positive-definite and invertable, it also assures that can be estimated in every random permutation. Whether, the distribution of random obtained from RRPP, followed by regularization in each permutation, yields appropriate sampling distributions would remain to be seen. The statistical properties have been adjudicated using a penalized-likelihood framework for evaluating Wilks' , with RRPP (Clavel & Morlon, 2020), so there is promise that this framework would also work for calculating multivariate scores.
The third potential solution is to use phylogenetically aligned component analysis (PACA; Collyer & Adams, 2021) as a dimension reduction method. The perils of data reduction before likelihood estimation have been clearly demonstrated (Uyeda et al., 2015), but this was for cases where the data reduction method (principal component analysis, PCA) did not find components specifically aligned to phylogenetic signal. PACA specifically aligns components to phylogenetic signal, such that greater phylogenetic signal—rather than variance—is predominantly found in the first few components. It might be possible to use a subset of data dimensions that contain most or all phylogenetic signal to estimate pseudo-likelihoods. Again, it might not be sufficient to refer to a statistic calculated this way as model likelihood, but if random outcomes across many permutations produce a sampling distribution that yields similar values in fewer dimensions, it might be trusted for estimating for highly multivariate data.
Regardless of these three possible solutions, another consideration is whether different variables could have different in the estimation of log-likelihoods; that is, can it be assumed traits evolve independently? In Appendix 2 of the Supporting Information, we outline a method for calculating log-likelihoods for multivariate data, both assuming common and independent for traits. Model selection could be used to compare these two likelihoods to ascertain if traits evolve independently, and if so, the two-sample test described here could be used to determine which traits have greater phylogenetic signal. However, appropriate optimization methods for multiple should be rigorously researched, in addition to the statistical properties of different likelihood estimators, before solutions for multivariate traits are eagerly embraced.
Regardless of future challenges, the ability to estimate an effect size that can be used for hypothesis tests to compare phylogenetic signal in multiple traits, as a tool, is a boon for the PCM toolkit. We feel that measuring phylogenetic signal is more nuanced than using a single statistic, but adding to the suite of statistics used can help decipher between weak and strong phylogenetic signals, rather than misinterpreting values of or . Our scope of investigation concerned BM models of evolutionary divergence and one transformation parameter, Pagel's . Pagel's is generally considered to be most associated with phylogenetic signal, but one could also consider using RRPP with additional transformation parameters, including and (Pagel, 1999). Because the transformation of the matrix is an a priori step and this transformation is retained through random permutations, it would be easy to extend the RRPP-log-likelihood computations to additional matrix transformations. Furthermore, RRPP could be used with alternative models of evolution (e.g. multi-rate Brownian, early burst, OU, AC/DC), recognizing that the simplifications we made from Equations 1 to 3 would not be the same. Random versions of in Equation 1 would have to be calculated in each RRPP permutation, accounting appropriately for parameters that are fixed or variable in each permutation. Insomuch as phylogenetic signal effect size (using ) is a measure of the fit of tree to comparative data for a BM model of evolution, any similar approach can be considered a model effect size for an alternative evolutionary model. Thus, there could be some appeal with using the RRPP-log-likelihood effect size as a model selection criterion, especially because multiple models could be compared, not just assuming a null model of phylogenetic independence, but other null models as well. (For example, a single rate BM model could serve as null model for various multi-rate alternative models. Comparison of among the models, using both phylogenetic independence and BM as different null models, could be valuable for inferring the best evolutionary model for trait data.)
In answering certain evolutionary questions, such as comparing the strength of phylogenetic signal, traditional parametric approaches offer challenges. First, the parameter space of is bounded, and thus, a mixture of distributions is required as a proxy for a sampling distribution (Molenberghs & Verbeke, 2007; Self & Liang, 1987; Verbeke & Molenberghs, 2003). Second, distributions are asymptotically appropriate for likelihood ratio statistics for very large sample sizes (Wilks, 1938), a situation rarely afforded when working with phylogenies. The permutation test we presented is not constrained to use a parametric probability distribution as a proxy, and is additionally capable of providing effect sizes, which are comparable across datasets to evaluate comparative hypotheses. Prior work (Adams & Collyer, 2018) has shown that empirical sampling distributions generated from RRPP match nearly perfectly the parametric -distributions typically used in ANOVA, when data are simulated to match the assumptions of ANOVA. Based on our work here, one might speculate that RRPP-generated sampling distributions are better proxies for statistics without appropriate parametric sampling distributions and converge on parametric distributions in cases where sampling distribution solutions are tractable. When viewed from this perspective, permutation methods such as RRPP should not be considered mere analytical band-aids to be used for ill-conditioned datasets, or scenarios where standard tests are not applicable. Rather, they are equivalent to parametric procedures for ‘standard’ biostatistical problems and can supersede them in cases where parametric methods are not applicable. Thus, our perspective is that this work helps to continue to pave the way for advancement of PCMs as sets of tools that take advantage of the computational power of modern computers rather than force evolutionary biology questions into limited traditional statistics applications.
ACKNOWLEDGEMENTS
We thank Elizabeth Glynne and Bryan Juarez for comments on early drafts of the manuscript, and Simone Blomberg and Joe Felsenstein for their helpful comments on the penultimate draft. This work was sponsored in part by National Science Foundation Awards DBI-1902694 (to M.L.C.) and DBI-1902511 (to D.C.A.). National Science Foundation (DBI-1902511, DBI-1902694).
CONFLICTS OF INTEREST
The authors have no conficts of interest to declare.
AUTHORS' CONTRIBUTIONS
D.C.A. conceived the original idea for an effect size measure of phylogenetic signal; M.L.C. developed the log-likelihood permutation procedure that was implemented and D.C.A., E.K.B., and M.L.C. collaboratively developed the concept and contributed to all portions of this manuscript. All authors approve of the final product and are willingly accountable for any portion of the content.
Open Research
DATA AVAILABILITY STATEMENT
Empirical data used in this paper are available on Dryad (associated with original article, Baken et al., 2020), https://doi.org/10.5061/dryad.59zw3r23m. R-scripts for simulated data tests are provided as Supporting Information.