Some Results on Estimating a Wilcoxon–Mann–Whitney Measure of Effect Size When There are two Covariates

Wilcox, Rand R.

doi:10.1007/s13571-023-00319-3

Some Results on Estimating a Wilcoxon–Mann–Whitney Measure of Effect Size When There are two Covariates

Open access
Published: 06 December 2023

Volume 86, pages 41–54, (2024)
Cite this article

Download PDF

You have full access to this open access article

Sankhya B Aims and scope Submit manuscript

Rand R. Wilcox ORCID: orcid.org/0000-0002-2524-2976¹

439 Accesses
Explore all metrics

Abstract

A basic method for characterizing the difference between two independent groups is $\pi $, the probability that a randomly sampled value from the first group is less than a randomly sampled value from the second group. Recently, a method for making inferences about $\pi $ was proposed that deals with a covariate. A goal here is to report simulation results when there are two covariates. Included is a suggestion for picking the covariate points. The method is illustrated with data dealing with depressive symptoms among older adults.

ANCOVA: An Approach Based on a Robust Heteroscedastic Measure of Effect Size

Article 09 July 2022

Effect size measures in a two-independent-samples case with nonnormal and nonhomogeneous data

Article 20 October 2015

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider two independent random variables, $Y_1$ and $Y_2$. As is evident, a basic method for comparing the corresponding distributions is in terms of $\pi =P(Y_1<Y_2)$, the probability that a randomly sampled observation from the first group is less than a randomly sampled observation from the second group. Numerous methods for making inferences about $\pi $ have been derived beyond the classic Wilcoxon–Mann–Whitney test, twelve of which were compared by Ruscio and Mullen (2012).

A fundamental issue is making inferences about $\pi $ when there is a covariate. Wilcox (2023) proposed such a method when dealing with a single covariate, the details of which are reviewed below. The method is readily extended to two covariates, but there are no simulation results on how well the method performs in terms of controlling the probability of a Type I error and providing reasonably accurate confidence intervals. The goal in this note is to address this issue and to comment on how the covariate points might be chosen.

This note is organized as follows. Section 2 describes a method for estimating $\pi $ when there are two covariates. The method is a simple generalization of the method used by Wilcox (2023). Included is a proposed method for computing a confidence interval for $\pi $ followed by a description of how the covariate points might be chosen. Section 3 reports simulation results on how well the proposed confidence interval for $\pi $ performs. Section 4 illustrates the proposed method using data dealing with depressive symptoms among older adults. The covariates are a measure of perceived health and a measure of meaningful activities. As will be seen, a plot of the estimates of $\pi $, as function of the covariates, provides a more detailed understanding of the role of the covariates.

2 The Proposed Method

For convenience, momentarily consider a single group and let Y denote some random variable of interest and let $X_1$ and $X_2$ denote two covariates associated with this group. It is assumed that qth quantile of Y, given that $X_1=x_1$ and $X_2=x_2$, is given by

$$\begin{aligned} y_q = \beta _{0q}+ \beta _{1q}x_1 + \beta _{2q}x_2. \end{aligned}$$

(1)

The unknown parameters $\beta _{0q}$, $\beta _{1q}$ and $\beta _{2q}$ can be estimated via the well-known method derived by Koenker and Bassett (1978). Briefly, let $r_i$ ($i=1, \ldots , n$) denote the residuals based on a random sample of size n. The Koenker–Bassett (KB) estimator is based on values for the intercept and slopes that minimize

$$\begin{aligned} \sum \psi _q(r_i), \end{aligned}$$

(2)

where

$$\begin{aligned} \psi _q(w) = w(q-I_{w<0}) \end{aligned}$$

(3)

and the indicator function $I_{w<0}=1$ if $w<0$; otherwise $I_{w<0}=0$. The resulting estimates of $\beta _{0q}$, $\beta _{1q}$ and $\beta _{2q}$ are denoted by $b_{0q}$, $b_{1q}$ and $b_{2q}$, respectively.

Note that the KB estimator provides an estimate of the conditional distribution Y given that $X_1=x_1$ and $X_2=x_2$. For convenience, let $u=100 q$. Following Wilcox (2023), the conditional distribution is estimated by computing

$$\begin{aligned} D_{u}(x_1, x_2) = b_{0q}+ b_{1q}x_1+ b_{2q}x_2 \end{aligned}$$

(4)

for $q=.01(.01).99$.

Next, let $D_{uj}(x_1, x_2)$ denote the estimate of the conditional distribution for the jth group ($j=1, 2$ ) and let $\pi (x_1, x_2)$ denote the conditional value of $\pi $ given that $X_1=x_1$ and $X_2=x_2$. An estimate of $\pi (x_1, x_2)$ is

$$\begin{aligned} \hat{\pi }(x_1, x_2)= \frac{1}{99} \frac{1}{99} \sum _{u=1}^{99} \sum _{v=1}^{99} I_{uv}(x_1, x_2), \end{aligned}$$

(5)

where the indicator function $I_{uv}(x_1, x_2)=1$ if $D_{u 1}(x_1, x_2) <D_{v 2}(x_1, x_2)$; otherwise $I_{uv}(x_1, x_2)=0$.

When there is no covariate, the usual estimate of $\pi $ is

$$\begin{aligned} \hat{\pi }_a= \frac{1}{n_1 n_2} \sum \sum K_{ik}, \end{aligned}$$

(6)

where $n_j$ ($j=1, 2$) is the sample size for the jth group, $K_{ik}=1$ if $Y_{i1} < Y_{k,2}$, otherwise $K_{ik}=0$. Wilcox (2023) reported a partial check on the accuracy of (5) when there is a single covariate by comparing its sampling distribution to the sampling distribution of $\hat{\pi }_a$ given by (6). This was done for the situation where X and Y are independent and both have a standard normal distribution. A partial check was repeated here when there are two covariates using a simulation with 2000 replications. The covariate points were taken to be (0, 0) and (.67, .67) but for brevity only the results for this latter case are reported. The results are reported in Table 1 for sample sizes $n=20$ and 1000. As can be seen, there is little or no bias associated with (5) and the two sampling distributions are very similar. The main difference is their standard errors when the sample size is small. When $n=20$, the standard error of (5) was estimated to be .15 while the standard error of $\hat{\pi }_a$ was estimated to .09.

Table 1 Properties of the estimators given by (5) and (6)

Full size table

There is the issue of testing

$$\begin{aligned} H_0: \pi (x_1, x_2)=.5, \end{aligned}$$

(7)

as well as computing a confidence interval for $\pi (x_1, x_2)$. Following Wilcox (2023), the method begins by generating a bootstrap sample from both groups. Viewing a random sample as data stored in a matrix with $n_j$ rows for jth group, a bootstrap sample from the jth group consists of $n_j$ rows randomly sampled with replacement from the matrix of observed data. Based on these bootstrap samples, let $\pi ^*(x_1, x_2)$ denote the estimate of $\pi (x_1, x_2)$. Repeat this process B times yielding $\pi _1^*(x_1, x_2), \ldots , \pi _B^*(x_1, x_2)$. An estimate of the squared standard error of $\hat{\pi }(x_1, x_2)$ is

$$\begin{aligned} \hat{\tau }^2 = \frac{1}{B-1} \sum (\pi _b^*(x_1, x_2) - \bar{\pi }(x_1, x_2))^2, \end{aligned}$$

(8)

where $ \bar{\pi }(x_1, x_2)= \sum \pi _b^*(x_1, x_2)/B$. It is assumed that $\hat{\pi }(x_1, x_2)$ has a normal distribution in which case a $1-\alpha $ confidence interval for $\pi (x_1, x_2)$ is taken to be

$$\begin{aligned} \hat{\pi }(x_1, x_2) \pm z \hat{\tau }, \end{aligned}$$

(9)

and (7) is rejected at the $\alpha $ level if

$$\begin{aligned} |W| \ge z, \end{aligned}$$

(10)

where z is the $1-\alpha /2$ quantile of a standard normal distribution and

$$\begin{aligned} W= \frac{\hat{\pi }(x_1, x_2) -.5}{\hat{\tau }}. \end{aligned}$$

(11)

Here, $B=100$ is used for two reasons. First, results reported by Efron (1987) indicate that $B=100$ suffices in general. Second, when running a simulation, each replication took 68 seconds despite taking advantage of a multicore processor. Consequently, 1000 replications took 18.8 hours of execution time.

2.1 Choosing Covariate Points

In some situations there might be substantive reasons for focusing on certain covariate points. If this is not the case, testing (7) for a reasonably wide range of points would be of interest to get some sense of where and how the two groups differ. If the number of covariate points is large, execution time is an issue if (7) is tested for each point. This section suggests one way this issue might be addressed.

The basic idea is to first choose a point that is deeply nested in the cloud of covariate points followed by a collection of points that are increasingly far from the center of the cloud. Mahalanobis distance is perhaps the most obvious choice for implementing this approach, but the sample mean is not necessarily near the center of the data cloud and this approach implicitly assumes that sampling is from an elliptically contoured distribution.

Here, Mahalanobis distance is replaced by a certain version of what is generally known as projection distances. A detailed description of projection distances, as well as related methods, is summarized in Wilcox (2022). Here, for brevity, only an outline of the method is provided.

One version of projection distances begins by computing some robust measure of location. The median of the marginal distributions could be used, but a criticism is that this approach does not deal with outliers in a manner that takes into account the overall structure of the data cloud. There are several alternatives to the marginal medians that deal with this issue (e.g., Wilcox 2022), but the practical utility of these alternative estimators, for the problem at hand, is left to future investigations. For notational convenience, denote n covariate points by $\textbf{X}_1, \ldots , \textbf{X}_n$. For the ith point, $\textbf{X}_i$, orthogonally project the data onto the line connecting $\textbf{X}_i$ and the median. Based on this projection, compute the distance for each of the n projected points from the median and then standardized this distance by dividing by some robust measure of scale associated with the projected data. Repeat this process for each $\textbf{X}_i$ ($i=1, \ldots , n$). The projection distance of $\textbf{X}_j$ is its largest standardized distance among all n projections. Typically, the measure of scale is taken to be the interquartile range or the median absolute deviation (MAD) statistic. If $\mathcal {P}_1, \ldots , \mathcal {P}_n$ denote the projected data, MAD is

$$\begin{aligned} \textrm{Med}(|\mathcal {P}_1-\textrm{Med} (\mathcal {P})|, \ldots , |\mathcal {P}_n-\textrm{Med}(\mathcal {P})|), \end{aligned}$$

(12)

where Med denotes the median. Let $d_j$ denote the projection distance of $\textbf{X}_j$. The projection depth of $\textbf{X}_j$ is

$$\begin{aligned} \mathcal {D}_j = \frac{1}{1+d_j}. \end{aligned}$$

(13)

The higher the depth, the closer is the point to the center of the data cloud. The R function pdepth in the R package WRS (Wilcox & Schönbrodt 2016) performs the calculations using the interquartile range.

An alternative approach is to use random projections via the R function depthProjection, which is a part of the R package DepthProc (Zawadzki et al., 2022). This function uses MAD as a measure of scale. Generally, the R functions pdepth and depthProjection give very similar results. Here, the R function depthProjection is used simply because it reduces execution time.

3 Some Simulation Results

This section reports simulation results on how well the proposed method performs when making inferences about $\pi (x_1, x_2)$. Generally, it is when skewness and kurtosis are both relatively high that impacts the performance of many inferential methods. Because of the high execution time associated with the method used here, the strategy was to start with normal distributions, then consider a situation where the distributions are skewed with relatively heavy tails. A symmetric distribution with relatively heavy tails is considered as well.

Here, data were generated from a g-and-h distribution. If Z is generated from a standard normal distribution,

$$\begin{aligned} V = {\left\{ \begin{array}{ll} \frac{\exp (gZ)-1}{g}\exp (hZ^2/2),&{} \text {if } g > 0\\ Z \exp (hZ^2/2),&{} \text {if } g = 0 \end{array}\right. } \end{aligned}$$

(14)

has a g-and-h distribution (Hoaglin, 1985), where g and h are parameters that determine the first four moments. The choice $(g, h) =(1, 0)$ corresponds to a lognormal distribution shifted to have a median of zero, while $(g, h) =(0, .2)$ is a symmetric distribution about zero having kurtosis approximately equal to 24.8. The lognormal distribution has skewness 6.18 and kurtosis 113.9 and appears to reflect a relatively extreme departure from a normal distribution. For example, Wu (2002) reports estimated skew using data from 24 dissertations. The maximum estimate was 2.7. Cain et al. (2017) gathered 1,567 estimates of skewness and kurtosis reported in various published papers. They report that 99% of the estimates were less than the skewness and kurtosis of a lognormal distribution, suggesting that if a method performs reasonably well when dealing with a lognormal distribution, it is likely to performs well in practice.

Pearson’s correlation for $X_1$ and $X_2$ was taken to be 0 or .5. For non-normal data, this was done by generating data from a bivariate normal distribution with correlation $\rho _a$, where $\rho _a$ was chosen so that the correlation, after transforming the marginal distributions to a g-and-h distribution, is equal to the desired value. This was done via the R function rngh in the R package WRS (Wilcox & Schönbrodt 2016). If, for example, $\rho _a=.5$ is used, and the marginal distributions are transformed to a g-and-h distribution with $g=1$ and $h=0$, the resulting correlation differs from .5. To get a correlation equal to .5 after transforming, $\rho _a=.61$ was used.

Two covariate points were used. The first is the deepest point among the covariate points associated with the first group. The second was the point midway between the deepest point and the least deep point. Data were generated according to the model

$$\begin{aligned} Y_j= \beta _{1j} X_{1j}+\beta _{2j}X_{2j} + \beta _{0j}+ \epsilon , \end{aligned}$$

(15)

where $X_{1j}$, $X_{2j}$ and $\epsilon $ were taken to have identical distributions. Table 2 reports the estimated Type I error probabilties for $(n_1, n_2) = (30, 30)$, (30,60) and (50,50) and when all of the slopes and the intercept are equal to zero. As can be seen, with one exception, the estimates range between .039 and .067. The largest estimate was .070. Simulations were also run for $(n_1, n_2) = (20, 20)$, generally the Type I error probability was controlled reasonably well, but for $g= h =\rho =0$, the estimate was .082. Consequently, to be safe, the proposed method should not be used when the minimum sample size is less than 30.

Table 2 Estimated type I error probabilities

Full size table

An issue is why the method is unsatisfactory when $n_1=n_2=20$. One possibility is that the estimate of the standard error, $\hat{\tau }$, is biased. This was checked when all variables have a standard normal distribution. The actual standard error was estimated based on the 1000 estimates of $\eta $ and found to be .112. The mean of the 1000 estimates of the standard error was estimated to be .123. The main issue is the tails of the distribution of the test statistic as revealed by a Q-Q plot not shown here. Testing the hypothesis that the null distribution W is standard normal, the p-value of the Shapiro-Wilk test was less than .0001. In contrast, when $n_1=n_2=40$, a Q-Q plot suggests that W has, to a close approximation, a standard normal distribution (the plot is very similar to the Q-Q plot in Wilcox, 2023). Now the Shapiro-Wilk test returns a p-value equal to .233. For $n_1=n_2=40$ the bias of $\hat{\tau }$ was estimated to be negligible.

Some additional simulations were run where the dependent variable has a Poisson distribution with mean equal to 9. The goal was to get some sense of how the method performs when tied values are likely to occur. The results are shown in Table 3 for $n_1=n_2=30$. Again, control over the probability of a Type I error is reasonably good.

Table 3 Estimated type I error probabilities for a Poisson distribution, $n_1=n_2=30$

Full size table

4 Two Illustrations

This section describes results using data stemming from a study dealing with an intervention program aimed at improving the physical and emotional wellbeing of older adults. The dependent variable is a measure of depressive symptoms (CESD) developed by Radloff (1977). For a summary of how well this measure of depressive symptoms performs, google apa cesd and follow the link.

The first illustration is based on data collected prior to intervention. The two covariates are a measure of perceived health (PH) and a measure of meaningful activities (MAPA). The two groups are participants who did not graduate from high school and those who completed four years of college. The sample sizes are 136 and 48, respectively. Table 4 summarizes some basic properties of the data.

Table 4 Properties of the data used in the first illustration

Full size table

To begin, estimates of $\pi (x_1, x_2)$ were computed for all of the covariate points. Figure 1 shows a plot of the results using the smoother derived by Cleveland (1979). As can be seen, the plot suggests that as PH increases, $\hat{\pi }(x_1, x_2)$ decreases. The plot suggests that MAPA plays little or no role when PH is low, but there is an indication that MAPA does play some role when PH is high. Roughly, when PH is low the estimates suggest that there is little or no difference between the groups. But as PH increases, there is an increasing likelihood that the first group will have higher CESD scores.

Next, (7) was tested based on forty points selected as described in section 2.1. Figure 2 shows a scatterplot of the covariate points. The covariate points where (7) was tested are indicated by *. Points where the p-value is less than .05 are indicated by o. Among the eleven points that have a p-value less than or equal to .05, five have a p-value less than or equal to .05 based on the method derived by Hochberg (1988) for controlling the probability of one or more Type I errors. The medians for PH and MAPA are 33.9 and 31, respectively. Roughly, a significant result was obtained when both PH and MAPA have a value greater than their respective medians.

The second illustration is based on data from the same study, using the same groups and the same random variables, only now the measures were taken after the participants completed a six months intervention program. Now the sample sizes are 92 and 33. Figure 3 shows a plot of the regression surface and Fig. 4 shows the results when testing (7). Figures 2 and 4 reflect the same pattern: significant differences when both PH and MAPA are relatively large. Figure 4 also indicates one significant difference when both PH and MAPA are relatively low. This suggests that the first group will have lower CESD scores than the second group when both meaningful activities and perceived health are relatively low.

5 Concluding Remarks

Although the importance of a Type I error can depend on the situation, Bradley (1978) suggested that at a minimum, when testing at the .05 level, the actual level should be between .025 and .075. All indications are that the method used here satisfies this criterion when the minimum sample size is greater than or equal to 30. This is in contrast to the situation where there is a single covariate, in which case a minimum sample size greater than or equal to 20 suffices.

Certainly, an analog of the Wilcoxon–Mann–Whitney method, when dealing with covariates, is a useful tool that yields information that is readily understood by non-statisticians. But it is not being suggested that this approach dominates all other techniques that might be used. As seems evident, no single method tells us everything of interest regarding how two groups differ. For example, when there is a covariate, there is interest in making inferences about the heteroscedastic analog of Cohen’s d that was derived by Kulinskaya et al. (2008). Such a method provides an alternative perspective that helps build a more nuanced understanding of how two groups compare.

R functions for applying the method described here can be accessed by sourcing the file Rallfun-v41.txt, which can be downloaded from https://osf.io/xhe8u/. The R function ancovap2.wmw performs the proposed method and was used to create Figs. 2 and 4. The argument npts controls how many covariate points are used and defaults to 30. The argument method indicates how the FWE rate is controlled and defaults to ‘hoch’, which refers to the method derived by Hochberg (1988). To use the Benjamini and Hochberg (1995) method, set method=‘BH’. The R function ancovap2.wmw.plot plots an estimate of the regression surface for predicting $\pi (x_1, x_2)$ based on values of the covariates and was used to create Figs. 1 and 3.

A broad issue is the extent the basic approach used here can be used with other measures of effect size. This issue was investigated where $\hat{\pi }(x_1, x_2)$ is replaced with what is known as quantile shift measure of effect size, the details of which are summarized in Wilcox (2022, section 5.7.2). The details are not summarized here because preliminary simulations indicated that the obvious modification of the method used here is unsatisfactory.

Availability of data and material

publicly available at https://osf.io/nvd59/quickfiles.

Code availability

Stored at https://osf.io/nvd59/quickfiles.

References

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, B, 57, 289–300.
MathSciNet Google Scholar
Bradley, J. V. (1978). Robustness?. British Journal of Mathematical and Statistical Psychology, 31, 144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x
Cain, M., Zhang, Z. & Yuan, K.-H. (2017). Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation. Behavioral Research, 49, 1716–1735. https://doi.org/10.3758/s13428-016-0814-1
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74, 829–836. https://doi.org/10.2307/2286407
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171–185.
Article MathSciNet Google Scholar
Hoaglin, D.C. (1985). Summarizing shape numerically: the g-and-h distribution. In: Hoaglin, D., Mosteller, F., Tukey, J. (Eds.), Exploring Data Tables Trends and Shapes. New York: Wiley, pp. 461–515.
Google Scholar
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800–802. https://doi.org/10.1093/biomet/75.4.800
Koenker, R. & Bassett, G. (1978). Regression quantiles. Econometrika, 46, 33–50. https://doi.org/10.2307/1913643
Kulinskaya, E., Morgenthaler, S. & Staudte, R. (2008). Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence. New York: Wiley. https://doi.org/10.1348/000711005X68174
Radloff, L. S. (1977). The CES-D scale: A self report depression scale for research in the general population. Applied Psychological Measurements, 1, 385–401.
Article Google Scholar
Ruscio, J. & Mullen, T. (2012). Confidence intervals for the probability of superiority effect size measure and the area under a receiver operating characteristic curve. Multivariate Behavioral Research, 47, 201–223.
Article Google Scholar
Wilcox, R. & Schönbrodt, F. D. (2016). R package WRS. https://github.com/nicebread/WRS
Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing. 5th Edition. San Diego, CA: Academic Press
Google Scholar
Wilcox, R. (2023). A heteroscedastic analog of the Wilcoxon–Mann–Whitney test when there is a covariate. International Journal of Statistics and Probability, 12. 18–27. https://doi.org/10.5539/ijsp.v12n2p18
Wu, P.-C. (2002). Central limit theorem and comparing means, trimmed means one-step M-estimators and modified one-step M-estimators under nonnormality. Unpublished doctoral dissertation, Dept. of Education, University of Southern California.
Zawadzki, Z., Kosiorowski, D., Slomczynski, K., Bocian , M. & Wegrzynkiewicz, A.(2022). R package DepthProc. https://www.depthproc.zstat.pl/

Download references

Funding

Open access funding provided by SCELC, Statewide California Electronic Library Consortium

Author information

Authors and Affiliations

Dept of Psychology, University of Southern California, Los Angeles, CA, USA
Rand R. Wilcox

Authors

Rand R. Wilcox

View author publications

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rand R. Wilcox.

Ethics declarations

Conflicts of Interest

None

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wilcox, R.R. Some Results on Estimating a Wilcoxon–Mann–Whitney Measure of Effect Size When There are two Covariates. Sankhya B 86, 41–54 (2024). https://doi.org/10.1007/s13571-023-00319-3

Download citation

Received: 27 April 2023
Accepted: 05 November 2023
Published: 06 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s13571-023-00319-3

Some Results on Estimating a Wilcoxon–Mann–Whitney Measure of Effect Size When There are two Covariates

Abstract

Similar content being viewed by others

ANCOVA: An Approach Based on a Robust Heteroscedastic Measure of Effect Size

Effect size measures in a two-independent-samples case with nonnormal and nonhomogeneous data

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

1 Introduction