1 Introduction

Consider two independent random variables, \(Y_1\) and \(Y_2\). As is evident, a basic method for comparing the corresponding distributions is in terms of \(\pi =P(Y_1<Y_2)\), the probability that a randomly sampled observation from the first group is less than a randomly sampled observation from the second group. Numerous methods for making inferences about \(\pi \) have been derived beyond the classic Wilcoxon–Mann–Whitney test, twelve of which were compared by Ruscio and Mullen (2012).

A fundamental issue is making inferences about \(\pi \) when there is a covariate. Wilcox (2023) proposed such a method when dealing with a single covariate, the details of which are reviewed below. The method is readily extended to two covariates, but there are no simulation results on how well the method performs in terms of controlling the probability of a Type I error and providing reasonably accurate confidence intervals. The goal in this note is to address this issue and to comment on how the covariate points might be chosen.

This note is organized as follows. Section 2 describes a method for estimating \(\pi \) when there are two covariates. The method is a simple generalization of the method used by Wilcox (2023). Included is a proposed method for computing a confidence interval for \(\pi \) followed by a description of how the covariate points might be chosen. Section 3 reports simulation results on how well the proposed confidence interval for \(\pi \) performs. Section 4 illustrates the proposed method using data dealing with depressive symptoms among older adults. The covariates are a measure of perceived health and a measure of meaningful activities. As will be seen, a plot of the estimates of \(\pi \), as function of the covariates, provides a more detailed understanding of the role of the covariates.

2 The Proposed Method

For convenience, momentarily consider a single group and let Y denote some random variable of interest and let \(X_1\) and \(X_2\) denote two covariates associated with this group. It is assumed that qth quantile of Y, given that \(X_1=x_1\) and \(X_2=x_2\), is given by

$$\begin{aligned} y_q = \beta _{0q}+ \beta _{1q}x_1 + \beta _{2q}x_2. \end{aligned}$$
(1)

The unknown parameters \(\beta _{0q}\), \(\beta _{1q}\) and \(\beta _{2q}\) can be estimated via the well-known method derived by Koenker and Bassett (1978). Briefly, let \(r_i\) (\(i=1, \ldots , n\)) denote the residuals based on a random sample of size n. The Koenker–Bassett (KB) estimator is based on values for the intercept and slopes that minimize

$$\begin{aligned} \sum \psi _q(r_i), \end{aligned}$$
(2)

where

$$\begin{aligned} \psi _q(w) = w(q-I_{w<0}) \end{aligned}$$
(3)

and the indicator function \(I_{w<0}=1\) if \(w<0\); otherwise \(I_{w<0}=0\). The resulting estimates of \(\beta _{0q}\), \(\beta _{1q}\) and \(\beta _{2q}\) are denoted by \(b_{0q}\), \(b_{1q}\) and \(b_{2q}\), respectively.

Note that the KB estimator provides an estimate of the conditional distribution Y given that \(X_1=x_1\) and \(X_2=x_2\). For convenience, let \(u=100 q\). Following Wilcox (2023), the conditional distribution is estimated by computing

$$\begin{aligned} D_{u}(x_1, x_2) = b_{0q}+ b_{1q}x_1+ b_{2q}x_2 \end{aligned}$$
(4)

for \(q=.01(.01).99\).

Next, let \(D_{uj}(x_1, x_2)\) denote the estimate of the conditional distribution for the jth group (\(j=1, 2\) ) and let \(\pi (x_1, x_2)\) denote the conditional value of \(\pi \) given that \(X_1=x_1\) and \(X_2=x_2\). An estimate of \(\pi (x_1, x_2)\) is

$$\begin{aligned} \hat{\pi }(x_1, x_2)= \frac{1}{99} \frac{1}{99} \sum _{u=1}^{99} \sum _{v=1}^{99} I_{uv}(x_1, x_2), \end{aligned}$$
(5)

where the indicator function \(I_{uv}(x_1, x_2)=1\) if \(D_{u 1}(x_1, x_2) <D_{v 2}(x_1, x_2)\); otherwise \(I_{uv}(x_1, x_2)=0\).

When there is no covariate, the usual estimate of \(\pi \) is

$$\begin{aligned} \hat{\pi }_a= \frac{1}{n_1 n_2} \sum \sum K_{ik}, \end{aligned}$$
(6)

where \(n_j\) (\(j=1, 2\)) is the sample size for the jth group, \(K_{ik}=1\) if \(Y_{i1} < Y_{k,2}\), otherwise \(K_{ik}=0\). Wilcox (2023) reported a partial check on the accuracy of (5) when there is a single covariate by comparing its sampling distribution to the sampling distribution of \(\hat{\pi }_a\) given by (6). This was done for the situation where X and Y are independent and both have a standard normal distribution. A partial check was repeated here when there are two covariates using a simulation with 2000 replications. The covariate points were taken to be (0, 0) and (.67, .67) but for brevity only the results for this latter case are reported. The results are reported in Table 1 for sample sizes \(n=20\) and 1000. As can be seen, there is little or no bias associated with (5) and the two sampling distributions are very similar. The main difference is their standard errors when the sample size is small. When \(n=20\), the standard error of (5) was estimated to be .15 while the standard error of \(\hat{\pi }_a\) was estimated to .09.

Table 1 Properties of the estimators given by (5) and (6)

There is the issue of testing

$$\begin{aligned} H_0: \pi (x_1, x_2)=.5, \end{aligned}$$
(7)

as well as computing a confidence interval for \(\pi (x_1, x_2)\). Following Wilcox (2023), the method begins by generating a bootstrap sample from both groups. Viewing a random sample as data stored in a matrix with \(n_j\) rows for jth group, a bootstrap sample from the jth group consists of \(n_j\) rows randomly sampled with replacement from the matrix of observed data. Based on these bootstrap samples, let \(\pi ^*(x_1, x_2)\) denote the estimate of \(\pi (x_1, x_2)\). Repeat this process B times yielding \(\pi _1^*(x_1, x_2), \ldots , \pi _B^*(x_1, x_2)\). An estimate of the squared standard error of \(\hat{\pi }(x_1, x_2)\) is

$$\begin{aligned} \hat{\tau }^2 = \frac{1}{B-1} \sum (\pi _b^*(x_1, x_2) - \bar{\pi }(x_1, x_2))^2, \end{aligned}$$
(8)

where \( \bar{\pi }(x_1, x_2)= \sum \pi _b^*(x_1, x_2)/B\). It is assumed that \(\hat{\pi }(x_1, x_2)\) has a normal distribution in which case a \(1-\alpha \) confidence interval for \(\pi (x_1, x_2)\) is taken to be

$$\begin{aligned} \hat{\pi }(x_1, x_2) \pm z \hat{\tau }, \end{aligned}$$
(9)

and (7) is rejected at the \(\alpha \) level if

$$\begin{aligned} |W| \ge z, \end{aligned}$$
(10)

where z is the \(1-\alpha /2\) quantile of a standard normal distribution and

$$\begin{aligned} W= \frac{\hat{\pi }(x_1, x_2) -.5}{\hat{\tau }}. \end{aligned}$$
(11)

Here, \(B=100\) is used for two reasons. First, results reported by Efron (1987) indicate that \(B=100\) suffices in general. Second, when running a simulation, each replication took 68 seconds despite taking advantage of a multicore processor. Consequently, 1000 replications took 18.8 hours of execution time.

2.1 Choosing Covariate Points

In some situations there might be substantive reasons for focusing on certain covariate points. If this is not the case, testing (7) for a reasonably wide range of points would be of interest to get some sense of where and how the two groups differ. If the number of covariate points is large, execution time is an issue if (7) is tested for each point. This section suggests one way this issue might be addressed.

The basic idea is to first choose a point that is deeply nested in the cloud of covariate points followed by a collection of points that are increasingly far from the center of the cloud. Mahalanobis distance is perhaps the most obvious choice for implementing this approach, but the sample mean is not necessarily near the center of the data cloud and this approach implicitly assumes that sampling is from an elliptically contoured distribution.

Here, Mahalanobis distance is replaced by a certain version of what is generally known as projection distances. A detailed description of projection distances, as well as related methods, is summarized in Wilcox (2022). Here, for brevity, only an outline of the method is provided.

One version of projection distances begins by computing some robust measure of location. The median of the marginal distributions could be used, but a criticism is that this approach does not deal with outliers in a manner that takes into account the overall structure of the data cloud. There are several alternatives to the marginal medians that deal with this issue (e.g., Wilcox 2022), but the practical utility of these alternative estimators, for the problem at hand, is left to future investigations. For notational convenience, denote n covariate points by \(\textbf{X}_1, \ldots , \textbf{X}_n\). For the ith point, \(\textbf{X}_i\), orthogonally project the data onto the line connecting \(\textbf{X}_i\) and the median. Based on this projection, compute the distance for each of the n projected points from the median and then standardized this distance by dividing by some robust measure of scale associated with the projected data. Repeat this process for each \(\textbf{X}_i\) (\(i=1, \ldots , n\)). The projection distance of \(\textbf{X}_j\) is its largest standardized distance among all n projections. Typically, the measure of scale is taken to be the interquartile range or the median absolute deviation (MAD) statistic. If \(\mathcal {P}_1, \ldots , \mathcal {P}_n\) denote the projected data, MAD is

$$\begin{aligned} \textrm{Med}(|\mathcal {P}_1-\textrm{Med} (\mathcal {P})|, \ldots , |\mathcal {P}_n-\textrm{Med}(\mathcal {P})|), \end{aligned}$$
(12)

where Med denotes the median. Let \(d_j\) denote the projection distance of \(\textbf{X}_j\). The projection depth of \(\textbf{X}_j\) is

$$\begin{aligned} \mathcal {D}_j = \frac{1}{1+d_j}. \end{aligned}$$
(13)

The higher the depth, the closer is the point to the center of the data cloud. The R function pdepth in the R package WRS (Wilcox & Schönbrodt 2016) performs the calculations using the interquartile range.

An alternative approach is to use random projections via the R function depthProjection, which is a part of the R package DepthProc (Zawadzki et al., 2022). This function uses MAD as a measure of scale. Generally, the R functions pdepth and depthProjection give very similar results. Here, the R function depthProjection is used simply because it reduces execution time.

3 Some Simulation Results

This section reports simulation results on how well the proposed method performs when making inferences about \(\pi (x_1, x_2)\). Generally, it is when skewness and kurtosis are both relatively high that impacts the performance of many inferential methods. Because of the high execution time associated with the method used here, the strategy was to start with normal distributions, then consider a situation where the distributions are skewed with relatively heavy tails. A symmetric distribution with relatively heavy tails is considered as well.

Here, data were generated from a g-and-h distribution. If Z is generated from a standard normal distribution,

$$\begin{aligned} V = {\left\{ \begin{array}{ll} \frac{\exp (gZ)-1}{g}\exp (hZ^2/2),&{} \text {if } g > 0\\ Z \exp (hZ^2/2),&{} \text {if } g = 0 \end{array}\right. } \end{aligned}$$
(14)

has a g-and-h distribution (Hoaglin, 1985), where g and h are parameters that determine the first four moments. The choice \((g, h) =(1, 0)\) corresponds to a lognormal distribution shifted to have a median of zero, while \((g, h) =(0, .2)\) is a symmetric distribution about zero having kurtosis approximately equal to 24.8. The lognormal distribution has skewness 6.18 and kurtosis 113.9 and appears to reflect a relatively extreme departure from a normal distribution. For example, Wu (2002) reports estimated skew using data from 24 dissertations. The maximum estimate was 2.7. Cain et al. (2017) gathered 1,567 estimates of skewness and kurtosis reported in various published papers. They report that 99% of the estimates were less than the skewness and kurtosis of a lognormal distribution, suggesting that if a method performs reasonably well when dealing with a lognormal distribution, it is likely to performs well in practice.

Pearson’s correlation for \(X_1\) and \(X_2\) was taken to be 0 or .5. For non-normal data, this was done by generating data from a bivariate normal distribution with correlation \(\rho _a\), where \(\rho _a\) was chosen so that the correlation, after transforming the marginal distributions to a g-and-h distribution, is equal to the desired value. This was done via the R function rngh in the R package WRS (Wilcox & Schönbrodt 2016). If, for example, \(\rho _a=.5\) is used, and the marginal distributions are transformed to a g-and-h distribution with \(g=1\) and \(h=0\), the resulting correlation differs from .5. To get a correlation equal to .5 after transforming, \(\rho _a=.61\) was used.

Two covariate points were used. The first is the deepest point among the covariate points associated with the first group. The second was the point midway between the deepest point and the least deep point. Data were generated according to the model

$$\begin{aligned} Y_j= \beta _{1j} X_{1j}+\beta _{2j}X_{2j} + \beta _{0j}+ \epsilon , \end{aligned}$$
(15)

where \(X_{1j}\), \(X_{2j}\) and \(\epsilon \) were taken to have identical distributions. Table 2 reports the estimated Type I error probabilties for \((n_1, n_2) = (30, 30)\), (30,60) and (50,50) and when all of the slopes and the intercept are equal to zero. As can be seen, with one exception, the estimates range between .039 and .067. The largest estimate was .070. Simulations were also run for \((n_1, n_2) = (20, 20)\), generally the Type I error probability was controlled reasonably well, but for \(g= h =\rho =0\), the estimate was .082. Consequently, to be safe, the proposed method should not be used when the minimum sample size is less than 30.

Table 2 Estimated type I error probabilities

An issue is why the method is unsatisfactory when \(n_1=n_2=20\). One possibility is that the estimate of the standard error, \(\hat{\tau }\), is biased. This was checked when all variables have a standard normal distribution. The actual standard error was estimated based on the 1000 estimates of \(\eta \) and found to be .112. The mean of the 1000 estimates of the standard error was estimated to be .123. The main issue is the tails of the distribution of the test statistic as revealed by a Q-Q plot not shown here. Testing the hypothesis that the null distribution W is standard normal, the p-value of the Shapiro-Wilk test was less than .0001. In contrast, when \(n_1=n_2=40\), a Q-Q plot suggests that W has, to a close approximation, a standard normal distribution (the plot is very similar to the Q-Q plot in Wilcox, 2023). Now the Shapiro-Wilk test returns a p-value equal to .233. For \(n_1=n_2=40\) the bias of \(\hat{\tau }\) was estimated to be negligible.

Some additional simulations were run where the dependent variable has a Poisson distribution with mean equal to 9. The goal was to get some sense of how the method performs when tied values are likely to occur. The results are shown in Table 3 for \(n_1=n_2=30\). Again, control over the probability of a Type I error is reasonably good.

Table 3 Estimated type I error probabilities for a Poisson distribution, \(n_1=n_2=30\)

4 Two Illustrations

This section describes results using data stemming from a study dealing with an intervention program aimed at improving the physical and emotional wellbeing of older adults. The dependent variable is a measure of depressive symptoms (CESD) developed by Radloff (1977). For a summary of how well this measure of depressive symptoms performs, google apa cesd and follow the link.

The first illustration is based on data collected prior to intervention. The two covariates are a measure of perceived health (PH) and a measure of meaningful activities (MAPA). The two groups are participants who did not graduate from high school and those who completed four years of college. The sample sizes are 136 and 48, respectively. Table 4 summarizes some basic properties of the data.

Table 4 Properties of the data used in the first illustration
Fig. 1
figure 1

Regression surface showing the association between \(\hat{\pi }(x_1, x_2)\) and the two covariates, PH and MAPA

To begin, estimates of \(\pi (x_1, x_2)\) were computed for all of the covariate points. Figure 1 shows a plot of the results using the smoother derived by Cleveland (1979). As can be seen, the plot suggests that as PH increases, \(\hat{\pi }(x_1, x_2)\) decreases. The plot suggests that MAPA plays little or no role when PH is low, but there is an indication that MAPA does play some role when PH is high. Roughly, when PH is low the estimates suggest that there is little or no difference between the groups. But as PH increases, there is an increasing likelihood that the first group will have higher CESD scores.

Fig. 2
figure 2

Scatterplot of the covariate points. Covariate points where the groups were compared are indicated by *. Significant results are indicated by an o

Next, (7) was tested based on forty points selected as described in section 2.1. Figure 2 shows a scatterplot of the covariate points. The covariate points where (7) was tested are indicated by *. Points where the p-value is less than .05 are indicated by o. Among the eleven points that have a p-value less than or equal to .05, five have a p-value less than or equal to .05 based on the method derived by Hochberg (1988) for controlling the probability of one or more Type I errors. The medians for PH and MAPA are 33.9 and 31, respectively. Roughly, a significant result was obtained when both PH and MAPA have a value greater than their respective medians.

Fig. 3
figure 3

Same as Fig. 1, only now the data are based on measures taken after participants took part in a six months intervention program

Fig. 4
figure 4

Same as Fig. 2, only now the data are based on measures taken after participants took part in a six months intervention program

The second illustration is based on data from the same study, using the same groups and the same random variables, only now the measures were taken after the participants completed a six months intervention program. Now the sample sizes are 92 and 33. Figure 3 shows a plot of the regression surface and Fig. 4 shows the results when testing (7). Figures 2 and 4 reflect the same pattern: significant differences when both PH and MAPA are relatively large. Figure 4 also indicates one significant difference when both PH and MAPA are relatively low. This suggests that the first group will have lower CESD scores than the second group when both meaningful activities and perceived health are relatively low.

5 Concluding Remarks

Although the importance of a Type I error can depend on the situation, Bradley (1978) suggested that at a minimum, when testing at the .05 level, the actual level should be between .025 and .075. All indications are that the method used here satisfies this criterion when the minimum sample size is greater than or equal to 30. This is in contrast to the situation where there is a single covariate, in which case a minimum sample size greater than or equal to 20 suffices.

Certainly, an analog of the Wilcoxon–Mann–Whitney method, when dealing with covariates, is a useful tool that yields information that is readily understood by non-statisticians. But it is not being suggested that this approach dominates all other techniques that might be used. As seems evident, no single method tells us everything of interest regarding how two groups differ. For example, when there is a covariate, there is interest in making inferences about the heteroscedastic analog of Cohen’s d that was derived by Kulinskaya et al. (2008). Such a method provides an alternative perspective that helps build a more nuanced understanding of how two groups compare.

R functions for applying the method described here can be accessed by sourcing the file Rallfun-v41.txt, which can be downloaded from https://osf.io/xhe8u/. The R function ancovap2.wmw performs the proposed method and was used to create Figs. 2 and 4. The argument npts controls how many covariate points are used and defaults to 30. The argument method indicates how the FWE rate is controlled and defaults to ‘hoch’, which refers to the method derived by Hochberg (1988). To use the Benjamini and Hochberg (1995) method, set method=‘BH’. The R function ancovap2.wmw.plot plots an estimate of the regression surface for predicting \(\pi (x_1, x_2)\) based on values of the covariates and was used to create Figs. 1 and 3.

A broad issue is the extent the basic approach used here can be used with other measures of effect size. This issue was investigated where \(\hat{\pi }(x_1, x_2)\) is replaced with what is known as quantile shift measure of effect size, the details of which are summarized in Wilcox (2022, section 5.7.2). The details are not summarized here because preliminary simulations indicated that the obvious modification of the method used here is unsatisfactory.