Nonparametric Limits of Agreement for Small to Moderate Sample Sizes: A Simulation Study

Hans Christian Petersen

The assessment of agreement in method comparison and observer variability analysis of quantitative measurements is usually done by the Bland–Altman Limits of Agreement, where the paired differences are implicitly assumed to follow a normal distribution. Whenever this assumption does not hold, the 2.5% and 97.5% percentiles are obtained by quantile estimation. In the literature, empirical quantiles have been used for this purpose. In this simulation study, we applied both sample, subsampling, and kernel quantile estimators, as well as other methods for quantile estimation to sample sizes between 30 and 150 and different distributions of the paired differences. The performance of 15 estimators in generating prediction intervals was measured by their respective coverage probability for one newly generated observation. Our results indicated that sample quantile estimators based on one or two order statistics outperformed all of the other estimators and they can be used for deriving nonp...

Article Nonparametric Limits of Agreement for Small to Moderate Sample Sizes: A Simulation Study Maria E. Frey 1 , Hans C. Petersen 2 1 2 3 4 * and Oke Gerke 3,4, * Department of Toxicology, Charles River Laboratories Copenhagen A/S, 4623 Lille Skensved, Denmark; mariafrey93@hotmail.com Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense M, Denmark; hcpetersen@sdu.dk Department of Nuclear Medicine, Odense University Hospital, 5000 Odense C, Denmark Department of Clinical Research, University of Southern Denmark, 5000 Odense C, Denmark Correspondence: oke.gerke@rsyd.dk Received: 7 August 2020; Accepted: 25 August 2020; Published: 28 August 2020 Abstract: The assessment of agreement in method comparison and observer variability analysis of quantitative measurements is usually done by the Bland–Altman Limits of Agreement, where the paired differences are implicitly assumed to follow a normal distribution. Whenever this assumption does not hold, the 2.5% and 97.5% percentiles are obtained by quantile estimation. In the literature, empirical quantiles have been used for this purpose. In this simulation study, we applied both sample, subsampling, and kernel quantile estimators, as well as other methods for quantile estimation to sample sizes between 30 and 150 and different distributions of the paired differences. The performance of 15 estimators in generating prediction intervals was measured by their respective coverage probability for one newly generated observation. Our results indicated that sample quantile estimators based on one or two order statistics outperformed all of the other estimators and they can be used for deriving nonparametric Limits of Agreement. For sample sizes exceeding 80 observations, more advanced quantile estimators, such as the Harrell–Davis and estimators of Sfakianakis–Verginis type, which use all of the observed differences, performed likewise well, but may be considered intuitively more appealing than simple sample quantile estimators that are based on only two observations per quantile. Keywords: agreement; Bland-Altman plot; coverage; limits of agreement; method comparison; quantile estimation; repeatability; reproducibility 1. Introduction The classical Bland–Altman Limits of Agreement (BA LoA) define a range within which approximately 95% of normally distributed differences between paired measurements are expected to lie [1–3]. In cases of non-normally distributed differences, the use of empirical quantiles has been proposed as a robust alternative [2,4,5]; however, extensive research endeavors in the past have suggested the application of nonparametric quantile estimation to the assessment of 2.5% and 97.5% percentiles as nonparametric LoA. We performed a simulation study on 15 nonparametric quantile estimators to derive nonparametric prediction intervals and assessed their performance by means of the coverage probability for one newly generated observation. The aim of this study was to suggest a nonparametric and robust alternative to the classical BA LoA when the normality assumption does not hold and/or the sample sizes are small to moderate. Our findings are illustrated by an application to data from a previously published clinical study on coronary artery calcification measured by the Agatston score [6]. Stats 2020, 3, 343–355; doi:10.3390/stats3030022 www.mdpi.com/journal/stats 344 Stats 2020, 3 2. Methods Let the differences of paired observations be independent observations of a random variable X with a cumulative distribution function (CDF) F : R → [0, 1]. If F is continuous from the right, then the quantile function of X is continuous from the left: Q(u) := inf{ x : F ( x ) ≥ u}, u ∈ (0, 1); hence, at least 100u percent of the values of X are below Q(u) [7,8]. In the following, we seek to estimate the 2.5% and 97.5% percentiles of F, corresponding to u = 0.025, 0.975, by different methods of nonparametric quantile estimation. Databases, such as JSTOR (Journal Storage), ScienceDirect, the online journal platform “Taylor & Francis Online”, and those maintained by PubMed/Medline in the NCBI (National Center for Biotechnology Information) were searched for quantile estimator, nonparametric quantile estimator, nonparametric kernel quantile estimator, subsampling quantile estimator, and new quantile estimator. Fifteen nonparametric quantile estimators were chosen, three of which are sample quantile estimators, four are subsampling quantile estimators, two are kernel quantile estimators, and six are other quantile estimators. Hence, we chose very different types of nonparametric quantile estimators, which use one, two, or all the observations in a sample. 2.1. Sample Quantile Estimators The simplest way to estimate quantiles nonparametrically is by using sample quantile estimators. A random sample X1 , . . . Xn of size n is sorted in increasing order X(1) ≤ X(2) ≤ . . . ≤ X(n) ; the symbols here denote the order statistics of the random sample. The CDF of F can then be estimated by the step function 1 n F̃ ( x ) = ∑ I[ xi ,∞) ( x ), (1) n i =1 where I A ( x ) is the indicator function that takes the value 1 if x ∈ A and 0 if x ∈ / A; the sample quantile function is defined as Q̃(u) := inf{ x : F̃ ( x ) ≥ u}, u ∈ (0, 1) by Cheng [7]. The quantile estimator SQ p1 = ( X(np) if [np] = np X([np]+1) if [np] < np (2) is based on a single order statistic, where [ x ] is the greatest integer that is less than or equal to x [9,10]. SQ p1 is the smallest observation for which at least p percent of the observed values in the sample are smaller than or equal to SQ p1 . The second sample quantile estimator SQ p2 is a weighted average of the two order statistics that are closest to including p percent of all the observations in the sample: SQ p2 = (1 − α) X(r) + αX(r+1) (3) with α = p(n + 1) − r and r = [ p(n + 1)] [9,11]. Finally, we considered a weighted average of X[np+0.5] and X[np+0.5]+1 : SQI p = (−np + 0.5 + i ) X(i) + (np + 0.5 − i ) X(i+1) (4) with i = [np + 0.5] and 0.5 ≤ np ≤ (n − 0.5) [9,12]. 2.2. Subsampling Quantile Estimators The abovementioned sample quantile estimators are based on only one or two order statistics whereas subsampling, kernel, and other quantile estimators employ linear combinations of all the available order statistics, weighting them according to their relative closeness to the target percentile. 345 Stats 2020, 3 Based on the sample quantile function Q̃, linear smooth nonparametric estimators of Q(u) can be written as Z Q(u) = 1 0 Q̃(t)dt G (u; t), (5) where G (u; ·) is a CDF with support on the unit interval. Many distributions G (u; ·) have been proposed, the choice of which depends on the sample size and, typically, a smoothing parameter. Depending on the choice of G (u; ·), there are two major classes of quantile estimators according to Cheng [7]: subsampling quantile estimators and kernel quantile estimators. Both can be given as n ∑ Wj · X( j) , where Wj and X( j) is the weight for the j-th order statistic and the j-th L-statistics, which is, j =1 order statistic itself, respectively [13]. For subsampling quantile estimators, a discrete distribution is chosen for G (u; ·), interpreted as a resampling distribution from the set of the observed order statistics X( j) , j = 1, . . . , n [7]. The Harrell-Davis estimator is given by n ∑ Wi X(i) HD p = (6) i =1 with weight function Wi = 1 β {(n + 1) p, (n + 1)(1 − p)} Z i/n (i −1)/n y(n+1) p−1 (1 − y)(n+1)(1− p)−1 dy = Ii/n { p(n + 1), (1 − p)(n + 1)} − I(i−1)/n { p(n + 1), (1 − p)(n + 1)} , where Ii/n { a, b} is the incomplete beta function [10,14,15]. The quantile estimator of Kaigh and Lachenbruch KL p = " r +n−k ∑ j =r j−1 r−1 # n−j n X( j ) , / k−r k (7) where r = [(k + 1) p], is obtained by averaging a subsample quantile estimate over all (nk ) subsamples of size k, 1 ≤ k ≤ n, which are sampled without replacement. The subsample size k is an arbitrary smoothing (or reduction) parameter, and Kaigh and Lachenbruch proposed choosing k, so as to minimize the mean squared error (MSE), which is, MSE = E(KL p − ε p )2 , where ε p is the true value of the p-th quantile [7,16,17]. Kaigh and Cheng [18] proposed the quantile estimator n KC p = ∑ j =1 r+j−2 r−1 n−j+k−r n+k−1 / X( j ) k−r k (8) with r = ⌈kp⌉, where ⌈ x ⌉ denotes the smallest integer greater than or equal to x [7]. The value of k is again determined by minimizing the MSE of the estimator. Finally, the Bernstein polynomial quantile estimator is given by n BPp = ∑ j =1 according to Cheng [7,19]. n − 1 j −1 p (1 − p ) n − j X( j ) j−1 (9) 346 Stats 2020, 3 2.3. Kernel Quantile Estimators Like subsampling quantile estimators, kernel quantile estimators can be written in the form of Equation (5). Here, G (u; .) is a location-scale family CDF with density function K, location parameter u, and scale parameter h(n): 1 K G (u; t) = h(n) t−u h(n) . Subsequently, Equation (5) becomes the kernel quantile estimator introduced by Parzen [8]: Q̂(u) = Z 1 0 1 ·K Q̃(t) · h(n) t−u h(n) dt with its L-statistic representation being n Q̂(u) = ∑ j =1 Z 1 ·K h(n) j/n ( j−1)/n t−u h(n) dt X( j) . The density K, called the kernel, is symmetric around zero and it has a chosen bandwidth h(n), which satisfies h(n) → ∞ when n → ∞ [7,20]. Yang [21] proposed a discretized version, which we used as the first of the two kernel quantile estimators in our study due to its closed form: KQ p1 = As K 1/n− p h(n) ,...,K n/n− p h(n) 1 n n ∑ j =1 1 ·K h(n) ( j/n) − p h(n) X( j ) . (10) do not generally provide a (discrete) probability distribution on [0, 1], monotonicity, translation and scale equivariance, and the symmetry relation do not hold. Translation and scale equivariance would imply that KQ p1 applied to ( X1 + c, . . . , Xn + c) is equal to c plus KQ p1 applied to ( X1 , . . . , Xn ); the asymmetry means that KQ p1 applied to (− X1 , . . . , − Xn ) is not equal to −KQ p1 applied to ( X1 , . . . , Xn ). The Nadaraya–Watson type estimator (( j−0.5)/n)− p h(n) X( j ) ((i −0.5)/n)− p n j =1 ∑ i =1 K h(n) n KQ p2 = ∑ K (11) Overcomes these drawbacks and was used as the second of the two kernel quantile estimators in this study [7,22,23]. Its name originates from the Nadaraya–Watson estimator in kernel regression [24,25]. In the following, we chose the standard Gaussian kernel for K and used the value of the bandwith h(n) that minimized the MSE to find k for both KL p in Equation (7) and KC p in Equation (8). 2.4. Other Quantile Estimators The kernel quantile estimators that are presented in the previous section are all based on the usual empirical distribution (1) with equal weights 1/n assigned to each observation. To improve the performance of quantile estimators, Huang and Brill [26] proposed using a weighted empirical distribution, for instance, the level crossing empirical distribution function n Flc ( x ) = ∑ wi,n I(−∞,x] (X(i) ) i =1 (12) 347 Stats 2020, 3 with weight function    21 1 − √ n−2 wi,n =  √ 1 n ( n −1) n ( n −1) if i = 1, n if i = 2, 3..., n − 1. (13) One such kernel quantile estimator using level crossing empirical distributions is   i    ∑ w j,n − p    j =1   −1 1   = ∑ n K   X(i ) .    h ( n ) h ( n ) i =1    n KQ plc (14) Huang [27] modified the Harrell–Davis estimator (6) by applying a weighted empirical distribution function instead of the empirical distribution with equal weights 1/n. The Harrell–Davis estimator using a level crossing empirical distribution function can be written as 1 1 F −1 (y)y(n+1) p−1 (1 − y)(n+1)q−1 dy β((n + 1) p, (n + 1)q) 0 lc n Z qi,n 1 y(n+1) p−1 (1 − y)(n+1)q−1 dy X(i) , =∑ qi−1,n β (( n + 1) p, ( n + 1) q ) i =1 Z HD plc = (15) i where β(·, ·) is the beta function, q = 1 − p, Flc (·) is given by (12), qi,n = ∑ w j,n , i = 1, . . . , n, with w j,n , j =1 as defined in (13), and q0,n ≡ 0. Sfakianakis and Verginis [28] proposed a group of estimators, motivated by the fact that nonparametric quantile estimation of extreme quantiles close to 0 and 1 requires large samples for sufficient accuracy. These three quantile estimators are supposed to better estimate quantiles in the tails of a distribution when using small samples and they employ the Binomial probability of observing exactly i out of n events with an event probability of p, B(i; n, p): 2B(0; n, p) + B(1; n, p) B(0; n, p) B(0; n, p) X(1) + X (2) − X(3) 2 2 2 n −1 B(i; n, p) + B(i − 1; n, p) +∑ X(i ) 2 i =2 SVp1 = − (16) B(n; n, p) 2B(n; n, p) + B(n − 1; n, p) B(n; n, p) X( n −2) + X( n −1) + X( n ) , 2 2 2 SVp2 = n −1 ∑ B(i; n, p)X(i+1) + (2X(n) − X(n−1) ) B(n; n, p), (17) i =0 n SVp3 = ∑ B(i; n, p)X(i) + (2X(1) − X(2) ) B(0; n, p). (18) i =1 Finally, Navruz and Özdemir [29] introduced a new quantile estimator, which is a weighted average of all order statistics: 348 Stats 2020, 3 NO p = ( B(0; n, p)2p + B(1; n, p) p) X(1) + B(0; n, p)(2 − 3p) X(2) − B(0; n, p)(1 − p) X(3) + n −2 ∑ ( B(i; n, p)(1 − p) + B(i + 1; n, p) p)X(i+1) − B(n; n, p) pX(n−2) (19) i =1 + B(n; n, p)(3p − 1) X(n−1) + ( B(n − 1; n, p)(1 − p) + B(n; n, p)(2 − 2p)) X(n) . 2.5. Simulation Setup We contrasted nonparametric LoA as constructed with the 15 abovementioned quantile estimators by comparing their coverage probabilities for the next paired difference under the given distributional assumption. Here, we employed the standard normal distribution (ND), a standard normal distribution with 1%, 2%, and 5% outliers (ND 1%, ND 2%, and ND 5%, respectively), an exponential distribution (ED) with a rate of 1, and a lognormal distribution (LND) with meanlog = 0 and sdlog = 1. For normal distributions comprising outliers, simulated data were replaced with a probability of 1%, 2%, and 5% by data sampled from a normal distribution with a mean of 0 and a standard deviation of 3. To examine small to moderate sample sizes, the sample size was set to 30, 50, 80, 100, and 150. For each combination of distribution, sample size, and nonparametric quantile estimator, 20,000 simulated trials of size (n + 1) were generated with R (the code is available as Supplemental Material S1). Here, a seed was set in order to use the same simulated data for each combination of distribution and sample size across nonparametric quantile estimators. The first n observations in each simulated trial were used to derive nonparametric LoA, to which the last observation was compared. The coverage probability was then the proportion of cases out of the 20,000 trials where the nonparametric LoA included the last observation. All of the figures were generated with Stata/MP 16.1 (College Station, TX 77845, USA). 3. Results For n = 30 (Table 1), none of the estimators reached the nominal coverage probability of 0.95. The coverage probability of SQ p1 was closest to 0.95, ranging from 0.934 to 0.938. Note that, for sample sizes of up to n = 40 observations, the smallest and largest difference are used as nonparametric quantile estimates for the 2.5% and 97.5% percentiles, respectively. For SQI p , HD p , HD plc , SVp1 , SVp2 , and SVp3 , the coverage probabilities were at least 0.921, 0.911, 0.910, 0.920, 0.914, and 0.923, respectively. Neither SQ p2 nor KL p are defined for n < 40. Table 1. Coverage probabilities for nonparametric Limits of Agreement (n = 30). Neither SQ p2 nor KL p are defined for n < 40. Estimator ND ND 1% ND 2% ND 5% ED LND SQ p1 SQ p2 SQI p HD p KL p KC p BPp KQ p1 KQ p2 KQ plc HD plc SVp1 SVp2 SVp3 NO p 0.937 0.926 0.911 0.900 0.905 0.904 0.912 0.915 0.916 0.924 0.925 0.925 0.813 0.938 0.927 0.923 0.890 0.908 0.882 0.893 0.893 0.917 0.925 0.927 0.926 0.814 0.937 0.927 0.923 0.877 0.908 0.865 0.880 0.874 0.917 0.925 0.926 0.925 0.817 0.937 0.928 0.924 0.851 0.909 0.832 0.857 0.838 0.919 0.926 0.927 0.926 0.821 0.934 0.921 0.916 0.885 0.897 0.936 0.916 0.923 0.910 0.920 0.914 0.923 0.808 0.937 0.926 0.920 0.880 0.904 0.922 0.906 0.919 0.915 0.924 0.919 0.929 0.834 349 Stats 2020, 3 SQ p2 was the only estimator with coverage probabilities oscillating closely around 0.95 for all the investigated sample sizes n ≥ 50 (Tables 2–5); for n = 50, SQ p2 was the only one to do so. For n = 80 (Table 3), the coverage probabilities of HD p , SVp1 , SVp2 , and SVp3 fluctuated closely around the nominal level except for the simulations with an ED (0.94). For n ≥ 100 (Tables 4 and 5), these estimators performed close to the 0.95 nominal level for all of the investigated distributions. The coverage probabilities of the recently proposed NO p estimator varied between 0.945 and 0.950 for n = 200 and they were very close to 0.95 for n = 250 (results not shown here). Table 2. Coverage probabilities for nonparametric Limits of Agreement (n = 50). Bold figures indicate coverage probabilities exceeding the nominal level of 0.95. Estimator ND ND 1% ND 2% ND 5% ED LND SQ p1 SQ p2 SQI p HD p KL p KC p BPp KQ p1 KQ p2 KQ plc HD plc SVp1 SVp2 SVp3 NO p 0.919 0.951 0.934 0.939 0.938 0.924 0.922 0.919 0.924 0.924 0.933 0.940 0.937 0.938 0.839 0.918 0.952 0.935 0.940 0.932 0.906 0.925 0.902 0.909 0.912 0.935 0.941 0.939 0.941 0.843 0.920 0.953 0.937 0.942 0.930 0.898 0.927 0.896 0.900 0.901 0.936 0.942 0.939 0.942 0.845 0.919 0.955 0.938 0.945 0.917 0.879 0.931 0.861 0.881 0.867 0.940 0.945 0.943 0.944 0.855 0.919 0.951 0.931 0.935 0.937 0.915 0.919 0.931 0.924 0.944 0.928 0.935 0.933 0.938 0.845 0.919 0.951 0.934 0.938 0.929 0.912 0.922 0.933 0.916 0.930 0.931 0.939 0.934 0.941 0.870 Table 3. Coverage probabilities for nonparametric Limits of Agreement (n = 80). Bold figures indicate coverage probabilities exceeding the nominal level of 0.95. Estimator ND ND 1% ND 2% ND 5% ED LND SQ p1 SQ p2 SQI p HD p KL p KC p BPp KQ p1 KQ p2 KQ plc HD plc SVp1 SVp2 SVp3 NO p 0.939 0.950 0.941 0.950 0.943 0.936 0.937 0.934 0.936 0.933 0.943 0.951 0.950 0.949 0.888 0.939 0.951 0.941 0.952 0.939 0.923 0.939 0.926 0.925 0.928 0.945 0.952 0.951 0.951 0.893 0.940 0.951 0.942 0.954 0.935 0.916 0.941 0.921 0.917 0.922 0.947 0.954 0.953 0.952 0.896 0.938 0.949 0.941 0.955 0.933 0.895 0.942 0.887 0.897 0.890 0.948 0.955 0.953 0.954 0.901 0.934 0.945 0.935 0.940 0.938 0.925 0.929 0.934 0.930 0.942 0.934 0.940 0.940 0.940 0.891 0.939 0.950 0.941 0.949 0.939 0.929 0.937 0.940 0.931 0.939 0.943 0.949 0.948 0.950 0.910 350 Stats 2020, 3 Table 4. Coverage probabilities for nonparametric Limits of Agreement (n = 100). Bold figures indicate coverage probabilities exceeding the nominal level of 0.95. Estimator ND ND 1% ND 2% ND 5% ED LND SQ p1 SQ p2 SQI p HD p KL p KC p BPp KQ p1 KQ p2 KQ plc HD plc SVp1 SVp2 SVp3 NO p 0.941 0.952 0.941 0.950 0.947 0.940 0.940 0.937 0.939 0.938 0.944 0.950 0.950 0.950 0.911 0.941 0.952 0.941 0.951 0.941 0.929 0.942 0.935 0.930 0.935 0.946 0.952 0.952 0.951 0.914 0.941 0.952 0.941 0.953 0.939 0.920 0.944 0.921 0.920 0.932 0.948 0.954 0.953 0.953 0.917 0.942 0.954 0.942 0.957 0.938 0.902 0.948 0.896 0.903 0.905 0.951 0.959 0.957 0.959 0.924 0.941 0.953 0.941 0.948 0.946 0.935 0.937 0.942 0.940 0.940 0.942 0.948 0.948 0.947 0.913 0.941 0.952 0.941 0.950 0.941 0.935 0.940 0.943 0.935 0.941 0.944 0.950 0.949 0.950 0.924 Table 5. Coverage probabilities for nonparametric Limits of Agreement (n = 150). Bold figures indicate coverage probabilities exceeding the nominal level of 0.95. Estimator ND ND 1% ND 2% ND 5% ED LND SQ p1 SQ p2 SQI p HD p KL p KC p BPp KQ p1 KQ p2 KQ plc HD plc SVp1 SVp2 SVp3 NO p 0.945 0.949 0.942 0.948 0.945 0.941 0.942 0.940 0.940 0.939 0.944 0.949 0.949 0.949 0.934 0.945 0.949 0.942 0.949 0.941 0.934 0.944 0.936 0.932 0.935 0.946 0.951 0.952 0.950 0.937 0.944 0.948 0.941 0.950 0.939 0.930 0.943 0.928 0.923 0.928 0.945 0.952 0.952 0.951 0.937 0.946 0.951 0.943 0.954 0.936 0.906 0.947 0.903 0.908 0.904 0.949 0.957 0.956 0.954 0.943 0.946 0.950 0.944 0.948 0.947 0.939 0.941 0.943 0.941 0.941 0.944 0.949 0.948 0.947 0.936 0.945 0.949 0.942 0.948 0.945 0.940 0.942 0.943 0.937 0.943 0.944 0.949 0.950 0.949 0.940 4. Example Diederichsen et al. [6] compared coronary artery calcification measurements using the Agatston score with the measurements using Framingham Heart Score in Danes of 50 and 60 years of age. Of 1825 randomly sampled citizens, 1257 consented to participation in the study, and 1156 of them were eligible. Agatston scores were independently reanalyzed for 129 randomly chosen study participants, and the agreement measures were the proportions of agreement and the kappa statistics for dichotomized calcification status (absence vs. presence) to assess intra- and inter-rater agreement. In the following, the intra-rater differences are used for exemplification purposes. Approximately half of the 129 participants had an Agatston score of 0. The paired intra-rater differences ranged from −683 to 130, with a first, second, and third quartile being equal to 0; the 5th, 10th, 90th, and 95th percentiles were −23, −12, 1.1, and 5, respectively. The empirical distribution of the paired differences was, therefore, characterized by its denseness around 0 and a single, comparatively extreme outlier, clearly indicating the inappropriateness of the normality assumption in this setting (see also a histogram including an approximating normal distribution as Supplemental Material S2). 351 Stats 2020, 3 Using SQ p2 , HD p , and SVp1 , the nonparametric, asymmetric, and robust LoA are −61.5, 12.8; −96.2, 26.7; and, −122.1, 30.6, respectively, whereas the symmetric BA LoA of −129.8, 116.1 are equidistant from the estimated mean difference of −6.9 (Figure 1). The upper LoA for HD p and SVp1 are similar, but the respective lower LoA are differently affected by the single outlier (3942.5, −683). SQ p2 appears to be most robust to few outliers due to its definition. The R source code for the derivation of these nonparametric LoA as well as the example data can be found as Supplemental Material S3 and S4, respectively. 200 100 Intra−rater differences 0 −100 −200 −300 −400 −500 −600 −700 0 500 1000 1500 2000 2500 Intra−rater means 3000 3500 4000 Figure 1. SQ p2 (magenta, long dashes), HD p (red, short dashes) and SVp1 (black, solid lines) contrasted with classical BA LoA (shaded area). The sensitivity of the classical BA LoA to outliers becomes crystal-clear when excluding the single outlier here. Subsequently, the symmetric BA LoA are −37.4 and 34.3, and the estimated mean difference reduces to −1.6 (results not shown here). In practice, outliers are, though, kept in the analysis dataset if there is no reasonable explanation for an exclusion. This underlines the importance of robust alternatives to the BA LoA. 5. Discussion 5.1. Statement of Principal Findings The simple sample quantile estimators that are based on one and two order statistics performed closest to the nominal level in terms of the coverage probability for the next observation across six distributional scenarios for n = 30 and n = 50, 80, 100, 150. The Harrell–Davis subsampling estimator and estimators of the Sfakianakis–Verginis type followed closely for sample sizes of at least n = 80 and may be considered intuitively more appealing, as they use the entire sample, whereas more simple and outlier-robust sample quantile estimators are only based on a few observations from the sample. Stats 2020, 3 352 5.2. Strengths and Limitations of The Study The choice of distributions for the simulation study was motivated by our own experience with agreement assessments in clinical studies, especially roughly normal distributions with a few percent outliers. We investigated a wide range of quantile estimators, comprising sample, subsampling, and kernel quantile estimators as well as other methods for quantile estimation with sample sizes between 30 and 150. As a measure of performance, we considered the coverage probability of nonparametric LoA for the next observation, interpreting the nonparametric LoA as a prediction interval, as the lower and upper LoA need to be simultaneously assessed. Therefore, we did not pursue evaluations using, for instance, mean squared errors. 5.3. Strengths and Limitations in Relation to other Studies A peculiarity of LoA is the sole focus on the 2.5th and 97.5th quantiles, two extreme quantiles. Dielman, Lowry and Pfaffenberger [9] investigated 0.02 and 0.98, but only for small samples (n = 10, 15, 25, 30). Others examined 0.05 and 0.95 [11,14,16,21,22,29], whereas Kaigh and Cheng [17,18] assessed 0.1 and 0.9. Only Huang and Brill [26,27] also targeted 0.025 and 0.975, but only for samples of maximum size n = 30. Sfakianakis and Verginis [28] analyzed 0.01 and 0.99 as well as 0.05 and 0.95 in various sample sizes. When compared to the usual number of 2000 iterations, the chosen number of 20,000 iteration runs translated for a given nonparametric estimator and sample size into a reduced range of the coverage probabilities across distributions by approximately 0.005 and is deemed appropriately accurate. However, the increased number of iterations did implicate considerably longer running times in creating the data for one Table (12 as opposed to 2 h). The abovementioned studies employed between 1000 and 10,000 iterations [16,22]. Harrell and Davis [14] did not recommend HD p for small n and extreme p, and Dielman, Lowry and Pfaffenberger [9] concluded that there was not one best estimator across scenarios, based on maximum sample sizes of 30 and 60; however, Dielman, Lowry and Pfaffenberger [9] suggested that HD p performs well in a wide range of cases, except when p = 0.02, 0.98. Our findings for HD p are in line with these former conclusions, but extend to larger sample sizes of n = 80, 100, 150, in which HD p appears to be a preferable choice for estimating extreme quantiles. 5.4. Meaning of the Findings: Possible Mechanisms and Implications Our findings suggest using SQ p1 in small samples with approximately n = 30 but SQI p , SVp1 , or SVp3 may be preferential alternatives as SQ p1 simply reduces to the smallest and largest observations as estimates for the 2.5% and 97.5% quantiles, respectively. The latter is, in turn, unfortunate in the case of outliers due to their unabated impact on the estimates. SQ p2 performed closest to the nominal level for all samples with n ≥ 50 and appeared to be less prone to the single outlier in our clinical example than HD p and SVp1 . However, the latter two estimators do involve all the observations. SQ p2 can, therefore, be considered the first choice for samples of approximately n = 50, but, for larger n, both HD p and Sfakianakis–Verginis type estimators are equally applicable and actually preferable if the researcher seeks to include the entire dataset in quantile estimation and not only pairs of order statistics. The normality assumption of the paired differences may often be considered to be reasonable in the planning stage; however, alternative quantile estimators should be equally specified in the planning stage as empirical distributions may deviate notably from ideal assumptions. Moreover, our investigation suggests several beneficial nonparametric alternatives to BA LoA instead of the simple percentile estimators that currently seem to prevail. 5.5. Unanswered Questions and Future Research In the case of normally distributed paired differences, Bland and Altman [1,2] have already proposed approximate confidence intervals for the BA LoA. Recently, Vock [30] emphasized that only a tolerance Stats 2020, 3 353 interval or the outer confidence limits for BA LoA can provide a range that will contain a specified percentage of future differences with a known certainty. Carkeet and Goh [31,32] proposed exact confidence intervals for BA LoA, while using two-sided tolerance factors for a normal distribution. In the case of any given distribution for the paired differences, several approaches for the construction of nonparametric confidence intervals for quantiles have been proposed over half a century [33–39]. For both HD p and KL p , confidence intervals for quantiles have been proposed [10,14,16]. In the context of nonparametric LoA, future research will naturally lie in the proposal and evaluation of confidence intervals for the 2.5% and 97.5% quantiles in small-to-moderate samples, especially with regard to SQ p2 , SQI p , HD p , and Sfakianakis–Verginis type estimators. Regression procedures for method comparison analysis have not been considered here [40–43]. Robust methods designed for data configurations with outliers, such as S- or MM-estimation, Least Trimmed Squares, or the Forward Search, are of interest in this context [44–48]. Supplementary Materials: The following are available online at http://www.mdpi.com/2571-905X/3/3/22/s1, Code S1: R source code for generating Tables 1–5. Figure S2: Histogram for the data of the clinical example in Section 4, including an approximating normal distribution. Code S3: R source code for generating Limits of Agreement for the clinical example in Section 4. Data S4: Dataset of the clinical example in Section 4. Author Contributions: Conceptualization, O.G.; methodology, H.C.P. and O.G.; software, M.E.F.; validation, M.E.F. and O.G.; formal analysis, M.E.F., H.C.P. and O.G.; writing—original draft preparation, M.E.F. and O.G.; writing—review and editing, M.E.F., H.C.P. and O.G.; visualization, O.G.; supervision, H.C.P. and O.G. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: The authors would like to thank Axel Diederichsen (Odense University Hospital, Denmark) for the permission to reanalyze fully anonymized data from the DanRisk study and three anonymous reviewers for very helpful comments on earlier versions of the manuscript. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Bland, J.M.; Altman, D.G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986, 1, 307–310. [CrossRef] Bland, J.M.; Altman, D.G. Measuring agreement in method comparison studies. Stat. Methods Med. Res. 1999, 8, 135–160. [CrossRef] [PubMed] Rosner, B. Fundamentals of Biostatistics, 8th ed.; Cengage Learning: Boston, MA, USA, 2015. Schmitz, S.; Krummenauer, F.; Henn, S.; Dick, H.B. Comparison of three different technologies for pupil diameter measurement. Graefe’s Arch. Clin. Exp. Ophthalmol. 2003, 241, 472–477. [CrossRef] [PubMed] Twomey, P.J. How to use difference plots in quantitative method comparison. Ann. Clin. Biochem. 2006, 43, 124–129. [CrossRef] Diederichsen, A.C.; Sand, N.P.; Nørgaard, B.; Lambrechtsen, J.; Jensen, J.M.; Munkholm, H.; Aziz, A.; Gerke, O.; Egstrup, K.; Larsen, M.L.; et al. Discrepancy between coronary artery calcium score and HeartScore in middle-aged Danes: The DanRisk study. Eur. J. Prev. Cardiol. 2012, 19, 558–564. [CrossRef] Cheng, C. On Estimation of Quantiles and Quantile Density Functions. Ph.D. Thesis, Texas A & M University, College Station, TX, USA, 1993. Parzen, E. Nonparametric statistical data modeling. J. Am. Stat. Assoc. 1979, 74, 105–121. [CrossRef] Dielman, T.; Lowry, C.; Pfaffenberger, R. A comparision of quantile estimators. Commun. Stat. Simul. Comput. 1994, 23, 355–371. [CrossRef] Steinberg, S.M. Confidence Intervals for Functions of Quantiles Using Linear Combinations of Order Statistics. Ph.D. Thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA, 1983. Parrish, R.S. Comparision of quantile estimators in normal sampling. Biometrics 1990, 46, 247–257. [CrossRef] Hyndman, R.J.; Fan, Y. Sample quantiles in statistical packages. Am. Stat. 1996, 50, 361–365. [CrossRef] Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1980. Harrell, F.E.; Davis, C.E. A new distribution-free quantile estimator. Biometrika 1982, 69, 635–640. [CrossRef] Steinberg, S.M.; Davis, C.E. Comparison of nonparametric point estimators for interquantile differences in moderate sized samples. Commun. Stat. Theory Methods 1987, 16, 1607–1616. [CrossRef] Stats 2020, 3 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 354 Kaigh, W.D.; Lachenbruch, P.A. A generalized quantile estimator. Commun. Stat. Theory Methods 1982, 11, 2217–2238. [CrossRef] Kaigh, W.D. Quantile interval estimation. Commun. Stat. Theory Methods 1983, 12, 2427–2443. [CrossRef] Kaigh, W.D.; Cheng, C. Subsampling quantile estimators and uniformity criteria. Commun. Stat. Theory Methods 1991, 20, 539–560. [CrossRef] Cheng, C. The Bernstein polynomial estimator of a smooth quantile function. Stat. Probab. Lett. 1995, 24, 321–330. [CrossRef] Delampady, M.; Ghosh, J.K.; Samanta, T. An Introduction to Bayesian Analysis Theory and Methods; Springer: New York, NY, USA, 2006. Yang, S.S. A smooth nonparametric estimator of a quantile function. J. Am. Stat. Assoc. 1985, 80, 1004–1111. [CrossRef] Sheather, S.J.; Marron, J.S. Kernel quantile estimators. J. Am. Stat. Assoc. 1990, 85, 410–416. [CrossRef] Zelterman, D. Smooth nonparametric estimation of the quantile function. J. Stat. Plan. Inference 1990, 26, 339–352. [CrossRef] Nadaraya, E.A. Smooth regression analysis. Sankhyā Indian J. Stat. 1964, 26, 359–372. Watson, G.S. On estimating regression. Theory Probab. Appl. 1964, 9, 141–142. Huang, M.L.; Brill, P. A level crossing quantile estimation method. Stat. Probab. Lett. 1999, 45, 111–119. [CrossRef] Huang, M.L. On a distribution-free quantile estimator. Comput. Stat. Data Anal. 2001, 37, 477–486. [CrossRef] Sfakianakis, M.E.; Verginis, D.G. A new family of nonparametric quantile estimators. Commun. Stat. Simul. Comput. 2008, 37, 337–345. [CrossRef] Navruz, G.; Özdemir, A.F. A new quantile estimator with weights based on a subsampling approach. Br. J. Math. Stat. Psychol. 2020, 73. [CrossRef] Vock, M. Intervals for the assessment of measurement agreement: Similarities, differences, and consequences of incorrect interpretations. Biom. J. 2016, 58, 489–501. [CrossRef] Carkeet, A. Exact parametric confidence intervals for Bland-Altman limits of agreement. Optom. Vis. Sci. 2015, 92, e71–e80. [CrossRef] Carkeet, A.; Goh, Y.T. Confidence and coverage for Bland-Altman limits of agreement and their approximate confidence intervals. Stat. Methods Med. Res. 2018, 27, 1559–1574. [CrossRef] Chu, J.T. Some uses of quasi-ranges. Ann. Math. Stat. 1957, 28, 173–180. [CrossRef] Campbell, M.J.; Gardner, M.J. Calculating confidence intervals for some non-parametric analyses. Br. Med. J. 1988, 296, 1454–1456. [CrossRef] Beran, R.; Hall, P. Interpolated nonparametric prediction intervals and confidence intervals. J. R. Stat. Soc. Ser. B 1993, 55, 643–652. [CrossRef] Hutson, A.D. Calculating nonparametric confidence intervals for quantiles using fractional order statistics. J. Appl. Stat. 1999, 26, 343–353. [CrossRef] Hutson, A.D. ‘Exact’ bootstrap confidence bands for the quantile function via Steck’s determinant. J. Comput. Graph. Stat. 2002, 11, 471–482. [CrossRef] Zielinski, R.; Zielinski, W. Best exact nonparametric confidence intervals for quantiles. Statistics 2005, 39, 67–71. [CrossRef] Balakrishnan, N.; Li, T. Confidence intervals for quantiles and tolerance intervals based on ordered ranked set samples. Ann. Inst. Stat. Math. 2006, 58, 757–777. [CrossRef] Cornbleet, P.J.; Gochman, N. Incorrect least-squares regression coefficients in method-comparison analysis. Clin. Chem. 1979, 25, 432–438. [CrossRef] [PubMed] Passing, H.; Bablok, W. A new biometrical method for testing the equality of measurements from two different analytical methods. Clin. Chem. Lab. Med. 1983, 21, 709–720. [CrossRef] Passing, H.; Bablok, W. Comparison of several regression procedures for method comparison studies and determination of sample size. Clin. Chem. Lab. Med. 1984, 22, 431–445. [CrossRef] Payne, R.B. Method comparison: Evaluation of least squares, Deming and Passing/Bablok regression procedures using computer simulation. Ann. Clin. Biochem. 1997, 34, 319–320. [CrossRef] Rousseeuw, P.J. Least median of squares regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [CrossRef] Yohai, V.J.; Zamar, R. High breakdown-point estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [CrossRef] Stats 2020, 3 46. 47. 48. 355 Riani, M.; Cerioli, A.; Atkinson, A.C.; Perrotta, D. Monitoring robust regression. Electron. J. Stat. 2014, 8, 646–677. [CrossRef] Rousseeuw, P.; Perrotta, D.; Riani, M.; Hubert, M. Robust monitoring of time series with application to fraud detection. Econom. Stat. 2019, 9, 108–121. [CrossRef] Riani, M.; Atkinson, A.C.; Corbellini, A.; Perrotta, D. Robust regression with density power divergence: Theory, comparisons, and data analysis. Entropy 2020, 22, 399. [CrossRef] © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

RELATED PAPERS

RELATED TOPICS

Log In

Nonparametric Limits of Agreement for Small to Moderate Sample Sizes: A Simulation Study

Nonparametric Limits of Agreement for Small to Moderate Sample Sizes: A Simulation Study

Related Papers

RELATED PAPERS

RELATED TOPICS