Treatment effects on count outcomes with non-normal covariates
The data used in the illustrative example are publicly available in OpenICPSR at https://www.openicpsr.org/openicpsr/project/128941; https://doi.org/10.3886/E128941
Abstract
The effects of a treatment or an intervention on a count outcome are often of interest in applied research. When controlling for additional covariates, a negative binomial regression model is usually applied to estimate conditional expectations of the count outcome. The difference in conditional expectations under treatment and under control is then defined as the (conditional) treatment effect. While traditionally aggregates of these conditional treatment effects (e.g., average treatment effects) are computed by averaging over the empirical distribution, a recently proposed moment-based approach allows for computing aggregate effects as a function of distribution parameters. The moment-based approach makes it possible to control for (latent) multivariate normally distributed covariates and provides more reliable inferences under certain conditions. In this paper we propose three different ways to account for non-normally distributed continuous covariates in this approach: an alternative, known non-normal distribution; a plausible factorization of the joint distribution; and an approximation using finite Gaussian mixtures. A saturated model is used for categorical covariates, making a distributional assumption obsolete. We further extend the moment-based approach to allow for multiple treatment conditions and the computation of conditional effects for categorical covariates. An illustrative example highlighting the key features of our extension is provided.
1 Introduction
The evaluation of treatment effects on count outcomes is quite common in the social and health sciences. Often, covariates are included in the analysis using a negative binomial regression model (Garrett et al., 2018; Hittner, Owens, & Swickert, 2016; Jobe et al., 2001; Mazerolle et al., 2019; Nusser & Weinert, 2017; Schaumberg & Flynn, 2017; Sridharan, Shoda, Heffner, & Bricker, 2019), that is, the conditional expectation for the count outcome is logarithmically linked to the treatment variable and the covariates. The difference in conditional expectations under treatment and under control is then defined as the (conditional) treatment effect. Traditionally, aggregates of conditional treatment effects (e.g., the average treatment effect) are computed by averaging over the empirical (joint) distribution of the covariates (Greene, 2007).
Recently, a moment-based approach was proposed by Kiefer and Mayer (2019, 2020) enabling aggregated effects to be computed using parameters of the (joint) distribution of the covariates (e.g., means, variances, and covariances). Instead of computing conditional effects given the observed covariate values and averaging over their empirical distribution, the conditional effects are integrated over all possible values of the covariates weighted by their likelihood given by the (assumed) joint density. While this procedure would usually have to be carried out by numerically solving an improper integral (e.g., an integral over conditional effects weighted by a multivariate normal density), the improper integral is replaced with a moment-generating function in the moment-based approach. Thus, as most moment-generating functions have closed-form solutions, the moment-based approach allows for a fast and analytical computation of the aggregated treatment effects.
The moment-based approach has two major advantages over the traditional empirical distribution approach. First, the covariate side of the effect calculation is stochastic instead of fixed, which enables more accurate statistical inferences about conditional and average treatment effects. In contrast, the traditional approach treats the observed covariate values as fixed by design (i.e., not varying between samples), which can lead to an underestimation of standard errors for the aggregated effects. Second, it allows common factor models for the covariates, meaning that not directly observed (latent) covariates can be accounted for as well. As the empirical distribution of latent variables is not observed, the traditional approach cannot deal with latent variables and would require fallible substitutes (i.e., sum scores, factor scores).
However, the extended moment-based approach as suggested by Kiefer and Mayer (2020) has an Achilles' heel: covariates are assumed to be multivariate normally distributed within the treatment groups. In practical settings, this assumption is often violated, as observed variables in real data sets in the social and health sciences frequently deviate from the normal distribution (Bono, Blanca, Arnau, & Gómez-Benito, 2017; Micceri, 1989). For example, Blanca, Arnau, Lopez-Montiel, Bono, and Bendayan (2013) examined 693 distributions from real psychological data and found that 74.4% presented slight or moderate deviations from the normal distribution, while 20% exhibited more extreme deviation. In a simulation study, Lei and Lomax (2005) investigated parameter and standard error bias in structural equation models when non-normal variables are introduced into normal distribution-based maximum likelihood estimation. They conclude that parameter estimation is sensitive to non-normal variables, but the bias introduced by slight non-normality is moderate. However, severe non-normality leads to substantial bias in loadings and structural parameters. Similarly, Kiefer and Mayer (2019) found in their examination of the univariate moment-based approach that missspecification of the covariates' distribution can introduce bias into the average effect estimation.
Thus, in this paper, we propose several extensions of the moment-based approach. These extensions aim to make the approach more flexible with regard to its distributional assumptions as well as to provide applied researchers with a more nuanced effect analysis. First, to properly account for non-normal covariates, we suggest differentiating between categorical covariates (e.g., gender, ethnicity) and non-normal continuous (or metric) covariates (e.g., pre-test count variables). We then describe four different possibilities to account for non-normal joint distributions. Second, we define and compute average and conditional treatment effects of interest, for example, the conditional effect given a treatment condition or given a gender. These kinds of finer-grained effects (as opposed to the average treatment effect) enable researchers to examine treatment efficacy under different conditions in a more nuanced way. Third, we consider the case in which multiple treatment conditions are to be examined, that is, treatment effects can be modelled and examined for more than one treatment condition (in comparison to a control group).
This paper is structured as follows. First, we provide general definitions of treatment effects on count outcomes based on multiple categorical and continuous covariates. Specifically, we introduce the average treatment effect as well as several conditional treatment effects. Second, we derive how the moment-based approach can be utilized to compute these effects, differentiating between a solution for categorical covariates and three possible approaches for non-normal continuous covariates. Third, we briefly present the negative binomial multi-group structural equation model as the statistical framework for simultaneously estimating the parameters of the negative binomial regression and the covariates' distribution. Finally, the moment-based approach is applied to a real data sample to highlight the key features of our proposed extensions.
2 Definition and terminology of treatment effects
2.1 Conditional effects function
In practical settings, a parameterization is needed for , which we will discuss at the end of this section. For a causal interpretation of the average and conditional effects, it is crucial that the effect functions be causally unbiased. Steyer, Mayer, and Fiege (2014) provide an overview of causality conditions ensuring unbiasedness of the effect function, one of which is the independence of the treatment variable from all potential confounders (e.g., created by randomized treatment assignment as in the ACTIVE study; see Section 6).
2.2 Average treatment effect
2.3 Conditional effects given X and K
2.4 Regression parameterization and factorization of the covariates' distribution
Up to this point, we have presented non-parametrical definitions of average and conditional effects on count outcomes. In order to derive empirically estimable quantities, we need to introduce a parameterization for the functions and an applicable factorization of the mixed joint density of the categorical covariate K and the continuous covariates .
In equation (4), the joint distribution of the covariates K and is defined as the marginal distribution of the joint distribution of the treatment variable X, the categorical covariate K, and the continuous covariate . In equation (5), the joint distribution is decomposed into a categorical group part P(X = x, K = k) and a group-conditional density of the continuous variables. This factorization serves two purposes. First, it allows us to identify the conditional densities required for the computation of the aforementioned average and conditional effects, namely , , , and . Second, we can use a multi-group approach, namely a multi-group structural equation model, to estimate parameters both of the negative binomial regression and of the distribution of the covariates.
The choice of a factorization is without loss of generality for our approach, because an alternative factorization can always be transformed into our factorization. We provide more information on this aspect in the discussion and in Appendix 1.
3 Moment-based approach assuming multivariate normality
In the previous section we introduced a distinction between categorical and continuous covariates. Such a distinction has been previously proposed for treatment effect computation, for example, in the EffectLiteR approach by Mayer et al. (2016). A saturated model is used, that is, the probability of each occurring value of the categorical covariate is examined. Thus, no further distributional assumption for the categorical covariate is required.
4 Accounting for non-normal continuous covariates
In the following subsections we present three different ways to incorporate non-normal continuous covariates into the moment-based approach. All three ways follow the basic idea of the moment-based approach, that is, (at least partly) substituting improper integrals with moment-generating functions. The suggested solutions depend on the level of information available for the non-normal continuous variables: (1) an alternative, non-normal joint distribution is known; (2) a plausible factorization of the joint distribution can be constructed; and (3) an approximation of the covariates' joint distribution is required. Note that we do not suggest that one of these approaches is generally preferable to the others. Rather, which approach to use depends on the concrete data situation one is confronted with.
All of the aforementioned approaches have in common that we use the joint density of the covariates in a maximum likelihood framework to estimate parameters and treatment effects. As these models contain negative binomial regressions, multi-group parts, measurement models and parameters of the non-normal distributions, they can involve many parameters. Thus, large sample sizes may be required for a solution to converge (Jackson, 2003). For smaller samples or cases in which none of the three proposed approaches is feasible, Bayesian modelling and estimation can be an alternative. We discuss Bayesian alternatives in the Discussion section.
4.1 Case 1: Known non-normal joint distribution
The first and probably simplest case is a known alternative non-normal (X = x, K = k)-conditional joint distribution for . In this case, we can substitute the density and moment-generating functions for effect computation with the ones from the alternative distribution. For example, if we want to account for a slight skew in our continuous variables, the multivariate skew-normal distribution (Azzalini & Valle, 1996) is a viable alternative to the normal distribution. Admittedly, the case of knowing a number of suitable non-normal multivariate distributions and their moment-generating function might not be very common, especially for applied researchers.
4.2 Case 2: Factorization of joint distribution
Sometimes, it might be impossible to find a suitable joint distribution, and the assumption of conditional independence might be too strong. A notable example of this case was chosen for our illustrative example: a continuous latent variable (i.e., depression) and a discrete, yet non-categorical count variable (i.e., baseline count of correctly answered items). To our knowledge, no joint distributions for count and continuous variables have been proposed to date. However, when examining count outcome variables, researchers often wish to account for the respective baseline count as well.
The factorization approach is challenging to integrate into the moment-based approach, because it does not yield a comprehensive moment-generating function for the factorized joint distribution. Consequently, a combination of moment-generating functions and numerical integration is required for effect estimation. In our illustrative example, we will further delineate these computations for the factorization approach.
In practical terms, the construction of a conditional distribution for some covariates can be achieved using a regression approach, that is, a parameterization of . In our illustrative example, this relation will be estimated using a regression with a logarithmic link function. From a causal perspective, such a regression can be problematic, because baseline variables do not necessarily have a temporal order, and we do not suggest that one baseline variable causes another baseline variable. Thus, the factorization approach is a technical workaround and its regression parameters are not necessarily of interest.
4.3 Case 3: Approximation with finite Gaussian mixtures
Finally, when no alternative joint distribution or plausible factorization can be found, it is possible to approximate the joint distribution of . One possible approach is to approximate the non-normal joint distribution of by using a finite mixture of M multivariate normal distributions, that is, where wm are weights and . In practice, this approximation of the distribution of can be estimated using a finite mixture or latent class approach (McLachlan & Peel, 2000). Researchers can control the degree of approximation by specifying the number M of latent classes, where only the expected values and the variance are allowed to vary among latent classes. This approach has previously been applied in the computation of average and conditional effects in a nonlinear regression setting by Mayer, Umbach, Flunger, and Kelava (2017) and can be estimated using nonlinear structural equation mixture models (Kelava & Brandt, 2014; Kelava, Nagengast, & Brandt, 2014). However, it is notable that an increasing number of latent classes M can lead to convergence problems in maximum likelihood estimation.
5 Negative binomial multi-group structural equation model
6 Illustrative example
To illustrate the use of our extension to the moment-based approach, we use data from the Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) study (Ball et al., 2002; Jobe et al., 2001; Tennstedt et al., 2005). The ACTIVE study is a large randomized controlled trial designed to examine the effectiveness of cognitive interventions among older adults. We will investigate the effects of these interventions while controlling for the (non-normal) count pre-test score, (latent) baseline depression, and participants' gender. As count outcome Y, we investigate post-test performance on an inductive reasoning assessment (i.e., letter sets). Letter sets evaluate how well an individual can recognize a pattern among several sets of letters. Thus, Y reflects the count of correctly answered items. The ACTIVE data subset analysed is publicly available; a link is provided in the online Appendix S1.
In this paper we do not present a comprehensive analysis of the ACTIVE study. Instead, the primary goal of this paper is to illustrate how non-normally distributed covariates can be integrated into the moment-based approach when estimating average and conditional treatment effects. The NB-MG-SEM and treatment effects were estimated using R (R Core Team, 2018) and the CountEffects package. The CountEffects package is an implementation of the moment-based approach assuming (X = x, K = k)-conditionally multivariate normally distributed covariates, which we extended with functions to estimate all three proposed approaches to deal with non-normal covariates for our illustrative example. In the online Appendix S1, we provide information on how to install CountEffects and how to estimate treatment effects based on the multivariate normal assumption and all three cases. Our findings indicate that the three non-normal cases yield similar results for the illustrative example, while the normal case yields differing estimates for some effects. However, in this section we will only describe the estimation and results of the case 2 scenario. We used listwise deletion in our analyses in order to keep the supplementary R code accessible for interested readers. However, we also ran a full-information maximum likelihood estimation for our model (using Mplus; Muthén & Muthén, 1998–2015) which yielded similar results.
6.1 Sample
The total sample size for our analysis was N = 2,363. The participants were randomly assigned to one of four conditions: a no-contact control group (X = 0, N = 592), a memory training (X = 1, N = 589), a reasoning training (X = 2, N = 590), and a speed of processing training condition (X = 3, N = 606). Each treatment condition consisted of a ten-session training intervention. In a baseline assessment, the participants took several cognitive functioning tests and completed a self-report questionnaire of psychological measures. The post-test assessment was conducted in the first 10 days after the last training session. The participants were predominantly female (75.8%, K = 1).
6.2 Measures
6.2.1 Depression
Depressive symptoms were assessed using the 12-item version of the Center for Epidemiological Studies Depression Scale (CESD-12; Radloff, 1977). Participants rated the frequency of several depressive symptoms (e.g., feeling sad) during the last week on a four-point scale from 0 = never to 3 = 5–7 days. We modelled baseline depression as latent variable , measured by three parcels (i.e., sum scores of four items each; Z11, Z12, Z13). We used random item parcels for simplicity in our illustrative example. For warnings about the use of parcels, see Marsh, Lüdtke, Nagengast, Morin, and von Davier (2013) and Sterba and Rights (2016).
6.2.2 Letter sets
The count of correctly answered items on a letter sets task was used as outcome Y (i.e., post-test score) and covariate (i.e., pre-test score) in our analysis. Letter sets evaluate how well an individual can find rules or patterns that make different sets of letters alike in some way. Each problem had five sets of letters with four letters in each set. Participants were given 15 different problems and had 7 min to complete the task.
6.3 Model
6.4 Average and conditional treatment effects
Standard errors for the estimated treatment effects can be derived using the delta method (cf. Boos & Stefanski, 2013, p. 237).3 In our analysis, we computed symmetric confidence intervals based on the standard errors, because the estimated treatment effects are asymptotically normally distributed and our sample was comparatively large. However, in smaller samples confidence intervals based on, for example, a bootstrap approach are recommended.
6.5 Results
It is currently not possible to evaluate the complete model fit using a χ2 test or other fit statistics, as these are based on the χ2 value of the model implied covariance matrix for the observed variables. For models with a logarithmic link function, the implied covariance matrix is not an appropriate description of the dependencies among the variables due to the nonlinearity.
6.5.1 Model parameters
The maximum likelihood estimates for the model parameters are given in Table 1. We will not discuss them in detail here, but rather present some notable findings to illustrate the interpretation of these parameters.
Parameter | Estimate | SE | p | Parameter | Estimate | SE | p |
---|---|---|---|---|---|---|---|
X = 0, K = 0 | X = 0, K = 0 | ||||||
Estimates for | Estimates for | ||||||
α000 | 1.324 | 0.103 | <.001 | α010 | 1.218 | 0.061 | <.001 |
α001 | −0.048 | 0.041 | .241 | α011 | -0.034 | 0.018 | .061 |
α002 | 0.099 | 0.011 | <.001 | α012 | 0.103 | 0.007 | <.001 |
Estimates for | Estimates for | ||||||
γ000 | 1.970 | 0.059 | <.001 | γ010 | 1.904 | 0.035 | <.001 |
γ001 | 0.148 | 0.049 | .002 | γ011 | -0.109 | 0.020 | <.001 |
Estimates for | Estimates for | ||||||
µ00 | 1.113 | 0.087 | <.001 | µ01 | 1.571 | 0.072 | <.001 |
0.792 | 0.128 | <.001 | 1.629 | 0.149 | <.001 | ||
X = 1, K = 0 | X = 1, K = 1 | ||||||
Estimates for | Estimates for | ||||||
α100 | 1.141 | 0.110 | <.001 | α110 | 1.256 | 0.057 | <.001 |
α101 | 0.016 | 0.031 | .602 | α111 | -0.037 | 0.016 | .019 |
α102 | 0.111 | 0.013 | <.001 | α112 | 0.102 | 0.007 | <.001 |
Estimates for | Estimates for | ||||||
γ100 | 1.938 | 0.053 | <.001 | γ110 | 1.890 | 0.030 | <.001 |
γ101 | -0.057 | 0.034 | .094 | γ111 | -0.077 | 0.017 | <.001 |
Estimates for | Estimates for | ||||||
µ10 | 1.296 | 0.109 | <.001 | µ11 | 1.440 | 0.074 | <.001 |
1.298 | 0.201 | <.001 | 1.955 | 0.176 | <.001 | ||
X = 2, K = 0 | X = 2, K = 1 | ||||||
Estimates for | Estimates for | ||||||
α200 | 1.335 | 0.103 | <.001 | α210 | 1.340 | 0.053 | <.001 |
α201 | 0.004 | 0.026 | .890 | α211 | -0.013 | 0.014 | .345 |
α202 | 0.102 | 0.011 | <.001 | α212 | 0.099 | 0.006 | <.001 |
Estimates for | Estimates for | ||||||
γ200 | 2.002 | 0.052 | <.001 | γ210 | 1.785 | 0.033 | <.001 |
γ201 | -0.098 | 0.031 | .001 | γ211 | -0.024 | 0.016 | .145 |
Estimates for | Estimates for | ||||||
µ20 | 1.411 | 0.127 | <.001 | µ21 | 1.641 | 0.074 | <.001 |
α20 | 1.879 | 0.303 | <.001 | α21 | 1.879 | 0.171 | <.001 |
X = 3, K = 0 | X = 3, K = 1 | ||||||
Estimates for | Estimates for | ||||||
α300 | 1.274 | 0.101 | <.001 | α310 | 1.210 | 0.057 | <.001 |
α301 | 0.016 | 0.026 | .532 | α311 | -0.009 | 0.017 | .581 |
α302 | 0.095 | 0.012 | <.001 | α312 | 0.107 | 0.007 | <.001 |
Estimates for | Estimates for | ||||||
γ300 | 1.936 | 0.051 | <.001 | γ310 | 1.823 | 0.033 | <.001 |
γ301 | -0.081 | 0.030 | .007 | γ311 | -0.052 | 0.018 | .005 |
Estimates for | Estimates for | ||||||
µ30 | 1.416 | 0.126 | <.001 | µ31 | 1.526 | 0.067 | <.001 |
α30 | 1.861 | 0.277 | <.001 | α31 | 1.515 | 0.137 | <.001 |
Pre-test depression had no significant effect on the post-test count of correctly answered items in either group (e.g., α001 = 0.048, 95% confidence interval (CI) [−0.129, 0.033]), except for female participants receiving the memory training (α111 = −0.037, 95% CI [−0.069, 0.006]). Furthermore, pre-test depression was significantly negatively related to the pre-test count of correctly answered items in most groups (e.g., γ011 = −0.109, 95% CI [−0.148, 0.070], γ201 = −0.098, 95% CI [−0.758, 0.038], and γ301 = −0.081, 95% CI [−0.139, 0.022]). For example, for male participants in the reasoning training, each one-unit change in pre-test depression was linked to a 9% decrease in correctly answered items at baseline (i.e., γ201, exp(−0.098) = −0.91). In two groups, this relationship was not significant (γ101 = −0.057, 95% CI [−0.123, 0.010] and γ211 = −0.024, 95% CI [−0.056, 0.008]).
Pre-test count of correctly answered items was a significant positive predictor of post-test count of correctly answered items in all groups. For example, for female participants in the memory group, each additional correctly answered item at baseline was linked with an 11% increase in correctly answered items at post-test (i.e., α112, exp(0.102) = 1.11). The intercept coefficients reflect the expected post-test count of correctly answered items for a male person with depression score () and zero correctly answered items at baseline (). For example, in the reasoning training (i.e., α200 = 1.335), the expected count is .
6.5.2 Average effects
An overview of all average and conditional treatment effects estimated for our illustrative example is given in Table 2. Remember that the letter sets test was part of a cognitive assessment measuring reasoning performance. Hence, we would expect the reasoning training in particular to have a significant effect, while the memory and speed of processing training might not necessarily affect reasoning performance. In line with this, the average treatment effect of the memory training (, 95% CI [−0.203, 0.372], ES = 0.030) and the average treatment effect of the speed of processing training (, 95% CI [−0.052, 0.525], ES = 0.083) on the count of correctly answered items were not significant. The average treatment effect of the reasoning training, however, was significant (, 95% CI [0.487, 1.080], ES = 0.275), that is, participants in this condition correctly answered 0.849 items more on average compared to baseline.
Effect | Estimate | SE | ES | Effect | Estimate | SE | ES |
---|---|---|---|---|---|---|---|
Estimated average and conditional effects of memory training | |||||||
Average effect | Conditional effects | ||||||
0.084 | 0.147 | 0.030 | −0.238 | 0.302 | −0.084 | ||
Conditional effects | 0.161 | 0.166 | 0.056 | ||||
0.053 | 0.147 | 0.019 | −0.130 | 0.314 | −0.046 | ||
0.094 | 0.151 | 0.033 | 0.167 | 0.171 | 0.059 | ||
0.093 | 0.147 | 0.033 | −0.109 | 0.321 | −0.038 | ||
0.095 | 0.147 | 0.034 | 0.157 | 0.167 | 0.055 | ||
Conditional effects | −0.119 | 0.309 | −0.042 | ||||
−0.152 | 0.305 | −0.053 | 0.160 | 0.167 | 0.056 | ||
0.161 | 0.167 | 0.057 | |||||
Estimated average and conditional effects of reasoning training | |||||||
Average effect | Conditional effects | ||||||
0.784 | 0.151 | 0.275 | 0.598 | 0.317 | 0.210 | ||
Conditional effects | 0.814 | 0.171 | 0.286 | ||||
0.756 | 0.152 | 0.266 | 0.697 | 0.325 | 0.245 | ||
0.786 | 0.156 | 0.276 | 0.814 | 0.178 | 0.286 | ||
0.797 | 0.152 | 0.280 | 0.715 | 0.329 | 0.251 | ||
0.789 | 0.151 | 0.277 | 0.831 | 0.171 | 0.292 | ||
Conditional effects | 0.697 | 0.319 | 0.245 | ||||
0.674 | 0.318 | 0.237 | 0.817 | 0.171 | 0.287 | ||
0.819 | 0.172 | 0.288 | |||||
Estimated average and conditional effects of speed of processing training | |||||||
Average effect | Conditional effects | ||||||
0.237 | 0.147 | 0.083 | −0.102 | 0.308 | −0.036 | ||
Conditional effects | 0.321 | 0.166 | 0.113 | ||||
0.207 | 0.147 | 0.073 | −0.025 | 0.320 | −0.009 | ||
0.234 | 0.152 | 0.082 | 0.317 | 0.172 | 0.111 | ||
0.254 | 0.148 | 0.089 | −0.015 | 0.326 | −0.005 | ||
0.250 | 0.147 | 0.088 | 0.340 | 0.167 | 0.120 | ||
Conditional effects | 0.011 | 0.314 | 0.004 | ||||
−0.035 | 0.310 | −0.012 | 0.322 | 0.166 | 0.113 | ||
0.325 | 0.167 | 0.114 |
6.5.3 Conditional effects given (K = k)
The conditional treatment effects given a gender are given in detail in Table 2. Here, we examine differential treatment effects depending on the participants' gender.
For the memory training, we found a slightly negative treatment effect for male participants (, 95% CI [−0.750, 0.446], ES = −0.053) and a positive treatment effect for female participants K = 1 (, 95% CI [−0.167, 0.489], ES = 0.057). While the effect sizes differed in direction and magnitude, both effects were not statistically significant.
The reasoning training had significant effects for both men and women. The conditional effect for female participants (, 95% CI [0.482, 1.156], ES = 0.288) was higher than the conditional effect for male participants (, 95% CI [0.052, 1.297], ES = 0.237).
While the average treatment effect of the speed of processing training was not significant, we found a slightly non-significant conditional effect for female participants (, 95% CI [−0.002, 0.652], ES = 0.114). Conversely, for male participants, the speed of processing training did not yield a significant effect (, 95% CI [−0.750, 0.446], ES = −0.053). This differential effect illustrates why examining conditional effects can be crucial when evaluating a treatment.
6.5.4 Conditional effects given (X = x) and (X = x, K = k)
The conditional treatment effects given a treatment condition (and gender) are given in detail in Table 2. As the ACTIVE study was a randomized controlled trial, the (X = x, K = k)-conditional distributions of depression and baseline test scores should not differ across levels of X. Thus, the conditional effects given a treatment condition are expected to be close to the corresponding average treatment effects. The same applies to conditional effects given treatment and gender. We will not discuss all of these effects in detail here, but rather illustrate their interpretation using one example.
For example, the conditional treatment effect of the memory training for the non-treated (, 95% CI [−0.234, 0.340], ES = 0.019) is similar to the average effect of the memory training. This refers to the expected effect if participants who were assigned to the no-contact control group had instead been assigned to the memory training. These effects being close to the average treatment effect reflects that differences in baseline variables are small across groups. Note that for observational studies or broken experiments, these effects would not be necessarily close to each other due to possible baseline differences.
7 Summary and conclusions
In this paper we presented and illustrated a new method of accounting for non-normal covariates when estimating average and conditional treatment effects for count outcomes. We extended the moment-based approach by Kiefer and Mayer (2019, 2020) in three respects. First, we presented four ways to account for non-normal covariates: (1) for categorical covariates, applying a saturated model; for continuous covariates, we suggested either (2) the use of an alternative, known non-normal joint distribution (e.g., skew-normal distribution), (3) a plausible factorization into marginal and conditional distributions, or (4) approximation via a finite Gaussian mixture distribution. Second, we extended the effect analysis to multiple treatment conditions, making it possible to evaluate the effectiveness of several treatments simultaneously. Third, we introduced conditional effects given a treatment condition and/or values of the categorical covariates into the moment-based approach, allowing for a finer-grained effect analysis. Finally, we provided an illustrative example to show how our extensions of the moment-based approach can be applied to real data. In our example, three cognitive training conditions were compared to a no-contact control group regarding the count of correctly answered items in a cognitive reasoning test. We considered as covariates the baseline count of correctly answered items (non-normal covariate), baseline depression (latent covariate), and gender (categorical covariate). The corresponding negative binomial multi-group structural equation model and the average and conditional treatment effect estimations were carried out in R. We have made these functions conveniently accessible for applied researchers in an R package.
The aforementioned advancements bring the benefits of the moment-based approach for statistical inference to a broader range of applied scenarios in psychological, social, and health sciences. In contrast to earlier approaches, the moment-based approach treats observed group sizes and covariate values as random and not predetermined by the experimenter. Ignoring this randomness would lead to underestimation of standard errors and, thus, inflated Type I error rates and decreased power, which has previously been shown for stochastic covariates (Li, McLouth, & Delaney, 2020; Liu, West, Levy, & Aiken, 2017) and stochastic group sizes (Mayer & Thoemmes, 2019). In addition, the moment-based approach allows accounting for measurement error in covariates. For an overview of consequences of measurement error in nonlinear regression models, see Carroll, Ruppert, Stefanski, and Crainiceanu (2010). However, in the social and health sciences, observed variables in real data sets often deviate from the normal distribution (Bono et al., 2017; Micceri, 1989). While Kiefer and Mayer (2020) proposed a moment-based approach assuming strictly multivariate normally distributed covariates, we relaxed this assumption in this paper, offering several alternatives for non-normally distributed variables. Thus, the moment-based approach with our extension should better fit typical applications in social and health sciences.
7.1 Limitations and further research
There are some aspects not covered in this paper that could provide starting points for further research and refinements of the approach:
In equation (3), we assumed a linear parameterization for the predictor function . While this is the standard parameterization of, for example, generalized linear models (McCullagh & Nelder, 1996) and for effect analysis with linear link function (e.g., Mayer et al., 2016), sometimes nonlinear predictor functions might be of substantial interest. For example, Mayer et al. (2017) investigated the effectiveness of autonomy support by ninth-grade teachers in reducing students' state of boredom. They hypothesized a quadratic relationship between boredom and self-efficacy, which means that the treatment could be most effective for medium values of self-efficacy and less for both extremes. Similarly, Liu and West (2015) use trigonometric terms to predict cyclic patterns in daily report (count) data. The inclusion of quadratic or other nonlinear terms might also be of interest for count outcomes and regression models with a logarithmic link function. We presume that the moment-based approach would not yield analytical effect formulas for nonlinear predictor functions, but effect computation using numerical integration would be an alternative.
Throughout the paper, we factorized the joint distribution of the covariates into a marginal distribution of the categorical variables (i.e., treatment X and covariates K) and a (X = x, K = k)-conditional distribution of the continuous covariates. Our choice can seem restrictive, as alternative factorizations might be more suitable for some cases. For example, Kiefer and Mayer (2019) used a converse factorization, that is, a marginal distributed continuous covariate Z and a (Z = z)-conditional probability of the binary treatment X. However, we would argue that for a given joint distribution multiple equivalent factorizations exist. In the aforementioned example, the joint distribution can be equivalently rewritten in our factorization, that is, a marginal distribution of the binary treatment and an (X = x)-conditional distribution of Z (for a derivation, see Appendix 1).
We used maximum likelihood estimation to obtain parameter estimates for the negative binomial multi-group structural equation model, but in applied settings alternative statistical frameworks might be more suitable. Maximum likelihood estimation can suffer from non-convergence issues in complex models (Deng, Yang, & Marcoulides, 2018). We also exhibited this phenomenon in our illustrative example, where we had to exclude overdispersion parameters from the model in order to achieve convergence. Thus, we recommend three solutions to address non-convergence issues. First, a pragmatic workaround would be to specify a more parsimonious model, because non-convergence might be a consequence of overparameterization (Jackson, 2001, 2003). Second, a Bayesian structural equation modelling framework can be applied. The Bayesian approach can typically handle complex models much easier than maximum likelihood estimation (Merkle & Rosseel, 2015). For an overview of Bayesian estimation of structural equation models based on finite-mixture distributions or distributions based on the exponential family, see Lee (2007, Chapters 11 and 13). Third, recent factor score approaches (e.g., Devlieger & Rosseel, 2017) might also help achieving convergence by simplifying the measurement model and, thereby, the numerical integration part of the maximum likelihood estimation. However, if the factor scores depend on a distributional assumption, it would be wise to use the same distributional assumption within the moment-based approach.
In our illustrative example, we examined the effectiveness of cognitive training using a pre–post design. The ACTIVE study also contains several follow-up measurements allowing long-term effectiveness of the cognitive training to be investigated. In future developments, it might be fruitful to incorporate this longitudinal information into effect analyses. We suggest two ways for doing so. First, when examining long-term effects using a follow-up measurement, earlier measurements (e.g., post-test) could be included as mediators. In this scenario, total, direct, and indirect effects of the treatment can be distinguished (Mayer, Thoemmes, Rose, Steyer, & West, 2014). This can help researchers understand, for example, which post-test characteristics foster a long-term effect of the cognitive training. Second, several measurement occasions can be summarized or contrasted, for example, using a growth curve model (McArdle & Epstein, 1987) or growth component model (Kiefer, Rosseel, Wiese, & Mayer, 2018; Mayer, Steyer, & Mueller, 2012) among follow-up measurements. In this case, the outcome of interest would be a latent slope or growth component instead of the count outcome.
In non-randomized observational studies the treatment effects discussed in this paper are not necessarily causal effects. However, causality theories, such as Rubin's causal model (Rubin, 2005) or the stochastic theory of causal effects (Steyer et al., 2014) provide causality conditions under which causal effects can be estimated. Usually, these conditions require a careful selection of covariates, which then are controlled for with regression adjustment (e.g., Mayer, 2019; Mayer et al., 2016) or propensity score methods (Rosenbaum & Rubin, 1983). For an overview of design and analysis in quasi-experimental settings, see also Reichhardt (2019).
Appendix 1
Factorization of the joint distribution
Consequently, if one version or factorization of the joint distribution is known, equivalent factorizations for the same joint distribution can be derived.
Note that these transformations are generally required for effect computation only. The maximum likelihood estimation would, nevertheless, be based on fZ(Z) and P(X = x|Z = z) in this example. In contrast, when trying to estimate the likelihood function based on P(X = x) and fZ|X = x, it might be tempting to estimate P(X = x) directly instead of . However, introducing a parameterization for P(X = x) would lead to redundancies, because P(X = x) is already determined by µZ, ,, and . Consequently, additional estimation of P(X = x) would lead to a non-invertible covariance matrix of the parameter estimates and, thus, standard errors could not be computed. The redundant parameter would remain at its starting value, and thus would not yield a trustworthy estimate. The likelihood function should only contain parameters µZ, ,, and . Corresponding parameters from our factorization and their standard errors can then be computed for effect estimation using the delta method.
Appendix 2
Model estimation
Note that marginalizing over the joint density of the complete data is only one way to optimize this likelihood. It would also be possible to use an EM algorithm (Dempster, Laird, & Rubin, 1977) to solve the complete data likelihood, but this is beyond the scope of this paper.
Appendix 3
Effect estimation
In this section we provide details on the derivation of the average and conditional effect formulas used in our illustrative example. Let us consider a count outcome variable , a treatment variable with four levels, a binary categorical covariate , and a vector of continuous covariates containing a latent covariate measured by three observed indicator variables , and a count covariate . In the model section, we already stated that we assume latent baseline depression is -conditionally marginally normally distributed and is related to the baseline count of correctly answered items via a Poisson regression with where .