Volume 74, Issue 3 p. 513-540
Article
Open Access

Treatment effects on count outcomes with non-normal covariates

Christoph Kiefer

Corresponding Author

Christoph Kiefer

Bielefeld University, Bielefeld, Germany

*Correspondence should be addressed to Christoph Kiefer, Bielefeld University, Bielefeld, Germany (email: [email protected]).

Search for more papers by this author
Axel Mayer

Axel Mayer

Bielefeld University, Bielefeld, Germany

Search for more papers by this author
First published: 05 May 2021

The data used in the illustrative example are publicly available in OpenICPSR at https://www.openicpsr.org/openicpsr/project/128941; https://doi.org/10.3886/E128941

Abstract

The effects of a treatment or an intervention on a count outcome are often of interest in applied research. When controlling for additional covariates, a negative binomial regression model is usually applied to estimate conditional expectations of the count outcome. The difference in conditional expectations under treatment and under control is then defined as the (conditional) treatment effect. While traditionally aggregates of these conditional treatment effects (e.g., average treatment effects) are computed by averaging over the empirical distribution, a recently proposed moment-based approach allows for computing aggregate effects as a function of distribution parameters. The moment-based approach makes it possible to control for (latent) multivariate normally distributed covariates and provides more reliable inferences under certain conditions. In this paper we propose three different ways to account for non-normally distributed continuous covariates in this approach: an alternative, known non-normal distribution; a plausible factorization of the joint distribution; and an approximation using finite Gaussian mixtures. A saturated model is used for categorical covariates, making a distributional assumption obsolete. We further extend the moment-based approach to allow for multiple treatment conditions and the computation of conditional effects for categorical covariates. An illustrative example highlighting the key features of our extension is provided.

1 Introduction

The evaluation of treatment effects on count outcomes is quite common in the social and health sciences. Often, covariates are included in the analysis using a negative binomial regression model (Garrett et al., 2018; Hittner, Owens, & Swickert, 2016; Jobe et al., 2001; Mazerolle et al., 2019; Nusser & Weinert, 2017; Schaumberg & Flynn, 2017; Sridharan, Shoda, Heffner, & Bricker, 2019), that is, the conditional expectation for the count outcome is logarithmically linked to the treatment variable and the covariates. The difference in conditional expectations under treatment and under control is then defined as the (conditional) treatment effect. Traditionally, aggregates of conditional treatment effects (e.g., the average treatment effect) are computed by averaging over the empirical (joint) distribution of the covariates (Greene, 2007).

Recently, a moment-based approach was proposed by Kiefer and Mayer (2019, 2020) enabling aggregated effects to be computed using parameters of the (joint) distribution of the covariates (e.g., means, variances, and covariances). Instead of computing conditional effects given the observed covariate values and averaging over their empirical distribution, the conditional effects are integrated over all possible values of the covariates weighted by their likelihood given by the (assumed) joint density. While this procedure would usually have to be carried out by numerically solving an improper integral (e.g., an integral over conditional effects weighted by a multivariate normal density), the improper integral is replaced with a moment-generating function in the moment-based approach. Thus, as most moment-generating functions have closed-form solutions, the moment-based approach allows for a fast and analytical computation of the aggregated treatment effects.

The moment-based approach has two major advantages over the traditional empirical distribution approach. First, the covariate side of the effect calculation is stochastic instead of fixed, which enables more accurate statistical inferences about conditional and average treatment effects. In contrast, the traditional approach treats the observed covariate values as fixed by design (i.e., not varying between samples), which can lead to an underestimation of standard errors for the aggregated effects. Second, it allows common factor models for the covariates, meaning that not directly observed (latent) covariates can be accounted for as well. As the empirical distribution of latent variables is not observed, the traditional approach cannot deal with latent variables and would require fallible substitutes (i.e., sum scores, factor scores).

However, the extended moment-based approach as suggested by Kiefer and Mayer (2020) has an Achilles' heel: covariates are assumed to be multivariate normally distributed within the treatment groups. In practical settings, this assumption is often violated, as observed variables in real data sets in the social and health sciences frequently deviate from the normal distribution (Bono, Blanca, Arnau, & Gómez-Benito, 2017; Micceri, 1989). For example, Blanca, Arnau, Lopez-Montiel, Bono, and Bendayan (2013) examined 693 distributions from real psychological data and found that 74.4% presented slight or moderate deviations from the normal distribution, while 20% exhibited more extreme deviation. In a simulation study, Lei and Lomax (2005) investigated parameter and standard error bias in structural equation models when non-normal variables are introduced into normal distribution-based maximum likelihood estimation. They conclude that parameter estimation is sensitive to non-normal variables, but the bias introduced by slight non-normality is moderate. However, severe non-normality leads to substantial bias in loadings and structural parameters. Similarly, Kiefer and Mayer (2019) found in their examination of the univariate moment-based approach that missspecification of the covariates' distribution can introduce bias into the average effect estimation.

Thus, in this paper, we propose several extensions of the moment-based approach. These extensions aim to make the approach more flexible with regard to its distributional assumptions as well as to provide applied researchers with a more nuanced effect analysis. First, to properly account for non-normal covariates, we suggest differentiating between categorical covariates (e.g., gender, ethnicity) and non-normal continuous (or metric) covariates (e.g., pre-test count variables). We then describe four different possibilities to account for non-normal joint distributions. Second, we define and compute average and conditional treatment effects of interest, for example, the conditional effect given a treatment condition or given a gender. These kinds of finer-grained effects (as opposed to the average treatment effect) enable researchers to examine treatment efficacy under different conditions in a more nuanced way. Third, we consider the case in which multiple treatment conditions are to be examined, that is, treatment effects can be modelled and examined for more than one treatment condition (in comparison to a control group).

This paper is structured as follows. First, we provide general definitions of treatment effects on count outcomes based on multiple categorical and continuous covariates. Specifically, we introduce the average treatment effect as well as several conditional treatment effects. Second, we derive how the moment-based approach can be utilized to compute these effects, differentiating between a solution for categorical covariates and three possible approaches for non-normal continuous covariates. Third, we briefly present the negative binomial multi-group structural equation model as the statistical framework for simultaneously estimating the parameters of the negative binomial regression and the covariates' distribution. Finally, the moment-based approach is applied to a real data sample to highlight the key features of our proposed extensions.

2 Definition and terminology of treatment effects

We start by introducing definitions of average and conditional treatment effects on a count outcome variable Y. Let the discrete treatment variable X have p+1 levels denoted with values x = 0,1,…, p. We will take X = 0 as reference or control group, and thus t = 1,…,p are the values of the remaining treatment conditions.1 Furthermore, we will consider a single (unfolded) categorical variable K with j+1 levels and values k = 0,1,…, j, and a vector of (latent) variables urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0007 with z = 1,…, q denoting the (z+1) th element of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0008. We use urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0009 to denote values of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0010. For this set of variables, the following parameterization of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0011 always holds:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0012(1)
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0013(2)
In equation (1) the conditional expectation of Y is decomposed into an intercept function g0 and a conditional effect function gt for treatment condition t (cf. Mayer, Dietzfelbinger, Rosseel, & Steyer, 2016; Steyer & Nagel, 2017). As Y is a count variable, the conditional expectation urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0014 can always be presented as in equation (2). Note that we use a regression with a logarithmic link function for each combination of X = x and K = k, that is, the function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0015 is group-specific:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0016
we use group-specific functions urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0017, because the effect computations later in this paper will be based on a multi-group model. The term urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0018 denotes a partial conditional expectation. For its definition and further details, see Steyer and Nagel (2017, Section 14.4).

2.1 Conditional effects function

The intercept function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0019 denotes the expected values for the control group X = 0,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0020
and the p effect functions urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0021 represent the expected increase/decrease in Y due to treatment t, that is, the difference between the conditional expectation of Y under treatment t compared to the control group:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0022
The effect function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0023 gives a conditional treatment effect for any value of the covariates K and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0024. The conditional expectations urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0025 reflect the expected outcome given a treatment condition t. For example, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0026 denotes the expected count of Y given K = 1 for a person who receives training X = 2 and has a pre-test score of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0027.

In practical settings, a parameterization is needed for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0028, which we will discuss at the end of this section. For a causal interpretation of the average and conditional effects, it is crucial that the effect functions urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0029 be causally unbiased. Steyer, Mayer, and Fiege (2014) provide an overview of causality conditions ensuring unbiasedness of the effect function, one of which is the independence of the treatment variable from all potential confounders (e.g., created by randomized treatment assignment as in the ACTIVE study; see Section 6).

2.2 Average treatment effect

The average treatment effect of treatment X = t compared to control group X = 0 is defined as the unconditional expectation of the effect function,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0030
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0031 denotes the density of the joint distribution of K and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0032, and therefore reflects the average over the conditional effects given all values of the covariates. The average effect AEt0 gives the expected effect for a randomly sampled person assigned to treatment condition X = t compared to the control group X = 0.

2.3 Conditional effects given X and K

In addition to the average effect, finer-grained aggregates of the effect function also exist. For example, the treatment effect for female participants can be examined. This is represented as a conditional effect given a value k of the categorical covariate K and averages over the (K = k)-conditional distribution of the continuous covariates,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0033
For example, if K represents a participant's gender, then urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0034 provides the aggregated conditional effect of treatment X = t for all persons with gender K = k.
In a similar vein, the conditional effect given a treatment condition is defined by
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0035
These effects are of interest when the (X = x)-conditional covariate distribution urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0036 differs among treatment groups, as is the case, for example, in observational studies (i.e., without randomization) or in so-called broken experiments (Sagarin et al., 2014). For an overview, see Geneletti and Dawid (2011). In properly randomized trials, however, we would expect the conditional effects given a treatment condition to be equivalent to the corresponding average treatment effect.
Finally, the conditional effects above can also be combined to give the conditional effect given a value k of K and a treatment condition x of X:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0037
Again, for a randomized controlled trial, we would expect these effects to be equivalent to the corresponding treatment effects given a value of the categorical covariate.

2.4 Regression parameterization and factorization of the covariates' distribution

Up to this point, we have presented non-parametrical definitions of average and conditional effects on count outcomes. In order to derive empirically estimable quantities, we need to introduce a parameterization for the urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0038 functions and an applicable factorization of the mixed joint density of the categorical covariate K and the continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0039.

We choose a linear parameterization of the function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0040,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0041(3)
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0042 is a real-valued vector of regression coefficients with length q+1. This corresponds to the parameterization widely used in count regression models, for example, Poisson or negative binomial regression models. Hence, a count regression is specified for each combination of (X = x, K = k).
As can be seen in the previous subsection, the various average and conditional treatment effects require unconditional, but also (X = x)-, (K = k)-, and (X = x, K = k)-conditional distributions of the covariates K and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0043. Hence, we provide a factorization of the mixed joint distribution urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0044 allowing us to easily derive all of the aforementioned unconditional and conditional distributions:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0045(4)
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0046(5)

In equation (4), the joint distribution of the covariates K and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0047 is defined as the marginal distribution of the joint distribution of the treatment variable X, the categorical covariate K, and the continuous covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0048. In equation (5), the joint distribution is decomposed into a categorical group part P(X = x, K = k) and a group-conditional density urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0049 of the continuous variables. This factorization serves two purposes. First, it allows us to identify the conditional densities required for the computation of the aforementioned average and conditional effects, namely urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0050, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0051, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0052, and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0053. Second, we can use a multi-group approach, namely a multi-group structural equation model, to estimate parameters both of the negative binomial regression and of the distribution of the covariates.

The choice of a factorization is without loss of generality for our approach, because an alternative factorization can always be transformed into our factorization. We provide more information on this aspect in the discussion and in Appendix 1.

Following the linear parameterization of the function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0054 and factorization of the joint distribution urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0055, the average treatment effect, for example, can be computed as
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0056(6)
Note that the parameterization and factorization are chosen with regard to the statistical framework for parameter estimation we will discuss later on, that is, negative binomial regression and multi-group structural equation models. As stated before, for a causal interpretation of the treatment effects, the effect function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0057 must be causally unbiased. Causal unbiasedness goes beyond unbiasedness of the estimated parameters and means that there are no unobserved confounders of the treatment effects. It can be achieved, for example, by an (unconditional or conditional) randomized assignment of persons to the treatment groups or by controlling for all confounders. Tools developed in the causal inference literature such as propensity scores (Rosenbaum & Rubin, 1983) can be helpful to meet the ‘no unobserved confounders’ condition. For an overview of conditions under which causal unbiasedness is achieved, see Steyer et al. (2014). Thus, applied researchers should carefully evaluate whether the parameterization and factorization described here fit their hypotheses and assumptions.

3 Moment-based approach assuming multivariate normality

In the previous section we introduced a distinction between categorical and continuous covariates. Such a distinction has been previously proposed for treatment effect computation, for example, in the EffectLiteR approach by Mayer et al. (2016). A saturated model is used, that is, the probability of each occurring value of the categorical covariate is examined. Thus, no further distributional assumption for the categorical covariate is required.

However, including continuous covariates requires specification of the (X = x, K = k)-conditional density urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0058 of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0059. In their extension of the moment-based approach, Kiefer and Mayer (2020) suggest assuming (X = x)-conditional multivariate normality. Adapting this notion to our distinction between categorical covariates K and continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0060 and assuming (X = x, K = k)-conditional normality, that is, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0061, the integration part of the average treatment effect in equation (6) can be substituted with moment-generating functions:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0062
Sometimes it might suffice to account for the case of categorical covariates in addition to normally distributed continuous covariates to obtain unbiased parameter and effect estimates.

4 Accounting for non-normal continuous covariates

In the following subsections we present three different ways to incorporate non-normal continuous covariates into the moment-based approach. All three ways follow the basic idea of the moment-based approach, that is, (at least partly) substituting improper integrals with moment-generating functions. The suggested solutions depend on the level of information available for the non-normal continuous variables: (1) an alternative, non-normal joint distribution is known; (2) a plausible factorization of the joint distribution can be constructed; and (3) an approximation of the covariates' joint distribution is required. Note that we do not suggest that one of these approaches is generally preferable to the others. Rather, which approach to use depends on the concrete data situation one is confronted with.

All of the aforementioned approaches have in common that we use the joint density of the covariates in a maximum likelihood framework to estimate parameters and treatment effects. As these models contain negative binomial regressions, multi-group parts, measurement models and parameters of the non-normal distributions, they can involve many parameters. Thus, large sample sizes may be required for a solution to converge (Jackson, 2003). For smaller samples or cases in which none of the three proposed approaches is feasible, Bayesian modelling and estimation can be an alternative. We discuss Bayesian alternatives in the Discussion section.

4.1 Case 1: Known non-normal joint distribution

The first and probably simplest case is a known alternative non-normal (X = x, K = k)-conditional joint distribution for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0063. In this case, we can substitute the density and moment-generating functions for effect computation with the ones from the alternative distribution. For example, if we want to account for a slight skew in our continuous variables, the multivariate skew-normal distribution (Azzalini & Valle, 1996) is a viable alternative to the normal distribution. Admittedly, the case of knowing a number of suitable non-normal multivariate distributions and their moment-generating function might not be very common, especially for applied researchers.

However, another important scenario for known alternative distributions is (X = x, K = k)-conditional independence of the covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0064 because then the product of univariate distributions corresponds to the marginal distribution. While the (X = x, K = k)-conditional independence of the covariates might be a strong assumption, it is helpful to see that the joint moment-generating function can be decomposed into the product of the univariate moment-generating functions in this case. For example, if we consider two continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0065 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0066 with (X = x, K = k)-conditional independence urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0067, the corresponding density can be decomposed urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0068, and thus the moment-generating function can be written as the product of univariate moment-generating functions,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0069
where t1,t2 are the evaluation points of the moment-generating functions. See Kiefer and Mayer (2019) for an overview of univariate moment-generating functions and their performance within the moment-based approach.

4.2 Case 2: Factorization of joint distribution

Sometimes, it might be impossible to find a suitable joint distribution, and the assumption of conditional independence might be too strong. A notable example of this case was chosen for our illustrative example: a continuous latent variable (i.e., depression) and a discrete, yet non-categorical count variable (i.e., baseline count of correctly answered items). To our knowledge, no joint distributions for count and continuous variables have been proposed to date. However, when examining count outcome variables, researchers often wish to account for the respective baseline count as well.

We suggest a practical workaround for these cases: finding a plausible factorization of the joint distribution. For example, this might be decomposing the joint distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0070 into a marginal and a conditional distribution, that is,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0071
where we specify the marginal distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0072 with index mi identifying a subset of covariates, and the conditional distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0073 given urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0074 with index ci identifying the remaining covariates in urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0075. In general, this factorization makes it possible to simplify the integration part of the average effect in equation (6) as
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0076

The factorization approach is challenging to integrate into the moment-based approach, because it does not yield a comprehensive moment-generating function for the factorized joint distribution. Consequently, a combination of moment-generating functions and numerical integration is required for effect estimation. In our illustrative example, we will further delineate these computations for the factorization approach.

In practical terms, the construction of a conditional distribution for some covariates can be achieved using a regression approach, that is, a parameterization of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0077. In our illustrative example, this relation will be estimated using a regression with a logarithmic link function. From a causal perspective, such a regression can be problematic, because baseline variables do not necessarily have a temporal order, and we do not suggest that one baseline variable causes another baseline variable. Thus, the factorization approach is a technical workaround and its regression parameters are not necessarily of interest.

4.3 Case 3: Approximation with finite Gaussian mixtures

Finally, when no alternative joint distribution or plausible factorization can be found, it is possible to approximate the joint distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0078. One possible approach is to approximate the non-normal joint distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0079 by using a finite mixture of M multivariate normal distributions, that is, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0080 where wm are weights and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0081. In practice, this approximation of the distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0082 can be estimated using a finite mixture or latent class approach (McLachlan & Peel, 2000). Researchers can control the degree of approximation by specifying the number M of latent classes, where only the expected values urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0083 and the variance urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0084 are allowed to vary among latent classes. This approach has previously been applied in the computation of average and conditional effects in a nonlinear regression setting by Mayer, Umbach, Flunger, and Kelava (2017) and can be estimated using nonlinear structural equation mixture models (Kelava & Brandt, 2014; Kelava, Nagengast, & Brandt, 2014). However, it is notable that an increasing number of latent classes M can lead to convergence problems in maximum likelihood estimation.

With respect to effect estimation, the finite-mixture approach yields a comprehensive (X = x, K = k)-conditional moment-generating function for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0085, as we can use a weighted sum of normal moment-generating functions for the integration part from equation (6),
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0086
which is similar to the original extended moment-based approach by Kiefer and Mayer (2020), with the exception of the summation over the latent classes.

5 Negative binomial multi-group structural equation model

In this section, we introduce the negative binomial multi-group structural equation model (NB-MG-SEM) as a statistical framework for parameter and effect estimation. The NB-MG-SEM provides maximum likelihood estimation. The model involves the count outcome variable Y, the categorical treatment variable X with p levels, the (unfolded) categorical covariate K with j levels, and the vector of continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0087 containing latent covariates measured by observed variables urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0088. The NB-MG-SEM consists of (at least) the following parts:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0089(7)
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0090(8)
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0091(9)
where v is a vector of measurement intercepts; urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0092 is a matrix of loadings; ɛ is a vector of measurement error variables with mean zero and covariance matrix urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0093; µY is the conditional expectation of the count outcome; αxk is a vector of regression coefficients; and κxk is a parameter for the log-transformed expected group frequency nxk of group (X = x, K = k). The probability for a group (X = x, K = k) can be computed with urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0094
The NB-MG-SEM presented in equations 7-9 is a least common denominator with regard to the cases of non-normally distributed covariates identified in the previous section and can be extended with respect to the given case at hand. In our illustrative example, we will add another regression with a logarithmic link function specifying the relation between two covariates. We provide detailed information on the maximum likelihood estimation of the NB-MG-SEM for our illustrative example in Appendix 2. For brevity, we present the four main parts and assumptions of the likelihood functions constituting the joint distribution of the variables considered:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0095
The joint distribution of the covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0096 depends on which of the three aforementioned cases is applied. In our illustrative example, we will factorize the distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0097 into a marginal and a conditional distribution (i.e., case 2 scenario), where the conditional distribution depends on a urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0098 and the parameters of the marginal distribution are implied by the measurement model. Illustration of the other cases is provided in the online Appendix S1.

6 Illustrative example

To illustrate the use of our extension to the moment-based approach, we use data from the Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) study (Ball et al., 2002; Jobe et al., 2001; Tennstedt et al., 2005). The ACTIVE study is a large randomized controlled trial designed to examine the effectiveness of cognitive interventions among older adults. We will investigate the effects of these interventions while controlling for the (non-normal) count pre-test score, (latent) baseline depression, and participants' gender. As count outcome Y, we investigate post-test performance on an inductive reasoning assessment (i.e., letter sets). Letter sets evaluate how well an individual can recognize a pattern among several sets of letters. Thus, Y reflects the count of correctly answered items. The ACTIVE data subset analysed is publicly available; a link is provided in the online Appendix S1.

In this paper we do not present a comprehensive analysis of the ACTIVE study. Instead, the primary goal of this paper is to illustrate how non-normally distributed covariates can be integrated into the moment-based approach when estimating average and conditional treatment effects. The NB-MG-SEM and treatment effects were estimated using R (R Core Team, 2018) and the CountEffects package. The CountEffects package is an implementation of the moment-based approach assuming (X = x, K = k)-conditionally multivariate normally distributed covariates, which we extended with functions to estimate all three proposed approaches to deal with non-normal covariates for our illustrative example. In the online Appendix S1, we provide information on how to install CountEffects and how to estimate treatment effects based on the multivariate normal assumption and all three cases. Our findings indicate that the three non-normal cases yield similar results for the illustrative example, while the normal case yields differing estimates for some effects. However, in this section we will only describe the estimation and results of the case 2 scenario. We used listwise deletion in our analyses in order to keep the supplementary R code accessible for interested readers. However, we also ran a full-information maximum likelihood estimation for our model (using Mplus; Muthén & Muthén, 1998–2015) which yielded similar results.

6.1 Sample

The total sample size for our analysis was N = 2,363. The participants were randomly assigned to one of four conditions: a no-contact control group (X = 0, N = 592), a memory training (X = 1, N = 589), a reasoning training (X = 2, N = 590), and a speed of processing training condition (X = 3, N = 606). Each treatment condition consisted of a ten-session training intervention. In a baseline assessment, the participants took several cognitive functioning tests and completed a self-report questionnaire of psychological measures. The post-test assessment was conducted in the first 10 days after the last training session. The participants were predominantly female (75.8%, K = 1).

6.2 Measures

6.2.1 Depression

Depressive symptoms were assessed using the 12-item version of the Center for Epidemiological Studies Depression Scale (CESD-12; Radloff, 1977). Participants rated the frequency of several depressive symptoms (e.g., feeling sad) during the last week on a four-point scale from 0 = never to 3 = 5–7 days. We modelled baseline depression as latent variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0099, measured by three parcels (i.e., sum scores of four items each; Z11, Z12, Z13). We used random item parcels for simplicity in our illustrative example. For warnings about the use of parcels, see Marsh, Lüdtke, Nagengast, Morin, and von Davier (2013) and Sterba and Rights (2016).

6.2.2 Letter sets

The count of correctly answered items on a letter sets task was used as outcome Y (i.e., post-test score) and covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0100 (i.e., pre-test score) in our analysis. Letter sets evaluate how well an individual can find rules or patterns that make different sets of letters alike in some way. Each problem had five sets of letters with four letters in each set. Participants were given 15 different problems and had 7 min to complete the task.

6.3 Model

Our effect analysis was based on a multi-group structural equation model with a group-invariant linear measurement model and a structural model with two linear predictors, logarithmic link functions, and a negative binomial distribution with overdispersion parameter ϕ fixed to zero (i.e., a Poisson distribution) for the respective dependent variable.2 The measurement model was chosen to be τ-congeneric, as we had no a priori assumption about the true scores (e.g., τ-equivalence). The first indicator was chosen as reference indicator. The measurement model is expressed as:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0101
The latent variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0102 was assumed to have an (X = x, K = k)-conditional normal distribution
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0103
Our structural model with two linear predictors, logarithmic link functions, and Poisson distribution for the respective dependent variable was
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0104
Note that the second regression urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0105 is specified for technical reasons, that is, to model a factorized joint distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0106 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0107. We do not assume a causal relation, where the values of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0108 are predetermined by urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0109. Hence, it would also be possible to factorize the joint distribution with urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0110. We chose the first factorization, because it yields some facilitative properties for the numerical integration procedures required in effect estimation (i.e., Gauss–Hermite quadrature; for more information, see Appendix 3).
Finally, the model for group sizes nxk was expressed as
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0111
The whole model is displayed in Figure 1. Note that regressions with a logarithmic link function are indicated by curved arrows, in contrast to straight lines for linear regressions.
Details are in the caption following the image
Path diagram for our illustrative example depicting the group invariant measurement models for latent baseline depression urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0112 and the baseline count of correctly answered items urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0113, the structural model specifying the regressions with a logarithmic link function (indicated by curved arrows) of the post-test count of correctly answered items Y on the baseline variables urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0114 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0115 in each of the eight groups.

6.4 Average and conditional treatment effects

With regard to our three proposed scenarios of non-normal continuous covariates, the specification of a baseline count variable and a continuous variable falls under case 2: as no suitable joint distribution is available, a factorization of the joint distribution into a known marginal and a known conditional distribution can be specified. In the previous section, we stated that we assumed latent baseline depression is (X = x, K = k)-conditionally marginally normally distributed urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0116 and is related to the baseline count of correctly answered items via a Poisson regression with urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0117 with urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0118. In this scenario, the average treatment effect can be computed as
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0119
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0120 and ah, wh are Gauss–Hermite quadrature points and weights. A detailed derivation of this formula is provided in Appendix 3. The moment-generating function for the Poisson distribution is:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0121
The conditional effects are then computed analogously to our definitions above. For example, the conditional effect given a value of the gender covariate K is computed as
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0122
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0123 are integration points at which the function is evaluated and ah,wh are Gauss–Hermite quadrature points and weights.

Standard errors for the estimated treatment effects can be derived using the delta method (cf. Boos & Stefanski, 2013, p. 237).3 In our analysis, we computed symmetric confidence intervals based on the standard errors, because the estimated treatment effects are asymptotically normally distributed and our sample was comparatively large. However, in smaller samples confidence intervals based on, for example, a bootstrap approach are recommended.

6.5 Results

It is currently not possible to evaluate the complete model fit using a χ2 test or other fit statistics, as these are based on the χ2 value of the model implied covariance matrix for the observed variables. For models with a logarithmic link function, the implied covariance matrix is not an appropriate description of the dependencies among the variables due to the nonlinearity.

6.5.1 Model parameters

The maximum likelihood estimates for the model parameters are given in Table 1. We will not discuss them in detail here, but rather present some notable findings to illustrate the interpretation of these parameters.

Table 1. Maximum likelihood results for group-specific model parameters
Parameter Estimate SE p Parameter Estimate SE p
X = 0, K = 0 X = 0, K = 0
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0124 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0125
α000 1.324 0.103 <.001 α010 1.218 0.061 <.001
α001 −0.048 0.041 .241 α011 -0.034 0.018 .061
α002 0.099 0.011 <.001 α012 0.103 0.007 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0126 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0127
γ000 1.970 0.059 <.001 γ010 1.904 0.035 <.001
γ001 0.148 0.049 .002 γ011 -0.109 0.020 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0128 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0129
µ00 1.113 0.087 <.001 µ01 1.571 0.072 <.001
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0130 0.792 0.128 <.001 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0131 1.629 0.149 <.001
X = 1, K = 0 X = 1, K = 1
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0132 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0133
α100 1.141 0.110 <.001 α110 1.256 0.057 <.001
α101 0.016 0.031 .602 α111 -0.037 0.016 .019
α102 0.111 0.013 <.001 α112 0.102 0.007 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0134 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0135
γ100 1.938 0.053 <.001 γ110 1.890 0.030 <.001
γ101 -0.057 0.034 .094 γ111 -0.077 0.017 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0136 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0137
µ10 1.296 0.109 <.001 µ11 1.440 0.074 <.001
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0138 1.298 0.201 <.001 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0139 1.955 0.176 <.001
X = 2, K = 0 X = 2, K = 1
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0140 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0141
α200 1.335 0.103 <.001 α210 1.340 0.053 <.001
α201 0.004 0.026 .890 α211 -0.013 0.014 .345
α202 0.102 0.011 <.001 α212 0.099 0.006 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0142 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0143
γ200 2.002 0.052 <.001 γ210 1.785 0.033 <.001
γ201 -0.098 0.031 .001 γ211 -0.024 0.016 .145
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0144 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0145
µ20 1.411 0.127 <.001 µ21 1.641 0.074 <.001
α20 1.879 0.303 <.001 α21 1.879 0.171 <.001
X = 3, K = 0 X = 3, K = 1
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0146 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0147
α300 1.274 0.101 <.001 α310 1.210 0.057 <.001
α301 0.016 0.026 .532 α311 -0.009 0.017 .581
α302 0.095 0.012 <.001 α312 0.107 0.007 <.001
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0148 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0149
γ300 1.936 0.051 <.001 γ310 1.823 0.033 <.001
γ301 -0.081 0.030 .007 γ311 -0.052 0.018 .005
Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0150 Estimates for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0151
µ30 1.416 0.126 <.001 µ31 1.526 0.067 <.001
α30 1.861 0.277 <.001 α31 1.515 0.137 <.001

Pre-test depression had no significant effect on the post-test count of correctly answered items in either group (e.g., α001 = 0.048, 95% confidence interval (CI) [−0.129, 0.033]), except for female participants receiving the memory training (α111 = −0.037, 95% CI [−0.069, 0.006]). Furthermore, pre-test depression was significantly negatively related to the pre-test count of correctly answered items in most groups (e.g., γ011 = −0.109, 95% CI [−0.148, 0.070], γ201 = −0.098, 95% CI [−0.758, 0.038], and γ301 = −0.081, 95% CI [−0.139, 0.022]). For example, for male participants in the reasoning training, each one-unit change in pre-test depression was linked to a 9% decrease in correctly answered items at baseline (i.e., γ201, exp(−0.098) = −0.91). In two groups, this relationship was not significant (γ101 = −0.057, 95% CI [−0.123, 0.010] and γ211 = −0.024, 95% CI [−0.056, 0.008]).

Pre-test count of correctly answered items was a significant positive predictor of post-test count of correctly answered items in all groups. For example, for female participants in the memory group, each additional correctly answered item at baseline was linked with an 11% increase in correctly answered items at post-test (i.e., α112, exp(0.102) = 1.11). The intercept coefficients reflect the expected post-test count of correctly answered items for a male person with depression score (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0152) and zero correctly answered items at baseline (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0153). For example, in the reasoning training (i.e., α200 = 1.335), the expected count is urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0154.

6.5.2 Average effects

An overview of all average and conditional treatment effects estimated for our illustrative example is given in Table 2. Remember that the letter sets test was part of a cognitive assessment measuring reasoning performance. Hence, we would expect the reasoning training in particular to have a significant effect, while the memory and speed of processing training might not necessarily affect reasoning performance. In line with this, the average treatment effect of the memory training (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0155, 95% CI [−0.203, 0.372], ES = 0.030) and the average treatment effect of the speed of processing training (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0156, 95% CI [−0.052, 0.525], ES = 0.083) on the count of correctly answered items were not significant. The average treatment effect of the reasoning training, however, was significant (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0157, 95% CI [0.487, 1.080], ES = 0.275), that is, participants in this condition correctly answered 0.849 items more on average compared to baseline.

Table 2. Estimated average and conditional effects of memory training (X = 1 versus X = 0), reasoning training (X = 2 versus X = 0), and speed of processing training (X = 3 versus X = 0) compared to the control group
Effect Estimate SE ES Effect Estimate SE ES
Estimated average and conditional effects of memory training
Average effect urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0158 Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0159
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0160 0.084 0.147 0.030 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0161 −0.238 0.302 −0.084
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0162 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0163 0.161 0.166 0.056
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0164 0.053 0.147 0.019 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0165 −0.130 0.314 −0.046
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0166 0.094 0.151 0.033 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0167 0.167 0.171 0.059
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0168 0.093 0.147 0.033 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0169 −0.109 0.321 −0.038
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0170 0.095 0.147 0.034 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0171 0.157 0.167 0.055
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0172 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0173 −0.119 0.309 −0.042
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0174 −0.152 0.305 −0.053 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0175 0.160 0.167 0.056
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0176 0.161 0.167 0.057
Estimated average and conditional effects of reasoning training
Average effect urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0177 Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0178
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0179 0.784 0.151 0.275 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0180 0.598 0.317 0.210
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0181 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0182 0.814 0.171 0.286
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0183 0.756 0.152 0.266 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0184 0.697 0.325 0.245
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0185 0.786 0.156 0.276 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0186 0.814 0.178 0.286
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0187 0.797 0.152 0.280 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0188 0.715 0.329 0.251
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0189 0.789 0.151 0.277 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0190 0.831 0.171 0.292
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0191 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0192 0.697 0.319 0.245
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0193 0.674 0.318 0.237 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0194 0.817 0.171 0.287
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0195 0.819 0.172 0.288
Estimated average and conditional effects of speed of processing training
Average effect urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0196 Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0197
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0198 0.237 0.147 0.083 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0199 −0.102 0.308 −0.036
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0200 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0201 0.321 0.166 0.113
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0202 0.207 0.147 0.073 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0203 −0.025 0.320 −0.009
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0204 0.234 0.152 0.082 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0205 0.317 0.172 0.111
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0206 0.254 0.148 0.089 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0207 −0.015 0.326 −0.005
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0208 0.250 0.147 0.088 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0209 0.340 0.167 0.120
Conditional effects urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0210 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0211 0.011 0.314 0.004
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0212 −0.035 0.310 −0.012 urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0213 0.322 0.166 0.113
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0214 0.325 0.167 0.114

6.5.3 Conditional effects given (K = k)

The conditional treatment effects given a gender are given in detail in Table 2. Here, we examine differential treatment effects depending on the participants' gender.

For the memory training, we found a slightly negative treatment effect for male participants (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0215, 95% CI [−0.750, 0.446], ES = −0.053) and a positive treatment effect for female participants K = 1 (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0216, 95% CI [−0.167, 0.489], ES = 0.057). While the effect sizes differed in direction and magnitude, both effects were not statistically significant.

The reasoning training had significant effects for both men and women. The conditional effect for female participants (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0217, 95% CI [0.482, 1.156], ES = 0.288) was higher than the conditional effect for male participants (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0218, 95% CI [0.052, 1.297], ES = 0.237).

While the average treatment effect of the speed of processing training was not significant, we found a slightly non-significant conditional effect for female participants (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0219, 95% CI [−0.002, 0.652], ES = 0.114). Conversely, for male participants, the speed of processing training did not yield a significant effect (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0220, 95% CI [−0.750, 0.446], ES = −0.053). This differential effect illustrates why examining conditional effects can be crucial when evaluating a treatment.

6.5.4 Conditional effects given (X = x) and (X = x, K = k)

The conditional treatment effects given a treatment condition (and gender) are given in detail in Table 2. As the ACTIVE study was a randomized controlled trial, the (X = x, K = k)-conditional distributions of depression urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0221 and baseline test scores urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0222 should not differ across levels of X. Thus, the conditional effects given a treatment condition are expected to be close to the corresponding average treatment effects. The same applies to conditional effects given treatment and gender. We will not discuss all of these effects in detail here, but rather illustrate their interpretation using one example.

For example, the conditional treatment effect of the memory training for the non-treated (urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0223, 95% CI [−0.234, 0.340], ES = 0.019) is similar to the average effect of the memory training. This refers to the expected effect if participants who were assigned to the no-contact control group had instead been assigned to the memory training. These effects being close to the average treatment effect urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0224 reflects that differences in baseline variables are small across groups. Note that for observational studies or broken experiments, these effects would not be necessarily close to each other due to possible baseline differences.

7 Summary and conclusions

In this paper we presented and illustrated a new method of accounting for non-normal covariates when estimating average and conditional treatment effects for count outcomes. We extended the moment-based approach by Kiefer and Mayer (2019, 2020) in three respects. First, we presented four ways to account for non-normal covariates: (1) for categorical covariates, applying a saturated model; for continuous covariates, we suggested either (2) the use of an alternative, known non-normal joint distribution (e.g., skew-normal distribution), (3) a plausible factorization into marginal and conditional distributions, or (4) approximation via a finite Gaussian mixture distribution. Second, we extended the effect analysis to multiple treatment conditions, making it possible to evaluate the effectiveness of several treatments simultaneously. Third, we introduced conditional effects given a treatment condition and/or values of the categorical covariates into the moment-based approach, allowing for a finer-grained effect analysis. Finally, we provided an illustrative example to show how our extensions of the moment-based approach can be applied to real data. In our example, three cognitive training conditions were compared to a no-contact control group regarding the count of correctly answered items in a cognitive reasoning test. We considered as covariates the baseline count of correctly answered items (non-normal covariate), baseline depression (latent covariate), and gender (categorical covariate). The corresponding negative binomial multi-group structural equation model and the average and conditional treatment effect estimations were carried out in R. We have made these functions conveniently accessible for applied researchers in an R package.

The aforementioned advancements bring the benefits of the moment-based approach for statistical inference to a broader range of applied scenarios in psychological, social, and health sciences. In contrast to earlier approaches, the moment-based approach treats observed group sizes and covariate values as random and not predetermined by the experimenter. Ignoring this randomness would lead to underestimation of standard errors and, thus, inflated Type I error rates and decreased power, which has previously been shown for stochastic covariates (Li, McLouth, & Delaney, 2020; Liu, West, Levy, & Aiken, 2017) and stochastic group sizes (Mayer & Thoemmes, 2019). In addition, the moment-based approach allows accounting for measurement error in covariates. For an overview of consequences of measurement error in nonlinear regression models, see Carroll, Ruppert, Stefanski, and Crainiceanu (2010). However, in the social and health sciences, observed variables in real data sets often deviate from the normal distribution (Bono et al., 2017; Micceri, 1989). While Kiefer and Mayer (2020) proposed a moment-based approach assuming strictly multivariate normally distributed covariates, we relaxed this assumption in this paper, offering several alternatives for non-normally distributed variables. Thus, the moment-based approach with our extension should better fit typical applications in social and health sciences.

7.1 Limitations and further research

There are some aspects not covered in this paper that could provide starting points for further research and refinements of the approach:

In equation (3), we assumed a linear parameterization for the predictor function urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0225. While this is the standard parameterization of, for example, generalized linear models (McCullagh & Nelder, 1996) and for effect analysis with linear link function (e.g., Mayer et al., 2016), sometimes nonlinear predictor functions might be of substantial interest. For example, Mayer et al. (2017) investigated the effectiveness of autonomy support by ninth-grade teachers in reducing students' state of boredom. They hypothesized a quadratic relationship between boredom and self-efficacy, which means that the treatment could be most effective for medium values of self-efficacy and less for both extremes. Similarly, Liu and West (2015) use trigonometric terms to predict cyclic patterns in daily report (count) data. The inclusion of quadratic or other nonlinear terms might also be of interest for count outcomes and regression models with a logarithmic link function. We presume that the moment-based approach would not yield analytical effect formulas for nonlinear predictor functions, but effect computation using numerical integration would be an alternative.

Throughout the paper, we factorized the joint distribution of the covariates into a marginal distribution of the categorical variables (i.e., treatment X and covariates K) and a (X = x, K = k)-conditional distribution of the continuous covariates. Our choice can seem restrictive, as alternative factorizations might be more suitable for some cases. For example, Kiefer and Mayer (2019) used a converse factorization, that is, a marginal distributed continuous covariate Z and a (Z = z)-conditional probability of the binary treatment X. However, we would argue that for a given joint distribution multiple equivalent factorizations exist. In the aforementioned example, the joint distribution can be equivalently rewritten in our factorization, that is, a marginal distribution of the binary treatment and an (X = x)-conditional distribution of Z (for a derivation, see Appendix 1).

We used maximum likelihood estimation to obtain parameter estimates for the negative binomial multi-group structural equation model, but in applied settings alternative statistical frameworks might be more suitable. Maximum likelihood estimation can suffer from non-convergence issues in complex models (Deng, Yang, & Marcoulides, 2018). We also exhibited this phenomenon in our illustrative example, where we had to exclude overdispersion parameters from the model in order to achieve convergence. Thus, we recommend three solutions to address non-convergence issues. First, a pragmatic workaround would be to specify a more parsimonious model, because non-convergence might be a consequence of overparameterization (Jackson, 2001, 2003). Second, a Bayesian structural equation modelling framework can be applied. The Bayesian approach can typically handle complex models much easier than maximum likelihood estimation (Merkle & Rosseel, 2015). For an overview of Bayesian estimation of structural equation models based on finite-mixture distributions or distributions based on the exponential family, see Lee (2007, Chapters 11 and 13). Third, recent factor score approaches (e.g., Devlieger & Rosseel, 2017) might also help achieving convergence by simplifying the measurement model and, thereby, the numerical integration part of the maximum likelihood estimation. However, if the factor scores depend on a distributional assumption, it would be wise to use the same distributional assumption within the moment-based approach.

In our illustrative example, we examined the effectiveness of cognitive training using a pre–post design. The ACTIVE study also contains several follow-up measurements allowing long-term effectiveness of the cognitive training to be investigated. In future developments, it might be fruitful to incorporate this longitudinal information into effect analyses. We suggest two ways for doing so. First, when examining long-term effects using a follow-up measurement, earlier measurements (e.g., post-test) could be included as mediators. In this scenario, total, direct, and indirect effects of the treatment can be distinguished (Mayer, Thoemmes, Rose, Steyer, & West, 2014). This can help researchers understand, for example, which post-test characteristics foster a long-term effect of the cognitive training. Second, several measurement occasions can be summarized or contrasted, for example, using a growth curve model (McArdle & Epstein, 1987) or growth component model (Kiefer, Rosseel, Wiese, & Mayer, 2018; Mayer, Steyer, & Mueller, 2012) among follow-up measurements. In this case, the outcome of interest would be a latent slope or growth component instead of the count outcome.

In non-randomized observational studies the treatment effects discussed in this paper are not necessarily causal effects. However, causality theories, such as Rubin's causal model (Rubin, 2005) or the stochastic theory of causal effects (Steyer et al., 2014) provide causality conditions under which causal effects can be estimated. Usually, these conditions require a careful selection of covariates, which then are controlled for with regression adjustment (e.g., Mayer, 2019; Mayer et al., 2016) or propensity score methods (Rosenbaum & Rubin, 1983). For an overview of design and analysis in quasi-experimental settings, see also Reichhardt (2019).

Appendix 1

Factorization of the joint distribution

In equations (4) and (5), we claimed that our factorization of the joint distribution was without loss of generality. This can be shown using two properties of distribution functions. Let us consider two random variables A and B with joint distribution fA,B(a,b). Then
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0226
If the marginal distribution fB(b) and the conditional distribution fA|B=b(a) are known, the corresponding distributions fA(a) and fB|A=a(b) can directly be computed with
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0227

Consequently, if one version or factorization of the joint distribution is known, equivalent factorizations for the same joint distribution can be derived.

Let us now consider a simple example of the moment-based approach, with a single covariate Z and a binary treatment variable X. In contrast to our factorization in equations (4) and (5), we now consider
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0228
as the known factorization of the joint distribution, where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0229 and the conditional probability of X = x is given by
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0230
that is, as a logistic function of the covariate Z. This kind of joint distribution can be useful if self-selection of participants into treatment X = 1 or control X = 0 depending on the covariate Z is assumed.
Nevertheless, our factorization from equation (5) can be obtained from these known marginal and conditional distributions by computing
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0231

Note that these transformations are generally required for effect computation only. The maximum likelihood estimation would, nevertheless, be based on fZ(Z) and P(X = x|Z = z) in this example. In contrast, when trying to estimate the likelihood function based on P(X = x) and fZ|X x, it might be tempting to estimate P(X = x) directly instead of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0232. However, introducing a parameterization for P(X = x) would lead to redundancies, because P(X = x) is already determined by µZ, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0233,urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0234, and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0235. Consequently, additional estimation of P(X = x) would lead to a non-invertible covariance matrix of the parameter estimates and, thus, standard errors could not be computed. The redundant parameter would remain at its starting value, and thus would not yield a trustworthy estimate. The likelihood function should only contain parameters µZ, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0236,urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0237, and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0238. Corresponding parameters from our factorization and their standard errors can then be computed for effect estimation using the delta method.

Appendix 2

Model estimation

In this section we present details on the maximum likelihood estimation of the NB-MG-SEM for our illustrative example. Let us consider a count outcome variable Y, a treatment variable X with four levels, a binary categorical covariate K, and a vector of continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0239 containing a latent covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0240 measured by three observed indicator variables z = (Z1,Z2,Z3), and a count covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0241. We want to estimate a statistical model as presented in equations 7-9 using maximum likelihood techniques. Assuming a sample of N independently and identically distributed observations, the complete-data likelihood is given by
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0242
where yi, xi, ki, zi urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0243 are the ith observed values of the aforementioned variables. As we do not have observed values urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0244 i.e., latent covariate), we actually have to optimize the observed-data likelihood:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0245

Note that marginalizing over the joint density of the complete data is only one way to optimize this likelihood. It would also be possible to use an EM algorithm (Dempster, Laird, & Rubin, 1977) to solve the complete data likelihood, but this is beyond the scope of this paper.

The joint density urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0246can be factorized into five conditional densities:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0247
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0248
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0249
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0250
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0251
In the last step, we decomposed the parameter vector urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0252 into four parts corresponding to the four conditional densities. The parameters to estimate in the respective densities are:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0253
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0254
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0255
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0256
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0257
In the NB-MG-SEM, the conditional densities are defined by the model assumptions. Here, urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0258reflects the model for group sizes urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0259, which are assumed to be Poisson distributed. The group-specific Poisson regression of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0260 on urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0261 is given by urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0262, the group-specific and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0263-conditional distribution of the observed indicators urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0264 is given by urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0265, the group-specific Poisson regression of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0266 on urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0267 is given by urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0268, and the group-specific marginal distribution of the latent variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0269 is given by urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0270. Thus, the conditional densities in our case are
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0271
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0272
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0273
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0274
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0275
As the group weights do not depend on the latent covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0276, we can simplify the observed-data density as follows:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0277
Hence, the corresponding log-likelihood is
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0278
As there is no closed-form solution for urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0279, numerical integration is required. In our implementation, we use Gauss–Hermite quadrature which approximates the integral over urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0280with a finite sum
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0281
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0282 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0283 are Gauss–Hermite quadrature points and weights.

Appendix 3

Effect estimation

In this section we provide details on the derivation of the average and conditional effect formulas used in our illustrative example. Let us consider a count outcome variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0284, a treatment variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0285 with four levels, a binary categorical covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0286, and a vector of continuous covariates urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0287 containing a latent covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0288 measured by three observed indicator variables urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0289, and a count covariate urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0290. In the model section, we already stated that we assume latent baseline depression is urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0291-conditionally marginally normally distributed urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0292 and is related to the baseline count of correctly answered items via a Poisson regression with urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0293 where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0294.

Beginning with the general definition of the average treatment effect for these variables,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0295
the integration part can be rewritten as a difference of integrals,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0296
where the decomposition of the joint distribution of urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0297 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0298 into a marginal and a conditional distribution enables the formation of an inner and an outer integral,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0299
The inner integral resembles a univariate (conditional) expectation and can therefore be substituted with a univariate moment-generating function:
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0300
Consequently, the computation of the average treatment effect simplifies to
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0301
and as both outer integrals have the same domain, we can go back to a single integral,
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0302
where we integrate over the density of the urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0303-conditionally normally distributed urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0304, which allows us to approximate the integral using a Gauss–Hermite quadrature
urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0305
where urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0306 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0307 are Gauss–Hermite quadrature points and weights.

  • 1 We intentionally use two different indices to refer to the values of the treatment variable urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0001, namely urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0002 and urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0003. These will fulfil two distinct purposes later on, that is, we use urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0004 to denote the treatment group for which an average or conditional effect (compared to the control group) is computed, while we use urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0005 for marginalizing (i.e., summing) over urn:x-wiley:00071102:media:bmsp12237:bmsp12237-math-0006-conditional distributions. For example, in equation (6) both indices appear and help to distinguish these two aspects.
  • 2 In our analysis, we fixed the overdispersion parameter ϕ = 0 for technical reasons, as the log-likelihood estimation did not converge for non-zero overdispersion parameters. We discuss the issue of estimation difficulties at the end of the paper.
  • 3 For a historical side note on the question of who invented the delta method, see Ver Hoef (2012).