Although broad patterns of psychological distress were similar across adulthood, the equipercentile-linking scale-equating framework (both equipercentile linking and using a calibrated cutoff) yielded lower means and standard deviations across the life course compared with the multiple-imputation approach. Although this held true across both index measures, differences appeared to be larger for the GHQ-12 than the Malaise-9. Cross-cohort comparisons were more susceptible to methodological effects. In general, using the GHQ-12 as an index measure yielded higher prevalence estimates than the Malaise-9. Sensitivity analyses that used study sweeps with both index measures suggested that a multiple-imputation approach leads to more accurate mean and prevalence estimates, whereas the equipercentile approaches yielded underestimates or overestimates.
Comparison with existing literature
The previously reported (
Bell, 2014;
Gondek et al., 2020;
Spiers et al., 2012) inverse U shape across adulthood was observed in the 1946 birth cohort for calibration against the GHQ-12 and Malaise-9 for the calibrated-cutoff and equipercentile-linking methods, although for both index measures, the multiple-imputation method showed a gradual decline in prevalence of psychological distress across the life course. This pattern was less clearly observed in the 1958 and 1970 cohorts across both index measures and calibration methods used, likely because these cohorts are still in middle age and prevalence of psychological distress has started to decline only marginally.
As in previous research comparing the age of 42 years across the 1958 and 1970 cohorts (
Gondek et al., 2020;
Ploubidis et al., 2017), we found that according to most methods and measures, the 1970 cohort has a higher prevalence of psychological distress at most ages compared with the 1958 cohort. However, we can draw no clear conclusions about any trends when also we also included the 1946 cohort. In the current study, we see lower prevalence in the 1946 cohort when the GHQ-12 was used as an index measure but higher prevalence of psychological distress in this cohort when the Malaise-9 was used as an index measure. Previous studies in North America have found that some older and more recent cohorts have higher distress—
Keyes et al. (2014) demonstrated a U shape in between-cohorts effects—and other UK-based studies have observed lower prevalence in more recent cohorts (
Thomson & Katikireddi, 2018). Note that the birth cohorts included in the present study were born across just 24 years in mid-20th-century Britain, and hence we cannot extrapolate findings to recent cohorts in which higher distress is increasingly reported. Our sensitivity analysis that used cohort sweeps in which both GHQ-12 and Malaise-9 were administered could not provide insight into why clear conclusions about cross-cohort trends could not be drawn. Although, it seems to suggest that multiple imputation might be more reliable than equipercentile linking. However, even just focusing on the multiple-imputation findings, we see prevalence of psychological distress increasing in more recent cohorts using the GHQ-12 yet lowest distress in the 1958 cohort using the Malaise-9.
Strengths and limitations
These results should be interpreted in light of the strengths and limitations inherent to this study. We used three nationally representative birth cohorts, and our calibration sample had sufficient coverage across the whole distribution of mental health and was broadly representative of the current general population of the United Kingdom in terms of age, sex, level of education, and country of residence. Our study was methodologically robust and was designed to allow for assessing reliability across methods: We used two different index measures and scale-equating methods, which enabled us to describe differences on the basis of these. This is in contrast to previous literature that applied these methods and used only one method and measure (
Choi et al., 2014;
Fischer et al., 2011,
2012;
Sellers et al., 2019), which resulted in limited evaluation of the reliability of any findings based on these approaches. Using an IRT calibration model as a further methodology was unsuitable because its key assumptions could not be met (
Hambleton et al., 1991): The two questionnaires used as index measures do not have a common set of anchor items, and the remaining items do not have identical response options.
However, there are also some limitations inherent to our study design. Because we used data only from the United Kingdom, we are uncertain about the generalizability of our results to an international context. Mode of questionnaire administration differed between our calibration sample (all self-reported online) and the cohort samples (either self-report via a paper questionnaire or interviewer administered), and this might have led to higher reporting of mental-health symptoms in the calibration sample (
Epstein et al., 2001). Finally, we used responses today to equate responses given at a previous point in time, as far back as the early 1980s. However, there appears to be no evidence that within-individuals and cross-cohort interpretation of the Malaise-9 changes over time (
Ploubidis et al., 2019). In addition, measurement-invariance analyses (see
Table S3 in the Supplemental Material) indicate that younger and older respondents today answer the measures used in this study similarly, which increases confidence in the longitudinal comparisons made.
Interpretation
Although life-course patterns of psychological distress were similar across both index measures and scale-equating methods, point estimates were not. Comparing methodologies, we found that the imputation method yielded higher means and standard deviations than the equipercentile-linking method, and sensitivity analyses indicate the former might be the less biased approach in this scenario. Although means are not directly comparable across index measures because of different scale ranges, prevalence estimates were higher using the GHQ-12.
These differences have little bearing on the longitudinal symptom profile (we confirmed an inverted U shape over the life course). There are two hypothetical explanations for this inverted U shape: It might be artifactual because the instruments we used are poor at capturing important aspects of mental health in later life, or it might be a reflection of genuine better mental health in later life (either through a reduced perception of pressure through socioemotional selectivity or through eudaemonic processes) after a period of greater stress in midlife (which potentially reflects the multiple stressors faced by many of child care, career pressures, and caring for elderly parents;
Willis et al., 2010). However, these differences do have substantial implications for cross-cohort comparisons. For instance, examining means derived through the equipercentile method calibrated against the Malaise-9 (
Fig. 1), we found that means are higher in the 1958 and 1970 cohorts, and any midlife peak is earlier in these cohorts. However, when using the same index measure and applying our multiple-imputation-based approach, the mean is highest in the 1946 birth cohort, and there is no discernible midlife peak in the other two cohorts. This method and measure dependency leaves us unable to make strong conclusions about whether more recent generations experience poorer mental health.
Compared with most of the other measures that we calibrated, the Malaise-9 has less variance (given the range from 0 to 9 compared with, e.g., the GHQ-12’s range of 0–36). We speculate that some of the discrepancies in the results we observe between these different measures might be due to differences in their scales (e.g., a score between 8 and 10 on the GHQ-12 gives a score of 1 on the Malaise-9; see
Table S7 in the Supplemental Material). If this is indeed an important part of the consideration, then equating scales between measures with similar ranges and variance is more likely to yield reliable estimates than measures with vastly different variances. This might also explain why the multiple-imputation approach appears to be more reliable than the other two approaches because it does not superimpose substantially larger or smaller variance. An important implication of our findings is the need for formal statistical simulation studies to investigate the conditions in which different calibration and/or harmonization methods return unbiased results, especially if, as we have shown, bias occurs when measures with large discrepancy with respect to their range and standard deviation are calibrated.
This methodological and measure-based heterogeneity in point estimates across the life course and between cohorts calls into question the robustness of articles that used only one index measure to estimate mental-health outcomes, and we would recommend, as possible, for researchers who use scale-equating approaches to use two index measures with similar range. On the basis of our analyses, it would seem that multiple-imputation-based scale-equating methods are better suited to scenarios such as this, in which an external calibration sample is used.
The potential inadequacy of these scale-equating approaches in translating findings from one measure to another has considerable implications regarding recent announcements to mandate the use of certain measures in all research that pertains to certain common mental-health symptoms (
Farber et al., 2020). The arguments in support for this include being able to use these measures to act as a common language, via scale-equating approaches, to be able to draw comparable comparisons and conclusions across a range of studies in various settings, including population studies, trials, and routine clinical monitoring. The findings here highlight that these approaches are problematic even for robust population-level inferences, and individual-level inferences (in which a score of
X on this measure means a likely score of
Y on the other measure for each individual) are likely to be even more error prone (
Piantadosi et al., 1988).