How deadly is COVID-19? Data Science offers answers from Italy mortality data.

A time-series analysis of Italy’s excess mortality

--

This post is part of a series describing work by us at Berkeley Center for Cosmological Physics (BCCP, UC Berkeley) on determining the impact on COVID-19 with data science methods. In this post, we focus on the results of a time series analysis of Italian mortality data. Our analysis suggests that the total number of deaths has been underestimated by up to a factor of 2. An accompanying post explains our methodology. The submitted medRxiv version of our paper (under peer-review) is here. Our data, analysis-code and updated version of the paper is on our github. Github post will be continuously updated as new data is available.

Despite the large number of infected people and confirmed deaths, estimating the spread and lethality of COVID-19 is still plagued with uncertainties. This is primarily due to the lack of exhaustive testing. Given the shortage of tests, a common strategy is to only test people with severe symptoms. Such selective testing under-estimates the spread of the infection. As a result, simply combining the ratio of positive tests with the number of reported deaths to estimate the case fatality rates (CFR) of the disease can result in vastly varying numbers for different regions. For instance, some countries report a CFR of less than a 1% (e.g. Israel and Germany), while others go as high as 10% (Italy). To correct for this, a lot of discussion has focused on estimating the correct number of infected cases in different communities, while the number of reported deaths itself has been assumed to be accurate.

However there is increasing evidence that the reported COVID-19 fatalities are also severely under-estimated: a comparison with the historical mortality data shows a significant excess over the expected number of deaths around this time of the year. This has been observed for many countries and regions around the world (Italy, US, UK, Switzerland, Netherlands, and others). This excess seems to coincide with the pandemic, but the number of deaths officially attributed to COVID-19 do not completely account for this excess¹. (To keep our discussion succinct, we list additional relevant discussion and assumptions we have made in the footnotes marked with superscript at appropriate places). Figure 1 shows this for Lombardia in Italy, which is one of the hardest hit regions in the world.

Figure 1: Mortality due to all causes in Lombardia for the first 12 weeks of the year. Compared to 2015–2019 data (dotted lines), there an obvious excess after February 22 when Italy reported its first death due to COVID-19 (CV-19). This excess is not completely accounted for by the reported COVID-19 deaths. An important thing to note is that the historical mean is consistently higher than the pre-pandemic mortality in 2020, suggesting a milder than average flu season². We propose two different techniques, CGP and SCM, to construct a prediction for this year’s mortality after Feb 22 that take this in formation into account. Our proposed methods fit the observed mortality in this period well.

Given these uncertainties in both, the number of reported infections and deaths, accurately quantifying the impact of about COVID-19 has been challenging. In this post, we attempt to do this by addressing the following questions -

  1. What has been the total death toll of COVID-19?
  2. How can we determine the lethality of COVID-19 for people of different ages, i.e. the infection fatality ratios (IFR) for different age groups?
  3. How can we estimate the spread of the infection (infection ratio, IR) in different communities?

Our Analysis

To answer these questions, we look towards Italy and while we consider all the 20 regions independently, we will focus primarily on the region of Lombardy and province of Bergamo since they are the worst affected of all. Italy reported its first death due to COVID-19 on February 22nd. We use the total mortality data due to all causes provided by the National Statistical Institute of Italy (Istat) for the period of January 1st to March 28th, from 2015–2020. It is a complementary dataset and independent of the official COVID-19 reported deaths, as well as less affected by unknowns such as the completeness of tests. However it is not 100% complete and is not a random sample of towns. We make some assumptions to account for this³.

We use two different models —conditional Gaussian Process (CGP) and Synthetic Controls Method (SCM) — with the historical data and predict the expected deaths in the absence of the pandemic. An important advantage of these methods over simple mean estimate is that we match the 2020 observed mortality in every region before the COVID-19 pandemic and this accounts for other confounding and time-varying factors such as a milder flu-season. With these predictions, we can estimate the excess deaths in 2020 as well as uncertainty on it and compare them with COVID-19 reported deaths. Details on our methods and calculation are in the accompanying methodology post. This allows us to answer the following questions.

Results

What has been the total death toll of COVID-19 in Italy?

  • We find that the total number of deaths in Italy is 52,000 ± 2000 as of April 18, 2020 (Figure 2).
    This more than a factor of 2 higher than the official numbers of 23,000. While the official statistics are lower by about 80% in the regions with the most official COVID-19 cases, we find concerning evidence that regions with a much lower official count like Sardegna and Tuscany might have underestimated their death toll by up to a factor of 4 (Figure 2).
Figure 2: Excess mortality for different regions in Italy compared to reported COVID-19 deaths. We show cumulative excess deaths, over the predicted counterfactual deaths from two different methods — SCM and CGP. We compare this with the reported COVID-19 deaths for the period since February 23rd (available COVID-19 data). We find that our estimated deaths exceeds COVID-19 reported deaths by multiple factors for every period and every region, even those which officially do not report huge death toll. We extrapolate the our predicted excess beyond March 28th with dashed-lines under conservative assumptions.
  • There seems to be good agreement in our estimated mortality and the official COVID-19 statistics for ages under 60 years, but a large excess over reported deaths that increases with age above 70 years (Figure 3).
    This suggests that there is a large population of predominantly old people missing from official fatality statistics. This age-distribution can be explained with a triage scenario which focuses its limited testing capacity on the young population. It could also be attributed to COVID-19 deaths of older people that passed away at their homes or nursing homes, possibly very rapidly, without ever seeking a test. The fact that regions with an officially low COVID-19 mortality, also show a significant discrepancy, seems to point towards the latter scenario.
Figure 3: Age distribution of excess mortality. We find good agreement in our estimated mortality and the official COVID-19 statistics for ages under 60 years, but a large excess that increases with age above 70 years. We find similar trend across all regions in Italy, irrespective of total COVID-19 reported deaths. To get the age distribution of COVID-19 deaths, we have assumed that the regional age-distribution follows the national reported distribution.
  • The proportion of deaths has reached 0.22% of the population in Lombardia and 0.57% in Bergamo.
    More importantly, as expected, there is a steep age-dependence in the fatalities: in Lombardia (Bergamo province), 0.6% (1.7%) of the population in age-group 70–79 has died; 1.6% (4.6%) in age group 80–89, and 3.41% (10.2%) people above 90 years old.

How lethal is COVID-19 for people in different age groups?

  • We estimate lower bounds on the age-dependent Infection Fatality Ratio (IFR). The lower limit on population IFR from Bergamo is 0.57% and 0.87% from Lombardia.
    To estimate this lower bound on IFR, we use the Total Positive Ratio (TPR) i.e. the reported fraction of positive cases to total tests. This is commonly considered as an upper bound on IR. For Bergamo, we make a more conservative assumption of IR = 1. Hence this is a lower-limit on this bound of IFR. We again observe a severe age-dependence on the lower bound (Table 1).
Figure 4 : Population fraction of fatalities and lower-bound on IFR estimated for different regions. To estimate the lower bound on IFR, we assume that the reported ratio of positive cases and total tests in COVID-19 statistics is an upper-limit on the infection fraction of the region (except for Bergamo, where we assume IR=1).
  • We also estimate age-dependent IFR by combining our estimated deaths with the Princess Diamond cruise ship IFR for ages above 70 (Table 1). These estimates are consistent with our lower-limits.
    For every region, we match the DP IFR for ages above 70 after adjusting for the age-distribution. This allows us to estimate IR for this age group and under the assumption of age-independent IR, we can estimate IFR for other age groups(Table 1).
Table 1 : Age dependent IFR (in %) from Lombardia data. We quote the lower limit by assuming IR = total positive rate in COVID testing. For the estimate from Diamond Princess, we match the IFR in age group 70–89 with DP estimate after matching age distribution. The Poisson error quoted here is the majority of our error budget. For lower bound, we give 1-sigma limit (68% c.l.) and for DP estimate, we give 95% c.l..

How much has the infection spread?

  • We also estimate the IR by combining our estimated deaths with the Princess Diamond cruise ship IFR for ages above 70. We find that the infection in Lombardia has reached 23% (12%-41%, 95% c.l.), province of Bergamo at 67% (33%-100%, 95% c.l.).
    This is under the assumption of age-independent IR, which we expect is a good assumption for highly infected regions like Bergamo and Lombardia. This suggests that Bergamo has reached herd immunity, and that the number of infected people greatly exceeds the number of positive tests, by a factor of 35 in Lombardia.
Figure 5 : Age-dependent IFR estimated using Diamond Princess data. We combine our estimate of excess mortality and match the the age-weighted IFR estimated from DP for the age group 70–90. This allows us to esimate IR and corresponding IFR for other age groups if the IR is assumed to be constant across age groups. This is a good assumption for Bergamo (and possibly Lombardia) which has likely reached herd immunity.
  • High infection rate helps to explain the high estimates of CFR in Italy.
    High CFR of Italy has been discussed extensively over the course of the pandemic and we quantify how this can be explained with high infection rate in Italy. Take Lombardia for example. Reported CFR is ~20% in Lombardia. However the total number of administered tests as of April 18 2020 was only 2.5% of the population, and 0.6% of the population tested positive. Compare this to our estimated 23% infection rate. Thus, if only the most severe cases that likely required hospitalization got tested, their fatality rate can be significantly higher than for the overall infected population. This will explain the high CFR of Italy.

Outlook and Estimating spread of COVID-19 in New York City

To summarize the discussion so far, we expect the impact of our analysis to be far reaching. Estimating the fatality ratios of COVID-19 requires accurate knowledge of two numbers: the actual number of fatalities and the actual number of infected people. There has been a growing evidence that both of these numbers are under-estimated in official COVID-19 statistics. In view of this, we look at the total mortality count with proper statistical analysis to get a robust estimate of the total number of deaths. We do this analysis for Italy under conservative assumptions and estimate the total mortality count of COVID-19 to be higher than official numbers by a factor of 2. At the same time, we use this estimated mortality to get bounds on the age-dependent IFR as well as quantify the spread of infection in different regions.

Age-dependent IFR is a powerful number to have and we can use it to get an estimate of the spread of infection in different regions by combining it with reported fatalities (of course with the caveat that this number might be higher than counted in official statistics). As an interesting observation upon combining our age-dependent IFR, we find a useful expression to estimate the population averaged IFR lower bound (mean) as 0.8 (1.0) times the yearly mortality rate (YMR) of the given population. Let us use this to look at the infection in other regions. For New York City (NYC) and Santa Clara county, using our YMR approximation, we obtain an IFR lower bound of 0.45%. As of April 18 NYC official COVID-19 PFR is 0.15% , which could be a significant underestimate of the total COVID-19 PFR. Assuming a more conservative IFR of 0.6% , we estimate that 25% of NYC residents are already infected. This is a factor of 16 above the number 130,000 positive cases reported by April 18. We focus on NYC in more detail in an upcoming post under this Medium publication.

Footnotes

  1. We attribute all the excess deaths to COVID-19. It seems likely that these deaths were due to COVID-19, but not attributed to it due to insufficient testing. The most direct way to verify this is to perform COVID-19 tests on every fatality,which is currently not done in any location. Another possible explanation is that the medical facilities were overwhelmed by COVID-19 cases, such that patients with curable ailments could not be treated. The latter explanation, however, is disfavored by the observations that — 1) the excess seems to be fairly independent of the region and hence the stresses on the health care system, 2) the proportion of under-reporting decreases as the pandemic progresses. This is consistent with the hypothesis that the deaths are missed due to lack of testing initially and not consistent with the deaths due to pressurized societal and medical systems which get more burdened with time. Alternatively, there are also arguments that suggest we may have underestimated the COVID-19 death rate. Italy has been under lockdown since March 9, which may have reduced fatalities due to other common sources such as road and workplace accidents, or criminal activities. Given the lack of better data to quantify these arguments, our analysis makes the morbid assumption that all the excess deaths can be attributed to COVID-19 mortality.
  2. Some arguments have been raised that due to the milder flu season and less flu fatalities, a large older population survived to be infected with COVID-19, hence resulting in more fatalities than otherwise expected. Without weighing the ethical and social merits of this argument, we only quantify that this number. As compared to the historical mean, Lombardia has 1532 less deaths by February 22 (compared to yearly variance of 680 for this period) while our estimated excess for COVID-19 fatalities are over 20000 (with standard error of 247) since then.
  3. We have data from 1680 towns. Since complete data is only available for a subset of towns, we need to evaluate the completeness per region and re-scale our estimates to obtain regional mortality. We estimate this factor for every region to be the ratio of the sum-total population of the towns in our dataset with the total regional population, as per the 2019 population from Istat website. We validate this scaling by comparing the proportion of mortality in these towns with the region as well and find it to be consistent. We remove from the analysis all the regions with less than 10\% completeness. As we have no way to quantify the error associated with this, we will use the most complete region (Lombardia, 72% complete) and province (Bergamo, 74% complete) for much of the quantitative analysis in this article, but we show that other regions give consistent results. Furthermore, given that we find good agreement between reported COVID-19 deaths and our estimated excess fatalities for ages below 60 years, we expect this scaling to not bias our results significantly.
  4. Though we have total mortality data only up to March 28th, we extrapolate to April 11th under the conservative assumption that the official reported COVID-19 mortality will catch up with the true excess mortality. After that, we assume reported COVID-19 fatalities accurately account for excess deaths. Given increased testing, the difference between these two numbers should decrease with time. However reaching exact matching by April 11th is probably still an optimistic assumption. Furthermore, given that the pandemic will last longer than April 18th (and given the time lag between infection and death), we believe that the numbers quoted here will still be an underestimate of the total COVID-19 mortality.
  5. In using IFR from DP, we have assumed that IFR in a given age group does not depend on other factors such as differences in co-morbidities or health care systems between the locations. This can be verified with tests performed at random or for an entire population, which are currently not available except for DP. It requires a very large number of tests to accumulate large enough fatality statistics.
  6. We assume the same IR for all the age groups in a particular region, inspired by epidemiological models. This could be to some extent verified by the total positive rate in tests as a function of age. Data for this exists but is unfortunately not published. However our age dependent population fatalities from the province of Bergamo, which serve as a lower limit to the IFR do not depend on this assumption since we assume IR=1. Bergamo province has very likely reached the herd immunity where IR is less likely to be age dependent, a situation very different from the low IR environments where one may expect more age dependence