1 Background

Starting from their origins in the the work of Kermack and McKendrick [1], the use of differential equation models of the course of infectious diseases has grown to become one of the more accessible instances of the reach of mathematics. The current COVID-19 Pandemic has brought them into the common parlance. Even before this, however, the baseline Susceptible-Infected-Recovered (SIR) model had been extended to include Exposed (E) and Deceased (D) compartments and applied with considerable success to influenza, ebola, malaria, cholera, tuberculosis and several other infectious diseases [2,3,4,5]. (Some of this literature also includes agent-based models, which we do not consider here.) During the COVID-19 Pandemic, the widespread availability of data in the public domain [6,7,8,9,10,11] has served to attract methods of mathematics, computation and data science to analyzing this information, inferring the disease’s dynamics and making projections. The present communication is in this spirit, and brings our recent work in large scale computations of partial differential equations (PDEs), system inference and machine learning to this problem [12,13,14,15,16,17].

Of particular interest to us are two lines of enquiry: The first is that for a rapidly evolving disease such as COVID-19, with its public health, population-based, political, travel and economic manifestations, the classical SIR model of ordinary differential equations (ODEs) with constant coefficients seems inadequate. Driven by data that extends the compartments to the deceased (D), we have adopted the SIRD model. This choice is based entirely on the nature of the data available on the epidemic in the state of Michigan, where the numbers of deceased are reported on a daily basis. The classical SIR model can be extended to compartments additional to the exposed and deceased ones. These models are typically designated as SIS (Susceptible-Infected-Susceptible again, such as in the common cold); MSIR (Maternal-Susceptible-Infected-Recovered, where immunity is derived from the mother in the M compartment); SEIS (Susceptible-Exposed-Infected-Susceptible again, also typical of the common cold); MSEIR and MSEIRD, which combine more of the compartments. However, data are not available to us on the exposed and maternally immune-protected sub-populations in the state of Michigan, and the Maternal compartment is not known to be relevant to COVID-19. We have therefore worked with the SIRD model. The interested reader is directed to Ref. [18] for details of the other SIR model variants.

The first extension that we have undertaken is to allow the ODE coefficients to vary in time to reflect the evolving contours of testing, quarantine and treatment protocols. This is not necessarily novel, and has been addressed in other work [2, 19], although perhaps not with the inference approach of Variational System Identification (VSI) and ODE-constrained optimization that we have adopted.

The second is the fact of a mobile population. Population mobility has been addressed through metapopulation models that characterize how diseases move between population hubs, across countries, or even intercontinentally. The most widely known are gravity models (e.g. [20]), and network and agent based models [21]. Given the prominence that quarantine protocols—adorned with the current-day euphemism of “social distancing”—have played in the COVID-19 Pandemic, it appears natural to seek an extension of the SIRD model to a spatio-temporal PDE model. As the world went into lockdown, but at different rates and degrees of rigor, and then began to emerge from it, the detection of patterns of mobility in space and time presents a compelling avenue for investigation. Such an extension also has been considered—chiefly in the setting of the mathematical analysis of reaction–diffusion systems [22,23,24]. Our contribution to this aspect of the mathematical treatment is to also allow the diffusivity of the S, I and R sub-populations to vary with time.

To these tasks we have brought the abundance of high-quality, public domain, data on the evolution of the various compartment pertaining to the SIRD model in the US state of Michigan. The temporal resolution by days and spatial resolution by the 85 counties of Michigan has allowed us to apply our methods of Variational System Identification [12, 13], PDE-constrained optimization and machine learning [14,15,16,17] to these data.

In Sect. 2 we review the foundational SIRD ODE model. Section 3 is on data preparation. The application of system identification and machine learning to the ODE system are, respectively, in Sects. 4 and 5. The results for inferred parameters and forward prediction are presented in Sect. 6. The extension to inferring mobility via reaction–diffusion systems is in Sect. 7. Our conclusions appear in Sect. 8.

2 The compartmental model of infectious disease dynamics

We use the SIRD version of compartmental epidemiology models. The population, taken to remain constant at N, is divided into four disjoint compartments with time-dependent sub-populations: S(t) for susceptible, I(t) for infected, R(t) for recovered and D(t) for deceased individuals. The governing ODEs are:

$$\begin{aligned} \frac{\text {d} S}{\text {d} t}&=-\frac{\beta }{N} SI+\gamma R \end{aligned}$$
(1)
$$\begin{aligned} \frac{\text {d}I}{\text {d}t}&=\frac{\beta }{N}SI-\mu I-\alpha I \end{aligned}$$
(2)
$$\begin{aligned} \frac{\text {d}R}{\text {d}t}&=\mu I-\gamma R \end{aligned}$$
(3)
$$\begin{aligned} \frac{\text {d}D}{\text {d}t}&=\alpha I \end{aligned}$$
(4)
$$\begin{aligned} N&= S(t) + I(t) + R(t) + D(t). \end{aligned}$$
(5)

This is the canonical form of the model where the sub-populations are assumed to be well-mixed so that spatial variations can be ignored over the domain of interest. Here \(\beta (t)\) is the infection rate, \(\mu (t)\) is the recovery rate, \(\gamma (t)\) is the rate of immunity loss, and \(\alpha (t)\) is the death rate—all allowed to vary with time. Using the natural temporal unit of one day, we note that \(1/\mu (t)\) is also the number of days an individual remains infectious. It follows that \(\beta (t)/\mu (t)\) is the effective reproduction number: the total number of the susceptible population that an infectious individual passes the disease to. This quantity is commonly denoted by \(R_0\), but we use \(r_0(t) = \beta (t)/\mu (t)\), to distinguish it from the recovered population, and emphasizing that it, too, varies with time.

We reiterate what we have outlined in the Background (Sect. 1). Given the rapidly varying nature of testing, reporting, treatment protocols and quarantine conditions over the course of an epidemic, it is natural to allow the coefficients in the SIRD model, Eqs. (14) to vary with time. Such variation is evident in epidemiological data. The reader may be familiar with the time varying nature of such factors over the course of the COVID-19 Pandemic. It is a central feature of data preparation in the following section.

Fig. 1
figure 1

The map of Michigan delineating the counties and regions (modified from [25])

Fig. 2
figure 2

Cumulative data with different kernel widths and multiples of application of the smoothing filter: \(7{\text {days}}\_1\times \) represents a 7-day filter applied once. Important dates are marked with the lockdown on March 23, reopening of construction and real estate sites (C) on May 1, reopening of manufacturing sites (M) on May 7, permission. to restart laboratory research (R) on May 15, lifting of the stay-at-home order (O) on June 1, and the end of our data collection (E) on June 28

Fig. 3
figure 3

Time derivatives (daily change) of sub-population data with different kernel widths and multiples of application of the smoothing filter. Important dates are marked with the lockdown on March 23, reopening of construction and real estate sites (C) on May 1, reopening of manufacturing sites (M) on May 7, permission. to restart laboratory research (R) on May 15, lifting of the stay-at-home order (O) on June 1, and the end of our data collection (E) on June 28

3 Data preparation

Counts of new confirmed infected cases I(t) and deaths D(t) were reported in the public domain on a daily basis by the state of Michigan for each county [26], while total recovered cases in the state R(t) were reported weekly [27]. See Fig. 1 for the counties and regions that Michigan is partitioned into. Since county specific recovery data was not reported, the distribution of recovered cases across counties was approximated to be the same as the distribution of cumulative infected cases, \(\int _0^t I(\tau )\text {d}\tau \) across counties. Estimates for the populations of Michigan’s counties [28] were used to determine the susceptible population, S(t), from Eq. (5).

Some amount of data smoothing was necessary, particularly to account for the weekly instead of daily reporting of the number of recovered cases. To compare the effect of the smoothing method on the data, a moving average filter was applied using 7, 11, and 15-day windows, guided by the week-long period of oscillation in the raw data for daily new infections \(I(t) - I(t-1)\). The 7-day window was applied one, two, and three times. As seen in Fig. 2, the method of smoothing has little effect on the trends of the data. However, and as expected, there is a strong effect on the numerical time derivatives (see Fig. 3). It is clear that multiple passes of the filter are required to remove jumps in \(\text {d}R/\text {d}t\) and \(\text {d}I/\text {d}t\). Since the additional smoothing is helpful for system inference in Sect. 4 and does not negatively affect the data, the 7-day moving average filter applied three times was used for data smoothing.

The lockdown in Michigan began on March 23, 2020. For brevity, we use C for the date when the outdoor construction industry was allowed to resume on May 1, 2020, M for the restart of some manufacturing on May 7, 2020, R for reopening of research laboratories on May 15, 2020, O for broader opening of most other activities and lifting of the stay at home order on June 1, 2020 (albeit with distancing guidelines in place), and E for the end of the data period that we considered (June 28, 2020). This notation is used for the rest of this communication.

4 System identification and ODE-constrained optimization

The SIRD model, Eqs. (14) was time-discretized using the Backward Euler method and written as:

$$\begin{aligned}&\frac{S^\mathrm{d}_{m} - S^\mathrm{d}_{m-1}}{\Delta t} +\frac{\beta }{N} S^\mathrm{d}_{m}I^\mathrm{d}_{m}-\gamma R^\mathrm{d}_{m} =0 \end{aligned}$$
(6)
$$\begin{aligned}&\frac{I^\mathrm{d}_{m} - I^\mathrm{d}_{m-1}}{\Delta t} -\frac{\beta }{N}S^\mathrm{d}_{m}I^\mathrm{d}_{m}+\mu I^\mathrm{d}_{m}+\alpha I^\mathrm{d}_{m} = 0 \end{aligned}$$
(7)
$$\begin{aligned}&\frac{R^\mathrm{d}_{m} - R^\mathrm{d}_{m-1}}{\Delta t} -\mu I^\mathrm{d}_{m}+\gamma R^\mathrm{d}_{m} = 0 \end{aligned}$$
(8)
$$\begin{aligned}&\frac{D^\mathrm{d}_{m} - D^\mathrm{d}_{m-1}}{\Delta t} - \alpha I^\mathrm{d}_m = 0 \end{aligned}$$
(9)

where \(S^\mathrm{d}_m, I^\mathrm{d}_m, R^\mathrm{d}_m, D^\mathrm{d}_m\) are the corresponding data (smoothed as in Sect. 3) at time \(t_m\) (the end of the mth day), \(\Delta t = 1\) day and Eq. (5) holds: \(S^\mathrm{d}_m = N - I^\mathrm{d}_m - R^\mathrm{d}_m - D^\mathrm{d}_m\).

The system identification problem is to infer the time-dependent coefficients \(\beta (t), \gamma (t), \mu (t), \alpha (t)\), which we choose to expand in a polynomial basis (other choices of bases are admissible).

$$\begin{aligned} \beta (t)&=\theta _0+\theta _1t+\theta _2t^2+\theta _3t^3 \end{aligned}$$
(10)
$$\begin{aligned} \gamma (t)&=\theta _4+\theta _5t+\theta _6t^2+\theta _7t^3 \end{aligned}$$
(11)
$$\begin{aligned} \mu (t)&=\theta _8+\theta _9t+\theta _{10}t^2+\theta _{11}t^3 \end{aligned}$$
(12)
$$\begin{aligned} \alpha (t)&=\theta _{12}+\theta _{13}t+\theta _{14}t^2+\theta _{15}t^3 \end{aligned}$$
(13)

The parameters to be inferred are collected into a vector \(\varvec{\theta } = \langle \theta _0,\dots ,\theta _{15}\rangle ^{\text {T}}\). Since the data are known the label vector can be constructed as:

$$\begin{aligned} \varvec{y}_{m} = \left\{ \begin{array}{c} \frac{S^\mathrm{d}_{m} - S^\mathrm{d}_{m-1}}{\Delta t} \\ \frac{I^\mathrm{d}_{m} - I^\mathrm{d}_{m-1}}{\Delta t}\\ \frac{R^\mathrm{d}_{m} - R^\mathrm{d}_{m-1}}{\Delta t} \\ \frac{D^\mathrm{d}_{m} - D^\mathrm{d}_{m-1}}{\Delta t} \end{array}\right\} \end{aligned}$$
(14)

and a matrix can be assembled from the reaction terms in the time-discretized SIRD equations (69):

$$\begin{aligned}&\varvec{\Xi }_{m}\nonumber \\&= \left[ \begin{array}{cccc} \frac{S^\mathrm{d}_m I^\mathrm{d}_m}{N}\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} -R^\mathrm{d}_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle \\ -\frac{S^\mathrm{d}_m I^\mathrm{d}_m}{N}\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle &{} I^\mathrm{d}_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} I_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle \\ \langle 0\;0\;0\;0\rangle &{} R^\mathrm{d}_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} - I^\mathrm{d}_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle \\ \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle &{} -I^\mathrm{d}_m\langle 1\;t_m\;t_m^2\;t_m^3\rangle \end{array}\right] \nonumber \\ \end{aligned}$$
(15)

The columns of \(\varvec{\Xi }_m\) can be regarded as discretized versions of the basis operators that appear as reaction terms on the right hand-side of the SIRD model (14). The label vectors and matrices of basis operators at times \(t_0,\dots t_M\) are collected into

$$\begin{aligned} \varvec{y} = \underbrace{\left\{ \begin{array}{c} \varvec{y}_0 \\ \vdots \\ \varvec{y}_M \end{array}\right\} }_{4(M+1)\times 1},\quad \varvec{\Xi } = \underbrace{\left[ \begin{array}{c} \varvec{\Xi }_0 \\ \vdots \\ \varvec{\Xi }_M \end{array}\right] }_{4(M+1)\times 16} \end{aligned}$$
(16)

and the residual vector is defined:

$$\begin{aligned} \mathscr {R}(\varvec{\theta }) = \varvec{y} - \varvec{\Xi }\varvec{\theta } \end{aligned}$$
(17)

Our approach to inference combines system identification by stepwise regression [12, 13] and ODE-constrained optimization using adjoints. We define a loss function that incorporates penalization on \(\varvec{\theta }\) (leading to ridge regression below):

$$\begin{aligned} \ell (\varvec{\theta }) = \vert \mathscr {R}(\varvec{\theta }) \vert ^2 + \frac{1}{2}\lambda \vert \varvec{\theta }\vert ^2 \end{aligned}$$
(18)

Our stepwise regression techniques incorporate two algorithms listed next:

figure a

There are several possible criteria for eliminating basis terms. Here, we adopt a widely used statistical criterion called the F-test, also used by us previously [12, 13]. The significance of the change between the model at iterations j and \(j-1\) is evaluated by:

$$\begin{aligned} F=\frac{ \frac{\ell _j -\ell _{j-1}}{p_{j-1}-p_{j}}}{\frac{l_{j-1}}{P-p_{j-1}}} \end{aligned}$$
(20)

where \(p_j\) is the number of bases at iteration j and \(P = 16\) is the total number of operator bases. The F-test is achieved through the application of Algorithm 2:

Model selection thus finds \(\varvec{\theta }\) consisting of a minimal set of non-zero components, ensuring that the coefficients \(\beta (t),\dots ,\alpha (t)\) admit a parsimonious representation as polynomials in t. For clarity, we collect this set of non-zero coefficients into another vector, \(\varvec{\vartheta }_0\). Using \({\text {dim}}(\bullet )\) to represent the dimension of a Euclidean vector, we have \({\text {dim}}(\varvec{\vartheta }_0) \le {\text {dim}}(\varvec{\theta })\). The next step is to further refine the values of the non-zero polynomial coefficients using ODE-constrained optimization starting from the initial guess \(\varvec{\vartheta }_0\), and regarding \(S_m(\widetilde{\varvec{\vartheta }}),I_m(\widetilde{\varvec{\vartheta }}), R_m(\widetilde{\varvec{\vartheta }}),I_m(\widetilde{\varvec{\vartheta }})\) as the forward solution to the discretized SIRD model (2225) with coefficient \(\beta (t),\dots ,\alpha (t)\) values drawn from \(\widetilde{\varvec{\vartheta }}\):

$$\begin{aligned} \varvec{\vartheta }= & {} {\text {arg}}\;\underset{\widetilde{\varvec{\vartheta }}}{\min }\quad \sum _{m=0}^M \left( \frac{S_m(\widetilde{\varvec{\vartheta }})-S^\mathrm{d}_m}{W_1}\right) ^2 +\left( \frac{I_m(\widetilde{\varvec{\vartheta }})-I^\mathrm{d}_m}{W_2}\right) ^2\nonumber \\&+\left( \frac{R_m(\widetilde{\varvec{\vartheta }})-R^\mathrm{d}_m}{W_3}\right) ^2 +\left( \frac{D_m(\widetilde{\varvec{\vartheta }})-D^\mathrm{d}_m}{W_4}\right) ^2 \end{aligned}$$
(21)

Subject to the discretized SIRD model:

$$\begin{aligned}&\forall \; m \in \{0,\dots ,M\}\nonumber \\&\frac{S_{m}(\widetilde{\varvec{\vartheta }}) - S_{m-1}(\widetilde{\varvec{\vartheta }})}{\Delta t} +\frac{\beta }{N} S_{m}(\widetilde{\varvec{\vartheta }})I_{m}(\widetilde{\varvec{\vartheta }})-\gamma R_{m}(\widetilde{\varvec{\vartheta }}) =0 \end{aligned}$$
(22)
$$\begin{aligned}&\frac{I_{m}(\widetilde{\varvec{\vartheta }}) - I_{m-1}(\widetilde{\varvec{\vartheta }})}{\Delta t} -\frac{\beta }{N}S_{m}(\widetilde{\varvec{\vartheta }})I_{m}(\widetilde{\varvec{\vartheta }})+\mu I_{m}(\widetilde{\varvec{\vartheta }})\nonumber \\&\qquad +\alpha I_{m}(\widetilde{\varvec{\vartheta }}) = 0 \end{aligned}$$
(23)
$$\begin{aligned}&\frac{R_{m}(\widetilde{\varvec{\vartheta }}) - R_{m-1}(\widetilde{\varvec{\vartheta }})}{\Delta t} -\mu I_{m}(\widetilde{\varvec{\vartheta }})+\gamma R_{m}(\widetilde{\varvec{\vartheta }}) = 0 \end{aligned}$$
(24)
$$\begin{aligned}&\frac{D_{m}(\widetilde{\varvec{\vartheta }}) - D_{m-1}(\widetilde{\varvec{\vartheta }})}{\Delta t} - \alpha I_m(\widetilde{\varvec{\vartheta }}) = 0 \end{aligned}$$
(25)
figure b

where

$$\begin{aligned} W_1&= \underset{m}{\max }\, S^\mathrm{d}_m - \underset{m}{\min }\, S^\mathrm{d}_m \\ W_2&= \underset{m}{\max }\, I^\mathrm{d}_m - \underset{m}{\min }\, I^\mathrm{d}_m \\ W_3&= \underset{m}{\max }\, R^\mathrm{d}_m - \underset{m}{\min }\, R^\mathrm{d}_m \\ W_4&= \underset{m}{\max }\, D^\mathrm{d}_m - \underset{m}{\min }\, D^\mathrm{d}_m \end{aligned}$$

The ODE-constrained optimization problem is solved iteratively, and requires the gradient of the ODE constraint (2225) with respect to \(\widetilde{\varvec{\vartheta }}\). We adopt the classical approach requiring a single solution of the adjoint equation of the original ODE-constraint in each iteration. In this work we use the L-BFGS-B optimization algorithm from SciPy [29] and the dolfin-adjoint software library [30] to compute the gradient.

5 Deep and Bayesian neural networks

We also explore multilayer feedforward neural networks (NNs), which are universal function approximators [31], to learn the disease’s dynamics via the data \(S^\mathrm{d}_m, I^\mathrm{d}_m, R^\mathrm{d}_m, D^\mathrm{d}_m\) at discrete times and to infer the coefficients in Eqs. (14), as an alternative to the approach presented in Sect. 4. Specifically, we construct two NNs to represent the data, with one as a deterministic model and the other being a probabilistic model.

Both NNs take \(\{I^\mathrm{d}_m, R^\mathrm{d}_m, D^\mathrm{d}_m\, \Delta t \}\) as features and \(\{I^\mathrm{d}_{m+k}, R^\mathrm{d}_{m+k}, D^\mathrm{d}_{m+k}\}\) as labels. Thus, the two NNs make predictions on case numbers at day \(m+k\) based on case numbers reported at day m. In this work, k is chosen to vary from 1 to \(M-m\), where \(M = 97\) is the number of days that we used data for. In both types of NNs, \(S^\mathrm{d}_{m}\) and \(S^\mathrm{d}_{m+k}\) are computed based on the constraint Eq. (5).

The deterministic model is a deep neural network (DNN) that consists of multiple fully connected layers, whose model parameters (i.e. weights and bias) can be obtained in a straightforward manner by minimizing the loss function

$$\begin{aligned} \mathcal {L}_{\text {DNN}} = {\text {MSE}} \end{aligned}$$
(26)

through an optimization algorithm, such as stochastic gradient descent, via backpropagation. The probabilistic model is a Bayesian neural network (BNN), which also consists of multiple fully connected layers, but with its model parameters (i.e. weights and bias) being sampled from a posterior distribution \(P({\varvec{\theta }} | \mathscr {D})\) that is computed based on Bayes’ theorem

$$\begin{aligned} P({\varvec{\theta }} | \mathscr {D}) = \frac{P(\mathscr {D}|{\varvec{\theta }}) P({\varvec{\theta }})}{P(\mathscr {D})}, \end{aligned}$$
(27)

where \(\mathscr {D}\) denote the i.i.d. observations (training data) and P represents the probability density function. In Eq. (27), \(P(\mathscr {D}|{\varvec{\theta }})\) is the likelihood, \(P({\varvec{\theta }})\) is the prior probability, and \(P(\mathscr {D})\) is the evidence, respectively. The posterior distribution of \({\varvec{\theta }}\) is computed based on variational inference (VI), which approximates the exact posterior distribution \(P({\varvec{\theta }} | \mathscr {D})\) with a more tractable distribution \(Q({\varvec{\theta }})\) by minimizing the Kullback–Leibler (KL) divergence [32,33,34]

$$\begin{aligned} Q^* = {\text {arg min KL}}(Q({\varvec{\theta }})||P({\varvec{\theta }} | \mathscr {D})). \end{aligned}$$
(28)

The KL divergence is computed as

$$\begin{aligned} {\text {KL}}(Q({\varvec{\theta }})||P({\varvec{\theta }} | \mathscr {D})) = \mathbb {E}[\log Q({\varvec{\theta }})] - \mathbb {E}[\log P({\varvec{\theta }}, \mathscr {D})] + \log P(\mathscr {D}), \end{aligned}$$
(29)

which requires computing the logarithm of the evidence, \({\log }P(\mathscr {D})\) in Eq. (27) [33]. Since \(P(\mathscr {D})\) is hard to compute, it is challenging to directly evaluate the objective function in Eq. (28). Alternately, we can optimize the evidence lower bound (ELBO) defined as

$$\begin{aligned} {\text {ELBO}}(Q) = \mathbb {E}[\log P({\varvec{\theta }}, \mathscr {D})] - \mathbb {E}[\log Q({\varvec{\theta }})], \end{aligned}$$
(30)

which is equivalent to the KL-divergence up to an additive constant coming from the evidence. Thus, maximizing the ELBO is equivalent to minimizing the KL-divergence. The loss function for the BNN has the following form:

$$\begin{aligned} \mathcal {L}_{\text {BNN}} = \omega _1 {\text {MSE}} + \omega _2 {\text {ELBO}}, \end{aligned}$$
(31)

where \(\omega _1\) and \(\omega _2\) are weighting parameters, with \(\omega _1 = 50\) and \(\omega _2=1\) being chosen in this work. A specific weight perturbation method, known as Flipout [35], is followed to infer \(Q({\varvec{\theta }})\) by minimizing Eq. (31) through mini-batch training via backpropagation with stochastic optimization algorithms. Flipout has been implemented in the TensorFlow Probability Library. The architectures of both NNs are summarized in Table 1. Both NNs were trained by using the Adam optimizer following an exponentially decaying learning rate

$$\begin{aligned} {\text {lr}} = {\text {lr}}_0 \cdot {\text {pow}} \left( v_{\text {decay}}, \frac{N_{\text {total}}}{N_{\text {decay}}}\right) \end{aligned}$$
(32)

with an initial learning rate \({\text {lr}}_0 = 0.001\), a decay rate \(v_{\text {decay}} = 0.91\), a decay step \(N_{\text {decay}} = 100\), and a final \(N_{\text {total}} = 10{,}000\) epochs.

Table 1 Model architecture for the DNN and BNN. Dense and DenseFlipout refer to specific NN architectures (layers) used in the TensorFlow Library

6 Results

Because of the extremely nonuniform distribution of the population of Michigan, we first studied the SIRD model for the entire state consisting of the lower and upper peninsulas (Fig. 1). Following this, the SIRD models were inferred for the eight Regions (also shown in Fig. 1) individually as one direct approach to study the effect of spatial variations in the populations and sub-populations corresponding to the model’s compartments.

6.1 System identification and ODE-constrained optimization

Figure 4 shows the progression of stepwise regression to infer the active time-dependent terms in Eqs. (1013) via Algorithms 1 and 2. The stem-and-leaf plots on the left illustrate the fate of the terms \(\theta _0\,-\,\theta _{15}t^3\) over eight iterations of stepwise regression. Each stem-and-leaf represents one term out of \(\theta _0\,-\,\theta _{15}t^3\) and the values are scaled to 1 (active) or 0 (inactive) for each iteration. On the right is the loss, which remains low until Iteration 10 and increases dramatically in Iteration 11, if any further terms are eliminated. Following the F-test used in Algorithm 2, the large increase in loss after Iteration 10 exceeds the threshold for acceptable model error. Thus system identification converges to the inferred model in ten iterations.

Fig. 4
figure 4

Left: Stem-and-leaf plot illustrating system identification of active time-dependent SIRD parameters using data for the entire state of Michigan. Each stem-and-leaf represents one term of \(\theta _0,\dots \theta _{15}t^3\), scaled to 1 (active) or 0 (inactive). Right: The changing loss as terms are eliminated from the set of time-dependent coefficients. System identification converges at Iteration 10 as the loss increases dramatically for further elimination of terms

Figure 5 shows, on the left, the evolution of SIRD model parameters and, on the right, a comparison of the predictions of the inferred model versus the data after ODE-constrained optimization that follows the system identification step. It is important to recall that these results are representative of the population of the entire state of Michigan. The SIRD model, having only four compartments, and applied to data that are the outcome of changing characteristics of testing, quarantine and treatment protocols, does not resolve many details of the public health aspects of the epidemic. The immunological characteristics of the disease itself are accounted for only in a very aggregated sense.

In Fig. 5, the important dates when the lockdown was imposed, and its gradual lifting are indicated by vertical lines to aid an understanding of the results. We first draw attention to the conclusion that \(\gamma (t) = 0\); the inference indicates that recovery from COVID-19 confers permanent immunity—an important conclusion, that remains to be confirmed by immunologists. As may be expected, the population’s infection rate, \(\beta (t)\), declined as the initially higher rates of positive diagnoses fell with fewer infected individuals. However, it began to rise again upon the opening of construction activities (C), and continued to do so through the lifting of stay at home orders (O). The recovery rate, \(\mu (t)\), showed a long initial increase as growing numbers of infected individuals recovered. Our interpretation of the initially high death rate, \(\alpha (t)\), is that many of the early cases already had advanced progression of the disease. Its rapid decline can be attributed to the ramp up of the public health campaign, hospitalization and emergency response of the medical system. The success that the state of Michigan gained by mandating an aggressive lockdown of nearly all societal, educational, commercial and industrial activity is best reflected in the rapid decline of the effective reproduction number, \(r_0(t)\). According to the inference presented here, \(r_0(t) < 1.0\) for \(m > 32\) (April 24, 2020), after which the typical infected individual passed the disease on to less than one other person. The death rate increased over the last few days for which data were obtained, perhaps as some number of individuals who had been infected for a longer time failed to recover. This affected the recovery rate as well, which fell. The close match between the simulations with the inferred ODE SIRD model and data (Fig. 5, right plot) validate the systems inferred. Such validation against the data holds for all the inferred results presented in this communication, although the non-uniqueness of inverse problems does not preclude the existence of multiple sets of inferred coefficients.

Fig. 5
figure 5

Left: The time-dependent SIRD parameters after tuning by ODE-constrained optimization following system identification are: \(\beta (t)=0.0756-0.0029t+3.33\times 10^{-5}t^3,\gamma (t)=0,\mu (t)=1.78\times 10^{-5}t^2,\alpha (t)=0.0053-2.8\times 10^{-6}t^2+2.93\times 10^{-8}t^3 \). Right: Simulation of the four compartments using the inferred ODE SIRD model, in comparison with the data

Figures 6 and 7, respectively, illustrate the time-dependent SIRD coefficients and comparison between data and simulation using the inferred model (with the inferred ODE SIRD model) of the disease for Regions 1–8 delineated in Fig. 1. This is an important step toward a more fine-grained understanding of the geographical distribution of the disease in the state. The Southeastern part of the state is more heavily populated, especially Regions 2 and 3, which also bore the greatest burden of the disease. The city of Detroit, at the Western tip of Region 3, was the worst affected, reflecting its well-known socio-economic challenges. By contrast, Washtenaw County, about 50 km to the West, but also in Region 3, bore among the lowest burdens, per capita. At the risk of stating the obvious, we note that Regions 1-4, which account for nearly 80% of the state’s population displayed very similar characteristics in the evolution of the data, as well as in SIRD coefficients and forward simulation results. We do not enter a more detailed analysis of these results here, deferring a different approach to spatial aspects of the spread of the disease to Sect. 7.

Fig. 6
figure 6

Parameters of time-dependent SIRD coefficients, \(\beta (t), \mu (t), \alpha (t)\), and the effective reproduction number, \(r_0(t)\), for Regions 1–8 (see Fig. 1) of Michigan

Fig. 7
figure 7

Comparison of the simulation using inferred SIRD parameters (Fig. 6) for Regions 1–8 of Michigan

Fig. 8
figure 8

a Time-dependent coefficients identified by DNNs, where an increased infection rate after the opening (O) of the lockdown on June 1st is observed. b DNNs learned S(t), I(t), R(t), D(t) based on the full extent of data points, and made a 30-day prediction

Fig. 9
figure 9

a Time-dependent coefficients identified by BNNs, where an increased infection rate after the opening (O) on June 1st is observed. b BNNs learned S(t), I(t), R(t), D(t) based on the full extent of data points, and made a 30-day prediction. Bands correspond to ± standard deviation over the mean

6.2 Deep and Bayesian neural networks

To infer the coefficients \(\beta (t)\), \(\gamma (t)\), \(\mu (t)\), \(\alpha (t)\), we first compute the time derivatives of S(t), I(t), R(t), D(t) by using the automatic differentiation API from TensorFlow. The coefficients are then computed by inverting Eqs. (14) at each time instant. For DNNs, we obtained deterministic results for all the coefficients. With BNNs, a Monte Carlo Sampling is performed to compute the mean and the standard deviation of the coefficients.

The constraint Eq. (5) is used to obtain \(S^\mathrm{d}_m\) from \(I^\mathrm{d}_m, R^\mathrm{d}_m, D^\mathrm{d}_m\). This ensures that the discrete time derivatives in Eqs. (2225) satisfy

$$\begin{aligned} \frac{S^\mathrm{d}_{m} - S^\mathrm{d}_{m-1}}{\Delta t} = -\frac{I^\mathrm{d}_{m} - I^\mathrm{d}_{m-1}}{\Delta t} -\frac{R^\mathrm{d}_{m} - R^\mathrm{d}_{m-1}}{\Delta t} -\frac{D^\mathrm{d}_{m} - D^\mathrm{d}_{m-1}}{\Delta t} \end{aligned}$$
(33)

The constraint Eq. (5) also has been imposed in the DNN and BNN representations by training networks for IR and D and then defining the network for S by this conservation of total population. Therefore, in using Eqs. (14) to invert the DNN/BNN representations for \(\beta (t),\gamma (t),\mu (t),\alpha (t)\) at each time instant, a linear dependence is encountered: The summed left and right hand-sides of (24) exactly equal the left and right hand-side of (1), respectively. A unique solution for \(\beta (t),\gamma (t),\mu (t),\alpha (t)\) is not possible due to linear dependence introduced by the population constraint. To circumvent this indeterminacy, we endow the system with additional information by requiring that \(\gamma (t) = 0\). This represents the conferral of immunity on the recovered population, and importantly, is detected by our inference results using system identification and ODE-constrained minimization, as discussed in Sect. 6.

The inferred values, extended to a 30-day prediction (until July 28, 2020) for \(\beta (t), \mu (t), \alpha (t), r_0(t)\) and \(S(t), I(t), R(t), D(t)\) obtained from both DNNs and BNNs for Michigan are presented in Figs. 8 and 9, while the results for the eight Regions are given in Appendices “DNN results for different regions” and “BNN results for different regions”. One can observe that these time-dependent coefficients in Figs. 8a and 9a have a similar initial trend as those inferred by the system inference approach in Fig. 5. The effective reproduction number \(r_0(t) < 1\) for \(m > 30\) (April 23), in good agreement with its value obtained via system inference in Fig. 5. As polynomial approximation is used by the system inference approach, the inferred coefficients in Fig. 5 are very smooth, whereas inversion using the NN approach captures the detailed fluctuation of these coefficients, particularly, the rising infection rate after the open of the lockdown on June 1, 2020. In Fig. 9, the band around the inferred coefficients and the NN predictions shows the mean ± one standard deviation of the corresponding results. Note the high standard deviation in parameters at early times, due to the noise in the data at small numbers. The regional results in Appendices “DNN results for different regions” and “BNN results for different regions” indicate that an accelerating infection rate for all the regions after the open of the lockdown. In particular, Region 7 and 8 have a predicted \(r_0(t)\) value that is greater than 1. In addition, we observed that the BNN inferred coefficients in the regional results have a narrower range compared to those from the DNN.

More broadly, we note the difference in trends between the inferred time-dependent coefficients with the DNNs and BNNs in Figs. 8 and 9 in comparison with those in Fig. 5. This is due to the local inversion at each data point to infer the coefficients with the DNNs and BNNs versus the global optimization of losses for system inference in Sect. 6. As was referred to above, inverse problems allow non-unique solutions. It will be instructive to compare the predictions made by the DNN and BNN representations with the data when they become available.

A result that is consistent across all inference methods: system identification with ODE-constrained optimization, DNNs and BNNs, and for the state as a whole as well as its Regions is the following: The infection rate, \(\beta (t)\), initially fell with the public health campaign, especially driven by the lockdown orders. However, it began to rise with the first step of opening (C), and even accelerated as more aspects of public, recreational, commercial and industrial activities were relaxed (M, R, O). Yet, every one of the versions of forward simulations with corresponding and consistently inferred systems matched very well with the data, which confirm that the state has largely controlled the pandemic, and continues to do so. As the number of remaining infected individuals, I(t), has fallen steeply, there are fewer conveyors of infection, and even the higher \(\beta (t)\) has not yet led to another explosion of infection. This also can be seen by the sharply rising recovery rate, \(\mu (t)\), and is verified by the effective reproduction number \(r_0(t)\) falling below 1.0 after April 20 or 23 (the later date according to the DNN and BNN inference methods). A warning bell, however, must be rung as the results also indicate that \(r_0(t) \rightarrow 1.0\) from below as we approach the end of our data and the time of writing. Michigan’s numbers for I(t) are rising, although not yet exponentially. See Figs. 5, 6, 7, 8, 9, and Appendix sections “DNN results for different regions”, “BNN results for different regions”.

7 Two dimensional SIRD model with diffusion

Classical epidemiological models hold in the well-mixed limit, which is reflected in the compartments and sub-populations, SIRD being being total numbers over some geographical region. Spatial effects have been introduced by simply resolving smaller regions and treating them individually, as demonstrated here with our inference of SIRD coefficients over the regions of Michigan’s lower peninsula (Figs. 6, 7). However, while affording a spatially finer-grained treatment, this approach cannot, of course, address the mobility of the population. This is an important consideration, especially in light of the imposition and lifting of quarantines. In the COVID-19 Pandemic, the effects of social distancing, and the possibility of surges with their lifting revolve on the question of the time (and spatially) varying mobility of the population. At the finest resolution, this must be approached via agent-based models refined to resolve individuals. However, an intriguing question to explore is whether simple reaction–diffusion models can detect the evidence of mobility in these data. With our approach to model inference, we have access to methods of identifying mechanisms from data in which their action, while weak, may hold the key to important insights to the system. In this section, we embark down such a path, while noting that reaction–diffusion models of epidemiology have been considered previously from the perspective of analysis of the corresponding PDEs [22,23,24].

We now extend the SIRD model to PDEs in two spatial dimensions using the same compartments. However, the population variables are now replaced with spatio-temporally varying densities, \(\widehat{S}(\varvec{x},t),\widehat{I}(\varvec{x},t),\widehat{R}(\varvec{x},t),\widehat{D}(\varvec{x},t)\) defined as numbers per unit area.

$$\begin{aligned} \frac{\partial \widehat{S}}{\partial t}&=\mathcal {D}_{\text {S}}\nabla ^2 \widehat{S}-\frac{\beta }{\widehat{N}} \widehat{S}\widehat{I}+\gamma \widehat{R} \end{aligned}$$
(34)
$$\begin{aligned} \frac{\partial \widehat{I}}{\partial t}&=\mathcal {D}_\mathrm{I}\nabla ^2 \widehat{I}+\frac{\beta }{\widehat{N}}\widehat{S}\widehat{I}-\mu \widehat{I}-\alpha \widehat{I} \end{aligned}$$
(35)
$$\begin{aligned} \frac{\partial \widehat{R}}{\partial t}&=\mathcal {D}_\mathrm{R}\nabla ^2 \widehat{R}+\mu \widehat{I}-\gamma \widehat{R} \end{aligned}$$
(36)
$$\begin{aligned} \frac{\partial \widehat{D}}{\partial t}&=\alpha \widehat{I} \end{aligned}$$
(37)

Where \(\mathcal {D}_{\text {S}}, \mathcal {D}_\mathrm{I}, \mathcal {D}_\mathrm{R}\) are diffusivities of the corresponding compartments, and represent the mobility of the population via random walks. We define \(\widehat{(\bullet )}=(\bullet )/\int _{\Omega }\text {d}A\) where \(\Omega \) is the domain of the lower peninsula of Michigan, to which we restrict our PDE SIRD studies. Furthermore the population constraint holds: \(\int _{\Omega }\widehat{N}\text {d}A = \int _{\Omega }\widehat{S}(t)\text {d}A + \int _{\Omega }\widehat{I}(t)\text {d}A + \int _{\Omega }\widehat{R}(t)\text {d}A + \int _{\Omega }\widehat{D}(t)\text {d}A\).

7.1 Inference on the PDE form of the SIRD model

We adopt the weak form, and specifically, the finite element framework for inference on the above system of PDEs. For a generic, finite-dimensional field \(u^h\), the problem is stated as follows: Find \(u^h\in \mathscr {S}^h \subset \mathscr {S}\), where \(\mathscr {S}^h= \{ u^h \in \mathscr {H}^1(\Omega ) ~\vert ~u^h = ~\bar{u}\; \mathrm {on}\; \Gamma ^u\}\), such that \(\forall ~w^h \in \mathscr {V}^h \subset \mathscr {V}\), where \(\mathscr {V}^h= \{ w^h \in \mathscr {H}^1(\Omega )~\vert ~w^h = ~0 \;\mathrm {on}\; \Gamma ^u\}\), the finite-dimensional (Galerkin) weak form of the problem is satisfied. The variations \(w^h\) and trial solutions \(u^h\) are defined component-wise using a finite number of basis functions,

$$\begin{aligned} w^h = \sum _{a=1}^{n_\mathrm {b}} c^a N^a, \quad \qquad u^h = \sum _{a=1}^{n_\mathrm {b}} d^a N^a, \end{aligned}$$
(38)

where \(n_\mathrm {b}\) is the dimensionality of the function spaces \(\mathscr {S}^h\) and \(\mathscr {V}^h\), and \(N^a\) represents the basis functions. To obtain the Galerkin weak forms, we multiply each strong form by the corresponding weighting function, use Backward Euler method for time-discretization, integrate by parts and apply boundary conditions appropriately, leading to:

$$\begin{aligned}&\int _{\Omega }w^h_1 \frac{\widehat{S}^h_{m} - \widehat{S}^h_{m-1}}{\Delta t} \text {d}s =-\int _{\Omega } \mathcal {D}_{\mathrm{S}}\nabla w^h_1\cdot \nabla \widehat{S}^h_m\text {d}s\nonumber \\&\quad -\int _{\Omega }w^h_1\left( \frac{\beta }{\widehat{N}} \widehat{S}^h_m\widehat{I}^h_m-\gamma \widehat{R}^h_m \right) \text {d}s \end{aligned}$$
(39)
$$\begin{aligned}&\int _{\Omega }w^h_2\frac{\widehat{I}^h_{m} - \widehat{I}^h_{m-1}}{\Delta t}\text {d}s =-\int _{\Omega }\mathcal {D}_\mathrm{I}\nabla w^h_2\cdot \nabla \widehat{I}^h_m\text {d}s\nonumber \\&\quad +\int _{\Omega }w^h_2\left( \frac{\beta }{\widehat{N}} \widehat{S}^h_m\widehat{I}^h_m-\mu \widehat{I}^h_m-\alpha \widehat{I}^h_m \right) \text {d}s \end{aligned}$$
(40)
$$\begin{aligned}&\int _{\Omega }w^h_3\frac{\widehat{R}^h_{m} - \widehat{R}^h_{m-1}}{\Delta t}\text {d}s =-\int _{\Omega }\mathcal {D}_\mathrm{R}\nabla w^h_3\cdot \nabla \widehat{R}^h_m\text {d}s\nonumber \\&\quad +\int _{\Omega }w^h_3\left( \mu \widehat{I}^h_m-\gamma \widehat{R}^h_m\right) \text {d}s \end{aligned}$$
(41)
$$\begin{aligned}&\int _{\Omega }w^h_4 \frac{\widehat{D}_{m} - \widehat{D}_{m-1}}{\Delta t}\text {d}s =\int _{\Omega }w^h_4\alpha \widehat{I}^h_m\text {d}s \end{aligned}$$
(42)

Where, boundary terms disappear because we assume that the populations do not cross the state boundary, or into the upper peninsula. The system identification problem is to infer the time-dependent coefficients \(\mathcal {D}_{\mathrm{S}}(t),\mathcal {D}_\mathrm{I}(t), \mathcal {D}_\mathrm{R}(t)\), and we also choose to expand them in a polynomial basis

$$\begin{aligned} \mathcal {D}_s(t)&=\theta _{16}+\theta _{17}t+\theta _{18}t^2+\theta _{19}t^3 \end{aligned}$$
(43)
$$\begin{aligned} \mathcal {D}_i(t)&=\theta _{20}+\theta _{21}t+\theta _{22}t^2+\theta _{23}t^3 \end{aligned}$$
(44)
$$\begin{aligned} \mathcal {D}_r(t)&=\theta _{24}+\theta _{25}t+\theta _{26}t^2+\theta _{27}t^3 \end{aligned}$$
(45)

along with the time-dependent coefficients \(\beta (t), \gamma (t), \mu (t), \alpha (t)\) shown in Eq. (1013). We expect that the effect of mobility on the evolution of population densities is small over the course of the COVID-19 Pandemic. However, our interest is in inferring the presence of this effect in the data following the relaxation of lockdown orders. In order to identify the diffusivities despite the expected dominance of the reaction terms in the data obeying Eqs. (42), we adopt two stage Variational System Identification [13].

We define Stage 1 by choosing \(w^h_i=1, i=1,\ldots ,4\), yielding:

$$\begin{aligned} \int _{\Omega } \frac{\widehat{S}^h_{m} - \widehat{S}^h_{m-1}}{\Delta t} \text {d}A&= -\beta \int _{\Omega }\frac{1}{\widehat{N}} \widehat{S}^h\widehat{I}^h \text {d}s-\gamma \int _{\Omega }\widehat{R}^h\text {d}A \end{aligned}$$
(46)
$$\begin{aligned} \int _{\Omega }\frac{\widehat{I}^h_{m} - \widehat{I}^h_{m-1}}{\Delta t}\text {d}s&=\beta \int _{\Omega }\frac{1}{\widehat{N}}\widehat{S}^h\widehat{I}^h\text {d}s-\int _{\Omega }\mu \widehat{I}^h\text {d}s\nonumber \\&\quad -\alpha \int _{\Omega }\widehat{I}^h \text {d}A \end{aligned}$$
(47)
$$\begin{aligned} \int _{\Omega }\frac{\widehat{R}^h_{m} - \widehat{R}^h_{m-1}}{\Delta t}\text {d}s&= \mu \int _{\Omega }\widehat{I}^h\text {d}s-\gamma \int _{\Omega }\widehat{R}^h\text {d}s \end{aligned}$$
(48)
$$\begin{aligned} \int _{\Omega } \frac{\widehat{D}^h_{m} - \widehat{D}^h_{m-1}}{\Delta t}\text {d}s&=\alpha \int _{\Omega }\widehat{I}^h\text {d}A \end{aligned}$$
(49)

The diffusion operators vanish since, for a constant weighting function, \(\nabla w = 0\). In order to avoid a proliferation of superscripts and subscripts, we simply denote the data interpolated over the finite element mesh at time m by \(\widehat{(\bullet )}^\mathrm{d}_m\), dispensing with the superscipt \((\bullet )^h\) for the finite-dimensional fields The label vector and matrix of bases can be constructed as:

$$\begin{aligned} \varvec{y}_{m}= & {} \left\{ \begin{array}{c} \int _{\Omega }\frac{\widehat{S}^\mathrm{d}_{m} - \widehat{S}^\mathrm{d}_{m-1}}{\Delta t}\text {d}A \\ \int _{\Omega }\frac{\widehat{I}^\mathrm{d}_{m} - \widehat{I}^\mathrm{d}_{m-1}}{\Delta t}\text {d}A\\ \int _{\Omega }\frac{\widehat{R}^\mathrm{d}_{m} - \widehat{R}^\mathrm{d}_{m-1}}{\Delta t}\text {d}A \\ \int _{\Omega }\frac{D^\mathrm{d}_{m} - D^\mathrm{d}_{m-1}}{\Delta t}\text {d}A \end{array}\right\} \end{aligned}$$
(50)
$$\begin{aligned} \varvec{\Xi }_{m}= & {} \left[ \begin{array}{cccc} \int _{\Omega }\frac{\widehat{S}^\mathrm{d}_m \widehat{I}^\mathrm{d}_m}{N}\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \int _{\Omega }-\widehat{R}^\mathrm{d}_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle \\ \int _{\Omega }-\frac{\widehat{S}^\mathrm{d}_m \widehat{I}^\mathrm{d}_m}{N}\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle &{} \int _{\Omega }\widehat{I}^\mathrm{d}_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \int _{\Omega }I_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle \\ \langle 0\;0\;0\;0\rangle &{} \int _{\Omega }\widehat{R}^\mathrm{d}_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \int _{\Omega }- \widehat{I}^\mathrm{d}_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle &{} \langle 0\;0\;0\;0\rangle \\ \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle &{} \langle 0\;0\;0\;0\rangle &{} \int _{\Omega }-\widehat{I}^\mathrm{d}_m\text {d}s\langle 1\;t_m\;t_m^2\;t_m^3\rangle \end{array}\right] \end{aligned}$$
(51)

Once the reaction terms are identified, we return to the original weak forms Eqs. (3942). Accounting for the arbitrariness of \(w^h\) in \(\mathscr {V}^h\), the finite-dimensionality leads to a system of residual equations for each degree of freedom (DOF):

$$\begin{aligned} \mathscr {R}_i=\mathscr {F}_i\left( S^\mathrm{d}_{m-1}, S^\mathrm{d}_m, \nabla S^\mathrm{d}_m,\ldots , D_\mathrm{s}\langle 1\;t_m\;t_m^2\;t_m^3\rangle ,\ldots , N,\nabla N\ldots \right) , \end{aligned}$$
(52)

where \(\mathscr {R}_i\) is the ith component of the residual vector. The diffusion terms can then be identified by the two stage approach to Variational System Identification detailed in [12].

Fig. 10
figure 10

A finite element mesh of the map of Michigan delineating the counties. Only Regions 1–7 were used in the PDE inference problem

7.2 Data preparation on the 2D map of Michigan

Fig. 11
figure 11

Left: stem-and-leaf plot illustrating system identification of active reaction parameters in the PDE SIRD model in Stage 1 of Variational System Identification. Each stem and leaf represents one term of \(\theta _0,\dots \theta _{15}t^3\), scaled to 1 (active) or 0 (inactive). Right: The changing loss as terms are eliminated from the set of time-dependent coefficients. System identification converges at Iteration 10 as the loss increases dramatically for further elimination of terms

Fig. 12
figure 12

Left: stem-and-leaf plot illustrating system identification of active diffusion parameters in the PDE SIRD model in Stage 2 of Variational System Identification. Each stem and leaf represents one term of \(\theta _{16},\dots \theta _{27}t^3\), scaled to 1 (active) or 0 (inactive). Right: The changing loss as terms are eliminated from the set of time-dependent coefficients. System identification converges at Iteration 8 as the loss increases dramatically for further elimination of terms

Fig. 13
figure 13

Left: The time-dependent reaction parameters in the 2D SIRD model after tuning by PDE-constrained optiization: \(\beta (t)=0.00798-1.82\times 10^{-4}t+1.49\times 10^{-6}t^2,\gamma (t)=0,\mu (t)=2.65\times 10^{-5}t^2,\alpha (t)=2.82\times 10^{-4}t-2.83\times 10^{-6}t^2 \). Right: The sole time-dependent diffusivity is for the infected sub-population: \(\mathcal {D}_\mathrm{I}=2.146-8.12\times 10^{-5}t^2-7.914\times 10^{-7}t^3\)

Fig. 14
figure 14

Comparison of the data on distributions of the infected (a) and recovered (c) sub-populations against forward PDE SIRD simulations with inferred quantities, (b) and (d), respectively. Data and simulation results are shown for Day 0 (lockdown, March 23), Day 40 (May 2, when the infected sub-population had its greatest spread across the state, but still restricted to Southeastern Michigan) and Day 96 (end of our data range, June 28, 2020)

We first construct a two-dimensional mesh that fully resolves the counties as shown in Fig. 10. Recall that only the lower peninsula, consisting of Regions 1–7 was included in the PDE inference problem. The data are available as cumulative sub-population numbers \(I^\mathrm{d}_m, R^\mathrm{d}_m, D^\mathrm{d}_m\) at the county level (Michigan’s lower peninsula has 68 counties). We use a uniform density of each sub-population to compute \(\widehat{I}^\mathrm{d}_m, \widehat{R}^\mathrm{d}_m, \widehat{D}^\mathrm{d}_m\) within the county, and applied Gaussian filtering to smooth the discontinuities between counties. Note that the discrete Gaussian filter can not be applied in a straightforward manner to unstructured meshes. Here we start with continuous Gaussian filtering over the infinite domain:

$$\begin{aligned} u(\varvec{ x}_0)&={\int _{-\infty }^\infty G(\varvec{ x}_0,\varvec{ x})u_\mathrm{raw}(\varvec{ x})\text {d}v} \end{aligned}$$
(53)
$$\begin{aligned}&=\int _\Omega G(\varvec{ x}_0,\varvec{ x})u_\mathrm{raw}(\varvec{ x})\text {d}v \end{aligned}$$
(54)

where u could be any of the four sub-population densities, and \(G(\varvec{ x}_0,\varvec{ x})=\frac{1}{2\pi \sigma ^2}e^{-\frac{||\varvec{ x}||^2}{2\sigma ^2}}\) is the two dimensional Gaussian distribution function. The parameter \(\sigma \) is the standard deviation of the Gaussian distribution which is related to the kernel size in the discrete Gaussian filter. Since \(\int _\Omega G \text {d}A < 1\) we scale up the filtered displacement at each node:

$$\begin{aligned} u(\varvec{ x}_0)&=\frac{\int _{-\infty }^\infty G(\varvec{ x}_0,\varvec{ x})\text {d}v}{{\int _\Omega G(\varvec{ x}_0,\varvec{ x})\text {d}v}}{\int _\Omega G(\varvec{ x}_0,\varvec{ x})u_\mathrm{raw}(\varvec{ x})\text {d}v} \end{aligned}$$
(55)
$$\begin{aligned}&=\frac{1}{{\int _\Omega G(\varvec{ x}_0,\varvec{ x})\text {d}v}}{\int _\Omega G(\varvec{ x}_0,\varvec{ x})u_\mathrm{raw}(\varvec{ x})\text {d}v} \end{aligned}$$
(56)

The spatio-temporal evolution of these fields was used in PDE inference via two-stage Variational System Identification as described in Sect. 7.1 followed by optimization constrained by the PDEs in (3942) using adjoints. Stem-and-leaf plots and the losses for Stage 1 of Variational System Identification appear in Fig. 11. Recall that in this stage only the reaction terms \(\beta (t),\gamma (t),\mu (t),\alpha (t)\) are identified. These inference results for active coefficients should be compared with the ODE SIRD model in Fig. 5. This is followed by Stage 2 of Variational System Identification with stem-and-leaf plots and losses appearing in Fig. 12. Note that the diffusivities of the susceptible and recovered populations, \(\mathcal {D}_\mathrm{S} = 0\) and \(\mathcal {D}_\mathrm{R} = 0\). However, the infected population has a time-varying diffusivity \(\mathcal {D}_\mathrm{I}\) that declines.

7.3 Results of system identification of two dimensional SIRD model with diffusion

Figure 13 shows the inference (two stage Variational System Identification followed by PDE-constrained optimization) for the coefficients \(\beta (t),\mu (t),\alpha (t)\), the effective reproduction number, \(r_0(t)\) as well as the diffusivity \(\mathcal {D}_\mathrm{I}(t)\) in the PDE SIRD model. On comparing with Fig. 5 some differences are revealed in the time dependence of \(\beta (t), \mu (t), r_0(t), \alpha (t)\). This is to be expected in adopting the PDE SIRD model over the ODE form. The inference of time-dependent diffusion in the mobility of the infected sub-population, \(\mathcal {D}_\mathrm{I}\), naturally affects the other quantities. While the preliminary nature of these results warrants caution, it is worth noting the inference of decreasing mobility of the infected sub-population in \(\mathcal {D}_\mathrm{I}\). Figure 14 compares data and the forward simulation with inferred quantities for the distribution of the infected and recovered sub-populations on days corresponding to the initial lockdown, the maximum spread of the infected sub-population (May 2), and at the end of our data collection. Notably, the restriction of the high density of the infected population, \(\widehat{I}\) to Southeastern Michigan reflects the success, to date, of the state’s public health response. While the correspondence is reasonable, the statewide sub-populations S(t), I(t), R(t), D(t) obtained by integrating the corresponding densities over the lower peninsula, show a poorer match in Fig. 15. While the trends are reproduced, there are notable errors over time. A major improvement is possible in the PDE SIRD model by allowing the coefficients \(\beta ,\gamma ,\mu ,\alpha , \mathcal {D}_\mathrm{S},\dots ,\mathcal {D}_\mathrm{R}\) to also vary over space. This would allow better representation of the system, in keeping with the inferred difference in \(\beta (t),\mu (t),r_0(t),\alpha (t)\) over the eight Regions in Fig. 6, which led to the excellent agreement between data and the forward ODE SIRD simulations in Fig. 7. From a purely data representation standpoint, the greater number of parameters will allow lower errors.

The code used for the inference, machine learning and forward simulations is available in the mechanoChem and mechanoChemML libraries at https://github.com/mechanoChem/.

8 Conclusion

We have brought machine learning inference techniques to bear upon the data on progression of COVID-19 across the state of Michigan by applying three distinct approaches: (a) Our methods of system identification to delineate the operational mechanisms, followed by (b) adjoint-based model-constrained optimization for refinement of the parameters, and (c) deep and Bayesian neural networks. Our interest in this study has been two-fold.

The first has been to seek to infer the time-dependence of the coefficients in the classical ODE SIRD model, motivated by the evolving characteristics of testing, quarantine and treatment protocols over the 97-day course of the pandemic as reflected in the data. As discussed in Sect. 6, our inference methods reveal the course of rates of infection, recovery and death over the state and its eight regions, assuming uniform mixing in each case. Notably, our methods suggest that recovery confers immunity, but we hasten to add that this is a very preliminary conclusion. More detailed and fine-grained studies need to be undertaken to verify it, and of course, immunology will have the final say here. Also of note are our conclusions that while the infection rate has increased after an initial decline, as the state relaxed restrictions, the lower numbers of infectious individuals has meant a lower overall extent of transmission. This is also seen in the effective reproduction rate, which, while below one, has trended dangerously closer to that threshold of exponential growth. The uncertainty in our inference, given the data, is reflected in the results of the Bayesian neural networks in the same section. Of some interest here are the predictions made by BNNs for 30 days beyond the end of the data we have considered; that is until July 28, 2020.

Fig. 15
figure 15

Simulation of the four compartments using the inferred PDE SIRD model, in comparison with the data

The second facet of our interest is to try and infer spatial dependence by extending the SIRD models to PDEs by incorporating the population’s mobility via diffusion. This is a different, and potentially intriguing, approach that complements the resolution of the problem down to the smaller Regions of the state as we did with the ODE SIRD model. On this front, we note that the inference needs to be extended to our methods of two-stage Variational System Identification followed by PDE-constrained optimization. Here, it is of note that the susceptible and recovered populations were found to have vanishing diffusivities (mobilities), while the infected population had a diffusivity that declined over the 97-day extent of the data that we used. This first extension to system inference of the PDE SIRD model returned reasonable comparisons with data on distributions of the sub-population densities, although the total numbers integrated over the state were not as well reproduced. As suggested by the notable differences in the ODE SIRD model coefficients for the eight regions of the state, the PDE SIRD model with spatially varying coefficients may be a better representation. This will allow us to make connections with the well-known metapopulation variants [20, 21] of the network-based SIR model that account for the mobility of individuals or groups, and thereby represent spatially and group-wise varying diffusivity. Building on these initial results, we see many possibilities for analysis and prediction of the future course and geographical spread of the COVID-19 Pandemic using the PDE SIRD model.