key: cord-0108438-ij825lja authors: Fan, Chun; Meng, Yuxian; Sun, Xiaofei; Wu, Fei; Zhang, Tianwei; Li, Jiwei title: Parameter Estimation for the SEIR Model Using Recurrent Nets date: 2021-05-30 journal: nan DOI: nan sha: 0314e8616a13a03562d30dd005de3299e68782a9 doc_id: 108438 cord_uid: ij825lja The standard way to estimate the parameters $Theta_text{SEIR}$ (e.g., the transmission rate $beta$) of an SEIR model is to use grid search, where simulations are performed on each set of parameters, and the parameter set leading to the least $L_2$ distance between predicted number of infections and observed infections is selected. This brute-force strategy is not only time consuming, as simulations are slow when the population is large, but also inaccurate, since it is impossible to enumerate all parameter combinations. To address these issues, in this paper, we propose to transform the non-differentiable problem of finding optimal $Theta_text{SEIR}$ to a differentiable one, where we first train a recurrent net to fit a small number of simulation data. Next, based on this recurrent net that is able to generalize SEIR simulations, we are able to transform the objective to a differentiable one with respect to $Theta_text{SEIR}$, and straightforwardly obtain its optimal value. The proposed strategy is both time efficient as it only relies on a small number of SEIR simulations, and accurate as we are able to find the optimal $Theta_text{SEIR}$ based on the differentiable objective. On two COVID-19 datasets, we observe that the proposed strategy leads to significantly better parameter estimations with a smaller number of simulations. The SEIR model [2, 14, 18, 19] is a widely-used epidemiological model to predict the macroscopic behavior of disease spread through a population, e.g., the spread of COVID-19 among different populations at different locations over time [1, 5, 9, 12, 13, 25] . A typical SEIR model models the spread dynamics of disease using four states of populations: susceptible (S), exposed (E), infectious (I) and recovered (R), where susceptible individuals can be transformed into the exposed, and later to the infectious and finally to the recovered. Each transition is associated with corresponding parameter(s), forming the parameter set Θ SEIR for the model: transmission rate for S → E, infection rate for E → I and recovery rate for I → R. Θ SEIR captures the spreading pattern of the disease, the learning of which is thus crucial for understanding the disease spread, and developing social-distancing or interventions policies. Existing SEIR models relies on grid search to estimate Θ SEIR via simulations. Model simulations are performed on different sets of parameters, and the set that has the smallest L 2 distance between the predicted number of infections and the observed number of infections is selected as the best parameter combination. This is because of the non-differentiable nature of the L2 objective with respect to Θ SEIR : predicted number of infections are obtained from discrete simulations. This learning process is (1) high time-intensive: the simulation process can be slow when the population is large, which gives a time complexity of O(M N T ) where M is the number of simulations performed, N is total number of individuals and T is maximum time steps; besides, the search space will grow exponentially with respect to the number of parameters we need to estimate; and (2) Figure 1 : An overview of a standard SEIR model (left) the mobility network (right: from time t to time t + 1, person p 3 moved from location l 1 to l 3 , and p 1 moved from l 1 to l 2 ). because it is impractical to enumerate all possible combinations of parameters in the continuous space, the resulting parameter combination would be sub-optimal given a limited amount of trials. To address these issues, in this paper, we propose to transform the original non-differentiable simulation problem into a differentiable one using neural recurrent nets for parameter estimation in SEIR models. The basic idea is that neural recurrent nets are differentiable with respect to the input learnable parameters, so that they are able to automatically learn these parameters via gradient descent and backpropagation [27] . Another advantage neural recurrent nets offer is that they intrinsically model the temporal changes of observations, and have the potentials to make accurate predictions about the number of individuals at each time step. To train the recurrent net, we first harvest the training data by running a few SEIR simulations, leading to a collection of simulated data under different parameter combinations. Next, a recurrent net is trained to fit these simulations. Then with model parameters of the the recurrent net Θ LSTM fixed, the observation data are used to train the net with respect to the SEIR parameters Θ SEIR via gradient descent, leading to the optimal parameter values Θ * SEIR . The proposed strategy is both time efficient in that it only relies on a small number of SEIR simulations, and accurate because we are able to find the optimal parameters Θ * SEIR based on the differentiable objective. Through experiments on two COVID-19 datasets, we observe that the proposed strategy leads to more accurate parameter estimates with significantly better time efficiency. This work focuses on estimating the parameters Θ SEIR in the SEIR mode. A standard SEIR model is formulated in terms of 4 populations of individuals: the susceptible population (S), the exposed population (E), the infectious population (I) and the recovered population (R), which are respectively comprised of all individuals susceptible to the infection,all individuals that have contacted infected patients but are currently during their incubation period, infected individuals that can transmit the disease to susceptible population, and recovered individuals that cannot become infected again and cannot transmit the disease to others. Given a mobility network G t = ({P, L}, E) t at time step t(t = 1, 2, · · · , T ) where P is the set of person nodes, L is the set of location nodes and E is the set of edges linking persons to locations, the SEIR model needs to predict, what the numbers of individuals for these populations are in different locations at time step t, according to the preceding populations progress, the current mobility network and a set of parameters controlling the probability of transferring from one state of population to the next state. Formally, we use the superscript l to represent a specific location l and use the subscript t to represent the time point t, e.g., S l t is the number of individuals for state S in location l at time step t. The number for all populations in location l at time step t is denoted by N l t = S l t + E l t + I l t + R l t , and S t = l S l t is the population for all locations at time t (the same for E t , I t , R t ). We further assume a constant overall population N = N t = S t + E t + I t + R t , ∀t. There are three sets of parameters β, κ and γ controlling how likely that a person currently at one particular state would transfer to the next state. β controls the probability of transferring from S to E, and κ and γ are respectively responsible for tranferring from E to I and I to R. β can be a set of values, representing the transmission rate for different location categories or cities. κ and γ can also be a set of values, representing the infection and recovery rates for different population groups. With a fixed set of parameters and the mobility network, the SEIR model can simulate the disease spread process and predict the numbers of these four populations at each time step for a location. The probability of an individual transferring from one state to the next at location l for time step t can be formulated as: During simulations, the state for each individual is sampled. An illustration is shown in Figure 1 . The sum of the L 2 distances between the observed infections and the predicted infections over all time steps is the simulation error for a particular set of parameters, and the best parameter combination is selected to minimize the simulation error: We use the hat notationˆto represent the prediction made by the model. Towards finding the (near) optimal parameter combination in Eq.(2), the SEIR models needs to run Eq.(1) over all individuals for T time steps, which gives a total time complexity of O(N T ). The core idea of the proposed framework is that, instead of using the brute force grid search strategy to obtain the parameter set that leads to the least L 2 distance between predictions and observations, we train a recurrent net to fit the time-series data of SEIR based on a small number of simulations using different sets of parameters. The trained recurrent net can then be used to directly find the optimal value of Θ * SEIR , which denotes the parameter set for SEIR that leads to the optimal predictions. We employ the widely used Long Short-Term Memory network (LSTM) [17] as the recurrent net model backbone. The first step of the proposed framework is to generate smiulation data, which will be used to train the recurrent LSTM net. We first use different sets of Θ SEIR to perform simulations on the predefined network, as in Eq.(1). We perform M simulations with M different sets of Θ SEIR , denoted by {Θ At each time step t, we iterate over all individuals, assign the previous state (i.e., S, E, I or R) of an individual, and sample its current state based on the SEIR model. Then, we sum up all individuals belonging to the same states, and obtain {S Next, we train an recurrent net to fit {S t−1 }, the populations of the four states at the previous time step. The model should also be aware of the value of Θ (m ) SEIR and the mobility network structure at time step t because they directly decide the number of individuals for the four states for the next time step. To serve this propose, we also feed Θ (m ) SEIR as input to the LSTM model at time step t. For the time-varying mobility network G t , we map it to a vector representation h Gt , which is able to capture its mobility structure and be conveniently fed to the LSTM network. Mapping G t to h Gt We use the DIFFPOOL model [33] to map the mobility network G t at time step t to its high-dimensional representation h Gt . DIFFPOOL is a differentiable pooling model that can generate hierarchical representations of graphs by progressively clustering nodes into a coarser graph. Specifically, at each layer l, DIFFPOOL learns an assignment matrix S (l) ∈ R n l ×n l+1 to assign each node at layer l to a cluster in the next layer l + 1, where n l is the number of nodes (clusters) at layer l. Given node embeddings Z (l) ∈ R n l ×d and the adjacency matrix A (l) ∈ R n l ×n l at layer l, DIFFPOOL generates the new node embeddings Z (l+1) ∈ R n l+1 ×d , the new adjacency matrix A (l+1) ∈ R n l+1 ×n l+1 and the new assignment matrix S (l+1) ∈ R n l+1 ×n l+2 by applying the following equations: GNN l,embed and GNN l,pool are two distinctly parameterized GNN, and are respectively responsible for generating new embeddings and producing a distribution over next-layer clusters. Setting the number of clusters at the last layer L to 1, DIFFPOOL outputs a single high-dimensional graph representation. We train DIFFPOOL to classify the graph G t at time step t to the category label t, and use the extracted graph representation h Gt as input to the LSTM. SEIR and h Gt , we are able to obtain the hidden vector representation h (m ) t for the time step t: where is then passed to a fully connected layer to obtainĥ (m ) t , which is mapped to scalars to predict {S The training objective is minimizing the distance between simulation outputs {S Eq. 8 can be trained in an end-to-end fashion to obtain optimal Θ LSTM . The LSTM model with Θ * LSTM is able to generalize the behavior of the SEIR model with a specific value of Θ SEIR . When training the LSTM to learn Θ * LSTM , Θ SEIR is set to a fixed value of Θ (m ) SEIR and fed as input to LSTMs at each time step. Due to the fact that Θ SEIR can also be viewed as parameters in LSTMs, i.e., the input to each time step, we can fix Θ * LSTM and relax Θ SEIR , treating Θ SEIR as learnable parameters, to minimize Eq.(2) which we write down here for reference: whereÎ t is the output from LSTM, andÎ t is the observation data rather than the simulation data used to train the LSTM. Eq. (2) is differentiable with respect to Θ SEIR and can be trained in an end-to-end fashion based on SGD [20, 27] . To this end, we learn the optimal values of Θ SEIR that minimize the L 2 distance between predicted infections and observations. In the case where we have prior knowledge about the values Θ SEIR , e.g., all values in Θ SEIR should be larger than 0, the value of β is usually smaller than 0.1 based on clinical observations for COVID-19 [5] , we can incorporate regularizers as side objectives: where prior(Θ SEIR ) denotes the human prior knowledge regarding the values of Θ SEIR , and λ controls the trade-off. We will explore the effects of prior(Θ SEIR ) and λ in experiments. We use two public datasets for evaluations: the infection network of Covid-19 in China (Covid-China) [24, 29] , and the infection network of Covid-19 in the US (Covid-US) [5] . Covid-China consists of infection networks for 31 provinces in China from Apr 2020 to Feb 2021, extracted from action tracking reports of Covid-19 patients. The network of Covid-China consists of two types of nodes: patient and location. Time-varying edges are constructed between a patient node and a location node if the patient visited the location at time t. Each location takes an attribute from 11 categories of locations: households, workplaces, hotels, supermarkets, banks, restaurants, parks,barber shops/hairdressers, trains, buses, and airplanes, and the economic city tier (first, second or third) that it belongs to. A patient node is characterized by features of age (taking the value of children, youths, adults or seniors) and gender (taking the value of male and female). Each attribute for gender, age, city-tier and location type is associated with a specific transmission rate β. The transmission rate for a certain person node of gender s, age a in location of type c in a city of tier t is the additive combination of corresponding β: β(s, a, t, c) = β s + β a + β t + β c (10) β = {β s , β a , β t , β c }, along with γ and κ are parameters to learn. The network for each city is sliced into consecutive time snippets, with the size of stride set to two weeks. For each city, we have daily gold number of infections, public by Chinese CDC. These gold numbers of infections are used to learn Θ SEIR . Snippets without any infection are removed. Each time step of each city is labeled with gold number of infections, which is used to train the SEIR parameters. Snippets are divided to 80%/10%/10% for training, dev and test. Covid-US consists of networks that capture hourly visits from each population group to each location in 10 metro areas in the US. The network is extracted from mobility data provided by the SafeGraph application. The network consists of two types of nodes: population group and location. A timevarying edge with weight w i,j is constructed, if at time t, the number of people from population group i visiting location j is w i,j . w i,j is column-normalized. Each location is associated with a location category (e.g., full-service restaurants, grocery stores, etc), and each population group is associated with its race and median income. For the SEIR model, transmissions can happen within groups or across population groups when two people from two groups visit the same location. Simulations are performed at the population group level. Each category location c is associated with a specific transmission rate β c , which captures the inter-group transmissions across groups in the locations. For each population group with race r and income decile (i), the intra-group transmission is set to β = β r + β i . Each area is associated with gold number of infections published by the The New York Times 1 . Each snippet consists of the network for a single city and lasts a week. We divided snippets to 80%/10%/10% for training, dev and test. Generating Simulated Data We first need to sample Θ SEIR . We limit the value of each β to the scope of [0, 0.1], and we randomly sample its value within the scope. For κ and γ, based on previous clinical observations [22] where κ −1 is around 96 hours and γ −1 is around 84 hours, we sample κ and γ from a normal distribution with expectation set to 96 and 84. Given a sampled set of Θ SEIR , we run simulations on the training datasets to obtain the simulation data, i.e., the number of individuals for all four states for each time step. For each episode, we take K samples of Θ SEIR , leading to a total number of K * |train| training sequences, where |train| denotes the number of training episodes. Learning LSTMs Θ LSTM to Fit Simulated Data We split the simulated data to 90/10 for training and validation. We train a three-layer LSTM with residual connections [15, 21] to fit the simulated time-series data using for training, based on Eq. (8) . Then size of hidden states is set to 128. The value of batch size is set to 256, and SGD is used for optimization. LSTM parameters and embeddings are initialized from a uniform distribution in [-0.08,008]. Gradient clipping is adopted by scaling gradients when the norm exceeds a threshold of 1. Dropout rate, learning rate and the number of training epochs are treated as hyper-parameters to be tuned on the dev set. Learning Θ SEIR We optimize Θ SEIR based on Eq.(9) on the number of gold daily infections, using the daily reported infections as labels. We use AdaGrad [7] for optimization. Dropout rate, learning rate, the number of training epochs, and the hyper-parameter λ are tuned on the dev set. For baselines, we search the optimal Θ SEIR using vanilla grid search SEIR simulations. For each set of Θ SEIR , simulations are performed on all training episodes, and the parameter set that leads to the minimum L 2 loss is selected as the final value. Suppose that we conduct K explorations for Θ SEIR . This means we need to perform K simulations on each training episode. The K is here thus comparable to and the same as the K for simulation data generation. We report the average of the square of L 2 distance between the predicted number and reported number of infections on the test episodes with varying number of simulations K on the test set. Lower values indicate superior models. Results are shown in Table 1 . Observations are as follows: (1) as the number of simulations K increases, the performances for both the vanilla model and the proposed model improve. This is in accord with our expectations: for the vanilla model, a larger number of simulations means that the model is able to explore the search space more thoroughly to obtain the optimal value; for the proposed LSTM model, the model learns better with more training data and avoids overfitting; (2) with the same number of K, the proposed LSTM model performs significantly better than the vanilla brute-force search model. This is due to the generalization ability of proposed framework: the vanilla model can only select the optimal parameters from the set it tries, while the proposed framework can generalize to the un-tried parameter set; and (3) notably, the proposed framework is able to achieve comparable performance to the vanilla search model with significantly smaller number of simulations. Specifically, for Covid-China, the performance obtained with 100 simulations (13.1) is comparable to the vanilla model with 5,000 simulations (13.5); for Covid-US, the performance obtained with 500 simulations (930) is comparable to the vanilla model with 5,000 simulations (970). This further illustrates the superiority of the proposed framework. In this subsection, we explore the effect of different modules, along with hyper-paremeters in the proposed framework to explore their influence. The effect of λ The hyper-parameter controls the trade-off between observations and external prior knowledge. The effect of λ is shown in Table 2 . As can be seen, model performance first improves and then declines as the value of λ grows. Finding the sweep spot for the balance between observations and prior knowledge leads to the best performance. We also conduct experiments using other recurrent net structures, including vanilla recurrent net (RNN), Gated Recurrent Unit (GRU) [6] , and Simple Recurrent Units (SRU) [23] . Results for different recurrent structures are shown in Table 3 . As can be seen, the LSTM structure performs comparable to GRU (slightly worse than GRU on Covid-China and slightly better on Covid-US), better than the vanilla recurrent net and SRU. How to incorporate Θ SEIR into LSTMs We explore the effects of different ways to incorporate Θ SEIR in the LSTM model, including the current strategy of (1) Θ SEIR being concatenated with the input x t at each time step (Each); (2) Θ SEIR being incorporated only at the first time step (First); and (3) Θ SEIR element-wise multiplies (Hadamard product) the input for each time step (Hadamard). For (3), since the dimensionalities of Θ SEIR and x t are different, x t is first passed to an FFN, the output of which has the same dimensionality with Θ SEIR . Results for the three strategies are shown in Table 4 . As can be seen, Θ SEIR incorporated only at the first time step significantly underperforms the strategy that incorporates Θ SEIR at every time step. This is because of the gradient vanishing effect of recurrent nets: reminding the model of Θ SEIR at each time leads to better performances. The concatenation strategy obtains comparable performances to the Hadamard product strategy. The Effect of Graph Representation The time-varying location-population network is captured by the graph representations through DIFFPOOL. This is critical since the number of infections highly relies on the spreading network for each time. We explore its necessity by comparing it with other variants: (1) Results are shown in Table 5 . As can be seen, when no network information is incorporated, the model nearly fails to learn anything. Constant networks perform slightly better than no network, but still significantly worse than the time-vary graph representations. This is in accord with our expectations since the time-varying network decides the number of infections and the disease spread patterns at each time. The classical compartmental models simplify the mathematical modeling of disease spread by simulating the population transitions between different states in the disease spread process. Since the outbreak of Covid-19, the SEIR model has been widely used to model the Covid-19 spread around the world [9, 12, 13] and provide important insights regarding isolation policies [4, 5, 8] and vaccine delivery strategies [3, 11] . To achieve a faster simulation speed and better simulation results, recent works have proposed to leverage deep neural networks in place of SEIR models to predict pandemic dynamics over time. For example, [32] used the LSTM model to predict the numbers of new infections given the contact statistics and the pre-selected transmission/incubation/recovery/death rates. [10] incorporated sequential network structures and graph attention to predict the number of infections upon a temporal and spatial mobility graph. These works aim at taking advantage of SEIR models to inform effective strategies in response to the disease spread. With regard to the estimation of the parameters in SEIR models, a simple approach is to enumerate parameter combinations, run the SEIR model with each combination and select the one with the smallest error. An alternative to grid search is to use approximate Bayesian computation (ABC) [26, 28, 30] , a technique that maintains a small fraction of simulations that are close to the targt statistics in the light of the computed distance. These simulations are treated as posterior distributions of the SEIR parameters, which are then used to infer the optimal parameters. Another approach to estimating parameters is to view the process of population transitions between states as a problem of ordinary differentiable equations (ODEs) [14, 16, 19] . However, ODEs can only give approximate numerical solutions, which could be inaccurate for real-world modeling. The most relevant work is from [31] , who proposed to make direct use of existing neural networks to predict the basic reproduction number (R0), the number of secondary cases generated by an infectious individual in a fully susceptible host population. This work is different from [31] in that (1) they sought to estimate the basic reproduction number R0 and we estimate the parameters in SEIR (or other SEIR variants) models; and more importantly (2) they propose to directly output the number to estimate given inputs, whereas we propose to automatically learn the parameters through neural network gradient descent and back-propagation. The proposed method can be extended to other fields that require time-consuming simulations to estimate necessary parameters. In this work, we propose to transform the original non-differentiable simulation problem of SEIR parameter estimation into a differentiable one by leveraging neural recurrent nets. The recurrent net is first trained to fit a small number of simulation data, and then trained on the observation data to derive the optimal SEIR parameters. This strategy bypasses the needs of time-consuming simulations, and automatically induces the optimal parameters via gradient descent, leading to both accuracy and efficiency gains. Modelling the impact of testing, contact tracing and household quarantine on second waves of covid-19 Dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model Model-informed covid-19 vaccine prioritization strategies by age and serostatus A simulation of a covid-19 epidemic based on a deterministic seir model Mobility network models of covid-19 explain inequities and inform reopening Learning phrase representations using rnn encoderdecoder for statistical machine translation Adaptive subgradient methods for online learning and stochastic optimization Estimating the overdispersion in covid-19 transmission using outbreak sizes outside china Covid-19 and sars-cov-2. modeling the present, looking at the future Stan: spatio-temporal attention network for pandemic prediction using real-world evidence An extended seir model with vaccination for forecasting the covid-19 pandemic in saudi arabia using an ensemble kalman filter Seir modeling of the italian epidemic of sars-cov-2 using computational swarm intelligence Extensions of the seir model for the analysis of tailored social distancing and tracing approaches to cope with covid-19 Exact analytical solutions of the susceptibleinfected-recovered (sir) epidemic model and of the sir model with equal death and birth rates Deep residual learning for image recognition The mathematics of infectious diseases Long short-term memory Deterministic and stochastic epidemics in closed populations Containing papers of a mathematical and physical character Stochastic estimation of the maximum of a regression function Residual lstm: Design of a deep recurrent architecture for distant speech recognition Early dynamics of transmission and control of covid-19: a mathematical modelling study. The lancet infectious diseases Simple recurrent units for highly parallelizable recurrence Mobility, exposure, and epidemiological timelines of covid-19 infections in china outside hubei province. Scientific data The effect of control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan, china: a modelling study Abc random forests for bayesian parameter inference Learning representations by back-propagating errors Inferring epidemiological parameters from phylogenies using regression-abc: A comparative study Analysis on action tracking reports of covid-19 informs control strategies and vaccine delivery in post-pandemic era Approximate bayesian computation Can machines learn respiratory virus epidemiology?: A comparative study of likelihood-free methods for the estimation of epidemiological dynamics Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions Hierarchical graph representation learning with differentiable pooling