key: cord-0301912-vks7mwtg authors: Albani, V. V. L.; Albani, R. A. S.; Massad, E.; Zubelli, J. P. title: Nowcasting and Forecasting COVID-19 Waves: The Recursive and Stochastic Nature of Transmission date: 2022-04-13 journal: nan DOI: 10.1101/2022.04.12.22273804 sha: 6af003b84700b63d58202b4278726555c952ae8a doc_id: 301912 cord_uid: vks7mwtg We propose a parsimonious, yet effective, susceptible-exposed-infected-removed-type model that incorporates the time change in the transmission and death rates. The model is calibrated by Tikhonov-type regularization from official reports from New York City (NYC), Chicago, the State of Sao Paulo, in Brazil, and British Columbia, in Canada. To forecast, we propose different ways to extend the transmission parameter, considering its estimated values. The forecast accuracy is then evaluated using real data from the above referred places. All the techniques accurately provided forecast scenarios for periods 15 days long. One of the models effectively predicted the magnitude of the four waves of infections in NYC, including the one caused by the Omicron variant for periods of 45 days long using out-of-sample data. Since the seminal work of Kermack-McKendrick [1] , susceptible-infected-removed-type and susceptible-exposed-infected-removed-type (SEIR-type) models have played a crucial role in understanding the dynamics of infectious diseases through nowcasting and helping public health authorities to address disease outbreaks by forecasting [2] . In the pandemic of COVID-19, that started in the end of 2019 and was still ongoing in the beginning of 2022, SEIR-type models were again recurrently used to nowcast the SARS-CoV-2 spread dynamics and to forecasting [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] . 1 As pointed in a series of recent works [3, 5, 24, 25] , the estimated values of the transmission parameter usually denoted by β change with time and presents some sort of randomness. To make accurate predictions, such variability must be incorporated into the model. Thus, the techniques proposed in what follows are designed to accurately predict the future pattern of transmission by accounting for past calibrated values of the time-dependent β. It is worth mentioning that using time-dependent transmission parameters also allow a remarkable adherence to reported data [3, 5, 24, 25 ]. To describe the spread dynamics of the virus SARS-CoV-2, we consider a SEIR-type model that, for simplicity, does not consider natural deaths and births. The model accounts for the following five compartments, susceptible (S), exposed (E), infected (I), recovered (R), and deceased (D). The progress between the compartments is defined by the dynamical system of ordinary differential equations below [26] : dD dt = δ(t)I. The time-dependent transmission parameter is denoted by β(t) and the infection to onset mean time is represented by 1/σ, which is set to 5.1 days [27] . The quantity γ denotes the recovery rate and δ(t) is the death rate. Following [28] , we set γ = 1/14 days −1 . For each day t, the death rate δ(t) is defined as follows, it is the total number of registered deaths on day t divided by the number of reported infection on the day t − 12. So it is dimensionless. The 12 days is the meantime that someone takes from being infected to die by COVID-19 [3, 24, 25] . Natural death and birth could be included in the model, but such terms would make little difference in the analysis that follows, as the time interval considered here is small. So, to keep the model as simple as possible, we decided to not include such terms. The estimation of the transmission parameter β from daily reports of COVID-19 infections is performed by the minimization of the Tikhonov-type functional below for each day t, as in [3] : where α is the regularization parameter, which has dimension days 2 , to make the second term in the right-hand side of Eq. (6) dimensionless, and I REP (t) denotes the reported COVID-19 3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 13, 2022. infections on the day t. The term log I REP (t)! is approximated by the Stirling's formula: The quantity Ω(t) = t t−1 σE(s)ds in Eq. (6) represents the model predicted infections for the day t. It depends on β, since it is part of the solution of the system of equations in Eqs. (2)- (5) . Therefore, the β(t) that minimizes Eq. (6) is the value that makes the model to predict correctly the infections reported in the day t. The first part of the functional in Eq. (6) accounts for the discrepancy between the model predictions and the reports of infections. The second part penalizes the first to avoid overfitting, by imposing that the discrepancy between β(t) and the already known value β(t − 1) must also be minimized. To avoid the introduction of bias in the calibration process, the parameter α must be appropriately chosen. In the present set of examples, we set α = 10 −3 /2 accordingly to an a posteriori rule [29] . The transmission parameter β, as we shall see, changes with time, presenting an uncertain pattern. Since we do not have any formulation to predict how β changes with time, we propose five different techniques to evaluate future values of beta allowing the SEIR-type model to make predictions. Average Method (AM) The simplest technique to extend β by setting its future values as the average of its estimated values. More precisely, for any t > t n , set β(t) = β, where β = β(t n ) + β(t n−1 ) + . . . + β(t n−m+1 ) m , t n−m+1 , . . . , t n are time values where β is calibrated, and m < n. Since the trend of β at the end of the dataset is the most relevant information to predict the number of infections in the following days, we set m = 5 days. If m is large, it is possible to corrupt predictions with outdated information. Moreover, keeping the number of unknowns considerably smaller than the size of "observed" quantities can help to reduce the effect of noise in the predictions. Linear Regression (LR) Another simple technique to extend β is to fit a linear model. In other words, for any t > t n , we set β(t) = a · t + b, with the scalar parameters a and b fitted to the set of calibrated values of β {β(t n−m+1 ), β(t n−m+2 ), . . . , β(t n )}. The calibration is performed by least-squares. In this case, since there are two unknowns, we set m = 10 days also to avoid outdated information and to keep the number of unknowns considerably smaller than the number of observations. Truncated Fourier Series (TFS) A more elaborated technique is to calibrate a truncated Fourier series [30] that approximates a dataset. More precisely, assume that β can be approximated by a truncated Fourier series in the time interval [t n−m+1 , t n ], 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 13, 2022. ; For m and N fixed, the parameters a 0 , a j , and b j (j = 1, . . . , N ), which have the same dimension of β, are calibrated by least squares from the dataset {β(t n−m+1 ), β(t n−m+2 ), . . . , β(t n )}. Thus, for each time t > t n , β(t) is given by the periodic function defined in Eq. (7) with calibrated parameters. In this case, choosing the length of the time interval where the series is defined and the number of terms in the series is tricky. After testing different values and configurations, we found that setting the length of the time interval m = 30 days and the number of terms in the series N = 3 gives a computational affordable technique that fits well the data. Truncated Fourier Series with Noise (TFSWN) In order to add some uncertainty to the predicted values of β, we add to the truncated Fourier series described above a Gaussian noise with mean zero and standard deviation given as follows: 1. Evaluate the difference between the values given by the truncated Fourier series and the dataset {β(t n−m+1 ), β(t n−m+2 ), . . . , β(t n )}. For each time t > t n , β(t) is given by the function in Eq. (7) plus a Gaussian random variable with mean zero and standard deviation given as above. In this case, we use the same values for m and N , as well as the parameters of the truncated Fourier series in Eq. (7) found for the noiseless case. A Mean-Reverting (MR) Model for the Transmission Parameter As an alternative to the methods described above, we propose a mean-reverting model for the dynamics of β. Since the time evolution of β seems to return to mean values, after random changes, it is natural to test such kind of model. So, inspired by the volatility of the Heston model [31] from Quantitative Finance, we assume that, for t > t n , the values of β(t) are given by the following stochastic process where {W (s) : s > 0} represents the Wiener process also known as Brownian motion. Thus, setβ(0) = β(t n ), and β(t) =β(t − t n ). Based on the Euler-Maruyama scheme to solve the stochastic differential equation in Eq. (8), we have,β where ε has Gaussian distribution with mean zero and variance ∆t. Since the estimated values for β are given daily, we set ∆t = 1/365. The parameters κ, θ, and ξ are obtained from the set of estimated values for β, {β(t n−m+1 ), β(t n−m+2 ), . . . , β(t n )}, by minimizing the functional with respect to the parameters κ, θ, ξ, whereβ j (j = 1, . . . , m) is a function of κ, θ, ξ. It is worth noticing that, in the calibration,β 0 = β(t n−m ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. ; https://doi.org/10.1101/2022.04.12.22273804 doi: medRxiv preprint The quantities E β j and E β 2 j are deterministic and can be easily evaluated by the following recursive formulas: In the results, we shall use m = 30 days, which keeps the number of unknowns much smaller than the number of observations, and, in principle, allows the model to incorporate the most recent data. Next, we perform a series of numerical tests to compare the accuracy of the predicted accumulated numbers of infections obtained using the techniques described in Section 2. Firstly, the SEIR-type model in Eqs. (2)-(5) is calibrated using 7-day moving average time-series of the daily reports of COVID-19 infections and deaths from Chicago and New York City(NYC) in the United States (US), the Canadian province of British Columbia (BC), and the Brazilian state of São Paulo (SP). The datasets start mainly in the beginning of 2020 and end at the end of 2021. We consider predictions of infections for 10, 15, 30, 45, 60, and 90 days after the period used in the calibration. Predictions are made as, follows: we calibrate the model until some date t n using the reported infections until the date t n , then, for any t j > t n we evaluate the model predictions of infections using the five techniques proposed in Section 2. These predictions are updated daily until the end of the dataset. We compare the reported and predicted accumulated numbers of infections, evaluating the normalized error in the out-ofsample predictions as follows, denote by I REP the accumulated number of reported infections during the time interval of forecast and by I PRED the accumulated predicted infections for the same period, then, The error is evaluated for the daily out-of-sample predictions and classified by the number of days ahead and the technique used to generate the prediction. by out-of-sample we mean that the model calibration was performed with a dataset different from the one used to compare the model predictions. The daily errors are then used to evaluate median values and 70% confidence intervals (70% CI). The comparison between the errors median values can be found in Figure 1 . Table 1 presents the errors median values and their 70% CI. To evaluate the errors of the stochastic techniques, namely TFSWN and MR, we used the median values of 5000 samples of predicted infections. As Figure 1 and Table 1 show, all the methods presented similar error values for predicting 10 to 15 days ahead. Major differences started to appear from 30 to 90 days ahead predictions. Predictions made with the AM presented small median error values with moderate 70% CI even for prediction with larger time intervals. For NYC, Chicago, and BC, it presented one 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. ; Figure 1 : Median values of the percentage error in the out-of-sample predicted accumulated numbers of infections. Predictions are based on calibrated data and evaluated from 10 to 90 days ahead using the methods of extending the transmission parameter β described in Section 2.3. The reports are different from those used in the calibration. of the smallest median error values for all prediction time intervals. For SP, this technique also presented accurate results, but for longer prediction time intervals, other techniques performed considerably better. Although it is the simplest technique considered in this work, its performance is remarkably accurate. The LR presented results similar to the AM technique. However, its errors were slightly larger than those for the AM using the datasets from NYC, Chicago, and BC. For SP, it presented a better performance than the AM. It is worth mentioning that such technique is also simple to implement and a nice alternative to the AM. The TFS and TFSWN presented similar results since their graphs matched for almost all places. They both performed better for SP, but all the other techniques outperformed both. The main reason for such results must be that truncated Fourier series fits the data quite well, which is useful for adherence to data, but may not be the case for model predictions. In other words, it seems that these techniques tend to overfit the data. However, these techniques seems to perform better for longer periods, since, in some sense, they capture better the log-term behavior of the transmission parameter. For Chicago and BC, the MR outperformed all the other techniques. For NYC it presented one of the best resutls for predictions until 30 days. For SP, it outperformed the other techniques only for predictions until 10 days. In general, for longer periods, like 60 to 90 days, the predictions 70% CIs were wider than those obtained with other techniques. Although, these results may indicate that such technique can be less accurate for larger prediction times, it was effective in predicting the magnitude of new waves of infections for NYC. In the second example, we focus on the forecast of large outbreaks that happened in New 7 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. [32] . For MR and TFSWN, we use median values of infections. The comparison between reported and predicted accumulated infections during the periods of interest can be found in Figure 2 . Each column of panels refers to one outbreak. In the first wave of infections, the LR and MR presented the best forecast performance with median values close to the observed number of infections. The LR started to underpredict the magnitude of the wave after10 to 15 days of forecast, whereas the MR presented a wide 90%CI containing the reports. The AMpresented accurate predictions until 10 to 15 days ahead. After this period, the method over-predicted considerably the reported infections. The TFS and TFSWN forecast poorly the first outbreak probably due to its tendency for overfitting. It is worth noticing that, during the first wave, data was poorer. In the other three outbreaks, all the models presented similar results. The AM forecast well the magnitude of the three waves, despite of its 90% CI, for the second and fourth waves, did not contained the reported values for periods longer than 20 to 30 days. The LR presented more accurate predictions than the AM. It only overpredicted the the third wave. The TFS 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. ; and TFSWN predicted well the magnitude of the second wave but underpredicted the other two. The MR predicted accurately all the three waves, with its median values matching the reports of the third wave. The 90% CI was not too wide and contained the reports. Overall, all the techniques performed well for shorter periods, i.e., from 10 to 15 days. For periods until 45 days, the AM, LR, and MR seem to perform considerably better than the TFS and TFSWN. However, for larger periods of forecast, predictions become less reliable and can be only considered qualitatively. Although the results of the AM, the LR, and the MR are comparable, the MR has the advantage of incorporating the randomness of the transmission, offering a series of possible future scenarios. Based on these examples we suggest that, for shorter periods of forecast, i.e., until 45 days, the AM and MR are reliable choices. For longer periods, their results must be compared with those given by the TFS or the TFSWN. The advantage of the TFSWN over the TFS, is the incorporation of uncertainty, that make it easy to provide predictions naturally inside confidence intervals. To illustrate that the results above can be extended to a larger class of SEIR-type models, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. ; we repeat the first experiment with NYC data and the model proposed by Albani et al. [3] . The version of the model used here accounts for different levels of disease severity and their dependency on age-range. Figure 3 shows the evolution of the median values of the normalized forecast errors. The same values with their 70% CI can be found in Table 2 . Table 2 : Median values and their corresponding 70% confidence intervals of the percentage error in the predicted accumulated numbers of infections. Predictions are based on calibrated data and evaluated from 10 to 90 days ahead using the methods of extending the transmission parameter β described in Section 2.3. As Figure 3 and Table 2 show, by including more characteristics of the disease in the SEIR-type model, the forecast overall accuracy did not improve. We used all the forecast techniques from Section 2.3. All of them presented similar forecast errors. Such pattern may be linked to the increase of the uncertainties in model calibration, since more sophisticated models have larger sets of unknowns for the same experimental datasets, since NYC does not have daily reports of COVID-19 cases, hospitalizations, and deaths classified by age-range. Although the model used to test different predictions techniques may seem to be simple, by considering time varying transmission parameter and death rate, it incorporates naturally stochastic fluctuations in the pathogen transmission, like the introduction of new variants, changes in the disease virulence and mortality. In other words, providing an accurate nowcast. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 13, 2022. More sophisticated models are harder to calibrate and to forecast scenarios within longer periods, it is necessary to predict the evolution of other quantities, like mobility information or the disease severity dependency on age. It is also difficult to estimate the causality relations between transmission and mobility [33, 34] . Moreover, as Figure 3 shows, the performance of more sophisticated models may not be considerably better. Thus, if the main interest is in scenarios regarding the general population, it is better to use simpler, but robust and accurate, models. Using the same dataset, predictions with different techniques presented considerably different performances in forecast. For shorter periods, the AM, the LR, and the MR outperformed the TFS and the TFSWN. It is also interesting that the MR predicted accurately the four waves of infections in NYC, as illustrated by Figure 2 . In all cases, the model prediction 90% CI contained the reported infections. This indicates that the MR incorporates better the uncertain nature of the virus transmission and shows that transmission parameter β must be modeled as a stochastic process. Reliability of model forecast decreases considerably with respect to the length of the prediction time interval, as Tables 1-2 show. This is due to unpredictable regime changes in the transmission dynamics, like the emergence of new strains of the pathogen, vaccination, or yet effectiveness of non-pharmaceutical measures like lockdowns. Thus, for periods longer than 45 days, predictions must be considered only qualitatively. Finally, the present study has some limitations since we considered only the class of SEIRtype models. There are a number of other ways to model the dynamics of infectious diseases [2] , but accounting for them is beyond the scope of this article that is intended to enlighten the stochastic nature of transmission. Moreover, we did not perform a rigorous statistical analysis of the transmission parameter and other rates defining the SEIR-type model in Eqs. (2)-(5) since the calibration techniques considered in this work are deterministic. The statistical analysis of SEIR-type models and their parameters will be subject of a future work. The datasets used in this study are publicly available [35] [36] [37] [38] . The codes in MATLAB (The MathWorks, Inc., Natick, USA) used in this work can be found in https://github.com/viniciusalbani/NowcastForecastCovid. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 13, 2022. ; https://doi.org/10.1101/2022.04.12.22273804 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 13, 2022. ; https://doi.org/10.1101/2022.04.12.22273804 doi: medRxiv preprint A contribution to the mathematical theory of epidemics Modeling Infectious Diseases in Humans and Animals Estimating, Monitoring, and Forecasting the Covid-19 Epidemics: A Spatio-Temporal Approach Applied to NYC Data Data-based analysis, modelling and forecasting of the COVID-19 outbreak Metapopulation Network Models for Understanding, Predicting, and Managing the Coronavirus Disease COVID-19 Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions Spread and dynamics of the COVID-19 epidemic in Italy: Effects of emergency containment measures Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Multi-generational SIR modeling: Determination of parameters, epidemiological forecasting and age-dependent vaccination policies Multi-scale modelling reveals that early super-spreader events are a likely contributor to novel variant predominance A stochastic differential equation SIS epidemic model Forecasting for COVID-19 has failed Three Quant Lessons from COVID-19. Risk Magazine New Global Coronavirus Death Forecast Is Chilling -And Controversial; 2020. Accessed on 09-01-2022. Goats and Soda STORIES OF LIFE IN A CHANGING WORLD 2020: A year of protests and civil disobedience The Quiet Hand of Conservative Groups in the Anti-Lockdown Protests; 2020. Accessed on 09-01-2022. The New York Times Protests during the 2020 COVID-19 lockdown in South Africa. MoLab Inventory of Mobilities and Socioeconomic Changes On the reliability of predictions on Covid-19 dynamics: A systematic and critical review of modelling techniques Evolution and epidemic spread of SARS-CoV-2 in Brazil Lockdowns result in changes in human mobility which may impact the epidemiologic dynamics of SARS-CoV-2 Assessing the effects of non-pharmaceutical interventions on SARS-CoV-2 transmission in Belgium by means of an extended SEIQRD model and public mobility data Assessing the spread of COVID-19 in Brazil: Mobility, morbidity and social vulnerability COVID-19 Underreporting and its Impact on Vaccination Strategies The Impact of COVID-19 Vaccination Delay: A Data-Driven Modelling Analysis for Chicago and New York City On the Role of Financial Support Programs in Mitigating the Sars-CoV-2 Spread in Brazil The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application Baseline characteristics and outcomes of 1591 patients infected with SARS-CoV-2 admitted to ICUs of the Lombardy region A Connection Between Uniqueness of Minimizers and Morozovlike Discrepancy Principles in Tikhonov-type Regularization The Volatility Surface: A Practitioner's Guide Fitting dynamic models to epidemic outbreaks with quantified uncertainty: a primer for parameter uncertainty, identifiability, and forecasts. Infectious Disease Modelling SARS-CoV-2 in Argentina: Lockdown, mobility, and contagion Individual social contact data and population mobility data as early markers of SARS-CoV-2 transmission dynamics during the first wave in Germany-an analysis based on the COVIMOD study Covid-19 data from NYC COVID-19 data from Chicago COVID-19 data from Canada COVID-19 data from São Paulo Pesquisa e Inovação do Estado de Santa Catarina through the grants 01/2020 and 00002847/2021, respectively. RA acknowledges the financial support from Fundação Carlos Chagas Filho de Amparoà Pesquisa do Estado do Rio de Janeiro (FAPERJ) through the grants E-26/202.932/2019 and E-26/202.933/2019. EM acknowledges the financial support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação Butantan through the Grants 305544/2011-0 and 01/2020, respectively. JZ acknowledges the financial support from Khalifa University, CNPq, and Fundação Carlos Chagas Filho de Amparoà Pesquisa do Estado do All the authors contributed equally to this work. The authors declare no competing interests.