key: cord-1038675-why9gq45 authors: Sahai, Saumya Yashmohini; Gurukar, Saket; KhudaBukhsh, Wasiur R.; Parthasarathy, Srinivasan; Rempała, Grzegorz A. title: A machine learning model for nowcasting epidemic incidence date: 2021-11-27 journal: Math Biosci DOI: 10.1016/j.mbs.2021.108677 sha: 1a1e6fbc2bec7fc7804f6e2b5decaa442555ce52 doc_id: 1038675 cord_uid: why9gq45 Due to delay in reporting, the daily national and statewide COVID-19 incidence counts are often unreliable and need to be estimated from recent data. This process is known in economics as nowcasting. We describe in this paper a simple random forest statistical model for nowcasting the COVID - 19 daily new infection counts based on historic data along with a set of simple covariates, such as the currently reported infection counts, day of the week, and time since first reporting. We apply the model to adjust the daily infection counts in Ohio, and show that the predictions from this simple data-driven method compare favorably both in quality and computational burden to those obtained from the state-of-the-art hierarchical Bayesian model employing a complex statistical algorithm. The interactive notebook for performing nowcasting is available online at https://tinyurl.com/simpleMLnowcasting. The SARS-CoV-2 virus, first observed in the United States (USA) in January 2020 [1, 2, 3] , is highly contagious [4] and has spread in both urban and rural regions [5, 6] of the USA. To gauge and combat the SARS-CoV-2 spread, governments and health organizations have set up public information systems such 5 as COVID-19 dashboards [7, 8, 9, 10] . These dashboards are useful to brief the public [8] about the current state of COVID-19 in specific regions, make data-driven public health decisions [10] , and improve transparency in governance [11] . Many of these dashboards show the number of daily new infections (daily incidence), where the infection count on a particular date refers to the 10 number of people who started experiencing disease symptoms on that date (i.e., the onset date of illness). Whereas reporting onset dates is very useful from the viewpoint of contact tracing and disease spread monitoring, it is also challenging due to unavoidable delays. [12, 13, 7] . These delays are often due to the time-lags between experiencing initial symptoms and seeking care, receiving 15 testing results, and updating the statewide records [14, 15] . As a consequence, the incidence reporting based on onset counts leads to under-counting of the present and most recent cases. Dashboards often explicitly warn about this problem [16, 17] . Figure 1 shows one such example from COVID-19 Dashboard maintained by the Ohio Department of Health (ODH) [7] where the region of 20 possible under-reporting is marked with a grey rectangle. The incomplete current count data poses huge challenges for both local and national healthcare policymakers as they strive to make difficult public health decisions (e.g., introduce lockdowns, curfews, evaluate vaccination effects, etc) in real time to limit the spread of the virus. The use of statistical methods to 25 moderate the effects of incomplete data could help reduce uncertainty in public health decision-making during the COVID-19 pandemic and increase public awareness of the most recent disease trends. timated R 0 in the Susceptible-Exposed-Infected-Recovered or SEIR model [22] for nowcasting and forecasting the outbreak's size. The nowcasting problem for delayed reporting of COVID-19 cases is also addressed by Silva et al. [23] and Greene et al. [13] using Bayesian smoothing approach [19] where the authors model the delayed number of reported cases with their proposed Markov 45 counting processes. In this paper, we propose a simple yet efficient machine learning model that addresses the problem of nowcasting in a way that is easily understood by non-experts and therefore suitable for presenting to public health decisionmakers. The only data our proposed model requires can be readily collected value), the number of people who start experiencing COVID-19 symptoms on a particular date. We also show that our proposed model outperforms the state-ofthe-art hierarchical Bayesian model [24] in terms of nowcasting accuracy while 55 being also approximately 72000x faster. Our model predictions can also be utilized as input to other forecasting models, for instance, the ones created for ODH [25] that forecast the future number of infections and subsequent hospital burden in Ohio. Note that since the goal is to nowcast the state epidemic incidence curve, there is no accounting for non-symptomatic cases. To perform our analysis, we used the public data available at ODH COVID-19 dashboard 1 , which is updated daily. It provides the daily partial incidence count, that is, the count of all individuals i td reported on a given day t to be 65 confirmed COVID-19 cases with the day of onset d where d ≤ t. For our analysis we aggregated cases by the onset date to get the state-level progression of the onset reporting. This was done by pulling data from the dashboard everyday -the dashboard provides the data for the d days which we pull for t days. Accordingly, the infection count I T D on a specific day T for a given specific 70 onset date D is given by where 1 i td is the indicator function Note that for a given D, I T D is non-decreasing as a function of t and, assuming that it is also bounded, it has a limit as T → ∞. This is illustrated in Figure 2 where we see that over the course of 52 days I T D becomes approxi-75 mately a constant. We denote the asymptotic stable value of I T D for an onset date D by and define F T D as the amount of undercounting for a specific D on day T given by We may think about F T D as a standardized measure of undercounting that is also robust to changes in incidence rates during the course of the pandemic. In what follows, we therefore consider F T D in place of I T D . Note that although in general F T D → 0 as T → ∞, this convergence is not necessarily monotone In order to cross-validate and measure the prediction testing error, data to be used for nowcasting is split into a training and a validation (testing) set based on t, where all F td with t < T train are in the former and t > T train are in the latter. Covariates. The model includes the following features to predict the F td . • Days since data collection (∆). For any given infection count I td reported on day t with onset date d, we define this feature as • Day of the week (ω t ). This categorical variable denotes the day of the week for t, at which data is being reported, ω t ∈ {Mo, Tu, We, Th, Fr, Sa, Su}. • Raw infection count (I td ). This is the daily partial incidence count for the 95 pandemic, as described in equation 1. Random forest regression. We train a random forest (RF) regression model [26] on the data partition defined in section 2.1, to predict F td from the covariates. Formally, we may write where f is the RF model. Prediction of missingness F td . Figure 4 shows the prediction of F td for different values of ∆ td . As seen from the plot, the model predictions are close to the true F td when ∆ > 4. The good agreement at ∆ = 0 is trivial, as at first date of collection, F td is almost always close to 1.0 and thus easy to predict. It is also 115 evident that first 3-4 days of data collection seem to be unreliable in predicting the correct F td and therefore should be utilized cautiously in the nowcasting predictions. Actual count prediction. Based on the prediction of F td and the current observed count I td , we use (4), to get the estimate of I s d , which is the stable value of the 120 infection count on day d. The typical trends for 4 different days of the week can be seen in Figure 5 . The infection count from the model predicts the stable value I s d robustly after five days (starting from ∆ = 5), and in some cases even earlier. In Figure 5 we may see that irrespective of the day of the week (Monday, Wednesday, Friday, Sunday), the model is seen to predict the value 125 of I s d with good accuracy. We may also note that on Monday and Sunday the model predictions have higher uncertainty likely due to the effect of weekend test processing slowdown. in the following, is more elaborate than ours as it has also a spatial component. Testing data Training data ODH data from 12-11 ODH data from 01-28 BM prediction RF predictions ODH data from 12-11 ODH data from 01-28 BM prediction RF predictions where O i is an offset of the logarithm of population of county i, the spatiotemporal random variables α i,t are the latent states of the process, the design vector X t indicates the day of the week, and the vector η i captures the day of the week effect. It is assumed that Y i,t is only partially observed for time t > T max −D, where T max stands for the last onset date and D (assumed 30 in [24] ) is the maximum reporting delay following onset. BM also uses a semi-local linear trend model [27] for the spatio-temporal random variables α i,t . Further, the spatial correlation is accounted for using an intrinsic conditional auto-regressive model. The reporting delay is described by a Multinomial-Dirichlet model as follows. Denoting by Z i,t,d the count of cases in county i with onset date t, which are observed d days after t, one defines Z i,t = (Z i,t,0 , Z i,t,1 , . . . , Z i,t,D ). Then, the Multinomial-Dirichlet model prescribes In Figures 6 and 7 we visually compare the nowcasts of the two models and see in particular that that the RF enjoys narrower uncertainty bounds and less bias than the corresponding BM model. In order to quantify this difference more formally, we calculate the L 2 distance between the predictions made by the 140 RF and the Bayesian model, respectively and the actual known stable values in the Ohio COVID-19 daily counts dataset. We report the ratio of the two L 2 distance values as a measure of relative closeness of the models to the true (stable) data value for days T − 10 to T and T − 10 to T − 5, where T is the last available date in the data. The results are presented in Table 2 . As can be seen 145 in the table, the predictions by the random forest model are relatively closer to the true values than those generated using the Bayesian model estimates. The ratio is smaller in the full 10 day window, indicating that the RF model makes better predictions than BM for days that are close to data collection. We presented here a simple method for nowcasting COVID-19 cases from historic data on daily incidence of new cases, as measured by the onset of symptoms. Such type of data is now widely available for all states in the USA as well as for most countries in the world. When the need to take immediate decisions on governance or policy arises, nowcasting can be a useful tool in providing 155 more accurate estimates about disease incidence and spread. Specifically, our proposed nowcasting algorithm uses a random forest (RF) regression methodology and leverages covariates that are based on day of the week, the number of days passed since first data collection and total incidence so far. The proposed algorithm is both conceptually simple and computationally 160 efficient. Our results also suggest that it compares favorably with a much more elaborate Bayesian model. We have illustrated the application of our approach on publicly available data from COVID-19 daily onsets in Ohio, as available from the state's COVID-19 interactive dashboard. We observed that the model is able to predict the final incidence for a day, within 3 to 4 days of data collection. 165 We also find that the number of days passed since first data collection, along with its transformations (or derivatives), are the most important covariates in predicting the final incidence. The proposed model learns from the specific epidemic curve (in our case and depends on how this curve is updated. In our study, we have 170 nowcasted epidemic incidence for Ohio. The process of updating is highly dependent on part of the country, population density, availability of testing and reporting by local health departments. It is likely that data from a different geographic region will lead to different learned model. There could be some level of nowcasting similarity in different geographical regions of the country and our 175 method could be used to help identify such cases. This can be a potential follow up to our work. In order to make our RF method predictions broadly available to the interested researchers and practitioners, we have created a publicly available and 13 J o u r n a l P r e -p r o o f Journal Pre-proof accessible interactive notebook (see below). As described in the repository, the 180 notebook allows one to use our algorithm to nowcast current COVID-19 onset occurrences, based on any user-provided historic data supplied in appropriate format. The problem of nowcasting historic data is an important one, specially during the current COVID-19 pandemic, when delays in reporting can snowball into 185 sub-optimal policies and actions, that can cost lives and create unnecessary societal burden. Our proposed method allows both general public and health providers to carefully monitor the pandemic trends and make informed decisions. The ideas we presented while focused on COVID-19 can be broadly applicable to similar public health problems in the future. The interactive self-contained notebook for performing the nowcasting using the random forest approach described in the paper, along with installation instructions, is freely available at https://zenodo.org/badge/latestdoi/346708110. Additionally, the web-based version of the interactive notebook is available at 195 https://tinyurl.com/simpleMLnowcasting. The work of WKB was supported by the President's Postdoctoral Scholars Program (PPSP) of the Ohio State University. We would like to 200 thank Harley Vossler for providing helpful feedback on the interactive notebook Cryptic transmission of sars-cov-2 in washington state First Travel-related Case of 2019 Novel Coronavirus Detected in United States Coast-to-coast spread of sars-cov-2 during the early epidemic in the united states Transmission of SARS-CoV-2: implications for infection prevention precautions, COVID-19 Data Dashboard Progression of covid-19 from urban to rural areas in the united states: a spatiotemporal analysis of prevalence rates Impacts of the covid-19 pandemic on rural america Ohio Department of Health COVID-19: Data Tracking COVID-19 in California Utah Department of Health, Phased Guidelines for the General Public and Businesses to Maximize Public Health and Economic Reacti-15 Trust and covid-19: Implications for interpersonal, workplace, insti-235 tutional, and information-based trust Overcoming reporting delays is critical to timely epidemic monitoring: The case of covid-19 in new york city Nowcasting for real-time covid-19 tracking in new york city: An evaluation using reportable disease data from early in the pandemic Covid-19 Data Reporting System Gets Off to Rocky Start COVID-19 Update: Antigen Testing, K-12 Education Update World Health Organization, WHO Coronavirus Disease (COVID-19) Dashboard COVID-19 Data Dashboard Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained pspline smoothing Nowcast-260 ing by bayesian smoothing: A flexible, generalizable model for real-time epidemic tracking Adjustments for reporting delays and the prediction of occurred but not reported events Nowcasting and forecasting the potential 265 domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study Seasonality and period-doubling bifurcations in an epidemic model Population-based seroprevalence of sars-cov-2 and the herd immunity threshold in maranhão A Bayesian 275 spatio-temporal nowcasting model for public health decision-making and surveillance Infectious Disease Institute (IDI) COVID-19 Response Modeling Team at The Ohio State University, Predicting COVID-19 Cases and Subsequent Hospital Burden in Ohio Machine Learning Inferring causal impact using Bayesian structural time-series models This research was partially funded by NSF grants DMS-1853587 and DMS-