key: cord-0861458-gb46gdw3 authors: Kline, David; Hyder, Ayaz; Liu, Enhao; Rayo, Michael; Malloy, Samuel; Root, Elisabeth title: A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance date: 2021-02-08 journal: American journal of epidemiology DOI: 10.1093/aje/kwac034 sha: a7de7f6dc0e88599f37482f55e26aea5f6e64043 doc_id: 861458 cord_uid: gb46gdw3 As COVID-19 spread through the United States in 2020, states began to set up alert systems to inform policy decisions and serve as risk communication tools for the general public. Many of these systems, like in Ohio, included indicators based on an assessment of trends in reported cases. However, when cases are indexed by date of disease onset, reporting delays complicate the interpretation of trends. Despite a foundation of statistical literature to address this problem, these methods have not been widely applied in practice. In this paper, we develop a Bayesian spatio-temporal nowcasting model for assessing trends in county-level COVID-19 cases in Ohio. We compare the performance of our model to the current approach used in Ohio and the approach that was recommended by the Centers for Disease Control and Prevention. We demonstrate gains in performance while still retaining interpretability using our model. In addition, we are able to fully account for uncertainty in both the time series of cases and in the reporting process. While we cannot eliminate all of the uncertainty in public health surveillance and subsequent decision-making, we must use approaches that embrace these challenges and deliver more accurate and honest assessments to policymakers. The first cases of SARS-CoV-2 in the United States were reported in early March [1] , though recent phylogenetic evidence suggests the first introductions occurred in January 2020 [2, 3, 4] . As COVID-19 spread throughout the country, states began to set up risk alert systems to support data-driven decision-making, improve government accountability, and communicate health risks to the public [5] . The goal of such systems is to provide clear and consistent messaging around the current state of the COVID-19 pandemic and help people adopt protective behaviors while policymakers implement appropriate structural changes to mitigate spread. Risk or public health alert systems typically develop a series of indicators which use various sources of surveillance data [5] . In some states, these systems were linked to specific policy actions [6] while in others they serve more as a risk communication tool to inform local health departments and the general public [7] . In most systems, several key indicators are tied to the reporting of confirmed COVID-19 cases and their onset date of illness (i.e., the date an individual first began to have symptoms) [8] . However, chronic delays in outbreak investigation and case reporting have led to challenges in estimating case-based indicators and communicating the situation in a location in near real-time. Issues related to reporting lag or reporting delay are not a new challenge in public health surveillance [9, 10, 11] . It is quite common for reporting in infectious disease and vital statistics systems to not occur instantaneously with the onset or occurrence of the event of interest. For infectious diseases, this delay can be due to: 1) a prolonged interval between the time an individual recognizes symptoms and is able to seek care and receive confirmatory testing, 2) administrative backlogs and delays in the acquisition, processing, and ultimate reporting of information, and 3) the length of time necessary to conduct a full case investigation. However, particularly when facing a fast-moving epidemic, important decisions need to be made in real-time despite the fact that the most recent information is likely incomplete. This added uncertainty can reduce the confidence of both policymakers and the public in the public health decision-making process. Methodology is needed to help provide a clearer picture to decision-makers in the face of the uncertainty from delays in reporting. To address this issue and build on the foundational methodology [9, 10, 11] , a relatively recent literature around "nowcasting" has emerged for delayed reporting. In contrast to forecasting which focuses on estimating what could happen in the future, nowcasting focuses on estimating what has already happened but has not yet been reported. Nowcasting leverages historical patterns in reporting and trajectories of the disease outcome to estimate current counts given partially reported values. To enhance model flexibility and interpretability, recent work [12, 13, 14] has extended prior work for nowcasting time series [15] and aberration detection [16] within a Bayesian framework. This work has been applied to estimate COVID-19 deaths in regions of the United Kingdom [17] and to incorporate spatial dependence [18, 19] . In addition, simulation modeling approaches have also been used for nowcasting [20, 21, 22] . In contrast to much of the current epidemiological work that relies on the specification of splines to capture trends, Bayesian structural time series can be specified as hierarchical autoregressive processes [23] . Given the link between autoregressive processes and infectious disease dynamics, we propose a spatial extension of the Bayesian structural time series model to nowcast county-level counts of confirmed COVID-19 cases in Ohio while accounting for reporting delay. Despite prior and current literature on methods for accounting for reporting delay, these methods have not been fully embraced in practice. The purpose of this paper is to highlight the critical need to account for reporting lag and other potential daily reporting patterns when assessing whether case rates are increasing. This is important because an increase in case rates is an indicator in many states' alert systems, including the Ohio Public Health Alert System (OPHAS) [7] , and can also serve as an early warning signal of disease spread. We apply our method to OPHAS Indicator 2 which measures "an increasing trend of at least 5 consecutive days in overall cases by onset date over the last 3 weeks" [7] . Ohio adopted a 21 day "look-back" period in an attempt to manually curtail the effect of reporting delays. We develop an extension of a Bayesian structural time series model that incorporates spatial dependence across counties and flexibly captures temporal dynamics with an autoregressive structure. We use case data from earlier in the pandemic that is now fully reported so the true trends can be determined for each county in Ohio. We then compare indicators based on the method currently used in Ohio, the method suggested by the Centers for Disease Control and Prevention (CDC) [8] , and our Bayesian approach. We used data on confirmed cases of COVID-19 in the state of Ohio which are captured by the Ohio Disease Reporting System and reported publicly [24] . In Ohio, case investigation is done by local, typically county, health departments and entered into the state system. Confirmed cases are defined as individuals who have a positive result on a laboratory molecular amplification test [1] or other approved testing methods. For each individual case, the system records the county of residence and the onset date of illness as determined by case investigators. If onset date is unknown, the system records the earliest date associated with the record. Onset date currently provides the index date for all reporting and analysis at the state-level in Ohio. The reporting date is defined as the first date at which a case appears in the system and is often several days or possibly weeks after the onset date. Thus, when examining case counts by onset date, counts for the most recent days are incomplete because of the delay between onset date and reporting date. The reporting delay can also be impacted by system strains due to case volume and daily variation in reporting that differ by local health department. To explore the impact of reporting patterns on the calculation and subsequent interpretation of public health alert indicators, we retrospectively consider four points in time during the pandemic: June 15, 2020, July 15, 2020, August 15, 2020, and September 15, 2020. At the time of the analysis, at least one full month had passed since September 15, and we assume that case reporting was complete through this date. For each date, we examine cases reported by that date and compute indicators related to the trends in case counts. Since the data are completely reported, we can compare the estimates from the indicators to the true trend observed in the onset cases at that point in time. This will allow us to examine the performance of each proposed approach for determining if a county is experiencing an increasing trend of cases. We refer to the current approach for determining if case rates are increasing used by the Ohio Department of Health as the rolling average approach [7] . This approach computes a 7 day rolling average of case counts, indexed by onset date, for each of the last 21 days. The alert indicator for an increasing trend in cases is flagged if there are 5 consecutive days of increasing averages at any point in the 21 day window. That is, the indicator flags if for 5 consecutive days the average is greater than the average the day before. This approach crudely accounts for daily reporting variation by averaging across 7 days but makes no attempt to account for reporting lag or any other sources of variation. A slightly more sophisticated but still simple approach was recommended by the CDC for detecting rebounds [8] and will be referred to as the spline approach. This approach is similar to the rolling average approach described above but fits a spline to the time series of rolling averages. For consistency, we used 7 day rolling averages over a 21 day period to align with the temporal window of interest for the alert system. We fit a cubic spline [8] to each series with 4 knots. By using a spline, we are able to smooth daily and other systematic variation in reporting patterns. Aligned with the CDC [8] , we determine if there is an increasing trend by looking at the fitted values from the spline and determining if there are any 5 consecutive day periods where the fit for each day is greater than the previous day. Like the rolling average approach, uncertainty is not incorporated into the decision-making process. Splines were estimated using the mgcv package in R [25] . In contrast to the simpler approaches, we explicitly model both the process for new onset cases and the reporting delay process. We extend the general framework outlined by previous work [12, 18] by using an autoregressive spatial Bayesian structural time series, rather than a spline based model. While the spline based model is flexible, it relies on reasonably specifying knots and is not ideal for estimating beyond the range of the observed data. In addition, it can be more challenging to incorporate hierarchical structure when temporal trends may be quite different across locations, which has been the case for COVID-19. Instead, an autoregressive structure retains the ability to flexibly capture spatio-temporal trends while also linking more closely to the dynamics of infectious disease [26] . It also allows for added flexibility in specifying a spatially varying reporting delay process. We follow the general set up outlined in previous work [12, 16] . In Ohio, COVID-19 cases are reported daily so we use a daily time scale. To reduce computation time, we will take a moving window approach [14] that considers the past 90 days (T = 90). From April through September 2020, 94% of cases were reported within 2 weeks of onset and 98% of cases were reported within 30 days. To be conservative, we set a maximum reporting delay time of 30 days following onset Outcome Model. Let Y it be the count of reported cases in county i = 1, . . . , N with onset date t = 1, . . . , T . Note that Y it is assumed to be the true total count, which is assumed to be partially observed for time t such that t + D > T . We assume where O i is an offset of the log population of county i, α it is the latent state of the process, X t is a design vector indicating the day of the week, and η i is the day of the week effect. Note that X t is parameterized using sum to 0 effect coding so α it reflects the average of the process across days of the week. By using this structure for the model, we are able to remove daily reporting variation from the latent state, α it , through X t η i . After removing the daily "seasonal" variation, we focus on the model for the latent state or structural part of the model. We use a semi-local linear trend model [27] to allow for some degree of longer term structure while still facilitating a very flexible model. That is for t > 1, α ) and the initial value at t = 1 is α i1 ∼ N (0, 100) . Then for the model for trend, we let where δ is a common statewide trend, d i is a county-specific spatial trend, and ρ δ is an autoregres- . A benefit to this parameterization is it allows us to separate changes that are due to white noise ( α it ) from those that are due to more consistent temporal trends (δ it ). By using a stationary model for δ, we are able to provide some structure around a longer term trend while retaining flexibility for local deviations in space and time. To account for spatial correlation, we assume the trends in neighboring counties are correlated and specify an intrinsic conditional autoregressive model. That is, where d −i is the set of counties excluding county i, w ij is an indicator of whether counties i and j are adjacent, w i+ = j =i w ij , and τ 2 d is a variance. To ensure a valid process model, we enforce a sum to 0 constraint on the d i [28] . We chose to incorporate spatial dependence in the trend to reflect a belief that cases in a county are likely to change in a similar fashion as cases in neighboring counties. This choice explicitly aligns with our general surveillance and risk evaluation strategy for counties where we have implicitly considered trends in neighboring counties when making our assessments. Another added benefit is that this helps to stabilize estimates for counties with small populations by borrowing strength from neighboring counties. We also assume county-specific effects of the day of the week. We assume that while variability exists between counties, the daily patterns are similar across the state. We assume the following hierarchical model where η i is a vector of state average day of the week effects, τ 2 η is a variance, and I 6 is a 6 × 6 identity matrix. This allows each county to have its own daily pattern while borrowing strength across all counties in the state as warranted. Reporting Model. Since we know that Y it is observed with reporting lag, we must specify a model for the delay. Let Z itd be the count of cases observed in county i with onset date t that are observed d = 0, . . . , D days after t. Note that Z itd corresponds to when cases are reported d days after onset date t and so is unobserved when t + d > T . We assume where Z it = (Z it0 , . . . , Z itD ), p it is the vector of proportions of the total Y it reported on each of the D days, and GD is the generalized Dirichlet distribution. We use a generalized Dirichlet distribution to properly account for potential overdispersion of the p it [12] . This leads to the following conditional distribution: is set of counts reported with a delay that is not d days. To model more intuitive quantities, we reparameterize the distribution [12] in terms of the mean ν itd and dispersion φ d such that α itd = ν itd φ d and β itd = (1 − ν itd )φ d . Then similar to a hazard function, we let logit(ν itd ) = ψ itd and assume the following AR1 model where β d is the average log odds of remaining cases being reported by delay d, V td is a design matrix indicating the day of the week, ξ i is a day of the week effect, ρ ψ is an autoregressive parameter, and ψ itd is an error term. We assume ψ itd iid ∼ N (0, τ 2 ψ ). Note that V td is parameterized using sum to 0 effect coding. The parameterizaton of the delay model allows us to accommodate several important features of COVID-19 reporting and should, in general, be customized to reflect the actual reporting process. First, reporting in Ohio is done by county health departments who may have varying capacity and resources for timely reporting. Thus, the delay model is county-specific. We account for day of the week effects, much like in the model for the case counts, because in many counties, reporting primarily aligns with the work week. We also assume autoregressive temporal dependence to capture the potential for administrative backlogs. For example, if a smaller portion of cases are reported today, we may also expect a smaller proportion the next day because of a backlog. We do not incorporate a term to account for spatial dependence in the delay model as we assume neighboring health departments are independent agencies, and so we would not anticipate spatial structure. As with the outcome model, we allow for county-specifc variability in day of the week reporting effects. We again assume similar patterns across the state and specify the following hierarchical model: where ξ is a vector of state average day of the week effects, τ 2 ξ is a variance, and I 6 is a 6 × 6 identity matrix. Prior Model and Computation. Since we fit our model in the Bayesian paradigm, we must specify prior distributions on all unknown parameters. For each element of η and ξ, we assign independent normal priors with 0 mean and variance 1. We also assign δ a normal prior with 0 mean and variance 1. We use a variance of 1 for these prior distributions as each parameter reflects a relative daily difference on the log scale, and so these priors reflect a reasonable range for those parameters. We assign β d independent normal priors with mean 0 and variance 4, which puts adequate probability on reasonable values on the logit scale. We also assign all variance parameters inverse gamma priors with shape and scale both set to 0.5. All autoregressive parameters are assigned uniform prior distributions over -1 to 1. To compare across approaches, we fit the model for each of the four dates considered. We treat the last day in the series (i.e., the current date) as missing and forecast the expected case count, which reduces model instability due to the rarity of cases reported on the day of onset (d = 0). The model was fit using a Markov chain Monte Carlo algorithm implemented in R using nimble [29] . The algorithm was run for 30,000 iterations with the first 15,000 discarded as burn-in and then thinned by keeping every 10th iteration. Computation time was approximately 20 hours, which would enable a daily update in practice. To determine whether the cases were increasing in the most recent 21 day period, we use the posterior distribution of δ it . Since δ it reflects the trend in county i at time t, there is a net increasing trend over the past 21 days if T t=T −20 δ it > 0. Using the posterior distribution, we can directly compute the posterior probability of an increasing trend for each county. One major advantage of a model-based approach is the flexibility to address more complex questions of interest. However, the goal of this paper is to assess the method used to calculate the OPHAS indicator for when cases are increasing in a county. To most closely align with the question as currently posed by the state of Ohio, we define a true increase in cases as when the number of cases in the most recent 7 day period is greater than the number of cases two weeks prior. While there are other potential ways to define a true increase, this most closely reflects the current definition used by the state of Ohio. The results from applying each of the three methods for calculating increasing case rates are shown in Figure 1 . There are several general observations that can be made across the four time points. The rolling average indicator generally does a poor job at accurately capturing counties where the cases have increased, and in most counties, there were true increases that went undetected. The spline indicator tends to make errors in the other direction by incorrectly flagging counties that did not meet the definition of a true increase. For the model-based approach, we generate a posterior probability of an increasing trend and highlight counties in yellow with a probability greater than 0.7 and in red those with a probability greater than 0.9. In addition to visually examining the results, we calculated sensitivity and specificity for each approach in Table 1 . The rolling average approach currently in use has a very low sensitivity of 0.20 and so is not successfully identifying counties with increasing trends. The spline approach has a much higher sensitivity of 0.87 but at the cost of a specificity of 0.48. Three cut points are shown for the model-based posterior probabilities. As expected, the higher thresholds exhibit excellent specificity but lower sensitivity since it reflects stronger evidence of an increase. Using a cut-point of 0.5, which reflects that the trend is more likely increasing than decreasing, we estimate The model-based approach also provides a rich set of additional results that can provide useful insights. Typically, the main goal of these models is to nowcast case counts. In Figure 2 , we show nowcast estimates with their 90% credible interval in black and the true counts in red for an urban and rural county. Averaging across the 4 time points, the 90% credible interval coverage was 0.96 over the 30 day period with incomplete reporting. The coverage was 0.92 in the most recent 7 days which have the most incomplete reporting. Thus, our model performs as expected for nowcasting cases. In Figure 2 , we also show time series plots of the latent state, which removes the daily seasonality, and the trend. The trend can also be viewed as the derivative of the latent state curve so when it is greater than 0, it indicates increasing case counts. We applied three approaches for assessing increasing trends in cases to completely observed data at four time points during the COVID-19 pandemic. When assessments are linked to onset date, case reporting is subject to reporting lag or delay. We illustrate that the simple approach currently used in OPHAS does not perform well as it fails to account for lag and other variation in reporting. The spline approach outlined by the CDC is more sensitive as it smooths over daily reporting variation but also fails to account for lag. In contrast, the model-based approach accommodates lag, daily variation, and spatio-temporal dependence. The model-based approach can also directly summarize observed evidence of increasing trends and the associated uncertainty through posterior distributions. This results in a better trade-off between sensitivity and specificity and can allow for prioritization of areas where the evidence of an increase is strongest. We note several key advantages to the model-based approach. First, the Bayesian approach allows us to use calculated posterior summaries to directly communicate uncertainty. Public health officials are constantly considering trade offs between different policy options -e.g., stay-at-home Since the posterior probability reflects the probability of an increasing trend given the observed data, this quantity can be used to directly address the policy question of interest and provides an indication of how strong the evidence is in each county. Unlike the spline or 7-day rolling average approaches that return a binary decision, the ability of the model-based approach to convey additional meaning through continuous estimates is a clear advantage that can improve decisionmaking [30, 31] . Second, by accounting for reporting delays and fully exploiting partially reported counts, the Bayesian approach can be more responsive to changing trends and provide earlier warning of changes in trends. Finally, the output from the Bayesian models (shown in Figure 2 ) provide important additional information that can be used by surveillance teams to understand trends over time. These results do require a team of epidemiologists to review the data, but still provide more information than the spline or 7-day moving average methods. When responding to a pandemic, it is important that the public health and policy response is guided by the best available information. Often even the best information can be incomplete and uncertain. However, statistical models have been developed to overcome these issues and aid in characterizing and quantifying uncertainty. These models are not as simple as the approach currently used in Ohio, and this is one limitation of this method. Risk alert systems should be transparent and easy to understand. Complex modeling approaches are difficult to explain to the general public and can lead to mistrust in the data and, by extension, the system as a whole. However, with proper preparation, the model output can be summarized to simply communicate the core messages, while leaving much of the complexity and technical details to the experts implementing the model. Additionally, the Bayesian models provide a wide range of information that can be used internally by epidemiologists and other public health data scientists to directly address important policy questions. Given the clear improvements our Bayesian models offer, it is imperative that we take advantage of these methodological advances to better serve the public and inform the distribution of limited resources. In conclusion, we have illustrated shortcomings in using simple approaches for public health decision-making. We have also illustrated how more sophisticated statistical models can account for the real-world complexities associated with surveillance data. Despite the added complexity, the output from these models can be summarized in a relatively simple and concise form that still appropriately reflects uncertainty. While we cannot eliminate all of the uncertainty in public health surveillance and decision-making, we must use approaches that embrace these challenges and deliver more accurate and honest assessments to policymakers. COVID-19) 2020 Interim Case Definition Substantial underestimation of SARS-CoV-2 infection in the United States The emergence of SARS-CoV-2 in Europe and North America Science Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States Cell Resolve to Save Lives . Staying Alert: Navigating COVID-19 Risk Toward a New Normal 2020 Utah Department of Health . Phased Guidelines for the General Public and Businesses to Maximize Public Health and Economic Reactivation Ohio Department of Health . Summary of Alert Indicators 2020 gov/static/OPHASM/Summary-Alert-Indicators. pdf Activities and Initiatives Supporting the COVID-19 Response and the President's Plan for Opening America Up Again Statistical methods for short-term projections of AIDS incidence Inference Based on Retrospective Ascertainment: An Analysis of the Data on Adjustments for reporting delays and the prediction of occurred but not reported events Multivariate hierarchical frameworks for modeling delayed reporting in count data Nowcasting the Number of New Symptomatic Cases During Infectious Disease Outbreaks Using Constrained P-spline Smoothing Epidemiology Nowcasting by Bayesian Smoothing: A flexible, generalizable model for real-time epidemic tracking Bayesian nowcasting during the STEC O104:H4 outbreak in Germany Bayesian outbreak detection in the presence of reporting delays Nowcasting CoVID-19 Deaths in England by Age and Region medRxiv A Hierarchical Modelling Framework for Correcting Delayed Reporting in Spatio-Temporal Disease Surveillance Data arXiv Bayesian spatiotemporal modeling with sliding windows to correct reporting delays for real-time dengue surveillance in Thailand Real-time influenza forecasts during the 2012-2013 season Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study The Lancet Predicting the present with Bayesian structural time series Ohio Department of Health . COVID-19 Dashoard 2020 Generalized Additive Models: An Introduction with Synchrony, Waves, and Spatial Hierarchies in the Spread of Influenza Science Inferring causal impact using Bayesian structural time-series models Hierarchical modeling and analysis for spatial data Programming with models: writing statistical algorithms for general model structures with NIMBLE Comparing the Effectiveness of Alerts and Dynamically Annotated Visualizations (DAVs) in Improving Clinical Decision Making Human Factors: the Alarm system management: evidence-based guidance encouraging direct measurement of informativeness to improve alarm response