key: cord-0973228-1y0bscqk authors: Wesner, Jeff S.; Van Peursem, Dan; Flores, José D.; Lio, Yuhlong; Wesner, Chelsea A. title: Forecasting Hospitalizations Due to COVID-19 in South Dakota, USA date: 2021-05-01 journal: J Healthc Inform Res DOI: 10.1007/s41666-021-00094-8 sha: 69b99e6309c9a085ad35420e74479b7c51765da2 doc_id: 973228 cord_uid: 1y0bscqk Anticipating the number of hospital beds needed for patients with COVID-19 remains a challenge. Early efforts to predict hospital bed needs focused on deriving predictions from SIR models, largely at the level of countries, provinces, or states. In the USA, these models rely on data reported by state health agencies. However, predicting disease and hospitalization dynamics at the state level is complicated by geographic variation in disease parameters. In addition, it is difficult to make forecasts early in a pandemic due to minimal data. Bayesian approaches that allow models to be specified with informed prior information from areas that have already completed a disease curve can serve as prior estimates for areas that are beginning their curve. Here, a Bayesian non-linear regression (Weibull function) was used to forecast cumulative and active COVID-19 hospitalizations for SD, USA, based on data available up to 2020-07-22. As expected, early forecasts were dominated by prior information, which was derived from New York City. Importantly, hospitalization trends differed within South Dakota due to early peaks in an urban area, followed by later peaks in rural areas of the state. Combining these trends led to altered forecasts with relevant policy implications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s41666-021-00094-8. The novel coronavirus (SARS-CoV-2) was first detected in December 2019 in Wuhan, China, and has since spread globally. The disease caused by SARS-CoV-2 (COVID-19) can lead to hospitalization or death in all age groups, but particularly in older age groups with comorbidities such as hypertension, obesity, and diabetes [1] . For example, in France, 3 to 4 % of infected people become hospitalized, but that rate varies from a low of 0.001 to 10 .1% for individuals that are <20 years old versus those >80 years old, respectively [2] . A central challenge for hospitals is predicting how many hospitalizations will occur due to COVID-19, and whether hospital capacities will be exceeded. Predicting hospitalization needs due to COVID-19 may be particularly challenging in rural areas. For example, relative to urban areas, rural communities in the USA have reduced access to health care [3] and increased mortality from chronic diseases [4] , both of which are key risk factors for COVID-19. South Dakota is among the most rural states in the USA with a 2017 population estimate of 869,666 [5] . Of South Dakota's 66 counties, 52% are frontier (with <15.5 persons per square kilometer) and 32% encompass or are comprised of reservation lands representing nine federally recognized tribes. Communities in rural areas also tend to have older populations. For example, among 56 US counties with the largest proportion of people ages >85, all but two counties are rural and a county in South Dakota ranks first [6] . South Dakota's public health infrastructure is limited, with a centralized state department of health that delivers public health services through a network of regional and county offices, most of which house a single public health nurse. Nearly 80% of nonprofit hospitals in South Dakota are critical access hospitals with 25 or fewer acute inpatient beds [7] , and access to medical facilities with an intensive care unit (ICU) and ventilators is limited in rural areas [8] . To our knowledge, there are no published studies that model hospitalizations due to COVID-19 in rural, low resource settings. Developing new ways to model infectious disease outbreaks in jurisdictions with limited public health and health care infrastructure is critical to preventing and reducing mortality and morbidity in communities that are at high risk of COVID-19. Early predictions of hospitalization in the USA relied on projections from SIR models and their derivatives [9] . Because they are developed primarily to simulate disease spread through a hypothetical, wellmixed population, SIR-based models require a number of assumptions to generate predictions of hospitalizations or death [10] . Unlike many US states, South Dakota has publicly released hospitalization data daily and there is now enough data to model hospitalization curves directly, rather than inferring hospitalizations through an SIR. Here, we modeled cumulative hospitalizations in an urban (Minnehaha County) versus rural population within South Dakota using a Bayesian non-linear Weibull function. Because early predictions in a disease outbreak are critical for planning, and also are data limited, we used informed priors from New York City, which began its hospitalization curve before South Dakota. While New York City is not a rural area, the use of informed prior distributions allowed our model to make reasonable, though highly uncertain, forecasts of hospitalizations in a rural setting. We obtained data on cumulative hospitalizations and active hospitalizations (number hospitalized on a given day) from the data dashboard for the South Dakota Department of Health (SD DOH) -https://doh.sd.gov/news/Coronavirus.aspx#SD. Data for cumulative hospitalizations began on 2020-03-08 and were entered by hand into a .csv each day (SD DOH only reports totals for the current day, not a timeline). Data for active hospitalizations were not released until 2020-04-20, when 56 people were actively hospitalized. Beginning on that date, we also updated our .csv with active hospitalizations each day. Data were collected up to 2020-07-22. When data collection began, most cases and hospitalizations were located in Minnehaha County, SD. Therefore, during data collection, we noted hospitalizations in Minnehaha County versus the rest of South Dakota to capture any potential divergent trends under the assumption that the disease would spread more slowly across rural South Dakota. For these two areas, data are only available for cumulative hospitalizations, not for active hospitalizations. Minnehaha County has a population density of 619 people per km 2 , which is >50 times higher than the state average population density of 28 people per km 2 [5] . Comparing these two areas allowed us to model COVID-19 hospitalizations in a rural and urban setting within the same state. We estimated cumulative hospitalizations using a Bayesian model in which hospitalizations were modeled as a sigmoid function of time using the Weibull function [11, 12] . The Weibull function is derived from the Weibull cumulative distribution [13] and has been used widely in biology to model growth curves [14] . We chose the Weibull function because it is more flexible than the logistic function and is asymmetric around the inflection point [11, 15] . We fit the Weibull function to two sets of data that describe (1) the cumulative hospitalizations for the state of South Dakota and (2) the cumulative hospitalizations for subgroups of Minnehaha County and the rest of South Dakota. Because the data were counts with positive outcomes, we used a Poisson likelihood with a log-link. With the above notation, y i is the cumulative number of people hospitalized in South Dakota on the ith date, α is the asymptote, β is the inflection point, and γ is the slope at the inflection point. Gamma priors were used because each parameter must be positive and continuous. Informative prior distributions were derived from the cumulative hospitalization curve in New York City (NYC Department of Health, https://www1.nyc.gov/site/doh/ covid/covid-19-data.page). We derived the priors from New York City because NYC had nearly completed its hospitalization curve when South Dakota's was still beginning and because the data were available as a timeline (many states either have not reported temporal hospitalization data or have not made the data easily extractable). To derive prior distributions for South Dakota, we first fit the aforementioned model to NYC's hospitalization curve. Before fitting the model, we multiplied NYC hospitalizations by 0.10 to put them on the scale of South Dakota's population (which is˜10% of NYC's population). We then fit the model to these adjusted hospitalizations using prior values of (1.2, 0.1) for α, (0.25, 0.005) for β, and (1.4, 0.3) for γ . Those reflect prior distributions with wide standard deviations that would represent a potential overload of South Dakota's˜2000 hospital beds: 10,000 ± 5000 (mean ± sd) for α, 50 ± 100 for β, and 100 ± 50 for γ (Table 1 ). To capture trends inside and outside of Minnehaha County, we fit the same model as before, but included an indicator variable with two levels (Minnehaha County or Outside Minnehaha County) for each of the three parameters. With the above notation, y i is the cumulative number of people hospitalized in each i date (x i ); α j , β j , and γ j are the parameters for each j group (Minnehaha County or the Rest of South Dakota); X minn are the priors for each X parameter (α, β, γ ); X rest are the priors for the difference in parameter values between Minnehaha and the rest of South Dakota; and r i is an indicator variable that is 0 if the data are in Minnehaha County and 1 otherwise. As before, prior values were chosen from a combination of prior information from NYC and from prior predictive simulation [16] . To do this, we simulated 300 cumulative hospitalization curves with mean values for each parameter derived from the fit of the NYC model. Because NYC has both a higher absolute population size and a higher population density (by 10-fold) than South Dakota, we adjusted prior means and standard deviations so that the prior predictive distributions estimated hospitalizations to have a maximum that is slightly below the maximum of NYC, but with standard deviations that still include positive prior probability for some extreme predictions (e.g., 50,000 cumulative hospitalizations) ( Table 1 ). Figure 1 shows the prior predictions for both models. Each model aforementioned was specified in R (version 3.6.3; R Core Team 2020) using the brms package [17] . Posterior sampling was performed using Hamiltonian Monte Carlo in rstan (version 2.19.2,) [18] . We fit four chains, each with 2000 iterations, discarding the first 1000 iterations of each chain as warm-up. Warm-up samples are similar to burn-in sampling, but are used in this case as an optimizer for the HMC algorithm. Chains were checked for convergence using trace plots to assure overlap ( Supplementary Information) , and by ensuring that the Gelman-Rubin convergence diagnosticR was < 1.1 [19] . To forecast cumulative hospitalizations, we used the posterior predictive distributions from each model by first solving for the fitted values across each iteration of the posterior: where k is the kth iteration from the posterior distribution and i is theith date. Posterior predicted values were estimated by drawing each ypred (k) i from the Poisson distribution: We then summarized the mean, median, standard deviation and credible intervals (50 and 95%) across the posterior distribution of fitted and predicted values. For visualization, we plotted fitted values within the range of the data and predicted values beyond the range of the data. To estimate active hospitalizations from the cumulative hospitalization curve, we first derived daily incidence φ for each iteration of yf it We then summed incidence over the previous 5, 10, 12, or 15 days to estimate variable lengths of hospital stays: i is the number of people actively hospitalized on the ith day for the kth iteration, φ (k) i is the incidence on the ith day for the kth iteration and n is 5, 10, 12, or 15. These lengths of stay were chosen to capture the range of reported hospital stay lengths from the literature [20] . We then plotted these predictions against active hospitalizations reported by the South Dakota Department of Health. For the group levels (model 2), we performed the same calculations as above, but for each group. In addition, we also estimated the state-level hospitalizations from model 2 by summing the predictions from each group. This allowed us to compare predictions when only state-level data were available versus predictions with data available for different areas of the state. We re-fit the model each day as data were released. To visualize how parameter values changed over time as data were added, we plotted posterior predictions and parameter values from model runs in 20 day intervals. Twenty days was arbitrarily chosen to allow for visual clarity in the plots. To determine how sensitive results were to iterations and alternative prior specifications, we re-ran each model using 20,000 iterations and also altered the priors to make them either less informative or more informative by widening or tightening the standard deviations (Table S2) . Increasing iterations had no impact on parameter estimates (Table S1) . Results from alternative priors are presented in Figs. S1 and S2. The results in the main text were robust to alternative prior specifications. One exception were the tighter priors in model 2, in which the estimate of maximum hospitalizations was ∼130 patients lower than the estimate from the main model. At the state level, model 1 predicted a total of 934 hospitalizations (median) in South Dakota (90% CrI: 897 to 977, Table 2 ), based on data available as of 2020-07-22. The inflection point was predicted at 38 days after the first hospitalization, suggesting that the peak rate of hospitalizations occurred around 2020-04-20 (Table 2 ). In contrast, the model with group-level effects clearly showed that hospitalization trends differed in Minnehaha County verses the rest of South Dakota (Fig. 2) . In Minnehaha County, the inflection point occurred around day 23 and revealed an asymptote of 332 hospitalizations (90% CrI: 326 to 338) ( Table 2 ). In the rest of South Dakota, the inflection point occurred ∼40 days later (day 62, Fig. 2) . Similarly, the maximum cumulative hospitalizations in the rest of South Dakota are estimated at a median of ∼813, but with large uncertainty (90% CrI: 742 to 892, Table 2 ). As expected, the uncertainty in predictions (Fig. 3) improved over time as data were added to the model. The largest improvement appeared to occur during the model run on day 60, when all parameter values appeared to stabilize ( Supplementary Information) . After this date, predictions for hospitalizations in Minnehaha County were stable, while predictions outside of Minnehaha County tended to under predict future hospitalizations until the most recent model runs (Fig. 3) . Converting the cumulative curve to estimate the number of people actively hospitalized yields a maximum estimate of ∼100 people actively hospitalized. This estimate is derived from the assumption that an average patient will spend 10 days in the hospital. That assumption appeared to best approximate the state reported data best (Fig. 4) . However, the state-reported data also appear to be staying consistent at around 90 people hospitalized even when model 1 predicts that active hospitalizations should be declining (Fig. 4) . It is possible that some of the difference is due to underlying differences in hospitalizations in the two subgroups, but active hospitalization data are only available for the state, so we are unable to assess the accuracy of the group-level predictions (Fig. 4 ). The most important result of this study is that modeling trends separately in urban versus rural jurisdictions reveal different projections of cumulative hospitalizations than if modeled only using state-level data. In particular, the model with urban vs. rural groups predicts that Minnehaha County will attain a maximum of ∼332 hospitalizations, while areas outside of Minnehaha will attain a maximum of ∼813. That results in approximately ∼1145 people cumulatively hospitalized. In contrast, In addition to differences in maximum hospitalizations, the models predict different trends of active hospitalizations. Active hospitalizations are more relevant than cumulative hospitalizations for measuring hospital capacity. They have been modeled for the COVID pandemic in a variety of ways with many states adopting a version of an SIR model, such as the CHIME model [9] . One challenge with estimating hospitalizations using an SIR-based approach is that they require estimates of a number of transition probabilities to convert infections into hospitalizations. These include estimating the proportion of infected people that become symptomatic, the proportion of those that require hospitalization, the proportion of those that seek hospitalization, the time lag between viral infection as modeled in an SIR, symptom onset, and actual hospitalization [9, 10] . Early in an epidemic, when hospitalization data are scarce or absent, these models are essential for predicting possible hospitalization scenarios. However, once enough data are available on hospitalizations, it is possible to model the hospitalization curve directly, as we have done here. Though we did not attempt to predict COVID-19 infections in South Dakota, it may be possible to use our approach to do so by treating infections as a latent variable that leads to subsequent hospitalizations. Flaxman et al. [10] provide an example of this approach using COVID-19-related deaths rather than hospitalizations. By modeling the dynamics of deaths, they estimated a time-varying reproduction number (R t ), which was used to estimate the number of positive cases. Using deaths (or hospitalizations) to estimate infection dynamics may help to overcome limitations in testing capacity, which in turn lead to difficulty in linking publicly reported testing results to true population-level infection rates. This is particularly true in the USA, which has limited testing capacity and no centrally coordinated testing program [21] . An advantage to the Bayesian approach is that we can use prior values of parameters in a model to fit a model with limited data. In our case, those parameters were conveniently available from New York City. However, if we were modeling a hospitalization curve that had no prior estimates, then we might derive priors from another epidemic that had similar disease characteristics or use prior predictive simulation to bound the model to reasonable prior predictions [16] . In particular, because the prior distributions for individual parameters may not be known or are difficult to interpret without the consideration of the likelihood [22] , it is important to assess the implications of prior choices using the prior predictive distribution (i.e., simulating potential from the prior distributions alone) [16] (Fig. 1 ). In our model, data simulated from the prior helped to confirm that our model was specified in a way that included a wide range of hospitalization trajectories (Fig. 1) , but excluded extreme values that might have come from more diffuse priors, such as projecting asymptotes with hospitalizations that are higher than the population of South Dakota. Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area Estimating the burden of SARS-CoV-2 in France National healthcare quality and disparities report chartbook on access to health care Leading causes of death in nonmetropolitan and metropolitan areas-United states American Community Survey 5-year estimates Mills S Characteristics of counties with the highest proportion of the oldest old Availability of respiratory care services in critical access and rural hospitals Locally informed simulation to predict hospital capacity needs during the COVID-19 pandemic Imperial College COVID-19 Response Team, Flaxman S, Mishra S et al Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe The potential of Weibull-type functions as flexible growth curves Sigmoid model for the evaluation of growth and production curves in laying hens A statistical distribution function of wide applicability The usefulness of ecological models: a stock-taking Visualization in Bayesian workflow An R Package for Bayesian Generalized Linear Mixed Models using Stan Stan : a probabilistic programming language Inference from iterative simulation using multiple sequences COVID-19 length of hospital stay: a systematic review and data synthesis Failing the test -the tragic data gap undermining the U.S. pandemic response The prior can often only be understood in the context of the likelihood Acknowledgements We thank the South Dakota Department of Health for making the hospitalization data publicly available. We also thank the editor and anonymous referees for their suggestions and comments that led to a significant improvement of this manuscript.Data Availability All data and code are available at https://github.com/jswesner/covid sd ms [23] . These will be permanently archived via Zenodo upon acceptance of this manuscript. The authors declare no competing interests. The online version contains supplementary material available at https://doi.org/10.1007/s41666-021-00094-8.Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.