key: cord-1018702-tw1tp9zj authors: Wesner, J. S.; Van Peursem, D.; Flores, J.; Lio, Y.; Wesner, C. title: Forecasting hospitalizations due to COVID-19 in South Dakota, USA. date: 2020-07-24 journal: nan DOI: 10.1101/2020.07.22.20160184 sha: a07b0b084e567674033771ccd88eabe01b329d2a doc_id: 1018702 cord_uid: tw1tp9zj Anticipating the number of hospital beds needed for patients with COVID-19 remains a challenge. Early efforts to predict hospital bed needs focused on deriving predictions from SIR models, largely at the level of countries, provinces, or states. In the United States, these models rely on data reported by state health agencies. However, predictive disease and hospitalization dynamics at the state level are complicated by geographic variation in disease parameters. In addition it is difficult to make forecasts early in a pandemic due to minimal data. However, Bayesian approaches that allow models to be specified with informed prior information from areas that have already completed a disease curve can serve as prior estimates for areas that are beginning their curve. Here, a Bayesian non-linear regression (Weibull function) was used to forecast cumulative and active COVID-19 hospitalizations for South Dakota, USA. As expected, early forecasts were dominated by prior information, which was derived from New York City. Importantly, hospitalization trends also differed within South Dakota due to early peaks in an urban area, followed by later peaks in other rural areas of the state. Combining these trends led to altered forecasts with relevant policy implications. >85, all but two counties are rural and a county in South Dakota ranks first. 6 South Dakota's public health infrastructure is limited with a centralized state department of health that delivers public health services through a network of regional and county offices, most of which house a single public health nurse. Nearly 80% of nonprofit hospitals in South Dakota are critical access hospitals with 25 or fewer acute inpatient beds, 7 and acess to medical facilities with an intensive care unit (ICU) and ventilators is limited in rural areas. 8 To our knowledge, there are no published studies that model hospitalizations due to COVID-19 in rural and low resource settings. Developing new ways to model infectious disease outbreaks in jurisdictions with limited public health and health care infrastructure is critical to preventing and reducing mortality and morbidity in communities that are at high risk of COVID-19. Early predictions of hospitalization in the United States relied on projections from SIR models and their derivatives. 9 Because they are developed primarily to simulate disease spread through a hypothetical, well-mixed population, SIR-based models require a number of assumptions to generate predictions of hospitalizations or death. 10 Unlike many U.S. states, South Dakota has publically released hospitalization data daily and there is now enough data to model hospitalization curves directly, rather than infering hospitalizations through an SIR. Here, we modeled cumulative hospitalizations in an urban (Minnehaha) versus rural population within South Dakota using a Bayesian non-linear Weibull function. Because early predictions in a disease outbreak are critical for planning, but also are data limited, we used informed priors from New York City, which began its hospitalization curve before South Dakota. While New York City is not a rural area, the use of informed prior distributions allowed our model to make reasonable, though highly uncertain, forecasts of hospitalizations in a rural setting. We obtained data on cumulative hospitalizations and active hospitalizations (number hospitalized on a given day) from the data dashboard for the South Dakota Department of Health (SD DOH) -https://doh.sd.gov/ news/Coronavirus.aspx#SD. Data for cumulative hospitalizations began on 2020-03-08 and were entered by hand into a .csv each day (SD DOH only reports totals for the current day, not a timeline). Data for active hospitalizations were not released until 2020-04-20, when 56 people were actively hospitalized. Beginning on that date, we also updated our .csv with active hospitalizations each day. When data collection began, most cases and hospitalizations were located in Minnehaha County, South Dakota. Therefore, during data collection, we noted hospitalizations in Minnehaha County versus the rest of South Dakota to capture any potential divergent trends under the assumption that the disease would spread more slowly across rural South Dakota. For these two areas, data are only available for cumulative hospitalizations, not for active hospitalizations. Minnehaha County has a population density of 619 people per km 2 , which is >50 times higher than the state average population density of 28 people per km 2 . 5 Comparing these two areas allowed us to model COVID-19 hospitalizations in a rural and urban setting within the same state. We estimated cumulative hospitalizations using a Bayesian model in which hospitalizations were modeled as a sigmoid function of time using the Weibull function. 11, 12 The Weibull function is derived from the Weibull cumulative distribution 13 and has been used widely in biology to model growth curves. 14 We chose the Weibull function because it is more flexible than the logistic function and is asymmetric around the inflection point. 11, 15 We fit the Weibull function to two sets of data that describe 1) the cumulative hospitalizations for the state of South Dakota and 2) the cumulative hospitalizations for subgroups of Minnehaha County and the rest of South Dakota. Because the data were counts with positive outcomes, we used a Poisson likelihood with a log-link. 2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. γ ∼ Γ(5.8, 4.8) With the above notation, y i is the cumulative number of people hospitalized in South Dakota on the i th date, α is the asymptote, β is the inflection point, and γ is the slope at the inflection point. Gamma priors were used because each parameter must be positive and continuous. Informative prior distributions were derived from the cumulative hospitalization curve in New York City (NYC Department of Health, https://www1.nyc.gov/site/doh/covid/covid-19-data.page). We derived the priors from New York City because NYC had nearly completed its hospitalization curve when South Dakota's was still beginning and because the data were available as a timeline (many states either have not reported temporal hospitalization data or have not made the data easily extractable). To derive prior distributions for South Dakota, we first fit the aforementioned model to NYC's hospitalization curve. Before fitting the model, we multiplied NYC hospitalizations by 0.10 to put them on the scale of South Dakota's population (which is~10% of NYC's population). We then fit the model to these adjusted hospitalizations using prior values of Γ(1.2, 0.1) for α, Γ(0.25, 0.005) for β, and Γ(1.4, 0.3) for γ. Those reflect prior distributions with wide standard deviations that would represent a potential overload of South Dakota's~2000 hospital beds: 10,000 +/-5000 (mean +/-sd) for α, 50 +/-100 for β, and 100 ± 50 for γ. To capture trends inside and outside of Minnehaha County, we fit the same model as before, but included an indicator variable with two levels (Minnehaha County or Outside Minnehaha County) for each of the three parameters. γ rest ∼ N (0, 0.5) With the above notation, y i is the cumulative number of people hospitalized in each i date (x i ), α j , β j , and γ j are the parameters for each j group (Minnehaha County or the Rest of South Dakota), X minn are the priors for each X parameter (α, β, γ), X rest are the priors for the difference in parameter values between Minnehaha and the Rest of South Dakota, and r i is an indicator variable that is 0 if the data are in Minnehaha County and 1 otherwise. As before, prior values were chosen from a combination of prior information from NYC and from prior predictive simulation. 16 To do this, we simulated 300 cumulative hospitalization curves with mean values for each parameter derived from the fit of the NYC model. Because NYC has both a higher absolute population size and a higher population density (by 10-fold) than South Dakota, we adjusted prior means and standard deviations so that the prior predictive distributions estimated hospitalizations to have a maximum that is slightly below the maximum of NYC, but with standard deviations that still include positive prior probability for some extreme prediction (e.g., 50,000 cumulative hospitalizations). Figure 1 shows the prior predictions for both models. Each model aforementioned was specified in R (version 3.6.3; R Core Team 2020) using the brms package. 17 Posterior sampling was performed using Hamiltonian Monte Carlo in rstan (version 2.19.2, 18 ). We fit four chains, each with 2000 iterations, discarding the first 1000 iterations of each chain as warm-up. Warm-up samples are similar to burn-in sampling, but are used in this case as an optimizer for the HMC algorithm. Chains were checked for convergence using trace plots to assure overlap (Supplementary Information), and by ensuring that the Gelman-Rubin convergence diagnosticR was < 1.1. 19 To forecast cumulative hospitalizations, we used the posterior predictive distributions from each model by first solving for the fitted values across each iteration of the posterior: where k is the k th iteration from the posterior distribution and i is the i th date. Posterior predicted values were estimated by drawing each ypred (k) i from the Poisson distribution: We then summarized the mean, median, standard deviation and credible intervals (50 and 95%) across the posterior distribution of fitted and predicted values. For visualization, we plotted fitted values within the range of the data and predicted values beyond the range of the data. To estimate active hospitalizations from the cumulative hospitalization curve, we first derived daily incidence φ for each iteration of yf it We then summed incidence over the previous 5, 10, 12, or 15 days to estimate variable lengths of hospital stays: ω is the number of people actively hospitalized on the i th day for the k th iteration, φ (k) i is the incidence on the i th day for the k th iteration and n is 5, 10, 12, or 15. These lengths of stay were chosen to capture the range of reported hopsital stay lengths from the literature. 20 We then plotted these predictions against active hospitalizations reported by the South Dakota Department of Health. For the group levels (Model 2 ), we performed the same calculations as above, but for each group. In addition, we also estimated the state-level hospitalizations from Model 2 by summing the predictions 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . from each group. This allowed us to compare predictions when only state-level data were available versus predictions with data available for different areas of the state. We re-fit the model each day as data were released. To visualize how parameter values changed over time as data were added, we plotted posterior predictions and parameter values from model runs in 20 day intervals. Twenty days was arbitrarily chosen to allow for visual clarity in the plots. At the state level, model 1 predicted a total of 932 hospitalizations (median) in South Dakota (90% CrI: 808-977, Table 2 ). The inflection point was predicted at 38 days after the first hospitalization, suggesting that the peak rate of hospitalizations occurred around April 20, 2020 (Table 2 ). In contrast, the model with group-level effects clearly showed that hospitalizations trends differed in Minnehaha County verses the rest of South Dakota (Figure 2 ). In Minnehaha County, the inflection point occurred around day 23 and revealed an asymptote of 332 hospitalizations (90% CrI: 326-338) ( Table 2 ). In the rest of South Dakota, the inflection point occurred~40 days later (day 62, Figure 2) . Similarly, the maximum cumulative hospitalizations in the rest of South Dakota are estimated at a median of~811, but with large uncertainty (90% CrI: 745-891, Table 2 ). As expected the uncertainty in predictions (Figure 3 ) improved over time as data were added to the model. The largest improvement appeared to occur during the model run on day 60, when all parameter values appeared to stabilize (Supplementary Information). After this date, predictions for hospitalizations in Minnehaha County were stable, while predictions outside of Minnehaha County tended to underpredict future hospitalizations until the most recent model runs (Figure 3 ). Converting the cumulative curve to estimate the number of people actively hospitalized yields a maximum estimate of~100 people actively hospitalized. This estimate is derived from the assumption that an average patient will spend 10 days in the hospital. That assumption appeared to best approximate the state reported data best (Figure 4 ). One clear difference between the two models is that the state-level model predicts a single peak in active hospitalizations, while model 2 predictions two distinct peaks (Figure 4) . The data appear consistent with two separate peaks. However, active hospitalization data are only available for the state, so we are unable to compare data to predictions at the group-level (Figure 4 ). The most important result of this study is that modeling trends separately in urban versus rural parts of a state population reveal different projections of cumulative hospitalizations than if modeled only using statelevel data. In particular, the model with urban vs. rural groups predicts that Minnehaha County will attain a maximum of~316 hospitalizations, while areas outside of Minnehaha will attain a maximum of~785. That results in approximately~1000 people cumulatively hospitalized. In contrast, the model with only state-level data indicates a 24% smaller number of total people hospitalized (762). In addition to differences in maximum hospitalizations, the models predict different trends of active hospitalizations. Active hospitalizations are more relevant than cumulative hospitalizations for measuring hospital capacity. They have been modeled for the COVID pandemic in a variety of ways with many states adopting a version of an SIR model, such as the CHIME model. 9 One challenge with estimating hospitalizations using an SIR-based approach is that they require estimates of a number of transition probabilities to convert infections into hospitalizations. These include estimating the proportion of infected people that become symptomatic, the proportion of those that require hospitalization, the proportion of those that seek hospitalization, the time lag between viral infection as modeled in an SIR, symptom onset, and actual hospitalization. 9,10 Early in an epidemic, when hospitalization data are scarce or absent, these models are essential for predicting 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . possible hospitalization scenarios. However, once enough data are available on hospitalizations, it is possible to model the hospitalization curve directly, as we have done here. Though we did not attempt to predict COVID-19 infections in South Dakota, it may be possible to use our approach to do so by treating infections as a latent variable that leads to subsequent hospitalizations. Flaxman et al. 10 provide an example of this approach using COVID-19-related deaths rather than hospitalizations. By modeling the dynamics of deaths, they estimated a time-varying reproduction number (R t ), which was used to estimate the number of positive cases. Using deaths (or hospitalizations) to estimate infection dynamics may help to overcome limitations in testing capacity, which in turn lead to difficulty in linking publically-reported testing results to true population-level infection rates. This is particularly true in the United States, which has limited testing capacity and no centrally coordinated testing program. 21 An advantage to the Bayesian approach is that we can use prior values of parameters in a model to fit a model with limited data. In our case, those parameters were conveniently available from New York City. However, if we were modeling a hospitalizaton curve that had no prior estimates, then we might derive priors from another epidemic that had similar disease characteristics or use prior predictive simulation to bound the model to reasonable prior predictions 16 . In particular, because the prior distributions for individual parameters may not be known or are difficult to interpret without the consideration of the likelihood, 22 it is important to assess the implications of prior choices using the prior predictive distribution (i.e. simulating potential from the prior distrbutions alone) 16 (Figure 1 ). In our model, data simulated from the prior helped to confirm that our model was specified in a way that included a wide range of hospitalization trajectories (Figure 1 ), but excluded extreme values that might have come from more diffuse priors, such as projecting asymptotes with hospitalizations that are higher than the population of South Dakota. All data and code are available at https://github.com/jswesner/covid_sd_ms. 23 These will be permanently archived via Zenodo upon acceptance of this manuscript. We declare no conflicts of interest. We thank the South Dakota Department of Health for making the hospitalization data publically available. This work is not affiliated with any funding agency. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. 12 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. 13 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. Here we compare the prior and posterior distributions of each parameter in the models. Figure S1 indicates how much information was learned about each parameter from adding data. Trace plots indicate well-mixed chains for each model. 14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. 15 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. 16 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. 17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area Characteristics of Counties with the Highest Proportion of the Oldest Old Availability of respiratory care services in critical access and rural hospitals Locally Informed Simulation to Predict Hospital Capacity Needs During the COVID-19 Pandemic Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe The potential of Weibull-type functions as flexible growth curves Sigmoid Model for the Evaluation of Growth and Production Curves in Laying Hens A Statistical Distribution Function Of Wide Applicability The Usefulness of Ecological Models: A Stock-Taking Epidemiological study of novel coronavirus (COVID-19) Visualization in Bayesian workflow Brms: An R Package for Bayesian Generalized Linear Mixed Models using Stan A Probabilistic Programming Language Inference from Iterative Simulation Using Multiple Sequences COVID-19 Length of Hospital Stay: A Systematic Review and Data Synthesis Model code in the Stan language is below. ## // generated with brms 2. 12