key: cord-0851917-yz71w4li authors: Ray, E. L.; Wattanachit, N.; Niemi, J.; Kanji, A. H.; House, K.; Cramer, E. Y.; Bracher, J.; Zheng, A.; Yamana, T. K.; Xiong, X.; Woody, S.; Wang, Y.; Wang, L.; Walraven, R. L.; Tomar, V.; Sherratt, K.; Sheldon, D.; Reiner, R. C.; Prakash, B. A.; Osthus, D.; Li, M. L.; Lee, E. C.; Koyluoglu, U.; Keskinocak, P.; Gu, Y.; Gu, Q.; George, G. E.; Espana, G.; Corsetti, S.; Chhatwal, J.; Cavany, S.; Biegel, H.; Ben-Nun, M.; Walker, J.; Slayton, R.; Lopez, V.; Biggerstaff, M.; Johansson, M. A.; Reich, N. G.; Consortium, COVID-19 Forecast Hub title: Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. date: 2020-08-22 journal: nan DOI: 10.1101/2020.08.19.20177493 sha: 651ff5d87525758317026ac46ad04862a55f875c doc_id: 851917 cord_uid: yz71w4li Background The COVID-19 pandemic has driven demand for forecasts to guide policy and planning. Previous research has suggested that combining forecasts from multiple models into a single "ensemble" forecast can increase the robustness of forecasts. Here we evaluate the real-time application of an open, collaborative ensemble to forecast deaths attributable to COVID-19 in the U.S. Methods Beginning on April 13, 2020, we collected and combined one- to four-week ahead forecasts of cumulative deaths for U.S. jurisdictions in standardized, probabilistic formats to generate real-time, publicly available ensemble forecasts. We evaluated the point prediction accuracy and calibration of these forecasts compared to reported deaths. Results Analysis of 2,512 ensemble forecasts made April 27 to July 20 with outcomes observed in the weeks ending May 23 through July 25, 2020 revealed precise short-term forecasts, with accuracy deteriorating at longer prediction horizons of up to four weeks. At all prediction horizons, the prediction intervals were well calibrated with 92-96% of observations falling within the rounded 95% prediction intervals. Conclusions This analysis demonstrates that real-time, publicly available ensemble forecasts issued in April-July 2020 provided robust short-term predictions of reported COVID-19 deaths in the United States. With the ongoing need for forecasts of impacts and resource needs for the COVID-19 response, the results underscore the importance of combining multiple probabilistic models and assessing forecast skill at different prediction horizons. Careful development, assessment, and communication of ensemble forecasts can provide reliable insight to public health decision makers. The outbreak of Coronavirus Disease 2019 in Wuhan, China, in late December 2019 quickly spread around the world, resulting in formal recognition of COVID-19 as a global threat by the World Health Organization on January 30, 2020 (Promed 2019; World Health Organization 2020). Subsequent rapid, global spread of SARS-CoV-2, the virus that causes COVID-19, drove an urgent need for forecasts of the timing and intensity of future transmission and the locations with the highest risk of spread to inform risk assessment and planning. Multiple studies of epidemic forecasting have shown that ensemble forecasts, which incorporate multiple model predictions into a combined forecast, consistently perform well and often outperform most if not all individual models (Viboud et al. 2018; Johansson et al. 2019; Reich, Brooks, et al. 2019) . Ensemble approaches are in widespread use in other fields such as economics (Timmermann 2006; Busetti 2014) and weather forecasting (Leutbecher and Palmer 2008) . Ensemble models can distill information across multiple forecasts and are a robust option for decision making and policy planning, especially in situations where extensive historical data on individual model performance are not available. Here, we summarize a collaborative effort between the U.S. Centers for Disease Control and Prevention (CDC), 21 largely academic research groups, five private industry groups, and two government-affiliated groups. Starting in April 2020, this consortium, called the COVID-19 Forecast Hub (https://covid19forecasthub.org), developed shared forecasting targets and data formats, then constructed, evaluated, and communicated the results of ensemble forecasts for U.S. deaths attributable to COVID-19. Deaths due to COVID-19 are a proximate indicator of burden on health care systems and a critical measure of health impact. Beginning on April 13, 2020 and every Monday thereafter, we collected probabilistic one-, two-, three-, and four-week ahead forecasts of the total number of deaths due to COVID-19 that would be reported by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (Dong, Du, and Gardner 2020) by the Saturday of each week for U.S. states and territories and the United States overall. Prediction intervals (e.g., 95% or 50%) characterize uncertainty which point forecasts are unable to characterize. Thus, each participating team provided the median of the predictive distribution and 11 prediction intervals ranging from a 10% prediction interval to a 98% prediction interval. Participating groups were able to use the methods and data sources they deemed appropriate to generate the forecasts (Reich et al. 2020 ). On April 27, after two weeks of consistent submissions, we began aggregating individual forecasts of cumulative deaths to construct a weekly ensemble forecast (hereafter referred to as "ensemble"). To be eligible for inclusion in the ensemble, a model had to submit a valid forecast for all four week-ahead horizons (details on validation in Supplemental Figure 1 ). The total number of individual models included in the ensemble forecasts ranged from six (April 27) to 20 (June 15 and 29). Ensemble forecasts for specific locations often included fewer models because some models did not include forecasts for all locations. The ensemble forecast was constructed as an equally-weighted average of forecasts from all eligible models. Specifically, the endpoints of each prediction interval in the ensemble forecast were calculated as the average of the corresponding quantiles across all individual model forecasts (Busetti 2014). Ensemble forecasts were constructed by researchers at the University of Massachusetts Amherst and posted publicly by CDC each week. We evaluated ensemble forecasts of cumulative deaths in the United States compared to reported deaths for only weeks which had previous forecasts at all four prediction horizons (one-to four-weeks ahead). For example, we did not evaluate forecasts for the week ending May 2, because only a one-week ahead ensemble forecast was available, making comparison across horizons impossible. Overall, we compared forecasts across nine weeks of observations, from the week ending May 23 through the week ending July 25, 2020. We evaluated the performance of ensemble forecasts across all locations and weeks where at least two individual models contributed to the ensemble; this included 2,512 of the 2,561 possible forecasts during the evaluation timeframe for different locations and forecast horizons. Reported cumulative death data for this evaluation were downloaded from the CSSE repository on July 26, 2020. We evaluated the error of the ensemble with mean absolute error (MAE), the mean difference between the forecasted and observed values, and calibration with prediction interval coverage. Prediction interval coverage was calculated by determining the frequency with which the prediction interval contained the eventually observed outcome. In a model that accurately characterizes uncertainty, the prediction interval level will correspond closely to the frequency of eventually observed outcomes that fall within that prediction interval. For example, eventually observed values should be within the 95% prediction interval approximately 95% of the time. Because the ensemble prediction intervals were calculated by averaging individual models, the ensemble prediction intervals typically were not whole numbers and almost never included predictions of no new deaths in a given week. We therefore assessed interval coverage for the original ensemble and for the rounded ensemble. We rounded the interval endpoints to a whole number conservatively, rounding the lower limits of prediction intervals down and the upper limits of prediction intervals up. The forecast data and submission instructions are available in a public GitHub repository (Reich et al. 2020) . Analyses were performed in R (https://www.R-project.org/) and all code for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . used to construct the ensemble forecasts and reproduce the analyses in this manuscript is available in a separate public GitHub repository (E. Ray 2020). ). Forecasts are onethrough four-week ahead predicted medians and 50%, 80%, and 95% prediction intervals from the ensemble forecasts created on May 18 and June 29. Forecasts from these two dates are shown as examples. Ensemble forecasts were made every week starting April 27, 2020 (not shown, available at (Reich et al. 2020) ). The mean absolute error (MAE) is the mean of the difference between the forecasted and observed values across all forecasts at each horizon (one-to four-weeks ahead) for which the target was observed as of July 26. Comparing forecasts to the reported death data for the weeks ending May 23 through July 25, 2020, the ensemble point forecasts were accurate and precise at short-term prediction for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. Table 1 ). Specifically, on average across all locations, the mean absolute error (MAE) of four-week ahead point predictions was more than three times the MAE of one-week ahead predictions. For example, the national-scale one-week ahead MAE indicated an average difference of 760 deaths from the eventually observed values, while the four-week ahead forecasts had a MAE of about 2,400 deaths ( Figure 1 ). The ensemble forecasts were also well calibrated, with prediction intervals that covered the observed data with the expected frequency (Table 1, Supplemental Figure 3 ). The 50% prediction intervals were conservative; they captured 50-55% of observations for all forecast horizons with the original forecasts, and 57-65% with the conservatively rounded, whole number forecasts. In contrast, the original 95% prediction intervals captured only 87-90% of observations. After rounding, the 95% prediction intervals were better calibrated, with coverage rates of 92-96%. Rounding had a particularly clear impact on forecasts for weeks with no reported deaths as many ensemble forecasts had lower prediction bounds just above zero due to one or more individual forecasts being greater than zero. For example, rounding changed the 95% coverage for 120 forecasts and 99 of those were for weeks with no new deaths. While the forecasts were less precise at longer horizons (greater MAE), they remained calibrated. Forecast Horizon (weeks ahead) This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177493 doi: medRxiv preprint The spread of COVID-19 has driven a continually adapting global response. Substantial uncertainty persists about the disease's trajectory, and robust forecasts of COVID-19 activity can help inform decision making in the face of this uncertainty. The COVID-19 Forecast Hub was created to rapidly collect, aggregate, ensemble, and evaluate forecasts in real-time. The resulting forecasts were published on the CDC website starting on April 27, 2020, (Centers for Disease Control and Prevention 2020) and were used to develop summary messages about the trajectory of the outbreak in the United States. These early results indicate that ensemble forecasts at the national and state-level were accurate, with mean absolute errors well below the maximum number of new deaths reported per week, and well-calibrated, with reasonable prediction interval coverage, especially when evaluated using rounded prediction intervals. These forecasts can be used as situational awareness and planning tools to inform a variety of planning and response decisions such as the implementation of mitigation strategies, the distribution of resources, or vaccine trial site selection (Dean et al. 2020; Wallinga, van Boven, and Lipsitch 2010; Lipsitch et al. 2011) . It is critical that forecasts used to inform public health decisions accurately characterize their uncertainty, and these ensemble forecasts achieved that goal. The ensemble forecasts were conservative for more central prediction intervals such as the 50% prediction interval which captured 50-60% of observations, but were well calibrated at the more important, extreme intervals after rounding. In outbreak forecasting, extreme errors can lead to suboptimal decision-making, such as unexpected shortages in or oversupply of resources. A well-calibrated forecast reduces this risk. For example, only 1% of outcomes should exceed the 99% prediction quantile, and this occurred for 1.4% of forecasts made by the COVID-19 Forecast Hub ensemble. Although the ensemble performed well, there are many avenues for further improvement. The ensemble forecast weighted the models equally, but weighting based on historical performance may improve forecast skill (McAndrew and Reich 2019; Viboud et al. 2018 ; E. L. Ray and Reich 2018; Yamana, Kandula, and Shaman 2016; Reis et al. 2019; Brooks et al. 2018 ). However, any weighting method will need to be dynamic and allow for the incorporation of new models, potential changes in existing models, and the different sets of locations forecasted by each model. The ensemble forecasts evaluated here only looked four weeks into the future. Although probabilistic forecasts at longer horizons may characterize uncertainty correctly, in this study we observed significant widening of prediction intervals and degrading precision of point forecasts, even at horizons of three-to four-weeks ahead. This raises concerns about the reliability of forecasts at longer horizons of weeks to months. In addition to current transmission dynamics and mitigation measures, long-term forecasts must also predict changes in mitigation and human behavior to inform projections. Uncertainties about data, for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . This analysis indicates that ensemble forecasts of cumulative mortality generated in realtime at the national and state level during the first six months of the COVID-19 outbreak in the United States were accurate and well-calibrated. Given this and the previous performance of probabilistic ensemble forecasts for multiple infectious diseases, we encourage decision makers to consider the use of multiple models and ensemble forecasts rather than single model forecasts for future risk assessment and planning needs. Supplemental Figure 1 : The number of times each contributing model was in the ensemble forecast for each location. An empty cell indicates that the model was not included in a forecast for the given location. The number of individual models included varied because of differences in number of submissions, locations included in submissions, and the following criteria for individual forecasts: (1) a forecast had to include all four week-ahead horizons, (2) the one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and (3) at each quantile level, predictions should be non-decreasing over the four prediction horizons. Abbreviations for each location are shown in Supplemental Table 1. for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020 . . https://doi.org/10.1101 /2020 Supplemental Figure 2 : Reported cumulative deaths due to COVID-19 for the US and states and territories from March 21 to July 25 as of July 26, 2020. Forecasts are 1-through 4week ahead predicted medians and 50%, 80%, and 95% prediction intervals from the ensemble forecasts created on May 18 and June 29. Forecasts from these two dates are shown as examples; ensemble forecasts were made every week starting April 27, 2020 (not shown, available at (Reich et al. 2020) ). Abbreviations for each location are shown in Supplemental Table 1. for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177493 doi: medRxiv preprint for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . Table 1 : MAE for cumulative deaths reported May 23 through July 25, the maximum number of deaths reported per week up through the week ending July 25, and the cumulative deaths reported through the week ending July 25. All reported data were collected on July 26, 2020. Locations are sorted in decreasing order of cumulative deaths. The top row summarizes MAE for all locations other than American Samoa, for which 0 deaths were observed during the evaluation period. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020 . . https://doi.org/10.1101 /2020 Mean absolute error (MAE) This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020. . Figure 3 : Calibration for ensemble predictions of cumulative deaths in locations with at least two valid individual models submitted. We report calibration of prediction intervals from forecasts for weeks ending May 23 to July 25, 2020, the set of observed weeks with previous forecasts at all four time horizons at the time of writing. Lines are labeled with the nominal coverage rate; a well calibrated forecast will have empirical coverage rate equal to the nominal coverage rate. for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 22, 2020 . . https://doi.org/10.1101 /2020 Prediction of Infectious Disease Epidemics via Weighted Density Ensembles A Collaborative Multiyear, Multimodel Assessment of Seasonal Influenza Forecasting in the United States Accuracy of Real-Time Multi-Model Ensemble Forecasts for Seasonal Influenza in the U.S Superensemble Forecast of Respiratory Syncytial Virus Outbreaks at National, Regional, and State Levels in the United States Chapter 4 Forecast Combinations The RAPIDD Ebola Forecasting Challenge: Synthesis and Lessons Learnt Optimizing Infectious Disease Interventions during an Emerging Epidemic Emergency Committee Regarding the Outbreak of Novel Coronavirus (2019-nCoV) Named Author: Vishal Tomar, MS Named author institution: Auquan Ltd Other team authors: Auquan Ltd: Chandini Jain Team name: Columbia_UNC-SurvCon Named Author: Yuanjia Wang, PhD Named author institution: Department of Biostatistics, Mailman School of Public Health, Columbia University Other team authors: Columbia University: Qinxia Wang, Shanghong Xie Department of Biostatistics, Gillings School of Public Health, University of North Carolina at Chapel Hill: Donglin Zeng Funding: YW and DZ are supported by NIH grant GM124104 Team name: Predictive Science Inc. Named Author: Michal Ben-Nun, PhD Named author institution: Predictive Science Inc. Other team authors Reich Team name: University of Arizona -EpiCovDA Named Author: Hannah Biegel, MSc Named author institution PhD Named author institution: Columbia University, Mailman School of Public Health, Dept. of Environmental Health Sciences Other team authors: Columbia University, Mailman School of Public Health, Dept. of Environmental Health Sciences: Sen Pei, Marta Galanti, Jeffrey Shaman Columbia University, Mailman School of Public Health Zhiling Gu Funding and acknowledgments: Zhiling Gu was supported in part by National Science Foundation award DMS-1916204. Myungjin Kim was supported in part by National Science Foundation award DMS-1934884. The computing was supported by College of Liberal Arts and Sciences, Iowa State University. Team name: NotreDame-FRED Named Author: Guido EspaƱa, PhD Named author institution Chris Murray Funding: Bill & Melinda Gates Foundation and National Science Foundation award 2031096 Team name: NotreDame-mobility Named Author: Sean Cavany, PhD Named Author institution PhD Named author institution: US Army Engineer Research and Development Center Other team authors: Information Technology Laboratory