key: cord-0542784-872ply2v authors: Taylor, Kathryn S.; Taylor, James W. title: A Comparison of Aggregation Methods for Probabilistic Forecasts of COVID-19 Mortality in the United States date: 2020-07-21 journal: nan DOI: nan sha: 92b9bb24663fdbd9ad30157ab1083afd0011e2f9 doc_id: 542784 cord_uid: 872ply2v The COVID-19 pandemic has placed forecasting models at the forefront of health policy making. Predictions of mortality and hospitalization help governments meet planning and resource allocation challenges. In this paper, we consider the weekly forecasting of the cumulative mortality due to COVID-19 at the national and state level in the U.S. Optimal decision-making requires a forecast of a probability distribution, rather than just a single point forecast. Interval forecasts are also important, as they can support decision making and provide situational awareness. We consider the case where probabilistic forecasts have been provided by multiple forecasting teams, and we aggregate the forecasts to extract the wisdom of the crowd. With only limited information available regarding the historical accuracy of the forecasting teams, we consider aggregation (i.e. combining) methods that do not rely on a record of past accuracy. In this empirical paper, we evaluate the accuracy of aggregation methods that have been previously proposed for interval forecasts and predictions of probability distributions. These include the use of the simple average, the median, and trimming methods, which enable robust estimation and allow the aggregate forecast to reduce the impact of a tendency for the forecasting teams to be under- or overconfident. We use data that has been made publicly available from the COVID-19 Forecast Hub. While the simple average performed well for the high mortality series, we obtained greater accuracy using the median and certain trimming methods for the low and medium mortality series. It will be interesting to see if this remains the case as the pandemic evolves. The coronavirus 2019 was declared a pandemic by the World Health Organization on 11 March 2020 (WHO 2020a), and five months later, over 21 million individuals had been infected and COVID-19 has caused over 760,000 deaths worldwide (WHO 2020b) . The COVID-19 pandemic has created enormous planning and resource allocation challenges. Governments are relying upon predictions of the numbers of COVID-19 cases, people hospitalized and deaths to help decide what actions to take (Adam 2020; Phelan et al. 2020) . In this paper, we consider the short-term forecasting of reported deaths from COVID-19. Forecasting methods are well established in providing predictions of uncertain events to decision makers across a variety of settings, ranging from energy providers and individuals relying on the weather outlook, to investors eager to gain insight into future economic conditions. Epidemiological forecasting models have been applied to both vector-borne diseases, including Dengue disease (Shi et al. 2016 ) and the Zika virus (Kobres et al. 2019) , and contagious infectious diseases. These include the Severe Acute Respiratory Syndrome (SARS) (Ng et al. 2003) , Ebola (Viboud et al. 2017 ) and the Middle East respiratory syndrome (MERS) (Da'ar et al. 2018) . Numerous COVID-19 models have emerged (Adam 2020 , COVID-19 Forecast Hub 2020 . These models are based on different models and make different assumptions and therefore answer different questions (Holmdahl and Buckee 2020) . Due to the lack of data, assumptions have to be made about several factors including the extent of immunity, transmission among people who are asymptomatic and how the public will react to new government restrictions. Paucity of data is a common challenge in forecasting infectious diseases (Lauer et al. 2020) . Policy makers need to be aware of the limitations of the models, and need to be conscious of the uncertainty in predictions from these models (Sridhar and Majumder 2020) . Gneiting and Katzfuss (2014) describe how optimal decision-making relies on the availability of a forecast of a probability distribution, rather than just a single point forecast (see, for example, Gianfreda and Bunn 2018; Wang and Yao 2020) . Interval forecasts are also valuable, as they can support real-time decision making and provide situational awareness (see, for example, Chen and Xiao 2012; Grushka-Cockayne and Jose 2020; Bracher et al. 2020) . In this paper, we consider the case where probabilistic forecasts are provided by multiple forecasters for the cumulative total of deaths due to COVID-19. We aggregate the forecasts to extract the wisdom of the crowd. The benefit of collective decision making was highlighted by Galton in his seminal paper, as he reported the success of a crowd at a country fair in guessing the weight of an ox (Galton 1907) . Aggregation provides a pragmatic approach to synthesizing the information underlying different forecasting methods. It also enables diversification of the risk inherent in selecting a single forecaster who may turn out to be poor, and it offsets statistical bias associated with individual forecasters who tend to be under-or overconfident. Forecast aggregation is often referred to as forecast combining. Since the early work by Bates and Granger (1969) on combining point forecasts, methodology has evolved with the introduction of a variety of simplistic and sophisticated combining methods. Recent work has involved combining probabilistic forecasts. Winkler et al. (2019) predicts that probabilistic forecast aggregation will become more common due to developments in the field, the rising popularity of forecasting competitions, and raised awareness by increased reporting of probabilistic predictions in the media. The aggregation of individual forecasts could be weighted in some way if a record of past accuracy is available for each individual forecaster. However, such a record is often not available, and, indeed, this is the case for our application, which involves the forecasting of cumulative deaths during the relatively early stages of a pandemic. Even as time passes, comparable historical accuracy will not be available for all forecasters, as their involvement with the forecasting exercise commenced at different times. A further concern is that the relative accuracy of forecasters may not be stable over time (Winkler et al. 2019) . In this paper, we consider aggregation methods that do not rely on a record of past accuracy for each forecaster. The success of the simple average for combining point forecasting has motivated its use for probabilistic forecasts. The median and trimmed means have also been proposed for forecasts of probability distributions and interval forecasts, as they provide simple, robust alternatives to the mean (Hora et al. 2013; Jose et al. 2014; Gaba et al. 2017) . In this empirical paper, we evaluate the usefulness of these existing aggregation methods, applied to multiple probabilistic forecasts of COVID-19 mortality at the state and national level in the U.S, using data made publicly available (see COVID-19 Forecast Hub 2020) . In Section 2, we describe the dataset and the rise in mortality due to COVID-19 in the U.S. We consider interval forecast aggregation in Section 3, and we focus on the aggregation of forecasts of probability distributions in Section 4. Section 5 provides a summary and concluding comments. In this section, we introduce the dataset that is the focus of our study. We summarize the progression in mortality due to COVID-19 across the U.S. We then describe the forecasts in the dataset, and the criteria that we applied to select forecasts from this dataset for inclusion in our analysis. The COVID-19 Forecast Hub is curated by a group led by Nicholas Reich (COVID-19 Forecast Hub 2020). The Hub provides open access to weekly observations and forecasts for the cumulative total of reported COVID-19 deaths, as well as observations and forecasts for the total deaths each week (incident deaths). These data are available for the U.S. both at the national and state levels. The forecasts are submitted by multiple forecasting teams from academia and industry. The actual number of deaths from COVID-19 will be under-reported due to various factors including reporting delays and the lack of testing. There will also be indirect deaths as hospitals, overwhelmed by COVID-19, have had to delay diagnosis and treatment for other conditions, such as cancer. Therefore, the impact of COVID-19 will be judged ultimately in terms of the excess mortality, that is, the number of deaths above the number that would have been expected (Weinberger et al 2020) . The actual cumulative deaths described in this paper are obtained from the COVID-19 Forecast Hub. These data are from the daily reports issued by the Centre for Systems Science and Engineering at Johns Hopkins University. Based on these data, the first reported death due to COVID-19 in the U.S. was in the State of Washington, in the week ending 29 February 2020. At the time of writing, 16 August 2020, the total number of COVID-19 deaths in the U.S. had risen to 169,481. Figure 1 shows the number of deaths across the U.S. on this date. Figure 2 shows the rise in the cumulative total number of COVID-19 deaths in the five states with highest cumulative total on this date. As weekly cumulative and incident deaths are related, for simplicity, we focus only on cumulative deaths in our analysis. Although the COVID-19 Forecast Hub provides forecasts for all U.S. states and territories, we followed the convention adopted for the Hub's visualization by considering only the 50 states and the District of Columbia. For conciseness, we refer to these as 51 states. Given that we are also considering the national total, our dataset consists of 52 time series and associated forecasts. Figure 3 shows that there is considerable variation among the distributional forecasts in terms of their location, spread and shape. The simple averaging provided by the ensemble forecast sits at the heart of each set of distributional forecasts. However, it is interesting to note that, at least for New York and Florida, there are a number of outlying distributional forecasts, which motivates consideration of alternative aggregation methods, based on robust estimation, such as the median and trimming. We discuss this further in Sections 3 and 4. The COVID-19 Forecast Hub provides information regarding the methods used by the various forecasting teams and licensing conditions for the use of each team's data. The supplementary material to this paper summarizes this information. Approximately half of the teams use compartmental models. These involve the estimation of the rates at which the population, or sectors of the population, transfer between the states of being susceptible, exposed, infected and recovered/removed. Hence, they are widely referred to as SEIR or SIR models. The other forecasting teams used a variety of approaches, including agent-based simulation, statistical models, and deep learning. The use of data-driven machine learning methods is consistent with the increasing use of such methods in healthcare (see Guha and Kumar 2018) . Surowiecki (2004) describes conditions under which wisdom can best be extracted from a crowd. These include independent contributors, diversity of opinions, and a trustworthy central convenor to collate the information provided. Forecasts from the Hub satisfy these conditions. In our analysis, we considered forecasts made with forecast origin as Week 18 and later. Week 18 seemed a reasonable starting point for our analysis because levels of mortality were relatively low prior to this, and it was the first forecast origin for which the ensemble forecast was included in the Hub's visualization. From Week 20, the curators of the Hub have produced files listing the forecasts that they did not include in their ensemble, based on several data screening checks. We omitted these forecasts from our analysis, and also followed the curators by treating as ineligible any submission that did not provide forecasts for all 23 quantiles and all four lead times. Unless this criterion was not met, we included forecasts not recorded as being assessed for eligibility because we felt we had no clear justification for omitting them. We considered data screening checks for these forecasts, but concluded that setting our own thresholds for inclusion would be arbitrary. Consequently, we applied aggregation methods to a dataset of forecasts that included 30% more forecasts than were included in the ensemble aggregation method. Figure 4 shows the total number of forecasting teams included in our study for each forecast origin from Week 18 up to Week 33, which is the most recent origin at the time of writing. For each week, the figure also shows the split between the number of teams that used compartmental models and the number using alternatives. Note that the number of teams shown for each week in Figure 4 is an upper bound for the number available for aggregation for any one time series. This is because some teams either did not provide forecasts or did not provide eligible forecasts for all time series. Figure 4 : Number of forecasting teams included in our study for each forecast origin. The stacked bars indicate the split between teams using compartmental models, and alternatives. For each of the 36 forecasting teams that submitted forecasts, Figure 5 shows, for each week, whether we were able to include that team in the aggregated forecasts for at least one series. A break in the horizontal line for any team indicates that, from the first week when the team submitted forecasts, it was not the case that forecasts from that team were available and eligible in all the following weeks. The circles in Figure 5 give an indication of the extent to which each forecasting team featured in our study. The figure shows that, even when a record of past historical accuracy becomes available for all teams, accuracy will not be available for the same past periods and same time series. This implies that weighted aggregation methods, based on the historical accuracy of each method, will not be straightforward to construct. Hence, our interest in aggregation methods that do not require a record of past accuracy for each team. It would be interesting to compare the accuracy of the forecasts produced by the different teams. However, as Figure 5 shows that, for the full period that we use to compare the accuracy of the aggregation methods, we have forecasts for all 52 time series from only two of the participating teams. The results of these two teams did not compete with the results of the better aggregation methods. In this section, we describe the interval forecast aggregation methods that have been proposed in the literature, and then provide a summary list of the methods that we implemented. We then describe the measures that we use to evaluate interval forecasts, and finally use these to compare aggregation methods. We now briefly review methods for aggregating interval forecasts that are relevant to applications, such as ours, where there is a sizeable group of individual forecasters and a record of past accuracy is not available. The methods that have been proposed consider each bound of the interval separately. An obvious simplistic approach is to use the simple average of the forecasts. In the vast literature on combining point forecasts, it is well established that the simple average can be very competitive in a variety of applications. Interestingly, this is true regardless of whether a record of historical accuracy is available to enable far more sophisticated combining methods to be fitted (Larrick and Soll 2006 ). An advantage of the simple average is its simplicity, and robustness to changes over time in the relative performance of the individual methods. However, consideration of robustness prompts the use of aggregation methods that are robust to outliers, with the obvious candidate being the median. The simple average and the median are both considered by Park and Budescu (2015) and Gaba et al. (2017) , who also propose several novel aggregation methods. Extending the idea of robustness to outliers, Park and Budescu (2015) propose that, for each bound, a chosen percentage of the highest and lowest forecasts are discarded, followed by averaging of the rest. We refer to this as symmetric trimming. The median is an extreme version of symmetric trimming, where all but one forecast is trimmed. Rather than having robustness as motivation, Gaba et al. (2017) use trimming to address the situation where the individual forecasters tend to be either under-or overconfident. We refer to their methods as asymmetric trimming. Their exterior trimming involves removing a percentage of the highestvalued upper bounds and lowest-valued lower bounds, with the aggregate computed by averaging the remaining upper and lower bounds, respectively. This approach is suitable when the forecasters have tended to be underconfident, with interval forecasts that are generally too wide. Gaba et al. (2017) also suggest interior trimming, which involves removing a percentage of the lowest-valued upper bounds and highestvalued lower bounds, followed by averaging. This approach to aggregation is suitable when the forecasters are overconfident. They also propose an extreme version of interior trimming, which involves discarding all but the highest upper bound and lowest lower bound. They refer to this as the envelope method. In addition, Gaba et al. (2017) describe a heuristic approach that views the bounds of each forecaster as having been produced based on a normal distribution. The empirical studies of Park and Budescu (2015) involved 80% and 90% interval forecasts produced judgmentally by volunteers in experiments, where the interval forecasts related to general knowledge questions and estimates of financial and economic quantities. They found that the simple average and median were outperformed by trimming, and that symmetric was preferable to asymmetric trimming. Gaba et al. (2017) considered 90% interval forecasts produced for financial quantities by employees at a financial brokerage firm, and for macroeconomic variables by participants in the Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters. These forecasts are produced using a variety of methods, including statistical models, expert judgment, and a mixture of the two. They find that exterior trimming performs very well for some of their data, but that the ranking of aggregation methods is dependent on the characteristics of the individual forecasts, such as under-or overconfidence. In a recent empirical paper, Grushka-Cockayne and Jose (2020) applied aggregation methods to 95% interval forecasts produced by statistical time series methods for the 100,000 series from the M4-Competition. Overall, the best results were achieved with median aggregation and interior asymmetric trimming. For each mortality series, forecast origin and lead time, we applied the following methods: Ensemble: This is the simple average produced by the COVID-19 Forecast Hub. As discussed in Section 2, this is computed from a subset of the forecasts used in the other aggregation methods that we consider. Simple average: For each bound, we computed the simple average of forecasts of this bound. We used our full set of forecasts for this and the other aggregation methods described below. Median: For each bound, we computed the median of forecasts of this bound. For each bound, we averaged the forecasts remaining after the removal of the N lowest-valued and N highest-valued forecasts, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts. For all trimming methods in this study, we used =10%, 20%, 30%, 40%, 50%, 60%, 70% 80% and 90%. Asymmetric exterior trimming : We first removed the N lowest-valued lower bound forecasts, as well as the N highest-valued upper bound forecasts, where N is the largest integer less than or equal to the product of  and the number of forecasts. For each bound, we averaged the remaining forecasts. This sometimes delivered a lower bound that was above the upper bound. In this situation, we replaced the two bounds by their mean. Although the resultant interval has zero width, it is preferable to having the upper bound above the lower. We removed the N highest-valued lower bound forecasts, as well as the N lowest-valued upper bounds, where N is defined as for asymmetric exterior trimming. For each bound, we averaged the remaining forecasts. The interval is constructed using the lowest-valued lower bound forecast and highest-valued upper bound forecast. In this paper, our interest is in central (1-) intervals. These are bounded by the /2 and (1-/2) quantiles. We consider=5% and 50%, which correspond to 95% and 50% intervals, respectively. Our choice of these intervals reflects those presented in the visualization of the COVID-19 Forecast Hub. The accuracy of a set of interval forecasts can be assessed by evaluating the forecasts of the quantiles bounding the interval. A simple measure of accuracy of forecasts of the  quantile is to check the percentage of observations that fall below the forecast. We refer to this as the hit percentage. If this is equal to , the forecast is said to be calibrated. More precisely, we should refer to this as unconditional calibration, with conditional calibration being the property that the conditional expectation of the hit percentage is equal to  (Nolde and Ziegel 2017). Given the short length of the time series considered in our analysis, we assess only unconditional calibration. In addition to calibration, a quantile forecast should be evaluated using a score. A score can be viewed as a measure of how closely the forecast varies over time with the actual quantile. The score is said to be consistent if it is minimised by the true quantile. The use of a consistent score ensures honest reporting by a forecaster . For quantile forecasts, the most widely used consistent score is the quantile regression loss function (see Koenker and Machado 1999; Taylor 1999) . We refer to it as the quantile score, and present it as follows: where yt is the observation in period t, qt() is the  quantile, and I{·} is the indicator function. When =50%, the score reduces to the absolute error, showing that the widely used mean absolute error is an appropriate score for a point forecast defined as a prediction of the median. To summarize forecasting performance across a time series, the average of the score is computed. A consistent score for an interval forecast is produced by summing the quantile score for the quantiles bounding the interval . For an interval bounded by qt() and qt(1-), if we sum the quantile scores and divide by , we get the following interval score (Winkler, 1972) : where lt is the interval's lower bound qt() and ut is its upper bound qt(1-). The score has the intuitive interpretation that it rewards narrow intervals, with observations that fall outside the interval incurring a penalty, the magnitude of which depends on the value of  . In an application to influenza forecasting, Bracher et al. (2020) use this interpretation to seek insight into why the interval score for one forecasting model is lower than another. A notable aspect of forecast evaluation for our application is that we have a relatively small number of time periods, which is of particular concern when evaluating extreme quantiles, such as the 2.5% and 97.5% quantiles. In view of this, and the fact that the relative performances of the methods were similar across the four lead times, we opted to average the hit percentage and interval score across the lead times. In terms of averaging these measures over the 52 mortality series, this is unproblematic for calibration, but as the level of mortality varies greatly across the series, it is inevitable that averaging will lead to the interval score being dominated by its value for the high mortality series. In view of this, we report results for the following three groupings of the 52 series: the 22 states with cumulative mortality less than 1,000 at the end of the final week of our dataset; the 25 states with more than 1,000 and less than 10,000 at the end of the final week; and the series with at least 10,000 at the end of the final week, which corresponded to the four hardest hit states and the national U.S. series. We refer to these groupings as the low, medium and high mortality series. For the three groupings, the mean of the interval score for 95% interval forecasts is presented in the first three columns of values in Table 1 , and their ranks are provided in the next three columns. The unit of the score is deaths, and lower values of the score reflects greater accuracy. The final three columns present the skill score, which measures the percentage by which the score for a given method is better than a benchmark method, which we chose to be the simple average. For conciseness, we present the results for trimming with =20%, 40%, 60% and 80%, as we feel this provides an adequate summary. For the low and medium mortality series, comfortably the best results were obtained using the median aggregation method and symmetric trimming with =40%, 60% and 80%. For these series, the ensemble, the simple average and asymmetric trimming performed quite poorly. However, for the high mortality series, the best results were produced by the simple average, with symmetric trimming and asymmetric interior trimming also performing well. For the high mortality series, a curious finding is that the simple average of our set of forecasts outperforms the ensemble forecast, which is a simple average of a narrower set of forecasts. Including the additional information seems to have benefitted our simple average aggregation. For all series, the envelope aggregation method produced poor results. Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Table 2 presents the mean of the interval score for 50% interval forecasts. The results are broadly consistent with those for the 95% interval. Symmetric trimming and the median method were the most accurate for the low and medium mortality series. For the high mortality series, the best results were achieved with the simple average. Figures 6 to 8 summarize calibration by using Q-Q plots to report the hit percentages for the bounds of both the 50% and 95% interval forecasts. To ensure readability of the plots, we present the results for just the ensemble, simple average, median and two of the symmetric trimming methods, as these were competitive in terms of the interval score in Tables 1 and 2. In Figures 6 to 8 , the dashed line indicates the ideal performance. Figures 6 and 7 show that, for each method applied to the low and medium mortality series, forecasts of the 2.5% and 97.5% quantile were very close to the ideal. In these figures, for the 25% and 75% quantiles, the best results were for symmetric trimming, and for the median method. Figure 8 shows that, for the high mortality series, all the methods produced forecasts of the 25%, 75% and 97.5% quantiles that were too low. Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Figure 6 : Calibration hit percentages for bounds on 50% and 95% intervals for the low mortality series. In this section, we briefly review the distributional forecast aggregation methods that have been proposed in the literature, and then list the methods that we implemented. We then describe measures for evaluating distributional forecast accuracy, before using these to compare different aggregation methods. The simple average is a well-established approach for aggregating distributional forecasts (see, for example, Stone 1961) . It has typically taken the form of the linear opinion pool, which, for any chosen value of the random variable, is the average of the corresponding cumulative probabilities obtained from the distributional forecasts. However, this form of averaging has been criticized because, even when the quantiles of the individual distributions are calibrated, the linear opinion pool will not itself provide perfect calibration (Hora 2004, Ranjan and Gneiting 2010) . It has been noted that, when there is diversity among the means of the individual forecasts, this will tend to lead to an exaggerated variance in the forecast of the linear opinion pool (see, for example, Dawid 1995) . To address these problems, Lichtendahl et al. (2013) propose that, instead of averaging the cumulative probabilities, the distributional forecasts should be averaged by taking the mean of their corresponding quantile forecasts. In other words, for a chosen probability level , they suggest that aggregation is performed by averaging forecasts of the quantile qt() provided by each individual distributional forecast. They show how this leads to more attractive theoretical properties than the linear opinion pool. For our application in this paper, it provides a more convenient approach to averaging because each forecasting team provides the distributional forecast in terms of quantile forecasts for the same 23 values of the probability . To provide robust aggregation, Hora et al. (2013) propose that the median is used. For any chosen value of the random variable, their approach finds the median of the corresponding values of the cumulative probability forecasts. In fact, they show that this delivers the same aggregate distributional forecast as an approach that uses the median of forecasts of the quantile qt() for a chosen probability level . As with interval forecast aggregation, trimming has also been proposed for distributional forecasts. Jose et al. (2014) propose interior and exterior trimming approaches, which involve trimming the innermost and outermost distributions, respectively, from a set of distributional forecasts. To enable the trimming, the distributional forecasts must essentially be ordered in some way, and for this, they propose two alternative approaches: the CDF approach (CA) and mean approach (MA). MA orders the distributional forecasts according to their means, and involves trimming entire distributional forecasts. After a proportion have been trimmed, the authors use a linear opinion pool to average the rest. CA orders the distributional forecasts separately for each of a set of values of the random variable. After the trimming is performed, the aggregate forecast is computed as the average of the cumulative probabilities given by the remaining distributional forecasts. Jose et al. (2014) note that CA could be adapted so that the trimming and averaging is performed on forecasts of the quantile qt() for any chosen value of the probability , which would be more consistent with the advice of Lichtendahl et al. (2013) to average quantiles, rather than probabilities. For our application, this is a more convenient way to implement CA, as our distributional forecasts have each been provided in the form of a set of quantile forecasts. Following similar reasoning, we avoided the linear opinion pool with MA, so that following trimming, quantile forecasts are averaged. For interval forecast aggregation, symmetric trimming is motivated by robustness, and asymmetric trimming enables the impact to be reduced of a tendency among the individual forecasters to be either under-or overconfident. It is worth noting that analogous asymmetric methods are not straightforward for distributional forecasts because of the need to ensure that the resulting distribution function is monotonically increasing. It is also interesting to note that, although the trimming methods proposed by Jose et al. (2014) are all symmetric, and hence their exterior trimming will enable the removal of outliers, their main motivation for trimming is to address under-or overconfidence among the individual forecasters. have diverse means. They demonstrate this with an ensemble of distributional forecasts from a quantile regression forest. They show that the tendency for this machine learning method to overfit leads to diverse means among the ensemble, which can be addressed by CA with exterior trimming. For each mortality series, forecast origin and lead time, we implemented the following methods: Ensemble: This is the simple average produced by the COVID-19 Forecast Hub. For each of the 23 quantiles, the average is computed of the corresponding quantile forecasts obtained from the individual distributional forecasts. As we explained in Section 2, the ensemble used a subset of the forecasts included in all other aggregation methods that we considered. Simple average: For each of the 23 quantiles, we averaged the corresponding quantile forecasts. We used our full set of forecasts for this and the other aggregation methods described below. Median: For each of the 23 quantiles, we found the median of the quantile forecasts. CA exterior trimming : For each of the 23 quantiles, we computed the aggregate as the average of the quantile forecasts remaining after we had removed the N lowest-valued and N highest-valued quantile forecasts, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts. For each bound, we averaged the remaining forecasts. CA interior trimming : With this method, for each of the 23 quantiles, the innermost quantile forecasts were trimmed. The aggregate was computed as the average of the quantile forecasts that were either among the N lowest-valued or N highest-valued quantile forecasts, where N is the largest integer less than or equal to the product of (1-/2 and the total number of forecasts. MA exterior trimming : This method involved trimming entire distributional forecasts. The trimming was based on the mean of each distributional forecast, which we estimated using the average of the 23 quantile forecasts. The aggregate forecast was computed by averaging the distributional forecasts that remain after the removal of the N distributional forecasts with lowest-valued mean and the N distributional forecasts with highest-valued mean, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts. This was similar to MA exterior trimming, except the innermost distributional forecasts were trimmed. The aggregate was computed as the average of the distributional forecasts that were among the N distributional forecasts with lowest-valued mean and the N distributional forecasts with highest-valued mean, where N is the largest integer less than or equal to the product of (1-/2 and the total number of forecasts. describe how the aim of distributional forecasting is to maximise sharpness subject to calibration. Sharpness concerns the concentration of the distributional forecast, and calibration assesses its statistical consistency with the data. Randomly sampled values from a calibrated distributional forecast are indistinguishable from the observations (Gneiting and Katzfuss, 2014) . To evaluate calibration for distributional forecasts in our study, we computed the hit percentage for each of the 23 quantiles. A score summarises calibration and sharpness, and is said to be proper if it is minimized by the true distribution. As with consistent scoring functions for quantiles, proper distributional scores are recommended to ensure forecasters report honest predictions . A widely used proper score for distributions of continuous random variables is the continuous ranked probability score (CRPS) (see . It can be viewed in several different ways, including the integral of the quantile score of expression (1), with respect to the probability level . For our application, where we have quantile forecasts for just 23 different values of , we approximate the CRPS using expression (3), which is the sum of the quantile scores for the 23 quantile forecasts: This is a proper score, and this can be seen by viewing it as a quantile-weighted version of the CRPS (see Gneiting and Ranjan 2011) . For simplicity, in this paper, we refer to expression (3) as the CRPS. We note that Bracher et al. (2020) present it as a form of weighted interval score, because it can be written as a weighted sum of the interval score of expression (2) and the quantile score of expression (1) for the median. Although our main interest in this paper is probabilistic forecasting, we also consider point predictions of the median, as this conveys the accuracy of the centre of location of the distributional forecasts, and of course such point forecasts are often the main focus of attention. We evaluate these point forecasts using the mean absolute error (MAE). Table 3 presents the mean of the CRPS for the three groupings of series, along with ranks and skill scores. The unit of the CRPS is the number of deaths, and lower values of the score reflect greater accuracy. For the low and medium mortality series, the best results were produced by median aggregation and the exterior trimming aggregation methods, regardless of whether the trimming was performed within the CA or MA methods. In view of our comments in Section 4.1, the usefulness of exterior trimming suggests either outliers or that the forecasting teams were generally underconfident. For the high mortality series, the best results were obtained using the simple average and the trimming methods with low . As with the interval forecasting, it is notable that, for the high mortality series, the simple average of our set of forecasts outperformed the Hub's ensemble, which is a simple average of a more selective set of forecasts. In Table 4 , we evaluate point forecast accuracy for the median using the MAE. The unit of the MAE is the number of deaths. The relative performances of the methods in Table 4 are similar to those in Table 3 for the CRPS. This is quite common when there is sizeable variation over time in the location of the probability distribution, because inaccuracy in the prediction of the distribution's location will affect the accuracy of all the quantiles, and hence the whole distribution. Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Bold indicates the best three methods in a column. For the three groupings of series, Figures 9 to 11 present Q-Q plots to summarize the calibration hit percentages for each of the 23 quantiles. To ensure legibility of the figures, we include just the following five methods in each: the ensemble, the simple average, the median, and two of the more successful external trimming methods. Figure 9 shows that, for the low mortality series, on average, the quantiles of the distributional forecasts were too high, while for the high mortality series, Figure 11 shows that the quantile forecasts were generally too low. In Figures 9 and 10 , we see that for the low and medium mortality series, the best calibration results correspond to median aggregation and exterior trimming. For the high mortality series, Figure 11 shows that the best calibration corresponds to the ensemble and simple average. for the better performing methods, the greatest accuracy was achieved by aggregating all available methods. We have provided an empirical comparison of aggregation methods for interval and distributional forecasts of cumulative mortality due to COVID-19. The forecasts were produced by teams using a variety of approaches, including compartmental and statistical models. Aggregation provides a pragmatic way to synthesize the diverse information underlying these models. The simple average is a natural benchmark, but it is also worth considering the median, as well as trimming methods, which enable robust estimation and adjustment for the case where forecasters tend to be under-or overconfident. Given that we are currently at a relatively early stage of the pandemic, for each mortality time series, we do not have a record of historical accuracy for the same set of past periods for all forecasting teams, as we discussed in Section 2, in relation to Figure 5 . Indeed, for new participants, we have no record of past accuracy. This influenced the aggregation methods that we chose to consider, as it is not clear how to optimize the weights within some form of weighted forecast combination. However, we should acknowledge that the trimming methods do require the choice of trimming parameter. This could be optimized, either for each mortality series, or by recording accuracy averaged across a group of series. An appeal of the simple average and median is that they require no parameter optimization, and in view of our empirical results, we recommend the use of the median for the low and medium mortality series, and the simple average for the high mortality series. There is potential to improve the contribution of epidemiological models to the policy making process in dealing with the COVID-19 pandemic by considering aggregated predictions, as they offer several potential advantages, including greater accuracy, compared to predictions from single models. Special report: The simulations driving the world's response to COVID-19 The combination of forecasts Evaluating epidemic forecasts in an interval format Impact of reseller's forecasting accuracy on channel member performance Underlying trend, seasonality, prediction, forecasting and the contribution of risk factors: an analysis of globally reported cases of Middle East Respiratory Syndrome Coronavirus Epidemiology and Infection Coherent combination of experts' opinions Vox populi (the wisdom of crowds) A stochastic latent moment model for electricity price formation Probabilistic forecasts, calibration and sharpness Probabilistic forecasting Comparing density forecasts using threshold-and quantile-weighted scoring rules Strictly proper scoring rules, prediction, and estimation Combining prediction intervals in the M4 competition Ensembles of overfit and overconfident forecasts Emergence of big data research in operations management, information systems, and healthcare: Past contributions and future roadmap Wrong but Useful -What Covid-19 Epidemiologic Models Can and Cannot Tell Us Probability judgments for continuous quantities: linear combinations and calibration Median aggregation of distribution functions Trimmed opinion pools and the crowd's calibration problem A systematic review and evaluation of Zika virus forecasting and prediction research during a public health emergency of international concern Goodness of fit and related inference processes for quantile regression Intuitions about combining opinions: Misappreciation of the averaging principle Infectious disease forecasting for public health Is it better to average probabilities or quantiles? A double epidemic model for the SARS propagation Elicitability and backtesting: Perspectives for banking regulation Aggregating multiple probability intervals to improve calibration The novel coronavirus originating in Wuhan, China: Challenges for global health governance Combining probability forecasts Three-Month Real-Time Dengue Forecast Models: An Early Warning System for Outbreak Alerts and Policy Decision Support in Singapore Modelling the pandemic The opinion pool New York: Doubleday Taylor J W (1999) Evaluating volatility and interval forecasts The RAPIDD ebola forecasting challenge: Synthesis and lessons learnt Risk hedging for production planning Estimation of excess deaths associated with the COVID-19 pandemic in the United States A decision-theoretic approach to interval estimation Probability forecasts and their combination: a research perspective World Health Organization (2020a) Coronavirus disease (COVID-19) World Health Organization (2020b) WHO Director-General's opening remarks at the media briefing on COVID-19 -11 We are very grateful to all the forecasting groups who generously made their forecasts available on the Reich Lab COVID-19 Forecast Hub, and to Nicholas Reich and his team, for acting as curators for the Hub, and for providing such convenient access to the data, along with useful supplementary information.We would also like to thank Nia Roberts for clarifying our understanding of the licence terms for the forecast data. In Section 2, we discussed the different methods used by the individual forecasting teams. We described how compartmental models were used by approximately half the teams, and we illustrated this in Figure 4 . We were curious to see how the aggregation methods would perform if they were applied to only the teams using compartmental models. Given their widespread use by epidemiologists, one might surmise that aggregating only these models would be adequate, and that an aggregation using only the other types of models would deliver poor results. We investigate these issues in Table 5 , where we report CRPS results produced by applying the aggregation methods to the following three different sets of the individual forecasting teams: all the teams, the teams using compartmental models, and the teams not using compartmental models. For the medium and high mortality series, Table 5 shows that aggregating only compartmental models is better than aggregating only the other types of models. For the low mortality The terms and conditions of the forecasts that were analysed are recorded in the forecasting groups' supplementary files on the Reich Lab COVID-19 Forecast GitHub website: https://github.com/reichlab/COVID19-forecast-hub/tree/master/data-processed. UT-Mobility This model makes predictions assuming that social distancing patterns, as measured by anonymized mobile-phone GPS traces, remain constant in the future. Only models *first-wave deaths*.https://github.com/reichlab/covid19forecast-hub/tree/master/dataprocessed/UT-Mobility 32 * Based on information recorded on the GIT Hub § Classed as other model although it has a compartmental structure (SLIR), as it also has agent-based elements. This is a hybrid model. † Classed as other model as it is an extension to a non-linear growth model with a prior distribution SIR curve fitted. Therefore, it is not a compartmental.