key: cord-1013878-ggz4nll8 authors: Taylor, James W.; Taylor, Kathryn S. title: Combining Probabilistic Forecasts of COVID-19 Mortality in the United States date: 2021-06-28 journal: Eur J Oper Res DOI: 10.1016/j.ejor.2021.06.044 sha: 1ebac079cb9d131dc68290b45f2b675090d01dfc doc_id: 1013878 cord_uid: ggz4nll8 The COVID-19 pandemic has placed forecasting models at the forefront of health policy making. Predictions of mortality, cases and hospitalisations help governments meet planning and resource allocation challenges. In this paper, we consider the weekly forecasting of the cumulative mortality due to COVID-19 at the national and state level in the U.S. Optimal decision-making requires a forecast of a probability distribution, rather than just a single point forecast. Interval forecasts are also important, as they can support decision making and provide situational awareness. We consider the case where probabilistic forecasts have been provided by multiple forecasting teams, and we combine the forecasts to extract the wisdom of the crowd. We use a dataset that has been made publicly available from the COVID-19 Forecast Hub. A notable feature of the dataset is that the availability of forecasts from participating teams varies greatly across the 40 weeks in our study. We evaluate the accuracy of combining methods that have been previously proposed for interval forecasts and predictions of probability distributions. These include the use of the simple average, the median, and trimming methods. In addition, we propose several new weighted combining methods. Our results show that, although the median was very useful for the early weeks of the pandemic, the simple average was preferable thereafter, and that, as a history of forecast accuracy accumulates, the best results can be produced by a weighted combining method that uses weights that are inversely proportional to the historical accuracy of the individual forecasting teams. The coronavirus 2019 was declared a pandemic by the World Health Organisation on 11 March 2020 (WHO, 2020) . By the end of January 2021, over 102 million individuals had been infected, and COVID-19 had caused over 2.2 million deaths worldwide (WHO, 2021) . The pandemic has created enormous planning and resource allocation challenges. Governments are relying upon predictions of the numbers of COVID-19 cases, people hospitalised and deaths to help decide what actions to take (Adam, 2020; Nikolopoulos et al., 2021; Phelan et al., 2020) . In this paper, we consider the short-term forecasting of reported deaths from COVID-19. Forecasting methods are well established in providing predictions of uncertain events to decision makers across a variety of settings, ranging from energy providers and individuals relying on the weather outlook, to investors eager to gain insight into future economic conditions. Epidemiological forecasting models have been applied to both vector-borne diseases, including Dengue disease (Shi et al., 2016) and the Zika virus (Kobres et al., 2019) , and contagious infectious diseases. These include the Severe Acute Respiratory Syndrome (SARS) (Ng et al., 2003) , Ebola (Viboud et al., 2017) and the Middle East respiratory syndrome (MERS) (Da'ar et al., 2018) . Numerous COVID-19 models have emerged (Adam, 2020; COVID-19 Forecast Hub, 2020; Nikolopoulos et al., 2021) . These models are based on different assumptions and therefore answer different questions (Holmdahl and Buckee, 2020) . Due to the lack of data, assumptions have to be made about several factors including the extent of immunity, transmission among people who are asymptomatic and how the public will react to new government restrictions. Paucity of data is a common challenge in forecasting infectious diseases (Lauer et al., 2020) . Policy makers need to be aware of the limitations of the models, and need to be conscious of the uncertainty in predictions from these models (Sridhar and Majumder, 2020) . Gneiting and Katzfuss (2014) describe how optimal decision-making relies on the availability of a forecast of a probability distribution, rather than a single point forecast (see, for example, Gianfreda and Bunn, 2018) . We refer to such a probabilistic forecast as a distributional forecast. Probabilistic predictions can also take the form of interval forecasts. A variety of definitions exist for interval forecasts (Brehmer and Gneiting, 2020) . In this paper we use the most common, which is that a (1-) interval forecast is an interval predicted to contain the true outcome with probability (1-), and equal probability of being above or below the interval. Interval forecasts are valuable, as they can support real-time decision making and provide situational awareness (see, for example, Grushka-Cockayne and Jose, 2020; Bracher et al., 2021) . In this paper, we consider the case where probabilistic forecasts of deaths are provided by multiple forecasters. We combine the forecasts to extract the wisdom of the crowd. Combining provides a pragmatic approach to synthesising the information underlying different forecasting methods. It also enables diversification of the risk inherent in selecting a single forecaster who may turn out to be poor, and it offsets statistical bias associated with individual forecasters who tend to be under-or overconfident. Harnessing the wisdom of a crowd has been found to be useful in many forecasting applications, ranging from sports betting to economics to weather and climate modelling (see Brown and Reade, 2019; Budescu and Chen, 2015; Mote et al., 2016) . Since the early work by Bates and Granger (1969) on combining point forecasts, a variety of simplistic and sophisticated combining methods have been proposed. Recent work has involved combining probabilistic forecasts. Winkler et al. (2019) predicts that probabilistic forecast combining will become more common due to developments in the field, the rising popularity of forecasting competitions, and raised awareness by increased reporting of probabilistic predictions in the media. In this paper, we evaluate combining methods applied to multiple probabilistic forecasts of cumulative COVID-19 mortality at the state and national level in the U.S, using data made publicly available (see COVID-19 Forecast Hub, 2020) . A notable characteristic of this dataset is that the availability of forecasts from the participating forecasting teams varies greatly across the duration of the dataset. Without a comparable record of past accuracy, it is not clear how best to implement a weighted combining method. This situation has led researchers to focus on combining methods that do not rely on the historical accuracy of the individual forecasters. An example is the simple average. Its success for combining point forecasts has motivated its use for probabilistic forecasts. The median and trimmed means have also been proposed for forecasts of interval and distributional forecasts, as they provide simple, robust alternatives to the mean (Hora et al., 2013; Gaba et al., 2017) . In this paper, we evaluate these combining methods for the COVID-19 dataset. We also introduce several weighted combining methods. These address whether it is beneficial to allocate weights using the historical accuracy of each forecaster when these accuracies are not directly comparable. In Section 2, we describe the dataset and the rise in mortality due to COVID-19 in the U.S. We consider interval forecast combining in Section 3, and the combination of distributional forecasts in Section 4. We separate our consideration of interval and distributional forecasting because the combining methods differ for these two forms of probabilistic forecasts. Section 5 provides a summary and concluding comments. In this section, we first summarise the progression in mortality due to COVID-19 across the U.S. We then describe the forecasts in the dataset, and the criteria that we applied to select forecasts from this dataset for inclusion in our analysis. The COVID-19 Forecast Hub is curated by a group led by Nicholas Reich (COVID-19 Forecast Hub, 2020) . The Hub provides open access to weekly observations and forecasts for the cumulative total 6 of reported COVID-19 deaths, as well as observations and forecasts for the total deaths each week (incident deaths). These data are available for the U.S. both at the national and state levels. The forecasts are submitted by multiple forecasting teams from academia, industry and government affiliated groups. The actual number of deaths from COVID-19 will be under-reported due to various factors including reporting delays and the lack of testing. There will also be indirect deaths as hospitals, overwhelmed by COVID-19, have had to delay diagnosis and treatment for other conditions, such as cancer. Therefore, the impact of COVID-19 will be judged ultimately in terms of the excess mortality, that is, the number of deaths above the number that would have been expected (Weinberger, et al 2020) . The actual cumulative deaths described in this paper are obtained from the COVID-19 Forecast Hub. These data are from the daily reports issued by the Centre for Systems Science and Engineering at Johns Hopkins University. Based on these data, the first reported death due to COVID-19 in the U.S. was in the State of Washington, in the week ending 29 February 2020. On 30 January 2021, the total number of COVID-19 deaths in the U.S. had risen to 439,530. Figure 1 shows the number of deaths across the U.S. on this date. Figure 2 shows the rise in the cumulative total number of COVID-19 deaths in the five states with highest cumulative total on this date. Figure 2 . Rise in COVID-19 deaths in the five states with highest cumulative total on 30 January 2021. The curators of the COVID-19 Forecast Hub ask forecasting teams to submit forecasts for one- week periods ending at midnight on Saturday evenings. The weeks are numbered starting with Week 0 defined as the week ending on Saturday 21 December 2019. At the end of each week, the numbers of incident and cumulative deaths are published, and with that week as forecast origin, the teams submit forecasts for 1 to 4 weeks ahead. For each of these lead times, and for incident and cumulative deaths, the teams provide a forecast of the probability distribution, which we refer to as the distributional forecast. It is provided in the form of forecasts of the quantiles corresponding to the following 23 probability levels: 1%, 2.5%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, Figure 3 . One week-ahead distributional forecasts produced with Week 28 as forecast origin. In Figure 3 , we give examples of distributional forecasts for the national level of cumulative mortality, and for the states of New York and Florida. In each plot, the distribution function shown as the dashed line is a forecast combination, proposed and made available by the curators of the Hub. They refer to it as the ensemble forecast, and explain that, up to Week 31, it was the simple average of forecasts provided by the individual participating teams, and after that date, it was the median of these forecasts. The median distributional forecast is highlighted in black in each plot. Figure 3 shows that there is considerable variation among the distributional forecasts in terms of their location, spread and shape. It is also interesting to note that, at least for New York and Florida, there are a number of outlying distributional forecasts, which motivates consideration of alternative combining methods, based on robust estimation, such as the median and trimming. We discuss this further in Sections 3 and 4. The COVID-19 Forecast Hub provides information regarding the methods used by the various forecasting teams and licensing conditions for the use of each team's data. The supplementary material to this paper summarises this information. Approximately half of the teams use compartmental models. These involve the estimation of the rates at which the population, or sectors of the population, transfer between the states of being susceptible, exposed, infected and recovered/removed. Hence, they are widely referred to as SEIR or SIR models. The other forecasting teams used a variety of approaches, including agent-based simulation, statistical models, and deep learning. The use of data-driven machine learning methods is consistent with the increasing use of such methods in healthcare (see Guha and Kumar, 2018) . Surowiecki (2004) describes conditions under which wisdom can best be extracted from a crowd. These include independent contributors, diversity of opinions, and a trustworthy central convenor to collate the information provided. Forecasts from the Hub satisfy these conditions. In our analysis, we considered forecasts made with forecast origins as Week 18 up to Week 57, that is, from the week ending 25 April 2020 to the week ending 23 January 2021. In terms of observed 9 values with which to compare the forecasts, we used weekly data up to Week 58. Week 18 seemed a reasonable starting point for our analysis because levels of mortality were relatively low prior to this, and it was the first forecast origin for which the ensemble forecast was included in the Hub's visualisation. Week 20, the curators of the Hub have produced files listing the forecasts that they did not include in their ensemble, based on several data screening checks. We omitted these forecasts from our analysis, and also followed the curators by treating as ineligible any submission that did not provide forecasts for all 23 quantiles and all four lead times. Unless this criterion was not met, we included forecasts not recorded as being assessed for eligibility because we felt we had no clear justification for omitting them. We considered data screening checks for these forecasts, but concluded that setting our own thresholds for inclusion would be arbitrary. Consequently, we applied combining methods to a dataset of forecasts that included 11% more forecasts than were included in the ensemble combining method. Figure 4 shows the total number of forecasting teams included in our study for each forecast origin from Weeks 18 to 57. For each week, the figure also shows the split between the number of teams that used compartmental models and the number using alternatives. Note that the number of teams shown for each week in Figure 4 is an upper bound for the number available for combining for any one time series. This is because some teams either did not provide forecasts or did not provide eligible forecasts for all time series. For each of the 49 forecasting teams that submitted forecasts, Figure 5 shows, for each week, whether we were able to include that team in the combined forecasts for at least one series. A break in the horizontal line for any team indicates that, from the first week when the team submitted forecasts, it was not the case that forecasts from that team were available and eligible in all the following weeks. The circles in Figure 5 give an indication of the extent to which each forecasting team featured in our study. The figure shows that, even when a record of past historical accuracy becomes available for all teams, accuracy will not be available for the same past periods and same time series. This has potential implications for how to implement a weighted combining method based on the historical accuracy of each method. It would be interesting to compare the accuracy of the forecasts produced by the different teams. However, as Figure 5 shows, for most forecasting teams, forecasts were not available for many of the 52 series and 40 forecast origins in our study. Indeed, a full set of forecasts for all series and forecast origins were not available from any of the teams. In this section, we empirically compare a variety of interval forecast combining methods. After describing the structure of our empirical study, we discuss measures used to evaluate interval forecasts. We then review the literature on interval forecast combining methods, before introducing new weighted methods. We summarise the methods that we implemented, and then report the empirical results. As we explained in Section 2.3, we used forecasts produced at Weeks 18 to 57, which amounted to 40 forecast origins. Some of the combining methods require parameter estimation, and for this we opted to use at least 10 weeks. Therefore, we evaluated out-of-sample forecasts for the final 30 weeks of our dataset. As we moved the forecast origin through the out-of-sample period, we re-estimated parameters using all weeks up to and including the forecast origin. As the parameter optimisation and several combinations involve forecast accuracy measures, we now discuss interval forecast evaluation. In this paper, our interest is in (1-) intervals that are bounded by the /2 and (1-/2) quantiles. We consider =5% and 50%, which correspond to 95% and 50% intervals, respectively. Our choice of these intervals reflects those presented in the visualisation of the COVID-19 Forecast Hub. The accuracy of a set of interval forecasts can be assessed by evaluating the forecasts of the quantiles bounding the interval. A simple measure of accuracy of forecasts of the  quantile is to check the percentage of observations that fall below the forecast. We refer to this as the hit percentage. If this is equal to , the forecast is said to be calibrated. More precisely, we should refer to this as unconditional calibration, with conditional calibration being the property that the conditional expectation of the hit percentage is equal to  (Nolde and Ziegel, 2017) . Given the short length of the time series considered in our analysis, we assess only unconditional calibration. In addition to calibration, a quantile forecast should be evaluated using a score. A score can be viewed as a measure of how closely the forecast varies over time with the actual quantile. The score is said to be consistent if it is minimised by the true quantile. The use of a consistent score ensures honest reporting by a forecaster . For quantile forecasts, the most widely used consistent score is the quantile regression loss function (see Koenker and Machado, 1999; Taylor, 1999) . We refer to it as the quantile score, and present it as follows: where y t is the observation in period t, q t () is the  quantile, and I{·} is the indicator function. When =50%, the score reduces to the absolute error, showing that the widely used mean absolute error is an 12 appropriate score for a point forecast defined as a prediction of the median. To summarise forecasting performance across a time series, the average of the score is computed. A consistent score for an interval forecast is produced by summing the quantile score for the quantiles bounding the interval . For an interval bounded by q t () and q t (1-), if we sum the quantile scores and divide by , we get the following interval score (Winkler, 1972) : where l t is the interval's lower bound q t () and u t is its upper bound q t (1-). The score has the intuitive interpretation that it rewards narrow intervals, with observations that fall outside the interval incurring a penalty, the magnitude of which depends on the value of  . In an application to influenza forecasting, Bracher et al. (2021) use this interpretation to seek insight into why the interval score for one forecasting model is lower than another. In reporting both interval and distributional forecasting results in this paper, we average results across the four lead times. We do this for three reasons. Firstly, the relative performances of the methods were similar across the four lead times. Secondly, with many results to report, we felt it impractical to provide results for each lead time. Thirdly, we have a relatively small out-of-sample period, which is of particular concern when evaluating forecasts of extreme quantiles, such as the 2.5% and 97.5% quantiles. To show that the results are consistent across the lead times, in Section 4.6, we provide empirical results for distributional forecasting for each lead time. This also enables us to report the results of statistical tests, which are not available for our results averaged across lead times. The literature on combining interval forecasts is dominated by applications, such as ours, where there is a sizeable group of individual forecasters and a record of past accuracy is not available for the same past periods. The methods that have been proposed consider each bound of the interval separately. An obvious simplistic approach is to use the simple average of the forecasts. In the vast literature on combining point forecasts, it is well established that the simple average can be very competitive in a variety of applications. Interestingly, this is true regardless of whether a record of historical accuracy is available to enable far more sophisticated combining methods to be fitted (Larrick and Soll, 2006 ). An advantage of the simple average is its simplicity, and robustness to changes over time in the relative performance of the individual methods. However, consideration of robustness prompts the use of combining methods that are robust to outliers, with the obvious candidate being the median. The simple average and the median are both considered by Park and Budescu (2015) and Gaba et al. (2017) . Extending the idea of robustness to outliers, Park and Budescu (2015) propose that, for each 13 bound, a chosen percentage of the highest and lowest forecasts are discarded, followed by averaging of the rest. We refer to this as symmetric trimming. The median is an extreme version of symmetric trimming, where all but one forecast is trimmed. Rather than having robustness as motivation, Gaba et al. (2017) use trimming to address the situation where the individual forecasters tend to be either under-or overconfident. We refer to their methods as asymmetric trimming. Their exterior trimming involves removing a percentage of the highestvalued upper bounds and lowest-valued lower bounds, with the combination computed by averaging the remaining upper and lower bounds, respectively. This approach is suitable when the forecasters have tended to be underconfident, with interval forecasts that are generally too wide. Gaba et al. (2017) also suggest interior trimming, which involves removing a percentage of the lowest-valued upper bounds and highest-valued lower bounds, followed by averaging. This approach to combining is suitable when the forecasters are overconfident. They also propose an extreme version of interior trimming, which involves discarding all but the highest upper bound and lowest lower bound. They refer to this as the envelope method. In addition, Gaba et al. (2017) describe a heuristic approach that views the bounds of each forecaster as having been produced based on a normal distribution. The empirical studies of Park and Budescu (2015) involved 80% and 90% interval forecasts produced judgmentally by volunteers in experiments, where the interval forecasts related to general knowledge questions and estimates of financial and economic quantities. They found that the simple average and median were outperformed by trimming, and that symmetric was preferable to asymmetric trimming. Gaba et al. (2017) considered 90% interval forecasts produced for financial quantities by employees at a financial brokerage firm, and for macroeconomic variables by participants in the Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters. These forecasts are produced using a variety of methods, including statistical models, expert judgment, and a mixture of the two. They found that exterior trimming performs very well for some of their data, but that the ranking of combining methods is dependent on the characteristics of the individual forecasts, such as under-or overconfidence. In a recent paper, Grushka-Cockayne and Jose (2020) compared combinations of 95% interval forecasts produced by time series methods for the 100,000 series from the M4-Competition. Overall, the best results were achieved with median combining and interior asymmetric trimming. In our implementation of the combinations involving trimming, we optimised the percentage of forecasts to trim, represented by the parameter , by finding the value that minimised the sum of the interval score of expression (2) for all lead times for all periods up to and including the forecast origin. For point forecasting, many combining methods have been proposed that allocate weights 14 according to the historical accuracy of the forecasters. Analogous methods have been proposed for probabilistic forecasts. However, we are not aware of approaches to weighting probabilistic forecasts when historical accuracy is not available for each forecaster for the same past periods, which is the case in our study. Capistrán and Timmermann (2009) address this situation for point forecasts from a survey of macroeconomic forecasters that had "frequent entry and exit" of forecasters. We draw on their ideas to propose weighted combinations for probabilistic forecasts. Each method is based on historical accuracy, and pragmatically disregards the fact that this accuracy has not been assessed using the same past periods for each forecaster. We do not consider the approach of Capistrán and Timmermann (2009) that involves imputation of the missing forecasts, as it is not clear how to adapt it for probabilistic forecasts, and in any case, imputation is less appealing for short time series with many missing forecasts. A simple suggestion of Capistrán and Timmermann (2009) is to select the forecaster with best historical accuracy. We implemented this previous best approach by selecting the forecasting team for which the in-sample interval score was lowest. With the shortest in-sample period being 10 weeks, we considered only forecasting teams for which we had forecasts for at least five past forecast origins. We imposed the same requirement on all the weighted combining methods. Clearly, five periods is a small number with which to assess interval forecast accuracy, but larger numbers led to the elimination of many forecasters for the early weeks in our out-of-sample period. Capistrán and Timmermann (2009) implement the proposal of Bates and Granger (1969) to combine point forecasts using a convex combination with weights inversely proportional to the mean squared error. Stock and Watson (2001) explain that this simple method has the appeal of robustness when the estimation sample is small or there are many predictors, which are both issues in our study. Shan and Yang (2009) use the approach to combine quantile forecasts, with the weights computed by replacing squared error with the quantile score. A similar approach is adopted by Taylor (2020) in a study of value at risk and expected shortfall forecasting for financial data. We implemented inverse score combining using the interval score of expression (2), and the quantile score of expression (1). The inverse score approach requires no parameter optimisation, which is appealing because our time series are quite short. However, with little past data, a simple average may be preferable. In view of this, we implemented a form of shrinkage, used for point forecasting by Stock and Watson (2004) and Capistrán and Timmermann (2009) , which has the potential to reduce the combination to the simple average. The shrinkage forecast is a weighted average of the forecasts from the simple average and an inverse score method. We estimated the weight by minimising the interval score, after forecasts had become available for the inverse score method for a reasonable number of weeks. We set this to be the first 10 weeks of our 30-week out-of-sample period. This produced forecasts for the final 20 weeks of the out-of-sample period. For the first 10 weeks, we set the forecasts to be those of the inverse score method. We implemented shrinkage with the inverse interval score method, and separately with the inverse quantile score method. Stock and Watson (2001) , Shan and Yang (2009) and Taylor (2020) incorporate a tuning parameter  > 0 in the inverse score approach to control the influence of the score on the combining weights. Expression (3) presents the resulting combining weight for forecaster i at forecast origin t: where MS i,t is the historical mean of the score computed at forecast origin t from forecaster i, and J is the number of forecasting teams included in the combination. If  is close to zero, the combination reduces to a simple average, whereas a large value for  leads to the selection of the team with best historical accuracy. We followed the same approach that we used for the shrinkage method to estimate  and produce forecasts for the first 10 weeks of the out-of-sample period. We implemented the tuning method based on the interval score and another version of the method based on the quantile score. The inverse quantile score methods, as well as asymmetric exterior trimming, sometimes delivered a lower bound above the upper bound. In this situation, we replaced the two bounds by their mean. Although the interval then has zero width, it is better than the upper bound being below the lower. For each mortality series, forecast origin and lead time, we applied the following methods: Ensemble: This is the combination produced by the COVID-19 Forecast Hub. For the first 13 of the 40 forecast origins in our study, the forecast for each bound was the simple average of the individual forecasts of that bound, and thereafter, it was the median. As noted in Section 2.3, the ensemble was computed from a subset of the forecasts used in the other combining methods that we consider. Simple average: For each bound, we computed the arithmetic average of forecasts of this bound. We used our full set of forecasts for this and the other combining methods described below. Geometric mean: For each bound, we computed the geometric mean of forecasts. This was the only combination using the geometric mean. It was motivated by the potentially exponential rise in mortality. For each bound, we computed the median of forecasts of this bound. For each bound, we averaged the forecasts remaining after the removal of the N lowest-valued and N highest-valued forecasts, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts, and  is the percentage of forecasts to trim. We first removed the N lowest-valued lower bound forecasts, as well as the N highest-valued upper bound forecasts, where N is the largest integer less than or equal to the product of  and the number of forecasts. For each bound, we averaged the remaining forecasts. We removed the N highest-valued lower bound forecasts, as well as the N lowest-valued upper bounds, where N is defined as for asymmetric exterior trimming. For each bound, we averaged the remaining forecasts. Envelope: This uses the lowest-valued lower bound forecast and highest-valued upper bound forecast. The interval forecast is provided by the forecasting team for which the interval score was the lowest when computed using the weeks up to and including the forecast origin. Inverse interval score: This is a convex combination of forecasts, where the weights are inversely proportional to the interval score computed using the weeks up to and including the forecast origin. Inverse interval score shrinkage: This is a weighted average of the simple average and inverse interval score combining methods. This applies a tuning parameter to the weights of the inverse interval score method, as shown in expression (3). Inverse quantile score: This is a convex combination of forecasts, where the weights on the forecasts of each bound are inversely proportional to the quantile score computed for the forecasts of that bound using the weeks up to and including the forecast origin. Inverse quantile score shrinkage: The forecast for each bound is a weighted average of the forecasts of that bound from the simple average and inverse quantile score combining methods. Inverse quantile score tuning: This weighted combining method applies a tuning parameter to the weights of the inverse quantile score method, as shown in expression (3). As we explained in Section 3.2, we averaged the results across the lead times. However, with the level of mortality varying greatly across the series, it is inevitable that averaging will lead to the interval score being dominated by its value for the high mortality series. In view of this, we report results for the following four categories of the series: all 52 series; the 17 series with the highest cumulative mortality at end of the final week of our dataset, which corresponded to 16 states and the national U.S. series; the 17 states with the next highest cumulative mortality at the end of the final week; and the 18 states with lowest cumulative mortality at the end of the final week. We refer to these categories as: all, high, medium and low. For the four categories of series, the mean of the interval score for 95% interval forecasts is presented in the first four columns of values in Table 1 . The unit of the score is deaths, and lower values of the score reflect greater accuracy. To summarise performance within each category of series, we calculated the geometric mean of the ratios of the (arithmetic) mean score for each method to the (arithmetic) mean score for the simple average method, then subtracted this from 1, and multiplied the result by 100. This can be viewed as an average skill score for each category, reflecting the percentage by which a method is more accurate than the simple average. We present this measure in the final four columns of Table 1 . The first four rows of results correspond to simple benchmark combining methods. Of these, the simple average was the best for all four categories of the series. In the next four rows of results, we see that asymmetric interior trimming was the most successful trimming method. This method produced similar results to the simple average. The envelope method performed poorly. Turning to the score-based methods in the final seven rows, we find that the 'previous best' method was relatively poor, and the inverse interval score and inverse quantile score methods performed well. We note that incorporating shrinkage was only beneficial for the low mortality series, and tuning was not useful for any of the four categories of series. Overall, Table 1 shows that the best results were produced by the inverse score methods, with their improvements over the simple average being particularly good for the high and medium mortality series. As we said in Section 3.2, statistical tests are not available for the score averaged over lead times. (We return to the issue of statistical testing in Section 4.6.) As we noted in Section 2.3, for most forecasting teams, forecasts were not available for many of the 52 series and 40 forecast origins in our study. Indeed, for the out-of-sample 30-week period, a full set of forecasts for each series and forecast origin were available from only one team. The results for this team were very poor, and so we have not included them in Inv quantile score tuning 1101 2644 553 161 -0.9 5.8 1.7 -10.5 Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Bold indicates the best three methods in each column. Table 2 presents the interval score results for the 50% interval forecasts. The results are broadly consistent with those for the 95% interval. One difference is that, for the 50% intervals, the tuning forms of the inverse score methods were the most accurate for the all and high categories of series. Further investigation revealed that this was due to very good accuracy for the national U.S. mortality series. Figure 6 summarises calibration using Q-Q plots to report the hit percentages for the bounds of the 50% and 95% interval forecasts, which are forecasts of the quantiles with probability levels =2.5%, 25%, 75% and 97.5%. To ensure readability, we present the results for just four methods. We chose three of the simple benchmark methods and the inverse interval score method. This was the simplest of the inverse score methods, all of which performed well in terms of the interval score. Similar calibration results were produced by the other inverse score methods. In each plot, the dashed line indicates the ideal 19 performance. The plots show that the ensemble and median are slightly outperformed by the simple average and inverse interval score, which have particularly good calibration for the 2.5% and 97.5% quantiles. For the 25% and 75% quantiles, the hit percentages for all four methods were too low, indicating that the forecasts of these quantiles tended to be too low. Figure 6 . Calibration hit percentages for bounds on 50% and 95% intervals, computed using the 30-week out-of-sample period. This section empirically compares combining methods for distributional forecasts. We first briefly discuss the structure of our empirical analysis, and describe measures used to evaluate distributional forecasts. We then review combining methods that have been proposed in the literature for data of the type that we consider, and present new weighted methods. We then summarise the combining methods that we implemented, and present our empirical results. Our empirical analysis of distributional forecasts followed the same structure as our study of interval forecasts. The first 10 weeks of data were used as the initial estimation period, and out-of-sample forecasts were evaluated for the final 30 weeks. For each forecast origin, we re-estimated parameters using all weeks up to and including the forecast origin. To help us describe parameter optimisation and several combining methods, we next discuss distributional forecast evaluation. Gneiting et al. (2007) A score summarises calibration and sharpness, and is said to be proper if it is minimised by the true distribution. As with consistent scoring functions for quantiles, proper distributional scores are recommended to ensure forecasters report honest predictions . A widely used proper score for distributions of continuous random variables is the continuous ranked probability score (CRPS) (see . It can be viewed in several different ways, including the integral of the quantile score of expression (1), with respect to the probability level . For our application, where we have quantile forecasts for just 23 different values of , we use the linear quantile score (LQS) (see Grushka-Cockayne et al., 2017) in expression (4), which is the sum of the quantile scores for the 23 quantile forecasts: This is a proper score, and this can be seen by viewing it as a quantile-weighted version of the CRPS (see Gneiting and Ranjan, 2011) . We note that Bracher et al. (2021) present it as a weighted sum of the interval score of expression (2) and the quantile score of expression (1) for the median. Although our main interest in this paper is probabilistic forecasting, we also consider point predictions of the median, as this conveys the accuracy of the centre of location of the distributional forecasts, and of course such point forecasts are often the main focus of attention. We evaluate these point forecasts using the mean absolute error (MAE). In this brief review of distributional forecast combining methods, we focus on the literature that has considered applications, like ours, where there is a large group of individual forecasters, and there is not a sizeable record of past accuracy available for each forecaster for the same past periods. The simple average is a well-established approach for combining distributional forecasts (see, for example, Stone, 1961) . It has typically taken the form of the linear opinion pool, which, for any chosen value of the random variable, is the average of the corresponding cumulative probabilities obtained from the distributional forecasts. However, this form of averaging has been criticised because, even when the quantiles of the individual distributions are calibrated, the linear opinion pool will not itself provide perfect calibration (Hora, 2004; Ranjan and Gneiting, 2010) . It has been noted that, when there is diversity among the means of the individual forecasts, this will tend to lead to an exaggerated variance in the forecast of the linear opinion pool (see, for example, Dawid, 1995) . To address these problems, Lichtendahl et al. (2013) propose that, instead of averaging the cumulative probabilities, the distributional forecasts should be averaged by taking the mean of their corresponding quantile forecasts. In other words, for a chosen probability level , they suggest that combining is performed by averaging forecasts of the quantile q t () provided by each individual distributional forecast. They show how this leads to more attractive theoretical properties than the linear opinion pool. For our application in this paper, it provides a more convenient approach to averaging because each forecasting team submits the distributional forecast in terms of quantile forecasts for the same 23 values of the probability . To provide robust combining, Hora et al. (2013) propose that the median is used. For any chosen value of the random variable, their approach finds the median of the corresponding values of the cumulative probability forecasts. In fact, they show that this delivers the same combined distributional forecast as an approach that uses the median of forecasts of the quantile q t () for a chosen . As with interval forecast combining, trimming has also been proposed for distributional forecasts. Jose et al. (2014) propose interior and exterior trimming approaches, which involve trimming the innermost and outermost distributions, respectively, from a set of distributional forecasts. To enable the trimming, the distributional forecasts must essentially be ordered in some way, and for this, they propose two alternative approaches: the CDF approach (CA) and mean approach (MA). MA orders the distributional forecasts according to their means, and involves trimming entire distributional forecasts. After a proportion have been trimmed, the authors use a linear opinion pool to average the rest. CA orders the distributional forecasts separately for each of a set of values of the random variable. After the trimming is performed, the combined forecast is computed as the average of the cumulative probabilities given by the remaining distributional forecasts. Jose et al. (2014) note that CA could be adapted so that the trimming and averaging is performed on forecasts of the quantile q t () for any chosen value of the probability , which would be more consistent with the advice of Lichtendahl et al. (2013) to average 22 quantiles, rather than probabilities. For our application, this is a more convenient way to implement CA, as our distributional forecasts have each been provided in the form of a set of quantile forecasts. Following similar reasoning, we avoided the linear opinion pool with MA, so that following trimming, quantile forecasts are averaged. For interval forecast combining, symmetric trimming is motivated by robustness, and asymmetric trimming enables the impact to be reduced of a tendency among the individual forecasters to be either under-or overconfident. It is worth noting that analogous asymmetric methods are not straightforward for distributional forecasts because of the need to ensure that the resulting distribution function is monotonically increasing. It is also interesting to note that, although the trimming methods proposed by In our study, for the distributional forecast combining methods that involve trimming, we optimised the trimming percentage  by finding the value that minimised the sum of the LQS of expression (4) for all four lead times using all periods up to and including the forecast origin. For distributional forecasting, we implemented a set of score-based combining methods analogous to those described in Section 3.4 for interval forecasting. For the 'previous best' and inverse interval score weighted combining methods, we replaced the use of the interval score by the LQS. The other three combining methods in Section 3.4 produced a forecast for each interval bound, based on the inverse of the quantile score. For distributional forecasting, we implemented these methods for each of the 23 quantiles that underlie each distributional forecast. Four of the inverse score combining methods in Section 3.4 involved a parameter, which we optimised using the interval score. For the analogous methods for distributional forecasting, we used the same optimisation procedure with the interval score 23 replaced by the LQS. The inverse quantile score methods sometimes gave forecasts for the 23 quantiles that were not monotonically increasing with , the probability level. When this occurred for the quantile forecasts corresponding to two adjacent values of , we replaced both quantile forecasts by their mean. For each mortality series, forecast origin and lead time, we implemented the following methods: Ensemble: This is the combination produced by the COVID-19 Forecast Hub. For the first 13 of the 40 forecast origins in our study, the ensemble forecast of each of the 23 quantiles was the simple average of the corresponding quantile forecasts, and thereafter, it was the median. As noted in Section 2.3, the ensemble used a subset of the forecasts included in all other combining methods that we considered. Simple average: For each of the 23 quantiles, we used the arithmetic mean of the corresponding quantile forecasts. We used our full set of forecasts for this and the other combining methods described below. Geometric mean: For each of the 23 quantiles, we used the geometric mean of the corresponding quantile forecasts. This was the only combining method involving the geometric mean. Median: For each of the 23 quantiles, we found the median of the quantile forecasts. For each of the 23 quantiles, we averaged the quantile forecasts remaining after we had removed the N lowest-valued and N highest-valued quantile forecasts, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts, and  is the percentage of forecasts to trim. For each bound, we averaged the remaining forecasts. With this method, for each of the 23 quantiles, the innermost quantile forecasts were trimmed. The combination was computed as the average of the quantile forecasts that were either among the N lowest-valued or N highest-valued quantile forecasts, where N is the largest integer less than or equal to the product of (1-/2 and the total number of forecasts. This method involved trimming entire distributional forecasts. The trimming was based on the mean of each distributional forecast, which we estimated using the average of the 23 quantile forecasts. The forecast combination was computed by averaging the distributional forecasts that remain after the removal of the N distributional forecasts with lowest-valued mean and the N distributional forecasts with highest-valued mean, where N is the largest integer less than or equal to the product of /2 and the total number of forecasts. This was similar to MA exterior trimming, except the innermost distributional forecasts were trimmed. The combination was computed as the average of the distributional forecasts that were among the N distributional forecasts with lowest-valued mean and the N distributional forecasts with highest-valued mean, where N is the largest integer less than or equal to the product of (1-/2 and the total number of forecasts. The distributional forecast is provided by the forecasting team for which the LQS was the lowest when computed for the weeks up to and including the forecast origin. Inverse LQS: This is a convex combination of forecasts, where the weights are inversely proportional to the LQS computed for the weeks up to and including the forecast origin. This is a weighted average of the simple average and inverse LQS methods. This applies a tuning parameter to the weights of the inverse LQS method. Inverse quantile score: This is a convex combination of forecasts, where the weights on the forecasts of each of the 23 quantiles are inversely proportional to the quantile score computed for each quantile forecast using the weeks up to and including the forecast origin. Inverse quantile score shrinkage: The forecast for each quantile is a weighted average of the forecasts of that quantile from the simple average and inverse quantile score combining methods. Inverse quantile score tuning: This applies a tuning parameter to the weights of the inverse quantile score method. Table 3 presents the mean of the LQS for the four categories of series, along with skill scores. The unit of the LQS is the number of deaths, and lower values of the score reflect greater accuracy. There are similarities between the results of Table 3 and the interval score results of Tables 1 and 2 , with the simple average being the best of the four simple benchmarks; the trimming methods performing reasonably but unremarkably; the previous best method doing poorly; and the best results produced by the inverse score methods. For the medium and low mortality series, Table 3 shows that the inverse score methods performed similarly, while for the high mortality series, incorporating tuning was beneficial. Closer inspection revealed that this finding for the high mortality series was due to tuning improving accuracy for the national U.S. series. The incorporation of shrinkage was not beneficial. Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Bold indicates the best three methods in each column. As we explained in Section 3.6, for the out-of-sample 30-week period, a full set of forecasts for each series and forecast origin were available from only one of the individual forecasting teams. That team performed very poorly in terms of interval forecasting, and also for distributional forecasting. To gain some insight into the relative accuracy of the individual teams, we computed the skill scores, as in Table 3 , for each team using the periods for which forecasts were available from that team. This led to a wide range of skill scores, due partly to the instability of this measure when forecasts were available from a team for only a small number of periods. Nevertheless, it was interesting to find that the best skill scores computed for the individual teams for the all, high, medium and low categories of series were -1.3%, -1.3%, -8.6% and -6.1%, respectively. These negative values indicate that the best individual team for each category was less accurate than the simple average combining method. To look in more detail at the LQS results, Figure 7 reports the LQS for each of the 52 series for the simple average and the inverse LQS method, which is a simple form of inverse score method, involving no parameter estimation. The inverse LQS method can be seen to be slightly better than the simple average for most of the series, with this We also evaluated the distributional forecast combining methods of Table 3 in terms of their accuracy for producing 95% and 50% interval forecasts. The interval forecasts from the simple benchmarks were the same as those produced from these methods in our interval forecasting study of Section 3. However, this was not the case for the other methods in Table 3 . We found that the best of the trimming methods in Table 3 was outperformed by the best of the interval forecast trimming methods in Section 3. For the inverse score methods, the best interval forecast accuracy achieved by the methods of Table 3 was similar to the accuracy of the best of the inverse score methods in Section 3. For the four categories of series, Figure 8 In Table 4 , we evaluate point forecast accuracy for forecasts of the median using the MAE averaged across the four lead times. The unit of the MAE is the number of deaths. The relative performances of the methods in Table 4 are similar to those in Table 3 for the LQS. This is quite common when there is sizeable variation over time in the location of the probability distribution, because inaccuracy in the prediction of the distribution's location will affect the accuracy of all the quantiles, and hence the whole distribution. Note: The unit of the score is deaths. Lower values of the score and higher values of the skill score are better. Bold indicates the best three methods in each column. In the rest of this section, we extend our empirical analysis to consider the following issues: statistical testing, consistency of the results across lead times and across the 40 week-period, model diversity, use of different numbers of forecasts in the combination, and evaluation of average ranks. In this paper, we have presented results averaged across lead times. As far as we are aware, statistical tests are not available to compare results averaged in this way from different methods. The work of Quaedvlieg (2021) considers multi-horizon comparisons, but it is not applicable to our study where we expand the length of the estimation sample for each new forecast origin, which we feel would be done in practice because our time series are short. To consider statistical testing, we focus on each lead time separately. This also enables us to compare accuracy across lead times. Table 5 presents LQS results for each of the four lead times for a subgroup of the combining methods. The symbols * and † indicate methods with LQS that is significantly less than the simple average and median combinations, respectively, using a 5% significance level. 1 For all the series considered together and the high mortality series, the table shows that the inverse score methods were generally significantly more accurate than the simple average and the median. For the medium mortality series, the inverse score methods were significantly more accurate than the median, but generally not significantly more accurate than the simple average. There are no cases of significance for the low mortality series. Finally, we note the table shows similar rankings of the methods for each lead time. To address whether the ranking of the methods varies over the 40 weeks of our dataset, Table 6 presents the LQS separately for the four successive 10-week periods. We have averaged the LQS across lead times, and we consider the same subgroup of methods as in Table 5 . Note that the trimming and inverse score methods were not available for the first 10-week period, as this was the first in-sample estimation period for these methods. Although insight is limited from such short periods, especially for probabilistic forecasts, it is interesting to see that the inverse score methods were reasonably consistent in performing well across the 10-week periods, and the median was very competitive for the first two 10week periods. In Section 2, we discussed the different methods used by the individual forecasting teams. We described how compartmental models were used by approximately half the teams, and we illustrated this in Figure 4 . We were curious to see how the combining methods would perform if they were applied to only the teams using compartmental models. Given their widespread use by epidemiologists, one might surmise that combining only these models would be adequate, and that a combination using only the other types of models would deliver poor results. We investigate these issues in Table 7 , where we report LQS results produced by applying each of the combining methods to the following three different sets of the individual forecasting teams: all the teams, the teams using compartmental models, and the teams not using compartmental models. For the low mortality series, Table 7 shows that combining only compartmental models was most accurate. For the medium mortality series, combining only compartmental models was more accurate than combining only non-compartmental models, and there was only a small benefit in including the latter in a combination with the former. For the high mortality series, the best results were produced by combining both types of model. Table 7 . LQS for the distributional forecasts, averaged over the 30-week out-of-sample period, for combining methods applied to three different sets of individual forecasts: all, compartmental models only, and non-compartmental models. We investigated the impact on distributional forecast accuracy of using different numbers of forecasts in the combinations. We represent this number by K. In Section 2.3, we described how the availability of the forecasting teams varied across the series and forecast origins. The number of teams available varied between 6 and 36, with a median of 27. To investigate different values of K, for each location and forecast origin, we sampled K forecasts, with replacement, from the available forecasts, and evaluated combinations of the K forecasts. We did this 1,000 times for K=2 to 36. For each value of K, and simple average combining, Figure 9 uses a Box plot to summarise the resulting 1,000 LQS values. Each panel shows a noticeable improvement in the LQS as K increases to about 20. In fact, the LQS for each of the four categories of series continued to improve slightly up to about K=30, indicating the benefit of using a large pool of forecasts. Our final set of results consists of ranks of the methods, averaged over the series within each of the four categories. Table 8 reports the average ranks for the LQS and MAE. (Prior to computing the ranks, each score was averaged across the four lead times.) Lower ranks are better. The average ranks provide a similar message to the average scores in Tables 3 and 4 , with the inverse score methods performing the best. We implemented the statistical test for average ranks proposed by Koning et al. (Section 2.2, 2005) to enable multiple comparisons with the best method in each column of Table 8 . In each column, the best average rank is highlighted in bold, and * indicates a method that has average rank significantly worse than the best method in that column, using a 5% significance level. Significance can be seen in all cases for the 'previous best' method, and in most cases for the median combining method. Table 8 . Average ranks of the LQS and MAE computed for the 30-week out-of-sample period. All High Medium Low All High Medium Low pragmatic approach and implemented weighted combinations based on the inverse of appropriate scoring functions, computed using whatever historical forecasts were available for each method. We are not aware of previous studies that have used this approach for probabilistic forecasting with data that has frequent entry and exit of forecasters. For our out-of-sample period of the most recent 30 weeks, these weighted combinations outperformed all other methods for high mortality series, while for the other series, accuracy of these methods matched the best of the other combining methods, which was the simple average. For the first 10 weeks of our dataset, insufficient historical accuracy was available with which to construct the weighted combinations. For these early weeks, the median was overall the most accurate method. Special report: The simulations driving the world's response to COVID-19 The combination of forecasts Evaluating epidemic forecasts in an interval format Scoring interval forecasts: Equal-tailed, shortest, and modal interval The wisdom of amateur crowds: Evidence from an online community of sports tipsters Identifying expertise to extract the wisdom of crowds Forecast combination with entry and exit of experts COVID-19 Forecast Hub Underlying trend, seasonality, prediction, forecasting and the contribution of risk factors: an analysis of globally reported cases of Middle East Respiratory Syndrome Coronavirus Coherent combination of experts' opinions Comparing predictive accuracy Combining interval forecasts A stochastic latent moment model for electricity price formation Probabilistic forecasts, calibration and sharpness Probabilistic forecasting Comparing density forecasts using threshold-and quantile-weighted scoring rules Strictly proper scoring rules, prediction, and estimation Combining prediction intervals in the M4 competition Ensembles of overfit and overconfident forecasts Emergence of big data research in operations management, information systems, and healthcare: Past contributions and future roadmap Wrong but Useful -What Covid-19 Epidemiologic Models Can and Cannot Tell Us Probability judgments for continuous quantities: linear combinations and calibration Median aggregation of distribution functions Trimmed opinion pools and the crowd's calibration problem A systematic review and evaluation of Zika virus forecasting and prediction research during a public health emergency of international concern Goodness of fit and related inference processes for quantile regression The M3 competition: Statistical tests of the results Intuitions about combining opinions: Misappreciation of the averaging principle Infectious disease forecasting for public health Is it better to average probabilities or quantiles Superensemble regional climate modeling for the western United States A double epidemic model for the SARS propagation Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions Elicitability and backtesting: Perspectives for banking regulation Aggregating multiple probability intervals to improve calibration The novel coronavirus originating in Wuhan, China: Challenges for global health governance Combining probability forecasts Multi-horizon forecast comparison Combining regression quantile estimators Three-Month Real-Time Dengue Forecast Models: An Early Warning System for Outbreak Alerts and Policy Decision Support in Singapore Modelling the pandemic A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series Combination forecasts of output growth in a seven-country data set The opinion pool The wisdom of crowds: why the many are smarter than the few Forecast combinations for value at risk and expected shortfall The RAPIDD Ebola forecasting challenge: Synthesis and lessons learnt Estimation of excess deaths associated with the COVID-19 pandemic in the United States A decision-theoretic approach to interval estimation Probability forecasts and their combination: a research perspective Coronavirus disease (COVID-19) WHO Coronavirus Disease (COVID-19) Dashboard We are very grateful to all the forecasting groups who generously made their forecasts available on the Reich Lab COVID-19 Forecast Hub, and to Nicholas Reich and his team, for acting as curators for the Hub, and for providing such convenient access to the data, along with useful supplementary information. We would like to thank Nia Roberts for clarifying our understanding of the licence terms for the forecast data. We are also very grateful to four referees for providing very useful comments on the paper.