key: cord-0891947-9a1zoy4m authors: Coroneo, Laura; Iacone, Fabrizio; Paccagnini, Alessia; Monteiro, Paulo Santos title: Testing the predictive accuracy of COVID-19 forecasts() date: 2022-01-31 journal: Int J Forecast DOI: 10.1016/j.ijforecast.2022.01.005 sha: 13d92404febfbbc71a0123671a5ec950eb93853b doc_id: 891947 cord_uid: 9a1zoy4m We test the predictive accuracy of forecasts of the number of COVID-19 fatalities produced by several forecasting teams and collected by the United States Centers for Disease Control and Prevention for the epidemic in the United States. We find three main results. First, at the short horizon (1-week ahead) no forecasting team outperforms a simple time-series benchmark. Second, at longer horizons (3- and 4-week ahead) forecasters are more successful and sometimes outperform the benchmark. Third, one of the best performing forecasts is the Ensemble forecast, that combines all available predictions using uniform weights. In view of these results, collecting a wide range of forecasts and combining them in an ensemble forecast may be a superior approach for health authorities, rather than relying on a small number of forecasts. J o u r n a l P r e -p r o o f Journal Pre-proof evaluation sub-sample (from November 7, 2020 to March 20, 2021 . This implies that for each evaluation sub-sample we can base our inference only on 20 observations, making the use of fixed-smoothing asymptotics crucial for obtaining reliable results. We compare the predictive accuracy of the forecasts of each team relative to the forecasts of a simple benchmark model, obtained by fitting a second-order polynomial using a rolling window of the last five available observations. We also consider two ensemble forecasts that combine the forecasts from several models using equal weights: one published by the CDC and another one (the core ensemble) computed by us combining only the forecasts included in our evaluation exercise. A feature that makes forecast evaluation important in its own right, especially when dealing with predicting the spread of COVID-19, is that the cost of under-predicting the spread of the disease can be greater than the cost of over-predicting it. In the midst of a public health crisis, the precautionary principle implies that erring on the side of caution is less costly than predicting the tapering off of the disease too soon. Scale effects may also be important in the evaluation of forecasts of an epidemic outbreak, since the same forecast error may be considered differently when the realized level of fatalities is small, and when there is a large number of fatalities. These effects may be taken into account in the forecast evaluation exercise by a judicious choice of the loss function. Therefore, we evaluate the predictive accuracy of each forecasting team using several loss functions, that include the widely used quadratic and absolute value loss, the absolute percentage loss (that takes into account the scale of the number of fatalities), and a linear exponential loss function (that penalizes under-prediction more than over-prediction). Our findings can be summarized as follows. First, the simple polynomial benchmark outperforms the forecasters at the short horizon (1-week ahead), often significantly so. Second, at longer horizons (3-to 4-week ahead), the forecasters become more competitive and some statistically outperform the simple benchmark, especially in the first evaluation sub-sample. This suggests that forecasters can successfully help inform forward-looking policy decisions. Third, the ensemble forecasts are among the best performing forecasts. This is particularly true in the first evaluation sub-sample, but even in the second sub-sample the ensemble forecast combinations outperform the benchmark, although in this sub-sample the DM test statistics are not statistically significant. The reliability of ensemble forecasts underlines the virtues of model averaging when uncertainty prevails, and supports the view in Manski (2020) that data and modelling uncertainties limit our ability to predict the impact of alternative policies using a tight set of models. Overall, our findings hold for all the loss functions considered and caution health authorities not to rely on a single forecasting team (or a small set) to predict the evolution of the pandemic. A better strategy appears to be to collect as many forecasts as possible and to use an ensemble forecast. The remainder of the paper is organized as follows. Section 2 lays out the methodology to implement the test of equal predictive accuracy. Section 3 describes the data and the models. Results are documented and discussed in Section 4, and Section 5 concludes. Finally, in the Appendix, we perform a Monte Carlo simulation exercise to study the size and power properties of the two tests of equal predictive ability with fixed-smoothing asymptotics for the sample sizes considered in our empirical study, and consider several additional experiments including some alternative benchmark forecasts. We consider the time series of cumulative weekly deaths {y 1 , ..., y T }, with T the sample size for which forecasts are available. We want to compare two h-week ahead forecasts y (1) Denote the time-t loss differential between the two forecasts as t|t−h , and the sample mean of the loss differential as where w ≡ max(w 1 , w 2 ). When a large sample T is available, standard asymptotic theory may provide a valid guidance for the statistical evaluation of d, see Diebold and Mariano (1995) and Giacomini and White (2006) . However, the same inference may be severely biased when the sample T has only a moderate size, as it is indeed the case when comparing forecast accuracy of predictions of the number of fatalities of COVID-19. In this case, fixed-b and fixed-m asymptotics can be used to overcome the small-sample size bias, see Coroneo and Iacone (2020) , Choi and Kiefer (2010) and Harvey et al. (2017) . As for the fixed-b asymptotics, following Kiefer and Vogelsang (2005) , under the null in (1) J o u r n a l P r e -p r o o f Journal Pre-proof the Bartlett kernel with b ≤ 1, these can be obtained using the formula where α 0 = 1.2816, α 1 = 1.3040, α 2 = 0.5135, α 3 = −0.3386 for 0.900 quantile α 0 = 1.6449, α 1 = 2.1859, α 2 = 0.3142, α 3 = −0.3427 for 0.950 quantile α 0 = 1.9600, α 1 = 2.9694, α 2 = 0.4160, α 3 = −0.5324 for 0.975 quantile When testing assumptions about the sample mean, Kiefer and Vogelsang (2005) show in Monte Carlo simulations that the fixed-b asymptotics yields a remarkable improvement in size. However, while the empirical size improves (it gets closer to the theoretical size) as b is closer to 1, the power of the test worsens, implying that there is a size-power trade-off. For fixed-m asymptotics, following Hualde and Iacone (2017) , under the null in (1) we have where σ 2 DAN,m is the weighted periodogram estimate of the long-run variance of d t using the Daniell kernel and truncation m. Similar results, with a slightly different standardisation and therefore a slightly different limit, are in Sun (2013) . Monte Carlo simulations in Hualde and Iacone (2017) and Lazarus, Lewis, Stock and Watson (2018) show that fixed-m asymptotics has the same size-power trade-off documented for fixed-b asymptotics: the smaller the value for m, the better the empirical size, but also the weaker the power. Coroneo and Iacone (2020) analyze the size and power properties of the tests of equal predictive accuracy in (2) and (3) in an environment with asymptotically non-vanishing estimation uncertainty, as in Giacomini and White (2006) . Results indicate that the tests in (2) and (3) 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 J o u r n a l P r e -p r o o f Journal Pre-proof deliver correctly sized predictive accuracy tests for correlated loss differentials even in small samples, and that the power of these tests mimics the size-adjusted power. Considering size control and power loss in a Monte Carlo study, they recommend the bandwidth M = T 1/2 for the weighted autocovariance estimate of the long-run variance using the Bartlett kernel (where ⌊·⌋ denotes the integer part of a number) and m = T 1/3 for the weighted periodogram estimate of the long-run variance using the Daniell kernel. In Appendix A, we perform a Monte Carlo simulation exercise to investigate the empirical size and power of the two tests for sample sizes that match the ones in our empirical study. Our findings indicate that both tests are, in general, correctly sized, even when only 20 observations are available and in presence of autocorrelation of the loss differential, although the test with WCE and fixed-b asymptotics can be slightly oversized in short samples and in presence of strong autocorrelation. On the other hand, the test with WPE and fixed-m asymptotics trails slightly behind the test with WCE in terms of power. In our empirical investigation, we use forecasts for the cumulative number of deaths collected by the Centers for Disease Control and Prevention (CDC). The CDC is a federal agency in charge of protecting public health through the control and prevention of diseases. It is also the official source of statistics on the COVID-19 pandemic evolution in the US. In particular, in collaboration with independent teams of forecasters, the CDC has set up a repository of weekly forecasts for the numbers of deaths, hospitalizations, and cases. These forecasts are developed independently by each team and shared publicly. 1 We focus on forecasts of the number of deaths for three main reasons. First, the number of fatalities is more accurate 1 Background information on each forecasting teams, along with a summary explanation of their methods are available via the link https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us. html. J o u r n a l P r e -p r o o f Journal Pre-proof In the fourth column, "yes" means that the modelling team makes assumptions about how levels of social distancing will change in the future, while "no" means that it is assumed that the existing measures will continue through the projected 4-week time period. than the number of cases and hospitalizations, since the latter ignores asymptomatic cases and other diseases that are undetected. Second, the number of deaths is reported with less spatial and temporal biases. Third, when faced with a pandemic, the number of fatalities is arguably the primary concern of the health authorities and of the public. Our sample includes projections for national COVID-19 cumulative deaths made for the period between June 20, 2020 and March 20, 2021 by eight forecasting teams. The deadline for the teams to submit their weekly forecasts is on the Monday of each week, and they are usually published online on Wednesdays. Weekly cumulative data is the cumulative data up to and including Saturday. This means that, for example, the forecasts submitted by June 22 had as targets the cumulative number of deaths as of June 27 (1-week ahead), July 2 (2-week ahead), July 7 (3-week ahead), and July 12 (4-week ahead). Realised values are also taken from the CDC website. Notice that when COVID-19 is reported as a cause of mortality on the death certificate, it is coded and counted as a fatality due to COVID-19. The eight forecasting teams selected are those that submitted their predictions with no interruptions for all the weeks in our sample. We list the selected teams in Table 1 , and report the main features of the selected forecasts. They vary widely with regards to their modelling choice, information input (for example, how the information on infected people J o u r n a l P r e -p r o o f Journal Pre-proof is used), and in their assumptions about the evolution and impact of non-pharmaceutical interventions (for example regarding social distancing). 2 In our forecast evaluation exercise, we also consider two Ensemble forecasts: one published by the CDC (that combines the individual forecasts from several teams and that we label Ensemble -EN) and one computed by us (that combines the individual forecasts from the eight teams listed in Table 1 and that we label Core Ensemble -CE). 3 Combining forecasts is an effective procedure when there is uncertainty about the model and the data, as it is indeed the case here, where differences also include alternative assumptions on the responses of the public and of the health authorities. In this situation, combining forecasts is useful as it helps to diversify risk and to pool information (see Bates and Granger 1969) . In particular, forecast combination is most advantageous when there is pervasive uncertainty, as the ranking of best-performing forecasts may be very unstable and therefore forecast combination provides a robust alternative (see Watson 1998, Timmermann 2006) . Optimal weights that give the best combination, in the sense of minimizing a given loss function, can actually be derived, but in many practical applications estimated optimal weights schemes result in a combined forecast that does not improve simple averaging (see Clemen 1989 , Smith and Wallis 2009 , Claeskens, Magnus, Vasnev and Wang 2016 . In epidemiology, forecast combination has proved its ability to improve on the performance of individual competing models. For example, Reich et al. (2019) found that ensemble forecasting for influenza performed better on average against the constituting models; similar results have also been obtained by Chowell et al. (2020) in the Ebola Forecasting Challenge. J o u r n a l P r e -p r o o f Journal Pre-proof The benchmark against which we compare the forecasts collected by the CDC is a polynomial function. That is, benchmark forecasts are obtained as projections from the model: where y t is the cumulative number of fatalities, t is the time trend, and u t is an unobserved error term. To accommodate the fact that the forecasted patterns may need changing even over a short span of time, we fit the quadratic polynomial model using Least Squares with a rolling window of the last five observations (using weekly data, this covers approximately a month). To ensure that the benchmark forecasts for the cumulative number of deaths are not decreasing, we compute the benchmark predictions as the maximum between the current value and the prediction from (4). This very simple statistical model has been chosen because any continuous and differentiable function can be approximated locally by a polynomial, and we take the second degree polynomial as a local approximation. In recent works, the choice of a polynomial benchmark has also been considered by Jiang, Zhao and Shao (2020) and Li and Linton (2021) , among others, although with some small differences. In Jiang et al. (2020) , the intrinsic instability of the forecasted patterns is accommodated by fitting occasional breaks; Li and Linton (2021) fitted the model to the incidence of deaths, rather than to the cumulative deaths. In this section, we present some preliminary analysis of the forecasts submitted by the forecasting teams in Table 1 is apparent that the heterogeneity in forecasts grows with the forecasting horizon and, concurrently, that the forecasts are less precise as the forecast horizon increases. This simple observation may make the case for forecast combination at longer horizons more compelling. the sample mean, median, standard deviation, skewness, and the sample autocorrelations up to order 4 (in the columns AC(1), AC(2), AC(3) and AC(4), respectively). With the exception of the benchmark, the average of the forecast errors are positive for all forecasts, meaning that the forecasters tend to under-predict the fatalities. At the 1-week horizon, the benchmark polynomial model appears to outperform all the teams, with a much smaller average error and smaller dispersion. However, its performance deteriorates at longer horizons, with the volatility of the forecast errors increasing substantially, and becoming larger than those of most forecasting teams. This is not surprising, as epidemiological models are designed to predict the evolution of a pandemic in the medium and the long-run, and we observe here that even a very simple forecast does better when the horizon is very short. At longer horizons, however, epidemiological models should be 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 Table 1 ; EN denotes the Ensemble published by the CDC, CE denotes the Core Ensemble constructed combining all the forecasts of the teams in Table 1 , and PO denotes the polynomial benchmark. Table 1 ; EN denotes the ensemble forecast, CE denotes the core ensemble, and PO the polynomial benchmark. Forecast errors are defined as the realised value minus the forecast. Journal Pre-proof Journal Pre-proof expected to produce forecasts that are superior to simple statistical benchmarks. From Table 2 we can also observe that the forecast errors are autocorrelated, as documented in the columns AC(1), AC(2), AC(3) and AC(4). This happens even for one-step-ahead forecasts, where the first order sample autocorrelation may be as high as 0.83. This is interesting because, under mean squared error loss, optimal h-step ahead forecast errors should be at most MA(h − 1): so 1-step ahead forecast errors should be Martingale Differences, 2-step ahead errors should be at most MA (1), and so on. Indeed, this is the very argument given in Diebold and Mariano (1995) to justify the choice of the rectangular kernel to estimate the long-run variance. However, this condition is clearly violated by all forecasts. One explanation of this higher order autocorrelation in the forecast errors and the fact that the forecasting teams systematically underpredict the number of fatalities could be that the forecasting teams use alternative loss functions to produce their forecasts. Indeed, Patton and Timmermann (2007) show that, under asymmetric loss and nonlinear data generating processes, forecast errors can be biased and serially correlated of arbitrarily high order. Finally, Figure 2 shows a break in the volatility of the forecast errors across the first and the second halves of the sample (as illustrated by the vertical line in each diagram of Figures 1-2) . This is also shown in Tables 3-4 , where we report summary statistics for the forecast errors in the two sub-samples. In particular, we note that the volatility of the forecast errors is considerably higher in the second sub-sample. Such decline in the quality of the forecasts in the most recent sub-sample may at first be puzzling: one would expect the forecasting teams to improve their performance as more information becomes progressively available. However, this structural break in the forecasting ability of all models could in part be related to the emergence of a new strain of the virus in the end of 2020, with specific mutations in the spike protein of SARS-CoV-2 resulting in increased transmissibility. Consistent with this explanation, research from the CDC reports that the B. . This is illustrated in Figure 3 showing the incident deaths across the overall evaluation period. Thus the increased forecasting errors may be driven by the sudden heightened incident deaths and associated increased slope of the cumulative deaths curve. At any rate, the volatility of the forecasting errors increases markedly starting from the beginning of November 2020. We thus perform our forecast evaluation separately on two equally-sized sub-samples: the first evaluation sub-sample (from June 20, 2020 to October 31, 2020), and the second evaluation sub-sample (from November 7, 2020 to March 20, 2021). This means that for each evaluation sub-sample we base our inference on just 20 observations. With such small sample sizes, fixed-smoothing asymptotics is crucial to obtain correctly sized tests. Our main results for the test of equal predictive ability of each forecasting team vis-à-vis the benchmark model (4) are reported in Table 5 . We conduct the analysis separately for the two evaluation sub-samples identified in Figure 2 . This yields 20 observations for each sub-sample, underlying the importance of alternative asymptotics in evaluating predictive ability. 5 In the baseline analysis, we evaluate forecast errors relying on the quadratic loss function. We present the test statistics using both the weighted covariance estimator with Bartlett kernel (WCE) and the weighted periodogram estimator with Daniell kernel (WPE) of the long-run variance. A positive value for the test statistic indicates that the forecast in question is more accurate than the benchmark. We report two-sided significance at the 5% and 10% levels, using fixed smoothing asymptotics asymptotics. For comparison, we also report significance based on bootstrap critical values, constructed using the overlapping stationary block-bootstrap of Politis and Romano (1994) using an average block length of T 1/4 ≈ 2 and a circular scheme, as described in Coroneo and Iacone (2020). We consider first the upper panel of Table 5 , which reports results for the first sub-sample from June 20, 2020 to October 31, 2020. Results indicate that no forecasting scheme predicts better than the benchmark at the 1-week forecasting horizon. In fact, we find that the benchmark often significantly outperforms the forecasting teams. On the other hand, at the 2-week horizon the sign of some test statistics turns from negative to positive, reflecting a smaller relative loss by the forecasting teams. The relative performance of the forecasters improves further at longer horizons (3 and 4 weeks ahead), and we observe statistically significant relative gains in performance for some forecasting teams and the ensemble forecasts. The MAEs for the first sub-sample reported in Table 6 Note: test statistics for the test of equal predictive accuracy using the weighted covariance estimator (WCE) and the weighted periodogram estimator (WPE) of the long-run variance. The benchmark is a second degree polynomial fitted on a rolling window of 5 observations. The forecast errors are evaluated using the absolute value loss function, and a positive value of the test statistic indicates lower loss for the forecaster (i.e. better performance of the forecaster relative to the polynomial model). * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE and fixed-m asymptotics for WPE. and indicate, respectively, two-sided significance at the 5% and 10% level using the bootstrap. Bootstrap critical values are constructed using the overlapping stationary block-bootstrap of Politis and Romano (1994) using an average block length of T 1/4 ≈ 2 and a circular scheme, as described in Coroneo and Iacone (2020 individual forecasting teams. This finding is consistent with the consensus in the literature about the advantages of forecast combination (see Watson 1998, Timmermann 2006 ). Turning to the second sub-sample from November 7, 2020 to March 20, 2021, we can see from Table 5 that the benchmark still significantly outperforms some teams and the ensemble forecasts at the shortest horizon. However, in this sub-sample the forecasting teams and the ensembles fail to significantly outperform the benchmark even at the longer horizons. The J o u r n a l P r e -p r o o f MAEs for the second sub-sample reported in Table 6 indicate that, also in this sub-sample, at 3-and 4-week horizons, the two ensembles performed better than most of the individual teams. Finally, notice that the performance of the Ensemble forecast published by the CDC is similar to the one of the Core Ensemble constructed combining all the forecasts of the teams in Table 1 . However, the MAEs for the Ensemble are in all cases smaller than the ones for the Core Ensemble, indicating that combining a larger set of forecasts than the ones considered in Table 1 can provide some benefits in terms of predictive ability, albeit small. 6 Results are overall similar regardless of the type of estimator of the long-run variance. We also notice that findings from the bootstrap are largely the same, and confirm that fixed-smoothing asymptotics is a suitable and computationally much less time-consuming alternative to bootstrapping, as also found in Coroneo and Iacone (2020) and Gonçalves and Vogelsang (2011) . Moreover, by using fixed-smoothing asymptotics, we have known critical values for each test, given the sample size and choice of bandwidth. The absolute value loss function that we use in the baseline forecast evaluation reported in Table 5 is a common choice in forecast evaluation. In particular, the null hypothesis is the equality of the mean absolute prediction error. However, in relation to predicting the spread of COVID-19 (and, more generally, predicting the spread of an epidemic), the cost of under-predicting the spread of the disease can be greater than the cost of over-predicting it. Similarly, scale effects are important, since the same forecast error may be more costly for public health policy interventions when the number of fatalities is small compared to when it is large. For these reasons, in this section we consider alternative loss functions. 7 6 The equal predictive ability of the Ensemble forecast relative to the forecasting teams and the Core Ensemble is formally tested in Appendix E. 7 The teams submitting forecasts to the CDC were advised that their point forecasts would be evaluated with the mean absolute error loss function. The predictive median minimizes the mean absolute error and J o u r n a l P r e -p r o o f The DM test can be applied directly to alternative loss functions. Thus, we consider three alternative loss functions that provide alternative criteria for forecast comparison. Denoting e t as the forecast error (thus abbreviating in this way e (i) t|t−h ), the alternative loss functions considered are the following: • Quadratic: L(e t ) = (e t ) 2 ; • Absolute Percentage: L(e t ) = |e t |/(y t − y t−1 ); • Linex: L(e t ) = exp (e t /(y t − y t−1 )) − e t /(y t − y t−1 ) − 1. The quadratic loss function is a popular measure that penalises more large forecast errors: in this case, it seems natural to interpret it as giving more weight to fatalities that happen when the epidemic is less predictable. The absolute percentage loss considers the scale of the number of fatalities occurring in the period, thus allowing to evaluate differently the same forecast error when only a few fatalities occur, as opposed to when there is a large number of fatalities. Finally, with the linear exponential (linex) loss function we impose asymmetric weights, with more penalty given to under-prediction than to over-prediction. This reflects the fact that the social cost of the two errors, under-and over-prediction, are different, as the cost of not responding to the pandemic and incurring in a large loss of lives in the future is often regarded to be much higher than the economic and social cost of responding too quickly, imposing a lockdown when it is not necessary (on the precautionary principle in public health see Goldstein 2001) . The findings are summarized in Figures 4 and 5 for the first evaluation sub-sample, and in Figures 6 and 7 for the second evaluation sub-sample (the results for the absolute value loss function are also included, to facilitate the comparison). The dotted, dashed and continuous red horizontal lines denote respectively the 20%, 10% and 5% significance levels, which are, should, therefore, correspond to the optimal point forecast. However, if forecasters put greater weight to underprediction and, thus, seek to minimize a Linex type loss function, Christoffersen and Diebold (1997) show that such loss function implies an optimal point forecast that is a weighted sum of the mean and variance. This figure reports the test statistic for the test of equal predictive accuracy using the weighted covariance estimator (WCE) of the long-run variance and fixed-b asymptotics. The benchmark is a second degree polynomial model fitted on a rolling window of 5 observations. A positive value of the test statistic indicates lower loss for the forecaster, i.e. better performance of the forecaster relative to the polynomial model. Different loss functions are reported with different markers: plus refers to a quadratic loss function, diamond to the absolute loss function, filled circle to the absolute percentage loss function and empty circle to the asymmetric loss function. The dotted, dashed and continuous red horizontal lines denote respectively the 20%, 10% and 5% significance levels. The forecast horizons are 1, 2, 3 and 4 weeks ahead. The evaluation sample is June 20, 2020 to October 31, 2020. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 This figure reports the test statistic for the test of equal predictive accuracy using the weighted periodogram estimator (WPE) of the long-run variance and fixed-m asymptotics. The benchmark is a second degree polynomial model fitted on a rolling window of 5 observations. A positive value of the test statistic indicates lower loss for the forecaster, i.e. better performance of the forecaster relative to the polynomial model. Different loss functions are reported with different markers: plus refers to a quadratic loss function, diamond to the absolute loss function, filled circle to the absolute percentage loss function and empty circle to the asymmetric loss function. The dotted, dashed and continuous red horizontal lines denote respectively the 20%, 10% and 5% significance levels. The forecast horizons are 1, 2, 3 and 4 weeks ahead. The evaluation sample is June 20, 2020 to October 31, 2020. This figure reports the test statistic for the test of equal predictive accuracy using the weighted covariance estimator (WCE) of the long-run variance and fixed-b asymptotics. The benchmark is a second degree polynomial model fitted on a rolling window of 5 observations. A positive value of the test statistic indicates lower loss for the forecaster, i.e. better performance of the forecaster relative to the polynomial model. Different loss functions are reported with different markers: plus refers to a quadratic loss function, diamond to the absolute loss function, filled circle to the absolute percentage loss function and empty circle to the asymmetric loss function. This figure reports the test statistic for the test of equal predictive accuracy using the weighted periodogram estimator (WPE) of the long-run variance and fixed-m asymptotics. The benchmark is a second degree polynomial model fitted on a rolling window of 5 observations. A positive value of the test statistic indicates lower loss for the forecaster, i.e. better performance of the forecaster relative to the polynomial model. Different loss functions are reported with different markers: plus refers to a quadratic loss function, diamond to the absolute loss function, filled circle to the absolute percentage loss function and empty circle to the asymmetric loss function. In general, we conclude that the ensemble forecasts deliver some of the best performing predictions. They often achieve statistically significant outperformance against the benchmark. This is the case for the 3-and 4-week ahead predictions during the first evaluation sub-sample, when losses are evaluated using the absolute value, quadratic or absolute percentage loss functions. Even when the outperformance is not significant, ensemble predictions perform relatively well, in the sense of not underperforming the benchmark, even during the second evaluation sub-sample when the forecast errors are larger. Between the two ensemble forecasts, the wider ensemble obtained by the CDC performs slightly better against the benchmark compared to the Core Ensemble, illustrating once again the gains from combining a large number of predictions. Comparing the results of the tests for equal predictive ability across the two sub-samples, we note that the null hypothesis is more difficult to reject during the second evaluation subsample. In particular, whilst during the first sub-sample (Figures 4 and 5) , many forecasting teams and the ensemble forecasts outperform the benchmark at the 3-and 4-week horizons, this is no longer true in the second sub-sample (Figures 6 and 7) . Heightened incident deaths, associated with the increased transmissibility of the new strains of 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 J o u r n a l P r e -p r o o f Journal Pre-proof the virus that emerged in late 2020 and in 2021, may have affected the statistical performance of the tests of equal predictive ability in two ways: by making the task of forecasting more difficult (as evident by the larger MAEs in the second sub-sample in Table 6 ) and by inflating the long run variance of the test statistic (thus reducing the power to detect a significant difference for given MAE differential). Of course, the purpose of the tests of equal predictive ability is not to compare predictive ability across different periods, so a more pronounced failure to reject the null hypothesis during the second evaluation sub-sample is not evidence that the models were less valuable in this period. To examine further the reasons for the apparent change in the forecastability of the epidemic, in the bottom panel of Table 6 we report the ratio of the MAE of each forecasting team and the MAE of the benchmark. This enables us to compare the relative performance of each forecasting team and the benchmark across the two evaluation sub-samples. Considering for example the 4-week forecasting horizon, we can notice that for some forecasting teams and the ensembles the ratios are considerably smaller for the first evaluation sub-sample compared to the second evaluation sub-sample. This is true for all the loss functions considered (MAE, RMSE, MAPE, and MLinex), and suggests that the forecasting environment was different across the first and the second evaluation sub-samples. 8 In the Appendix, we consider several additional experiments and empirical exercises. In particular, in Appendix C, we perform the tests of equal predictive ability for the full sample, instead of considering each evaluation sub-sample separately. In Appendix D, we use an alternative benchmark model, obtained based on fitting an AR(1) model to the log incidence of weekly deaths. Finally, in Appendix E, we use the CDC ensemble forecast as the benchmark model, to test formally the null of equal predictive ability of the forecasting teams and the 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 We evaluate the relative predictive accuracy of real-time forecasts for the number of COVID-19 fatalities in the US, produced by several competing forecasting teams for two evaluation periods: June 20, 2020 to October 31, 2020 (first evaluation sub-sample) and November 7, 2020 to March 20, 2021 (second evaluation sub-sample). Ensemble forecasts, that combine all available forecasts using an equal weights scheme are also included. Since sample sizes are small, we use fixed-smoothing asymptotics for the limit distribution of the scaled expected loss differential between two competing forecasts. We find that none of the forecasting teams outperform a simple statistical benchmark at the 1-week horizon; however, at longer forecasting horizons some teams show superior predictive ability. The ensemble forecasts deliver some of the most competitive predictions. Whilst they do not yield the best forecasts overall, they are competitive in the sense of delivering predictions that significantly outperform the benchmark at longer horizons during the first evaluation sub-sample, and also never performing statistically worse than the benchmark even in the second evaluation sub-sample. In this sense, the Ensemble forecast may be seen as a robust choice. We also document that the broad based Ensemble published by the CDC is more accurate than the Core Ensemble, that only pools forecasts from the teams that we include in our exercise. Overall, our results indicate that forecasts of the COVID-19 epidemic are valuable but need to be used with caution, and decision-makers should not rely on a single forecasting team (or a small set) to predict the evolution of the pandemic, but should hold a large and diverse portfolio of forecasts. A natural extension of our analysis is to evaluate the interval forecasts with different level of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 J o u r n a l P r e -p r o o f Journal Pre-proof coverage submitted by the forecasting teams to the CDC. This would require choosing an appropriate loss function, for example the weighted interval score (see Bracher, Ray, Gneiting and Reich 2021) , and applying the same alternative asymptotics we have used here. In particular, recent work by Coroneo, Iacone and Profumo (2019) shows that fixed-smoothing asymptotics may also be employed successfully to evaluate density forecasts. Another interesting extension to the current work is to consider the predictive accuracy for a panel data of forecasts (since the forecasting teams predict not only the national spread of the disease, but also the regional evolution of the epidemic). Timmermann and Zhu (2019) propose methods for testing predictive accuracy for panel forecasts. In particular, they develop a panel-data Diebold-Mariano test for equal predictive accuracy, that pools information across both the time-series and cross-sectional dimensions. Our analysis could, therefore, be extended to the evaluation of a panel of regional predictions. 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 The combination of forecasts Evaluating epidemic forecasts in an interval format Improving robust model selection tests for dynamic models Real-time forecasting of epidemic trajectories using computational dynamic ensembles Optimal prediction under asymmetric loss The forecast combination puzzle: A simple theoretical explanation Finite-sample properties of tests for equal forecast accuracy Advances in forecast evaluation Combining forecasts: A review and annotated bibliography Comparing predictive accuracy in small samples using fixed-smoothing asymptotics A real-time density forecast evaluation of the ECB Survey of Professional Forecasters Comparing predictive accuracy Optimal forecast combinations under general loss functions and forecast error distributions Emergence of SARS-CoV-2 b. 1.1. 7 lineage-United States Tests of conditional predictive ability The precautionary principle also applies to public health actions Block bootstrap HAC robust tests: The sophistication of the naive bootstrap Forecast evaluation tests and negative long-run variance estimates in small samples Fixed bandwidth asymptotics for the studentized mean of fractionally integrated processes Time series analysis of COVID-19 infection curve: A change-point perspective A new asymptotic theory for heteroskedasticity-autocorrelation robust tests HAR inference: practice When will the Covid-19 pandemic peak? Forming COVID-19 policy under uncertainty Properties of optimal forecasts under asymmetric loss and nonlinearity The stationary bootstrap Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S A simple explanation of the forecast combination puzzle A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series A heteroskedasticity and autocorrelation robust F test using an orthonormal series variance estimator Forecast combinations