key: cord-0315271-95rkhvix authors: Bracher, J.; Wolffram, D.; Deuschel, J.; Goergen, K.; Ketterer, J. L.; Ullrich, A.; Abbott, S.; Barbarossa, M. V.; Bertsimas, D.; Bhatia, S.; Bodych, M.; Bosse, N. I.; Burgard, J. P.; Fiedler, J.; Fuhrmann, J.; Funk, S.; Gambin, A.; Gogolewski, K.; Heyder, S.; Hotz, T.; Kheifetz, Y.; Kirsten, H.; Krueger, T.; Krymova, E.; Leithaeuser, N.; Li, M. L.; Meinke, J. H.; Miasojedow, B.; Mohring, J.; Nouvellet, P.; Nowosielski, J. M.; Ozanski, T.; Radwan, M.; Rakowski, F.; Scholz, M.; Soni, S.; Srivastava, A.; Gneiting, T.; Schienle, M. title: National and subnational short-term forecasting of COVID-19 in Germany and Poland, early 2021 date: 2021-11-08 journal: nan DOI: 10.1101/2021.11.05.21265810 sha: b10be8ab1bd3e0e3c44c510d494a1c531249625d doc_id: 315271 cord_uid: 95rkhvix We report on the second and final part of a pre-registered forecasting study on COVID-19 cases and deaths in Germany and Poland. Fifteen independent research teams provided forecasts at lead times of one through four weeks from January through mid-April 2021. Compared to the first part (October--December 2020), the number of participating teams increased, and a number of teams started providing subnational-level forecasts. The addressed time period is characterized by rather stable non-pharmaceutical interventions in both countries, making short-term predictions more straightforward than in the first part of our study. In both countries, case counts declined initially, before rebounding due to the rise of the B.1.1.7 variant. Deaths declined through most of the study period in Germany while in Poland they increased after a prolonged plateau. Many, though not all, models outperformed a simple baseline model up to four weeks ahead, with ensemble methods showing very good relative performance. Major trend changes in reported cases, however, remained challenging to predict. 4 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Table 1 : Forecast evaluation for Germany and Poland (incidence scale, based on RKI/MZ data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. 7 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 8 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 13/19 and 10/19 four weeks ahead in Germany and Poland, respectively, at the 0.95 level), which reflects the severe difficulties in predicting cases in Fall 2020 as discussed in (12). From a public health perspective, there is often a specific interest in how well models anticipated major 143 inflection points. We therefore specifically discuss these instances. However, we note that, as will be detailed (21), large-scale sequencing had been adopted 154 by late January, but results were considered difficult to extrapolate to the whole of Germany. An updated 155 RKI report on virus variants from 10 February 2020 (22) described a "continuous increase in the share of 156 the VOC B.1.1.7", but cautioned that the data were "subject to biases, e.g., with respect to the selection of 157 samples to sequence" (our translation). Given the limited available data, and the fact that many approaches had not been designed to accom-159 modate multiple variants, only two of the teams submitting forecasts for Germany opted to account for this 160 aspect (a question which was repeatedly discussed during coordination calls). These exceptions were the The ITWW-county repro model was the only one to anticipate a change in trend on 15 February (though 168 slower than the observed one), and adapted quickly to the upward trend in the following week. This 169 model extrapolates recently observed growth or decline at the county-level and aggregates these fine-grained 170 forecasts to the state or national level. Therefore it may have been able to catch a signal of renewed growth, 171 as a handful of German states had already experienced a slight increase in cases in the previous week (e.g., Peak of the third wave (cases) In Poland, the third wave reached its peak in the week ending on 3 April 185 2021. Despite the fact that it coincided with the Easter weekend and thus somewhat unclear data quality, 186 this turnaround was predicted quite well by two Poland-based teams, MOCOS-agent1 and ICM-agentModel. As can be seen from Figure 6 , the trajectory of these two models differed substantially from those of most 188 other models, including the ensemble, which predicted a sustained increase. This successful prediction of 189 the turning point was in large part responsible for the good relative performance of MOCOS-agent1 and 190 9 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; Table 2 : Forecast evaluation at the regional level, Germany and Poland (incidence scale, based on RKI/MZ data). Results are averaged over the different regions (states in Germany, voivodeships in Poland). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. *Asterisks mark entries where scores were imputed for at least one week. Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week. Models marked thus received a pessimistic assessment of their performance. If a model covered less than two thirds of the evaluation period, results are omitted. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. For Germany, the peak of the third wave occurred only after the end of our pre-specified study period, but 194 we note that numerous models showed strong overshoot as they expected the upward trend to continue. The 195 exact mechanisms underlying the turnaround remain poorly understood (a new set of restrictions referred 196 to as the Bundesnotbremse was introduced too late to explain the change on its own). Changes in trend of deaths In Germany, the study period coincided almost perfectly with a prolonged two weeks before it actually occurred. Following the unexpected strong increase in the following week, it 201 went to extending the upward tendency, before switching back to predicting a turnaround. It seems likely 202 that the irregular pattern in late December and early January is partly due to holiday effects in reporting, 203 and forecast models may have been disturbed by this aspect. At the end of the downward trend in late March, the ensemble again anticipated the turnaround to arrive 205 earlier than it did, and predicted a more prolonged rise than was observed. Nonetheless, in both cases the 206 ensemble to some degree anticipated qualitative change, and the observed trajectories were well inside the 207 respective 95% prediction intervals (with the exception of the forecast from 4 January; however, this forecast 208 had prospectively been excluded from the analysis as we anticipated reporting irregularities). In Poland, deaths started to increase in early March after a prolonged period of decay. As can be seen 11 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. We presented results from the second and final part of a pre-registered forecast evaluation study conducted in 215 Germany and Poland (January-April 2021). During the period covered in this paper, ensemble approaches 216 yielded very good performance relative to contributed individual models and baseline models. The majority 217 of contributed models was able to outperform a simple last-observation-carried-forward model for most 218 targets and forecast horizons up to four weeks. The results in this manuscript differ in important aspects from those for our first evaluation period 220 (October-December 2020), when most models struggled to meaningfully outperform the KIT-baseline 221 model for cases. Fall 2020 was characterized by rapidly changing non-pharmaceutical intervention measures, 222 making it hard for models to anticipate the case trajectory. Pooled across both study periods, we found 223 ensemble forecasts of deaths to yield satisfactory reliability and clear improvement over baseline models. For cases, however, coverage was clearly below nominal from the two-week horizon onward, and in terms of 225 mean weighted interval scores the ensemble failed to outperform the KIT-baseline model three and four 226 weeks ahead. This strengthens our previous conclusion (12) led the organizers to suspend ensemble case forecasts beyond the one-week horizon. The differences between our two study periods illustrate that performance relative to simple baseline mod-230 els is strongly dependent on how good a fit these are for a given period. Cases in Germany plateaued during 231 November and early December 2020, making the last-observation-carried-forward strategy of KIT-baseline 232 difficult to beat. The second evaluation period was characterized by longer stretches of continued upward or 233 downward trends, making it much easier to beat that baseline. In this situation, however, many models did 234 not achieve strong improvements over the extrapolation approach KIT-extrapolation baseline. Ideally 235 one would wish complex forecast models to outperform each of these different baseline models. However, there are many ways of specifying a "simple" baseline (24), and post-hoc at least one of them will likely be 237 12 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint To interpret these insights we note that, in principle, there are two ways of forecasting epidemiological time Death forecasts belong into the realm of category (ii), with cases and hospitalizations serving as leading 254 indicators. This prediction task has been addressed with considerable success. Case forecasts, on the other 255 hand, typically are based on approach (i), which largely reduces to trend extrapolation, unless models 256 are carefully tuned to changing NPIs (see Table 3 ). Theoretical arguments on the limited predictability 257 of turning points in such curves have been brought forward (27; 28), and empirical work including ours 258 confirms that this is a very difficult task. The success of the two microsimulation models MOCOS-agent1 and 259 ICM-agentModel in anticipating the downward turn in cases in Poland is encouraging, but remains a rather 260 rare exception. Potential leading indicators to improve case forecasts could be trajectories in other countries 261 (29) or additional data streams on e.g., mobility, insurance claims or web searches. However, the benefits 262 of such data for short-term forecasting thus far have been found to be modest (30). Changes in dominant 263 variants may make changes in overall trends predictable as they arise from the superposition of adverse but 264 stable trends for the different variants. The availability of sequencing data has improved considerably since 265 our study period, but in practice the associated delays may still limit predictability in crucial periods. 266 We have extensively discussed the difficulties models encountered at turning points, both upward and . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint important from a subject-matter perspective, this is not without problems from a formal forecast evaluation 270 standpoint. Major turning points are rare events and as such difficult to forecast. Focusing evaluation on solely these instances will benefit models with a strong tendency to predict change, and adapting scoring 272 rules to emphasize these events in a principled way is not straightforward. This problem is known as the 273 forecaster's dilemma (32) in the literature and likewise occurs in, e.g., economics and meteorology (see 274 illustrations in Table 1 from (32)). The present paper marks the end of the German and Polish COVID-19 Forecast Hub as an independently 276 run platform. In April 2021, the European Center for Disease Prevention and Control (ECDC) announced 277 the launch of a European COVID-19 Forecast Hub (4), which has since attracted submissions from more than 278 30 independent teams. The German and Polish COVID-19 Forecast Hub has been synchronized with this 279 larger effort, meaning that all forecasts submitted to our platform are forwarded to the European repository, 280 while forecasts submitted there are mirrored in our dashboard. In addition, we still collect regional-level 281 forecasts, which are not currently covered in the European Forecast Hub. The adoption of the Forecast Hub 282 concept by ECDC underscores the potential of collaborative forecasting systems with combined ensemble 283 predictions as a key output, along with continuous monitoring of forecast performance. We anticipate that 284 this closer link to public health policy making will enhance the usefulness of this system to decision makers. An important step will be the inclusion of hospitalization forecasts. Due to unclear data access, these had The methods described in the following are largely identical to those in the first part (12) of our study, but 290 are presented in abridged form to ensure self-containedness of the present work. (1 − α), thus reaching from the α/2 to the 1 − α/2 quantile, it is defined as where χ is the indicator function and y is the realized value. Here, the first term characterizes the spread of 310 the forecast distribution, the second penalizes overprediction (observations fall below the prediction interval) 311 and the third term penalizes underprediction. To assess the full predictive distribution we use the weighted 312 interval score (WIS; (17)). The WIS is a weighted average of interval scores at different nominal levels and 313 14 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; where m is the predictive median. The WIS is a well-known approximation of the continuous ranked forecast spread, overprediction and underprediction, which makes average scores more interpretable. As 320 secondary measures of forecast quality we use the absolute error to assess the central tendency of forecasts 321 and interval coverage rates of 50% and 95% prediction intervals to assess calibration. As specified in our study protocol, whenever forecasts from a model were missing for a given week, we 323 imputed the score with the worst (largest) value achieved by any other model for the respective week and 324 target. However, almost all teams provided complete sets of forecasts and very few scores needed imputation. During the evaluation period, forecasts from fifteen different models run by fourteen independent teams of 327 researchers were collected. Thirteen of these were already available during the first part of our preregistered 328 study, see Table 3 and Supplementary Note 3 of (12) for detailed descriptions. Table 3 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint which has been taken from (34). Detailed descriptions can be found in (12) we always used the most recent prediction available on a given forecast date. obtained by the member models over the last six evaluated forecasts (last three one-week-ahead, last 352 two two-week-ahead, last three-week-ahead; missing scores are imputed by the worst score achieved by 353 any model for the respective target). This is done separately for each time series to be predicted. Inverse score weighting has recently also been employed by (36) who found it to perform well in a re-analysis 355 of forecasts from the US COVID-19 Forecast Hub. In the study protocol, the median ensemble was defined 356 as our primary ensemble approach (10), which is why we displayed this version in all figures and focused 357 our discussion on it. We have previously discussed advantages and disadvantages of the different ensemble 358 approaches in (12). There were no formal inclusion criteria other than completeness of the submitted set of 23 quantiles. The Forecast Hub team did, however, occasionally exclude forecasts with highly implausible central tendency or The forecast data generated in this study have been deposited in a GitHub repository (https://github. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. https://www.ecdc.europa.eu/en/news-events/forecasting-covid-19-cases-and-deaths-europe-new-hub. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 [20] Fischer-Fels, J. Erste Hochrechnung zur Verbreitung der Coronamutationen.Ärzteblatt (2021 18 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. Asterisks mark prediction intervals exceeding the upper plot limit. The figure shows forecasts from models not displayed in Figure 2 2 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 3 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 4 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 5 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 6 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint 7 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; Table 4 : Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint Table 7 : Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (cumulative scale, based on RKI/MZ data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. *Asterisks mark entries where scores were imputed for at least one week. Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week. Models marked thus received a pessimistic assessment of their performance. If a model covered less than two thirds of the evaluation period, results are omitted. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 8, 2021. ; https://doi.org/10.1101/2021.11.05.21265810 doi: medRxiv preprint Table 9 : Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on JHU data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 8, 2021. ; Table 10 : Forecast evaluation for Germany and Poland, pooled across evaluation periods, 1 and 2 weeks ahead (incidence scale, based on RKI/MZ data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score. KITCOVIDhub-median ensemble 31 Asterisks mark entries where scores were imputed for at least one week. Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week. Models marked thus received a pessimistic assessment of their performance KITCOVIDhub-inverse wis ensemble 9 Asterisks mark entries where scores were imputed for at least one week. Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week. Models marked thus received a pessimistic assessment of their performance Asterisks mark entries where scores were imputed for at least one week. Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week. Models marked thus received a pessimistic assessment of their performance Table 11 : Forecast evaluation for Germany and Poland, pooled across evaluation periods, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data). C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.