key: cord-0519889-wucowgfq
authors: Krymova, Ekaterina; B'ejar, Benjam'in; Thanou, Dorina; Sun, Tao; Manetti, Elisa; Lee, Gavin; Namigai, Kristen; Choirat, Christine; Flahault, Antoine; Obozinski, Guillaume
title: Trend estimation and short-term forecasting of COVID-19 cases and deaths worldwide
date: 2021-06-18
journal: nan
DOI: nan
sha: 672900599dfa9d33428064ba7c9abb80e8b6ae16
doc_id: 519889
cord_uid: wucowgfq

Since the beginning of the COVID-19 pandemic, many dashboards have emerged as useful tools to monitor the evolution of the pandemic, inform the public, and assist governments in decision making. Our goal is to develop a globally applicable method, integrated in a twice daily updated dashboard that provides an estimate of the trend in the evolution of the number of cases and deaths from reported data of more than 200 countries and territories, as well as a seven-day forecast. One of the significant difficulties to manage a quickly propagating epidemic is that the details of the dynamic needed to forecast its evolution are obscured by the delays in the identification of cases and deaths and by irregular reporting. Our forecasting methodology substantially relies on estimating the underlying trend in the observed time series using robust seasonal trend decomposition techniques. This allows us to obtain forecasts with simple, yet effective extrapolation methods in linear or log scale. We present the results of an assessment of our forecasting methodology and discuss its application to the production of global and regional risk maps.

It is of utmost importance for governments and decision-makers in charge of the healthcare system response to anticipate the evolution of the current COVID-19 pandemic Velavan and Meyer [2020] . When accurate and reliable, predictions can be very informative to define appropriate policy measures and interventions, such as lockdown and containment measures, border closures, quarantines, school openings, and physical distancing. They are also useful to predict hospital surge capacity, in order to manage hospital resources Lutz et al. [2019] . Given that the pandemic is affected by these measures, testing policies, the appearance of new variants, the diffusions across borders, etc, long term forecasts are difficult and their usefulness remains unclear Ioannidis et al. [2020] , whereas accurate short-term forecasts provide useful actionable information. Nevertheless, even short-term forecasts are far from trivial as recently evidenced by Cramer et al. [2021] , where a simple baseline appears to be not so easy to beat on a one-week horizon.

In this work, we propose a general methodology to produce forecasts on a oneweek horizon, which is applicable to close to 200 countries, and as many states/regions or provinces. An additional challenge to achieve this goal is that the quality of the reported data varies significantly from country to country. This translates into different fluctuations and irregularities that can be observed in the reported time-series Wilke and Bergstrom [2020] . Many countries do not report on a daily basis or delay their reports to particular days of the week. In particular, seasonal patterns with a weekly cycle are observed for many countries. In several countries, e.g. in Switzerland, the number of reported cases shows a significant decline during or immediately after the weekend, which is probably due to the fact that, on those days, fewer patients get tested and/or the reporting is less active and thus delayed. Also, it is important to note that, as illustrated in the data from Spain on Fig. 1 (a) , seasonal patterns are nonstationary and can actually change in time, in particular, if the reporting policies change. Furthermore, delays in reporting, changes in death cause attribution protocols, as well as changes in testing policies lead to abrupt corrections that introduce backlogs on some days, such that a number of daily cases or deaths which are anomalously high or even negative are reported. To take into account these peculiarities, we propose a forecasting methodology that relies on estimating the underlying trend with a robust seasonal-trend decomposition method and using simple extrapolation techniques to make a forecast over a week.

The problem of forecasting the evolution of the COVID-19 pandemic has attracted the attention of many researchers, institutions, and individuals across the globe. As a result, a significant number of dashboards have appeared that monitor and/or make predictions about the evolution of the pandemic based on past observations. All these efforts are also being leveraged to build ensemble predictive models for different regions in the World. For instance, the United States Center for Disease Control and Prevention provides ensemble predictions for the US at the state level in US Covid-19 Forecast Hub 1 . Similarly, the German and Polish COVID-19 ForecastHub 2 provides ensemble predictions at the regional level for Germany and Poland. And more recently, the European Centre for Disease Prevention and Control (ECDC) launched a European Covid-19 Forecast Hub 3 to provide short-and long-term ensemble forecasts for Europe. The considered modelling approaches rely on different data sources used (cases or/and death data, tests data, hospital data, mobility data etc.) and aim for forecasting horizons ranging from 1 week to months. Epidemiological compartmental models or models inspired by them are among the most popular ones for the forecasting task e.g., IHME Covid [2021] , YYG-ParamSearch 4 , UMass-MechBayes Gibson et al. [2020] , IEM Health-CovidProject 5 and USC-SIkJalpha Srivastava et al. [2020] . Such models (e.g., SEIR Allen et al. [2008] ) split the population in different groups (age, demographics) and states (susceptible, infected, etc.) and model the transition dynamics of the population between the different states over time. The different parameters of the models can be deterministic or random, or be allowed to vary in time. Other approaches use statistical regression, e.g. UMich-RidgeTfReg 6 , LANL Castro et al. [2020a] , curve fitting, e.g. RobertWalraven-ESG 7 , and machine learning, e.g. GT-DeepCOVID Rodriguez et al. [2020] , to learn a predictor from past observations, or time-series (e.g., ARIMA), e.g. MUNI-ARIMA 8 , to learn a representation describing the evolution of the dynamics of the observed measurements, e.g. CMU TimeSeries 9 , Ahmad et al. [2021] . Some models make strong assumptions on the transmission dynamics Castro et al. [2020b] or specific assumptions on the effect of different policies, e.g. Keskinocak et al. [2020] , Lemaitre et al. [2020] , Bertozzi et al. [2020] . These are just a few sample references and the list is by no means exhaustive. For a more complete list of recent related literature we refer the reader to Friedman et al. [2020] , Cramer et al. [2021] , Bracher et al. [2020 Bracher et al. [ , 2021a , Ray et al. [2020] .

Our approach differs from most of the other existing approaches in two ways. First of all, given that the usefulness of long-term (e.g., several weeks or months) forecasts has been subject to debate, because of the complexity of the phenomenon and the impossibility of taking into account a number of important factors, we consider, like Jewell et al.

[2020] "that short-term projections are the most that can be expected with reasonable accuracy". We thus focus on the prediction of short-term (one to two weeks ahead) forecasts of daily numbers for deaths and cases. Second, instead of building directly a forecasting model, our approach implements first a trend estimation model from daily cases/deaths observations that make little or no assumptions about the underlying dynamics, and don't require to estimate a large number of parameters (as opposed to e.g., SEIR-type), hence are robust and easier to apply at a more global scale. Our forecast is then obtained from the trend estimated with a simple extrapolation scheme. Our trend estimates are of independent interest, and we further use them to provide an independent estimate of the R-effective, which is an important measure for decision-making.

We apply the proposed algorithms to produce daily updated forecasts available on our Our dashboard, together with an implemented forecasting methodology, has been in operation since the very beginning of the pandemic. The forecasting methodology presented in the paper was put in production already in September 2020 and has therefore been producing results on the dashboard since. The dashboard has been actively used by epidemiologists and global health experts to analyse the evolution of the epidemiological situation and to provide recommendations to several European governments. From the early phases of the pandemic, we delivered daily forecasts of cases and deaths for 192 countries as well as regions of several countries, including Switzerland, Canada, US, and France for which we also provide state or region level statistics and trend estimates. In addition, we provide estimates of the prediction intervals based on the retrospective performance of the forecasting and the effective reproduction number based on our trend estimate via the method of Huisman et al. [2020] together with risk maps that assign to each country a color corresponding to its current epidemiological status (cf. Discussion).

To model a potentially quickly varying seasonal pattern and suppress the influence of outliers, we implemented a piecewise trend estimation method based on the robust Seasonal Trend decomposition procedure based on LOESS (STL) Robert et al. [1990] . STL is a filtering procedure for decomposing time series into trend, seasonal and residual components, which is furthermore robust to outliers. Specifically, the raw daily observations are modelled as:

where τ t is a slowly changing trend, δ t is a possibly slowly changing seasonal component and r t is a residual. Since the magnitude of the seasonal term can reasonably be expected to be proportional to the trend, allowing for the seasonal component to change with time is relevant here, especially since, as discussed, reporting patterns change in time in several time series. To obtain a more quickly adaptive algorithm, we use STL to produce separate trend estimates on windows of 6 weeks and recombine them. Outliers identified as a by-product of the STL procedure are removed, the corresponding counts are redistributed in recent history, and the trend estimation procedure is run one more time on the cleaned data (see the Methods Section for more details). Fig. 1 illustrates the behavior of our trend estimation procedure for different countries that represent a certain diversity. In the cases of Germany and Brazil (b, e) the weekly seasonal effect is quite significant and clearly non-stationary, in particular between December 22nd and January 3rd, 2021. For China (d) , no particular seasonal effect can be identified and several outliers seem to co-occur with the peak of the wave. The state of Kansas (US) (c) illustrates an example of a fairly irregular seasonal effect. In all these cases, the trend estimation proposed appears to be robust to outliers, to changes in seasonality or lack thereof, and adapts to the regularity of the underlying trend. Beyond a qualitative evaluation of the trend estimation, a quantitative evaluation is difficult for the lack of any ground truth, especially since the underlying dynamics of the trend in various countries and provinces are quite different. To some extent, the trend estimation proposed can be validated quantitatively via the forecasting algorithm which relies on it, since the quality of the forecast depends on the quality of the estimation of the trend.

To predict cases and deaths one week ahead we propose to simply extrapolate linearly the daily trend, which was obtained with the above trend estimation algorithm (see Fig. 1 ) either on the original or on the log-scale by preserving the most recent slope of the estimated trend. In the case of a decreasing trend slope the extrapolation is carried out in log-scale to prevent undershooting. For the case of an increasing trend the extrapolation is performed in linear scale to prevent overshooting. To forecast the number of deaths, some models have been using lagged cases as input. Given the diversity of situations in different territories, and the fact that the relation between deaths and cases was sometimes quite unclear or changing in a short amount of time, we used the same simple forecasting approach as for cases. See the SI for a discussion and references. Following the recommendations of Cramer et al. [2021] and the requests of different forecast hubs, we also produce a probabilistic forecasts for the weekly counts, under the form of a collection of 23 quantiles corresponding to the levels 0.05k for k = 1, . . . , 19 and the extreme levels α = (0.001, 0.025, 0.975, 0.99). These quantiles are estimated from quantiles estimates of appropriately normalized errors of our forecast on a recent history, that are extrapolated for extreme levels using a tail model. See Fig. 2 and Figs. S12-14 in SI for illustrations and Materials and Methods for more detail.

We evaluate our forecasts of the number of new cases in two ways, first, by comparing them with the forecasts obtained by several methods submitted to the European Covid-19 Forecast Hub 10 , second, by comparing our forecast with a baseline on a larger set of countries, namely the naive forecast (used on several hubs) that assumes that the weekly number of cases remains constant over the following week. To obtain interval forecasts, quantiles of the baseline predictive distribution are estimated from symmetrized observed errors of the baseline as in the US and European Covid Forecast Hubs Cramer et al.

[2021] 11 .

The comparison with the methods submitting of the European Covid-19 Forecast Hub is made in terms of MAE and average weighted interval score (WIS) Bracher et al.

[2021b] (see Materials and methods). We also provide some details on the performance of our deaths forecasts in Section A of SI.

To compare our method to the baseline we compute the relative improvement in mean absolute error (RMAE), relative improvement in median absolute error (RmedianAE), and relative improvement in average WIS (RWIS). The relative improvement is positive when the proposed forecasting method has a smaller error than the baseline. It can be thought of as a rate of decrease in error with respect to the baseline. 

In order to compare the performance of our method with other methods, we used the data available at the EU Covid Forecast Hub. The methods submitted to the Hub are aggregated in order to obtain a EuroCOVIDhub-ensemble method, and a baseline (EuroCOVIDhub-baseline) is available. The weekly forecasts can be submitted once a week. Between April 1st and December 15th 2021, 43 submissions are available for both the EuroCOVIDhub-ensemble and the EuroCOVIDhub-baseline. We included in our comparison five other methods whose forecasts were all available for all the 31 countries included in the hub, on a common large subset of 32 weeks (i.e., 75% of all weeks). These methods are: MUNI-ARIMA, IEM Health-CovidProject, USC-SIkJalpha Srivastava et al. [2020] , RobertWalraven-ESG, ILM-EKF 12 . In order to obtain results comparable across different countries, we report the ratio of the MAE (or average WIS) of each method to the MAE (or average WIS) of the EuroCOVIDhub-baseline, following the reporting standards 13 of the hub. (See SI Section A).

The results show that the proposed method (SDSC ISG) performs well for most of the countries, e.g. see the histogram of average WIS in Fig. 3 (for MAE see Fig. S1 in SI). It is clear that in terms of average WIS, the performance of our method is one of the best ones and is close to the performance of the ensemble method and MUNI-ARIMA.

Additionally, we ranked the methods according to their performance from 1 to 7, where 1 corresponds to the method with the smallest value of MAE or average WIS. The results can be summarized as follows: for average WIS our method outperforms all other methods (including the ensemble) for 8 countries, it ranks second or best for 16 of them, and it is among the top three for 25 countries; for MAE it ranks first for 9 countries, second or best for 18, and within the top three for 25. More details can be found in SI Section A and Tables S1-S2. Additionally, we used our methodology to perform 2 weeks ahead forecasts and obtained similar results (see Tables S3-S4 and Fig. S2 -S3 in SI). Given that our forecast is based on a simple extrapolation of our trend estimate, this suggests that the trend estimate is accurate even on the boundary of the period where data is available.

For countries that report new cases with irregular delays, it is difficult to know whether the discrepancy between the forecast and reported weekly numbers is due to errors of the forecast or the fact that the reported numbers actually do not reflect accurately the current number of new cases.

We, therefore, present the main evaluation of our forecasting strategy on a restricted set of 80 countries, which report sufficiently frequently with a relatively low number of outliers. These countries were selected based on a set of criteria that are independent from our trend estimation and forecast methodology (see Materials and Methods). Nevertheless, we present the evaluation on the full list of countries in SI Section A. We use the data provided by Johns Hopkins University Dong et al. [2020] after April 1st, 2020, which corresponds approximately to the date after which all countries started reporting regularly.

For the 80 selected countries, we performed a retrospective analysis from April 1st, 2020 till December 15, 2022. For each day in this period, we forecast the total number of cases over the week following that day using our methodology and the baseline using the data that was available at that date (and thus without corrections made a posteriori). As ground-truth we used the weekly data available on January 10, 2022. We evaluate our forecast by reporting, for each country the RMAE, the RmedianAE, the relative improvement in coverage, and RWIS. The detailed evaluation results for the full list of considered countries can be found in SI Section A. In Fig. 4 (left) we display a scatter plot of the relative improvement in terms of RMAE and RWIS of our proposed forecast methodology over the baseline for the subset of 80 regularly reporting countries and for one week ahead forecast. As it can be seen, our method outperforms the baseline in both metrics for most of the selected countries. The same RMAE and RWIS values are displayed in Fig. 4 (right) for 30 countries with either large populations or large population density, where the impact of the pandemic is potentially more important in terms of scale.

Out of the 80 countries, 72 (i.e., 90%) show an improvement in MAE, 66 (82.5%) show an improvement in median AE, 71 (88.75%) show an improvement in WIS, and 68 countries show an improvement in both MAE and WIS. Only 5 countries do not show an improvement in either of these criteria. We also measure the coverage of the estimated prediction intervals, where our forecast is more accurate than the baseline for 66 countries. There are 53 out of 80 countries for which our method shows an improvement in all four metrics (MAE, median AE, average WIS, and coverage). This is the case, for example, for the US where the improvement in MAE and WIS over the baseline is 25%. We note that in the work of Cramer et al. [2021] , which compares forecasting algorithms focusing on US data, only 6 algorithms out of 23 achieve an improvement of more than 20% in MAE over the baseline (for forecasts on horizons of 1 to 4 weeks; see Table 2 in that paper). The countries, for which our method did not perform better than the baseline, typically have long plateaus that the baseline benefits from, or/and have quickly changing seasonality patterns and direction of the trend, where the simple and robust baseline makes smaller errors. We analysed how the AE varies as a function of the growth rate of the trend and evaluated to what extent, as soon as the trend is not flat, our method produces improved forecasts compared to the baseline (see Section F and Fig. S4 in SI). Our method outperforms the baseline predictor when the growth rate is larger than 3% in absolute value, which shows that the proposed forecast is informative as soon as the trend is not flat.

The comparisons with the forecasts submitted to the European Covid Forecast Hub and the baseline demonstrate that the proposed forecasting methodology performs well. It should be noted that, as a forecasting method, the considered baseline is uninformative in the sense that it does not attempt to characterize the evolution of the curve. In spite of this, as reported in Cramer et al. [2021] , it is not easy to outperform this type of baseline in terms of pure predictive accuracy. Thus, any forecasting method characterizing the evolution of the curves which improves over this baseline can be useful. Our forecasts are available in the US Covid-19 Forecast Hub, European Covid-19 Forecast Hub, and Apart from producing forecasts and estimates of prediction intervals, the trend estimates that we obtain are of independent interest. We use them in particular to produce a stable estimate of the R-effective (R-eff for short). The R-eff measures the expected number of people that can be infected by an individual at any given time Mahase [2020] and has been used as a key indicator in this pandemic. Since its estimation requires essentially to solve a deconvolution problem Huisman et al. [2020] , it is quite sensitive to the irregularities in the data. In the original paper the authors use LOESS smoothing in order to decrease their influence. In our case, we propose to apply the deconvolution based on our piecewise robust STL trend estimate.

Finally, we use our estimates of the R-eff together with our forecasts to produce global daily risk maps according to the following scheme: If the number of tests as reported in Our World in Data is available and is above 10,000 per 1M individuals, we compare the prediction for the number of weekly cases per 100K and R-eff with corresponding thresholds to color code the map. Green is assigned if the number of weekly cases per 100K inhabitants was below 30, orange is assigned if it is above 30 and the epidemic curve is descending (R-eff¡0.9), and red is assigned if it is above 30 the epidemic curve is ascending or plateauing (R-eff¿0.9). Other organisations use similar thresholds for the cumulative rate per 100K. For instance, our choice of a threshold coincides with the upper value of the 3rd level (out of 7) in the ECDC map of the geographic distribution of 14 https://kitmetricslab.github.io/forecasthub/forecast Covid cases. The value of the R-eff is not taken into account in the risk assessment (color code) of a country when the incidence numbers are low since its estimation becomes less reliable. Besides, it theoretically converges to 1 at the end of an epidemic, and in such regime it is no longer an indicator of the severity of the pandemic. As a consequence, if no test data is available, or the number of tests are below 10,000 per 1M population, the region is colored in grey, meaning that the data is missing/unreliable and no risk assessment can be made. An example of a risk map is given in Fig. 5 . These maps are useful to compare the levels of epidemic activity between countries based on a discrete color code. These comparisons can be insightful, especially if the R-eff and different non-pharmaceutical interventions of countries are taken into account.

The fact that our model makes minimal assumptions about the data is an advantage to make it applicable to a large number of countries and regions. But there are of course downsides. In particular, our models only take into account a fairly limited amount of information, and in particular no indicators of mobility, prophylactic measures, or lockdowns are taken into account. Our models can however detect the effect of changes of behavior, which can be quite informative for decision-makers. For example, when Ireland faced a second wave and the decision of lockdown was made on the October 21st, 2020 our models predicted exponential growth for the 7 following days. However, almost the day after the curve broke and on October 25th the R-eff was clearly below 1, with a descending epidemic trend. That could not be attributed to the lockdown measures only four days after being taken.

We have proposed a methodology for trend estimation and short-term forecasts of the evolution of the number of Covid-19 cases and deaths, which is broadly applicable to a large number of different countries, states, and regions. Beyond its use to produce forecasts, our trend estimation method is of independent value, as it aims at providing a clear view of the current local evolution of the trend. Estimating the recent behavior of the trend is important as a tool to assess the current epidemic situation and to be able subsequently to analyze the effect of various measures. We use in particular our trend estimate to produce our own estimates of the R-effective curves, which are in turn used to produce risk maps. For the forecast, our evaluation shows that, 1) the methodology performs well compared to several methods submitted to the European Covid Forecast Hub, 2) for the 80 selected countries in which we can reliably use weekly data as ground truth, we outperform the baseline in a large fraction of countries.

Before applying the trend estimation model to the data, we remove negative values corresponding to reassessment of the counts, while making sure that the cumulative counts are preserved by appropriately scaling our estimates. We infer which zero counts on a given day correspond to missing reports, and we eventually impute the counts corresponding to these missing reports. The corresponding procedures are detailed in SI Section B.

Our trend estimation algorithm leverages the Seasonal-Trend decomposition procedure based on LOESS (STL) proposed by Robert et al. [1990] . STL consists of an iterative procedure that alternates between the estimation of the seasonal and trend components, each of them being estimated using a LOESS model, as well as the reestimation of importance weights associated with each observation for robust estimation. More precisely, the algorithm consists of two nested loops: The inner loop comprises several steps involving moving average and LOESS nonparametric regressions Cleveland and Devlin [1988] in order to estimate the seasonal and trend components for the current set of robustness weights. The importance weights are then updated in the outer loop based on the residuals after the update of the seasonal and trend components. The procedure is repeated for a number of iterations.

First, we smooth with STL separately all intervals of 6 weeks that are starting every three weeks from the end of the time series (so that each interval is half overlapping with the previous one). In the absence of outliers (which should be detected by the robust STL procedure), each trend estimate computed on a six weeks interval is rescaled to account for the same total number of counts as the observed data, and to obtain the final trend estimate, on each disjoint three weeks interval defined from the end of the time-series, we compute a pointwise weighted combination of the two overlapping trend estimates computed on this interval. In the presence of outliers, these are first identified by our method and redistributed in the past. After that, the same procedure as explained before is applied to the corrected data. For further details, we refer to SI Section C.

We produce probabilisic forecasts under the form of a collection of 23 quantiles for the predicted average daily counts, which produce as well a collection of nested prediction intervals of the form [q α , q 1−α ]. These quantiles are estimated based on empirical quantiles of the retrospective deviations of the daily forecast f h,t from the actual weekly rolling meanx t+h = 1 7 3 k=−3 x t+h+k for horizons h = 1, . . . , H and 19 levels α = 0.05, 0.10, . . . , 0.95, normalized by f t,h , namely the scaled errors

The length of the history for estimation of the 19 quantiles was set to 40 + H days. The normalization by f t,h , which performed better than f t,h or 1, is motivated by a Poisson error distribution model. For the lowest/highest levels (α = 0.01, 0.025, 0.975, 0.99), we extrapolated the quantiles based on an exponential tail model whose scaling parameter is estimated from the other estimated 19 quantiles. Finally, we shift all quantilesq α i ,h on the scaled errors by a constant to enforce thatq 0.5,h = 0. This is motivated by the fact that we expect our forecast to be close to the conditional best median forecast. The prediction quantiles are finally of the form q Fig. 2 and Figs. S12-14 in SI for illustrations. Similarly, for the weekly total number of cases/deaths k weeks ahead, k = {1, 2}, the ground truth can be computed as X t+4+7(k−1) = 7 h=1 x t+h+7(k−1) and the point forecast are just F t,4+7(k−1) = 7 h=1 f h,t+7(k−1) . The probabilistic forecast can be computed using the same approach as above by replacingx t+h by X t+4+7(k−1) and f t,h by F t,4+7(k−1) . In that case, the horizon is in weeks instead of days.

Ifx t = 1 7 3 k=−3 x t+h+k is as before the rolling mean of the number of daily new cases over a week and f t is the corresponding point forecast (which we identify with the median forecast), the absolute error of f t is AE(f t ) = |f t −x t |. We consider, as evaluation metrics, the mean absolute error (MAE) and the median absolute error over the evaluation period. Given a baseline b t =x t−7 , the relative MAE is defined as

Similarly, one can define RmedianAE.

Following the methodology presented in Cramer et al. [2021] , we evaluate our probabilistic forecasts using proper scoring rules defined for forecasts taking the form of a collection of quantiles or equivalently of nested intervals, namely the weighted interval score (WIS). The interval score Bracher et al. [2021b] at level α ∈ (0, 1) for the interval [ , u] and observation ξ is defined as

where 1{·} is 1 if the condition is satisfied and zero otherwise. The weighted interval score (WIS) Bracher et al.

[2021b] is a proper scoring rule for probabilistic forecast, which is defined as follows: for a number of levels A = {α 1 , . . . , α K }, α i ∈ [0, 0.5) and the corresponding estimated quantiles of the predictive distribution P , defined as q α = inf{q | P (Ξ ≤ q) ≥ α) for the level α i ∈ {α 1 , . . . , α K , 0.5, 1 − α K , . . . , 1 − α 1 }, where Ξ is the random variable associated with the observation ξ, as follows:

The average WIS is defined as the mean of the WIS for the predictive quantiles of distributions P t constructed to predictx t over the times t in the estimation interval, i.e., MWIS = 1 T T t=1 WIS(P t , A,x t ). Relative improvement in MWIS (RWIS) is defined in similarly as RMAE.

For our main results, we kept 80 countries whose reports of cases are sufficiently frequent, have only few missing values and a limited number of outliers. We proceeded as follows. First, we excluded 52 countries that reported cases on less than 70% of the days, since these countries have either a very small number of cases or are reporting very irregularly, and among the remaining countries, the 39 countries for which more than five consecutive days were missing. Then, we performed robust outlier detection (described in SI Section D) to estimate the number of outliers in each time series and we excluded the 20 countries with the largest number of outliers among the remaining ones. It is important to note that the selection criteria proposed here are independent of our trend estimation and forecast methodology.

Code and evaluations are accessible at https://github.com/ekkrym/CovidTrendModel. We use the publicly available data collected by Johns Hopkins University, which consists of countrywise daily cases and deaths. Regional-level data for Canada, Switzerland, and France are obtained from the JHU repository, the Specialized Unit for Open Government Data of the Canton of Zürich, and the French National Health Agency, respectively.

The development of the dashboard was partly funded by the Fondation Privée des Hôpitaux Universitaires de Genève. We would like to thank Fernando Perez-Cruz for several discussions and constructive comments which led to methodological improvements. 

Global comparison with the baseline We evaluate our method by reporting, for each country the RMAE, the RmedianAE, the relative improvement in mean total coverage (see Section E), and average WIS. The relative improvement in mean total coverage is computed as RC = (MC(s) − MC(b))/MC(b) from the coverages M C of the methods, since one aims for higher coverage by the estimates of the confidence intervals. In Fig. S5 , we illustrate the evaluation for the 80 countries with reliable data. In Fig. S6 , we illustrate the evaluation score for the 101 countries that did not pass our evaluation criteria. The evaluation on the remaining countries (with the exclusion of 11 countries which had a particularly low number of cases) shows that there remains 42 countries, for which our method still obtains a better MAE than the baseline, only 29 for which we improve in median AE, 63 for which RWIS is improved and 74 with improved total coverage. Many of the countries for which our method performs clearly worse than the baseline are in fact countries with a fairly low number of cases, which were not the focus of our modelling efforts. This is justified by the fact that accurate forecasts are not critical for these countries as long as their number remains low. Furthermore, the corresponding time series of a significant number of these countries have numerous irregularities of the reporting and backlogs which makes it harder to associate the target average weekly number with the true underlying trend. In that case, the simple baseline forecast, which is the average value of the previous week, appears to be closer to the target than the forecast obtained after trend estimation. Different smoothing techniques would be needed to produce better trend estimation for these countries, which take into account the discrete nature of the count and their Poissonian distribution.

The list of methods includes (with abbreviations used in the Tables S14-S14 below):

• EuroCOVIDhub-ensemble (EUHub-ens, https://covid19forecasthub.eu/visualisation.html),

• EuroCOVIDhub-baseline (https://covid19forecasthub.eu/visualisation.html),

• MUNI-ARIMA (MUNI, https://krausstat.shinyapps.io/covid19global/),

• IEM Health-CovidProject (IEM Health, https://iem-modeling.com/),

• USC-SIkJalpha (USC, https://scc-usc.github.io/ReCOVER-COVID-19/#/),

• RobertWalraven-ESG (RW, http://rwalraven.com/COVID19/Model),

• ILM-EKF (ILM, https://github.com/Stochastik-TU-Ilmenau),

• the proposed method (SDSC ISG, https://renkulab.shinyapps.io/COVID-19-Epidemic-Forecasting/).

Fig. S1-S3 show histograms of the errors of the methods with respect to the baseline with 0.1-wide bins. For visualization purposes, errors greater than 1.55 were set to 1.55 and therefore contribute to the last bin. One week ahead forecasts MAE measured in multiples of the baseline MAE and average WIS measured in multiples of the baseline WIS are presented in Tables S14 and S14. Two week ahead forecasts MAE measured in multiples of the baseline MAE and average WIS measured in multiples of the baseline WIS are presented in Tables S14 and S14.

At first sight, in a very simple theoretical model, the number of deaths should be related to the number of cases, and correspond simply to the fraction of the cases that did not survive. The strategy of estimating the deaths from cases has been particularly successful for the USA at the country and state levels. Among the models participating in the US COVID Forecast Hub, epiforecasts-ensemble1 contains a model which estimates deaths from a convolution of cases, the model MIT Crit Data-GBCF takes 3weeks lagged deaths and cases numbers as a part of the input, etc. Forecasting the number of deaths from a lagged case curve was one of our first approaches, but the diversity of situations encountered across the world and in time for a particular region makes it that this strategy fails in a number of cases. The relation between lagged cases and the number of deaths is sometimes quite unclear: for example, if we consider the evolution of the number of cases and deaths in Egypt in November-December 2021 that we show in Fig. S7 , the number of cases is almost not changing while the number of death is increasing and then decreasing. There are many reasons why the relation between the number of cases might be more complicated, be non-stationary and potentially change relatively quickly: as the virus circulates it affects different groups in the population who are more or less fragile and who protect their senior more or less well, the testing and reporting policies of some countries have sometimes changed quite quickly (including reporting policies for deaths), there is an effect of the vaccination (which is however on a sufficiently slow timescale that it can be reestimated over time), there is the emergence of new variants, etc. Taking into account all of the above, we use the same strategy for the number of deaths forecasting as for cases, i.e. we estimate the trend based solely on the previous deaths observations and predict future numbers by the simple extrapolation. We provide a brief comparison of the performance of our strategy to forecast deaths (individually in the same way as cases) with a few methods from the European forecast Hub for 31 European countries in a similar way to what we did for cases. We identified two forecasting methods IEM Health-CovidProject and RobertWalraven-ES, which were regularly submitting to the European COVID forecast Hub apart from EuroCOVIDhubensemble and EuroCOVIDhub-baseline. The results demonstrate that for 1-week ahead forecast our methodology obtains levels of performance comparable to those of other methods submitted to EU COVID Forecast Hub: on Fig. S8 , S10 and Fig. S9 , S11 one can see that mean absolute error (MAE) and WIS normalized by the respective errors of the EU COVID hub baseline of our method (SDSC ICG) are aligned with the other methods.

In this section, we describe the preliminary data cleaning steps and smoothing for the daily cases and deaths.

When a negative count is reported following a reassessment of previously reported cases or deaths, we substitute the negative value with the estimate x t computed from the daily observations a week before multiplied by the growth factor computed from two weekly observations: during a week before and two weeks before the negative value had occurred, i.e. with

where x t and X t = 6 k=0 x t−k are daily and weekly observations respectively. Next we reduce the counts in the whole previous history by a constant multiplicative factor c to match the cumulative counts obtained by removing this negative quantity from the cumulative counts. That is we compute c = ( t s=0 x t )/( t−1 s=0 x t ) and x t is updated to x t ← c x t . For the case of large and significantly delayed reports or reassessment leading symmetrically to a large positive spike in the data, this is ignored at this stage, and will be addressed by the trend-estimation method.

One of the difficulties with the data sources that we are using is that no distinction is made between a missing report and an existing report stating that no new cases should be reported on a given day: in both cases the database contains a zero. This is of course because, in practice, there is often no distinction made in the reporting protocols. It is however important for our models to be able to distinguish between these two situations. To distinguish missing values from actual zeros, we proceed as follows: if the last observation in the data is zero, we compute an estimate of the Poisson rate by taking the average of observations during the week before zero occurs. If the probability of observing zero new cases given this estimated rate is too low, we consider the zero value to be a missing report one and exclude it from the history for further trend estimation and forecasting. The forecasting starts from the next day from the last initial observation.

Many countries have seasonal patterns in which no data is reported on certain days, typically during the weekend, and where all the new cases that appeared during these days are reported all together on the next reporting day (typically a Monday). We use as a preprocessing a simpler imputation scheme, which consists in reassigning the data declared on that last day uniformly over the previous days of missing report and the following reporting day.

Since the seasonal pattern might evolve over time due to changes in the reporting pattern we propose to apply STL in a piecewise fashion. This allows our method to better adapt to changes in the seasonal component, as the hyperparameters defining the smoothing levels can then change in each separate segment modelled. To estimate the trend locally in the whole period of observations, we split the observed time interval into halfoverlapping intervals of 6 weeks. These time intervals are defined from the end of the time series backwards, so that the time series ends with the last segment of 6 months. First, we apply STL to estimate the trend of the last subinterval [T − L + 1, T ], where L = 42 (corresponding to 4 weeks). For the last subinterval, the STL trend is rescaled to preserve the number of observations in the last L/2 days to obtain the estimate s −1 in [T − L + 1, T ]. Next, the trend estimation proceeds as follows. For the two overlapping subintervals, e.g. consider [T −L+1, T ] and [T −3L/2+1, T −L/2], we take the estimate s −1 and we estimate the trends in [T − 3L/2 + 1, T − L/2]. In order to smoothly join two local trends s −1 ands −2 in the interval [T − L + 1, T − L/2] we use a simple weighted interpolation and obtain the trend in [T − 3L/2 + 1, T ] :

where t 0 = T − L + 1, and σ(τ ) = (1 + exp(a(τ − 1) − b)) −1 with a = 21.1/L, b = 5.46. We additionally apply rescaling to redistribute the possible outliers, removed by trend estimation: we compare the sum of the numbers so far estimated by the trend, e.g.

with the corresponding number of raw daily observations κ −2 = 3L/2−1 i=0

x T −i : if the excess κ −2 − S −2 is positive, it is added to the observations before T − 3L/2 + 1 by rescaling, otherwise the trend is rescaled such that the sum of estimated numbers meets κ −2 . The procedure continues with the next local trend estimate (s) from the corrected data. Note that κ is always computed from raw observations. Local trend estimation with rescaling repeats backward until we reach the beginning of the time interval. As a result, we get a smooth trend, the sum of which is equal to the sum of raw daily observations.

One of the criteria for the inclusion of a country into our main evaluation set is that there are not too many large delayed reports. We assimilate these as outliers and use a simple estimate for each time series of the number of outliers. The corresponding outlier detection scheme is based on the Median Absolute Deviation (MAD), which is defined as MAD = median(|x i − median(x i )|) for daily observations x i . For each country, we detect outliers using a sliding window MAD estimate. More precisely, for each daily value, we compute the MAD in a symmetric window of 22 days around that day. For the MAD to be a consistent estimator for the standard deviation, we multiply it by a constant scale factor of 1.4826, which relates the MAD to the standard deviation for a Gaussian distribution. If the daily value differs from the median in the window by more than 2 standard deviations, we consider it as an outlier. Note that this procedure is only used for the construction of the set of countries in the evaluation set, and not for the trend estimation or forecast.

We define the total coverage of a probabilistic forecast P the sum over all levels considered of the coverage of the intervals [q k , q 2K+2−k ] as defined in Cramer et al. [2021] :

where 1{·} is 1 if the condition is satisfied and zero otherwise. We the define the mean total coverage as

where X t are the weekly total number of cases/deaths.

To be able to estimate the growth rate of the trend, we first compute an independent estimate of the trend using cubic B-splines on the weekly data and next compute the growth rate as the slope of the trend normalized by the trend value. To aggregate AE values for different countries, we use the MAPE (the mean of AE(F t )/X t ) on the weekly forecasts in the evaluation period instead of the MAE, to bring the errors for each country on a comparable scale. Given that the growth rate as a measure of the slope is comparable across countries, we pool the data from all countries to obtain Fig. S4 . The baseline performs best when there are no changes in the number of cases/deaths (i.e., the growth rate of 0 or constant trend). However, our method outperforms the baseline predictor as soon as the growth rate is larger than 3% in absolute value, which shows that the proposed forecast is informative as soon as the trend is not flat. Figure S3 : Histograms for the average WIS (in x-axis) based on 2 week ahead cases forecasts for 31 European countries Figure S4 : Dependence of the error on the relative slope of the trend: Median (solid line) and interquartiles (shaded region) over all countries (this excludes 11 countries with more than 90% of zero daily observations: Marshall Islands, Grenada, Vanuatu, Tanzania, Fiji, Saint Kitts and Nevis, Micronesia, Samoa, the Holy See, Solomon Islands, Laos) of the MAPE of each forecasting algorithm (blue: baseline, orange: proposed forecast) as a function of the growth rate (aka relative slope) of the trend. Since the baseline assumes zero slope it has lower median error when the absolute growth rate is less than 3%, but larger median error otherwise. Figure S7 : Cases in (a) and deaths numbers in (b) in Egypt in the end of 2021: the growth and decrease of death numbers is not preceeded by the similar behavior in the cases. weekly q weekly forecasted Figure S14 : Confidence intervals for Switzerland for two weeks ahead prediction obtained on January 14, 2022. The upper subplot demonstrates the forecast together with the predictive intervals; additional spline smoothing is applied for each confidence level to smooth the quantiles in time. Lower plot shows the estimated quantiles for weekly numbers of the second forecasted week. Table S14 : Two week ahead forecast WIS normalized by the EuroCovidhub baseline WIS. The values, which are highlighted in bold and orange color, correspond to the best performance. The lower part of the table reports for each method the number of countries for which its forecast is best, or in the top 2, 3 or 4 best performing methods.

Applying infectious disease forecasting to public health: a path forward using influenza forecasting examples

Forecasting for COVID-19 has failed

Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US. medRxiv

Predicting an epidemic trajectory is difficult

Modeling COVID-19 scenarios for the united states

Real-time mechanistic Bayesian forecasts of COVID-19 mortality. medRxiv

Fast and accurate forecasting of COVID-19 deaths using the SIkJα model

COFFEE: COVID-19 forecasts using fast evaluations and estimation

Deepcovid: An operational deep learningdriven framework for explainable real-time COVID-19 forecasting. medRxiv

Syeda Hira Fatima, Aamer Ikram, and Hajo Zeeb. Evaluating data-driven methods for short-term forecasts of cumulative SARS-CoV2 cases

The turning point and end of an expanding epidemic cannot be precisely forecast

The impact of social distancing on COVID-19 spread: State of Georgia case study