key: cord-0948510-kth6a2t0 authors: Ahmad, G.; Ahmed, F.; Rizwan, M. S.; Muhammad, J.; Fatima, H.; Ikram, A.; Zeeb, H. title: Evaluating Data-Driven Forecasting Methods for Predicting SARS-CoV2 Cases: Evidence From 173 Countries date: 2020-08-04 journal: nan DOI: 10.1101/2020.08.03.20167189 sha: 18e298af3165dc6ac6211c391fd5610cc22bcc5d doc_id: 948510 cord_uid: kth6a2t0 The WHO announced the epidemic of SARS-CoV2 as a public health emergency of international concern on 30th January 2020. To date, it has spread to more than 200 countries, and has been declared as a global pandemic. For appropriate preparedness, containment, and mitigation response, the stakeholders and policymakers require prior guidance on the propagation of SARS-CoV2. This study aims to provide such guidance by forecasting the cumulative COVID-19 cases up to 4 weeks ahead for 173 countries, using four data-driven methodologies; autoregressive integrated moving average (ARIMA), exponential smoothing model (ETS), random walk forecasts (RWF) with and without drift. We also evaluate the accuracy of these forecasts using the Mean Absolute Percentage Error (MAPE). The results show that the ARIMA and ETS methods outperform the other two forecasting methods. Additionally, using these forecasts, we generated heat maps to provide a pictorial representation of the countries at risk of having an increase in cases in the coming 4 weeks for June. Due to limited data availability during the ongoing pandemic, less data-hungry forecasting models like ARIMA and ETS can help in anticipating the future burden of SARS-CoV2 on healthcare systems. Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV2) is a zoonotic virus belonging to the betacoronavirus group of the coronaviridae family which also includes SARS-CoV and MERS. These viruses are known to cause severe acute respiratory diseases in humans. 1 The first confirmed case of SARS-CoV2 emerged in the Wuhan city of the Hubei province, China in December 2019. The WHO announced the epidemic of SARS-CoV2 as a public health emergency of international concern on 30th January 2020 due to the high human to human transmission rate and absence of any treatment or vaccine. 2 To date, it has spread to more than 200 countries and has been declared as a global pandemic. 3 transmits through respiratory droplets and has a binding capacity, through spike proteins, to angiotensin-converting enzyme 2 (ACE2) receptors in the human respiratory system. 3 Clinical symptoms of SARS-CoV2 include cough, fever, shortness of breath, and -in severe cases -pneumonia and multiple organ failure. 3 SARS-CoV2 has an incubation period of 1-14 days and a substantial proportion of the infected persons appear to be asymptomatic. Moreover, these individuals are highly infectious before the onset of symptoms, which makes it a challenge to diagnose, contain and control transmissions. 1, 4 Initially, the mean basic reproduction number ( 0 ) of SARS-CoV2 ranged from 1.4 to 6.49, while some studies highlighted that the 0 stabilized around 2-3 leading to an exponential increase in the number of cases. 3, 5 Globally, many public and private enterprises are exploring treatment options and are in the process of vaccine development. However, vaccines have to go through a robust and usually time-consuming process of clinical trials owing to the paradigms of human safety, health, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint and bioethics. 6 Some experts have shown concern that a vaccine for SARS-CoV2 may never even be developed and, even if developed, its production could take 1-2 years. 7 Mutations of the virus are also a major concern in this regard since the virus has shown 14 mutations from December 2019 to March 23, 2020. 8 Globally, countries have implemented various interventions in an attempt to limit transmissions and curtail the number of deaths caused by COVID-19. These interventions include social distancing, the closing of public places, academic institutes and schools, travel restrictions, quarantine for the infected, and -in some cases -curfew or lockdown. 9 However, issues in the health security infrastructure, disease surveillance, health systems, and limited availability of health professionals, makes it a challenge for containing and mitigating infectious outbreaks. 9 Considering these concerns, for appropriate preparedness, containment, and mitigation response, the stakeholders and policymakers require prior guidance on the propagation of SARS-CoV2. This study aims to provide such guidance by forecasting the cumulative COVID-19 cases up to 4 weeks ahead, and ascertain their accuracy using the Mean Absolute Percentage Error. The daily level data for the cumulative COVID-19 cases, for 173 countries, and at the aggregated level for the entire world, was acquired from "Our World in Data" -a combined effort of the researchers at the University of Oxford and the Global Change Data Lab -which relies on the European Centre for Disease Prevention and Control (ECDC) for data collection. 10, 11 Our sample period starts from the day of the first reported case for each country till 1 st June 2020. The statistical analysis of this was performed using R 3.6.3. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint No ethics approval was required for the study as secondary data analysis was performed on publicly available COVID-19 dataset. To forecast the cumulative COVID-19 cases, we use four different forecasting methods. Three of the forecasts are based on the autoregressive integrated moving average process which is usually denoted as ( , , ) where is the order of the autoregressive model, is the degree of differencing, and is the order of the moving average model. The model has been used for forecasting and assessing seasonality in infectious disease outbreaks. [12] [13] [14] [15] The model is a generalization of the autoregressive moving average ( ) model with an ability to address the potential non-stationarity of the variable of interest. To test for stationarity of the cumulative COVID-19 cases, we used the Augmented Dickey-Fuller ( ) and Phillips-Perron ( ) unit root tests. The null hypothesis of these tests is that the variable contains a unit root, hence non-stationary, whereas the alternative is that the time series variable was generated by a stationary process. Table 2 reports the p-values, for the unit root tests, for the 29 countries with the highest cumulative COVID-19 cases and at the aggregated level for the entire world. Table 2 in the Online Appendix provides the results of the unit root test for the 173 countries. The test results suggest non-stationarity which justifies the suitability of the model. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. Table 2 : Results of the unit root tests for the 29 countries with the highest cumulative COVID-19 cases and at the aggregated-level of the entire world. Let denote the cumulative COVID-19 cases on the th day for the country being analyzed. Then, ( , , ) process can be given as follows: In equation (1), is the lag operator, 's and 's are the coefficients of the autoregressive and the moving average component of the model, respectively, and is the error term which is assumed to be independently and identically distributed from a normal All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. A particularly naïve attempt is to fit (0,1,0) which is commonly referred to as random walk and its forecasts are termed as random walk forecasts ( ). We generate the with and without drift for the variable of interest. A more systematic approach for fitting the model follows these steps: 16, 17 1. To ensure stationarity, the differencing order ( ) is selected by using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. 2. The lags, and , are determined by using the Akaike Information Criterion corrected for small sample sizes. Aside from the model, we also used the exponential smoothing method ( ) for generating forecasts. is a forecasting method for univariate data which deals with the systematic trend and seasonality, and can be used as an alternative to the models. 18 To evaluate the performance of forecasts, the data is divided into two mutually exclusive sets, the training and test sets. The training set is used to fit the model (without using any data from the test set) whereas the test set is kept for evaluating the forecast accuracy. We use a variant of the time series cross-validation which is a more sophisticated version of the usual training-test set methodology. 18 In this method, there is a series of test sets, and each test set is accompanied by a corresponding training set consisting of observations before the test set. Therefore, a series of training-test sets are constructed, and for each training-test All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint set forecast accuracy is determined. This method is more sophisticated than the usual training-test set methodology because it allows more comparisons of the forecasted and actual data values. The time-series cross-validation method is also referred to as evaluation on a rolling forecasting origin because the origin of the test set is rolled forward in time. In simpler words: 1. An origin for the first test set is selected. 2. Forecasts are determined for the test set using the corresponding training set. 3. The origin is rolled forward by one period generating a new training-test set for which forecasts can be evaluated, and so on. In this study, we take the 45 th day -since the first reported case in the country -as the origin which is then rolled forward one day at a time. The variation in our methodology is that, instead of taking each of the test set as a single observation, we take four different test sets for each training set: 1 week, 2 weeks, 3 weeks, and 4 weeks into the future. This allows us to ascertain the accuracy of the forecasting method up to 4 weeks ahead for each training set. Therefore, we include countries with at least 73 (45+28) observations to ensure that there is at least one available test set for the 4 weeks ahead forecasts for each country included for the forecast evaluation. Suppose the country under consideration has data available for ∈ {1,2, ⋯ , }. The following steps explain the methodology: 1. Use the data available till = 45, and forecast the values of + for ∈ {1,2, ⋯ ,28} i.e. obtain the forecasts for the next 28 days or 4 weeks. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint 2. Construct 1 week, 2 weeks, 3 weeks, and 4 weeks ahead forecasts using the forecasted values till + 7, + 14, + 21, and + 28, respectively. 3. Increase the data sample by one day i.e. take the data till ( + 1) th day and obtain 28day ahead forecasts, and repeat this process until we reach the end of the data i.e., we reach the th day. There are several methods to determine the accuracy of the forecasted values. We used the Mean Absolute Percentage Error ( ) for this purpose which is defined as follows: In equation (2) All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint Forecasting Evaluation Figure 1 shows the for the forecasted values for the 29 countries with the highest cumulative COVID-19 cases and at the aggregated level for the entire world. Figure 1 of the Online Appendix provides these results for the 173 countries. As expected, the increases as we increase the forecasting horizon from 1 week to 4 weeks ahead. This suggests that shorter-term forecasts are more accurate compared to longer-term forecasts. Overall, forecast evaluation shows that the and forecasts outperform with and without drift. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint Table 3 shows the average values for each forecasting horizon for the 173 countries and at the aggregated level for the entire world. In line with the observations of Figure 1 Forecasted Scenario Figure 2 shows the forecasted values generated using the and forecasting methodologies for the 29 countries with the highest cumulative COVID-19 cases and at the aggregated level for the entire world. Figure 2 of the Online Appendix provides these forecasts for the 173 countries. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint We are only using the and forecasts because Subsection 4.1 showed that these forecasts outperform the with and without drift. This figure uses the data from the entire sample period, and the values are forecasted for 4 weeks into the future. Figure 2 shows that the and forecasts perform similar except for a few cases. The figure shows the flattening of forecasts for Belgium, China, Germany, Italy, and Spain. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint To generate the heat maps, we selected all those countries which had at least 73 (45+23) observations. Moreover, we only use the forecasts here since its performance is comparable to the forecasts. Overall, the heat maps depict information for 165 countries as some of the countries were not matched with the countries listed in the package used for generating these maps e.g. Palestine, Taiwan, etc. In Figure 3 , during June the forecasted All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint scenarios show an increase in the cumulative COVID-19 cases in South America, South Asia, and Russia. Our results show that the and methods perform well in forecasting cumulative COVID-19 cases. Additionally, using these forecasts, we generated heat maps to provide a cases. Among these, the most used is the Susceptible-Exposed-Infectious-Recovered (SEIR) model. [19] [20] [21] [22] The SEIR model makes assumptions on the population belonging to the different compartments. However, for these assumptions to be reliable, large datasets are required. [19] [20] [21] [22] During a pandemic, not a lot of data is available to reliably run the aforementioned models. In comparison, the data-driven methods considered in this paper are less data-hungry and do not require as much level of detail in the datasets for forecasting purposes. Other advantages of these data-driven techniques include simplicity of estimation which can be performed using open-source statistical software (RStudio). determine their testing capabilities. The SEIR model can capture the propagation of the disease which means that it would be able to predict the true number of cases considering the susceptible and asymptomatic individuals. However, data for asymptomatic cases is All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint largely unavailable for SARS-CoV2 due to limited testing capabilities and a large proportion of asymptomatic cases not being detected; making it challenging to verify the predictions from the compartmental models. On the other hand, data-driven techniques can provide information on the confirmed number of cases with high accuracy. For this study we focused on the cumulative COVID-19 cases, however, the forecasting methods can be used for other indicators such as cumulative deaths, cumulative recovery, etc. The forecasts of the confirmed number of cases are sensitive to the number of tests performed, however, since the confirmed number of cases is an indicator of the anticipated burden on the healthcare system and professionals, the projections by the data-driven techniques might be insightful for the policymakers. This is important because the availability of health service resources during COVID-19 is an issue faced by many countries. One study carried out in different states of the United States predicted an excess demand of 64,175 total beds and 17,380 ICU beds at the peak of COVID-19. 23 Even with lockdown measures enacted, the peak demand for healthcare services, during the COVID-19 pandemic, is likely to exceed capacity irrespective to the current capability of the healthcare infrastructure and resources. Globally, our forecasting results reveal that the number of cases will increase in most of the countries. Forecasted scenarios for June depict that some of the hardest-hit regions include the United States, the United Kingdom, Canada, some parts of Europe, South America, and South Asia as shown in Figure 2 of the Online Appendix. Middle eastern countries including Iran, the United Arab Emirates, and Saudi Arabia also show an increase in the cumulative COVID-19 cases. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint The future of the global pandemic greatly depends on the implementation of mitigation measures. Strict measures such as worldwide lockdowns, travel restrictions, school closures, non-essential business closures, social distancing, isolation of infected populations as well as heightened hygiene measures can potentially reduce the risk of spread. 23 The forecasted results, Figure 2 Although, the novel coronavirus pandemic is associated with many uncertainties, we believe that forecasting and predictive modeling can be an effective tool in targeted intervention strategies. Model-based predictions can help policymakers to make the right decisions in a timely way. 24 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint Our results indicate that the and model perform well in forecasting the cumulative COVID-19 cases. We ran the model for 173 countries with varying health system resources and infrastructure, and at the aggregated level for the entire world. The results suggest that the and model can be used for SARS-CoV2 forecasting in different countries and regions with a high level of accuracy. Since these models only rely on past observations of the cumulative 0COVID-19 cases, it can also be used for forecasting provincial, district, or state level SARS-CoV2 cases and other COVID-19 indicators. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.03.20167189 doi: medRxiv preprint The proximal origin of SARS-CoV-2 Coronavirus disease 2019: What we know Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin Report from the american society for microbiology covid-19 international summit A pneumonia outbreak associated with a new coronavirus of probable bat origin COVID-19, an emerging coronavirus infection: advances and prospects in designing and developing vaccines, immunotherapeutics, and therapeutics Developing Vaccines for SARS-CoV-2 and Future Epidemics and Pandemics: Applying Lessons from Past Outbreaks Genotyping coronavirus SARS-CoV-2: methods and implications No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study Coronavirus Pandemic (COVID-19) -Statistics and Research -Our World in Data Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico Near-term forecasts of influenza-like illness: An evaluation of autoregressive time series approaches Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks Time series analysis of influenza incidence in Chinese provinces from Characteristic-based clustering for time series data stm: R Package for Structural All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted Deep Learning for Time Series Forecasting Predict the Future with MLPs, CNNs and LSTMs in Python Global dynamics of a SEIR model with varying total population size Statistical inference in a stochastic epidemic SEIR model with control intervention: Ebola as a case study Global dynamics of an SEIR epidemic model with saturating contact rate Global stability for the SEIR model in epidemiology -19 health service utilization forecasting & Murray, C. J. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months How will country-based mitigation measures influence the course of the COVID-19 epidemic?