key: cord-0542519-51235tn2 authors: Said, Ahmed Ben; Erradi, Abdelkarim; Aly, Hussein; Mohamed, Abdelmonem title: Predicting COVID-19 cases using Bidirectional LSTM on multivariate time series date: 2020-09-10 journal: nan DOI: nan sha: cafb7c5b61fcc30942f5236b3c6eb48d3d389559 doc_id: 542519 cord_uid: 51235tn2 Background: To assist policy makers in taking adequate decisions to stop the spread of COVID-19 pandemic, accurate forecasting of the disease propagation is of paramount importance. Materials and Methods: This paper presents a deep learning approach to forecast the cumulative number of COVID-19 cases using Bidirectional Long Short-Term Memory (Bi-LSTM) network applied to multivariate time series. Unlike other forecasting techniques, our proposed approach first groups the countries having similar demographic and socioeconomic aspects and health sector indicators using K-Means clustering algorithm. The cumulative cases data for each clustered countries enriched with data related to the lockdown measures are fed to the Bidirectional LSTM to train the forecasting model. Results: We validate the effectiveness of the proposed approach by studying the disease outbreak in Qatar. Quantitative evaluation, using multiple evaluation metrics, shows that the proposed technique outperforms state-of-art forecasting approaches. Conclusion: Using data of multiple countries in addition to lockdown measures improve accuracy of the forecast of daily cumulative COVID-19 cases. In December 2019, Wuhan, the capital of Central China's Hubei province, with 11 millions population, has witnessed the outbreak of a new coronavirus [1, 2] . The virus has propagated in China then all over the world. On the 11 th March 2020, with more than 280k cases and more than 4000 deaths worldwide, it has been declared as a global pandemic by the World Health Organization (WHO) [3] . Within few months, the number of cases has exponentially grown to more than 17 millions and more than 670k deaths by end of July 2020. model. Both approaches are used to forecast the cumulative COVID-19 cases for ten days (1 to 10 April 2020) using the confirmed cases reported on March. In [5] , the authors applied exponential smoothing approach and conducted five rounds of forecast of cumulative confirmed cases globally starting from the 1 th February till 21 th March 2020. The authors emphasized that forecasts related to the virus outbreak must be an integral part of any decision-making process particularly in high risk cases. Indeed, this enables authorities to explore various 'what if' scenarios in order to assess the implication of any decision. Ahmar et al. [6] proposed to apply a variety of ARIMA, called SutteARIMA, for shortterm forecast of COVID-19 cases in Spain and the impact on the Spanish Market Index (IBEX). Data from February 12 till April 2 are used to train the model to forecast the data from April 3 to 9. The Mean Absolute Percentage Error (MAPE) metric is calculated to assess the fitting accuracy. The findings showed that SutteARIMA outperformed ARIMA model. In [7] , the authors compared six prediction techniques to forecast the cumulative cases in ten Brazilian states: ARIMA, cubist regression, random forest, ridge regression, support vector regression, and stacking-ensemble learning. The prediction is conducted for multiple time horizons: one day, three days and six days ahead. Chimula et. [8] studied the propagation of the virus in Canada using Long Short-Term Memory (LSTM) neural network, known to be efficient with sequential data. The results show that Canada had a linear growth of number of cases until March 16, 2020 followed by an exponential growth. It has been estimated that the ending point of the outbreak is around June. Maleki et al. [9] applied TP-SMN-AR, a variation of autoregressive models to forecast the confirmed and recovered number of cases worldwide. This prediction is conducted for the period between April 21 till April 30. Mathematical models of infectious disease have been also applied in attempt to obtain better insight about the virus outbreak. Kuniya [10] applied the SEIR model to predict the epidemic peak in Japan from 15 January to 29 February. SEIR provides a mathematical formulation to describe the transmission of a disease from an individual to another. These individuals pass through four states: susceptible (S), exposed (E), infectious (I) and recover (R). The study showed that the basic reproduction number R 0 -"the average number of secondary infections produced by a typical case of an infection in a population where everyone is susceptible" [11] -is 2.6 with a 95% Confidence Interval 2.4-2.8. The SEIR model also showed that the peak will occur on early-middle summer 2020. Furthermore, some epidemiological conclusions are drawn: the intervention has great implication on delaying the epidemic peak. It also must be conducted over a long period to ensure effective reduction of the epidemic size. In [12] , the authors applied SIR model to predict the daily cases in Algeria. SIR takes into account the number of susceptible cases (S), the number of infected cases (I) and the number of Recovered cases (R). The model showed that the peak is expected on July 24 at worst and that the disease will disappear between September and November. Roosa et al. [13] used three phenomenological models: the generalized logistic growth model [14] , the Richards model [15] and a sub-epidemic wave model [16] for real-time forecasts of the COVID-19 cumulative number of confirmed reported cases in Hubei province, China. These models were previously applied to forecast several infectious diseases including Ebola, SARS, pandemic Influenza, and Dengue. Authors in [17] studied the effect of weather on the spread of COVID-19. Using the daily cases in 50 US states between January 1 and April 9 in addition to temperature and absolute humidity information, the authors identified the vulnerable narrow absolute humidity range. States with absolute humidity between 4 and 6 g/m 3 have significant spread with more than ten thousands cases. The findings are used to determine the Indian regions with potential vulnerability to weather based spread. It is widely known that lockdown measures, e.g. restriction on gathering, school and workplace closing, public transport shutdown and international travel controls, are needed for halting the spread of the virus. Atalan et al [18] conducted a data analysis and showed evidence that lockdown can contribute in suppressing COVID-19 pandemic. Dawoud [19] emphasized on the importance of preventive measures including social distancing and mask usage for efficient lockdown exit strategy. Sahoo et al [20] conducted a data-driven approach to analyze the effect of lockdown in India. The authors showed that after six weeks of lockdown, the infection rate reached three times lower compared to the initial one. Hence, the lockdown measures are quintessential to manage such pandemic. However, these measures are rarely considered when forecasting COVID-19 daily or cumulative cases. Furthermore, most COVID forecast methods typically rely on limited data of a single country. Yet countries having common demographic and socio-economic properties and similar health sector indicators can exhibit similar pandemic patterns. Our contribution consists of first grouping countries having similar demographic and socio-economic properties and health sector indicators then using COVID-19 data from each cluster to build the prediction model. This yields a richer dataset for training. Furthermore, we propose a deep learning based forecasting approach using a Bidirectional LSTM (Bi-LSTM). This type of neural network not only relies on the past data to predict the future, but it also enables learning from the future to predict the past. By adopting such learning framework, Bi-LSTM provides better understanding of the learning context [21] . Additionally, to train our Bi-LSTM-based model, we use multivariate time series consisting of the daily cumulative number of cases and time series describing the lockdown measures: the school closing, workspace closing, restriction on gathering, public transport closing and international travel controls. The proposed Bi-LSTM on multivariate time series allows multiple dependent time series to be modelled together to account for the correlations cross and within the series capturing variables changing simultaneously over time. We depict in Fig. 1 the overall approach to predict the daily cumulative cases of COVID-19. First, we collect data describing the demographic and socioeconomic properties and health sector of countries of the world. These data are clustered to identify group countries that have similar properties. We first apply the Elbow method to determine the optimal number of clusters to pass as an input parameter for the K-Means algorithm. Next, given a particular country, we identify its cluster. Multivariate time series are then constructed consisting of daily cumulative cases of all countries belonging to the cluster in addition to time series describing the level of lockdown measures associated to travel control (border closing), school closing, workplace closing, public transport shutdown and public gathering ban. The multivariate time series is used to train a deep learning Bi-LSTM network to forecast future cumulative number of cases. It is worthwhile to mention that this approach is applicable for any country to forecast its daily cumulative COVID-19 cases. We describe in this section the demographic, socioeconomic and health sector indicators used to cluster countries. Then we present the approach used to cluster countries having similar properties. This yields a richer dataset for training COVID-19 cumulative cases prediction model per countries cluster. These data have been collected from the Department of Economic and Social Affairs of the United Nations and the Organisation for Economic Cooperation and Development. The data include: • Median age per country. • Population percentage of age groups per 4 years interval e.g. 4-9 year, 10-14 years etc. • Country population and density. • The percentage of urban population. • Gross Domestic Product (GDP) per capita. • The number of hospitals per 1000 people. • Death rate from lung diseases per 100k people for female and male. To discover countries having similar characteristics we applied K-Means clustering algorithm [22, 23] to identify similar members among the data points. Let ., x n } be the set of d-dimensional points we seek to cluster into K clusters. In other words, we attempt to assign each x i , i = 1, ..., n to a cluster c k , k = 1, ..., K. K-Means partitions the data such that the squared error between the mean of a cluster and the data points, members of the clusters, is as low as possible. Let m k be the mean of cluster c k . The squared error between a cluster center and its members is defined as: K-Means seeks to minimize the sum of the squared errors: Where C is the set of clusters. To minimize Eq. 2, the following algorithm is applied: 1. Randomly assign K cluster centers and repeat step 2 and 3. 2. Assign each data point to the closest cluster center. 3. Calculate the new cluster centers. However, K-Means requires the number of clusters to be known. Hence, we applied the Elbow method to determine the optimal number of clusters for which the obtained partition is compact, i.e. low J(C). Naturally, adding more clusters would result in even more compact partition which may lead to overfitting. Hence, the variation of J(C) with respect to K would exhibit first a sharp decrease followed by a slow one. The Elbow method recommends to select the number of cluster corresponding to the elbow of the curve J(C) vs K. After K-Means is applied to group countries, we collect the multivariate time series data for the countries of each cluster to train a prediction model using a Bi-LSTM deep neural network. The motivation is to strengthen the prediction accuracy by forcing the network to train not only on past data to predict the future but also to train it on the future data to predict the past. The multivariate time series has more than one time-dependent variable. Intrinsically, these variables are also dependent on each others. Indeed, it is confirmed that lockdown measures have significant impact on the evolution of the cumulative number of COVID-19 cases. Our times series consists of: • Cumulative COVID-19 cases per day. These data are widely available and several APIs provided by government agencies can be queried for this information. We collect data from February 15th to July 31st. The building block of the network is the LSTM cell depicted in Fig. 3 . Given the current value x t , the previous hidden state h t−1 and the previous state C t−1 , the following transformations are applied: Where σ and tanh are the sigmoid and hyperbolic tangent function respectively. The proposed technique is versatile. Indeed, the forecast can be applied for any country. In our experiment, We aim at using information from the previous 6 days to predict the next day cumulative cases. We focus on Qatar as a use-case and we analyze and assess the forecast performance of the proposed technique and compare against multiple techniques with multiple scenarios. We analyze the performance by: Where x i , y i ,x andŷ are the actual reported cumulative cases, predicted cumulative cases, average reported cumulative cases and average predicted cumulative cases respectively. The best prediction is the one achieving the lowest RMSE, MAE, the highest R 2 and the closest CRM value to zero. We illustrate in Fig. 2 To further assess the prediction performance, we conduct quantitative analysis, detailed in Table. . Results show that Bi-LSTM with lockdown information results in the lowest RMSE, MAE, the highest R 2 score and the closest CRM value to zero while the best performance is achieved when the model is trained on all data of the cluster to which Qatar belongs rather than Qatar data only. In Table . 2 Predicting cumulative COVID-19 cases is a challenging task as it depends on several complex and highly dependable parameters. The disease outbreak depends on the lockdown measures and how fast they are imposed. In our proposed approach, we aimed at incorporating several parameters to achieve accurate The deadly coronaviruses: the 2003 SARS pandemic and the 2020 novel coronavirus epidemic Comprehensive review of coronavirus disease 2019 WHO, 2020. Situation report -77 coronavirus disease 2019 (COVID-19). WWW Document Forecasting the prevalence of COVID-19 outbreak in Egypt using nonlinear autoregressive artificial neural networks Forecasting the novel coronavirus COVID-19 Short-term forecasting method, a case: Covid-19 and stock market in Spain Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil Time series forecasting of COVID-19 transmission in Canada using LSTM networks Time series modelling to forecast the confirmed and recovered cases of COVID-19 Prediction of the Epidemic Peak of Coronavirus Disease in Japan Modern Epidemiology Predicting the COVID-19 epidemic in Algeria using the SIR model. medRxiv Real-time forecasts of the COVID-19 epidemic in China from February 5th to A generalized-growth model to characterize the early ascending phase of infectious disease outbreaks A flexible growth function for empirical use A novel sub-epidemic modeling framework for short-term forecasting epidemic waves Effect of weather on COVID-19 spread in the US: A prediction model for India in 2020 Is the lockdown important to prevent the COVID-9 pandemic? Effects on psychology, environment and economy-perspective Emerging from the other end: Key measures for a successful COVID-19lockdown exit strategy and the potential contribution of pharmacists A data driven epidemic model to analyse the lockdown effect and predict the course of COVID-19 progress in India Bidirectional recurrent neural networks Data clustering: 50 years beyond K-means A FCM and SURF based algorithm for segmentation of multispectral face images Learning representation by backpropagating errors 5) Back-Propagation and Other Differentiation Algorithms. Deep Learning Figure 1 : Overview of the proposed prediction approach of daily cumulative cases of COVID-