key: cord-0870145-1z86375w authors: Kafieh, Rahele; Saeedizadeh, Narges; Arian, Roya; Amini, Zahra; Serej, Nasim Dadashi; Vaezi, Atefeh; Javanmard, Shaghayegh Haghjooy title: Isfahan and Covid-19: Deep Spatiotemporal Representation date: 2020-10-05 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2020.110339 sha: 9332de6d2a19f671b46d0f5e74a61f8ac6d056ce doc_id: 870145 cord_uid: 1z86375w The coronavirus COVID-19 is affecting 213 countries and territories around the world. Iran was one of the first affected countries by this virus. Isfahan, as the third most populated province of Iran, experienced a noticeable epidemic. The prediction of epidemic size, peak value, and peak time can help policymakers in correct decisions. In this study, deep learning is selected as a powerful tool for forecasting this epidemic in Isfahan. A combination of effective Social Determinant of Health (SDH) and the occurrences of COVID-19 data are used as spatiotemporal input by using time-series information from different locations. Different models are utilized, and the best performance is found to be for a tailored type of long short-term memory (LSTM). This new method incorporates mutual effect of all classes (confirmed/ death / recovered) in predication process. The future trajectory of the outbreak in Isfahan is forecasted with the proposed model. The paper demonstrates the positive effect of adding SDHs in pandemic prediction. Furthermore, the effectiveness of different SDHs is discussed, and the most effective terms are introduced. The method expresses high ability in both short- and long- term forecasting of the outbreak. The model proves that in predicting one class (like the number of confirmed cases), the effect of other accompanying numbers (like death and recovered cases) cannot be ignored. In conclusion, the superiorities of this model (particularity the long term predication ability) turn it into a reliable tool for helping the health decision makers. The beginning of 2020 was also the beginning of the Covid-19 outbreak, which was declared as pandemic on March 11 th [1] . Iran was one of the first countries affected by this virus. The outbreak in Iran started from Qom, spread to Tehran, Isfahan, Guilan, and then to other provinces [2] . Isfahan, the third most populated Iran province is 270 KM south of Qom and experienced a noticeable epidemic [3] . Four main hospitals were allocated to COVID-19 patients, and the public health measures were implemented in alignment with national policies. The first case in Isfahan was hospitalized on Feb. 18 th , and today after around three months, the number of confirmed cases is increasing yet. Figure 1 demonstrates the spread of disease in Isfahan Province. Due to the high false-negative ratio of reverse transcription polymerase chain reaction (RT-PC R) test and the limited number of available tests (which led to a single measure per patient), the population with a negative test is not reliable to be ignored. Accordingly, the suspicious cases (with negative results) are also considered as confirmed cases in this study. Another issue to be discussed is the change in the total number of the tested population during the time. In the first phase of facing the pandemic in Iran, the government could only manage to perform the PCR test on hospitalized cases; however, in the second phase and passing through the rough time (around 31st March), more cases, including hospitalized and outpatients were gradually added to the tested community. The prediction of epidemic size, curve, and peak time using mathematical models can help policymakers make evidence informed decisions. To have a better prediction, based on the available data from all countries, we have selected some influential Social Determinant of Health (SDHs) to be considered in our model. These factors, which are always of concern when talking about health, could be determinants of the outbreak's behavior. SDHs such as the Gross Domestic Product (GDP), fertility, or the growth rate, the population contributed to the world's share (World Share), date of school closure [4] [5] [6] . Among different prediction methods in pandemics, deep learning is proved to be a successful and powerful tool. Pandemics like Covid-19 may not essentially be considered as linear systems, but conventional statistical models like Auto Regressive Moving Average (ARMA), Moving Average (MA), Auto Regressive (AR) methods [7, 8] are mostly supposing linear assumptions for corresponding models. However, in such a pandemic, transmission rates can change dynamically due to environmental factors like behavior of people, country actions, etc. In contrast, the merit of deep learning methods in powerful dealing with such nonlinearities motivates the researcher to use it in forecasting time series. Furthermore, deep learning methods are known to have the capacity of automatic feature extraction, which eliminates the need for manual detection and extraction of efficient characteristics. In this paper, the proposed method is based on a novel deep learning method combined with effective SDHs and the original time series of number of confirmed, death and recovered cases. A limited number of works have already implemented using this type of model for forecasting COVID-19 [9] [10] [11] [12] [13] [14] [15] [16] [17] . The rest of the paper is organized as follows; in section 2, the dataset and utilized models are fully described. In section 3, optimal model selection, the effectiveness of SDHs in prediction, and future trajectory of the outbreak in Isfahan are elaborated. Finally, the results are discussed and concluded in the last section. We called this study "spatiotemporal analysis", motivated by using time-series information from different locations, including selected countries (Global data) and Isfahan province (Local data). The information constitutes occurrences of COVID-19 and SARS plus Social Determinants of Health (SDH) for each location to forecast the number of COVID-19 cases in Isfahan (summarized in Table 1 ). We collected the COVID-19 Global complete data from John Hopkins University [2, 26] , including all infected countries and Local data from Isfahan province. The dataset contains a cumulative number of confirmed, death, and recovered COVID19 cases for different dates. Global data from 22 January till 3 May 2020 and Local data from 18 February till 3 May 2020 was used. We also added SDH for each location as described in Table 1 .SARS data is also used as initial weights of the network. Despite the fact that COVID-19 is more contagious than SARS (COVID-19 has a higher R0 [27, 28] ), deep learning methods need a sample dataset to mimic the overall trend process. Since only a limited number of countries experienced the full curve (both rising and dropping edges) of the Covid-19 epidemic, SARS data is used to provide such missed information for the model. The occurrence data is divided into training and validation data subset with a ratio of 7:3. In this study, we utilized different machine learning methods ranging from classic models to sophisticated deep learning models including Random Forest (RF) [29] , Extreme Gradient Boosting (XGBoost) [30] , Light Gradient Boosting Machine (LGBM) [31] , multi-layer perceptron (MLP) [32] , convolutional neural networks (CNN) [33] , Long short-term memory (LSTM) [10, 11] , Multi-channel LSTM, and Parallel LSTM. Figure 2 demonstrates the complete block diagram of the proposed model, emphasizing the datasets, data preparation, and comparison of the models [34, 35] . Above mentioned models are evaluated on Global data by utilizing COVID-19 occurrence data (Confirmed/ Death / Recovered) in time-series format. We investigated different values of time intervals, termed as "lag", that can be considered before the prediction date to feed the occurrence data into the model. To find the optimal lag parameter, the MAPE value is calculated for predicting occurrences of confirmed, death, and recovered cases for lag values ranging between 1-20 days used for validation. The lowest MAPE (common between confirmed, death, and recovered cases) is six days, which is selected as the "optimal lag parameter" [18] . Furthermore, the models are also fed by SDHs. For comparison of the models, Mean Absolute Percentage Error (MAPE) metric is used to measure the size of the error in percentage terms regarding the actual values. The metrics like Mean Absolute Error (MAE) [36] , the Mean Square Error (MSE) [12] , Root Mean Square Error (RMSE) [10, 26, 37] , suffer from non-normalized measurements and accordingly provide higher values for countries with more population.. Therefore, we selected MAPE, which offers a normalized version that is more comparable between different sizes of the population. MAPE is calculated using the equations below: Where X t is the actual value and X t is the corresponding estimated value for t th sample from all n available samples. The best model based on MAPE error on Global data is then selected for training and forecasting on Local dataset from Isfahan province. Random Forest (RF) [29] or random decision forests is a Supervised Learning algorithm based on Ensemble learning, popular in regression and prediction. Ensemble methods make more accurate predictions compared to individual models since such methods combine the predictions from multiple machine learning algorithms. XGBOOST [30] is a scalable end-to-end tree boosting system with a novel sparsity-aware algorithm. With insights on cache access patterns, data compression to build a scalable tree boosting system, XGBoost scales beyond billions of examples using far fewer resources than existing systems. LightGBM [31] is proved to be an effective machine learning method that uses Gradient-based One-Side Sampling (GOSS) to exclude a significant proportion of data instances with small gradients. Multilayer Perceptron (MLP) [32] , referred to as the deep neural network (DNN), is a feedforward analysis method consisting of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. It uses a supervised learning technique called back propagation for training. We designed a COVID-19 prediction model using MLP, and basic MLP is used with its more advanced descendants. MLP used variables including bias b, input x, output y, weight w, sum function s, and activation function f. Each neuron in MLP is fired with the following formula: Figure 3 shows the overall structure of a neuron and a MLP model. Each layer of neurons in MLP is connected to the next layer, justifying the name "Dense layer", more prevalent in DNN literature. We used implementation in the Keras package in the Python version 3.7.3 [19] . Convolutional neural network (CNN) is a class of deep neural networks and regularized versions of MLP (fully connected (Dense layer) networks ( . In MLP, each neuron is connected to all neurons in the next layer. CNNs take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns [33, 38] . In this work, we use one dimensional CNN, appropriate for analyzing time sequences [19] . CNN used variables including bias b, input time series x of length T( T indicates the value of lag parameter), result of convolution C, kernel w, and nonlinear activation function f such as the Rectified Linear Unit (ReLU). A general form of applying the convolution for a centered time sample t is given by: The univariate output of the previous step can be considered as another time series C. Regarding kernel w, the CNN model as an AI method updates the values of w to reach the best performance during the training step. Therefore, only a random initial value is set for w, and the final weights are optimized automatically. Figure 4 shows the overall structure of a 1d CNN. Unlike regression predictive modeling, time series also adds the complexity of a sequence dependence among the input variables [20, 21] . The LSTM model is a subcategory of recurrent neural networks (RNNs), a powerful type of neural network designed to handle sequence dependence. Unlike standard feed-forward neural networks, LSTM has feedback connections to only process single data points, but also entire sequences of data. A common architecture for LSTM has the following components: a cell (the memory part) and three gates (input gate, output gate, and forget gate). LSTM uses the activation function of the logistic sigmoid function. Weights of a LSTM network consist of the connections into and out of the LSTM gates and are learned during training. A Vanilla LSTM [22] is a LSTM model with a single hidden layer of LSTM units, and an output layer used to make a prediction. A Stacked LSTM model is compromised from multiple hidden LSTM layers. Since each LSTM layer requires a three-dimensional input and the reality that outputs of an LSTM are twodimensional, one may consider the LSTM output as each time step in the input data by setting the return_sequences=True argument on the layer. In this particular application of LSTM in predication, we originally deal with three different occurrences (daily number of confirmed/ death / recovered people). Multivariate LSTM Models are undoubtedly a perfect match for such problems. Multivariate time series data means data where there is more than one observation for each time step. There are two main models with multivariate time series data: Multiple Input Series and Multiple Parallel Series. a) Multiple Input Series: We may consider the number of confirmed cases as the output of the network, but for forecasting purposes, two or three parallel input time series (confirmed/ death/ recovered in previous days) may be of importance. The input time series are parallel because each series has an observation at the same time steps. We used the Stacked LSTM model in the previous section for this network. b) Multiple Parallel Series: An alternate time series problem is when there are multiple parallel time series, and a value must be predicted for each. Now, we may consider the number of occurrences in all classes (confirmed/ death / recovered) as input, data and we may want to predict the value for each of the three-time series for the next time step. This might be referred to as multivariate forecasting. The characteristics of the mentioned models are collected in Table 2 . The models were formed based on occurrences for the daily number of confirmed, death, and recovered COVID-19 in Global and Local data. A lag of six days was applied to each dataset. The networks are pre-trained with occurrences for the daily number of confirmed, death, and recovered SARS disease cases. Each model is tested with many different architectures, and the best performance is achieved with architectures described in Table 2 . Each model's performance was evaluated on the test subset of the data to provide a fair comparison based on the MAPE value. Section 3.1 elaborates each model's performance with such different inputs, and the best model is then found to make the forecasting for the next days in section 3.3. Furthermore, in section 3.2 effectiveness of SDHs is explored. Eight different models are chosen as candidates for this application (elaborated in section2.2). The proposed models are tested utilizing data types from time. Different time intervals, termed as "lag", can be considered before the prediction date to feed the model's occurrence data. Namely, with a sample lag equal to l=4, the model predicts the occurrence in 10th March is using input values in from l prior data points (in 9th, 8th, 7th, and 6th March). A lag of 6 days is selected as the optimum and most effective value for predictions in the next steps. We selected the MAPE metric to compare different model/ input combinations. For such an evaluation, a set of nine countries as Global data are chosen, which constitutes a variety of countries; China is undoubtedly the main candidate as the start point of COVID-19. Iran, Italy, Spain, and the USA are selected due to reports of a high number of confirmed and death cases. Germany and Switzerland are also coming from different trends with a high number of confirmed cases and a controlled number of deaths.Finally, Korea and Japan are also included in demonstrating the countries with a high degree of control on the epidemic. Table 3 shows a summary of the performance of Global data with different models. The columns in Table 3 are derived with/without adding SDH features. Table 3 , the Multiple Parallel Series is the best performing network in identifying the true magnitude of the pandemic with a MAPE of1.12% for Global data. As elaborated in section 2.3, this network uses lag information from confirmed, death, and recovered cases to predict each next occurrence. This result shows that considering these three occurrences' mutual effect can provide better modeling, and ignoring such dependence leads to less performance. SDHs are shown to be valuable by comparing the columns of Table 3 . The input features to the proposed models come from three types, elaborated in Table 1 . Data types for the first and 3 rd row in Table 1 turn into time series and the values with optimal lag of 6 samples are fed into the models. One common approach to finding the optimum combination is to make a correlation matrix that computes the mutual correlation between features and the desired output [24] ( Figure 5.a) . However, such an approach ignores the abilities of the network in making connections between the features. Looking for effective features in this work is inspired by Class Activation Mapping (CAM) [23] . This method is most prevalent in image processing and finds the highly effective image regions by masking different regions during the classification process. Removing effective regions of an image make the accuracy lower, but less effective regions have no substantial effect. Using such a theory in mind, we removed each feature from a preparatory model's inputs and calculated the corresponding MAPE value. The features with more effect on resultant MAPE are selected as the most effective features. Figure 5 .b shows the preparatory model's MAPE values in predicting occurrences of confirmed cases from COVID-19, with eliminating each feature listed as SDH. For each region, the four most effective SDHs are extracted from the CAM method and placed in figure 5.c. Comparing the results in figure 5.a and 5.c confirms that effective SDHs for all countries are similar in both methods. It includes the following SDHs: density, fertility, age, school closure, GDP, smoking, and day_from_jan_first. As a superiority to the correlation method, the CAM method is capable of extracting effective SDHs for each separate region. To provide forecasting on the number of confirmed, death, and recovered cases in Local data, Isfahan, Iran, we run the model until August 11th of 2020. As shown in Table 3 , the best performance relates to the Multiple Parallel Series method with adding SDHs. Simulationbased estimation of the trajectory of cumulative confirmed, death, and recovered values for local data is illustrated in figure 6 .a. Furthermore, figure 6 .b shows the daily number of occurrences of confirmed cases in Isfahan. The real values of the cases are also shown in the green curve and predicted curve is shown in red (after training the network until April 15th). One important parameter to be discussed about our local data is the total number of tested population during the time. In the first phase of facing the pandemic in Iran, the government could only manage to perform the PCR test on hospitalized cases. However, in the second phase and passing through the rough time (around 31st March), more cases, including hospitalized and outpatients, were gradually added to the tested community. Here, we demonstrate our model's power for long-term predication (even with stopping the training in first phase of pandemic in local data). Interestingly, the prediction (dark blue curve) in Figure 6 .c could catch the peak of the real data (green curve) around 1.5 months earlier. It comes from the fact that our training method is mostly using Global data, including both hospitalized and outpatients. Therefore, it is expected that our prediction on Local details assuming that PCR tests in Isfahan from the first point were derived from both hospitalized and outpatients (which is evident with the difference between dark blue and green curves). Accordingly, it could predict the current situation correctly. To clarify this issue, the light blue curve in Figure 6 .c is sketched, indicating the number of confirmed cases after eliminating the outpatients. The light blue curve starts to deviate from the green curve at the start of the second phase (around 31st March, when more cases were gradually added to the tested community). Figure 6 .c is carrying a message that under/overestimation in the number of confirmed cases leads to different curves. Therefore, if we pick such incorrect data, the forecasting may not essentially match with. While the pandemic is still a global concern, it is important to have evidence-based images of the epidemic's behavior to figure out the possible situation for coming back to normal life. Moreover, apart from the known risk factors of COVID-19, some other determinants (SDHs) affect and could be managed by a community-based perspective. Here we discuss the result of this study in two categories; first, the effect of SDHs on the epidemic curve and the value of these determinants. Second, the behaviour of the outbreak in Isfahan concerning the number of new cases, death, and recovered ones. 1. Adding SDH features to the model makes the prediction more precise, based on the MAPE value shown in table 3. But the critical question is which determinant is more effective, and the answer to this question could be found in Figure 5c : darker squares represent more effective determinants. As it is obvious in Figure 5 , the effective SDHs are similar in most studied countries. As COVID-19 is transmissible via person-to-person contact, it was predictable that determinants like density and school closure date are significant factors in spread of disease and can affect the prediction. Furthermore, determinants like fertility as an indicator of population growth seem to be effective in the pandemic. According to our analysis in Figure 5, As elaborated above, a limited number of methods based on deep learning are recently developed for COVID-19 prediction. We provide a detailed comparison of our proposed method with such works. Regarding the geographical concentration, the proposed research is mainly developed to test the performance on Isfahan Province, Iran; but the dataset provided by the Johns Hopkins Center is used for training the algorithm. Many other works like [9, 14] [15] are also developed based on the Johns Hopkins dataset or Oxford University database, but the geographical focusing on limited areas like India [11] , Canada with Italy and USA [12] , Denmark, Belgium, Germany, France, United Kingdom, Finland, Switzerland, and Turkey [13] , Iran [17] [18] , and China [16] . Another important issue is the number of days used as input and as output of the prediction methods. The optimal lag of six days and long term (extendable to months) forecast is proposed in our method. Another parameter in available methods is considering complementary parameters like SDHs for better prediction. We found effective SDHs, including density, fertility, age, school closure, GDP, smoking, and day_from_jan_first. The other works are also including features like the closure of the city and travel restriction [9] , lockdown, and social distancing [11] , population and population density [14] . Regarding the performance metric, we found that MAPE is more illustrative in performance evaluation. Other works also used mean square error (MSE), mean absolute error (MAE), root MSE (RMSE), Normalized RMSE (NRMSE), and Symmetric MAPE (SMAPE) [13] [15] [16] additionally. In conclusion, the superiorities of this model (particularity the long term predication ability) turn it into a reliable tool for helping the health decision-makers. Our study predicted that the peak time of the COVID-19 outbreak in Isfahan province has passed around May 2 nd and if the controlled governmental rules to the population's compliance with the health policies would continue, the epidemic curve will be finished by july 28th. Credit author statement: WHO Declares COVID-19 a Pandemic Coronavirus disease 2019 (COVID-19) situation report -31 COVID-19 Pandemic and Comparative Health Policy Learning in Iran Staying at home: mobility effects of covid-19 Covid-19: risk factors for severe disease and death COVID-19 and smoking: A systematic review of the evidence Application of the ARIMA model on the COVID-2019 epidemic dataset Forecasting of covid-19 confirmed cases in different countries with arima models An Improved Method of COVID-19 Case Fitting and Prediction Based on LSTM Predicting COVID-19 incidence through analysis of google trends data in iran: data mining and deep learning pilot study Prediction for the spread of COVID-19 in India and effectiveness of preventive measures Time series forecasting of COVID-19 transmission in Canada using LSTM networks Comperative analysis and forecasting of COVID-19 cases in various European countries with ARIMA, NARNN and LSTM approaches Statistical Explorations and Univariate Timeseries Analysis on COVID-19 Datasets to Understand the Trend of Disease Spreading and Death Machine learning approach for confirmation of covid-19 cases: Positive, negative, death and release Multiple-input deep convolutional neural network model for covid-19 forecasting in china Exponentially Increasing Trend of Infected Patients with COVID-19 in Iran: A Comparison of Neural Network and ARIMA Forecasting Models COVID-19 in Iran: A Deeper Look Into The Future Deep learning for time series classification: a review Artificial Intelligence Application in COVID-19 Diagnosis and Prediction Attention-based recurrent neural network for influenza epidemic prediction Learning to forget: Continual prediction with LSTM Learning deep features for discriminative localization A method for generating realistic correlation matrices COVID-19 incidence using Google Trends and data mining techniques: A pilot study in Iran COVID-19, SARS and MERS: are they closely related? The reproductive number of COVID-19 is higher compared to SARS coronavirus Preliminary Flu Outbreak Prediction Using Twitter Posts Classification and Linear Regression With Historical Centers for Disease Control and Prevention Reports: Prediction Framework Study Xgboost: A scalable tree boosting system Lightgbm: A highly efficient gradient boosting decision tree A Review of Epidemic Forecasting Using Artificial Neural Networks DEFSI: Deep learning based epidemic forecasting with synthetic information Modeling and forecasting the COVID-19 pandemic in India Forecasting the daily and cumulative number of cases for the COVID-19 pandemic in India Analyzing Crude Oil Prices under the Impact of COVID-19 by Using LSTARGARCHLSTM Time Series Prediction of COVID-19 by Mutation Rate Analysis using Recurrent Neural Network-based LSTM Model Deep learning