key: cord-0821182-iptmfpsy authors: ArunKumar, K.E.; Kalaga, Dinesh V.; Kumar, Ch. Mohan Sai; Kawaji, Masahiro; Brenza, Timothy M title: Forecasting of COVID-19 using deep layer Recurrent Neural Networks (RNNs) with Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) cells date: 2021-03-14 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2021.110861 sha: d0a7a08f42d470ae466fe2754e0173c5afdf3e7d doc_id: 821182 cord_uid: iptmfpsy In December 2019, first case of the COVID-19 was reported in Wuhan, Hubei province in China. Soon world health organization has declared contagious coronavirus disease (a.k.a. COVID-19) as a global pandemic in the month of March 2020. Over the span of eleven months, it has rapidly spread out all over the world with total confirmed cases of ∼ 41.39 M and causing a total fatality of ∼1.13 M. At present, the entire mankind is facing serious threat and it is believed that COVID-19 may have been around for quite some time. Therefore, it has become imperative to forecast the global impact of COVID-19 in the near future. The present work proposes state-of-art deep learning Recurrent Neural Networks (RNN) models to predict the country-wise cumulative confirmed cases, cumulative recovered cases and the cumulative fatalities. The Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) cells along with Recurrent Neural Networks (RNN) were developed to predict the future trends of the COVID-19. We have used publicly available data from John Hopkins University's COVID-19 database. In this work, we emphasize the importance of various factors such as age, preventive measures, and healthcare facilities, population density, etc. that play vital role in rapid spread of COVID-19 pandemic. Therefore, our forecasted results are very helpful for countries to better prepare themselves to control the pandemic. Coronavirus disease (COVID-19) is a respiratory illness caused by severe acute respiratory syndrome coronavirus-2 (SARS -CoV-2), which is a strain of coronaviruses. It was first identified in December 2019 when a group of patients demonstrated novel form of viral pneumonia with similar history of visiting wet market in Wuhan China. On March 11 th , 2020, World Health Organization (WHO), declared novel coronavirus (2019-nCoV) outbreak as global pandemic [1] . Since then, several preventive measures such as lockdowns, rapid testing, wearing masks, self-quarantine, social distancing are being applied by the countries to stop the spread of COVID-19 pandemic. Despite these measures, COVID-19 is propagating rapidly and affecting both human health and global economy, due to various reasons such as population density, total population, lifestyle, worldwide travel, and precautionary measures etc. As of today (June 11 th , 2020), there have been about a total confirmed case of 7.37 M and about 0.41 M total fatalities across the world. Most of these 7.37 M confirmed cases are concentrated in top 10 countries [2] such as USA, Brazil, India, Russia, South Africa, Mexico, Peru, Chile, United Kingdom (UK) and Iran. Firstly, the major concern of the people and the government authorities is to get an estimate about what would be the daily new cases, recovery rate, total fatalities, total number of confirmed cases by forecasting the reported data. Secondly, how long it would take to stop the spread of pandemic COVID-19, ultimately helping the governments and healthcare systems to prepare in-advance for the forecasted number of cases. Recently, machine learning, and deep learning techniques have gained immense interest due to the availability of abundant data. These techniques are very helpful for obtaining the relationships from the data without defining them priori [3] . Further these techniques are also capable of forecasting the trends based on the reported time-series data over a known period of time. Several researchers have used the machine learning and deep learning models to predict short-term forecast of COVID-19 pandemic. The following paragraph very briefly summarizes the literature reported on the machine learning-based forecast of COVID-19 cases. Ghoshal et al. [4] have used the linear regression and multiple linear regression to predict the number of deaths in India for upcoming six weeks. Authors have predicted that the total deaths in India will be doubled if the COVID-19 preventive measures are unchanged or not implemented strictly. Parbat and Chakraborty [5] have used the Support Vector Regression (SVR) for predicting the COVID-19 cases in India for 60 days based on the time series data reported for the period of ~ 60 days (1 st March 2020 to 30 th April 2020). Their model has ~ 97% accuracy in predicting the cumulative confirmed cases, total recovered, total fatalities, and has 87% accuracy in predicting the daily new cases. Maleki et al. [6] have employed Autoregressive time-series models based on two-piece scale mixture normal distributions to forecast the confirmed and recovered COVID-19 cases. Their model performed well in forecasting the confirmed and recovered COVID-19 global cases. Benvenuto et al. [7] have reported very briefly about the application of Autoregressive Integrated Moving Average (ARIMA) model to predict the future trends of prevalence and incidence of COVID-19 data. Fotios et al. [8] have used nonseasonal exponential smoothing models to forecast cumulative cases of COVID-19 until 21 st march 2020. In another study, Ram Kumar Singh et al [9] applied holt-winter models and Susceptible, Infected and Recovered (SIR) model on COVID-19 data of India. They reported that COVID-19 cases in India will be highest during the first week of November 2020 and India will return to normalcy by last week of February 2021. Similarly, in another study Shaobo He et al, used SEIR (Susceptible-Exposure-Infective-Recovered) model on COVID-19 data of Hubei province of China. They used particle spam optimization to estimate the parameters of the SEIR model. Moreover, their model considered quarantine and treatment for forecasting the COVID-19 cases [10] . Further details on the SIR and SEIR model can be found elsewhere [9] [10] [11] [12] [13] . Ribeiro et al. [14] have used ARIMA, Cubist Regression, Random Forest, Ridge Regression, SVR, and Stacking-ensemble learning for short-term forecasting of COVID-19 confirmed cases in Brazilian states. Their paper reveals the order of best to worst performing models, the best performing models are found to be SVR and ARIMA. Kumar et al. [15] have used ARIMA model with machine learning approach to forecast the trajectories of COVID-19 pandemic in top 15 countries in the month of April 2020. Their ARIMA model was successfully able to predict the COVID-19 trends in countries such as Iran, Italy, Spain and France. Ardabili et al. [16] have used the machine learning techniques for predicting the COVID-19 outbreak, they found that the multi-layer perceptron model and adaptive network-based fuzzy interface system are found to give promising results. Their study has recommended that individual machine learning models are needed for each country due to the presence of fundamental differences between the countries. The following paragraph briefly summarizes the reported work on the deep learningbased forecast of COVID-19 cases. Chimmula and Zhang [17] have used the state-of-art deep learning model using Long Short-Term Memory (LSTM) network to predict the possible ending time of the COVID-19 pandemic in CANADA. Based on their LSTM model, they estimated that the time required for ending the pandemic is about three months. Salgotra et al. [18] have developed models based on the genetic programming for forecasting the total confirmed cases and total fatalities in highly affected states of India and as well as for total India. Authors have reported that their model is less sensitive to the variables and highly reliable in predicting the confirmed cases and deaths. Qi et al. [19] have used the generalized adaptive model to understand the associations of daily average temperature and relative humidity with the daily COVID- 19 Based on the aforementioned reports, all the models proposed in the literature are confined to very less data points, few countries, or few states in a country or for a very short forecast time. Therefore, there is still room to develop country specific deep learning models to predict the 60-day forecast of the COVID-19 trends in top 10-countries that are highly impacted by COVID-19. Hence, the aim of the present work is to predict the future trends of the cumulative confirmed cases, cumulative recovered cases and cumulative fatalities of the top 10 countries using RNN along with Gated Recurrent Units (GRUs) and Long Short-Term Memory The methods for forecasting the time-series data can be mainly classified into two types, machine learning and deep learning methods. Deep learning models are superior over the statistical machine learning models for forecasting the non-linear applications such as prediction of weather, stock prices [25] , electrocardiogram (ECG) recordings [26] and crude oil prices [27] etc. Feed Forward Neural Networks (FFNNs) and Recurrent Neural Networks (RNNS) are two types of widely used deep learning techniques but FFNNs are not suitable for forecasting as they are not capable of considering the trends in the time-series data. Whereas RNNs are powerful and robust type of artificial neural networks that uses existing time-series data to predict the future data over a specified length of time. RNNs are very promising techniques due to the internal memory that can remember the important features of the input sequential data which allows them to accurately predict the future. Unlike FFNNs, where the information flows strictly in one direction from layer to layer, in RNNs, the output from the previous time stamp along with input from the present time stamp will be fed into RNN cell, so that the current state of the model is influenced by its previous states. The following equation explains the function of the single RNN cell: Where, is the weight matrix, is the bias matrix, and are hidden state at current time-step and previous time-step, respectively. RNNs perform computations, very similar to FFNN, using the weights, biases, and activation functions for each element of the input sequence (Fig. 1A) . Essentially a neuron in RNN has a single hyperbolic tangent function in which and are combined and multiplied by some weight matrix and then adding a bias to it followed by passing it through the hyperbolic tangent function which gives back . Hyperbolic tan function ( ) is used to scale the actual values so that the values fall in between the range of -1 to +1. At each time step, the output of the RNN cell is updated using a sigmoid function, which is a widely used non-linear activation function in artificial neural networks. The following equation is a mathematical representation of the sigmoid function: RNN can only recollect the recent information but cannot recollect the earlier information. Though the RNNs can be trained by back-propagation, it will be very difficult to train them for long input sequences due to vanishing gradients. Hence, the main drawback of the RNN architecture is its shorter memory to remember the features, vanishing and exploding gradients. Hence, we have used the combination of RNN with GRU cells and LSTM cells to overcome these drawbacks. Hochreiter and Schmidhuber [28] have proposed LSTM to overcome the vanishing and exploding gradients problem. The memory of LSTM cell will be stored and converted from input to output in cell state. The general architecture of the LSTM cell can be found in the Figure 1B . LSTM cell is comprised of the forget gate, output gate, input gate and update gates. As the name indicates, forget gate decides what to forget from the previous memory units, the input gate decides what to accept into the neuron, the update gate updates the cell, and the output gate generates the new long-term memory. These four main components of LSTM will work and interact in a special manner, as it accepts long-term memory, short-term memory, input sequence at a given time step and generates new long-term memory, new short-term memory and output sequence at a given time step. The input gate decides which information must be transferred to the cell; the input gate is mathematically represented as following: The operator '*' represents the element-wise multiplication of the vectors. The information to be neglected from the previous memory is controlled by forget gate which is mathematically defined as following: The cell state is updated by the update gate, expressed mathematically by the following equations: The hidden layer of the previous time step is updated by the output gate which is also responsible for the updating the output as it is given by: Gated Recurrent Units were proposed by Chung et al. [29] which is a simplified version of LSTM and requires less training time with improved network performance. The operation of a GRU cell is similar to the operation of LSTM cell but GRU cell uses one hidden state that merges the forget gate and the input gate into a single update gate. Further, it combines the cell state and hidden state into one state and hence the total number of gates in GRU is half (update and reset gates) of the total number of gates in LSTM. Hence, it is popular and simplified variant of LSTM cell. The hidden state of the GRU cell is updated by the following equation: The update gate is computed by the following equation, which decides how much of the GRU unit get updated: The reset gate is computed very similar to the update gate, it is given by the following equation: The new remember gate is generated by applying hyperbolic tan function to the reset gate, which is described by the following function. Scripts Where, is model predicted value, is actual value. This section describes the forecasted trends of cumulative confirmed, cumulative We have presented the 60-day forecast of the cumulative confirmed cases obtained from For the forecast of USA' confirmed cases, it is evident ( Figure 2 ) that both LSTM and GRU models performed reasonably well in the validation phase of the model development. Though the predictions of the LSTM and GRU models did not varied much from the test data, LSTM has lesser MSE and RMSE ( However, as we have seen the performance of the models varied from country to country according to information embedded in the data. In case of Russia, GRU performed well with lesser RMSE than that of the LSTM and the cumulative confirmed cases forecasted by LSTM were ≈90,000 greater than the GRU forecast. From Figure 2 , it is evident that according to LSTM model the maximum number of cumulative confirmed cases in Russia will be ≈1,200,000 by the end of September. According to Figure 2 , number of cumulative confirmed cases in South Africa are increases at a faster pace as the trend is following an exponential growth phase since April 2020. Both LSTM and GRU models predicted an approaching plateau by the end of September 2020. The LSTM and GRU models predicted that there will be ≈675,000 and ≈710,000 confirmed From the above-mentioned results, it is evident that the forecasted number of cumulative COVID-19 infection rate is directly proportional to the population density as the probability of the exposure increases as the population density increases [32] . The mortality rate is higher in extremely populated areas as the health care facilities are insufficient to meet the demand of increasing new cases [33] . However, in a study, the R 2 value of the relation between infection rate and population density was found to be moderate (0.67) [34] . This indicates only a fraction of the infection rate is described by population density meaning other factors contribute to the increase in new daily cases of COVID-19. Mobility restriction is another factor that plays a key role in the spread of the infection. After implementing mobility controls, UK [35] and China has seen decline in the association between the fatalities and social mobility [36] . To propose the best model for each country, we have developed and validated GRU and Russia, South Africa, Mexico, Peru, Chile, UK and Iran by the end of September 2020 will be ≈2,650,000, ≈3,250,000, ≈3,000,000, ≈1,000,000, ≈850,000, ≈810,000, ≈510,000, ≈370,000, ≈1,500 and ≈350,000 respectively. According to LSTM models of USA, Brazil, India, Russia, South Africa, Mexico, Peru, Chile, UK, and Iran the cumulative recovered cases will be ≈2,100,000, ≈3,770,000, ≈2,500,000, ≈1,000,000, ≈700,000, ≈870,000, ≈510,000, ≈,380,000 ≈1,500, and ≈470,000 respectively. It is found that there are two different trends that can be Further, in countries such as UK and Chile there is very negligible number of recovered cases forecasted. This could be explained by observing the figure 3 , during the period between March and July 2020 very few recovered cases were reported. In contrast to this observation, when we examine the data of the USA, it is evident that the reported recovered cases during the same period (March to July 2020) has an exponential trend. The above discussed contradiction in the observations from country to country implies that the definition of recovered case and the process of reporting the cumulative recovered cases is different from country to country. For example, in USA [37] , there are 16 states that have no definition for recovered case and do not report or document the recovered cases. Another 8 states of USA count the number of hospital discharges as recovered case. In states such as South Dakota the recovered case is defined as day-based meaning the infected individual is free from any symptoms for 3 -42 days. According to John Hopkins, a COVID-19 patient is considered recovered only if the patient is appearing symptom free for 10 days since the occurrence of the first symptom and no fever for 24 hours without using fever reducing medication. Whereas, the loss of smell and taste for weeks cannot be considered in defining the recovered cases [38] . However, the reverse transcription polymerase chain reaction (RT-PCR) test conducted on the four patients who met the criteria for hospital discharge were tested positive after 10 days of the hospital discharge [39, 40] . In a study, [8] Fotios presented a brief insight on the importance of recovered cases, They reported the recovered cases are increasing exponentially as the number of days in the pandemic increase. However, their study was restricted to short time-series data. Whereas in India few tests are conducted per 1,000 population [41] therefore the reported cumulative cases do not represent the actual number of cumulative recovered cases. Despite the above-mentioned results and limitations, from Figure 3 , it is evident that the recovery rate is increasing in most of the countries. The increase in number of recovered cases depends not only on the definition of recovered case, on the process of the recording the recovered cases but also on various factors such as age, underlying health conditions, preventive measures, and local weather conditions. The recovery statistics [42] show that 60% patients between the age group 20-40 years have recovered. The percentage of patients recovered decreased to 56% for the age groups 50-59 and > 60 years. Moreover, the susceptibility to the COVID-19 and rate of transmission of the disease was varied based on the age of the person. In a case study [43] , it is reported that the percentage of manifestation of the clinical symptoms in age Figure 3 : 60-day ahead forecast of cumulative recovered cases for top-10 countries based on RNN-GRU and RNN-LSTM models. The forecast of the cumulative fatalities for top-10 countries were presented in this section. Figure 4 depicts the forecasted trends of cumulative fatalities and the corresponding simulation details of both the models for each country are given in Table 3 . From the Figure 4 , it is evident that the number of cumulative fatalities in USA will be in between 200,000 and 240,000 according to forecasts of the GRU and LSTM based models, respectively. As per both the models, the fatalities are continuously increasing as the number of days into the pandemic is increasing. For the number of cumulative fatalities in USA, Peru, Chile, and UK, GRU based model performed better than the LSTM model which is evident from the almost same for both LSTM and GRU models and both models show that the cumulative fatalities will reach a plateau by the end of September 2020. Fatalities in Peru, South Africa, Russia also followed a plateau but the agreement between the models is not good. Similar deviation between the models were observed for countries USA, Brazil and India, but the fatalities found to continuously increasing as the pandemic is increasing. According to Figure 4 , the number of cumulative death cases are either increasing (USA, Brazil, India Russia, Mexico, Chile and Iran) or reaching a steady value (South Africa, Peru and UK) as the pandemic period is increasing. The varied trends in the graphs as shown in Figure 4 could be because of the lesser availability of infrastructure such as number Intensive Care Units (ICU), number of hospital beds [44] , available healthcare workers per number of patients [45] in developing countries such as [47] . The raise of number of cases is directly related to social distance practices and implementation of other preventive measures. This is evident from the reported data of USA until 3 rd August 2020 of the Figure 2 . The number of cumulative confirmed cases increased drastically from the first reported case in USA. Similarly, there are other factors that contribute to the higher mortality in some regions of the world and countries. Such factors include, economic status [48] , race/ethnicity [49, 50] , housing conditions [51] . Moreover, there is a positive correlation between the poor housing conditions and COVID-19 occurrence and mortality. In USA, the deaths related to COVID-19 pandemic were high in the communities with higher percentage of poor housing conditions [52] . Moreover, there is a strong evidence for the positive correlation between the COVID -19 fatalities and ethnicity. In USA, regions with higher population of Black and Latino residents had higher number of COVID-19 cases per 100,000 population [49] . COVID-19 mortality relation with the health at the county level is more pronounced in the non-urban counties. In these counties the people who work on the farms, and who have lower income are at high risk for COVID-19 mortality [53] . Similarly in a country level case study, it is found that higher COVID-19 mortality is associated with lower government effectiveness, fewer hospital beds and lower test number [54] . Based on the discussion it is important to identify the vulnerable communities of the society so that the public health officials can develop strategies specific to such communities. Tiwari et al; [55] reported an innovative COVID-19 impact assessment algorithm based on random forest machine learning model to identify and map vulnerable counties. In such situation, our results not only provide an information on the upcoming number of fatalities, but also considers the various factors that play curial role in increasing the mortality caused by COVID-19 disease. The present study reported a 60-day forecast of the covid- 19 Though the GRU model is over predicting, recovered data in USA, India and South Africa will reach a plateau as per both the models. Similarly, Brazil will also reach plateau but with LSTM overpredicting compared to the GRU. Recovered cases in Mexico, Iran, Peru and Russia will continuously increase in near future, according to LSTM and GRU models. The recovered cases in UK and Chile will reach a plateau in upcoming 60 days, which might be not true due to inconsistency in the reported data. Russia and India will be continuous to increase with respect to both the LSTM and GRU model. Mexico, Chile, UK and Iran have reported to show a plateau with both model's predictions very consistent with each other. Fatalities in South Africa and Peru will reach a plateau but with less agreement between the model's predictions. Based on the results from the present work, it is highly recommended that to develop a deep learning models by feeding all three cumulative data sets (confirmed, recovered and fatalities) simultaneously to predict the pandemic trends. Further, more amount of data and accurate data is needed to develop robust and accurate models to obtained forecasts with less margin of error and to correlate the forecasts with factors that contribute to the COVID-19 rapid spread. Our study also helps countries realize the importance of the various factors that contribute to the spread of COVID-19 thereby helping them better prepare for the upcoming surge. WHO declares COVID-19 a pandemic Role of Intelligent Computing in COVID-19 Prognosis: A State-of-the-Art Review Prediction of the number of deaths in India due to SARS-CoV-2 at 5-6 weeks A Python based Support Vector Regression Model for prediction of Covid19 cases in India Time series modelling to forecast the confirmed and recovered cases of COVID-19 Application of the ARIMA model on the COVID-2019 epidemic dataset Forecasting the novel coronavirus COVID-19 Short-term statistical forecasts of COVID-19 infections in India SEIR modeling of the COVID-19 and its dynamics Optimal vaccination strategies for an SEIR model of infectious diseases with logistic growth Modeling Epidemics With Compartmental Models A time-dependent SEIR model to analyse the evolution of the SARS-CoV-2 epidemic outbreak in Portugal Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil Forecasting the dynamics of COVID-19 Pandemic in Covid-19 outbreak prediction with machine learning. Available at SSRN 3580188 Time series forecasting of covid-19 transmission in canada using lstm networks Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming COVID-19 transmission in Mainland China is associated with temperature and humidity: A time-series analysis Comperative analysis and forecasting of COVID-19 cases in various European countries with ARIMA, NARNN and LSTM approaches Machine learning approach for confirmation of covid-19 cases: Positive, negative, death and release. medRxiv Prediction for the spread of COVID-19 in India and effectiveness of preventive measures COVID-19 Infection Forecasting based on Deep Learning in Iran. medRxiv Novel Coronavirus (COVID-19) Cases Data The development of a new statistical technique for relating financial information to stock market returns RETRACTED: A new statistical PCA-ICA algorithm for location of R-peaks in ECG Text-based crude oil price forecasting: A deep learning approach Long short-term memory Empirical evaluation of gated recurrent neural networks on sequence modeling Social and behavioral consequences of mask policies during the COVID-19 pandemic Deep Learning based Safe Social Distancing and Face Mask Detection in Public Areas for COVID-19 Safety Guidelines Adherence Does density aggravate the COVID-19 pandemic? Early findings and lessons for planners Disease and healthcare burden of COVID-19 in the United States Impact of population density on Covid-19 infected and mortality rate in India The impact of government measures and human mobility trend on COVID-19 related deaths in the UK. Transportation research interdisciplinary perspectives The effect of human mobility and control measures on the COVID-19 epidemic in China Covid-19 cases inconsistent and incomplete. Numbers elusive and may mislead on real medical impact of virus CDC. Coronavirus Questions and Answers Positive RT-PCR test results in patients recovered from COVID-19 Positive RT-PCR Test Results in 420 Patients Recovered From COVID-19 in Wuhan: An Observational Study Estimation of Effective Reproduction Numbers for COVID-19 using Real-Time Bayesian Method for India and its States Effects of age and sex on recovery from COVID-19: Analysis of 5769 Israeli patients Age-dependent effects in the transmission and control of COVID-19 epidemics Variation in COVID-19 hospitalizations and deaths across New York City boroughs COVID-19): situation report, 82. 2020. 46. Older Population and Aging Who is more susceptible to Covid-19 infection and mortality in the States? medRxiv Association of race, ethnicity, and community-level factors with COVID-19 cases and deaths across US counties Pérez-Stable, COVID-19 and Racial/Ethnic Disparities Poor sanitation and transmission of COVID-19 in Brazil Association of poor housing conditions with COVID-19 incidence and mortality across US counties Social determinants of COVID-19 mortality at the county level Covid-19 mortality is negatively associated with test number and government effectiveness Using Machine Learning to Develop a Novel COVID-19 Vulnerability Index (C19VI) The authors would like to acknowledge the financial aid obtained from the research project the National Science Foundation-Partnerships for International Research and Education ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: