key: cord-0801292-axip5h8f authors: Absar, Nurul; Uddin, Md Nazim; Khandaker, Mayeen Uddin; Ullah, Md Habib title: The efficacy of deep learning based LSTM model in forecasting the outbreak of contagious diseases date: 2021-12-28 journal: Infect Dis Model DOI: 10.1016/j.idm.2021.12.005 sha: 4ad0dbbdac5eee21d20e70e11d562c3d16eb3abd doc_id: 801292 cord_uid: axip5h8f The coronavirus disease that outbreak in 2019 has caused various health issues. According to the WHO, the first positive case was detected in Bangladesh on 7(th) March 2020, but while writing this paper in June 2021, the total confirmed, recovered, and death cases were 826922, 766266 and 13118, respectively. Due to the emergence of COVID-19 in Bangladesh, the country is facing a major public health crisis. Unfortunately, the country does not have a comprehensive health policy to address this issue. This makes it hard to predict how the pandemic will affect the population. Machine learning techniques can help us detect the disease's spread. To predict the trend, parameters, risks, and to take preventive measure in Bangladesh; this work utilized the Recurrent Neural Networks based Deep Learning methodologies like LongShort-Term Memory. Here, we aim to predict the epidemic's progression for a period of more than a year under various scenarios in Bangladesh. We extracted the data for daily confirmed, recovered, and death cases from March 2020 to August 2021. The obtained Root Mean Square Error (RMSE) values of confirmed, recovered, and death cases indicates that our result is more accurate than other contemporary techniques. This study indicates that the LSTM model could be used effectively in predicting contagious diseases. The obtained results could help in explaining the seriousness of the situation, also mayhelp the authorities to take precautionary steps to control the situation. In December 2019, several patients in China's Hubei province observed pneumonia that resembled viral pneumonia, and several people quickly increased to fatal sickness and dead results. The International Committee on Taxonomy of Viruses has named it severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In February 11, 2020, the WHO officially declared the COVID-19 disease a global pandemic. It is the third zoonotic virus of the century to become a global pandemic [1] . Seven types of coronaviruses can contaminate humans only. The first among them was discovered in 1965 [2] . Symptoms were mild and most commonly was only a common cold. In 2002, people from more than 37 countries were effected by (SARS-CoV). In 2012, the Middle East respiratory syndrome Coronavirus was discovered to have originated from bats [3] . Although it was only confined to the region, it was considered a deadlier variant of the SARS-CoV-2 virus [4] . COVID-19 exhibited the usual symptoms of viral pneumonia, such as fever, difficulty in breathing, and coughing. However, compared to SARS and coronavirus, COVID-19 is a worst contagious disease [5] [6] [7] [8] [9] . The SARS-CoV-2 virus has been spreading globally since it was first spotted in China in 2019. It has been identified as a global threat to the scientific community [10] . The COVID-19 virus is a positive-sense RNA that can mutate frequently to adapt in different conditions. [11] . It can infect humans for about 6.5 days while patients remain asymptomatic or experience little symptom, hence spreading the virus silently [12] . In Bangladesh, a relatively high rate of young people were affected, and the rate of recovery for elderly people is lower [13] . According to WHO, the cases are rising daily in Asia, the Americas, and Europe [14, 15] , and the preventive measures for COVID-19 are maintaining social distancing, avoiding touching the mouth, nose, face, and washing hands repeatedly. Bangladesh is one of the densely populated countries in the world. As a result, following the detection of the first cases, it spreads to the whole country within a very short period. Following the confirmation of the first few cases of COVID-19 on March, 2020, the government of Bangladesh declared a countrywide lockdown until 30 th May 2020, followed by the implementation of zone-based lockdown strategy [20] . At the beginning, only one lab, located in Dhaka city, was engaged in testing of COVID- 19 symptoms. However, because of the increasing rate of COVID- 19 cases, the number of labs has increased to 37 throughout the country. On the other hand, the country shows a lower recovery rate compared to other countries, which further causes the rise in cumulative way and poses a big challenge to the healthcare facilities in the country. Consequently, the lockdown has not been withdrawn completely but been relaxed or allowed to limit the operation of some sectors including export oriented industries, shopping malls, public transports, etc. In the meantime, the government emphasized on the increase of testing, with the hope that the disease could be mapped more accurately, which may help to take the necessary preventive actions. However, the present testing capacity is not sufficient for a country of 170 million people [16] . To overcome this uncertain situation, the authorities need to take decisions that may help to maintain a balance between continuing lockdown and stabilizing the economy. However, to take a viable decision, the necessary information on the transmission characteristics of this virus and the resultant effects on live and livelihood need to be known. Meanwhile, like other countries, many researchers in Bangladesh are working to understand how the virus spreads and find ways to put it under control [17, 18] . In this situation, the transmission dynamic of this virus needs to be modeled using publicly available data. Such an effort may help to develop competent public health policy and economic activity guidelines, and planning effective control strategies. Literature shows that machine learning (ML) models not only demonstrates effectiveness in predicting the number of COVID-19 cases [19, 21] but also the number of deaths caused by COVID-19 [22] [23] [24] . Some ML based studies have already reported from countries like China [25] [26] [27] [28] [29] [30] , Italy [31] [32] [33] , France [30, 34, 29] , USA [35, 36] , and South Korea [37, 38] . Beside this, the fast growing COVID- 19 literature includes different types of modeling tools, such as compartmental deterministic models [39] [40] [41] [42] [43] , network models [41] , stochastic models [44] , and branching process models [45, 46] . However, in Bangladesh, the studies are very less till now. No significant research has been found in Bangladesh using the latest data. Observations from the few existing studies in the country are as follows: -Most studies have explored a limited number of machine learning models, so their forecasting capability is restricted and prediction is less independent. -Bangladesh needs an adaptable forecasting model for a comprehensive study for pulling things together and planning better for future circumstances. There is an uncertainty when the outbreak of this pandemic will be ended. At the time of writing thismanuscript, in Bangladesh, more than 530,271 were confirmed cases and 7,966 death cases. The authority is struggling to provide healthcare facilities to COVID-19 patients. Therefore, some predictive models may predict future cases [47] . The COVID-19 data show nonlinear characteristics, therefore the use of linear methods may not be suitable to predict the COVID-19 dynamics [48] . Similarly, since the data are dynamic in nature, the use of statistical and epidemiological models like Moving Average, Auto-Regressive, Auto-Regressive, Moving Average, etc may provide questionable results [49] [50] [51] . To overcome such limitations, this study proposes a model for predicting the outbreak of COVID-19 that fits the real data by using a suitable Artificial Intelligence (AI) based techniques. Artificial Intelligence and mobile computing evolve as a key player for the success of technology in healthcare systems [52] . In the world of smart devices, data mining has been done in an unprecedented way than ever before [52] . Today's world is more connected than ever before w.r.t. the sharing of data between countries in real time. Since COVID-19 is a time series dataset, thus it should be appropriate to use sequential networks to extract the correct patterns [53] [54] [55] [56] . In this context, we have adopted data-driven Deep Learning-based LSTM model. The proposed methods approximately fit real data, thus the policymakers can take decision considering the COVID-19 phenomenon in the upcoming days. In this work, we predicted the COVID- 19 outbreak based on the available data for the last 1 year. The coherence of input data was first analyzed to find the key parameters like the number of new cases compared with the previous day reported data. Following the selection of key parameters of the network, several trials were made to predict future infections, recovery rate, and death rate with minimum error. Any biasness of algorithm from the design point of view has been reduced by using LSTM networks. It is expected that with the help of predicted data obtained from real-time forecasting tools, the healthcare providers and frontline staffs in Bangladesh can prepare themselves to handlethe COVID-19 crisis. The methodology of this work including the datasets and LSTM modelswork is presented in Section 2, section 3 comprises the results and discussion, and section 4 addresses the conclusion and future work. Data related to the SARS -CoV-2 pandemic were released daily by the Health division of the Government of the Republic of Bangladesh. Daily confirmed cases, recovered cases, and death cases of COVID-19 in Bangladesh are collected from the government website [57] . For this study, the data are extracted for the period from March 2020 to August 2021. Dataset is prepared in excel format. This dataset contains 6 columns and 365 rows which is shown in Figure 1 . Starting with the date, the 6 columns include the confirmed cases, recovered cases, death cases, test cases, and infection rate ((test case/confirmed cases)* (1/100)). The 365 rows represent 365 days' data (number of confirmed cases, recovered cases, death cases, test cases, and infection rate). The dataset is available in the dd-mm-yyyy time series format, and a wavelet transformation [58] was applied to mitigate the random noise in the dataset. The conventional function were used to fit the data considering the fundamental point to represent and forecasting the trends. From these data, 70% is used for training the model and 30% is used for testing and validation purposes. Confirmed Cases Daily New Tests Infection Rate The graphical view of COVID-19 is shown in Figure 2 . In this graph, the green curve indicates the death cases, blue curve indicates the confirmed cases, and the orange curve indicates the recovered cases. This is actually the graphical view of the COVID-19 in Bangladesh until now. We have used Long Short Term Memory networks for predicting the trend of COVID-19 in Bangladesh. We preprocessed the data using MinMaxScaler. We have added 1 dense layer, 50 hidden layers, and used adam optimizer. For calculating losses, we used Mean_Square_Error. Based on this parameter setting, the LSTM model has been trained using 70% data and the remaining 30% is used for validation. Using the trained model, we have predicted the confirmed cases, recovered cases, and death cases of COVID-19 in Bangladesh. The system architecture is given in Figure 3 . Python programming language has been used to implement the system. Several libraries such as Matlab, Keras, Matplotlib, Numpy are used in this system. Tensorflow was used, because the system is back-end type. Time Series (TS) is a series of data points over a regular interval of time. It is preferred that the observation is in sequential arrangement in the regression detective modeling approach. Simply, it remains unchanged over time. In TS expressions, the regularity is referred to as the time series being fixed, which represents a constant mean and variance with respect to time. These are significant properties in TS. On the other hand, it can be simply violated by the presence of tendency, seasonality, and inaccuracy. A Time Series is said to have a trend if a certain pattern repeats in regular interval. A nonstationary time series has trend, seasonal effects, therefore, its statistical features may change over time. Due to lockdown, quarantine, social distancing, etc, the TS datasets of COVID-19 may have nonstationary patterns. Therefore, it is essential to know the characteristics of TS before applying for forecasting. This research adopted the Augmented Dickey-Fuller (ADF) test [59] to make sure the scenery of TS. ADF is used for a unit root test which helps us to check the impact of trends in a given TS. The outcomes are interpreted by the p-value from the test. The p-value of 1-5 % suggests the rejection of the null hypothesis; otherwise, a p-value of above threshold suggests the nonrefusal of the null hypothesis. The p value of >5% indicates a unit root of the input data, hence it is a nonstationary series. Prior to the model architecture, the internal mechanisms of LSTM networks need to be clearly addressed. In 1997, the Long Short-Term Memory structural design was first proposed by [60] for recurrent neural networks. LSTM cell was introduced by Hochreiter and Schmidhuber [78] . The LSTM cell has two mechanisms to its condition compared to the RNN cell which are hidden, and internal cell state [61] . The hidden state corresponds to the Short-Term Memory (STM) component, and the cell state corresponds to the Long Term Memory (LTM) [61, 62] . LSTM is a unique type of circular neural network construction that can learn the proposed long-term dependence to defeat the problem of slope disappearing [63] . In this context, the LSTM structure replaces the general stress in the circulatory neural network with LSTM cells with little inner memory. These LSTM cells share a common circulatory neural network, serving to memorize multiple levels of error in the cell's internal condition. LSTM structure facilitates to forget the cell status values and determines how many new input values will be received. The gradient does not finish, and the learning becomes not possible even if this process is recurring. It evaluates the absolute output value throughout the number of hidden sides in a similar way as the standard circulation neural network. In the development of evaluating the number of variables in the hidden layer, gateways are fitted to control the flow of information. As a result, the circulatory neural network using LSTM cells handles even the data in a long sequence of procedures without causing slope loss. In this study, LSTM is used as a model to forecast the COVID-19 confirmed cases, recovered cases, and death cases. The general structure of LSTM contains of four gates-input gate, forget gate, control gate, and output gate as shown in figure 4. In Figure 4the , input gate, forget gate, control gate, and output gate are denoted by it, ft, ct, ot , respectively. The details of these four gates are enlightened below: The input gate is expressed as: It decides which information can be transferred from the earlier cell to the current cell. The forget gate is defined by equation (2), and it is used to store the information from the input of previous memory or otherwise. The control gate controls the update of the cell and it is defined as: In the above equations, xt =input, w= corresponding weight matrix of input, b= corresponding bias of input, Ct-1= previous block memory, Ct= current block memory, ht-1= previous block output, ht= current block output. Furthermore, tanh is the hyperbolic tangent function that is used to scale the values in range -1 to 1, and σ is the sigmoid activation function, which gives the output between 0 to 1. The LSTM algorithms were implemented in the following way. To evaluate the model performance, the Root Mean Square Error (RMSE) is used as a standard statistical metric for getting higher accuracy. It is generally used to evaluate the variation between the predictable value, or the value measured by the model and the value observed in the actual environment. If A is the dataset with the actual value and B is the dataset of predicted values, then the mean square root error can be represented as equation (5) The square root of that value is obtained when the error of the pairs of elements in each set is calculated and then squared to determine the exact difference, averaging the overall error. In this study, an appropriate model was selected and trained to forecast the three values such as the confirmed case, death case, and recovered case of COVID-19 in Bangladesh.In our review, the LSTM models show a close value to the actual curve by accumulating traffic big data, demonstrating good simulation and forecast effects. Since the lags of unidentified periods could get a position between major events in a time series, thus the LSTM networks appear to be excellent options for classification process and prediction according to time series data. The LSTM program together with Python and Tensorflow are used for the prediction of COVID-19 outbreak. The obtained results are shown in Figs. 5 -8. Our study was performed with the most recent data and got more precise predictions compared to the reported results available in the literature. First, we have imported the test set that was used to make the predictions. To predict the confirmed cases, death case, and recovered cases, we were considered a couple of things after loading in the test set:  Merge the training set and the test set on the 0 axis.  Set the time step as 60  Used MinMaxScalar to transform the new dataset  Reshape the dataset as done previously The confirmed cases from March 2020 to August 2021 are shown in figure 5(a) , where the implementation of prediction of confirmed cases is shown in Figure 5 (b). The blue curve in Figure 5 (b) indicates the training data, the orange curve indicates testing data, and the green curve indicates the predicted data for the confirmed cases. Figure 5(b) , only Epoch-80 has been shown. Other Epoch (Epoch-100, 120, 150) are presented in supplementary figure S1. From Figure 5(b) , it is clear that the official data and the curve of prediction are much closer to each other. Therefore, we conclude that the model has learned the sequential patterns of COVID-19 positive cases successfully. Data validation for confirmed cases is given in Figure 5 (c). The overall graphical view and prediction results for the recovered cases from March 2020 to August 2021 in Bangladesh are given in Figure 6 (a) and Figure 6 (b). In Figure 6 (b), the blue curve indicates the training data, the orange curve indicates testing data, and the green curve indicates the predicted data for the recovered cases. Data validation for the recovered cases is shown in Figure 6 Official data: (Blue and Orange colour) Prediction data: Green colour [71] Epoch is one of the most excellent methods to compare different data for forecasting. We compared the result of our study by a changeable epoches. It represents the highest prediction with the best accuracy at 120 epoches. For measuring the effectiveness of the system, the RMSE has been used. As an example, Hridoy et al. [67] The combined CNN-LSTM model has a good impact on the confirmation of anticipated findings with respect to actual results, according to the experimental data. Performance measure metrics are used to ensure that all defined cases in relation to COVID-19 are confirmed. The method suggests that those who have been afflicted by the disease may recover, and that a suitable vaccination will be created to minimize the death rate. The proposed technique predicts the illness patterns and assists clinicians in conducting early preventive studies. Clinical doctors have made every effort for success, and it is expected that they will success in the near future. The LSTM has provided more convincing RMSE results. The model is preferable if it has a lower RMSE. A considerable number of studies is in progress in predicting COVID-19 outbreak since our live and livelihood depends on the outcome of this emergency. We took into consideration of confirmed, recovered, and death cases in Bangladesh to carry out our study ofCOVID-19 prediction. To predict the transmission of epidemic trends, a model based on deep learning approach is applied in this study. LSTM algorithm was used to develop proposed model that shows high performance in time series forecasts, and the results show good performances in the confirmed case, recovered case, and the death cases. The obtained results can be efficiently used to reduce the infectious rate and improve the recovery rate. Since the measure of new cases continue to raise worldwide, our model shows the potential for pandemic curve forecast and hindrance of COVID-19 in new countries. Based on our model simulations, the current COVID-19 pandemic is not expected to be ended in the next few months. Based on the graphical analysis, it can be mentioned that LSTM model shows a great fit for the prediction of COVID-19 because the actual curve and the prediction curve are very close to each other in maximum time. Moreover, the calculated RMSE value indicates the effectiveness of the LSTM model in predicting the COVID-19. It is therefore expected that the LSTM model could be effectively utilized for the prediction of contagious diseases, hence the respective authority may be able to decide what would be the best precautionary steps to prevent further transmissions in the country. In our next work, several other machine learning and deep learning models are planned to be used to predict the trend of COVID-19 and provide the best models in the case of COVID-19 prediction and contemporary issues. Return of the coronavirus: 2019-nCoV Seroepidemiologic studies of coronavirus infection in adults and children Bats are natural reservoirs of sars-like coronaviruses Comparative therapeutic efficacy of remdesivir and combination lopinavir, ritonavir, and interferon beta against mers-cov A mathematical model for the novel coronavirus epidemic in Wuhan A new coronavirus associated with human respiratory disease in China Isolation and characterization of viruses related to the sars coronavirus from animals in southern china Reservoir bats: The invisible enemy Commentary: Middle east respiratory syndrome coronavirus (merscov): announcement of the coronavirus study group Available online at World Health Organization Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats COVID-19 and Bangladesh: Challenges and How to Address Them. Front. Public Health 2020, 8, 154 The Estimate of the Basic Reproduction Number for COVID-19 :A Systematic Review and Meta-analysis A universal model for prediction of COVID-19 pandemic based on machine learning COVID-19 outbreak prediction with machine learning Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing Machine learning approach for confirmation of COVID-19 cases: Positive, negative, death and release Early prediction of mortality risk among severe COVID-19 patients using machine learning Covid 19 future forecasting using supervised machine learning models Real-time forecasts of the COVID-19 epidemic in china from Trend and forecasting of the COVID-19 outbreak in china Artificial intelligence forecasting of COVID-19 in china Predicting the cumulative number of cases for the COVID-19 epidemic in china from early data Epidemic analysis of COVID-19 in china by dynamical modeling Analysis and forecast of COVID-19 spreading in china, italy and france Critical care utilization for the COVID-19 outbreak in lombardy, italy: early experience and forecast during an emergency response Extended sir prediction of the epidemics trend of COVID-19 in italy and compared with hunan, china Prediction of COVID-19 spreading profiles in south korea, italy and iran by data-driven coding COVID-19 : Forecasting short term hospital needs in france COVID-19 progression timeline and effectiveness of response-to-spread interventions across the united states Sentinel event surveillance to estimate total sars-cov-2 infections, united states Changes in risk perception and protective behavior during the first week of the COVID-19 pandemic in the united states Machine learning model estimating number of COVID-19 infection cases over coming 24 days in every province of south korea (xgboost and multioutputregressor)," medRxiv A mathematical model for simulating the phase-based transmissibility of a novel coronavirus A mathematical model for the novel coronavirus epidemic in Wuhan Outbreak dynamics of COVID-19 in China and theUnited States Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19 ) taking into account the undetected infections. The case of China Data-based analysis, modelling and forecasting of the COVID-19 outbreak Early dynamics of transmission and control of COVID-19 : A mathematical modeling study Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts The challenges of modeling and forecasting the spread of COVID-19 Updating of covariates and choice of time origin in survival analysis: problems with vaguely defined disease states Strong consistency of least-squares estimation in linear regres-sion models with vague concepts The ǣinconvenient truth ǥaboutai in healthcare Machine learning approach for confirmation of COVID-19 cases: positive, negative, death and release Multiple-input deep convolutional neural network model for COVID-19 forecasting in china Prediction for the spread of COVID-19 in india and effective-ness of preventive measures Neural network based country wise risk pre-diction of COVID-19 Wavelet transform domain filters: a spatially selective noise filtration technique S : Lag order and critical values of the augmented dickey-fuller test Long short-term memory Long short-term memory Recurrent neural networks for time series forecasting: current status and future directions Bidirectional recurrent neural networks Learning to forget: continual prediction with LSTM Root mean square error (RMSE) or mean absolute error (MAE)?-Arguments against avoiding RMSE in the literature Pearson correlation coefficient Forecasting COVID-19 Dynamics and endpoint in Bangladesh: A data-driven approach. Researchgate CNN-LSTM Model for Verifying Predictions of COVID-19 Cases Multivariate lstm-fcns for time series classification Neural network based country wise risk prediction of COVID-19 Samir Kumar Bandyopadhyay;Machine learning approach for confirmation of COVID-19 cases: positive, negative, death and release