key: cord-0813938-dxx6usvg authors: Painuli, Deepak; Mishra, Divya; Bhardwaj, Suyash; Aggarwal, Mayank title: Forecast and prediction of COVID-19 using machine learning date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00027-7 sha: 0044e5404a906a0e00ff53c36e56dc21a4d4c37c doc_id: 813938 cord_uid: dxx6usvg COVID-19 outbreaks only affect the lives of people, they result in a negative impact on the economy of the country. On Jan. 30, 2020, it was declared as a health emergency for the entire globe by the World Health Organization (WHO). By Apr. 28, 2020, more than 3 million people were infected by this virus and there was no vaccine to prevent. The WHO released certain guidelines for safety, but they were only precautionary measures. The use of information technology with a focus on fields such as data Science and machine learning can help in the fight against this pandemic. It is important to have early warning methods through which one can forecast how much the disease will affect society, on the basis of which the government can take necessary actions without affecting its economy. In this chapter, we include methods for forecasting future cases based on existing data. Machine learning approaches are used and two solutions, one for predicting the chance of being infected and other for forecasting the number of positive cases, are discussed. A trial was done for different algorithms, and the algorithm that gave results with the best accuracy are covered in the chapter. The chapter discusses autoregressive integrated moving average time series for forecasting confirmed cases for various states of India. Two classifiers, random forest and extra tree classifiers, were selected; both have an accuracy of more than 90%. Of the two, the extra tree classifier has 93.62% accuracy. These results can be used to take corrective measures by different governmental bodies. The availability of techniques for forecasting infectious disease can make it easier to fight COVID-19. COVID-19 is not just a name now. It has become a deadly widespread virus that has affected tens of thousands of people all over the world. Its origin was Wuhan City, China in Dec. 2019. When people were unaware of the virus, COVID-19 started to spread from one person to another; it has slowly reached almost all countries and has become a pandemic [1e3] . COVID-19 is the short form for coronavirus disease 2019, an illness caused by a novel coronavirus (nCoV) now known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2); formerly called 2019-nCoV. COVID-19 was not the formal name of this virus; it was called SARS-CoV-2 by the International Committee on Taxonomy of Viruses because its symptoms were related to the virus that caused the SARS outbreak in 2003. However, this virus had not previously appeared in humans, and this time, they were severely infected by the virus, so to avoid confusion with other viruses, the World Health Organization (WHO) named it COVID-19 to communicate with the public [2e5]. During its early stages, COVID-19 was first identified as only an outbreak of respiratory illness cases in Wuhan City, Hubei Province, China. On Dec. 31, 2019, China reported about this respiratory disease to the WHO. It was declared to be COVID-19, a global health emergency, by the WHO on Jan. 30, 2020 . According to records of WHO, in 2009, H1N1 was declared to be a global pandemic after which, on Mar. 11, 2020, COVID-19 was declared a global pandemic by the WHO [2] . The name COVID-19 was selected because the WHO does not want to associate the origins of the virus in terms of populations, geography, or animals to cause stigma [5] . According to the WHO and other health agencies, coronaviruses are defined as a collection of viruses whose symptoms ranges from the common cold to more severe diseases. However, nCoV is a new type of virus not been previously seen in humans. Countries across the globe quickly identified this respiratory disease as the cases of COVID-19 rapidly increased. More and more people were infected with COVID-19 since the day it was identified in China [1, 3] . Since it was declared as the pandemic, the WHO has published guidance regarding this virus for all countries, including how the people may identify whether they are infected by this disease, how to remain unaffected by the virus, what kind of precautions should be taken care, when to go to the hospital, levels of conditions of people who are infected, and symptoms of this virus after a deep examination of infected people [2e5]. The WHO continuously shares information with people in different countries about this virus so that the public does not panic. During the early days of COVID-19, the WHO did not suggest avoiding travel. Strict suggestions were to distance from infected persons, wash hands regularly, and, if experiencing coughing or a cold, covering the mouth. However, later on, travel history became one of the important identifiers of COVID-19, and based on this information, screening of all persons traveling from different countries, especially from infected areas, was done regularly. All persons coming from other countries were recommended to be isolated at home for around 14 days, because that was the symptomatic period of this virus, as mentioned by the WHO. If a person showed any symptoms of illness, he or she was taken to the hospital for treatment [2] . The incubation period is the time between when someone catches the virus and when symptoms start to appear. As reported by the WHO, this virus has an incubation period of 2e14 days in the human body [4, 6] . According to the Centers for Disease Control and Prevention (CDC), mild symptoms of the virus start appearing within 5 days and become worse afterward [7] . However, more recent data on patients showed that the incubation period had increased from 14 to 20 or 28 days as the virus started mutating, and after many negative tests, it suddenly revealed a positive result. It was reported that patients have tested positive for the virus without having symptoms owing to a strong immunity system. As, the symptoms of this virus do not appear with strong immunity system, so if we come in contact with the person affected by the virus but with strong immunity system then we can definitely get infected [4, 8, 9] . The coronavirus is transmitted from person to person when they are directly in contacted with each other or when the infected person sneezes or coughs. It is a respiratory disease, so it directly affects the respiratory system [2] . According to CDC, nCoV is reported to be highly contagious, which means it spreads easily from among persons. [6] . It can also be spread when a person touches a surface or edible items that have come into contact with an infected person. The WHO released precautionary measures to avoid infection from COVID-19 virus. They include covering the face with a mask or cloth, avoid handshaking and instead bowing with namaste, following social distancing, and enforcing a lockdown [1e5]. Several countries are affected by COVID-19. In some of those most affected, many thousands of people have died from the coronavirus [10e19]: (1) China reported its first positive case appeared in Wuhan, China, which is also the origin of COVID-19, on Dec. 31, 2019. An update of COVID-19 in China cited 82,816 cases out of which 4632 died and 77,346 recovered and 838 cases of which remained active. [11, 12] ; (2) Italy reported its first positive case on Jan. 31 [15] ; (9) Turkey reported its first case on Mar. 10, 2020. There were 104,912 cases, out of which 2600 died and 21,737 recovered [19] . Besides all of these countries, the first case in India was reported on Jan. 30, 2020 in Kerala. There were 24,942 cases, out of which 780 died and 5498 recovered. According to Arthur Samuel (1959), ML is the field of study that gives computers the ability to learn without being explicitly programmed. Thus, we can define ML as the field of computer science in which machines can be designed that can program themselves [20] . The process of learning is simply learning from experience or observations from previous work, such as examples, or instruction, to look for patterns in data and with the help of examples, provided the system can make better decisions. The basic aim of ML is to make computers learn automatically with no human intervention and to adjust perform actions accordingly [20, 21] . Fig. 20.1 shows the process of ML. Past data are used to train the model, and then this trained model is used to test new data and then for prediction. The trained ML model's performance is evaluated using some portion of available past data (which is not present during training). This is usually referred as the validation process. In this process, the ML model is evaluated for its performance measure, such as accuracy. Accuracy describes the ML model's performance over unseen data in terms of the ratio of the number of correctly predicted features and total available features to be predicted. ML algorithms can be divided into supervised or unsupervised learning: (1) Supervised ML algorithms is a type of ML technique that can be applied according to what was previously learned to get new data using labeled data and to predict future events or labels. In this type of learning, supervisor (labels) is present to guide or correct. For this first analysis, the known training set and then the output values are predicted using the learning algorithm. The output defined by the learning system can be compared with the actual output; if errors are identified, they can be rectified and the model can be modified accordingly [20] . (2) Unsupervised ML algorithms: In this type, there is no supervisor to guide or correct. This type of learning algorithm is used when unlabeled or unclassified information is present to train the system. The system does not define the correct output, but it explores the data in such a way that it can draw inferences (rules) from datasets and can describe hidden structures from unlabeled data [20e22]. (4) Reinforcement ML algorithms is a type of learning method that gives rewards or punishment on the basis of the work performed by the system. If we train the system to perform a certain task and it fails to do that, the system might be punished; if it performs perfectly, it will be rewarded. It typically works on 0 and 1, in which 0 indicates a punishment and 1 indicates a reward. It works on the principle in which, if we train a bird or a dog to do some task and it does exactly as we want, we give it a treat or the food it likes, or we might praise it. This is a reward. If it did not perform the task properly, it might be scolded as a punishment by us. [20e22]. ML is used in various fields, including medicine to predict disease and forecast its outcome. In medicine, the right diagnosis and the right time are the keys to successful treatment. If the treatment has a high error rate, it may cause several deaths. Therefore, researchers have started using artificial intelligence applications for medical treatment. The task is complicated because the researchers have to choose the right tool: it is a matter of life or death [23] . For this task, ML achieved a milestone in the field of health care. ML techniques are used to interpret and analyze large datasets and predict their output. These ML tools were used to identify the symptoms of disease and classify samples into treatment groups. ML helps hospitals to maintain administrative processes and treat infectious disease [24e26]. ML techniques were previously used to treat cancer, pneumonia, diabetes, Parkinson disease, arthritis, neuromuscular disorders, and many more diseases; they give more than 90% accurate results in prediction and forecasting [22, 23] . The pandemic disease known as COVID-19 is a deadly virus that has cost the lives of many people all over the world. There is no treatment for this virus. ML techniques have been used to predict whether patients are infected by the virus based on symptoms defined by WHO and CDC [2, 6] . ML is also used to diagnose the disease based on x-ray images. For instance, chest images of patients can be used to detect whether a patient is infected with COVID-19 [25, 26] . Moreover, social distancing can be monitored by ML; with the help of this approach, we can keep ourselves safe from COVID-19 [2, 3, 24] . Various ML techniques are used to predict and forecast future events. Some ML techniques used for prediction are support vector machine, linear regression, logistic regression, naive Bayes, decision trees (random forest and ETC), K-nearest neighbor, and neural networks (multilayer perceptron) [ Each technique has unique features and is used differently based on the accuracy results. The model with the best accuracy during the model evaluation process is chosen for prediction or forecasting. In the same way, we identified and used the ETC for the symptom-based prediction of COVID-19 and the ARIMA forecasting model to forecast the number of confirmed cases of COVID-19 in India, because they had the best accuracy results among all classifier and forecasting methods we used when we evaluated model performance. Fig. 20.2 shows a flowchart of the ML process. It defines how data are collected and preprocessed, and then are divided into a training dataset and test dataset for training and performance evaluation. A symptom-based predictive model was proposed to predict COVID-19 based on symptoms defined by the WHO and CDC [1e3,6]. Because there is no proper description of symptoms declared by the WHO, based on some existing symptoms, we defined a model used to predict the disease according to the accuracy given by the model [1, 2] . We created a symptom database in which rules were created and used as input. Then, these data were used as raw data. Then, feature selection took place as part of preprocessing data. The data were divided into training data (80% of data) and test data (20% of data), usually known as the train-test split process. This split is generally done in a stratified or random manner so that population distribution in both groups consists of shuffled data, which leads minimized bias or skewness in the data. Training data were used to train the ML classifier that we used in the model, and test data were used to test that classifier in terms of accuracy received over a predefined unseen portion of the dataset [29e39]. In our work, the symptoms and patient's class dataset was defined on the basis of symptoms such as fever, cough, and sneezing, whether the patient had traveled to an infected place, age, and whether the patient had a history of disease that could increase the possibly of being infected by the virus [1, 2] . This dataset was then further divided into two sets (training set and testing set) using the test-train split method. The system was trained on the basis of training set data and the accuracy of the ML classifier, and then evaluated over the testing set. Finally, the model was used to predict the probability of infection from the disease using new patient data in terms of positive or negative [34, 35] . A correlation matrix, which is a tool for the feature selection process, is table used to define correlation coefficients among variables or features. Every cell in the matrix defines a correlation between two variables. It is used to summarize a large dataset and also to identify the most highly correlated features (shown in gray entries in second last row and column in Fig. 20. 3) in the given data [35e37,40] . The correlation coefficient's value near 1 signifies that features participating in correlation are highly correlated to each other; on the other hand, the correlation coefficient's value near 0 signifies that features are less correlated to each other. Generally, correlation could be of two types: positive and negative. A positive correlation states that an increase or decrease in one feature's value results in an increase or decrease in the other feature's value; in contrast, a negative correlation has a reverse relation between the two features, so an increase in one feature's value results in the decreased value of the other feature. Rows and columns in the correlation matrix represent each feature's name. Each cell in a table containing the correlation coefficient calculated between features corresponds to the respective row and column of that particular cell. Fig. 20 .4 shows another form of representation of a correlation matrix using a heat map. Heat maps are a popular way to visualize the interrelation between two or more variables or features, because it is easy for the human mind to distinguish between an attribute's ranks by visualizing color coding rather than checking and searching for the best value in a given list of numerical values, as shown in Fig. 20 .3. One can easily identify and choose the most correlated feature using heat map visualization, in which the light-colored cell defines the most correlated features and the dark-colored cell defines the least correlated features. Fig. 20 .5 shows the prediction performance based on two classifiers, random forest and ETC. The ETC gives one wrong prediction (dark gray colored column) out of 14 data points and the random forest classifier (RFC) gives three wrong predictions out of 24 data points. Fig. 20 .6 compares two classifier outputs using line graphs: ETC and RFC. The figure shows 24 data points, on the basis of which the accuracy of these classifiers is described. The ETC misclassified at point 16, whereas the RFC misclassified at three points: 2, 16, and 24. This means the ETC is more accurate than the RFC. This comparison is shown for synthetic data. thus, if real data are used, based on those data, training of the classifier is performed and then the classifier is tested for accuracy. The classifier that gives the best accuracy can be used for prediction. For forecasting through ML, time series analysis may be used, which is an important part of ML. It is a univariate type of regression in which the target feature (dependent feature) is forecast using only one input feature (independent feature), which is time [41e43] . It is used to forecast future event values, and it has an important role for forecasting the existence of respiratory diseases such as COVID-19. Positive cases are increasing daily, so it is necessary to forecast whether the ratio by which the number is increasing is continuing based on prior observations. It is helpful for the government, because based on the forecast, it can plan for resources to control the spread of disease and act for the future so that the growth rate of the infection decreases without affecting more people [30, 32, 35] . Forecasts depend completely on past trends, so forecast values cannot be guaranteed. However, this forecasted approximation of events may help authorities to assess forthcoming resource planning to compete with any pandemic situation such as COVID-19. We used the most widely used forecasting method, called the ARIMA model for time series forecasting. ARIMA is used for time series data to predict future trends [41e50]. ARIMA is a form of univariate regression analysis that predicts future values based on differences between values rather than actual values. It combines three terms: ARIMA uses a pdq forecasting equation in which the these parameters are defined as: p is the number of observations that have lagged; d defines the time (i.e., how many times the raw observations are different); and q defines the size of the MA window. In the AR model, the value of Yt depends on its own lagged value. In the MA model, the value of Yt depends on lagged forecast errors. Thus, the general equation of ARIMA is: where Yt is the target to be predicted, a is the constant value, b1Yt À 1 is the linear combination lags of Y that are taken up to p lags, and W1εt À 1 is the linear combination of lagged forecast error that is taken up to q lags [41e49] . The syntax of the ARIMA model is: [34] ; based on this dataset, we forecast confirmed cases in India and the top 10 states (with respect to COVID-19 infection cases), performed using the ARIMA method. As good ML practice, dataset was pre-processed (only required features have been selected) and dataset has been split into two parts Training Set (30-01-2020e15-04-2020) and Test(Validation)_Set (15-04-2020e19-04-2020). After that, the model was trained using the training set employing several pdq configurations of the ARIMA model and then cross-validated using the testing set. Results of the cross-validation are listed in Table 20 .1. In Table 20 .1, the error rate (root mean square error [RMSE]) of the model for different states is shown using the ARIMA model. Entries in bold show the lowest RMSE for the state of a particular pdq configuration. The lowest value of RMSE is treated as the best configuration of ARIMA to forecast future values for a particular state. According to the values given in Table 20 .1, we have selected two states, Telangana and West Bengal, and the total cases for India. Fig. 20 .7 shows the forecast for Telangana state based on data up to Apr. 19, 2020. Dark gray dots signify the actual training set (past real observation) upon which the model was trained and green (gray in printed version) signifies the actual testing set (partial data points from the dataset) for which the furcated value was validated (see overlapping area of light gray dots); pink (black in printed version) is the future forecast. The pandemic of COVID-19 has affected the entire globe. It has spread in more than 85 countries as of Apr. 2020. Scientists have made every effort to find solutions to it; according to claims by the United States and India, some vaccines have been made that are being trialed. The use of computers by scientists for early prediction has been widespread. A lot of research is taking place using ML to combat COVID-19. This chapter can be used by different researchers to learn how ML can be employed to forecast not only this situation but also other cases. The chapter specifically used the ARIMA method of time to forecast the stability and growth of COVID-19. Many countries have seen high totals of deaths owing to COVID-19. It is believed that the performance of the model can be improved or the model can give more accurate data if more datasets are available. The model gives results on the basis of data developed by information given by health agencies. Thus, forecasting may not be 100% accurate, but it can surely be used as a corrective measure. For future work further enhancement can be done by combining new factors and algorithms with ARIMA to get more accurate results. Confirmed cases of Covid 19 Discusses Coronavirus Disease Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Coronavirus Disease (COVID 19) Outbreak. Available from COVID 19 Article Discusses How Long Is the Incubation Period for the Coronavirus Anulekha Ray Discusses ABOUT Coronavirus: India's biggest Concerns are COVID 19 Patients with No Symptoms Teena Thacker Discusses About No Symptoms in 80% of COVID Cases Raises Concern Praveen Duddu Discusses About COVID 19 Coronavirus: Top Ten Most Affected Countries Covid 19 Cases in China Coronavirus Epidemic Keeps Growing, But Spread in China Slows Covid 19 cases in Italy Covid 19 cases in Spain Covid 19 cases in US Covid 19 cases in Germany Covid 19 cases in France Covid 19 cases in Iran Covid 19 cases in Turkey Available from: https:// towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0 Optimized cuttlefish algorithm for diagnosis of Parkinson's disease C4.5: Programs for Machine Learning Building predictive models for MERS-CoV infections using data mining techniques A Novel transfer learning based approach for pneumonia detection in chest X-ray images Chest x-ray pneumonia prediction using machine learning algorithms Diabetes Determination via Vortex Optimization Algorithm Based Support Vector Machines Disease classification using machine learning algorithms e a comparative study Disease prediction using machine learning over big data Chest diseases diagnosis using artificial neural networks Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions MERS coronavirus neutralizing antibodies in camels Predicting active pulmonary tuberculosis using an artificial neural network Role of machine learning to predict the outbreak of covid-19 in India Predicting medical risks and appreciating uncertainty Predicting the Growth and Trend of COVID-19 Pandemic Using Machine Learning and Cloud Computing COVID-19 Epidemic Analysis Using Machine Learning and Deep Learning Algorithms Prediction and Analysis of Coronavirus Disease Predicting Malarial Outbreak Using Machine Learning and Deep Learning Approach: A Review and analysis A comparative study on chronic obstructive pulmonary and pneumonia diseases diagnosis using neural networks and artificial immune system Automatic Detection of Major Lung Diseases Using Chest Radiographs and Classification by Feed-Forward Artificial Neural Network Using artificial intelligence techniques for economic time series prediction An ant-lion optimizer-trained artificial neural network system for chaotic electroencephalogram (EEG) prediction, Multidisciplinary Sudalai Rajkumar discusses about datasets Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks Forcasting incidence of hemorrhagic fever with renal syndrome in China using arima model Ridge regression e some simulations A state space framework for automatic forecasting using exponential smoothing methods Sanjay Sharma discusses about How predictive models can aid in the battle against COVID-19 Machine-Learning Models for Sales Time Series Forecast: MDPI AI-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data