key: cord-1052508-ohgaiv1n
authors: Dharani, N. P.; Bojja, Polaiah; Raja Kumari, Pamula
title: Evaluation of Performance of an LR and SVR models to predict COVID-19 Pandemic
date: 2021-02-16
journal: Mater Today Proc
DOI: 10.1016/j.matpr.2021.02.166
sha: ea8e11fa4c759f3488e9eed0f5dbf639e3e29666
doc_id: 1052508
cord_uid: ohgaiv1n

Recently, in December 2019 the Coronavirus disease surprisingly influenced the lives of millions of people in the world with its swift spread. To support medical experts/doctors with the overpowering challenge of prediction of total cases in India, a machine-learning algorithm was developed. In this research article, the author describes the possibility of predicting the COVID-19 total, active cases, death and cured cases in India up to 25th June 2020 by applying linear regression and support vector machine. It is extremely tricky to manage the occurrence of corona virus since it is expanding exponentially day to day and is difficult to handle with a limited number of doctors and beds to treat the infected individuals with limited time. Hence, it is essential to develop a machine learning based computerized predicting model. The development effort in this article is based on publicly available data that is downloaded from KAGGLE to estimate the spread of the disease within a short period. We have calculated the RMSE, R2, MAE of LR and SVR models and concluded that the RMSE of linear regression is less than the SVR. Therefore, the LR will help doctors to forecast for the next few days.

mainly through small drops of infected person's saliva that is ejected from mouth or rhinorrhea from nose. A big alert is needed to the globe where high dense populated countries. A person suffering from coronavirus disease is having the following symptoms: cough, temperature, feeling difficult in taking breath.

The COVID-19 expands very instantly from one person to another person by small respiratory drops. Coronavirus disease is transmitted in 4 stages. Stage 1: without local origin, the people who are imported from the affected countries. Stage 2: local spread by the affected people who travelled from the affected country. Stage 3: community transmission where the contacted persons are untraceable. Stage 4: most severe stage [2] . Due to inadequate diagnosis tools available, there is a necessity of a development of novel frameworks to identify the early diagnosis and predicting the disease by researchers, academicians. As per the information received from expert doctors, the person who is infected by coronavirus disease has to be in treatment for one week or two week and should be isolated from the healthy persons.

The various datasets of COVID-19 were available in websites like IEEE-dataport, KAGGLE, Github,Johns Hopkins in different formats [3] . For this article, we gathered daily updated COVID-19 data, which is in XLSX format and downloaded from KAGGLE website from 30th January, 2020 to 21st May, 2020. This data is utilized to find RMSE, R 2 and MAE of linear regression and support vector machine models to predict the active, total, death and cured cases in India for next few days [4] . To investigate various types of data machine-learning tools are used. In this tool, analysis of regression and using support vector machine models plays a vital role. Nowadays most of the data are being analyzed by machine learning algorithms. To have cost effective health management, perfect and early prediction of the corona virus has to be implemented by young scientists/researchers. The corona virus is in exponential nature and hence it is tricky to manage with inadequate experts in the hospitals to handle such a huge number of infected persons. Hence, a suitable machine learning/deep learning model needs to develop that would represent all the information regarding the whole disease system and it tells the nature of disease and forecast the total confirmed cases in India for next few days. Many authors published few articles regarding the prediction of the disease by proposing their own mathematical models by introducing quarantine by following government measures to reduce the disease transmission [5] . With the advent of computational tools and software, any one with little knowledge can easily develop and solve any issues related to infectious disease. In this epidemic analysis and predicting the corona virus disease, machine-learning algorithms play a vital part. ML is a division of artificial intelligence and is used to train and test the system with the information gathered from KAGGLE.

R. Sujath, et.al. [6] carry out LR, MLP and VAR models for Coronavirus Kaggle data to predict the epidemiological issue. Natrayan L and senthil kumar et.al. [7] employed two algorithms such as Jaya and Multi output regressor algorithms to train and test the models in terms of binary classification. Shreshth Tuli, et.al. [8] developed an ML model to foresee the future of coronavirus cases worldwide. Generalized Inverse Weibull distribution is a best model for real time prediction of the behavior of epidemic COVID-19. Ramjeet Singh Yadav [9] presented six various regression models namely exponential, 2nd, 3rd, 4th, 5th, 6th degree polynomials of the COVID-19 data and found RMSE of all the six models and concluded 6th degree polynomial model is generating best model for predicting the situation of next 6 days of 

The model solutions need to be repeated till to find the best solutions that are capable of predicting the amount of confirmed, active, migrated, death cases of coronavirus in India. The goals of this study are stated as following:

The models of LR and SVR are used to find the amount of expansion of the infection.

To predict the coronavirus cases: confirmed, active, death, cured in India for further 25 days so that government authorities will take control measures.

ML is being taught as a computerized model to analyze data in a variety of fields like therapeutic engineering, statistical engineering, commerce and financial sectors. The purpose of machine learning is to understand the data and fit the data that should be understood by the users to utilize this model for their own data. Machine learning algorithms allow computer systems to train input data and analyze the statistical output values within a particular range. Machine learning builds the models to automate the decision making process based on the input data.

Machine learning methods are broadly classified into three methods: [13] i) Supervised learning ii) Unsupervised learning iii) Reinforcement learning Supervised learning is provided with the labeled inputs to the system to get desired labeled outputs. The main objective of this method is to learn by comparing real output with the trained outputs to get errors so that the model can be modified accordingly. To forecast the label values, supervised learning is used. Unsupervised learning is used to find the similarities in the input data that has been provided. In this method, the data to be used is unlabeled and this unlabeled data is richly available. This unsupervised learning allows the machine to discover the patterns that are needed to classify from the original data automatically. In addition to the abovementioned two methods, the last learning method is reinforcement learning. This algorithm focuses on the study to make balance between the investigation and utilization of the input and output pairs.

The dataset of corona virus utilized in this research article is gathered from KAGGLE website from January 30, 2020 to May 21, 2020. The data includes all the active cases, confirmed cases, death cases and migrated cases in India. The data was reorganized into date wise confirmed cases, active cases, cured cases and death cases in India and is available in time series format. For predicting purpose, the data is divided into training dataset and testing dataset. In simple regression analysis, the linear regression is expressed as:

where, b0 is constant (intercept), b is weight of regression and e is residue error.

Multi LR: Multi LR model consists of more than one predictor. The simple equation for multi linear regression is given by

Where, b0 is constant term, b1 is x1 coefficient variable, b2 is x2 coefficient variable, b3 is x3 coefficient variable and e is error associated with predicted value. 

The reorganized data was applied to support vector machine (SVM) to analyze the model perfor-mance in terms of R 2 , RMSE, and MAE. SVM is divided into regression and classification problems. It is a non linear generalization and can be used as algorithm for learning purpose. The main objective of this algorithm is to minimize the error occurred by observed training so that generalized performance is achieved. As earlier discussed, SVM has two classes namely SVC and SVR. SVR model depends only on subgroup of data which is used for training as the cost function is used to build the model that neglects the training data to predict the model [13] [14] Whereas SVC model is a binary classifier in machine learning method whose output is positive or negative result based on the input provided [13] . In general, the learner regression problem is as follows where the machine under the learning process is arranged to provide D The SVR function provides the relation with input vector and the target vector and is simply expressed as

where w is the weight vector, b is the bias. SVR describes how much error is good enough with the model and will locate an appropriate line to fit the data. The role of SVR is to reduce the coefficients i.e., vector coefficient Support Vector Regression attempts the best line to fit within a preset value.

The performance of linear regression model and support vector regression model is evaluated by the following parameters: RMSE, R 2 , MAE [14] .

MAE gives the total variation of forecasted and estimated parameter values [14] .

. 

The value of correlation coefficient is defined as the values representing the independent parameter (x) and dependent parameter (y). It is denoted by r and is expressed as:

R-square value draws the variation of dependent parameter and independent parameter in the model in terms of percentage. 

Forecasting gives relevant and reliable input regarding to present, past and future activities with definite numerical and scientific methods. There are few steps involved in predicting the numerical values for a specific task. First step is to understand the problem with entire analysis and second is collecting the relevant data to analyze the problem for further estimation. After estimation, compare the actual and estimated values with necessary actions.

The data was arranged in such a way that confirmed, active, deaths, recovered cases are plotted according to date i.e., daily counts of all the cases are shown below:

The figure 3 gives the data of COVID-29 of all the states of India which are rising exponentially. Among those Maharashtra is showing highest cases.

The above figure 4 gives the plot of confirmed cases which were observed and forecasted based on training data sets in figure 5.

The figure 6 shows the plot of root mean square error of observed and forecasted confirmed cases in figure 7 and the value is 1065.468 approximately observed in figure 8.

The above plot is showing that all the predicted data that was fit in a straight line. The figure 9 shows the plot of forecasted and observed values of active cases in India. And similarly figure and the value we got is 1043 approximately. Like this we have plotted the graphs for cured, migrated and death cases (figure 12) and the RMSE value of cured and death cases were given as 584 ( figure 13) and 62 approximately and the graph was shown in Figure 14 to 17.

We tabulated the performance metrics of linear regression and SVM regression models and table 1 provides the performance metrics i.e., RMSE, MAE, R 2 , training time needed to train the model and prediction speed to forecast the cases (confirmed, active, cured/migrated, death) in India. We have analyzed and predicted the data of cases till 25th June, 2020 and table 2 provides the predicted cases (confirmed, active, cured/migrated, death) by using linear regression model.

These numerical values are predicted as per the actual values give as input to the system model. 

Machine learning approach for confirmation of covid-19 cases: Positive, negative, death and release

Time series forecasting of COVID-19 transmission in Canada using LSTM networks

Smart clothes with bio-sensors for ECG monitoring

Analysis of RCC T-beam and prestressed concrete box girder bridges super structure under different span conditions

A Methodological Approach for Predicting COVID-19 Epidemic Using EEMD-ANN Hybrid Model

A machine learning forecasting model for COVID-19 pandemic in India

Optimization of squeeze casting process parameters on AA2024/Al 2 O 3 /SiC/Gr hybrid composite using taguchi and Jaya algorithm

Predicting the Growth and Trend of COVID-19 Pandemicusing Machine Learningand Cloud Computing

Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India

Pervasive computing in the context of COVID-19 prediction with AI-based algorithms

Machine learning based approaches for detecting COVID-19 using clinical text data

Online forecasting of covid-19 cases in nigeria using limited data

An integrated artificial neural network and Taguchi approach to optimize the squeeze cast process parameters of AA6061/Al 2 O 3 /SiC/Gr hybrid composites prepared by novel encapsulation feeding technique

Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

Detection of Breast Cancer by Thermal Based Sensors using Multilayered Neural Network Classifier