key: cord-0858133-qp77vl6h authors: Yadav, Milind; Perumal, Murukessan; Srinivas, M title: Analysis on Novel Coronavirus (COVID-19) Using Machine Learning Methods date: 2020-06-30 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2020.110050 sha: 2b8b956376a7b94e294d425f050194bc0276bc64 doc_id: 858133 cord_uid: qp77vl6h In this paper, we are working on a pandemic of novel coronavirus (COVID-19). COVID-19 is an infectious disease, it creates severe damage in the lungs. COVID-19 causes illness in humans and has killed many people in the entire world. However, this virus is reported as a pandemic by the World Health Organization (WHO) and all countries are trying to control and lockdown all places. The main objective of this work is to solve the five different tasks such as I) Predicting the spread of coronavirus across regions. II) Analyzing the growth rates and the types of mitigation across countries. III) Predicting how the epidemic will end. IV) Analyzing the transmission rate of the virus. V) Correlating the coronavirus and weather conditions. The advantage of doing these tasks to minimize the virus spread by various mitigation, how well the mitigations are working, how many cases have been prevented by this mitigations, an idea about the number of patients that will recover from the infection with old medication, understand how much time will it take to for this pandemic to end, we will be able to understand and analyze how fast or slow the virus is spreading among regions and the infected patient to reduce the spread based clear understanding of the correlation between the spread and weather conditions. In this paper, we propose a novel Support Vector Regression method to analysis five different tasks related to novel coronavirus. In this work, instead of simple regression line we use the supported vectors also to get better classification accuracy. Our approach is evaluated and compared with other well-known regression models on standard available datasets. The promising results demonstrate its superiority in both efficiency and accuracy. COVID-19 is an infectious disease caused by a novel coronavirus which has first been originated in Wuhan city, Hubei Provinces of China [1, 2] . Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) is a new type of virus family that has not been earlier identified in people. The virus seems to be transmitted mostly through the minute respiratory droplets via coughing, sneezing or when people interact with each other for some time in close proximity. These droplets can then be inhaled, or they can land on surfaces that others may come into touch with, who can then get contaminate when they contact their eyes, mouth, or nose. The novel coronavirus can live on different surface like few days (stainless steel and plastic) and few hours (cardboard, and copper). However, the amount of viable virus declines over time and may not always be present in sufficient numbers to cause infection. In humans, the symptoms of this virus can be experienced in between 1 to 14 days from the day of infection. From then it has been spreading at the speed of knots, giving no time to prepare against a newly identified infectious and notorious virus which have compelled the WHO to declare COVID-19 as a pandemic [3] due to its fast human to human transmission and people got infected in every continent and it had already taken so many lives. The statistics and graph for increasing cases, active cases have been shown in the figure 1. symptoms of coronavirus change in severity from having no symptoms at all (being asymptomatic) to having fatigue, cough, fever, general weakness, sore throat,muscular pain and in the most extreme cases, sepsis, severe pneumonia, acute respiratory distress syndrome,and septic shock, all potentially leading to death. Reports show that clinical deterioration can occur quickly, often during the 14 days of disease. Of late, anosmia loss of the sense of smell have been reported as a one of the symptom of a coronavirus infection. There is already conformation from many regions such as Italy, China, and South Korea that patients with committed SARS-CoV-2 infection have developed anosmia/hyposmia, in some cases in the absence of any other symptoms. Still, there is no proper treatment, drugs or vaccine for coronavirus disease. Several random drugs are being tried to target the virus on severely affected coronavirus patients. However, the use of these need to be more carefully assessed in randomized controlled trials. Several clinical trials are ongoing to assess their effectiveness but results are not yet available. As this is a new virus, no vaccine is currently available. Although work on a vaccine has already started by several research groups and pharmaceutical companies worldwide, it may be months to more than a year before a vaccine has been tested and is ready for use in humans [5] . Till today COVID-19 has been infected the citizens of more than 212 countries leads to 37,47,356 patients out of which 2,58,970 people had lost their lives and 12,50,693 people [6] gets recovered but due to the virus bi-phasic nature, there may be chances of infection again in those recovered cases. Due to an insufficient number of test kits, ventilators, oxygen tanks, hospital beds, and unavailability of proper treatment or vaccine, it is very important to analyze the growth rates of positive cases, number of recoveries, and other factors that affect the growth of this virus. To the same extent proper arrangements can be made to prevent losses of lives and to have proper insights of the condition. For example, based on the analysis of data, the government can have the prior information to the number of cases till a particular day, and before that day they can arrange all the necessary medical equipment, or which mitigations to be done to prevent losses of lives. Nowadays, machine learning methods have been widely used in healthcare field [7, 8] and for having much faster and efficient prediction of COVID-19 infected person. In this work, the Support Vector Regression (SVR) [9, 10, 11] model is used to solve the four different types of COVID-19 related problems. The proposed method will be fitted into the dataset containing the total number of COVID19 positive cases, and the number of recoveries for different countries like Mainland China, US, Italy, South Korea, and India. And with the help of the proposed method to predict the future number of total cases, active cases, and recoveries. These tasks can help a country/region to understand the spreading of the virus, facilitate/aware people, start mitigations. Itll also help that region/country to be prepared for whats will happen in the future, which may help in saving lives and agony. And compare proposed method results with other well know regression models such as Simple Linear Regression, Polynomial Regression [12] . And another task containing the weather data for regions like New York City (NYC) and Milano (Italy), to analyze the correlation between different weather parameters and the total number of cases Pearsons method is used. This will help in understanding the effects of weather conditions on the virus spread. This section is describing the motivation and the detailed overview of the tasks and proposed approach. COVID 19 is having dramatic effects among people causing deaths, agony and chaos. Such that to analyze the effects, some tasks can be performed. These tasks can help in understanding and extracting knowledge from COVID 19 data, which can help a country(region) to understand the spreading of the virus, facilitate(aware) people, start mitigations, whether or not mitigation is having some positive effect, other factors affecting the virus, etc. Itll help that country to get prepared for whats coming in the near future, that may help in saving lives and agony. In this paper, mainly five different tasks that were performed [13] and that are as follows: In our proposed work Support Vector Regression (SVR) model is used to work on first four tasks. And Pearsons Correlation [14] method is used for fifth task. Support vector machines (SVM) [15] is a supervised learning algorithm. This algorithm is used for classification and regression problems. SVR is based on the same principles as SVM for classification i.e. to find a hyperplane in a ddimensional space (d is the number of features) that uniquely classifies the data points. SVR uses a non-parametric technique, which means, the output from the SVR model does not depend on distributions of the dependent and independent variables. SVR technique is basically dependent on kernel functions, which allows for the construction of a non-linear model without changing the explanatory variables, which helps in better interpretation of the resultant model. In these algorithms, a hyperplane is found that separates the different features. The produced model by SVM does not depend on the training points that lie outside the margin but instead depends on a subset of the training data as the cost function. Similarly, in SVR, support vectors find the closest data points and the actual function represented by them. We get closest to the actual curve if the distance between the support vectors to the regressed curve is maximum. A hyperplane is a function that classifies the points in a higher dimension or other words hyperplanes are the boundaries that help in the classification of the data points. If the margin for any hyperplane is maximum, then that hyperplane is the optimal hyperplane. The points which are closest to hyperplane are called support vector points and the distance of the vectors from the hyperplane are called the margins, as shown in figure 2. Farther the Support Vector points, from the hyperplane, more is the probability that the points will be correctively classified in their respective region or where l i = {l 0 , l 1 , l 2 , ...}, b = biased term (l 0 ) and x = variables. Kernel is an important part of SVR. The kernel is a way of computing the dot product of two vectors x and y in some high dimensional feature space. Kernel trick is used in SVR which simply means to replace the dot product of two vectors by the kernel function. We have present the experimental results in detail about each task and at Korea and, India. The outbreak of Covid-19 is developing into a major international crisis, and it's starting to influence important aspects of daily life. For example, Bans have been placed on hotspot countries, international manufacturing operations have often had to throttle back production and many goods solely produced in China have been halted altogether. In highly affected areas, people are starting to stock up on essential goods. Such that, in order to predict how the virus could spread across different countries and regions, different regression models were to be used to predict the total number of positive cases. The main goal of this task was to build and compare regression models that can predict the progression of the total positive COVID 19 cases from different regions, that may help mitigation efforts. The advantage of doing this would be that, we will have an idea about the number that will reach of many cases, this will give the idea about the level of spread, and in accordance to that, the government and the citizens can make proper plans to handle the situation by taking measures to minimize the virus spread by various mitigation and other necessary actions. Predicting total number of cases in different regions are show in figure 3 . The accuracy of the total number of positive cases in Mainland China is shown in Fig 3( Figure 4 shows the total number of active cases of COVID 19 of that particular region, and is plotted with a number of active cases on Y-axis and number of days on X-axis. All the values of X-axis and Y-axis were scaled before use. In Fig 4(a) shows the growth rate of the total number of active cases in Since no vaccine for the virus has yet been discovered, it is important to see how many of the patients will recover from this virus, and how and when the epidemic will end. The main objective of this task was to predict how many people were going to recover based on old recovery records. Such that in order to predict how many people will actually recover, records for the number of recovered patients across different countries were taken. The goal of solving this task helps us to understand how the epidemic will end. i.e. how many patients will recover. The advantage of doing this would be that, we will have an idea about the number of patients that will recover from the infection, from the older known methods, since no vaccine or cure is yet discovered. And by predicting the time that will be taken by all the patients to recover, we will be able to understand how much time will it take to for this pandemic to end. To prepare the dataset for this task firstly, dates were converted to day The accuracy for Simple Linear Regression, Polynomial Regression, and SVR were 90.92%, 99.32% and 99.47% respectively. The total number of recoveries in US country is shown in Fig 5(b) respectively. In Fig 5(c) shows the total number of recoveries in Italy region is plotted. The COVID19 virus is spreading at immense rates among humans all around the globe. Since, no vaccine for the virus is yet discovered, so it is important to understand how the virus is transmitting i.e. how fast or how slow the virus is spreading among different countries. In order to predict how many persons are infected each day, records for the number of total positive cases across different countries were taken. The goal of solving this task helps us to understand in which countries the transmission is faster or slower. The advantage of doing this would be that, we will be able to observe and analyze how fast or slow the virus is spreading among regions therefore, which areas needs more attention or not. To prepare the dataset for this task firstly, dates were converted to day number taking 22 Jan 2020 as Day 1 and 24 April 2020 as Day 93. Then, the total number of newly found cases per day was calculated by subtracting the total number of cases a day before to the total number of cases on the present day and, we get the number of newly found cases per day. Then, the day number and new cases per day were scaled to observe clearer results. and 37.9% respectively. The total number of newly found cases per day in US region is shown in Fig 6(b) . The curve is rose to local maxima and is seen to be dropping with the number of days. In Fig 6(c) shows the total number of newly found cases per day in Italy is plotted. The curve is rose to local maxima and is seen to be dropping with the number of days. The predicted values for the newly found cases for day 93 with Simple Linear Regression method is 5,030, with the Polynomial Regression method is 1,920 and the proposed SVR method is 2,831, whereas the actual number of newly found cases was 3,021. The accuracy for Simple Linear Regression, Polynomial Regression, and SVR were 57.33%, 83.51% and 91.51% respectively. The total number of newly found cases per day in South Korea is shown in Fig 6( The weather conditions that affect the spread of COVID 19 is not the same for all regions. Such that for different regions, it is important to analyze which weather conditions mostly affect the spread. The main objective of this task was to get data regarding the count of infected people of New York City and Milan city (Italy) date wise with temperature and then analyze it with humidity perception and wind. The main motive was to observe if these factors contribute to the spread in these cities, although the effects may be tiny but one cannot ignore the effects. The advantage of doing this would be that well be able to create better surroundings for the infected patient to reduce the spread. The people can also be warned if they should avoid humidity or not, or high temperature or not, etc. To understand the correlation between the spread and weather conditions Pearsons correlation method is using. Correlation is a measure of association between two variables and the di- Pearson correlation: It is the measure of the degree of the relationship between linearly related variables. Pearson correlation is the most widely used correlation. For the Pearson correlation, both the variables whose correlation is to be found are assumed to be normalized, if not normalized, then normalization should be performed first. Also, the relationship between both the variables should be a straight line, assuming that data is equally distributed about the regression line. Following is the formula which is used to calculate the Pearson r correlation: Where, r is Pearson r correlation coefficient between a, b, N is number of observations, a indicate value of x and b is a value of y. In Figure 8 (d) a correlation graph between total positive cases and wind speed in Milan city, is plotted. It is observed that, wind speed and total positive cases have a nonlinear shape. COVID-19 causes illness in humans and creates severe damage in the lungs. However, COVID-19 has killed many people in the entire world. In this paper, we are proposing the Support Vector Regression method based navel coronavirus analysis on five different tasks. Main novelty in this work is instead of simple regression line we use supported vectors also to get better classification accuracy. The main advantage of doing the first task would be that, this will give the idea about the level of spread, and in accordance to that, the government and the citizens can make proper plans to handle the situation by taking measures to minimize the virus spread by various mitigation and other necessary actions. With the help of second task will have an idea about how well the mitigation's are working, and the actions that are taken till date how effective are they, or how many cases have been prevented by this. The advantage of doing third task would be that, we will have an idea about the number of patients that will recover from the infection, from the older known methods, since no vaccine or cure is yet discovered. And by predicting the time that will be taken by all the patients to recover, we will be able to understand how much time will it take to for this pandemic to end. With fourth task we will be able to observe and analyze how fast or slow the virus is spreading among regions therefore, which areas needs more attention or not. Finally, Fifth task create better surroundings for the infected patient to reduce the spread. The people can also be warned if they should avoid humidity or not and high temperature or no. Pearsons correlation method gives a clear understanding of the correlation between the spread and weather conditions. In all tasks, the proposed Support Vector Regression method based coronavirus analysis given promising results compared with other well know regression methods on the first four tasks. The authors would like to thank the anonymous reviewers for their thorough review and valuable comments. This work was supported in part by grants Government of India, Ministry of Human Resource Development and NIT Warangal under NITW/CS/CSE-RSM/2018/908/3118 project. Author Agreement Statement We the undersigned declare that this manuscript is original, has not been published before and is not currently being considered for publication elsewhere. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs A novel coronavirus from patients with pneumonia in china A novel coronavirus genome identified in a cluster of pneumonia caseswuhan Severe acute respiratory syndrome-related coronavirusthe species and its viruses, a statement of the coronavirus study group Confirmed cases and deaths by country, territory, or conveyance Confirmed cases country, territory, or conveyance Deep dictionary learning for finegrained image classification Classification of medical images using edge-based features and sparse representation Support vector regression. efficient learning machines Travel-time prediction with support vector regression Improved svm regression using mixtures of kernels Modelling using polynomial regression Covid-19 challenge tasks Pearsons correlation coefficient Predicting time series with support vector machines The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.