key: cord-0806141-memfwbr0 authors: Khan, Y.A.; Abbas, S.Z.; Truong, Buu-Chau title: Machine Learning-Based Mortality Rate Prediction using Optimized Hyper-parameter date: 2020-08-18 journal: Comput Methods Programs Biomed DOI: 10.1016/j.cmpb.2020.105704 sha: b18be45858b3aa842c1753e50c1b945e75141413 doc_id: 806141 cord_uid: memfwbr0 OBJECTIVE AND BACKGROUND: The current scenario of the Pandemic of COVID-19 demands multi-channel investigations and predictions. A variety of prediction models are available in the literature. The majority of these models are based on extrapolating by the parameters related to the diseases, which are history-oriented. Instead, the current research is designed to predict the mortality rate of COVID-19 by Regression techniques in comparison to the models followed by five countries. METHODS: The Regression method with an optimized hyper-parameter is used to develop these models under training data by Machine Learning Technique. RESULTS: The validity of the proposed model is endorsed by considering the case study on the data for Pakistan. Five distinct models for mortality rate prediction are built using Confirmed cases data as a predictor variable for France, Spain, Turkey, Sweden, and Pakistan, respectively. The results evidenced that Sweden has a fewer death case over 20000 confirmed cases without observing lockdown. Hence, by following the strategy adopted by Sweden, the chosen entity will control the death rate despite the increase of the confirmed cases. CONCLUSION: The evaluated results notice the high mortality rate and low RMSE for Pakistan by the GPR method based Mortality model. Therefore, the morality rate based MRP model is selected for the COVID-19 death rate in Pakistan. Hence, the best-fit is the Sweden model to control the mortality rate. In the 1960s, the first identification of Coronaviruses occurred; their origination is still a mystery. The shape resemblance with crown-like proposed the name. They could infect humans as well as animals [1] . By functionality division, it is that kind that affects the sinuses, nose, and throat. The types NL63,229E, HKU1, and OC43 of coronaviruses, like the common cold, usually cause illness of the upper respiratory tract. Many of the human population victimized by these types of viruses in the entire course of their lives. Such diseases lost in a minimal period. The general symptoms could be a headache, runny nose, dry cough, fever, sore throat, etc. [2] . It was December 31, 2019, when the world health organization announced several reasons of pneumonia cases in Wuhan city of Hubei province in China. This virus was noticed to be different from any other known type of viruses. For a new originated virus, we do not know the ways how it affects the peoples around, which raised significant concern. A few days later, the concerned authorities in China declared that they had identified a virus with a new shape. It was the coronavirus that causes the common cold like the MERS and SARS. The scientific name 2019-nCOV is suggested [3] . Now it is confirmed that new coronaviruses initiate the secretive respiratory sickness in Wuhan city. It is now clear that the secretive respiratory illness in Wuhan is undertaken by this virus vaguely associated with the SARS coronavirus abbreviated as SARS-CoV [4, 5] . In humans, these viruses normally instigate through the surrounding of an infected entity by sneezing and coughing openly, close physical contact, touching of the objects with virus presence [6] . The emerging trends of machine learning techniques posses a key feature in several fields of intelligence. It is based on the optimization of data by different algorithms to take preemptive measures. In data Sciences, it has the primary role for Data analytics. It makes us understand data and its processes better, make predictions based on historical data/experience data and categorize a group of data automatically called classification. It is observed that the Gaussian Based models have been commonly used in optimization applications [7] . In [8] , an extensive comparative study was carried out between several surrogate models, comprising GPR, using simulation-optimization methodology with uncertainty parameters, in which it is concluded that the GPR models and their ensemble were efficient methods concerning the accuracy in prediction. Similarly, the classical as supervised and Unsupervised learning Techniques feature as Supervised learning techniques including Regression, Classification and Regression Trees (CART), and naive Bayes use labeled data to train the algorithms where input and output are known. The unsupervised learning techniques use unlabeled data to train the algorithms where the input of raw data given directly to these algorithms without knowing the output of that data [9, 10] . The mortality models could be made more efficient by using Machine learning techniques. One such application is illustrated in [11] , which is fully machine learning-oriented. Further, in [12] , the authors extended the Lee-Carter model to multiple populations using neural networks. In this research, the Gaussian Process Regression model with optimized hyper-parameter is used to develop the mortality models regarding COVID-19 for five different countries (Turkey, Spain, Sweden, France and Pakistan). Regression processes countered the flaws in these models. This model is fully featuring the machine learning techniques, which is capable of holding pieces of information which are not covered by standard models. We evaluate the enactment of these models, both in estimation and forecasting mortality rates, considering the available data for Pakistan. The remaining paper is organized as follows: Section 2 describes the proposed methodology for Predicting deaths due to COVID-19 for Pakistan by utilizing updated dataset samples. The discussion of the empirical study presented in Section 3. Section 4 concludes this work with possible enhancement as future work. All technical support is shown in appendix A. Gaussian Process Regression (GPR) is a non-parametric kernel-based probabilistic model that can handle complex non-linear relations between response and predictor variables [7] . The Gaussian process is random and is considered as a set of random variables with a Gaussian joint multivariate distribution [14] . GP mainly based on a mean and covariance function. GP can achieve non-parametric regression function learning from noisy data, and it has Gaussian distributions over the data [14, 18] . The predictable mean value is a linear combination by GP computation of the covariance function [14] . Among many others, one of the essential applications of GPs is Gaussian process regression (GPR). GPR is a probabilistic and robust nonparametric Bayesian model that defines a priori distribution of the likelihood over function space [8] . It is one of the most significant Bayesian machine learning methods that estimate the subsequent deterioration of non-linear regression by restricting the previous distribution to match the available training data [14] . The productivity of the prediction is a Gaussian distribution of probability and is characterized by its mean and variance. Variance is the confidence factor for the output's expected mean value [19] . Usually, a GPR model is provided with training data, and weighting targets calculate its performance in terms of error between training and test input [20] . A GP is a set of random variables, such that any finite number of them have a joint The joint distribution of latent variables in the GPR model is as follows: Which is similar to a linear regression model, where looks as follows: The covariance function is usually parameterized by a set of kernel-parameters or hyperparameters and often written as | to explicitly indicate the dependence on . It is used to represents the covariance between pairs of random variables in GPR and can be written where Here are hyper-parameters. In literature, the model given under Eq. 2 is called a surrogate for the objective function. The proxy is more comfortable with optimizing than the objective function. GP methods find the next set of hyper-parameters to evaluate the actual objective function by selecting the best hyperparameters that perform on this surrogate function. In regression models, the objective of parameter optimization is to find the parameters of a given algorithm that return the best performance on a validation set while training and testing the model [14] . It is mathematically represented as Here represents an objective score to minimize the root mean squared error (RMSE) evaluated on the validation set is the set of hyper-parameters which yields the lowest score of RMSE, and is any value in the problem domain X  Gaussian process regression is probabilistic and robust non-parametric Bayesian model that defines a priori distribution of the likelihood over function space [15] .  It is one of the most significant BML-methods that evaluates the successive deterioration of non-linear regression by limiting the previous distribution to match the available training data [14] .  It has high flexibility and accurate prediction for processing small data set and also for high-dimensional data [14, 16] .  It can have trained from noisy data using non-parametric regression function and, sidestepping simple parametric assumptions [17] . Polynomial regression is a particular form of multiple linear regression models in which the maximum degree of the predictor variable is more than 1. In this technique, the best fit line is in the curve shape. It can be used to approximate a complex non-linear relationship [18] . The order polynomial model in one variable is given by (6) Where A model with k explanatory variables is called multiple linear regression. If a polynomial regression model is express in the form (7) Then the methods of linear regression model estimation can be easily assumed for fitting the polynomial regression model. Root mean square error is a statistic used for accounting the average error size. It is the square root of the squared differences measured between estimated and actual observation and can be stated as The data used in this research is obtained from https://ourworldindata.org/coronavirus 1 . Since January 21, 2020, the data is updated daily, with an increment of the number of infected people, seconds to reach the maximum iteration. An experiment is carried out to evaluate the efficiency of the proposed methodology for proposed Algorithm for optimization, the parameter of the mortality rate prediction model was optimized, which was used for prediction purposes. Fig. 2 below represents the components of COVID-19 confirmed cases. A list of countries having a higher rate of COVID-19 confirmed cases are presented in Fig. 3 . Fig. 4 represents the number of COVID-19 death cases for countries such as Spain, Turkey, Pakistan, France, and Sweden. Table 1 . Each country has a different tuning hyper-parameter value, which reflects that each country has a distinct trend in the COVID-19 spread and has the rate of increased confirmed cases also different. The regression loss for predicted value using the GPR model for five countries is also given in Table 2 . Root Mean Square Error for the same are tabulated and are presented in the same table. Table 3 shows the RMSE value for the Polynomial Regression model of five countries. Hence, Gaussian process regression with an optimized hyperparameter model is chosen as the best model which also supported by literature and used to predict deaths due to COVID-19 for Pakistan after lifting lockdown in the country where the chances of infection are very high. Turkey, France, Sweden, Spain, and Pakistan using the GPR and PR model, respectively. Fig. 6 illustrates the ranking of the COVID-19 mortality rate model of five countries. 7 represents the number of predicted death cases in five countries. Table 5 shows the mortality rate prediction for France, Sweden, Turkey, Spain, and Pakistan, respectively. Fig. 7 represents the ranking of five countries based on the mean mortality rate. Based on the mean mortality rate, Sweden's model is the best model for MRP. In sighting the situation around the world, coronavirus becomes a biological bomb whose impact is more severe than a nuclear weapon. Although corona not only digests thousands of precious lives but also destroy the economy of the world. Almost all countries in the world practicing social distancing and observing lockdown from the last two months make the life of human being hell. Although it is indeed essential to get an estimate of the financial losses to occur due to the deadly virus, which will be obtained years later, it is imperative to recognize the pattern of deaths. To minimize as much as possible, future losses of precious lives all over the world. The proposed study investigated the advantages of Gaussian process regression using hyperparameter optimization, composed of the number of confirmed cases and deaths for the duration 21, March 2020 to May 10, 2020. A comparison is also given with the Polynomial Regression. Better performance is noticed for the Gaussian Process Regression model. As outlined by Hong et al. [14] in his work, Gaussian process regression has the advantage of utilizing prior information to estimates the subsequent variation of a non-linear pattern under few assumptions, which is also validated in this study. Pakistan. Hence, the best-fit is the Sweden model for Pakistan to control the mortality rate. Coronavirus will lasted for a long time. Although many countries are observing lockdown, :-GPR Prediction of response Note: y "response" is no of deaths, and X "predictor" is no of confirmed cases. R codes of parameter optimization the Algorithm will be provided on personal request. Common human coronaviruses, CDC New Virus Discovered by Chinese Scientists Investigating Pneumonia Outbreak New-type coronavirus causes pneumonia in Wuhan: expert. Available online Transmission of coronavirus Gaussian Processes for Regression: A Quick Introduction Gaussian processes in machine learning A Neural Network Extension of the Lee-Carter Model to Multiple Populations Review of surrogate modeling in water resources Multistep ahead groundwater level time-series forecasting using Gaussian Process Regression and ANFIS Trained meta-models and Evolutionary Algorithm based multi-objective management of coastal aquifers under parameter uncertainty Application of Gaussian process regression for bearing degradation assessment Gaussian processes in machine learning Dynamical systems identification using Gaussian process models with incorporated local models Recursive Gaussian process: online regression and learning Gaussian Processes for Machine Learning Distributed prognostic health management with gaussian process regression Dynamical systems identification using Gaussian process models with incorporated local models The authors declared no conflict of interest regarding this manuscript submitted to Computer Methods and programs in Biomedicine