key: cord-1001035-9s238474 authors: Yalçınkaya, Abdullah; Balay, İklim Gedik; Şenoǧlu, Birdal title: A new approach using the genetic algorithm for parameter estimation in multiple linear regression with long-tailed symmetric distributed error terms: An application to the Covid-19 data date: 2021-09-15 journal: Chemometr Intell Lab Syst DOI: 10.1016/j.chemolab.2021.104372 sha: ad6d9cc51fcf288e25917efdeea1b779b00122b3 doc_id: 1001035 cord_uid: 9s238474 Maximum likelihood (ML) estimators of the model parameters in multiple linear regression are obtained using genetic algorithm (GA) when the distribution of the error terms is long-tailed symmetric. We compare the efficiencies of the ML estimators obtained using GA with the corresponding ML estimators obtained using other iterative techniques via an extensive Monte Carlo simulation study. Robust confidence intervals based on modified ML estimators are used as the search space in GA. Our simulation study shows that GA outperforms traditional algorithms in most cases. Therefore, we suggest using GA to obtain the ML estimates of the multiple linear regression model parameters when the distribution of the error terms is LTS. Finally, real data of the Covid-19 pandemic, a global health crisis in early 2020, is presented for illustrative purposes. Consider the following multiple linear regression model where Y ¼ [y 1 , y 2 , …, y n ] 0 is a n  1 vector of responses, 1 is a n  1 vector of ones, X ¼ ½x 1 ; x 2 ; …; x q is a n  q non-stochastic design matrix, θ ¼ [θ 1 , θ 2 , …, θ q ] 0 is a q  1 vector of unknown coefficients and ϵ ¼ [ϵ 1 , ϵ 2 , …, ϵ n ] 0 is a vector of independent and identically distributed error terms. Linear regression models are very popular and therefore widely used in many different areas of science, see, for example, Ghosal et al. [11] , Jomnonkwao et al. [19] , Liao et al. [23] , Mc Garry et al. [25] . The distribution of the error terms is assumed to be normal in the context of the linear regression model in many cases encountered in reallife problems. However, the distribution of the error terms can depart from normality in practice because of the presence of heavy tails, see, for examples related to non-normality, Geary [9] , Huber [14] , Pearson [27] , Tiku et al. [34] . Therefore, in this paper, we strategically choose long-tailed symmetric (LTS) distribution as the distribution of the error terms in the multiple linear regression model since it is more flexible than normal distribution with its heavy tails. It is also used for modeling the outliers existing in the direction of the tails. Thus, LTS distribution is utilized to obtain more robust solutions to the statistical inference problems as an alternative to the normal distribution, see, for example, Tiku et al. [31] and Tiku and Suresh [33] . There exist plenty of studies about the regression models with nonnormal error terms in the literature. Islam et al. [17] and Tiku et al. [31] studied parameter estimation and hypothesis testing for the simple linear regression under non-normal error distributions. Then, Islam and Tiku [16] discussed the multiple linear regression model with non-normal error terms. Furthermore, multivariate linear regression models under the assumption of non-normal error terms are considered by Islam [15] and Islam et al. [18] . It is very crucial to estimate the parameters in a multiple linear regression model efficiently. Least squares (LS) method is convenient to estimate the parameters of a multiple linear regression model when the error terms are normally distributed. However, LS estimators lose their efficiencies when the normality assumption is not satisfied [35] . To estimate multiple linear regression model parameters under a nonnormal error structure, ML estimation method is widely used because ML estimators are unbiased and fully efficient under certain regularity conditions. However, ML equations cannot be solved analytically when the distribution of the error terms is non-normal. For this reason, numerical methods are generally used to derive ML estimates of the unknown model parameters in literature. It should be noted that some problems such as non-convergence of iterations, convergence to the wrong root, slow convergence may arise in numerical methods, see Refs. [1, 2, 28, 36, 37] . Different than the earlier studies genetic algorithm (GA) which is a population-based heuristic optimization method is utilized to obtain the ML estimates of model parameters in multiple linear regression. It is a powerful method besides being easy for solving an optimization problem and can comfortably be used for large-scale and complex nonlinear optimization problems, see Xia et al. [39] . Further, as Lu et al. [24] stated that satisfactory solutions can easily be obtained according to the optimization objectives, and the shortcomings of numerical optimization methods are overcome by using GA due to its characteristics which make it have outstanding advantages in iterative optimization. Garcia et al. [8] also reported that GA is one of the most robust heuristics automated methods to solve optimization problems. See also Yalcinkaya et al. [40, 41] for the details and advantages of GA. The main reason for choosing GA in obtaining the estimates of the parameters is that the use of GA warrants the convergence to the global optimal solution in optimization problems that are extremely complex and suspected to be multi-modal, see Goldberg [12] . We then compare the efficiencies of the ML estimators using GA with corresponding estimators using traditional iterative techniques, such as Newton-Raphson (NR), Nelder-Mead (NM), and iteratively re-weighting algorithm (IRA). One of the main contributions of this paper is to propose confidence intervals based on Tiku's [29, 30] modified maximum likelihood (MML) estimators for the corresponding multiple linear regression model parameters as the search space in GA. Yalcinkaya et al. [40, 41] and Acitas et al. [3] use GA and particle swarm optimization to obtain the ML estimators of the parameters of skew normal distribution and Weibull distribution, respectively. They both strategically use the confidence intervals based on the MML estimators of parameters of interest as search space and show that using the proposed approach provides a narrower search space improving the GA's performance for convergence. The usage of heuristic methods is very limited in the context of regression. Therefore, we aim to implement these methods in solving optimization problems in multiple linear regression with non-normal error terms instead of classical approaches. To the best of our knowledge, this is the first study to obtain ML estimates of the parameters of a multiple linear regression model with LTS distributed error terms using GA. It should be noted that there are many long-tailed symmetric distributions, see, for example, Lange et al. [21] and Lange and Sinsheimer [22] . Our results can easily be extended to these distributions so they may be the topic of our future studies. A virus called as Covid-19, which appeared in Wuhan province of China in late 2019, spread rapidly worldwide in early 2020. On July 7, 2020, World Health Organization (WHO) reported that Covid-19 infected over 11 million 500 thousand people worldwide and killed more than 500 thousand of them. In the Covid-19 pandemic, which turned into a global health crisis, the growth rate of cases and the number of deaths per million varies from country to country because of the different characteristics of their governments and people. It is important to model the mortality rate on fighting Covid-19. Therefore, in this study, we analyze the Covid-19 data which includes characteristics of governments and people, both to make a scientific contribution to the fight with Covid-19 and to demonstrate an implementation of our proposed methodology. The rest of the paper is organized as follows. Section 2 presents the LTS distribution. Section 3 includes the ML estimation for multiple linear regression model parameters when the distribution of error terms is LTS. The details of GA, the procedure of identifying the search space in GA, and the details of IRA are also given in Section 3. The simulation study and its results are presented in Section 4. Real data of the Covid-19 pandemic is examined in Section 5 to show the implementation of the proposed methodology. In the final section, the concluding remarks are given. Probability density function (pdf) of the LTS distribution is given by where k ¼ 2p À 3, p ! 2, σ > 0, and B (⋅, ⋅) is the beta function. The expected value and variance of the LTS distribution are given by E(ϵ) ¼ 0 and Var(ϵ) ¼ σ 2 , respectively. Also, the kurtosis value (γ) is evaluated by γ ¼ 3 (p À 3/2)/(p À 5/2). Table 1 shows kurtosis values for some representative values of shape parameter p. LTS distribution has the following special cases: • Assume X~LTS (μ, σ, p) then T ¼ ffiffiffiffiffiffiffi v=k p ðXÀμÞ σ reduces to the Student's t distribution with v ¼ 2p À 1 degrees of freedom. • When p → ∞, LTS distribution converges to the well-known Normal distribution. See Islam and Tiku [16] and Tiku and Kumra [32] for more details about LTS distribution. In this section, we derive the maximum likelihood estimators of the parameters θ 0 , θ 1 , …, θ q and σ in the multiple linear regression with LTS distributed error terms. The log-likelihood (lnL) function based on (2) where In order to obtain the ML estimators of unknown parameters, partial derivatives of lnL function with respect to the parameters of interest are taken and we set them equal to 0 as follows: The solutions of these likelihood equations are the ML estimators of parameters of interest. However, the equations have no explicit solutions [16] since the likelihood equations include intractable functions such as Therefore, we resort to the iterative techniques such as GA, IRA, NR, and NM algorithms. The procedures of GA and IRA used here are introduced in the following subsections. See Ref. [40] for the details of NR and NM algorithms. Here, we Table 1 The kurtosis values for the LTS distribution. don't give the NR and NM algorithms because they are very common methods to obtain the ML estimators in the literature. Genetic Algorithm (GA), an iterative population-based search technique proposed by Holland [13] , is a very popular heuristic algorithm to find the optimum of the objective function. The idea underlying GA is to imitate the behavior of heredity characteristics of chromosomes on the passing from one generation to another based on the evolutionary mechanisms. Each solution and a set of solutions in every generation are called as chromosome and population, respectively. The following procedure given stage by stage is used during GA. Generating Initial Population: GA starts with an initial population of N chromosomes generated from the search space via an initialization strategy. Assume that the initial population is denoted by  w is a vector of unknown parameters. Here, q and N are the number of the independent variables in the multiple linear regression model and the population size, respectively. Also, the vector of w ðmÞ represents the values of the rth chromosome in the population at mth iteration. Fitness Evaluation: Each chromosome has a fitness value evaluated according to the objective function. In this study, we use lnL as an objective function. lnLðw ðmÞ r Þ, m ¼ 0, 1, 2, … represents the fitness value of the rth chromosome at iteration m. Firstly, the chromosomes in mth population are sorted and also classified by their fitness values from the best to the worst to obtain new (m þ 1)th population. It should be noted that the classifying a chromosome as better or namely the having better fitness value demonstrates the having bigger/smaller fitness value for maximization/minimization problems. Elitism: A predetermined elite number (E) of chromosomes having the best fitness values are accepted as elite chromosomes and transferred directly to the new population without any modifications. Selection: At a predetermined selection rate (SR), the worst chromosomes are replaced by new chromosomes generated randomly from the search space. Mutation and Crossover: Finally, the mutation and crossover operators are applied according to mutation probability (MP) and crossover probability (CP) to the candidate chromosomes selected from the chromosomes except elites via a selection strategy. Here, we prefer the roulette wheel selection strategy. The basic principle of this strategy is that the chromosomes having the better fitness value have a greater chance of being selected. Convergence Check: So, the new population  w If the convergence criteria is not hold, this process is continued by setting m ¼ m þ 1. When the process stops, the values of the best chromosome at the last population are called as the estimates of the parameters. Identifying the search space of GA: We use the confidence intervals based on Tiku's MML estimators of the parameters θ 0 , θ 1 , …, θ q and σ as the search space in GA. Then asymptotic 100 (1 À α)% confidence intervals for the parameters θ 0 , θ 1 , …, θ q and σ are given as follows: Here, se(⋅) is the standard error of the estimator of interest. In this study, we evaluate the standard errors by using the asymptotic variancecovariance matrix of Tiku's MML estimators. The asymptotic variancecovariance matrix of the MML estimators is obtained by using the inverse of the expected Fisher information matrix given in Islam and Tiku [16] . MML estimators are explicit estimators of the model parameters and derived by using non-iterative method, see Ref. [29] . To obtain the MML estimators, firstly the likelihood equations given in (4)- (6) are expressed in terms of the ordered statistics (for given θ 0 and θ j , 1 j q) is called as the concomitant vector of observations corresponding to ith ordered ϵ (i) . Then the MML estimators of multiple linear regression model parameters when the errors have LTS distribution are formulated aŝ and where and Here, where t (i) is the expected value of the ordered statistics Z (i) ¼ ϵ (i) /σ, i. e., (18) is a negative value, then we use the following α * i and β * i , instead of α i and β i , respectively for mathematical convenience; see Ref. [16] for further details. MML estimators are asymptotically equivalent to the ML estimators, therefore, under regularity conditions, they are fully efficient. Also a remarkable property of MML estimators is that they are as efficient as the ML estimators even for small sample sizes, see Refs. [5, 37, 38] . The high performance of MML estimators is shown in plenty of studies in the literature; see, for example, [6, 17, 20] . In this study, we provide IRA to compute the ML estimates of the parameters of interest. It should be noted that the IRA is EM type algorithm and its convergence is guaranteed, see Ref. [7] . The steps of IRA used in this study is given below. Step 1: Identify the initial values θ , …, θ ð0Þ q and σ (0) for θ 0 , θ 1 , …, θ q and σ. Step 2: Compute the following equations Step 3: Stop the iterations when the conditions jθ ðmþ1Þ 0 À θ ðmÞ 0 j < ε, j θ ðmþ1Þ l À θ ðmÞ l j < ε (l ¼ 1, …, q) and |σ (mþ1) À σ (m) | < ε are hold. Here, ε > 0 is a predetermined small constant. In this section, we compare the efficiencies of the proposed parameter estimators for the multiple linear regression model with LTS distributed error terms via an extensive Monte Carlo simulation study. We simulate for q ¼ 1, 2, 3, and 4 separately and see that similar results are obtained for all q values, therefore, we preferably give the results of q ¼ 3 for the sake of brevity. The ML estimates of the parameters θ 0 , θ 1 , θ 2 , θ 3 , and σ are obtained by using the GA, IRA, NR, and NM algorithms. It should be noted that the value of parameter p is assumed to be known throughout this section. All computations are conducted using the R statistical software environment. For the computations of the GA algorithm, we use 'ga' function with the following settings; that are the population size N ¼ 500, the mutation probability MP ¼ 0.2, the crossover probability CP ¼ 0.8, and the elite number E ¼ 4. The maximum number of iterations is taken to be 1000. The other conditions are taken to be default in the 'ga' function. Without loss of generality, we take θ 0 ¼ 0, in the design matrix are randomly generated from Uniform (0, 1) distribution. The sample sizes are taken to be n ¼ 20, 50, 100, 200, and 500 for small, moderate, large, very large, and extremely large sample size scenarios, respectively. We strategically choose p ¼ 3 and p ¼ 5 in LTS distribution to compare the effects of kurtosis levels γ ¼ 9 and γ ¼ 4.2, respectively. It is known that LTS distribution with p ¼ 3 and p ¼ 5 are equivalent to the Student's t distribution with 5 and 9 degrees of freedom, respectively. Each Monte Carlo simulation run is independently replicated ⌊100,000/n⌋ times where ⌊⋅⌋ denotes the greatest integer function. The simulated mean, mean square error (MSE), and deficiency (DEF) criteria are used in the comparisons. The mathematical expression of the DEF criterion which is used to compare joint efficiencies of the estimators is given below We report the mean, MSE, and DEF values for the ML estimators of the parameters in Table 2 . We present the discussions about the simulation results below. It is seen from Table 2 that the NR algorithm gives non-meaningful values for the small sample sizes (i.e., n ¼ 20) for all shape parameter values. Considering the mean values in Table 2 , it can be easily said that the biases for all estimators (except NR for n ¼ 20) are negligible. For θ 0 : GA has better performance than the NM and NR with the smallest MSE values for all shape parameter values and all sample sizes. On the other hand, IRA outperforms all methods in all of the cases according to the MSE criterion. Therefore, GA has the second-best performance in estimating θ 0 since it has quite small MSE values. For θ 1 , θ 2 , θ 3 , and σ: GA has the best performance among NR, NM, and IRA according to MSE criterion for all cases. The simulation results show that GA outperforms NM and NR algorithms according to the DEF criterion because of providing the lowest deficiency in all of the cases. However, IRA has a little bit smaller deficiency than GA for a few cases, i.e., n ¼ 100, 200, 500 when p ¼ 5 and n ¼ 500 when p ¼ 3. On the other hand, GA is more efficient than IRA in the other cases (60% of cases). Therefore, GA outperforms IRA according to DEF criterion in most of the cases. As a result, it can be said that GA is more preferable than the other algorithms to obtain the ML estimators of the parameters of the multiple In this section, a real-life data set about the Covid-19 pandemic is analyzed to show the implementation of the proposed methodology, see Ref. [10] for the detailed backgrounds of the data. The data set consists of measurements about some indicators and indices for 52 countries in the context of the Covid-19 pandemic. Therefore, it is called as Covid-19 data in the rest of this section. Covid-19 data include n ¼ 52 observations about mortality rates (deaths from Covid-19 per million people), cultural tightness, government efficiency, and growth rates of the virus. We use the following multiple linear regression model and obtain the estimated regression equation. Here, the dependent (response) variable Y represents the mortality rate, and the independent (explanatory) variables X 1 , X 2 , and X 3 represent the cultural tightness, the government efficiency, and the growth rate, respectively. In regression analysis, it is an important concern to identify problems such as heteroscedasticity, multicollinearity and to take preventive action. Since the mortality rates for some countries (e.g., Italy, Spain) are far greater than those of the other countries, the mortality rates have a high skewness and variance. Such response variables in a regression model can lead to heteroscedasticity. To prevent violations of homoscedasticity, we apply log-transformation on the raw mortality rates similar to Ref. [10] . In the regression analysis, the log-transformed mortality rates are used as the response. The multicollinearity has been checked by Ref. [10] via the variance inflation factors, and it is seen that there is no problematic multicollinearity among the explanatory variables. To see the numerical value of the relationship between explanatory variables, we give the correlations between them in Fig. 1 . The entries on the main diagonal show the distribution of the variables, and the entries below and above the main diagonal display the scatter plots of each pair of variables and the Pearson correlation coefficients, respectively. To find the appropriate regression equation which provides a better fit to the Covid-19 data, we obtain the ML estimates of the model parameters in multiple linear regression by using GA and IRA algorithms because of their superior performances in the simulation study. We consider two different cases in identifying the estimate of the shape parameter (p) of LTS distribution. In Case 1, it is estimated by using profile likelihood (PL) method given in Ref. [40] . In Case 2, we obtain the estimates of all parameters including the shape parameter p by using GA, simultaneously. To compare the fitting performances of the regression equations based on the ML estimates obtained by using GA and IRA algorithms, we use AIC (Akaike information criterion), and RMSE (The root mean square error) criteria. AIC is a popular measure proposed by Akaike [4] to compare the fitting performances of the possible models according to the log-likelihood values and the number of estimated parameters. The mathematical expression of AIC is where p is the number of estimated parameters. AIC is a useful measure when the number of estimated parameters is different for the compared models. It should be noted that the number of estimated parameters is different for Case 1 and Case 2 in our problem. It is known that the model having the lowest AIC value and the lowest RMSE value (the closest to zero) among the possible models provides the best fitting performance to the data. To identify the distribution of the residuals, a Q-Q plot of the estimated residualŝ is given in Fig. 2 . It is seen that there exist residuals which are grossly anomalous in Covid-19 data. Therefore, we set aside observations that correspond to these outliers (i.e., À7.82, À5.77, and À5.30), see also the box-plot in Fig. 3 . After excluding the outliers from regression analysis, we estimate the residuals obtained from the regression model for the remaining n ¼ 49 observations. A Q-Q plot of these residuals given in Fig. 4 indicates that LTS distribution is appropriate. To identify whether the LTS distribution is appropriate for the distribution of the error terms or not, we also use Kolmogorov-Simirnov (K-S) test which is a well-known and widely-used goodness of fit test. To test the following null hypothesis K-S test statistic D ks is obtained as follows: where F 0 (⋅) is the cdf of a LTS distribution with a known shape parameter p and F n (⋅) is the empirical cdf for the error terms. At significance level α, if the calculated value D ks is greater than the tabulated value d α given in Ref. [26] or equivalently if the corresponding p-value is less than α, then the null hypothesis H 0 is rejected. The estimates of the regression parameters, the bootstrap standard errors of the corresponding estimates (given in the parenthesis), AIC, RMSE, and the K-S test statistics (i.e. D ks ) values are given in Table 3 . According to the K-S results based on GA and IRA estimates of the model parameters, the null hypothesis is not rejected at the significance level α ¼ 0.05 for both tests since the D ks values are less than the corresponding table value d n ¼ 49,α ¼ 0.05 ¼ 0.17128, see Table 3 . This result implies that the distribution of the error terms obtained from the regression equation based on GA and IRA estimates of the model parameters is LTS. The Q-Q plot given in Fig. 4 supports this result, visually. It is clear from Table 3 that the proposed method using GA has lower AIC and RMSE values than those of IRA in Case 1. Furthermore, GA has much smaller bootstrap standard errors than IRA for all parameter estimates. Therefore, GA is more reliable and preferable than IRA. Additionally, it is obvious from Table 3 that using GA is superior than using PL in model fitting according to the AIC and RMSE results given in Case 1 and Case 2. These conclusions show the superiority of the GA. As a result, we advice to use GA to obtain the estimates of the parameters (including the shape parameter p) in multiple linear regression model with LTS distributed error terms. Additionally, to investigate the aspects of the robustness of the ML estimates based on GA and IRA methods to outliers, we give the results of the regression analysis for the data including outliers (i.e., n ¼ 52), see Table 4 . It is seen from Table 3 and Table 4 that the model fitting performance of GA is less sensitive to the outliers than that of IRA. It is clear that AIC and RMSE values decrease for the censored sample for both models based on GA and IRA, this is an indication of the negative effect of the outliers on the estimated regression equation. However, the reduction rate is much bigger for the model based on IRA. In this study, we focus the ML estimation of the parameters of the multiple linear regression model when the underlying distribution of error terms is LTS. It should be noted that the ML estimators of the parameters cannot be obtained analytically. Therefore, we resort to GA and traditional NR, NM and IRA algorithms. To improve the performance of GA, we use the robust confidence intervals based on the MML estimators of the regression parameters as the search space. We compare the efficiencies of the ML estimators obtained by using mentioned algorithms in terms of MSE and DEF criteria. Our simulation study shows that GA outperforms other traditional algorithms in most of the cases to obtain the ML estimates. Eventually, we strongly advise using GA to obtain the ML estimates of the parameters of the multiple linear regression model when the error terms have LTS distribution because of the superior performance of the GA. One-step M-estimators: jones and Faddy's skewed t-distribution Robust estimation with the skew t2 distribution A new approach for estimating the parameters of Weibull distribution via particle swarm optimization: an application to the strengths of glass fibre data Maximum likelihood identification of Gaussian autoregressive moving average models The asymptotics of maximum likelihood and related estimators based on type II censored data Robust estimation and testing in one-way ANOVA for Type II censored samples: skew normal error terms Estimation and testing in one-way ANOVA when the errors are skew-normal Calibration of an urban cellular automaton model by using statistical techniques and a genetic algorithm Testing for normality The importance of cultural tightness and government efficiency for understanding COVID-19 growth and death rates Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases Genetic Algorithms in Search, Optimization, and Machine Learning Adaptation in Natural and Artificial System: an Introduction with Application to Biology, Control and Artificial Intelligence Robust Statistics Estimation and hypothesis testing in multivariate linear regression models under non normality Multiple linear regression model under nonnormality Nonnormal regression. I. Skew distributions Inference in multivariate linear regression models with elliptically distributed errors Forecasting road traffic deaths in Thailand: applications of time-series, curve estimation, multiple linear regression, and path analysis models A comparative study for the location and scale parameters of the Weibull distribution with given shape parameter Robust statistical modelling using the tdistribution Normal/independent distributions and their applications in robust regression A multiple linear regression model with multiplicative log-normal error term for atmospheric concentration data Energy management of hybrid electric vehicles: a review of energy optimization of fuel cell hybrid power system based on genetic algorithm Multiple linear regression models for reconstructing and exploring processes controlling the carbonate system of the northeast US from basic hydrographic data Table of percentage points of Kolmogorov statistics The analysis of variance in cases of non-normal variation Modified maximum likelihood method for the robust estimation of system parameters from very noisy data Estimating the mean and standard deviation from a censored normal sample Estimating the parameters of normal and logistic distribution from censored samples Nonnormal regression. II. Symmetric distributions Expected values and variances and covariances of order statistics for a family of symmetric distributions (Student's t), selected tables in mathematical statistics A new method of estimation for location and scale parameters Estimating parameters in autoregressive models in non-normal situations: symmetric innovations A survey of sampling from contaminated distributions On the Tiku-Suresh method of estimation The generalized secant hyperbolic distribution and its properties Estimation and hypothesis testing for a nonnormal bivariate distribution with applications Application of genetic algorithmsupport vector regression model to predict damping of cantilever beam with particle damper Maximum likelihood estimation for the parameters of skew normal distribution using genetic algorithm, Swarm and Evolutionary Computation Maximum likelihood and maximum product of spacings estimations for the parameters of skew-normal distribution under doubly type II censoring using genetic algorithm The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.