key: cord-0026002-erc513nq authors: Miao, Maoxuan; Wu, Jinran; Cai, Fengjing; Wang, You-Gan title: A Modified Memetic Algorithm with an Application to Gene Selection in a Sheep Body Weight Study date: 2022-01-15 journal: Animals (Basel) DOI: 10.3390/ani12020201 sha: 6a7d389978f93a5a467d43950df32e94fbd32978 doc_id: 26002 cord_uid: erc513nq SIMPLE SUMMARY: Due to lacking exploitation capability, traditional genetic algorithm cannot accurately identify the minimal best gene subset. Thus, the improved splicing method is introduced into a genetic algorithm to enhance exploitation capability for achieving balance between exploitation and exploration of GA. It can effectively identify true gene subsets with high probability. Furthermore, a dataset of the body weight of Hu sheep has been used to show that the proposed method can obtain a better minimal subset of genes with a few iterations, compared with all considered algorithms including genetic algorithm and adaptive best-subset selection algorithm. ABSTRACT: Selecting the minimal best subset out of a huge number of factors for influencing the response is a fundamental and very challenging NP-hard problem because the presence of many redundant genes results in over-fitting easily while missing an important gene can more detrimental impact on predictions, and computation is prohibitive for exhaust search. We propose a modified memetic algorithm (MA) based on an improved splicing method to overcome the problems in the traditional genetic algorithm exploitation capability and dimension reduction in the predictor variables. The new algorithm accelerates the search in identifying the minimal best subset of genes by incorporating it into the new local search operator and hence improving the splicing method. The improvement is also due to another two novel aspects: (a) updating subsets of genes iteratively until the no more reduction in the loss function by splicing and increasing the probability of selecting the true subsets of genes; and (b) introducing add and del operators based on backward sacrifice into the splicing method to limit the size of gene subsets. Additionally, according to the experimental results, our proposed optimizer can obtain a better minimal subset of genes with a few iterations, compared with all considered algorithms. Moreover, the mutation operator is replaced by it to enhance exploitation capability and initial individuals are improved by it to enhance efficiency of search. A dataset of the body weight of Hu sheep was used to evaluate the superiority of the modified MA against the genetic algorithm. According to our experimental results, our proposed optimizer can obtain a better minimal subset of genes with a few iterations, compared with all considered algorithms including the most advanced adaptive best-subset selection algorithm. In data mining, feature selection is a fundamental strategy to handle "the curse of dimensionality" [1] . With an effective feature selection procedure, the redundant and irrelevant features are eliminated to improve the performance of the learning process [2, 3] . Further, the feature selection approaches can identify small subsets of biologically important genes, which are the most relevant to the target trait, such as genetic diseases and angiotensin-converting enzyme 2 [4, 5] . Thus, in this paper, considering a new application balance between exploitation and exploration of GA; and (3) In projects where identification of SNPs for body weights is required, the GA based on an improved splicing method can find the minimal best subset of genes compared with other heuristic methods, including GA [25] , β-hill climbing [26] , salp swarm algorithm [27] , artificial bee colony algorithm [28] , sine cosine optimization algorithm [29] , and binary particle swarm optimization [30] , and adaptive best-subset selection from Zhu et al. [24] . In this section, a modified memetic algorithm is proposed by combining an improved splicing method with genetic algorithm. In other words, the improved splicing method is incorporated into GA to accelerate the search for identifying the minimal best subset of genes. The main advantages of the proposed method are two-fold: (1) it can provide the promising subset of genes of the whole gene subset space based on selection and crossover operators; and (2) compared with traditional GA, it has strong exploitation capability and recovers the minimal best subset of genes with high probability based on an improved splicing method. Genetic algorithm is a heuristic algorithm, which is mainly based on selection and crossover operators [31] . Selection operator directs GA to find the most promising subset of genes of the entire gene subset space. The crossover operator has an exploration capability, which can direct GA to escape from sub-optimal locations. The crossover operator, individual representation, initialization, and selection operator are discussed below. a. Individual Representation: For feature selection, each individual is represented as a subset of genes. Each individual is encoded by a binary vector, i.e., a i (t) = [ f 1 , f 2 , · · · , f q ], f k ∈ {0, 1}, where f k = 0 denoted as the k-th gene is not selected while f k = 1 denoted as the k-th gene is selected. b. Initialization: N individuals are randomly generated, which consist of a population P(0), i.e., P(0) = [a 1 (0), · · · , a i (0), · · · , a N (0)]. The procedure of randomly generating an individual a i (t) is shown in Algorithm 1. An initial probability P initial = s/q is defined as a gene f i = 1, where s is the expected number of a subset of genes. A gene is randomly assigned as 0 or 1 by the following way: f i = 1 if U(0, 1) < P initial ; else, f i = 0. The entire procedure is repeated until q genes are all assigned. Algorithm 1 Initialization of individual 1: Input: individual a i (t) = [], probability P initial 2: for i = 1, 2, · · · , q do 3: if U(0, 1) < P initial then Generally, the higher the quality of gene subsets, the more likely they are within the most promising region of entire search space. Therefore, the survival probability of high quality gene subsets should be set higher. Based on this principle [32] , a proportional roulette wheel selection is used in the paper. Since error indicators are considered as fitness function in the paper, F i is lower and the individual a i (t) is more likely to survive. The survival probability is formulated as, where F i is the fitness value of the i-th individual a i (t). Each subset of genes is allocated with a survival probability. Then, roulette wheel sampling-based survival probability is utilized to select individuals a i (t). d. Crossover Operator: The crossover operator has two distinct characteristics [33] : (1) Genes common to parents are preserved in the offspring. It means that some important genes common to parents improved by the improved splicing method can still be retained to the next generation; (2) Produces offspring that are contained in a region of the search space spanned by parents. It means that the produced offspring remain in the promising area when the parents are located in the promising region of the entire search space. The uniform crossover operator is a general form of single-point or two-point crossover, which is used in the paper. It can be flexible to adjust disruptive effect in term of crossover disruption probability P 0 , which is defined as using the bit value of the first parents [34] . e. The Improved Splicing Method: Splicing method [24] , also called the polynomial method, is where subsets of genes can be updated iteratively until the loss function cannot be improved by splicing. It can guarantee separation of unimportant genes from a subset of genes. However, the size of subsets of genes gradually increases over generations. It leads to the minimal best gene subset, which, potentially, cannot be found in higher generations. Therefore, we propose an improved splicing method in this paper. The advantages have three points, as follows: (1) The add and del operators are introduced into the splicing method to limit the maximal size of subsets of genes S max . The del operator is used to delete some insignificant genes in active sets to achieve the desirable maximal size of subsets of genes, while the add operator is used to add deleted genes for del operator into inactive sets; (2) It can recover the true subset of genes with high probability; (3) It has strong exploitation capability. Since the mutation operator has less exploitation capability and may introduce some unimportant or irrelevant genes into the subset of genes, mutation operator is replaced by the improved splicing method. The entire process of the improved splicing method mainly is consisted of five parts, i.e., individual segmentation, evaluation, add and del operators, swap, and merge. We illustrate the details of the improved splicing method with an example, as shown in Figure 1 , and the algorithm is shown in Algorithm 2. Step 1 Individual Segmentation: The individual, i.e., If the size of active set A is greater than the maximal size of the subset of genes S max , go into Step 3; otherwise, go into Step 5. Step 3 Evaluation: The backward sacrifice is used to evaluate the score of each gene in active set A. The score of each gene is evaluated by where n is the number of samples and β j is coefficient of the j-th gene. In Figure 1c , score of the gene f 3 is the highest and score of the gene f 6 is the lowest. Step 4 add and del Operators: Some genes with the lowest score in active set A are deleted, and then added into inactive set I. The del operator is used to delete |A| − S max genes with the lowest score in active set A. Deleted |A| − S max genes are added into inactive set I for add operator. In Figure 1d , S max is set as 3. Since the score of the gene f 6 is the lowest, the gene f 6 in active set A is deleted, and then added into inactive set I. Step 5 Evaluation: The score of each gene in active set A is calculated by backward sacrifice and the score of each gene in inactive set I is calculated by forward sacrifice. Backward sacrifice is formulated as Equation (2), while forward sacrifice is formulated as, where n is the number of samples and d j = X T j (Y − X A β A )/n. They are both filter methods based on change in linear loss function. The larger the change in loss function, the more significant the gene is. k genes of the lowest scores in active set A are consisted of set A k ; k genes of the highest scores in inactive set I are consisted of set I k . Then, sets A k and I k have swapped each other. Parameter k is less than or equal min(|A|, |I|). In order to find the best minimal subset of genes, we should search the optimal parameter k from range {1, 2, · · · , min(|A|, |I|)} by using grid search. For example, in Figure 1 , the parameter k is 1. The score of the gene f 5 in active sets is the lowest and the score of the gene f 2 in inactive sets is the highest in terms of Step 5. Therefore, the sets A k = { f 5 } and I k = { f 2 } are obtained. Then, the sets A k and I k have swapped with each other. Finally, the new active set Step 7 Update: The active set A is updated by repeating Steps 5-6 until the loss function L = 1 2n ||Y − X A β A || 2 2 cannot be improved. Then, go to Step 8. The updated active set A and inactive set I are merged to form a new individual, as shown in Figure 1f . Calculate the score of each gene in active set A in terms of backward sacrifice. 5: Delete |A| − S max genes of the lowest score in active set A to obtain a new active set A; Then deleted |A| − S max genes are added into inactive set I to obtain a new inactive set I. 6 : 12: Calculate the score ξ j of each gene in active set A in terms of backward sacrifice. 13: Calculate the score of each gene ζ j in inactive set I in terms of forward sacrifice. 14: for k = 1, 2, · · · , min(|A|, |I|) do 15 : Calculate Loss function L n = 1 2n ||Y − XÃβÃ|| 2 2 ; 18: if L n < L then 19 : end if 21: end for 22: until L 0 − L < τ 23: Merge active set A with inactive set I to generate a new individual A. 24: Output: A new individual A. A modified memetic algorithm, genetic algorithm based on an improved splicing method, is a modified version of GA. To accelerate the search to identify the minimal best subset of genes, a new local search operator, improved splicing method, is utilized to improve starting points and mutation operator is replaced with the improved splicing method to enhance exploitation capability for achieving balance between exploitation and exploration of GA. In addition, the elitist operator is introduced into GA to prevent loss of the important subset of genes. The proposed algorithm is consisted of the following steps and the flow chart is shown in Figure 2 . Step 1 Initialization: Set t = 1 and Maximal number of generation T; N subsets of genes a i (t), i = 1, · · · , N, is randomly generated, consisted of a population, i.e., P(t) = [a 1 (t), · · · , a N (t)]. Each subset of genes is randomly generated in terms of Algorithm 1; Step 2 Improved Splicing Method: The improved splicing method shown in Algorithm 2 is used to improve each subset of genes a i (t) in P(t), i = 1, 2, · · · , N; Step 3 Evaluation: Calculate fitness value F i of each subset of genes a i (t), i = 1, 2, · · · , N, in initial population P(t); Step 4 New Population: Set empty list of new population NP, i.e., NP = []; Step 5 Elitist Operator: The best subset of genes B(t) in the current generation is directly added into new the population NP, i.e., NP = [B(t), NP]; Step 6 Selection Operator: Proportional roulette wheel selection is utilized to select parents a i (t) and a j (t) in the current population P(t); Step 7 Crossover Operator: Parents a i (t) and a j (t) are recombined to produce offspring o i (t) and o j (t) by a uniform crossover; Step 8 Improved Splicing Method: The improved splicing method, shown in Algorithm 1, is used to improve offspring o i (t) and o j (t). Then, improved offspring o i (t) and o j (t) are added into the new population NP, i.e., NP = [o i (t), o j (t), NP]; Step 9 Repeat: Repeat Steps 6-8 until producing (N + 1) individuals; Step 10 Evaluation: t = t + 1; P(t) = NP in population of the next generation; calculate fitness value F i of each subset of genes a i (t), i = 1, 2, · · · , N, in the new population P(t); Step 11 Stopping Criterion: Repeat Steps 4-10 until maximum number of generations T is satisfied or no improvement over 5 generations continuously. Then, go to Step 12; Step 12 Output: Output the best subset of genes. To check the performance of the proposed optimizer, seven well-known feature selection methods are compared with it, including adaptive best-subset selection (ABSS) [24] , genetic algorithm (GA) [25] , binary particle swarm optimization (BPSO) [30] , binary salp swarm algorithm (BSWA) [27] , sine cosine optimization algorithm (SCA) [29] , artificial bee colony algorithm (ABC) [28] , and β-hill climbing [26] . In meat production, the body weight (BW) of sheep is a key economic trail [35] . As pointed out by Cao et al. [36] , BWs measured at birth and other life stages are major indicators for productivity, health, and preventive management. In genetics, some SNPs are associated with BW, and the identification of these SNPs can improve the efficiency of sheep breeding programs. However, selecting the significant SNPs among sheep genes is a NP-hard problem [37] . Here, to investigate the effectiveness of the proposed method, we conduct experiments on the dataset of three measures of body weights of 240 Hu sheep, including birth weight, six-month weight, and weaning weight. The dataset is available in the GEO accession number GSE152717 [36] . We exclude some genes containing missing values. Then, a brief dataset characteristic for three types of body weight of Hu sheep is shown in Table 1 . The mean squared error (MSE) is used as fitness function in the paper, and support vector regression (SVR) [38] is used as a regression model to predict body weights at three different times in this paper. For the body weight measured on each occasion, the hyper-parameters of the proposed method are set as follows: Number of generations T = 30; Number of individuals N = 20; Threshold τ = 0.01|A| log(p) log(log n)/n [24] , where |A| is the size of the active set, n is the number of training sets and p is the number of features; Parameters of SVR, including penalty parameter C, in -insensitive loss function and σ in Gaussian kernel function are shown in Table 2 ; The expected size of the subset of genes s = 2000; For maximal size of the subset of genes S max and crossover disruption probability P 0 , the grid search is utilize to search the optimal combination S max and P 0 . Range of P 0 is set as {0.1, 0.2, 0.3, 0.4, 0.5} and range of S max is set as {10, 20, 30, 40, 50}. The result is shown in Figure 3 . In the dataset of the birth weight, when S max = 30, P 0 = 0.5, MSE is smaller; In the dataset of the six-month weight, when S max = 40, P 0 = 0.5, MSE is smaller; In the dataset of the weaning weight, when S max = 50, P 0 = 0.2, MSE is smaller. The other parameter settings of the proposed method for the body weights on each occasion are reported in Table 2 . Additionally, the parameter settings for the considered benchmark methods are recorded in Appendix A. To test the performance of feature selection methods, for predicting the body weights on each occasion, the instances are divided into training set (170 samples) and test set (70 samples). The training set is used to obtain the relevant subset of genes while the test set is used to evaluate MSE of SVR [39] . The initialization of individuals uses Algorithm 1, the fitness function uses MSE, and the regression model uses SVR in the other heuristic methods for fair comparison. The experiment results are shown in Table 3 . Some main points can be obtained from our experiment as follows. In Table 3 , the SVR models with heuristic methods, including β-Hill Climbing, SWA, ABC, SCA, BPSO, can significantly reduce the candidatures of gene to nearly 2000 genes. Furthermore, according to the error indicator (MSE), the performance of these SVR models is very similar. For example, all indicators for the birth weight are 0.239, thus we can confirm that more than 52,000 genes are irrelevant to the birth weight. In Table 3 , the performance of SVR combined with the proposed method is more outstanding compared with SVR combined with ABSS for the body weight of the Hu sheep on each occasion. For the six-month weight, the number of selected genes is only a few by ABSS, and the performance of SVR significantly worsens. This means that ABSS may be limited by correlation feature method in the initial step of ABSS. It cannot provide the promising subset of genes of the entire gene subset space while the proposed method can provide the promising subset of genes of the entire gene subset space by selection and crossover operators. In Figures 4 and 5 , mean and min fitness value of each generation in the proposed method are less than in GA over 5 generations or more for the dataset of body weight at each of the three time points. Both the mean and min fitness value of each generation gradually decreases (in the proposed method) and increases (in GA) over generations for the body weight at 6 and 12 months, and at weaning, respectively. It shows that the mutation operator lacks exploitation capability while the improved splicing method has strong exploitation capability. In addition, in Figure 5 , a minimal min fitness value is found in a few generations. It means that the initial start points, and offspring improved by the improved splicing method, can enhance the efficiency of search. Furthermore, in Table 3 , the quality of the selected subset of genes by the proposed method not only is better than by GA, but also the size of the selected gene subset by the proposed method is less than by GA. It shows that the improved splicing method is embedded into GA to find the minimal best size of subset of genes. In Figure 4 , mean fitness almost remain unchanged over 5 generations in other stateof-arts optimization algorithms while mean fitness almost continuously decreases over 5 generations in the proposed method. It shows that the improved splicing method can prevent premature convergence to identify best gene subsets while the other optimization algorithms easily encounter sub-optimal gene subsets. Furthermore, in Table 3 , the performance of SVR combined with the proposed method outperforms SVR combined with other heuristic methods, including β-hill climbing, SWA, ABC, SCA, and BPSO. The size of the selected gene subset by the proposed method is smaller than by heuristic algorithm, including β-hill climbing, SWA, ABC, SCA, and BPSO. Here, we can conclude that the proposed method is a successful heuristic algorithm to find the minimal best subset of genes for gene selection problems. The experimental results exhibit that the proposed method can yield minimal best gene subsets compared with other feature selections. Thus, Table 4 shows the best subset of genes obtained by using the proposed method for the body weight on the three occasions. Figure 6 displays a heatmap created for the identified best subset of genes for the body weight at the three occasions. The heatmap describe degree of similar and dissimilar among selected genes and sheep. Table 4 . Selected genes by using the proposed method for the body weight on the three occasions. Birth Weight Figure 6 . The heatmap of the actual expression profiles for the best subset of genes obtained from the proposed method. The non-parametric Friedman test is used to show whether there exists any statistically significant difference among 8 feature selection methods. The average rank of each algorithm is shown in Table 5 . As shown in Table 5 , the proposed method has placed in rank one. The Iman and Davenport statistic F F [40] is calculated as 5.579. The result is much larger than critical values (F 0.1 (7, 14) = 2.19). This means that the null hypothesis is rejected, i.e., there are significant differences among the eight feature selection methods. Then, the Nemenyi test in post hoc Holm test [40] further is employed to show significant difference between the proposed method and other feature selection methods. As shown in Table 6 , there are significant difference between the proposed method and GA and the proposed method and BPSO. In this paper, a modified memetic algorithm, a genetic algorithm based on an improved splicing method, has been proposed for gene selection problems. Different from traditional genetic algorithm, the optimizer can accelerate search to identify the minimal best subset of genes. It can absorb characteristics of crossover and selection operators to provide the promising subset of genes of the entire gene subset space. Furthermore, the improved splicing method can reduce the size of the promising subset of gene and recover the true subset of genes with a high probability. The initial points are improved by the improved splicing method to enhance efficiency of search, and the mutation operator is replaced by the improved splicing method to enhance exploitation capability for achieving balance between exploration and exploitation of GA. Therefore, the proposed optimizer can effectively achieve the best minimal subset of genes out of thousands of genes. Moreover, by using the body weights on each of the three occasions, we have demonstrated that our modified memetic algorithm can find the best minimal subset of genes compared with all considered algorithms, including ABSS. In addition, the proposed optimizer can be generalized to other high dimensional optimization problems [41, 42] . The data used to support the findings of this study are available from Wang, Y.-G. upon reasonable request. The authors declare no conflicts of interest. The parameter settings of SVR for analyzing the body weights are set as follows: In Birth weight, C = 0.1, = 0.01, σ = 10 −5 ; In six-month weight, C = 1, = 0.01, σ = 0.01; In weaning weight, C = 0.1, = 0.01, σ = 10 −5 ; crossover disruption probability P 0 settings of GA for the body weights are set as follows: P 0 = 0.5 in birth weight; P 0 = 0.5 in six-month weight; P 0 = 0.2 in weaning weight. The parameters settings of other feature selection methods for the body weights are reported in Table A1 . Table A1 . The parameters settings of other feature selection methods in analyzing the body weights. Whale optimization approaches for wrapper feature selection Feature Selection for Knowledge Discovery and Data Mining Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data Gene selection using intelligent dynamic genetic algorithm and random forest Distributed reliefF-based feature selection in spark A cancer gene selection algorithm based on the KS test and CFS Significance tests for analyzing gene expression data with small sample sizes A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification A hybrid feature selection method for complex diseases SNPs An efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets Wrapper method for feature selection to classify cardiac arrhythmia A multi-objective heuristic algorithm for gene expression microarray data classification A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges An improved firefly algorithm for global continuous optimization problems Masoudi-Nejad, A. A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata Exploration and exploitation in evolutionary algorithms: A survey Hybrid genetic algorithms for feature selection Balancing exploration and exploitation in memetic algorithms: A learning automata approach Recursive memetic algorithm for gene selection in microarray data Wrapper-filter feature selection algorithm using a memetic framework Deluge based genetic algorithm for feature selection Feature selection for handwritten word recognition using memetic algorithm A polynomial algorithm for best-subset selection problem Genetic algorithm based feature selection approach for effective intrusion detection system Abu Doush, I. Binary β-hill climbing optimizer with S-shape transfer function for feature selection A new binary salp swarm algorithm: development and application for optimization tasks A novel binary artificial bee colony algorithm Sine cosine optimization algorithm for feature selection New fitness functions in binary particle swarm optimisation for feature selection Genetic algorithm Genetic algorithm performance with different selection strategies in solving TSP Fitness landscapes and memetic algorithm design. New Ideas Optim A formal analysis of the role of multi-point crossover in genetic algorithms Review on genomic regions and candidate genes associated with economically important production and reproduction traits in sheep (Ovies aries) Genome-wide association study of body weights in Hu sheep and population verification of related single-nucleotide polymorphisms Machine learning supervised algorithms of gene selection: A review A tutorial on support vector regression Feature selection: A data perspective Statistical comparisons of classifiers over multiple data sets Profile-guided three-phase virtual resource management for energy efficiency of data centers Improved grey model by dragonfly algorithm for chinese tourism demand forecasting