key: cord-0902765-aomrcj22
authors: Khalid, Asmaa M.; Hamza, Hanaa M.; Mirjalili, Seyedali; Hosny, Khalid M.
title: BCOVIDOA: A novel Binary Coronavirus Disease Optimization Algorithm for feature selection
date: 2022-04-18
journal: Knowl Based Syst
DOI: 10.1016/j.knosys.2022.108789
sha: 9a4bd82d204b45f8283cb42ed9fdcadac88a1fd7
doc_id: 902765
cord_uid: aomrcj22

The increased use of smart devices such as mobile devices, Internet of Things devices, cameras, and microphones, has produced big data. Large data dimensionality, redundancy, and irrelevance are challenging problems. Feature selection is a necessary process to select the optimal subset of features. In this paper, the authors propose a novel Binary Coronavirus Disease Optimization Algorithm (BCOVIDOA) for feature selection, where the Coronavirus Disease Optimization Algorithm (COVIDOA) is a new optimization technique that mimics the replication mechanism used by Coronavirus when hijacking human cells. The performance of the proposed algorithm is evaluated using twenty-six standard benchmark datasets from UCI Repository. The obtained results are compared with nine recent well-known feature selection algorithms. The experimental results demonstrate that the proposed BCOVIDOA significantly outperforms the existing algorithms in terms of accuracy, best cost, the average cost (AVG), standard deviation (STD), and size of selected features. Additionally, the Wilcoxon rank-sum test is calculated to prove the statistical significance of the results.

With the rapid use of computer and internet technologies, immense quantities of data with hundreds of features are produced. In data mining, useful information must be extracted from such big data to decide. Selecting only the relevant and useful features would have a significant effect in many applications such as text mining [1] , image processing [2] , Bioinformatics [3] , and

The virus uses the spike protein on its surface as a key to enter a human cell. Once inside the cell, the virus contents (RNA) are released inside the cell cytoplasm [32] .

The virus uses the Ribosomal frameshifting technique for replication [31] .

Frameshifting is a process when a specific reading frame of RNA molecule shifts to another reading frame to provide a new protein sequence [32, 34, 35] . Frameshifting results in the creation of several viral proteins. In the proposed algorithm, a solution (virus particle) is selected for replication. The frameshifting technique produces several viral proteins that are then combined to form a new virus particle (solution). The most popular type of frameshifting is +1 frameshifting [36] which can be modeled as follows:

The parent solution's values are shifted in the right direction by 1, and the value in the first position is set as a random value in the range [minVal maxVal] as follows.

S k (1) = rand(minVal, maxVal),

S k (2: D) = P(1: D − 1),

Where minVal and maxVal are the minima and maximum values for the variables in each solution.

Coronavirus accumulates random mutations to escape from the human immune system [37] . In the proposed algorithm, a mutation operator is applied to the solution created in the previous step to generate a new mutated solution as follows:

Zi={ r if rand(0,1) < MR X i otherwise

Finally, the new virion is released, trying to hijack new healthy cells. The replication lifecycle of Coronavirus is shown in figure 2 . The flowchart of COVIDOA is shown in figure 3 .

The parameters of COVIDOA are described as follows:

 PopNo : number of solutions in the initial population.  MR: represents the mutation rate. As mentioned in [35] , the mutation rate of Coronavirus is 1×10 -6, which is very low; however, the mutation rate in the proposed algorithm is set at a larger value in the range [0.1 0,001], which helps in exploring new promising regions and avoid getting stuck in a local minimum.

 Shifting Number: represents the number by which the variables of each solution are shifted in the frameshifting technique. The most common type of frameshifting is +1 frameshifting which uses a shifting number of 1.

 numOfProtiens: represents the number of viral proteins generated from one virus particle in the replication process. Each virus particle can generate millions of viral proteins; however, we set it to 2 proteins to avoid computational complexity. As mentioned in the coming sections, parameter tuning is applied to test the impact of changing parameter values on the performance of the proposed algorithm.

In binary problems, such as feature selection, each solution is represented by a one-dimensional vector that contains only zeros and ones, where one indicates that the feature is selected and 0

indicates it is ignored. The number of elements in the vector equals the size of features in the dataset. The binary representation of a COVID solution for a dataset with the number of features D is shown in figure 4 . 

In the initial stage of the COVID algorithm, a population pop of randomly generated agents is created, where each agent represents a solution to the problem. The following equation generates the initial population:

Where is the solution at the ℎ location in the population pop; is a random value between 0 and 1; and are the upper and lower boundaries of the problem. However, in the binary version, = 1 and = 0. In the proposed binary algorithm, each solution is converted into its binary representation using a binarization technique. The sigmoid function is one of the most used transformation functions belonging to the S-shaped family [36] . It maps each value in the real-valued solution to a value of 0 or 1 as follows:

The curve of the sigmoid function is shown in figure 5 . 

The KNN classifier (where K = 5) is used to get the classification accuracy of each solution for the following advantages [39, 40] :

 It is straightforward to implement. Only two parameters are required to implement KNN, i.e., the parameter K, which represents the number of neighbors, and the distance function (e.g., Euclidean or Manhattan, etc.).

 It is a memory-based approach that allows the algorithm to respond quickly to changes in the input [41] .

 Efficiency in finding the best subset of features and high robustness to noisy data.

 K-NN algorithm gives the user the flexibility to choose distance while building the K-NN model (e.g., Euclidian distance, Hammig distance, Manhattan distance, etc.).

J o u r n a l P r e -p r o o f Journal Pre-proof 11 The role of the KNN classifier is to assign each data point to a class to which most of the closest K neighbors belong. Each dataset is divided into training, validation, and testing sets using cross-validation in the same way as in [42] . The K-1 folds are used for training and validation in cross-validation, and the remaining is utilized for testing.

The K-NN classifier works as follows:

For the test dataset, the kNN algorithm must determine the K closest neighbors for each sample from the training dataset by computing the Euclidean as follows:

Where d is the number of features in a given dataset.

As shown in figure 6 , the classifier assigns the unknown sample to class B (when k=3) because 2 of its closest points are from class B. The classification accuracy for the classifier determines how accurate the class prediction is for the classifier and can be obtained by dividing the correct instances by the total number of instances found in the dataset. On the other hand, the classification error rate can be obtained by dividing the incorrect instances by the total number of instances in the dataset.

The classification accuracy rate must be calculated using the KNN classifier to evaluate each solution, where the best solution is the one with the highest accuracy rate. 

Not only the classification accuracy rate is the only measure used to compare solutions, but an additional objective is needed, which is the number of selected features. When two solutions have the same classification accuracy, the one with the minimum number of selected features is the best. Therefore, the fitness function is to maximize the classification accuracy rate (minimize classification error) and minimize the number of selected features as follows:

Where α is a value between 0 and 1, γ c is the error rate of the classifier, S represents the number of selected features, and N is the total number of features.

As mentioned in [31] , virus particles (solutions) use the frameshifting technique for producing multiple protein sequences, which are then combined to form a new particle (solution). In the original COVIDOA version, the +1 frameshifting technique is applied to a solution by shifting its variables to the right by 1, and the value in the first position is assigned with a random value in the range [minVal maxVal] as mentioned in equations (1) and (2) . In the binary version, the frameshifting technique is applied as follows:

Sk refers to the k th generated protein, P is the parent solution, D is the problem dimension, () is a random value between 0 and 1. After replication, a mutation in the binary version can be applied as follows:

X is the solution before mutation, Z is the mutated solution, and Xi and Zi are the i th elements in the old and new solutions.

This section presents the proposed algorithm results and the comparisons with the state-of-theart algorithms. The proposed and state-of-the-art algorithms were run on a laptop with the following specifications: Intel(R) Core(TM) i7-1065G7 CPU, 8 GB RAM, Windows 10 operating system, and MATLAB R2016a development environment.

We applied the proposed method to 26 different datasets from the UC Irvine Machine

Learning Repository [44] to prove the efficiency of the proposed algorithm. Each data set is described in terms of the number of features, number of instances, number of classes, and the area to which they belong. We utilized many datasets to ensure the efficiency of the proposed algorithm in feature selection. These 26 datasets are selected based on the variety in dimension size (number of features) and the area they belong to. We utilized datasets with small (9, 11, 13) (8) and (9) . Table 2 .

To test the impact of changing parameter values on the performance of the COVID algorithm, we used nine different scenarios by changing the values of the parameters MR (Mutation Rate) and numOfProtiens. We utilized the values of 0.1, 0.01, ad 0.001 for MR, 2, 4, and 6 for numOfProtiens which produces nine scenarios, as shown in table 3. The feature selection results (for the zoo dataset) of each scenario are shown in table 4. It is observed that scenario 1 yields the best results, followed by scenario 3, which means that a higher mutation rate and a lower number of proteins improve the performance of the proposed algorithm.

The results of the proposed algorithm are compared to the state-of-the-art feature selection algorithms such as GA [46] , DE [47] , PSO [48] , WOA [24] , WOASA [45] , GWOPSO [50] , HH [51] , GWO [52] , and AOA [53] . The parameters of these algorithms are selected as suggested by their authors. The comparison is made according to the following measures:

It measures how accurate the classifier is in selecting the optimum subset of features. The maximum classification accuracy can be calculated as follows:

Where refers to the accuracy at run n, where n = 1, …, M and M is the number of runs.

The best cost at run n can be calculated as follows:

Where is the cost obtained at iteration i where i =1, … , . The best cost obtained over the M runs can be calculated as follows:

Where n = 1, … , M and M is the number of runs.

 Average cost:

The average cost at each run n can be calculated as follows:

Where refers to the cost at iteration i.

The minimum average cost over M runs can be calculated as follows:

Where n = 1, … , M and M is the number of runs.

It shows how the cost values are far from the average cost. STD at run n can be calculated as follows:

The minimum STD over M runs can be calculated as follows:

= min (17)

The selection size at run n can be calculated as follows:

refers to the number of selected features in the optimum solution obtained at run n. The minimum number of features obtained over the M runs can be calculated as follows:

 Wilcoxon rank-sum test

Null hypothesis [55] is a type of hypothesis widely used in Statistics. It is used to prove the results' statistical significance. The test results of the 15 datasets are compared using Wilcoxon rank-sum test at the %5 significance level [54] .

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis. In this hypothesis, researchers assume no significant difference between the two methods' average values.

This section presents the numerical results of the proposed binary COVIDOA and the comparisons with the state-of-the-art algorithms. The results of the proposed binary COVIDOA for feature selection are presented in figure 13 . The figure shows that the proposed algorithm comes in the first position with a value of (0.0898), followed by the GWOPSO algorithm with a value of (0.01024).

The standard deviation is one of the essential metrics to evaluate the proposed algorithm. Lower standard deviation values indicate that the fitness values are clustered closely around the mean value, proving the proposed algorithm's stability. As shown in table 9, the proposed algorithm comes in the first rank as it achieves the minimum STD values in 17 out of 26 datasets and has the minimum average STD value (0.0019) compared to its peers, see Figure 14 .

While maintaining the highest classification accuracy, another objective is to achieve the minimum number of selected features. 15 ) for all datasets, which means that the proposed algorithm has high size reduction capabilities. As shown in figure 15 , the proposed algorithm is superior to its peers in the selected size.

The Wilcoxon rank-sum test is a nonparametric statistical test that compares two paired groups.

This test calculates the difference between sets of pairs and analyzes them to establish if they are statistically significantly different. The test results of the 26 benchmark datasets are compared using Wilcoxon rank-sum test at the %5 significance level. 

Feature selection is a way to eliminate redundant and irrelevant data. This process leads to improved learning accuracy, reduced computational time, and enhanced understanding of learning models. This paper proposed a binary approach to the Coronavirus Optimization algorithm based on a wrapper method to solve feature selection problems. The proposed algorithm used the KNN classifier because its simplicity has two parameters: K and distance function. Many evaluation metrics are utilized to evaluate the performance, such as classification accuracy, best fitness, average fitness, standard deviation, and selection size. The proposed algorithm is tested on 26 benchmark datasets, and the results are compared with nine well-known metaheuristics.

Additionally, the Wilcoxon rank-sum test is evaluated to prove the significance of the proposed algorithm. The statistical results reveal that the proposed algorithm performs better than the state-of-the-art algorithms. The convergence curves proved that it has a high convergence speed as it reaches the global optimum rapidly.

Future work may apply the binary COVIDOA with different classifiers such as support vector machines (SVM). Also, applying the proposed algorithm to solving real-world problems such as medical diagnoses, image processing, and industrial applications would be interesting.

Another possible future work is hybridizing COVIDOA with another metaheuristic algorithm such as SA or PSO. 

Improved feature selection approach TFIDF in text mining

Wavelet feature selection for image classification

A review of feature selection techniques in bioinformatics

Intelligent IoT traffic classification using novel search strategy for fast-based-correlation feature selection in industrial environments

An industrial Internet of Things feature selection method based on potential entropy evaluation criteria

Big Data Challenges for the Internet of Things (IoT) Paradigm

Feature selection: A data perspective

A hybrid genetic algorithm for feature selection wrapper based on mutual information

A survey on feature selection methods

Feature selection for classification

SVMs classification based two-side cross-domain collaborative filtering by inferring intrinsic user and item features. Knowledge-Based Systems

A cross-domain collaborative filtering algorithm with expanding user and item features via the latent factor space of auxiliary domains

Hybridizing exact methods and metaheuristics: A taxonomy

Binary grey wolf optimization approaches for feature selection

BCS: A binary cuckoo search algorithm for feature selection

Binary flower pollination algorithm and its application to feature selection. Recent advances in swarm intelligence and evolutionary computation

Binary dragonfly algorithm for feature selection

Using simulated annealing to optimize the feature selection problem in marketing applications

Using simulated annealing to optimize the feature selection problem in marketing applications

A new local search based hybrid genetic algorithm for feature selection

Feature selection with discrete binary differential evolution

Data feature selection based on Artificial Bee Colony algorithm

An advanced ACO algorithm for feature subset selection

Sshaped binary whale optimization algorithm for feature selection. Recent trends in signal and image processing

BBA: a binary bat algorithm for feature selection

A novel hybrid genetic algorithm and simulated annealing for feature selection and kernel optimization in support vector regression

Hybrid whale optimization algorithm with simulated annealing for feature selection

Binary optimization using hybrid grey wolf optimization for feature selection

Cancer data classification by quantum-inspired immune clone optimization-based optimal feature selection using gene expression data: deep learning approach. Data Technologies and Applications

Clustering-based hybrid feature selection approach for high dimensional microarray data

COVIDOA: A Novel Evolutionary Optimization Algorithm Based on Coronavirus Replication Lifecycle

Principles of virus uncoating," cues and the snooker ball

Ribosomal frameshifting in decoding antizyme mRNAs from yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation

Pharmacological approaches for targeting cystic fibrosis nonsense mutations

SARS-CoV-2 (COVID-19) by the numbers

Coronavirus SARS-CoV-2: Analysis of subgenomic mRNA transcription, 3CLpro, and PL2pro protease cleavage sites and protein synthesis

Comparative genome analysis of novel coronavirus (SARS-CoV-2) from different geographical locations and the effect of mutations on major target proteins", An in-silico insight

S-shaped versus V-shaped transfer functions for binary particle swarm optimization

Binary grey wolf optimization approaches for feature selection

An introduction to kernel and nearest-neighbor nonparametric regression

KNN classifier with self-adjusting memory for heterogeneous concept drift

Binary grey wolf optimization approaches for feature selection

An introduction to kernel and nearest-neighbor nonparametric regression

CJ Merz UCI repository of machine learning databases

Hybrid whale optimization algorithm with simulated annealing for feature selection

A new local search based hybrid genetic algorithm for feature selection

Feature Selection with Discrete Binary Differential Evolution

Two-Step Particle Swarm Optimization to Solve the Feature Selection Problem

Hybrid whale optimization algorithm with simulated annealing for feature selection

Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection

A New Quadratic Binary Harris Hawk Optimization for Feature Selection

Binary grey wolf optimization approaches for feature selection

The arithmetic optimization algorithm

A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms

When null hypothesis significance testing is unsuitable for research: a reassessment

No conflict of interest exists.No funding was received for this work.The authors understand that the Corresponding Author is the sole contact for the Editorial process. He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.