key: cord-0499536-k5m60klt authors: Gong, Rui; Hase, Kazunori title: A Plant Root System Algorithm Based on Swarm Intelligence for One-dimensional Biomedical Signal Feature Engineering date: 2021-07-31 journal: nan DOI: nan sha: 8eb3ea9794837e338c25fd00f2cee70a7aebcda4 doc_id: 499536 cord_uid: k5m60klt To date, very few biomedical signals have transitioned from research applications to clinical applications. This is largely due to the lack of trust in the diagnostic ability of non-stationary signals. To reach the level of clinical diagnostic application, classification using high-quality signal features is necessary. While there has been considerable progress in machine learning in recent years, especially deep learning, progress has been quite limited in the field of feature engineering. This study proposes a feature extraction algorithm based on group intelligence which we call a Plant Root System (PRS) algorithm. Importantly, the correlation between features produced by this PRS algorithm and traditional features is low, and the accuracy of several widely-used classifiers was found to be substantially improved with the addition of PRS features. It is expected that more biomedical signals can be applied to clinical diagnosis using the proposed algorithm. One-dimensional signals are the most easily collected signals by machine. Such signals are widely used in industry, agriculture, finance, and, in particular, the medical field. Lately, massive onedimensional biomedical signals have been digitized and stored as time-series data on hard disks, and much consideration has been given to how to make effective use of these data. Applying biomedical signals to clinical diagnosis is a frontier multidisciplinary research subject that has great significance for the advancement of medicine [1, 2] . Classification, which is the ability to group objects by their similar features, is one of the primary time-series data processing applications. Supervised classification, in particular, is considered to perform well, and it would seem that a robust and wellestablished classifier with supervised machine learning techniques can help doctors improve the accuracy of their clinical diagnoses [3, 4] . To develop a powerful classifier requires heavy training. If an untrained classifier can be regarded as an "infant," then features training can be regarded as the "experience" required for the infant to progress towards adulthood. Before classification, many correct and appropriate features need to be extracted and transformed from signals or images in a process referred to as feature engineering [5] . Feature engineering is a key to success in applied machine learning, as it allows feature sets to become more flexible and reduces overmatching or undermatching in classification models, thus improving classification accuracy [5] . The features of one-dimensional time-series signals, especially biomedical signals, can be roughly divided into time-domain features, frequency-domain features, and nonlinear features [6] . These features appear to be unrelated; however, features derived from operations and transformations of the original signal can be restored to the original signal by inverse operations and inverse transformations. Even when these features are carefully selected, independence is still not satisfied, bringing bias to the training process. To counteract this anchoring bias, we propose a plant root system (PRS) algorithm based on swarm intelligence as an auxiliary feature. Franz et al. [7, 8] hypothesized that the root system of plants is a swarm intelligence system and that such systems are not limited to animals and humans. According to this hypothesis, root apices communicate with each other through electrical signals or chemical pheromones, and then build a massive "brain-like" system in the underground world [9] . Within the complex root system, independent roots acquire water and nutrients through cooperation. At the same time, to support proper growth and development, the root system smartly competes with other root systems and chooses to start a "war" when it has an advantage or retreat when it is at a disadvantage [7, 10] . In this article, the proposed PRS approach mimics the growth of plant roots, where the root apices cooperate to absorb as much of the sustaining nutrients as possible. In our proposed approach, the base features used by the selected machine-learning model to classify biomedical signals are equivalent to nutrients in the digital soil. These base features are features obtained from common feature extraction methods. In addition to these base features, the sum of the nutrients absorbed during growth serves as a first auxiliary feature; the growing area of the root system is the second auxiliary feature. These auxiliary features are referred to as PRS auxiliary features. In the model training process, the most significant advantage of the PRS features based on swarm intelligence is that they are not only ideally independent (i.e., they have low correlation with the other features), but they are also highly correlated with the dependent variable. The independence of the PRS features derives from the brain-like structure of the root system, where the co-determination of root tips reduces the correlation with the base features. When these PRS auxiliary features are appended to the training set, the increase in independence between features results in a decrease in bias and variance and an increase in the accuracy of classification by machine learning. For biomedical signals, an increase in classification accuracy means an increase in the positive diagnosis rate, which ultimately leads to a greater use of more biomedical signals in the clinical field. computational burden and has stronger reliability than nonlinear features [11] . Moreover, timedomain features also avoid the risk of spectral leakage caused by signal decomposition and is more dependable than frequency-domain features [12] . Time-domain feature selection is an extremely important process for base feature extraction. The number of features needs to be carefully considered: too few features will limit the growth of the root system, while too many features may adversely affect the performance of the classifier [13] . To comprise the set of base features, we selected 12 features that have relatively low computational complexity and that have been successfully used for classification with a high degree of accuracy [14] [15] [16] . Definitions of the base features are given in Table 1 , where represents the onedimensional biomedical signal in segment , N is the length of signal, and ̅ is the average value of . Root mean square (RMS) Kurtosis (KURT) Mean absolute value (MAV) Zero crossing (ZC) Slope sign change (SSC) Willison amplitude (WAMP) Simple Sign Integral (SSI) Non-linear energy (NLE) Waveform length (WL) For the development of a root system, the site where the seed is planted needs nutritious soil. In this study, the distribution of nutrients is equivalent to the distribution of the base features. Each element , of the base features set × 〈 〉 needs to be sorted. Before this is done, the dataset must be normalized using a scaling technique in the pre-processing stage to eliminate scale differences among the various elements when sorting. In this study, min-max normalization was to scale the data in the range [0,1] . The advantage of this method is that it can maintain the relationships that exist among the original data [17] . Here, the min-max normalization based on the features set ̇× 〈 〉 can be shown as where = 1, 2, ⋯ ,12 and = 1, 2, ⋯ , ℎ . Feature importance measurement, or feature selection, provides the means by which the initial number of features can be reduced by eliminating those features with a low importance score for improving the performance of the classifier [18] . In this study, we chose not to delete features, but rather to sort features based on their worth score. To establish a feature's worth score, we measured the information gain associated with the feature; the features with higher worth scores were those closer to the center location. Information gain was first shown in a decision tree and used to rank the priority of feature nodes; this was then expanded and applied to our importance measurement [19] . Let the class label of a feature have two distinct values defining binary classes { 1 , 2 }. (̇), also known as the entropy of ̇× 〈 〉, can be defined as follows: where 1 , 2 is the nonzero probability that an arbitrary tuple in ̇× 〈 〉 belongs to class 1 , 2 and is estimated by | 1 2,̇| /|̇|. Since each feature column ̇ is a set of discontinuous values, these discrete values of ̇ need to be split into two sets using an unsupervised algorithm. The K-means clustering method is a good choice for partitioning the given data set into k groups, where k represents the two groups prespecified by the analyst [20] . In this study, this process is repeated 25 times to produce lower withincluster variation and a more stable result. Each ̇, now belongs to splitting set { 1 , 2 } and has a class label. Then the expected information required to classify the tuple from ̇ based on feature ̇ is The ̇ with the highest information gain is chosen to replace the center of {̇1,̇2, ⋯ ,̇1 2 }, and the other features are sorted, one-by-one, from the center to the two sides, with decreasing information gain. In this way, a sorted base feature set is obtained. In this study, the digital soil consisting of the base features set cannot have only width and not depth. Therefore, continuous feature discretization is necessary (from scalar to vector). In this process, equal width interval binning is perhaps the most common method to discretize data for producing nominal values from continuous features [21] . With this binning method, if a feature is observed to have values bounded by and , then the method computes equally-sized bin and constructs bin boundaries at + where = 1,2 ⋯ , − 1. In order to be closer to the natural environment of the root system, we set = 15. In addition, the vertical distribution of soil nutrients should follow, to the extent possible, the laws of nature. Most nutrients are concentrated in the shallowest layer and decrease with depth since nutrients return to the soil through biomedical cycles [22] . Thus, the discrete feature set 15×12 is arranged in decreasing order from top to bottom. On the other hand, the horizontal distribution of nutrients is associated with the crustal movement. The kernel convolution values and are used to calculate the reconstituted nutrient distributions. The nutritional reconstructions are divided into two types: affects the nutrient distribution from shallower layers, while affects the nutrient distribution from deeper layers. The convolution kernel is then be defined as The nutritious soil used to generate the root system is written as 15×12 , which results from the convolution of 15×12 with kernel and . The calculated target matrix requires zero-padding with size one before each convolution. The completed soil with the mineral nutrition generation process is shown in Fig. 1 . Hereinafter, the matrix for the distribution of nutrients in the soil will be referred to as the "nutrient matrix." Fig. 1 Top-to-bottom order of soil matrix construction. The elements of the base feature set are arranged in order of importance, from the center to either side. Each element in the set is then discretized. Finally, the soil matrix is obtained from two convolution calculations. In plants, the radicle, or primary root, is first organ to emerge from the seed coat. Before the first leaf grows, the energy needed by the cotyledon for root development all comes from the seed itself. Additional inorganic nutrition (including water) is also essential; only a small part of this nutrition comes from the seed itself, with the remainder coming from the surrounding soil containing mineral nutrients [23, 24] . The rule inherent in the genes of higher plants is to develop a sufficiently large first leaf and sufficiently long roots before the energy stored in the seeds is exhausted. The algorithm we propose follows this same rule. Since organic matter cannot be synthesized through photosynthesis before the first leaf has grown, the task is to absorb more mineral nutrients by growing a sufficient root system using limited energy (root division) in a limited time. Hereinafter, the matrix distribution for the roots in the soil will be referred to as the "root matrix." In our proposed approach, the above natural processes need to be transformed into a data processing program. First, the root matrix and the nutrient matrix form a 4-dimensional tensor, as shown in Fig 2. The initial root matrix is a zero matrix, where the upper center location assigning values can create a radicle matrix. To absorb more nutrients, the rule of root division is to preferentially proliferate at the location of the global maximum in the mapping nutrient matrix; at the same time, location neighbor root-tips are necessary. The radicle matrix, photosynthetic day, and daily root division are then available as parameters to be modified. The development process can be characterized by the root distribution area and total nutrient absorption. Fig. 3 shows the natural process and digital process of building a brain-like root system. In Fig. 3 , we can see that the growth of digital roots is dependent on the base features. Usefully, by drawing out PRS features from the brain-like root system, correlation with the base features is reduced. Fig. 3 The natural process and digital process of building brain-like root systems. The growth process of the roots can be written as pseudocode, as shown in Table 2 . In the proposed algorithm, the core steps of root development involve seeking the global maximum location of the nutrient matrix map relative to the root matrix location using a tensor of the base features set of a biomedical signal. The entire root system makes its decisions jointly in brain-like fashion to improve total nutrient absorption. In this procedure, the roots distribution can be drawn in a coordinate system as a polygon. The area of this polygon serves as a feature. For the polygon area calculation, we used the tool developed by Bourke [25] , which is one of the most well-known of such tools. The procedure outputs two results: the total amount of nutrients absorbed during root system development, designated the nutrition feature (NF), and the area of the roots distribution, designated the root feature (RF). Both are referred to as PRS features. In this study, the accuracy of various classifiers is used for evaluation. The key measurable for each classifier is the change in accuracy before and after adding the root-based features to the original feature set. The formula used for quantifying this binary accuracy is where TP = True positive; FP = False positive; TN = True negative; FN = False negative. When selecting a dataset of biomedical signals, a non-stationary signal is preferred. In fact, nearly all biomedical signals are non-stationary since the human system is always in dynamic equilibrium under the regulation of the brain [26] . Accordingly, three biomedical signals were selected in our study: the vibroarthrographic signal (VAG), which is the vibration signal generated by the friction between the knee cartilage during flexion-extension movement to detect knee joint disorder [27, 28] ; the electromyography signal (EMG) of the forearm, which is the electrical signal generated by the extensor carpi radialis longus during the fist-relax movement [29] ; and the audio signal collected via a smartphone web app [30] of a cough coming from lungs and airways that may or may not be affected by COVID-19. All the above signals are binary class signals-young or old for the VAG signals, rest or fist for the EMG signals, and positive or negative for the cough sound signals. Table 3 describes the properties of these signals in detail. [31, 32] . As the principal linear classifier in this study, we chose the logistic regression (LR) model. LR has been widely used in classification problems and offers the benefit of outputting probabilities. However, an LR classifier is also limited by these probabilities. A result of 51% or even 99% pointing to a given class can produce mismatching [31] . In contrast to LR, which focuses on maximizing the probability of group membership, a support vector machine (SVM) seeks to find the separating hyperplane that maximizes the distance of the closest points to the margin. Since the feature sets involve more than 12 dimensions, we chose a polynomial kernel-based nonlinear SVM as the main nonlinear classifier in the study. In addition, we used linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as complementary classifiers to the LR and SVM classifiers. Together, these four machine learning-based classifiers were used to test the feature set. Implementation of the signal processing, feature engineering, the PRS algorithm, and the classifiers was supported by RStudio and related packages [33] [34] [35] [36] . Four datasets were input into each of the classification models chosen for the study: The first is the base set containing only the 12 time-domain features. The second (base + NF) and third (base + RF) sets each contain one PRS feature and the 12 time-domain features, for a total of 13 features. The final set is the PRS set with 14 features. All four datasets from the VAG signal, the EMG signal, and the cough signal were input, one by one, into the LR classifier, the SVM classifier, the LDA classifier, and the QDA classifier. Each classifier was trained with learning rates of 40% to 80% of the full samples (in increments of 10%). The remaining samples were used as test data. Classification accuracy was then calculated using show an increasing trend in the accuracy rate when the training rate is increased. However, what stands out most in the results is that the sets that include PRS features show a greater increase in accuracy than when only the base set is used. In addition, the accuracy for the base set is generally lower than the accuracy of at least one of the sets that include PRS features when the training rate is higher than 50%. As shown in Fig. 5 , for the VAG signal, the sets containing the nutritional feature (NF) tend to perform better. Here, the QDA classifier with the set containing both the nutritional feature and the root feature has the highest classification accuracy. Further analysis indicates that when the training rate is increased to 80% for the LR classifier, classification accuracy using the base set improves by 3.12%, while that for the base+NF set improves by 3.65%. Fig. 6 indicates that the classification accuracy for the EMG signal is particularly high for all four classifiers, in some cases as high as 98%, a value much higher than the maximum values for the other two signals. Somewhat surprisingly, while most sets containing the RF achieved higher accuracy than the base set, the base set with an 80% learning rate led the QDA classifications. Fig. 7 shows that the cough signal is less accurate and regular than the other signals. Here, the set containing the NF and the set containing the RF show higher accuracy, alternating with different training rates. In addition, the number and intensity of monotonicity changes are higher than those of the other signals. Overall, our results indicate that the sets containing only the base features had a consistently lower accuracy rate and that the sets containing the PRS features had a consistently higher accuracy rate. For VAG signals, the sets containing PRS features showed a maximum accuracy of 93%, which is a higher rate than has been reported in all relevant past studies [27, 28] . These results agree with Rauber's (2015) findings, which showed that using only appropriate features could improve the performance of classifiers [13] . For cough signals, although the sets containing PRS features did not achieve the 80% accuracy reported by Chaudhari (2015) , who used thousands of samples and deep learning to produce his results, our method achieved 76% accuracy with only one-tenth of the sample size [30] . We were unable to find published classification results for EMG signals; however, we believe that the classification accuracy we achieved of up to 98% is quite satisfactory. The most significant characteristic of this study is its creative use of a swarm intelligence approach to draw out a feature set with exceptionally low correlations by simulating the development of natural plant root systems. The features correlation for the EMS signals is shown in The proposed PRS algorithm actually performed better than we had first expected. However, we are soberly aware that our study is not perfect and has a number of limitations. First, the rate of accuracy improvement is rather modest for each classification model. As shown in Fig. 6 , a mere 1% improvement in accuracy is not enough to offset the error bar. One possible explanation for this might be a too-small sample size, resulting in inadequate model training [4] . Second, although the overall accuracy using the PRS features tends to be higher than that of the base and classical features sets for each classifier, there are still exceptions, as can be seen in Figs Although there remain many unanswered questions regarding the classification of biomedical signals, the PRS features proposed in this study, combined with proven classifiers, bring us closer to the cut-off point for clinical application. The limitations noted above will be addressed in subsequent studies to improve the algorithm while testing more biomedical signals. Recognizing that existing feature extraction methods are not sufficient to produce biomedical signal classifiers that meet the requirements of clinical application, this research proposes a new swarm intelligence-based algorithm for extracting features to ensure higher classification accuracy. The proposed algorithm is based on the growth of roots in nature and offers good interpretability. Importantly, the features obtained by the PRS algorithm have a very low correlation with traditional features, and the accuracy of classifiers using features extracted by the proposed algorithm is shown to improve. This study lays the groundwork for future research focused on extending the use of biomedical signals from strictly research applications to clinical applications. Surface EMG data aggregation processing for intelligent prosthetic action recognition Predicting the Future -Big Data Machine learning: A review of classification and combining techniques Learning Feature Engineering for Classification Feature Engineering and Computational Intelligence in ECG Monitoring Light pollution as a biodiversity threat Swarm intelligence in animals and humans Picking battles wisely: plant behaviour under competition New Fault Recognition Method for Rotary Machinery Based on Information Entropy and a Probabilistic Neural Network The discrete fourier transform, part 4: Spectral leakage Heterogeneous Feature Models and Feature Selection Applied to Bearing Fault Diagnosis Analysis of Statistical Time-Domain Features Effectiveness in Identification of Bearing Faults From Vibration Signal Investigating long-term effects of feature extraction methods for continuous emg pattern classification Epileptic Seizure Prediction Using Hybrid Feature Selection Over Multiple Intracranial EEG Electrode Contacts: A Report of Four Patients Normalization: A Preprocessing Stage A survey on feature selection methods Induction of decision trees Convergence of Optimization Problems Supervised and Unsupervised Discretization of Continuous Features The distribution of soil nutrients with depth: Global patterns and the imprint of plants Marschner's Mineral Nutrition of Higher Plants Calculating the area and centroid of a polygon Application of higher order statistics/spectra in biomedical signals-A review Knee osteoarthritis detection based on the combination of empirical mode decomposition and wavelet analysis Post -processing algorithm for removing soft -tissue movement artifacts from vibroarthrographic knee -joint signal Latent Factors Limiting the Performance of sEMG-Interfaces Virufy: Global Applicability of Crowdsourced and Clinical Datasets for AI Detection of COVID-19 from Cough Audio Samples Interpretable Machine Learning Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes kernlab -An S4 Package for Kernel Methods in R RStudio: Integrated Development Environment for R A Language and Environment for Statistical Computing