key: cord-0451970-gj5ybm4p authors: Sethi, Deeksha; Nagaraj, Nithin; HarikrishnanN, B title: Neurochaos Feature Transformation and Classification for Imbalanced Learning date: 2022-04-20 journal: nan DOI: nan sha: 1a121f23a24e6e32bf3e9cebd80bc0d857d826c9 doc_id: 451970 cord_uid: gj5ybm4p Learning from limited and imbalanced data is a challenging problem in the Artificial Intelligence community. Real-time scenarios demand decision-making from rare events wherein the data are typically imbalanced. These situations commonly arise in medical applications, cybersecurity, catastrophic predictions etc. This motivates the development of learning algorithms capable of learning from imbalanced data. Human brain effortlessly learns from imbalanced data. Inspired by the chaotic neuronal firing in the human brain, a novel learning algorithm namely Neurochaos Learning (NL) was recently proposed. NL is categorized in three blocks: Feature Transformation, Neurochaos Feature Extraction (CFX), and Classification. In this work, the efficacy of neurochaos feature transformation and extraction for classification in imbalanced learning is studied. We propose a unique combination of neurochaos based feature transformation and extraction with traditional ML algorithms. The explored datasets in this study revolve around medical diagnosis, banknote fraud detection, environmental applications and spoken-digit classification. In this study, experiments are performed in both high and low training sample regime. In the former, five out of nine datasets have shown a performance boost in terms of macro F1-score after using CFX features. The highest performance boost obtained is 25.97% for Statlog (Heart) dataset using CFX+Decision Tree. In the low training sample regime (from just one to nine training samples per class), the highest performance boost of 144.38% is obtained for Haberman's Survival dataset using CFX+Random Forest. NL offers enormous flexibility of combining CFX with any ML classifier to boost its performance, especially for learning tasks with limited and imbalanced data. Technological advancements have made a paradigm shift in the evolution of science. A driving force behind this shift is the high storage and computational capacity available in this era. This has given rise to computational techniques for analysis and pattern discovery from data, popularly known as data-driven science. Data can be structured or arXiv:2205.06742v2 [cs.NE] 16 May 2022 The effectiveness of NL is shown in the classification of various datasets such as MNIST, Iris, Exoplanet [17] , Coronavirus Genome classification [21] , especially in the low training sample regime. Recently, [22] has demonstrated the efficacy of NL in the classification and preservation of cause-effect for 1D coupled chaotic maps and coupled AR processes. There is a need to rigorously test NL, and in particular the nonlinear chaotic transformation of features, to more datasets in the context of imbalanced learning. This study addresses this gap. We combine NL features with classical machine learning algorithms such as Decision Tree (DT), Random Forest (RF), AdaBoost (AB), Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) and Gaussian Naive Bayes (GNB). A comparison of NL: chaos-based-hybrid ML architectures with stand-alone ML algorithms is brought out for the following balanced and imbalanced datasets: Iris, Ionosphere, Wine, Bank Note Authentication, Haberman's Survival, Breast Cancer Wisconsin, Statlog (Heart), Seeds, Free Spoken Digit Dataset. A detailed analysis of this comparative study (NL: chaos-based-hybrid ML vs. stand-alone ML) is carried out for both the high and low training sample regime. Learning from limited data/ imbalanced data is a challenging problem in the ML community. This research highlights the performance comparison of NL and classical ML algorithms in learning from limited training instances. The organization of this paper is as follows: section 2 explains the proposed architecture and methods being investigated in this paper. The description of all datasets used for the experiments is provided in section 3. The experiments and their corresponding results are available in section 4. A detailed discussion on the inferences is provided in section 5. The concluding remarks of this study are in section 6. In this study, a modified form of the recently proposed Neurochaos Learning architecture [18] is put forward. This modified NL architecture is depicted in Figure 1 . NL has mainly three blocks -(a) feature transformation, (b) neurochaos feature extraction and (b) classification. A detailed description of each block is given below. • Input: Input is the first and most significant step of any learning process. It contains input attributes obtained from the dataset (x 1 , x 2 , . . . , x n ). These input attributes (after suitable normalization) are further passed to the feature transformation block. • Feature Transformation: The feature transformation block consists of an input layer of 1D Generalized Lüroth Series (GLS) neurons. The 1D GLS neurons are piece-wise linear chaotic maps. The number of GLS neurons (G 1 , G 2 , . . . , G n ) in the input layer is equal to the number of input attributes (n) in the dataset. Each neuron G 1 , G 2 , . . . , G n has an initial neural activity of q units. Upon arrival of the input attributes (also known as stimuli) x 1 , x 2 , . . . , x n , each of the 1D GLS neurons (G 1 , G 2 , . . . , G n ) starts independently firing with an initial neural activity of q units. The neural trace of these chaotic neurons halts when it reaches the neighbourhood of the stimulus (at which point we say that it has successfully recognized the stimulus). The halting of the neural trace is mathematically guaranteed by the topological transitivity property of chaos [17] . The transformed input attributes are further passed to the neurochaos feature extraction block. • Neurochaos Feature Extraction: The following features are extracted from the chaotic neural trace: 1. Firing time (N ): The time taken by the chaotic neural trace to recognise the stimulus [23] . 2. Firing rate (R): Fraction of time for which the chaotic neural trace exceeds the discrimination threshold b so as to recognize the stimulus [23] . 3. Energy (E): A chaotic neural trace z(t) with a firing time N has an energy (E) defined as: (1) The entropy of a chaotic neural trace z(t) is computed using the symbolic sequence SS(t) of z(t). SS(t) is defined as follows: where i = 1 to N (firing time). From SS(t), Shannon Entropy H(SS) is computed as follows: where, p 1 and p 2 are the probabilities of occurrence of symbols 0 and 1 in SS(t) respectively. An input stimulus x k of a data instance visiting the k-th GLS neuron (G k ) is transformed to a 4D vector The CFX feature space contains a collection of all the 4D vectors after feature transformation. These chaos based features are passed to the third block of the NL architecture i.e, classification. • Classification: There are mainly two kinds of NL architecture: (a) ChaosNet, (b) ChaosFEX (CFX) + ML. ChaosNet architecture computes the mean representation vector for each class. The mean representation vector of class-k contains the mean firing time, mean firing rate, mean energy and mean entropy of the k-th class. ChaosNet uses a simplistic decision rule, namely, the cosine similarity of testdata instances with the mean representation vectors. The testdata instance is assigned a label = l, if the cosine similarity of that testdata instance with l-th mean representation vector is the highest. Alternatively, we have the flexibility of choosing any other ML classifier instead of ChaosNet. In this kind of NL architecture, the ChaosFEX features are fed directly to the ML classifier (CFX+ML). In this study, we have tested with the following ML algorithms -Decision Tree (DT), Random Forest (RF), AdaBoost (AB), Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) and Gaussian Naive Bayes (GNB). CFX+ML is a hybrid NL architecture that combines the best of chaos and machine learning. • Output: The output obtained from the classification block serves as output for that respective NL architecture. On using ChaosNet in the classification block, the output is an outcome of classification based on the mean representation vectors. The Machine Learning implementation in the classification block produces an output dependent on the choice of the ML classifier. In [21] , it has been shown that ChaosNet satisfies the Universal Approximation Theorem with a bound on the number of chaotic neurons to approximate the finite support discrete time function. This is due to the uncountably infinite number of dense orbits and the topological transitivity property of chaos. Another important feature of ChaosNet and CFX+ML is the natural presence of Stochastic Resonance (noise enhanced signal processing) [18] found in the architecture. An optimal performance of ChaosNet and CFX+ML is obtained for an intermediate value of noise intensity ( ). This has been thoroughly shown in [18] . The ChaosFEX (CFX) feature extraction algorithm requires normalization of the dataset and numeric codes for labels. Therefore, to maintain uniformity, all datasets are normalized 1 for both stand-alone algorithms and their integration with CFX. The labels are renamed to begin from zero in each dataset to ensure compatibility with CFX feature extraction. The rules followed for the numeric coding for the labels of all the datasets are provided in section 8 (Table 16 -24) . Iris [24, 25] aids classification of three iris plant variants: Iris Setosa, Iris Versicolour, and Iris Virginica. There are 150 data instances in this dataset with four attributes in each data instance: sepal length, sepal width, petal length, and petal width. All attributes are in cms. The specified class distribution provided in Table 1 is in the following order: (Setosa, Versicolour, Virginica). The Ionosphere [26, 27] dataset enables a binary classification problem. The classes represent the status of returning a radar signal from the Ionosphere. Label 'g' (Good) denotes the return of the radar signal, and label 'b' (Bad) indicates no trace of return of the radar signal. The goal of this experiment is to identify the structure of the Ionosphere using radar signals. This dataset has 351 data instances and 34 attributes. The specified class distribution provided in Table 1 is as follows: (Bad, Good). Wine [26, 28] dataset aims to identify the origin of different wines using chemical analysis. The classes are labeled '1', '2', and '3'. It has 178 data instances and 13 attributes ranging from alcohol, malic acid to hue and proline for the collected samples. The specified class distribution provided in Table 1 is as follows: (1, 2, 3). Bank-note Authentication [26, 29] is a binary classification dataset. The classes involve Genuine and Forgery. A Genuine class refers to an authentic banknote denoted by '0', while a Forgery class refers to a forged banknote denoted by '1'. The obtained dataset is from images of banknotes belonging to both classes, taken from an industrial camera. It contains 1372 total data instances. The dataset has four attributes retrieved from the images using wavelet transformation. The specified class distribution provided in Table 1 is as follows: (Genuine, Forgery). Haberman's Survival [26, 30] is a compilation of sections of a study investigating the lifespan of a patient after undergoing a breast cancer surgery. This is a binary classification problem and the dataset provides information for the prediction of the survival of patients beyond five years. Class '1' denotes survival of the patient for five years or longer after the surgery. Class '2' denotes the death of a patient within five years of the surgery. It contains 306 total data instances and three attributes. The specified class distribution provided in Table 1 is as follows: (1, 2). Breast Cancer Wisconsin [26, 31] dataset deals with the classification of the intensity of the breast cancer. Class 'M' refers to a malignant level of infection and class 'B' refers to a benign level of infection. It contains a total of 569 data instances and 31 attributes such as radius, perimeter, texture, smoothness, etc. for each cell nucleus. The specified class distribution provided in Table 1 is as follows: (Malignant -M, Benign -B). Statlog (Heart) [26] enables differentiation between presence and absence of a heart disease in a patient. Class '1' denotes the absence while class '2' denotes the presence of a heart disease. It contains 270 total data instances and 13 attributes including resting blood pressure, chest pain type, exercise induced angina and so on. The specified class distribution provided in Table 1 is as follows: (Absence , Presence). Seeds [26] dataset examines three classes of wheat: Kama, Rosa and Canadian using soft X-ray on the wheat kernels to retrieve relevant properties. It contains 210 total data instances and seven attributes namely compactness, length, width etc. of each wheat kernel. The specified class distribution provided in Table 1 is as follows: (Kama, Rosa, Canadian). The Free Spoken Digit Dataset [32] is a time-series dataset comprising recordings of six speakers. Each speaker recites numbers from one to nine. For each number, every speaker makes 50 recordings. The speaker chosen for all experiments is Jackson. The dataset undergoes preprocessing using a Fast Fourier Transform (FFT) technique. The dataset for speaker Jackson has 500 data instances, and only instances above a threshold of 3000 samples are considered to tackle the varying data length through the dataset. In these data instances, only the first 3005 data samples are examined. Finally, 480 data instances are filtered to feed into the algorithm. The specified class distribution provided in Table 1 is as follows: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) . In this section, the authors evaluate the efficacy of ChaosFEX (CFX) feature engineering on various classification tasks. For this, hybrid models are developed by combining CFX features extracted from the input data with the classical ML algorithms such as Decision Tree, Random Forest, AdaBoost, Support Vector Machine, k-Nearest Neighbors and Gaussian Naive Bayes. The experiments are performed for both low and high training sample regime. The train-test distribution (80% − 20%) for each dataset in the high training sample regime is available in Table 1 . In low training sample regime, 150 random trials for training with 1, 2, . . . , 9 data instances per class are considered. The trends of all algorithms in the stand-alone form and their implementation using CFX features are conveyed in this section. For all experiments in the paper, the following software: Python 3.8 and scikit-learn [24] , CFX [17] are used. Every ML algorithm has a set of optimal hyperparameters to be found by hyperparameter tuning. The hyperparameters for all the algorithms with respect to each dataset were tuned using five-fold cross-validation. Table 2 provides the set of hyperparameters tuned for all algorithms in this research. In the case of CFX+ML for Iris, Ionosphere and Wine, both CFX (q, b, ) and ML hyperparameters were tuned. For the remaining datasets, the hyperparameters tuned for ChaosNet (q, b, ) were retained and only the ML hyperparameters were tuned. The most common performance metric used in ML is Accuracy [33] . In the case of imbalanced datasets, this metric will lead to misinterpretation of results. Thus, the performance metric used for all experiments in this study is Macro F1-Score. This metric is computed from the confusion matrix. Figure 2 shows a confusion matrix for a binary classification problem. Prediction values represent the predictions made by the classifier. Actual values stand for the Table 29 in Supplementary Information section Support Vector Machine SVM C, kernel Table 30 in Table 31 in Supplementary F1-Score depends on two parameters: 1. Precision -The ratio of the number positives correctly classified as positive by the algorithm to the total number of instances classified as positive, given by - 2. Recall -The ratio of the number positives correctly classified as positive by the algorithm to the total number of actual positives, given by - Thus, F1-Score which is the harmonic mean of Precision and Recall is given by, Macro F1-Score is obtained by averaging the F1 scores for all classes in the dataset, given by where, n stands for the number of distinct classes in the dataset. F1-score of Class k considers the instances of Class k as positive and all remaining instances of classes as negative. Thus, the task is transformed to a binary classification problem. All instances of Class k correctly classified as positive or negative by the classifier are termed as true positives or true negatives respectively. All instances that are incorrectly classified by the classifier are placed under the false positive and false negative category accordingly. Hence the associated equations of precision and recall for this example are given by, The F1-score for Class k is computed by: Similarly, the F1-scores for all classes in the classification problem are computed and applied in Eq 7. ChaosNet has three hyperparameters -initial neural activity (q), discrimination threshold (b), and noise intensity ( ) [18] . Specifics of the same are provided in Table 2 . In this section, the authors represent the results in two formats (a) bar graph and (b) line graph. The comparative results for ChaosNet, CFX+ML and stand-alone ML in the high training sample regime are depicted using bar graph. All values plotted in the bar graphs for each dataset are provided in the Supplementary Information section (section 8) from Table 32 -37. On the other hand, the line graph depicts the comparative performance of ChaosNet, CFX+ML and stand-alone ML in the low training sample regime. The tuned hyperparameters used and all experiment results for the Iris dataset are available in Table 4 and Figure 3 respectively. The tuned hyperparameters used and all experiment results for the Ionosphere dataset are available in Table 5 and Figure 4 respectively. The tuned hyperparameters used and all experiment results for the Wine dataset are available in Table 6 and Figure 5 respectively. The tuned hyperparameters used and all experiment results for the Bank Note Authentication dataset are available in Table 7 and Figure 6 respectively. The tuned hyperparameters used and all experiment results for the Haberman's Survival dataset are available in Table 8 and Figure 7 respectively. The tuned hyperparameters used and all experiment results for the Breast Cancer Wisconsin dataset are available in Table 9 and Figure 8 respectively. The tuned hyperparameters used and all experiment results for the Statlog (Heart) dataset are available in Table 10 and Figure 9 respectively. The tuned hyperparameters used and all experiment results for the Seeds dataset are available in Table 11 and Figure 10 respectively. The tuned hyperparameters used and all experiment results for the FSDD dataset are available in Table 12 and Figure 11 respectively. The overall comparative performance of different algorithms is provided in Table 13 . In the high training sample regime, the efficacy of using CFX features is evident from Table 13 for Iris, Ionosphere, Haberman's Survival, Statlog (Heart) and FSDD. While Iris is a balanced dataset, Ionosphere, Haberman's Survival and Statlog (Heart) are imbalanced. Through this, a versatility in the algorithm's ability to perform with both balanced and imbalanced datasets can be established. The performance boost after using CFX features is calculated as follows: where F 1 ML and F 1 CFX+ML refers to the macro F1-score for the stand-alone ML algorithm and the hybrid NL (CFX+ML) algorithm respectively. The consistency of an algorithm can be inferred by evaluating the range of minimum and maximum macro F1-scores produced by it in the high training sample regime. Table 15 shows the ranges of macro F1-scores of different algorithms, measured across all nine datasets used in the research. The provided format for the range of macro F1-scores in Table 15 is [Minimum, Maximum] . ChaosNet ranks second in the least difference between the maximum and minimum macro F1-scores. This observation owes to the consistency and tolerance towards dataset diversity in ChaosNet. In algorithms such as AdaBoost (AB) and Support Vector Machine (SVM), a relatively low difference between F1-scores is realised after the usage of CFX features. Gaussian Naive Bayes (GNB) shows the least difference between F1-scores. Along datasets of different domains considered in this study, the performance of ChaosNet can be seen as comparable with GNB. With the current implementation of the ChaosFEX algorithm, computation of image datasets is a costly process. This may limit the use of NL architectures in practical situations involving images. Furthermore, NL architectures have been based on certain assumptions. One assumption revolves around the separability of data. NL assumes that applying a nonlinear chaotic transformation will result in separable data suitable for classification, which may not be true in all cases. As it stands, the input layer of NL treats input attributes as independent of each other. It establishes no connection between the neurons for each input attribute. This limitation can be addressed by using coupled chaotic neurons in the input layer. Currently, we have not considered multi-layered NL (Deep-NL) which could significantly enhance performance (by careful choice of coupling between adjacent layers). Another limitation of the NL architectures is the lack of a principled approach to tune the best hyperparameters for a classification task. Currently, cross-validation experiments are used to tune the hyperparameters. The connection between the degree of chaos as measured by lyapunov exponent [34] ) and learnability is also worth exploring for future research. Decision making under the presence of rare events is a challenging problem in the ML community. This is because rare events have limited data instances, and this problem boils down to imbalanced learning. In this work, we have evaluated the effectiveness of ChaosFEX (CFX) feature transformation used in Neurochaos Learning (NL) architectures for imbalanced learning. Nine benchmark datasets were used in this study to bring out this evaluation. Seven out of nine datasets used are imbalanced (Refer to Table 1 ). This paper accomplishes a comparative study on the performance of NL architecture: ChaosNet and CFX+ML with classical Machine Learning (ML) algorithms. The obtained results reflect an evident performance boost in terms of macro F1-score after a nonlinear chaotic transformation (ChaosFEX or CFX features). Additionally, the efficacy of CFX features can be observed in five out of nine balanced and imbalanced datasets in the high training sample regime, with a boost ranging from 2.73% (Free Spoken Digit Dataset) to 25.97% (Statlog -Heart). In the low training sample regime, the integration of CFX features has boosted the performance of classical ML algorithms in five datasets, from a total of nine datasets. A maximum boost of 144.38% on the Haberman's Survival dataset using CFX+RF is obtained. Refer to Table 13 and 14 for the detailed performance boost using CFX features. This is the first study thoroughly evaluating the performance of NL in the imbalanced learning scenario. NL is a unique combination of chaos and noise-enhanced classification. The enormous flexibility of NL offers endless possibilities for development of novel NL: chaos-based-hybrid ML models that suit the application at hand. As new ML algorithms get invented, they can be readily combined in the NL framework. We forsee exciting combinations of CFX with DL and other ML algorithms in the future. Deeksha Sethi is thankful to Saneesh Cleatus T, Associate Professor, BMS Institute of Technology and Management for enabling her with this research opportunity. Sethi dedicates this work to her family. Harikrishnan N. B. thanks "The University of Trans-Disciplinary Health Sciences and Technology (TDU)" for permitting this research as part of the PhD programme. The authors gratefully acknowledge the financial support of Tata Trusts. The authors acknowledge the computational facility supported by NIAS Consciousness Studies Programme. • Funding: The authors gratefully acknowledge the financial support of Tata Trusts. • Conflicts of Interest/Competing interests: There is no conflict of interest or competing interests. • Ethics approval: Not Applicable. • Consent to participate: Not Applicable. • Consent for publication: Not Applicable. • Availability of data and materials: Not Applicable. • Code availability: The codes used in this research are available in the following link: https://github.com/ deeksha-sethi03/nl-imbalanced-learning. This is the supplementary information pertaining to the main manuscript. It contains the following -(1) description of datasets used in our study including the coding rule for the labels of different classes, (2) hyperparamter tuning details for each dataset and for each ML algorithm (Decision Tree, Random Forest, AdaBoost, SVM, k-NN) and NL algorithm (ChaosNet) used in the study, (3) the test data macro F1-scores for each algorithm in the high training sample regime for each dataset. 2. max_depth: Declares the maximum depth each tree the the forest is allowed to have. It is tuned from 1 to 10 with a step-size of 1. Table 27 and 28 Table 27 All remaining hyperparameters offered by scikit-learn are retained in their default forms. The results of hyperparameter tuning for AdaBoost are available in Table 29 . Table 30 . Following are the hyperparameters tuned for k-Nearest Neighbors: 1. k: Defines the number of nearest training data samples to the testing data sample to be chosen. It is tuned from 1 to 6 with a step-size of 2. Table 31 . 3. : Noise Intensity, it is varied from 0.001 to 0.499 with a step-size of 0.001. The results of all experiments using Decision Tree in the high training sample regime are shown in Table 32 . The results of all experiments using Random Forest in the high training sample regime are shown in Table 33 . The results of all experiments using AdaBoost in the high training sample regime are shown in Table 34 . The results of all experiments using Support Vector Machine (SVM) in the high training sample regime are shown in Table 35 . The results of all experiments using k -Nearest Neighbors in the high training sample regime are shown in Table 36 . The results of all experiments using Gaussian Naive Bayes (GNB) in the high training sample regime are shown in Table 37 . An anarchy of methods: Current trends in how intelligence is abstracted in ai Artificial intelligence: A guide for thinking humans Learning from class-imbalanced data: Review of methods and applications Robust weighted kernel logistic regression in imbalanced and rare events data The covid-19 pandemic The 1918 "Spanish Flu" in Spain Ransomware: Recent advances, analysis, challenges and future research directions Credit card fraud detection: A fusion approach using dempster-shafer theory and bayesian learning The foundations of cost-sensitive learning Smote: synthetic minority over-sampling technique A multiple expert approach to the class imbalance problem using inverse random under sampling Pca feature extraction for change detection in multidimensional unlabeled data Dynamic mode decomposition: A feature extraction technique based hidden markov model for detection of mysticetes' vocalisations A novel approach for mfcc feature extraction Nonnegative matrix factorization: A comprehensive review Feature extraction based on empirical mode decomposition for automatic mass classification of mammogram images Chaosnet: A chaos based artificial neural network architecture for classification When noise meets chaos: Stochastic resonance in neurochaos learning Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain Is there chaos in the brain? ii. experimental evidence and related models A neurochaos learning architecture for genome classification Cause-effect preservation and classification using neurochaos learning Neurochaos inspired hybrid machine learning architecture for classification Scikit-learn: Machine learning in Python The use of multiple measurements in taxonomic problems UCI machine learning repository Classification of radar returns from the ionosphere using neural networks PARVUS: An Extendable Package of Programs for Data Exploration Banknote authentication The analysis of residuals in cross-classified tables Nuclear feature extraction for breast tumor diagnosis Jakobovski/freespoken-digit-dataset: v1.0.8 An experimental comparison of performance measures for classification Lyapunov exponents. Wiley encyclopedia of biomedical engineering