1-s2.0-S0957417414002164-main.pdf MIFS-ND: A Mutual Information-based Feature Selection Method N. Hoquea,!, D. K. Bhattacharyyaa,!, J. K. Kalitab,! aDepartment of Computer Science & Engineering, Tezpur University Napaam, Tezpur-784028, Assam, India bDepartment of Computer Science, University of Colorado at Colorado Springs CO 80933-7150, USA Abstract Feature selection is used to choose a subset of relevant features for e!ective classification of data. In high dimensional data classification, the performance of a classifier often depends on the feature subset used for classification. In this paper, we introduce a greedy feature selection method using mutual information. This method combines both feature-feature mutual information and feature- class mutual information to find an optimal subset of features to minimize redundancy and to maximize relevance among features. The e!ectiveness of the selected feature subset is evaluated using multiple classifiers on multiple datasets. The performance of our method both in terms of classification accuracy and execution time performance, has been found significantly high for twelve real-life datasets of varied dimensionality and number of instances when compared with several competing feature selection techniques. Keywords: Features, mutual information, relevance, classification 1. Introduction Feature selection, also known as variable, attribute, or variable subset selection is used in ma- chine learning or statistics for selection of a subset of features to construct models for describing data [1, 2, 3, 4, 5]. Two important aspects of feature selection are: (i) minimum redundancy and (ii) maximum relevance [6]. Besides these, people use feature selection for dimensional- ity reduction and data minimization for learning, improving predictive accuracy, and increasing comprehensibility of models. To satisfy these requirements, two dimensionality reduction ap- proaches are used, i.e., feature extraction and feature selection [7]. A feature selection method selects a subset of relevant features from the original feature set, whereas a feature extraction method creates new features based on combinations or transformations of the original feature set. Feature selection is used to overcome the curse of dimensionality [8] in a pattern recognition !Corresponding authors Email addresses: tonazrul@gmail.com (N. Hoque), dkb@tezu.ernet.in (D. K. Bhattacharyya), jkalita@uccs.edu (J. K. Kalita) 1Nazrul Hoque is a Senior Research Fellow in the Department of Computer Science and Engineering, Tezpur University, Napaam, Tezpur, Assam, India. 2Dhruba Kr. Bhattacharyya is a professor in the Department of Computer Science and Engineering, Tezpur University, India. 3Jugal K. Kalita is a professor in the Department of Computer Science, University of Colorado at Colorado Springs, USA. Preprint submitted to Expert systems with applications (Elsevier) April 21, 2014 system. The main objective of feature selection is to identify m most informative features out of the d original features, where mI(X2; Y ) $ "n#1 k=1 I(X2; Xk)< "n#1 k=1 I(X1; Xk) In this scenario, the proposed method selects the feature X1, since its feature-class mutual infor- mation is higher than X2. Definition 1. Feature relevance: Let F be a full set of features, fi a feature and Si = F #{fi}. Feature fi is strongly relevant i! I(C|fi, Si) %= I(C|Si) otherwise if I(C|fi, Si) = I(C|Si) and &S"i ' Si such that I(C|fi, Si) %= I(C|Si), then fi is weakly relevant to the class C. Definition 2. Feature-class relevance: It is defined as the degree of feature-class mutual in- formation for a given class Ci of a given dataset D, where data elements described by d features. A feature fi " F ", is an optimal subset of relevant features for Ci, if the relevance of (fi, Ci) is high. Definition 3. Relevance score: The relevance score of a feature fi is the degree of relevance in terms of mutual information between a feature and a class label. Based on this value, a rank can be assigned to each feature fi, (i = 1, 2, 3, · · · , d. For a given feature fi " F ", the relevance score is high. The following properties are trivial based on [26] and [38] Property 1: For a pair of features (fi, fj) " F ", the feature-feature mutual information is low, while feature-class mutual information is high for a given class Ci of a given dataset D. Explanation: For a feature fi to be a member of an optimal subset of relevant features F " for a given class Ci, the feature-class mutual information or the relevance score (Definition 3) must be high. On the contrary, whenever feature-class mutual information is high, feature-feature mutual information has to be relatively low [38]. Property 2: A feature f"i /" F " has low relevance w.r.t. a given class Ci of a given dataset D. Explanation: Let f"i /" F " where F " corresponds to class Ci and let feature-class mutual informa- tion between f"i and Ci be high. If the feature-class mutual information score is high, then as per Definitions 1 and 2, f"i must be a member of set F ", which contradicts. Property 3: If a feature fi has higher mutual information than feature fj with the class label C, then fi will have a smaller probability of misclassification [39]. Explanation: Since fi has higher mutual information score thanfj corresponding to class C, it has more relevance. So, the probability of classification accuracy using fi as one of the feature will be higher than fj. Lemma 1: For a feature fi, if the domination count Cd is larger and the dominated count Fd is 8 smaller than all other features fj, (i %= j), then the feature fi has the highest feature-class mutual information and has more relevant. Proof: For a feature fi, if its domination count Cd is larger and the dominated count Fd is smaller than all other features fj, (i %= j) then NSGA-II method ensures that feature-class mutual infor- mation for Fi is the highest whereas the feature-feature mutual information value is the lowest. Hence, the method selects feature fi as the strongly relevant feature as shown in Example 1. Lemma 2: For any two features, fi and fj, if the di!erence of Cd and Fd for feature fi is the same as the di!erence of Cd and Fd for feature fj, then the feature with the highest feature-class mutual information is relevant. Proof: For any two features fi and fj, if the di!erence of Cd and Fd for feature fi is as same as the di!erence of Cd and Fd for feature fj, then NSGA-II ensures that neither fi nor fj satisfies the Lemma 1 and in this situation either fi or fj has higher feature-class mutual information. Hence, the method selects a feature that has higher feature-class mutual information as shown in Example 2. 4.3. MIFS(MI based Feature Selection) method Input: d, the number of features; dataset D; F = {f1, f2, · · · , fd}, the set of features Output: F ", an optimal subset of features Steps: for i=1 to d, do Compute MI(fi, C) end Select the feature fi with maximum MI(fi, C) F " = F " ) {fi} F = F # {fi} count=1; while count <= k do for each feature fj " F, do FFMI=0; for each feature fi " F ", do FFMI=FFMI+compute FFMI(fi,fj) end AFFMI=Average FFMI for feature fj. FCMI=Compute FCMI(fj, C) end Select the next feature fj that has maximum AFFMI but minimum FCMI F " = F " ) {fi} F = F # {fj} i = j count=count+1; end Return features set F " Algorithm 1: MIFS-ND The proposed feature selection method depends on two major modules, namely Compute FFMI and Compute FCMI. We describe working of each of these modules next. Compute FFMI(fi,fj): For any two features fi, fj " F , this module computes mutual informa- tion between them, i.e.; fi and fj using Equation 1. It computes marginal entropy for variable fi 9 and computes mutual information by subtracting conditional entropy of fi for the given variable fj from marginal entropy. Compute FCMI(fj, C): For a given feature fj " F and a given class label say C, this module is used to find mutual information between fj and C using Shannon’s mutual information formula using equation 1. First, it computes marginal entropy for the variable fj and then subtracts conditional entropy of fj for the given variable C. Using these two modules the proposed method picks up a high ranked feature which is strongly relevant but non-redundant. To select a strongly relevant but non-redundant feature it uses NSGA-II method and computes the domination count (Cd) and dominated count (Fd) for every feature. If a feature has the highest di!erence between Cd and Fd then it selects that feature using Lemma 1 otherwise it uses Lemma 2 to select the relevant feature. 4.4. Complexity Analysis The overall complexity of the proposed algorithm depends on the dimensionality of the input dataset. For any dataset of dimension d, the computational complexity of our algorithm to select a subset of relevance features is O(d2). However, the use of appropriate domain specific heuristics can help to reduce the complexity significantly. 4.5. Comparison with other Relevant Work The proposed feature selection method di!ers from other methods in the following manners. 1. Like Battiti [24], our method considers both feature-feature and feature-class mutual in- formation, but Battiti uses an additional input parameter called ! to regulate the relative importance of mutual information between a feature to be selected and the features that have already been selected. Instead of the input !, our method calculates the domina- tion count and dominated count, which are used in NSGA-II algorithm to select a relevant feature. 2. Like Kwak and Choi [25], our method uses a greedy filter approach to select a subset of features. However, like them our method does not consider joint mutual information among three parameters, such as a feature that is to be selected, a set of features that are already selected and the class output. 3. Unlike Peng et al. [26], who use both filter and wrapper approaches, our method uses only the filter approach to select an optimal feature subset using a ranking statistic, which makes our scheme more computationally cost e!ective. Similar to the Peng’s mRMR method, we use the maximum relevance and minimum redundancy criterion to select a feature from the original feature set. In the subsequent section (Section 5) a detailed comparison of our method with Peng et al.’s method for two high dimensional gene expression datasets and four UCI datasets has been shown. Unlike Brown [40], our method uses Shannon’s mutual information on two variables only but Brown’s method uses multivariate mutual information. Also, our method selects a feature set using a heuristic bottom-up approach 10 and iteratively inserts a relevant but non-redundant feature into the selected feature set. Brown’s method follows a top-down approach and discards features from the original feature set. 4. Unlike Kraskov et al. [35], our method computes Shannon’s entropy whereas they compute entropy using k-nearest neighbor distance. 5. Experimental Result Experiments were carried out on a workstation with 12 GB main memory, 2.26 Intel(R) Xeon processor and 64-bit Windows 7 operating system. We implement our algorithm using MATLAB R2008a software. 5.1. Dataset Description During our experimental analysis, we use several network intrusion, text categorization, a few selected UCI and gene expression datasets. These datasets contain both numerical and categorical values with various dimensionalities and numbers of instances. Descriptions of the datasets are given Table 4. Table 4: Dataset description Dataset Number of instances Number of attributes Intrusion NSL-KDD 99 125973 42 Dataset 10% KDD 99 494021 42 Corrected KDD 311029 42 Wine 178 13 Non-intrusion Monk1 432 6 Dataset Monk2/ Monk3 432 6 IRIS 150 4 Ionosphere 351 34 Sonar 209 61 Text categorization Bloggender-male 3232 101 Bloggender-female 3232 101 Gene Expression Lymphoma 45 4026 Colon Cancer 62 2000 5.2. Results The proposed algorithm first selects a subset of relevant features from each dataset. To eval- uate the performance of our algorithm we use four well known classifiers, namely, Decision Trees, Random Forests, K-Nearest Neighbor (KNN) and Support Vector Machines (SVM). The perfor- mance of our algorithm is compared with five standard feature selection algorithms, namely Chi squared method, Gain Ratio, Information Gain, ReliefF, and Symmetric Uncertainty, which are already available in Weka. We use 10-fold cross validation to evaluate the e!ectiveness of selected features using di!erent classifiers. The comparison of classification accuracy of our algorithm with all the aforesaid feature selection algorithms is shown here for all the mentioned datasets. From the experimental results, we observe that the proposed method gives good classification accuracy on most datasets. Since, the network datasets are very large to be handled in MATLAB, we split the network datasets into smaller partitions which contain class labels from all the classes. 11 Finally, we compute the average classification accuracy of all the partitioned datasets. The proposed feature selection method selects an optimal subset of features for which we achieve the best classification accuracy. However, the cardinality of a feature subset identified by the proposed MIFS-ND method varies for di!erent datasets. This variation in cardinalities of the optimal feature subsets for some of the datasets, namely, Sonar, Ionosphere, Wine and Iris is shown in Figures 3(a), 3(b), 4(a) and 4(b), respectively. Our observation is that (i) except for the Sonar dataset, the proposed MIFS-ND method performs significantly better for other datasets than the competing methods. The performance of our method su!ers in case of Sonar dataset due to lack of su"cient training instances, and (ii) with the increase in size of the original feature sets of di!erent datasets, the cardinality for optimal feature subset identified by MIFS-ND also increases. For example, in case of the UCI datasets, when the size of feature set varies from 4 to 61 (as shown in Table 4) the cardinality of the optimal feature subset varies from 3 to 6, whereas in case of the intrusion and text categorization datasets, the cardinality of the feature subsets varies from 5 to 10 due to the increased size of feature sets. Figure 1 establishes this fact for 12 datasets. Figure 1: Optimal range of the size of feature subsets for di!erent datasets 5.3. Discussion Feature selection is an unavoidable part for any classification algorithm. We use mutual information theory to select a subset of features from an original feature set based on feature- feature and feature-class information values. We evaluate our feature selection method using network security, text categorization, a few selected UCI and gene expression datasets. From experimental analysis, we observe that for most datasets, the classification accuracy is much better for a subset of features compared to when using the full feature set. Due to significant reduction of the number of features, better computational e"ciency is also achieved. As shown in Figure 2, the computational time increases significantly for Random forests but computational 12 time is constant for SVMs. The computational time is relatively increased for Decision Trees and KNN classifiers due to the increase in the number of features. Figure 2: Execution time performance for di!erent classifiers 5.3.1. Performance Analysis on Intrusion Datasets In case of network datasets, we found high classification accuracy for all the classifiers with the 10% KDD dataset as shown in Figures 6(e) to 6(h). We found better classification accuracy using Decision Trees, Random Forests and the KNN classifier than the SVM on the corrected KDD dataset for the proposed algorithm as shown in Figures 6(a) to 6(d). Similarly, as shown in Figures 6(i) to 6(l), our algorithm produces better result on the NSL KDD dataset when k * 6, where k is the number of features in a subset and when k + 5, the classification accuracy is relatively lower. 5.3.2. Performance Analysis on UCI Datasets We perform experiments to analyze the classification accuracy of our proposed MIFS-ND method on UCI datasets. With the Wine dataset, Decision Trees, Random Forests and the KNN classifier give better classification accuracy for the proposed method but the accuracy is a bit lower with the SVM classifier, as plotted in Figures 5(a), 5(b), 5(c) and 5(d), respectively. As shown in Figures 5(e) to 5(t), the used classifiers show high classification accuracy with Monk1, Monk2, Monk3 and Iris datasets. We also compare the performance of our method MIFS-ND with MIFS [24] and MIFS-U [25] for Sonar and Ionosphere datasets as shown in Figure 3(a) and Figure 3(b), respectively. The comparison shows that the proposed algorithm gives better classification accuracy with these two UCI datasets. 5.3.3. Performance Analysis on Text Categorization Datasets The proposed feature selection method is also applied to text datasets to evaluate its per- formance. From experimental results, we observe that the Bloggender female and male datasets 13 show a bit poorer classification accuracy for Decision Trees as shown in Figures 7(a) and 7(d), respectively. But, the method shows almost equal classification accuracy for Random forests and KNN as plotted in Figures 7(b), 7(c), 7(e) and 7(f), respectively. 5.3.4. Performance Analysis on Gene Expression Datasets The proposed MIFS-ND method is also tested on two high-dimensional gene expression datasets, namely, Lymphoma and Colon Cancer. The classification accuracy on the Lymphoma dataset is very good for all the four classifiers as shown in Figures 8(a)-8(d). Compared to mRMR, MIFS-ND gives high classification accuracy for all four classifiers. However, the accuracy is com- promised from feature number 7 for SVM classifier only. Similarly, in Colon Cancer dataset, Random forests give high classification accuracy for MIFS-ND as shown in Figure 8(f) compared to other feature selection methods. But KNN and SVM give high classification accuracy for the mRMR method as plotted in Figure 8(g) and Figure 8(h), respectively. In case of Decision Tree, classification accuracy for mRMR and MIFS-ND is almost similar as shown in Figure 8(e). (a) Ionosphere dataset (b) Sonar dataset Figure 3: Comparison of MIFS-ND with MIFS, MIFS-U and mRMR for Ionosphere and Sonar datasets (a) Wine dataset (b) Iris dataset Figure 4: Comparison of MIFS-ND with Entropy measure, GINI info, ICA-MI, Local ICA-MI and mRMR 5.4. Evaluation of Classification Performance based on t-Test We use the McNemar’s test [41] to evaluate the performance of our method in terms of clas- sification accuracy using four di!erent classifiers. According to McNemar’s test, two algorithms 14 can have four possible outcomes such as TT, TF, FT and FF. Here, T and F stand for true and false respectively. The test computes a value called z-score, defined in Equation 3 that measures how statistically significant the results are. The arrows (,, -) indicate which classifier performs better with a given dataset. z = (|NT F # NF T | # 1). NT F + NF T (3) where, N= number of instances in a dataset. From the results of t-test, it can be observed that for Wine dataset, the Decision Trees performance is better than KNN and SVM but Random Forests perform better than Decision Trees for all the feature selection methods as shown in Table 5. In Monk1 dataset, KNN and Random Forests, both give better performance than Decision Trees, whereas Decision trees performance is better than SVM as shown in Table 6. In Monk2, Monk3 and Ionosphere dataset, Decision Trees classifier show better performance compared to KNN and SVM whereas Random Forests performance is better than Decision Trees as shown in Table 7, 8 and 11. As given in Table 9, Decision Tree performance is better than SVM but KNN and Random Forests show better performance than Decision Trees. In Sonar dataset, the performance of KNN and Random Forests is better than Decision Trees whereas Decision Trees performance is better than that of SVM as shown in Table 10. Table 5: t-test for Wine dataset Dataset Feature selection KNN RF SVM method Wine ChiSquar DT $ 4.3410 1.5539 % $ 8.7818 GainRatio $ 3.1201 2.2159 % $ 8.6927 InfoGain $3.4758 1.3790 % $ 8.8202 ReliefF $ 3.7773 2.4859 % $ 8.6646 SymmetricU $2.9723 1.9829% $ 8.4879 MIFS-ND $2.7097 2.6000 % $ 8.4917 Table 6: t-test for Monk1 dataset Dataset Feature selection KNN RF SVM method Monk1 ChiSquar DT 3.1888 % 4.2504 % $ 12.6320 GainRatio 3.2269 % 4.4338 % $ 12.5119 InfoGain $1.8995 1.8943% $10.7272 ReliefF 5.5811 % 5.5811 % $ 12.2406 SymmetricU 3.7420% 4.3032% $ 12.6151 MIFS-ND 2.7419 % 3.8916 % $ 12.4085 Table 7: t-test for Monk2 dataset Dataset Feature selection KNN RF SVM method Monk2 ChiSquar DT $ 2.3840 $ 1.0071 $ 11.0991 GainRatio $ 2.2699 $ 0.5800 $ 10.9840 InfoGain $ 1.7675 $ 1.2980 $ 10.9856 ReliefF 0.7857 % 0.2196% $ 11.2472 SymmetricU $ 2.8095 $1.4083 $ 11.2910 MIFS-ND 0.6572 % 0.5255 % $ 11.0219 15 Table 8: t-test for Monk3 dataset Dataset Feature selection KNN RF SVM method Monk3 ChiSquar DT $ 2.6415 $ 6.4908 $ 13.8191 GainRatio $ 2.4969 6.4944 % $ 13.9999 InfoGain $ 1.3687 6.0628 % $ 13.7995 ReliefF 0.6791 % 6.5169 % $ 13.7592 SymmetricU $ 2.3657 6.2318 % $ 13.8910 MIFS-ND $1.2637 $ 5.9339 $ 13.7014 Table 9: t-test for Iris dataset Dataset Feature selection KNN RF SVM method Iris ChiSquar DT 0.1215 % 1.5815 % $ 8.2768 GainRatio 0.1250 % 1.7166 % $ 8.2622 InfoGain 0.1768 % 1.6919 % $ 8.2618 ReliefF $ 0.1250 1.2877 % $ 8.3372 SymmetricU 0.6637 % 1.4462 % $ 8.3071 MIFS-ND 0.2754 % 1.6303 % $ 8.2772 Table 10: t-test for Sonar dataset Dataset Feature selection KNN RF SVM method Sonar ChiSquar DT 0.7063 % 1.3416 % $ 7.2530 GainRatio 1.6916 % 1.6897 % $ 7.5730 InfoGain 1.8612 % 1.4681 % $ 6.6444 ReliefF 1.4308 % 1.8329 % $ 7.0064 SymmetricU 1.8296 % 2.5176 % $ 6.9518 MIFS-ND 1.7365 % 1.8857 % $ 7.2041 Table 11: t-test for Ionosphere dataset Dataset Feature selection KNN RF SVM method Ionosphere ChiSquar DT $ 0.8715 1.4005 % $ 9.9160 GainRatio $ 0.8175 1.3189 % $ 9.8643 InfoGain $ 0.4200 1.6768 % $ 9.7896 ReliefF $ 0.7503 1.7440 % $ 9.8320 SymmetricU $ 0.6462 1.9885 % $ 9.7559 MIFS-ND 1.6916 % 1.6897 % $ 7.5730 6. Conclusion and Future Work In this paper, we have described an e!ective feature selection method to select a subset of high ranked features, which are strongly relevant but non-redundant for a wide variety of real-life dataset. To select high ranked features, an optimization criterion used in the NSGA-II algorithm is applied here. The method has been evaluated in terms of classification accuracy as well as execution time performance using several network intrusion datasets, a few UCI datasets, text categorization datasets and two gene expression datasets. We compare the classification accuracy for the selected features using Decision trees, Random Forests, KNN and SVM classifiers. The overall performance of our method has been found excellent in terms of both classification accuracy and execution time performance for all these datasets. Development of an incremental feature selection tool based on MIFS-ND is underway to support wide variety of application domains. Acknowledgment This work is supported by Department of Information Technology (DIT). The authors are thankful to the funding agency. 16 (a) Decision Trees accuracy WINE (b) Random Forests accuracy for WINE (c) KNN accuracy for WINE (d) SVM Accuracy for WINE (e) Decision Trees accuracy MONK1 (f) Random Forests accuracy for MONK1 (g) KNN accuracy for MONK1 (h) SVM Accuracy for MONK1 (i) Decision Trees accuracy MONK2 (j) Random Forests accuracy for MONK2 17 (k) KNN accuracy for MONK2 (l) SVM Accuracy for MONK2 (m) Decision Trees accuracy MONK3 (n) Random Forests accuracy for MONK3 (o) KNN accuracy for MONK3 (p) SVM Accuracy for MONK3 (q) Decision Trees accuracy IRIS (r) Random Forests accuracy for IRIS (s) KNN accuracy for IRIS (t) SVM Accuracy for IRIS 18 (u) Decision Trees accuracy Ionosphere (v) Random Forests accuracy for Iono- sphere (w) KNN accuracy for Ionosphere (x) Decision Trees accuracy Sonar (y) Random Forests accuracy for Sonar (z) KNN accuracy for Sonar Figure 5: Accuracy of di!erent classifiers found in non-intrusion datasets References [1] A. Arauzo-Azofra, J. L. Aznarte, J. M. Beńıtez, Empirical study of feature selection meth- ods based on individual feature evaluation for classification problems, Expert Systems with Applications 38 (7) (2011) 8170–8177. [2] M. Dash, H. Liu, Feature selection for classification, Intelligent data analysis 1 (3) (1997) 131–156. [3] K. Polat, S. Güneş, A new feature selection method on classification of medical datasets: Kernel f-score feature selection, Expert Systems with Applications 36 (7) (2009) 10367– 10373. [4] H. Liu, L. Yu, Toward integrating feature selection algorithms for classification and clustering, Knowledge and Data Engineering, IEEE Transactions on 17 (4) (2005) 491–502. [5] J. M. Cadenas, M. C. Garrido, R. Mart́ıNez, Feature subset selection filter–wrapper based on low quality data, Expert Systems with Applications 40 (16) (2013) 6241–6252. 19 [6] A. Unler, A. Murat, R. B. Chinnam, mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classifica- tion, Information Sciences 181 (20) (2011) 4625–4641. [7] D. D. Lewis, Feature selection and feature extraction for text categorization, in: Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, 1992, pp. 212–217. [8] G. Hughes, On the mean accuracy of statistical pattern recognizers, Information Theory, IEEE Transactions on 14 (1) (1968) 55–63. [9] I. Guyon, A. Elissee!, An introduction to variable and feature selection, The Journal of Machine Learning Research 3 (2003) 1157–1182. [10] A. L. Blum, P. Langley, Selection of relevant features and examples in machine learning, Artificial intelligence 97 (1) (1997) 245–271. [11] H.-H. Hsu, C.-W. Hsieh, M.-D. Lu, Hybrid feature selection by combining filters and wrap- pers, Expert Systems with Applications 38 (7) (2011) 8144–8150. [12] K.-C. Khor, C.-Y. Ting, S.-P. Amnuaisuk, A feature selection approach for network intrusion detection, in: Information Management and Engineering, 2009. ICIME’09. International Conference on, IEEE, 2009, pp. 133–137. [13] Y. Yang, J. O. Pedersen, A comparative study on feature selection in text categorization, in: ICML, Vol. 97, 1997, pp. 412–420. [14] J.-T. Horng, L.-C. Wu, B.-J. Liu, J.-L. Kuo, W.-H. Kuo, J.-J. Zhang, An expert system to classify microarray gene expression data using gene selection by decision tree, Expert Systems with Applications 36 (5) (2009) 9072–9081. [15] Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, bioinformatics 23 (19) (2007) 2507–2517. [16] R. Caruana, D. Freitag, Greedy attribute selection., in: ICML, Citeseer, 1994, pp. 28–36. [17] H. Frohlich, O. Chapelle, B. Scholkopf, Feature selection for support vector machines by means of genetic algorithm, in: Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE International Conference on, IEEE, 2003, pp. 142–148. [18] S.-W. Lin, K.-C. Ying, C.-Y. Lee, Z.-J. Lee, An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection, Applied Soft Computing 12 (10) (2012) 3285–3290. [19] L. Yu, H. Liu, Redundancy based feature selection for microarray data, in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2004, pp. 737–742. 20 [20] D. K. Bhattacharyya, J. K. Kalita, Network Anomaly Detection: A Machine Learning Per- spective, CRC Press, 2013. [21] S. Nemati, M. E. Basiri, N. Ghasem-Aghaee, M. H. Aghdam, A novel aco–ga hybrid algorithm for feature selection in protein function prediction, Expert systems with applications 36 (10) (2009) 12086–12094. [22] P. Mitra, C. Murthy, S. K. Pal, Unsupervised feature selection using feature similarity, IEEE transactions on pattern analysis and machine intelligence 24 (3) (2002) 301–312. [23] R. B. Bhatt, M. Gopal, On fuzzy-rough sets approach to feature selection, Pattern recognition letters 26 (7) (2005) 965–975. [24] R. Battiti, Using mutual information for selecting features in supervised neural net learning, Neural Networks, IEEE Transactions on 5 (4) (1994) 537–550. [25] N. Kwak, C.-H. Choi, Input feature selection for classification problems, Neural Networks, IEEE Transactions on 13 (1) (2002) 143–159. [26] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max- dependency, max-relevance, and min-redundancy, Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (8) (2005) 1226–1238. [27] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, Neural Networks, IEEE Transactions on 20 (2) (2009) 189–201. [28] L. D. Vignolo, D. H. Milone, J. Scharcanski, Feature selection for face recognition based on multi-objective evolutionary wrappers, Expert Systems with Applications 40 (13) (2013) 12086–12094. [29] K. Kira, L. A. Rendell, A practical approach to feature selection, in: Proceedings of the ninth international workshop on Machine learning, Morgan Kaufmann Publishers Inc., 1992, pp. 249–256. [30] H. Liu, R. Setiono, Chi2: Feature selection and discretization of numeric attributes, in: Tools with Artificial Intelligence, 1995. Proceedings., Seventh International Conference on, IEEE, 1995, pp. 388–391. [31] M. A. Hall, L. A. Smith, Feature selection for machine learning: Comparing a correlation- based filter approach to the wrapper., in: FLAIRS Conference, 1999, pp. 235–239. [32] Y. Ke, R. Sukthankar, Pca-sift: A more distinctive representation for local image descriptors, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2, IEEE, 2004, pp. II–506. [33] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The weka data mining software: an update, ACM SIGKDD explorations newsletter 11 (1) (2009) 10–18. 21 [34] B. Swingle, Rényi entropy, mutual information, and fluctuation properties of fermi liquids, Physical Review B 86 (4) (2012) 045109. [35] A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69. [36] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii, Lecture notes in computer science 1917 (2000) 849–858. [37] P. E. Lutu, A. P. Engelbrecht, A decision rule-based method for feature selection in predictive data mining, Expert Systems with Applications 37 (1) (2010) 602–609. [38] Q. Song, J. Ni, G. Wang, A fast clustering-based feature subset selection algorithm for high dimensional data, IEEE transactions on knowledge and data engineering 25 (1) (2013) 1–14. [39] B. Frénay, G. Doquire, M. Verleysen, Theoretical and empirical study on the potential in- adequacy of mutual information for feature selection in classification, Neurocomputing 112 (2013) 64–78. [40] G. Brown, A new perspective for information theoretic feature selection, in: International Conference on Artificial Intelligence and Statistics, 2009, pp. 49–56. [41] T. G. Dietterich, Approximate statistical tests for comparing supervised classification learn- ing algorithms, Neural computation 10 (7) (1998) 1895–1923. 22 (a) Decision Trees accuracy CKDD (b) Random Forests accuracy for CKDD (c) KNN accuracy for CKDD (d) SVM Accuracy for CKDD (e) Decision Trees accuracy 10%KDD (f) Random Forests accuracy for 10%KDD (g) KNN accuracy for 10%KDD (h) SVM Accuracy for 10%KDD (i) Decision Trees accuracy NSL KDD (j) Random Forests accuracy for NSL KDD 23 (k) KNN accuracy NSL KDD (l) SVM accuracy for NSL KDD Figure 6: Accuracy of di!erent classifiers found in non-intrusion datasets (a) Decision Trees accuracy Bloggender Fe- male (b) Random Forests accuracy for Bloggender Feamle (c) KNN accuracy for Bloggender Female (d) Decision Trees Accuracy for Bloggender Male (e) Random Forests accuracy Bloggender Male (f) KNN accuracy for Bloggender Male Figure 7: Accuracy of di!erent classifiers found in Text Categorization datasets 24 (a) Decision Trees accuracy for Lymphoma (b) Random Forests accuracy for Lym- phoma (c) KNN accuracy for Lymphoma (d) SVM accuracy for Lymphoma (e) Decision Trees Accuracy for Colon Can- cer (f) Random Forests accuracy Colon Cancer (g) KNN accuracy for Colon Cancer (h) SVM accuracy for Colon Cancer Figure 8: Accuracy of di!erent classifiers for Gene Expression datasets 25