D:/ahmed/My-Documents/latex/AI_journal/EEA/EEA_08v1.dvi


A Dependency-Based Search Strategy for Feature

Selection

Ahmed Al-Ani

Faculty of Engineering and Information Technology

University of Technology, Sydney, Australia

ahmed@eng.uts.edu.au

Abstract

Feature selection has become an increasingly important field of research. It

aims at finding optimal feature subsets that can achieve better generalization

on unseen data. However, this can be a very challenging task, especially when

dealing with large feature sets. Hence, a search strategy is needed to explore

a relatively small portion of the search space in order to find ”semi-optimal”

subsets. Many search strategies have been proposed in the literature, however

most of them do not take into consideration relationships between features.

Due to the fact that features usually have different degrees of dependency

among each other, we propose in this paper a new search strategy that utilizes

dependency between feature pairs to guide the search in the feature space.

When compared to other well-known search strategies, the proposed method

prevailed.

Key words: Feature selection, search strategy, dependency, mutual

information

Preprint submitted to Expert Systems with Applications September 23, 2008


1. Introduction

The identification of optimal feature subsets plays an important role for

a wide range of classification problems, as it can lead to better performance

in terms of accuracy and computational cost. The selection of good feature

subsets requires an evaluation measure to estimate the goodness of subsets

and a search strategy to generate candidate feature subsets (1). Evaluation

measures are broadly divided into two categories: filters and wrappers. A

filter method uses a measure that is independent of the predetermined classi-

fication algorithm to estimate the goodness of candidate subsets. A wrapper

method on the other hand estimates the goodness of a candidate subset using

the classification accuracy obtained by feeding that particular subset to the

adopted classification algorithm. Thus, wrappers are computationally more

expensive than filters, however they are usually more accurate.

Searching for the optimal subset, which can achieve the best performance

according to the defined evaluation measure, is quite a challenging task. The

exhaustive search, which considers all possible subsets, is guaranteed to find

the optimal solution. However, it is impractical to run, even with moderate

size feature sets. A number of other search strategies that differ in their

computational cost and optimality have been proposed in the literature. One

of the early search strategies is the branch and bound (2), which requires the

evaluation function to be monotonic. This method can be computationally

expensive for large data sets. Sequential search methods, such as sequential

forward selection and sequential backward elimination (3), have been widely

used because of their simplicity and relatively low computational cost. The

major drawback of the traditional sequential search methods is the nesting

2


effect, i.e., in backward search when a feature is deleted, it cannot be re-

selected, and in forward search when a feature is selected, it cannot be deleted

afterwards. A slightly more reliable sequential search method is the plus-

l-minus-r (l − r), which considers removing features that were previously

selected and selecting features that were previously eliminated (4). Another

trend of search strategies is the stochastic search, where it has been found

that including some randomness in the search process makes it less sensitive

to the dataset (1), and hence helps avoid local minimas. Some of the famous

stochastic methods used in feature selection are: simulated annealing (5),

Genetic Algorithm (GA) (6), Ant Colony Optimization (ACO) (7), Particle

Swarm Optimization (PSO) (8) and Differential Evolution (DE) (9).

Despite the encouraging results achieved in certain cases, the main draw-

back of most of the above methods is that they do not take into consideration

the importance of features and how they are related. This may have an ef-

fect on the performance, especially for complicated search problems. Some

methods attempt to estimate the relevance of subsets of features as the basis

for selection. It has been shown that estimating the relevance of individual

features may not be difficult, however, the real challenge is to estimate the

relevance of subsets of features. This issue has been studied by a number of

researchers (10; 11; 12; 13; 14). Some of the interesting finding of Guyon et

al. (1) are:

• a feature that is irrelevant by itself may become relevant when used

with other features;

• a relevant feature may not be needed because of possible redundancies.

3


Because of the difficulty associated with estimating the relevance of subsets

of features, the proposed search strategy adopts the wrapper approach and

focuses on dependency between feature pairs as a mean to guide the search.

The paper is organized as follows: the next section explains the impor-

tance of dependency between features. Section 3 describes the proposed

search strategy. Experimental results are presented in section 4, and a con-

clusion is given in section 5.

2. Dependency Between Features

One of the promising approaches in searching the feature space is population-

based search, where the current population consists of a number of feature

subsets. Each one of these subsets is modified, in a certain way, to produce

the next generation of subsets. Examples of population-based search proce-

dures include Genetic Algorithms (GA), Particle Swarm Optimization (PSO)

and Differential Evolution (DE). All of these methods were not originally de-

veloped for feature selection, and hence, their original implementations do

not take into consideration relationships between features when producing

future populations.

Let’s presume that F is the original feature set with n = 20 features, S1

and S2 are two subsets of the current population, each with m = 4 features.

Let’s also consider that S1 is one of the best subsets that has been tested so

far and we would like to modify S2, with the help of S1, to produce a better

subset for the next population. If S1 = {f2, f5, f9, f15}, S2 = {f1, f2, f6, f17}

and features f1 and f9 are highly dependent, then it will be logical to only

consider replacing f6 and/or f17. Furthermore, instead of randomly choosing

4


any feature from F as a replacement feature, the search space can be reduced

using the dependency between features. For instance, if f6 is to be replaced,

we first find which of the two features f5 and f15 is closer to it, i.e., has

a higher dependency value. Let’s presume that f15 is closer to f6. Then,

candidate replacement features are reduced to the ones that lie (in terms of

dependency) between f6 and f15. The newly produced subset will not only

have a chance of being a better subset than S2, but it might also outperform

S1, if this replacement makes it closer to the optimal solution.

In addition to the above, dependency is also useful in identifying the

K best subsets that are selected so far. For instance, we can not choose

both S1 = {f2, f5, f9, f15} and S3 = {f1, f2, f5, f15} to be among the K best

subsets, since they are almost identical, where f1 and f9 are highly dependent

as mentioned earlier. This restriction will enhance the exploration capability,

where diversity among the “good” subsets is quite important in avoiding local

minimas.

A famous approach to estimate dependency between two random vari-

ables is based on the concept of mutual information (15). The mutual infor-

mation between random variables X and Y is defined as:

I(X; Y ) =

∫

PXY (x, y) log

(

PXY (x, y)

PX (x)PY (y)

)

dxdy (1)

The entropy, which is a measure of uncertainty of random variables, is

usually used to represent mutual information according to the following for-

mula:

I(X; Y ) = H(X) + H(Y ) − H(X, Y ) (2)

= H(X) − H(X|Y ) (3)

5


where H(X) is the entropy of X, H(X, Y ) is the joint entropy of the two

random variables, while H(X|Y ) is the conditional entropy, which represents

the uncertainty in X after knowing Y .

When we deal with real data, the main problem for evaluating I(X; Y )

is the estimation of probabilities PX (x),PY (y) and PXY (x, y). One possible

solution is to subdivide the XY plane into boxes of size ∆x∆y. By doing

so, we are able to estimate the discrete values of PX , PY and PXY . An

alternative approach was proposed in (16), which uses variable box size over

the XY plane. The method presented in (17) estimates the MI by an adaptive

partitioning of the observation space. A Parzen window method is proposed

in (18) to estimate the input distribution. For simplicity, we used here a

fixed box size implementation. Hence, the MI can be rewritten as:

I(X; Y ) =
∑

rx

∑

ry

PXY (rx, ry) log

(

PXY (rx, ry)

PX (rx)PY (ry)

)

(4)

where, rx and ry are the discrete levels for X and Y respectively.

Let’s consider the earlier example about subset S2 and suppose that we

have the entropies shown in Fig. 1, where H(C) represents the entropy

of the target classes. The objective here is to select the optimal subset of

features that would minimize the uncertainty in C. Features f1, f2 and f17

provide good information about the target classes, as indicated by the area

covered in shaded vertical lines. The figure also shows that the additional

information provided by f15 is more than that provided by f6 (the shaded

grid areas). However, instead of presuming that f15 is the best feature when

combined with f1, f2 and f17, why not searching the area that surrounds f15,

i.e., features that are closer to f15 than f6. This limited space may include

6


��
��
��

��
��
��

���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������

���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������
���������������������

�����
�����
�����
�����
�����

�����
�����
�����
�����
�����

H(f15)

H(C)

H(f6)

H(f1,f2,f17)

H(f11)

Figure 1: Importance of features in subset S2, and identifying a candidate replacement

feature for f6

good features, such as f11, which when replacing f6 will not only make S2 a

better subset, but it will also make it better than the subset {f1, f2, f15, f17}.

3. The Proposed Search Strategy

The proposed search strategy is similar to GA, ACO, PSO and DE in

the sense that they are all population-based methods. However, unlike other

methods, the proposed search strategy utilizes dependency between feature

pairs to guide the search. We will refer to it as DSS (Dependency-based

Search Strategy), and it is implemented using a wrapper approach as follows:

1. Randomly initialize the subsets of the first population and sort them

according to their fitness values

2. For each subset, Si, of the current population

• Make a copy of Si

7


• Randomly choose one of the K best subsets, Sk

• Identify candidate features for replacement, which are features of

Si that are not present in Sk. Place those features in S
′

i

• Initially assign 0 to the accumulated difference between the original

and newly generated subset, AcDif f = 0

• do

– Randomly choose one of the features of S′i, fi,j and find the

feature fk,l in Sk that maximizes I(fi,j; fk,l)

– identify the candidate features to substitute fi,j according to

the following equation:

Sm = argfm

(

I(fk,l; fm) ≥ I(fk,l; fi,j)
)

(5)

– Randomly choose one of the features of Sm, to substitute fi,j ,

S′i ←S
′

i \{fi,j}

– Calculate the accumulated difference between the replaced fea-

tures and their replacements, as follows:

AcDif f = AcDif f + (H(fm) − I(fm; fi,j)) (6)

• while AcDif f is less than a certain constant and S′i 6= ∅

• Repeat the same procedure of step 2 for the remaining subsets

3. Evaluate the newly generated subsets

4. Select the K best subsets from the original and newly generated subsets,

given that there is at least a certain ratio of difference between any two

subsets. This is measured by averaging the mutual information between

8


each feature from a given subset and its closest counterpart in another

subset

5. Select the remaining subsets of the population in a similar procedure as

in the previous step

6. Discard the unselected subsets

7. If the stopping criterion is not met, goto step 2 to generate another

population

The rationale behind Eq. 5 is that if fi,j is the closest feature to fk,l and

the two features are highly dependent on each other, then Sm will only con-

sist of fi,j and fk,l. Hence, the replacement of fi,j will have little or no effect

on the performance of Si. On the other hand, if there is only a small degree

of dependency between fi,j and fk,l, then Sm will consist of other features

that are closer to fk,l than fi,j , and in this case the replacement will most

likely have an effect on the performance of Si. In other words, Sm will have

an adaptive size, where the number of the candidate replacement features

depends upon the degree of dependency between fi,j and fk,l. Eq. 6 is used

to specify when to stop replacing features. If features fm and fi,j are highly

dependent on each other, then I(fm; fi,j) will approach H(fm) according to

Eq. 3 (the uncertainty in one of them after knowing the other will be close to

0) and AcDif f will almost stay unchanged. Accordingly, this replacement is

not expected to make a noticeable difference on the performance of Si, and

hence, the algorithm will consider replacing another feature. On the other

hand, if fm and fi,j are not highly dependent, then AcDif f will have a higher

value due to this replacement. When AcDif f exceeds a certain threshold,

there will be some difference between the original and newly generated sub-

9


sets. Note that we do not want the original and newly generated subsets to

be very different, as this may lead to big jumps in the search space and will

hinder the convergence of the algorithm.

Identifying the K best subsets also plays an important role in guiding the

search, as it is important to choose good and diverse subsets. Concentrating

on the goodness of subsets without considering how similar those subsets are

may cause the search to be trapped in local minima. Thus, steps 4 and 5 aim

at selecting good and yet diverse subsets to better explore the search space.

4. Experimental Results

Three different classification problems are considered. In the first prob-

lem the Madelon dataset, which is obtained from the UCI repository, is used.

EEG data and speech segments are used in the second and third problems

respectively. The following methods were implemented to select feature sub-

sets: genetic algorithms, particle swarm optimization, differential evolution

and the proposed dependency-based search strategy.

A GA-based feature selection solution would typically be a fixed length

binary string representing a feature subset, where the value of each position

in the string represents the presence or absence of a particular feature. A

traditional GA algorithm was used here with probability of crossover = 0.8,

and probability of mutation = 0.05. The obtained strings are constrained to

have the number of ’1’s matching a predefined number of desired features.

Feature selection using PSO is also implemented using binary strings. It

uses particles’ best and global best values to guide the search. A DE-based

feature selection utilizes a real-valued optimizer and applies a rounding and

10


uniform crossover operators.

The proposed DSS is implemented as explained in the previous section.

Reasonable values of K and the ratio of difference between subsets were

found to be around 5 and 0.1 respectively. For all experiments described

below, a population size of 50 was used by the four feature selection methods.

In addition, all of the four methods started with the same initial population

and they all have the same stopping criterion, which is reaching a pre-defined

maximum number of iterations.

4.1. The Madelon Dataset

The Madelon dataset was designed for the NIPS 2003 variable selection

benchmark. It is a two-class classification problem with sparse binary input

features. There are 500 features, only 5 of which are useful and the rest are

either redundant or irrelevant. 2000 patterns were used for training and 600

for validation. This dataset represents quite a useful benchmark for feature

selection, as the optimal subset of features is already known.

For all of the four methods, classification accuracy obtained using the k-

nearest neighbor (k-NN) classifier with k = 5 was used as a fitness function,

the desired number of selected features was set to 5 and the maximum number

of iterations was set to 100. The experiment was repeated 30 times and the

averaged classification accuracy values are shown in Fig. 2. The results

indicate that the DSS not only achieved a higher accuracy by the end of

the 100 iterations, but also required smaller number of iterations to find the

optimal subset. PSO was found to be the second best, where it managed

to find the optimal solution in one run out of the 30 repetitions. Despite

being good optimization algorithms, the GA and DE did not perform very

11


0 10 20 30 40 50 60 70 80 90 100
55

60

65

70

75

80

85

90

95

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 2: Number of iterations vs. classification accuracy for the selection of 5 features

well in this experiment, as they did not find the optimal solution in any run.

This is mainly due to the large number of irrelevant and redundant features.

The standard deviation values of the classification accuracy over the 30 runs

for DSS, DE, GA and PSO are: 0.0, 3.23, 5.40 and 2.30 respectively. This

indicates that GA has the highest fluctuation among the four methods, and

it probably means that its performance is sensitive to the initial population

when having large feature sets.

4.2. EEG Classification

The second problem involves the classification of Electroencephalogram

(EEG) signals into one of pre-determined mental tasks. The data used here

was obtained from the Department of Medical Informatics, University of

Technology, Graz, Austria. EEG signals were recorded from three right-

handed females with 56 Electrodes. The subjects were asked to imagine right

or left finger movements according to stimuli on screen. The wavelet packet

transform was used to extract 3 features from each channel. Hence, the total

number of features extracted was 168 (56 channels × 3 features/channel). 406

12


0 10 20 30 40 50 60 70
70

75

80

85

90

95

No. of selected features

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 3: Number of selected features vs. classification accuracy - EEG data

trials were recorded, 208 for left movement and 198 for right. 300 patterns

were used for training and the remaining 106 for testing. The desired number

of selected features was varied from 3 to 70. The selection of feature subsets

was performed using the four methods mentioned earlier, and the number of

iterations was set to 400. The experiment was repeated 30 times and a Linear

Discriminant Analysis (LDA) classifier was used. The averaged classification

accuracy values of the selected subsets are shown in Fig. 3.

It is clear that DSS outperformed the other three methods in all cases, and

hence it achieved excellent performance regardless of the size of the selected

number of features. The second best algorithm is DE followed by GA and

finally PSO. The detailed results for the selection of 3, 30 and 70 features are

shown in Figs. 4, 5 and 6 respectively. For the selection of 3 features, the

results indicate that DSS is the only method that converged to the optimal

solution in most runs (29 out of 30). The DE, GA and PSO managed to

achieve the optimal solution in 11, 10 and 1 runs respectively. The standard

deviation values of the classification accuracy for DSS, DE, GA and PSO

13


0 50 100 150 200 250 300 350 400
64

66

68

70

72

74

76

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 4: Number of iterations vs. classification accuracy for the selection of 3 EEG

features

are: 0.52, 1.11, 1.63 and 0.98 respectively.

The search became more complicated as the number of selected features

increased. The results obtained for the selection of subsets of 30 features

indicate that more time was needed for the methods to find good solutions.

When evaluating the runs individually, DSS managed to outperform the other

three methods in 21 runs, and it shared the highest accuracy with another

method in four other runs. However, DE was a bit more consistent in its

performance, where the standard deviation was 0.94, 0.67, 2.05 and 1.17 for

DSS, DE, GA and PSO respectively.

Searching for subsets of 70 features was even harder. On average, higher

standard deviation values were recorded; 1.12, 1.20, 1.45 and 1.33 for the

four methods in order, which indicate higher fluctuations in performance.

As with the previous case, DSS achieved higher accuracies in more runs than

any other method (19 runs as a clear winner and shared the highest with

another method in seven other runs). Fig. 6 shows an interesting trend,

14


0 50 100 150 200 250 300 350 400
76

78

80

82

84

86

88

90

92

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 5: Number of iterations vs. classification accuracy for the selection of 30 EEG

features

0 50 100 150 200 250 300 350 400
76

78

80

82

84

86

88

90

92

94

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 6: Number of iterations vs. classification accuracy for the selection of 70 EEG

features

where unlike the other three methods that in many runs got trapped in local

minimas, DSS continued to improve as the number of iterations increased.

In fact, DE was a bit better than DSS in the first 250 iterations, but unlike

DSS, the performance of DE has hardly improved after that. This trend

of continuous improvement in the performance of DSS is mainly due to its

enhanced exploration capability, as explained in sections 2 and 3.

15


0 50 100 150 200 250 300 350 400
65

70

75

80

85

90

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 7: Number of iterations vs. classification accuracy for the selection of 30 EEG

features using a kNN classifier

0 50 100 150 200 250 300 350 400
74

76

78

80

82

84

86

88

90

92

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 8: Number of iterations vs. classification accuracy for the selection of 30 EEG

features using a SVM classifier

In addition to the LDA classifier, the k-NN and Support Vector Machine

(SVM) were also considered. Results are shown in Figs. 7 and 8 for the

selection of subsets of 30 features. These figures indicate that DSS clearly

outperformed the other three methods, even when considering different clas-

sifiers. It is worth mentioning that k-NN and SVM did not perform as good

as LDA, due to the limited number of available patterns.

16


4.3. Speech Segment Classification

In the third experiment, speech segments are classified according to their

manner of articulation. Nine classes were considered: vowel, semi-vowel,

nasal, fricative, stop, closure, lateral, rhotic, and silence. We used speech

signals from the TIMIT database, where segment boundaries were identified.

Three different sets of features were extracted from each speech frame: 16

log mel-filter bank (MFB), 12 linear predictive reflection coefficients (LPR),

and 10 wavelet energy bands (WVT). A context dependent approach was

adopted to perform the classification. So, the features used to represent

each speech segment, Segn, were the average frame features over the first

and second halves of Segn and the average frame features of the previous

and following segments (Segn−1 and Segn+1 respectively). Hence, the base-

line feature sets based on MFB, LPR, and WVT consisted of 64, 48 and 40

features respectively. An LDA classifier was used to classify the features of

each baseline set into one of the nine manner-of-articulation classes. 4000

segments were used, the first 2000 for training and the last 2000 for valida-

tion. The obtained classification accuracy for MFB, LPR and WVT were

75.40%, 63.15% and 72.80% respectively. It is clear that MFB achieved the

best performance among the three baseline sets, however, it used more fea-

tures than the other two baseline sets. The three baseline feature sets were

concatenated to form a new set of 152 features. The four feature selection

algorithms were used to select from these features. The desired number of

selected features was varied from 7 to 70. 300 iterations were used by each

method, and the experiment was repeated 20 times.

The first thing that can be noticed from the averaged classification results,

17


10 20 30 40 50 60 70
68

70

72

74

76

78

80

82

No. of selected features

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 9: Number of selected features vs. classification accuracy - speech data

shown in Fig 9, is the enhanced performance that was achieved in comparison

to the three baseline feature sets. For instance, with 40 features DSS managed

to achieve an average accuracy of 78.82%, which is not only higher than the

accuracy of the 40 wavelet feature, but it is also higher than the accuracy

of the 64 MFB features. The figure also indicates that the difference in

performance between the four feature selection methods is not large, however,

DSS still managed to outperform the other three methods in all cases. DE

achieved quite good results in this experiment, as the number of irrelevant

and redundant features was not big. The detailed results for the selection of

7 features are shown in Fig. 10. As with last experiment, DSS managed to

converge faster than the other methods and managed to achieve the highest

accuracy in 14 runs and share the highest accuracy with another method in

5 other runs. The standard deviation values of the classification accuracy for

DSS, DE, GA and PSO are: 0.10, 0.20, 0.52 and 0.47 respectively.

DSS also outperformed the other three methods when selecting 30 fea-

tures. It achieved the highest accuracy in 17 runs and shared the highest

18


0 50 100 150 200 250 300
61

62

63

64

65

66

67

68

69

70

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 10: Number of iterations vs. classification accuracy for the selection of 7 speech

features

0 50 100 150 200 250 300
72

73

74

75

76

77

78

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 11: Number of iterations vs. classification accuracy for the selection of 30 speech

features

accuracy with another method in 2 other runs. The figure also shows that

unlike DE and DSS that managed to improve their performance as the num-

ber of iterations increased, GA and PSO got trapped in local minimas. The

standard deviation values of the classification accuracy for the four methods

in order are: 0.24, 0.26, 0.61 and 0.33.

Similar to the EEG experiment, results for the selection of 70 features,

19


0 50 100 150 200 250 300
78

78.5

79

79.5

80

80.5

81

81.5

82

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 12: Number of iterations vs. classification accuracy for the selection of 70 speech

features

shown in Fig. 12, reflect the enhanced exploration capability of DSS, where

despite its slow start, compared to DE and GA, it continued to improve, while

other methods trapped in local minimas at different locations of the search

space. The figure also shows that the second best algorithm is DE, which

outperformed GA and PSO with a clear margin. The standard deviation

values are: 0.22, 0.22, 0.42 and 0.24 for DSS, DE, GA and PSO respectively.

The reason behind the lower standard deviation values in this experi-

ment is the fact that there are less variations in the relevance of available

features in comparison to the previous two experiments. This can be noticed

when comparing the classification accuracy ranges of the different figures, for

example Figs. 6 and 12.

Results for the selection of subsets of 30 features when using a k-NN

classifier are shown in Fig. 13. One can notice that the k-NN classifier

managed to achieve better results that LDA in this experiment. The results

also show that DSS clearly outperform the other three methods, where it

20


0 10 20 30 40 50 60 70 80 90 100
86

86.5

87

87.5

88

88.5

89

89.5

90

90.5

No. of iterations

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
DSS
DE
GA
PSO

Figure 13: Number of iterations vs. classification accuracy for the selection of 30 speech

features using kNN

achieved the highest accuracy in 17 out of 20 runs.

5. Conclusion

This paper presented a novel search strategy for feature selection. The

proposed method, which utilizes dependency between feature pairs to guide

the search, proved to be very successful in exploring the feature space and

avoiding local minimas. Three different problems were used to compare the

proposed method with other population-based optimization search methods.

The proposed method managed to achieve a consistently high-quality per-

formance, and outperformed the other methods in all cases. It proved to be

particularly useful when dealing with high number of redundant and irrele-

vant features.

21


References

[1] I. Guyon, S. Gunn, M. Nikravesh and L.A. Zadeh Feature Extraction:

Foundations and Applications, Springer-Verlag New York, 2006.

[2] P.M. Narendra and K. Fukunaga, ”A branch and bound algorithm for

feature subset selection,” IEEE Trans. Computers, vol. 26, pp. 917-922,

1977.

[3] R. Kohavi and G. H. John, ”Wrappers for feature subset selection,” Ar-

tificial Intelligence, special issue on relevance, vol. 97, pp. 273-324, 1997.

[4] P.A. Devijver and J. Kittler, Pattern Recognition: a statistical approach,

Prentice-Hall, Englewood CliEs, NJ, 1982.

[5] S-W. Lin, T-Y. Tseng, S-Y. Chou and S-C. Chen, ”A simulated-

annealing-based approach for simultaneous parameter optimization and

feature selection of back-propagation networks”, Expert Systems with Ap-

plications, vol. 34 , no. 2, pp. 1491-1499, 2008.

[6] H. Frohlich, O. Chapelle and B. Scholkopf, ”Feature Selection for Support

Vector Machines by Means of Genetic Algorithms,” Proc. IEEE Intl. Conf.

Tools with Artificial Intelligence (ICTAI’03), pp. 142-148, 2003.

[7] A. Al-Ani, ”Feature Subset Selection Using Ant Colony Optimization”,

Int. Journal of Computational Intelligence, vol. 2, pp. 53-58, 2005.

[8] H.A. Firpi and E. Goodman, ”Swarmed Feature Selection”, Proc. Applied

Imagery Pattern Recognition Workshop, pp. 112–118, 2004.

22


[9] R. Khushaba, A. Al-Ani and A. Al-Jumaily, ”Differential Evolution

based Feature Subset Selection,” Proc. Intl. Conf. Pattern Recognition

(ICPR’08), 2008.

[10] H. Peng, F. Long and C. Ding, ”Feature selection based on mutual infor-

mation criteria of max-dependency, max-relevance, and min-redundancy,”

IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8,

pp. 1226-1238, August 2008.

[11] L. Yu, and H. Liu, ”Efficient feature selection via analysis of relevance

and redundancy,” Journal of Machine Learning Research, vol. 5, pp. 1205-

1224, Oct. 2004.

[12] G. Qu, S. Hariri and M. Yousif, ”A new dependency and correlation

analysis for features,” IEEE Trans. Knowledge and Data Engineering, vol.

17, no. 9, pp. 1199-1207, Sep. 2005.

[13] A. Jakulin and I. Bratko, ”Analyzing attribute dependencies,” Proc.

PKDD 2003, Lecture Notes in Artificial Intelligence, pp. 229-240, 2003.

[14] Z. Zhao and H. Liu, ”Searching for interacting features,” Proc. Intl.

Joint Conf. on Artificial Intelligence (IJCAI-07), pp 1156-1161, 2007.

[15] T.M. Cover and J.A. Thomas, Elements of information theory, John

Wiley & Sons, 1991.

[16] A.M. Fraser and H.L. Swinney, ”Independent coordinates for strange

attractors from mutual information,” Physical Review A, vol. 33, pp. 1134-

1140, 1986.

23


[17] G.A. Darbellay and I. Vajda, ”Estimation of the mutual information by

an adaptive partitioning of the observation space,” IEEE Trans. Informa-

tion Theory, vol. 45, pp. 1315-1321, 1999.

[18] N. Kwak and C-H. Choi, ”Input feature selection by mutual information

based on Parzen window,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 24, no. 12, pp. 1667-1671, 2002.

24