poe.dvi Genetic wavelet packets for speech recognition Leandro D. Vignolo∗, Diego H. Milone, Hugo L. Rufiner Research Center for Signals, Systems and Computational Intelligence, Departamento de Informática, Facultad de Ingenieŕıa y Ciencias Hı́dricas, Universidad Nacional del Litoral, CONICET, Argentina Abstract The most widely used speech representation is based on the mel-frequency cepstral coefficients, which incorporates biologically inspired characteristics into artificial recognizers. However, the recognition performance with these features can still be enhanced, specially in adverse conditions. Recent ad- vances have been made with the introduction of wavelet based representations for different kinds of signals, which have shown to improve the classification performance. However, the problem of finding an adequate wavelet based representation for a particular problem is still an important challenge. In this work we propose a genetic algorithm to evolve a speech representation, based on a non-orthogonal wavelet decomposition, for phoneme classification. The results, obtained for a set of spanish phonemes, show that the proposed genetic algorithm is able to find a representation that improves speech recog- nition results. Moreover, the optimized representation was evaluated in noise conditions. Key words: Phoneme classification, genetic algorithms, wrappers, wavelet packets ∗Corresponding author. Research Center for Signals, Systems and Computational Intelligence, Departamento de Informática, Facultad de Ingenieŕıa y Ciencias Hı́dricas, Universidad Nacional del Litoral, Ciudad Universitaria CC 217, Ruta Nacional No 168 Km 472.4, TE: +54(342)4575233 ext 191, FAX: +54(342)4575224, Santa Fe (3000), Argentina. Email address: ldvignolo@fich.unl.edu.ar (Leandro D. Vignolo) URL: http://fich.unl.edu.ar/sinc (Leandro D. Vignolo) Preprint submitted to Expert Systems with Applications September 18, 2012si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. 1. Introduction Automatic speech recognition systems need a pre-processing stage to make phoneme key-features more evident, in order to obtain significant im- provements in the classification results [1]. This task was first addressed by signal processing techniques like filter-banks, linear prediction coding and cepstrum analysis [2]. The most popular feature representation currently used for speech recognition is built from the mel-frequency cepstral coeffi- cients (MFCC) [3], which are based on a linear model of voice production together with the codification on a psycho-acoustic scale. However, due to the degradation of recognition performance in the presence of additive noise, many advances have been conducted in the development of alternative fea- ture extraction approaches. In particular, techniques like perceptual linear prediction [4] and relative spectra [5] incorporate features based on the hu- man auditory system and provides some robustness in ASR. Also, significant progress has been made with the application of different artificial intelli- gence techniques in the field of speech processing [6]. Besides, the utilization of wavelet based analysis for speech feature extraction has recently been studied [7, 8, 9, 10]. The multi-resolution analysis associated with discrete wavelet transform (DWT) can be implemented as a filter bank decomposition (or filter bank schemes) [11]. Wavelet packet transform (WPT) is a generalization of the DWT decomposition which offers a wider range of possibilities for signal rep- resentation in the time-scale plane [12]. To obtain a representation based on this transform, usually, a particular orthogonal basis is selected among all the available basis. Nevertheless, in phoneme classification applications there is not evidential benefit on working with orthogonal basis. Moreover, it is known that the analysis performed at the level of the auditory cortex is highly redundant and, therefore, non-orthogonal [13]. Without this restric- tion the result of the full WPT decomposition is a highly redundant set of coefficients, from which a convenient representation for the problem in hand can be selected. Many approaches addressing the optimization of wavelet decompositions for feature extraction have been proposed. For instance, in [14] an automatic extraction of high quality features from continuous wavelet coefficients ac- cording to signal classification criteria was presented. In [15], an approach based on the best basis wavelet packet entropy method was proposed for elec- troencephalogram classification. Also, a method for the selection of wavelet 2si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. family and parameters was proposed for the phoneme recognition task [16]. Similarly, the use of wavelet based decompositions has also been proposed as a tool for the development of robust features for speaker recognition [17, 18]. Another interesting approach was proposed in [19], in which a novel approach for generating the wavelet that maximizes the discrimination capability of ECG beats using particle swarm optimization. Also, the use of evolutionary computation techniques in order to optimize over-complete decompositions for signal approximation was proposed in [20]. Furthermore, the use of a genetic algorithm to optimize WPT based features for pathology detection from speech was proposed in [21], where an entropy criterion was minimized for the selection of the wavelet tree nodes. Similar approaches propose the optimization of wavelet decomposition schemes using evolutionary compu- tation for denoising [22, 23] and image compression [24]. Besides, different approaches have been proposed for the optimization of wavelet based repre- sentations using swarm intelligence [25, 26]. Many other studies also rely on evolutionary algorithms for feature selection [27, 28, 29] and the optimization of speech representations [30, 31, 32]. However, the flexibility provided by the full WPT decomposition has not yet been fully exploited in the search for a set of features to improve speech recognition results. When this search is not restricted to non-redundant representations, there is a large number of non-orthogonal dictionaries to be explored, leading to a hard combinatorial problem. Here we propose a new approach to optimize over-complete decomposi- tions from a WPT dictionary, which consists in the use of a genetic algorithm (GA) for the selection of wavelet based features. In order to evaluate the so- lutions during the search, the GA uses a learning vector quantization (LVQ) classifier. Some preliminary results with this strategy were presented in [33]. The methodology, referred to as genetic wavelet packets (GWP), relies on the benefits provided by evolutionary computation to find a better signal representation. This feature selection scheme, known as wrapper [34, 35], is widely used as it allows to obtain the good solutions in comparison with other techniques [36]. The organization of this paper is as follows. In Section 2, brief descriptions of the properties of WPT and GA are presented. Next, our wrapper method for the selection of the WPT components is described. The following section discusses the obtained recognition results for a set of spanish phonemes. Finally, the general conclusions and future work are presented. 3si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. 2. Materials and methods 2.1. Wavelet and wavelet packet transforms In contrast with sine and cosine bases, wavelet bases are simultaneously localized in time and frequency. This feature is particularly interesting in the case of signals which present both stationary and transient behaviors. Wavelets can be defined, in a simplified manner, as a function of zero mean, unitary norm and centered in the neighborhood of t = 0 [37]: ψ(t) ∈ L2(R); ∫ ∞ −∞ ψ(t)dt = 0; ‖ψ(t)‖ = 1. (1) A family of time-frequency atoms is obtained by scaling and translating the wavelet function: ψu,s(t) = 1√ s ψ ( t − u s ) , (2) with u,s ∈ R. This way, the continuous wavelet transform of the signal x(t) is defined as the inner product with this family of atoms Wx (u,s) = 〈x(t),ψu,s(t)〉 = +∞ ∫ −∞ x(t) ψ∗u,s (t) dt. (3) The discrete dyadic wavelet transform (DWT) of x[n] ∈ RN is obtained by discretizing translation and scaling parameters in (3), as u = m and s = 2j . A fast implementation of the DWT based multiresolution analysis exists [11], which uses low-pass and high-pass filters to decompose a signal to detail (dj[n]) and approximation (aj [n]) coefficients. Since the filter outputs contain half the frequency components of the original signal, both approximation and detail can be sub-sampled by two, maintaining the number of samples. This process is iteratively repeated for the approximation coefficients, increasing the frequency resolution on each decomposition step. As result a binary decomposition tree is obtained, where each level corresponds to a different scale j [38] dj+1[m] = √ 2 ∞ ∑ n=−∞ g[n − 2m]aj [n], (4) aj+1[m] = √ 2 ∞ ∑ n=−∞ h[n − 2m]aj[n], (5) 4si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. here g[n] and h[n] are the impulse responses of the high-pass and low-pass filters associated with the wavelets and scaling functions, respectively. The WPT could be considered as an extension of the DWT which pro- vides more flexibility on frequency band selection. With the same reasoning above, details (high frequency components) can be decomposed as well as approximations (low frequency components). In a similar way to the DWT, the full wavelet packets decomposition tree is obtained by c 2p j+1[m] = √ 2 ∞ ∑ n=−∞ g[n − 2m]cpj [n], (6) c 2p+1 j+1 [m] = √ 2 ∞ ∑ n=−∞ h[n − 2m]cpj [n], (7) where j is the depth of the node and p indexes the nodes in the same depth, every c p j with p even is associated to approximations and every c p j with p odd is associated to details. The wavelet packet analysis allows to represent the information contained in a signal in a more flexible time-scale plane, by selecting different sub-trees from the full decomposition (Figure 1). For the selection of the best tree it is possible to make use of the knowledge about the characteristics of the signal and to obtain an efficient representation in the transform domain. For the case of signal compression the criteria is based on “entropy” measures, method named as best orthogonal basis [39]. Another possibility, closer to the classification problem, is to use the local discriminant basis algorithm, which provides an appropriate orthogonal basis for signal classification [40]. These criteria are based on the assumption that an orthogonal basis is convenient. Nevertheless, for the case in study there is not evidence on the convenience of a non-redundant representation. Moreover, the redundancy often provides additional robustness for classification tasks in adverse conditions [31]. Be- cause of this, a method which explores a wider range of possibilities should be studied. 2.2. Genetic wavelet packets Genetic algorithms are meta-heuristic optimization methods motivated by the process of natural evolution [41]. A classic GA consists of three kinds of operators: selection, variation and replacement [42]. Selection mimics the natural advantage of the fittest individuals, giving them more chance to 5si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Figure 1: Wavelet packets tree with six decomposition levels for a 256 samples signals. Figure 2: General scheme of the proposed wrapper method. reproduce. The purpose of the variation operators is to combine informa- tion from different individuals and also to maintain population diversity, by randomly modifying chromosomes. The number of individuals in the cur- rent population that are substituted by the offspring is determined by the replacement strategy. The information of a possible solution is coded in a chromosome and its fitness is measured by an objective function, which is specific to a given problem. Parents, selected from the population, are mated to generate the offspring by means of the variation operators. The popula- tion is then replaced and the cycle is repeated until a desired termination criterion is reached. Once the evolution is finished, the best individual in the population is taken as the solution for the problem [43]. Genetic algorithms are inherently parallel, and one can easily benefit from this to increase the computational speed [33]. In this case, the objective function needs to evaluate the signal represen- tation suggested by a given chromosome, providing a measure of the class separability. Therefore, the fitness function was defined as a phoneme clas- sifier, based on the optimized learning vector quantization (O-LVQ) tech- 6si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Figure 3: Frequency band integration scheme (half tree). nique [44]. This classifier uses a set of reference vectors (codebook ) which are adapted using a set of training patterns in order to represent the distribu- tion of classes. O-LVQ was chosen because it requires much less processing time than those used in state-of-the-art speech recognizers, based on hidden Markov models (HMM). Although, after the optimization, for the validation of the evolved representation an HMM based classifier was also used. For the evaluation of every individual in the population, the classifier is trained and tested on a phoneme corpus, and the recognition rate is used as the fitness value. The scheme of the proposed wrapper method is shown in Figure 2. The GA uses roulette wheel selection method, the classic mutation and one-point crossover. Also, an elitist replacement strategy was applied, which maintains the best individual to the next generation. The feature extraction scheme was designed for signals of 256 samples length, this is 32 ms frames at 8 kHz sampling frequency. And the iterative process of filtering and decimation was performed to obtain six decomposi- tion levels, obtaining a full wavelet packet tree consisting of 1792 coefficients. Then, in order to reduce the dimensionality of the search space, the coeffi- cients inside each frequency band were “integrated” by groups. This means that each band was subdivided into groups, and an energy coefficient for each group was obtained by e s j = ∑ ∀ck∈G s j |ck|2, (8) where esj is the energy coefficient for integration group j in scale s, G s j , and ck is the k-th coefficient belonging to this group. Figure 3 illustrates the integration scheme for the first half of the WPT decomposition tree, while the other half is integrated in the same manner. Each small square represents 7si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 1: Integration scheme applied to the WPT tree for a 256 sample signal, which reduces from 1792 wavelet coefficients to 208 integration coefficients. Level 1 2 3 4 5 6 Nodes 2 4 8 16 32 64 Integration groups per node 8 8 4 2 1 1 Wavelet coefficients per group 16 8 8 8 8 4 Integration coefficients per level 16 32 32 32 32 64 a single component, in the first row (level 0) this is a sample of the temporal signal and for the other rows (levels 1 to 6) each of them correspond to a single wavelet coefficient. White and gray zones delimit the different integration groups and the tree nodes in each decomposition level are indicated with thick line rectangles. Table 1 shows the number of components and coefficients in each integration group. This integration scheme was heuristically designed, considering the most relevant frequency bands in speech and their temporal resolutions. After band integration, a normalization was applied: if wp[k] is the k- th energy coefficient corresponding to the pattern p, then the normalized coefficient will be given by ŵp[k] = wp[k] max i {wi[k]} . (9) A canonical evolution model with binary chromosomes was used, in which every individual represents a different selection of the WPT band-integrated coefficients. Each gene in a chromosome represents one normalized coef- ficient, and its value indicates whether that coefficient should be used to parametrize the signals (Figure 4). Once the data base is processed, each feature vector is composed by the normalized and band-integrated WPT co- efficients. These labeled patterns are used to train and test the O-LVQ based classifier. When a particular individual is evaluated, each feature vector is reduced to the subset of coefficients indicated by the chromosome. The selection of individuals should be done considering the set of coeffi- cients represented by each chromosome. The chromosomes which codify the best signal representations, those which allow better classification results, should be assigned high probability. As the codification could be redundant and no restriction is imposed for coefficients combination, the GA initial- ization consists on a random settling of the genes in the chromosomes. All 8si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Figure 4: Codification example with a 80 genes chromosome and the corresponding WPT tree. Dark boxes represent the used coefficients and the white boxes represent those that are discarded. Algorithm 1: Optimization for GWP. Obtain full WPT for each phoneme example in the corpus using (6) and (7) Apply the band-integration scheme to WPT coefficients using (8) Normalize the integrated coefficients for each pattern according to (9) Initialize the GA population Evaluate population (Algorithm 2) repeat Select parents (roulette wheel) Create a new population from selected parents Replace population Evaluate population until stopping criteria is met the steps involved in the GWP feature selection strategy are summarized in Algorithm 1, while the details for the population evaluation are shown in Algorithm 2. 2.3. Phoneme corpus The speech data used for experimentation is a subset of the Albayzin geographic corpus [45, 46], named Minigeo. This subset consists of 600 ut- terances, spoken by twelve different speakers (six men and six women) which where between fifteen and fifty five years old. This speech data has been phonetically segmented using a speech recognition system based on hidden Markov models [47]. This way the temporal localization of every phoneme within each utterance was obtained. The extracted phonetic corpus was par- titioned in three groups, a training set and a testing set to be used during evolution, and a third data set which was used only for the validation of best 9si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Algorithm 2: Evaluate population. for each individual in the population do Re-parameterize training/test patterns according to the chromosome Train the LVQ based classifier on the training set Test the LVQ based classifier on the test set Assign classification rate as the current individual’s fitness solutions after the feature selection process. The experiments included the phonemes /a/, /e/, /i/, /o/, /u/, /b/, /d/, /p/ and /t/ from Spanish. The five vowels were included because of their obvious importance in the language, while the four occlusive phonemes have been included because of their similar characteristics, that make a set particularly difficult to distinguish [48]. Even though it can be suggested that all phonemes should be included, our hypothesis is that this strategy simplifies the task of the GA while still allows to find features useful in continuous ASR. In order to avoid adding additional complexity to the search of the GA, every sample used in the optimization consisted in a single speech frame of 32 ms length, which was extracted from the center of the phoneme utterance. 3. Results and discussion 3.1. Genetic optimization of wavelet packets In [49], various wavelet families have been tested in order to find which one is the most convenient for signal classification. In this work the tests included the most widely used families, among which we can mention Meyer, Daubechies, Symmlets, Coiflets y Splines [50]. As result, the 4th order Coiflet family was chosen to be used on the following experiments. For the first experiment, a codebook of 117 vectors (13 per phoneme) was used within the LVQ classifier and the initial learning rate was set on 0.02. The classifier training was made in 6 epochs with 1449 patterns and 252 patterns were left for testing. For the GA, the population size was set on 100 individuals, while crossover and mutation probabilities were set on 0.9 and 0.05, respectively. The performance of the best solution found was 57.94% of correct classifications. As the LVQ codebook initialization has a random component, repeated evaluations for the same individual may result on different fitness values. 10si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 2: Summary of obtained fitness and validation results. Strategy Convergence Fitness Validation (# generations) average STD Random LVQ initialization 26 57.94% 53.69% 2.33% Fixed LVQ initialization 216 57.78% 56.67% 2.9% Generational LVQ initialization 355 64.07% 59.16% 2.91% Then, in order to obtain a good estimation of the performance for the best individual, after the evolution, it was evaluated ten times with a validation data set In this process we used 2637 training patterns and 450 test pat- terns which were not used during the evolution. As result, an average of 53.69% correct classifications with a standard deviation of 2.33% was ob- tained. Also, in order to analyze the effect of the random LVQ initialization on the evolution, two different alternatives were considered. In the first case, the randomness was eliminated from the codebook initialization. In this case, the GA was able to find an individual with a fitness (classification rate) of 57.78%. With the validation procedure, described earlier, an average clas- sification rate of 56.67% was obtained. Even though an improvement was obtained, it is possible that the evolution was biased by this fixed codebook initialization. Then, another strategy was considered, in which a fixed initial- ization sequence was used for each generation. In order to allow individuals evaluated with different conditions (initializations) to coexist in the same generation a generational gap of 10 individuals was used, maintaining more information from one generation to the next. This means that, in addition to the best individual which is preserved by the elitist strategy, another 10 individuals are chosen by the selection algorithm to be maintained to the next generation. In this case the best solution achieved 64.07% correct clas- sifications, and an average classification rate of 59.16% was obtained with the validation data. Table 2 summarizes these results, showing that this last strategy allowed to improve the generalization capability. Table 3 shows a confusion matrix obtained from validation results. As this matrix shows, the /t/ phoneme is mainly (61.51%) classified as /p/. This error turn up because the experimental data was taken from the cen- tral part of the samples and the plosive phonemes, like /p/ and /t/, have their most particular attributes at the beginning (the phoneme plosion). A similar problem happens with phoneme /d/, and this might be solved for all plosive phonemes by considering their context (i.e. a number of precedent 11si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 3: Confusion matrix obtained from the validation of the best GA solution, giving 59.16% average classification rate . /a/ /e/ /i/ /o/ /u/ /b/ /d/ /p/ /t/ /a/ 84.85 00.30 00.00 11.82 01.21 01.21 00.30 00.30 00.00 /e/ 02.73 76.06 01.82 05.15 03.64 03.33 01.51 03.64 02.12 /i/ 00.00 08.18 86.97 00.00 00.30 02.73 00.61 00.30 00.91 /o/ 25.15 10.61 00.00 42.42 14.24 04.24 02.73 00.61 00.00 /u/ 08.79 00.00 01.51 08.48 59.39 14.24 00.61 05.15 01.82 /b/ 00.30 02.42 00.00 04.54 09.70 62.12 06.06 13.03 01.82 /d/ 10.30 31.82 09.09 07.57 04.54 04.54 10.61 17.27 04.24 /p/ 00.00 00.00 00.00 00.00 00.00 03.03 02.42 78.18 16.36 /t/ 00.00 00.00 00.00 00.30 00.00 04.54 01.82 61.51 31.82 and posterior frames). 3.2. Comparative analysis In order to compare this results, the same classifiers were trained with other state-of-the-art speech features: the classic MFCC [3], the alterna- tive cepstral features based on Slaney’s filter-bank [51], and a representation based on the standard DWT. It is worth pointing out that, as the speech data used in these experiments was sampled at 8 kHz, the original parameters of the filter-bank proposed by Slaney were modified by following the reason- ing in [52]. Table 4 shows the results obtained with the above mentioned representations and using a validation data set consisting of 450 patterns, which were not used for the optimization. As it can be seen, the best aver- age classification rate was obtained with the GWP. This shows that by the genetic optimization of the full WPT decomposition it is possible to improve the classification results, in contrast to the classic cepstrum based represen- tations. On the other hand, it can be noticed that the phoneme /d/ seems to be the most difficult for all representations, except for DWT. Even though the state-of-the-art speech recognizers use HMM for acoustic modeling, the significant improvements provided by GWP when using an LVQ based classifier should not be disregarded. These results clearly show that a more efficient class separation is provided in the GWP features space. It should be taken into account that this straightforward classifier was used as objective function to guide evolution, and only the central frame was considered for each phoneme pattern during the optimization. 12si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 4: Classification results obtained with an LVQ based classifier. Phoneme Cepstral coefficients Wavelets MFCC(13) Slaney(18) DWT(256) GWP(104) /a/ 82.80 70.40 69.39 84.85 /e/ 77.40 61.60 54.54 76.06 /i/ 90.00 61.60 84.54 86.97 /o/ 46.20 28.40 54.54 42.42 /u/ 31.60 21.40 31.51 59.39 /b/ 45.20 39.60 58.48 62.12 /d/ 08.80 09.00 59.69 10.61 /p/ 55.60 45.20 04.85 78.18 /t/ 48.60 58.40 31.21 31.82 Average 54.02 43.96 49.86 59.16 3.3. Evaluation with hidden Markov models In this work we have raised the hypothesis that, by using a simple classifier in the optimization, the class separability would be maximized and this could also be beneficial to a more sophisticated classifier, like HMM [53]. In order to verify this, the performance of an HMM based classifier was evaluated for each of the representations in Table 4. This classifier is based on a continuous HMM, using Gaussian mixtures with diagonal co-variance matrices for the observation densities, as common in ASR [54]. We used a three state model with mixtures of four gaussians, constructed with the tools provided in the HMM Toolkit (HTK) [47]. These tools use the Baum-Welch algorithm [55] to train the HMM parameters, and the Viterbi algorithm [53] to search for the most likely state sequence. This classifier was evaluated in a ten-fold cross-validation process with random partitions, each of which consisted of 2484 and 621 patterns for training and testing, respectively. It is important to point out that, because of the nature of HMM, in the evaluation of this classifier all the successive frames composing a phoneme were used. While, for the LVQ based classifier, only the central frame was used for a particular phoneme. For the features based on DWT the training of the HMM classifier could not converge, which is mainly because the gaussian mixtures are not able to adequately model the probability densities of these coefficients [56]. An- other problem for training the models with DWT coefficients is due to the high dimensionality of this representation. Then, a post-processing based on principal component analysis (PCA) [57] was applied in order to obtain a rep- resentation of lower dimensionality, with probability densities more similar to 13si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 5: Classification results obtained with an HMM based classifier. Phoneme Cepstral coefficients Wavelets MFCC(13) Slaney(18) DWT+PCA(134) GWP(104) /a/ 59.70 58.26 49.71 54.21 /e/ 67.55 64.21 45.50 60.30 /i/ 59.00 63.49 62.02 76.82 /o/ 31.74 27.53 27.38 34.78 /u/ 37.68 51.02 37.82 58.99 /b/ 43.76 26.08 42.61 30.59 /d/ 30.72 16.52 22.45 26.81 /p/ 36.96 40.00 36.37 49.44 /t/ 71.46 60.00 59.43 53.33 Average 48.74 45.24 42.60 49.48 gaussians. The best result for DWT+PCA was obtained when preserving the 99% of the variance, giving a representation of 134 dimensions. Even though our optimized representation is also based on wavelets, and the same prob- lem could be expected, no post-processing was necessary for GWP. Thus, we can assume that the band integration, besides reducing dimension, produced coefficients with probability densities more appropriate to gaussian mixture modeling. Table 5 shows the phoneme classification results obtained with HMM, comparing GWP and other state-of-the-art speech representations. The op- timized representation is the same from Table 4, which was evolved using an LVQ-based classifier. It can be noticed that the best results are those ob- tained by means of the optimized representation and MFCC, similar to the case of the validation with LVQ (Table 4). Even though the fitness was mea- sured by means of an LVQ based classifier in the optimization, the evolved representation provided satisfactory results when using HMM. This means that the optimized representation captures information which is relevant for the discrimination, regardless of the type of classifier. Moreover, using this low-cost classifier, we have successfully saved significant computational time in the optimization. It should be taken into account that if we had used an HMM based classifier and, therefore, considered all the possible frames within a phoneme example, the evaluation of each individual would have taken approximately ten times more. It is also important to note that the proposed GWP representation, which yielded the best classification result when using an LVQ based classifier, pro- vides relatively lower performance with HMM. This is because, as explained 14si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. before, the probability densities of the coefficients provided in wavelet based representations are not entirely suitable for gaussian mixture models [56]. Then, different alternatives remains to be explored besides band integration and PCA post-processing, in order to obtain coefficients more suitable to gaussian mixture models. Also, it should be considered that the dimension- alities of the wavelet based representations are much higher than those of the cepstral representations, which makes the training of the classifier more difficult. Despite the previous considerations, the results from Tables 4 and 5 show that the proposed method is useful for the optimization of wavelet based rep- resentations. Moreover, the results obtained with the GWP features shown that using this evolutionary methodology it is possible to improve the per- formance of the classical representations in ASR. 3.4. Evaluation in noise conditions In order to evaluate the robustness of the optimized representation, white noise was added to the original utterances. The tests were made at several noise levels, and the mismatch training and test (MMTT) condition was considered, which means that the classifiers were trained with clean signals and tested with noisy signals. In general, the input of an ASR system would consist on speech signals with different SNR to those in the training set. For this reason, the evaluation of the recognition performance in MMTT conditions is more realistic than the case where the SNR is the same in both training and test sets. Each test consisted in a ten-fold cross-validation process with random partitions, which consisted of 2484 and 621 patterns for training and testing, respectively. The process of training and testing was repeated for these ten partitions and results were averaged, ensuring that the resultant accuracy would not be biased because of a particular data partition. Figure 5 shows the average results, as well as the estimated standard deviations, showing that the GWP improves the MFCC in all cases. At 0 and 5 dB SNR the DWT+PCA representation gives better results, however, for most of the noise conditions its performance is noticeably worse than GWP. It can be noticed that, on average, the results given by GWP are significantly better when compared to the other representations. In order to evaluate the statistical significance of these results, we estima- ted the probability that GWP is better than each of the reference representa- tions for the given tasks. To perform this test, statistical independence of the 15si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. 0 dB5 dB10 dB15 dB20 dB30 dB40 dBclean 10 15 20 25 30 35 40 45 50 55 SNR C la ss ifi ca tio n r a te [ % ] GWP DWT+PCA MFCC Slaney Figure 5: Classification results obtained with an HMM based classifier evaluated in MMTT conditions. classification errors for each phoneme has been assumed, and the binomial distribution of the errors was approximated by means of a Gaussian distri- bution. Table 6 shows the statistical significance of the classification rates obtained with the HMM based classifier: results improved by GWP with statistical significance higher than 97% are indicated with the superscript △. It can be noticed that most of the classification improvements obtained with the proposed optimized representation are statistically significant. For instance, the probability that GWP performs better than the cepstral coef- ficients obtained with Slaney’s filterbank is higher than 0.99 for all SNRs. Table 7 shows more detailed information about the classification perfor- mance comparing MFCC and GWP, for the case of 40 dB SNR in MMTT conditions. In these confusion matrices, the rows correspond to the actual phoneme and the columns to the predicted phoneme, while the percentages of correct classification are on the diagonal. These matrices show coincidences between the phonemes which are most confused with MFCC and the ones that are confused with GWP. For example, in both cases phoneme /t/ was confused with /p/ and vice versa, which is reasonable as these two plosive 16si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 6: Statistical significance of classification results obtained with an HMM based classifier evaluated in MMTT conditions. Superscript △ indicates that the statistical significance of the improvement of GWP is higher than 97%. SNR Cepstral coefficients Wavelets MFCC(13) Slaney(18) DWT+PCA(134) GWP(104) clean 48.74 45.24△ 42.60△ 49.48 40 dB 48.58△ 45.15△ 43.40△ 50.19 30 dB 47.38 43.82△ 41.18△ 47.77 20 dB 40.42△ 43.64△ 42.23△ 46.72 15 dB 39.12△ 39.19△ 41.08△ 42.93 10 dB 33.77△ 23.45△ 36.10 36.11 5 dB 23.32△ 12.55△ 27.78 25.05 0 dB 12.84△ 11.17△ 22.34 16.42 Table 7: Confusion matrix obtained from the validations with MFCC and GWP in MMTT conditions and 40 dB SNR. MFCC GWP /a/ /e/ /i/ /o/ /u/ /b/ /d/ /p/ /t/ /a/ /e/ /i/ /o/ /u/ /b/ /d/ /p/ /t/ /a/ 61.7 08.8 00.2 11.9 02.8 04.5 09.6 00.3 00.3 57.8 08.3 00.0 13.2 11.6 05.1 04.1 00.0 00.0 /e/ 11.0 66.4 06.5 04.8 03.9 03.7 03.8 00.0 00.0 09.1 61.0 08.4 04.4 01.2 02.8 11.7 00.3 01.2 /i/ 00.2 24.6 60.9 02.6 06.0 03.8 02.0 00.0 00.0 00.2 06.8 77.5 00.7 03.1 00.6 09.4 00.7 01.0 /o/ 15.5 17.1 03.2 32.3 13.6 13.3 03.5 00.3 01.2 08.6 10.3 01.2 35.7 29.3 05.5 05.7 02.6 01.3 /u/ 04.8 04.2 07.4 19.6 38.0 15.9 09.0 00.9 00.3 02.3 01.8 02.9 16.1 57.1 13.8 04.1 02.1 00.0 /b/ 04.7 01.3 03.4 09.9 05.9 45.2 18.6 09.9 01.3 04.5 00.3 00.0 11.5 25.8 29.7 17.0 06.2 05.1 /d/ 06.1 37.0 00.9 03.1 02.8 07.4 27.8 05.7 09.4 06.1 23.2 09.4 05.4 07.7 05.4 26.8 07.4 08.7 /p/ 00.0 00.4 00.0 00.0 01.3 06.0 16.7 34.1 41.6 00.0 00.0 00.0 00.2 00.7 03.2 10.0 52.5 33.5 /t/ 00.0 00.0 02.3 00.2 00.2 02.6 07.0 16.8 71.0 00.0 01.0 00.7 00.4 00.9 00.6 12.0 30.7 53.6 Average: 48.58% Average: 50.19% consonants share many spectral features. In a similar way vowels /o/ and /u/, which are close in the formants map, are quite confused in both cases. Similarly, Table 8 shows the confusion matrices obtained in the classification with MFCC and GWP in the case of 15 dB SNR and MMTT conditions. It can be noticed that the vowels /a/ and /u/ are mostly misclassified with MFCC, but they are significantly better distinguished with GWP. The op- timized features also introduced an important improvement for the vowel /i/, which is confused with the phoneme /d/ when using MFCC. It is also interesting to note that the phonemes which have their classification rates most affected by noise when using MFCC and GWP do not match. Though, when the noise level is increased, the number of confusions between phonemes /t/ and /d/ increases for both MFCC and GWP. Similarly, the number of confusions between phonemes /t/ and /p/ is also increased in both cases. Another interesting remark is that, for both representations, phoneme /t/ is 17si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. Table 8: Confusion matrix obtained from the validations with MFCC and GWP in MMTT conditions and 15 dB SNR. MFCC GWP /a/ /e/ /i/ /o/ /u/ /b/ /d/ /p/ /t/ /a/ /e/ /i/ /o/ /u/ /b/ /d/ /p/ /t/ /a/ 04.5 31.0 00.0 20.1 00.0 03.2 32.8 08.1 00.3 43.8 17.3 00.5 18.1 09.8 03.4 07.3 00.0 00.0 /e/ 00.0 80.6 03.3 00.0 00.0 00.3 15.8 00.0 00.0 04.2 60.3 17.7 03.5 01.3 00.9 11.0 00.0 01.2 /i/ 00.0 19.1 53.1 00.0 00.2 00.2 24.4 00.0 03.2 00.0 11.0 82.5 01.2 03.4 00.3 01.7 00.0 00.0 /o/ 00.0 25.2 01.8 26.7 01.3 09.6 24.6 05.4 05.5 07.1 13.6 00.8 31.9 29.6 02.5 12.5 00.2 02.1 /u/ 00.0 08.4 11.6 16.8 03.6 27.8 29.4 01.3 01.0 02.3 04.4 06.1 19.1 57.5 05.2 05.4 00.0 00.0 /b/ 00.0 03.7 04.1 03.5 00.4 12.0 56.7 08.7 11.0 01.5 09.3 08.6 18.1 29.9 09.6 14.9 01.6 06.7 /d/ 00.0 29.6 01.2 00.6 00.0 02.8 49.0 03.4 13.6 04.5 27.7 12.3 02.1 11.4 00.9 27.5 02.2 11.5 /p/ 00.0 00.5 00.0 00.0 00.0 00.0 05.1 48.1 46.4 00.0 00.5 06.8 01.6 08.7 01.6 19.4 12.0 49.4 /t/ 00.0 00.0 00.0 00.0 00.0 00.0 13.3 12.2 74.5 00.0 01.9 06.7 00.0 03.2 00.7 15.7 10.6 61.3 Average: 39.12% Average: 42.93% better classified when the SNR is 15 dB than when it is 40 dB. These results show that the classification performance of the classical representation can be improved. This suggest that, by means of this GWP methodology, ad- ditional robustness against white noise can be provided to a state-of-the-art ASR system. In order to perform a qualitative analysis of the optimized representation, the tiling of the time-frequency plane was constructed from the selected de- composition using the criteria proposed in [58]. The result is shown in Fig- ure 6, where each decomposition level is depicted separately for an easier interpretation. Each ellipse represents a selected group from the integration scheme (Table 1), then their widths and time localizations are determined by the time-frequency atoms corresponding to the integration group (Figure 3). Therefore, each element in the tiling represents a time-frequency atom that was obtained by adding the original wavelet atoms, according to the integration scheme. This explains why the atoms for levels 1 and 2 are the same time width, as the number of coefficients integrated in the groups for level 1 are twice the number of coefficients in the groups for level 2 (Figure 3). The same explanation applies for the width of the atoms in levels 5 and 6. Note that the whole time-frequency tiling is obtained by the superposi- tion of these six sub-figures, yielding great overlapping between atoms from different decomposition levels. A first observation is that the optimization of the WPT-based decomposition led to a highly redundant non-orthogonal representation, which has been able to exploit this redundancy in order to increase robustness against additive noise. However, the optimized repre- sentation uses only 50% of the total of the coefficients obtained from the integration of the whole WPT-tree. This also suggest that it could be pos- 18si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. sible to achieve still further redundant and robust representations. It is also interesting to note that there are some selected atoms concentrated at the center of the time axis, which could be related to how the phonemes were sampled from the speech corpus, as only the frames extracted from the center of each utterance were considered for the optimization. Also, there are some atoms concentrated at the side parts of the time axis, which could be related to the plosive phonemes. 4. Conclusion and future work A wrapper optimization strategy has been proposed, taking advantage of the benefits provided by evolutionary computation techniques, in order to carry on the search for an advantageous wavelet-based speech representa- tion. The results, obtained in the classification of a group of nine phonemes from spanish, shown that the optimized representation provides important improvements in comparison to the classical features. This suggests that the task of a classifier is simplified when using this optimized representation, due to a better class separation in the features space. Therefore, the proposed strategy provides an alternative feature set for speech signals, which allows to improve the classification results in the presence of noise. In future work we would design more specific genetic operators, so that more information about the problem in hand could be incorporated to the search. Pursuing the objective of finding a representation more suitable for HMM with gaussian mixture modeling, one interesting idea is to incorpo- rate some measure about the gaussianity of the probability densities of the GWP coefficients to the fitness function. Besides, we would study different alternatives to the proposed band integration scheme and the use of tempo- ral information, in regard to successive speech frames, like first and second derivatives. In order to obtain a representation which allows to improve the results in a continuous speech recognition system, future experiments will include more phonemes in the data-sets used for the optimization. Also, the robustness of the representation will be evaluated in comparison with different state-of- the-art robust representations, considering different noise types. References [1] L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice Hall, NJ, 1993. 19si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 1 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 2 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 3 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 4 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 5 0 5 10 15 20 25 30 0 1 2 3 4 Time [msec] F re q u e n cy [ K H z] Selected atoms for level 6 Figure 6: Tiling of the time-frequency plane obtained for the optimized decomposition. For a better visualization, each decomposition level was schematized separately. 20si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. [2] L. Rabiner, R. Schafer, Digital Processing of Speech Signals, Prentice Hall, NJ, 1978. [3] S. V. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing 28 (1980) 57–366. [4] H. Hermansky, Perceptual linear predictive (plp) analysis of speech, J Acoust Soc Am 87 (4) (1990) 1738–1752. doi:10.1121/1.399423. [5] H. Hermansky, N. Morgan, Rasta processing of speech, IEEE Trans Speech Audio Process 2 (1994) 578–589. doi:10.1109/89.326616. [6] A. Hassanien, G. Schaefer, A. Darwish, Computational intelligence in speech and audio processing: Recent advances, in: X.-Z. Gao, A. Gaspar-Cunha, M. Kppen, G. Schaefer, J. Wang (Eds.), Soft Com- puting in Industrial Applications, Vol. 75 of Advances in Intelligent and Soft Computing, Springer Berlin / Heidelberg, 2010, pp. 303–311, 10.1007/978-3-642-11282-9-32. [7] N. Nehe, R. Holambe, DWT and LPC based feature extraction methods for isolated word recognition, EURASIP Journal on Audio, Speech, and Music Processing 2012 (1) (2012) 7. doi:10.1186/1687-4722-2012-7. URL http://asmp.eurasipjournals.com/content/2012/1/7 [8] S. Patil, M. Dixit, Speaker independent speech recognition for diseased patients using wavelet, Journal of the Institution of Engineers (India): Series B 93 (2012) 63–66, 10.1007/s40031-012-0010-3. URL http://dx.doi.org/10.1007/s40031-012-0010-3 [9] E. Avci, Z. H. Akpolat, Speech recognition using a wavelet packet adap- tive network based fuzzy inference system, Expert Systems with Appli- cations 31 (3) (2006) 495 – 503. doi:10.1016/j.eswa.2005.09.058. [10] J.-D. Wu, B.-F. Lin, Speaker identification using discrete wavelet packet transform technique with irregular decomposition, Expert Systems with Applications 36 (2, Part 2) (2009) 3136 – 3143. [11] M. Vetterli, C. Herley, Wavelets and filter banks: Theory and design, IEEE Trans. Signal Proc. 40 (10) (1992) 2207–2232. 21si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. [12] N. Hess-Nielsen, M. V. Wickerhouser, Wavelets and Time-Frequency Analisys, Proceedings of the IEEE 84 (4) (1996) 523–540. [13] R. Munkong, B.-H. Juang, Auditory perception and cogni- tion, Signal Processing Magazine, IEEE 25 (3) (2008) 98–117. doi:10.1109/MSP.2008.918418. [14] S. Ray, A. Chan, Automatic feature extraction from wavelet coefficients using genetic algorithms, in: Proceedings of the 2001 IEEE Signal Pro- cessing Society Workshop, Neural Networks for Signal Processing XI, 2001, pp. 233–241. [15] D. Wang, D. Miao, C. Xie, Best basis-based wavelet packet entropy feature extraction and hierarchical eeg classification for epileptic detec- tion, Expert Systems with Applications 38 (11) (2011) 14314 – 14320. doi:10.1016/j.eswa.2011.05.096. [16] H. Rufiner, J. Goddard, A method of wavelet selection in phoneme recog- nition, in: Circuits and Systems, 1997. Proceedings of the 40th Midwest Symposium on, Vol. 2, 1997, pp. 889 –891 vol.2. [17] P. Kumar, M. Chandra, Hybrid of wavelet and mfcc features for speaker verification, in: Information and Communication Technolo- gies (WICT), 2011 World Congress on, 2011, pp. 1150 –1154. doi:10.1109/WICT.2011.6141410. [18] V. Tiwari, J. Singhai, Wavelet Based Noise Robust Features for Speaker Recognition, International Journal of Signal Processing 5 (2) (2011) 52 – 64. [19] A. Daamouche, L. Hamami, N. Alajlan, F. Melgani, A wavelet optimiza- tion approach for ecg signal classification, Biomedical Signal Processing and Control. In Press. doi:10.1016/j.bspc.2011.07.001. [20] A. R. Ferreira da Silva, Approximations with evolutionary pursuit, Sig- nal Processing 83 (3) (2003) 465–481. [21] R. Behroozmand, F. Almasganj, Optimal selection of wavelet- packet-based features using genetic algorithm in pathological assess- ment of patients’ speech signal with unilateral vocal fold paraly- sis, Computers in Biology and Medicine 37 (4) (2007) 474 – 485. doi:10.1016/j.compbiomed.2006.08.016. 22si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. [22] A. R. Ferreira da Silva, Wavelet denoising with evolutionary al- gorithms, Digital Signal Processing 15 (4) (2005) 382 – 399. doi:10.1016/j.dsp.2004.11.003. [23] E.-S. El-Dahshan, Genetic algorithm and wavelet hybrid scheme for ecg signal denoising, Telecommunication Systems 46 (2011) 209–215, 10.1007/s11235-010-9286-2. [24] R. Salvador, F. Moreno, T. Riesgo, L. Sekanina, Evolutionary Ap- proach to Improve Wavelet Transforms for Image Compression in Em- bedded Systems, EURASIP Journal on Advances in Signal Processing 2011 (2011). doi:10.1155/2011/973806. [25] W. Zhao, C. E. Davis, Swarm intelligence based wavelet coefficient feature selection for mass spectral classification: An application to proteomics data, Analytica Chimica Acta 651 (1) (2009) 15 – 23. doi:10.1016/j.aca.2009.08.008. [26] A. Daamouche, F. Melgani, Swarm intelligence approach to wavelet design for hyperspectral image classification, Geoscience and Remote Sensing Letters, IEEE 6 (4) (2009) 825 – 829. doi:10.1109/LGRS.2009.2026191. [27] W. Pedrycz, S. S. Ahmad, Evolutionary feature selection via structure retention, Expert Systems with Applications 39 (15) (2012) 11801 – 11807. doi:10.1016/j.eswa.2011.09.154. [28] S. Chatterjee, A. Bhattacherjee, Genetic algorithms for feature selection of image analysis-based quality monitoring model: An application to an iron mine, Engineering Applications of Artificial Intelligence 24 (5) (2011) 786 – 795. doi:10.1016/j.engappai.2010.11.009. [29] Y.-X. Li, S. Kwong, Q.-H. He, J. He, J.-C. Yang, Genetic algorithm based simultaneous optimization of feature subsets and hidden markov model parameters for discrimination between speech and non-speech events, International Journal of Speech Technology 13 (2010) 61–73, 10.1007/s10772-010-9070-4. [30] L. D. Vignolo, H. L. Rufiner, D. H. Milone, J. C. Goddard, Evolution- ary Splines for Cepstral Filterbank Optimization in Phoneme Classifica- 23si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. tion, EURASIP Journal on Advances in Signal Processing Volume 2011, doi:10.1155/2011/284791. [31] L. D. Vignolo, H. L. Rufiner, D. H. Milone, J. C. Goddard, Evolutionary Cepstral Coefficients, Applied Soft Computing 11 (4) (2011) 3419 – 3428. doi:10.1016/j.asoc.2011.01.012. [32] L. Vignolo, H. Rufiner, D. Milone, J. Goddard, Genetic optimization of cepstrum filterbank for phoneme classification, in: Proceedings of the Second International Conference on Bio-inspired Systems and Signal Processing (Biosignals 2009), INSTICC Press, Porto (Portugal), 2009, pp. 179–185. [33] L. Vignolo, D. Milone, H. Rufiner, E. Albornoz, Parallel implementation for wavelet dictionary optimization applied to pattern recognition, in: Proceedings of the 7th Argentine Symposium on Computing Technology, Mendoza, Argentina, 2006. [34] S. Durbha, R. King, N. Younan, Wrapper-based feature subset selection for rapid image information mining, Geoscience and Remote Sensing Letters, IEEE 7 (1) (2010) 43 –47. doi:10.1109/LGRS.2009.2028585. [35] H.-H. Hsu, C.-W. Hsieh, M.-D. Lu, Hybrid feature selection by com- bining filters and wrappers, Expert Systems with Applications 38 (7) (2011) 8144 – 8150. doi:10.1016/j.eswa.2010.12.156. [36] R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1-2) (1997) 273 – 324. doi:10.1016/S000437029700043X. [37] S. Mallat, A Wavelet Tour of signal Processing, 3rd Edition, Academic Press, 2008. [38] S. Mallat, A theory of multiresolution of signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Machine Intell. 11 (7) (1989) 674–693. [39] R. Coifman, M. V. Wickerhauser, Entropy-based algorithms for best basis selection, IEEE Transactions on Information Theory 38 (2) (1992) 713–718. 24si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. [40] N. Saito, Local feature extraction and its applications using a library of bases, Ph.D. thesis, Yale University, New Haven, USA, director-Ronald R. Coifman (1994). [41] S. N. Sivanandam, S. N. Deepa, Introduction to Genetic Algorithms, Springer London, Limited, 2008. [42] H. Youssef, S. M. Sait, H. Adiche, Evolutionary algorithms, simulated annealing and tabu search: a comparative study, Engineering Applica- tions of Artificial Intelligence 14 (2) (2001) 167 – 181. doi:10.1016/S0952- 1976(00)00065-8. [43] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1992. [44] T. Kohonen, Improved versions of learning vector quantization, in: Proc. of the Int. Joint Conf. on Neural Networks, San Diego, 1990, pp. 545– 550. [45] A. Echeverŕıa, J. Tejedor, D. Wang, An evolutionary confidence measure for spotting words in speech recognition, in: Y. Demazeau, F. Dignum, J. Corchado, J. Bajo, R. Corchuelo, E. Corchado, F. Fernández-Riverola, V. Julián, P. Pawlewski, A. Campbell (Eds.), Trends in Practical Ap- plications of Agents and Multiagent Systems, Vol. 71 of Advances in Intelligent and Soft Computing, Springer Berlin / Heidelberg, 2010, pp. 419–427. [46] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. Marino, C. Nadeu, Albayzin speech database design of the phonetic corpus, Tech. rep., Universitat Politecnica de Catalunya (UPC), Dpto. DTSC (1993). [47] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK book, HTK version 3.1, Cambridge University (2001). URL http://htk.eng.cam.ac.uk [48] A. Quilis, Tratado de Fonoloǵıa y Fonética Españolas, Biblioteca Románica Hispánica, Editorial Gredos, Madrid, 1993. 25si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2. [49] H. Rufiner, Comparación entre análisis onditas y fourier aplicados al reconocimiento automático del habla, Master’s thesis, Universidad Autónoma Metropolitana, Iztapalapa (1996). [50] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. [51] M. Slaney, Auditory Toolbox, Version 2, Technical Report 1998-010, Interval Research Corporation, Apple Computer Inc. (1998). [52] T. Ganchev, N. Fakotakis, G. Kokkinakis, Comparative evaluation of various MFCC implementations on the speaker verification task, in: Proceedings of the SPECOM-2005, 2005, pp. 191–194. [53] X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition, Edinburgh University Press, 1990. [54] K. Demuynck, J. Duchateau, D. Van Compernolle, P. Wambacq, Im- proved Feature Decorrelation for HMM-based Speech Recognition, in: Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia, 1998. [55] F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, Cam- brige, Masachussets, 1999. [56] D. H. Milone, L. E. D. Persia, M. E. Torres, Denoising and recognition using hidden Markov models with observation distributions modeled by hidden markov trees, Pattern Recognition 43 (4) (2010) 1577 – 1589. doi:10.1016/j.patcog.2009.11.010. [57] C. M. Bishop, Pattern Recognition and Machine Learning, 1st Edition, Springer, 2007. [58] M. Lewicki, Efficient coding of natural sounds, Nature Neuroscience 5 (4) (2002) 356–363. 26si nc (i ) R es ea rc h C en te r fo r S ig na ls , S ys te m s an d C om pu ta ti on al I nt el li ge nc e (f ic h. un l. ed u. ar /s in c) L . D . V ig no lo , D . H . M il on e & H . L . R uf in er ; "G en et ic w av el et p ac ke ts f or s pe ec h re co gn it io n" E xp er t S ys te m s w it h A pp li ca ti on s, 2 01 2.