Evolving linear transformations with a rotation-angles/scaling representation Evolving linear transformations with a rotation-angles/scaling representation A. Echeverría a,⇑, J.M. Valls b, R. Aler b a Universidad Autónoma de Madrid, Madrid, Spain b Universidad Carlos III de Madrid, Madrid, Spain a r t i c l e i n f o a b s t r a c t Keywords: Data transformation CMA-ES Rotation angles Scaling Classification ⇑ Corresponding author. E-mail addresses: alejandro.echeverria@uam.es f.uc3m.es (J.M. Valls), aler@inf.uc3m.es (R. Aler). Similarity between patterns is commonly used in many distance-based classification algorithms like KNN or RBF. Generalized Euclidean Distances (GED) can be optimized in order to improve the classification success rate in distance-based algorithms. This idea can be extended to any classification algorithm, because it can be shown that a GEDs is equivalent to a linear transformations of the dataset. In this paper, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is applied to the optimization of linear transformations represented as matrices. The method has been tested on several domains and results show that the classification success rate can be improved for some of them. However, in some domains, diagonal matrices get higher accuracies than full square ones. In order to solve this problem, we propose in the second part of the paper to represent linear transformations by means of rotation angles and scal- ing factors, based on the Singular Value Decomposition theorem (SVD). This new representation solves the problems found in the former part. 1. Introduction Many classification algorithms use the concept of distance or similarity between patterns. This is specially true for local classifi- cation methods such as Radial Basis Neural Networks (Moody & Darken, 1989) or Nearest Neighbor classifiers (Cover & Hart, 1967). Typically, the Euclidean distance is used but this is not nec- essarily the best option for all classification domains. Therefore, finding the most appropriate distance for a classification task may be an important part in obtaining high classification rates. The Generalized Euclidean Distance (GED), displayed in Eq. (1), is a family of distances that depends on a parameter S, where S is a symmetric positive definite square matrix. This distance is usually called Mahalanobis distance (Atkenson, Moore, & Schaal, 1997; Tou & Gonzalez, 1974; Weisberg, 1985) although originally this term applied to cases where S is the inverse of the covariance matrix of the data. Eq. (1) shows the GED dij between vectors xi and xj. dij ¼ ½ðxi � xjÞ T Sðxi � xjÞ� 1=2 ð1Þ In Valls, Aler, and Fernández (2007), Genetic Algorithms have been used to optimize GEDs distance matrices for Radial Basis Neu- ral Networks. However, this approach can only be used in classifi- (A. Echeverría), jvalls@in cation algorithms that explicitly use distances, like RBNN, but not others like C4.5. In order to make the approach more general, and by taking into account that a symmetric positive definite ma- trix S can always be decomposed into MTM, for some matrix M, the definition of a GED results in Eq. (2). dij ¼ ½ðxi � xjÞ T MT Mðxi � xjÞ� 1=2 ð2Þ It can be shown that Eq. (2) is equivalent to computing an Euclidean distance on a transformed dataset, where the new pat- terns in the new space are linear transformations of the original data: x0 = Mx. Thus, the GED can be redefined as shown in Eq. (3). dij ¼ ½ðMxi � MxjÞ TðMxi � MxjÞ� 1=2 ¼ ðx0i � x 0 jÞ T x0i � x 0 j � �h i1=2 ð3Þ Table 1 displays several examples of such transformations: (a) rotation of data points by h degrees, (b) scaling of dimensions, and (c) projection onto the y coordinate. The main goal of this paper is to use an optimization method, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), to look for a transformation matrix M as defined in Eq. (3) (instead of a GED) in order to improve the classification rate of a classification algorithm. CMA-ES is one of the best evolutionary techniques for continuous optimization in difficult non-linear domains (Hansen & Ostermeier, 2001; Ostermeier, Gawelczyk, & Hansen, 1994). CMA-ES is an Evolution Strategy where search is guided by a covari- ance matrix, which is adjusted and updated during the search pro- cess. CMA-ES works well in both local and global optimization 1 mailto:alejandro.echeverria@uam.es mailto:jvalls@inf.uc3m.es mailto:jvalls@inf.uc3m.es mailto:aler@inf.uc3m.es http://dx.doi.org/10.1016/j.eswa.2011.09.015 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa Cita bibliográfica Published in: Expert systems with applications 39 (2012), 3276-3282 Table 1 Examples of linear transformations: (a) rotation, (b) scaling, (c) projection. ðaÞ M ¼ cos h sin h �sin h cos h � � ðbÞ M ¼ k 0 0 1 � � ðcÞ M ¼ 0 0 0 1 � � tasks. One of the most interesting features of CMA-ES for our purposes is the self-adaptation of mutation (the update of the covariance matrix), allowing for a finely tuned local search at the end of the optimization process. In this paper, we will use CMA- ES to optimize the transformation matrix M for a classifier. Any learning algorithm could have been used but in the present work we will consider a nearest neighbor algorithm (KNN) with K = 1. This paper reports the results of experiments using diagonal and square matrices M.1 Experiments show that classification rates can be improved by both kind of matrices, but that in some cases, diagonal matrices outperform square matrices. This outcome is rather unexpected given that diagonal matrices are a subset of square ones. In order to deal with this issue, the second part of the paper tests a second representation of transformation matrices by using the Singular Value Decomposition theorem (SVD). The SVD allows to decompose any real matrix into two rotation and one scaling matrices, as displayed in Eq. (4) M ¼ URV� ð4Þ where V⁄ is the conjugate transpose of V. In our case, all the matri- ces have real values and then V⁄ = VT. Thus, M ¼ URV T ð5Þ The SVD is valid for any matrix, including rectangular ones. In this paper, matrices M are always square. In that case, U and V are both n � n unitary matrices and R is an n � n diagonal matrix with nonnegative real numbers on the diagonal. Both U and VT are rotation matrices, that is, they transform data by performing a rotation in n-dimensions. This means that matrix M works by first rotating the data (VT), then scaling it (R), and then rotating it again (U). However, let us remember that in this paper, we are using a nearest neighbor algorithm, which is rotation-invariant. This means that the final rotation U will not affect the classification re- sults (this is also true of many other classification algorithms like Fisher Discriminant, Neural Networks, but not of others like C4.5). Therefore, in the current work and without losing generality, we will consider matrices of the form RVT (i.e. a rotation followed by a scaling). Thus, our second method will evolve matrices R and VT separately, instead of letting CMA-ES evolve matrix M. More- over, the rotation matrix VT can be encoded very naturally, by using explicitly the rotation angles. For instance, transformation (a) in Table 1 is a rotation h in two dimensions represented in matrix form. But this transformation can be represented more naturally by encoding just the rotation angle h. We expected that this natural representation of linear transfor- mations will not fall into the same pitfalls than the square matrices mentioned before: diagonal matrices outperformed square ones. The rationale for this is that SVD allows to factor the diagonal part 1 In this paper, the term ‘square matrix’ will refer to matrices where all its components can be non-zero, in opposition to diagonal matrices, where only the diagonal components are non-zero. (which is basically a scaling of the data, represented by matrix R) and the rotation part. By allowing the algorithm to separate both components, it is expected that CMA-ES will be able to find the proper scaling when only a scaling is needed. Experimental results will show that this is the case. The structure of this article is as follows. Section 2 describes the CMA-ES algorithm. Section 3 describes the proposed methods, both evolving directly the transformation matrices and evolving the rotation angles and the scaling factors. Section 4 describes the syn- thetic and real domains that have been used to test the approach, and also reports the results of the experiments. Finally, Section 5 draws some conclusions and points to future work. 2. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) The algorithm CMA-ES (Covariance Matrix Adaptation Evolu- tion Strategy) was proposed by Hansen (2009) and Hansen and Ostermeier (2001). It is an evolution strategy for difficult optimiza- tion problems in continuous domains, characterized by the use of a covariance matrix to guide the search process. In this section, we will give a very short overview of a (l, k) CMA-ES, where l is the number of parents and k the number of offspring. The full algo- rithm is quite complex, so the interested reader should consult the CMA tutorial for a complete description (Hansen, 2009). CMA-ES estimates a probability distribution from the best per- forming samples in order to guide the search towards promising regions of the search space. Let us suppose that we desire to opti- mize a fitness function f(x) : Rp ? R, where p is the dimensionality of the problem. CMA-ES basic algorithm follows a randomized black box search, described as follows: 1. Initialize distribution parameters h. 2. For generation (iteration) t = 0, 1, 2, . . .: (a) Sample k points from distribution P(x—h(t)) = x1, . . . , xk. (b) Evaluate fitness f of x1, . . . , xk. (c) Update distribution parameters h based on the best per- formers x1, . . . , xl (f(x1) 6 f(x2) 6 � � � f(xl), for a minimization problem). In CMA-ES, the probability distribution to be estimated is a mul- tivariate normal N(m, d2C), whose parameters h are the mean m and the covariance matrix d2C. The mean represents the current loca- tion of the search and it moves towards better locations as search progresses. The covariance matrix controls mutations and is used to guide the search. In some sense, the covariance matrix ‘‘points’’ towards better solutions and is estimated from past samples xi that performed well with respect to f. It is important to remark that CMA-ES decomposes the total covariance matrix into a covariance matrix C and the overall variance d2, also called the step-size con- trol. The designers of the algorithm have found it useful to control the size of the mutation steps d2 independently of the direction of the search C. The reason is that estimating C requires many sam- ples, and therefore takes longer to be adapted, whereas d2 needs to be adapted within shorter time-frames. It has to be clarified that the mean, the step-size, and the covari- ance matrix are updated every generation t, so they should be writ- ten as m(t), d(t), C(t). However, in order to clarify the notation, the generation t will be omitted, by understanding that if the parame- ter appears on the right side of an equation, it belongs to genera- tion t and to generation t + 1 if it appears on the left side. Sampling k points from N(m, d2C) can be easily done, according to Eq. (6) (where I is the identity matrix and z 2 Rp is a random vector generated from N(0, I), an spherical multivariate normal distribution) 2 xi m þ d � ffiffiffi C p � z ð6Þ After sampling, the best l samples xi are used to update the mean (m), the covariance matrix (C), and the step-size (d). The up- date of the mean is straightforward, by means of Eq. (7), where wi are user selected weights m Xl i¼1 wi xi ð7Þ The update of C is done according to Eq. (8) C ð1 � ccovÞ � C þ a1 CovðscÞþ a2 Covðxi¼1���lÞ ð8Þ where sc is called the evolution path, and summarizes the best xi samples from the beginning of the run (not just the ones in the cur- rent generation). In some sense, it contains a summary of the global performance so far. Cov(sc) is the covariance due to the evolution path and Cov(xi=1� � �l) is the covariance due to the best performers in the current generation. ccov 2 (0, 1] and a1 + a2 = ccov are learning ratios. And finally, with respect to the adaptation of the step-size, d may be either too long (and therefore it should be shortened) or too short (then it should be made larger). CMA-ES updates d by tak- ing into account the samples xi along the evolution path. When the xi in the path are correlated – all xi are going in the same direction – the same distance could be covered in fewer but longer steps. Therefore, d should be increased. On the other hand, when the evo- lution path is anticorrelated – the xi go in contrary directions and cancel each other- the step-size d should be made smaller so that region of the search space can be explored with a finer grain. In short, it is desirable that there are no correlations among samples along an evolution path. This can be done by comparing the length of the evolution path with the length of an evolution path under random selection, because random selection produces uncorre- lated samples. Thus, the step-size is updated by means of Eq. (9) d d � exp b � ksdk kE½Nð0; IÞ�k � 1 � �� � ð9Þ where b is a parameter that determines the step size variation be- tween successive generations and, kE[N(0, I)]k is the expectation of the length of a N(0, I) distributed random vector in Rp. Thus, Eq. (9) updates d by comparing the length of the current evolutionary path ksdk and the expected length of an evolutionary path under random selection kE[N(0, I)]k. 3. The method In this paper we have applied CMA-ES for finding data transfor- mations that minimize the classification error (i.e. maximize the classification rate) of a learning technique. Two different ways of representing transformations have been tested. In the first method, transformations matrices coefficients are directly coded. In the sec- ond method, transformations are coded as rotation matrices and scalings. CMA-ES is extensively described in Ostermeier et al. (1994 ) and Hansen and Ostermeier (2001).2 3.1. First method: square and diagonal matrices In the first method, we have considered two types of matrices: diagonal matrices (where all elements outside the diagonal are zero) and arbitrary non-diagonal matrices (n � n square matrices). Using diagonal matrices amounts to just a weighting of the attributes by the corresponding element in the diagonal, as shown in Eq. (10) 2 We have used the Matlab code available at http://www.lri.fr/�hansen/ cmaes_inmatlab.html. k1 0 0 k2 � � � x1 x2 � � ¼ k1 � x1 k2 � x2 � � ð10Þ The reason for testing these two types of structures is to check whether square matrices are actually useful beyond a mere attri- bute weighting. Square matrices involve fitting many parameters (n � n) and it is important to know whether they improve accuracy or rather produce overfitting because of the extra degrees of free- dom, or underfitting because of not being able to adjust properly all the parameters. In order to describe the application of any evolutionary algo- rithm, three elements have to be defined: the parameters, the chro- mosome, and the fitness function. With regard to parameter setting, CMA-ES is almost parameter- free in the sense that it self-adjusts some of its own parameters during the search and also provide some reasonable values for the rest of parameters that usually work well. Typically, CMA-ES starts from a random individual, but in order to reduce some ran- domness in results, CMA-ES will start from the identity matrix for all the runs. This is equivalent to starting with no transformation (or starting with the Euclidean distance). Also, the initial mutation step size has been fixed to 1.0 (although CMA-ES will adjust this value in the course of the search). Also, a stopping criterion has to be provided. In this paper, we have checked empirically that 3000 fitness computations are enough for all domains. With respect to the chromosome, CMA-ES individuals are matrices that represent linear transformations. Matrices can be di- rectly encoded by flattening them (i.e. converting it into an array). If the matrix is square, the chromosome has n2 components and is represented in Eq. (11) v ¼ðx11; x12; . . . ; x21; x22; . . . ; xnnÞ ð11Þ If the matrix is diagonal, the chromosome is represented in Eq. (12), and has n components v ¼ðx11; x22; x33; . . . ; xnnÞ ð12Þ Finally, in order to evaluate each individual fitness, the original training dataset is transformed by the matrix encoded in the individual: T M ¼ M � T ð13Þ where M is the transformation matrix corresponding to the individ- ual, T is the original training set and TM is the transformed training dataset. Then, a classifier is trained using TM. Any rotation-invariant learning algorithm L could have been used. In this paper, neighbor- hood-based algorithm KNN (with k = 1) has been selected. It is easy to think what kind of linear transformations could be useful for KNN, and this has helped in the design of some of the synthetic do- mains that will be described in the experimental section (Section 4). Fitness is then computed by performing a fivefold crossvalidation on TM (Eq. (14)). Crossvalidation is required here because the train- ing error of KNN is always zero EM ¼ errorðL; T MÞ ð14Þ 3.2. Second method: rotation angles and scaling matrix In the second method, a representation of rotation angles and scaling factors is directly used, to check whether the poor results produced in some domains with the square matrices of the first method, can be improved. Let us remember that in this second method, transformation matrices M will be represented by means of a scaling (diagonal) matrix R and a rotation matrix V (Eq. (15)). The later will be represented directly by rotation angles M ¼ RV T ð15Þ 3 http://www.lri.fr/~hansen/cmaes_inmatlab.html http://www.lri.fr/~hansen/cmaes_inmatlab.html http://www.lri.fr/~hansen/cmaes_inmatlab.html The diagonal of the scaling matrix R will be represented as vec- tor s = (s1, . . . , sn), where n is the dimension of the problem. Matrix V will be represented as a vector of rotation angles r = (r1, . . . , ra), where a = n(n � 1)/2, are coded in the chromosome v as follows: v ¼ðr; sÞ¼ ðr1; . . . ; ra; s1; . . . ; snÞ; where the first a elements are the rotation angles and the last n are the scaling factors. It means that our algorithm will perform n(n � 1)/2 rotations and n scalings on data. Algorithmically, the generation of a n-dimensional rotation can be performed by multi- plication of a rotation matrices R (a). In two dimensions, the form of these matrices that rotate a given vector by a counterclockwise angle is: RðaÞ¼ cos a �sin a sin a cos a � � ð16Þ In three dimensions, rotations of the three axes in a counter- clockwise direction give the matrices in Eqs. (17)–(19), respec- tively (Arfken, 1985; Goldstein, 1980) RxðaÞ¼ 1 0 0 0 cos a �sin a 0 sin a cos a 0 B@ 1 CA ð17Þ RyðaÞ¼ cos a 0 sin a 0 1 0 �sin a 0 cos a 0 B@ 1 CA ð18Þ RzðaÞ¼ cos a �sin a 0 sin a cos a 0 0 0 1 0 B@ 1 CA ð19Þ The composition of rotation angles matrices Rx(a)Ry(b)Rz(c) in three dimensions would give the transformation matrix of Eq. (20): cosbcosc cos bsinc sin b �cosasinc�coscsinasin b cosacosc�sinasinbsinc cosbsina sinasinc�cosacoscsin b �coscsina�cosasinbsinc cosacosc 0 B@ 1 CA ð20Þ As we can see in Bck (1996), only a few lines of code (Schwefel, 1980) are sufficient to perform these calculations. To perform our rotation-scaling transformation routine, we have added a scale transformation to this code producing Algorithm 1, explained be- low. First, the main loop is entered for all data instances p initializ- ing a to the number of rotation angles, i.e., a = n(n � 1)/2, where n is the domain dimension. Then, next two loops (lines 3 and 6) move data T indexes to perform 2D rotation obtaining a rotated data TR (lines 8 and 9) in a counterclockwise direction, using a angles a con- tained in vector r (line 7). At the end of each iteration for every data instance, we perform a scale transformation multiplying it by vec- tor s producing a final transformed data TM (line 14). Table 2 UCI domains characteristics. Domain Instances Attributes Classes Blood (BL) 748 5 2 Car (CR) 1728 6 4 Diabetes (DB) 768 8 2 Ionosphere (IO) 351 34 2 Iris (IR) 150 4 3 Wine (WI) 178 13 3 Algorithm 1. Rotation Angles Routine 1: for i = 1 to p do 2: a = n(n � 1)/2 3: for j = 1 to n � 1 do 4: n1 = n � j 5: n2 = n 6: for k = 1 to j do 7: a = r(a) 8: TR(i, n2) = T(i, n1) sin(a) + T(i, n2) cos(a) 9: TR(i, n1) = T(i, n1) cos(a) � T(i, n2) sin(a) 10: n2 = n2 � 1 11: a = a � 1 12: end for 13: end for 14: TM(i) = TR(i)s0 15: end for The parameters setting for CMA-ES in the second method is similar to the first one: it starts with all the rotation angles set to zero and the scaling factors set to 1 (which corresponds to the identity transformation matrix). As in the first method, the initial step-size is set to 1.0 and the number of matrix evaluations al- lowed is the same than in the first method for each domain. For the same reasons explained in Section 3.1 and with the aim of comparing both methods, KNN (with k = 1) is also used as clas- sification algorithm. 4. Experiments In this section, we will carry out experiments on several artifi- cial and real world domains. 4.1. Domains We have used four synthetic domains and six real world do- mains. All of them correspond to classification problems and have numerical attributes. They are described next. 4.1.1. Artificial data domains We have used a well-known synthetic dataset, the Ripley (Rip- ley, 1996) data domain, which has been widely used in the Ma- chine Learning literature, and three more artificial domains that we have called RandomAttr, Straight-0 and Straight-45, specifi- cally designed to check if the approach works properly. In the Ripley (RP) dataset each pattern has two real-valued coordinates and a class which can be 0 or 1. Each class corresponds to a bimodal distribution that is a balanced composition of two normal distributions. Covariance matrices are identical for all the distributions and the centers are different. One of the issues that make this domain interesting is the big overlap existing between both classes. RandomAttr (RA) is a two-class domain with four real-valued attributes x1, x2, x3, x4. The examples have been randomly gener- ated following an uniform distribution in the interval [0, 1] for attributes x1, x2 and the interval [0, 100] for attributes x3, x4. If x1 < x2 then the example is labeled as class ‘1’. Otherwise, if x1 > x2 the example belongs to class ‘0’. Thus, attributes x3 and x4 are irrelevant. Because the ranges of irrelevant attributes are much bigger, the classification accuracy of KNN is very bad (about 50%). The dataset is composed of 300 examples, 150 from each class. Straight-45 (S45) is a two-class domain with two real-valued attributes. The examples have been generated in this way: initially 100 examples of class 1 are located as regular intervals in a straight 4 Table 3 Classification rate (percentage) with KNN (k = 1) for the original dataset and for the transformed data when square and diagonal matrix transformations are used. Original data CMA-ES diagonal CMA-ES square Straight-0 (S0) 9.20 ± 2.26 �99.70 ± 0.79 �99.85 ± 0.24 Straight-45 (S45) 9.20 ± 1.62 8.95 ± 1.89 �99.15 ± 0.58 RandomAtt (RA) 50.50 ± 2.38 �95.60 ± 1.81 �79.63 ± 6.28 Ripley (RP) 88.58 ± 0.36 88.91 ± 0.44 88.41 ± 0.57 Blood (BL) 68.41 ± 0.57 67.92 ± 0.69 68.12 ± 1.01 Car (CR) 87.49 ± 0.17 �95.82 ± 0.33 �97.52 ± 0.18 Diabetes (DB) 68.11 ± 1.02 67.10 ± 1.51 67.28 ± 0.95 Ionosphere (IO) 86.19 ± 0.39 89.94 ± 1.57 88.31 ± 0.63 Iris (IR) 96.00 ± 0.00 94.80 ± 0.76 95.27 ± 1.24 Wine (WI) 75.99 ± 1.47 �94.39 ± 1.14 �85.69 ± 1.64 Significantly better/ worse 4/0 (of 10) 5/0 (of 10) line passing through the origin (0, 0) with an angle of 45� respect to the horizontal axe. The distance between two consecutive points is 1. One hundred examples of class 0 are generated in the same way in a parallel straight line passing through the point (0,�1) in such a way that the nearest point of a given point always belongs to the opposite class, because it is located in the opposite parallel straight line. Then all points are perturbed by adding to each coordinate a random number uniformly distributed in [�0.5, 0.5]. The idea behind this domain is that the nearest neighbor algo- rithm will achieve a very bad result because most of the times the nearest neighbor of a given point belongs to the opposite class. But certain transformations of the data involving rotations and coordinate scaling will allow a good classification rate. Straight-0 (S0) is very similar to Straight-45. The only differ- ence is that all the points have been rotated 45� clockwise. The motivation for using this domain is that in this case with a simpler transformation the data could be properly classified because no rotation is needed. In short, Straight-45 requires both rotation and scaling (a square matrix) whereas Straight-0 requires scaling only (a diagonal matrix). 4.1.2. Real world data domains We have used the well known Blood (BL), Car (CR), Diabetes (DB), Ionosphere (IO), Iris (IR) and Wine (WI) domains from the UCI Machine Learning Repository.3 Table 2 displays the characteris- tics of these UCI domains. 4.2. Experimental results. Square vs. diagonal matrices In this subsection we show the experimental results obtained when square and diagonal matrix transformations are evolved by CMA-ES and KNN (with k = 1) is used as classifier in the transformed space. The parameters for CMA-ES are the following: the initial stan- dard deviation has been set to 1.0 for all the experiments. The max- imum number of fitness evaluations has been set to the same value (3000) for both methods by means of preliminary experiments. The rest of CMA-ES parameters are set to their default values. In all domains we compare the classification results in three sit- uations: for each fold, KNN classifies the original data, the data transformed by a diagonal matrix optimized by CMA-ES and by a square matrix optimized by CMA-ES. In all cases linear transforma- tions are done by means of complete matrices and thus, the dimen- sion of the transformed space remains unaltered. Table 3 shows the mean classification accuracy rates obtained for all the domains and their standard deviations. The means have been obtained by averaging 10 executions of a 10-fold crossvalida- tion procedure. The best results have been marked with a dagger. A Corrected Repeated 10 � 10-fold Crossvalidation T-Test (Bouckaert & Frank, 2004) has been used in all cases for testing if differences between an algorithm and the results obtained with the original data are statistically significant (a = 0.05). Statistically significant results are marked with a dagger (�). The last row of Table 3 counts in how many domains the differences are significant. In the Straight-0 domain, with a simple transformation of the space, just compressing the x1 coordinate, a good classification rate can be achieved because points belonging to the same class can get as close to each other as needed. This simple transformation can be done with a diagonal matrix and therefore with a square matrix too. The results show that KNN only obtains a classification rate of 9.2% on the original dataset, but if the data is linearly trans- formed with a diagonal matrix the rate is 99.7% and 99.8% if a square matrix is used. In this case both optimization methods achieve the same results for each matrix type. 3 http://archive.ics.uci.edu/ml/. In the Straight-45 domain, we see that a more complex trans- formation must be done because it is not enough to scale the coor- dinates but to rotate the points as well. This linear transformation cannot be obtained with a diagonal matrix. The second row of Ta- ble 3 shows the results as expected: KNN obtains a very bad clas- sification accuracy on the original data (9.20%). A diagonal matrix is useless to obtain an adequate transformation and we can see that the results are very bad too, around 8.95%. On the contrary, when a square matrix is used, the results are near to 100%. The RandomAttr domain has two irrelevant attributes whose numeric range is much bigger that the relevant attributes range. That is the reason why the accuracy of KNN on the original data is 50.5%. Scaling the irrelevant attributes should be enough to at- tain a much better accuracy, thus both diagonal and square matri- ces should be appropriate. The results show that diagonal matrices obtain results over 90%. It can also be seen that in this domain, square matrices do not perform as well as diagonal matrices. With regard to the remaining domains, CMA-ES diagonal ob- tains significatively better results than KNN on the original data for Car (95.82% vs. 87.49%), and Wine (94.39% vs. 75.99%). CMA- ES square also obtains significatively better results than KNN on the original data for the same domains: Car, 97.52% vs. 87.49% and Wine, 85.69% vs. 75.99%. Summarizing the results, it can be observed that CMA-ES diag- onal and CMA-ES square behave as expected in the artificial do- mains, except in RandomAttr, where CMA-ES square obtains much worse results than expected. Also, it is never the case that the original data significantly outperforms any of the other two methods. There are some cases where diagonal matrices largely outperform square ones (RandomAttr and Wine). In those two do- mains, the square matrix results are much worse than the diagonal ones. This is remarkable, given that the set of square matrices in- clude diagonal ones. This issue deserves more attention because it would be useful to apply just one method to all domains: the evolution of square matrices. Square matrices have more parameters to fit than diagonal ones, but the worse results of square matrices in RandomAttr and Wine is not a case of overfitting. We have extended the number of evalua- tions from 3000 to 10,000, and results in RandomAttr improve from 79.63% to 88.86%, and in Wine from 85.69% to 87.14%. These improvements hint that the issue here is not overfitting but rather underfitting due to a very slow convergence. In any case, Random- Attr is a very simple problem and requiring more than 10,000 eval- uations in order to obtain the same results than a diagonal matrix does not seem appropriate. In order to deal with this issue, we have proposed a second way of representing the transformation matrix. This second method was described in Section 3.2. Next section (Sec- tion 4.3) reports the experimental results). 5 http://archive.ics.uci.edu/ml/ Table 4 This table adds to Table 3 a new column for our classification rate (percentage) with KNN (k = 1) for the transformed data when scaling factors and rotation angles are directly evolved. Original data CMA-ES diagonal CMA-ES square CMA-ES rotation angles Straight-0 (S0) 9.20 ± 2.26 �99.70 ± 0.79 �99.85 ± 0.24 �98.90 ± 0.66 Straight-45 (S45) 9.20 ± 1.62 8.95 ± 1.89 �99.15 ± 0.58 �98.60 ± 0.32 RandomAtt (RA) 50.50 ± 2.38 �95.60 ± 1.81 �79.63 ± 6.28 �97.70 ± 0.46 Ripley (RI) 88.58 ± 0.36 88.91 ± 0.44 88.41 ± 0.57 88.33 ± 0.68 Blood (BL) 68.41 ± 0.57 67.92 ± 0.69 68.12 ± 1.01 67.31 ± 0.89 Car (CR) 87.49 ± 0.17 �95.82 ± 0.33 �97.52 ± 0.18 �96.23 ± 0.37 Diabetes (DB) 68.11 ± 1.02 67.10 ± 1.51 67.28 ± 0.95 67.95 ± 1.07 Ionosphere (IO) 86.19 ± 0.39 89.94 ± 1.57 88.31 ± 0.63 �91.51 ± 0.98 Iris (IR) 96.00 ± 0.00 94.80 ± 0.76 95.27 ± 1.24 94.93 ± 0.95 Wine (WI) 75.99 ± 1.47 �94.39 ± 1.14 �85.69 ± 1.64 �96.37 ± 0.80 Significantly better/worse 4/0 (of 10) 5/0 (of 10) 6/0 (of 10) 4.3. Experimental results. Rotation angles and scaling In this subsection we show the experimental results obtained by the second method described in Section 3.2 when scaling factors and rotation angles are evolved and KNN (with k = 1) is used as classifier in the transformed space. The parameters for CMA-ES are the same as in the previous subsection. We apply the method to the same data folds previously used when the first method was applied. Thus, the results can be compared with the ones dis- played in Table 3 and obtained when the transformation matrices were directly evolved. In order to allow a good comparative of both methods, Table 4 shows all the results: column 1 shows the classification rates when KNN is applied to the original data. Columns 2 and 3 show the re- sults corresponding to the first method explained in Section 4.2, when either diagonal and square matrices were directly evolved. Finally, column 4 displays the results obtained by the second method, when scaling factors and rotation angles are evolved. In all cases the results correspond to the mean classification accuracy rates obtained for all the domains and their standard deviations. As in the last subsection, the means have been obtained by averaging 10 executions of a 10-fold crossvalidation procedure. The best re- sults have been marked with a dagger. As before, a Corrected Re- peated 10 � 10-fold Crossvalidation T-Test (Bouckaert & Frank, 2004) has been used for testing if each method is significantly bet- ter (or worse) than KNN on the original data (a = 0.05). A dagger (�) marks significant differences. In the Straight-0 domain, as we said before, just compressing the x1 coordinate, a good classification rate can be achieved be- cause points belonging to the same class can get as close to each other as needed. This simple transformation can be done with a diagonal matrix, and therefore with a square matrix and a rota- tion-angles matrix too. The results show that KNN only obtains a classification rate of 9.2% on the original data set, but if the data is linearly transformed with a square matrix the rate is 99.8% and 98.9% if a rotation-angle matrix is used. In the Straight-45 domain, a rotation is needed. Table 3 shows the results as expected: KNN obtains a very bad classification accu- racy on the original data (9.2%) and when a diagonal matrix is used (8.95%). On the contrary, when a square matrix or rotation-angles matrix is used, the results are near to 100%. In the RandomAttr domain, scaling the irrelevant attributes should be enough to attain a much better accuracy. Therefore, diagonal matrices should be enough but given that square matrices include diagonal ones, square matrices should get similar results. But we seen in Table 3 that square matrices did not reach the ex- pected performance. On the other hand, rotation-angles matrices perform very well, and they obtain a rate of 97.70% vs. 79.63% for square matrices. In this domain we observe that encoding di- rectly the scaling factors and the rotation angles, facilitates the search of the appropriate transformation. Regarding the real data domains, we can see that the rotation- angles method also solves the stagnation problem for the Wine do- main: 96.37% (rotation angles) vs. 75.99% (square matrix) vs. 94.39% (diagonal matrix). Summarizing the results, it can be observed that the CMA-ES rotation-angles method behaves as expected in all domains, avoid- ing the situations where the first method (evolving directly the square matrices) got stagnated in some domains (RandomAttr and Wine). Rotation-angles is also the method that obtains the highest number of statistically significant differences with the ori- ginal data: 6 out of 10 vs. 5 out of 10 (square matrices) and 4 out of 10 (diagonal matrices). 5. Conclusions In this work, the evolutionary method CMA-ES is used to search for appropriate linear transformations of data that improve the success rate of KNN. Working with linear transformations has an important advantage: any learning algorithm with numerical attri- butes can be used, even if it is not based on distances. In a first stage, linear transformations have been represented as straightforward matrices. Both diagonal and full square matrices have been considered. The method has been tested with different domains, both synthetic and real, and the results show that, in gen- eral, both diagonal and square matrices found by CMA-ES either outperform or match the classifier on untransformed data. Square matrices have the advantage that they allow for more complex transformations that are required in some of the domains. How- ever, in certain domains diagonal matrices attain significatively better results than square ones. This is unexpected given that diag- onal matrices are a subset of square ones. We have tested that this is not due to overfitting but rather to a very slow convergence in some of the domains. In order to deal with this issue, we have modified the method using a different representation of transformation matrices, which is based on the Singular Value Decomposition theorem (SVD). According to SVD, any linear transformation can be split into rota- tions and scalings. Thus, in the second part of this work, the trans- formations are directly represented by rotation angles and scaling factors. By allowing the algorithm to separate both components, it is expected that CMA-ES will be able to find the proper scaling when only a scaling (or diagonal matrix) is needed. Experimental results show that the rotation-angles representation manages to avoid the stagnation problem of square matrices. In the future, we will test what other learning methods benefit from data transformations. Also, the rotation-angles method could 6 be used as a dimensionality reduction technique in a similar vein to Sierra and Echeverra (2006). Acknowledgment This article has been financed by the Spanish founded research MCINN project MSTAR::UC3M, Ref:TIN2008-06491-C04-03, and by project A3::UAM, Ref:TIN2007-66862-C02–02. References Arfken, G. (1985). Mathematical methods for physicists (third ed.). Orlando, FL: Academic Press. Atkenson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11–73. Bck, T. (1996). Evolutionary algorithms in theory and practice. Oxford New York: Oxford University Press. Bouckaert, R. R., & Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms. Advances in Knowledge Discovery and Data Mining (PAKDD), 3–12. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. Goldstein, H. (1980). Classical mechanics (second ed.). Reading, MA: Addison- Wesley. Hansen, Nikolaus (2009). The CMA evolution strategy: A tutorial. TU Berlin: Technische Universitat Berlin. Hansen, N., & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2), 159–195. Moody, J. E., & Darken, C. (1989). Fast learning in networks of locally tuned processing units. Neural Computation, 1, 281–294. Ostermeier, A., Gawelczyk, A., & Hansen, N. (1994). A derandomized approach to self-adaptation of evolution strategies. Evolutionary Computation, 4(2), 369–380. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Schwefel, Hans-Paul (1980). Subroutines EVOL, GRUP, KORR – Listings and user’s guides. Nuclear Research Centre (KFA) Jülich.. Sierra, A., & Echeverra, A. (2006). Evolutionary discriminant analysis. IEEE Transactions on Evolutionary Computation, 10(1), 81–92. Tou, J. T., & Gonzalez, R. C. (1974). Pattern recognition principles. Addison-Wesley. Valls, J. M., Aler, R., & Fernández, O. (2007). Evolving generalized euclidean distances for training RBNN. Computing and Informatics, 26, 33–43. Weisberg, S. (1985). Applied linear regression. New York: John Wiley and Sons. 7