Microsoft Word - 26_BankoZoltan.doc Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 295 Correlation Based Dynamic Time Warping Zoltán Bankó, János Abonyi Department of Process Engineering, University of Pannonia, Veszprém, Hungary www.fmt.uni-pannon.hu/softcomp, e-mail: abonyij@fmt.uni-pannon.hu Abstract: Dynamic Time Warping (DTW) is a widely used technique for univariate time series comparison. This paper proposes a new algorithm for the comparison of multivariate time series which generalize DTW for the needs of correlated multivariate time series. Keywords: multivariate time series, principal component analysis, dynamic time warping, segmentation 1 Introduction A time series is a sequence of values measured as a function of time. These kinds of data are widely used in the field of process engineering, medicine, bioinformatics, chemistry and finance. The increasing popularity of knowledge discovery and data mining tasks for discrete data such as clustering (partitioning the data into coherent, but not predefined subsets), classification (placing the data into predefined, labeled groups) and kNN (k-nearest neighbor) search has indicated the growing need for similarly efficient methods for time series databases. These methods share a common requirement: a similarity measure has to be defined between the elements of a given database. The similarity of multivariate time series can be approached from two different perspectives. The first way is the application of metrics based warping measures such as Dynamic Time Warping (DTW) and Longest Common SubSequence (LCSS). These techniques are perfectly suitable for univariate tasks like speech recognition, where the analyzed process represented by one variable only. In most cases these methods can be easily generalized for the needs of the multivariate time series, where the process depends on two or more variables. However, their application for comparing correlated multivariate time series is often not as effective as it is expected. The direct comparison of the variables used by these approaches ignores the hidden process, i.e. the correlation between process variables and often this hidden process carries the real information. Hence, Principal Component Analysis (PCA) based similarity measures are used to overcome this problem. PCA considers the Z. Bankó et al. Correlation Based Dynamic Time Warping 296 time series as a whole but does not take into account the alterations in the relationship between the variables. Our main goal is to construct a similarity measure which not only compares the variables directly, but the process behind them and is able to deal with the changes in the correlation structure of the variables. 1.1 Contribution In this paper the problem of creating similarity measures for multivariate time series is investigated. We present a new and intuitive multivariate dynamic time warping method which is based on the correlation of variables and uses principal component analysis. This makes our similarity measure invariant to phase shifts of the time axis and it is capable to compensate the differences in sampling rates and ‘locally elastic’ shifts (local time warping) of the time series. We do not examine the data validity and we omit the preprocessing steps for brevity. We treat that the time series are sampled (or can be resampled) at equispaced time points. The rest of the paper is organized as follows. Section 2 details the theoretical background of the proposed similarity measure, i.e. dynamic time warping, principal component analysis and segmentation. In Section 3 we introduce our novel similarity measure which allows to combine the DTW and PCA, while Section 4 conducts a detailed empirical comparison of our method with basic techniques on verification databases widely used by the time series data mining community [1]. Finally in Section 6 we present our conclusions and suggestions for future work. 2 Background An n-variable, m-element time series, [ ]nn xxxX ,,, 21 K= , is an m-by-n element matrix, where [ ]Tiiii mxxxx )(,),2(),1( K= , is the ith variable and )( jxi its jth element. [ ])(,),(),()( 21 jxjxjxjX nn K= is the jth sample of nX . The similarity between nX and nY is denoted by ),( nn YXs , where 1),(0 ≤≤ nn YXs , ),(),( nnnn XYsYXs = and 1),( =nn XXs . Obviously, the similarity is nothing more than a real number between zero and one which expresses the tightness of connection between the processes behind the time series. The closer is the number to one the more similar the processes are. In Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 297 practice, the term distance or dissimilarity ( d ) is used instead of similarity. The value of the distance is given by a number ranged from zero to one. This can be associated with similarity: ),(1),( nnnn YXsYXd −= . Note that is not expected from an applied distance to satisfy the triangle inequality when the distance is not used as the basic distance of an indexing method. 2.1 Dynamic Time Warping Getting valuable information from time series data by data mining methods require a good similarity measure; even so, the traditional approaches are rarely precise enough for the most applications. This is caused by the brittleness of the conventional comparison methods. Distance measures such as Euclidean distance, are unable to handle the distortions in time axis, so these distortions almost randomly affect the distance between time series. The solution for this problem is the application of dynamic time warping which can ‘warp’ the original time series (nonlinearly dilate or compress their time axes) to be similar in shape to the query series as much as possible. In the following the DTW algorithm is reviewed for univariate time series x and y . To align these sequences, firstly we define a grid D with size n x m where each cell represents a distance between the appropriate indices of the two time series. In this step we can choose any application-dependent distance like L1 and L∞ norms, but generally Euclidean distance is suggested because it allows of the efficient indexing of dynamic time warping [2]. Considering this we have: 2 2))()((),( jyixjiD −= (1) Using grid D, we can create arbitrary mappings – called warping paths – between x and y. The construction of a warping path [ ])(,),2(),1( lppp K has to be restricted with the following constraints: - Boundary condition: The path has to start in )1,1(D and end in ),( mnD . - Monotonicity: The path has to be monotonous, i.e. always heading from )1;1(D to ),( mnD . If ),()( jiDkp = and )','()1( jiDkp =+ then 0' ≥−ii and 0' ≥− jj . - Continuity: The path has to be continuous. If ),()( jiDkp = and )','()1( jiDkp =+ then 1' ≤−ii and 1' ≤− jj . Z. Bankó et al. Correlation Based Dynamic Time Warping 298 Figure 1 The cumulative distance matrix and the optimal warping path on it To find the optimal warping path (the DTW distance of the two time series), every warping path has an assigned cost which is the sum of values of the affected cells divided by the normalization constant K: ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ = ∑ = K ip yxd l i DTW 1 )( min),( (2) The value of K depends on the application and in most cases this is the length of the path, but it can also be omitted. More information about the method of defining K and its significance can be found in Reference [3]. Note that the Euclidean distance is a special case of DTW, i.e. we choose the path that is located on the diagonal of grid D and K = 1. Obviously, the number of paths exponentially grows by the size of time series. Fortunately, the optimal path can be found in O(nm) time with the help of dynamic programming using cumulative distance matrix D̂ . A cell of the cumulative matrix contains the sum of the appropriate cell value in matrix D and the minimum of the three cells from where the cell can be reached: ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ +−− +− +− = ),()1,1(ˆ ),()1,(ˆ ),(),1(ˆ min),(ˆ jiDjiD jiDjiD jiDjiD jiD (3) The DTW distance between the two time series can be found in ),(ˆ mnD . Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 299 2.2 PCA Similarity Factor PCA is a well known and widely used multivariate statistical technique for reducing the dimensionality of the data. Fundamentally, this is a linear transformation that projects the original (Gaussian distributed) data to a new coordinate system with minimal loss of information. In multivariate cases the information is the structure of the original data, i.e. the correlation between the variables and the alteration of the correlation structure among them. To create a projection based on this, PCA selects the coordinate axes of the new coordinate system one by one according to the greatest variance of any projection. In projected space the first coordinate (first principal component) is the vector which has maximal variance in the original data space, the second coordinate (second principal component) has maximal variance in the original data space among all vectors which orthogonal to the first coordinate, the third coordinate orthogonal to the first two and has maximal variance in original data space, etc. Technically speaking, we select the eigenvectors of the covariance matrix of the original data (according to the descending order of corresponding eigenvalues) for axes in the new dimensional space. Figure 2 Dimension reduction with the help of PCA from 3 (filled dots) to 2 (patched dots) dimensions. Notice that the correlation between the dots is maximally preserved. In most real multivariate processes the high-order principal components do not add any significant information to the low-order components so we can omit them to perform dimensionality reduction. Krzanowski defined [4] the PCA similarity factor to measure the similarity between different data by comparing the hyperplanes (the dimensionality reduced, new coordinate systems). Z. Bankó et al. Correlation Based Dynamic Time Warping 300 Let nX and nY two multivariate time series with n variable. pX nU , and pYnU , indicates the matrices of eigenvectors which belong the most important p eigenvalues of nX 's and nY 's covariance matrix, i.e. the two hyperplanes. The similarity between them: p UUUUtr YXs pX T pYpY T pX nnPCA nnnn )( ),( ,,,,= (4) The similarity factor has a geometrical explanation, because it measures the similarity between the two hyperplanes by computing the squared cosine values between all the combinations of the first p principal components from pX nU , and pYn U , : ∑∑ = = Θ= p i p j jinnPCA p YXs 1 1 , 2cos 1 ),( (5) where ji,Θ is the angle between the ith principal component of nX and the jth principal component of nY . 2.3 Segmentation The ith segment of nX is a set of consecutive time points, [ ])(;);1();(),( bXaXaXbaS nnni K+= . The c-segmentation of time series nX is a partition of nX to c non-overlapping segments, [ ]),1(;);,1();,1( 21 mkSbaSaSS ccX n ++= K . In other words, a c- segmentation splits nX to c disjoint time intervals, where a≤1 and mk ≤ . The main goal of segmentation is to find homogenous segments by the definition of a cost function. This function can be any arbitrary function which projects from the space of multivariate time series to the space of the positive real numbers and can be framed in several ways [5]. Abonyi et al. presented a PCA based segmentation in Reference [6], where the authors used the Hotelling’s T2 statistics and the Q reconstruction error as the measure of the homogeneity of the segments, i.e. the cost function. Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 301 Figure 3 The distance measures used in PCA based segmentation 2.4 Related Work The data mining community has been investigating different similarity measures for univariate time series for a long time. The majority of these works have focused on the Minkowski metrics (Lp norms), especially on the Euclidean distance. This popularity does not only arise from its simple geometric representation and fast computability, but also from the fact that the Euclidean distance is preserved under orthogonal transformations such as Fourier and Wavelet Transform and satisfies the triangle inequality. From the viewpoint of data mining the main advantage of these transforms is that the original time series – in most cases – are well approximated using the first few coefficients of their transformed forms and the relations of the distances between the elements are also preserved. Technically speaking, the Euclidean distance allows the efficient indexing of large time series databases. The GEMINI framework generalized the creation of the indexing methods, thus, it provided facilities for new approaches which held out higher accuracy. In parallel with this, the increasing and less and less costly computational power made it possible to create similarity measures without indexing capabilities for tasks where the quality of the result is much more important than the speed of the calculation. Hence, new Z. Bankó et al. Correlation Based Dynamic Time Warping 302 techniques such as Dynamic Time Warping, Longest Common Subsequences and Symbolic Aggregate approXimation has been studied extensively from the requirements of ECG signal analysis through handwriting recognition and verification to tornado prediction. On the other hand, the PCA based similarity factor also gains in popularity thanks to its outstanding property which makes possible to recognize the direction of the changes in the distance between the variables. Another advantage of this method is its optimal variable reduction behavior which makes it ideal for tasks with numerous variables such as process engineering jobs. Johannesmayer [6] modified the PCA similarity factor by weighting each principal component with the square root of its eigenvalue: , )( cos)( ),( 1 , 2 1 1 ∑ ∑∑ = = = Θ = p i Y i X i ji p i p j Y j X i nnPCA nn nn YXs λλ λλ λ (6) where nXiλ and n Y iλ are the corresponding eigenvalues for the ith and jth principal component of nX and nY . This principle was developed by K. Yang [7] who presented the logical extension of PCA similarity factor and λPCAs called Eros (Extended Frobenius Norm): ∑∑ == Θ== n i ii n i T YXinnEros wiUiUwYXs nn 11 ,cos)()(),( (7) where iΘcos is the angle between the two corresponding principal components and iw is the weighting vector which is based on the eigenvalues of the sequences in the data set. 3 Correlation Based Dynamic Time Warping of Multivariate Time Series In the previous chapters, all of the required techniques have been reviewed for correlation based dynamic time warping of multivariate time series. Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 303 3.1 Proposed Algorithm The following steps have to be performed for the realization of correlation based multivariate time warping: • Segment the time series of the given database by correlation to create the most homogenous segments. The projection error or Hotelling’s statistic can be used as the cost function. This segmentation can be done off-line in most cases. • Segment the query time series when it arrives similarly. • Calculate the distance between the query and the time series of the database with dynamic time warping. The base distance of DTW can be any distance which is based on the previously presented covariance based similarities (e.g. 1− PCAs ). The size of matrix D̂ (the computation time of DTW) depends on the number of the segments. 4 Evolution Method and Experimental Results A modified leave-one-out k-nearest neighbor search algorithm [8] was used to perform the validation of multivariate dynamic time warping. After the algorithm executes the recall-precision graph, which has been frequently used to measure the performance of Content Based Image Retrieval (CBIR) systems as well as Information Retrieval (IR) systems, can be plotted. The precision expresses the proportion of the relevant sequences from all the sequences are retrieved. Similarly, the recall is the percentage of the total relevant documents in a database retrieved by the k-nearest neighbor search. 4.1 Comparison of Similarity Measures According to Reference [9], two dataset were used for the validation and both of them are available from the Internet. The first one is the database of SVC2004 (Signature Verification Conference 2004) which contains 40 sets of signature data1. This is an ideal dataset for correlation based multivariate time warping because the correlation alternates many times between the variables. 1 Each set contains 20 genuine signatures from one signature contributor and 20 skilled forgeries from five other contributors. Although both genuine signatures and skilled forgeries are available, obviously the 800 genuine signatures were used for the validation. Z. Bankó et al. Correlation Based Dynamic Time Warping 304 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall P re ci si on CbDTW SegPCA SPCA Figure 4 The Recall-Precision graph of SVC2004 dataset Firstly, the best projection was selected for similarity factor based comparison. The best result was the one which compares the 2 dimensional hyperplanes. The first two eigenvalues describe more than the 99% of the total variance. This case is signed by the red line. After this, the sequences were segmented into 20 parts and the distances between the corresponding segments were calculated with the help of the Krzanowski similarity factor and the segmentation cost was the Hotelling’s T2 statistics. The graph of SegPCA shows that this technique really outperforms the segmentation free version. For fairness to this method, the correlation based DTW (CbDTW) used the same segmentation while performed the non-linear pairing between the segments. The superiority of this method is obvious for this dataset. Magyar Kutatók 8. Nemzetközi Szimpóziuma 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics 305 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall P re ci si on CbDTW SegPCA SPCA Figure 5 The Recall-Precision graph of AUSLAN dataset The second validation was processed on the Australian Language Sign dataset (AUSLAN) collected by Mohammed Waleed Kadous [10]. This contains 95 signs and each of them has 27 examples, so the total number of sings is 2565. The average length of the time series in the database is only 57. In Figure [5], the high values show that the correlation between the variables is much stronger than in SVC2004 dataset and this characterize the classes much more. The data was projected into 2 dimensional hyperplanes similarly to the first validation. These principal components describe more than the 95% of the total variance. For the last two methods, interpolation has to be executed and segmentation was applied on these interpolated versions of the original data. The segmentation cost was selected to Hotelling’s T2 statistic and the number of segments was 20. Conclusions The paper presented a new similarity measure for multivariate time series by combining dynamic time warping and principal component analysis. The results of the validations are more then promising, however, there is still a lot of work that has to be done and there are lots of questions that have to be answered. The other similarity measures (such as Eross and λ PCAs ) do not provide as good results Z. Bankó et al. Correlation Based Dynamic Time Warping 306 as the PCA similarity factor. Presumably the reason for this is the segmentation. It has to be examined how the best segmentation result can be achieved for a given similarity measure. Acknowledgement The authors acknowledge the financial support of the Cooperative Research Centre (VIKKK, project 2004-I), the Hungarian Research Found (OTKA 49534), the Bolyai János fellowship of the Hungarian Academy of Science, and the Öveges fellowship. Thanks to Prof. Eamonn J. Keogh for providing datasets from Time Series Data Mining Archive. References [1] S. Hettich, S. D. Bay: The UCI KDD Archive, http://kdd.ics.uci.edu, University of California, Department of Information and Computer Science, 1999 [2] E. Keogh: Exact Indexing of Dynamic Time Warping, in Proceedings of 28th VLDB Conference, Hong Kong, China, 2002, pp. 406-417 [3] H. Sakoe, S. Chiba: Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 27(1), 1978, pp. 43-49 [4] W. J. Krzanowski: Between-Groups Comparison of Principal Components, in Journal of the American Statistical Society, pp. 74:703-707, 1979 [5] E. Keogh, S. Chu, D. Hart, M. J. Pazzani: An Online Algorithm for Segmenting Time Series, in Proceedings of ICDM, 2001, pp. 289-296 [6] J. Abonyi, B. Feil, S. Németh, P. Árva: Principal Component Analysis- based Time Series Segmentation, in Proceedings of the IEEE International Conference on Computational Cybernetics, 2003 [7] M. C. Johannesmeyer: Abnormal Situation Analysis Using Pattern Recognition Techniques and Historical Data, M.sc. thesis, University of California, Santa Barbara, CA, 1999 [8] K. Yang, C. Shahabi: A PCA-based Similarity Measure for Multivariate Time Series, in Proceedings of the 2nd ACM International Workshop on Multimedia Databases, 2004, pp. 65-74 [9] E. Keogh, S. Kasetty: On the Need for Time Series Data Mining Benchmarks: a Survey and Empirical Demonstration, in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 102-111 [10] Mohammed Waleed Kadous: Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series, PhD thesis, School of Computer Science & Engineering, University of New South Wales, 2002