key: cord-0546711-18h3uavm
authors: James, Nick; Menzies, Max; Azizi, Lamiae; Chan, Jennifer
title: Novel semi-metrics for multivariate change point analysis and anomaly detection
date: 2019-11-04
journal: nan
DOI: nan
sha: 817a1500b7505c1be62d1b69a238d696b3c9c975
doc_id: 546711
cord_uid: 18h3uavm

This paper proposes a new method for determining similarity and anomalies between time series, most practically effective in large collections of (likely related) time series, with a particular focus on measuring distances between structural breaks within such a collection. We consolidate and generalise a class of semi-metric distance measures, which we term MJ distances. Experiments on simulated data demonstrate that our proposed family of distances uncover similarity within collections of time series more effectively than measures such as the Hausdorff and Wasserstein metrics. Although our class of distances do not necessarily satisfy the triangle inequality requirement of a metric, we analyse the transitivity properties of respective distance matrices in various contextual scenarios. There, we demonstrate a trade-off between robust performance in the presence of outliers, and the triangle inequality property. We show in experiments using real data that the contrived scenarios that severely violate the transitivity property rarely exhibit themselves in real data; instead, our family of measures satisfies all the properties of a metric most of the time. We illustrate three ways of analysing the distance and similarity matrices, via eigenvalue analysis, hierarchical clustering, and spectral clustering. The results from our hierarchical and spectral clustering experiments on simulated data demonstrate that the Hausdorff and Wasserstein metrics may lead to erroneous inference as to which time series are most similar with respect to their structural breaks, while our semi-metrics provide an improvement.

Metric spaces (X, d) appear throughout mathematics. One particular field of study that has arisen in image detection and other applications is the study of metrics on the power set P X of (certain) subsets of X. The most utilized metric in this context is the Hausdorff metric, which we introduce and summarize in section 2. This provides a metric between closed and bounded subsets of any ambient metric space (X, d). In addition to the Hausdorff metric there are several semi-metrics, satisfying some but not all of the properties of a metric, which are still useful. These properties will also be summarized in section 2.

Conci and Kubrusly [1] give an overview of certain such (semi-)metrics on the space of subsets and some applications. This review breaks down the applications of such distances between subsets into three primary areas, which we will outline with reference to other elements of the literature:

1. Computational aspects 2. Distances between fuzzy sets 3. Distances in image analysis.

First, there are three subcategories of computational applications of the Hausdorff family of metrics: distance in graphs, distance between polygons, and numerical procedures and algorithms. In [42] , Eiter and Mannila considered the Hausdorff metric along with some variants, and compared these measures both in theory and practical settings such as link distances in graphs. In [30] , Atallah proposed algorithms for computing Hausdorff distances between pairs of convex polygons, which he later extended in [6] . In [29] , Barton et al. proposed algorithms for computing the Hausdorff distances between general polyhedra represented by triangular meshes. Shonkwiler proposed an algorithm in [38] that computes a Hausdorff distance between images in linear time. In [13] and [16] , Huttenlocher proposed novel advancements in computational costs for Hausdorff distance calculation in various Euclidean-geometric scenarios. In [33] , Aspert, Santa-Cruz and Ebrahami considered numerical comparisons for reducing computational effort and memory usage of Hausdorff distance calculation in triangular meshes.

Second, there have been extensions of the Hausdorff metric applied to measurements between fuzzy sets. Brass [36] outlined systems of metric axioms for fuzzy sets, and Fujita [34] applied a variety of distance measures to fuzzy sets, starting with the Hausdorff metric and the Marczewski-Steinhaus finite-version metric. In [2] , Gardner, Kanno, Duncan and Selmic extended the work of Marczewski-Steinhaus and showed a certain measure is indeed a metric and suitable for various machine learning applications. Hausdorff-like distance measures between fuzzy sets have been considered by many authors including Rosenfeld [4] , Chaudhuri and Rosenfeld [9, 10] , Boxer [28] and Fan [27] .

Third, the Hausdorff distance and variations have been applied widely in the field of image analysis. Baddeley [5] cited the Hausdorff metric's sensitivity to outliers and lack of suitability in image processing problems. Fujita [34] and Gardner et al. [2] proposed original metrics. Dubuisson and Jain [32] considered most of the Hausdorff family in image processing problems. Rote [40] , Li et al. [7] , Huttenlocher et al. [13, 16] and Rucklidge [43, 44, 45] investigated applications of the Hausdorff distance under a variety of geometric operations such as translation and rotation. Takacs [8] proposed an algorithm for face matching using the Hausdorff family of distances, while Sim [41] , Zhao [12] and Jesorsky [35] proposed asymmetric variations for object matching and show encouraging experimental results. Semi-metrics, where the triangle inequality is not satisfied, have been considered in various applications of distance measurement, including by Jacobs, Weinshall and Gdalyahu [17] . For an extensive review of more traditional Hausdorff-based distance applications and algorithmic computational aspects for image processing, see Rucklidge's text [44] .

Change point detection is an important task in a wide variety of statistical modelling problems. Essentially, one estimates the location of changes in a time series' statistical properties, points in time at which the estimated probability density functions change. The change point model framework was initially introduced by Hawkins, Qiu and Kang in [24] and Hawkins and Zamba in [14] . More recently, the framework has been extended by Hawkins and Deng in [15] , Ross, Tasoulis and Adams in [22] , and Ross and Adams in [23] . The latter demonstrate non-parametric methods for change point detection in non-Gaussian sequences. Note that change-point detection algorithms have a critical restrictive assumption of requiring independence between change points. Dependence may be handled by first modelling autocorrelation structure in the data, and then applying change point detection on either model residuals or one-step-ahead prediction errors. For a thorough review, interested readers should review Gustafsson [19] .

Results of various change point detection algorithms vary with each algorithm applied to the sequence of data. Each algorithm has its own set of assumptions and may be applied appropriately in different scenarios. Various change point algorithms test for shifts in different underlying distributional properties. The two sample test statistic framework requires choosing an appropriate test based on the assumptions of the statistical model. The Student-t, Bartlett and generalized likelihood ratio tests detect changes in a Gaussian sequence of random variables [24, 14] . The Student-t and Bartlett tests detect changes in either the mean or variance while the GLR may detect changes in both the mean and variance. Ross proposes Fisher's Exact Test which may detect changes in a sequence of Bernoulli random variables [20] . Note that the tests described so far make strong assumptions on the statistical properties of the random variables.

The Mann-Whitney test detects changes in the mean, while the Mood test detects changes in the variance in sequences of random variables [15, 23] . The Lepage, Kolmogorov-Smirnov and Cramer-von-Mises statistics detect general distributional changes, explored in Ross and Adams [39] . These tests make no assumptions about the underlying distribution of the data and random variables.

There has been extensive work in determining similarity between time series. Moeckel and Murray [37] survey the shortcomings of possible distance functions between time series, stating the Hausdorff metric's limitation in ignoring the frequency with which one set visits parts of the comparable set. That is, the metric only focuses on one measurement between two candidate sets, and is sensitive to outliers. Instead, they propose a distance function developed by Kantrovich that is based on geometric and probabilistic factors; this proves to be robust with respect to outliers, noise and discretisation errors. Moeckel and Murray also demonstrate that the transportation distance proposed by Kantrovich can be used to evaluate mathematical models for chaotic or stochastic systems, and for parameter estimation within a dynamical systems context. Hutchinson [26] used a related function in the setting of iterated function systems.

Unsupervised learning algorithms have been widely applied in unearthing patterns in data over the last several decades. Standard algorithms such as EM and K-means clustering are commonly applied, however these methods have several drawbacks. First, parametric density estimation methods assume that each cluster is Gaussian. Second, the log-likelihood may have multiple local minima -requiring multiple restarts to find good solutions. More recently, there has been interest in spectral clustering techniques, which overcome these problems. Spectral clustering was first applied to computer vision and VLSI design systems by Malik and Alpert [25, 11] ; the technique gained real traction in the machine learning community with the work of Ng, Jordan and Weiss [3] . Their algorithm demonstrated the potential failures of direct application of K-means clustering to an initial data set, and how after projecting points to a different space, data may form tight clusters. Prior to Ng, Jordan and Weiss, there was disagreement among authors as to which eigenvectors should be used and how to derive clusters. For details, one should review Weiss' work [46] . Spectral clustering has been widely applied in different domains since the inception of its popularity. Applications include: image segmentation, educational data mining, entity resolution, speech separation, clustering protein sequences, etc. Interested readers should see Suryanarayana [18] .

To our knowledge, there has been no work in detecting the similarity between time series' change points. Change points signify changes in the statistical properties of a time series' distribution, so determining which time series are most similar with respect to the number and location of these changes is of interest to analysts, in a wide variety of disciplines, who wish to assess the underlying structure in larger collections of time series. Equally of interest are time series whose change points are dissimilar to the rest of the collection and exhibit anomalous behaviour with respect to their structural breaks.

Many statistical modelling problems require the identification of change points in sequential data. The general setup for this problem is the following: a sequence of observations x 1 , x 2 , ..., x n are drawn from random variables X 1 , X 2 , ..., X n and undergo an unknown number of changes in distribution at points τ 1 , ..., τ m . One assumes observations are independent and identically distributed between change points, that is, between each change points a random sampling of the distribution is occurring. Following Ross [21] , we notate this as follows:

While this requirement of independence may appear restrictive, dependence can generally be accounted for by modelling the underlying dynamics or drift process, then applying a change point algorithm to the model residuals or onestep-ahead prediction errors, as described by Gustafsson [19] . The change point models applied in this paper follow Ross et al. We briefly outline the two change point detection methods.

This phase of change point detection is retrospective. We are given a fixed length sequence of observations x 1 , . . . , x n from random variables X 1 , . . . , X n . For simplicity, assume at most one change point exists. If a change point exists at time k, observations have a distribution of F 0 prior to the change point, and a distribution of F 1 proceeding the change point, where F 0 = F 1 . That is, one must test between the following two hypotheses for each k:

and end with the choice of the most suitable k.

One proceeds with a two-sample hypothesis test, where the choice of test is dependent on the assumptions about the underlying distributions. To avoid distributional assumptions, non-parametric tests can be used.

One appropriately chooses a two-sample test statistic D k,n and a threshold h k,n . If D k,n > h k,n then the null hypothesis is rejected and we provisionally assume that a change point has occurred after x k .

These test statistics D k,n are normalized to have mean 0 and variance 1 and evaluated at all values 1 < k < n, and the largest value is assumed to be coincident with the existence of our sole change point.

That is, the test statistic is then 

whereD k,n were our unnormalized statistics.

The null hypothesis of no change is then rejected if D n > h n for some appropriately chosen threshold h n . In this circumstance, we conclude that a (unique) change point has occurred and its location is the value of k with maximizes D k,n . That is,τ

This threshold h n is chosen to bound the Type 1 error rate as is standard in statistical hypothesis testing. First, one specifies an acceptable level α for the proportion of false positives, that is, the probability of falsely declaring that a change has occurred if in fact no change has occurred. Then, h n should be chosen as the upper α quantile of the distribution of D n under the null hypothesis. For the difficult details of computation of this distribution, see [21] . Computation can often be made easier by appropriate choice and storage of the D k,n .

In this case, the sequence (x t ) t≥1 does not have a fixed length. New observations are received over time, and multiple change points may be present. Assuming no change point exists so far, this approach treats x 1 , ..., x t as a fixed length sequence and computes D t as in phase I. A change is then flagged if D t > h t for some appropriately chosen threshold. If no change is detected, the next observation x t+1 is brought into the sequence. If a change is detected, the process restarts from the following observation in the sequence. The procedure therefore consists of a repeated sequence of hypothesis tests.

In this sequential setting, h t is chosen so that the probability of incurring a Type 1 error is constant over time, so that under the null hypothesis of no change, the following holds:

In this case, assuming that no change occurs, the average number of observations received before a false positive detection occurs is equal to 1 α . This quantity is referred to as the average run length, or ARL0. Once again, there are computational difficulties with this conditional distribution and the appropriate values of h t , detailed in Ross [21] .

A metric space is a pair (X, d) where X is a set and d : X × X → R satisfies the following axioms for all x, y, z ∈ X:

Given a subset S ⊂ X and a point x ∈ X we can define the minimal distance from

As an aside, d(x, S) ≥ 0 with equality if and only if x lies in the closure of S.

Let S, T ⊂ X. There is a common notion of distance between these subsets, given by:

Note d min (S, T ) = 0 if S, T intersect. So this is not an effective metric between subsets.

The Hausdorff distance considers how separated S and T are at most, rather than at least. It is defined by:

That is, the sup, or L ∞ norm, of all minimal distances from points s ∈ S to T and points t ∈ T to S. The Hausdorff distance satisfies the triangle inequality. This supremum is highly sensitive to even a single outlier. We propose using the L p norms instead. Henceforth, S and T will be finite sets of real numbers, which will be the point breaks.

The first modified Hausdorff distance is presented in Deza [31] and Dubuisson [32] . As is the case with most modified Hausdorff metrics, the primary application to date has been in computer vision tasks, where semi-metrics and metrics focused on geometric averaging provide a more robust distance measure in comparison to metrics like the Hausdorff distance.

Eiter [42] and Dubuisson [32] present the d 2 M H distance. This measure captures the total distance between one set and another. Removing the sup operator from outside the bracket is essentially a measure of total deviation between all points between two sets, which will also be more robust to outliers than the Hausdorff metric.

Deza [31] and Dubuisson [32] propose a variant of d 1 M H , where there is a different averaging component. This measure is referred to as geometric mean error between two images.

Proposition 2.1. All three distances above are semi-metrics. They satisfy properties 1. and 2. of a metric but do not satisfy 3. Moreover, they fail the triangle inequality up to any constant. That is, there is no constant k such that d(S, R) ≤ k(d(S, T ) + d(T, R)) for any subsets S, T, R.

Proof. All three distances are symmetric in S, T by definition, so 2. holds. All three distances are sums of non-negative real numbers, so clearly non-negative. (1)) and so there is no universal k such that a modified triangle inequality holds. Now turn to MH 1 and MH 3 . Both of these terms contain averaging terms, so the triangle inequality can be violated by "bunching" of elements. For ex-

. . , b p are all within of each other. Note we choose p this time so that T has p elements, with p − 1 bunched together.

If p is large and is small relative to d we see there is no universal k such that a modified triangle inequality holds.

The Wasserstein metric is commonly used as a distance measure between two probability distributions. Intuitively, it gives the work (in the sense of physics) required to mould one probability distribution into another. The Wasserstein metric may be applied as a distance measure between two sets S, T and is written as:

The inf is taken over all joint probability distributions µ with marginals S and T . The implicit assumption when using the Wasserstein metric between finite sets is that each change point set is a (discrete sum of) Dirac delta function(s), and we are computing the distances between respective probability distributions.

In subsequent sections, when we experiment with the Wasserstein metric, we set p = 1.

In [32] , Jain and Dubuisson state their view that their measure MH 1 is the best for image matching. To reach this conclusion, they take two steps. First, they compare three favourable operators f 2 , f 3 , f 4 (page 567) and briefly argue that f 2 (equivalent to taking the max in the MH 1 ) is preferable to other operators, citing a 'larger spread'. Secondly, they argue (page 568) that a process of averaging distances is superior to taking the Kth ranked distances such as the median.

We modify both these steps of reasoning. For the first, we replace the max in their MH 1 with the L 1 norm average of all the minimum distances from S to T and T to S:

M J so these are equivalent as semi-metrics. However, d 1 M J is more precise than d 1 M H ; it varies as S or T changes individually, whereas the max of the MH 1 may often not vary. For example, suppose s 0 ∈ S is the closest point to every element in T , and assume

Then, by deforming S in such a way that s 0 remains the closest point to every element of T , the measure d 1 M H (S, T ) will not vary. It is also preferable to the average used in the MH 3 , as duplicating points has a greater effect on that than on the MJ 1 . For example, if T is a general set and S is a single point, duplicating the point in S will change the MH 3 but not the d 1 M J above.

With the second step, we agree that an averaging process is much less sensitive to outlier error than the alternative processes. However, we may generalize this process by using other L p norm averages. And so we present generalized semi-metrics:

This is chosen so that After all, d H can now be viewed as the L ∞ norm of these distances. Thus, our family of semi-metrics includes the Hausdorff distance as its p = ∞ element, placing the existing Hausdorff metric in a new family of semi-metrics.

Remark 2.2. Usually in the context of L p norms, p must be in the range p ≥ 1 to preserve the triangle inequality. Since these measures do not preserve the triangle inequality, we can take p > 0. This means that p = 1 2 for example is even less sensitive to outliers than the MH 1 and d 1 M J measures.

As p grows larger, d p M J approaches the Hausdorff metric, which satisfies the triangle inequality. As p grows smaller, these distances are less sensitive to outliers. Thus, this continuum of p allows us to compromise between the triangle inequality property of the metric, and the sensitivity to outliers.

If we were guaranteed that all elements d(t, S) and d(s, T ) were non-zero, we could also consider p ≤ 0. For p = 0 the above norm is properly interpreted as a limit which equals the geometric mean Proof. Identical to proposition 2.1.

To generate our distance matrix D, we compute the distance between all sets of time series change points within our collection of time series. Suppose we have n time series with sets of change points S 1 , . . . , S n .

Hausdorff distance matrix (D H ) i,j :

Wasserstein Distance matrix (D W ) i,j :

MJ p Distance matrix (D p ) i,j :

Like the modified Hausdorff distances, the novel proposed distance measures MJ p do not satisfy the triangle inequality in generality. This is significant because it is possible that sets S, T and T, R are each close with respect to these measures, but S, R are not close. Then the property of closeness would not be transitive. However, in practice, the distance measures respect transitivity quite well, at least for p ≥ 1. We examine two questions: To explore these questions, we empirically generate a three dimensional matrix and test whether the triangle inequality is satisfied for all possible combinations of elements within the matrix. We construct our matrix as follows:

Analysing the eigenvalues and eigenvectors of a system in physical and applied sciences is of great importance. Matrix diagonalisation and understanding a matrix' eigenspectrum arises in many applications such as stability analysis and oscillations of vibrating systems.

In our context, we analyse the distance matrices D, all of which are symmetric real matrices with trace 0. As such, they can be diagonalised over the real numbers with real eigenvalues. To determine similarity of time series with respect to their change points, we plot the absolute value of all our the matrix's eigenvalues. Note all eigenvalues are real and sum to zero.

Consider the following real world heuristic: many real time series, such as stock returns, are not necessarily highly correlated on a regular basis. The returns of Microsoft and Ford may have little to do with each other over time. However, it is expected that a significant market event or crash would significantly affect both Microsoft and Ford at essentially the same time, and yield a change point in the stochastic properties of both time series at the same time. Thus, even if the overall properties of time series may be uncorrelated or negatively correlated, change points are likely to cluster. It would be of considerable interest if a third time series, say the returns of a new green energy company, had change points different from the majority. Perhaps this third time series would then be concluded to be immune from the market crash. In our analysis of time series we expect a majority, say 70% of time series to follow similar change points, and from these we will be able to examine the exceptional ones for any opportunities.

Mathematically, if 70 out of 100 time series have very similar change points, then the distance matrix D should have the following structure:

Where rows r 1 , . . . , r 70 are highly similar and elements * are close to zero. By inspection, this means that this matrix is a small deformation from a rank 31 matrix. That is, if 70 of 100 time series have very similar change points, then 69 of the eigenvalues should be close to zero.

Given a threshold , we can rank the absolute values of the eigenvalues |λ 1 |, ..., |λ n |. If |λ 1 |, ..., |λ k | < then we can deduce k + 1 of the time series are similar with respect to their stochastic breaks. This may be the most pithy way of expressing the number of time series that are similar in terms of their change points within large collections of time series.

Spectral clustering applies a graph theoretic interpretation of our problem, and projects our data into a lower dimensional space, the eigenvector domain, where it may be more easily separated by standard algorithms such as K-means. We transform our distance matrix D into an affinity matrix A as follows:

The graph Laplacian matrix is given by:

where E is the diagonal degree matrix E ii = j A i,j The graph Laplacian matrix possesses four key properties: .., f k and construct the matrix F ∈ R n×k from f 1 , f 2 , ..., f k . Finally we treat each row of F as a vertex in the low dimensional projection of our graph and cluster these vertices using the standard k-means algorithm. Following Ng et al., K is chosen to be the value that maximizes the eigengap ∆ k = |λ k − λ k−1 |.

A dendrogram displays the hierarchical relationships between objects in a dataset. Hierarchical clustering falls into two categories:

1. Agglomerative clustering -a bottom-up approach where all data points start as individual clusters; or 2. Divisive clustering -a top-down approach where all data points start in the same cluster and are recursively split.

The dendrogram and hierarchical clustering results are highly dependent on the distance measure used to determine clusters. Hierarchical clustering results are highly dependent on the distance measure used. We display the respective dendrogram of our four candidate distance matrices and assess which method displays similarity between time series most appropriately for our change point problem. The colours of the dendrogram correspond to the closeness of any two sets of time series change points.

We generate three collections of ten time series. The first collection exhibits very few change point outliers, the second exhibits a moderate number of less severe change point outliers and the final collection of time series exhibits multiple extreme change point outliers. The Hausdorff, three modified Hausdorff varieties, Wasserstein, MJ 0.5 , MJ 1 and MJ 2 distances are compared between the time series. Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10  Hausdorff  1 Figure 1 displays the ten time series of candidate change points. When assessing if two time series are similar with respect to their change points, we are interested in both the location and number of change points. In this example, one should consider the first five time series (1-5 incl.) as similar, the next three (6-8 incl.) as similar and time series the final two (9 and 10) as dissimilar to all other time series. Although this collection of time series exhibits far fewer outliers than time series exhibited by real phenomena, it is instructive to measure how various distance measures perform without the presence of outliers.

Interpreting similarity among large collections of time series' change points may be a difficult task. Therefore, we make inference using all three of our proposed methods of analysing some candidate distance matrix. Perhaps the most concise and expressive display of general similarity or dissimilarity within any such collection is to plot the absolute value of the eigenvalues of the distance matrices. Figure 2 plots the absolute value of the eigenvalues for all comparative distance measures. All distance measures appear to indicate that there are five time series that are highly similar, three that are slightly less similar and two far more dissimilar to the rest of the collection. There are no change point outliers in this scenario, and there is an average spacing of ∼ 35 units between change points; this simulates realistic outputs from a change point detection algorithm that generally requires a minimum number of data points within locally stationary segments. Of our eight distance measures, six distance measures cluster the time series correctly, seen in table 1: Hausdorff, MH 1 , MH 3 , MJ 0.5 , MJ 1 and MJ 2 . Both the Wasserstein and MH 2 distances fail to determine appropriate clusters within the spectral clustering. The dendrograms in Figure 3 must be analysed carefully. All distance measures indicate that there is a cluster of five change point sets that are similar, another cluster of three and two unrelated change point sets. However, the spectral clustering highlights that the Wasserstein and MH 2 distance measures are incorrectly identifying which time series should be considered similar.

We also analyse the transitivity, described in section 2.4.1, over all sets of change points for the six semi-metrics within our analysis. The MJ 0.5 fails most significantly, with 51% of potential triples failing, and an average fail ratio of 1.28. Both the MJ 1 and MJ 2 distances fail the triangle inequality in this example, with 4% of elements in the matrix failing with the MJ 1 distance and 3% of of elements failing with the MJ 2 distance. Of those elements that fail the triangle inequality, the average fail is 1.32 and 1.14 for the MJ 1 and MJ 2 distances respectively. As expected, as p increases, transitivity seems to improve. All of the modified Hausdorff distances fail the triangle inequality too. 8.5% of MH 1 triples fail the triangle inequality ratio, with an average fail ratio of 1.06. The MH 2 has a lower % of failed triples than the MH 1 with only 4% failing, however those that do fail perform significantly worse with an average score of 1.49. The MH 3 also has 4% of triples fail, with a less severe average fail ratio at 1.41. So in this scenario the MJ 0.5 has the most failed triples by a significant margin, however the MH 2 has triple ratio that violate the triangle inequality most severely. Figure 5 shows ten simulated time series change points that exhibit outliers with moderate frequency and severity. This is a more realistic scenario than Figure 1 as outliers occur regularly when applying change point detection algorithms to multiple time series. Again, time series 1-5, 6-8, 9-10 should be identified as separate clusters. Figure 6 displays the increasing absolute value of the eigenvalues. In this example, we see that that Hausdorff distance in Figure 6a has detected eight time series that are highly similar, and two that are dissimilar to the remainder of the change point sets. Other distance measures have identified the general similarity Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10  Hausdorff  3  1  1  3  3  1  1  1  4  2  MH 1  2  2  2  2  2  3  3  3  1  4  MH 2  1  1  1  1  1  2  2  2  3  4  MH 3  3  3  3  3  3  1  1  1  2  4  Wasserstein  4  1  3  1  2  1  1  1  1 Table 2 : Spectral clustering distance matrices -moderate outliers more appropriately. The MH 1 , MH 3 , MJ 0.5 , MJ 1 and MJ 2 in particular appear to be producing sensible outputs. The dendrograms displayed in figure 7 illustrate the Hausdorff distance's sensitivity to outliers. The remaining six distance measures correctly identify the general structure in the time series collection. That is, there are two separate clusters of highly similar time series and two unrelated time series.

The spectral clustering results in table 2 indicate that five of the seven distance measures correctly identified similar groupings of time series, namely: MH 1 , MH 2 , MH 3 , MJ 0.5 , MJ 1 and MJ 2 produced the correct groupings of time series. Once again, the Wasserstein distance, although producing eigenvalue and dendrogram outputs consistent with the successful distance measures, proposed inappropriate collections of time series within candidate clusters.

In the presence of moderate outliers, the MJ 0.5 violates the triangle inequality worst; both in terms of the % of failed triples at 58% and the average fail ratio of 1.54. Both the MJ 1 and MJ 2 distances fail the triangle inequality ( Figure  8 ) too. 10% of potential distances did not satisfy the triangle inequality when using the MJ 1 distance and 5.6% potential distances did not satisfy the triangle inequality when using the MJ 2 distance. The average ratio of a fail for the MJ 1 distance is 1.13 and 1.08 for the MJ 2 distance. For the modified Hausdorff distances, the MH 1 had the worst failed triple ratio (1.19) with 5.6% of potential triples violating the triangle inequality. 8.4% of MH 2 triples failed the triangle inequality, with an average fail of 1.14 while the 10.4% of MH 3 distances failed with an average failed triple of 1.11. In this scenario the MJ 0.5 is the only semi-metric whose average fail ratio and % of failed triples may make the ratio unusable. This may however depend on the context of usage. Figure 9 displays the third collection of time series' change points, where we analyse the effects of extreme outliers on candidate distance measures. This is a The presence of outliers has certainly impacted most of the distance measures when analysing the magnitude of the eigenvalues (seen in Figure 9 ). The Hausdorff distance in particular (Figure 10a ) fails to identify the similarity in the first five time series (1) (2) (3) (4) (5) . Interestingly, the MH 1 does not display the appropriate degree of similarity in the first five time series, while the MH 2 and MH 3 measures produce plots that indicate similarity among the change point sets more appropriately. The Wasserstein distance in Figure10e and MJ 1 in Figure 10g both produce outputs consistent with the phenomenology of the Spectral Clustering Results  Metric  TS1  TS2  TS3  TS4  TS5  TS6  TS7  TS8  TS9  TS10  Hausdorff  1  1  2  2  2  2  2  2  3  4 The dendrograms displayed in Figure 11 indicate that the Hausdorff distance performs worst. The heat map fails to identify appropriate similarity among the time series. The MH 1 , MH 2 , MH 3 , MJ 0.5 and MJ 1 distances produce heat maps that are most representative of the true similarity in the set. Interestingly, the MJ 2 (Figure 11h ) distance does not perform as well as the MJ 1 or MJ 0.5 (Figure 11g ) in this scenario, perhaps due to the higher order of p. That is, the lower orders of p provide stronger geometric mean averaging. In particular, the MJ 2 distance has particular difficulty distinguishing between clusters 2, 3 and 4, which should contain time series 6-8, 9 and 10 respectively.

The spectral clustering highlights that the MH 1 , MH 2 , MH 3 , MJ 0.5 , MJ 1 and MJ 2 correctly identify clusters of similar time series. Again, both the Hausdorff and Wasserstein metrics do not perform well. The Hausdorff metric in particular has severe sensitivity to outliers.

The transitivity analysis under this (contrived) scenario of extreme outliers demonstrates that the MJ 0.5 may be unasable, with 47.7% of triples failing the triangle inequality and an average fail ratio of 8.63. The MJ 1 and MJ 2 distances have a good proportion of candidate distances that fail the triangle inequality, but significantly less than MJ 0.5 . In the case of the MJ 1 distance, 14.2% of candidate distances fail the triangle inequality, with failed distances yielding an average score of 1.86. In the case of the MJ 2 distance, 11% of candidate distances failed the triangle inequality with an average score of 1.45. 14.4% of MH 1 triples fail the triangle inequality with an average ratio of 2.18. 15% of MH 2 triples fail the triangle inequality with an average score of 1.85 and 15% of MH 3 triples also fail the triangle inequality with an average fail ratio of 1.86.

The MJ 1 , MJ 2 and modified Hausdorff semi-metrics all perform similarly in terms of the proportion of failed triples and their severity in all three scenarios. The MJ 0.5 is clearly the worst violator of the triangle inequality.

We have noted above various cases where the MH 1 , MH 2 or MH 3 exhibit errors in clustering. With this in mind, recalling section 2.3 and [32] we again differ with Jain and Dubuisson's conclusions that MH 1 is the best for image matching. Both the MJ 1 and MJ 2 have advantages over MH 1 .

Given that there is a clear trade-off between various distance measure's transitivity and robust performance in the presence of outliers, one potential avenue to be explored would be optimising the order p in the L p norm. That is, p should be large enough to satisfy the metric property, yet small enough to allow for geometric averaging. We find that when p gets beyond 2 or 3, the geometric averaging property is lost -and the measure becomes sensitive to outliers. Alternatively, one may insist that an acceptable percentage of measurements within the time series collection violated the triangle inequality. That is, we could insist that

where is some acceptable percentage of failed triples within the matrix. We set = 0.05. Figure 13a demonstrates that once p reaches 7, less than 5% of elements in D fail the triangle inequality. However, the dendrogram in figure 13 demonstrates that the MJ 7 distance no longer provides the geometric averaging required to produce robust measurements. In fact, inference gained with the MJ 7 distance would be entirely erroneous. When considered in conjunction with the results from other simulated experiments, our findings suggest that p needs to be low for powerful geometric averaging and robustness to outliers.

The cryptocurrency market is in its relative infancy in comparison to most exchange traded financial products. Cryptocurrencies are infamous for their volatile price behaviour, and high degree of correlation within the market, due to crowd behaviour often referred to as "herding".

We analyse the similarity in the change points of the thirty largest cryptocurrencies by market capitalisation. We apply the Mann-Whitney test to the log returns of each cryptocurrency and use the MJ 1 distance as our measure of choice. Note that the log returns provide normally distributed data with mean zero.

Our method provides the following insights: 1. The cryptocurrency market is characterised by a high degree of similarity between time series. This is seen in the plot of the eigenvalues ( Figure  14b ), which confirms that ∼ 24 cryptocurrencies behave very similarly. The dendrogram in Figure 14a confirms this, with a large proportion of the market exhibiting a small distance between change points.

2. Our method highlights the presence of anomalous cryptocurrencies, and indicates which cryptocurrencies are behaving dissimilarly to the rest of the market. In particular, Figure 14a shows that XMR and DCR behave very differently to the rest of the cryptocurrency market. They do however have a high degree of similarity between themselves.

3. Spectral clustering on the distance matrix confirms the presence of anomalous cryptocurrencies warranting their own cluster.

4. Finally, the dendrogram in Figure 14a highlights clusters of cryptocurrencies that behave similarly to one another, and less similarly to the rest of the market. This is often the case in financial markets, where companies in similar sectors or geographic regions may become correlated due to (possibly) multiple exogenous variables.

To summarize, we have uncovered a high degree of correlation within the cryptocurrency market, and unearthed anomalies, including clusters theoreof. Our analysis gives insight into anomalous and similar behaviours with respect to structural breaks, a feature of time series of key interest to analysts and market participants.

Applying our method to measles counts in 19th century UK cities, one can determine the similarity between structural changes in time series and perform anomaly detection simultaneously. First, we apply the Kolmogorov-Smirnov Analysis on UK measles data yields the following insights:

1. Figure 16b indicates that there are 5 time series that are similar and two relative outliers.

2. Figure 16a highlights that there are 4 time series that are similar with respect to their structural breaks, 2 moderately similar and 1 anomalous city. The time series deemed to be most similar are; Sheffield, London, Manchester and Newcastle. This insight is consistent with visual inspection of the time series in Figure 16 , where these time series exhibit multiple periodicities that are detected via the change point algorithm. Although our algorithm does not explicitly measure the periodic nature of each time series, our method manages to capture the most pertinent latent component in this collection of time series. 

Prior work indicates that semi-metrics adapting L p distance measures have mostly been used in computer vision applications. Although prior work has disproved the triangle inequality for these measurements, we are unaware of any work examining theoretically or empirically how badly the triangle inequality fails, and under what conditions. Our experiments on simulated and real data indicate that when there are more outliers within a collection of time series, and when there is a smaller average distance between successive change points -there is generally a higher percentage of time series which fail the triangle inequality. This is reflected in proposition 2.1 showing that the bunching of points can cause the triangle inequality to fail up to any arbitrary constant. While the Hausdorff metric satisfies the triangle inequality, it is extremely sensitive to outliers, as is confirmed in prior work among varying applications. This presents an interesting trade-off. Semi-metrics such as the ones investigated in this paper may lose universal transitivity, however different averaging methods (other than the max operation) produce more appropriate distance measurements between time series.

After applying a distance measure between sets of change points, we demonstrate the insights one can generate. First, we show a pithy means of eigenvalue analysis where we plot the absolute value of the ordered eigenvalues. This analysis illustrates the number of time series that are similar within any candidate scenario. We demonstrate the dendrogram's ability to simultaneously uncover similarity between time series and perform anomaly detection. Finally, we use spectral clustering on a transformation of the distance matrix (the affinity matrix) and illustrate how spectral clustering may determine groups of similar time series and detect clusters of anomalous time series.

Our results on simulated data show that semi-metrics perform as well or better than traditional metrics such as the Hausdorff and Wasserstein distance in all settings (no outliers, moderate outliers and extreme outliers). Our two applications (measles and cryptocurrency) demonstrate that our method may detect similar time series while also identifying anomalous phenomena.

In determining an appropriate distance measure between sets of change points, we have introduced a new computationally useful continuum of semimetrics which we term MJ p distances. As p → ∞, MJ p distances approach the Hausdorff distance. Thus, we have understood the Hausdorff distance in a new context within a family of MJ p distances. Eigenvalue analysis, hierarchical clustering and spectral clustering can be used to analyse matrices of distances between sets of change points. Our analysis indicates that there is a trade-off between transitivity and sufficient geometric mean averaging (robust results in the presence of outliers). Traditional metrics such as the Hausdorff and Wasserstein distances are severely impacted by outliers, while semi-metrics with strong averaging properties such as the MJ 0.5 fail the triangle inequality frequently, and in some cases, severely. Our experiments indicate that both the MJ 1 and MJ 2 distances are better compromises than the existing modified Hausdorff framework. In future work, we will aim to investigate and generalise other distance metrics outside the Hausdorff framework.

Distances between sets -a survey

Measuring distance between unordered sets of different sizes

On spectral clustering: analysis and an algorithm

Distances between fuzzy sets

Errors in binary images and an l p version of the Hausdorff metric

Computing some distance functions between polygons

A new algorithm for computing the minimum Hausdorff distance between two point sets on a line under translation

Comparing face images using the modified Hausdorff distance

On a metric distance between fuzzy sets

A modified Hausdorff distance between fuzzy sets

Spectral partitioning: The more eigenvectors, the better

A new Hausdorff distance for image matching

Computing the minimum Hausdorff distance for point sets under translation

Statistical process control for shifts in mean or variance using a changepoint formulation

A nonparametric change-point control chart

On dynamic Voronoi diagrams and the minimum Hausdorff distance for point sets under Euclidean motion in the plane

Classification with nonmetric distances: image retrieval and class representation

A survey: spectral clustering applications and its enhancements

Adaptive filtering and change detection

Sequential change detection in the presence of unknown parameters

Parametric and nonparametric sequential change detection in R: the cpm package

Nonparametric monitoring of data streams for changes in location and scale

Sequential monitoring of a Bernoulli sequence: When the pre-change parameter is unknown

A change-point model for a shift in variance

Perceptual organization for artificial vision systems. Contour and texture analysis for image segmentation

Fractals and self-similarity

Note on Hausdorff-like metrics for fuzzy sets

On Hausdorff-like metrics for fuzzy sets

Precise Hausdorff distance computation between polygonal meshes

A linear time algorithm for the Hausdorff distance between convex polygons

Encyclopedia of distances

A modified Hausdorff distance for object matching

Mesh: measuring errors between surfaces using the Hausdorff distance

Metrics based on average distance between sets

Robust face detection using the Hausdorff distance

On the nonexistence of Hausdorff-like for fuzzy sets

Measuring the distance between time series

An image algorithm for computing the Hausdorff distance efficiently in linear time

Two nonparametric control charts for detecting arbitrary distribution changes

Computing the minimum Hausdorff-distance between two point sets on a line under translation

Object matching algorithms using robust Hausdorff distance measures

Distance measures for point sets and their computation

Locating objects using the Hausdorff distance

Efficient visual recognition using Hausdorff distance

Efficiently locating objects using the Hausdorff distance

Segmentation using eigenvectors: a unifying view