key: cord-0288858-lv4xw4k6
authors: James, Nick; Menzies, Max
title: A new measure between sets of probability distributions with applications to erratic financial behavior
date: 2021-06-10
journal: nan
DOI: 10.1088/1742-5468/ac3d91
sha: 3713cbc9e6ef5b7a4ceb908817a1fb5adba64578
doc_id: 288858
cord_uid: lv4xw4k6

This paper introduces a new framework to quantify distance between finite sets with uncertainty present, where probability distributions determine the locations of individual elements. Combining this with a Bayesian change point detection algorithm, we produce a new measure of similarity between time series with respect to their structural breaks. First, we demonstrate the algorithm's effectiveness on a collection of piecewise autoregressive processes. Next, we apply this to financial data to study the erratic behavior profiles of 19 countries and 11 sectors over the past 20 years. Our measure provides quantitative evidence that there is greater collective similarity among sectors' erratic behavior profiles than those of countries, which we observe upon individual inspection of these time series. Our measure could be used as a new framework or complementary tool for investors seeking to make asset allocation decisions for financial portfolios.

In this paper, we marry two widely researched areas within statistics and the natural sciences: similarity and anomaly detection. Our primary means of detecting similarity that we build upon is the use of distance measures such as metrics; our primary means of anomaly detection is change point detection, incorporating a Bayesian perspective around change points. We introduce a new framework of sets with uncertainty (which are sets of probability distributions) and a new family of semi-metrics between them. Then, we apply this to measure the similarity between time series' sets of change points, as proxies for their erratic behavior profiles.

Metric spaces appear throughout mathematics. One particular subfield that has seen substantial recent activity is the study of metrics between sets, specifically subsets of an ambient metric space. The most utilized metric in this context is the Hausdorff metric, which we introduce and summarize in Section 2. Distances between sets have proven useful in many applications, including image detection and matching [1, 2, 3, 4, 5] , the study of fuzzy sets [6, 7, 8, 9, 10, 11] , and efficient computational geometry [12, 13, 14, 15, 16, 17, 18, 19, 20, 21] . The primary challenge in this area is that the Hausdorff and other metrics [22] are highly sensitive to outliers [23] , while alternative semi-metrics may not satisfy the triangle inequality property of a metric. Relevant definitions and recent advances will be reviewed in Section 2.

In recent years, there has been substantial interest in the use of distance measures between time series [24] , for example to understand similarity in movement between financial assets [25, 26, 27] . More recent work has prioritized distances between time series based on certain critical points, requiring the use of similarity measures between finite sets. These critical points often carry particular importance, capturing information about the broader behavior of a time series. For example, [28] studied turning points in COVID-19 case time series, which may summarize undulating wave behavior and separate the data into different waves of the disease. Outside time series, metric learning has become a popular topic within the field of computational statistics and machine learning more broadly, where a distance function is optimally tuned for a candidate task. Various applications include computer vision [29, 30] , text analysis [31, 32] and program analysis [33] .

A natural corollary of the use of distance measures between time series (and more broadly, any sort of data) is the detection of anomalies. Anomaly detection is a wellresearched problem across many data sets and spaces, incorporating various techniques from statistics and machine learning [34, 35, 36, 37, 38] . Our paper follows a line of recent work that aims to detect entire time series, not just single data points, as anomalous. Several means to define anomalous time series may be used, including geometric measures, principal components, and shapelet transformations [39, 40, 41] . Our primary means of observing anomalous time series is agglomerative hierarchical clustering on our new distance measure.

Change point detection is an important subfield of time series analysis. Developed by Hawkins et al. [42, 43] , change point (or structural break) detection algorithms estimate the points in time at which the stochastic properties of a time series, such as its underlying mean or variance, change. Traditionally, most algorithms apply hypothesis testing frameworks [44, 45, 46, 47, 48, 49] , which do not quantify the uncertainty surrounding the change points and typically require the assumption of independent data. In a financial setting, this assumption is inappropriate, as rich patterns in correlation structure have been observed in the literature [50, 51, 52, 53, 54, 55, 56] . Bayesian methodologies have been introduced to ameliorate both these issues, providing uncertainty intervals around candidate change points in a setting where time series may exhibit strong autocorrelation [57, 58] .

Recent work combined hypothesis testing change point algorithms with semi-metrics between finite sets to produce new distance measures between time series [59] , and applied this to measure similarity in erratic behavior [60] . However, it did not consider the uncertainty inherently present in change point detection, and we are not aware of any existing work that computes distances between sets, where there are uncertainty intervals associated with individual elements in the set. Thus, we introduce a new framework of sets with uncertainty (which are certain sets of probability density functions), where probability distributions determine the uncertainty surrounding various elements' locations, and appropriate semi-metrics between them. Then, we combine this new family of semi-metrics with a Bayesian change point detection algorithm, which records the uncertainty around change points, and apply this to quantify distance between time series' sets of change points with uncertainty. This is presented in Section 3.

We apply our procedure to a financial context in Section 4. Analyzing both countries and sectors, we reveal similarity and anomalies in the long term dynamics of various indices with respect to change points in their behavior. We also demonstrate that our methodology may fit well with more conventional statistical or time series analysis. In our case, we complement our analysis with a closer examination of some of the most frequently observed change points across countries and sectors, associated to the global financial crisis (GFC) and COVID-19. We discuss insights and implications from both our utilized new and existing methodologies in Section 5.

In this section, we review some properties of a metric space and the existing (semi)-metrics we draw upon, including the Wasserstein metric between probability distributions, the Hausdorff metric between sets, and recently introduced semi-metrics between finite sets. Definition 2.1 (Metric space). Let X be a set and d : X × X → R ≥0 a function. Suppose d satisfies the following properties for all x, y, z ∈ X:

Then we say d is a metric and the pair (X, d) is a metric space [61] . Alternatively, suppose d : X × X → R ≥0 satisfies only conditions (i) and (ii). Then we say d is a semi-metric on X.

Condition (iii) is known as the triangle inequality. A particularly important metric is the Wasserstein metric, which is used as a distance between two probability measures. Definition 2.2 (Wasserstein metric). Let (X, d) be a complete and separable metric space. Suppose µ and ν are Borel probability measures on X, and q ≥ 1 [62] . Let Γ(µ, ν) be the set of all Borel probability measures on X × X with marginal distributions µ and ν respectively. The Wasserstein metric [62] is defined as follows:

The Wasserstein distance is most commonly applied to Euclidean space X = R m with d as the standard metric. On X = R, the computation of the Wasserstein metric simplifies considerably. If probability measures µ, ν on R have associated cumulative distribution functions (cdf's) F, G and quantile functions [63] F −1 , G −1 respectively, the Wasserstein metric can be computed [64] as

(2)

In this paper, we shall only compute the Wasserstein metric between discrete probability density functions valued on the real line, which always have associated cdf's and quantile functions. The other class of (semi-)metrics we will use are those between subsets of a given metric space. We begin with a preliminary definition. Let S be a subset of a metric space X, and x an element of X. Then the distance from the element to the set is defined as the minimal distance from x to S, computed as follows:

When the metric d is the Wasserstein metric W q , we denote this minimal distance d W , suppressing q from the notation. Definition 2.3 (Hausdorff metric). Let (X, d) be a metric space. Suppose S and T are closed and bounded subsets of X. The Hausdorff metric [65] between S and T is defined as follows:

This is the supremum or L ∞ norm of all minimal distances from elements s ∈ S to T and t ∈ T to S. The Hausdorff metric is a metric on the set of all closed bounded subsets of X. The Hausdorff metric satisfies the triangle inequality, but is highly sensitive to even a single outlier. In [59] , we introduced the following semi-metric, which replaces the L ∞ norm with an L p average (p ≥ 1). It can only be defined for finite sets S, T. Definition 2.4. Let (X, d) be a metric space, and p ≥ 1. Suppose S and T are finite subsets of X. The MJ p distance [59] between S and T is defined as follows:

As p → ∞, this family of semi-metrics includes the Hausdorff metric as its limit element. In particular, changing p may provide a trade-off between reduced sensitivity to outliers (better when p is small) and greater satisfaction of the triangle inequality (better when p is large). Several theoretical and experimental properties of these semi-metrics are presented in [59] .

The primary application of the semi-metrics in that work was between sets of time series' change points. Specifically, given a collection of real-valued time series (X (i) t ), i = 1, ..., n, a change point detection algorithm was applied that produced sets of structural breaks S i for each time series, i = 1, ..., n. Then, the distance between time series was defined as d p M J (S i , S j ). However, this work did not take into account any potential uncertainty in the change points.

In this section, we introduce our framework of sets with uncertainty, define appropriate distances between such objects, and describe the primary application of this manuscript, namely quantifying similarity between sets of change points with uncertainty present. That is, a set with uncertainty allows the positions of each element to vary according to specified probability distributions. The disjointness of the intervals is analogous to the requirement that the elements of a regular set be distinct. We will use the notationS,T for sets with uncertainty and S, T for regular sets. In the specific case when the probability densities are just delta functions supposed at single points, that is, S = {δ x 1 , ..., δ x k }, we can associate a regular set S = {x 1 , ..., x k }, and vice versa. 

That is, for each distribution f i , its closest distribution g j inT is found according to the Wasserstein metric, and this minimal distance is computed. Essentially, this combines the MJ p semi-metric on a general metric space with the Wasserstein metric on probability distributions. Remark 3.3. In this remark, we briefly explain why the MJ-Wasserstein semi-metric can be viewed as a direct generalization of the MJ p defined in (6) . Indeed, letS,T be two sets with uncertainty consisting only of Dirac delta functions. To these we can associate regular sets S 0 , T 0 as discussed above. By (6), (7) and the fact that W q (δ a , δ b ) = |a − b|, it follows that d p M JW (S,T ) = d p M J (S 0 , T 0 ). That is, the MJ-Wasserstein distance betweeñ S,T reduces to the existing MJ p distance between S 0 , T 0 .

Remark 3.4. This property is by no means the only reason we select the Wasserstein metric for our methodology. The well-known Kullback-Leibler divergence [66] is unsuitable in our context because it returns a value of infinity when two probability distributions have disjoint support [67] , as will frequently be the case (where time series' change points are not located close to each other). Similarly, the Jensen-Shannon metric [68] always returns its greatest possible value of 1 (depending on normalization convention [69] ) between two densities of disjoint support, so is usually uninformative in our application.

On the other hand, the Wasserstein metric is informative regarding where the change points occur, taking uncertainty into account. This is crucial in our application, as we want to keep track of the positions in time when time series behaviors change, and which time series change at similar points in time. Finally, we select the Wasserstein over the related Radon metric, as the latter lacks desirable analytical properties, such as sequential compactness [70] .

Proposition 3.5. For p > 0, the MJ-Wasserstein distance measures are semi-metrics. However, they fail the triangle inequality up to any constant. That is, there is no constant k such that

for any sets with uncertaintyS,T ,R.

Proof. First, (7) is defined symmetrically, so we clearly see that d p M JW (S,T ) = d p M JW (T ,S). Next, if d p M JW (S,T ) = 0, this forces every term d W (g,S) = 0 for each g ∈T . As the Wasserstein distance is a metric, this implies g ∈S. Reversing the reasoning, we also see that each probability density function inS must lie inT . So the sets with uncertaintyS andT must be equal, with the exact same probability distributions as members. This establishes properties (i) and (ii) of Definition 2.1.

Finally, we demonstrate the failure of the triangle inequality up to any constant. We can find an appropriate counterexample whenS,T ,R are regular sets S, T, R. We modify the example of failure of the triangle inequality for the MJ semi-metric, presented in [59] , Proposition 3.6. Specifically, suppose a, b are two elements in an an ambient metric

Essentially, T has n elements, with n − 1 of them closely bunched together. Then

Suppose is chosen such that < d(2n)

Carefully noting what is above,

Choosing n sufficiently large, with < d(2n) −1 p , we deduce there is no universal modified triangle inequality for the MJ-Wasserstein distance. 

This measures a one-way distance fromS toT . Then 

Then, increasing the size ofT with bunched elements may reduce (11) relative to d p M JW (S,R). That is, the triangle inequality may be violated whenT is excessively "large" relative toS andR, for example ifT containsS andR as subsets as well as bunched or irrelevant elements.

Remark 3.7. For this reason, we require sets with uncertainty, by definition, to have disjoint intervals of support. Not only is this always the case in the output of our change point algorithm, it also helps to reduce the effects of "bunching," which may cause a failure of the triangle inequality. The choice of the Wasserstein metric is useful here too: disjoint supports of the constituent probability distributions inS = {f 1 , ..., f k } prevent bunching in the values of d W (f i , f j ); this was the cause of the violation of the triangle inequality in the proof of Proposition 3.5. Even more simply, the disjointness condition prohibits duplication (or near-duplication) of elements ofS. (Near-)duplication of elements may also throw off the triangle inequality.

Remark 3.8. We briefly compare our framework sets with uncertainty to the existing notion of fuzzy sets. A fuzzy set is a pair A = (X, m) where X is a reference set and m : X → [0, 1] is a membership function. The membership function describes whether elements x ∈ X are not included in the fuzzy set (m(x) = 0), partially included (0 < m(x) < 1) or fully included (m(x) = 1). Fuzzy sets are frequently used in image detection and pattern recognition [6, 7, 8, 9, 10, 11] . Fuzzy sets are typically not considered probabilistic in nature; there is no requirement for values of m(x) to add to 1 over particular intervals, and the membership function does not model uncertainty in the location of an element, only degree of membership.

In addition, existing distance measures between fuzzy sets are based on measures between sets, not probability density functions. Specifically, (semi)-metrics between fuzzy sets use the Hausdorff distance [7] , typically between sets A ≥r = {x ∈ X : m(x) ≥ r}. That is, our semi-metrics between sets with uncertainty are of a different nature, prioritizing the probability distributions under which the elements vary, rather than using the Hausdorff metric between specified subsets of the reference set X.

There is a method to associate a fuzzy set to a set with uncertainty. IfS = {f 1 , ..., f k } where probability distributions have supports I 1 , ..., I k , we may set X = I 1 ∪ ... ∪ I k and m(x) = f j (x) if x ∈ I j , zero otherwise. As the supports are assumed to be disjoint, this defines a fuzzy set. However, no existing measures between fuzzy sets produce the MJ-Wasserstein distance measure between sets with uncertainty. Thus, the sets with uncertainty framework is necessary to construct the new distance measures and the existing fuzzy sets category is not sufficient.

Now, we describe our application to time series' sets of change points with uncertainty. Let (X (i) t ), i = 1, ..., n be a collection of n time series over a period t = 1, ..., T . Seeking to determine similarity between time series' erratic behavior profiles, taking into account uncertainty, we apply the Bayesian change point detection algorithm of Rosen et al. [58, 71] to each time series. Initially, this produces a distribution over the number of change points m, and conditional on m, a set of m points with uncertainty intervals. We select the maximally likely number of change points m 

, setting p = 1 in the MJ-Wasserstein parameter and q = 1 in the Wasserstein metric. We remark that this is an abuse of notation, and that the distance can only be defined after a chosen change point detection algorithm has been performed. Nonetheless, the application of this entire methodology provides a useful framework for quantifying affinity between an attribute of time series behavior that is notoriously hard to capture, while considering uncertainty. We normalize by the length of the time series T so we can compare features of collections of time series over different period lengths, which will be required in Section 4. We provide full details of the change point algorithm in Appendix A.

As established in Proposition 3.5, the MJ-Wasserstein semi-metric does not satisfy the triangle inequality (condition (iii) of Definition 2.1), even up to a constant. This could present a practical problem, as it could mean the property of sets with uncertainty being close under the semi-metric may not be transitive. That is, perhaps d p M JW (S,T ) and d p M JW (T ,R) could be sufficiently small, indicating substantial similarity betweenS,T andT ,R, but d p M JW (S,R) might not be small. Thus, we include an empirical analysis of the failure of the triangle inequality property with our use of the new semi-metric. Specifically, we examine two questions:

(i) how often does the d p M JW semi-metric fail the triangle inequality, and (ii) how badly do these quantities violate the triangle inequality?

To explore this empirically, we generate a three-dimensional matrix and test whether the triangle inequality is satisfied for all possible triples of elements within the matrix. The matrix is defined as

where D ij give the distances between a collection of sets with uncertainty, as defined in Section 3.1. We then record the proportion of triples (i, j, k) that fail the triangle inequality as well as the mean value of D ik D ij +D jk among the failed triples.

In this section, we calculate the cost of our entire procedure, both the change point detection algorithm in Appendix A and our subsequent calculation of the MJ-Wasserstein distances. As in Section 3.1, suppose we have n time series (X (i) t ) over a period t = 1, ..., T . Further, suppose as a result of implementing the change point algorithm, every considered number of change points m is bounded above by a constant M . Finally, suppose that every constituent probability distribution within eachS i is supported on at most P elements. Alternatively, one may suppose the sizes of the support intervals I (i) j are uniformly at most P .

First, the Bayesian change point detection algorithm proceeds via a reversible jump Markov chain Monte Carlo algorithm. We use a fixed number K = 10000 of iterations (including 5000 for burn in). The two most costly procedures within a single iteration are as follows. One is the search for optimal points t within so-far-determined segments [ξ k * −1 , ξ k * +1 ]. This has a cost of O(T ). The other is the inversion of a matrix of size at most M × M . We reiterate that we assume our bound M is greater than every possible considered number of change points m within the sampling scheme of Appendix A. So the inversion has a cost of at most O(M 3 ) (but in practice is more efficient). Thus, the cost of the change point algorithm for a single time series is O(K(M 3 + T ). It follows the total cost of implementing the change point algorithm across a collection of n time series is O(nKM 3 + nKT )). At this point, n sets with uncertaintyS i , i = 1, ..., n have been obtained.

Next, we consider the cost of computing a single MJ-Wasserstein distance d 1 M JW (S i ,S j ). This computation involves at most O(|S i ||S j |) comparisons between probability densities, as seen in (7). By (2), computing each individual Wasserstein distance d W between probability distributions supported on intervals of length at most P is of cost O(P ). Indeed, (2) calculates the difference between two step functions with at most P steps. Hence, the computation of Finally, with these matrix entries computed, the cost of the procedure to empirically analyze the triangle inequality is O(n 3 ). Overall, the full cost of our procedure is O(nKM 3 + nKT + n 2 M 2 P + n 3 ). In practice, K and T are much larger than any other parameter, so the total cost is O(nKT ), with the greatest cost occurring as a result of the sampling scheme.

In this section, we validate our methodology on six synthetic time series of length T = 1500. Each is formed by concatenating a sequence of autoregressive processes whose parameters change at specified locations. Four of the generated time series are chosen to share similar locations of structural breaks, while two are quite different from the rest, exhibiting similarity only with each other. This will serve as a simplified representative of the real data in Section 4, where there will frequently be a majority collection of similar time series with a smaller number of outlier sectors or countries.

The first four time series are chosen to have change points at approximately t = 200, 500, 700, 900, 1100, 1300 while the latter two have change points at approximately t = 750. We remark that the detection of change points in this context has traditionally been very difficult due to limited or no changes in the process mean or variance between autoregressive processes and the high amount of autocorrelation. Nonetheless, our methodology performs well at identifying change points in these synthetic time series, and our distance measure clearly identifies the similarity between the first four time series and the outlier status of final two. In Figure 1 , we display hierarchical clustering on the obtained 6 × 6 matrix D ij , which identifies the similarity of the first four time series and the outlier status of the final two. Here and elsewhere in this paper, we implement agglomerative hierarchical clustering via the average-linkage method [72] . In Appendix B, we provide full details of these synthetic time series, including all piecewise autoregressive components, and show the change points detected.

Finally, we include a brief sensitivity analysis of the ability of our change point algorithm to detect small changes in adjacent autoregressive processes. Results are promising, indicating successful detection except for very small changes in the autoregressive parameters. This experiment is also detailed in Appendix B.

In this section, we study the erratic behavior profiles of 19 country and 11 sector indices. Country data runs from 01/01/2002-10/10/2020, while sector data runs from 01/01/2000-10/10/2020. First, we generate a log return time series for each country and sector index. Next, we apply a Bayesian change point detection algorithm, as detailed in Appendix A, to generate a set of change points with uncertainty for each index. Then, we may apply our semi-metrics between such sets with uncertainty, as defined in Section 3.

First, we apply the methodology of Section 3 to the log return time series of 19 country indices. We produce a 19 × 19 distance matrix D c ij that measures similarity between the erratic behavior profiles of these time series, as summarized by their sets of change points with uncertainties. We display hierarchical clustering of D c in Figure 2 . The cluster structure of these indices reveals one predominant cluster of countries, and a minority collection of various outliers. The primary cluster consists of Australia, Canada, France, Germany, Italy, Korea, the Netherlands, Saudi Arabia, Switzerland, Turkey, the United Kingdom (UK) and the United States (US). The outlier countries include Brazil, China, India, Indonesia, Japan, Russia and Spain. The primary cluster contains a strong subcluster of mostly European composition, such as France, Italy, the Netherlands, Switzerland and the UK. This may be due to substantial similarity in the geographic location and political association between such countries, leading to similar change point propagation surrounding events such as Brexit. Notably, Spain is determined to be markedly less similar than its European counterparts. In this section and elsewhere, we refer to a time series (in this case pertaining to countries) as anomalous if it lies in its own cluster as determined by our implementation of agglomerative hierarchical clustering via the average-linkage method [72] . In this instance, the unique anomalous country within the collection is China. This is unsurprising, as many of the idiosyncratic restrictions related to their local equity market may provide different dynamics to those exhibited by other countries.

We take a closer look at the similarity of select country indices in Figure 3 . First, we observe clear similarity in the log return time series for France, Germany, and the UK, in Figures 3a, 3b and 3c , respectively. All three of these time series exhibit change points that are, broadly speaking, evenly distributed over the entire period of analysis. Further, all three experience a change point in 2008-2009, associated to the global financial crisis (GFC), in mid-2016, associated with Brexit, and early 2020, associated with COVID-19.

These three examples are also representative of the behavior of Australia, Canada, Italy, the Netherlands and Switzerland, all of which lie in the same subcluster of similarity. Next, we display the time series for the US (3d); this is characterized by a long period of stability from 2012-2019, associated with a prolonged bull market. The strong similarity between France, Germany and the UK and the slightly less affinity between them and the US are all visible in Figure 2 . In Figures 3e and 3f , we display the time series for India and Brazil, respectively. These are both characterized by a period of stability from 2010-2016; the primary difference is that India exhibits more change points in the first decade, while Brazil is more stable during this time. Russia is displayed in Figure 3g , distinguished by a lengthy period of stability in the first decade followed by regularly spaced change points. Finally, China (3h) is distinguished as the only country that lacks a change point in 2020 associated with COVID-19. Indeed, the Chinese market recovered more quickly from the impact of the pandemic than any other country [73] .

Next, we apply the methodology of Section 3 to the log return time series of 11 sector indices to produce an 11 × 11 distance matrix D s ij , and display hierarchical clustering of this matrix in Figure 4 . This dendrogram consists of a dominant cluster and two outliers. The primary cluster consists of communications, consumer discretionary, energy, financials, healthcare, information technology (IT), industrials, real estate and utilities. Of these, consumer discretionary, energy, financials, healthcare, industrials and real estate comprise a subcluster of similarity. The two outlier sectors are consumer staples and materials.

We examine select sectors and their change points in Figure 5 . In Figures 5a, 5b and 5c, we display the log return time series for the consumer discretionary, energy and financial sectors, respectively. All three experience a change point at the start of 2020, and approximately evenly distributed breaks throughout the entire period. These three time series are also representative of the other members of the subcluster, that is, healthcare, industrials and real estate. Next, we display the log returns for communication services (5d) and utilities (5e), which represent the remainder of the primary cluster in Figure 4 . Both of these are characterized by a period of stability in 2012-2019, much like that of the US time series (3d). Finally, we display one of the outlier sectors, materials, in Figure 5f . This is characterized by relatively few change points early on. Broadly speaking, we can see that the sector time series have more similarity among themselves than the country indices. For example, every sector has a change point associated with the GFC and COVID-19, and sector change points are more evenly distributed throughout the entire period than for the countries. We shall observe this quantitatively as well. The most similarity is observed between consumer discretionary ("Discretionary"), energy, financials, healthcare, industrials and real estate. Communication services ("Comms"), IT and utilities show slightly different erratic behavior.

In this section, we compare the properties of the two distance matrices between countries and sectors as a whole, analyzing different properties of their collective similarity. First, we analyze different matrix norms of D c and D s . Let D be an arbitrary symmetric n × n matrix with diagonal entries equal to zero. Let

||D|| op = max v∈R n −{0} ||Dv|| ||v|| = max{|λ| : λ is an eigenvalue of D}.

These are referred to as the L 1 , L 2 and operator norms of the matrix D. We have rescaled the L 1 and L 2 norms by the number of non-zero elements of D, so we may compare matrices of different sizes. The operator norm does not require rescaling, as for example, the scalar matrix kI n has operator norm equal to |k| regardless of n. The equality of A blue dot represents a triple that satisfies the triangle inequality, a yellow dot represents a failure of the triangle inequality by a factor of at most 2, and a red dot, of which there are none, indicates failure by a factor greater than 2. Although neither countries nor sectors fail the triangle inequality frequently or severely, countries do fail the triangle inequality slightly more frequently and severely than sectors. 2.45% of candidate country triples fail, with an average fail of 1.10. 1.05% of sector triples fail, with an average fail of 1.08.

the operator norm [74] with the greatest eigenvalue holds as D is symmetric, hence diagonalizable; this is known as the spectral theorem [75] . We record the different norms of D c and D s in Table 1 . As described in Section 3, the distance matrices are normalized by the slightly differing period lengths of the time series of countries and sectors (19 and 21 years, respectively). There, we confirm what we qualitatively observed in the above sections: the country distance matrix has consistently greater norms than the sectors, even after normalization by the number of elements and period length. This quantitatively shows greater discrepancy in distances between change points among countries than sectors. That is, the erratic behavior profiles are more similar among sectors than among countries.

In this section, we compare our results obtained from our methodology with several classical distances and similarity measures. We consider five measures between time series x(t) and y(t) over t = 1, ..., T :

(ii) (Pearson) correlation, defined as the cosine similarity between x(t) −x and y(t) −ȳ;

(iii) the Manhattan (L 1 ) distance [76] , defined as T t=1 |x(t) − y(t)|; (iv) the Euclidean (L 2 ) distance, defined as ( T t=1 |x(t) − y(t)| 2 ) 1 2 ; (v) the Chebyshev (L ∞ ) distance [76] , defined as max t |x(t) − y(t)|.

In Appendix C, we display the ten heat maps obtained by using the five classical measures above to collections of both countries and sectors. Broadly, none of the observations concerning the erratic behavior of country and sector indices discussed previously in Section 4 can be obtained. We believe this demonstrates the necessity of using a new methodology to obtain inference on the similarity of sector or country indices with respect to their change points, taking uncertainty into account.

Finally, we analyze the failure of the triangle inequality when computing the semi-metric d 1 M JW among the elements of each collection. As discussed in Section 3, the semi-metric does not satisfy the triangle inequality in general, so we propose an empirical method to test the triangle inequality among triples of elements in Section 3.2. We form the matrix in (12) for each collection, and display this in Figure 6 for countries and sectors. The results highlight that neither sectors nor countries fail the triangle inequality frequently or severely, but triangle inequality failures are more frequent and slightly worse between country structural breaks. 1.05% of sector triples fail the triangle inequality, with an average fail ratio of 1.08, while 2.45% of country distances fail the triangle inequality with an average fail ratio of 1.10. We also record this in Table 1 Table 1 : Matrix norms and transitivity analysis results for D c and D s , distance matrices between country and sector indices, respectively. We see quantitatively that country indices exhibit greater overall discrepancy in sets of change points than sectors.

This paper proposes a new method for measuring similarity in erratic behavior among a collection of time series. Our mathematical framework defines sets with uncertainty, introduces a new class of semi-metrics between them, and combines this with a suitable change point algorithm to quantify discrepancy between time series based on their change points, taking uncertainty into account. Our change point detection algorithm is judiciously chosen to perform well on dependent data, and to record the uncertainty around change points. Although our semi-metric may violate the triangle inequality, we include an empirical investigation and show that in all applications of the paper, the triangle inequality is violated infrequently and mildly. Thus, the semi-metric still respects a transitive property of closeness, as discussed in Section 3.

Applying this methodology to collections of country and sector indices over 20 years, we obtain immediate and visible insight into the structure of these collections with respect to their sets of change points, which are representatives of the indices' erratic behavior. We see clear similarity between the developed countries of our collection, particularly European countries. Examining these countries individually, we confirm similarity in several key change points, including the GFC, COVID-19, and Brexit in the case of European countries. Our methodology is also able to quickly identify outlier countries, such as China. Examining China reveals a clearly anomalous feature: no change point associated with COVID-19.

Applying our methodology to sectors and analyzing the resulting distance matrix compared to that of countries also produces interesting results. We show quantitatively that there is greater collective similarity among the erratic behavior profiles of sectors than countries. This can be qualitatively observed by examining the individual sector time series, where change points are consistently observed to be fairly equidistributed over time. In addition, every sector exhibits a change point around both the GFC and COVID-19. This observation has meaningful implications for investors, especially with regards to portfolio risk. Since there is greater similarity in sector erratic behaviors compared with countries, this suggests that diversification benefits are more substantial to investors diversifying with respect to countries (as opposed to sectors), as there is greater potential for reduction in change point variance.

The findings of the entire paper have several interpretations regarding modern trends in portfolio management. The metric introduced could accompany traditional risk management tools, and provide a framework for measuring distance between country and sector erratic behavior profiles. Our metric may capture dynamics and associations that are not explicitly identified in traditional measures such as correlation. The experiments in this paper may provide an indication for potential use cases for investors concerned with asset allocation across countries and sectors. Future work could use analysis of the erratic behavior profiles of different time series, such as individual equities, to provide an alternative or modification of traditional portfolio diversification.

In summary, this paper introduces a new measure to study similarity between time series' change points (as determined by a Bayesian change point algorithm), while capturing the uncertainty around such erratic shifts. We present promising findings on country and sector indices, with substantial insights into the structure of the two collections and the differences therein. Our methodology and findings pair well with existing statistical techniques, with the identification of frequently observed change points inspiring a closer examination of the associated crises. This closer analysis identifies relationships between erratic behavior profiles, correlations, dynamics and trajectories, with noteworthy implications for financial practitioners and new understandings of risk management during crises.

In this section, we describe the Bayesian change point detection algorithm that we use in the paper. Specifically, we use a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm [71] that continually updates a partitioning of the time series into m segments, where m also changes with time. The sampling scheme produces a distribution over m and a set of m change points. As described in Section 3, we conclude by selecting the maximal likely m 0 ; then our set of change points varies with uncertainty conditional on m 0 .

Our methodology detects change points based on changes in the time-varying power spectrum [57] . This addresses many of the limitations of traditional change point detection algorithms to handle dependent data. Prior work [71] has demonstrated that the methodology applied in this paper appropriately detects changes in piecewise autoregressive processes, highlighting the algorithm's ability to partition dependent data. This is crucial when analyzing financial time series, where rich autocorrelation structure has been observed in the literature.

We follow Rosen et al. [58] in our implementation of the RJMCMC sampling scheme. We denote a varying partition of the time series by ξ m = (ξ 0,m , ..., ξ m,m ); these are our m change points (excluding ξ m,m , which by convention is always the final time point). The algorithm requires the consideration of a vector of amplitude parameters τ 2 m = (τ 2 1,m , ..., τ 2 m,m ) and regression coefficients β m = (β 1,m , ..., β m,m ) that we wish to estimate, for the jth component within a partition of m segments, j = 1, ..., m. For notational convenience, β j,m , j = 1, ..., m, is assumed to include the first entry, α 0j,m . In the proceeding sections, superscripts c and p refer to current and proposed value in the RJMCMC sampling scheme, respectively.

First, we describe the between-model moves, where the number of change points m may be changed. Let θ m = (ξ m , τ 2 m , β m ) be the model parameters at some point in the RJMCMC sampling scheme and assume that the chain starts at (m c , θ c m c ). The algorithm proposes a move to (m p , θ p m p ), by drawing (m p , θ p m p ) from a proposal distribution q(m p , θ p m p |m c , θ c m c ). This draw is accepted with probability

where p(·) is a target distribution, namely the product of the likelihood and the prior. The target and proposal distributions vary based on the type of move taken in the The support of q 1 has n k * + n k * +1 − 2t min + 1 points while q 2 has at most three. The term q 2 alone would result in a high acceptance rate for the Metropolis-Hastings step, but it would explore the parameter space slowly. The q 1 component allows for larger steps, and produces a compromise between a high acceptance rate and a thorough exploration of the parameter space. Then, β p j , j = k * , k * + 1 is drawn from an approximation to k * +1 j=k * p(β j |x p j , τ 2 j ), following the corresponding step in the between-model move. The proposal distribution, evaluated at β p j , j = k * , k * = 1, is

where β p * = (β p k * , β p k * +1 ) and τ 2 * = (τ 2 k * , τ 2 k * +1 ) . This proposal distribution is also evaluated at current values of β c * = (β c k * , β c k * +1 ) . β p * is then accepted with probability

,

When the draw is accepted, we update the partition and regression coefficients (ξ c k * , β c * ) = (ξ p k * , β p * ). Finally, we draw τ 2p from

This is a Gibbs sampling step, and as such the draw is accepted with probability 1. This concludes the description of the algorithm. For our purposes, we do not need the vectors of amplitude parameters or regression coefficients, just the distribution of the change points.

First, we explicitly detail the processes governing the six synthetic time series discussed in Section 3.4. In what follows, t are independent and identically distributed N (0, 0.1) Gaussian noise terms.

901 ≤ t ≤ 1110;

1.5x (2) t−1 − 0.75x (2) t−2 + t , 1111 ≤ t ≤ 1300; 0.9x (2) t−1 − 0.8x (2) t−2 + t , 1301 ≤ t ≤ 1500;

191 ≤ t ≤ 500;

t−2 + t , 1106 ≤ t ≤ 1300; 1.5x

1 ≤ t ≤ 190; 0.9x (4) t−1 + t , 191 ≤ t ≤ 500; 0.9x (4) t−1 − 0.8x (4) t−2 + t , 501 ≤ t ≤ 685; 1.5x (4) t−1 − 0.75x (4) t−2 + t , 686 ≤ t ≤ 900; −0.9x (4) t−1 + t , 901 ≤ t ≤ 1105; 0.9x (4) t−1 − 0.8x (4) t−2 + t , 1106 ≤ t ≤ 1300; −0.9x (4) t−1 + t , 1301 ≤ t ≤ 1500;

(B.4)

x (5) t = −0.9x (5) t−1 + t , 1 ≤ t ≤ 750;

1.5x (5) t−1 − 0.75x (5) t−2 + t , 751 ≤ t ≤ 1500;

(B.5)

x (6) t = 0.9x (6) t−1 − 0.8x (6) t−2 + t , 1 ≤ t ≤ 750; −0.9x (6) t−1 + t , 751 ≤ t ≤ 1500; (B.6) Next, we plot the above six time series in Figure B1 , as well as the determined sets of change points with uncertainty. While not perfect, we believe this validates our methodology as determining difficult-to-detect change points in time series with a high level of autocorrelation.

Finally, we describe our sensitivity analysis regarding small changes in autoregressive parameters. We form four time series of length T = 2000, each with a subtle break at t = 1000. Each of the time series is of the same form:

where c i is a specified constant for each time series i = 1, ..., 4. We select constants c 1 = 0.5, c 2 = 0.6, c 3 = 0.7, c 4 = 0.8. Applying our change point methodology, change points near t = 1000 are successfully detected for c 1 = 0.5 and c 2 = 0.6 but not for the even smaller changes c 3 = 0.7 and c 4 = 0.8 relative to 0.9. For c 1 and c 2 , change points are estimated at t = 1009 and t = 977, respectively. Thus, we may observe successful detection of change points in all but the slightest changes of autoregressive parameters.

In this brief section, we plot ten heat maps obtained by applying the five classical measures defined in Section 4.4 to the collections of country and sector indices. For a consistent comparison, we make a small adjustment to the latter three metrics (the Manhattan, Euclidean and Chebyshev distances). The cosine similarity and correlation take values between −1 and 1, with 1 indicating the greatest possible similarity. The latter three metrics indicate greatest similarity with a value of zero. For ease of comparison, we linearly adjust the latter three metrics as follows: given a n × n distance matrix D, let its associated affinity matrix be defined as

, i, j = 1, ..., n.

(C.1)

Then an affinity value of 1 between two time series indicates greatest possible similarity.

Having performed this transformation, we plot heat maps of the cosine similarity and correlation matrices as well as the affinity matrices associated to the Manhattan, Euclidean and Chebyshev distances in Figure C1 . Essentially none of the findings reported in Section 4 can be drawn from these figures. We believe this demonstrates the utility of the methodology of the paper. Figure C1 : Heat maps of classical similarity measures between country and sector indices. We display the cosine similarity for (a) countries and (b) sectors as well as correlation for (c) countries and (d) sectors. Subsequently, we plot the affinity matrices associated to three metrics, the Manhattan metric in (e) and (f), the Euclidean metric in (g) and (h), and the Chebyshev metric in (i) and (j), each for countries and sectors, respectively.

Measuring distance between unordered sets of different sizes

A modified Hausdorff distance for object matching

Locating objects using the Hausdorff distance

Efficient Visual Recognition Using the Hausdorff Distance

Computing the minimum Hausdorff distance for point sets under translation

On dynamic Voronoi diagrams and the minimum Hausdorff distance for point sets under Euclidean motion in the plane

MESH: measuring errors between surfaces using the Hausdorff distance Proceedings

Discriminant embedding for local image descriptors

Structured metric learning for high dimensional problems Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining -KDD 08

Enhancing one-class support vector machines for unsupervised anomaly detection

Deep learning for anomaly detection: A survey

Large-scale unusual time series detection

Anomaly detection on streamed data

Principles of Mathematical Analysis

Statistical Modelling with Quantile Functions

Elements of Information Theory

Theory of Probability & Its Applications

BayesSpec: Bayesian Spectral Analysis Techniques R package version 0

Functional Analysis

Modern mathematical methods for physicists and engineers

sampling scheme. The distribution q(m p , θ p m p |m c , θ c m c ) is described as follows:q(m p , θ p m p |m c , θ c m c ) = q(m p |m c )q(θ p m p |m p , m c , θ c m c ) = q(m p |m c )q(ξ p m p , τ 2p m p , β p m p |m p , m c , θ c m c ) = q(m p |m c )q(ξ p m p |m p , m c , θ c m c )q(τ 2p m p |ξ p m p , m p , m c , θ c m c )q(β p m p |τ 2p m p , ξ p m p , m p , m c , θ c m c ).To draw (m p , θ p m p ), one must first draw m p , and then ξ p m p , τ 2p m p , and β p m p . The number of segments m p is drawn from the proposal distribution q(m p |m c ). Let M be the maximum number of segments and m c 2,min be the number of current segments containing at least 2t min data points. The proposal is as follows: Conditional on the proposed number of change points m p , a new partition ξ p m p , a new vector of covariance amplitude parameters τ 2p m p , and a new vector of regression coefficients, β p m p are proposed. τ 2 is described as a smoothing parameter [58] or amplitude parameter. Now, we describe the process of the birth of new segments, where the number of proposed change points increases. Suppose that m p = m c + 1. As defined before, a time series partition,is drawn from the proposal distribution q(ξ p m p |m p , m c , θ c m c ). The algorithm first proposes a partition by selecting a random segment j = k * to split. Then, a random point t * within the segment j = k * is selected to be the proposed partition point. This is subject to a constraint, ξ c k * −1,m c + t min ≤ t * ≤ ξ c k * ,m c − t min . The proposal distribution is then computed as follows:.Then, the vector of amplitude parametersThe algorithm is based on the RJMCMC algorithm of [77] . This draws from a uniform distribution u ∼ U [0, 1] and defines τ 2p k * ,m p and τ 2p k * +1,m p in terms of u and τ 2c k * ,m c as follows:Then, the vector of coefficients. The vectors β p k * ,m p and β p k * +1,m p are drawn from Gaussian approximations to the respective posterior conditional distributions p(β p k * ,m p |x p k * , τ 2p k * ,m p , m p ) andHere, x p k * and x p k * +1 refer to the subsets of the time series across segments k * and k * + 1, respectively. Then, ξ p m p determinesWe provide an example for illustration: the coefficientFor the birth move (that is the increase in m), the probability of acceptance is α = min{1, A}, where A is equal to.Above, p(u) = 1, 0 ≤ u ≤ 1, while p(β p k * ,m p ) and p(β p k * +1,m p ) are the density functions of Gaussian proposal distributions N (β max k * , Σ max k * ) and N (β max k * +1 , Σ max k * +1 ), respectively. The above Jacobian is computed as The updated vector of amplitude parametersis then drawn from the proposal distribution q(τ 2p m p |m p , ξ p m p , m c , θ c m c ) = q(τ 2p m p |m p , τ 2c m c ). Then, one amplitude parameter τ 2p k * ,m p is formed from two candidate amplitude parameters, τ 2c k * ,m c and τ 2c k * +1,m c ,. by reversing the equations (A.1) and (A.2). Specifically,Finally, the updated vector of regression coefficients,. The vector of regression coefficients is drawn from a Gaussian approximation to the posterior distribution p(β k * ,m p |x, τ 2p k * ,m p , ξ p m p , m p ) with the same procedure for the vector of coefficients in the birth step. The probability of acceptance is the inverse of the corresponding birth step. If the move is accepted then the following updates of the current values occur: m c = m p and θ c m c = θ p m p . Finally, we describe the within-model moves: henceforth, the number of change points m is fixed and notation describing the dependence on the number of change points is removed. There are two parts of a within-model move. First, the change points may be relocated (with the same number), and conditional on the relocation, the basis function coefficients are updated. The steps are jointly accepted or rejected with a Metropolis-Hastings step, and the amplitude parameters are updated within a separate Gibbs sampling step.We assume the chain is located at θ c = (ξ c , β c ). The proposed move θ p = (ξ p , β p ) is as follows: first, a change point ξ k * is selected for relocation from m − 1 candidate change points. Next, a position within the interval [ξ k * −1 , ξ k * +1 ] is chosen, subject to the fact that the new location is at least t min data points away from ξ k * −1 and ξ k * +1 , so thatwhere Pr(j = k * ) = (m − 1) −1 . A mixture distribution for Pr(ξ p k * = t|j = k * ) is constructed to explore the space efficiently, so Pr(ξ p k * = t|j = k * ) = πq 1 (ξ p k * = t|ξ c k * ) + (1 − π)q 2 (ξ p k * = t|ξ c k * ),where q 1 (ξ p k * = t|ξ c k * ) = (n k * + n k * +1 − 2t min + 1) −1 , ξ k * −1 + t min ≤ t ≤ ξ k * +1 − t min and q 2 (ξ p k * = t|ξ c k * ) =0 if |t − ξ c k * | > 1; 1/3 if |t − ξ c k * | ≤ 1, n k * = t min and n k * +1 = t min ; 1/2 if t − ξ c k * ≤ 1, n k * = t min and n k * +1 = t min ; 1/2 if ξ c k * − t ≤ 1, n k * = t min and n k * +1 = t min ; 1 if t = ξ c k * , n k * = t min and n k * +1 = t min .