key: cord-0481461-g2dymsgo
authors: Crook, Oliver M.; Cucuringu, Mihai; Hurst, Tim; Schonlieb, Carola-Bibiane; Thorpe, Matthew; Zygalakis, Konstantinos C.
title: A Linear Transportation $mathrm{L}^p$ Distance for Pattern Recognition
date: 2020-09-23
journal: nan
DOI: nan
sha: bdd39544ab77f162a354a3623599f4c9008e47ac
doc_id: 481461
cord_uid: g2dymsgo

The transportation $mathrm{L}^p$ distance, denoted $mathrm{TL}^p$, has been proposed as a generalisation of Wasserstein $mathrm{W}^p$ distances motivated by the property that it can be applied directly to colour or multi-channelled images, as well as multivariate time-series without normalisation or mass constraints. These distances, as with $mathrm{W}^p$, are powerful tools in modelling data with spatial or temporal perturbations. However, their computational cost can make them infeasible to apply to even moderate pattern recognition tasks. We propose linear versions of these distances and show that the linear $mathrm{TL}^p$ distance significantly improves over the linear $mathrm{W}^p$ distance on signal processing tasks, whilst being several orders of magnitude faster to compute than the $mathrm{TL}^p$ distance.

Optimal transport has gained in recent popularity because of its ability to model diverse data distributions in the signal and image processing fields [44] . Transportation-based methods have been successfully applied to image analysis, including medical images [6] and facial recognition [45] , as well as cosmology [21, 22] and voice recognition. Machine learning and Bayesian statistics have also benefited from transport-based approaches [17, 23, 49, [59] [60] [61] .

The popularity of optimal transport is, in part, due to the rise in the number of problems in the experimental and social sciences in which techniques are required to compare signal perturbations across spatial or temporal domains. For example, optimal transport based methods for image registration and warping [38] and image morphing [68] have existed for many years. Transportation techniques provide non-linear methods that jointly model locations and intensities, making transportation based approaches a powerful tool in many problems.

Optimal transport methods, in particular Wasserstein W p distances, are grounded in a wealth of mathematical theory. Excellent introductions to the advanced mathematical theory of optimal transport are presented in [64, 65] , whilst [57] presents the theory with a more applied perspective. Many technical aspects of optimal transport have been explored, including geometric properties [25] and links to evolutionary PDEs [4] .

There is much interest in developing efficient methods to compute optimal transport distances and maps. For discrete measures, the optimal transport problem can be solved using linear programming approaches [57] . Since solving a linear programme can be costly, [51] proposed a multi-scale linear programme for efficient computations. Other approaches include flow minimisation techniques [4, 5, 37, 38] , and gradient descent approaches [11] , as well as multi-scale methods [36, 48] . More recently, Cuturi proposed entropy-regularisation based approaches to compute approximations of the optimal transport problem [14] . These methods have been explored in-depth with many extensions [1-3, 7, 15, 35, 47, 59] . In order to efficiently compute pairwise optimal transport distances on a large data set, a framework called linear optimal transport was proposed in [46, 67] for the W p distance. In [67] the linear transportation distance was applied to classification tasks for medical, facial and galaxy images, [46] applied the distance to classification problems in medical, facial, galaxies and bird images, [54] used the framework to generate new images, and [9] used the distance to classify jets in collider data.

Higher-order transportation methods have been proposed in the mathematical analysis literature [31, 62] . In [31] García Trillos and Slepčev proposed the transportation L p (TL p ) distance to define discrete-to-continuum convergence of variational problems on point clouds. This was further extended in [62] to a transportation W k,p distance (where the notation relates to Sobolev spaces). These transportation distances have several key advantages over the optimal transport distance; these include [62] : W p distances in classification tasks. However, these higher-order transportation distances require the computation of an optimal transport map (on a higher dimensional space) and this thwarts its application to larger scale pattern recognition tasks, where pairwise distances are needed. In this paper we propose a linear approximation of the TL p distance, which is orders of magnitude faster to compute than the full TL p distance first proposed in [31] , whilst retaining some of its favourable properties Our proposed method can be seen as an extension of the linear Wasserstein framework (LW p ) [67] to higher order transportation distances. The linearisation is essentially a projection onto the tangent manifold at a given reference point. The geodesic distance (in this case corresponding to the Wasserstein distance) is approximated by the Euclidean distance in the tangent space. Suppose we wish to compare N signals/images, then application of optimal transport methods would require computation of N (N −1)/2 distances. In the linear optimal transport framework only N distances need to be computed. From here signals/images are then embedded into Euclidean space allowing linear statistical methods to be applied, whilst preserving much of the geometry of the original optimal transport space. In [67] the optimal transport distance of choice is the W p distance, here we will choose the TL p distance. This choice allows a more general set of unnormalised, not necessarily non-negative, multi-channel signals to modelled, which is not possible in the linear W p framework. This manuscript begins by reviewing optimal transport and the W p and TL p distances. The LW p framework is reviewed in Section 2.3 and we propose our extension to TL p in Section 2.4. We then give an overview on interpolation in the TL p space. A background on numerical methods, more precisely how existing methods for W p can be adapted to TL p is included in the appendix. In Section 3 we apply our method to classification problems in Australian sign language (Section 3.1), breast cancer histopathology (Section 3.2) and financial time series (Section 3.3). We show that linear TL p (LTL p ) outperforms LW p and that it has similar performance to TL p , but is several orders of magnitude faster. Other applications to synthetic data sets and cell morphometry are given in the appendix.

To fix notation, we review the modern Monge-Kantorovich formulation of optimal transport and we refer to the excellent monographs [55, 57, 64, 65] for a thorough exposition. Let µ and ν be probability measures on measure spaces X and Y respectively, i.e. µ, ν ∈ P(X). Further, let B(X) denote the Borel σ-algebra on X. We define the pushforward of a measure µ ∈ P(X) by a function h : X → Z by h * µ(A) := µ(h −1 (A)) for all A ∈ B(Z). The inverse of h is understood as being in the set theoretic sense, i.e. h −1 (A) := {x : h(x) ∈ A}. We denote by Π(µ, ν) the set of all measures on X × Y such that the first marginal is µ and the second marginal is ν. To be precise, if P X : X × Y → X and P Y : X × Y → Y are the canonical projections then P X * π = µ and P Y * π = ν. We call any π ∈ Π(µ, ν) a transportation plan between µ and ν (also called a coupling between µ and ν).

It is worthwhile noting that the formulation of the optimal transport problems in equations (2.1) and (2.2) are not, in general, equivalent. However, if the optimal transport plan π † can be written in the form π † = (Id×T † ) * µ then it follows that T † is an optimal transport map and the two formulations are equivalent, i.e. K(µ, ν) = M (µ, ν). A sufficient condition to show that such an optimal transport plan exists is to require that µ is absolutely continuous with respect to the Lebesgue measure on a compact domain Ω ⊂ R d and, in addition, c(x, y) = h(x − y) where h : Ω → [0, ∞) is strictly convex and superlinear, see [64, Theorem 2.44] . Note that it is easy to find examples where there do not exist transport maps at all. For example, if µ = 1 3 δ x 1 + 1 3 δ x 2 + 1 3 δ x 3 and ν = 1 2 δ y 1 + 1 2 δ y 2 .

Let Ω ⊂ R n and P p (Ω) be the set of Radon measures on Ω with finite p th moment. For p ∈ [1, ∞), we define the Wasserstein distance between µ and ν in P p (Ω) by

The Wasserstein space W p is the metric space (P p (Ω), d W p ). For the case p = ∞, we can define a distance on P ∞ (Ω) by

We briefly review the features that make optimal transport particularly suited to signal and image processing. For an extended survey of these ideas see [44] . Optimal transport is able to provide generative models which can represent diverse data distributions and can capture signal variations as a result of spatial perturbations. Furthermore, there is a well formulated theoretical basis (particularly for W 2 ) with interesting geometrical properties, such as existence of minimisers [24, 65] , the Riemannian structure of Wasserstein spaces when p = 2 [4, 53] and characterisation as the weak * convergence when Ω is compact [57] . The Riemannian structure allows the characterisation of geodesics (shortest curves) on the space P 2 (Ω). In addition, there are many methods to compute optimal transport distances, for convenience, we review a selection in the appendix in the context of computing the TL p distance.

The TL p distance was first introduced in [31] to define a discrete-to-continuum convergence on point clouds. This tool has been been extensively used to study similar statistical problems, e.g. [12, 16, 26-30, 32-34, 52, 58, 63] , and recently has been shown to be a valuable tool in signal analysis [20, 62] . In this section, we review the definitions and properties of the TL p distance and space.

Given an open and bounded domain Ω ⊂ R d we define the TL p space as the set of pairs (µ, f ) such that f ∈ L p (µ; R m ) and µ ∈ P p (Ω). We do not make any assumption on the dimension m of the range of f . Working in this formal and abstract framework of measure theory allows us to formulate our methods for both discrete and continuous signals, simultaneously. Importantly, similarly to the Wasserstein distance but unlike the L p distance, this framework allows us to compare signals with different discretisations since µ and ν need not have the same support. We define the TL p space as

We construct the TL p metric between pairs (µ, f ) ∈ TL p and (ν, g) ∈ TL p as follows:

Intuitively, we see that TL p optimal transport plans (that is plans in Π(µ, ν) that achieve the minimum in the above variational problem) strike a balance between matching spatially, i.e. minimising Ω×Ω |x − y| p p dπ(x, y), and matching signal features, i.e. minimising Ω×Ω |f (x) − g(y)| p p dπ(x, y).

Example 2.1. Let us explain here how images can be represented in TL p . Let {x i } n i=1 be the location of pixels (which usually form a grid over [0, 1] × [0, 1]). We apply TL p by choosing a base measure µ, this is commonly the uniform measure over

for RGB images and f (x i ) are the RGB values for the pixel at location x i . Similarly, for greyscale images one would have f :

is now the greyscale value for the pixel at location x i . Of course, one can make different choices for µ in order to emphasise regions/features of the images. We note that, as we represent the image as a function f , then we do not need to assume that the image has unit mass. To apply W p we would represent the image as a probability measure µ, and we would therefore have to assume that the image has unit mass (or renormalise). This strong assumption limits the applicability of optimal transport in image processing because it requires ad-hoc renormalisation which may suppress features of the image.

To further understand the TL p distance, we reformulate it as a W p distance supported on the graphs of functions. Recall that the graph of a function is defined as:

Note that the Gra(f ) ⊂ Λ := Ω × R m . We define the following lifted measure on Gra(f ):

where A×B ⊂ Λ. It is clear thatμ is a well-defined measure on Gra(f ). We can characterise the TL p distance as a W p distance in the following way [31] :

. Thus, we can see that the TL p distance is the W p distance between the appropriate measures on the graphs of function. This allows us to make the following identification between W p and TL p through the mapping.

This connection of the TL p distance and the W p distance facilitates the transfer of certain Wasserstein properties to the TL p setting; for example, metric properties and existence of minimisers.

It is easy to see that, for any µ, ν ∈ P p (Ω) and f, g ∈ L p (µ):

In fact, one can also prove the converse inequalities (up to a constant) and hence TL p can be seen to generalise both weak * convergence of measures and L p convergence of functions (see [31] or Proposition 2.2 below). The TL p distance can be seen as a special case of optimal transport by observing that d TL p ((µ, f ), (ν, g)) coincides with the Kantorovich optimal transport problem between two measures µ and ν with cost function c(x, y; f, g) = |x − y| p p + |f (x) − g(y)| p p . For reference, we also state a Monge-type formulation of the TL p distance as follows:

where T is a transportation map. When we write the Monge formulation of optimal transport we are assuming that there is an equivalence between (2.4) and (2.7) . This is in general difficult to verify, since the application of Brenier's theorem does not lead to natural conditions. (Assuming that µ does not give mass to small sets and both µ and ν have a sufficient number of bounded moments then one can apply Brenier's theorem to K(µ, ν) where c(x, y) = |x − y| p p + |f (x) − g(y)| p p and K is defined by (2.1), if c is strictly convex; practically this is not reasonable.) However, when µ and ν are discrete uniform measures with supports of equal size the Monge formulation (2.7) coincides with the Kantorovich formulation (2.4) (see the proposition below).

Let us recall the identification (2.5-2.6) and the identity d W p (μ,ν) = d TL p ((µ, f ), (ν, g)). Then, there is a corresponding equivalence between transport maps. That is (assuming all transport maps exist and are unique) let T † achieve the minimum in (2.7) andT † achieve the minimum in M (μ,ν) where M is given by (2.2) with c(x, y) = |x − y| p p . It follows that n n j=1 δ y j then for any f ∈ L p (µ) and g ∈ L p (ν) there exists T † :

i.e. the Monge and Kantorovich formulations of TL p (given by (2.7) and (2.4) respectively) are equivalent for point masses.

Note that not all properties of W p carry through to TL p . For example (TL p , d TL p ) is not complete. Indeed, following [31] , let Ω = (0, 1) and note that f n+1 (x) = sign sin(2 n πx), µ n = L (0,1) (the Lebesgue measure on (0, 1)) is a Cauchy sequence in (TL p , d TL p ). However, {f n } does not converge in L p and therefore {(µ n , f n )} cannot converge in TL p , by part 3 of the above proposition. The completion of TL p can be identified with the set of Young measures, and therefore the space P(Ω × R m ), see [31, Remark 3.6] . We note also that there do not exist geodesics in TL p , however we develop an approach to interpolate in TL p (see section 2.5).

By part 3 in the above proposition TL p inherits some sensitivity to high frequency perturbations from the L p norm. In contrast, as W p metricizes the weak* convergence (in compact Euclidean spaces) then W p is insensitive to high frequency perturbations, see [62, Section 2.2] .

We also can deduce that the TL p distance inherits translation sensitivity from W p . In particular, we can see that L p distances are insensitive to translations if the supports of the images are disjoint. On the other hand, the W p distance scales linearly with the size of the translation no matter how large the translation. The TL p distance, although not scaling linearly with translation, is monotonically non-decreasing as a function of translation.

The TL p appears an excellent tool to exploit in pattern recognition problems such as images or times series, however it is as computational demanding as W p and despite recent advances in computation of optimal transport, e.g. [14] , it is still challenging to apply it to large scale problems. In the next section, we review the linear Wasserstein (LW p ) framework, which was introduced to allow application of optimal transport methods to classification problems [67] . In later sections, we apply the ideas of linear optimal transport to TL p .

The LW p framework was proposed in [67] , as a way to apply optimal transport techniques (in particular W p ) to large scale classification problems for image analysis. Given a set of N images, one would need to compute all pairwise W p distances in order to use methods such as k-nearest neighbour classifiers. The LW p framework was developed so that the Wasserstein distance needs to be computed only N times. In particular, it is the optimal Wasserstein transport maps between signals that are computed. From here, the images are embedded in Euclidean space therefore allowing linear statistical techniques to be applied [66] . This technique was successfully applied in [6] to detect morphological difference in cancer cells. The technique has been further refined and extended to super-resolution images [45, 46] . In this section, we briefly review the ideas of LW p .

The idea behind the LW p framework is to find the optimal Wasserstein transport maps with respect to one (reference) measure. For simplicity we assume that the Monge problem is equivalent to the Kantorovich problem and, in particular, there exists optimal transport maps. Via an embedding of the transport map into Euclidean space the W p distance between any two pairs is estimated. More precisely, the LW p framework provides a linear embedding for P p (Ω) with respect to a fixed measure σ ∈ P p (Ω) [44] . This means the Euclidean distance of the embedded measure and the fixed measure σ is equal to the W p distance of the measure and the fixed measure. The Euclidean distance between any two measures is then an approximation to the Wasserstein distance between these measures. These linear embeddings then facilitate the application of standard statistical techniques such as PCA, LDA and K-means. The LW p framework is also invertible and so synthetic, but physically possible signals, can be realised [54] .

Let µ 1 , µ 2 ∈ P(Ω) and σ ∈ P(Ω) is our reference measure. Throughout this section, we assume that optimal transport maps T µ i : Ω → Ω exist between σ and µ i , i.e.

and T µ i * σ = µ i . If optimal transport maps do not exist then one can still define the LW p distance but there is not a natural way to embed this distance into Euclidean space. We refer to [67, Section 2.3] on how to define the LW p distance using generalised geodesics.

In the setting considered here the Linear Wasserstein Distance, LW p , is defined by [67] :

We observe that d LW p ,σ is a metric and

Let us assume that σ has a density ρ with respect to the Lebesgue measure and define

The map P c is our linear embedding from the Wasserstein space to Euclidean space. We make the following claims on the embedding.

Assume Ω ⊂ R d is bounded and σ ∈ P(Ω) has a density ρ with respect to the Lebesgue measure. Define P = P c where P c is given by (2.8) . Then, the following holds:

1. P (µ) ∈ L p (Ω) for any µ ∈ P(Ω),

Proof. Since σ has a density with respect to the Lebesgue measure then transport maps

(3) was shown already in (2.9). Finally, (4) follows

where we use P (σ) = 0.

We make a similar definition for discrete measures.

Analogously to the Lebesgue density case we have

where we recall that | · | p is the Euclidean p-norm: |x| p := p n j=1 |x j | p . For discrete σ the map P d is our linear embedding from the Wasserstein space to Euclidean space.

n . Then, the following holds:

Proof. The particular forms of all the measures σ, µ, µ 1 , µ 2 is enough to guarantee that transport maps T µ , T µ 1 , T µ 2 all exist. The proof is then analogous to the proof of Proposition 2.3.

Example 2.5. Let us consider how to generate a new image using the linear embedding. Suppose we have a reference measure σ ∈ P(R d ) with density ρ and a set of measures

with optimal transport maps T µ i which form the linear embedding through

Given a new point α in the linear space we can define a transport map by T = αρ − 1 p + Id. We generate a new image by µ = T * σ. Note that to generate the new image we only required the reference measure σ and a new point α in the linear space. However, in order to generate the new point α it will often be sensible to use the statistics of {α i } N i=1 , for example see [54] . In both the Lebesgue density and uniform discrete case P preserves the W p distance between the reference measure and any given µ (where for discrete measures µ is also uniform discrete). Between

. The next section proposes our extension of the LW p framework to the TL p distance.

In this section, we propose a linear TL p framework. Recall that the TL p distance can be defined as an optimal transport distance between measures supported on the graph of a function. Let (σ, h) ∈ TL p be the TL p reference signal andσ = (Id × h) * σ ∈ P p (Λ) the measure in Λ = Ω × R m with support on the graph of h. Let (µ i , f i ) ∈ TL p , i = 1, 2, and defineμ i = (Id × f i ) * µ i . As in the previous section we will assume that optimal transport mapsTμ i : Λ → Λ exist betweenσ andμ i , i.e.

Simple manipulations of the LTL p distance imply

Following the construction of the embedding in the previous section we go directly to the discrete case (sinceσ has support on the graph it cannot have a density with respect to the Lebesgue measure). We assume that

Given this definition we can write

The mapP d embeds our signals into Euclidean space. We have the following properties of the embedding (analogous to Propositions 2.3 and 2.4).

n . Then, the following holds:

Proof. By Proposition 2.2(5) the transport maps T µ , T µ 1 , T µ 2 exist. The rest of the proof follows as in the proof of Proposition 2.3.

As for LW p we have that LTL p is exactly TL p when comparing with the reference measure, i.e. d LTL p ,(σ,h) ((σ, h), (µ, f )) = d TL p ((σ, h), (µ, f )). When we are comparing two measures, neither of which are the base measure, then we make the approximation

When (µ i , f i ) = (σ, h) then the approximation is only formal. In particular, to derive quantitative bounds between LTL p and TL p requires a detailed analysis of the TL p space, including quantitative estimates on curvature. To the authors knowledge there is not yet such a bound between the W p and LW p distances, although recently [50] have obtained bounds for some perturbations.

Example 2.7. Let us consider how to generate a new TL p image from the linear space. We recall that colour images can be represented by

To generate a new image we need to invert this mapping. Let α = (α 1 , . . . , α n ) ∈ R 5n . We

Then y i = (T i ) 1:2 ∈ R 2 are the location of the pixels and c i = (T 3:5 ) are the RGB values in the new image. In the TL p space the new image is represented by (ν, g) where ν = 1 n n i=1 δ y i and g(y i ) = c i . This is only well defined if y i are all unique. If not, then we use barycentric projection (see also Section 2.5), for example if y i = y for all i ∈ I then we define g(y) = 1 |I| I c i to be the empirical average.

The space (P p (Ω), d W p ) is a geodesic space, with easily characterisable geodesics. Letting π † ∈ Π(σ, µ) be the optimal transport plan that minimises the transport problem given by (2.1), we define I t : Ω × Ω → Ω, where t ∈ [0, 1], to be a linear interpolation, as follows:

Then the geodesic in W p is given by µ(t) = [I t ] * π † . When there exists transport maps, i.e. π † = (Id × T µ ) * µ then the geodesic can be written

we see that the projection of the geodesic onto the Euclidean space is the geodesic between the projections. In particular, the geodesic between P (σ) = 0 and P (µ) in Euclidean space is simply tP (µ). The same argument holds in the discrete case where P = P d is defined by (2.10). Since the projection is invertible (at least in some open neighbourhood of the reference measure) we can map from the Euclidean embedding back to W p . Notably, this allows one to translate principal eigenvectors in PCA space (of the linear embedding) into modes of variation in W p , see [67] for more details. This argument does not directly apply to the TL p space since, by the following remark, the TL p space does not permit geodesics.

Remark 2.8. Consider the measure µ = 1 2 δ 0 + 1 2 δ 1 and the functions f (0) = 0, f (1) = 10, g(0) = 10, g(1) = 0. Then the transport between (µ, f ) and (µ, g) is from (0, 0) to (1, 0) and from (1, 10) to (0, 10). The "half way" point would be the measure µ 1 2 = δ1 2 and the function that takes the value 10 and 0 at x = 1 2 , which is not a function. However, this does not prevent us from interpolating and visualising modes of variation.

interpolates between the signals (σ, h) and (µ, f ). In fact, this is the geodesic in W p (Ω×R m ); that isμ t =Tμ t is the geodesic in W p (Ω × R m ) betweenσ andμ. Although we can invert P =P d (defined in (2.11)) in the Wasserstein space, i.e. for all p ∈ R nm there exists ν ∈ P(Ω×R m ) such that P (ν) = p (note that this is P and notP since we are inverting with respect to W p (Ω×R m )) we cannot guarantee thatν can be written in the formν = (Id×g) * ν for some g ∈ L p (ν). Hence, we cannot in general invert the linear embeddings from TL p back into TL p . Instead we use an approximate inversion. We defineP −1 (p) = (ν,ḡ) wherẽ ν satisfies P (ν) = p,ḡ(x) = Eν x and (where we use disintegration of measures)ν =ν x ⊗ ν with the latter meaning

In other words, we define the "inverse" map from the linear embedding of TL p back into TL p as the inverse map in W p (Ω × R m ) and projected onto TL p (Ω):

The projection onto TL p is done by taking the mean across each fibre in x ∈ Ω. We note that the second step does not have to be performed in the linear Wasserstein setting. With this definition we are also able to visualise any point in the linear embedding in TL p space.

In this section we apply the LTL p framework to three real world examples to auslan (Australian sign language), breast cancer histopathology and financial time series. We also include two synthetic examples and a further application to cell morphometry in the appendix. Throughout we will choose p = 2.

We apply the transportation methodology presented in this manuscript to the Australian Sign language (Auslan) dataset [43] . A native Auslan signer was recorded, using fifth dimension technology gloves, making 95 different signs repeated over a period of 9 weeks. The sign was repeated 3 times at each recording, thus each word was measured 27 times. This means there are a total of 2565 signs in the dataset. Each measurement is considered as a multivariate time-series. The measurements taken for each hand are the x, y, z positions, along with roll, pitch and yaw. In addition, the bend of each of the 5 fingers is recorded. Thus at each frame 22 measurements are observed. We consider the Auslan data as functions

We truncate the number of time frames to 44 because little variation was observed past this point. The goal of this task is to classify signs given f i i = i, ..., 2565 as input to 95 possible words (labels) as output.

We apply the LTL p and LW p frameworks to this dataset. Since TL p can handle multichannel signals, no additional pre-processing was needed. However, to apply W p additional pre-processing was required. Firstly, all signals were made positive and then the mean was taken so that there was only a signal channel with positive values. We then normalised so the signal integrated to unity. A linear embedding was obtained as described in previous sections. Once this linear embedding is obtained, we use the 1 nearest neighbour (1NN) algorithm to predict the signs from the linear embedding of the signals. As an assessment of performance we use the macro-F1 score (the harmonic mean of the precision and recall) [39] . We assess performance with a 5-fold cross-validation framework and repeat 100 times to produce a distribution of scores. In addition, we compare to the standard TL p methodology since for this particular data set, even though costly it is possible to perform the full computation. We also recorded timings for each of the methods. Table 1 shows that the linear transportation methods are considerably faster than the full transportation methods. Indeed the LTL p distance was on the order of magnitude of ten's of seconds, whilst the full TL p took several hours. Figure 1 demonstrates that our proposed LTL p method significantly outperforms the LW p approach (T-test, p < 10 −4 ) on the Auslan data. There is a loss in classification ability of LTL p versus the TL p method, which is unsurprising as the linear transportation method is approximate. However, on the Auslan dataset we observe this difference to be small and this minor improvement comes at computational cost orders of magnitude greater. 

In this section, we demonstrate the applicability of LTL p to images from breast cancer histopathology. Invasive Ductal Carcinoma is an aggressive and common form of breast cancer and deep learning based approaches have been used to construct classifiers to analyse such data [13, 42] . We analyse these datasets using transportation based approaches. We randomly sample 100 images each from two patients 1 healthy and 1 cancerous, totalling 200 images. Each image is on a 50 × 50 pixel grid. To apply LW p to these images we first convert the images to a single intensity channel and renormalise so that the intensities integrate to unity. We apply the LTL p approach without any ad-hoc preprocessing, since it can be directly applied to un-normalised multi-channelled images. We compute transport maps in each case using entropy regularised approaches and then linearly embed these images as described in earlier sections. We visualise the linear embeddings using PCA. In addition, we again employ the 1NN classifier using the same framework as in the Auslan application. We report distributions of macro-F1 scores for both LW p and LTL p . From Table 1 we see that the computational cost of the LTL p and LW p distances differs significantly more than might be expected. This is due to the LTL p distance converging more quickly (and therefore needing fewer iterations than the LW p distance (and conversely in the financial time series example). The TL p distance was not included as a comparison as it was too expensive to compute. Figure 2 demonstrates clear differences between the LTL p embedding and LW p embedding. The cancerous and healthy images separate more obviously in the PCA representation of the LTL p embeddings. This is supported when using the 1NN classifier, where a mean macro-F1 score of 0.70 is reported in the LW p case, whilst for LTL p the mean macro-F1 score is 0.88 representing a greater than 25% improvement. It is clear from the box plots that LTL p outperforms LW p (T-test p < 10 −16 ). Using linear interpolation, we visualise perturbations, in units of standard deviation, in principal component space as synthetic images (Figure 3 ). The interpolation in the embedding produced by LTL p demonstrates localised mass moving from the centre of the image towards the edges. This corresponds to cancer invading the milk ducts in cancerous tissue with open milk ducts in non-cancerous tissue. Linear interpolation in the LW p embedding visualises mass moving from the lower right to the upper left of the plot, there is no physical interpretation for this variation. It is clear that the LTL p synthetic images are more interpretable than the LW p synthetic images. 

In this section, we consider an application of transportation distances to financial time series data. ore specifically, we use daily close prices for constituents of the SP1500 index, for the period 2 nd January 2004 -23 rd March 2020.

LWp channel 2 channel 3 Figure 3 : Variations along the first principal component in the application to breast cancer histopathology for both the LTL p embedding and the LW p embedding.

We only consider instruments (stocks) available throughout the entire history, which amounts to approximately n = 1150. We use daily log-returns, with the return of instrument i, between times t 1 and t 2 , defined as R

, where P i,t denotes the price of instrument i at the end of day t. On any given day t, we consider the matrix S t = [P t−m+1 , · · · , P t ] of size n × m (where P t is the vector [P 1,t , · · · P 1150,t ] ), capturing the daily returns for the past m days (with fixed m = 20 throughout the experiments). We refer to S t as the sliding window, as we vary the time component. The goal is to use and compare various techniques for computing the k-nearest-neighbors of S t , which we denote by N k (S t ) = {t j : j = 1, . . . , k s.t. S t 1 , S t 2 , . . . , S t k are the k nearest neighbors of S t } , by pooling together the similarities between the multivariate time series comprising S t with prior historical time series S m , . . . , S t−h , where h is the future horizon at which we aim to predict. Note that, at any given time t, the available history to query for the k-nearestneighbors of S t ends at S t−h , in order to avoid look forward data snooping. Figure 4 is a schematic diagram of our pipeline process.

Future returns We consider forward looking returns (referred to as targets) of two different types, and at various horizons. The future return can be the raw returns itself, or P i,t denote the raw return of instrument i at time t, with future horizon h. In our setting, we consider the following future horizons h ∈ {1, 3, 5, 10} days. One often uses the S&P500 as a proxy for the entire market return. The S&P500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the US market. The index has a corresponding ETF (Exchange Traded Fund), which can be traded much like any other regular stock (with symbol SPY), and we denote its raw return by f (h,RR) SPY,t = log P SPY,t+h P SPY,t . With this in mind, for any instrument i, one can then consider its future market-excess return (MR) as f

For simplicity, we assume β i = 1 across all instruments i = 1, . . . , n, though there are various techniques to infer the individual betas from historical prices [40] . We remark that the main reason for benchmarking our predictions against market-excess returns, as opposed to only raw returns, is essentially to hedge away the market risk and increase the Sharpe Ratio score defined further below.

Estimates of future returns Once we have identified the k-nearest-neighbor (knn) periods of the current window S t , the prediction made at time t, for a given horizon h, is a weighted sum of the corresponding historical future returns, where the weights are inversely proportional to the knn distances. More precisely, if we denote byf (h,RR) t the n × 1 vector of forecasts for the future h-day raw returns at time t, its values are given bŷ

where the weights w i are given by w j ∼ 1 d(St,St j ) , normalized such that t j ∈N k (St) w j = 1, where d(S t , S t j ) denotes the distance between the current window S t , and its nearest historical neighbors S t j , t j ∈ N k (S t ). Similarly, we compute estimates for the future h-day market-excess returns (MR), by pooling together k-nn historical forward looking marketexcess returns.

P&L For performance evaluation, we rely on standard metrics from the finance literature. For a given set of forecasts, the corresponding PnL (Profit and Loss) on day t for a given return f t (shortly chosen to be the raw returnf 

where α i,t denotes our forecast for stock i on day t. Note that the PnL increases if and only if the sign of the forecast α agrees with the sign of the future return f i,t , and decreases otherwise. The sum is across all the n instruments, and f i,t is the future return (either raw return (RR) or market-excess return (MR)) of stock i on day t. We explore different forward looking horizons h ∈ {1, 3, 5, 10}). We add a superscript to the PnL calculation to indicate its dependency on horizon h and the type of return considered (RR or MR). For instance, in our setting, the h-day forward looking market-excess return, computed daily, is given by

Sharpe Ratio After computing the daily PnL time series for all available days, we capture the risk-adjusted performance by computing the corresponding (annualized) Sharpe Ratio

where the scaling is due to the fact that there are 252 trading days within a calendar year. For simplicity, we apply the same scaling √ 252 also to the longer horizons h > 1, and refer the reader to [8] for an in depth discussion on Sharpe Ratios 1 and practical considerations arising from the fact that typical forecasts for equity returns are usually serially correlated. We attribute the future h-day PnL to each day t (leading to overlapping windows), as opposed to maintaining h parallel portfolios, and computing their daily total PnL. We are mainly interested in relative performance of the methods, in terms of Sharpe Ratio and PnL, and less on the actual magnitudes of these performance metrics, and their practical considerations.

Average PnL in basis points In the financial literature, a typical performance measure is the average return per dollar traded, in percentage. For example, one is typically interested in the annualized return of the portfolio. In what follows, we denote by PPT (PnL Per Trade) the average daily PnL per unit of notional (eg., $1). For instance, if at time t 0 the available capital is $100, and the cumulative PnL at time t 252 (thus after one year) is $10, then the annualized return amounts to 10%. Recalling that 1% amounts to 100 basis points (bpts), an annualized return of 10% translates to approximately PPT = 4 bpts per day (since 4 × 252 ≈ 1000 bpts, which amounts to 10%). Essentially, the PPT is telling us how much would we earn for each $1 traded in the markets (excluding transaction costs and fees). For simplicity, we ignore sizing/liquidity effects and assume that each day, we invest $1 for each 1 The annualized Sharpe Ratio is calculated from daily observations as

where µ is the average daily PnL return, r f denotes the risk-free rate, and σ the standard deviation of the PnL returns. Since the risk-free rate is close to zero over the period of study, we compute the Sharpe Ratio as µ σ · √ 252.

of the n instruments, which leads us to the following simplified notion of PPT, averaged over the entire trading period comprised of T days

Quintile Portfolios One is often interested in understanding the performance of the forecasts, as a function of their respective magnitudes. To this end, one typically considers only a subset (eg, top q% strongest in magnitude forecasts) of the universe of stocks, usually referred to as quantile portfolios in the literature [18] . A quantile-based analysis simply constructs and evaluates portfolios composed of stocks which fall in a specific quantile bucket, or above a quantile threshold. We choose to use upward-contained quintile buckets, which we denote by qr i , i = 1, 2, . . . , 5 indicating the quintile rank of each stocks, meaning that stocks with quintile rank qr i correspond to the top 1− i−1 5 fraction of largest-in-magnitude forecasts. The colors in Figures 7 denote the quintile portfolios traded based on the magnitude of the forecasts. For example, the red bars qr 1 correspond to the full universe of stocks, while the green bars qr 4 denote the top 40% largest in magnitude forecasts.

We compare the prediction performance of W p , LW p and LTL p , and leave out the full TL p due to its prohibitive computational running cost. In addition, we compare to another more classical approach, not relying on transportation distance methodology, given by the simple Pearson correlation between the original time series. More precisely, for each window S t (matrix of size n × m, t = 1, . . . , T ), we first standardize the returns in each row (eg, corresponding to each stock), and denote the resulting matrix byS t , t = 1, . . . , T . Next, we unwrap each matrixS t into a vector ψ t ∈ R nm , and finally compute the pairwise distance between a pair of time windows S i and S j , using a correlation-based distance between their corresponding flattened versions: COR ij := 1 − Corr(ψ i , ψ j ). Note that, in light of the preprocessing step that standardized each stock, this corresponds, up to a scaling constant, to the squared Euclidean distance between the corresponding vectors ||ψ i − ψ j || 2 F . Figure 6 shows numerical results comparing the various methods considered. The left column is a heatmap showing the T × T pairwise distance matrix between all days available in history, in the interval 2004 -2020. We note that LTL p clearly highlights the financial crisis occurred in 2008, followed by LW p , and to some extent, W p , while COR show barely visible signs of this event. The middle columns show a distribution of the pairwise distances, while the right columns show the row sums of the distance matrix. Construing the distance matrix as a network with distance/dissimilarity information, this plot effectively plots the degree of each node (i.e., of each time period corresponding to a sliding window of length 20). The periods of time with the largest dissimilarity degree correspond to the financial crisis in 2008. Note that, for each of visualization, we standardize the degree vector.

Next, we zoom in into the LTL p pairwise distance matrix, and show the resulting degrees in Figure 5 . After computing the LTL p distance matrix, we interpret this as a distance network, and compute the total distance degree of each node, after standardization. In this plot, we are able to recognize many of the major financial market events that have happened over the last two decades: the big financial crisis of 2007-2008, the 2010 Flash crash, the August 2011 markets fall (between May-October 2011), the Chinese market crash from January 2016, the period Oct-Nov 2018 (when the stock market lost more than $2 trillion), August 2019 (a highly volatile month in the global stock markets), and finally, the February 2020 stock market crash triggered by the COVID-19 pandemic. It is interesting to observe that the distance degree corresponding to the COVID-19 pandemic is matched in magnitude only by the 2007-2008 financial crisis. Furthermore, in the top right of Figure 5 we plot the top 50 eigenvalues of the LTL p distance matrix, highlighting the usual market mode top eigenvector, with the second eigenvector highly localized on the 2007-2008 financial crisis and the February-March 2020 Covid-19 pandemic period (plots of the top 5 eigenvectors are shown in the Appendix, see Figure 13 ). Figure 5 : Normalized total distance for each day, as computed via LTL p , annotated with the major market events. More explicitly, we compute the distance matrix between all pairs of days (where the data for a given by is given by the previous m = 20 days, including the day of), construe this as a distance network, and compute the total degree of each node (which we show in the above figure, after standardization, for ease of visualization). Here, we fixed the number of nearest neighbors to k = 100, and allow the knn search to span back until the start of the available history T = 1. When forecasting raw returns (left column in Figure 7 ), all methods perform rather poorly, with COR and LW p showing the best performance for h = 1, while for h ∈ {5, 10}, LTL p clearly outperforms all other methods. This supports the assumption that TL p is better able to model similarities in financial time series.

In the market-excess returns setting, for h = 1, all methods return a similar performance in terms of Sharpe Ratio (SR) around 1, except for W p which has a SR of around 0.5; in terms of PnL, most methods achieve a PPT of 1-3 basis points (bpts). However, for longer horizons, LTL p clearly outperforms all other methods, both in terms of Sharpe Ratio and PPT.

We have presented a linear optimal transport distance for use in pattern recognition tasks based on the TL p distance. The TL p distance generalises the Wasserstein distance such that it can be applied to multi-channelled data, for example colour images and multivariate timeseries. However, when pairwise distances are needed the computation of the TL p distance can make its routine application in pattern recognition tasks computationally infeasible. We proposed a method to alleviate this problem by extending the LW p framework to the TL p setting and call this the LTL p framework. In a dataset of N signals, these linear transportation approaches need to only compute N transport maps/distances and thus the cost scales linearly in the number of signals. In contrast, if pairwise distance are needed then the cost scales quadratically with the number of signals, which can render the problem infeasible.

We developed the theory required to allow these transportation methods to be applied. Showing the preservation of these transportation distances with respect to a reference signal, as well as showing that the LTL p defines a bona fide metric on the TL p space. We additionally showed that the linear transportation methods facilitates a linear embedding of our signals allowing simple statistical methods to be applied to the data.

We compared our methods on Auslan, Breast Cancer Histopathology and Financial times series problems. In each case, the LTL p approach significantly outperformed the LW p approach. Furthermore, this came at minimal increased computational cost. Even though TL p often outperformed LTL p on this task this improved performance comes at an unreasonable additional computational cost. Our method still relies on the computation of transport maps and this comes at a cost. We have found entropy regularised methods to perform well but suffer from instability. Fast and stable algorithms which are memory efficient are still needed by the community for reliable computation of transport maps.

Extensions of the TL p framework could be to the manifold setting [19] , which would allow computation of the TL p distance on a graph. The TL p barycentre and other transformations have also yet to be explored. Considering TL p has a spatial penalty it could also be combined with other penalty terms for more complex applications, for example one could also include a penalty on derivatives of signals as in the TW k,p distance, see [62] .

We assume two pairs (µ, f ), (ν, g) ∈ TL p can be written in the form

It was proposed in [14] to consider the entropy regularised problem

where ε > 0 is a positive parameter that controls the amount of regularisation, H is entropy and defined by

π ij log π ij and the minimum in (A.1) is taken over matrices π ∈ R n×m + such that the row sums are p = (p 1 , . . . , p m ) and the column sums are q = (q 1 , . . . , q n ). When ε → 0, the results of [10] imply that S ε ((µ, f ), (ν, g)) → d p TL p ((µ, f ), (ν, g)). Subsequent developments of the entropy regularised approach have appeared in [7, 15] . The measure S ε is referred to as the Sinkhorn distance. It is easy to see that

is the Gibbs distribution, C ij = |x i − y j | p p + |f i − g j | p p , and KL denotes the Kullback-Leibler divergence. The minimisation is taken over the same set as in (A.1) The optimal choice for π can be written in the following form:

where u, v are the limits, as r → ∞, of the sequence

see [7] . The entropy regularisation means the optimal π for S ε cannot be written as a transport map. To obtain an approximation to the optimal transport map one can use Barycentric projections as in [51, Section 2.3].

Following [38] we derive a flow minimization method for finding the transportation map in TL 2 . Let Ω ⊂ R d be a compact domain with smooth boundary and let (µ, f ) and (ν, g) be signals in TL 2 (Ω, R m ), where f, g : Ω → R m are square-integrable functions. Furthermore, we assume that the measures µ and ν admit densities with respect to the Lebesgue measure.

Abusing notation we write dµ(x) = µ(x)dx. The variational TL 2 problem is finding the diffeomorphic map T : Ω → Ω, which minimises the following energy

We assume the following polar factorization of T . Let s : Ω × [0, ∞) → Ω and assume the second coordinate is time. We further assume for any fixed t, [s(·, t)] * µ = µ. That is, s(·, t) : Ω → Ω is a mass preserving rearrangement of µ. Let T 0 : Ω → Ω be an initial mass preserving map between µ and ν, i.e. T 0 * µ = ν, for example the Knothe-Rosenblatt coupling [65] . We assume that s(·, t) is invertible in x for every t and with an abuse of notation we write s −1 for this inverse, i.e.

We require that T = T 0 • s −1 . The strategy in [38] is to evolve s(·, t) using a gradient descent step such that it converges to a minimiser of (A.2) satisfying the constraint (A.3) as t → ∞.

We first consider sufficient conditions on s in order to guarantee that (A.3) holds for all t > 0. The proof of the proposition can be found in [38, Section A.2] .

Proposition A.1.

Let Ω ⊂ R d be a compact domain with a smooth boundary and µ, ν ∈ P(Ω). Assume that µ and ν have C 1 densities with respect to the Lebesgue measure on Ω and with an abuse of notation write dµ(x) = µ(x) dx and dν(x) = ν(x) dx Let χ be a C 1 vector field on Ω satisfying div(χ) = 0 on Ω and χ · n = 0 on ∂Ω where n is the normal to the boundary of Ω. Assume s : Ω × [0, ∞) → Ω is differentiable and invertible in the sense of (A.4), and T 0 satisfies T 0 * µ = ν. If, s(·, 0) = Id and for all t > 0 ∂s ∂t (x, t) = 1 µ(s(x, t)) χ(s(x, t)) (A.5)

If we restrict ourselves to look for transport maps of the form (A.5-A.6) then we must decide how to choose χ. Let us define s χ : Ω × [0, ∞) → Ω by (A.5) with s χ (·, 0) = Id and T t χ : Ω → Ω by (A.6) with s = s χ . An obvious criterion is to choose χ so that ε(T t χ ) decreases quickest over all choices of χ. To this end we compute the derivative of ε(T t χ ) with respect to t.

Lemma A.2. In addition to the assumptions and notation of Proposition A.1 let f ∈ C 1 (Ω; R m ) and g ∈ L 2 (ν) and define ε by (A.2). Define s χ : Ω × [0, ∞) → Ω by (A.5) with s χ (·, 0) = Id and T t χ : Ω → Ω by (A.6) with s = s χ . Then we have d dt

Proof. We defineε

which we can also write as

By a change of variables y = s −1 χ (x, t), and since [s χ (·, t)] * µ = µ we havẽ

Differentiating the above we obtain,

We note that ε(T t χ ) =ε(T t χ ; Id, Id) +ε(T t χ ; f, g) hence d dt ε(T t χ ) = − Ω Q(x, t) · χ(x) dx. When d = 2 by the Helmholtz decomposition (in 2D) we can find, for each t > 0, two scalar fields w : Ω → R and α : Ω → R such that Q(·, t) = ∇w + ∇ ⊥ α (where the t dependence on α and w is supressed) and

for a function f (x) = f (x 1 , x 2 ). To find the direction of steepest descent we let ψ = ∇ ⊥ α and χ = ∇ ⊥ β and compute

where the third line follows from the divergence theorem and since div(χ) = 0 on Ω, and the fourth line follows from χ(x) · n(x) = 0 on ∂Ω. It follows that the direction of steepest descent is α = β.

To find α, we need to observe that ∇α = −Q ⊥ − ∇ ⊥ w where ⊥ is rotation clockwise by π/2, i.e. Q ⊥ = (−Q 2 , Q 1 ). Taking the divergence we have

Hence, α solves the Poisson equation with Dirichlet boundary conditions:

in Ω (A.9) α = 0 on ∂Ω.

(A.10)

To summarise, the flow minimization scheme for TL p , given a step size τ is as follows.

1. Construct T 0 and set t = 0.

2. Compute Q(·, t) defined by (A.7).

3. Find α by solving (A.9-A.10).

6. Repeat 2-5 until convergence.

To supplement the examples given in the main body of the paper we include three other applications. The first two additional examples are to synthetic 1D and 2D data; the same examples were given in [62] . The final example is an application to cell morphometry which appeared in [6] as an example of the LOT framework.

We repeat Table 1 

We first consider a one dimensional signal processing problem, to test the ability of W p and TL p to discriminate between different signals. Throughout, we take p = 2. We consider the task of discriminating between double hump and a high-frequency perturbation of the hump function: a chirp function. A double hump function is of the form:

where 1 [α,β] denotes the indicator function on the interval [α, β] and l ∈ [0, 1 − b − 2r]. The constant K 1 is chosen such that f integrates to unity. A chirp-hump function is given as:

where γ controls the high-frequency perturbation and K 2 is chosen so that f integrates to unity. To generate our synthetic dataset we proceed as follows, fixing l, r and b, we generate f 1 , ..., f 30 from (B.1). We corrupt each signal with standard Gaussian noise to obtain 30 noisy double hump functions. We then obtain two separate classes from the chirp-hump functions by first randomly sampling γ ∈ {γ 1 , γ 2 } with equal probability. Each chirp-hump function is then corrupted with standard Gaussian noise. We then obtain functions f 31 , ..., f 60 as chirp-hump functions with proportion R 1 having perturbation parameter γ 1 and proportion R 2 having perturbation parameter γ 2 . All functions are defined on [0, 1] discretized on a uniform gird of length N = 150.

To apply LW p to discriminate between these signals we first need to satisfy positivity and mass constraints. Thus, each function f is normalised as follows g = f +χ (f +χ) , for a small number χ. Discrete measures µ 1 , ..., µ 60 are then defined to be the probability measures with density g 1 , . . . , g 60 with respect to the uniform grid on [0, 1]. A reference measure σ is constructed as an empirical average of all these measures. We then use entropy regularised methods, see [7, 14] or Section A.1 to compute optimal transport plans between σ and µ i , where i = 1, ..., 60; after which an optimal transport map is computed using barycentric projection. We then embed the measures into Euclidean space as described in Section 2.3. Note that for this linear embedding E ∈ R 150×60 . This method requires only the computation of 60 transport plans, rather than 59 × 30 if all pairwise W p distance were computed.

The LTL p framework can be applied directly without ad-hoc pre-processing and normalisation. The base measure is taken as the uniform measure on [0, 1]. The reference measure σ is taken to be the base measure and the reference signal h is the empirical average of all signals. Optimal TL p plans are computed again using entropy regularised methods from (σ, h) to (µ i , f i ) for i = 1, ..., 60. Recall that this requires the computation of the optimal transport plan fromσ toμ i for i = 1, ..., 60. The map is then obtained from the plan via barycentric projection. A linear embedding U is obtained as detailed in Section 2.4. Note this linear embedding is higher dimensional and U ∈ R 300×60 .

To assess the discriminating ability of LW p and LTL p we apply K-means clustering with K = 3 to the linear embedding to see if we can recover the true underlying classes. Since Kmeans attempts to minimise the within class distance and maximise between class distance, we expect a distance which is able to detect the differences between the classes to have the best performance. We take the clustering returned from the K-means algorithm and compare it to the true clustering using the adjusted Rand index (ARI) [41, 56] . An ARI is a score with 1 indicating perfect agreement, 0 indicating the method performs as well as one would expect if random assignment where made and the ARI can be negative if the method is worse than random. We repeat our method 100 times to produce a distribution of scores. Figure 9 demonstrate the improved performance of using the TL p distance to form a linear embedding of the data. The median ARI using the TL p distance was 1, whilst the median for using the W p distance was 0.8129. The LTL p approach outperforms the LW p approach significantly (Kolmogorov-Smirnov (KS) Test, p-value less than 10 −4 ). Furthermore, Figure  9 panels (b) and (c) demonstrates that the linear embedding produce much tighter and therefore more interpretable clusters in LTL p as compared to LW p . 

We consider a more challenging synthetic two dimensional signal processing problem. As in the previous section, we take p = 2 as the exponent of the cost function in both settings.

Consider the following class of functions, defined on a grid on [0, 1] 2 :

i −y 2 j + σ ij , α ∼ N (0, 1), σ ij ∼ N (0, 1) ,

and

i −y 2 j + σ ij , α ∼ N (−4, 1.5), σ ij ∼ N (0, 1) .

Furthermore, we introduce a perturbation to functions in M 1 , by first sampling an integer n ∈ {10, ..., 20} each with equal probability and then setting f = −2 for n randomly chosen coordinates on the grid, with each coordinate having equal probability of being chosen. We note that these signals take positive and negative values and thus to apply W p to this problem we perform normalisation. We add a constant to each (random) function and then ensure that mass still integrates to unity. This ad-hoc normalisation procedure introduces signal compression into the problem causing important features to become suppressed. The TL p distance can be applied without normalisation or pre-processing.

We generate 25 random functions from each class. We then apply both the LW p and LTL p methods to the resulting dataset. We performed PCA and K-means clustering, with K = 2, on the linear embedding to see if we can discriminate between the two classes. As in the previous section, we compare the resultant clustering with ground truth using the ARI. We repeat this process 100 times to obtain a distribution of scores. Figure 10 show that the LTL p embedding outperforms the LW p embedding. The median ARI for the LTL p approach is 0.92 and the median ARI for the LW p approach is 0.5689. The LTL p distance produced significantly better results (KS-test, p < 10 −4 ) and the PCA plots in Figure 10 demonstrate that the two classes overlap when using W p but separate when using TL p .

In this section, we analyse the liver dataset of [6] containing 250 normal and 250 cancerous liver cells. [6] proposed a transportation based morphometry analysis and this facilitated high accuracy classification. Furthermore, the generative nature of optimal transport allowed them to visualise the modes of variation in the dataset, allowing for superior interpretation of the data. Each cell image is defined on a 192 × 192 pixel grid with a single intensity channel. Thus both LW p and LTL p are applicable.

For consistency we apply flow minimisation techniques to compute the transport maps in each setting [46] . To alleviate numerical issues, as in [46] , we apply a Gaussian low-pass filter with standard deviation 2 to smooth the data. We perform mass normalisation so that optimal transport can be applied and these signals are also used for TL p , so that the spatial and intensity features are on the same scale. A linear embedding is obtained from the transport maps.

Once this linear embedding is obtained we use the 1 nearest neighbour (1NN) algorithm to predict normal or cancerous from the linear embedding of the signals. We assess performance with a 5-fold cross-validation framework; that is, an 80/20 split between training and testing partitions. Figure 11 shows that on this particular task the LW p and LTL p frameworks have similar classification performance when differentiating between cancerous and normal cells. This is unsurprising as these images only have a single intensity channel. We also visualise, in a PCA plot, the LTL p embedding showing the variability across the dataset. Figure 11 panel (c) visualises the modes of variation in image space for the first 5 principal components of the data.

We show here additional numerical results for the financial time series applications. Figure  12 shows the PnL curves corresponding the full portfolio of stocks, attained by each of the methods, for the 1,3,5,10-day future horizons, for both the raw returns and market-excess returns. As similarly observed earlier for the top quintile portfolio shown in Figure 8 , LTL p outperforms all other methods at the longer horizons 3,5,10, especially in the more realistic scenario of using market excess returns (corresponding to a hedged portfolio). Figure 13 shows the top eigenvectors of the LTL p distance matrix, which localize on known crises, especially those during 2007-2008 and 2020. 

10-day future horizons, for raw returns and market-excess returns, across different methods. The legend contains performance statistics of each forecast method: PnL Per Trade (PPT) in basis points, Sharpe Ratio (SR), while N denotes the size of the portfolio

Stochastic algorithms for entropy-regularized optimal transport problems

Screening sinkhorn algorithm for regularized optimal transport

Near-linear time approximation algorithms for optimal transport via sinkhorn iteration

Gradient flows: in metric spaces and in the space of probability measures

Minimizing flows for the Monge-Kantorovich problem

Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry

Iterative Bregman projections for regularized transportation problems

Testing Sharpe ratio: luck or skill

Linearized optimal transport for collider events

Convergence of entropic schemes for optimal transport and gradient flows

A gradient descent solution to the Monge-Kantorovich problem

Large data limit for a phase transition model with the p-laplacian on point clouds

Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks

Sinkhorn distances: Lightspeed computation of optimal transport

Fast computation of Wasserstein barycenters

Large data and zero noise limits of graph-based semi-supervised learning algorithms

Bayesian inference with optimal maps

The Cross-Section of Expected Stock Returns

Monge's transport problem on a Riemannian manifold

Optimal transport for manifold-valued images

A reconstruction of the initial conditions of the universe by optimal mass transportation

Application of optimal transport theory to reconstruction of the early universe

Learning with a Wasserstein loss

Optimal maps in Monge's mass transport problem

The geometry of optimal transportation

Variational limits of k-NN graph based functionals on data clouds

Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator

On the consistency of graph-based bayesian semi-supervised learning and the scalability of sampling algorithms

A new analytical approach to consistency and overfitting in regularized empirical risk minimization

Continuum limit of posteriors in graph Bayesian inverse problems

Continuum limit of total variation on point clouds

A variational approach to the consistency of spectral clustering

Estimating perimeter using graph cuts

Consistency of Cheeger and ratio graph cuts

Stochastic optimization for large-scale optimal transport

Multiscale strategies for computing optimal transport

An efficient numerical method for the solution of the L 2 optimal mass transfer problem

Optimal mass transport for registration and warping

Learning from imbalanced data

Estimating beta

Comparing partitions

Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series

Optimal mass transport: Signal processing and machine-learning applications

Transport-based single frame super resolution of very low resolution face images

A continuous linear optimal transport approach for pattern analysis in image datasets

On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms

A multiscale approach to optimal transport

Robust and scalable Bayes via a median of subset posterior measures

Linear optimal transport embedding: Provable fast wasserstein distance computation and classification for nonlinear problems

An efficient linear programming method for optimal transportation

Consistency of Dirichlet partitions

The geometry of dissipative evolution equations: the porous medium equation

Representing and learning high dimensional data with the optimal transport map from a probabilistic viewpoint

of Foundations and Trends in Machine Learning Series

Objective criteria for the evaluation of clustering methods

Optimal transport for applied mathematicians

Analysis of p-laplacian regularization in semisupervised learning

Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains

Wasp: Scalable Bayes via barycenters of subset posteriors

Scalable Bayes via barycenter in Wasserstein space

A transportation L p distance for signal analysis

Asymptotic analysis of the Ginzburg-Landau functional on point clouds

Topics in Optimal Transportation

Optimal transport: old and new

Penalized Fisher discriminant analysis and its application to image-based morphometry

A linear optimal transportation framework for quantifying and visualizing variations in sets of images

An image morphing technique based on optimal mass preserving mapping

10-day future horizons, for raw returns and market-excess returns, across different methods. The legend contains performance statistics of each forecast method: PnL Per Trade (PPT) in basis points

This work was supported by The Alan Turing Institute under EPSRC grant EP/N510129/1. In addition the authors are grateful for discussions with Elizabeth Soilleux, whose interest in machine learning methods for diagnosing coeliac disease motivated this work, and Dejan Slepčev. OMC is a Wellcome Trust Mathematical Genomics and Medicine student and is grateful for generous funding from the Cambridge school of clinical medicine. MC acknowledges support from the EPSRC grant EP/N510129/1 at The Alan Turing Institute. TDH was supported by The Maxwell Institute Graduate School in Analysis and its Applications, a Centre for Doctoral Training funded by the EPSRC (grant EP/L016508/01), the SFC, Heriot-Watt University and the University of Edinburgh. CBS acknowledges support from the Leverhulme Trust project on 'Breaking the non-convexity barrier', the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC grants EP/S026045/1 and EP/T003553/1, the EPSRC Centre Nr. EP/N014588/1, the Wellcome Innovator Award 

In the appendix we include some additional background on computing the TL p distance and further applications.

In this section we review some methods for computing the LTL p embedding. In principle, any algorithm that can compute optimal transport distances can be adapted to compute TL p by either interpreting TL p as a Wasserstein distance on the graphs of functions, or as an optimal transport problem with cost c(x, y; f, g) = |x − y| p p + |f (x) − g(y)| p p . We refer to [55] for a thorough review of computational methods for optimal transport. Here, we review entropy regularised optimal transport and flow minimisation in the setting of TL p . We note that as the Kantorovich problem is a linear program then one can use algorithms such as the simplex or interior-point methods. Although there are multi-scale approaches [51] these are typically not state-of-the-art for high dimensional images/signals and so we omit them from this review.Once we have obtained optimal TL p mapsTμ i : Ω × R m → Ω × R m for each of the transportation problems betweenσ ∈ P(Ω × R m ) andμ i ∈ P(Ω × R m ) for i = 1, ..., N we can embed into Euclidean space by (2.11-2.13) whereTμ i = (T µ i , f i • T µ i ). Hence, linear statistical methods can be applied.