key: cord-0127225-veuaybfi
authors: Cohen, Samuel; Luise, Giulia; Terenin, Alexander; Amos, Brandon; Deisenroth, Marc Peter
title: Aligning Time Series on Incomparable Spaces
date: 2020-06-22
journal: nan
DOI: nan
sha: 7056097ba60cbc01d82532fea206391124418ac3
doc_id: 127225
cord_uid: veuaybfi

Dynamic time warping (DTW) is a useful method for aligning, comparing and combining time series, but it requires them to live in comparable spaces. In this work, we consider a setting in which time series live on different spaces without a sensible ground metric, causing DTW to become ill-defined. To alleviate this, we propose Gromov dynamic time warping (GDTW), a distance between time series on potentially incomparable spaces that avoids the comparability requirement by instead considering intra-relational geometry. We derive a Frank-Wolfe algorithm for computing it and demonstrate its effectiveness at aligning, combining and comparing time series living on incomparable spaces. We further propose a smoothed version of GDTW as a differentiable loss and assess its properties in a variety of settings, including barycentric averaging, generative modeling and imitation learning.

Data is often gathered sequentially in the form of a time series, which consists of sequences of data points observed at successive time points. Elements of such sequences are correlated through time, and comparing time series requires one to take the direction of time into account. To define a sensible similarity measure between time series, Sakoe and Chiba [21] proposed dynamic time warping (DTW), a distance over the space of time series, which has recently been extended by Cuturi and Blondel [6] into soft DTW for use as a differentiable loss. DTW consists of a minimal-cost alignment problem and is solved via a Bellman recursion, while soft DTW leverages a soft-min operation to smooth the DTW objective. Such distances enable tackling a large range of temporal problems, including aligning time series, averaging them or making long-term predictions.

DTW-based approaches require a sensible cost function between samples from the two time series. The specification of such cost functions is often hard, and limits the applicability of DTW. For example, in cases where the time series are invariant under symmetries, such as sequences of word embeddings which are only identified up to a rotation of latent space, one needs to solve an alignment problem to compare the two sequences. Vayer et al. [25] propose an extension of DTW that addresses this issue by making the cost invariant with respect to specific sets of symmetries. In this case, one still requires the definition of a cost function between samples from the two time series, along with a potentially large set of transformations to optimize over. On the other hand, in multi-modal settings, one considers time series that live on incomparable spaces: for example, the configuration space of a physical system and its representation as pixels of a video frame. In such cases, defining a sensible distance between samples from the two sequences is impractical, as it would require detailed understanding of the objects we wish to study.

In this work, we propose to tackle the problem by relaxing our notion of equality in a manner inspired by recent ideas from the optimal transport literature. Using connections between DTW and the Wasserstein distance, we propose Gromov dynamic time warping (GDTW), which compares two time series by contrasting their intra-relational geometries, analogously to the Gromov-Wasserstein distance of isometry classes of metric-measure spaces [16] . This allows one to compare two time series without defining a sensible ground cost between their domains, and automatically incorporates invariances into the distance.

Contributions. (1) We introduce Gromov DTW, a distance between time series and a smoothed version extending DTW variants to handling time series on incomparable space, (2) we present an efficient algorithm for computing it, (3) we apply Gromov DTW to a range of settings including barycentric averaging, generative modeling and imitation learning.

Notation. Let (X , d X ) be a compact metric space, and let a time series x of length T ∈ N be an element of X T . Let A(m, n) ⊆ {0, 1} m×n be the set of alignment matrices, which are binary matrices containing a path of ones from the top-left to the bottom-right corner, allowing only bottom, right or diagonal bottom-right moves. Given a matrix A ∈ A(m, n) and a 4-dimensional array L ∈ R m×n×m×n , define the matrix (L ⊗ A) ij = kl L ijkl A kl ij . Denote the Frobenius matrix inner product by ·, · F . Define the probability simplex ∆ J := {q ∈ R J , q j ≥ 0 for j = 1, . . . , J, j q j = 1}. Finally, x :i corresponds to the first i time steps of x.

We now introduce concepts and definitions needed in the rest of the work.

We now review DTW and its smoothed version. Sakoe and Chiba [21] consider the problem of aligning two time series x ∈ X Tx and y ∈ X Ty , where potentially T x = T y . This is formalized as

where D ij = d X (x i , y j ) is the pairwise distance matrix. This problem amounts to finding an alignment matrix that minimizes the total alignment cost. The objective (1) can be computed in O(T x T y ) by leveraging the dynamic programming forward recursion DTW(x :i , y :j ) = d X (x i , y j ) + min DTW(x :i−1 , y :j−1 ), DTW(x :i−1 , y :j ), DTW(x :i , y :j−1 ) .

(2) The optimal alignment matrix A * can then be obtained by tracking the optimal path backwards. DTW is a more flexible choice for comparing time series than element-wise Euclidean distances, because it allows one to compare time series of different sampling frequencies due to its ability to warp time. In particular, two time series can be close in DTW even if T x = T y . DTW has been used in a number of settings, including time series averaging and clustering [18, 22] and feature extraction [2, 12] .

A limitation of DTW is the discontinuity of its gradient, which can affect the performance of gradient descent algorithms. To address this, Cuturi and Blondel [6] introduced a soft version of DTW. The minimum in (1) is replaced with a softened version, yielding

DTW is recovered in the limit γ → 0. They also discuss a softened version of the optimal alignment matrix A * , given by the softened argmin defined by

where C x,y is the normalizing constant of the unnormalized density e − 1 γ D,A . While it considers temporal variability, DTW is not invariant under transformations, such as translations and rotations, which can limit its application to settings where time series are obtained only up to isometric transformations, such as word embeddings. To alleviate this, Vayer et al. [25] propose

which gives a distance between time series that is invariant under a set of transformations F where f is applied elementwise to points of the time series, and Vayer et al. [25] consider orthonormal transformations. However, in more general settings, this requires one to optimize over a potentially large space of transformations F, which becomes unfeasible if x and y are too different.

Optimal transport [19] allows one to compare and average measures in a way that incorporates the geometry of the underlying space on which they are defined. Such approaches can be intuitively connected to DTW by observing that time series are essentially discrete measures equipped with an ordering. This allows one to view the alignment matrices in the DTW objective as analogues of coupling matrices that appear in the Kantorovich formulation of the classical optimal transport problem [27] . To formalize this, we introduce the Wasserstein distance between discrete measures.

Let

q i δ yi be discrete probability measures with p ∈ ∆ m , q ∈ ∆ n , and set D ij = d X (x i , x j ). Define the Wasserstein distance between discrete measures µ x and µ y as

where Π(p, q) is the set of coupling matrices with marginals p and q. Equation (6) clearly resembles (1) , and in both cases the objective consists of the minimization of the element-wise dot product between a distance matrix and another matrix, which we call the plan. In the DTW case, the plan consists of an alignment matrix, and in the Wasserstein case it consists of a coupling matrix. Moreover, the optimal coupling T * ij describes the optimal amount of probability mass to move from point x i to y j , whilst the optimal alignment A * ij describes whether or not x i and y j are aligned at optimality. The Wasserstein distance is limited by the requirement for a sensible ground metric to be defined between samples x i ∈ X and y j ∈ Y, which is impossible if the spaces are unregistered [24] . The Wasserstein distance is also not invariant under isometries, such as rotations and translations, and generally leads to a large distance between measures equivalent up to such transformations. To relax these requirements, Mémoli [16] proposed the Gromov-Wasserstein (GW) distance between metric-measure triples (X , d X , µ x ) and (Y, d Y , µ y ), up to isometry. It is defined as

where L is typically squared error loss or Kullback-Leibler divergence, and does not rely on a cost or metric to compare x i with y j . GW compares the intra-relational metric geometries of the two measures by comparing the distributions of their pairwise distances, and only requires the definition of metrics d X and d Y on X and Y, respectively, which can be arbitrarily different. GW has been used as a tool for comparing measures on incomparable spaces, notably for training generative models [1] , graph matching [29] , and graph averaging [28] . Vayer et al. [26] also proposed an extension of both Wasserstein and GW, named Fused-Gromov-Wasserstein to deal with structured measures such as graphs, which consists of a mixture of Wasserstein distances on the structural components, and GW on the spatial features.

Motivated by the connections between DTW and optimal transport described in Sections 2.1 and 2.2, respectively, we introduce a distance between time series x ∈ X Tx and y ∈ Y Ty defined on potentially incomparable compact metric spaces. Define the Gromov dynamic time warping distance between metric-time-series triples (X , d X , x) and (Y, d Y , y) as

where L : R 2 → R + is a loss function measuring the alignment of the pairwise distances, and the first two elements of the triple are omitted to ease notation. We think of L as a proxy for measuring the alignment of the time series (e.g, the square error loss L(a, b) = (a − b) 2 and KL loss

. Under the optimal alignment, for any two pairs (x i , y j ) and (x k , y l ), if x i is close to x k then y j is close to y l . Some optimal alignments are given in Figure 1 . Provided L is a pre-metric and so induces a Hausdorff topology, GDTW possesses the following properties:

and only if L is symmetric. As DTW fails to satisfy the triangle inequality, GDTW does not generally satisfy it either. Thus, GDTW is a pre-metric over equivalence classes of (X , d X , x), up to metric isometry. A formal treatment is given in Appendix A.

We now present a straightforward algorithm for computing GDTW. Following ideas proposed in the optimal transport setting for computing the Gromov-Wasserstein distance, one can introduce a 4-dimensional array

This expression is similar to the DTW objective in (1), but with a cost function D that now depends on the alignment matrix A. We apply the Frank-Wolfe algorithm [4, 9] to (9) . This consists of (i) solving a linear minimization oracle

which can be performed exactly in O(T x T y ) by a DTW iteration, and we note that [20] . This is followed by (ii) a line-search step

We prove in Appendix A that the optimal step size is always 0 or 1. The final step (iii) is updating

In summary: if S (t) improves, update A (t+1) = S (t) , otherwise the algorithm converges to a local optimum at A (t) . Note that the optimal step size η (t) ∈ {0, 1} remediates the non-convexity of the constraint set, as iterates are guaranteed to remain in A(T x , T y ) in spite of its non-convexity. Proposition 1. Algorithm 1 for computing Gromov DTW converges to a stationary point.

Proof. Appendix A.

Algorithm 1 Franke-Wolfe algorithm for Gromov DTW 

Gromov DTW can be itself used as a differentiable loss function. Here, we apply the envelope theorem [3, 17] to (9) and obtain

Similarly to DTW, GDTW suffers from unpredictability when the time series is close to a change point of the optimal alignment matrix because of the discontinuity of derivatives. To remediate this, we describe how GDTW can be softened analogously to soft DTW. This allows smoother derivatives when applying it to generative modeling of time series and imitation learning. Our algorithm for computing Gromov DTW consists of successive DTW iterations. Following ideas from the Gromov-Wasserstein literature, we propose to replace the DTW operation in the iterations with a softened version, by replacing the argmin by the soft argmin in (4) . A priori, it may seem that computing this is significantly more involved-however, Cuturi and Blondel [6] observed that

where arg min γ is the softened arg min in (4). Hence, (14) can be computed by reverse-mode automatic differentiation in quadratic time, and soft GDTW iterations can be performed by plugging in D = L ⊗ A. We approximate the derivatives by using the optimal soft alignment matrix in (13).

We now present a range of applications of Gromov DTW, including barycentric averaging, generative modeling and imitation learning.

To compute barycenters of Gromov DTW (9), we extend the algorithm from Peyré et al. [20] to the sequential setting. Given time series

. For fixed T ∈ R (length of the barycentric time series), the barycenter is defined as any triple (X , d X , x) satisfying

where with some abuse of notation we denote GDTW purely in terms of distance matrices. The barycentric time series can then be reconstructed by applying multi-dimensional scaling (MDS) [13] to D * , and is illustrated in Figure 2 . We solve (15) by rewriting it as

which is solved by alternating between minimizing over A j , j ∈ 1, ..., J via Algorithm 1, and minimizing over D for fixed A j . The latter step admits a closed-form solution given as follows.

Proposition 2. If L is a square error loss, the solution to the minimization in (16) for fixed A j is

where division ·/· is performed element-wise, and 1 is a vector of ones.

Proof. Appendix A.

We now use GDTW as an approach for training generative models of time series. Here, we view our dataset of time series

We define a generative model µ θ = G θ# ν, where ν is a latent measure, such as an isotropic Gaussian, G θ : Z → X T is a neural network and G θ# ν is the pushforward measure. By nature of Gromov DTW, the generated time series do not have to live in the same space as the data. We train the model µ θ by minimizing the entropic Wasserstein distance W ε [5] between µ and µ θ . For the ground cost d of W ε , we use DTW γ and GDTW γ . For GDTW γ , the objective is

where H is the relative entropic regularization term. Following Genevay et al. [10] , it's also possible to use its debiased analog. W ε (µ, µ θ ) is computed efficiently using the Sinkhorn algorithm [5, 23] , and θ is minimized by gradient descent. This approach extends the Sinkhorn GAN of Genevay et al. [10] and the GWGAN of Bunne et al. [1] to sequential data.

We consider an imitation learning setting in which an agent needs to solve a task given the demonstration of an expert. We assume the agent has access to the true transition function T over the agent's state-space X , and define a state trajectory as a time series x ∈ X Tx . An expert state trajectory y exp ∈ Y Ty solving a specific task, such as traversing a maze, is given. The goal is to train the agent's parametrized policy π θ : X → A to solve the given task by imitating the expert's behavior, where A is the action space. To find this policy, the agent uses the model of the environment to predict state trajectories x θ under the current policy π θ , compares these predictions with the expert's trajectory y exp , and then optimizes the controller parameters θ to minimize the distance between predicted agent trajectory and observed expert trajectory. Using GDTW, our objective is

The flexibility of GDTW allows for expert trajectories defined in pixel space Y = R 32×32 , while the agent lives in X = R 2 . Similarly, rollouts obtained with π θ mimic the expert's trajectory up to isometry. For comparison, we also consider DTW in (19) , which aims to learn the same trajectory in the same space as the expert-this requires X = Y, and starting positions for the agent and expert to be identical. From a reinforcement learning perspective, the use of GDTW in (19) can be interpreted as a value estimate and gradient-based policy learning can be seen as taking value gradients [7, 11] .

We now assess the effectiveness of our proposals in settings in which (i) time series live in comparable spaces and where previous approaches apply, (ii) the spaces are incomparable.

Baselines: Throughout the experiments, we compare our proposal GDTW γ to, in settings in which they apply, DTW γ [6, 21] and its rotationally-invariant extension DTW-GI γ for γ ≥ 0 [25].

We first evaluate GDTW on alignment tasks. We consider two settings in which y is obtained by applying to x (i) a rotation, and (ii) a translation followed by a rotation. DTW-GI is invariant under rotations, and is therefore expected to work in setting (i) only, whilst GDTW is invariant under isometries, and is expected to work in both. In Figure 1 , we see that GDTW recovers the right alignment in both settings, while DTW-GI only works in the rotational setting 1(a), and ordinary DTW fails in 1(a)-1(b). Further experiments with soft DTW and GDTW are given in Appendix B.

We now investigate barycentric averaging of GDTW, on both toy data and the QuickDraw 2 dataset. We compare Gromov DTW to DTW and DTW-GI, where barycenters from the latter two methods are computed using DTW barycentric averaging [18] .

Toy data In Figure 2 , we see that in comparable settings DTW barycenters fail if time series are rotated or translated. DTW-GI is robust to rotation, but fails when applying both rotations and translations. By contrast, GDTW is robust to both, and leads to meaningful barycenters in all settings.

QuickDraw dataset The QuickDraw dataset consists of time series of drawings in R 2 , belonging to 345 categories. Among those categories, we selected hands, clouds, fishes, and blueberries. To address high variability in classes, we selected input data following a preprocessing routine described in Appendix B. A sample of the data sets, together with barycenters computed with DTW, DTW-GI, and GDTW is displayed in Figure 3 . While DTW and DTW-GI fail to reproduce the shape of the inputs for most classes, GDTW provides meaningful barycenters across the range of examples. This shows that GDTW is more robust in recovering the geometric shape of the time series, whilst DTW variants are sensitive to moderate isometries.

We evaluate the generative modeling proposal of Section 4.2, and compare the behavior of the learned model when using DTW and GDTW. Here, we consider the sequential-MNIST dataset 3 , which consists of time series of digits in R 2 being drawn, and where each time step corresponds to a stroke. In Figure 4 we see that samples using GDTW as ground cost (18) than samples using DTW. This can be explained by the variability in the data set: slight translations significantly affect DTW, but not GDTW. Note that the GDTW samples are rotated and reflected, since GDTW only produces learned samples up to metric isometries.

We now apply Gromov DTW to the imitation learning setting of Section 4.3. Here, we are given an expert trajectory y exp , and our goal is to find a policy π θ such that the agent's simulated trajectory x θ mimics y exp . We consider maze navigation tasks in two settings: (i) both expert trajectories and the agent's domain are X = Y = R 2 and (ii) expert trajectories consist of a video sequence of 32 × 32 images, giving Y = R 32×32 , whilst the agent's domain is X = R 2 . In the first setting, DTW and GDTW apply, whilst in the second setting only GDTW can be used. Figure 5 (h) displays the loss (19) , which is the GDTW distance to the given trajectory, obtained by learning with GDTW and DTW in (i) averaged across 20 seeds. We see that GDTW slightly outperforms DTW, and both agents recover the spiral trajectory provided by the expert.

Finally, we consider a setting in which an agent living in R 2 is provided with an expert trajectory y exp consisting of a video of a car driving through a spiral, illustrated in Figures 5(a)-5(e) (before down-scaling the images). Here, the state-space of the agent, X = R 2 , differs from the state-space of the expert, Y = R 32×32 . We define the cost on image space d Y to be the 2-Wasserstein distance, defined on images interpreted as densities on a grid, and the cost on the Euclidean space d X to be the Euclidean distance. Figure 5 (f) shows the agent's trajectory under the learned policy π θ , and Figure  5 (g) shows the loss (19) against the number of training steps. We see that, using GDTW, the agent successfully learns to solve the task despite never having access to trajectories in the space of interest.

We propose Gromov DTW, a distance between time-series living on potentially incomparable spaces. The idea is to consider the intra-relational geometries of the considered time series, alleviating the need for a ground metric to be defined across spaces, which is a requirement of previous approaches. Moreover, Gromov DTW is invariant under isometries by nature, which is an important inductive bias for generalization, and makes it significantly more robust under transformations of the spaces by contrast with DTW and DTW-GI. The generality of our proposed distance enables applying it to a wide range of problems that previous approaches could not tackle, in particular when comparing time series on unregistered spaces. We considered applications ranging from alignment to barycentric averaging, generative modeling and imitation learning.

In this work, we develop techniques for aligning, averaging, and learning using multiple time series in potentially different domains. This makes it easier for practitioners to use a number of different tools in these settings.

In particular, we envision this might help reduce demand for manually labeled data in robotics. For example, one might use techniques derived from ours to train robots on expert demonstrations, without manually transcribing those demonstrations into a computer-friendly format. This can play an important role in human-robot interaction, for instance in construction or elderly care.

Similarly, aligning time series can be helpful in epidemiology. For example, it could allow scientists to compare the shapes of infection curves with different starting points, thereby allowing to compare countries at different stages of an epidemic. This could help with understanding the evolution of diseases, such as COVID-19, which would in turn benefit the general public.

Finally, our work can be used in a climate science context to align time series from computer-generated numerical weather prediction and heterogeneous observational data, which might be available with different sampling frequencies. This in turn might improve quality of data assimilation, thereby improving weather forecasting, which we hope might contribute to the United Nations' sustainable development goal 13 on climate action.

Here we develop the theory of Gromov dynamic time warping distances. We begin by introducing the necessary preliminaries. Definition 3 (Time series). Let (X , d X ) be a compact metric space, and let I X = {1, 2, .., T X } ⊂ N.

We call a finite sequence x : I X → X a TIME SERIES. Let X be the space of all time series.

Definition 4. Let x and y be time series. Define a premetric D : X × Y → R, which we call the COST. Define the m × n COST MATRIX D ∈ R m×n by D ij = D(x i , y j ). Definition 5. We say that a binary matrix A is an ALIGNMENT MATRIX if A 11 = 1, A mn = 1, and A ij = 1 implies exactly one of A i−1,j = 1, A i,j−1 = 1, and A i−1,j−1 = 1 holds. Let

be the set of ALIGNMENT MATRICES.

Definition 6 (Dynamic Time Warping). Let x and y be time series. Define the DYNAMIC TIME WARPING distance by

where ·, · F is the Frobenius norm over real matrices.

If D is a premetric, then DTW : X × X → R is a premetric on the space of time series. If we take c = d M , then DTW : X × X → R is a symmetric premetric on X.

Proof. Lemire [15] .

A premetric induces a Hausdorff topology on the set it is defined over, and so is suitable for many purposes that ordinary metrics are used for. To proceed along the path suggested by Gromov-Hausdorff and Gromov-Wasserstein distances over metric-measure spaces, we need to define the time series analog. Definition 8. Define a METRIC SPACE EQUIPPED WITH A TIME SERIES to be a triple (X , d X , x).

Let (X , d X , x) and (Y, d Y , y) be metric spaces equipped with time series. Define X| x = {x ∈ X : x ∈ img x}, and Y | y similarly, and equip both sets with their respective subset metrics. We say that (X , d X , x) and (Y, d Y , y) are ISOMORPHIC if there is a metric isometry φ : X| x → Y | y such that φ( x i ) = y i , where x and y denote x and y with consecutive repeated elements removed.

At this stage it is not clear whether or not the class of all such triples under isometry forms a set, or is instead a proper class. To avoid set-theoretic complications, we need the following technical result. It follows immediately that the class of all metric spaces equipped with time series is a set, provided that identification by isometry extends to the time series. We are now ready to define GDTW. Definition 11. Let L be a premetric on R + , and define L ∈ R m×n×m×n by

Define the GROMOV DYNAMIC TIME WARPING distance by

where (L ⊗ A) ij = kl L ijkl A kl .

Proposition 12. GDTW is a premetric on the set of all metric spaces equipped with time series.

Proof. We check the conditions. Non-negativity is immediate by definition. It also follows immediately that (X , d X ,

where all elements of the last sum are non-zero. Suppose without loss of generality that x and y contain no duplicate elements. We argue inductively that optimal A is the identity matrix.

1. First, note that A 11 = 1 by definition of A.

2. Now, consider A 21 . If we suppose A 21 = 1, then we must have L 2111 = 0, and hence d X (x 2 , x 1 ) = d Y (y 1 , y 1 ) = 0. But then x 2 = x 1 , contradicting the assumption there are no duplicates. Hence, A 21 = 0.

3. By mirroring the above argument, A 12 = 0. Hence, by definition of A, the only remaining possibility is A 22 = 1. Inductively, we conclude A ii = 1 for all i, and A ij = 0 for i = j.

4. Finally, since the lower-right corner of A has to also be equal to one by definition, it follows that A is the square identity matrix.

Hence A ij = 1 and A kl = 1 if and only if i = j and k = l. Plugging this into the previous equality yields d X (x i , x k ) = d Y (y i , y k ) for all i, k, which together with diagonal A gives the isomorphism. Finally, to see that lack of duplicates truly is assumed without loss of generality, note that if there are duplicates in x and y, then we apply the above argument to x and y of Definition 9, which no longer contain duplicates. The claim follows.

One can easily see that GDTW will be symmetric if L is symmetric. Since DTW itself doesn't satisfy a triangle inequality [15] , GDTW won't satisfy it either.

We formulate an algorithm for computing GDTW within the Frank-Wolfe (FW) framework [9] . These algorithms tackle problems of the form

where G : R d → R is the objective to be minimized, assumed differentiable with Lipschitz continuous gradient, and C ⊆ R d is the (usually convex) constraint set.

We first describe the Frank-Wolfe algorithm in its general form. Let be A (0) the initial point.

2. Find the optimal step size

3. Perform the update

We now describe the algorithm our setting. The objective is min A∈A(Tx,Ty)

The constraint domain A(T x , T y ) in our case is not convex. This is usually a requirement of Frank-Wolfe algorithms, but we derive a result in the sequel that enables us to bypass this requirement.

Step 1: Linear Minimization Oracle We note that the gradient is of the form

and step 1 thus consists in solving arg min A∈A(Tx,Ty)

This can be minimized exactly in O(T x T y ) time by plugging D = L ⊗ A (t) in the DTW objective (1) and solving via dynamic programming.

Step 2: Optimal

Step Size Proposition 13. The optimal step size in (27) is either 0 or 1.

Proof. This follows by applying the argument of Chapel et al. [4] , who derive a similar result in the Gromov-Wasserstein setting, with one minor modification. In their equation (9), an optimaltransport-based argument is used to obtain an inequality-in our setting, an analogous inequality holds for DTW, given by

where S t is given by dynamic programming. The claim follows.

We observe from Proposition 13 that the optimal step size is 0 or 1, therefore if the proposal S (t) improves on the objective, γ (t) = 1, otherwise γ (t) = 0.

Step 3: Iterative Updates We set

Frank-Wolfe algorithms typically require convexity of the constraint set C, otherwise the iterates A (t+1) = A (t) + γ (t) (S (t) − A (t) ) might escape from the constrained domain. In our setting, since the optimal step size is either 0 or 1, this never happens: A (t+1) is equal to A (t) or S (t) , and both belong to the constrain set, and convexity is not needed to guarantee the iterates remain in the constrained domain.

Summarizing, we obtain the following. Proposition 14. Algorithm 1 for computing GDTW converges to a stationary point.

Proof. The result follows by a minor modification of Theorem 1 of Lacoste-Julien [14] , which proves convergence of the Frank-Wolfe algorithm for possibly non-convex optimization objectives over convex constraint sets. Here, we use Proposition 13 instead of convexity to ensure the iterates remain in the constraint set.

Barycenter computation Proposition 15. If L is a square error loss, the solution to the minimization in (16) for fixed A j is

where division ·/· is performed element-wise, and 1 is a vector of ones.

Proof. If L is square error loss, then (16) can be written as

where is element-wise matrix multiplication. Differentiating the objective with respect to D and setting it equal to 0, we get

which, dividing both sides element-wise, gives the result.

Alignments In Figures 7-10 , we provide further alignment experiments. Here, we set the entropic term γ to 1 for soft alignments, and we use normalized distance matrices. We observe that GDTW and soft GDTW are robust to scaling, rotations and translations, whilst DTW and soft DTW are sensitive to rotations and translations. Finally, DTW-GI is robust to rotations, but sensitive to translations, which further corroborates the observations from Figure 1 .

In this experiment, we perform barycenters of 30 elements of 4 quickdraw classes with respect to DTW, DTW-GI and GDTW.

Data selection and pre-processing The classes considered in the experiment are fish, blueberries, clouds and hands. The variability in each class of QuickDraw is extremely high: we created datasets of 30 elements such that it is straightforward to recognize to which category the element belongs to, such that the element is drawn with a single stroke and such that it has a common style. The full datasets are displayed in Figure 6 . Before running the algorithms, we rescaled the data, applying the transformation x → (x − min(x))/ max(x) to each data point. Finally, we down-sampled the length of the time series reducing it by 1/3 for hands and 1/2 for fish, clouds and blueberries.

Algorithms For GDTW barycenters, we applied our algorithm of Section 4.1, using the entropy regularized version of GDTW with γ = 1. For DTW and DTW-GI, we used standard DBA procedures. For both algorithms, we set the barycentric length to 60 for fish and hands and 40 for clouds and blueberries. Also, we set the maximum number of FW iterations for GDTW to 25, and the number of DTW-GI iterations to 30.

In this experiment, we use the Sinkhorn divergence objective. We use a latent dimension of 15, and the generator is a 4-layer MLP with 1000 neurons per layers. The length of the generated time series is set to T = 40, and the dimension of the space is p = 2, thus the MLP's output dimension is T × p = 80. We set the batch size to 25. We use the ADAM optimizer, with β = (0.5, 0.99), and the learning rate set to 5 × 10 −5 . We set γ = 1, and the maximum number of iterations in the GDTW computation to 10. We use the sequential MNIST dataset 4 and normalize the data, which time series in R 2 , into the unit square.

In this experiment, we use a two-layer MLP policy, with input dimension of dim(X ), a hidden dimension of 64, and an output dimension of 2. The learning rate is set to 5 × 10 −5 , and we use the ADAM optimizer, and β = (0.5, 0.99). In the video/2D experiment 5 , the ground cost for the video is entropic 2-Wasserstein distance, computed efficiently using GEOMLOSS [8] , and the ground cost on the 2D space is squared error loss. We plot mean scores along with standard deviations (across 20 random seeds). 

Learning Generative Models Across Incomparable Spaces

Efficient Retrieval of Similar Time Sequences Under Time Warping

Foundations of Mathematical Economics

Partial Gromov-Wasserstein with Applications on Positive-Unlabeled Learning

Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Soft-DTW: A Differentiable Loss Function for Time-Series

Value-Gradient Learning

Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences

An Algorithm for Quadratic Programming

Learning Generative Models with Sinkhorn Divergences

Learning Continuous Control Policies by Stochastic Value Gradients

Using Dynamic Time Warping Distances as Features for Improved Time Series Classification. Data Mining Knowledge Discovery

Multidimensional Scaling

Convergence Rate of Frank-Wolfe for Non-Convex Objectives

Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound

Gromov-Wasserstein Distances and the Metric Approach to Object Matching

Envelope Theorems for Arbitrary Choice Sets

Summarizing a Set of Time Series by Averaging: From Steiner Sequence to Compact Multiple Alignment

Computational Optimal Transport. Foundations and Trends in Machine Learning

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Dynamic Programming Algorithm Optimization for Spoken Word Recognition

Nonsmooth Analysis and Subgradient Methods for Averaging in Dynamic Time Warping Spaces. Pattern Recognition

Diagonal Equivalence to Matrices with Prescribed Row and Column Sums

Entropic Metric Alignment for Correspondence Problems. SIGGRAPH

Time Series Alignment with Global Invariances

Optimal Transport for Structured Data with Application on Graphs

Optimal Transport: Old and New

Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching

Gromov-Wasserstein Learning for Graph Matching and Node Embedding

We are grateful to K. S. Sesh Kumar for ideas on the Frank-Wolfe algorithm. SC was supported by the Engineering and Physical Sciences Research Council (grant number EP/S021566/1).