key: cord-0679545-wc01u5dq authors: Krawtschenko, Roman; Uribe, C'esar A.; Gasnikov, Alexander; Dvurechensky, Pavel title: Distributed Optimization with Quantization for Computing Wasserstein Barycenters date: 2020-10-27 journal: nan DOI: nan sha: ce71843f4c7214ef81bb66e9a17a1c34e643da1c doc_id: 679545 cord_uid: wc01u5dq We study the problem of the decentralized computation of entropy-regularized semi-discrete Wasserstein barycenters over a network. Building upon recent primal-dual approaches, we propose a sampling gradient quantization scheme that allows efficient communication and computation of approximate barycenters where the factor distributions are stored distributedly on arbitrary networks. The communication and algorithmic complexity of the proposed algorithm are shown, with explicit dependency on the size of the support, the number of distributions, and the desired accuracy. Numerical results validate our algorithmic analysis. Optimal transport (OT) has become an important part of modern machine learning for its ability to take into account the geometry of the data for computations. Applications range from image retrieval [54] and image classification [16] to Generative Adversarial Networks [3] . An immediate use of Wasserstein distances is the definition of Frechet means of distributions [1, 52] , which is usually called the Wasserstein barycenter (WB) [8, 28] . Informally, WB allows us to define for example, an average image when interpreted as a discrete probability distribution. Such flexibility, along side the geometric and statistical properties of WB has led to a large number of applications, e.g., image morphing and image interpolation of natural images [57] , averaging atmospheric gas concentration data [6] , graph representation learning [58] , fairness in ML [13] , geometric clustering [48] , Bayesian learning [4] , stain normalization and augmentation [49] , probability and density forecast combination [15] , multimedia analysis and fusion [36] , unsupervised multilingual alignment [44] , clustering patterns for COVID-19 dynamics [51] , channel pruning [56] , and many others. As OT and, in particular, the WB framework is used in machine learning applications, the scale of the problem increases as well. Thus, the geometric and statistical advantages of WB come with a high computational cost. For example, calculating the WB of two images of one million pixels translates into a large-scale optimization problem, where the definition of the Wasserstein distance itself contains a minimization problem with approximately 10 12 variables. Approximating WB with high accuracy requires processing a large number of samples to get good statistical estimates, and dependencies on the distribution support, number of distribution, and desired accuracy are usually high [10] . Therefore, the complexity and scalability of approximating WB is also a main research thrust within the ML community [29, 45, 53, 45] , where some heuristics [11] , strongly-polynomial 2-approximation [9] , fast computation [34, 32] , saddlepoint [61] , and proximal methods [60, 65] have been proposed, as well as approaches based on multimarginal optimal transport [46, 62] . In addition to scale, modern ML applications require other considerations. In particular, decentralized approaches [59] have recently emerged over centralized ones due to its ability to take into account data ownership, privacy, fault tolerance, and scalability, e.g., parameter server [17] , federated learning [40, 47, 37] , among others. In this paper, we are focused on: • Communication-efficient, • Stochastic (semi-discrete), and • Decentralized computation of Wasserstein barycenters. Decentralized because we assume it is not possible to store the whole dataset into one machine, and probability distributions are stored locally on a number of agents or workers connected over some arbitrary network. The network structure defines some limited communication capabilities between the nodes. Semi-discrete because in contrast with most of the available WB computation approaches, we assume the base probability distributions are continuous, while we want to recover their best discrete (finite) barycenter. Moreover, the semi-discrete setup implies that agents or machines are oblivious to their local probability distribution and can learn about it from the realization of an associated random variable. Finally, Communication-efficient because we exploit the fact that the gradients of the associated dual formulation [55, 64] of the WB lie in the probability simplex, i.e., is a discrete distribution. Thus, we propose a quantized communication approach that drastically reduces the network's communication load by sampling from such a discrete distribution, creating a sparse approximation of the stochastic gradient. Applied to the WB problem, this means that instead of sending a whole barycenter estimate (e.g., an image, in each iteration), agents share a histogram from a finite number of samples drawn from its current WB estimate. The main contributions of this paper are: • We propose a general quantized communicationbased algorithm for the distributed computation of semi-discrete WB over networks. • We analyze the communication and communication complexity of the proposed algorithm in two setups: 1) The number of samples from the distribution and the samples from the gradient are allowed to increase with time. 2) The number of samples is fixed to some predefined constant. • Our specific results for the computation of WB are based on a new general primal-dual analysis of the accelerated stochastic gradient descent method, for which we provide a new convergence rate and sample complexity analysis. • We provide numerical experiments that empirically support our theoretical finding on the convergence of the proposed algorithm. Related works: Efficient gradient quantization was studied in [35, 39, 38] . However, only non-accelerated stochastic gradient approaches are available. A similar approach to ours was recently studied in [5] , which uses SGD and sampling scheme in the discrete setting. In contrast, we use AGM (Accelerated gradient method) instead of SGD (Stochastic gradient descent), and we study the WB problem in the semi-discrete setting and give precise computation and communication complexity in terms of the dimension n of the barycenter space, the number of considered distributions m, and the network architecture. Centralized Semi-discrete WB [14] was studied in [59] , where the convergence of the discrete approximations of the continuous measures was studied. Decentralized discrete WB communication and computation complexity was studied in [63] , and their corresponding semi-discrete analysis is available in [21], both approaches with no quantization for communication efficiency. Empirical studies of the approximation of the fully continuous barycenter are recently available in [43] . To the best of the authors' knowledge, the best estimate of the complexity of computing the WB in the centralized discrete setup is available in [20] . This paper is organized as follows. Section 2 introduces the problem of distributed WB computation. Section 4 presents the general primal-dual accelerated gradient method. Section 3 analyzes the primal-dual properties of the WB problem. Section 5 states our proposed algorithm and main results. Section 6 shows the numerical results. Finally, Section 7 provides conclusions and future work. Notations: We denote M 1 + (X ) as the set of positive Radon probability measures on a metric space X . Y ∼ µ means that random variable Y is distributed according to measure µ. Let e i denote the i-th unit vector. Let W be a positive semi-definite matrix with maximal and minimal nonzero eigenvalues denoted by λ max (W ) and λ + min (W ). We denote the condition number of a matrix W by χ(W ) {p ∈ R n : n l=1 p l = 1} denotes the probability simplex of dimension n.Õ(·) denotes the complexity up to polylogarithms. δ denotes the Dirac mass and [p] i denote the i-th component of the vector p. ⊗ denotes the Kronecker product of matrices and denotes the component-wise vector multiplication. nnz(x) denotes the number of non-zero elements of vector x. Given a matrix W and its rows W i we write κ(W ) m i=1 nnz(W i ). In this section, we describe the problem of the distributed computation of the entropy-regularized semidiscrete Wasserstein barycenter. Moreover, we present a summary of the main results of this paper regarding the communication, sample, and arithmetic complexity of the proposed algorithm. Consider a network of m agents defined over a connected, fixed, and undirected graph G = (V, E), where V = {1, · · · , m} is the set of agents and E is the set of edges, such that (i, j) ∈ E if agent i and agent j can communicate with each other. We assume that the graph G does not have self-loops. Moreover, at each time k ≥ 0, each agent has access to realizations or samples (the number of samples will be specified later) from a stationary random process is a continuous probability distribution with with density q i (y) on a metric space Y, see Figure 1 . The group of agents tries to collaboratively compute the entropy-regularized semi-discrete Wasserstein barycen- Figure 1 : A network of agents, where each agent is able to sample from a different distribution, i.e., agent i ∈ V samples from µ i . ter, defined as the discrete probability distribution ν, on a fixed support z 1 , . . . , z n ∈ Z, i.e., ν = n i=1 [p * ] i δ(z i ), and p ∈ S 1 (n) such that where Without loss of generality we set ω i = 1/m. Moreover, we denote W γ,µ (p) W γ (µ, p) as the fixed probability measure µ, where W γ (µ, p) is the entropy-regularized semi-discrete Wasserstein distance: where ξ is the density of uniform distribution on Y × Z, c i (y) = c(z i , y) denotes the cost of transporting a unit of mass from point z i ∈ Z to point y ∈ Y, and Π(µ, p) is the set of admissible coupling measures π defined as and γ is the regularization parameter. Note that unlike [31] , we regularize the problem by the Kullback-Leibler divergence from the uniform distribution ξ, which allows us to find explicitly the Fenchel conjugate for W γ (µ, p), to be defined later in Lemma 3. We are interested in the distributed computation of approximate solutions of (1), such that, each agent finds an approximate distributionp i , where andp 1 , · · · ,p m are close to each other in a sense to be defined soon. Also, note that with an appropriate selection of the regularization parameter γ, one can also compute an approximate solution to the nonregularized Wasserstein barycenter as described in the next proposition. Proposition 1. Assume that the space Y × Z is compact with the volume Ω Vol(Y × Z) and there exist p satisfying (3) with γ =γ ε/(4 log(Ω)). Then For simplicity, we refer to (2) as regularized Wasserstein distance. However, we want to emphasize that our algorithm does not rely on any specific choice of cost c i (y) and can be used for the Monge-Kantorovich distances where the transportation costs do not have to satisfy the metric properties. The objective of the group of agents is to solve (1). However, the distributed structure induced by the network G imposes communication constraints we need to consider. We assumed that at each time t ≥ 0, each agent i ∈ V has access to the samples from the distribution µ i only, i.e., agent i has no information about µ j for j = i. Therefore, cooperation is needed. Such cooperation is modeled as the ability of the agents to share information over the set of edges E. In Table 1 , we informally summarize the main results of this paper. We also describe the communication and the computational complexity of the proposed algorithm, quantized distributed primal-dual stochastic accelerated gradient method Q-DecPDSAG. We compare the obtained bounds with the Non-quantized version denoted DecPDSAG [21]. The specifics of the proposed quantization scheme will be described in Section 3.2. Moreover, we compare two different regimes for Q-DecPDSAG, namely, having an increasing number of samples per iteration and having a constant number of samples per interaction. We analyze the computational complexity of Q-DecPDSAG in terms of arithmetic operations and the total communication complexity, as the total number of coordinates sent per node over all the iterations of the algorithm for the general graph as well as for the specific architecture where we use expander graphs. In the next section, we study the primal-dual properties of the WB problem. We specifically exploit the dual problem structure and propose an effective sampling strategy for the construction of estimates of the stochastic gradient. This will be the main building block of our algorithm. Total Number of bits sent Total Arithmetic Operations expander graph Total Number of bits sent expander graph Number of bits sent at iteration k n max{n, k} M χ(W )mn ε 2 χ(W )mn ε 2 χ(W )mn ε 2 In this Section, we build upon the dual properties of the entropy-regularized Wasserstein distance to develop a distributed algorithm that allows for the computation of a Wasserstein barycenter. A commonly used method to introduce the distributed structure of the problem is by reformulating (1) to take into account the constraints imposed by the graph G explicitly. To do so, we define the Laplacian matrix is the number of neighbors of node i, i.e., the degree of the node i. Similarly, we define the auxiliary matrix W W ⊗ I n , which takes into account the dimension n of the decision variable. Therefore, assuming that the graph G is undirected and connected, W is symmetric and positive semidefinite. Moreover, the matrix W inherits the spectral properties ofW . More importantly, the Laplacian matrix has the prop- Given that (5) is an optimization problem with linear constraints, we introduce a vector of dual variables Then, the Lagrangian dual problem for (1) can be written as The next two auxiliary lemmas state properties of W * γ,µi (·) what will be useful for our analysis [21] . In particular, W * γ,µi (·) is a smooth function with Lipschitzcontinuous gradient and can be expressed as an expectation of a function of additional random argument. . Given a positive Radon probability measure µ ∈ M 1 + (Y) with density q(y) on a metric space Y, the Fenchel-Legendre dual function of W γ,µ (p) can be written as Moreover, W * γ,µ (λ) has m/γ-Lipschitz gradients w.r.t. 2-norm, and its l-th coordiante, for l = 1, . . . , n, is defined as where we denotedλ j = m[ √ W λ] j . Lemma 2 states that the dual problem (6) is a smooth stochastic convex optimization problem. This is in contrast to [42] , where the primal problem is a stochastic optimization problem. This will be the main observation that will allow us to propose an efficient sampling and communication strategy to find an approximate solution of (6), and (1) respectively. More importantly, the gradient of the dual function can be computed in a distributed manner over a network, where each agent j ∈ V computes ∇W * γ,µj (λ j ) using local information only. Then, the gradient is shared with neighbors following the graph topology, and a full gradient step can be taken. Next, we describe our sampling and quantization approach. We take advantage of the smooth stochastic form of the dual problem, for which each agent can obtain an estimate of the gradient-based on samples from the random variables Y i . Moreover, such an approximate gradient will be a probability distribution itself. Thus, instead of communicating the full gradient, an agent can share a histogram of samples of the approximate gradient itself. We build our main result based on the idea of double sampling of the gradients. In [21] , the authors propose to use sampling from measures µ i to approximate the gradients of the dual problem. Given that each agent i can obtain at each iteration M 1 samples of the random variable Y i k ∼ µ i , one can define an approximate gradient as where, for all l = 1, · · · , n, However, such an approximate gradient will be a vector of dimension n. For the case where n is large, this might be prohibitively expensive in terms of communications. Thus, we take advantage of the fact that ∇W * γ,µ (λ) ∈ S 1 (n), to propose a strategy that guarantees lower communication costs. We propose to use the approximate stochastic gradient defined in the next lemma. Lemma 3. Let each agent i ∈ V , at iteration k, take M i,1 samples of the random variable Y i , and build a stochastic gradient as defined in (8) . Now, let each agent i ∈ V take M i,2 samples of a discrete random variable ξ on l ∈ {1, ..., n} with ξ = l with probability [ ∇W * γ,µj (λ j )] l , and construct a histogram, or approximate quantized sampled gradient of its local dual function as ∇W * γ,µj (λ j ) = 1 Mi,2 r=1 e ξr , which are then combined in the stochastic approximation for the whole dual objective (6) defined as Then, the quantized stochastic gradient is unbiased, i.e., Furthermore, its variance is bounded by There are two main advantages in the approach described in Lemma 3. First, the size of the communicated messages between the nodes diminishes with a smaller second batch, while the double-batched algorithm preserves roughly the same complexity just as the Accelerated Distributed Computation of Wasserstein barycenter in [21] . The second advantage is the quantization aspect such that agents exchange sparse integer vectors instead of dense vectors of float variables. The main observation of this Section is that the dual problem to the WB setup is a stochastic optimization problem. Moreover, its stochastic gradients live in a probability simplex, which defines a discrete probability distribution. Such a unique structure allows for effective sampling strategies, which translate into effective communication due to the primal-dual properties of the problem. In the next section, we propose a new analysis of stochastic optimization problems with such a structure. Later we will specialize it for the WB problem, but our general results could be of independent interest. For any finite-dimensional real vector space E, we denote by E * its dual, by λ, x the value of a linear function λ ∈ E * at x ∈ E. Let · denote some norm on E and · * denote the norm on E * which is dual to · , i.e. λ * = max{ λ, x : x ≤ 1}. For a linear operator A : The main problem we consider in this section is where Q is a simple closed convex set, A : E → H is given linear operator, b ∈ H is given, Λ = H * . The dual problem for (11) consists in minimization of the function Here f * is the Fenchel-Legendre conjugate function for f . We say that a (random) pointx is an ε−solution to Problem (11) if where R is such that the dual problem (12) has a solution λ * satisfying λ * 2 ≤ R < +∞. Further, we assume that the dual problem (12) can be accessed by a stochastic oracle (Φ(λ, ξ), ∇Φ(λ, ξ)) with Φ(λ, ξ) = λ, b + F * (−A T λ, ξ) and ∇Φ(λ, ξ) = b − A∇F * (−A T λ, ξ). Let us summarize our main assumptions as follows. • The gradient of the objective function ϕ is L-Lipschitz continuous. • The stochastic oracle pair (Φ(λ, ξ), ∇Φ(λ, ξ)) for the dual problem satisfies E ξ ∇Φ(λ, ξ) = ∇ϕ(λ). • There is a random variable ξ, which is independent of ξ, and a stochastic approximation ∇ Φ(λ, ξ, ξ) for ∇Φ(λ, ξ) satisfying • The dual problem (12) has a solution λ * and there exists some R > 0 such that λ * 2 ≤ R < +∞. Select the coefficients: 4: We propose Algorithm 1 for solving (11) using the stochastic oracle described in Assumption 1. Contrary to [21], we extend the accelerated stochastic gradient method from [18] and propose its primal-dual version. The main benefit is flexibility in the choice of the batch size, which is assumed fixed in [21] . Also, the algorithm from [18] cannot be directly used since it does not allow us to reconstruct the solution of the primal problem, i.e., the barycenter in our application. We believe that this general primal-dual accelerated stochastic gradient method with flexible batch size can be of independent interest. The next theorem presents the convergence properties of Algorithm 1. Theorem 1. Let Algorithm 1 be applied to solve Problem (11) under the above assumptions. Case A Assume also that at each iteration the variance σ k of the stochastic gradient satisfies σ 2 k ≤ εLα k /A k . Then, Algorithm 1 outputs an ε− solution to Problem (11) after 8LR 2 /ε iterations. Case B Assume the variance σ 2 is constant. Then, Algorithm 1 outputs an ε−solution to Problem (11) after max{ 4LR 2 /ε, 9σ 2 R 2 /ε 2 } iterations. In this section, we use the main primal-dual algorithm presented in Section 4 to the specifics of Wasserstein Barycenter problem of Section 2. The algorithm has two different options for the batch size, one for the increasing batch size similar to the main algorithm in [21] and one for the constant batch size. We adapt the structure of the general Algorithm 1 for the specific properties of the WB. This allows us to propose Algorithm 2 for the distributed computation of WB in the Algorithm 2 Q-DecPDSAG: Decentralized Quantized and Stochastic Computation of the Semi-Discrete Entropy-Regularized Wasserstein Barycenter 1: Initialization 2: Each agent i ∈ V is assigned a distribution µi. and γ = ε/(4 ln(Ω)). 4: Set the number of iterations N 5: Select the sampling scheme: A) Increasing batch size. 6: Compute ∇W * γ,µ i (λ 0 i ) according to Lemma 3. 7: Send ∇W * γ,µ i (λ 0 i ) to neighbors, i.e., {j | (i, j) ∈ E}. 8: Initialization ends. 9: For each agent i ∈ V : 10: for k = 0, . . . , N − 1 do 11: Wij ∇W * γ,µ j (λ k+1 i ). : . 19: end for Output: The pointp N . semi-discrete setting with quantized communications. Finally, the following theorems establish the complexity results in Table 1 . Theorem 2. Let G be a connected, undirected, fixed graph, and let {µ i } i∈V be a set of probability distributions with µ i ∈ M 1 + (Z). Moreover, set N = O χ(W )n/ε (communication rounds), and pick the increasing batch size strategy with M k i,1 = M k i,2 = max{1, (k + 2)/log(Ω) } for all i ∈ V and all k ≥ 0. Then, the output of Algorithm 2, i.e.,p N , is an ε− solution of (1). Moreover, the total number of samples drawn from {µ i } i∈V is O χ(W )mn/ε 2 , the total number of arithmetic operations is O max{χ(W )mn 2 /ε 2 , χ(W )κ(W )n/ε 2 } a, and the total number of bits sent per node is O χ(W )κ(W ) min{n 3/2 /ε, n/ε 2 } . The results in Theorem 2 are two-fold. On the one hand, it provides the number of communication rounds required for Algorithm 2 to obtain an arbitrary approximation of the non-regularized WB when sampling scheme Case A) is selected. Moreover, we provide an explicit sample, arithmetic, and communication complexities for generating such an approximation. In particular, if one uses expander graphs, which are in general well-connected, the total number of arithmetic operations is O mn 2 /ε 2 , and the total number of bits sent per node is O(min{n 3/2 /ε, n/ε 2 }). Next, we present a result for the sampling scheme Case B) in Algorithm 2. Similarly as in Theorem 2, when expander graphs are used, the total number of arithmetic operations is O mn 2 /ε 2 , and the total number of bits sent per node is O(mn/ε 2 ). We present numerical experiments to support Theorem 3 and Theorem 2. Figure 2 shows the consensus gap and the dual function value of the WB problem generated by Algorithm 2 for m = 10 and m = 100 agents. The set of distributions are set to be Gaussian with means and variances randomly selected and assigned. Moreover, we compare the performance for the Non-quantized AGM for WB computation proposed in [21] . We test the algorithm for various combinations of sampling rates M 1 and M 2 . As expected, the algorithm from [21] has the best results because it is a non-quantized communication approach, and at each iteration, each agent communicates its full gradient with its neighbors. However, the experiments also show that even with a small number of samples like M 1 = 1 and M 2 = 10 Algorithm 2 can rapidly converge to the barycenter with significantly less communication overhead. Figure 3 We presented an algorithm for the computation of approximate semi-discrete WB over networks. We analyzed the primal-dual structure of the WB associated optimization problem and proposed a novel accelerated stochastic method to minimize functions with stochastic dual structure. The proposed method's main advantage is its flexibility in the sampling scheme of the dual stochastic gradient. Two sampling approaches were studied, increasing batch sized and constant batch size. The constant batch is particularly useful when a specific sample size is available at each iteration step. These sampling schemes drastically reduce the communication complexity of the WB computation. Moreover, we explicitly analyze the communication, sample, and arithmetic complexity of the proposed methods. Future work should focus on studying whether the total arithmetic operations can be improved further than nm 2 /ε 2 as in the discrete case [41, 45] . [10] S. Borgwardt It is clear that the density ξ of the uniform distribution on Y × Z is 1/Ω, and that the entropy term in (2) is bounded from below by zero and from above by the value achieved when when π is a Dirac mass. In the latter case the maximum value is γ log Ω, which is no greater than ε/4 by our choice of γ =γ = ε/(4 log(Ω)). Thus, for any µ, p, W 0 (µ, p) ∈ [Wγ(µ, p), Wγ(µ, p) + ε/4]. Substituting this in (3), we obtain the desired result. By the definition of ξ r , E ξr [e ξr ] = ∇W * γ,µj (λ j ) and it is easy to verify that the stochastic approximation ∇W * γ,µj (λ j ) is unbiased, and (10) holds. Let us now estimate the variance. m,l=1,... ,Mi,1,r=1,. .., Mi, 2 where ξ i r is distributed as described in Lemma 3 we have the following chain of estimates. Using the triangle inequality for the square of the norm we get: Moreover, using the [21, For a fixed λ and p(Y r ) := p(λ, Y r ), the first term in inequality (18) can be computed as follows: This leads to: and the desired result follows. As mentioned in Section 4, we consider primal-dual pair of problems of the form: As before we assume that the conditions 1 are satisfied. Algorithm 1 generalizes the Algorithm 6.1 of [18] (see also [22, 30] ), here we present the Algorithm 6.1 to give a reader a better overview and its convergence theorem. Then we analyze the convergence rate of our primal-dual version of the Algorithm 1. Note that the primal-dual analysis of the existing accelerated methods [66, 2, 7, 12, 23, 24, 25, 26, 33, 50] does not apply since the dual problem is a stochastic optimization problem and we use additional randomization. Algorithm 6.1 of [18] applied to the dual problem (22) with stochastic inexact oracle ∇ Φ(λ, ξ, ξ) is listed as Algorithm C3. Since the objective ϕ(λ) has Lipschitz-continuous gradient, in the setting of [18] we have δ = 0. The following condition for the convergence analysis of Algorithm C3 is stated at the beginning of Section 6.1 in [18] . Assumptions 2. The following conditions hold for the sequences {α k } and {β k }, for all k ≥ 0. α 0 ∈]0, 1] and β k+1 ≥ β k > L for all k ≥ 0, Algorithm C3 Stochastic Fast Gradient Method ([18, Algorithm 6.1]) Input: Sequences α k , β k , A k = k i=0 α k satisfying Assumptions 2. Distance generating function d(λ) and corresponding Bregman divergence V [z k ](λ). Set η0 = ζ0 = z0 = λ0 and choose a number of iterations N . 2: for k = 0, . . . , N − 1 do 3: 5: 8: η k+1 = τ k ζ k+1 + (1 − τ k )η k . 9: end for Output: The point ηN . We denote, for any N ≥ 0, Theorem C1 (Lemma 2 in [18] ). Assume that Conditions 1, 2 are satisfied. Then, the output of Algorithm C3 for all N ≥ 0 has the following property: Let us now move to our generalization of Algorithm C3. In our setting, we take Algorithm C3 with a particular choice of the prox-function d(λ) = 1 2 λ 2 2 and the corresponding Bregman divergence V [ λ](λ) = 1 2 λ − λ 2 2 . Then, the steps (23) and (24) of Algorithm C3 can be computed analytically. Indeed, the optimality conditions result in equations β k λ + k l=0 α l ∇ Φ(λ l , ξ l , ξ l ) = 0 and β k (λ − z k ) + α k+1 ∇ Φ(λ k+1 , ξ k+1 , ξ k+1 ) = 0. The steps therefore read as z k = − 1 β k k l=0 α l ∇Φ(λ l , ξ l ) and ζ k+1 = z k − α k+1 β k ∇Φ(λ k+1 , ξ k+1 ). Our generalization consists in adding an update for the primal variable, such that not only the dual, but also the primal problem (21) is solved with an optimal rate. Thus, in our primal-dual algorithm, we add the stepx k+1 = 1 to the algorithm for the update of the primal variable. Here the vector x(−A T λ, ξ) is defined as We also specify two choices of sequences α k , β k , A k = k i=0 α k . Having made these adjustments we arrive at Algorithm 1. The following Lemma is needed for the proof of Theorem C2. Lemma C1. Let the assumptions of Section 4 in the main paper hold. Then Proof. The proof can be found in [21, Theorem 2] . Theorem C2 gives a convergence result for Algorithm 1 which is a primal-dual version of the Algorithm C3. Theorem C2. Let Assumptions 1 hold and the variance in each iteration be bounded E ξ k , ξ k ∇ Φ(λ k , ξ k , ξ k ) − ∇ϕ(λ k ) 2 2 ≤ σ 2 k . Let accuracy ε > 0 and the number of steps N ≥ 0 be given. Then, for the primal sequencê where the expectation is taken w.r.t. all the randomness ξ 1 , . . . , ξ N , ξ 1 , . . . , ξ N . Proof. The dualization result relies on the weak duality and convexity of the considered problem. Taking the expectation in the inequality (26) in Theorem C1 for the unbiased stochastic approximation cancels out certain terms and we obtain Let us first estimate E Ψ * N . We introduce the feasible set Λ R := {λ ∈ H * : λ 2 ≤ 2R}. Then, Applying Lemma C1 above and using E ξ k , ξ k ∇ Φ(λ k , ξ k , ξ k ) = E ξ k ∇Φ(λ k , ξ k ), we get: In the inequalities above, we also used the convexity of f , the property of A N being a linear combination α k , and definitions ofx N and Λ R . Thus, E Ψ * N is bounded by Moreover, using (30) , and (32), the fact that d(λ) ≤ R 2 /2, and the unbiasedness of the stochastic approximation, we get the following bound on the iteration term A N ϕ(η N ): Now, let us estimate the convergence of the primal objective f (Ex N ). Using the weak duality −f (x * ) ≤ ϕ(λ * ) we obtain: Dividing all terms by A N in inequality 33 and using it we obtain Regarding the convergence of the argument AEx N − b 2 , we know Therefore, we have for any x ∈ Q, f (x * ) ≤ f (x) + λ * , Ax − b . Using the bound λ * 2 ≤ R, and choosing Finally, using this estimate in the inequality (38) gives us: and the desired result follows. Using the techniques developed in [27, 19] , these convergence rate guarantees can be extended to the bounds in terms of probability of large deviations. We next analyze Case A) of Algorithm 1 and prove Theorem 1 case A). Theorem C3 (Proof of Theorem 1 case A). Select Case A in Algorithm 1 by choosing A k = 1 2 (k + 1)(k + 2), Assume that the variance σ k satisfies σ 2 k ≤ εLα k A k . Then, Algorithm 1 gives an ε-solution after 8LR 2 ε iterations. Proof. By (28), we obtain We can see that after N = 8LR 2 ε the r.h.s. becomes smaller than ε 2 . The bound for AEx N − b 2 is obtained in the same way using (29) . We next analyze Case B) of Algorithm 1 and prove Theorem 1 case B). Theorem C4 (Proof of Theorem 1 case B). Assume that we have some constant variance σ 2 k = σ 2 . Given c ≥ 1, a = 2 c , b > 0 let the coefficients be defined by α k = k + 1 2 c and β k = L + . Select case B of Algorithm 1 by setting c = 3/2 with the resulting step sizes α k = k + 1 2 √ 2 and β k = L + σ 2 1/4 √ 3R (k + 2) 3/2 . Then Algorithm 1 gives an ε-solution after 4LR 2 ε Proof. Using (28), we get By the choice of the number of iterations N = 4LR 2 ε 9σ 2 R 2 ε 2 , we obtain that the r.h.s. is smaller than ε. The bound for AEx N − b 2 is obtained in the same way using (29) . Proof. For the Wasserstein barycenter problem, the dual objective function is ϕ(λ) = W * γ (λ), the primal objective is f (p) = 1 m m i=1 W γ,µi (p i ), and the linear operator is A = √ W . We set the approximate gradient ∇ Φ(λ, Y 1 , . . . , Y M1 , ξ 1 , . . . , ξ M2 ) := ∇W * γ (λ, Y 1 , . . . , Y M1 , ξ 1 , . . . , ξ M2 ). With this setting we can apply the Theorem C3. We set γ = ε/(4 ln(Ω)) and the weights to ω l = 1 m . According to Theorem C3 the variance has to satisfy the condition where we used also Lemma 2. According to Lemma 3 the total variance is bounded by Moreover, the total number of samples is Let us compute the computational complexity for a particular node i. On each iteration we have to batch M k samples. Sampling individual vector sample p i (λ i , Y i r ) costs n with r = 1, . . . , M k . Therefor the overall sampling cost is M k n at the k-th iteration. At iteration k the cost of summation of individual gradients (Equations (12) and (16) in Algorithm 2) is nnz(W i )M k . Obtaining realizations costs n. Other operations cost O(n), thus the computational complexity is O(M k n) at each iteration. Summing the complexity up over the iterations results in N k=1 (n log 2 n + M k (n + nnz(W i ))) = N n log 2 n + (n + nnz(W i )) Compared to the non-quantized version the number of bit communications per node per round for the quantized version is now nnz(W i )(M k n) instead of nnz(W i ) · n because we send M k coordinates and at most the full dimensional vector. Summing this over the number of iterations and over the nodes gives us Proof. We choose the dual, primal objective and approximate gradient as in proof of Theorem 2. According to Theorem C4, Algorithm 2 case B) converges after N ≥ 4LR 2 ε 9σ 2 R 2 ε 2 iterations. We choose case B of Algorithm 2 with constant batch size M = M k i,1 = M k i,2 . Then, by Lemma 3 Using again the estimate for R and L the algorithm converges after N = 32χ(W )n ε 72χ(W )n ε 2 M = O χ(W )n ε 2 iterations for a small constant batch size. The total number of samples is m This useful lemma is used to give an estimation of the area where we have to search for the dual solution: Lemma F2. We can upper bound the norm of the dual solution by λ * for γ proportional to ε and c is a constant close to 2 (see [42] and Lemma 8 in [41] ). We normalize the cost matrix C 2 ∞ ≤ 1. For the weights ω i = 1 m the estimate becomes R ≤ 2n mλ + min (W ) . In this section, we show additional simulation results for the proposed Algorithm 2. In each case, we assume there is a network with 30 nodes, and randomly generate 30 different Gaussian distributions shown in Figure 4 . Each node is assigned one of those Gaussian distributions, with uniformly sampled means and variances. Agents have the ability to query samples from those distributions, but are oblivious to the distribution itself. We test the proposed algorithm on five different graph classes: path, cycle, star, Erdős-Renywe test three different setups for number of samples, namely, M 1 = 1 and M 2 = 100, M 1 = 10 and M 2 = 10, and M 1 = 100 and M 2 = 1. A video of the resulting barycenter estimates for the path graph can be found in https://youtu.be/aSwNZvCkrCw. Barycenters in the wasserstein space Dual approaches to the minimization of strongly convex functionals with a simple structure under affine constraints Bayesian learning with wasserstein barycenters Stochastic optimization for regularized wasserstein estimators Averaging atmospheric gas concentration data using wasserstein barycenters Mirror descent and convex optimization problems with nonsmooth inequality constraints Distribution's template estimate with wasserstein metrics An lp-based, strongly-polynomial 2-approximation algorithm for sparse wasserstein barycenters Fair regression with wasserstein barycenters Stochastic Wasserstein barycenters Probability forecast combination via entropy regularized wasserstein distance Sinkhorn distances: Lightspeed computation of optimal transport Large scale distributed deep networks Stochastic first order methods in smooth convex optimization. CORE Discussion Paper Construction of non-asymptotic confidence sets in 2-Wasserstein space Scalable computations of wasserstein barycenter via input convex neural networks Stochastic intermediate gradient method for convex optimization problems Stochastic optimization for large-scale optimal transport Accelerated alternating minimization, accelerated Sinkhorn's algorithm and accelerated Iterative Bregman Projections Accelerated primal-dual gradient descent with linesearch for convex, nonconvex, and nonsmooth optimization problems Fast algorithms for computational optimal transport and wasserstein barycenter Stochastic distributed learning with gradient quantization and variance reduction Multimedia analysis and fusion via wasserstein barycenter Advances and open problems in federated learning Decentralized deep learning with arbitrary communication compression Decentralized stochastic optimization and gossip algorithms with compressed communication Federated optimization: Distributed machine learning for on-device intelligence On the complexity of approximating Wasserstein barycenters Communicationefficient algorithms for decentralized and stochastic optimization Continuous regularized wasserstein barycenters Unsupervised multilingual alignment using wasserstein barycenter Computational hardness and fast algorithm for fixed-support wasserstein barycenter On the Complexity of Approximating Multimarginal Optimal Transport Federated learning of deep networks using model averaging Variational wasserstein barycenters for geometric clustering Multimarginal wasserstein barycenter for stain normalization and augmentation Primal-dual accelerated gradient methods with small-dimensional relaxation oracle Clustering patterns connecting covid-19 dynamics and human mobility using optimal transport An Invitation to Statistics in Wasserstein Space: Fréchet Means in the Wasserstein Space W2 On the computation of Wasserstein barycenters The earth mover's distance as a metric for image retrieval Optimal algorithms for smooth and strongly convex distributed optimization in networks Cpot: Channel pruning via optimal transport Barycenters of natural images constrained wasserstein barycenters for image morphing Graph representation learning with wasserstein barycenters Parallel streaming wasserstein barycenters Gradient methods for problems with inexact model of the objective Stochastic saddle-point optimization for wasserstein barycenters Multimarginal optimal transport by accelerated alternating minimization Distributed computation of Wasserstein barycenters over networks A dual approach for optimal algorithms in distributed optimization over networks A fast proximal point method for computing exact wasserstein distance A universal primal-dual convex optimization framework