key: cord-0652984-dj6ikam0 authors: Antil, Harbir; Dolgov, Sergey; Onwunta, Akwum title: TTRISK: Tensor Train Decomposition Algorithm for Risk Averse Optimization date: 2021-11-09 journal: nan DOI: nan sha: 3e926a1aad427fc9a1295652db407b1730aef155 doc_id: 652984 cord_uid: dj6ikam0 This article develops a new algorithm named TTRISK to solve high-dimensional risk-averse optimization problems governed by differential equations (ODEs and/or PDEs) under uncertainty. As an example, we focus on the so-called Conditional Value at Risk (CVaR), but the approach is equally applicable to other coherent risk measures. Both the full and reduced space formulations are considered. The algorithm is based on low rank tensor approximations of random fields discretized using stochastic collocation. To avoid non-smoothness of the objective function underpinning the CVaR, we propose an adaptive strategy to select the width parameter of the smoothed CVaR to balance the smoothing and tensor approximation errors. Moreover, unbiased Monte Carlo CVaR estimate can be computed by using the smoothed CVaR as a control variate. To accelerate the computations, we introduce an efficient preconditioner for the KKT system in the full space formulation.The numerical experiments demonstrate that the proposed method enables accurate CVaR optimization constrained by large-scale discretized systems. In particular, the first example consists of an elliptic PDE with random coefficients as constraints. The second example is motivated by a realistic application to devise a lockdown plan for United Kingdom under COVID-19. The results indicate that the risk-averse framework is feasible with the tensor approximations under tens of random variables. . 3 . Effect of the quantile 21 5 . 4 . Elliptic PDE with log-normal coefficient 22 5 . 5 . Infection ODE model 24 6 . Conclusion and outlook 27 Acknowledgement 29 Appendix A. Cross approximation 29 Appendix B. Tensor Train algebra 30 References 32 1. Introduction Uncertainty is ubiquitous in science and engineering applications. It may arise due to noisy measurements, unknown parameters, or unverifiable modeling assumptions. Examples include, infectious disease models for COVID-19 [8] , partial differential equations (PDEs) with random coefficients, boundary conditions or right-hand-sides [7, 30, 38, 22] . The control or design optimization problems constrained by such systems must produce controls or optimal designs which are resilient to this uncertainty. To tackle this, recently in [21, 20, 22, 9] , the authors have created risk-averse optimization frameworks targeting engineering applications. The goal of this paper is to introduce a new algorithm TTRISK which uses a Tensor Train (TT) decomposition to solve risk averse optimization problems constrained by differential equations (ODEs and/or PDEs). Let (Ω, A, P) be a complete probability space. Let U, Y be real reflexive Banach spaces, and let Z be a real Banach space. Here Y denotes the deterministic state space, U is the space of optimization variables (control or designs etc.) and Z is the differential equation residual space. Let U ad ⊆ U be a closed convex subset and let c : Y × U ad × Ω → Z denote, e.g., a PDE in a weak form, then consider the equality constraint c(y, u; ω) = 0, in Z, a.a. ω ∈ Ω where a.a. indicates "almost all" with respect to a probability measure P. The goal of this article is to consider optimization problems of the form min u∈U ad R[J (y, u; ω)] + αP(u) subject to c(y, u; ω) = 0, in Z, a.a. ω ∈ Ω, (1.1) where u ∈ U ad is the deterministic control and y ∈ Y is the state, P is the cost of the control, α ≥ 0 is the regularization parameter, J is the uncertain variable objective function and R is the risk-measure functional which maps random variables to extended real numbers. We assume that R is based on expectation, i.e., f : R × R N → R and T ⊆ R N , with N ∈ N, is a closed convex set. The problem class ( 1.2) consists of a large number of risk-measures which are of practical interest. In particular, it includes the coherent risk measures, which are sub-additive, monotonic, translation equivariant and positive homogeneous [32, 37] . Notice that sub-additivity and positive homogeneity implies convexity. These risk measures have several advantages, for instance, they preserve desirable properties of the original objective function such as convexity. In addition, in engineering or in finance applications, tail-probability events are rare but detrimental events that often lead to failure of a system. It is therefore, essential to minimize the risk of failure, i.e., R and obtain controls u which are resilient to uncertainty in the system. A typical example of coherent risk measure R is the conditional value-at-risk (CVaR β ), where f in ( 1.2) is given by with T = R, β ∈ (0, 1) is the confidence level and (x) + = max{x, 0}. CVaR β is also known as expected shortfall. It's origin lies in financial mathematics [32, 33] , but owing to Kouri [19] and Kouri and Surowiec [21] , it is now being widely used in engineering applications. Our work in particular, focuses on minimization problems (1.1) with R given by CVaR β but it can be extended to other coherent risk measures, such as buffered probability of exceedence (BPOE), of type ( 1.2) . Notice that, since risk measures, such as CVaR β focus on the upper tail events, the traditional sampling techniques to solve these stochastic PDE-constrained optimization problems are often computationally expensive. More precisely, CVaR β captures the cost associated with rare events, but it requires more samples in order to be accurately approximated, which leads to many differential equation solves [21] . Moreover, the presence of the non-smooth function (·) + in CVaR β poses several challenges, including, nondifferentiable cost functional, wasted Monte Carlo samples outside of the support of (·) + = max{·, 0}, or slowly converging polynomial and other function approximation methods. To tackle some of these challenges, [21] has proposed a smoothing of (·) + which requires solving a sequence of smoothed optimization problems using Newton-based methods. Another solution strategy is to reformulate the problem and use interior-point methods [9] . A duality based approach has also been recently proposed in [23] . In this paper we develop an efficient method to tackle the above challenges associated with minimization of CVaR β subject to constraints given by differential equations with random inputs. We consider two formulations of (1.1). The first one is the implicit approach where we remove the equality constraint c(y, u; ω) = 0 via a control to solution map u → y [21, 9, 23] . The second case is the full space approach, where we directly tackle the full problem (1.1) using the Lagrangian formulation. The latter formulation appears to be new in the context of risk-averse optimization. Numerical experiments demonstrate that the full formulation converges more reliably for extreme parameters, e.g. large β and small α. Our framework builds rigorously on tensor decomposition methods, which emerged in the past two decades [13, 17] as an efficient approximation of multi-index arrays, in particular when those contain expansion coefficients of high-dimensional functions [2, 12] . The idea starts from the classical separation of variables. Functions of certain structure [14] or regularity [35] have been shown to admit rapidly (often exponentially) convergent series, where each term is a product of univariate functions. Later, instead of a simple sum of products, it was found more practical to consider hierarchical separation of variables [13] . A particularly simple instance of such is the Tensor Train (TT) decomposition [27] that admits efficient numerical computations. One of the most powerful algorithms of this kind is the cross approximation, as well as its variants [28, 34, 5] . Those allow one to compute a TT approximation to potentially any function, using a number of samples from the sought function that is a small multiple of the number of degrees of freedom in the tensor decomposition . Once TT decompositions are computed, integration, differentiation and linear algebra of the original functions can be implemented using their TT formats instead with a linear cost in the dimension. However, irregular functions, such as the (·) + function in (1.3), may lack an efficient TT decomposition. This is another reason for switching to the smoothed CVaR formulation, since smooth functions do admit convergent TT approximation. Nevertheless, this convergence can still be slow, and for practically feasible smoothness parameters, the solution of the smoothed problem can be biased. To obtain an unbiased, asymptotically exact solution, we propose a version of Multilevel Monte Carlo methods [10] , namely, we use a smoothed solution as a control variate [31] . The numerical experiments demonstrate that the stochastic risk-averse control problem can be solved with a cost that depends at most polynomially on the dimension. This allows us to solve a realistic risk-averse ODE control problem with 20 random variables. Outline: Section 2 and Appendix set up the relevant notation and provide the necessary background on risk-averse-optimization and tensor-train decomposition. In Section 3, we introduce the control-to-state map, i.e., u → y, and eliminate the equality constraints c(·) = 0. The resulting optimization problem (1.1) is only a function of the control variable u and is known as reduced problem. A control variate correction of the problem is considered in Section 3.2. When the constraint c(·) is handled directly, the resulting formulation is called full-space problem and is discussed in Section 4. This is followed by Section 4.2 where the Gauss-Newton system for the full-space formulation is considered. Next, in Section 4.3 we consider preconditioning strategies for this formulation. Section 5 focuses on our numerical examples, where we first consider an optimal control problem constrained by an elliptic PDE with random coefficients. This is followed by the risk-averse optimal control of an infectious disease model which has been recently developed to propose lockdown strategies in the United Kingdom due to COVID-19. In this section, we first provide background on CVaR β and introduce a regularized problem with CVaR β replaced by CVaR ε β with ε > 0. This is followed by a discussion on tensor-train decomposition. The aim of this section is to set the stage for TTRISK. Risk-averse optimization. Let L := (Ω, A, P) be a complete probability space. Here, Ω is the set of outcomes, A ⊂ 2 Ω is the σ-algebra of events, and P : A → [0, 1] is an appropriate probability measure. Let X be a random variable defined on L. Then, the expectation of X denoted by E[X] is given by As stated in the introduction, this paper focuses on optimization problems of type (1.1) where R is the risk measure given by (1.2) . In particular, we consider the conditional valueat-risk (CVaR β ) at confidence level β, β ∈ (0, 1) where f in (1.2) is given by (1.3) , with T = R. CVaR β builds on the concept of value-at-risk (VaR) at level β ∈ (0, 1), which is the β-quantile of a given random variable. More precisely, let X be a random variable and let β ∈ (0, 1) be fixed. Then, VaR β (X) is given by where P[X ≤ t] denotes the probability that the random variable X is less than or equal to t. VaR β is unfortunately not coherent because it violates sub-additivity / convexity. This is why CVaR β is preferred as a risk measure. Even though now we know that coherent risk-measures can be written in the abstract form (1.2), however, the 1D minimization formulation of CVaR β was first introduced by Rockafellar and Uryasev in [32, 33] : Moreover, if X is a continuous random variable, then where again U ad denotes the set of admissible controls. We use (y, u) to denote the state and control, respectively. In our setting u is always deterministic. The constraint equation c(y, u; ω) = 0 represents a differential equation with uncertain coefficients, P(u) is a deterministic cost function, α is the regularization parameter, and J (y, u; ω) is a random variable cost function. Here, we make the finite dimensional noise assumption on the equality constraint [21] . More precisely, the constraint c(y, u; ω) = 0 is represented by a finite random vector ξ : Ω → Ξ, where Ξ := ξ(Ω) ⊂ R d with d ∈ N. This allows us to redefine the probability space to (Ξ, Σ, ρ), where Σ = ξ(A) is the σ-algebra of regions, and ρ(ξ) is the continuous probability density function such that E[X] =´Ξ X(ξ)ρ(ξ)dξ. The random variable X(ξ) can be considered as a function of the random vector ξ = (ξ (1) , . . . , ξ (d) ), belonging to the Hilbert space F = {X(ξ) : X < ∞}, equipped with the inner product X, Y = Ξ X(ξ)Y (ξ)ρ(ξ)dξ and the Euclidean norm X = X, X . Then, (2.1) reads min u∈U ad CVaR β [J (y, u; ξ)] + αP(u) subject to c(y, u; ξ) = 0, in Z, a.a. ξ ∈ Ξ. (2.2) Here and in what follows, bracketed superscripts (e.g. ξ (d) ) denote a component, not a power or derivative. To tackle nonsmoothness in CVaR β , we employ smoothing based approach from [21] . The smoothing approach is essentially aimed at approximating the positive part function (·) + in CVaR β by a smooth function g ε : R → R, which depends on some ε > 0. Various examples of g ε are available in [21, section 4. 1.1] . In particular, we consider the following C ∞ -smoothing function (2.4) Thus, the optimization problem for smooth CVaR ε β is given by For convergence analysis of (2.5) to (2.2), we refer to [21] . Cartesian function space. The dimension d of the random vector ξ can be arbitrarily high. For instance, ξ may be a tuple of tens of model tuning parameters, or it can be a vector of coefficients of a Karhunen-Loeve (KL) approximation of an infinite-dimensional continuous random field. In this case expectations as in (2.6) become high-dimensional integrals. Instead of a direct Monte Carlo average (which may converge too slowly), we introduce a highorder quadrature rule (e.g. Gauss-Legendre) with n ξ ∈ N points in each of the components ξ (1) , . . . , ξ (d) independently. For this, we assume that each ξ (k) has a probability density function ρ (k) (ξ (k) ), and that the space of functions f (ξ) ∈ F is isomorphic to a Cartesian product of spaces of univariate functions, However, the exponential total number of quadrature points in all variables n d ξ becomes intractable even for moderate dimensions. 2 . 3 . Tensor Train decomposition. We circumvent this "curse of dimensionality" problem by approximating all functions depending on ξ by a Tensor Train (TT) decomposition [27] , which admits efficient integration or differentiation. is said to be approximated by a (functional) TT decomposition [2, 12] with a relative approximation error if there exist univariate where the subscripts s k−1 , s k denote elements of a matrix, and f −f = f . The factors F (k) are called TT cores, and their image dimensions r 0 , . . . , r d ∈ N are called TT ranks. Without loss of generality we can let r 0 = r d = 1, but the other TT ranks r 1 , . . . , r d−1 can vary depending on the approximation error. The simplest example is a bi-variate truncated Fourier seriesf (ξ (1) , ξ (2) ) = r s=−r f s (ξ (1) ) exp(isξ (2) ). From (2.7), we notice that the expectation off factorizes into univariate integrations, For practical computations with (2.7) we introduce univariate bases { i (ξ (k) )} n ξ i=1 , and the multivariate basis constructed from a Cartesian product, Now we can collect the expansion coefficients off into a tensor F ∈ R n ξ ×···×n ξ , Similarly, TT cores in (2.7) can be written using three-dimensional tensors F (k) ∈ R r k−1 ×n ξ ×r k , The original (discrete) TT decomposition [27] was introduced for tensors, (2.10) Note that F contains n d ξ elements, whereas storing F (1) , . . . , F (d) needs only k r k−1 n ξ r k elements. For brevity we can define the maximal TT rank r := max k r k , which gives us a linear storage complexity of the TT decomposition, O(dn ξ r 2 ). If is a Lagrange polynomial basis, we can choose it together with quadrature points Ξ (k) := {ξ (k) i } and weights {w i } such that i (ξ (k) j ) = δ i,j , and hence The summation over s 0 , . . . , s d in the right hand side can be computed recursively by multiplying only two tensors at a time. Assuming that a partial result R k ∈ R r 0 ×r k is given, we can compute as a matrix product with a O(n ξ r 2 ) complexity. Starting with R 0 = 1 and finishing with R d = E[f ], we complete the entire integration in O(dn ξ r 2 ) operations. Such a recursive sweep over TT cores is paramount to computing TT approximations of arbitrary functions, or to solution of operator equations in the TT format. For example, the TT-Cross method (see Appendix A) requires O(dn ξ r 2 ) samples from a function f (ξ) and O(dn ξ r 3 ) further floating point operations to compute a TT approximationf (ξ) ≈ f (ξ). Similarly, linear algebra on functions can be recast to linear algebra on their TT cores with a linear complexity in the dimension (see Appendix B). This paper considers two approaches to tackle (2.5) . In this section, we first introduce TTRISK for the so-called reduced form of (2.5). To ensure that CVaR ε β is a statistically unbiased estimator of CVaR β , we introduce a control variate correction in subsection 3.2. Assume that c(y, u; ξ) = 0 is uniquely solvable, i.e., for each u ∈ U ad there exists a unique solution mapping y(u; ·) : Ξ → Y for P a.a ξ ∈ Ξ. The resulting optimization problem (2.5) only depends on u and is given by where j(u; ξ) := J (y(u; ξ), u; ξ). The exact expectation in CVaR ε β can be approximated by a quadrature similarly to (2.11). We denote the total number of quadrature points N (which is formally N = n d ξ in the Cartesian formulation), and the approximate expectation E N [f ] := E[f ], wheref is a TT approximation (2.7) to f . This leads to the following approximation of (3.1): The function G(ξ) := g ε (j(u; ξ) − t) under the expectation can be approximated by a TT-Cross algorithm as described in Appendix A, and the integration carried out as shown in (2.12). However, to optimize the entire cost J N we need to calculate the first and second order derivatives. We readily obtain that the first order derivatives are and the second order derivatives are Observe that the second derivatives of J N computed above depend on g ε . From (2.4), we see that g ε decays rapidly away from the interval (−ε, ε). Consequently, the Hessian . . , N, since all but the (1, 1) block of H are zeros [25] . To circumvent this problem, we proceed similarly to the augmented Lagrangian methods. We start with t 0 = ε 0 = E N [j(u 0 ; ξ)]. In many cases there will be a lot of points j(u; ξ i ) − t in a significant support of g ε . During the course of Newton iterations, we decrease ε geometrically, where 0 < µ ε < 1 is a tuning factor (e.g. µ ε = 1/2). However, the next iterate t m+1 may end up far on the tail of g ε again. To prevent this from happening, we perform a line search, where in addition to the standard Armijo rule (with zero Armijo parameter) where u m+1 = u m + hδ um , t m+1 = t m + hδ tm with a step size h > 0 and Newton directions δ um , δ tm , we require that for some 0 < θ < 1/4. This ensures that E N [g ε (j(u m+1 ; ξ) − t m+1 )] stays away from zero by a fixed fraction of the maximum of g ε (t), which is equal to 1/(4ε). In the numerical experiments we have found the iterations to be robust and insensitive for θ between 10 −2 and 10 −1 . Remark 3. 1 . If we assume that (3.11) holds for the initial guess (m = 0) for the chosen θ, then, since j(u, ξ) and g ε (t) are continuous in u and t, by induction it is always possible to find a step size h > 0 in the Newton iteration If the dimension of discretized u is small, we can compute the TT-Cross approximation of ∇ u j(u m , ξ), ∇ uu j(u m , ξ), g ε (j(u m ; ξ) − t m ) and g ε (j(u m ; ξ) − t m ) directly, compute the expectations in (3.3) and (3.4) , and solve for the Newton directions. However, if u is large, ∇ uu j(u m , ξ) is large and dense, and its TT decomposition becomes too expensive. In fact, even multiplying ∇ uu J N (u, t) by a vector in an iterative solver would require recomputation of the TT decomposition in each iteration. To avoid this problem, we propose to replace the expectation in (3.4 ) by sampling at a fixed pointξ. This gives and consequently a fixed-point Hessian The choice ofξ is motivated by the mean value theorem. We can treat as an expectation of ∇ uu j(u; ξ) alone over a probability density functionρ( , so we can take the right hand side as an approximation also in a general case. This gives . ( 3.15) We focus on ∇ uu j (and hence on g ε ) since ∇ uu j is usually the dominant part of ∇ uu J N . Note that the action ∇ uu j(u;ξ)·δ u can usually be applied efficiently, since this requires the solution of one forward and one adjoint deterministic problems at fixed ξ =ξ. Similarly, ∇ uu P(u) is a sparse, and ∇ u j(u;ξ)∇ u j(u;ξ) * is a rank-1 matrix after discretization. This allows us to solve the Newton system (3.12) with H instead of H iteratively with fast matrix-vector products. The entire procedure is summarized in Algorithm 1. Require: α, β, line search parameter θ, number of iterations I max , smoothness reduction factor µ ε , stopping tolerance, procedure to compute j(u, ξ). (3.13) and (3.15) using (2.11),(2.12). Solve (3.12) with H or H using Conjugate Gradients method. 6 : for control constraints 7: Find 0 < h ≤ 1 such that (3.10) and (3.11) hold. 8 : Set m = m + 1 10: end while 11: Smoothed CVaR as control variate for Monte Carlo. Assuming thatg ε andj are TT approximations to g ε and j, respectively, we can write since we assume that the quadrature is exact for the polynomial basis used. Using add-andsubtract trick, we can write the exact risk-measure functional as follows, However, for the last term we can use a different quadrature for computing the expectation, such as Monte Carlo with M samples. This defines a corrected smoothed functional: is small (where δ is the total error from the TT-Cross (Algorithm 2) and smoothing), by the law of large numbers the variance of this estimator reads where the latter term is the variance of the straightforward Monte Carlo approximation of R t . Consequently, one needs a much smaller M to achieve the same variance (i.e. Mean Square Error) threshold. One can say thatg ε is used as a control variate for variance reduction of Monte Carlo [31, 26] , or vice versa, that (3.16) is the 2nd level correction [10, 24] to the surrogate modelg ε . This makes R ε,M t,β,N a statistically unbiased estimator of R t . Similarly we can correct the cost function and its gradient (3.3): where θ(t) = 1 if t ≥ 0, and 0, otherwise. These gradients can be plugged into Line 4 of Algorithm 1 instead of (3.3). Since the variance of the correction is expected to be small, we omit it in the Hessian, turning Alg. 1 into a Gauss-Newton method. In this section, we first focus on the full-space formulation. This is followed by a Gauss-Newton system setup for the problem and a preconditioning strategy for this system. 4 be all weights in the quadrature (2.11). Moreover, we assume a Lagrangian basis expansion (2.8) , and for brevity we let L i (ξ) : Applying this formalism to y instead of f , and assuming that the constraint c(y, u, ξ) = 0 holds pointwise in ξ, we obtain semi-discrete equations Likewise, we can introduce the smoothed CVaR with the Monte Carlo correction (cf. (3.16 )) using the quadrature Let y ∈ Y ⊗N and p ∈ (Y * ) ⊗N be y i and p i stacked together. We introduce the Lagrangian In the differentiation, we will distinguish the components corresponding to different coefficients. This gives, for all j = 1, . . . , N , ∇ p j L = w j c(y j , u, ξ j ), where we have defined error correction shortcuts with θ(t) = 1 for t ≥ 0 and 0 otherwise. The second derivatives of (4.2) are difficult both notationally and computationally, since arbitrary points ξ , leading to nonzero Lagrangian polynomial values L j (ξ ), produce dense matrices. However, if we assume that the correction is small in magnitude, we can follow the arguments of Gauss-Newton methods and remove the correction derivatives in the Hessian, as well as second derivatives of c(y, u, ξ) and J (y, u, ξ). This way we obtain (where δ j,k denotes the Kronecker delta, and [f (y, u, ξ)] j evaluates f (y j , u, ξ j )) as well as symmetric terms. For both notational and computational brevity, let us introduce First, we can note that all terms corresponding to differentiating L in y j or p j first contain the quadrature weight w j . In higher dimensions w j may vary over many orders of magnitude, and keeping them in the Hessian may render it extremely ill-conditioned. Instead, we divide the corresponding rows of the Hessian together with the right hand side (the derivatives of L) by w j in our formulation directly. This makes the Hessian non-symmetric, but a much lower condition number together with a potent preconditioner developed below allows one to use GMRES or BiCGStab with only a few iterations. Second, in the case of a linear-quadratic control we have P(u) = 1 2 u, M u u , and c(y, u, ξ) = c(y, ξ) + Bu. In a weakly nonlinear case we can consider an inexact Newton method, where P(u) and c(y, u, ξ) are approximated in this form. This allows one to resolve (4.1) explicitly and plug the corresponding component u = Proj U ad ((αM u ) −1 (−B) * N i=1 w i p i ) back into the Lagrangian, reducing it to three variables ( y, p, t). The reduced Gauss-Newton Hessian term appears: where w j is pending to removal as described above, and Proj U ad is a semi-smooth derivative of the projector (e.g. for box constraints this is just an indicator of U ad ). In addition to being quadratic, the control penalty is equivalent to a (weighted) squared L 2 -norm in many cases. This renders discretization of M u spectrally close to a diagonal matrixM u ; for example, one may use a standard lumping of the finite element mass matrix. This makes the discretized Hessian (4.3) easy to assemble, e.g. sparse when M u is replaced byM u . The purpose of this reduction of u becomes more apparent for the solution of the Gauss-Newton system in the TT format. An Alternating Least Squares method (cf. Appendix B) tailored to KKT systems [1] requires that all solution components are represented in the same TT decomposition [6] . Since y and p have the same size, this holds naturally; the only additional variable t is a single number that can be embedded into the same TT decomposition at a little cost. In contrast, a (potentially large) component u needs a nontrivial padding that may inflate the TT ranks and/or make the Hessian more ill-conditioned. Finally, we obtain the following linear system on the solution increments: H yy and A are (block) diagonal matrices with H yy j and A j := ∇ yĉ (y j , ξ j ) on the diagonal, W = 1 i · w j N i,j=1 ∈ R N ×N is a rank-1 matrix (after discretization), and H yt = H yt j , H yt w = H yt j w j whereas the righthand side components are Having solved (4.4), we perform the Newton update similarly to Algorithm 1, by setting y m+1 = y m + hδ y , p m+1 = p m + hδ p , t m+1 = t m + hδ t , and ε m+1 = µ ε ε m , where the step size h > 0 is chosen ensuring (3.10) and (3.11) . Observe that the system (4.4) can be ill-conditioned and thus requires a good preconditioner to solve it efficiently. In what follows, we discuss a Schur complement-based preconditioner. 4 . 3 . Preconditioning. We propose a matching strategy [29] to approximate the Schur complement to the Gauss-Newton matrix (4.4). First, since δ t is a single number, we can compute the corresponding Schur complement matrix where we can denote S yy := H yy − H yt 1 H tt (H yt w ) * for brevity. Since A is a linearization of c(y, u, ξ) (for example, the PDE operator), it is often invertible. In this case, the Schur complement towards the (1, 2) block of this matrix reads We propose a matching approximation consisting of three factors: where η = B ⊗ W / S yy is the scaling constant. Note that which was shown [1] to be small in norm compared to S for both limits α → ∞ when S = O(1), and α → 0 when S = O(α −1 ). Ultimately, we obtain the following right preconditioner: Note that this is a permuted block-triangular matrix, solving a linear system which requires the solution of smaller systems with H tt ,S and A. In turn, the solution withS requires the solution with (A * + ηS yy ) and (A + 1 η B ⊗ W ). If the constraints are defined by a PDE, A, A * and g ε (J − t)∇ yy J (inside H yy ) are sparse matrices, whereas the remaining terms g ε (J − t)∇ y J (∇ y J ) * , H yt 1 H tt (H yt w ) * and W are low-rank matrices, and can be accounted for using the Sherman-Morrison formula. We this section, we present various numerical examples to illustrate the efficiency of the proposed approach in both the reduced and full-space formulations. This section in fact goes beyond the above theoretical presentation in multiple ways. In section 5.1, we consider an optimal control problem in one spatial dimension and random coefficient. We study the approximation error in CVAR β due to each of the variables ε, n y (spatial discretization), n ξ ,, d, and TT truncation tolerance. We propose a strategy to select these parameters by equidistribution of the total error. Control variate strategy is applied to this problem in section 5.2. Section 5.3 focuses on the impact of the quantile β and the ε growth factor µ ε . In section 5.4 , we consider the two spatial dimension version of the problem and carry out a comparison between the reduced and full space formulations. We conclude with a realistic problem in section 5.5 , where we propose a risk-averse strategy for lockdown due to pandemics such as COVID- 19 . The TTRISK (Algorithm 1) is implemented based on TT-Toolbox 1 , whereas for the TT-Cross (Algorithm 2) we use a rank-adaptive implementation amen cross s from TT-IRT 2 . We run the computations using a default multithreading in Matlab R2019b which can spawn up to 10 threads in BLAS on an Intel Xeon E5-2640 v4 CPU. 5 . 1 . Elliptic PDE with affine coefficient. We first test the reduced formulation. Consider a PDE in one space dimension with random coefficients κ. The Karhunen-Loeve (KL) expansion of κ(x, ξ) (truncated to d terms) is defined by mean κ 0 (x) = 10 and covariance function x, x ∈ (0, 1). The control u is defined on a subdomain (0.25, 0.75), and B is the identity insertion operator: Bu(x) = u(x), x ∈ (0.25, 0.75), 0, otherwise. The PDE is discretised using piecewise linear finite elements on a uniform grid with n y points 0 = x 1 < · · · < x ny = 1, with the coefficient κ(x, ξ) and the control u(x) discretised by collocation at the midpoints {x i+1/2 }. The random variables ξ (1) , . . . , ξ (d) are discretised by collocation at n ξ Gauss-Legendre points on the interval (− √ 3, √ 3). We aim at estimating the total cost-error scaling. However, since the computation depends on a number of approximation parameters (ε, n y , n ξ , d and the TT truncation tolerance), those need to be selected judiciously to obtain the optimal total complexity. Next, we estimate the error (in CVaR) contributed by each of the parameters. This will allow us to select the parameter values by equalizing their corresponding error estimates. The (relative) CVaR error at given parameters is defined as where t m is the output of Algorithm 1, and R * is the reference solution computed at the finest parameters ε = 3 · 10 −4 , tolerance= 10 −5 , n y = 1025, n ξ = 33, d = 10. We take the control regularization parameter α = 10 −6 and the confidence threshold β = 0. 5 ε TT rank Figure 1 shows that the error in CVaR depends almost linearly on ε (in fact the decay is slightly faster, which may be due to particular symmetries in the solution). This is consistent with [25, Lemma 3. 4 .2] where a linear theoretical convergence was established. However, the TT ranks of the logistic function derivatives grow also linearly with 1/ε. This will eventually lead to noticeable computing costs as ε decreases. It is thus crucial that the cross approximation of g ε uses a precomputed TT approximation of j(u, ξ) instead of original PDE solutions. In turn, the TT ranks of j(u, ξ) and ∇ u j(u, ξ) stay almost constant near 40, and hence the number of PDE solutions is almost independent of ε in the considered range. In Figure 2 we vary the number of quadrature points introduced in each of the random variables ξ. As expected from many previous works (see e.g. [18, 3, 15] ), the approximation converges exponentially in n ξ . The TT ranks stabilize towards the value prescribed by ε in Figure 1 . Figure 3 benchmarks the scheme against the relative error tolerance used for stopping both the TT-Cross Algorithm 2 and TTRISK Algorithm 1, as well as for the TT approximation. The convergence is linear except the very large values of the threshold, when the TT-Cross may stop prematurely after an accidental drop of the iteration increment below the threshold. The TT ranks grow logarithmically or even slower with 1/tol, which is the enabling observation for many applications of tensor methods. In Figure 4 we vary the number of finite elements in the spatial discretization of (5.1). As expected from the second order of consistency of the continuous linear elements, the CVaR error converges quadratically with n y . The TT ranks stay almost constant, which shows that even the coarsest grid resembles enough qualitative features of the solution. Lastly in this series, Figure 5 varies the dimension of the random vector ξ, that is, the number of terms in the KL expansion. For the Gaussian covariance matrix of κ we observe the expected exponential convergence in d, and the stabilization of the TT ranks as long as the contribution of the latter random variables (ξ (k) , . . . , ξ (d) for k ≥ k 0 with some k 0 > 1) becomes negligible compared to the (fixed) TT truncation threshold. Equipped with the individual error estimates, we can vary only n y (the spatial grid size), since the number of grid intervals is restricted to a power of 2 to ensure that the control subdomain is aligned to all grids considered. For each n y , this gives (note the division by 5 In Figure 6 we show the TT ranks, the corresponding numbers of PDE solutions required for the TT-Cross approximation of j(u, ξ) and ∇ u j(u, ξ), as well as the total computing time. We observe that both complexity indicators depend linearly on 1/error or slower. This is to be compared with the solution of a deterministic PDE (5.1) with linear finite elements, which provide an error = O(n −2 y ) with the computing cost = O(n y ), that is, the deterministic problem scales as cost = error −1/2 . We see that the TT solution of a highdimensional stochastic problem contributes the same amount of complexity. Figure 1 we notice that ε is the slowest in terms of convergence. In this experiment we add the Monte Carlo control variate correction (3.17) , (3.18) to the gradient of the CVaR cost function during the optimization, and to the CVaR computation (3.16) in the end. To estimate the accuracy of the correction itself, we split the entire budget of M Monte Carlo samples into 16 batches of M/16 samples each. Then we compute the average over those M/16 samples in each batch independently. This gives us 16 iid samples of the correction. The ultimate correction is computed as an average of those 16 corrections, which is equivalent to the simple averaging of all M samples. However, we can also compute the standard deviation over the 16 correction samples. This estimates the statistical error in the correction term computed using M/16 samples, and, since the TT approximation is a deterministic quantity, the error in the final result as well. Using the central limit theorem, we can then expect that the error in the actually applied correction (computed from M samples) is 4 times smaller. In Figure 7 we show both the average corrections over M samples, and standard deviations over the 16 correction samples divided by 4, as an estimate of the standard deviations of the actual correction terms. We see that the standard deviations decay with the law of large numbers, as expected. Moreover, both means and standard deviations decrease linearly with ε. In multilevel Monte Carlo, the estimated standard deviation can also be used to adapt the number of Monte Carlo samples towards the desired error threshold by extrapolating the law of large numbers [24] . However, the convergence (and error estimation) of the TT approximation can be more complicated. Therefore, we suggest to decrease ε until the TT ranks are still manageable, and then compute the control variate correction to both estimate and improve the error. This also allows us to decouple the TT and Monte Carlo steps, or to reuse previously computed TT approximations of e.g. the forward model solution. 5 . 3 . Effect of the quantile. In the last experiment with the 1D PDE, we vary the quantile of the confidence interval in CVaR. The results are reported in Figure 8 . As β increases, the model minimizes the cost g with higher confidence, which requires a stronger control (see the right plot of Figure 8) . However, the increasing 1/(1 − β) term in the gradient and Hessian renders the Newton method slower and less reliable. Recall that we start with a large ε and decrease it geometrically with Newton iterations according to (3.9) . This allows us to avoid getting too small values of g ε . The default value used in the previous experiments was µ ε = 0.5, which was a reasonable balance between the stability of the method and its convergence speed. However, for larger β the Newton method may break at exact machine zeros in g ε , leading to infinities in the solution increment. In the left plot of Figure 8 we vary both β and µ ε independently. Dashed circles denote the experiments where the Newton method failed to converge. We see that larger β requires slower iterations with larger µ ε . Of course, too large µ ε will just lead to unnecessarily many iterations. This means that the parameter µ ε may need problem-dependent tuning, although we believe that the observations in Figure 8 should serve as a good initial guess in a variety of cases. 4 . Elliptic PDE with log-normal coefficient. Now we consider a larger problem: we solve a PDE on a two-dimensional space, where the coefficient is a log-normal random field. Namely, we assume that log κ(x, ξ) is a normal field with mean zero and covariance function = 0.5, σ 2 = 0.05, and parametrize log κ(x, ξ) with d = 10 terms of the KL expansion. This accounts for 98% of the variance. The control u is defined on a disk inside the domain, and B is the identity insertion operator: The PDE is discretised using piecewise linear finite elements on a triangular grid with 1829 nodes, which is shown in Figure 9 . The random variables ξ (1) , . . . , ξ (d) are discretised by I t e r . R e d . C P U t i m e L a g r . Iter. Lagr. α collocation at n ξ = 9 Gauss-Hermite points. In this test we set β = 0.8, and correspondingly µ ε = 0.7 (cf. Figure 8) . The final smoothness width ε = 3 · 10 −3 , consistent with the TT approximation error threshold tol = 10 −3 , the discretization, and KL truncation errors. In Figure 10 we vary the control regularization parameter α, and compare the Reduced formulation introduced in Section 3, and the Lagrangian formulation from Section 4. We see that the TT ranks of derivatives of the sigmoid function g ε in both formulations are of the same scale. However, since the forward model involves solving a PDE, the bottleneck is actually the computation of a surrogate of the PDE solution: the TT approximation cost function j(u; ξ) in the reduced formulation, and the block TT format of δ y and δ p in the Lagrangian formulation. We see that the TT ranks of the Lagrangian solution are 50% larger than those of j, which is actually lower than a factor of 2 expected for a TT format representing two components simultaneously (δ y and δ p ). We see also that the TT ranks of j, δ y and δ p depend little on α. However, if we look at CPU times ( Fig. 10 right) , we notice that the time of the Lagrangian method grows rapidly with α despite nearly constant number of Newton iterations and TT ranks. This is due to the growth of GMRES iterations in solving Eq. (4.4) . The condition number of the KKT matrix preconditioned with Eq. (4.6) may still deteriorate with α in the considered case of a control on a subdomain. A better preconditioner may reduce the complexity. Nevertheless, for moderate values of α the Lagrangian formulation is preferable to the reduced one. This is due to the KKT matrix being the exact Hessian of the Lagrangian, whereas the reduced formulation uses only an approximate (fixed point) Hessian of j(u, ξ). This is even more crucial for a large β as we have also seen in the previous test. The imperfection of the reduced Hessian can be seen in the growing number of Newton iterations. 5 . 5 . Infection ODE model. In the last example, we experiment with an epidemiological ODE model from [8] , which was used to estimate the progression of COVID-19 in the UK for 90 days starting from the 1st March 2020. This model considers population dynamics split into the following compartments: • Susceptible (S): individuals who are not in contact with the virus at the moment. • Exposed (E) to the virus, but not yet infectious. • Infected SubClinical (I SC1 ) at the moment, but who may require hospitalizations. • Infected SubClinical (I SC2 ), but recovering without any medical intervention. • Infected Clinical (I C1 ), individuals in the hospital, who may decease after some time. • Infected Clinical (I C2 ) in the hospital, but recovering. • Recovered (R) and immune to reinfections. • Deceased (D). Each of those categories is further stratified into 5 age groups: 0-19, 20-39, 40-59, 60-79 and 80+. The age group is denoted by a subscript, e.g. E i is the number of exposed individuals in the ith age group (i = 1, . . . , 5) , and similarly for other compartments. Each variable of the list above is thus a vector of length 5. On the other hand, three simplifications are in order. First, in the early stage of the pandemic the number of individuals affected by the virus is a small fraction of the entire population. This allows us to take S constant equal to the initial population, and exclude it from the ODE system. Similarly, R and D are terminal states in the sense that none of the other variables depend on them. This again allows us to decouple R and D from the system of ODEs. Thus, we arrive at a linear system involving only Exposed and Infected numbers: Here I is the identity matrix of size 5, diag(·) stretches a vector into a diagonal matrix, A u = χ · diag(S) · C u · diag( 1 N ), and the remaining variables are model parameters: • χ: probability of contact between S and I SC individuals. • κ = 1/d L : transition rate of Exposed becoming SubClinical. d L is the number of days in the Exposed state. • γ C = 1/d C : transition rate of SubClinical turning into Clinical. Similarly, d C is the number of days this transition takes. • γ R = 1/d R : recovery rate from I SC2 . • γ R,C = 1/d R,C : recovery rate from I C2 . • ν = 1/d D : death rate from I C1 . • ρ = (ρ 1 , . . . , ρ 5 ) ∈ R 5 : age-dependent probabilities of Exposed turning into the first SubClinical category. • ρ = (ρ 1 , . . . , ρ 5 ) ∈ R 5 : age-dependent probabilities of SubClinical becoming the first Clinical category. • N = (N 1 , . . . , N 5 ) ∈ R 5 : population sizes in each age group. • C u ∈ R 5×5 : the contact matrix, which depends on the control u. Scalar-vector operations (1/N , 1 − ρ, etc.) are understood as elementwise operations. Another parameter is the number of infected individuals on day 0 (1st March) N 0 . It is split further across the age groups as follows: The population size S = N is taken from the office of national statistics, mid 2018 estimate. The contact matrix consists of four contributions, corresponding to people activities: where C * are pre-pandemic contact matrices (for details of their estimation see [8] ), and α * are coefficients of reduction of activities introduced in March 2020. (Here * stands for home, work, school or other.) In turn, those are constructed as follows. Firstly, we set α home = (1, . . . , 1), since home contacts cannot be influenced. For the remaining activities, noting that the lockdown in the UK was called on day 17 (18th March), we set where u work , u school and u other are the control signals (the lockdown measures) that we are going to optimize, and α 123 , α 4 , α 5 are their proportions in the corresponding age groups. In the cost function we penalize the total fatalities and the hospital capacity exceedance. As long as ( 5.4 ) is solved, the number of deaths can be calculated directly as Moreover, we aggregate and penalize I C exceeding 10000. Finally, we penalize the strength of the lockdown measures, i.e. the norm of the control u(τ ) = (u work (τ ), u school (τ ), u other (τ )), as well as constraining u work ∈ [0, 0.69], u school ∈ [0, 0.9] and u other ∈ [0, 0.59]. This gives us a deterministic cost where T = 90 is the simulation interval, and is the regularization parameter. However, optimization of (5.8) may be misleading, since the model parameters are not known in advance, and can only be estimated. In particular, [8] employed an Approximate Bayesian Computation (ABC), which used existing observations of daily deaths and hospitalizations in the UK to form the likelihood. This renders model parameters into random variables, which are distributed according to the posterior density. In turn, this motives a modification of (5.8) into a risk-averse cost function, for example, using CVaR with β = 0. 5 . More precisely, we aim to minimize Ideally, the expectations in CVaR need to be computed with respect to the posterior density from ABC. However, the latter is a complicated multivariate function, which lacks an independent variable parametrization necessary to set up the discretization and TT approximation. (A possible solution to this using optimal transport [4] can be a matter of future research.) As a proof of concept, we simplify the distribution to independent uniform, reflecting means and variances estimated by ABC. Thus, we assume χ ∼ U(0. 13 21 . This allows us to use the reduced optimization formulation. Moreover, since the cost contains non-smooth functions, we abandon the Newton scheme, resorting to the Projected Gradient Descent method with a finite difference computation of the gradient of (5.9). The ODE (5.4) is solved using the implicit Euler method with time step 0.1. The control regularization = 100 is taken from [8] , and the CVaR smoothing parameter ε = 1000 corresponds to the relative width of the smoothed region ε/t < 10 −2 , matching the bias of the smoothed CVaR and the TT approximation error. In Table 1 we vary σ and investigate the behavior of the reduced TT formulation for CVaR. Note that we need more iterations and higher TT ranks for the larger variance. This is also reflected in a larger total cost, which is dominated by the CVaR term. In Figure 11 , we compare the controls computed with the two values of σ, as well as the minimizer of the deterministic cost (5.8). We see that a small σ yields the solution which is similar to the deterministic one. However, the risk-averse control for a larger variance of the parameters is more conservative: it tends to be larger, hitting the constraints at almost the entire interval except the final point, where the control stops making any influence on the system. In the bottom right panel of Figure 11 we plot the predicted hospitalization numbers. The historic numbers are obtained by simulating the deterministic model (using mean values of the parameters in (5.10)) with the control derived from the smoothed Google daily mobility data 3 . We see that both optimization techniques allow one to reduce the hospital occupancy. However, the deterministic approach tends to underestimate I C compared to the mean risk-averse value. In addition, the CVaR frameworks enables a rigorous uncertainty quantification. This paper has introduced a new algorithm called TTRISK to solve risk-averse optimization problems constrained by differential equations (PDEs or ODEs). TTRISK can be applied to both the reduced and full-space formulations. The article also introduces a control variate correction to get unbiased estimators. Various strategies to choose the underlying algorithmic parameters have been outlined throughout the paper, especially in the numerics section. This is carried out by carefully taking into account all the approximation errors. The TT framework offers multiple advantages, for instance, our numerical examples illustrate that it can help overcome the so-called "curse of dimensionality". Indeed, the approach introduced here has been successfully applied to a realistic problem, with 20 random variables, to study the propagation of COVID-19 and to devise optimal lockdown strategies. This article aims to initiate new research directions in the field of risk-averse optimization. There are many open questions that one could pursue, a few examples are: Figure 11 . Control signals and predicted hospitalizations numbers for different optimization strategies. In the bottom right plot, blue circles denote mean I C after CVaR optimization, shaded area denotes 95% confidence interval. Suppose f (ξ) is defined as a minimizer of a cost function C(f ). We aim at minimizing C over a TT decomposition instead by driving ∇C(f ) = 0. However,f is a product expansion in the actual degrees of freedom {F (k) }, which makes the problem nonlinear even if ∇ f C(f ) was linear. Instead of running a generic nonlinear optimization of all TT cores simultaneously, we can resort to the Alternating Least Squares [16] and Density Matrix Renormalization Group [39, 36] methods. Those alleviate the nonlinearity issue by iterating over k = 1, . . . , d, solving only ∇ F (k) C(f ) = 0 in each step, similarly to the coordinate descent method. Note thatf is linear in each individual factor F (k) . Applying this coordinate descent idea to the problem of interpolating a given function with a TT decomposition yields a family of TT-Cross algorithms [28, 34, 5] . Suppose we are given a procedure to evaluate a continuous function f (ξ) at any given ξ. We iterate over k = 1, . . . , d and in each step compute F (k) by solving an interpolation conditioñ j ) ∈ R d : j = 1, . . . , r k−1 n ξ r k }, (A.1) over some carefully chosen sampling sets Ξ k . Stretching the tensor F (k) into a vector f (k) = [F (k) (s k−1 , i, s k )] ∈ R r k−1 n ξ r k , we can write (A.1) as a linear equation F =k f (k) = f (Ξ k ), where each row of the matrix F =k ∈ R r k−1 n ξ r k ×r k−1 n ξ r k is given by (A.2) where ξ j ∈ Ξ k , j = 1, . . . , r k−1 n ξ r k . To compute this efficiently in a recursive manner similar to (2.12), we restrict Ξ k to the Cartesian form Moreover, for the left and right sets Ξ k we assume nestedness conditions This way, given F k , we can continue the recursion by computing i ), and (A.8) and extracting F k−1 simply as submatrices of F ≤k and F ≥k , respectively. This needs O(n ξ r 3 ) operations. The particular elements extracted from e.g. Ξ k−1 . The entire iteration, which we call TT-cross, is outlined in Algorithm 2. in each step. Ifà andb are given by TT decompositions, A k and b k can be computed recurrently in the course of iteration using matrix products similar to (A.8), (A.9). One full iteration over k = 1, . . . , d needs O(dn 2 ξ R 2 r 2 + dn ξ Rr 3 ) operations [16] , where R := max k R k . Choose initial sets Ξ 0 ) samples of f 5: Solve F =k f (k) = f (Ξ k ) using the matrix Let Ξ >k−1 = [Ξ (k) × Ξ >k ]| I ≥k , and F >k−1 = F ≥k O(n ξ r 2 ) samples of f Low-rank solvers for unsteady Stokes-Brinkman optimal control problem with random data Spectral tensor-train decomposition Analytic regularity and polynomial approximation of parametric and stochastic elliptic PDE's Deep composition of Tensor-Trains using squared inverse Rosenblatt transports Parallel cross interpolation for high-precision calculation of highdimensional integrals Computation of extreme eigenvalues in higher dimensions using block tensor train format Uncertainty quantification for subsurface flow problems using coarsescale models, in Numerical analysis of multiscale problems Using mobility data in the design of optimal lockdown strategies for the COVID-19 pandemic An interior-point approach for solving risk-averse PDE-constrained optimization problems with coherent risk measures Multilevel Monte Carlo methods Zamarashkin, How to find a good submatrix A continuous analogue of the tensor-train decomposition Tensor Spaces And Numerical Tensor Calculus Low-rank Kronecker-product approximation to multidimensional nonlocal operators. I. Separable approximation of multi-variate functions Deep neural network expression of posterior expectations in Bayesian PDE inversion The alternating linear scheme for tensor optimization in the tensor train format Tensor numerical methods in scientific computing Tensor-structured Galerkin approximation of parametric and stochastic elliptic PDEs An Approach for the Adaptive Solution of Optimization Problems Governed by Partial Differential Equations with Uncertain Coefficients Optimization of PDEs with uncertain inputs, in Frontiers in PDE-Constrained Optimization Risk-averse PDE-constrained optimization using the conditional value-at-risk Risk-averse optimal control of semilinear elliptic PDEs A primal-dual algorithm for risk minimization Multilevel quasi-Monte Carlo methods for lognormal diffusion problems Newton-Based Methods for Smoothed Risk-Averse PDE-constrained Optimization Problems A multi level Monte Carlo method with control variate for elliptic PDEs with log-normal coefficients Tensor train decomposition TT-cross approximation for multidimensional arrays Regularization-robust preconditioners for timedependent PDE-constrained optimization problems An inexact gauss-newton method for inversion of basal sliding and rheology parameters in a nonlinear stokes ice sheet model Optimization of conditional value-at-risk Conditional value-at-risk for general loss distributions Fast adaptive interpolation of multi-dimensional arrays in tensor train format Approximation rates for the hierarchical tensor format in periodic Sobolev spaces The density matrix renormalization group Lectures on stochastic programming Stochastic averaging of nonlinear flows in heterogeneous porous media Density matrix algorithms for quantum renormalization groups The Center for Mathematics and Artificial Intelligence (CMAI) and Department of Mathematical Sciences Email address: s.dolgov@bath.ac.uk Appendix A. Cross approximation Appendix B. Tensor Train algebra Once a TT decomposition is constructed, simple operations can be performed directly with the TT cores. Besides the fast integration (2.12) , additions of functions approximated by their TT formats, pointwise products and actions of linear operators can be written in the TT format with explicitly computable TT cores, and hence a linear complexity in d [27] . For example, addition of functionsf (ξ) andg(ξ) defined by their TT cores {F (k) } and {G (k) } and TT ranks (r 0 , . . . , r d ) and (p 0 , . . . , p d ), respectively, is realized by the TT decompositioñHowever, such explicit decompositions are likely to be redundant. For example, we can immediately compress H (1) to [F (1) (ξ (1) ) G (1) (ξ (1) )], and similarly the summation over s d Solve F =k f (k) = f (Ξ k ) using the matrix (A.4). Compute I ≤k from maxvol on F ≤k as defined in (A.8). 13 :14:end for 15: end while collapses H (d) . In general, a quasi-optimal approximate re-compression of a TT decomposition [27] can be performed in O(dn ξ r 3 ) operations by using truncated Singular Value decompositions (SVD) in a recursive manner similar to (2.12).Linear operators acting on multivariate functions can also be decomposed (or approximated) in a similar TT format that enables a fast computation of their action. An operator A ∈ L(F, F) with F = F (1) × · · · × F (d) can be approximated by a TT operatorà of the formÃ) is an operator acting on F (k) (ξ (k) ). The imageÃf can now be written as a TT decompositionwhere t k = mod(m k − 1, R k ) + 1, and s k = (m k − 1)/R k + 1. Linear equations can be solved by the Alternating Least Squares. We iterate over k = 1, . . . , d, solving in each stepÃf =b with respect to the TT core F (k) in the representation off . Constructing a vector-functionsimilarly to (A.2), we can writef (ξ) = f ( =k) (ξ)f (k) , and solve a linear system