key: cord-0650234-2562ltxc authors: Tong, Shanyin; Subramanyam, Anirudh; Rao, Vishwas title: Optimization under rare chance constraints date: 2020-11-11 journal: nan DOI: nan sha: 2b7fbbdd3ecfbd3e959e06b995b20eb7c6af0335 doc_id: 650234 cord_uid: 2562ltxc Chance constraints provide a principled framework to mitigate the risk of high-impact extreme events by modifying the controllable properties of a system. The low probability and rare occurrence of such events, however, impose severe sampling and computational requirements on classical solution methods that render them impractical. This work proposes a novel sampling-free method for solving rare chance constrained optimization problems affected by uncertainties that follow general Gaussian mixture distributions. By integrating modern developments in large deviation theory with tools from convex analysis and bilevel optimization, we propose tractable formulations that can be solved by off-the-shelf solvers. Our formulations enjoy several advantages compared to classical methods: their size and complexity is independent of event rarity, they do not require linearity or convexity assumptions on system constraints, and under easily verifiable conditions, serve as safe conservative approximations or asymptotically exact reformulations of the true problem. Computational experiments on linear, nonlinear and PDE-constrained problems from applications in portfolio management, structural engineering and fluid dynamics illustrate the broad applicability of our method and its advantages over classical sampling-based approaches in terms of both accuracy and efficiency. 1. Introduction. Rare high-impact events are phenomena that occur with low probability, but can have widespread-and sometimes even catastrophic-impacts on the economy, society and environment. Examples of such phenomena include cascading blackouts in the power grid, stress fractures and material failures in bridges and other civil structures, extreme oceanic turbulence leading to natural disasters like hurricanes and tsunamis, financial stock market crashes, and large-scale pandemics like the ongoing COVID-19 outbreak. It is of paramount importance to quantify the risk of such rare high-impact events in a given system, by estimating their probability of occurrence, and to actively reduce this risk by tuning the controllable properties of the system. Chance constraints provide an attractive modeling framework to mitigate the risk of such rare high-impact phenomena, since it is often difficult-or even impossible-to estimate the costs associated with the underlying events. This motivates our study of chance constrained optimization problems of the following form. Here, u represents the set of control or decision variables with domain U, J : U → R represents the cost associated with decisions u, and ξ denotes uncertain parameters that can influence the decisions via the function F . Thus, U represents deterministic constraints on our decisions that are unaffected by uncertainties, and we consider U ⊆ R m or U ⊆ L 2 (D) for some D ⊆ R m . We shall assume that the uncertain parameters ξ are modeled as random vectors with probability distribution P supported on Ξ ⊆ R n , and use P(A) to denote the probability of an event A under the measure induced by P. Therefore, F : U × Ξ → R represents the uncertaintyaffected system metric (also variously referred to as the parameter-to-observation map or limit-state function), and z is a specified upper bound for the metric F such that, for any fixed decision vector u and fixed realization ξ of the random vector, F (u, ξ) ≥ z leads to an "unsafe condition" or "bad event" that we seek to mitigate. Problem (1.1) thus seeks to determine decisions u such that the probability of the system being unsafe, P (F (u, ξ) ≥ z), remains below a certain specified risk threshold α ∈ (0, 1). We are particularly interested in the rare event regime where, for reasonable values of u ∈ U, the latter quantities-the probability of being unsafe and the risk threshold α-are very close to 0. We defer further assumptions and concrete examples to the subsequent sections. One of the main challenges in solving (1.1) is to tackle the chance constraint for extreme values of the risk threshold α 1 or the upper bound z → ∞, since the resulting probability P (F (u, ξ) ≥ z) can be very small in these cases. For general nonlinear F , classical methods approximate this probability by generating samples from the original distribution of (or some other 'trial' distribution closely related to) ξ. However, the rare nature of the event coupled with the potential nonlinearity and complexity of F render sampling-based methods generally intractable and potentially inaccurate. As Figure 1 illustrates, such samples concentrate around regions of high likelihood, and relatively few samples cover the unsafe region. Indeed, assuming that F satisfies certain convexity properties, the number of samples required to obtain a feasible solution to (1.1) is O(α −1 ) (e.g., see [9] ), which becomes impractical when α is 10 −6 or 10 −10 (say). When taking into account problem-dependent constants as well, the actual required sample size (even for modest α and Gaussian ξ) can become arbitrarily large [25, Example 1] . This is made doubly worse when these samples are included in the optimization problem, since F must be evaluated at each generated sample and this can lead to computational difficulties if F itself is complex. This motivates the development of an alternate sampling-free method to solve (1.1), whose computational complexity does not scale with extreme values of z or α. 1.1. Overview of methodology and contributions. We propose a sampling-free method to solve (1.1) by exploiting-and extending-recently developed results based on large deviation theory (LDT). Under certain assumptions on the distribution of ξ and the function F , it was shown in [15, 48] that, for fixed u and z, the rare event probability P(F (u, ξ) ≥ z) can be estimated using only local derivative information of F evaluated at the LDT minimizer : where I is the so-called rate function that depends only on the distribution of ξ. Although the method is more generally applicable, we focus our attention on the case where ξ follows a general Gaussian mixture distribution. Depending on the extent of local derivative information {ξ : F (u, ξ) ≥ z} µ Figure 1 . Drawbacks of sampling-based methods for estimating rare events. The blue dots represent samples from the original distribution P with mean µ, whereas the shaded region represents the rare event set {ξ : F (u, ξ) ≥ z} for some fixed decision vector u. Observe how very few samples fall inside the rare event set. of F that one can exploit or access, we then propose first-and second-order approximations of the rare event probability by quantifying the measure of the corresponding set bounded by the first-and second-order Taylor expansions of F at ξ , and provide easily verifiable conditions under which the proposed estimates are exact or constitute upper bounds on the true probability. However, using these estimates in place of the true probability in (1.1) amounts to a bilevel optimization problem, since the LDT minimizer (1.2) is a function of the decision variables u; to tackle this, we replace (1.2) with its optimality conditions, and elucidate when the latter are sufficient to characterize the minimizer. To demonstrate the effectiveness of our method, and its potential benefits over classical sampling-based methods, we consider a variety of applications from different domains. Specifically, via problems in financial portfolio management, structural column design and optimal control of a system of partial differential equations (PDE), we illustrate the accuracy of our probability estimates, feasibility of the obtained optimal solutions, improvement in computation time, and ability to handle complex nonlinear models. The main contributions of our work may be summarized as follows. 1. We provide explicit formulas for the first-and second-order LDT approximations of rare event probabilities when the constraint function F is a continuously differentiable function of random parameters that follow a general Gaussian mixture distribution. This generalizes existing results [48] for Gaussians, and can be used to tackle general continuous distributions by approximating them with a Gaussian mixture. We also provide an alternate characterization of the second-order LDT approximation for Gaussians that is more numerically efficient for our purposes. 2. We apply the LDT approximations to optimization problems affected by rare chance constraints, and provide reformulations that can be solved by off-the-shelf solvers. Notably, our formulations are sampling-free and scalable: their size and complexity does not scale with the rarity of the chance constraints. We also contribute easily verifiable conditions under which their optimal values are exact or provide conservative upper bounds to the true optimal values. Finally, we do not limit ourselves to linear or convex F and only require that it is twice differentiable, thus enabling fast local solutions to nonlinear and nonconvex problems. 3. We conduct computational experiments on linear, nonlinear and PDE-constrained optimization problems from several application areas to demonstrate the viability of our method and compare its performance with traditional sampling-based methods. This paper is structured as follows. After introducing some notation and basic assumptions in subsection 1.2, we provide a literature review of solution methods for chance constrained problems in Section 2. In Section 3, we provide an overview of our techniques, the large deviation results we use, as well as easily verifiable sufficient conditions under which we can establish optimality and feasibility of our formulations; notably, this section does not make specific assumptions on the nature of the distributions. In Section 4 and Section 5, we then specalize our results to Gaussian and Gaussian mixture distributions, respectively, and provide closed-form expressions of the probability estimates and optimization formulations. In Section 6, we illustrate the computational performance of our method on several applications, and finally, in Section 7, we provide concluding remarks and directions for future work. We use R + to denote the set of non-negative reals, and I n to denote the n × n identity matrix. We define [x] + := max{x, 0}, and 1 X to be the indicator function of set X: 1 X (x) is equal to 1 if x ∈ X and 0 otherwise. Given a symmetric matrix Q ∈ R n×n , we use Q 0 ( 0) to denote that it is positive definite (positive semidefinite), and in such cases, we denote the corresponding weighted inner product between x ∈ R n and y ∈ R n by x, y Q := x, Qy , and the induced norm by x Q . We denote the standard Euclidean inner product interchangeably using both x, y and x y. With a slight abuse of notation, we use the same symbol ξ to denote the random parameter vector as well as its realizations; the meaning should be clear from the context. We use P to denote the distribution of ξ, Ξ to denote its support, and P(A) to denote the probability of an event A under the measure P. Given a measurable function f (ξ), we let E(f (ξ)) denote the expectation of f (ξ). Finally, we use S : R n → R to denote the cumulant generating function: and define the rate function I : Ξ → R as the convex (Fenchel) conjugate of S: Throughout the paper, we assume that µ := E(ξ) exists and that S is continuously differentiable on R n . These are easily verified to be true whenever P is a Gaussian (or a mixture of Gaussians). In particular, when working with multivariate Gaussian parameters in R n , we use ξ ∼ N (µ, Σ) to denote that ξ is a multivariate Gaussian with mean µ ∈ R n and covariance matrix Σ 0, and ξ ∼ M i=1 w i N (µ i , Σ i ) to indicate that ξ follows a Gaussian mixture distribution with M components, where component i ∈ {1, . . . , M } is a Gaussian with mean µ i , covariance Σ i 0 and weight w i > 0 such that M i=1 w i = 1. Finally, we use Φ to denote the cumulative distribution function (CDF) of the standard normal distribution N (0, 1). Regarding the structure of F , we shall assume that F (·, ξ) is twice continuously differentiable on U for every ξ ∈ Ξ and F (u, ·) is twice continuously differentiable on closure of Ξ for every u ∈ U. In addition, we assume that there exist ∅ = U 0 ⊂ U, z 0 < ∞ and K 0 > 0 such that, for all u ∈ U 0 and all z ∈ [z 0 , ∞), we have F (u, µ) < z 0 , Ξ(u, z) := {ξ ∈ Ξ : F (u, ξ) = z} = ∅ and ∇ ξ F (u, ξ) ≥ K 0 > 0 for all ξ ∈ Ξ(u, z). We note that the existence of u ∈ U and x in a neighborhood of z for which Ξ(u, x) = ∅ is not restrictive; indeed, if one has F (u, ξ) < z for all ξ ∈ Ξ and u ∈ U, then (1.1) is trivially feasible for all u ∈ U (i.e., the chance constraint can be ignored), and similarly it is trivially infeasible if F (u, ξ) > z for all ξ ∈ Ξ and all u ∈ U. The assumption ∇ ξ F (u, ξ) > 0 ensures that the hypersurface Ξ(u, x) ⊂ Ξ does not degenerate and remains smooth and simply connected, and that the minimization problem defined in (3.6) is always feasible for any u ∈ U 0 . We defer further assumptions to the sections where they are used. Finally, note that we only consider the case of a single individual chance constraint. The extension to multiple individual chance constraints is straightforward and we do not present it for ease of exposition. On the other hand, the extension to joint chance constraints where F : U × Ξ → R p and z ∈ R p , with p > 1 is not straightforward. In such cases, one can either resort to a Bonferroni approximation (i.e., union bound) to convert the joint chance constraint into p individual chance constraints, or apply our method to a smooth approximation of the function φ(x, u) = min{F 1 (u, ξ) − z 1 , . . . , F p (u, ξ) − z p } (e.g., see [41] ). 2. Literature review. Chance constrained problems have been intensely studied for over five decades [43, 47] . They are generically NP-hard since the set of feasible decisions is usually nonconvex, and admit tractable reformulations only in restrictive settings (e.g., when F is bi-affine and ξ follows a normal distribution [43] ). Therefore, the majority of existing research takes one of two approaches: (1) exploit or assume some special structure in F or the distribution of ξ, including linear or convex F [32, 38] , finitely supported distributions of ξ [35] , and elliptically symmetric Gaussian-like distributions of ξ [52] , or (2) resort to some form of sampling to approximate the chance constraint; we discuss these approaches in subsection 2.1. In subsection 2.2, we discuss connections of our method to so-called reliability-based methods that are prevalent in engineering design. 2.1. Sampling-based methods. Importance sampling is the most commonly used sampling method to estimate rare event probabilities [30] . These methods belong to the class of variance reduction techniques, and sample from a (typically problem-specific) biasing distribution that is constructed so that probability mass is concentrated in the region of interest (e.g., see [3, 8, 20] ). Most of these methods construct the biasing distribution for a fixed decision vector u. Therefore, they are not directly applicable to chance constrained optimization, where u may change across algorithmic iterations. The quality of a single biasing distribution may not be uniform over these iterations, and this in turn may warrant the construction of multiple such distributions, albeit at the expense of additional computational effort. Nevertheless, existing methods for rare chance constrained optimization are also based on variants of importance sampling [4, 45] . For instance, [4] constructs an importance sampling distribution that is uniformly efficient across the entire decision space; however, this is achieved by making the (restrictive) assumption that the components of ξ are mutually independent random variables. Given N independent samples or realizations, ξ 1 , . . . , ξ N , that have been obtained via importance sampling (or some other method), the probability, P (F (u, ξ) ≥ z) in (1.1), can then be approximated via one of two broad classes of approaches which we discuss in subsection 2.1.1 and subsection 2.1.2. Examples of generic sampling-based solution methods that do not fall into these classes can be found in [12, 41] . An alternative to importance sampling is quasi Monte Carlo sampling of the unit hypersphere that arises in the spherical-radial decomposition method for elliptically symmetric distributions, which we discuss in subsection 2.1.3. However, none of these approaches can be expected to work well in general for rare chance constraints, since the required number of samples can be O(α −1 ) in the worst case. Indeed, even if we manage to reduce the required number of samples using importance sampling or by spherical radial decomposition, the former still entail modeling additional (at least O(N )) variables and constraints, whereas the latter entail high per-iteration cost of constraint and gradient evaluations in the optimization problem, which can make its solution computationally difficult. [34, 40] : The discontinuous indicator function 1 [z,∞) is then either exactly reformulated using O(N ) binary variables (e.g., see [35] ) resulting in a mixed-integer nonlinear optimization problem, or approximated using continuous functions to avoid introducing binary variables. Examples of the latter include difference-of-convex approximations [27] , sigmoidal or other smooth approximations [1, 10, 23] as well as convex outer approximations [38] . An example of a smooth approximation (that we also compare in our numerical experiments) is the sigmoidal approximation proposed in [10] . It is based on the observation that, for any ν, τ > 0, one can construct an outer-approximation of the indicator function that is asymptotically optimal as ν, τ → ∞. This allows us to obtain the following approximation to (1.1): Another class of general outer approximations was proposed in [38] by using any nondecreasing convex function ψ : R → R + satisfying ψ(0) = 1 and ψ(x) > 1 for all x > 0. In particular, by noting that 1 [z,∞) (x) ≤ ψ t −1 (x − z) is satisfied for any such function and scalar t > 0, it can be shown that satisfaction of the inequality, inf t>0 tE ψ t −1 (x − z) − tα ≤ 0, implies satisfaction of the probability constraint in (1.1). Under the choice ψ(x) = [1 + x] + , the above inequality is equivalent to CVaR 1−α (F (u, ξ) − z) ≤ 0, where, for any measurable function f (ξ) and any α ∈ (0, 1), CVaR 1−α (f (ξ)) := inf t∈R t + α −1 E [f (ξ) − t] + is the conditional-value-at-risk (CVaR) of the random variable f (ξ). By approximating the expectation in the expression for CVaR with a sample average, we obtain the following conservative approximation to (1.1): The CVaR approximation has the advantage of preserving any convexity structure in F (with respect to u), and so we use it as a benchmark in our numerical experiments. An alternative approach is to enforce the constraints deterministically for each of the N realizations of ξ: F (u, ξ i ) ≤ z, for all i ∈ {1, . . . , N }. As in the case of the CVaR approximation, this preserves any convexity structure in F and the hope is that for sufficiently large N , optimal solutions of the scenario approximation will be feasible in (1.1) with high probability. Under the assumption that F (·, ξ) is convex for all ξ ∈ Ξ, the work [9] shows that choosing N = O(mα −1 log(α −1 )) suffices, where m is the number of decision variables u. This size can be prohibitively large for small values of α. Therefore, [5, 37] consider extensions of this approach to reduce the required sample size by making further assumptions on F and the distribution of ξ. 2.1.3. Spherical-radial decomposition. Generic Monte Carlo sampling over the entire sample space can be inefficient when dealing with rare chance constraints, since the majority of samples will lie outside the rare event set. The spherical-radial decomposition (SRD) is an alternative approach that can achieve significant variance reduction whenever ξ follows an elliptically symmetric distribution. For example, when ξ ∼ N (µ, Σ) follows a Gaussian distribution with a Cholesky factorization Σ = LL , one can write ξ = µ + rLζ, where ζ is a random direction with uniform distribution ν on the unit sphere S n−1 := {x ∈ R n : x = 1}, and r is the random magnitude of that direction following a chi-distribution with degree n. This allows us to express P( where m χ,n is the probability measure of the chi-distribution with degree n, R(u, ζ) = {r > 0 : F (u, µ + rLζ) ≤ z} are the magnitudes of direction ζ which intersect the complement of the rare event set {ξ : F (u, ξ) ≤ z}, and ζ 1 , . . . , ζ N are uniform samples on the unit sphere S n−1 (e.g., obtained via quasi Monte Carlo sampling). Notably, [51, 52] have shown that the same set of samples can also be used to compute the gradient (with respect to u) of P(F (u, ξ) ≤ z). The crucial step in SRD is multiple root-finding to determine R(u, ζ), which can become computationally expensive whenever explicit closed-form solutions of the roots (of F (u, µ + rLζ) = z in r for fixed u and ζ) may not be available. This is especially true in problems where F depends nonlinearly on ξ (for e.g., via a PDE) and a good initial guess for r may not be readily available. Moreover, this has to be repeated for each sampled direction ζ and each candidate decision u that will change over optimization iterations. Nevertheless, the method has found applications in several problems, including those involving joint chance constraints, where F may be a vector-valued function. Successful applications include the probabilistic capacity maximization of gas networks [24] and optimal Neumann boundary control of a vibrating string with uncertain initial data and probabilistic terminal constraints [22] . Reliability-based methods. In engineering, the structural reliability of buildings or bridges is assessed by solving a probability estimation problem similar to the one we consider, and reliability is enforced as part of the structural design by solving a so-called inverse reliability problem [17] or reliability-based design optimization problem [50] that is similar in spirit to the chance constrained problem (1.1). These methods approximate the distribution of ξ using a (standard) multivariate normal distribution, and then compute the so-called most probable point (i.e., the point with maximum probability density) over the failure region {ξ ∈ Ξ : F (u, ξ) ≥ z} (e.g., see [19] ). The distance of the most probable point from the origin is referred to as the reliability index, which is not only used for estimating probabilities, but also explicitly constrained as part of the optimal design problem. In particular, the probability estimates are obtained by employing first-or second-order Taylor expansions of the limit-state function around the most probable point and then integrating the Gaussian densities over the set bounded by the Taylor approximations, resulting in the so-called first-and second-order reliability methods (FORM and SORM), respectively [19, 46] . In contrast to a chance constraint on the limit-state function F , the reliability-based design problem explicitly constrains the reliability index and is solved using specialized procedures; e.g., see [33, 50] . These methods share similarities with our LDT approach, especially for Gaussian distributions. In particular, both methods are sampling-free, and connect the probability estimations to an optimization problem. A crucial difference is that whereas reliability-based methods focus on finding the point with maximal likelihood, our LDT approach uses the point which characterizes the asymptotic behavior of the rare event probability, and thus includes information beyond just local probability densities. Another caveat is that reliability-based methods typically require transformation of the distribution to a standard normal, which can be difficult for a general high-dimensional distribution, whereas our methods are generally applicable for distributions with well-defined rate functions, including Gaussian mixtures that can, in limit, approximate arbitrary continuous distributions. Finally, an important conceptual difference lies in the formulation of the optimization problem: whereas reliability-based methods use the reliability index as a surrogate for the probability to enforce constraints, we treat the more traditional chance constrained formulation (1.1), and use the LDT minimizer along with local derivative information directly to approximate the probability in the chance constraint. 3. Large deviation theory for rare chance constraints. Large deviation theory (LDT) [16, 49, 53] is concerned with the asymptotic behavior of tails of probability distributions, and specifically the rates of exponential decay of probability measures of extreme events. Classical LDT enables characterizing the asymptotic behavior of rare probabilities, such as the one in (1.1), with respect to some small (or large) parameter associated with the random variable ξ, while keeping z fixed. It states that the behavior of the probability with respect to that small (or large) parameter is dominated by a decaying exponential term that can be characterized by the rate function of the random variable. Recently, [15] adapted classical theory and sharp asymptotics based on the dominating point of large deviations [6, 7, 29, 39] to study the asymptotic behavior with respect to z → ∞, while keeping the distribution of ξ fixed. The key idea is to find a dominating point in the rare event set by solving an appropriately defined optimization problem and estimating probabilities solely using this dominating point. In [48] , it was shown that improved estimates can be obtained by exploiting Gaussianity as well as local approximations of F around this dominating point. The approach has been successfully applied to quantify rare probabilities of extreme oceanic phenomena [13, 14, 48] . In the following subsections, we present an overview of this approach and show how it can be used for rare chance constrained optimization. Specifically, in subsection 3.1, we first reduce the problem of estimating the measure of the set bounded by (the potentially nonlinear) F to estimating the measure of the set bounded by its first-and second-order Taylor expansions around a dominating point. In subsection 3.2, we provide a specific choice of the dominating point based on large deviation arguments; and finally, in Section subsection 3.3 we incorporate these probability estimates in the chance constrained problem. 3.1. Taylor approximation of nonlinear limit state function. For a fixed choice of u ∈ U and z ∈ R, a key challenge in computing the probability in (1.1), P (F (u, ξ) ≥ z) is the potential nonlinearity of F with respect to ξ. A simple way to tackle this is to construct firstand second-order Taylor approximations of F around some point ξ ∈ Ξ, We can then compute first-and second-order probability estimates, P 1 and P 2 of P (F (u, ξ) ≥ z), by computing the measure of the sets bounded by the corresponding Taylor approximations: It is important to note that this approach approximates the nonlinear (in ξ) chance constrained problem (1.1) with a linear and quadratic (in ξ) chance constrained problem, respectively. In particular, it allows us to exploit special structure (if any) in the distribution of ξ to efficiently compute the first-and second-order approximations in (3.2). We provide examples of this computation for standard Gaussian and mixture distributions in the subsequent sections. First, we make the immediate observation that P 1 provides a conservative approximation to the true probability whenever F is concave in the uncertainty ξ. Proposition 3.1 (First-order probability estimate for concave functions). Fix u ∈ U. Suppose that Ξ is convex and the function F (u, ·) is concave on Ξ. Then, P (F (u, ξ) ≥ z) ≤ P 1 (u, z, ξ ) for any z ∈ R and ξ ∈ Ξ. Proof. Since F (u, ·) is continuously differentiable (refer to subsection 1.2), it is concave on Ξ if and only if Therefore, we have the relationship i.e., the rare event set is included in the half-space bounded by F 1 (u, ξ; ξ ) = z. Therefore, , and the claim follows. level sets of I(ξ) Figure 2 . Illustration of first and second-order approximations of probability using Taylor expansions at the LDT minimizer ξ , which is the solution to (3.3) . The blue dotted lines are level sets of the rate function I(·), and the red shaded region is the extreme event set {ξ : F (u, ξ) ≥ z}. The normaln aligns with the gradients ∇ ξ F (u, ξ ) and ∇I(ξ ). The figure is adapted from [48] . Although the Taylor approximations of F provide a tractable means to estimate the true probability, their accuracy depends critically on the choice of ξ . In the following, we invoke large deviation results from [15] to justify a choice of ξ that is asymptotically optimal as z → ∞. The formal statement of the result needs additional assumptions on the distribution of ξ and the structure of F . The procedure for choosing ξ based on this result, and the first-and second-order approximations of F at this point are shown in Figure 2 . Theorem 3.2 (Theorem 2.1 in [15] ). Fix u ∈ U 0 . Suppose the following assumptions hold: (A1) The optimization problem min ξ∈Ξ {I(ξ) : F (u, ξ) ≥ z} has a unique global minimizer for all z ∈ [z 0 , ∞). Therefore, we can define: (A2) The function ξ in (3.3) is continuously differentiable, and there exists K 0 > 0 such that the rate function I(ξ (u, ·)) in (1.4) is strictly increasing with (A3) The following relation holds for all z ∈ [z 0 , ∞): Then, the following result holds. Theorem 3.2 implies that under its stated assumptions, the rare event probability in (1.1) satisfies the asymptotic equality where indicates that the ratio of logarithms of both sides tends to 1 as z → ∞, For large but finite values of z, the right-hand of (3.6) can therefore be expected to serve as a good approximation of the true probability. This justifies the LDT minimizer ξ (u, z) defined in (3.3) as the ideal point around which one should Taylor approximate the nonlinear F . We now review the assumptions leading to the result of Theorem 3.2 in detail, with the goal of simplifying them. Although not explicitly stated in [15] , assumption (A1) is necessary to ensure that the LDT minimizer in (3.3) is well defined. Assumption (A2) formally states that the event {F (u, ξ) ≥ z} becomes rare as z → ∞, whereas assumption (A3) requires that ξ is a so-called dominating point [39] : that is, the rare event set {ξ ∈ Ξ : F (u, ξ) ≥ z} is completely enclosed in the halfspace that is tangent to the hypersurface Ξ(u, z) at ξ . Finally, Assumption (A4) is an equivalent rephrasing of Assumption 5 in [15] . Specifically, it ensures that the true rare event probability does not decay much faster than exp(−I(ξ (u, z))) as z → ∞; i.e., that the right-hand side of (3.5) is a lower bound on its left-hand side expression. The assumptions leading to the result of Theorem 3.2 may not always be straightforward to verify. With a goal of obtaining easily verifiable conditions, the following theorem provides alternative sufficient conditions that lead to the same large deviation result as Theorem 3.2. Then, the following inequality is valid: Additionally, suppose that the following conditions also hold: where B(r,ξ) ⊂ Ξ denotes a closed ball of radius r that containsξ ∈ Ξ. Then, Equation (3.5) holds. Proof. To show the first part, we will show that condition (C1) implies assumptions (A1), (A2) and (A3) stated in Theorem 3.2. The proof for the validity of (3.7) then follows from the proof of Theorem 2.1 in [15] . By a similar argument, we will show that the conditions taken together also imply assumption (A4). The proof that the left-hand side of (3.7) is ≥ to its right-hand side then follows from similar arguments as Theorem 2.1 in [15] . Verification of (A1). Since the cumulant generating function S is finite on R n (see subsection 1.2), [42, Theorem 2] implies that it is strictly convex and twice continuously differentiable. Therefore, the rate function I, which by definition (1.4) is the convex conjugate of S, satisfies the following properties: (i) I is strictly convex and twice continuously differentiable on R n [26, Corollary 4.2.10], and (ii) I is coercive [44, Theorem 11.8] , i.e., I(ξ) → +∞ as ξ → ∞. Together with the concavity of F from condition (C1), these properties imply that min ξ∈Ξ {I(ξ) : F (u, ξ) ≥ z} is a convex optimization problem (which is always feasible for z ∈ [z 0 , ∞), see subsection 1.2) with a finite optimal value that is attained by a unique global minimizer ξ (u, z) defined in (3.3) . Notably, this implies assumption (A1). Verification of (A2). We show the implication of assumption (A2) in two parts. • First, note that continuity of F (u, ·) implies that sup ξ∈Ξ {F (u, ξ) : ξ ≤ r 0 } < ∞ for all finite r 0 > 0. This, in turn, implies that any point in the set {ξ ∈ Ξ : F (u, ξ) ≥ z} must satisfy ξ → ∞ as z → ∞. In particular, we have ξ (u, z) → ∞ and from the coercivity of I established in (ii) above, this implies that I(ξ (u, z)) → ∞. • Second, for any z ∈ [z 0 , ∞), there always exists a point ξ ∈ Ξ = R n satisfying F (u, ξ) > z, see subsection 1.2. This implies that the convex problem min ξ∈Ξ {I(ξ) : F (u, ξ) ≥ z} satisfies Slater's constraint qualification, and strong Lagrangian duality holds: For sufficiently large values of z, the maximizer in the right-hand side of the above expression is always attained at some λ ≥ λ 0 > 0; indeed, if it is attained at λ = 0, then I(ξ (u, z)) = min ξ∈Ξ I(ξ) < +∞ which would contradict I(ξ (u, z)) → ∞. Therefore, as a function of z, the right-hand side of the above expression, and hence I(ξ (u, z)), is convex and strictly increasing. Moreover, ∇I(ξ (u, z)) = λ ∇ ξ F (u, ξ) ≥ λ 0 K 0 > 0, (Note that, by the same reasoning, we also have F (u, ξ (u, z)) = z which will be used in the next paragraph). Verification of (A3). This follows from the definition of concavity in condition (C1). Verification of (A4). Let L > 0 denote the Lipschitz constant in condition (C2). Consider now the closed ball B(r, ξ (u, z)) of radius r = K 0 L −1 > 0 centered at ξ (u, z) + rn(u, z), where K 0 was defined in subsection 1.2, andn(u, z) is the unit vector along ∇ ξ F (u, ξ (u, z)). We claim that B(r, ξ (u, z)) ⊆ Λ(u, z, 2r), where Λ is defined in assumption (A4). If true, then observe that, the fact that ξ (u, z) → ∞ as z → ∞ (established above), along with condition (C3) evaluated atξ = ξ (u, z) implies (A4) is satisfied with K 1 = 2K 0 L −1 > 0. To show B(r, ξ (u, z)) ⊆ Λ(u, z, K 1 ), we proceed as follows, dropping the dependence on u and z for simplicity of exposition. with its twice differentiability implies that ∇ 2 ξ F (u, ξ) −LI n holds for all ξ ∈ Ξ. Along with Taylor's theorem, this implies Thus, we have also verified assumption (A4). Unlike Theorem 3.2, it is crucial to note that the conditions in Theorem 3.3 either depend only on the structure of F or on the measure P, and hence, are relatively easier to verify in general. For example, Theorem 5.1 in Section 5 shows that when P is a Gaussian or mixture of Gaussian distributions, then condition (C3) is automatically satisfied. 3.3. Incorporating large deviation estimates in the chance constrained problem. We can now incorporate in the chance constrained problem (1.1), the probability estimates obtained using the first-and second-order Taylor expansions of F that we introduced in subsection 3.1, around the dominating point ξ from subsection 3.2. Depending on whether we use the first-(k = 1) or second-order (k = 2) approximation, we obtain the following conceptual problem: Observe that, since the dominating point depends on the choice of the decision variables u, the resulting problem is a generic nonlinear bilevel optimization problem. Moreover, even though we use Taylor approximations of F in constructing P k , see (3.2) , the lower-level problem that characterizes the dominating point ξ utilizes the original nonlinear function. This is crucial for the validity of the asymptotic results in Theorem 3.2 and Theorem 3.3. To tackle the bilevel structure, we replace the lower-level problem with its first-order optimality conditions. Although it is straightforward to consider the extension where the support Ξ = {ξ ∈ R n : G j (ξ) ≤ 0, j ∈ {1, . . . , J}} has an explicit algebraic description in terms of inequalities, we assume that Ξ = R n for simplicity of exposition. We have the following single-level approximation of the chance-constrained problem (1.1): Note that we have replaced F (u, ξ ) ≥ z with equality. Doing so relies on an implicit (yet mild) assumption that F (u,ξ) < z, whereξ := argmin ξ∈Ξ I(ξ), and we provide a formal argument in Theorem 3.4 below. In any case, note thatξ is unique and always exists (see proof of Theorem 3.3), and continuity of F implies that this assumption is expected to hold for sufficiently large values of z. It is important to note that this argument does not rely on any convexity assumptions on F . Finally, we note that the rate function I may not always be explicitly available depending on the distribution of ξ. In such cases, we can instead rely on the fact that I is the conjugate of the cumulant generating function S, for which we typically have efficient statistical codes that can compute its values and gradients for several distributions. Indeed, from the definition of the rate function (1.4) and convexity and differentiability of S (see subsection 1.2), we have It is straightforward to verify that we then have ∇I(ξ ) = η . We then obtain the following single-level approximation of the chance-constrained problem (1.1): We establish some properties of the single-level approximation (3.9) or equivalently (3.11). Theorem 3.4 (Properties of the LDT chance-constrained optimization). Suppose that Ξ = R n and F (u,ξ) < z for all feasible solutions u ∈ U of (1.1) and (3.8), whereξ := argmin ξ∈Ξ I(ξ). Then, the following statements are true. 1. If (u, ξ ) is feasible in the bilevel problem (3.8), then there exists λ ∈ R + such that (u, ξ , λ) is feasible in the single-level problem (3.9). 2. Under condition (C1), the converse is also true: if (u, ξ , λ) is feasible in (3.9), then (u, ξ ) is feasible in (3.8). Therefore, (3.9) is an exact single-level reformulation of (3.8). 3. Under condition (C1) and k = 1, the first-order LDT problem (3.9) is a conservative approximation of the chance constrained problem (1.1) for any z ∈ R; that is, if (u, ξ , λ) is feasible in and second-order (k = 2) LDT problems (3.9) are in one-one correspondence with those of the chance constrained problem (1.1) and therefore, the former constitutes an asymptotic reformulation of the latter, as z → ∞. Proof. We consider each of the claims separately. Proof of claim 1. From subsection 1.2, we know that ∇ ξ F (u, ξ) > 0 for all u ∈ U 0 . Therefore, all local minimizers of the lower-level problem in (3.8) satisfy the linear independence constraint qualification. Therefore, if (u, ξ ) is feasible in (3.8), then ξ must satisfy the first-order optimality conditions of the lower-level problem for some λ ∈ R + . If λ = 0, then the stationarity condition ∇I(ξ ) = λ∇ ξ F (u, ξ ) would imply that ξ =ξ; the primal feasibility of the lower-level problem then implies that F (u, ξ ) = F (u,ξ) ≥ z which leads to a contradiction since F (u,ξ) < z by construction. Therefore, we must have λ > 0 and complementarity slackness then requires that F (u, ξ ) = z. Proof of claim 2. Under condition (C1), the lower-level problem in (3.8) satisfies Slater's constraint qualification, see also the proof of Theorem 3.3. Therefore, its first-order optimality conditions are necessary and sufficient. Proof of claim 3. This is a direct consequence of Proposition 3.1 and claim 2. Proof of claim 4. Suppose that (u, ξ , λ) is feasible in (3.9). Then, under conditions (C1), (C2) and (C3), Theorem 3.3 ensures that this ξ is precisely the unique global minimizer appearing in (3.3) . Moreover, for this u and ξ , the measure of the rare event set P ({ξ ∈ Ξ : F (u, ξ) ≥ z}) exp(−I(ξ )) as z → ∞. An application of the same theorem to F k (u, ξ; ξ ) (see (3.1a) and (3.1b)) instead of F , must require that ξ again be the unique global minimizer appearing in (3.3) ; this is because of ∇ ξ F (u, ξ ) = ∇ ξ F k (u, ξ ; ξ ) (by construction) along with the strict convexity of the rate function I. Therefore, again we have P ({ξ ∈ Ξ : F k (u, ξ; ξ ) ≥ z}) exp(−I(ξ )) as z → ∞. The claim now follows from claim 2. We remark that the single-level reformulations (3.9) and (3.11) are conceptual in the sense that they require an instantiation of the probability estimate P k as well as gradients of the rate function I and/or the cumulant generating function S. In Section 4 and Section 5, we provide explicit single-level formulations in the case where ξ is a Gaussian and a mixture of Gaussians, respectively. 4. Explicit formulations for Gaussian random parameter. Throughout this section, we shall assume that ξ ∼ N (µ, Σ) follows a Gaussian distribution with mean µ and covariance Σ 0. Our goal will be to provide explicit formulations of the first-and second-order LDT chance constrained optimization problems (3.9). The moment generating function of ξ is exp η µ + 1 2 η Ση , and therefore, its cumulant generating function, S(η) = η µ + 1 2 η Ση, is quadratic. Its rate function can be computed explicitly through the Legendre transform of S, and is also quadratic: Observe that I(ξ) is, up to a normalization constant, the negative log-probability density of N (µ, Σ). Hence, for a Gaussian distribution, minimizing the rate function over some domain, such as in (1.2), is equivalent to maximizing the probability density function over that domain. We note, however, this is a special property of Gaussians, and it does not hold in general; indeed, the next section shows that this property no longer holds for non-Gaussian parameters. First-and second-order probability estimates. For ξ ∼ N (µ, Σ), the probability estimates in (3.2) are Gaussian integrals over a half-space (k = 1), or a set bounded by a quadric (k = 2). The explicit formulas for P k can be computed by first splitting the respective domains into a normal and orthogonal component (with respect to ∇ ξ F (u, ξ )), and then by computing the integral along each direction directly [48] . In the remainder of this discussion, we fix u and ξ and assume that they satisfy F (u, ξ ) = z. The first-order approximation is a Gaussian integral over a half-space bounded by a plane perpendicular ton := ∇ ξ F (u, ξ )/ ∇ ξ F (u, ξ ) , and passing through ξ . It is straightforward to verify that this integral degenerates to a one-dimensional Gaussian integral and can be computed using the dominating point ξ as follows: The second-order probability estimate P 2 is slightly more complicated, but can still be approximated using an analytical expression. Following [48] , we first approximate the surface F 2 (u, ξ; ξ ) = z by a paraboloid with axis of symmetryn at ξ , and curvatures from ∇ 2 ξ F (u, ξ ) in the directions on the orthogonal plane. P 2 (u, z, ξ ) can then be approximated as the measure of the set bounded by that paraboloid, which turns out to be equal to P 1 (u, z, ξ ) multiplied by a correction term. This correction term can be computed either through principle curvatures of F at ξ [19] , or by using the eigenvalues of the rotated covariancepreconditioned Hessian of F at ξ ; see [48] for details. In this paper, we introduce another way to compute this correction term, which turns out to be more numerically tractable for optimization purposes. Theorem 4.1 (Second-order probability estimate for Gaussians). Suppose that ξ ∼ N (µ, Σ), u ∈ U and ξ ∈ Ξ satisfy F (u, ξ ) = z, and that the following matrix is invertible: Then the second-order approximation of the probability P(F (u, ξ) ≥ z) is given by: where the covariance-preconditioned normaln = and orthogonal determinant (4.4) det ⊥n H = n H −1n det H. Proof. Following the proof of Theorem 4.4 in [48] , the second-order probability approximation can be shown to be equal to: where C(u, ξ ) is a correction term computed through the curvatures of F at ξ . To define it, we first denote S(ϑ) to be the Lebesgue measure on the orthogonal component ϑ ∈ {ϑ ∈ R n : n, ϑ = 0} ofn at ξ . It can then be shown that where we have used ∇I(ξ ) = Σ −1 (ξ −µ) along with the definition of H. The above expression can be further simplified by defining the unit vector: Then, for all ξ ∈ R n , there exist s ∈ R and ϑ ∈ {ϑ ∈ R n : n, ϑ = 0} such that ξ = ϑ + st. From Fubini's theorem and the orthogonality betweenn and ξ, we have Applying the Gauss integral formula to both sides of (4.6), we obtain ϑ∈R n : n,ϑ =0} e − 1 2 ϑ,Hϑ dS(ϑ). Plugging in the definition oft in (4.5), we obtain (2π) − n−1 2ˆ{ ϑ∈R n : n,ϑ =0} e − 1 2 ϑ,Hϑ dS(ϑ) = n,t −1 t , Ht This establishes the statement of the theorem. Note that the expression (4.3) in Theorem 4.1 is not the same as P 2 in (3.2), and only provides an approximation of the latter. Nevertheless, under the assumption that ξ is the LDT minimizer in the lower-level problem of (3.8), it is asymptotically equal to (3.2) as z → ∞. Crucially, it has the added advantage of being computable in closed-form. An application of the approximations in (4.2) and (4.3) to (3.9) yield the following first-(k = 1) and second-order (k = 2) LDT approximations of the chance-constrained problem (1.1), respectively: Specifically, we replace the first-order probability constraint P 1 (u, z, ξ ) ≤ α as follows: Similarly, we replace the second-order probability constraint P 2 (u, z, ξ ) ≤ α as follows: where we have used the fact that We close this section with a few important remarks. First, note that since all of the functions involved in the closed-form expressions (4.8) and (4.9) are fairly standard, we can use standard automatic differentiation routines to compute their function values and gradients. Notably, this allows us to use off-the-shelf solvers for the solution of (4.7). Second, the matrix H = I n − λΣ 1 2 ∇ 2 ξ F (u, ξ )Σ 1 2 appearing in (4.9) is guaranteed to be positive definite (and hence, invertible) whenever the local solutions of the lower-level problem in (3.8) satisfy the sufficient second-order optimality conditions. Although the solutions of (4.7) cannot guarantee this in general, we have nonetheless empirically found in our numerical experiments that the linear system Hx =n was almost always solvable; in such cases, we can equivalently reformulate (4.4) asn x det(H) and evaluate its function values and gradients. Third, and finally, although the above derivations are for Gaussian random parameters, the probability estimates (4.2) and (4.3) as well as the resulting LDT chance-constrained problem (4.7) can be also used for random parameters that can be transformed to a Gaussian distribution. Specifically, if for a random parameter ζ, there exists an invertible function f , such that f (ζ) is Gaussian, then we can use ξ = f (ζ) instead as the random parameter and apply the above results to ξ while involving f −1 in F . For example, when ζ follows a lognormal distribution, we can use ξ = log(ζ) as the random parameter, while replacing ζ with exp(ξ) in F . We discuss these techniques further in subsections 6.1 and 6.2. Explicit formulations for Gaussian mixture random parameter. We now generalize the approach from the last section for non-Gaussian random parameters to provide explicit formulations of the first-and second-order LDT chance constrained optimization problems (3.9). Specifically, we assume that ξ ∼ M i=1 w i N (µ i , Σ i ) follows a Gaussian mixture distribution with M components, where w i > 0 and M i=1 w i = 1. Here, we use P i = N (µ i , Σ i ) to denote the distribution of the i th component with mean µ i and covariance Σ i 0. Note that P is a convex combination of P i , i ∈ {1, . . . , M }: Unlike in the case of Gaussians, the rate function is no longer equal to the negative log-density of ξ, and is available only implicitly. We first compute the cumulant generating function using its definition as follows: Since the rate function I is the Legendre transform of S, it is non-trivial to compute it explicitly in this case. Nevertheless, as noted in Section 3, we can still compute the LDT minimizer ξ by solving (3.10) instead: Moreover, even though the rate function is not available in closed form, the following theorem establishes that Gaussian mixture distributions (and hence, Gaussian distributions) always satisfy condition (C3) in Theorem 3.3. Therefore, the conditions of Theorem 3.3 depend only on the behavior of the constraint function F with respect to the uncertain parameters ξ, and not on the form of the mixture distribution. Proof. We shall show that (a) the cumulant generating function S is strongly convex, and (b) the rate function I satisfies I(ξ) ≥ min j∈{1,...,M } I j (ξ) for all ξ ∈ R n , where I j (ξ) = 1 2 ξ − µ j 2 Σ −1 j denotes the rate function of the j th Gaussian N (µ j , Σ j ), see (4.1). Assume for the moment that the above statements are true, and let σ > 0 denote the strong convexity parameter of S in (a). Since the rate function (1.4) is the convex conjugate of S, strong convexity of S implies that the gradient of the rate function ∇I(ξ) is Lipschitz continuous with constant σ; that is, Now, fix r > 0 andξ ∈ R n , and letξ be the (unique) minimizer of the integrand in (C3) over B(r,ξ). Since for the given Gaussian mixture distribution P, (see Section 5) Combining the above with (5.4), we obtain B(r,ξ) where we have used the definition of the j th component rate function along with the fact thatξ,ξ ∈ B(r,ξ) =⇒ ξ −ξ ≤ r. Claim (b) implies that there exists some k ∈ {1, . . . , M } such that I(ξ) ≥ I k (ξ). Therefore, we have shown that Also, we havê where the equality can be verified easily (e.g., see the proof of Theorem 2.1 in [15] ). Therefore, the left-hand side expression in condition (C3) is bounded away from 0 by a constant that is independent ofξ, whereas its right-hand side → ∞ as ξ → ∞ (see proof of Theorem 3.3). This proves the statement of this theorem. It remains to prove claims (a) and (b). To show (a), let S j (η) = η µ j + 1 2 η Σ j η denote the cumulant generating function of the j th Gaussian N (µ j , Σ j ) . Then, the cumulant generating function of the Gaussian mixture S(η) = log M j=1 w j exp(S j (η)) , see (5.2) . A straightforward calculation shows that its Hessian satisfies where the first inequality follows from the fact that w j ≥ 0, M j=1 w j = 1, the first equality follows from the monotonicity of exp, the second and final equalities follow from the definition of the convex conjugate, and the third equality follows because we can interchange the order of the max over j with the max over ξ. Finally, the order reversing property of the conjugate f * ≤ g * =⇒ f ≥ g, implies that I(ξ) ≥ min j∈{1,...,M } I j (ξ) which is claim (b). First-and second-order probability estimates. Recall from subsection 3.1 that the first-order probability estimate P 1 in (3.2) is the measure of the half-space bounded by F 1 (u, ξ; ξ ) = z, where ξ solves (5.3). Since a Gaussian mixture is just a convex combination of Gaussians, we can exploit the analogous result for the latter (4.2) to compute the first-order estimate analytically. N (µi, Σi) . The left figure shows the procedure for the first-order approximation, namely of finding a point on the hyperplane F1(u, ξ; ξ ) = z where the contour of negative log-density of N (µi, Σi) is tangential to the plane, by solving (5.7) . Similarly, the right figure shows the procedure for the second-order approximation, by findingξi on F2(u, ξ; ξ ) = z by solving (5.9). Theorem 5.2 (First-order probability estimate for Gaussian mixtures). Suppose that ξ ∼ M i=1 w i N (µ i , Σ i ) and for some fixed u ∈ U, ξ is the solution to (5.3). Then the first-order approximation of the probability P(F (u, ξ) ≥ z) is given by Proof. Substituting (5.1) in (3.2) for k = 1, we obtain: Therefore, we can consider each P i ({ξ : F 1 (u, ξ; ξ ) ≥ z}), i ∈ {1, . . . , M }, separately. Similar to (4.2) for Gaussians, this quantity is a Gaussian integral over a half-space. However, unlike the latter, the contours of negative log-density of N (µ i , Σ i ), i.e., of 1 necessarily be tangential to the half-space F 1 (u, ξ; ξ ) = z at ξ . Therefore, to compute P i ({ξ : F 1 (u, ξ; ξ ) ≥ z}), we need to find the pointξ i on the hyperplane F 1 (u, ξ; ξ ) = z where the contour of 1 2 is tangential to the plane (see Figure 3 ): where we used the definition of the first-order Taylor expansion (3.1a) and the fact that ξ satisfies F (u, ξ ) = z since it solves (5.3). The optimality conditions of (5.7) imply that there exists λ i ≥ 0 such that: . Therefore, we have that Finally, replacing ξ in the formula of the first-order probability estimate for Gaussians (4.2) withξ i yields Combining this with (5.6) gives us the claimed formula (5.5). The second-order probability estimate P 2 satisfies a linear relationship similar to the firstorder estimate P 1 ; that is, . . , M }, we can then use formula (4.3) for Gaussians to compute P i ({ξ : F 2 (u, ξ; ξ ) ≥ z}). As in the case of the first-order estimate, this requires us to first find the point on the surface F 2 (u, ξ; ξ ) = z where the contours of 1 2 are tangential to the surface, see Figure 3 . The final closed-form expression is presented in the following theorem, whose proof is omitted for brevity. Suppose that ξ ∼ M i=1 w i N (µ i , Σ i ) and for some fixed u ∈ U, ξ is the solution to (5.3) . Then the second-order approximation of the probability P(F (u, ξ) ≥ z) is given by (5.8) To obtain the second-order LDT approximation, we note that we first have to reformulate the nonconvex quadratic optimization problem (5.9) that appears in the closed-form expression of the second-order probability estimate (5.8). The following lemma provides a characterization ofξ i that we shall use subsequently. only if there existsλ i ∈ R + such that Proof. First, note that ∇ ξ F 2 (u, ξ ; ξ ) = ∇ ξ F (u, ξ ) = 0 (see subsection 1.2). Therefore, [36, Theorem 3.2] shows thatξ i is a global minimizer if and only if (5.11a) and (5.11b) are satisfied, along with Σ −1 is seen to be equivalent to (5.11c ). To obtain the second-order LDT chance-constrained approximation, we relax the positive definiteness condition (5.11c), and combine (5.3), (5.8), (5.11a) and (5.11b) along with (3.11) (k = 2) to obtain the following formulation: subject to u ∈ U, ξ ∈ R n , η ∈ R n , λ ∈ R + , As in the case of the LDT chance-constrained formulations for Gaussians, the function values and gradients of all functions involved in (5.10) and (5.12) can be computed using standard automatic differentiation codes, enabling their solution using off-the-shelf solvers. 6. Applications. We illustrate the computational performance of our first-and secondorder LDT probability estimates and corresponding chance-constrained optimization problems using three examples from different applications. In subsection 6.1, we consider a portfolio selection problem with random stock prices; in subsection 6.2, we consider a structural short column design with uncertain material properties and loads; and finally in subsection 6.3, we consider optimal control of a steady-state advection-diffusion PDE. We implemented all of our formulations in Julia using JuMP [21] , and solved all optimization problems using Ipopt [54] with default solver options. We used the forward mode automatic differentiation provided by JuMP for gradient computations. Since JuMP does not support Hessian computations of user-defined nonlinear constraint functions, which is needed for the second-order LDT chanceconstrained formulations (for example, to encode the orthogonal determinant (4.4)), we used Ipopt's in-built quasi-newton method for the solution of these problems. It is important to note that our goal is to demonstrate the advantage and competitive performance of our method using a purely off-the-shelf implementation. Reported solution times correspond to wall clock times on a single thread of an Intel Xeon Gold CPU at 2.30GHz with 512 GB RAM. 6.1. Portfolio management. Consider an optimal asset allocation problem, where we wish to allocate our wealth among n stocks which will be held over T time periods. Let u i ∈ R + denote the fraction of wealth invested in the i th stock. Our goal is to determine the optimal allocation u = [u 1 , . . . , u n ] ∈ R n + of our wealth for each stock. Let v i denote the value of the i th stock at the end of T time periods. We model the uncertainty in stock price v i using a traditional log-normal model [28, 31] , where θ i ∈ R and σ i ∈ R represent the drift and infinitesimal standard deviation of the Lévy process for the i th stock, respectively, and ξ i ∈ R is a zero-mean continuously compounded rate of return which represents the true uncertainty driver. Since choosing the right distribution for ξ is an unresolved question (e.g., see [31] ), we model ξ using Gaussian (as in traditional models) as well as Gaussian mixture distributions. Our goal is to ensure, with very high probability, that the total return after T time periods, n i=1 v i (ξ i )u i , is higher than some value z; note that this is equivalent to constraining the probability P( n i=1 v i (ξ i )u i ≤ z). 6.1.1. Probability estimation. Before we consider the chance-contrained formulation, we empirically assess the accuracy of the first-and second-order LDT estimates of the rare event probability P( n i=1 v i (ξ i )u i ≤ z) for a fixed stock allocation u and fixed value of z. Defining this is equivalent to estimating P(F (u, ξ) ≥ −z). Note that the function F is concave in ξ and linear in u, and in particular, it satisfies condition (C1) in Theorem 3.3. We downloaded 1,000 days' worth of daily NASDAQ stock price data for 50 stocks to estimate the drift parameter θ i , standard deviation σ i and the distribution for the random parameter ξ (Gaussian or Gaussian mixture). Specifically, we first compute the compounded rates of v i ; i.e., the logarithm of the ratio of stock prices in two consecutive days. For a Gaussian distribution, we then compute the empirical mean and covariance corresponding to these rates to obtain θ i −σ 2 i /2 and covariance for ξ i respectively. For a Gaussian mixture, we use the GaussianMixtures.jl package to estimate the parameters of a 2-and 3-component Gaussian mixture using the expectation-maximization algorithm. The allocation u is fixed and chosen randomly from a uniform distribution in [0, 1] n subject to the constraint n i=1 u i = 1. We fix T = 10, and compare the accuracy of estimating P(F (u, ξ) ≥ −z) for different fixed values of z using: (i) our first-(P 1 ) and second-order (P 2 ) LDT probability approximations, with (ii) a Monte Carlo sample average approximation P M C N (u, z) using N independent samples. We use N = 10 7 samples to estimate the true probability as P true = P M C 10 7 (u, z), and define errors LDT k (u, z) := log 10 P k (u, z, ξ ) − log 10 P true , k ∈ {1, 2}, M C N (u, z) := log 10 P M C N (u, z) − log 10 P true , N ∈ {10 4 , 10 5 }. We note that positive errors denote an over-estimation of the true probability, whereas negative values imply under-estimation. Table 1 The comparison of the first-(P1) and second-order (P2) LDT probability estimates with sample average approximations (P M C N ) for estimating P(F (u, ξ) ≥ −z) in the portfolio management problem. We drop the dependence on u and z for simplicity. Table 1 shows that, as z becomes smaller, the event F (u, ξ) ≥ −z becomes rarer and a sample average approximation with N = 10 4 or 10 5 samples is insufficient to accurately estimate the true probability. For example we observe that for z = 0.80, the true probability is much smaller than 10 −4 ; in such cases, none of the N = 10 4 samples record the rare event, on average, resulting in a probability estimate of 0. Stated differently, P M C 10 4 and P M C 10 5 fail to approximate probabilities lower than 10 −4 and 10 −5 , respectively, illustrating their high sample size requirement for accurate rare event estimation. In contrast, our proposed LDT approximations are relatively less sensitive to the extremeness of the event (i.e., to values of z). In particular, since F is concave in ξ, Proposition 3.1 ensures that the first-order approximation P 1 (u, z, ξ ) always overestimates the probability, and this is verified in Table 1 , since the errors LDT 1 are positive in all cases. Nevertheless, the first-order estimates can be about three times the magnitude of the true probability, with log-errors of roughly 0.5, across different z. In contrast, the second-order estimates are highly accurate, with log-errors consistently less than 10 −1 , regardless of values of z, and they typically outperform the Monte Carlo estimates for both Gaussians and mixture distributions. We highlight that, for rare events, it is much more important to correctly estimate the order of magnitude of the probability rather than its exact value; for example, it matters less that we estimate an event probability of 10 −6 as 2 · 10 −6 , as compared to estimating it as 10 −3 . We now turn to determining the optimal stock allocation. We first define the value-at-risk (VaR) of a fixed allocation vector u: Our goal is to maximize the VaR of the total return at level 1−α, and consider α 1 reflecting a conservative investment approach. This problem can be reformulated as the following variant of the chance-constrained optimization problem (1.1) using the definition of F in (6.2): For T = 10 and α = 10 −4 , Table 2 compares the computational performance of our LDT chance-constrained formulations with the smooth sigmoidal sample average formulation (2.1) (with ν = τ = 1) and the CVaR sample average formulation (2.2), each with N = 10 5 samples. Here, u denotes the optimal allocation determined by each method, z denotes the corresponding objective value of (6.4), and VaR 1−α (u ) denotes the Value-at-Risk (6.3) of the corresponding allocation u which we would like to be maximized. As before, we estimate the true probability P true (u , z ) = P M C 10 7 (u , z ) using N = 10 7 Monte Carlo samples. Therefore, the column log 10 P true (u , z ) − log 10 α shows the extent of feasibility of the chance constraint in each case, where negative values indicate satisfaction of the chance constraint whereas positive values indicate that the corresponding u and z fail to satisfy the chance constraint in (6.4) . We also report solution times of the different methods. Table 2 Comparison of solution methods for the optimal stock allocation problem (6.4) with T = 10 and α = 10 −4 . z VaR1−α(u ) P true (u , z ) log 10 P true (u , z ) − log 10 α Time [sec] Gaussian LDT (k = 1) 9.53 · 10 −1 9.54 · 10 −1 7.7 · 10 −5 −0.11 0.5 LDT (k = 2) 9.54 · 10 −1 9.54 · 10 −1 9.5 · 10 −5 −0.02 197.0 SAA (N = 10 5 ) 9.02 · 10 −1 9.53 · 10 −1 0.0 −∞ 638.7 CVaR (N = 10 5 ) 9.53 · 10 −1 9.52 · 10 −1 1.5 · 10 −4 0.17 527.6 Gaussian mixture (M = 2) LDT (k = 1) 9.15 · 10 −1 9.20 · 10 −1 5.4 · 10 −5 −0.26 11.3 LDT (k = 2) 9.21 · 10 −1 9.23 · 10 −1 8.3 · 10 −5 −0.08 85.7 SAA (N = 10 5 ) 9.20 · 10 −1 9.18 · 10 −1 1.2 · 10 −4 0.09 1, 147.4 CVaR (N = 10 5 ) 9.19 · 10 −1 9.17 · 10 −1 1.2 · 10 −4 0.07 420.4 From Table 2 , we observe that our proposed second-order LDT formulation consistently attains the maximum VaR. Moreover, its optimal objective value z is quite close to the true VaR, indicating that the second-order probability estimate is also accurate. Notably, these improved objective values also come at a much lower solution time compared to the samplingbased methods. The first-order LDT formulation attains a slightly smaller VaR compared to the second-order formulation, but it is still better than the sampling-based methods. Moreover, the solutions (u , z ) obtained by the sampling-based methods are not feasible in the original problem (6.4), since the corresponding log 10 P true (u , z ) − log 10 α > 0, implying that we need to increase the number of samples N for these methods to obtain a feasible solution. However, given that their solution times are already quite excessive with N = 10 5 samples; further increasing N will only amplify these times even more. In contrast, the first-and secondorder LDT formulations solve in ∼10 and ∼100 seconds respectively, with their solution times being largely unaffected by the choice of α. In conclusion, the LDT formulations outperform the sampling-based methods both in terms of objective value and solution time. The choice between the first-and second-order formulations can be made on a case-by-case basis; whereas the former is more computationally efficient, the latter is more accurate, in general. Figure 4 shows the optimal stocks allocations obtained by different methods in the case of the Gaussian mixture. Interestingly, all methods indicate that we should invest majority of the initial wealth in the stock "AGZD",and never invest in some of the other stocks. However, the first-order LDT formulation determines a slightly different allocation for this stock compared with other methods, implying that we should use the second-order LDT allocation to obtain higher returns. Curiously, a closer look at Table 2 reveals that the VaR of the total return is less than the initial wealth (= 1) in all cases, indicating that it may not be wise to invest at all over the short time period T = 10 with an overly conservative risk threshold α = 10 −4 . 6.2. Short column design. In the last application, the function F was linear in the decisions u. We now consider a nonlinear model for the design of a short column [2, 11] with uncertain material properties, and subject to a chance constraint on material failure. Consider a short column with a rectangular cross-section of width w and height h, that is subject to an uncertain axial load ξ F , bending moment ξ M and an uncertain yield stress ξ Y . We model these as a three-dimensional random vector ξ = [ξ F , ξ M , ξ Y ]. Our goal is to determine w and h so that the area of the cross-section wh is minimized, and the design should not experience material failure with very high probability of 1 − α, where α ∈ (0, 1) is some given threshold. The parameter-to-observation map F can be derived from the elasticplastic constitutive law, and it has a general nonlinear dependence on both the decisions u = (w, h) ∈ R 2 as well as the uncertainty ξ ∈ R 3 : The optimization problem is where L w , L h and U w , U h are given lower and upper bounds on w and h, respectively, and we set [L w , U w , L h , U h ] = [5, 15, 15, 25] . 6.2.1. Gaussian random parameter. We first consider the random vector ξ ∼ N (µ, Σ) with mean and covariance given as follows, adapted from the example in [11] : The covariance matrix indicates that the correlation coefficient between the axial force ξ F and bending moment ξ M is 0.5, whereas the yield stress exp(ξ Y ) is uncorrelated to them and follows a log-normal distribution with mean 5 and standard deviation 0.5. In Figure 5 , we compare the performance of different methods for solving the short column design problem with Gaussian random parameter (6.7), as a function of the risk threshold α varying from 10 −1 to 10 −6 . The left-most figure shows that our LDT chance constrained formulations provide similar optimal solutions, indicating that F is near-linear. Moreover, when α becomes smaller, although the optimal areas determined by the sampling-based methods appear to be smaller than the ones determined by the LDT approaches (near-horizontal parts in the left-most figure) , the middle figure shows that these solutions are, in fact, not feasible because they fail to satisfy the probability constraint (since their curves fall outside the negative part highlighted in blue). In fact, as far as the sampling-based methods go, the solutions determined using N samples can give reliable solutions only for α ≥ 10 −(N −1) . In contrast, the middle figure shows that the solutions determined by the LDT formulations always satisfy the chance constraint for all values of α, and their corresponding optimal areas coincide with those of the sampling-based methods, whenever the latter are also feasible. In fact, they are able to find feasible designs even for extreme cases where sampling-based methods with N ≤ 10 5 samples fail. Finally, the right-most figure shows that the computational time increases severely for sampling-based methods as the number of samples N increases, since the problem size in (2.1) and (2.2) scales as O(N ). The middle figure suggests that, for very high reliability α = 10 −6 , we need to increase N to obtain a feasible solution, and thus the corresponding time is also expected to increase. The LDT approaches are much more efficient, with both the first-and second-order formulations solving in roughly 10% of the time compared to SAA with N = 10 2 samples. 6.2.2. Gaussian mixture random parameter. We now consider ξ ∼ 0.5N (µ 1 , Σ 1 ) + 0.5N (µ 2 , Σ 2 ) to follow a Gaussian mixture distribution with two components, where the mean µ 1 and covariance Σ 1 of the first Gaussian component have the same values as in (6.7), whereas the mean and covariance of the second Gaussian component is given as follows: The results are summarized in Figure 6 , and are largely similar to the case of Gaussians. The main differences are the different optimal areas determined by the two LDT approaches, with the second-order formulation yielding a slightly better objective value while still obtaining a feasible design. As before, both LDT approaches are significantly more efficient compared with sampling-based methods in terms of computational time and sensitivity to event extremeness. 6.3. Optimal boundary control for steady-state advection-diffusion. In the previous sections, we showed the performance of our proposed formulations approach for problems that are nonlinear in the uncertainty, but still admit closed-form expressions of the parameterto-observation map F . In this section, we consider a much more complicated PDE system, where we do not have closed-form expressions. In particular, these represent examples where it is infeasible to use sampling-based methods, since they would need to incorporate as many PDE solves as the number of samples. In contrast, we demonstrate that our LDT approaches continue to provide satisfactory results. Consider a 2D steady-state advection-diffusion equation in the unit square Ω = [0, 1] × [0, 1], and denote the boundary Γ c = 0 × [0, 1] and Γ n = ∂Ω\Γ c . −∇ · (κ(x, ξ)∇y(x)) + w(x) · ∇y(x) = f (x, ξ), x ∈ Ω, (6.9a) (κ(x, ξ)∇y(x)) · n(x) = 1 0 (u(x) − y(x)) , on Γ c , (6.9b) (κ(x, ξ)∇y(x)) · n(x) = 0, on Γ n . (6.9c) Here, y is the state representing the temperature at each location x and u is the Dirichlet boundary control on the boundary Γ c . We enforce y(x) = u(x) on Γ c through a Robin-type condition (6.9b) to avoid regularity limitations of Dirichlet boundary control. Also, 0 = 10 −4 is a penalty parameter, w = [1, 0] is a fixed velocity field, and n(x) represents the outwardpointing normal at the boundary. The diffusion coefficient κ(x, ξ) and force f (x, ξ) are subject to random uncertainty ξ with the following nonlinear dependence: Observe that, since the temperature y is a solution of (6.9), it is uncertain due to the randomness of ξ. In applications like hyperthermia treatment for cancer therapy [18] , it is important to ensure that the (uncertain) temperature in some areas (e.g., essential organs) do not exceed some critical threshold to protect organs from thermal damage. Motivated by these applications, our goal is to ensure that the average temperature in the sub-domain Ω 0 = [0.4, 0.6] × [0.4, 0.6] remains less than z with high probability 1 − α. This leads to the following definition of the parameter-to-observation F (u, ξ) as the average temperature in Ω 0 for a given realization of the random parameter ξ and control vector u: where y(x; u, ξ) is the solution of PDE (6.9) at location x given ξ and u. Our goal is to control the temperature at the boundary Γ c as close to 0 as possible, while ensuring that the average temperature in Ω 0 does not exceed z with high probability: where z is a given upper bound on the average temperature in Ω 0 , and α 1. Unlike the other applications, note that we do not have a closed-form expression for F (u, ξ) with respect to u or ξ, and we can only access it through the solution of the PDE in (6.9). Nevertheless, it can still be characterized implicitly as a linear system upon discretization of the PDE. If we denote the dimension of the design u to be m and use a finite difference method to dicretize the PDE, then the resulting linear system has size m 2 × m 2 . Since the complexity of solving one linear problem is O(m 2 ), resorting to a sampling-based methods will require us to model N realizations of F as constraints in the optimization problem, each involving a PDE-based linear solve. This would result in a very large-scale problem even for modest numbers of samples and control inputs, making sampling-based methods practically inefficient. This motivates alternate LDT approaches. We use a finite-difference method to discretize the PDE (6.9) with m = 30 elements. Therefore, the mesh is 30×30 and the state y is 900-dimensional. To avoid issues of infeasibility Table 3 Optimal objective values and constraint feasibilities of the boundary control problem (6.12) with z = 1 for different α's using our first-and second order LDT chance-constrained formulations. . The optimal boundary condition u from problem (6.12) with z = 1 for different α's using the second-order LDT approximation. Figure 8 . The temperature profile y under the optimal boundary condition u in Figure 7 and ξ for the minimization problem (6.12) with z = 1 for different α's using the second-order LDT approximation. and convergence to non-optimal stationary points, we successively warm-start the solution of the α = 10 −(k+1) problem using the optimal values determined for the α = 10 −k problem. In case of the first-order LDT approximation, we use a simple finite difference method to compute gradients, since the random parameter is low-dimensional n = 2 (this also explains the faster solution times in Table 3 ). In case of the second-order LDT approximation, we use forwardmode automatic differentiation to compute gradients, since we need accurate gradients of the correction term (4.9) to ensure convergence. Table 3 summarizes the computational performance. We observe that the optimal objective values become larger for more extreme events; i.e., the optimal boundary control u has larger variations. The terms P(F (u , ξ) ≥ z) show chance constraint feasibilities, and the fact that they are all less than α indicates that all obtained solutions are feasible. Comparing the two approaches, we observe that the optimal objective values of the second-order approximations are better compared with those of first-order approximations; however, this improvement comes at a significantly higher computational cost. We note that a large portion of the computational time is due to the inability of forward-mode automatic differentiation in JuMP to differentiate a sparse solve that required us to modify the PDE solve to a dense solve. The latter can be reduced by using a more efficient automatic differentiation implementation (or perhaps by using tailored codes for evaluating adjoints and gradients). Finally, Figures 7 and 8 show the optimal boundary control u and corresponding temperature profile y under the optimal u and ξ obtained through (4.7). The optimal control inputs u have similar shapes but scale with log α; moreover, they all have a jump at x 2 = 0.6, which comes from the discontinuity of the diffusion coefficient κ at x 2 = 0.6. The temperature profile y under the optimal boundary condition u and the LDT realization ξ , corresponds to the most representative profile among all feasible profiles in (6.12) . Similar to the boundary conditions in Figure 7 , the temperatures for smaller risk thresholds have larger variations. Despite their rare occurrence, factors such as climate change and human population growth are causing a steady increase in the frequency of extreme high-impact events. The paucity of available data and inability to accurately model socioeconomic costs during these events motivates chance constraints as a viable modeling framework to mitigate the risk of extreme events. In this paper, we combined ideas from large deviation theory and convex and bilevel optimization to propose a novel solution method for decision-making problems constrained by probabilities of rare events. Unlike classical approaches, our formulations are sampling-free, independent of event extremeness, and applicable to a broad class of nonlinear problems. Under certain regularity assumptions on the system constraints, they constitute safe conservative approximations or even asymptotically equivalent reformulations of the true problem. Furthermore, by utilizing local Taylor approximations of the system functions, we provide refined approximations over classical large deviation estimates that turn out to be empirically accurate even in non-asymptotic regimes. Although we only considered problems affected by Gaussian mixture uncertainties, it should be noted that our methods are also directly applicable to more general settings where the uncertainties can be transformed to a Gaussian or approximated well using a mixture of Gaussians, for which there are efficient off-the-shelf codes. Our proposed formulations complement this well since they can also be solved by off-the-shelf optimization solvers. Computational experiments on applications from diverse domains confirm the broad applicability, and improved efficiency and accuracy of our method over classical sampling-based methods in the rare event regime: the portfolio application illustrates its applicability to nonlinear high-dimensional uncertainties, the structural design application showcases its potential advantages over sampling-based methods in systems that are nonlinear in the decisions as well as uncertainties, whereas the optimal control application shows its ability to address complicated PDE-constrained systems. We believe that our approach is but a small step towards tackling rare chance constraints, and much more work needs to be done. Various regularity assumptions on the structure of the constraint function (e.g., smoothness, concavity etc.) as well as the uncertainty (e.g., sufficiently smooth rate functions) need to be lifted and generalized. For example, our findings suggest that we cannot expect our method, and perhaps even large deviation theory, to work well when the system constraints are convex functions of the uncertainty. We would also like to extend our methodology to nonsmooth constraint functions, where the current Taylorbased probability estimates would be inapplicable, as well as to joint chance constrained problems. From a computational viewpoint, although our method has the advantage of using off-the-shelf codes, this also means that it can be inefficient at tackling more structured largescale problems. For example, we need more efficient methods to compute derivatives of the probability estimates, and particularly of the second-order correction term. We believe this may also spur the development of new techniques in automatic differentiation. Solving joint chance constrained problems using regularization and benders' decomposition Benchmark study of numerical methods for reliability-based design optimization, Structural and multidisciplinary optimization Stochastic simulation: algorithms and analysis Chanceconstrained problems and rare events: an importance sampling approach Optimal scenario generation for heavy-tailed chance constrained optimization On the multi-dimensional central limit theorem Tauberian theorems, Chernoff inequality, and the tail behavior of finite convolutions of distribution functions Introduction to rare event simulation Uncertain convex programs: randomized solutions and confidence levels A sigmoidal approximation for chance-constrained nonlinear programs Risk-based design optimization via probability of failure, conditional Value-at-Risk, and buffered probability of failure A sequential algorithm for solving nonlinear optimization problems with chance constraints Experimental evidence of hydrodynamic instantons: The universal route to rogue waves Rogue waves and large deviations in deep sea Extreme event quantification in dynamical systems with random components Large Deviations Techniques and Applications, Applications of mathematics Inverse reliability problem Mathematical cancer therapy planning in deep regional hyperthermia A most probable point-based method for efficient uncertainty analysis Exploring Monte Carlo methods JuMP: A modeling language for mathematical optimization Optimal Neumann boundary control of a vibrating string with uncertain initial data and probabilistic terminal constraints An inner-outer approximation approach to chance constrained optimization On probabilistic capacity maximization in a stationary gas network, Optimization A critical note on empirical (sample average, Monte Carlo) approximation of solutions to chance constrained programs Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods Sequential convex approximations to joint chance constrained programs: A Monte Carlo approach Options, futures and other derivatives Sharp asymptotics of large deviations for general state-space Markov-additive chains in R d Methods of reducing sample size in Monte Carlo computations A log-robust optimization approach to portfolio management Probabilistically constrained linear programs and risk-adjusted controller design Second-order reliability method-based inverse reliability analysis using hessian update for accurate and efficient reliability-based design optimization A sample approximation approach for optimization with probabilistic constraints An integer programming approach for linear programs with probabilistic constraints Generalizations of the trust region problem Scenario approximations of chance constraints, in Probabilistic and randomized methods for design under uncertainty Convex approximations of chance constrained programs Dominating points and the asymptotics of large deviations for random walk on R d Sample average approximation method for chance constrained programming: theory and applications Solving chance-constrained problems via a smooth sample-based nonlinear approximation Finitely generated cumulants Probabilistic programming Variational Analysis Optimization of computer simulation models with rare events A critical appraisal of methods to determine failure probabilities Lectures on stochastic programming: modeling and theory Extreme event probability estimation using PDEconstrained optimization and large deviation theory, with application to tsunamis A basic introduction to large deviations: Theory, applications, simulations A survey on approaches for reliability-based optimization, Structural and Multidisciplinary Optimization Sub-) differentiability of probability functions with elliptical distributions, Set-Valued and Variational Analysis Gradient formulae for nonlinear probabilistic constraints with Gaussian and Gaussian-like distributions Large deviations and applications On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming AAL AAME AAON AAWW AAXJ AAXN ABCB ABIO ABMD ABTX ABUS ACBI ACGL ACHC ACIA ACIW ACNB ACRX ACTG ACWI ACWX ADAP ADES ADI ADMA ADP ADSK ADUS ADVM AEGN AEHR AEIS AEMD AERI AEY AEYE AFH AFMD AGEN AGFS AGFSW AGIO AGLE AGRX AGTC AGYS AGZD AHPI AIA Acknowledgments. We thank Eric Vanden-Eijnden and Georg Stadler from Courant Institute at New York University for helpful suggestions at various stages of this work.