key: cord-0549565-lojso9za authors: Cakmak, Sait; Astudillo, Raul; Frazier, Peter; Zhou, Enlu title: Bayesian Optimization of Risk Measures date: 2020-07-10 journal: nan DOI: nan sha: 1bc3ded1e98a96119e0bfd13ea3694ee7c394ac4 doc_id: 549565 cord_uid: lojso9za We consider Bayesian optimization of objective functions of the form $rho[ F(x, W) ]$, where $F$ is a black-box expensive-to-evaluate function and $rho$ denotes either the VaR or CVaR risk measure, computed with respect to the randomness induced by the environmental random variable $W$. Such problems arise in decision making under uncertainty, such as in portfolio optimization and robust systems design. We propose a family of novel Bayesian optimization algorithms that exploit the structure of the objective function to substantially improve sampling efficiency. Instead of modeling the objective function directly as is typical in Bayesian optimization, these algorithms model $F$ as a Gaussian process, and use the implied posterior on the objective function to decide which points to evaluate. We demonstrate the effectiveness of our approach in a variety of numerical experiments. Traditional Bayesian optimization (BO) has focused on problems of the form min x F (x), or more generally, min x E [F (x, W )] where F (x, W ) is a time-consuming black box function that does not provide derivatives, and W is a random variable. This has seen enormous impact, and has expanded from hyper-parameter tuning [1, 2] to real-world decisions [3, 4, 5, 6] . However, for many truly high-stakes decisions the expectation is inappropriate: we must be risk-averse. In such settings, risk measures have become a crucial tool for quantifying risk. For instance, by law, banks are regulated using the Value-at-Risk [7] . Other examples where risk measures have been used include cancer treatment planning [8, 9] , healthcare operations [10] , natural resource management [11] , and disaster management [12] . In this work, we consider risk averse optimization of the form min x ρ [F (x, W )], where ρ is a risk measure that maps the probability distribution of F (x, W ) (induced by a random W and fixed F ) onto a real number describing its level of risk. When F is inexpensive and has convenient analytic structure, optimizing against a risk measure is well understood [13, 14, 15] . However, to the best of our knowledge, when F is expensive-to-evaluate, derivative-free, or is a black box, there are no existing methods able to tackle the problem. While one can sample from W and obtain a high-accuracy estimate of ρ[F (x, W )], this typically requires a large number of samples. For example, if each evaluation of F requires 1 hour, and we need 10 2 samples to obtain a high-accuracy estimate of ρ[F (x, W )], this would require more than 4 days per evaluation. Moreover, if we do not obtain a high-accuracy estimate, estimators are typically biased [16] in a way that is unaccounted for by traditional Gaussian process regression. In this paper, we develop a novel approach that overcomes these challenges. We provide a new one-step Bayes optimal algorithm ρKG, and a fast approximation ρKG apx that works well in numerical experiments. This significantly improves the sampling efficiency over the state art BO methods. We develop asymptotically unbiased and consistent gradient estimators for efficient optimization of ρKG. To further improve the numerical efficiency, we introduce a novel two time scale optimization approach that is broadly applicable in many BO methods that require nested optimization. We focus on two risk measures commonly used in practice: Value at Risk (VaR) and Conditional Value at Risk (CVaR). VaR and CVaR have been applied in a wide range of settings, including robust optimization [14] , simulation optimization under input uncertainty [17] , insurance [18] and risk management [19] . We refer the reader to [20] for a broader discussion on VaR and CVaR. The remainder of this paper is organized as follows. Section 2 gives a brief background on risk measures and BO. Section 3 formally introduces the problem setting. Section 4 introduces the statistical model on F, a Gaussian process (GP) prior, and explains how to estimate VaR and CVaR of a GP. We introduce ρKG, a value of information type of acquisition function for optimization of VaR and CVaR in Section 5. A cheaper approximation ρKG apx and efficient optimization of both algorithms is also discussed. Finally, Section 6 lists a number of problems that we consider for testing and demonstrating the performance of the algorithms developed here. The BO problem has been studied extensively over the past 20 years. Popular BO methods include expected improvement [21] , knowledge gradient [22] , (predictive) entropy search [23] ; can work on discrete [22] or continuous [24] domains; with noisy or noise-free observations; has been extended to utilize parallel evaluations [25, 26] , multi-fidelity observations [27] , and gradient observations [28] . We refer the reader to [29] for a review of BO methods. Within the BO literature, a closely related work to ours is [30] , where (1) is studied when ρ is the expectation operator, Due to linearity of the expectation, the GP prior on F (x, w) translates into a GP prior on E[F (x, W )], a fact that is leveraged to efficiently compute the one-step Bayes optimal policy, also known as the knowledge gradient (KG) policy. Although it is not the focus of this study, since CVaR at risk level α = 0 is the expectation, our methods can be directly applied for solving the expectation problem, and the resulting algorithm is equivalent to the algorithm proposed in [30] . Other BO methods that consider the effect of a random environmental variable include [31, 32] . We recall that a risk measure (cf. [33, 34, 35, 36] ) is a functional that maps probability distributions onto a real number. For a generic random variable Z, we often use the notation ρ(Z) to indicate the risk value evaluated on the distribution of Z. The most widely used risk measure is the Value-at-Risk (VaR) [37] , which is ρ(Z) = VaR α [Z] = inf{t : P Z (Z ≤ t) ≥ α} where P Z indicates the distribution of Z. Another widely used risk measure is the Conditional Value-at-Risk (CVaR), CVaR Risk measures offer a middle ground between risk-neutral expectation operators and a worst-case performance measure, which is often more interpretable than expected utility [17] . They are widely used in finance, and value at risk is encoded in the Basel II accord [38] . We consider the optimization problem where X ⊂ R d X is a simple compact set, e.g. a hyper-rectangle; W is a random variable with probability distribution P (w) and compact support W ⊂ R d W ; and ρ is a known risk function, mapping the random variable F (x, W ) (induced by W ) to a real number. We assume that F : X × W → R is a continuous black-box function whose evaluations are expensive, i.e., each evaluation takes hours or days, has significant monetary cost, or number of evaluations is limited for some other reason. We also assume that evaluations of F are either noise-free or observed with independent normally distributed noise with known variance. We emphasize that the risk measure ρ is only over the randomness in W , and ρ [F (x, W )] is calculated holding F fixed. For example, if ρ is the value at risk, and W is a continuous random variable, then this is explicitly written as: Later, we will model F as being drawn from a GP, but we will continue to compute ρ [F (x, W )] only over he randomness in W . Thus, this quantity is a function of F and x and will be random with a distribution induced by the distribution on F (and more precisely by the distribution on F (x, ·)). We assume a level of familiarity with Gaussian processes and refer the reader to [39] for details. For notational purposes, we recall that, given a GP prior on F , the posterior distribution of F given the history after n evaluations of F , F n , is again a GP, F | F n ∼ GP(µ n , Σ n ), where µ n and Σ n are the posterior mean and covariance functions. We build on this standard analysis to build a Bayesian posterior over ρ[F (·, W )]. The GP posterior distribution on F implies a posterior distribution on the mapping x → ρ[F (x, W )]. In contrast with the case where ρ is the expectation operator, this distribution is in general non-Gaussian, thus rendering computations more challenging. The quantity of interest is the posterior mean of ρ[F (x, W )], i.e., where E n denotes the conditional expectation given the history up to time n, F n . We can estimate (3) following a simple Monte Carlo approach by first drawing multiple samples of the (possibly infinite dimensional) random vector (F (x, w) : w ∈ W) according to the time n GP posterior distribution on F and then averaging over the value of ρ computed with respect to each of these samples. We note that if W is a finite set, sampling from (F (x, w) : w ∈ W) reduces to sampling from a multivariate normal distribution with mean vector (µ n (x, w) : w ∈ W) and covariance matrix (Σ n (x, w, x, w )) w,w ∈W . We further discuss estimation of (3) and its derivative in Section 5.2. As is standard in the BO literature, our algorithm's search is guided by an acquisition function, whose maximization indicates the next point to evaluate. Our proposed acquisition function generalizes the well known knowledge gradient acquisition function [24] , which has been generalized to other settings such as parallel and multi-fidelity optimization [40, 41] . Before formally introducing our acquisition function, we note that, in our setting, choosing a decision x to be implemented after the evaluation stage is complete, is not a straightforward task. If the cardinality of W is large, then chances are ρ[F (x, W )] will not be known exactly as this would require evaluating F (x, w) for all w ∈ W. A common approach in such scenarios is to choose the decision with best expected objective value according to the posterior distribution (c.f. [29] ), Having defined the choice of the decision x to be implemented after the evaluation stage is complete, we are now in position to introduce our acquisition function, the knowledge gradient for risk measures. We motivate this acquisition function by noting that, if we had to choose a decision with the information available at time n, the expected objective value we would get according to equation (4) is simply On the other hand, if we were allowed to make on additional evaluation, the expected objective value we would get becomes can be interpreted as a measure of improvement due to making one additional evaluation. We note that, given the information available at time n, min x∈X E n+1 [ρ[F (x, W )]] is random due to its dependence on the yet unobserved (n + 1) th evaluation, and thus the quantity above is itself random. The knowledge gradient for risk measures acquisition function is defined as the expected value of this quantity given the information available at time n and the candidate (x, w). Our algorithm sequentially chooses the next point to evaluate as the one that maximizes (5) and is thus, by construction, one-step Bayes optimal. In this subsection, we explain how to estimate the value of ρKG for a given candidate (x, w) and the solutions to the inner optimization problems x * i . We note that the first term in (5) does not depend on the conditional expectation, and start by replacing the intractable expectation operators with their Monte-Carlo estimators and the inner optimization problem(s) with the given solutions. where F j is the j th sample drawn from the current GP, and F ij is the j th sample drawn from i th fantasy GP model. A fantasy GP model is the GP model obtained by conditioning the GP model on a fantasy observation simulated using the distribution implied by the GP model at the candidate point. Let w 1:L denote the vector [w 1 , . . . , w L ]. We approximate ρ[ F ij (x, W )] (and F j ) as follows. If W is a small finite set, let W = W, otherwise let W = {w 1:L } iid ∼ P (w). Draw the random vector { F ij (x, w), w ∈ W} from the i th fantasy GP, GP i f with mean and covariance functions µ i f , Σ i f , where · is the ceiling operator, as estimates of VaR and CVaR respectively (cf. [20] ). We note that the last step of the procedure outlined here assumes that W has either a continuous or a discrete uniform distribution. Otherwise, one should follow the definitions given in Section 2, and calculate the sample ρ while accounting for the probability mass of each w ∈ W. Since the first term in (5) is independent of the candidate point, the optimization problem reduces to The objective function involves two expectations that are not analytically available. Replacing these with their Monte-Carlo estimators, we obtain where F ij is defined in Section 5.1. Let ∇ t F (x i , w l ) be the gradient (w.r.t. the argument t) corresponding to the sample F (x i , w l ). Based on the ordering defined in Section 5.1, ) be the estimators of gradients of VaR and CVaR respectively. Under mild regularity conditions, [42] and [43] show that these are asymptotically unbiased and consistent estimators of Combining with results from the IPA literature (e.g. [44] , Proposition 1), averaging over M i.i.d. samples of the estimators, we obtain asymptotically unbiased and consistent estimators of can be used to solve the inner optimization problem. Let x * i be the solution to the i th inner optimization problem. Using the Envelope theorem ( [45] , Corollary 4), we can propagate gradients one step further to obtain where ∇ u c/v j (x * i ) is the estimator from j th sample, as asymptotically unbiased and consistent gradient estimators for (7) . We use these with the LBFGS algorithm [46] to optimize VaRKG. Proposition 1. Under mild regularity conditions, the estimators given by (9) are asymptotically unbiased and consistent gradient estimators of the ρKG optimization problem (7). A formal statement of the proposition and the proof is given in the supplement for the case of d W = 1. We show in the supplement that selecting the n th candidate to evaluate using the ρKG algorithm, including the training of the GP model, has a computational complexity of O( [46] iterations performed for training the GP model, and the outer and inner optimization loops respectively. The ρKG algorithm is computationally intensive, as it requires solving a nested non-convex optimization problem. While we are confident that, in a wide range of settings, the sampling efficiency it provides justifies its computational cost, we also reckon that in certain not-so-expensive settings, a faster algorithm is desirable. Inspired by the EI [21] and KGCP [24] acquisition functions, we propose the following acquisition function, which replaces the inner optimization problem of ρKG with a much simpler one. In the following, the set X = {x 1:n } is the x components of all the solutions evaluated so far, in addition the set X + = {x 1:n , x n+1 } includes the x component of the current candidate in consideration. We note that, like ρKG, ρKG apx has an appealing one-step Bayes optimal interpretation; the maximizer of ρKG apx is the one-step optimal point to evaluate if we restrict the choice of the decision to be implemented, x, among those that have been evaluated for at least one environmental condition, w. In this paper, we introduce two acquisition functions, ρKG and ρKG apx . These acquisition functions share a nested structure, making optimization computationally challenging. Here we describe a novel two-timescale optimization approach overcoming this challenge. Because the posterior mean and kernel are continuous, if we fix the seed used to generate the fantasy model and the GP sample path, then continuity of the posterior mean and kernel causes a small perturbation to the candidate solutions to shift the sample path only slightly. Thus, optimal solutions obtained for the previous candidate should be in a small neighborhood of the current optimal solutions. We can thus use the inner solutions obtained in the previous iteration to obtain a good approximation of the acquisition function value and gradients for the current candidate. We utilize this observation by solving the inner optimization problem once every T ≈ 10 iterations, and using this solution to evaluate ρKG for the remaining T − 1 iterations. We refer to this approach as two-scale optimization and present an algorithmic description with more detail in the supplement. The two-time scale idea outlined here is not limited to ρKG, and can be applied to other acquisition functions that require nested optimization, such as [24, 25, 30] . In numerical testing, the two-time-scale optimization approach did not affect the performance of either of the algorithms while offering significant computational savings. In Figure 1 , we provide a plot of a GP model based on 6 random samples taken from a 2D Branin test function, and the corresponding acquisition function values (VOI, short for value of information) of ρKG and ρKG apx over the candidate space X × W. In the example, we consider a CVaR objective with risk level α = 0.7 and a uniform distribution over W = {0/9, 1/9, . . . , 9/9}. Our first observation is that the VOI plots closely resemble each other, supporting our claim that ρKG apx is a good approximation to ρKG. When we look at the implied posterior on the objective, we see a large uncertainty for large x values, and expect a large VOI to encourage exploration. When we look at the VOI plots, we see that this is indeed the case, but more so with certain w than others. Looking at the µ n plot, we see that the subset of W corresponding to CVaR objective is where w is large, and see that VOI is larger in this region. Thus, for a given x, ρKG prefers a sample from w which directly affects the value of the objective, and can efficiently search the candidate space. We observe a similar behavior in the promising region x ∈ [0.2, 0.5]. Certain w in this are both have a small uncertainty and do not directly affect the value of the objective, thus have small VOI despite being in the promising region of the GP model. However, w at the two ends directly affect the value of CVaR objective in the promising region, thus have the largest VOI in the whole candidate space. We hope that these plots and accompanying discussion help the reader appreciate the value of considering the context variables in designing the acquisition function. In this section, we present several numerical examples that demonstrate the sampling efficiency of our algorithms. We compare our algorithms with Expected Improvement (EI, [21] ), Knowledge Gradient (KG, [25] ), Upper Confidence Bound (UCB), and Max Value Entropy Search (MES, [47] ) algorithms. We use the readily available implementations from the BoTorch package [48] , and the default parameter values given there. For benchmark algorithms, we fit a GP on observations of ρ[F (x, W )], which are obtained by evaluating a given x ∈ X for all w ∈ W (or a subset W) and then calculating the value of the risk measure on these samples. Thus, the benchmark algorithms require |W| (or | W|) samples per iteration whereas ρKG and ρKG apx require only one. We optimize each acquisition function using the LBFGS [46] algorithm with 10 × (d X + d W ) restart points. The restart points are selected from 500 × (d X + d W ) raw samples using a heuristic. For the inner optimization problem of ρKG, we use 5 × d X random restarts with 25 × d X raw samples. For both ρKG and ρKG apx , we use the two time scale optimization where we solve the inner optimization problem once every 10 optimization iterations. ρKG and ρKG apx are both estimated using K = 10 fantasy GP models, and M = 40 sample paths for each fantasy model. We initialize each run of the test problems with 2d X + 2 starting points from the X space. We evaluate F (x, w) for each 2d X + 2 points for each w ∈ W (or W, to be specified for each problem). ρKG and ρKG apx then use each of these evaluations to initialize their respective GP models, and the benchmark algorithms use ρ[F (x, W )] computed using these evaluations to initialize the GP models. Further details on experiment settings is given in the supplement. The code for our implementation of the algorithms and experiments can be found at https://github.com/saitcakmak/BoRisk. Left to right, full domain and close-ups of the optimal region. It is notable that the objective function is quite flat across most of its domain, making it challenging to identify the optimal solution using a small number of evaluations. The first two problems we consider are synthetic test functions from the BO literature. The first problem is the 4 dimensional Branin-Williams problem formulated in [49] . We consider the minimization of VaR at risk level α = 0.7 w.r.t. the distribution of context variables w = (x 2 , x 3 ). A plot of the VaR objective of the Branin Willams function is given in Figure 2 . The second problem we consider is the 7 dimensional f 6 (x c , x e ) function from [50] . We formulate this problem for minimization of CVaR at risk level α w.r.t. the distribution of the 3 dimensional environmental variable x e . More details can be found in the supplement. In this test problem, our goal is to tune the hyper-parameters of a trading strategy so as to maximize return under risk-aversion to random environmental conditions. We use CVXPortfolio [51] to simulate and optimize the evolution of a portfolio over a period of four years using open-source market data. Each evaluation of this simulator returns the average daily return over this period of time under the given combination of hyper-parameters and environmental conditions. The details of this simulator can be found in Sections 7.1-7.3 of [51] . The hyper-parameters to be optimized are the risk and trade aversion parameters, and the holding cost multiplier over the ranges [0.1, 1000], [5.5, 8.] , and [0.1, 100], respectively. The environmental variables are the bid-ask spread and the borrow cost, which we assume are uniform over [10 −4 , 10 −2 ] and [10 −4 , 10 −3 ], respectively. For this problem, we use the VaR risk measure at risk level α = 0. 8 We use a random subset W of size 40 for the inner computations of our algorithms, and a random subset of size 10 for the benchmark evaluations of VaR 0. 8 Since this simulator is indeed expensive-to-evaluate, with each evaluation taking around 3 minutes, evaluating the performance of the various algorithms becomes prohibitively expensive. Therefore, in our experiments we do not use the simulator directly. Instead, we build a surrogate function obtained as the mean function of a GP trained using evaluations of the actual simulator across 3, 000 points chosen according to a Sobol sampling design [52] . In this example, we study allocation of a joint COVID-19 testing capacity between three neighboring populations (e.g. cities, counties). The objective is to allocate the testing capacity between the populations to minimize the total number of people infected with COVID-19 in a span of two weeks. We use the COVID-19 simulator provided in [53] and discussed in [54, 55] . The simulator models the interactions between individuals within the population, the disease spread and progression, testing and quarantining of positive cases. Contact tracing is performed for people who test positive, and the contacts that are identified are placed in quarantine. We study three populations of size 5 × 10 4 , 7.5 × 10 4 , 10 5 that share a combined testing capacity of 10 4 tests per day. The initial disease prevalence within each population is estimated to be in the range of 0.1 − 0.4%, 0.2 − 0.6% and 0.2 − 0.8% respectively. We assign a probability of 0.5 to the middle of the range and 0.25 to the two extremes, independently for each population. Thus, the initial prevalence within each population defines the context variables. We pick the fraction of testing capacity allocated to the first two populations as the decision variable (remaining capacity is allocated to the third), with the corresponding decision space X = {x ∈ R 2 + : x 1 + x 2 ≤ 1}. For the inner computations of ρKG apx , we use the full W set, however, for the evaluations of benchmark algorithms we randomly sample a subset W of size 10 to avoid using 27 evaluations at once. The results of the experiments are plotted in Figure 3 . Evaluations reported exclude initialization and error bars denote one standard error. In each experiment, ρKG and ρKG apx match the performance of all benchmarks using fewer evaluations, thus demonstrating superior sampling efficiency. In the experiments W is intentionally kept small to avoid giving ρKG an outsize advantage. If W were larger, the benchmarks would use up more evaluations per iteration and our algorithms would provide an even larger benefit. To demonstrate the benefit of using our novel statistical model on ρ[F (x, w)], we compare two random sampling strategies. The one labeled "random" evaluates ρ[F (x, w)] and uses the corresponding GP model over X . "ρ-random" on the other hand, evaluates F (x, w) at a randomly selected (x, w), uses our statistical model and reports arg min x E n [ρ[F (x, w)]] as its solution. The ability to survey X × W more freely gives "ρ-random" an edge over "random" in all experiments except for f 6 (x c , x e ) which we attribute to the size of W. Moreover, in the COVID-19 problem, we see that the "ρ-random" is performing better than all the benchmarks considered. This demonstrates the added value of our statistical model and suggests an additional cheap-to-implement algorithm that is useful whenever F (x, w) is cheap enough to render typical BO algorithms too expensive. Our work is of interest whenever one needs to make a decision guided by an expensive simulator, and subject to environmental uncertainties. Such scenarios arise in simulation assisted medical decision making [4] , in financial risk management [7, 56, 57] , in public policy, and disaster management [58] . The impact of our algorithms could be summarized as facilitating risk averse decision making. In many scenarios, risk averse approaches result in decisions that are more robust to environmental uncertainties compared to decisions resulting from common risk neutral alternatives. For example, in a financial crisis, an earlier risk-averse decision of holding more cash and other low-risk securities might prevent large losses or even the default of a financial institution. As another example, a risk averse decision of stockpiling of excess medical supplies in non-crisis times would alleviate the shortages faces during crisis times, such as the COVID-19 pandemic we are facing today. On the negative side of things, the risk averse decisions we facilitate may not always benefit all stakeholders. For a commercial bank, a risk averse approach may suggest a higher credit score threshold for making loans, which might end up preventing women and minorities, who systematically have lower credit scores, from access to much needed credit. In our case, the failure of the system would mean a poor optimization of the objective, and recommendation of a bad solution. Implementation of a bad decision can have harmful consequences in many settings, however, we imagine that any solution recommended by our algorithms would then be evaluated using the simulator, thus preventing the implementation of said decision. Our methods do not rely on any training data, and only require noisy evaluations of the function value. Thus, it can be said that our method does not leverage any bias in any training data. However, the solutions recommended by our algorithms are only good up to the supplied function evaluations, thus are directly affected by any biases built into the simulator used for these evaluations. In this section, we analyze the computational complexity of using ρKG to pick the n th candidate to evaluate. For simplicity, we ignore the multiple restart points used for optimization, and present the analysis for a single restart point. The resulting complexity involves terms for number of L-BFGS-B iterations performed. These terms can be adjusted if needed to account for the cost of multiple restart points. In this analysis, we assume that evaluating the GP prior mean and kernel functions (and the corresponding derivatives) takes O(1) time. We break down the computations into several chunks and analyze them one by one. An iteration starts by fitting the GP hyper-parameters using MAP estimation with Q 1 iterations of L-BFGS-B [?] algorithm. Each iteration requires calculating the likelihood of the data, which requires computing A n = Σ 0 (x 1:n , w 1:n , x 1:n , w 1:n ) + diag(σ 2 (x 1 , w 1 ), . . . , σ 2 (x n , w n )) (O(n 2 ) kernel evaluations), its Cholesky factor and inverse (O(n 3 )), and solving a triangular set of equations (O(n 2 )), putting the total cost of fitting the GP at O(Q 1 n 3 ). We can use the Cholesky factor of A n and the inverse A −1 n computed in the previous step to calculate the posterior mean and variance at the candidate point at a cost of O(n 2 ), which are then used to draw K fantasy observations at a cost of O(K). Let A n+1 be the matrix obtained by adding A n one row and column corresponding to the candidate (x n+1 , w n+1 ) at a cost of O(n). Using the pre-computed Cholesky factor of A n , we can compute the Cholesky factor of A n+1 at a cost of O(n 2 ). The inverse A −1 n+1 can also be obtained from A −1 n at a cost of O(n 2 ). Note that A n+1 is identical for each fantasy. For each fantasy model, we need to compute the posterior mean and covariance matrix for the L points on which we draw the sample paths. Given A −1 n+1 , we first compute Σ 0 (x i , w 1:L , x 1:n+1 , w 1:n+1 )A −1 n+1 at a cost of O(Ln 2 ), then use it to compute the mean vector and the covariance matrix at a total cost of O(L 2 n). To sample from the multivariate normal distribution, we also need the Cholesky factor of the covariance matrix, which is obtained at a cost of O(L 3 ). Repeating this for each fantasy, we obtain all K mean vectors, covariance matrices and the Cholesky factors at a total cost of O(KLn 2 + KL 2 n + KL 3 ). Given the mean vector and the Cholesky factor of the covariance matrix, we can generate a sample If we then use Q 2 iterations of L-BFGS-B to optimize the acquisition function, we end up with a total cost of O(Q 2 Q 3 KL[n 2 +Ln+L 2 +M L]) for optimization, and a total cost of O(Q 1 n 3 +Q 2 Q 3 KL[n 2 +Ln+L 2 +M L]) to pick the n th candidate to evaluate, including the GP fitting. Among the state of the art Bayesian optimization algorithms, knowledge gradient (KG) algorithms are on the more expensive side due to the inherent nested optimization structure. Although the added computational cost is typically justified by the superior sampling efficiency they offer (see [1, 2] for numerical comparisons), reducing the computational cost would make KG algorithms applicable in much broader settings. To reduce the computational cost of KG optimization, [2] proposes a one-shot approach for optimizing the KG acquisition function, which converts the d X dimensional nested optimization problem into a (non-nested) (K + 1)d X dimensional optimization problem. The objective function of the one-shot problem is substantially cheaper to evaluate, as it does not require solving an optimization problem. Moreover, this new problem problem is deterministic and smooth, allowing the use of efficient gradient-based algorithms such as L-BFGS-B. However, the one-shot problem is again non-convex, and its large dimension makes it challenging to find a global optimal solution. Using a gradient based approach, the one-shot problem is optimized locally from a given set of restart points, resulting in local solutions to the inner problems, rather than a global solution one would get with a nested approach. This does not cause any significant issues if one has large enough K and the majority of them are at their global optimum, however, this requires starting with a large enough number of raw samples. When the evaluations of the objective are cheap, as in the classical setting, one can afford to evaluate the one-shot acquisition function at a large number of (carefully crafted) raw samples, and select a good set of restart points for optimization. As K increases, both the number of raw samples and the restart points need to increase as well. When the acquisition function evaluations (even without the inner optimization) are significantly more expensive than the classical setting, one is restricted to a relatively small number of raw samples, which requires using a small K (< 10) to achieve a good performance within a reasonable time. If one uses an insufficient number of raw samples to generate the restart points, it is possible to end up with a bad candidate that has all the globally optimal inner solutions, thus has a very large acquisition function value, and at the same time to have the optimal candidate, however with bad local optimal solutions, thus with a relatively small acquisition function value. In such a scenario, the one-shot approach will recommend the bad candidate with the large acquisition function value, leading to a poor sampling decision. This so called instability is what became an issue with the one-shot implementation of ρKG in our preliminary testing, and led to a search for a stable alternative. In this paper, we propose two time scale optimization (TTS), a novel approach for optimizing KG type of acquisition functions that addresses the issues raised here about the one-shot method while still providing significant computational savings. The main idea behind the TTS optimization is that the objective of the inner optimization problem is a continuous function of the candidate, and the optimal inner solution changes only a small amount between iterations when we update the candidate using a gradient based algorithm. Input: Q 2 , Q 3 , M, K, L, α, starting candidate (x 0 , w 0 ), and TTS-frequency T . for t = 0 to Q 2 − 1 do Generate the fantasy models GP i f , i = 1, . . . , K using the candidate (x t , w t ). for i = 1 to K do if t mod T = 0 then Generate a set of random restart points for optimization of i th inner problem. Include the previous solution x T i if available. Use Q 3 iterations of the multi-start L-BFGS-B algorithm to optimize i th inner problem, and set x T i as the optimizer. end if Return x T i as the solution to i th inner problem end for as the gradient at the current candidate. Do an iteration of L-BFGS-B using this gradient to obtain new candidate (x t+1 , w t+1 ). end for Return: (x Q2 , w Q2 ) as the next candidate to evaluate. Thus, the solution to the inner problem obtained with the previous candidate provides a good approximation of the acquisition function value and derivatives. TTS optimization leverages this observation to provide computational savings by solving the inner optimization problem once every T iterations. We provide a detailed description of TTS in Algorithm 1. We note that L-BFGS-B performs multiple line searches within an iteration and evaluates the solution found at the end of each line search. The description of Algorithm 1 treats each line search as an iteration of L-BFGS-B, and the candidate (x t , w t ) is updated at the end of each line search. This essentially means that we optimize the inner solution at every T th evaluation of the ρKG and not necessarily at every T th iteration of L-BFGS-B. In the algorithm description the parameter T is the TTS-frequency that determines how often the inner problem is solved. If T = 1, the TTS optimization reduces to the standard nested optimization. In numerical testing, we observed that TTS optimization reduces the cost of optimizing ρKG roughly by a factor of T . In our experiments, we used T = 10 without any noticeable change in algorithm performance. Notice that the only ρKG specific part of the TTS optimization is the inner optimizer and the resulting derivative. By modifying these parts, the TTS optimization can be used with many BO algorithms that involve a nested optimization, including all the knowledge gradient algorithms. We mentioned earlier that the TTS optimization would address the instability we faced with the one-shot optimization. Going back to our example with one good and one bad candidate, TTS, by periodically solving the inner problem globally for each candidate, ensures that each candidates acquisition function value is calculated using a global solution, reducing the chances of misidentifying the bad candidate as the better one. We conclude by noting that the TTS implementation presented here is a primitive one, and has lots of room for improvement. We would also like to point out that our discussion of one-shot optimization is mostly an observational one, based on certain shortcomings we faced. An in-depth comparison and further development of the two optimization methods is left as a subject for future research. • Common Parameters: Each algorithm is optimized using L-BFGS-B algorithm [3] (except for Covid-19 where we use SLSQP [?] due to the linear constraint on the decision variables), with 10 × (d X + d W ) random restart points which are selected from 500 × (d X + d W ) raw samples. We use low-discrepancy Sobol sequences [4] to generate the raw samples. The restart points are selected using a heuristic implemented in the BoTorch package [2] , which assigns each raw sample a weight that is an exponential factor of its acquisition function value, then selects the restart points with probability proportional to this weight. • EI: Does not have any specific parameters. • MES: A set of 500 × (d X + d W ) randomly drawn points is used to discretize the design space, and the max values are sampled over this discretization using the Gumbel approximation. We draw 10 max value samples over the discretization, and use 128 fantasies to compute the approximate information gain from using the candidate. • KG: We use the one-shot KG implementation with 64 fantasies. • UCB: We use parameter β = 0.2, i.e. the acquisition function is given by UCB(x) = µ n (x) + √ 0.2Σ n (x, x). • ρKG and ρKG apx : We use K = 10 fantasy models, and M = 40 sample paths per fantasy. For the inner optimization problem of ρKG, we use 5 × d X random restart points selected from 50 × d X raw samples. We use TTS optimization with T = 10. All fantasy models and sample paths are generated using low-discrepancy Sobol sequences [4] . For our algorithms, we observed a small improvement when increasing the number of fantasy models, however, this comes with a linear increase in run time and memory usage. The number of sample paths were chosen rather arbitrarily since the majority of computational effort is spent in calculating the Cholesky factor, which is then used for generating all sample paths. We observed very similar performance using M = 4 and M = 40, and went with the larger value as since it only increased the run time by about 50%. We use the "SingleTaskGP" model from the BoTorch package [2] with the given priors on hyper-parameters, which, quoting from the code base, "... work best when covariates are normalized to the unit cube and outcomes are standardized (zero mean, unit variance).". We use the Matérn 5/2 kernel and fit the hyper-parameters using MAP estimation. We project the function domain X × W to the unit hypercube [0, 1] d X +d W , and all the calculations are done on the unit hypercube. For for calculations involving the GP, we use the "Standardize" transformation available in BoTorch which transforms the outcomes (observations) into zero mean, unit variance observations for better compatibility with the hyper-parameter priors. Left to right, full domain and close-ups of the optimal region. It is notable that the objective function is quite flat across most of its domain, making it challenging to identify the optimal solution using a small number of evaluations. The first two problems we use are synthetic test functions from the BO literature. The first problem is the 4 dimensional Branin-Williams problem formulated in [5] . For x ∈ [0, 1] 4 , the function is defined as The function is observed with additive Gaussian noise with standard deviation 10. The decision variables are taken to be x 1 and x 4 , and the context variables are w = (x 2 , x 3 ). The set W of size 12 and the corresponding probability distribution is provided in the supplementary material. The problem is originally formulated for the expectation objective, however, here we consider the minimization of VaR at risk level α = 0.7. A plot of the VaR objective of the Branin Willams function is given in Figure 1 . The second problem we consider is f 6 (x c , x e ) function from [6] given by To initialize each experiment, we draw 2d X + 2 random samples from x i ∈ X , and for each x i we evaluate F (x i , w), w ∈ W where W is the set corresponding to the benchmark algorithms as described below. For ρKG and ρKG apx , we use these (2d X + 2) × | W | evaluations for initialization, and for the benchmarks we use 2 X + 2 ρ[F (x i , W )] calculated using these observations for initialization. These random samples are redrawn for each replication of the experiments, and are synchronized across different algorithms, i.e. k th replication of each experiment uses the same set of samples. In what follows, we describe the use of W in each experiment. Note that all the random sets W are independently redrawn at each iteration. 1. Branin-Williams: The set W and the corresponding probability distribution is given in Table 1 For the inner computation of ρKG apx , we use the full set W. For evaluations of the benchmarks, we draw a random sample W of size 10 and use this to calculate CVaR[F (x, W )], thus using 10 evaluations per iteration. The output of the Covid-19 simulator is stochastic with a large variation. To enable fair comparison between the algorithms, we fix a set of 10 seeds, and each function evaluation is done using a randomly selected seed from this set. The reported performance is computed as the average performance of using all 10 seeds. We ran the benchmark algorithms EI, MES, KG, UCB and random for 100 replications in each experiment. For ρ−random, we ran 100 replications in all experiments except for Covid-19 where we used 50 replications. We ran 10 replications of ρKG in all experiments except Covid-19; and ran 20 replications of ρKG apx in all experiments except Covid-19, where we ran 30 replications. In the main body of the paper, we present the results using a moving average window of 3 iterations, to smooth out the oscillations and make the plots more readable. In Figure 2 , we present the non-smoothed plots to address any questions this may raise. The comparatively large variance of our algorithms is due to: (i) having roughly 10 times as many data points; (ii) having fewer replications than the benchmark algorithms. We start with a simple result showing that the value of information defined by ρKG is non-negative. The rest of the section is dedicated to showing that the derivative estimators we propose in this paper are asymptotically unbiased and consistent. Here, we only present the proofs for the case of d W = 1. The derivative of a GP is again a GP (cf. [7] and section 9.4 of [8] ). If we model F (x) as a GP with mean and covariance functions µ, Σ, then F and ∇F follow a multi output GP with mean and covariance functions given by [7] µ(x) = (µ(x), ∇µ(x)) and Σ(x, where ∇Σ and ∇ 2 Σ are the Jacobian and Hessian of the covariance function. We note that, although this result is found in the literature without any assumptions on the GP, it in fact requires the GP to possess differentiable sample paths. For example, the Brownian motion is a well known example of a GP that has nowhere differentiable sample paths, thus violates this framework. To this end, we refer the reader to [9] for precise conditions on the differentiability of the GP sample paths, and assume that these conditions are satisfied. Remark 1. Before discussing the assumptions below, we note that the assumptions listed here are more restrictive than needed. The main results we build on require certain conditions on the distribution of the function in a neighborhood of its VaR. However, when the function is draw from a GP, it is impossible to know where this neighborhood, which would be different for each draw, might be, and thus it is impossible to impose assumptions on such an unknown and ever changing neighborhood. The main purpose of the results presented here is to show that there indeed exists a set of conditions under which our claims hold. However, we are confident that our algorithms are applicable in much broader settings where no such assumption can be verified. We assume that the kernel Σ of the joint GP of the GP and its derivative is strictly positive definite; and the random context variable W admits a strictly positive and continuous density p(w) over its domain. Observe that when the kernel is strictly positive definite, the sample paths of the GP are non-zero w.p.1., i.e. the equation F (x, w) = 0 has at most countably many solutions, where F (·, ·) denotes a sample drawn from the GP. Lemma 1. Under conditions outlined above, for any t such that P W (F (x, W ) ≤ t) ∈ (0, 1), the CDF P W (F (x, W ) ≤ t) is continuously differentiable w.r.t. both x and t; and the density ∂ t P W (F (x, W ) ≤ t) is strictly positive for each x. Proof. Since there are at most countably many points where the derivative is zero, we can construct distinct intervals W i , i ∈ I where F (x, w) is monotone (in w) in each W i and W i are such that P W (∪ i∈I W i ) = 1 and P W (W i ) > 0, i ∈ I. Then, where the interchange of derivative and summation is justified as the sum is bounded. From here on, we ignore the W i that do not produce any root to the equation F (x, w) = t, as the corresponding derivatives are zero. Since F (x, w) is continuously differentiable and monotone in w in each W i , by the inverse function theorem [10] , there exists a continuously differentiable inverse function F −1 i (t ; x) mapping t to w ∈ W i for a given x. For the remaining W i , define the sets I + and I − such that, I + is the set of W i that has F −1 i (t; x) as the upper bound of the integral and I − as the ones with the lower bound. For i ∈ I + , let w − i be the lower bound of the integral, and similarly w + i as the upper bound for I − . Then, where the result follows by the Leibniz rule. Applying the same steps with ∂ t , we get ∂ t P W (F (x, W ) ≤ t) = i∈I + ∪I − p(w)∂ t F −1 i (t; x). Since p(w) is continuous and F −1 i (t; x) is continuously differentiable, P W (F (x, W ) ≤ t) is continuously differentiable w.r.t. both x and t. Since p(w) is strictly positive and the derivative is non-zero, it follows that ∂ t P W (F (x, W ) ≤ t) is strictly positive, thus completing the proof. Proposition 1. Under the conditions outlined above, for α ∈ (0, 1), as M, L → ∞ where p − → denotes convergence in probability. Moreover, the gradient ∇ x E n+1 [VaR α [F (x, w)]] is continuous. Proof. As a sample path of the GP, F (x, w)(ω) is continuously differentiable w.r.t. w w.p.1. By Lemma 1, the CDF P W (F (x, W ) ≤ t) is continuously differentiable w.r.t. both x and t, and the density ∂ t P W (F (x, W ) ≤ t) is strictly positive. Then, by Lemma 2 of [11] , Since the gradient is also a GP defined in a compact space, ∇ x F (x, w)(ω) is bounded w.p.1., so is ∇ x VaR α [F (x, W )]. Then, by proposition 1 of [12] , F (x, w) is bounded by the same argument as ∇ x F (x, w), thus has finite second moment. Therefore, the main result follows by an application of Chebyshev's inequality. The continuity of ∇ x E n+1 [VaR α [F (x, W )]] follows by Theorem 1 of [11] and the fact that both gradients are continuous by Lemma 1. Before going into the main result, we discuss the mapping from the candidate point (x, w) to the sample F ij (·, ·). Let's start with the mapping from the candidate to the mean and covariance functions of GP i f . Let Z i be the (a fixed realization of) standard normal random variable used to generate the fantasy. The mean and covariance functions of GP i f are given by µ i f (x , w ) = µ n (x , w ) + Σ n (x , w , x, w) Σ n (x, w, x, w) + σ 2 (x, w) Z i , Σ i f (x , w , x , w ) = Σ n (x , w , x , w ) − Σ n (x , w , x, w)Σ n (x, w, x , w ) Σ n (x, w, x, w) + σ 2 (x, w) , where σ 2 (x, w) is the known observation noise level (see [13] for more details). Assuming that the prior mean and covariance functions are continuously differentiable, µ i f and Σ i f are continuously differentiable w.r.t. the candidate (x, w). For a given x i and finite set W = {w 1:L }, and a (realization of) L -dimensional standard normal random vector Z ij L , the j th sample path of i th fantasy model is generated as where C(x i , w 1:L ) is the Cholesky factor of L × L covariance matrix Σ i f (x i , w 1:L , x i , w 1:L ). Since Σ i f is continuously differentiable w.r.t. the candidate, so is C(x i , w 1:L ), making the sample F ij (·, ·) continuously differentiable w.r.t. the candidate (x, w). The discussion above establishes continuous differentiability of the sample F ij (·, ·) w.r.t. the candidate (x, w) only for a finite collection of points (x i , w 1:L ). However, to be able to use the analysis of Proposition 1, we need continuous differentiability of F ij (x, w), w ∈ W which, with W being a continuous random variable, is an infinite dimensional random variable. The concept of an infinite dimensional standard Gaussian random variable, or a standard Gaussian random function, is not well defined, and we can no longer use equation (3) . In the proposition below, we assume that F (·, ·) is indeed differentiable w.r.t. the candidate (x, w) and hope that this discussion justifies the assumption. Below, we use u := (x, w) as the candidate to simplify the notation. Proposition 2. Under the conditions outlined above, and assuming that the solution to the inner problem x * is unique w.p.1., Practical bayesian optimization of machine learning algorithms Benchmark and Survey of Automated Machine Learning Frameworks The knowledge-gradient algorithm for sequencing experiments in drug discovery Optimization of computationally expensive simulations with gaussian processes and parameter uncertainty: Application to cardiovascular surgery Efficient tuning of online systems using Bayesian optimization Using simulation to improve sample-efficiency of bayesian optimization for bipedal robots Regulatory Evaluation of Value-at-Risk Models Robust treatment planning with conditional value at risk chance constraints in intensity-modulated proton therapy A risk-based modeling approach for radiation therapy treatment planning under tumor shrinkage uncertainty Quantile regression: A statistical tool for out-of-hospital research Risk measures for natural resource management: Description, simulation testing, and r code with fisheries examples Risk aversion to parameter uncertainty in markov decision processes with an application to slow-onset disaster relief Optimization of risk measures Robust Optimization Theory and applications of robust optimization Risk Management and Quantitative Approaches in Finance A bayesian risk approach to data-driven stochastic optimization: Formulations and asymptotics Robust portfolio optimization with derivative insurance guarantees Financial Risk Manager Handbook Monte carlo methods for value-at-risk and conditional value-at-risk: A review Efficient global optimization of expensive black-box functions The knowledge-gradient policy for correlated normal beliefs Predictive entropy search for bayesian optimization with unknown constraints The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression The parallel knowledge gradient method for batch bayesian optimization Parallel Bayesian Global Optimization of Expensive Functions Freeze-Thaw Bayesian Optimization Bayesian optimization with gradients A tutorial on bayesian optimization Bayesian optimization with expensive integrands Distributionally robust bayesian optimization Adversarially robust optimization with gaussian processes Coherent measures of risk Conditional value-at-risk for general loss distributions Coherent risk measures on general probability spaces The fundamental risk quadrangle in risk management, optimization and statistical estimation Value at Risk: The New Benchmark for Managing Financial Risk Basel Committee on Banking Supervision Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) The parallel knowledge gradient method for batch bayesian optimization Multi-information source optimization Technical note-on estimating quantile sensitivities via infinitesimal perturbation analysis Simulating sensitivities of conditional value at risk Estimating security price derivatives using simulation Envelope theorems for arbitrary choice sets Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization Max-value entropy search for efficient Bayesian optimization BoTorch: Programmable Bayesian Optimization in PyTorch Sequential design of computer experiments to minimize integrated response functions Worst-case global optimization of black-box functions through kriging and relaxation Multi-period trading via convex optimization Scrambling sobol' and niederreiter-xing points Group Testing Feasibility of COVID-19 Screening for the U.S. Population with Group Testing Here's A Way To Contain Covid-19 And Reopen The Economy In As Little As One Month Quantitative Risk Management: Concepts, Techniques and Tools Nested simulation in portfolio risk measurement The role of simulation and modeling in disaster management The parallel knowledge gradient method for batch bayesian optimization BoTorch: Programmable Bayesian Optimization in PyTorch Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization Scrambling sobol' and niederreiter-xing points Sequential design of computer experiments to minimize integrated response functions Worst-case global optimization of black-box functions through kriging and relaxation Bayesian optimization with gradients Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) Principles of Mathematical Analysis Technical note-on estimating quantile sensitivities via infinitesimal perturbation analysis Estimating security price derivatives using simulation The parallel knowledge gradient method for batch bayesian optimization Envelope theorems for arbitrary choice sets Simulating sensitivities of conditional value at risk The authors gratefully acknowledge the support by the National Science Foundation under Grants CAREER CMMI-1453934 and CCF-1740822; and the Air Force Office of Scientific Research under Grants FA9550-19-1-0283 and FA9550-15-1-0038.