key: cord-0548393-au0finum
authors: Astudillo, Raul; Frazier, Peter I.
title: Bayesian Optimization of Function Networks
date: 2021-12-31
journal: nan
DOI: nan
sha: da26cda0e0eb9b41461d9b1342e62c1d70a34042
doc_id: 548393
cord_uid: au0finum

We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.

We consider Bayesian optimization (BO) of objective functions defined by a series of time-consumingto-evaluate functions, f 1 , . . . , f K , arranged in a directed acyclic network, so that each function takes as input the output of its parent nodes. As we detail below, these problems arise in BO-based policy search in reinforcement learning (Lizotte et al., 2007) , optimization of complex systems modeled via simulation, and calibration of time-consuming physics-based models.

To illustrate, we introduce a running example of vaccine manufacturing (Sekhon and Saluja, 2011) , focusing on the portion of the manufacturing process that uses live cells to produce proteins needed in a vaccine. It begins with a cell culture, in which living cells are grown and used as "factories" to produce proteins. This process is controlled by a vector, x 1 , containing the temperature, pH, and CO 2 content used when growing these cells. The output of this process is the quantity of the desired protein y 1 = f 1 (x 1 ), i.e., the yield of this step, along with other byproducts. This output is passed into a second process, purification, which removes byproducts and is controlled by a vector x 2 comprising temperature, pressure, and flow rate. The yield of the desired protein from this second step is y 2 = f 2 (x 2 , y 1 ). This output enters a third step, formulation, in which we formulate the raw protein into a form that can be distributed as controlled by a third set of parameters. This determines the yield of the overall process y 3 = f 3 (x 3 , y 2 ). We wish to choose (x 1 , x 2 , x 3 ) to maximize overall protein yield. This problem is summarized as a function network in Figure 1 .

The problem described above and other similar problems can be tackled with Bayesian optimization (BO), which has been shown to perform well compared to other derivative-free global optimization 35th Conference on Neural Information Processing Systems (NeurIPS 2021). methods for time-consuming-to-evaluate objective functions (Snoek et al., 2012; Frazier, 2018) . A standard BO algorithm would fit a Gaussian process (GP) (Rasmussen and Williams, 2006 ) model on the objective function (y 3 , which depends on (x 1 , x 2 , x 3 )) and use it, along with an acquisition function, to sequentially choose the points to evaluate. Under this standard approach, however, evaluations of the intermediate nodes, f 1 , . . . , f K−1 , would be ignored despite being available when computing the objective function. In the example above, this corresponds to looking only at the yield of the overall process, and not of each individual step.

f 1 f 2 f 3

x 1 x 2 x 3 y 1 y 2 y 3 Figure 1 : Vaccine manufacturing as a function network. Protein y 1 = f 1 (x 1 ) is created, then purified with yield y 2 = f 2 (x 2 , y 1 ), and formulated with yield y 3 = f 3 (x 3 , y 2 ). The goal is to find (x 1 , x 2 , x 3 ) that maximizes y 3 .

In this paper, we introduce a novel BO approach that leverages function network structure for substantially more efficient optimization. This approach models the individual nodes of the network using distinct GPs. This allows incorporating observations of each node's output recursively into a non-Gaussian posterior on the network's overall output. Our approach then chooses the points to evaluate using the expected improvement (Jones et al., 1998) computed with respect to this implied posterior on the objective function. The non-Gaussian nature of this posterior prevents the expected improvement from having a closed form. However, we show that it can still be efficiently maximized via sample average approximation (Kleywegt et al., 2002) .

Our approach can outperform standard BO by leveraging information from internal nodes unavailable to standard methods. We briefly explain one way this can happen, in the context of the example above. In vaccine manufacturing, each function f k (x k , y k−1 ), k = 2, 3, is bounded between 0 and y k−1 because new protein cannot be created in the purification and formulation stages. Moreover, application experts have a prior on what values for f k (x k , y k−1 )/y k−1 should be achievable if x k is set well. Thus, if we see that y 3 is unexpectedly poor, information from intermediate nodes can be extremely valuable: if y 1 and y 2 are as expected, then this suggests the problem is with x 3 ; if y 1 is as expected but y 2 is poor, then the problem is likely with x 2 ; and if y 1 is poor then the problem is likely with x 1 . (If there is a problem with x k , there may also be a problem with x k , k > k, but we can focus on fixing x k first.) Thus, by observing intermediate nodes, we can instantly reduce the effective dimensionality of the input space by a factor of K = 3.

We show that our method is asymptotically consistent, i.e., that it discovers the global optimum given sufficiently many samples. Remarkably, in contrast with most BO methods, it may do so without measuring densely over the feasible domain, instead leveraging function network structure to exclude regions as unnecessary to explore. This indicates the power of function network structure to improve query efficiency.

We demonstrate through numerical experiments that access to additional information available in a problem formulated as a function network can dramatically accelerate optimization. We study four synthetic problems and four real-world problems: a manufacturing problem similar in spirit to the vaccine example above, an active learning problem with a robotic arm, and two problems arising in epidemiology, one calibrating an epidemic model and the other designing a testing strategy to control the spread of COVID-19. Our method significantly outperforms competing methods that utilize less information, in some cases by ∼5% and in other cases by several orders of magnitude.

Our work occurs within BO, a framework for global optimization of expensive-to-evaluate black-box functions that originated with the work of Zhilinskas (1975) and Močkus (1975) , and has recently become popular due to its remarkable performance in hyperparameter tuning of machine learning algorithms (Snoek et al., 2012; Swersky et al., 2013; Wu et al., 2019) . We refer the reader to Shahriari et al. (2016) and Frazier (2018) for modern introductions to BO.

Our approach can be catalogued as a grey-box BO method since it does not treat the objective function entirely as a black box, and instead exploits known structure to improve sampling efficiency. Other examples of grey-box BO approaches include multi-fidelity BO (Kandasamy et al., 2017; Wu et al., 2019) , which leverages cheaper approximations of the objective function; BO of objective functions that are the integral of an expensive-to-evaluate integrand (Williams et al., 2000; Toscano-Palmerin and Frazier, 2018; Cakmak et al., 2020) ; BO of objective functions that are a sum of squared errors (Uhrenholt and Jensen, 2019) ; and, more generally, BO of objective functions that are the composition of an expensive-to-evaluate inner function and a known inexpensive-to-evaluate outer function (Astudillo and Frazier, 2019) , of which our work can be seen as a significant generalization. We refer the reader to Astudillo and Frazier (2021) for a tutorial on grey-box BO.

Our work is also closely related to Marque-Pucheu et al. (2019) , which proposes a method for efficient sequential experimental design of nested computer codes, also using GPs. This work focuses on the case where there are only two node functions, and one takes as input the output of the other. In contrast with our work, the goal of the proposed method is to learn the output code as accurately as possible within a limited evaluation budget, but optimization is not pursued.

Optimization of composite (a.k.a nested) functions has also been considered in the gradient-based optimization literature (Shapiro, 2003; Drusvyatskiy and Paquette, 2019; Charisopoulos et al., 2021; Balasubramanian et al., 2020) . In contrast with ours, these works assume that evaluations are inexpensive, and often also some form of convexity, along with availability of gradients. However, this literature is similar in spirit to ours in that information from inner functions improves efficiency.

Function networks arise in many application areas. While these applications have not, to our knowledge, been previously formulated as specific instances of the general function network model we propose, their literatures are nonetheless relevant to our work.

• In engineering and aerospace design, function networks arise in multidisciplinary optimization (Cramer et al., 1994; Amaral et al., 2014; Benaouali and Kachel, 2019) , where simulators focusing on different physical laws are coupled into a function network. For example, a simulation of airflow over a wing may output the forces on the wing to another simulation of mechanical stress on the wing structure. • In drug discovery and materials design, function networks arise from the sequence-structurefunction (Sadowski and Jones, 2009 ) and composition-structure-property (Hattrick-Simpers et al., 2016) paradigms. Here, decision variables (composition, e.g., what fraction of what raw materials are used) determine the structure of the material (how the atoms combine together), which in turn determines how the material behaves (properties). • Function networks arise in the design of queuing networks (Fu and Henderson, 2017; Arcelli, 2020) . This includes manufacturing systems (Ghasemi et al., 2018) , where the partiallyprocessed output of one workstation is input to the next workstation, the design of service systems such as hospitals and airport security checkpoints, and choosing traffic signal timings for a city's street network (Osorio and Bierlaire, 2013) . • Finally, function networks arise in reinforcement learning (Sutton and Barto, 2018) and

Markov decision processes (Puterman, 1990) , where the transition kernel transforms the state variable at the start of a time range to another state variable at the end of this range. This outputted state variable is the input to the next time range. §5.2 shows an example.

We consider objective functions evaluated via a series of functions, f 1 , . . . , f K , arranged in a directed network so that each function in this network takes as input the output of its parent nodes, and assume that evaluating this collection of functions is time-consuming. The network structure is encoded as a directed graph with nodes V = {1, . . . , K} and directed edges E = {(j, k) : f k take as input the output of f j }. We assume that this graph is acyclic and has a single leaf node whose output is the objective to optimize.

Let J(k) denote the set of parent nodes of node k 1 . We assume, without loss of generality, that the nodes are ordered such that j < k for all j ∈ J(k). In addition to the output of its parent nodes, we assume that each function, f k , takes as input a (potentially empty) subset, I(k) ⊂ {1, . . . , D}, of the components of the decision vector x ∈ R D .

Let h 1 (x), . . . , h K (x) denote the values of the K nodes in the function network when it is evaluated at x. These values are defined recursively by

i.e., by evaluating each function in the network as the values of its parent nodes become available. This recursion is well-defined by our assumption that the graph is acyclic. The objective function is then g = h K , and the goal is to solve max

where X ⊂ R D is a simple compact set, such as a hyper-rectangle.

The standard BO approach to this problem models g using a GP prior distribution. This approach iteratively chooses the next point at which to evaluate g as follows. Given n observations of the objective function, g(x 1 ), . . . , g(x n ), it computes the posterior distribution on g, which is itself a GP. It then uses this posterior distribution to compute an acquisition function (Frazier, 2018) that quantifies the value of the information that would result from a function evaluation at any given point. Finally, it chooses the next point to evaluate, x n+1 , as the point that maximizes this acquisition function. Notably, although the outputs of f 1 , . . . , f K−1 would be observed as part of this approach, these evaluations would be ignored by standard BO when calculating the posterior distribution and corresponding acquisition function.

This section describes our approach. Like standard BO, it consists of a statistical model and an acquisition function. Unlike standard BO, however, our approach leverages the network structure of the problem by utilizing intermediate outputs within the network. As we describe below, this is achieved by modeling the node functions, f 1 , . . . , f K , as GPs, which in turn implies a non-Gaussian distribution on g ( §4.1). Our acquisition function is the expected improvement under this posterior distribution ( §4.2), which no longer has a closed form and thus we maximize it via sample average approximation ( §4.3). We end up this section by proving that our acquisition function is asymptotically consistent despite not necessarily sampling X densely ( §4.4).

Instead of modeling g directly, we model f 1 , . . . , f K , as drawn from independent GP prior distributions. Let µ 0,k and Σ 0,k denote the prior mean and covariance functions of f k , k = 1, . . . , K, respectively. When g is evaluated at x, we also get to observe the value of f k at x I(k) , h J(k) (x) . Thus, after querying g at n points, x , = 1, . . . , n, the observations of the values of f k , k = 1, . . . , K, at x ,I(k) , h J(k) (x ) , = 1, . . . , n, imply posterior distributions on f 1 , . . . , f K , which are again (conditionally) independent GPs with mean and covariance functions µ n,k and Σ n,k 2 , k = 1, . . . , K. These can be computed in closed form using the standard GP regression equations (see, e.g., Rasmussen and Williams 2006) . For completeness, we include these equations in §F of the supplement.

The posterior distributions on f 1 , . . . , f K , described above imply a posterior distribution on g. This distribution is in general non-Gaussian. Thus, unlike in the standard setting, where g is modeled as a GP, classical acquisition functions such as the expected improvement cannot be computed in closed form. However, as we describe next, drawing samples from this distribution is simple.

Thanks to the acyclic structure of the underlying network that defines g, a sample, g(x) = h K (x) from the posterior distribution on g at x can be obtained following the iterative process described next. In each iteration, k = 1, . . . , K, we obtain a sample, h k (x), from the posterior distribution on h k (x) by drawing a sample from the normal distribution with mean µ n,k x I(k) , h J(k) (x) and standard deviation

Supporting efficient calculation, the samples h J(k) (x) will have already been obtained in previous iterations of the for loop since j < k for all j ∈ J(k) (Note that J(1) = ∅). This procedure is summarized in Algorithm 1.

Algorithm 1 Draw one sample from the posterior on g(x)

Require: x ∈ X 1: for k = 1, . . . , K do 2:

We end this section by noting that, while the statistical model described above could be catalogued as a deep GP (Damianou and Lawrence, 2013) , in the sense that we have GP layers in an architecture described by a directed acyclic graph, inference in our model is faster. Typically, deep GP inference requires marginalization over latent values of GP layers. In our setting, however, observation structure creates conditional independence across layers, avoiding the need to marginalize.

Our acquisition function is the expected improvement, computed with respect to the implied posterior distribution on g under the statistical model described in §4.1:

..,n g (x n ) is the best value observed so far, and E n is the expectation computed with respect to the GP time-n posterior distributions on f 1 , . . . , f K . To distinguish it from the classical expected improvement, we refer to our acquisition function as the expected improvement for function networks (EI-FN) in the remainder of this work.

Since the posterior distribution on g is non-Gaussian, in contrast with the classical expected improvement acquisition function, EI-FN n does not admit a closed form expression. However, EI-FN n (x) can be computed with arbitrary precision following a simple Monte Carlo (MC) approach:

where g(x) (1) , . . . , g(x) (M ) are samples drawn from the posterior distribution on g(x), which can be obtained following the approach described in §4.1.

Following the above MC scheme to approximately compute EI-FN, one can aim to maximize this function using a derivative-free global optimization algorithm for inexpensive-to-evaluate functions, such as CMA (Hansen, 2016), by drawing samples, g(x) (1) , . . . , g(x) (M ) independently for each x at which EI-FN n is evaluated. However, this approach is slow since evaluations of EI-FN are noisy and derivative information is not leveraged. Instead, we propose to maximize EI-FN following a sample average approximation (SAA) approach (Kleywegt et al., 2002; Balandat et al., 2020) .

Succinctly, the SAA approach works by building a MC approximation of EI-FN that is deterministic given a finite set of random variables not depending on x, and maximizing this approximation instead of EI-FN itself. The key observation for creating this approximation is that a sample, g(x), can be obtained as g(x) = h K (x), where h 1 (x), . . . , h K (x) are defined recursively by

where Z = (Z 1 , . . . , Z K ) is a standard normal random vector, and we write h k (x) as h k (x; Z) to make its dependence on Z explicit. Analogously, we also write g(x) as g(x; Z). This can be seen as an extension of the so-called reparametrization trick for acquisition functions (Wilson et al., 2018) .

If we now fix M samples drawn from the K-dimensional standard normal distribution, Z (1) , . . . , Z (M ) , then

is a MC approximation of EI-FN that is deterministic given Z (1) , . . . , Z (M ) , as desired. Moreover, Proposition A.1 below shows that, under mild regularity conditions, any maximizer of the above SAA converges in probability exponentially fast to a maximizer of EI-FN n as M → ∞, thus suggesting that in practice it is safe to use small values of M . This result is a generalization of Theorem 1 in Balandat et al. (2020) . Its proof can be found in §A of the supplement. Then, for every > 0, there exist A, α > 0 such that P dist x

Finally, we note that EI-FN n is differentiable with respect to x, provided that µ k,n and Σ k,n , k = 1, . . . , K, are all differentiable. Thus, EI-FN n can be maximized using a gradient-based deterministic optimization algorithm such as L-BFGS-B (Byrd et al., 1995) , with multiple restarts.

To shed light on the convergence properties of the expected improvement acquisition function under our statistical model, we prove that, under suitable regularity conditions, EI-FN is asymptotically consistent, meaning that it finds the global optimum of the objective function as the number of evaluations grows to infinity. This builds on results shown for the classical expected improvement (Vazquez and Bect, 2010; Bect et al., 2019) and the expected improvement for composite functions (Astudillo and Frazier, 2019) , and we rely on similar assumptions. This result is stated in Theorem 1 below and its proof can be found in §B of the supplement. Theorem 1. Suppose that x n+1 ∈ argmax x∈X EI-FN n (x) for all n. Then, under regularity conditions stated precisely in the supplement, g * n → max x∈X g(x) as n → ∞ almost surely under the prior distribution on f 1 , . . . , f K .

Significant novelty in our proof arises from the fact that EI-FN's measurements are not necessarily dense in X. In nearly all existing work, consistency of BO methods is shown by first showing that the measurements are dense in the objective's domain (see, e.g., Vazquez and Bect 2010) . Surprisingly, the function network structure may allows us to exclude certain regions of X as suboptimal after only finitely many measurements, allowing our method to be consistent without measuring everywhere. Such ability gives insight into EI-FN's strong empirical performance compared to methods ignoring function network structure. As stated in Proposition 2 below, we provide a function network where EI-FN provides a consistent estimate of the global optimum without sampling densely. Proposition 2. There exists a function network (detailed in §C of the supplement) in which EI-FN is consistent but whose measurements are not necessarily dense in X.

We compare the performance of our algorithm (EI-FN) against the classical expected improvement (EI), i.e., the expected improvement under a GP model over g. We also compare with the performance of two other standard algorithms: the algorithm that chooses the points to evaluate uniformly at random over X (Random); and the knowledge gradient (KG) (also under a GP model over g), a more sophisticated acquisition function that often delivers a better performance than the expected improvement (Wu and Frazier, 2016) . The problems in §5.2 and §5.3 fall within the framework of Astudillo and Frazier (2019) and we include its proposed EI-CF algorithm as a benchmark. All algorithms were implemented in BoTorch (Balandat et al., 2020) .

We evaluate on 4 synthetic problems and 5 real-world problems. These are described below or in §D of the supplement, with the supplement describing a manufacturing problem building on §1, a COVID-19 testing problem building on §5.3, and a robot control problem similar to the one described in §5.2. In all problems, a first stage of evaluations is performed using 2(d + 1) points chosen uniformly at random over X. A second stage (pictured in plots) is then performed using each of the algorithms. Error bars in Figure 2 show the mean of the best objective value observed so far, plus and minus 1.96 times the standard deviation divided by the square root of the number of replications. Since the difference in performance in some of our experiments is better appreciated in a logarithmic scale, we also include plots showing the log 10 -regret for such experiments in Figure 3 . Each experiment was replicated 30 times. Experimental setup details and runtimes are available in the supplement. Code to reproduce our numerical experiments can be found at https://github.com/RaulAstudillo06/BOFN.

We create synthetic test problems by arranging standard test functions from the global optimization literature (Jamil and Yang, 2013) into function networks. These explore a variety of network structures in an easy-to-reproduce form, and are named after the standard test function used to define the function network. We describe these briefly here and then in full detail in §D of the supplement.

Alpine2 and Rosenbrock both arrange K nodes in series, where each node except the first node takes the output of the previous node as input. Additionally, in Alpine2, each node takes a distinct dimension of the decision vector x as input. In Rosenbrock, x 1 and x 2 are inputs to the first node, x 2 and x 3 are inputs to the second, and so on. These network architectures arise in manufacturing problems like the example in §1, as well as business operations with queues like boarding an aircraft or fulfilling drive-through orders. For Alpine2 we set K = 6, and for Rosenbrock we set K = 4.

Ackley has 3 nodes. The first two nodes each take the same 6-dimensional input. Their outputs are passed to the third node that produces the final output. This type of function network arises in algorithm design for two-sided markets (Li et al., 2021) , like Uber and AirBnB, where the first node simulates an intervention's effect on riders (or guests), the second simulates its effect on drivers (or hosts), and the third simulates the matching process where riders and drivers (or guests and hosts) interact to produce transactions.

Drop-Wave has two nodes. The first node takes a two-dimensional vector x as input. This node's output is passed to the second node, which produces the objective value. This network architecture is representative of multidisciplinary engineering design (Benaouali and Kachel, 2019) , for example in aerospace, where a small number of distinct black-box simulators simulate processes governed by physical laws that affect each other through a small number of channels, such as an aircraft engine simulation (the first node) determining heat generated while flying, which is then inputted to a temperature-dependent simulation of mechanical stress on the aircraft's frame (the second node).

This test problem is obtained by adapting the Fetch environment from OpenAI Gym (see Plappert et al. (2018) ). The goal is to move the gripper of a robotic arm to a target location with only three movements. We formulated this problem as a function network with 3 nodes, each representing a movement of the robotic arm. Each of these nodes takes as input the current location of the gripper along with a vector of forces to be applied to the robotic arm in that step, and produces as output the location of the arm after this movement is complete. (Note that the output of each node is 3-dimensional and thus this can also be thought of as a function network with 9 single-output nodes). The objective to minimize is the distance between the gripper and the target in the final step. Figure 4 shows an animation of this problem.

We formalize the above problem as follows. Let z init , z target ∈ R 3 denote the object's initial and target locations, respectively. At each time step, t, we choose the vector of forces to be applied to the robotic arm x t ∈ [−1, 1] 3 . After this movement, the location of the object becomes z t+1 . The goal is to choose x t for t = 1, . . . , T to minimize z target − z T 2 . We set z init = (0, 0, 0), Figure 2 above, which shows the best objective value found, here we plot the log 10 -regret. z target = (−12, 13, 0.2), and T = 3. This can be interpreted as a function network by associating each time step with a triplet of node functions f t = (f t,1 , f t,2 , f t,3 ) which take x t as input and produce x t+1 = f t (x t ) as output.

A very similar experiment to the one described above can be found in §E of the supplement. It considers a a variation of the active learning for robot pushing problem introduced by Wang and Jegelka (2017) whose goal is to teach a robot to push an object to a predetermined target location. As in the experiments here, EI-FN outperforms other methods significantly, including EI-CF.

Here we consider calibration of compartmental stochastic models to data, a widely-used tool in epidemiology, medical modeling, and ecology (Sandberg, 1978) . Function networks are well-suited to exploit the structure of such models. We focus on calibration of a specific epidemic model for influenza, building toward a COVID-19 mitigation benchmark in the next section. We first describe the epidemic model, then the calibration problem and, finally, formulation as a function network.

We calibrate to data a widely used epidemiological model, the SIS model (see, e.g., Garnett 2002), that models diseases like influenza capable of reinfecting individuals multiple times. In this model, individuals either do not have the disease and are "susceptible" (S) or have the disease and are "infectious" (I). We consider a SIS model of two interacting populations, where infections occur at population-dependent rates.

The model is dynamic, with time indexed by t = 0, 1 . . . , T . At the beginning of each time period t, the fraction of population i ∈ {0, 1} that is infectious is I i,t . We assume each population is of equal size, N . During this time period, each person in group i comes into close physical contact with β i,j,t people from group j. When this contact is between an infectious person and a susceptible one, it infects the susceptible person. A fraction I j,t of the people from group j involved in such interactions are infectious and a fraction (

infections resolve at a rate of γ per period. This results in a decrease in the infectious population in group i of γI i,t . Putting this together, the number of infectious individuals in group i at the start of the next time period is

Calibration: The SIS model has parameters I i,0 , γ and β i,j,t , where 0 ≤ t < T and i, j ∈ {1, 2}. We calibrate the parameters β = (β i,j,t : i, j, t) while fixing γ = 0.5 and I i,0 = .01 (for both i). We simulate a trajectory of infections from t = 0 up to T = 3, using a held-out value for β. We let I obs i,t denote the fraction of group i observed to be infected at time t in this trajectory. We then search for the vector β that, when passed to the SIS model, minimizes the mean squared error between this trajectory and the SIS model predictions. Letting I i,t ( β) indicate this predicted value, the goal is to minimize the mean-squared error (MSE),

Formulation as a Function Network: We encode this as a function network using 2T + 1 nodes, as illustrated in Figure 5 . For each time period t and each group i, a node takes input I t := (I j,t : j = 0, 1) and β t := (β j,j ,t : j, j ∈ {0, 1}) and produces output I i,t+1 . (For t = 0, the input I 0 is not needed since this is the same for all β.) Then, one additional node takes the output of the other nodes (I i,t : i = 0, 1, t = 1, . . . , T ) as its input and produces the sum of squared errors T t=1 (I obs i,t − I i,t ( β)) 2 as output. We treat this final node as known (its GP prior has a kernel of 0). EI-CF benchmark: The fact that the final node in this problem (denoted "MSE" in Figure 5 ) has known structurre permits comparing against the EI-CF method for BO of composite functions (Astudillo and Frazier, 2019) as a benchmark. EI-CF is substantially less general than our method (EI-FN): it is restricted to settings with one time-consuming black-box multi-output node that provides input to one fast-to-evaluate node with known structure. To apply EI-CF to this problem, the single black-box multi-output node takes β as input and produces the vector (I i,t ( β) : i, t) as output. This output is then supplied to the "MSE" node. This approach ignores the fact that I i,t does not depend on β t , t > t, and depends only indirectly on β t , t < t through I j,t−1 , j = 1, 2.

Across the wide range of problems considered, EI-FN significantly improves query efficiency over standard BO methods that ignore the function network structure of evaluations. The benefits range The largest benefit arise when the control input is high-dimensional but the input to individual nodes is low-dimensional. On the epidemic model calibration problem, we see EI-CF (Astudillo and Frazier, 2019) outperforming EI and KG by several orders of magnitude, and EI-FN outperforming EI-CF by several additional orders of magnitude. As noted above, EI-CF can be seen as a special case of EI-FN using a less informative function network that hides observations from some nodes. This is consistent with observations of function network structure allowing substantial improvement in query efficiency, and observing more of the internal function network structure providing more value.

We introduced a novel BO approach for objective functions defined by a series of expensive-toevaluate functions, arranged in a network so that each function takes as input the output of its parent nodes. These objective functions arise in a wide range of application domains. However, existing methods cannot leverage this structure. Our approach models the outputs of the functions in this network instead of only the objective function, as is standard in BO. Our experiments show that, by doing so, this approach can dramatically outperform standard BO methods.

Though we see substantial benefits from our approach, there are limitations. First, it requires more computation than standard BO methods, as explored in the supplement. (When the objective is time-consuming, the improved query efficiency more than makes up for the additional computation required.) Second, while we have demonstrated our method in problems with up to 9 nodes, and computational speed would support more, our method does not (yet) scale to hundreds of nodes. Third, while we show consistency, it would be instructive to complement our empirical results showing fast convergence with a theoretical understanding of convergence rates. Existing approaches to prove convergence rates for the classical expected improvement heavily rely on properties of its analytical expression (Bull, 2011; Ryzhov, 2016) , and thus are not directly generalizable to our setting. This is, however, an exciting direction for future work.

As with any powerful new method for optimizing time-consuming-to-compute black-box functions, ours can accelerate many applications. While this includes innovations that generally benefit society, such as improvements to public health and vaccine manufacturing, it also includes the design of weapons and other engineering systems that could harm individuals. Thus, it is important that society enact guardrails that ensure proper use of our methodology. 

In this section, we provide a formal statement and proof of Proposition 1. We begin by proving the following auxiliary result. Lemma A.1. Suppose that f : R B1×...×B J → R and h j : R A → R Bj , j = 1, . . . , J, are all Lipischitz continuous with Lipschitz constants L f and L hj , j = 1, . . . , J, respectively. Then, the function g :

We are now in position to show Proposition 1, which can be seen as a simple generalization of Theorem 1 in Balandat et al. (2020) . Proposition A.1 (Proposition 1). Suppose that X is compact, and that the functions µ n,k , σ n,k :

where {Z m } ∞ m=1 are independent standard normal random variables. Then, for every > 0, there exist A, α > 0 such that P dist x Proof. Let L µ n,k and L σ n,k be the Lipschitz constants of µ n,k and σ n,k , respectively, and consider the functions f n,k : R |I(k)| × R |J(k)| × R → R, k = 1, . . . , K, given by f n,k (x I(k) , y J(k) , z k ) = µ n,k (x I(k) , y J(k) ) + σ n,k (x I(k) , y J(k) )z k .

We note that, for any fixed z k , the function (x I(k) , y J(k) ) → f n,k (x I(k) , y J(k) , z k ) is L µ n,k + L σ n,k |z k | -Lipschitz. Now consider the functions h n,1 , . . . , h n,k : X × R K → R defined recursively by h n,k (x, z) = f n,k x I(k) , h n,J(k) (x, z), z k , k = 1, . . . , K.

Applying Lemma 1 repeatedly, we find that, for any fixed z ∈ R K , the functions x → h n,k (x, z), k = 1, . . . , K, are Lipschitz continuous with Lipschitz constants L h n,k (z), k = 1, . . . , K, defined recursively by

L hn,j (z)   , k = 1, . . . , K.

Let g n = h n,k and L gn = L h n,k . Then, for any fixed z, the function x → g n (x, z) is L gn (z)-Lipschitz. Moreover, by definition, g n satisfies

where Z is a K-dimensional standard normal random vector and also EI-FN n x; Z (1:

Now observe that L gn (z) is a polynomial in the variables |z 1 |, . . . , |z k | with degree at most 1 for each variable. Since the folded normal distribution has finite moment generating function everywhere Therefore, if Z is K-dimensional standard normal random vector, then L gn (Z) has a finite moment generating function in a neighborhood of 0.

An similar argument can be used to show that, for every x, g n (x, Z) has a finite moment generating function in a neighborhood of zero. The desired result is now a direct consequence of Proposition 2 in the supplement of Balandat et al. (2020) , which is in turn a consequence of Theorem 2.3 in Homem-de Mello (2008).

In this section, we prove Theorem 1. Throughout this section, we let (x n : n) denote the sequence of points at which the function network is evaluated. We begin by introducing the following definition, which is analogous to Definition 2.1 in Bect et al. (2019) . Definition B.1. Let F n be the sigma algebra generated by the function network observations up to time n. The sequence (x n : n) is said to be a (non-randomized) sequential design if x n is F n−1 -measurable for all n.

Throughout this section, we assume that (x n : n) is a sequential design. We note, in particular, that if x n ∈ max x∈X EI-FN n−1 (x) for all n, then (x n : n) is a sequential design.

Our proof relies on the following assumptions. . . , f K and a sequential design (x n : n), there exists a function β(z) such that | g n−1 (x n , z)| ≤ β(z) for all n and z and ∞ −∞ ϕ(z)β(z) < ∞, where ϕ is the standard normal pdf.

Assumptions B.1, B.2, and B.3 are standard. Assumption B.4 is bespoke to our arguments, but holds, for example, when the posterior mean of f K is uniformly bounded. (This bound can depend on f 1 , . . . , f K .) This occurs, for example, when each f k is in the reproducing kernel Hilbert space (RKHS) corresponding to the prior covariance function. In particular, if f K is in this RKHS and the prior kernel is bounded, then f K is bounded and there is a uniform bound (depending on the RKHS norm of f K and the prior covariance) over the deviation between f K and the sequence of posterior means resulting from our observations. The sum of these two bounds and a term that is linear in z arising from the posterior standard deviation term in the definition of g n−1 (x n , z) provides β(z). We also believe that Assumption B.4 holds more broadly.

Note that our proof does not rely on the no-empty-ball assumption (NEB) of Vazquez and Bect (2010) . Thus, our proof also extends the proof of Astudillo and Frazier (2019) to a broader class of prior distributions.

As in the main paper, we refer to the "time-n" posterior, which is the conditional distribution of f 1 , . . . , f K given (x m : m ≤ n) and (h k (x m ) : k ≤ K, m ≤ n). E n denotes expectation with respect to this conditional distribution, and P n denotes the probability operator.

Recall the sampling procedure from Section 4.3 of the main paper. This defined a function g(x, Z) that depended on the current posterior distribution, a point x in the feasible domain, and a vector Z. When Z was generated as a standard normal random variable, the distribution of g(x, Z) was the same as that of g(x) under the current posterior. To support working with this over a sequence of posterior distributions, we use g n (x, Z) to indicate this function calculated using the posterior at time n. We similarly use the notation h k (x, Z) to represent the function h n,k (x, Z) from the main paper computed with respect to the posterior at time n. Lemma B.1. g * ∞ := lim n g * n exists and is finite almost surely. Moreover, any limit point x ∞ of the sequence (x n : n) satisfies g(x ∞ ) ≤ g * ∞ .

Proof. The sequence (g * n : n) is non-decreasing and bounded above by the random variable g * := max x ∈X g(x ). The random variable g * is almost surely finite since g is almost surely continuous (it is the composition of a collection of almost surely continuous functions h k ) and X is compact. Thus g * ∞ := lim n g * n exists and is finite almost surely. Let x ∞ be the limit of a convergent subsequence (x nm : m) of (x n : n). Since g is almost surely continuous,

Lemma B.2. Consider the almost sure event that f k is continuous for all k = 1, . . . , K. On this event, the function h k is continuous for all k = 1, . . . , K.

Proof. We show this via induction. The base case, for k = 1, follows since f 1 is continuous on the event considered, x I(k) is a continuous function of x, and so h 1 (x) = f 1 (x I(k) ) is the composition of two continuous functions and so is a continuous function of x.

We now show the induction step. Fix k > 1. Suppose h k (x) is continuous for all k < k. applying the induction hypothesis for all

) is continuous. This and the fact that f k is continuous on the event considered implies that h k (x) = f k (x I(k) , h J(k) (x)) is a composition of continuous functions and so is continuous.

Lemma B.3. For each k = 1, . . . , K, the functions µ n,k and σ n,k converge pointwise to some continuous functions µ ∞,k and σ ∞,k almost surely; moreover, this convergence is uniform over compact subsets of R |I(k)| × R |J(k)| .

Proof. Fix k and consider the almost sure event that f 1 , . . . , f k−1 are continuous. Condition on continuous realizations of f 1 , . . . , f k−1 , thus fixing h J(k) as well.

Let A be an arbitrary compact subset of R |I(k)| ×R |J(k)| and define B = {(x I(k) , h J(k) (x)) : x ∈ X}. B is compact by Lemma B.2 since x → (x I(k) , h J(k) (x)) is continuous, X is compact, and the image of a compact set through a continuous function is compact. The observations of f k occur at input points {(x n,I(k) , h J(k) (x n )) : n} ⊂ B.

Let C = A ∪ B and note that C is compact. Then, by Proposition 2.9 in Bect et al. (2019) , µ n,k and σ n,k converge uniformly over C and thus also over A.

Lemma B.4. Let (x nm : m) be a convergent subsequence of (x n : n) with limit x ∞ . Then, lim m→∞ g nm−1 (x nm , z) = g(x ∞ ) for each z ∈ R K almost surely.

Proof. All the convergence claims made in this proof are almost surely. First note that Lemma B.3 implies that, for each k = 1, . . . , K, the function f n,k defined in the proof of Proposition 1 converges to the function f ∞,k defined by

uniformly in (x I(k) , y J(k) ) (but not necessarily z k ). This in turn implies that, for each k = 1, . . . , K, h n,k converges to the function h ∞,k defined recursively by

We show the following two claims by induction on k:

We first show the induction step, where we assume the induction hypothesis is true for all k < k and show it for k.

Each element of J(k) is strictly less than k and so the induction hypothesis implies that h nm−1,J(k) (x nm , z) → h J(k) (x ∞ ). Moreover, since x nm → x ∞ , and σ n,k converges uniformly to σ ∞,k , it follows that σ nm−1,k (x nm,I(k) , h nm−1,J(k) (x nm , z)) converges to

where the second equation is obtained by noting that, since h nm−1,

This proves the first part of the induction step. Similarly, note that µ nm+1,k (x nm,

This proves the second part of the induction step.

The proof of the base case (k = 0) is analogous except that J(k) = ∅ eliminates terms that depend on k < k.

Lemma B.5. lim inf n EI-FN n−1 (x n ) = 0 almost surely.

Proof. Since (x n : n) is contained in a compact set, it has a convergent subsequence, (x nm : m).

Then, letting ϕ(z) be the standard normal probability density function, 

for each z. Thus, lim m→∞ EI-FN nm−1 (x nm ) = 0, implying that lim inf n EI-FN n−1 (x n ) ≤ 0. This and the fact that EI-FN n−1 (x) ≥ 0 for all x implies that lim inf n EI-FN n−1 (x n ) = 0.

The following lemma considers a sequence of random variables I n that we will later take to be the random improvements generated within our statistical model under the posterior after n measurements.

The quantity E[I + n ] will then be the EI-FN under this posterior. Lemma B.6. Consider a sequence of scalar random variables I n . If lim inf n P(I n ≥ ) > 0 for any given > 0, then lim inf n E[I + n ] > 0.

Proof. We have E[I + n ] ≥ P(I n ≥ ). Thus, lim inf n E[I + n ] ≥ lim inf n P(I n ≥ ) > 0.

We are now ready to prove Theorem 1.

Proof of Theorem 1. Pick any point x ∈ X. Since we choose to evaluate at the point with largest EI-FN n (x), EI-FN n (x) ≤ EI-FN n (x n+1 ) for each n.

Lemma B.5 then implies that there is a subsequence (n m : m) on which lim m EI-FN nm (x nm+1 ) = 0. This and the fact that EI-FN n (x) ≥ 0 imply that lim n EI-FN nm (x) = 0. Thus, lim inf n EI-FN n (x) = 0.

Consider the conditional distribution of g(x) − g * n under the time-n posterior. This is the same as the conditional distribution of g n (x; Z) − g * n where only Z is random and the other quantities are completely determined by the observations of the function network at x 1 , . . . , x n . By taking I n to be a random variable with the same distribution for each n, then on any sequence of observations, Lemma B.5 and the contrapositive of Lemma B.6 imply that lim inf n P n (g(x) − g * n ≥ ) = 0 for each > 0. Since the random variable g * ∞ defined in Lemma B.1 bounds each g * n above, P n (g(x) − g * n ≥ ) ≥ P n (g(x) − g * ∞ ≥ ) and we have lim inf n P n (g(x) − g * ∞ ≥ ) = 0 for each > 0. For any event W , (P n (W ) : n) is a uniformly integrable martingale, and thus converges almost surely to a limiting random variable P ∞ (W ), where P ∞ is defined as the conditional expectation with respect to the event (x n , h k (x n ) : n < ∞, k ≤ K) (by Theorem 5.13 of Çınlar (2011)). Taking W to be the event that g(x) − g * ∞ ≥ , we have that P n (g(x) − g * ∞ ≥ ) has a limit, P ∞ (g(x) − g * ∞ ≥ ). Moreover, this limit must be the same as the lim inf, which we showed above was 0. Thus,

Since this is true for each > 0, taking the limit as → 0 and using the monotone convergence theorem shows

Taking the expectation under the prior and applying the law of conditional expectation, we have that

. Thus, the value of g(x) is almost surely less than or equal to the limiting value of the sequence of best points found.

Let X be a countable set that is dense in X. Such set exists because X is compact. Then, because the countable union of events with probability zero also has probability zero, 0 = P(g(x) > g * ∞ for some x ∈ X) = P sup x∈X g(x) > g * ∞ Moreover, because g is almost surely continuous and X is dense in X, sup x∈X g(x) = sup x∈X g(x) almost surely. Hence, P (sup x∈X g(x) > g * ∞ ) = 0, which concludes the proof.

In this section we prove Proposition 2 by providing a function network and a set of initial conditions where EI-FN does not measure the optimization domain densely. While the example we provide is very simple, we think such behavior also arises in more complex networks. Proposition C.1 (Proposition 2). There exists a function network in which EI-FN is consistent but whose measurements are not necessarily dense in X.

Proof. Let X = [0, 1] and consider a function network with two nodes f 1 : X → R and f 2 : X × R → R where f 2 is deterministic, given byf 2 (x, y) = max{1, y} − x, and the objective function is given by g(x) = f 2 (x, f 1 (x)).

Suppose that f 1 is drawn from a GP prior with a continuous mean function and a bounded positive definite covariance function whose sample paths are almost surely continuous. From this and the fact that f 2 is deterministic and continuous, it follows that Assumptions 2.1-2.3 in §2 are satisfied. Assumption 2.4 is also satisfied because f 2 is bounded over X × R. Thus, Theorem 1 implies that EI-FN is consistent almost surely.

Let τ = inf{n ≥ 1 : f 1 (x n ) > 1} be the first time that we measure a point whose value for f 1 is strictly greater than 1. (If we never measure such a point, then τ is infinity.)

If P(τ < ∞) = 0 under EI-FN, then this problem is one in which EI-FN does not measure densely. This is because there is a strictly positive probability that sup x∈[0,1] f 1 (x) > 1. Moreover, the fact that f 1 is almost surely continuous implies that on this event there is an non-empty interval on which f 1 (x) is strictly above 1 over the entire interval. If EI-FN were to never measure in this interval then it would not have measured densely. Thus, going forward, we assume P(τ < ∞) > 0. In fact, using a similar argument, we may assume P(τ < ∞, f 1 (0) < 1, f 1 (1) < 1) > 0.

For any x > x τ , we have g(x) ≤ 1 − x < 1 − x τ = g(x τ ) almost surely. Thus, any such x is almost surely strictly suboptimal under the posterior and EI-FN n (x) = 0 for all n ≥ τ . Now let n ≥ τ and consider any unmeasured point x with 1 − x > g * n ; such a point exists on the event under consideration since f 1 (0) < 1 implies g * n ≤ max{f 1 (0), 1 − min m≤n x m } < 1 and we make take x arbitrarily close to 0. Because the prior covariance function is positive definite, the posterior probability distribution over f 1 (x) has full support over the real line. Thus, in particular, there is a strictly positive posterior probability that f 1 (x) ≥ 1. On this event, g(x) = 1 − x > g * n and so EI-FN n (x) is strictly positive.

It follows that EI-FN would not measure at a point in the interval (x τ , 1], which concludes the proof.

All GPs in our experiments have a constant mean function and ARD Matérn covariance function with smoothness parameter equal to 5/2, which is a standard choice in practice. The length scales of these GPs are estimated via maximum a posteriori (MAP) estimation with Gamma priors.

We use M = 128 quasi-MC samples obtained via scrambled Sobol sequences (see Balandat et al. (2020) for details) for computing the SAA of EI-FN. We use the same number of samples for EI-CF and 8 samples for KG. KG is maximized following the one-shot approach proposed introduced by Balandat et al. (2020) . Under this approach, the dimension of the optimization problem that arises when optimizing KG grows linearly with the number of samples and thus one is restricted to a small number of samples. The average runtimes of the BO methods for each of the problems are summarized in Table 1 . We emphasize that, while optimizing EI-FN is more expensive than optimizing EI, the additional computation required by our method is compensated by its excellent performance, and is thus justified for problems where each function network evaluation takes several minutes or more. We also note KG becomes very expensive to optimize for problems with relatively high input dimension, and can be even more expensive than EI-FN.

The BoTorch python package and the source code for the robot pushing problem are both publicly available under a MIT licence. Our code is also publicly available under a MIT license.

Here we describe in detail how each of the synthetic test functions is arranged as a function network.

The Alpine2 test function (Jamil and Yang, 2013) is defined by

x k sin(x k ). We adapt this function to our setting by letting The network structure of this test function can be summarized as a series of nodes where the output of each node is governed by one decision variable of its own, and the output of the previous node.

The Ackley test function (Jamil and Yang, 2013) has been widely used as a benchmark function in the BO literature. It is defined by

We adapt it to our setting by letting

cos(2πx d ); 

The Rosenbrock test function (Jamil and Yang, 2013) is also a widely used benchmark function in the BO literature. It is defined by

We adapt it to our setting by letting 

The Drop-Wave test function (Surjanovic and Bingham, 2013) is highly multi-modal and complex. It is defined by g(x) = 1 + cos 12 x 2 1 + x 2 2 2 + 0.5 (x 2 1 + x 2 2 )

.

We adapt it to our setting by taking

f 2 (y 1 ) = 1 + cos (12y 1 ) 2 + 0.5y 

Here we describe the manufacturing throughput manufacturing test problem. This problem is similar in spirit to the biomanufacturing example in the introduction, but focusing on more traditional manufacturing in which workproduct is discrete rather than continuous. We have a manufacturing line with a series of stations that perform operations: e.g., steel is cut to size, then bent to shape, then holes are drilled, and finally the piece is painted. We consider "make-to-order' in which custom features of the part (e.g., color, size, orientation of the holes) require waiting until a customer order arrives to begin processing the part.

Orders for parts arrive randomly according to a homogeneous Poisson process. Orders move to the first station in the manufacturing line and enter a queue where they wait. The first order in the queue requires a processing time exponentially distributed with a service rate that decreases with the amount of resource devoted to that station (e.g., more workers, better machines). Parts not being processed wait in the queue until they arrive to the front of the queue. Once processed, a part moves to the second station where it similarly waits in a queue until it is at the front, then waits an exponential amount of time that depends on a second service rate, which we can again control through staffing. This continues, with each part completing service moving to the next queue until it has completed service at all stations.

Our goal is to choose a collection of service rates, one for each station, to maximize the number of parts finished in a fixed amount of time. We constrain the sum of the service rates across the stations to represent a limit on total resources that can be allocated.

In our experiment, we consider a manufacturing line with 4 stations. The objective to maximize is the throughput of the network in steady state, f (x), where x i is the service rate of station i, over the feasible domain X = {x : 0 ≤ x i , i = 1, ..., 4, and 4 i=1 x i ≤ 1}. Let f i (x i , y i−1 ) be throughput of station i in steady state given that the service rate of station is i is x i and station i−1 has throughput y i−1 in steady state. Then, our objective function can be written as a function network by taking I(k) = {k}, k = 1, . . . , K; J(1) = ∅; and J(k) = {k − 1}, k = 2, 3, 4. This network is illustrated in Figure 6 .

y 1 y 2 y 3 y 4 Figure 6 : Manufacturing line as a function network.

Here we describe our COVID-19 testing benchmark problem, which considers reinforcement learning for large-scale testing to prevent the spread of COVID-19. It builds on the model for infection dynamics described in the epidemic model calibration problem in §5.3 of the main paper.

Overview We design an asymptomatic screening protocol for controlling the spread of COVID-19 in a city. This approach regularly tests the entire population to identify people who are infected but do not have symptoms and may be unknowingly spreading virus. This approach has been employed successfully to control the spread of COVID-19 among students at several US universities (Denny et al., 2020) and also in Wuhan, China (Cao et al., 2020) . To limit the resources needed for testing, we use pooled testing, which is described in detail in the supplement. This has a parameter (the pool size) that, when increased, reduces the testing resources used per person tested but also degrades test accuracy in a complex way that depends on the prevalence of the virus in the population.

We simulate the effect of asymptomatic screening using pooled testing on a single population, indexing time by t = 1, 2, 3. As in §5.3, we track the fraction of the population that is infectious and susceptible. To model the immunity that follows infection with COVID-19, an individual can be "recovered" (R), which means that they were previously infected, are no longer infectious and cannot be infected again. At the start of time period t, I t is the fraction of the population that is infectious, and R t is the fraction that is recovered. The rest is susceptible.

During each period t, the entire population is tested using a pool size of x t . A black-box simulator determines the accuracy of these tests and the testing resources used (which depends on both x t and the prevalence I t ). Individuals testing positive are isolated 3 so that they cannot infect others during the period, and infectious individuals missed in testing infect others. Lower accuracy results in more individuals missed in testing. At the end of the period, all individuals in isolation are modeled as having recovered and leave isolation. This process results in a loss L t incorporating infections, testing resources used, and individuals isolated. Our goal is to choose the pool sizes x 1 , x 2 , x 3 to minimize the total loss t L t .

This is encoded as a function network in Figure 3 in the main paper. As described above and in detail below, each time period performs a calculation that takes the pool size x t and a bivariate description (I t , R t ) of the population's current infection status. (The details of this computation is unknown to our function networks Bayesian optimization model.) It then produces as output a loss L t and the corresponding description (I t+1 , R t+1 ) of the population's infection status at the start of the next period. The objective function is t L t . The known form of the final node t L t is leveraged while the other nodes are treated as black boxes.

Given this overview of the problem, we now describe in detail how these black boxes are computed.

Pooled Testing We first describe pooled testing in more detail. Pooled testing is a method for testing a large number of people for the presence of virus or some other pathogen (Cleary et al., 2021; Dorfman, 1943) in a way that reduces the amount of resource (specifically, chemical reagent and machine time) required per test performed compared with testing each person individually.

As in any COVID-19 test, we first collect a nasal or saliva sample from each individual being tested. Each sample is placed in a separate tube of fluid. Pooled testing relies on the ability to be able to test several people (a "pool") simultaneously, returning a signal that tells us whether (1) no one in the pool is positive; or (2) at least one person in the pool is positive. To accomplish this, a bit of fluid from each sample in the pool is taken and mixed together. Then a single chemical reaction (PCR, or polymerase chain reaction) is run to asses whether anyone in the pool is positive.

We specifically consider square array pooled testing (Westreich et al., 2008) . This approach considers x 2 saliva samples as occupying a x × x grid. Then, it forms 2x pools: one pool from the samples in each row; and one from the samples in each column. Pools are tested, as described above. If a sample's row or column pool tests negative, then that sample is considered free of virus. All other samples (those whose row and column pool both test positive) are tested individually using a chemical reaction performed on additional fluid from that sample. This is illustrated in Figure 3 in the main paper.

The chemical reactions used to check for virus sometimes make errors: both false negatives, in which a pool or individual sample including material from an infected person tests negative; and false positives, in which a pool that does not contain virus nevertheless is deemed positive. Moreover, the probability of a false negative rises with the pool size (Cleary et al., 2021) . This results in errors from the overall pooled testing procedure, where an individual who is virus-free is deemed positive (a false positive) or an individual who is infected with virus is deemed negative (a false negative).

In addition to depending on the pool size, the probability of these two kinds of errors (false positives and false negatives) in the overall testing procedure depends on the prevalence, i.e., the fraction of the population infected. When prevalence is high, there are sometimes two positive individuals providing fluid to a pool. This increases the chance that the pool tests positive. Also, more poools contain positive individuals, increasing the number of negative people whose row and column pools both test positive. This increases the chance that an overall test of a virus-free person will come back positive.

The level of resource used is proportional to the number of chemical reactions performed. This also depends on the pool size and the prevalence If the prevalence is small, then the number of chemical reactions used is approximately 1/(2x) since the number of chemical reactions performed on individual samples is small. As prevalence rises, larger pool sizes require more followup testing (because the pools become likely to contain at least one positive individual) and smaller pool sizes become efficient.

Infection Dynamics without Pooled Testing As described above, time is divided into discrete time points t = 1, 2, 3, each representing a distinct two-week period. At the start of each period, the population is described by two numbers: I t , the fraction of the population that is infectious; and R t , the fraction of the population that is recovered and cannot be infected again. These numbers are both in [0, 1]. The additional S t = 1 − I t − R t fraction of the population is susceptible, and can be infected. Such divisions of a population into these three different groups (susceptible, infectious, and recovered) is widely used in epidemiology (Frazier et al., 2020) .

We first describe our assumed infection dynamics in the absence of asymptomatic screening. This is obtained by integrating continuous-time dynamics within a given two-week period. We denote time strictly within a two-week period by t + u, where t is an integer and u ∈ (0, 1). During this period, we assume that infectious individuals who were infectious at the start of the period remain so for the full two weeks. Each comes into physical contact with other people at a rate of β people per unit time. A fraction S t are susceptible and become infected 4 . This gives us the differential equation

which has solution: I t+u = I t exp(βS t u).

At the end of the two week period, we then assume that all individuals who were infectious at the start of the period convalesce and become recovered. Putting this together, in the absence of asymptomatic screening, the resulting dynamics would be:

S t+1 = S t − I t exp(βS t ) (1) I t+1 = I t (exp(βS t ) − 1)

(2) R t+1 = R t + I t

Although the details of these dynamics are different from those in §5.3, it results in behavior that is qualitatively similar. In particular, if β is small enough, then I t shrinks to 0, but if it is large enough then the fraction of the population infected grows to a high fraction over a small number of time periods.

In our implementation, we set β = (14/3) ln(2) ≈ 3.23, corresponding to an epidemic that doubles in size every 3 days in the absence of any interventions.

We now incorporate the effect of asymptomatic screening.

Infection Dynamics with Pooled Testing Our simulation includes pooled testing as follows.

Pooled testing using the pool size x t is used at the start of the period. As described above, its error rates (false positive and false negative) and its efficiency depend on both the pool size and the prevalence (I t ). We use a black-box computation using logic described above to calculate three quantities:

• α FP (x t , I t ), the fraction of virus-free individuals tested that test positive (i.e., the false positive rate for the overall pooled testing procedure); • α TP (x t , I t ), the fraction of infected individuals tested that test positive (i.e., the true positive rate for the overall pooled testing procedure); • α C (x t , I t ), the number of chemical reactions performed across the entire population.

Individuals that test positive are immediately removed from the population placed into isolation. This includes both infectious individuals (in particular, a fraction (α TP (x t , I t ))I t of the overall population) as well as susceptible and recovered individuals who were incorrectly classified (fractions α FP (x t , I t )S t and α FP (x t , I t )R t of the overall population respectively). Thus, the number of people isolated in period t is, Q t = α TP (x t , I t )I t + α FP (x t , I t )(S t + R t ). This results in a term c Q Q t that is added to our loss, representing the social costs of isolation.

Because some infectious individuals are in isolation, the number of new infections is smaller than in the setting described above without asymptomatic screening. This number is I t (1 − α TP (x t , I t ). Following the infection dynamics described above, this results in an additional new I t (1 − α TP (x t , I t )) exp(βS t ) infections drawn from the susceptible population. In addition, all individuals who were infectious as the start of period t recover. Thus, our dynamics are:

One may wonder about two modeling details. First, susceptible people who are erroneously in isolation are nevertheless modeled as eligible for infection. Additionally, recovered people are modeled as being tested, although in practice one might choose to not test these individuals. These assumptions have little impact on outcomes because (1) false positive rates are small enough that the fraction of the susceptible population in isolation is a very small fraction of the overal susceptible population;

(2) in the regimes where good solutions lie, few people are ever infected, making the recovered population also small. Making these assumptions simplifies the description and implementation.

The loss at time t is the sum of the social cost of isolation described above, the cost of the testing supplies consumed c T α C (x t , I t ), and the social cost associated with the new infections, c I I t+1 . L t = c T α C (x t , I t ) + c Q Q t + c I I t+1 .

Here we describe one additional experiment. We consider a variation of the active learning for robot pushing problem introduced by Wang and Jegelka (2017) whose goal is to teach a robot to push an object to a predetermined target location. We modify the problem by allowing the robot to push the object several times instead of only once. We formalize this problem as follows. Let x init , x target ∈ [−5, 5] 2 denote the object's initial and target locations, respectively. At each time step, t, we choose the location of the robot's arm, r t ∈ [−5, 5] 2 , and the duration of the push, d t ∈ [1, 12]. The robot then moves its arm from r t in the current direction of the object, x t , over d t units of time. After this push, the location of the object becomes x t+1 (if the robot fails to push the object, x t+1 = x t ). The goal is to choose (r t , d t ) for t = 1, . . . , T to minimize x target − x T 2 2 . We set x init = (0, 0), x target = (2.9, 1.6), and T = 3. This can be interpreted as a function network by associating each time step with a pair of node functions p t,1 and p t,1 which take (r t , d t , x t ) as input and produce x t+1 = (p t,1 (r t , d t , x t ), p t,2 (r t , d t , x t )) as output.

The results of this experiment are shown in Figure 7 . EI-FN improves over EI-CF and improves substantially over EI and Random.

In this section, we write explicit formulas for the posterior mean and covariance functions of the GP distributions associated to the node functions, f 1 , . . . , f K . To write this more simply, we define some additional notation. Given generic vectors x ∈ R D and y ∈ R K , we define z k = (x I(k) , y J(k) ) as the elements of these vectors supplied as input to node k. Similarly, given a historical observation of the values of the node functions y = (h 1 (x ), . . . , h K (x )), = 1, . . . , n, we define z ,k = (x ,I(k) , y ,J(k) ). Using this notation, our posterior mean and covariance functions can be written as µ n,k (z k ) = µ 0,k (z k ) + Σ 0,k (z k , z 1:n,k ) Σ 0,k (z 1:n,k , z 1:n,k ) −1 (y 1:n,k − µ 0,k (z 1:n,k )) , and Σ n,k (z k , z k ) = Σ 0,k (z k , z k )−Σ 0,k (z k , z 1:n,k ) Σ 0,k (z 1:n,k , z 1:n,k ) −1 Σ 0,k (z 1:n,k , z k ) ,

respectively.

A decomposition-based approach to uncertainty analysis of feed-forward multicomponent systems

Exploiting queuing networks to model and assess the performance of self-adaptive software systems: A survey

Bayesian optimization of composite functions

Thinking inside the box: A tutorial on grey-box Bayesian optimization

Botorch: A framework for efficient Monte-Carlo Bayesian optimization

Stochastic multi-level composition optimization algorithms with level-independent convergence rates

A supermartingale approach to Gaussian process based sequential design of experiments

Multidisciplinary design optimization of aircraft wing using commercial software integration

Convergence rates of efficient global optimization algorithms

A limited memory algorithm for bound constrained optimization

Bayesian optimization of risk measures

Post-lockdown sars-cov-2 nucleic acid screening in nearly ten million residents of wuhan, china

Lowrank matrix recovery with composite optimization: Good conditioning and rapid convergence

Probability and stochastics

Using viral load and epidemic dynamics to optimize pooled testing in resource-constrained settings

Problem formulation for multidisciplinary optimization

Deep Gaussian processes

On Bayesian methods for seeking the extremum

A simulation-based optimization framework for urban transportation problems

Multi-goal reinforcement learning: Challenging robotics environments and request for research

Markov decision processes

Gaussian processes for machine learning

On the convergence rates of expected improvement methods

The sequence-structure relationship and protein function prediction

On the mathematical foundations of compartmental analysis in biology, medicine, and ecology

Biosimilars: An overview

Taking the human out of the loop: A review of Bayesian optimization

On a class of nonsmooth composite functions

Practical Bayesian optimization of machine learning algorithms

Drop-Wave function

Reinforcement learning: An introduction

Multi-task Bayesian optimization

Bayesian optimization with expensive integrands

Efficient Bayesian optimization for target vector estimation

Convergence properties of the expected improvement algorithm with fixed mean and covariance functions

Metamodel-based simulation optimisation for bed allocation

Max-value entropy search for efficient Bayesian optimization

Optimizing screening for acute human immunodeficiency virus infection with pooled nucleic acid amplification tests

Sequential design of computer experiments to minimize integrated response functions

Maximizing acquisition functions for bayesian optimization

The parallel knowledge gradient method for batch Bayesian optimization

Practical multi-fidelity Bayesian optimization for hyperparameter tuning

Single-step Bayesian search method for an extremum of functions of a single variable

The authors were partially supported by AFOSR FA9550-19-1-0283 and FA9550-20-1-0351.