key: cord-0999697-a8uczfl7 authors: Thul, Lawrence; Powell, Warren title: Stochastic Optimization for Vaccine and Testing Kit Allocation for the COVID-19 Pandemic() date: 2021-11-11 journal: Eur J Oper Res DOI: 10.1016/j.ejor.2021.11.007 sha: 2173d1d234ef656c0ef2117d2f04395ab297d85c doc_id: 999697 cord_uid: a8uczfl7 We present a formal mathematical modeling framework for a multi-agent sequential decision problem during an epidemic. The problem is formulated as a collaboration between a vaccination agent and learning agent to allocate stockpiles of vaccines and tests to a set of zones under various types of uncertainty. The model is able to capture passive information processes and maintain beliefs over the uncertain state of the world. We designed a parameterized direct lookahead approximation which is robust and scalable under different scenarios, resource scarcity, and beliefs about the environment. We design a test allocation policy designed to capture the value of information and demonstrate that it outperforms other learning policies when there is an extreme shortage of resources (information is scarce). We simulate the model with two scenarios including a resource allocation problem to each state in the United States and another for the nursing homes in Nevada. The US example demonstrates the scalability of the model and the nursing home example demonstrates the robustness under extreme resource shortages. During the early months of 2020, it became evident that the SARS-CoV-2 virus was spreading through the global population at an alarming rate. The mitigation strategies in place were not sufficient to handle a crisis at this scale, devastating global economies and supply chains. After the tragic losses to life and economic damage suffered, it is imperative to reflect on the nature of the problem which was faced and how to act differently in the future. The greatest challenge decision-makers face at the onset of an epidemic is the huge set of unknowns. There is uncertainty about the features of the disease, such as transmission rates, recovery rates, and death $ This work is supported by the Air Force Office of Scientific Research (Award Number FA9550-19-1-0203). * Corresponding author Email address: lathul@princeton.edu (Lawrence Thul) rates. There is uncertainty about the dynamics of the disease, such as exposure time to infection, reinfection rates, or asymptotic spreading. Once personal protective equipment is available, there is uncertainty about the effectiveness and public use. Once testing kits are available, there is uncertainty about testing accuracy and infectivity measurements in the population. Once vaccines are available, there is uncertainty about efficacy rates and public confidence. As the resources available to fight the disease are manufactured, there is uncertainty about the production rates. In the face of all the unknowns, decision-makers must act swiftly and strategically to mitigate the spread of the disease in a crucial period of time. The epidemic problem setting has enumerable complexities associated with it. In this paper, we will focus on a subset of the problems faced by decision-makers. Specifically, we will focus on the problem of allocating vaccines throughout a region when the state of the epidemic is not known perfectly to the decision-maker. We assume that the sequential decision problem begins at the onset of vaccine production, so there will be extreme shortages of vaccines which will rollout as they are manufactured. Additionally, a limited stockpile of testing kits are also produced which implies there are a limited number of observations available to the vaccine distributors. Hence, the decision-maker must capture how valuable the observations are with respect to learning the true state of the epidemic in local zones. Figure 1 illustrates the problem of allocating stockpiles of vaccines and testing kits to zones. During the SARS-CoV-2 epidemic, the initial vaccine distribution strategy was to allocate vaccines proportional to the number of adults in each state as soon as they became available (Simunaci (2020) ). In this paper, we design a policy using a parameterized rolling horizon stochastic optimization technique and compare it to other classes of policies. The formulation of a proper model to design allocation policies which can adapt to the non-stationary stream of data allows for more robust management of resources. In reality, there are different goals for allocating testing kits and vaccines, but when these problems are considered jointly the limited resources can be used more effectively. It is uncommon in the broad literature to find a multi-agent problem where there are agents which can change the state of the environment and agents which learn about the environment considered jointly. There are many modeling and algorithmic challenges presented in this application setting. The region is partitioned into a set of zones and each zone will get individual allocations of vaccines and testing kits. This leads to the set of possible decisions becoming very high dimensional. Each zone also has sets of individuals in different states related to the epidemic. For example, a percentage of the population is infected with the disease, a percentage is susceptible to the disease, and a percentage is vaccinated or immune to the disease. From the decision-makers perspective, the true state of the infection within the population is not known perfectly, so probability distributions must be maintained as information is processed over time. This leads to state spaces over parameters of probability distributions, which grow very large and difficult to handle. The set of possible observations from each zone is a function of the number of tests allocated to it, so the observation spaces become very high dimensional as the number of zones increase. The high dimensionality of multiple aspects of the problem will lead to a limited set of approaches we can consider from the existing literature. There are many different agents allocating resources during a pandemic at all levels (federal, state, local) of government or organizations. The framework in this paper will develop a model with hyperparameters that can be adjusted to simulate the scenario for the decision-maker. This paper will consider two scenarios to highlight the robustness and scalability of the framework. We simulate the federal allocations of vaccines and tests as separate agents in the federal government to each state. The scenario demonstrates the model and policies designed can scale to populations of hundreds of millions of people and millions of resources available. The second scenario models state-level agents allocating tests and vaccines to nursing homes in the state of Nevada. The scenario is designed to highlight the robustness of the framework to handle extreme resource shortages. Some local areas will not receive as many resources due to budgets at higher levels in the supply chain or worse outbreaks in other areas in the country. Therefore, it is imperative to ensure the framework is robust when the availability of resources is scarce. Hence, this scenario demonstrates the model is able to capture the value of information collected and the vaccine allocation policies can effectively adapt as new information streams in. This paper makes the following contributions: • We present the first formal multi-agent modeling extension to the unified framework for an epidemic application. We formulate a mathematical model for a multi-agent stochastic resource management problem that combines resource allocation (for the vaccines) with active learning (through testing). This model is able to capture passive information processes and perform active learning to improve the belief states by querying valuable observations. • We propose a vaccine allocation policy which solves a parameterized direct lookahead model. The parameterization must be tuned using policy search. Furthermore, we demonstrate the necessary, but rare in the literature, search over policies across multiple communities of stochastic optimization. In our search, we tested all four classes of policies, but omitted policies with the worst performance due to space. • We propose a test kit allocation policy by formulating a surrogate function and drawing from one-step lookahead acquisition functions from the Bayesian optimization literature. We demonstrate the utility of active learning through the test kit allocation policy when resources are extremely scarce. • We demonstrate that under extreme resource shortages the proposed vaccination allocation and learning policies work best in conjunction compared to all other combinations of policies. The nursing home simulation highlights the power of using active learning to guide an implementation decision under resource scarcity. The paper is organized as follows. Section 2 summarizes the literature about vaccine distribution strategies, stochastic optimization, and similar areas of research to this paper. Section 3 describes the multi-agent mathematical model using the unified framework. Section 3 is broken down into the environment agent model and controlling agent models. The controlling agent section presents the learning model, and vaccination model. Section 4 describes the formulation of policies for the vaccination agent and learning agent. Section 5 discusses the results of implementing the model on simulators designed for two different scenarios of the environment agent. Section 6 concludes and summarizes the results and contributions of the research. There have been various computational and mathematical strategies for simulation, forecasting, and control of epidemics. One of the most common ways to model a pandemic is to use compartmental models. Kermack and McKendrick (1927) creates the SIR model which is the most basic compartmental model consisting of three groups within a population: those susceptible (S) to the disease, those infected (I) with the disease, and those removed (R) from the population (from death, recovery, or immunity). Tang et al. (2020) reviews the literature about compartmental models and provides various extensions of the SIR model such as susceptibleexposed-infected-recovered (SEIR), spatial SIR models, spatiotemporal SIR models, and other possible multicompartment extensions. Greenwood and Gordillo (2009) provides a review of the SIR model with stochastic transmission rates. The literature regarding decision-making strategies to combat an epidemic is large and spans many disciplines. There are strategies regarding control via public policy, pharmaceutical or vaccine intervention. Köhler et al. (2020) and Morato et al. (2020) use public policy controls (e.g. social distancing/lockdowns) to mitigate the spread of infection when a vaccine is unavailable. Buhat et al. (2021) develops equitable testing kit allocation strategies to medical centers in the Philippines. Lin et al. (2020) models a problem to decide whether a distributor will transport vaccines through a cold chain or a non-cold chain to ensure that they are still viable at administration. Ekici et al. (2008) Bayesian optimization frameworks. We seek to perform active learning through the Bayesian optimization frameworks discussed in Frazier (2018) and Shahriari et al. (2015) . Active learning has been used for optimizing nonlinear belief models (Han and Powell (2020) ). It has also been used for materials science (e.g. Packwood (2017)), engineering design (e.g. Imani and Ghoreishi (2020)), medical decision making (e.g. Wang and Powell (2016) ), and drug discovery (e.g. Reyes and Powell (2020) ). There are various other stochastic optimization approaches for resource allocation problems throughout the literature. Gülpınar et al. (2018) proposes an approximate dynamic programming algorithm for assigning a limited number of resources to as many tasks as possible. Creemers ( Decision-making with a partially observable state of the world can be modeled as a partially observable Markov decision process (POMDP) (e.g. Cassandra et al. (1994) ). This modeling approach is widely used for problems with unobservable parameters or quantities, but it suffers from severe computational limitations (it probably cannot be applied to a problem in this paper with more than 3 or 4 zones). The ability to and the exact optimal solution is almost never possible for real world problems; in fact, the finite horizon POMDP is PSPACE-complete (e.g. Pineau et al. (2006)). Often overlooked, however, are subtle modeling assumptions that would not apply for our epidemic setting. In particular, the policy derived from the belief MDP uses the one-step transition matrix which, aside from being computationally intractable, implicitly assumes that the transition function is known to the controller. This means that the controller actually knows the dynamics of how the disease is communicated, which is not the case with COVID-19. instead of just the parameters of the model. Second, our controller is robust to changes to the environment model. In fact, any increasingly complex epidemic which can be tested for infections and responds to a vaccine decision could plug and play with our controller models and adaptively mitigate the spread of a virus because the environment is a black box. We demonstrate the versatility of our framework by implementing the model on scenarios with very large populations with moderate resource availability, as well as smaller populations under extreme resource shortages. Third, they present a scenario-based rolling horizon model, whereas, we present a parameterized multi-stage lookahead approximation which can be tuned to work best under different scenarios. This section presents a mathematical framework extending the unified framework presented in Powell (2019) to a multi-agent setting with an epidemic application under partial observability. The standard unified framework is designed with the philosophy to model first, then solve the problem. The model consists of five components: the state variable, decision variables, exogenous information, transition function, and objective function. After the model is constructed, the problem is solved by designing policies by searching over the four classes of policies which encompass any stochastic optimization solution strategy. Then, we extend the standard unified framework modeling process to a multi-agent formulation for partially observable systems. In this paper, we have an environment agent and two controlling agents. The environment agent represents the epidemic system and does not make decisions, but it can be observed through tests and impacted by vaccines. There are also two controlling agents which collaborate together to complete a joint goal of minimizing the cumulative number of new infections. There is a vaccination agent responsible for allocating a dynamic stockpile of vaccines, n vac t to a set of zones and a learning agent responsible for allocating a dynamic stockpile of testing kits, n test t to the same set of zones. Each agent has its own model from the five components of the unified framework and its own policy function for making decisions. They can characterize their own perspectives with individual models and make decisions according to its own individual objectives using separate policies. We will demonstrate the multiagent collaboration between a learning agent and an vaccination agent. The agents have unique abilities because they have different resources. The learning agent is responsible for constructing and maintaining a belief model describing the probability distributions over the uncertain state of the environment: the belief state, B t . The learning agent communicates the belief state to the vaccination agent and it can utilize the new information to make the most impactful vaccine allocation decisions. Each agent has its own model, but the actual dynamics of the environment will be represented as general functions. The learning agent will make test allocation decisions, x lrn t , to receive samples of infected individuals,Î t+1 . Then, update the belief model, which it will communicate to the vaccination agent to inform the vaccine allocation decisions, x vac t . The flow of information between each agent is displayed in Figure 2 . At each discrete time step there is a sequence of events that occur. For example at time t, the samples from the previous test allocation from time t − 1 are realized, which leads to a new belief state, B t . Then, the belief state is transferred to the vaccination agent to make a vaccine allocation decision x vac t . The learning agent uses the knowledge from the vaccine allocation to strategically allocate the tests at time t to be distributed to collect the new samples for time t + 1. The following sections will present the mathematical modeling frameworks for each model. Section 3.1 presents the general model for the environment agent. Section 3.2 presents the learning agent model which is prefaced by the belief model in section 3.2.1. Section 3.3 presents the vaccination agent model. The region in this problem is partitioned into a set of zones z ∈ Z. Each zone within the region has a population of individuals, N z . There is a disease present within the population which evolves according to some fixed dynamics. Throughout the time horizon T , there will be an exogenous process which will produce a stockpile of vaccines, n vac t , and testing kits, n test t , at the beginning of each time step. The environment model is a passive agent, so it evolves through time without making decisions. However, it has its own dynamics and can be impacted by controller decisions. A passive agent only has three of the five components of the unified framework because it does not make decisions or have an objective. It has a state variable, exogenous information, and transition functions. The ground truth components of this model may be a complex simulator or the real world. For the purposes of this paper, we limit the environment state variable to be the states of the SIR model; however, other compartmental extensions are easily appended to the model. The general set of true parameters at time t are packaged into Ψ t . Environment State Variable. The environment state variable, S env t , represents the information the environment would need to transition to the next state from time t onward. It has the following form: where, S tz = true number of individuals susceptible to the disease in zone z at time t, I tz = true number of individuals infected with the disease in zone z at time t, R tz = true number of individuals removed (immune/vaccinated) from susceptibility the disease in zone z at time t, Ψ t = true parameters of the SIR model at time t, The assumed environment state at time 0 is separated from the dynamic state because it includes latent variables which are fixed over time, given by, where, N z = the population of each zone. Environment Exogenous Information. The environment has dynamically changing parameters. For example, the transmission rates, recovery rates, and vaccine efficacy are all streaming over time, but the distributions are unknown. Additionally, from the environment agent's perspective the vaccine allocations are an exogenous information process: The real world will almost always be more complex than any simulator of the environment, and a controlling agent would have to approximate the real world to the best of its ability. The simulator designed for this paper was made as complex and realistic as possible by including more stochasticity and complexity than the controlling agent model, and biasing the sampling to approximate dynamic human behavior and asymptomatic spread. The true simulator models used to test this model is given by Appendix 8.2. This section proposes a model for the learning agent using the five components of the unified framework. The controller does not have access to the environment agent's state variable, S env t , or the transition functions describing how they evolve, f env (S env t , W env t+1 ). The distributions over the dynamic components of the environment state variable are maintained by the following belief model. In a sequential decision problem with imperfectly known states and state transitions, the controller must maintain a belief model. This belief model contains three major components: 1) the environment model assumptions, 2) the belief state, 3) the updating equations for the belief state. Environment Model Assumptions. The states of the environment are random variables from the perspectives of the controlling agents. Equation (1) defines the general form for the environment state variable. We assume the true parameters of the SIR model are where, β z = true transmission rate at time t in zone z, γ z = true recovery rate in zone z, ξ = vaccine efficacy. We assume the transmission rate is a dynamically changing stochastic process in each zone, the recovery rate is fixed in each zone, and the vaccine efficacy is the same for the entire region and fixed over time. We argue these are reasonable assumptions because the transmission rates reflect human behavior within a zone over time and can be time dependent and random. The recovery rate reflects the latency between being infected and either naturally recovering or dying from the illness so it is generally fixed within each zone. It is heterogeneous between zones because the healthcare and access to hospitals may be different. We assume the vaccine technology over the time horizon is fixed so the vaccine efficacy does not change. The assumed transition functions for the subpopulations in the environment model follow a modified version of the classic SIR compartmental model in epidemiology. The equations describe how each subpopulation within each zone interacts and evolve through the time horizon. The equations are given by, R t+1,z = R tz + γ z I tz + min(S tz , ξx vac tz ) where S x tz is the post-decision state of the susceptible group. The post-decision state represents the state of the susceptible subpopulation after the vaccination decision has been made, but before the system transitions to t + 1. The post-decision state reflects the individuals effectively vaccinated between time t and t + 1 and removed from the susceptible population. The susceptible compartment is reduced by the number of post-vaccination susceptible interacting with infected people at each step. The infected compartment gains those individuals, but individuals are removed at a rate γ z into the removed compartment. The vaccinated individuals are moved into the removed compartment. The transmission rates are assumed to be random perturbations in the interval (0, 1) around an average β and have the following form, where, ε β ∼ U nif (−δ β , +δ β ). Belief State. Since the learning agent cannot observe the environment perfectly at time t it must maintain a probability distribution over the entire state space: the belief state. The SIR model assumes that the total population remains constant. Therefore, the belief about the true percentage of each subpopulation in a zone has the following property, Letp S tz ,p I tz andp R tz denote estimates of the percentage of each population in each subpopulation of zone z at time t. The estimates have the same property as the true parameters in equation (9). Hence, the most natural distribution to reflect this structure is a multinomial distribution for each zone with parameters (N z ,p S tz ,p I tz ,p R tz ). Specifically given by, Furthermore, this implies the dynamic belief state for the controlling agent models is given by, Belief State Update. The belief state updating equations take the observations,Î t+1,z , queried from the testing centers and use them to estimate the new belief state at t + 1. Each of the three estimatesp S t+1,z ,p I t+1,z , andp R t+1,z in each zone z will need an updating equation. We will update the belief state through a Bayesian procedure outlined in Figure 3 . The first step to updating the model is to formulate priors through the forecasting model. To forecast, the conditional expectation of each subpopulation at t+1 is estimated with the dynamics we assumed in equations (4 -7). The closed form expectation does not exist, so we approximate it with normal distributions. We denote variables in the forecast model with f and superscripted with the variable being forecasted from the current time t and one for the future time t + 1 (e.g. f I t,t+1,z forecasts I). We state the equations in lemma 3.1, but leave the details to the Appendix. Lemma 3.1. The predictions at t + 1 describe the conditional expectation of the subpopulations for each of the belief state variables passed through the transition function in equations (4 -7). As the size of the population gets large, the multinomial distributions will converge into normal distributions in the limit. This property allows us to approximate the conditional expectation of equations (4 -7) with respect to the belief state. Let The equations are given as follows, where,μ S tz and Φ is the standard normal cdf and φ is the standard normal pdf. For explicit expressions and vaccination details for the moments of the random variables X and Y , see the Appendix. Proof. See Appendix. The sample of infected individuals from each zone are drawn from a binomial distributions with x lrn tz samples from each zone determined by the learning policy and unknown probability parameters. The conjugate prior for the binomial distribution is a beta distribution which effectively puts a prior distribution over the unknown parameters. The updating equation for the infected population can be seen in Lemma 3.2. Lemma 3.2. LetÎ t+1,z be a sample drawn from a binomial distribution with x lrn tz trials. Let (α tz , κ tz ) be parameters of a beta distribution encoding the prior information known about I t+1,z . The compound distribution produced by Bayes' Theorem is a beta-binomial distribution. The estimator for the probability of an infection, p I t+1,z is given by,p λ ∈ (0, 1) is a tunable weighting factor based on how much we trust the observations versus the model. Hence, the beta distribution parameter is given by, where,p I t+1 is computed using equation (13). Proof. See Appendix. After the tests have been administered into the population, it is possible to get an estimate of the number of infected individuals; however, there are two other groups in the population: susceptible and removed. Since we only have observations of the number of infected individuals at time t + 1, then to estimate the susceptible and removed subpopulations we will use the predictions from Lemma 3.1 and the posterior from Lemma 3.2. Definition 3.1. Let Π ∆z be the projection operator for the set defined by, If the terms are not in the set ∆ z in definition 3.1, then they must be projected back to the nearest point. This projection operation for the susceptible and removed subspopulations is then given by, In summary, the controlling agent updates the parameters in the belief state through the following process: 1) Make observationsÎ t+1,z for all z ∈ Z, 2) Compute the belief state predictions using Lemma 3.1, 3) Compute the Bayesian update using equation (15) in Lemma 3.2, 4) Use equation (18) to updatep S t+1,z andp R t+1,z . The learning agent is responsible for the allocation of testing kits. The testing kits are used to collect information about the state of the infection in each zone. The following subsection presents the five components of the unified framework for the learning agent model. The learning agent state variable is very similar to the vaccination agent; however, it also needs the vaccine stockpile because it must be able to compute the vaccination policy to evaluate the value of collecting information. The initial state variable for the learning agent is given by, Decision Variable. The decision to allocate testing kits to each zone follows the same structure as the vaccine allocation decision. The test kit decision is given by the vector x test t , and constrained by the total number of testing kits available, n test t . The testing kit allocation also must remain in the set of natural numbers because partial kits cannot be allocated. Exogenous Information. The exogenous information process contains all information which streams into the learning agent. The learning agent receives the random samples queried by the testing kit allocation at time t. The learning agent also receives the vaccination decision from the vaccination policy. The learning agent will receive the vaccination decision before it makes the testing kit allocation at time t, and then receive all exogenous random information between t and t + 1. The set of all exogenous information is given by, We omit the vaccination decision from the process to reiterate that it arrives earlier than t + 1. Transition Function. The transition function for the learning agent describes the set of equations for updating each of the state variables. The test kit stockpile is evolving exogenously. The updating procedure for the belief model is given in section 3.2.1, and the explicit procedure for the components of the belief state are given by equations (15) and (18). Objective Function. The joint goal of the agents is to minimize the cumulative number of new infections. Hence, the one-step cost is given by, The true one-step cost is not possible to evaluate online, so the expectation must be taken over the belief state. Therefore, the optimization problem for this problem becomes, where Π lrn is the set of all admissible testing kit allocation policies. The vaccination agent is responsible for making vaccine allocation decisions. The remainder of this section lays out the five components of the mathematical model for the vaccination agent: the state variable, decision variable, exogenous information, transition function, and objective function. State Variables. The state variables include the information which is needed to compute the transition functions, objective function, and policy at time t. Any information which is not changing dynamically remains a latent variable defined in the initial state. The state variable for the vaccination agent's base model is defined as, where, B t = the belief state communicated from the learning agent, n vac t = the number of vaccines available at time t. The initial state contains the initial dynamic variables and static parameters of the model, given by, Decision Variables. The decision to allocate vaccines to each zone is given by a vector, x vac t , which is constrained by the total number of vaccines available, n vac t . Hence, the vaccine decision set is given by, Note, this set must be constrained to the natural numbers because there cannot be partial vaccines distributed. Exogenous Information. The exogenous information, W vac t+1 , represents all information that arrives between time t and t + 1. It is given by, The vaccination agent is completely dependent on the information arriving from the learning agent. Transition Function. The entire state variable arrives exogenously, hence S vac t+1 = W vac t+1 . Objective Function. The one-step contribution for the joint goal is given by equation (20). Hence, the optimization problem for this problem becomes, where Π vac is the set of all admissible vaccination policies. The policy is a mapping from the state space to the decision space. At time t there are vaccination decisions (the number of vaccines to allocate to each zone) and learning decisions (the number of testing kits to allocate to each zone). In section 4.1, we illustrate two types of vaccination policies: one from the PFA class and one is a parameterized DLA policy. In section 4.2, we present a one-step lookahead learning policy for deciding which zones to allocate testing kits to. The policy, π, is a function used to map states into decisions which we designate, X π (·). There are two general strategies for designing policies for stochastic optimization: policy search and lookahead approximations. Policy search looks within a class of functions for a policy that will work best with respect to some metric. The lookahead approximation strategy approximates the value a current decision will have on the future. The It is also possible to form hybrids between the four classes, such as parameterizing a DLA which would be a hybrid between the CFA and DLA classes. The next sections will present the best performing policies for each agent from our simulation studies in section 5. The vaccination decision in this problem chooses how many vaccines to send to each zone, x vac tz . The decision space for the next set of policies is given by equation (23). The state space for the vaccination agent has 4|Z| + 1 dimensions; hence, as |Z| grows finding the optimal policy becomes quickly intractable due to the curse of dimensionality. Therefore, the approximation to the optimal policy must be designed by searching through the four classes of policies to find which one works best. The following subsections will present policies from the PFA class and the DLA class. The policy from the PFA class allocates vaccines using an analytic function of the population of each zone. The PFA policy is designed to resemble a myopic policy which would be used by decision-makers in the real world. The DLA policy is a lookahead policy which solves a parameterized lookahead model which models the future but adds parameters to be tuned in order to adjust to the simulator (or real world online). The proportional PFA we present was the policy used for the COVID-19 pandemic. It simply takes the proportion of the population of each zone with respect to the total population, and creates a weighting. Then, the weight is used to allocate the proportion of the vaccines available. Hence, The parameterized DLA creates an approximate model of the future to make decisions at t by looking at the impact of decisions in the future. The lookahead model consists of the five components of the unified framework; however, parts of the model have been simplified to make the problem more tractable. There are several approximations that can simplify a lookahead model such as, reducing the horizon length, discretizing states and/or decisions, sampling using Monte Carlo methods, or creating a simple policy within the lookahead model to simulate the future. We perform multiple approximation methods for solving the base model with the lookahead model. Firstly, we truncate the horizon length to look two steps into the future. The model is still not solvable because the belief states are continuous and multidimensional. We add a set of tunable parameters, θ DLA ∈ (0, 1)×R 4 + , to the lookahead model to perform various functions. The set of parameters are given by θ DLA = (θ 0 , θ 1 , θ 2 , θ 3 , θ 4 ). The first element, θ 0 ∈ (0, 1) is used to parameterize the state space to select a tunable percentile of the distribution over susceptible individuals in each zone. The second through fifth elements are used to directly parameterize multiple elements of the nonlinear quadratic program in lemma 4.1. The parameterization allows the simulator to tune the policy to find a parameterization of the multi-stage deterministic program which performs best over multiple Monte Carlo evaluations of the simulation. The following paragraphs will sketch the lookahead model. Any variables superscripted by θ are functions of the parameterization. The optimization problem for a two-step lookahead approximation is given by, which is now the summation of multiple one-step costs in the future. This formulation is much more manageable than trying to optimize the multi-period objective in the base model. The policy derived from this optimization problem is given by,   be the two stage lookahead vaccination decision vector. Then, the policy can be rewritten as a non-convex quadratic program given by, Explicit expressions for the objective function in equation (32) and the constraints (33) and (34) can be found in the Appendix. The matrix Q θ is deconstructed into block matrices and each block is parameterized by θ 1 and θ 2 . The vector q θ is split into its first |Z| components and second |Z| components and parameterized by θ 3 and θ 4 respectively. Q θ has both positive and negative eigenvalues, in general; hence it is not always positive semidefinite. Proof. See Appendix. The optimization problem in equation (31) reduces to a problem with a nonconvex quadratic objective function with linear constraints. This approximation can be solved in practice with a bilinear quadratic solver when |Z| is not too large ( 100). Additionally, θ DLA = (θ 0 , θ 1 , θ 2 , θ 3 , θ 4 ) ∈ (0, 1) × R + 4 which requires offline parameter tuning to find the best value. The parameterizations will effect performance and could change whether the program is convex or not. The second type of decision is to allocate tests to each zone to learn about the state of the pandemic. At time t, the learning agent must decide which zones to send the n test t kits after the vaccination policy has already been made. The learning decision will impact the distribution of the random samples drawn from the environment and impact the vaccine allocation decisions in the future. The large action space will limit the feasible acquisition functions available from the literature. Many of the policies are challenging to optimize in high dimensional spaces due to the computational complexity. The restricted options will narrow down the search over learning policies. In this section, we present a learning policy designed to capture the value of information. One-step Variance Maximization. The surrogate objective is designed to optimize the estimator from equation (15) because the other random variables are functions ofp I t+1 . The surrogate function is given by, where we assumeÎ t+1,z ∼ Bin(x lrn t , f I t,t+1,z ). Lemma 4.2. Let the mean and variance of the surrogate function be given by, where α tz and κ tz are parameters of the beta distribution prior given by equations (16) Proof. See Appendix. The one-step variance maximization policy is designed to minimize the variance in the estimator at t + 1 by choosing to allocate tests which will minimize the sum of forecasted variances. Therefore, this policy creates a surrogate function designed to capture the amount of useful information gained through testing by minimizing the forecasted uncertainty in the next time step. Considering this problem makes decisions allocating resources into a set of zones in a heterogeneous population, then it is important to address the problem of fairness in both testing kit allocation and vaccine allocation. In this paradigm, we modeled the problem to minimize the overall sum of infected cases, but this could lead to a spike in one area to hoard all resources because it reduces the overall cases through the horizon. While the allocation may achieve the best outcome with respect to the defined cost function, it could also create inequities with respect to access to resources during the pandemic. The real world could have unintended consequences which were not considered in the original model. Therefore, we propose a fairness trade-off policy for each type of decision to guarantee each zone gets resources available for a percentage of the population at each time step. Then, the rest of the resources are allocated according to the policy designed to optimize the model. Let X opt represent a general allocation policy used to optimize the model for minimizing the overall number of cases using n t resources. Let ρ ∈ [0, 1] be a tunable parameter representing the percentage of the population we guarantee will have access to the respective resources in each zone. The proportional population-based allocation is given by equation (26) which could be implemented for the vaccination allocations, testing kit allocations, or both. Then, the allocation policy designed to optimize the models are applied with (1−ρ)n t resources. Hence, the general fairness policies are given by, We can tune the model to trade-off between fairness and optimizing the model. In this section, we study two scenarios to demonstrate the versatility and robustness of our multi-agent modeling framework. The first scenario models each zone as the 50 states plus Washington D. In the simulation studies we conducted, we simulated vaccination allocation policies across three of the four classes of policies. We test a proportional allocation PFA, a risk adjusted CFA, an unparameterized 2-step lookahead policy, and a parameterized lookahead policy. The proportional PFA can be found in section 4.1.1. The CFA is a tunable integer program which directly optimizes the one-step contribution function at each time step. The parameterized DLA performs a rolling horizon optimization with lemma 4.1, and the parameters were tuned via policy search to the optimal values. The standard two-step risk-neutral deterministic lookahead solves the lookahead model without a parameterization. This is equivalent to solving the parameterized lookahead model with parameterization θ DLA = (0.5, 1.0, 1.0, 1.0, 1.0). We also tested each of the four vaccination policies in conjunction with three different learning policies. The learning policies are an even allocation PFA, a one-step variance maximization policy, and a fairness policy. The even allocation PFA allocated the testing kits evenly across each zone. The variance maximization policy solves the one-step lookahead optimization problem in lemma 4.3 to optimize the value of information. The fairness allocation policy guarantees 30% of the kits will be allocated weighted by population size, and the rest are allocated via the variance maximization policy. All static parameters for each scenario were left to the appendix if a reader is interested for implementation details. In the US simulation, the federal government has two agents corresponding with each other to administer a stockpile of vaccines and a stockpile of testing kits to each state (zones The transmission rates are assumed to have a constant mean generated by population densities (e.g. Martins- Filho (2021)). We assume they are constant because the public policies are generally constant over the time horizon, but there is noise added to each transmission rate process to account for dynamic and unpredictable human behavior. We assume the recovery rates have a similar structure as the transmission rates, but the constant mean values are determined based on hospital/care center density in each state (e.g. Bloom et al. (2020)). The vaccine efficacy is reported from CDC data based on the average over multiple clinical trials. Figure 4 shows the percent improvement for each combination of policies for each agent with respect to the performance of no allocation decisions. We display the performance of each vaccination policy for multiple different types of testing policies. The best policy combination for the two agents in the US scenario is to provide the vaccine administrator with the parameterized DLA policy and the test administrator with an even allocation policy. The parameterized DLA provides over a one percent improvement over the next policy is the main contributor to performance for the US scenario. The next section will provide more insights into why there is not much difference in performance across each test allocation policy. In fact, we show empirically there is a critical threshold where each zone should be tested evenly versus trying to strategically allocate kits via value of information approximations. The alternative to the US scenario is a case where there are extreme shortages. During the height of a pandemic, there are likely to be extreme resource shortages in local areas which may not be favored for allocations at the federal level. Even if the local area is given a supply of resources, the tests are usually prioritized for symptomatic individuals and hospitals. Hence, there are scenarios where difficult decisions must be made by administrators. Consider a scenario where the state of Nevada has vaccines available for less than one percent of the nursing home residents and there is not enough testing capacity to test each of the 53 nursing homes in the state. We developed a simulation model where the infection levels in each of the nursing homes proceed independently, but there are stochastic spikes entering the nursing home which could be introduced by staff or visitors. It is imperative to try to minimize the uncertainty in the breakouts, but the testing capacity is under extreme shortages. We want to minimize the risk of severe outbreaks by monitoring the state of the pandemic in each nursing home, and we have to be strategic about how to allocate the testing kits. Figure 6 shows the infection curves for each of the different policies under severe shortages. The following plot demonstrates the Figure 7 demonstrates the risk of allocating evenly under a critical point of test capacity. There is a risk of severe outbreaks if every zone is not taking enough tests, whereas, there is a value to only allocating to certain zones with high variance. After the critical point it is no longer valuable to follow the maximum variance policy because there are a sufficient number of tests to collect enough valuable information from each zone. The most critical aspect to achieving good performance for a parameterized DLA is tuning the hyperparameters of the policy. We optimized the parameters using a stochastic gradient descent method, and we present the level sets of the hyperparameter space near the optimum to show the differences. The optimal parameters for the USA problem were θ DLA, * = (0.25, 5.0, 0.2, 2.75, 0.75). Figure 8 shows the level sets for each of the hyperparameters in the parameterized DLA for the USA problem. The optimal parameters for the nursing home scenario were θ DLA, * = (0.05, 6.0, 0.0, 0.5, 0.7). Figure 9 shows the level sets for each of the hyperparameters in the parameterized DLA nursing home scenario. We vary the number of vaccines available (as a percent of the population) with a fixed testing capacity. Decision-makers must consider the runtime complexity when considering a decision-making strategy. Solving a nonlinear optimization problem takes more time because the complexity of the nonlinear solver is much larger than the complexity of an analytic function. The runtime statistics for each of the policy combinations is given in Figure 10 below for both of the scenarios. The unparameterized DLA takes significantly more time than the parameterized DLA. This phenomenon presents another advantage of the parameterized DLA for these specific scenarios. However, the runtimes are on the order of seconds which is negligible compared to the week long time steps for each allocation decision. This paper contributes a multi-agent modeling extension to the unified framework for an epidemic application. We presented a formal multi-agent model for managing vaccines and tests during a pandemic. Our work extends the unified framework for sequential decisions to the multi-agent setting for the first time. The multi-agent modeling strategy allows each agent to work with its own knowledge and adapt its policies based on the scenarios. Additionally, the unknown environment agent can easily be changed and the models for each agent do not need to be changed. We demonstrate the robustness and scalability of the modeling strategy through two scenarios. The first scenario presents a model of COVID-19 in the USA. We collected vaccine and testing data from the CDC and used population data to construct a simulation for the agents to interact with. Then, we demonstrate the capabilities of our modeling framework to interact with the environment when there are millions of vaccines and tests to allocate to populations on the scale of hundreds of millions. The second scenario presents the state-level resource allocations to the nursing homes in the state of Nevada. The nursing home scenario shows the robustness of the model to perform under extreme resource shortages. The parameterized direct lookahead approximation can outperform policies from multiple other classes of policies, including the proportional PFA which was used to allocate vaccines during the COVID-19 pandemic. Optimal control of vaccine distribution in a rabies metapopulation model Optimal vaccination strategies for a community of households Introduction to stochastic programming Modeling interaction between individuals, social networks and public policy to support public health epidemiology The impact of hospital bed density on the covid-19 case fatality rate in the united states Resource allocation for control of infectious diseases in multiple independent populations: beyond cost-effectiveness analysis A survey of monte carlo tree search methods Optimal allocation of covid-19 test kits among accredited testing centers in the philippines A new epidemics-logistics model: Insights into controlling the ebola virus disease in west africa Acting optimally in partially observable stochastic domains Covid data tracker Uncertainty and value of information when allocating resources within and between healthcare programmes Stochastic dynamic resource allocation for hiv prevention and treatment: An approximate dynamic programming approach The preemptive stochastic resource-constrained project scheduling problem Contracting for on-time delivery in the u.s. influenza vaccine supply chain. Manufacturing & Service Operations Management Emergency supply chain management for controlling a smallpox outbreak: the case for regional mass vaccination Optimizing tactics for use of the us antiviral strategic national stockpile for pandemic influenza Rabies in raccoons: optimal control for a discrete time model on a spatial grid A data-driven optimization approach for multi-period resource allocation in cholera outbreak control The benefits of combining early aspecific vaccination with later specific vaccination Winter Simulation Conference, IEEE A tutorial on bayesian optimization Stochastic epidemic modeling, in: Mathematical and statistical estimation approaches in epidemiology Heuristics for the stochastic dynamic task-resource allocation problem with retry opportunities Data-driven network resource allocation for controlling spreading processes Optimal online learning for nonlinear belief models using discrete priors Bayesian optimization objective-based experimental design A contribution to the mathematical theory of epidemics. Proceedings of the Robust and optimal predictive control of the covid-19 outbreak Solving stochastic resource-constrained project scheduling problems by closed-loop approximate dynamic programming Cold chain transportation decision in the vaccine supply chain A model for the optimal control of a measles epidemic Relationship between population density and covid-19 incidence and mortality estimates: A county-level analysis A parametrized nonlinear predictive control strategy for relaxing covid-19 social distancing measures in brazil Optimal vaccine distribution in a spatiotemporal epidemic model with an application to rabies and raccoons Optimizing real-time vaccine allocation in a stochastic sir model Whole blood or apheresis donations? a multiobjective stochastic optimization approach Bayesian Optimization for Materials Science Logistics of community smallpox control through contact tracing and ring vaccination: a stochastic network model A unified framework for stochastic optimization Approximate Dynamic Programming: Solving the Curses of Dimensionality Real-time decision-making during emergency disease outbreaks Optimal learning for sequential decisions in laboratory experimentation Dynamic control of modern, network-based epidemic models Taking the human out of the loop: A review of bayesian optimization Adaptive management and the value of information: learning via intervention in epidemiology Pro-rata vaccine distribution is fair, equitable Reinforcement learning: An introduction A review of multi-compartment infectious disease models Iis branch-and-cut for joint chance-constrained stochastic programs and application to optimal vaccine allocation Finding optimal vaccination strategies under parameter uncertainty using stochastic programming 2020 census data An optimal learning method for developing personalized treatment regimes Robust economic model predictive control of continuous-time epidemic processes Optimal two-phase vaccine allocation to geographically different regions under uncertainty On the analysis of a multi-regions discrete sir epidemic model: an optimal control approach Scalable vaccine distribution in large graphs given uncertain data Lookahead State Variable. The lookahead state variable includes the approximate state variable which will be used to model the future. The lookahead state variable chooses the θ 0 percentile of the susceptible population of each zone.The lookahead state variable,S vac tt , is denoted with a tilde and two time subscripts. The first time subscript describes the time t in the base model and the second time subscript describes the time t approximation in the future. The lookahead state variable is not the same as the base model state variable at time t because the approximations to the belief state must be realized. The lookahead state variable at time t is given by,Lookahead Decisions. The state variable induces a tunable chance constraint on the decision set to reduce the risk of allocating more vaccines than susceptible individuals. Hence, the decision set is given by, there is no conditional expectation over the belief state because the approximations remove the uncertainty.The forecasting equations from Lemma 3.1 simplify to,