key: cord-0445802-tu2czm64
authors: Thul, Lawrence; Powell, Warren
title: Stochastic Optimization for Vaccine and Testing Kit Allocation for the COVID-19 Pandemic
date: 2021-01-04
journal: nan
DOI: nan
sha: 6532c1e95e732f06f57f3b0b6334784c0cad65a9
doc_id: 445802
cord_uid: tu2czm64

The pandemic caused by the SARS-CoV-2 virus has exposed many flaws in the decision-making strategies used to distribute resources to combat global health crises. In this paper, we leverage reinforcement learning and optimization to improve upon the allocation strategies for various resources. In particular, we consider a problem where a central controller must decide where to send testing kits to learn about the uncertain states of the world (active learning); then, use the new information to construct beliefs about the states and decide where to allocate resources. We propose a general model coupled with a tunable lookahead policy for making vaccine allocation decisions without perfect knowledge about the state of the world. The lookahead policy is compared to a population-based myopic policy which is more likely to be similar to the present strategies in practice. Each vaccine allocation policy works in conjunction with a testing kit allocation policy to perform active learning. Our simulation results demonstrate that an optimization-based lookahead decision making strategy will outperform the presented myopic policy.

During the early months of 2020, it became evident that the SARS-CoV-2 virus was spreading through the global population at an alarming rate. As the pandemic devastated global economies and supply chains, it was obvious that the strategies in place were not sufficient to handle a crisis at this scale. Some of the obvious challenges presented by this crisis were the distribution of personal protective equipment, testing kits, and vaccine distribution under extreme shortages, as well as testing accuracy. At the time of this writing, the world is waiting for a vaccine to suppress the virus in the population and distributing it will present a new series of logistical issues. This problem is extremely rich with potential to improve distribution processes using This work is supported by the Air Force Office of Scientific Research (Award Number FA9550-19-1-0203). quantitative strategies; this work will choose a subset of these problems and utilize reinforcement learning and approximate dynamic programming strategies to improve them.

In this paper, it is assumed that a country/region is in the midst of a pandemic and a decision-maker is tasked with allocating a limited supply of vaccines and testing kits to various local zones in order to try to mitigate the spread of the disease. These zones could be states, counties, or some other partition of the region. A key challenge in this setting is the uncertain number of infected individuals, which can only be learned by making observations through testing. The data collected through testing would then be used to create a probabilistic belief about the true underlying state, which gives vaccine distributors useful information about where to send vaccines to slow the spread. The combination of making resource allocation decisions under uncertainty, while also performing information collection decisions to improve through active learning, represents a unique modeling and algorithmic challenge.

The heterogeneous characteristics of population densities, health characteristics and behavior will influence transmission rates and virus mobility within local regions; hence, intelligently allocated vaccines will impact the number of infections. The testing decisions are critical to making the best allocation decisions because the controller needs to have knowledge about the state of the pandemic. The observations made through testing will reflect biases in the samples caused by factors such as asymptomatic carriers, refusal to get tested, and false positives/negative errors. It is important to consider the risks caused by the biases because skewed testing data could eventually lead to poor allocation decisions. The best decisions can be made when each of these characteristics are properly modeled. This paper makes the following contributions:

• It provides the first formal model for a high dimensional pandemic resource allocation problem that captures passive information processes, and active learning.. This model is able to capture the uncertainty in the knowledge about the elements of the SIR model through the belief state, make vaccine allocation decisions under uncertainty, and perform active learning to improve the quality of the belief state.

• It proposes a vaccine allocation policy which combines a direct lookahead model with a parameterized approximation of the cost function in order to adjust for risk. The tunable lookahead policy in conjunction with an active learning policy is capable of outperforming a population-based myopic policy.

We demonstrate this through multiple pandemics simulations on toy networks and simulation models of states in the USA.

• We propose an observation function model in order to capture the biases introduced from asymptomatic carriers, refusal to get tested, and false positives/negatives. The risk-adjusted stochastic lookahead implementation policy proposed in section 5.1.2 can be tuned to reduce the risks posed by biased sampling.

• We present a robust modeling framework which is capable of working with environment models of increasing complexity. As the epidemiological models used to create a simulator improve, the modeling framework has the flexibility to adapt to the new data.

The paper is organized as follows. Section 2 summarizes the literature about vaccine distribution, POMDP modeling, and the unified framework. Section 3 describes the mathematical model for the environment agent using the unified framework. Section 4 describes the mathematical model for the controlling agent using the five components of the unified framework. Section 5 describes implementation policies selected from the two general strategies for designing policies: policy search and lookahead approximations. Section 5 describes the active learning policy chosen for allocating testing kits. Section 6 describes the results of the model implemented on an epidemic toy problem and the simulated SARS-CoV-2 pandemic model in the United States. Section 7 summarizes conclusions from the research.

In section 2.1, we give a brief overview of the literature about vaccine distribution strategies during an epidemic. In section 2.2, we give a brief review of partially observable Markov decision processes (POMDP's) and their limitations. In section 2.3, we give an overview about modeling frameworks and their relationship to the unified framework in this paper.

Planning during an epidemic requires making decisions in many different areas such as enacting social distancing measures, contact tracing, vaccine distribution, personal protective equipment distribution, and the allocation of various other essential resources. Making these decisions can be modeled as sequential decision problems with the common goal of limiting the number of infected individuals. The literature is rich with policies and algorithms across various different fields to make these decisions intelligently.

The resource distribution problem is multi-faceted. The supply chain to deliver resources from the manufacturing facilities to the recipient has many different steps. For example, some of the decisions which need to be made are where to send resources, who recieves the resources (i.e. who gets vaccinated at a clinic), and how to transport the resources to these locations.

The implementation decisions in this paper are concerned with where to send vaccines to have the greatest reduction in cases. Any sequential decision problem requires a dynamic model of the problem and a policy for making the decisions using the model. One of the most common ways to model a pandemic is to use compartmental models, and these are especially useful because they can be formulated as Markov chains. In 1927, Kermack and McKendrick (1927) created the SIR model which is the most basic compartmental model consisting of three groups within a population: those susceptible (S) to the disease, those infected (I) with the disease, and those removed (R) from the population (from death, recovery, or immunity). Since then, the compartmental model literature has expanded past the SIR model to include many different extensions.

Some extensions of the SIR model include: the SEIR (Susceptible-Exposed-Infected-Removed) model, SIRS (Suscepible-Infected-Removed-Susceptible) model, stochastic SIR model, and a spatial SIR model. The SEIR model is used for diseases with an incubation period before an individual becomes infected. The SIRS model is used for diseases that do not have permanent immunity. The stochastic SIR model is used for diseases which have stochastic transmission rates over time. The spatial SIR formulation is used to model a disease across spatially distributed subpopulations which may interact with one another. In this paper, we focus on the SIR model with stochastic and spatial extensions; however, there could always be additional extensions which could make it marginally more realistic.

Most models for planning during epidemics include compartmental models in their logic. Brandeau et al. (2003) utilizes an SI compartmental model in conjunction with optimization techniques to mitigate the spread of an infectious disease by allocating resources which suppress transmission rates. Martin et al. (1993) uses a spatial SEIR model in conjunction with multiple vaccine allocation strategies to mitigate a measles outbreak amongst dorms on a college campus. Ding et al. (2007) , Asano et al. (2008) ), Neilan and Lenhart (2011), and Zakary et al. (2017) use deterministic spatial SIR models and extensions to derive policies from optimal control theory using variations of the Hamilton-Jacobi-Bellman equations to decide where to allocate vaccines. Zhang and Prakash (2014) formulates a graphical model of a pandemic with uncertainty in the transmission rates reflected through the edges in the graph. They propose and compare algorithms to allocate a limited number of vaccines across the network to minimize the resulting footprint.

Another decision is who receives a set of allocated resources. Patel et al. (2005) uses a stochastic influenza model along with a genetic algorithm to decide which age groups should receive a limited supply of vaccines. Medlock and Galvani (2009) uses models of past flu pandemics to compare policies for administering a limited supply of vaccines among various age groups. Ekici et al. (2008) uses a complex spatial SEIR model in conjunction with an age-based component to decide who to feed during a flu pandemic in Georgia by setting up food distribution centers.

There are various other decisions studied among the literature to optimize a vaccine supply chain. For example, Lin et al. (2020) models a problem to decide whether a distributor will transport vaccines through a cold chain or a non-cold chain to ensure that they are still viable at administration. Dai et al. (2016) models an influenza supply chain in the U.S. consisting of healthcare providers, manufacturers, and distributors to ensure on-time delivery to each provider.

This paper will assume that the true pandemic has a compartmental model; however, we assume that the controller does not have access to this model and must make decisions about where to test to learn about the true state of the pandemic (active learning). If the true state is only partially observable, then the controller must model the uncertainty to make decisions. There are various modeling strategies to handle decisions under uncertainty. The next section discusses the modeling framework for partially observable Markov decision processes (POMDP), and its limitations in this problem.

A partially-observable Markov Decision Process (POMDP) is a system which includes a controlling agent making decisions in an environment where the underlying state transitions are Markov; however, the true underlying state of the environment can only be observed partially by the controlling agent. The finite horizon POMDP problem is modeled using the tuple (S, X , P, R, W, P obs , T ).

• S -the state space of the environment (e.g. number of infected people in the population).

• X -the decision space (e.g. number of vaccines sent to each zone).

• P(s |s, x) -the probability transition matrix for the next environment state given a state-action pair.

• R(s, x) -the reward function for a given state-action pair.

• W -the observation space (e.g. number of new infections measured through testing).

• P obs (w|s, x) -the probability of observing w ∈ W given a state action pair.

• T -the time horizon. For the POMDP model, let s t be the state variable for the environment agent. The goal of a POMDP is the same goal as an MDP: to find a policy, X π , which maximizes (or minimizes) the expected sum of cumulative rewards, max π E π T t=0 R(s t , X π t (s t ))|s 0 .

The expectation over π suggests that the stochastic processes in the problem are impacted by the decisions made by the policy. In a POMDP, the true state variable, s t , is unknown to the controlling agent at time t,

hence the value of the state cannot be computed. If the true state variable is unknown then the state transitions are non-Markovian (from the controlling agents perspective); hence, the entire history of observations and actions would be required to make decisions. If the entire history of observations had to be stored this would grow extremely quickly. The rapid growth of the history of actions and observations is known as the curse of history.

To circumvent the curse of history, the controlling agent stores a belief state variable, b t (s) ∀s ∈ S. b t (s) is a probability distribution over the (presumably discrete) state space, hence s∈S b t (s) = 1. Astrom (1965) shows that the belief state is a sufficient statistic for the history and describes a distribution over the possible states of the environment at time t derived from the previous observations, actions, and states. It is defined by, b t (s) = P(s t |W t , s t−1 , x t−1 , W t−1 , ..., s 0 , x 0 ) = P(s t |W t , b t−1 (s)).

When new observations arrive the agent must perform a Bayesian update to compute this new belief state using only the new observations and the previous belief state. Using the terms in the tuple, we can compute

In equation (2), the state space must be summed over; if it is a vector then |S| grows exponentially. Additionally, computing the transition matrix P(s |s, x) and observation matrix, P(w|s , x), when the observations and decisions are vectors will also be intractable. Powell (2011) discusses the reasons for the intractability of equation (2) known as the three curses of dimensionality. In theory, the solution to the belief MDP can then be used to solve for the optimal policy using Bellman's Equation Ross et al. (2008) 

then Bellman's optimality equation for a POMDP is given by,

It is important to note that b t+1 (s ) must be calculated using equation (2). For finite horizon problems the exact solution is only tractable when the state space, action space, and observation space are small. The belief state is an uncountable continuous set, therefore trying to find a mapping for each element of this set using backward dynamic programming is not possible. However, it is a well known result from the POMDP literature that the value function for belief states is still piecewise linear and convex (e.g. Pineau et al. (2006) ), which implies that an optimal solution exists. The ability to find the exact optimal solution is almost never possible for real world problems, in fact, the finite horizon POMDP is PSPACE-complete (e.g. Pineau et al. (2006) ). Pineau et al. (2006) gives a set of solutions using a point-based value iteration approach and compares it to other POMDP solvers. Ross et al. (2008) derives online planning algorithms for the POMDP problem. Hoey and Poupart (2005) formulates strategies for solving POMDPs with continuous and multi-dimensional observations spaces. Roy et al. (2005) attempts to circumvent the curse of dimensionality by finding a low dimensional belief space embedded in the high dimensional belief state to project into. There have been many algorithms developed for solving POMDPs to make the solutions tractable, but the computational complexities suggest that they would only work for problems with small state, action, and observation spaces compared to the size of the problem discussed in this paper.

The POMDP modeling approach is widely used for problems with unobservable parameters or quantities, but it suffers from severe computational limitations (it probably cannot be applied to a problem in this paper with more than 3 or 4 zones). Often overlooked, however, are subtle modeling assumptions that would not apply for our COVID setting. In particular, the policy in equation (3) uses the one-step transition matrix P(s |s, x) which, aside from being computationally intractable, implicitly assumes that the transition function is known to the controller. This means that the controller actually knows the dynamics of how the disease is communicated, which is not the case with COVID-19.

The traditional POMDP formulation is not only computationally intractable for all but toy problems, it does not properly model the fact that the people making decisions about testing and vaccine distribution do not even know the dynamics of the problem. The POMDP model is a more generalized version of the Markov Decision Process (MDP) framework which forms the roots for much of the reinforcement learning literature.

It is not the only way to model a sequential decision problem in the literature.

Powell (2019b) compares and contrasts reinforcement learning in the MDP framework with the modeling framework in the stochastic optimal control literature. The discussion by Powell (2019b) leads to formulation of the unified framework which ties together the various styles of modeling a sequential decision problem into a consolidated framework which can easily be translated into software. This framework leads to an optimization problem to search over policies, which discriminates it from other modeling frameworks. The unified framework consists of five core components: the state variable, the decision variable, the exogenous information process, the transition function, and the objective function. After the five components are modeled, a policy can be chosen from four classes of policies which are listed below:

• Policy function approximation (PFA) -analytical functions directly mapping states to actions, e.g.

lookup tables, parametric function,

• Cost function approximation (CFA) -an optimization problem involving a parameterized approximation of the cost function, e.g. upper confidence bounding, and Powell (2020) utilize the unified framework for problems in application spaces such as energy storage, transportation, and dynamic pricing, respectively. This paper will view the controller and environment as two separate agents and model each of them with the unified framework. The interactions between the agents are then modeled as appendages to their respective exogenous information processes.

From the perspective of the controlling agent, the environment agent is a black box which can be queried for observations and impacted by decisions. The real world will almost always be more complex than any simulator, and a controlling agent would have to approximate the real world to the best of its ability. Therefore, the environment model was designed to be sufficiently complex to include factors the controlling agent cannot anticipate. The point is to create a robust model that allows the controlling agent to make effective decisions with an imperfect model. The model presented in section 4 would function with any increasingly complex environment model. The simulator designed in this section is to make the environment model as realistic as possible by including more stochasticity and complexity than the controlling agent model; hence, we recognize that the simulator constructed for the environment model can be more complex. is the lack of symptoms in some portion of the positive cases, also known as an asymptomatic carrier. The presence of symptoms will cause issues with testing because it decreases the likelihood of a person going to get a test. To make the testing model as realistic as possible, these factors are included in the sampling oracle.

The errors within the tests will also be a major concern because they are not always accurate. There are two types of errors that are possible: false positives and false negatives. Each type of error will have a probability of occurring given that a test was administered. Due to limited testing capacity, the number of tests available may exceed the number of people that try to get access to tests. The model will assume that as the number of tests available increases within a zone, then the likelihood of going to get a test will also increase.

In summary, the environment agent model used to construct the simulator in this paper includes many factors such as spatial correlations between infected populations, biases in testing procedures, stochastically evolving transition rates, and vaccine success rates. Some of these factors would be extremely difficult to evaluate online as the controlling agent makes decisions, but they impact the results of the simulator. Hence, we claim that the simulator presented in this section is sufficiently complex to measure the performance of the policies designed in section 5 and the quality of the modeling in section 4.

The state variables for the environment agent base model include the variables needed to model the system from time t and onward. The importance of modeling the simulator and the controller as separate entities is to capture the fact that the controller is not always able to observe the state of the environment perfectly.

Hence, the dynamic state variable for the simulator is given by,

where, Additionally, the initial environment state variable consists of all variables which do not change dynamically, hence it is given by S env In this model of the environment agent, we reiterate that this simulator does not include dynamically changing mean transmission rates, recovery rates, or vaccine success rates. In the real world, these factors are likely changing dynamically as local laws change and better drugs are developed.

The exogenous information for the environment agent model refer to streams of information that occur exogenously, which we model as stochastic processes. These variables include:

where,

x vac t = vector of number of vaccines allocated to each zone, M t+1 = random matrix of inter-zone mobility, ε β t+1 = additive noise for transmission rate evolution, α t+1 = mobility rate at time t + 1, n vac t+1 = number of vaccines available at time t + 1.

The simulator transition function describes how each simulator state variable evolves between time t and t+1. The spatial SIR model can be modified to allow vaccines (which are managed by the controller) to impact the environment agent. At time t, the x vac t vaccines pass through the environment transition functions and can impact the true state. The function of the vaccine is to directly move people from the susceptible group into the recovered/immune group. The dynamics are captured through the following transition equations,

where ξ is the success rate of the vaccine, r t = β t+1,z Itz Nz , andM t+1 is the mobility probability matrix. The mobility rate and number of vaccines available are assumed to be totally exogenous based on the current laws and vaccine manufacturing process, respectively. The equation for the transition rate at t + 1 is given by,

Since there are no decisions made within the environment there is no objective function.

Another important component of the simulator is to model the impact of testing within each zone. To make the testing procedure as realistic as possible, the sampling oracle was designed to contain implicit biases.

The following definitions are used to outline the important parts of the observation function. The complement of each of the outcomes listed are E neg , E asymp , E not , E tneg , respectively.

Definition 3.2. The probability that a person drawn from the population is truly positive with the disease at time t is given by

When the controlling agent attempts to query a sample from a zone, then the sampled positive cases may not reflect the true percentage in the population, on average. The controlling agent may not be able to see this bias; however, the environment agent will have access to the information causing the skew in the data. Definition 3.4. If a person is not showing symptoms while positive with the disease, then they have a lower chance of going to wait in the test queue between time t and t + 1. Assume that as the testing capacity of a zone is increased, then the likelihood of getting tested increases through the following equations,

where c 0 and d 0 are the initial probability of attempting to get a test while there are none allocated to zone z.

The simulator is able to access each probability, but the controller would not be able to observe this information directly. It is easier to think about the probabilities a, b, c and d as follows: a is the probability of being symptomatic while carrying the disease, b is the probability of being symptomatic without carrying the disease, c is the probability of trying to get tested while showing symptoms, and d is the probability of trying to get tested without showing symptoms. The probabilities c and d are functions of how many tests are available, while c 0 and d 0 are the base probabilities of trying to get tested given that there are no tests available. From the controllers perspective, these values cannot be seen however they are reflected in the sample created after administering the test.

Lemma 3.1. Let f p and f n be the probability of false positive and false negative, respectively. a, b, c(x test tz ), d(x test tz ) are from definitions 3.3 and 3.4. p test tz represents the probability of testing positive if the person gets tested given by,

where,

The variables c and d are functions of the number of tests allocated to a zone.

Proof. See Appendix.

The adjusted probability allows the factors affecting the observation distribution to be reflected in the probability of testing positive. Therefore, the number of positive test results after administering the limited number of tests, x test tz , is given by the random variable,Î t+1,z . The random variableÎ t+1,z has a binomial distribution with parameters (n, p) = (x test tz , p test tz ). Lemma 3.1 implies the parameter of the binomial distribution for the observation function does not reflect the true percentage of infected individuals in the population on average, but a shifted distribution with a bias unknown to the controller.

This section proposes a model for the controlling agent using the five components of the unified framework.

The controller does not have access to the environment agent's state variable, S env t , or the transition functions describing how they evolve, S M,env (S env t , W env t+1 ). Due to the lack of this necessary information, the controller must make assumptions about the components of the environment agent's model. We are going to start by outlining the belief model, before giving the five elements of the controller model.

In a sequential decision problem with imperfectly known states and state transitions, the controller must Definition 4.1. The SIR model assumes that the total population remains constant. Therefore, the belief about the true percentage of each subpopulation in a zone has the following property, p susc tz +p inf tz +p rec tz = 1.

The property implies that the belief about the subpopulation of each zone has a multinomial distribution with parameters (S tz , I tz , R tz ) ∼ M ult(N z ,p susc tz ,p inf tz ,p rec tz ). Therefore, the belief state will contain the dynamically changing parameters of the multinomial distribution for each zone.

The controlling agent assumes a functional form for the transition functions in the environment model.

The transition functions in the environment model are used to update the state variable known perfectly to the environment agent. Therefore, the functions in this section would describe the set of equations to update the state variables if there was no uncertainty in the state at time t.

Let (S t , I t , R t ) be vectors containing the number of susceptible, infected and removed individuals in each zone z, respectively. The vectors are the physical state variables for the environment agent. If the controlling agent had access to the true physical state variables and made the decision x vac t , then it believes the environment will transition to the next state according to the following set of equations:

Equations (14 -16) describe the belief about how each environment state variable evolves. Since the true state variable is not fully observable each element of (S t , I t , R t ) are random variables; hence, the function cannot be evaluated by the controlling agent because it only has a belief state describing the random variables. However, the equations are important because they must be used to derive the updating equations for the belief state variables.

The controlling agent will decide to allocate x test tz testing kits to zone z at time t. Between time t and t + 1 the testing kits are administered and produce a random response,Î t+1,z , which represents the total number of positive cases measured in zone z between t and t + 1. Assume the random variableÎ t+1,z has a binomial distribution with parameters (n, p) = (x test tz , p test tz ). The response would be analogous to the output of the observation function defined in the POMDP model and the specifics are outlined in section 3.4.

The probability, p test tz , is the probability of randomly selecting an infected individual out of the population of individuals who received a test at time t in zone z. This probability is affected by factors such as the likelihood of going to get a test while showing symptoms, the likelihood of showing symptoms while positive and the probability of false positives and false negatives. The exact impact those factors will have on p test tz is not known to the controller, but the policies we discuss later will try to make decisions which can mitigate the risks imposed by the sampling biases.

The belief state updating equations take the observations,Î t+1,z , queried from the testing centers and use them to estimate the new belief state at t + 1. Each of the three parametersp susc t+1,z ,p inf t+1,z , andp rec t+1,z in each zone z will need an updating equation. Note that the infected population is being observed directly withÎ t+1 , but the susceptible and removed populations are not. 

The observations are assumed to be drawn from a binomial distribution with parameters (x test tz , p test tz ). p test tz is unknown to the controller, but a sample was just observed from its distributionÎ t+1,z . The beta prior is a conjugate prior to the binomial distribution, and through Bayes' theorem the updated distribution will be beta-binomial. The resulting distribution can be thought of as the binomial parameter p test tz being drawn from the beta distribution with parameters α prior tz and β prior tz . Lemma 4.1. Let α prior tz and β prior tz be parameters of a beta distribution andÎ t+1,z be a sample drawn from a binomial distribution with x test tz trials. The compound distribution produced by Bayes' Theorem is a betabinomial distribution. The estimator for the probability of an infection,p inf t+1,z is given by,

Proof. See Appendix.

After Definition 4.2. Let Π ∆z be the projection operator for the set defined by, ∆ z = (p susc t+1,z ,p rec t+1,z ) ∈ [0, 1] 2 :p susc t+1,z +p rec t+1,z = 1 −p inf t+1,z .

The expected susceptible and removed subpopulations of each zone is computed by taking the expectation of equation (14) and (16), respectively. If the terms are not in the set ∆ z in definition 4.2, then they must be projected back to the nearest point. This set of equations is given by,

In summary, the controlling agent updates the parameters in the belief state through the following process:

1) Make observationsÎ t+1,z for all z ∈ Z,

2) Compute the estimator ofp inf t+1,z in equation (19),

3) Use equations (20 -22) to updatep susc t+1,z andp rec t+1,z .

It is important to emphasize the beliefsp susc t+1,z andp rec t+1,z would not be possible to update through observations ofÎ t+1 without the belief about the transition equations presented in section 4.1.2. The belief about the transition equations may not match the true model (as we demonstrate in this paper) because it is usually an approximation of the real world, which is usually not possible to know perfectly. After the belief model is fully defined, then the five components of the unified framework naturally follow.

The state variables include the information which is needed to compute the transition functions, objective function, and policy for making decisions at time t. Any information which is not changing dynamically remains a latent variable defined in the initial state. The state variable for the base model is defined as, Let the state space be denoted by S cont . The initial state contains the variables which are not changing dynamically. The initial state for this model is given by,

where, 

There are two sets of decisions at each time step: the information collection decision of how many tests to allocate and the implementation decision of how many vaccinations to allocate to each zone. The decision to allocate vaccines to each zone is given by, x vac tz , which is constrained by the total number of vaccines available, n vac t . The decision to allocate testing kits to each zone is given by, x test tz , which is constrained by the total number of tests available, n test t . Hence, the vaccine decision variable is constrained by,

The testing decision variable is constrained by,

The set of decisions can be characterized by

.

The exogenous information, W cont t+1 , represents all the infromation that arrives between time t and t + 1. It is given by,

where, I t+1,z = number of positive tests sampled, n vac t+1 = the number of vaccines available at t + 1, n test t+1 = the number of tests available at t + 1.

The number of tests and vaccines available at t+1 is a process evolving completely exogenously from the model.

These values could be determined by the bottlenecks in the manufacturing processes from the distributors. It is assumed that at the beginning of each time step this number is revealed to the system; however, it does not attempt to model the stochastics behind it.

The transition function is the set of equations which describe how each of the state variables evolve between time t and t + 1. The state variable is given in Equation (23). A simple belief about the transition function would assume the basic SIR model where each zone is evolving independently. Equations (14 -16) from section 4.1.2 are defined as, 

where,S tz = N zp susc tz , σ susc tz = N zp susc tz (1 −p susc tz ),

and Φ is the standard normal cdf and φ is the standard normal pdf.

Proof. See Appendix.

If α prior tz =Ī t+1,z and β prior tz = N z −Ī t+1,z , then Lemma 4.1 gives the updating procedure forp inf t+1,z .

The other belief state variables,p rec t+1,z andp susc t+1,z , are computed using equations (20 -22) and the result of Lemma 4.2. Hence,

The objective function measures the performance of the system. It will measure how well the vaccine and testing allocation decisions impact the state of the epidemic. The goal of the problem is to come up with a decision making strategy, or policy, to minimize the total number of infected people in the population throughout the entire time horizon. Therefore, the cumulative reward objective can be posed as the following optimization problem,

The functionC(S cont t , X π (S cont t ), W t+1 ) is a one period cost function describing the metric which is evaluated by the controller at each time period. The goal of mitigating the spread of disease is to minimize the number of infected people, but it is not possible to perfectly observe the true number of infected people at t + 1. If the controller had access to the true state and transition equations, then the goal would be to minimize the true one-step cost given by,

Equation (37) is not observable, hence the controller will minimize the expected number of individuals infected with the disease, given the current belief state. The belief state contains the parameters of the probability distribution over the environment state variables, hence the controllers cost function is given by,

To evaluate this cost function, the belief about the transition function for the infected number of individuals must be used to predict what the impact of the decision will be, hence equation (33) must be used to evaluate the right hand side of equation (38).

The policy is a mapping from the state space to the decision space. At time t there are implementation decisions (the number of vaccines to allocate to each zone) and learning decisions (the number of testing kits to allocate to each zone). In section 5.1, we illustrate two types of implementation policies: one from the PFA class and one from the DLA class. In section 5.2, we present a CFA learning policy for deciding the zones with the most valuable information to collect through testing.

The implementation decision in this problem setting is to choose how many vaccines to send to each zone in the nation, x vac tz . The decision space for the next set of policies is given by,

The optimal policy for making a decision is given by Bellman's equation defined in equation (3),

where V * t+1 (·) is the optimal value function at t+1 if we knew the true state of the environment. The dimension of the state variable defined in section 4.2 is 3|Z| + 2, therefore as |Z| grows it becomes computationally intractable to solve for the optimal policy due to the dimensionality of the state variable. When the optimal policy is not possible to achieve, then the goal of the problem becomes finding the best approximation to it.

When designing the approximation, there are four possible classes of policies to choose from, which are defined in section 2.3.

The PFA class of policies is an analytical function which maps states to actions, and usually includes tunable parameters. These policies can be used to characterize a simple rule-based method, which is why these are most commonly used in practice. Some examples of very simple PFA policies could be to allocate the set of vaccines evenly among the zones, proportionally by population, or proportionally by the belief of the number of infected individuals. In this paper, the PFA we chose to analyze consists of a sigmoid function, which is a function ofp susc tz , with tunable parameters θ P F A = (θ P F A 0 , θ P F A 1 ). The sigmoid function for each zone is given by,

Then, to proportionally allocate each set of responses and ensure that the solution remains in the decision set, the policy becomes:

This policy operates quickly online, but it requires a time consuming offline grid search to find the best set of hyperparameters. The policy will proportionally allocate the vaccines based on the magnitude of the response caused by the belief about the percentage of the population that is susceptible to the disease.

In a direct lookahead approximation, an approximate model of the future is created to make the best decisions at t by looking at the impact of decisions in the future. The lookahead model will approximate the base model by choosing the θ risk -percentile of the distribution represented by the distribution of the belief state, and using this value as the lookahead state variable. The lookahead model will simulate the future by using the belief about the transition functions in equations (27 -30) . This is a reasonable approximation of the base model; however, it omits performing active learning. To create a proper lookahead model the five components of a sequential decision model defined by the unified framework in section (2.3) must be defined.

A lookahead variable will have a tilde and two time indices, where the first is the time within the base model and the second index is the time within the lookahead model.

Lookahead State Variable. The distribution of each subpopulation of each county is approximated by a normal distribution defined in Lemma 4.2; hence the θ risk -percentile of the susceptible distribution is given byS θ tt = S tz − z θ σ susc tz . The number of infected individuals will be approximated by the mean of the distributioñ I ttz =Ī tz , and the number of removed individuals will beR θ ttz = N z −S θ ttz −Ĩ ttz . The model at time t does not have access to a forecast of the number of vaccines available in the future, hence the lookahead model will assume that the number of vaccines available at each t in the future, where t ≥ t, will be the same. Hence, the lookahead state for t in the future is given by,

Lookahead Decisions. The lookahead model will not attempt to learn in the future, hence the testing decision will not be modeled. Therefore, the only lookahead decision is the decision about where to allocate vaccines,

x tt . Each of these decision vectors is constrained by 1 Tx tt ≤ n vac t . Lookahead Objective Function. The lookahead objective function is given bỹ

The optimization problem for a two-step lookahead approximation is given by,

The optimization problem in equation (43) reduces to a problem with a nonconvex quadratic objective function with linear constraints. For problems where |Z| is not too large the policy can be solved with bilinear solvers. The details describing the solution to equation (43) can be found in the Appendix.

The second type of decision is to select how many samples to observe from each zone through testing. At time t the controlling agent must decide which zones to send the n test t kits after the implementation policy has already been made. The learning decision will impact the distribution of the random variable for positive tests,Î t+1 , and consequently affect the future decisions about where to send vaccines.

It is important to have estimates of each environment state variable, and this decision will change those estimates. The testing policy we propose consists of a tradeoff between proportionally allocating a percentage of the testing kits, and trying to minimize the variance in the belief state at t + 1. The logic behind this design is get close to an estimate with respect to mean-squared error globally in the region, but also ensure that each zone is being sampled.

Equation (19) updates the belief about p inf t+1 by assuming the posterior has a beta-binomial distribution. If the estimator in equation (19) has a beta-binomial distribution, then its variance is given by,

Let the sum of variances in equation (44) become the objective function. The testing policy is induced by the optimization problem when the sum of variances is the objective function, given by

subject to 1 T x test ≤ n test t .

Lemma 5.1. Let equation (45) be the optimization problem used to produce the learning decisions. The optimization problem reduces to a quadratic program with linear constraints given by,

where,

This is an integer quadratic program with linear constraints.

The problem with the policy induced by equation (46) is it will greedily choose which zones to send kits to and ignore some zones with lower variance for too long. We can circumvent this problem by ensuring that a certain portion of the testing kits, ρ test , are distributed proportionally to the population of each zone, and the remaining 1 − ρ test testing kits are chosen greedily. This policy falls into the CFA class of policies because it has an embedded optimization problem, and the parameter ρ test must be tuned to find the best proportion on average. The greedy policy and the proportional policy are given by

and,

respectively. Hence, the resulting testing policy becomes,

The best ρ test in this function class can be tuned through policy evaluation using the simulator defined in section 3.

The environment agent model from section 3 was used to construct a simulator in python. There were two simulators constructed: one for a toy problem with 25 randomly generated zones in a region discussed in section 6.1, and one for a simulation of the states in the United States discussed in section 6.2. Considering a vaccine would not be developed until a country is in the midst of a pandemic, the initial state of the infected population is randomly generated from a uniform distribution between 5% and 15% at t = 0. The testing and vaccine manufacturing process is assumed to be increasing (on average) to simulate the reality that there will not be a large number of doses at the onset of its distribution into the population.

The toy problem simulator was constructed by randomly generating 25 sample locations, which would be considered nodes of a graph, and then weighting the mobility transition matrix,M t+1,zz , by the distance between the locations. Hence, individuals from nearby zones will have a higher probability of interacting with one another. Each zone has its own transmission rate and the removal rate for the disease remains constant for each zone. The graph of the simulated region is shown in Figure 1 . The weights of the edges of the graph represent the probability of individuals from each zone interacting with one another.

Each of the policies in section 5 were implemented and the results of the cumulative total of infected individuals in each zone can be seen in figure 1 . The null policy shows the amount of infected individuals It is clear that each of the policies offers a significant improvement over the 140,000 people who would be infected if there was no vaccine being distributed in this region. The tuned PFA offers a 62.2% reduction in cases over the null policy and the DLA offers a 69% reduction in cases over the null policy.

In the model of the United States, the zones in the country are partitioned at the state level. The population data for each state was provided by the US Census Bureau. Each state in the country has a known population, N z . The probability matrix,M t+1,zz , for each state was weighted by the distance between each state because neighboring states are more likely to infect each other. The beta values in each state are created by taking the logarithm of the population density and rescaling it between a fixed range of (β min , β max ). The policies discussed in section 5 were implemented with |Z| = 51 zones (including Washington, DC). The initial infection rate was generated by the record day of daily new infection data in each state to demonstrate that this strategy could work at the peak of a pandemic. In the USA simulator, we assume that the vaccines are being manufactured and the initial vaccine total is about 1% of the population. As the manufacturing capabilities increase, the total vaccines available for distribution will increase stochastically. Under the null policy with the assumed β z values, the simulator predicts about 18 million people would be infected with the disease. The PFA policy improves on the null policy by 45% and the DLA improves on the null policy by 54%. The DLA policy will also prevent about 1.8 million more infections than to the PFA policy over the given time horizon.

When the problem is scaled up to 51 zones and the populations grow to be on the order of millions of people, then the number of infected individuals has the potential to grow significantly larger. The number of susceptible people determines the rate at which the infected group will grow, so it is important to allocate vaccines strategically in areas with large populations of susceptible individuals. In the plot in figure 4, we show a breakdown of the country as the pandemic spreads with three very different scenarios. In the first row, there are no vaccines so the infection can spread uncontrollably. In the second row, the PFA strategy is more likely to spread the vaccine evenly among states if there is a high percentage of susceptible individuals;

hence the states with lower populations and transmission rates are targeted with more vaccines. The third row shows the DLA policy which is much more aggressive towards states with higher population density. It is obvious that states like New York, California, and Illinois, which have large cities, will spread the virus within their own state at a higher rate than states like Wyoming or Montana. Furthermore, many individuals travel within those high population states, so when the DLA targets those states with vaccines it will have higher order effects. Hence, the DLA is capable of mitigating those effects. It is intuitive that early action will always produce the fewest number of infections in the population, which is also why an aggressive DLA has better results. If we assume that the death rate due to the disease is fixed throughout the time horizon, then it is just a linear function of the number of infected individuals.

Therefore, every infection prevented will be proportional to the number of deaths prevented from the disease.

The risk-adjusted DLA policy prevented 1.8 million more infections than the PFA policy. Hypothetically, if a 1% death rate is assumed, then the DLA policy would save about 180,000 more lives than the myopic PFA.

This work presented a formal model for managing vaccines and testing kits during a pandemic using the unified framework. During a pandemic, the controller would not know the true state of the number of susceptible, infected and removed individuals; hence, the belief modeling in this paper captures the belief states and belief transitions. The multiagent modeling strategy in this paper is used to distinctly separate the modeling of the environment from the modeling of the controller. The strategy allows the environment to be considered a blackbox and the controller can only query observations and make decisions which impact it, which is why we focus on a tunable policy which can be optimized to mitigate the risks of an uncertain environment. Hence, if a more advanced environment model is applied to the problem, then the controller would still be able to robustly be applied.

There were two types of decisions in this paper: the implementation decision and the learning decisions.

This paper compared two implementation policies in the PFA class and the DLA class while using a CFA active learning policy for the testing decisions. The results demonstrate that the DLA class with a tuned risk adjustment parameter will perform the best in the toy problem and the US states problem. There are many more complex policies which could be applied to this problem, but we selected two that would illustrate the concepts most clearly. The PFA policy is intended to illustrate a very basic policy which is more likely to be used in practice than the DLA. The DLA policy is obviously more complex because it uses lookahead modeling and optimization, while the PFA is an analytical function. Therefore, we demonstrate that a well designed DLA policy can significantly outperform the basic policies most likely to be used in practice by myopic decision-makers.

8. Bibliography

Lemma 3.1.

Proof. This proof uses definitions 3.1-3.4. Since each of the probabilities are impacted by the number tests allocated there is an implicit assumption that x test tz is fixed. Hence, the probability of getting a test given that the person is positive to be written as, .

These terms can be rearranged to arrive at,

Let f p and f n be the probability of false positive and false negative, respectively. The following equation computes the probability of testing positive given that the person goes to get tested, Lemma 4.1. Let α and β be parameters of a beta distribution. Assume a sampleÎ from a binomial distribution after n trials and each trail has probability p. The pdf of the prior beta distribution is given by, π(x|α, β) = 1 B(α, β)

x

where, B(α, β) = Γ(α)Γ(β) Γ(α + β) .

The likelihood of observingÎ t+1 is given by,

Therefore, the Bayesian update is given by, P[p|α, β] = L(p = x|k)π(x|α, β) 1 0 L(p = x|k)π(x|α, β)dx

The integral in the denominator is given by, 1 0

x k+α−1 (1 − x) n+β−k−1 dx = B(α + k, n + β − k).

Hence,

x k+α−1 (1 − x) n+β−k−1 B(α + k, n + β − k) ,

If we assume that the size of the population is fairly large, then through the central limit theorem we claim that each of the random variables S tz , I tz , and R tz can be approximated by independent normal distributions with parameters (S tz , σ susc tz ), (Ī tz , σ inf tz ) and (R tz , σ rec tz ), respectively. Assume S tz ∼ N S tz , (σ susc tz ) 2 , I tz ∼ N Ī tz , (σ inf tz ) 2 , and independence, then

Through the identity in Zhan and Xing (2020) , the term in the previous equation is given by, The sum of the terms in equation (50) can be rewritten as, z∈ZĨ t,t+1,z = 1 T v + u Tx tt .

The objective function can then be rewritten as,

where we also assume the constrain z + w x tt >x t,t+1 . The terms that do not depend onx t,t+1 orx tt can be removed from the objective function; hence, = (2 − γ)(u Tx tt ) + (b (v + u x tt )) T (z + w x tt ) − (b (v + u x tt )) T (x t,t+1 ), 

Submitted. The information-collecting vehicle routing problem: Stochastic optimization for emergency storm response. Submitted to Transportation Science

Optimal control of vaccine distribution in a rabies metapopulation model

Optimal control of markov decision processes with incomplete state estimation

Resource allocation for control of infectious diseases in multiple independent populations: beyond cost-effectiveness analysis

Contracting for on-time delivery in the u.s. influenza vaccine supply chain. Manufacturing & Service Operations Management

Rabies in raccoons: optimal control for a discrete time model on a spatial grid

Backward approximate dynamic programming with hidden semi-markov stochastic models in energy storage optimization

Winter Simulation Conference, IEEE

Optimal online learning for nonlinear belief models using discrete priors

Solving pomdps with continuous or large discrete observation spaces

A contribution to the mathematical theory of epidemics. Proceedings of the

Cold chain transportation decision in the vaccine supply chain

A model for the optimal control of a measles epidemic

Optimizing influenza vaccine distribution

Optimal vaccine distribution in a spatiotemporal epidemic model with an application to rabies and raccoons

Finding optimal vaccination strategies for pandemic influenza using genetic algorithms

Anytime point-based approximations for large pomdps

A unified framework for stochastic optimization

Approximate Dynamic Programming: Solving the Curses of Dimensionality

Clearing the jungle of stochastic optimization

A unified framework for optimization under uncertainty

From reinforcement learning to optimal control: A unified framework for sequential decisions

Stochastic Optimization and Learning: A unified framework

Online planning algorithms for pomdps

Finding approximate pomdp solutions through belief compression

On the analysis of a multi-regions discrete sir epidemic model: an optimal control approach

Expected improvement for expensive optimization: a review

Scalable vaccine distribution in large graphs given uncertain data

which is a beta distribution with parameters α + k and n + β − k. If π(x|α, β) was the prior distribution for the value of p, then after the observationÎ = k the posterior will also have a beta distribution with parameters α + k and n + β − k. Therefore, the mean of p given by the posterior distribution iŝ p =Î + α n + α + β .If X is a random variable with a beta distribution, X ∼ π(x|α, β). Then its expected value is given by,Any samples drawn from this distribution can be thought of as being drawn from a binomial distribution with parameter p being drawn from the prior beta distribution with parameters α and β. The terms α and β can be thought of as "pseudo -observations" of the successes and failures. For Lemma 4.1Lemma 4.2. Assume that (S tz , I tz , R tz ) has a multinomial distribution with parameters (N z ,p susc tz ,p inf tz ,p rec tz ). The subpopulations of each zone are not truly independent; however, when the populations of the zones are large it is a reasonable assumption to make. Let Y tz = min(S tz , x vac tz ). The percent error introduced into the model by assuming there are no correlations would be given by, .Let be the percent error by assuming that S tz and I tz are independent random variables given by,If the errors, , are equal to 1 Nz−1 , then when the population is large the percent error will get very small. Consider the case S tz < x vac tz , this implies that Y tz = S tz ; hence, = . The other case is whenNz−1 . Therefore, the percent errors are less than or equal to 1 Nz−1 which become very small as N z gets large when independence is assumed; hence it is a reasonable assumption.The belief about the environment agent transition functions for time t + 1 is given by,Finally, through the weak law of large numbers, asx vac tz ] because it has finite variance. The other belief state variables have the same result. Therefore the estimates of the belief state variables when they are assumed to have independent normal distributions are given by, We estimate each expectation by assuming the random variables are normally distributed. Hence, where d = the two-step lookahead problem in Equation (43) becomes a quadratic program if 1 TSθ tt > n vac t , otherwise the solution is x DLA t =S θ tt . The quadratic program for the non-trivial solution is given by, We know from equation (29) that the transition equation forĨ t,t+1,z is given by,Since these the lookahead state variables are deterministic point estimates, then including the constraint S ttz >x ttz will not change the results of the optimization problem in equation (43). The constraint implies (S ttz −x ttz ) + =S ttz −x ttz . Assume the following vectors are defined by,