key: cord-0866362-n39ol2u6 authors: Zong, Kai; Luo, Cuicui title: Reinforcement learning based framework for COVID-19 resource allocation date: 2022-01-29 journal: Comput Ind Eng DOI: 10.1016/j.cie.2022.107960 sha: 1dac93788be780edca1c564b58db8224b2c8fb46 doc_id: 866362 cord_uid: n39ol2u6 In this paper, a reinforcement learning based framework is developed for COVID-19 resource allocation. We first construct an agent-based epidemic environment to model the transmission dynamics in multiple states. Then, a multi-agent reinforcement-learning algorithm is proposed based on the time-varying properties of the environment, and the performance of the algorithm is compared with other algorithms. According to the age distribution of populations and their economic conditions, the optimal lockdown resource allocation strategies of Arizona, California, Nevada, and Utah in the United States are determined using the proposed reinforcement-learning algorithm. Experimental results show that the framework can adopt more flexible resource allocation strategies and help decision makers to determine the optimal deployment of limited resources in infection prevention.  An agent-based simulation environment for COVID-19 epidemic is proposed.  A new multi-agent reinforcement-learning algorithm is proposed.  Allocation of resources has been made effectively by applying the framework. The COVID-19 outbreak is a global pandemic with the widest impact that humans have encountered in the past century. The health care systems in many countries have been severely tested. In the early stage of the outbreak, many hospitals faced shortages of staff and medical supplies, and patients who could not be treated in time had higher mortality rates [1] . Even now, health 5 systems in some areas are under pressure from shortages of medical supplies and beds. Although some countries have developed vaccines against the SARS-CoV-2 virus, vaccination coverage in some countries remains low. As the outbreak continues and the virus mutates, vaccines become less effective against the new mutated virus. There are many measures to control COVID-19 pandemic, some of them can be timely deployed, such as lockdown, wearing face masks, and maintaining personal hygiene. Compared with limited and expensive medical resources, more effective use of such non-pharmaceutical interventions measures would not only alleviate the pressure on health systems, but also significantly reduce the number of additional deaths caused by delays in getting care. In this paper, we construct an agent-based epidemic simulation environment for COVID-19. 15 Based on the time-varying properties of the environment, a new multi-agent reinforcement-learning algorithm is also proposed herein. The proposed reinforcement learning algorithm is applied to explore the optimal lockdown resource allocation strategies. The contributions of this paper are threefold: First, an agent-based multi-agent reinforcementlearning simulation environment for COVID-19 epidemic is proposed. This environment can not 20 only simulates fine-grained interactions among people at specific locations, but also simulate the population flow between different U.S. states with different economic structures and age distributions. Second, a new multi-agent reinforcement-learning algorithm that can capture the timevarying nature of the environment is developed, and the results show that the algorithm has better performance. Third, real epidemic transmission data are used to calibrate the environment, so the 25 simulation results are more consistent with the real situation. Then, the calibrated framework is used to explore the optimal lockdown resource allocation strategies among the four states in the United States. Recent studies have identified that artificial intelligence (AI) techniques comprise a promising 30 technology employed by various healthcare providers. Reinforcement learning is a machine learning technique that understands and automatically processes goal-oriented learning and decision-making problems. In recent years, reinforcement learning has been applied to many fields and achieved remarkable results [2, 3, 4] . There are many literatures on the application of reinforcement learning methods in supply chain 35 management and resource allocation [5, 6] . For example, [7] developed a model for energy emergency supply chain coordination, which combined reinforcement learning and the emergency supply chain collaboration optimization with group consensus. [8] studied the dynamic resource allocation of edge computing servers in the Internet of Things environment based on the reinforcement learning method. [9] explored the problem of applying reinforcement learning to wireless resource allocation 40 in vehicular networks. [10] developed a multi-agent reinforcement learning framework to study the dynamic resource allocation of multi-Unmanned Aerial Vehicles communication networks in order to maximize long-term benefits. The simulation results show that the multi-agent reinforcement learning algorithm can achieve a good trade-off between performance gains and information exchange costs. [11] applied a multi-agent reinforcement learning to study the channel allocation and 45 power control in heterogeneous vehicular networks and showed that this algorithm has advantages over other reinforcement learning based resource allocation schemes. In the application of reinforcement learning to COVID-19 related research, some efforts have been chronicled in the literature. [12] employed reinforcement learning to explore COVID-19 lockdown strategies and the experimental results showed that the strategy can strike a balance between 50 controlling the spread of the epidemic and the economic development. [13] applied reinforcement learning to the problem of redistribution of medical supplies, and the effectiveness of the algorithm was demonstrated experimentally. To study the evolution of COVID-19, predict the timing of the next outbreak, and test the effects of intervention measures, many researchers have studied the construction of epidemic transmission 55 models. There are three main types of epidemic transmission models: compartment models, network models and agent-based models. Since the agent-based model can clearly understand the current disease state of an individual and the interaction between the individual and other individuals, it has been widely used to study the transmission and response of COVID-19. Using population mobility data and demographic data, [14] constructed an agent-based model of COVID-19 transmission in 60 the Boston metropolitan area. [15] applied an agent-based model to simulate resident interactions in Belgium and found that it was critical to complete contact tracing within four days of symptom onset. [16] simulated the spread of COVID-19 by building an agent-based model that included household structure, age distribution and comorbidity. [17] used an agent-based model to investigate the effectiveness of repeat population screening. The rest of this paper is organized as follows. Section 3 describes the multi-agent reinforcement learning algorithm proposed in this paper; that is, the multi-agent recurrent attention actor-critic (MARAAC) algorithm. Section 4 introduces the COVID-19 epidemic simulation environment designed in this study. Section 5 describes the simulation experiment settings. Section 6 presents the experiment results, and Section 7 concludes the paper. Markov Games. A Markov game is an extension of game theory to Markov decision processes-like environments [18] . In general, a Markov game can be represented by the following: where N is the number of agents and S the system state, which generally refers to the joint state of multiple agents. A 1 , . . . , A N is the action set of these agents. T is the state-transition function and P is probability distribution. T : S × A 1 × · · · × A N → P (S) ∈ [0, 1]; that is, the probability 75 distribution of the next state is determined by the current system state and the current actions of all agents. r i (s, a 1 , . . . , a N ) represents the reward obtained by agent i after performing the joint action in state s. The goal of agent i is to maximize its discounted reward expectation E inf j=0 γ j r i,t+j . γ is the discount factor and r i,t+j is the reward obtained by agent i at time t + j. Attention Mechanism. In recent years, attention mechanisms have been applied in many areas of machine learning. In brief, an attention mechanism is a technique that allows the model to focus on and learn from important information. In the algorithm proposed herein, each agent can observe the information about observations and actions of other agents, and then incorporates the information into the estimation of its value function. An attention function can be described as mapping a query and a set of key-value pairs to an output. The output is a weighted sum of values, and the weight of each value is calculated by a compatibility function of the query with the corresponding key [19] . The calculation formula for scaled dot-product attention is where Q, K, and V represent queries, keys, and values, respectively. which makes it difficult to deal with time-series problems with long-term dependence. However, long short-term memory (LSTM) is universal and effective in capturing long-term time dependence [20] . As a variant of LSTM, the gated recurrent unit (GRU) simplifies the structure of LSTM while retaining its advantages. The GRU cell can be considered as a black box that contains the input state of the current moment and the hidden state of the previous moment, and the cell generates the hidden state of the current moment. The update formulas for all weights are expressed as follow: where r t is the reset gate, z t is the update gate, and tanh is the activation function. Consider a task with N agents. Let π = {π 1 , . . . , π N } represent N random policies adopted by each agent and θ = {θ 1 , . . . , θ N } represent the parameter of policy π i . The action of agent i under observation can be expressed as π θi (a i |o i ). Adding recurrence allows the network to better estimate the underlying system state [21] . In this paper, a GRU is introduced in an agent's policy where hx t−1 i , hx t i are the hidden states of the GRU cell at times t − 1 and t, and b π is bias. The policy of agent i is updated as follows: where where D is the replay buffer that stores the past experience and α the temperature parameter determining the balance between maximizing entropy and rewards. A i (o, a) is called the advantage function, which is used to solve the multi-agent credit assignment problem. It comes from COMA's counterfactual baseline method developed by [22] . By comparing the action value of agent i with the average action value of all agents, one can determine whether the increased reward is due to the current action of agent i or the action of other agents. That is, one can calculate the baseline in a single forward pass by outputting the expected return for every possible action that agent i can take. The expectation can be calculated as follows: is the Q-value function for agent i. Its input is the action a i taken by agent i and the environment observation o i received by agent i, where g i is a one-layer multi-layer-perceptron (MLP) embedding function and f i is a two-layer MLP. x i represents other agents' contributions: v j is an embedding function of agent j, which encodes agent j through an embedding function g j . h is a nonlinear activation function. α j is the attention weight of agent j relative to agent i, which is embedded into a softmax function: According to the above attention-mechanism settings, the parameters are shared between all agents. All critics are updated by minimizing a loss function: where where µ, θ, µ, and θ are the parameters of critic, policy, target critic, and target policy, respectively. α is the temperature parameter. The structure of the algorithm is shown in Fig. 1 . The algorithm is trained in K parallel environments to improve the sample efficiency and reduce the variance of updates. Algorithm 1 is the pseudo code of MARAAC algorithm. In this section, we construct an agent-based epidemic transmission simulation environment. The environment simulates the contact of different individuals in different locations, including offices, schools, grocery stores, retail stores, restaurants, bars, and parks. The environment simulates a day as 24 discrete hours. In each hour, individuals can randomly decide whether to stay at home, Update critic by minimizing the loss: Update actor: Update target network parameters for each agent i: go to work, go to school, travel to another city, and so on. The design of this environment is a modification of [23] , which supports simultaneous simulation of multiple cities, and residents can visit other cities. In this environment, multi-agent reinforcement learning is applied to simulate the impact of different government lockdown strategies on the spread of the epidemic. By assigning an agent to each city, the multi-agent reinforcement-learning algorithm can allocate different lockdown 100 resources based on the characteristics of different states, so as to control the spread of the epidemic more effectively. First, the environment randomly generates a specified number of people based on the pre-set age distribution of the population. Then, the population is divided into youth (aged 0-18), worker (aged 19-64), and elder (aged above 65) groups. These three groups have their own characteristics. At the beginning of the epidemic simulation, some people will be randomly selected and set to be exposed. Then, the exposed people will transfer to one of the infectious states and interact with susceptible people. The probability that a susceptible person will be infected on a given day after contact with an infected individual is calculated as follows: whereP TheP S→E in 10 is the probability that a susceptible person will not be infected after contact with Table 1 . The schematic diagram of learning process of MARAAC algorithm in the simulation environment is shown in Figure 4 . Based on Figure 4 , for an agent, during a learning process, the algorithm first passes the current lockdown resource allocation policy into the environment. The environment then In order to study the performance of MARAAC algorithm proposed in this paper, we compare the whole training process. Among all environments, MARAAC converges faster than MAAC algorithm, and the final reward value is also larger. Table 2 . The objective function of the environment is to minimize the economic loss while keeping the Arizona number of individuals in critical condition p C below the hospital capacity M . Therefore, the reward function is where α β are weights and Loc the location weight. w i represents the lockdown level at location i. The epidemic simulation environment used in this paper contains many artificial parameters. Although these parameters refer to the real world, it is still doubtful whether they can truly simulate the actual transmission situation. Therefore, real data are used to calibrate the parameters. Specifically, we compared the average time to peak for deaths in the epidemic simulation environment against real data from Sweden 1 . Sweden can represent the countries in which the fewest restrictions were applied during the first wave of the epidemic and the transmission dynamics were the most "natural" [30] . Bayesian optimization was applied to adjust the infection spread rate and scaling factor in the simulation environment. That is, we ran a grid search on the transmission This section mainly discusses how infection spread rate and scaling factor affect the spread of 195 the epidemic under different values. The results are shown in Table 3 . For all states, the peak number of infections, the time to peak, and the eventual number of deaths all increased as the infection spread rate increased. As the scaling factor increased, the peak number of infections, the time to peak, and the eventual number of deaths decreased. The result is intuitive, as increased infection spread rate and contact rate both worsen the epidemic. The purpose of the experiment is to explore the optimal lockdown resource allocation strategy and investigate the performance of the multi-agent reinforcement-learning algorithm proposed in this paper, i.e., MARAAC, so as to provide more insights for the simulation and resource allocation strategy of the COVID-19 epidemic. To this end, two experiments were designed. In the first 205 experiment, one agent was assigned to each state. In each step, each agent selected one allocation strategy to execute according to the five-level lockdown resource allocation strategies formulated in this paper. The details of five-level allocation strategies are provided in Table 4 . In the second experiment, each type of location in each state was assigned an agent. This allows the algorithm to fine-tune the allocation strategy of lockdown resources. At the beginning of the epidemic simulation, 1% of the population were randomly selected in each state as exposed individuals. To make the lockdown resource allocation strategy more in line with reality, the agents updated their strategies once per week (7 d) and the spread of the epidemic was simulated for 20 weeks (140 d). In order to evaluate the transfer-ability of the reinforcement learning algorithm, the trained 230 algorithm was applied to the environment with a total population of 100,000, and the result is shown in Figure 9 . As can been seen from Figure 9 , the results are robust, so the algorithm can quickly learn the strategy even if it is transferred to a larger environment. The results of the second experiment are displayed in Figure 10 . In this paper, a reinforcement learning based framework is constructed for COVID-19 resource allocation. First, we develop an agent-based COVID-19 epidemic simulation environment, which can not only simulate the interaction among people within a city, but can also simulate the population flow among different states (several U.S. states were chosen for simulation). Then, a multi-agent Utah in the United States, the optimal lockdown resource allocation strategies are determined using the proposed framework. The experimental results show that the algorithm can adopt the more 250 flexible allocation strategy according to the age distribution of population and economic conditions, which provide insights for decision makers in supply chain management. Office: 8, 150,0 ; School: 2, 30 Hospital: 1, 30, 10; Grocery Store: 6, 5, 30 Retail Store: 6, 5, 30; Restaurant: 2,6,30; Bar: 3, 5, 30 Office: 37, 150, 0; School: 8, 30 Hospital: 7, 30, 10; Grocery Store: 30, 5, 30 Retail Store: 30, 5, 30; Restaurant: 14, 6, 30; Bar: 15, 5, 30 Office: 2, 150, 0; School: 1, 50 Hospital: 1, 18, 6; Grocery Store: 2, 5, 30 Office: 2, 150, 0; School: 1, 50 Hospital: 1, 18, 6; Grocery Store: 2, 5, 30 Retail Store: 2, 5, 30; Restaurant: 1, 6, 30; Bar: 5, 3, 30 Report 46 -factors driving extensive spatial and temporal fluctuations in covid-19 fatality rates in brazilian hospitals Intelligent trading of seasonal effects: A decision support algorithm based on reinforcement learning Mastering the game of go without human knowledge Superhuman ai for multiplayer poker A reinforcement learning approach to parameter estimation in dynamic job shop scheduling A self-learning genetic algorithm based on reinforcement learning for flexible job-shop scheduling problem Energy emergency supply chain collaboration optimization with group consensus through reinforcement learning considering non-cooperative behaviours Dynamical resource 275 allocation in edge for trustable internet-of-things systems: A reinforcement learning method Deep-learning-based wireless resource allocation with application to vehicular networks Multi-agent reinforcement learning-based resource allocation 280 for uav networks Multi-agent deep reinforcement learning based resource allocation for heterogeneous qos guarantees for vehicular networks Exploring optimal control of epidemic 285 spread using reinforcement learning On collaborative reinforcement learning to optimize the redistribution of critical medical supplies throughout the covid-19 pandemic Modelling the impact of testing, contact tracing and household quarantine on second waves of covid-19 The impact of contact tracing and household bubbles on decon-295 finement strategies for covid-19 Modeling between-population variation in covid-19 dynamics in hubei, lombardy, and new york city Test sensitivity is secondary to frequency and turnaround time for covid-19 screening Markov games as a framework for multi-agent reinforcement learning Advances in neural information processing systems Lstm: A search space odyssey Deep recurrent q-learning for partially observable mdps Counterfactual multi-agent policy gradients Agent-based markov modeling for improved covid-19 mitigation policies outside hubei province, china: a descriptive and modelling study Spread of sars-cov-2 in 325 the icelandic population Temporal dynamics in viral shedding and transmissibility of covid-19 Estimates of the severity of coronavirus disease 2019: a model-based analysis Covid-19 laboratory-confirmed hospitalizations preliminary data as of sep 12 Actor-attention-critic for multi-agent reinforcement learning Covid-19 and the swedish enigma