key: cord-0493669-89hz27ti authors: Hu, Jueming; Xu, Zhe; Wang, Weichang; Qu, Guannan; Pang, Yutian; Liu, Yongming title: Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines date: 2021-09-30 journal: nan DOI: nan sha: 1185f0f7b90d8c70e12f013631be18c69a750f81 doc_id: 493669 cord_uid: 89hz27ti In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP) where the dynamics of neighboring agents are coupled. We use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently, based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of Q-function on other agents decreases exponentially as the distance between them increases. Furthermore, the complexity of DGRM is related to the local information size of the largest $kappa$-hop neighborhood, and DGRM can find an $O(rho^{kappa+1})$-approximation of a stationary point of the objective function. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by two case studies, UAV package delivery and COVID-19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID-19 pandemic mitigation. In multi-agent reinforcement learning (MARL), a collection of agents interact within a common environment and learn to jointly maximize a long-term reward. We study MARL in a graph-based Markov Decision Process (MDP) setting, where for each agent, the transition model is represented by a dynamic Bayesian network [1, 2] . In graph-based MDPs, the transition function between an agent's states may depend on its current state and action and the neighboring agents [3, 4] . In this work, the agents are allowed to have different reward functions from different tasks, and they can perceive their own rewards. The key challenge in MARL is the combinatorial nature, which results in high computational complexity. The combinatorial nature refers to the exponentially increased size of the joint state and action space in the multi-agent system [5, 6] . Since all agents are learning simultaneously, the behavior of each individual agent relies on what has been learned by the respective other agents. Thus, the learning problem is non-stationary from each agent's perspective. To tackle non-stationarity, each individual agent has to consider the joint state and action space, whose dimension increases exponentially with the number of agents. RL becomes more challenging when solving complex temporally extended tasks. In many complex tasks, agents only receive sparse rewards for complex behaviors over a long time. However, agents do not have access to the high-level ideas behind the sparse rewards. Moreover, in a sparse-reward formulation, if it is improbable that the agent achieves the task by chance when exploring the environment, the learning agent will rarely obtain rewards. In this paper, we provide a decentralized framework that enables a collection of agents to solve complex temporally extended tasks under coupled dynamics with enhanced computational efficiency. Specifically, we use reward machines (RMs) to describe the environment, track the progress through the task, and encode a sparse or non-Markovian reward function for each agent. The proposed algorithm, called decentralized graph-based reinforcement learning using reward obtained at state after taking action ; : × × → [0, 1] is the transition function, ( , , ) represents the probability of going to from after taking action ; is the discount factor, ∈ (0, 1]; P is a finite set of events; and : × × → 2 P is a labeling function, ( , , ) maps the state transition to a relevant high-level event. Definition 1. A reward machine (RM) is a tuple, R = ( , , P, , ), where is a finite set of RM states, ∈ is an initial state, P is a finite set of events, is the transition function: × 2 P → , and is a reward function: × 2 P → R. Reward machines [33] are a way to encode a non-Markovian reward function. A run of a reward machine R on the sequence of labels ℓ 1 ℓ 2 . . . ℓ ∈ (2 P ) * is a sequence 0 (ℓ 1 , 1 ) 1 (ℓ 2 , 2 ) . . . −1 (ℓ , ) of states and label-reward pairs such that 0 = . For all ∈ {0, . . . , − 1}, we have ( , +1 ) = +1 and ( , +1 ) = +1 . We write R (ℓ 1 ℓ 2 . . . ℓ ) = 1 2 . . . to connect the input label sequence to the sequence of rewards produced by the machine R. We say that a reward machine R encodes the reward function of an MDP if for every trajectory 0 1 1 . . . and the corresponding label sequence ℓ 1 ℓ 2 . . . ℓ , the reward sequence equals R (ℓ 1 ℓ 2 . . . ℓ ). Given a labeled MDP M = ( , , , , , , P, ) with a non-Markovian reward function and a reward machine R = ( , , P, , ) which encodes the reward function of M, we can obtain a product MDP M R , whose reward function is Markovian such that every attainable label sequence of M R gets the same reward as in M. Furthermore, any policy for M R achieves the same expected reward in M [24] . We define the product MDP M = ( , , , , , , P , ) by: We denote = ( , ) as an undirected graph, where = {1, . . . , } is a finite set of nodes that represents the agents and is a finite set of edges. For agents , , we call a neighboring agent of , if there is an edge that connects and . Let ( ) ⊆ be the set including agent and the neighboring agents of the agent . We consider a multi-agent system, including agents and formulate the system as a labeled graph-based MDP M = ( , , , , , , P , We consider a factored form for M , i.e, is a Cartesian product of the states for each agent in , = 1 × · · · × . Similarly, = 1 × · · · × ; = 1 × · · · × ; = 1 × · · · × ; = 1 × · · · × ; P = P 1 × · · · × P ; and = 1 × · · · × . For each agent , : ( ) × ( ) × → [0, 1], meaning ( + 1) depends on ( ) ( ) ∈ ( ) and ( ) ( ) ∈ ( ) , where ( ) and ( ) denote the Cartesian product of the sets of states and actions of the agents in ( ) respectively. and only depend on the information of agent itself, : × → R and : × × → 2 P . We design a reward machine R for each agent . R = ( , , P , , ) encodes the reward function of M individually. R defines the task of agent in terms of high-level events from P . Given a labeled graph-based MDP M = ( , , , , , , P , ) with a non-Markovian reward function collectively defined by Reward Machines R , we can obtain a product MDP M R = ( R , R , R , R , R , R , P R , R ) whose reward function is Markovian, For simplicity of notation, we define the global state = ( 1 , . . . , ) ∈ , global RM state = ( 1 , . . . , ) ∈ := 1 × · · · × , global action = ( 1 , . . . , ) ∈ , and global reward ( , ) = 1 =1 ( , ). M R augments RM state and MDP state, which tracks the progress through the task for each agent. The goal in an M R is to find the optimal joint policy : R → ( R ), which is parameterized by . ( | , ) is expected to maximize the discounted global reward, ( ) starting from the initial state distribution and , The Q-function under the policy , ( , , ), is defined as the expected discounted future reward of taking action given a pair ( , ) and then following the policy , where ( , , ) is the Q-function for the individual reward . Compared to standard Q-learning, ( , , ) also keeps track of RM states, allowing agents to take the high-level stages into consideration when selecting actions. A commonly used RL algorithm is policy gradient [34] . Let be a distribution on the R given by ( , is the distribution of ( ( ), ( )) under fixed policy when ( (0), (0)) is drawn from 0 . Then, Since the sizes of the joint state and action spaces, R and R , grow exponentially with the number of agents, the corresponding Q-functions and can be large tables and therefore, are intractable to compute and store. We propose the truncated Q-function (˜), which takes local information of each agent as the input, and thus, the dimension of˜is much smaller than . We present that˜is able to approximate with high accuracy. Then, we propose the truncated policy gradient based on˜and the error of policy gradient approximation is bounded. Moreover, the DGRM algorithm can find an ( +1 )-approximation of a stationary point of ( ). To further improve efficiency, we also develop the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. To reduce complexity, we propose a decentralized learning algorithm that equips each agent a localized policy , parameterized by . The localized policy for agent , ( | , ), is a distribution on the local action , depending on the local state and local RM state . Thus, each agent acts independently and information exchange is not required with learned policies. Moreover, the RM state is augmented with low-level MDP state for the policy, which helps provide guidance to the agent in the task level. The joint policy, . is similar to the policy defined in the Scalable Actor Critic (SAC) algorithm [19] , which is conditioned only on the local state . ( | , ) differs from the policy in graph-based MDP [4] in that their policy is conditioned on the agent's neighborhood. Further, DGRM algorithm uses local information of each agent to approximate , inspired by [19] . We assume that the agent has access to the neighborhood of radius (or -hop neighborhood) of agent , denoted by , ≥ 0. -hop neighborhood of includes the agents whose shortest graph distance to agent is less than or equal to , including agent itself. We define − = / , meaning the set of agents that are outside of agent 's -hop neighborhood. The global state can also be written as ( , − ), which are the states of agents that are inside and outside agent 's -hop neighborhood respectively, similarly, the global RM state = ( , − ) and the global action = ( , − ). We first define the exponential decay property of Q-function. Definition 2. Q-function has the ( , )-exponential decay property if, for any localized policy , for any Theorem 1. Assume ∀ , is upper bounded by , the ( 1− , )-exponential decay property of holds. The proof of Theorem 1 can be found in the supplementary material. Theorem 1 indicates that the dependency of on other agents decreases exponentially as the distance between them grows. Then, we propose the following truncated Q-functions for state-(RM state)-action triples, where ) is any non-negative weight and satisfies, Theorem 2. Under the ( , )-exponential decay property,˜satisfies, The proof of Theorem 2 can be found in the supplementary material. Theorem 2 shows that˜approximates with high accuracy, and next, we use˜to approximate the policy gradient. The truncated policy gradient for agent , , is defined as˜ , where˜is any truncated Q-function in the form of Eq. (7) . The proof of Theorem 3 can be found in the supplementary material. Theorem 3 indicates that the error of the truncated policy gradient is bounded and thus, it is safe to use the truncated Q-function to approximate the policy gradient, which has a much smaller dimension. The DGRM algorithm adopts the actor-critic framework, specifically using temporal difference (TD) learning to train the truncated Q-function and policy gradient for the localized policy. The pseudocode of DGRM is given in Algorithm 1. The truncated Q-function is updated at every time step, and the policy parameters are updated in every episode. DGRM reduces the computational complexity as the truncated Q-function is conditioned on the state-(RM state)-action triple of Take action (0) ∼ (·| (0), (0)) for all . 3: Get reward (0) for all . 4: while exist agent not finish the task and ≤ do 5: Get state ( ), RM state ( ) for all . 6: Take action ( ) ∼ (·| ( ), ( )) for all . 7: Get reward ( ) for all . 8: Calculate TD error for all , 9 : 10: Update the truncated Q-function for all , 11 :˜ 12: end while 13: Calculate the approximated gradient for all , 14 : Update the policy parameters for all agents, 16: (˜+ 1) ← (˜) +˜˜(˜). 17: end for -hop neighborhood instead of the global information, which results in a much smaller dimension. Thus, the complexity of DGRM is the local state-(RM state)-action space size of the largest -hop neighborhood. The approximated gradient of policy parameters is related to the truncated Q-function of the -hop neighborhood. Further, the authors in [19] claimed that under certain assumptions, the actor-critic algorithm will eventually find an ( +1 )-approximation of a stationary point of ( ). DGRM also follows the assumptions with RM state, thus the approximating convergence conclusion still holds, which is illustrated in Theorem 4. To further improve efficiency, we approximate both the policy and truncated Q-functions by deep neural networks. The pseudocode of deep DGRM is given in Algorithm 2. In every episode from line 2 to line 7, all the agents first interact with the environment. Then in line 8, each agent collects the necessary information during the episode for later actor and critic updating. Both the Q-network and policy-network are updated in every episode. In line 11, the TD error is the averaged error over one episode. In line 15, the gradient of policy is conditioned on the averaged TD error of the -hop neighborhood, which indicates communication is required for policy improvement during training. When implementing the learned policy after training, communication is not required since the policy is only conditioned on each agent's own state-(RM state) pair. In this section, we implement DGRM on two case studies: (1) a multi-UAV package delivery case and (2) COVID-19 pandemic mitigation adapted from [35] . Calculate the TD error for all , 11 : Update Q-network parameters for all , 13: (˜+ 1) ← (˜) +˜∇˜˜. 14: Update the policy-network parameters for all , 15 : 16 : end for We consider a total of six UAVs in the 4×5 grid-world with two warehouses. The task for each agent is to obtain one package in a warehouse and deliver it to the corresponding destination. Further details of the grid-world environment are illustrated in the supplementary material. The local state of UAV at time , ( ), includes UAV 's current position, the remaining battery percentage and a binary variable with value of 1 indicating UAV obtains a package. UAVs start with a fully charged battery. UAVs have four base possible actions at all grids: move north, move south, move east, and move west. At warehouses, UAVs have two additional possible actions: wait and pick up. When waiting, the agent remains where it was. The agent obtains the package with a probability of 0.9 if it selects to pick up and no other agents also pick up at the same warehouse. We define two UAVs are neighbors if they have access to the same warehouse. UAV ∈ 1, 2, 5, 6 has access to one specific warehouse and UAV ∈ 3, 4 has access to two warehouses. The reward machine for UAV is shown in Fig. 1 . Each edge is labeled by a tuple (ˆ,ˆ), whereˆis a propositional logic formula over P andˆ∈ R is a real number, defining the reward of the corresponding transition. For instance, to transition from 0 to 1 in Fig. 1(a) , the propositionˆhas to be true, and the transition outputs a reward of 5. Regarding UAV ∈ {1, 2, 5, 6}, the labeled events are P = {ˆ,ˆ,ˆ,ˆ}, whereˆmeans UAV reaches the accessible warehouse,î ndicates that UAV obtains the package,ˆis arriving at the destination, andˆindicates that UAV is running out of battery. When the agent has low battery, it transitions to the sink state 4 . Once the agent reaches a sink state, it cannot get out of the sink state and fails the task. 3 is the goal state, expressing task accomplishment. Since UAV ∈ {3, 4} has access to two warehouses, UAV 3 and UAV 4 have two possible routes to accomplish the task. Further details of the reward machine of UAV 3 and UAV 4 are illustrated in the supplementary material. In the experiment, we set = 0.9. The softmax policy is utilized for the localized policy, which forms parameterized numerical probabilities for each state-action pair [36] . We run the DRGM algorithm with ∈ {0, 1, 2} and plot the global discounted reward during the training process in Fig. 2 . The optimal solution is in closed form for this experiment. DGRM with = 0 has the best performance and is close to the optimal solution. Also, DGRM with = 0 converges the fastest. Fig. 2 indicates that more information of neighboring agents does not help improve the performance in this example. The possible reason is that we use the tabular Q-function for this discrete state problem and a larger -hop neighborhood introduces large state, RM state, and action space for the tabular Q-function, which may cause poor performance. Additionally, the result in Fig. 2 indicates that DGRM has the potential to determine the necessary information for the truncated Q-function and help understand the connections among the agents. To the best of our knowledge, this case study provides the first attempt to use reward machines as specifications to mitigate epidemics and pandemics such as COVID-19. We consider a network model of Italy [35] , where Italy is modeled as a network of 20 regions and the parameters of each region's model are calculated from real data. The total population is divided into five subgroups for each region : susceptible (S ), infected (I ), quarantined (Q ), hospitalized (H ), recovered (C ), and deceased (D ). The detailed analysis of real data and the estimation of parameters of the pandemic model can be referred to [35] . The authors in [35] propose a bang-bang control strategy for each region according to the relative saturation level of its health system. The regional lockdown is enforced when the relative saturation level is high. During the lockdown, the region implements strict social distancing rules, and control all fluxes in or out of the region. Further details of the bang-bang control and mathematical derivations are illustrated in the supplementary material. We take the above bang-bang control as a baseline method for comparison with our proposed method. In this case study, we aim to control the spread of COVID-19 among the 20 regions of Italy in a period of 4 weeks. A region is defined to be in a severe situation when the total hospital capacity is saturated, denoted by 0.1H T H ≥ 0.5, which is the same as the saturation level when a lockdown is adopted in the bang-bang control. 0.1H estimates the number of the hospitalized requiring care in ICU and T H is the number of available ICU beds in the region . The control objective is that each region would not be in a severe situation for 2 consecutive weeks and avoid long time lockdown of 2 consecutive weeks, since costs of continuing severe restrictions could be great relative to likely benefits in lives saved [37] . To conduct our experiment, we establish a labeled graph-based multi-agent MDP for the network model of Italy and a reward machine as task specification for each region. We define that region is region 's neighbor if there are fluxes of people traveling between two regions. The simulation is conducted with a discretization step of 1 day and starts on a Monday. At the first step pf the simulation, each region is set to be in a severe situation, H (0) = 6T H , the values of I (0), Q (0), C (0), and D (0) are the same as the experiment in [35] . where˜and˜are the number of days when the region is in a severe situation and the number of days when the region adopts a lockdown since this Monday, respectively. The transitions of S ( ), I ( ), R ( ), H ( ), Q ( ), D ( ) follow the pandemic model's dynamics in [35] . The set of events is P = { 0 , 1 , 0 0 , 0.5 0 , 1 0 , 0 0.5 , 0.5 0.5 , 1 0.5 , 0 1 , 0.5 1 , 1 1 }, where 0 is an empty event when the current day is not Monday; 1 is the event leading to the goal state and indicates the current day is the last time step of the entire simulation; the superscript of and describe the level of the situation's severity and the frequency of lockdown implementation during the previous seven days (from Monday to Sunday), respectively. Further details of and are illustrated in the supplementary material. We note that the reward machine for each region is the same for simplicity. The initial RM state assumes that during the previous week, the region was in a severe situation at least 1 day but less than 7 days and did not implement a lockdown. We define sink states to indicate that the region has been in a severe situation or has implemented a lockdown for two consecutive weeks, which are not encouraged. Further details of the reward machine and the reward function are illustrated in the supplementary material. In the experiment, we set = 0.9. We use two fully connected layers with [256, 128] units and ReLU activations for actor and two fully connected layers with [256, 128] units and Tanh activation for critic. We run the deep DGRM algorithm with = 0 up to = 5 and plot the global discounted rewards during the training process in Fig. 3 . With the increase in , the global discounted reward increases when ≤ 2. This is also demonstrated in Fig. 4 , which shows the performance of the baseline (bang-bang control policy in [35] ) and the results of deep DGRM obtained by 20 independent simulation runs using the converged policy for each . The deep DGRM algorithm with > 0 outperforms = 0, which is the independent learner method in [38] . The deep DGRM algorithm with = 1 improves the global discounted reward by 218% compared to = 0. The deep DGRM with = 2 has limited improvement (4%) on the performance compared to = 1. When > 2, the global discounted reward stops growing due to the noise. This is also consistent with Theorem 4 that the proposed algorithm can reach a stationary point. We note that with the deep DGRM algorithm, information of 1-hop neighborhood is sufficient for this experiment and the global information is unnecessary for good performance. Moreover, the deep DGRM with = 1 improves the global accumulated reward by 119% compared to the baseline. Further details of the result with = 1, including the local discounted rewards and the actions selected by each region are illustrated in the supplementary material. In this work, reward machines are leveraged to express complex temporally extended tasks. A decentralized learning algorithm for graph-based MDP is introduced. During training, we use the local information of the -hop neighborhood for a truncated Q-function. The truncated Q-function can be computed efficiently, and we prove that the approximation error is bounded. After training, agents can implement the policies independently. We also combine the decentralized architecture and function approximation by deep neural networks to enable the application to large-scale MARL or continuous state problems. Moreover, this is the first work to use reward machines to mitigate the COVID-19 pandemic through task specification, and we show the performance improvement over the baseline method. The current work assumes each agent has an individual reward machine and the reward machine is known, but it would be possible to learn reward machines from experience [24] and decompose a team-level task to a collection of reward machines [21] . We use a standard actor-critc algorithm, one possible research direction is to use more advanced RL algorithms such as Proximal Policy Optimization (PPO) [39] to allow for solving continuous action problems. Leveraging strategies such as Hindsight Experience Replay [40] to learn policies faster is also worth investigating. The bang-bang control of the baseline method is based on the relative saturation level of a region's health system, mathematically the ratio,˜, between the number of the hospitalized requiring care in ICU (estimated as 0.1H ) over the number of available ICU beds (T H ) in the region. The regional lockdown is enforced when˜is larger than 0.5 and relaxed when˜is smaller than 0.2. During the lockdown, the region implements strict social distancing rules, and all fluxes in or out of the region are reduced to 70% of their original values. Mathematically, the social distancing parameter and fluxes are set as follows. where is the minimum estimated value during the national lockdown and min(1, 3 ) simulates the relaxation of social distancing rules. where is the original value of the flux from region to region before the pandemic, and 0.7 simulates the feedback flux control during a lockdown. Recall˜is the number of days when the region is in a severe situation since this Monday,˜is the number of days when the region adopts a lockdown since this Monday, and the set of events is P = { 0 , 1 , 0 0 , 0.5 0 , 1 0 , 0 0.5 , 0.5 0.5 , 1 0.5 , 0 1 , 0.5 1 , 1 1 }. Mathematically, The reward machine for task specification of each region is shown in Fig. 7 . The agent starts at 0 , assuming during the previous week, the region was in a severe situation at least 1 day but less than 7 days and did not implement a lockdown. The RM states in purple ( 4 , 8 , 9 , 10 , 11 , 12 , 15 ) are sink states, indicating that the region has been in a severe situation or has implemented a lockdown for two consecutive weeks. 16 is the goal state and can be reached from all the states except the sink states (arrows pointing to 16 are omitted). Arrows with the same color point from the same RM state. Due to the limited space in Fig. 7 , the required events for RM state transitions which should be marked on the edges in Fig. 7 are listed in Table 1 . For instance, during the first week, if the region was in a severe situation and implemented a lockdown for seven consecutive days, denoted by 1 1 , the region is directed to 5 on the second week's Monday. From Tuesday to Sunday, the agent would remain at the same RM state due to the empty event 0 . The reward function is defined as follows. (20) where = ( , ( , , ) ). Fig. 8 shows the local discounted rewards of each region using the deep DRGM algorithm and the baseline method, with the reward function determined by the reward machine. From Fig. 8 , it can be seen that only region 5 receives a large penalty and fails the task with the deep DGRM algorithm, while a total of 13 regions fail the task utilizing the baseline. We also plot the result of 0.1H T H during the entire simulation for each region with = 1 in Fig. 9 . The result indicates our proposed deep DGRM algorithm can guide 19 among 20 regions to avoid being in a severe situation A model for reasoning about persistence and causation Efficient solution algorithms for factored MDPs Approximate linear-programming algorithms for graph-based Markov decision processes Variational planning for graph-based MDPs A survey of computational complexity results in systems and control A survey and critique of multiagent deep reinforcement learning Markov games as a framework for multi-agent reinforcement learning The dynamics of reinforcement learning in cooperative multiagent systems Value-function reinforcement learning in Markov games Nash Q-learning for general-sum stochastic games ILS: A framework for multi-paradigmatic learning UAS Conflict Resolution Integrating a Risk-Based Operational Safety Bound as Airspace Reservation with Reinforcement Learning 3M-RL: Multi-Resolution, Multi-Agent, Mean-Field Reinforcement Learning for Autonomous UAV Routing Fully decentralized multi-agent reinforcement learning with networked agents Multiagent Planning with Factored MDPs Distributed Policy Synthesis of Multi-Agent Systems with Graph Temporal Logic Specifications Counterfactual multi-agent policy gradients Multi-agent actor-critic for mixed cooperative-competitive environments Scalable reinforcement learning of localized policies for multi-agent networked systems Using reward machines for high-level task specification and decomposition in reinforcement learning Reward machines for cooperative multi-agent reinforcement learning Learning Quadruped Locomotion Policies with Reward Machines Planning with uncertain specifications (puns) Joint inference of reward machines and policies for reinforcement learning Learning reward machines for partially observable reinforcement learning Induction of subgoal automata for reinforcement learning Induction and exploitation of subgoal automata for reinforcement learning DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning Learning non-markovian reward models in mdps Reinforcement learning with hierarchies of machines Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Hierarchical reinforcement learning with the MAXQ value function decomposition Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning Policy gradient methods for reinforcement learning with function approximation Intermittent yet coordinated regional strategies can alleviate the COVID-19 epidemic: a network model of the Italian case Reinforcement learning: An introduction Stay at Home, Protect the National Health Service, Save Lives": a cost benefit analysis of the lockdown in the United Kingdom Multi-agent reinforcement learning: Independent vs. cooperative agents Proximal policy optimization algorithms Hindsight experience replay The research reported in this paper was supported by funds from NASA University Leadership Initiative program (Contract No. NNX17AJ86A, PI: Yongming Liu, Technical Officer: Anupa Bajwa). The support is gratefully acknowledged. . , is the distribution of ( ( ), ( ), ( )) conditioned on ( (0), (0), (0)) = ( , , ) under policy , and , is the distribution of ( ( ), ( ), ( )) conditioned on ( (0), (0), (0)) = ( , , ) under policy . Since the system transition model R is local dependent and the policy is localized, , only depends on the initial information of agent 's -hop neighborhood, ( , , ) . Thus, , = , , for all ≤ . Expanding using Eq. where ( , , , ) is the total variation distance between , and , , which is no larger than 1. For any ( , , ) ∈ × × , We first show 2 = 0.Because˜( , , ) does not rely on , as ∉ , and ∇ ( | , ) = ∇ ( | , ) = ∇ 1 = 0, we can get 3 = 0, 2 = 0. Then we can have, The 4×5 grid-world for this experiment is shown in Fig. 5 . Packages in warehouse A need to be transported to destination C, and packages in warehouse B need to be transported to destination D. UAV 1 and 2 only have access to warehouse A, UAV 5 and 6 only have access to warehouse B, and UAV 3 and 4 can go to either warehouse A or B. UAVs start from the positions as indicated in Fig. 5 with a fully charged battery. The battery consumption is 0.01% when waiting and 2% otherwise. We define a UAV runs out of battery if the remaining battery percentage is less than 7.5%.