key: cord-0461946-cgvikrwm authors: Liu, Changliu title: A Microscopic Epidemic Model and Pandemic Prediction Using Multi-Agent Reinforcement Learning date: 2020-04-27 journal: nan DOI: nan sha: 22f4589db7910d6e05162089c6e44a3d91471d7f doc_id: 461946 cord_uid: cgvikrwm This paper introduces a microscopic approach to model epidemics, which can explicitly consider the consequences of individual's decisions on the spread of the disease. We first formulate a microscopic multi-agent epidemic model where every agent can choose its activity level that affects the spread of the disease. Then by minimizing agents' cost functions, we solve for the optimal decisions for individual agents in the framework of game theory and multi-agent reinforcement learning. Given the optimal decisions of all agents, we can make predictions about the spread of the disease. We show that there are negative externalities in the sense that infected agents do not have enough incentives to protect others, which then necessitates external interventions to regulate agents' behaviors. In the discussion section, future directions are pointed out to make the model more realistic. With the COVID-19 pandemic souring across the world, a reliable model is needed to describe the observed spread of the disease, make predictions about future, and guide public policy design to control the spread. Existing Epidemic Models There are many existing macroscopic epidemic models. 3 For example, the SI model describes the growth of 3 Daryl J Daley and Joe Gani. Epidemic modelling: an introduction, volume 15. Cambridge University Press, 2001 infection rate as the product of the current infection rate and the current susceptible rate. The SIR model further incorporates the effect of recovery into the model, i.e., when the infected population turns into immune population after a certain period of time. The SIRS model considers the case that immunity is not for lifetime and that the immune population can become susceptible population again. In addition to these models, the SEIR model incorporates the incubation period into analysis. Incubation period refers to the duration before symptoms show up. 4 The most important factor in all those models Suppose there are M agents in the environment. Initially, m 0 agents are infected. Agents are indexed from 1 to M. Every agent has its own state and control input. The model is in discrete time. The time interval is set to be one day. The evolution of the infection rate for consecutive days depends on agents' actions. The questions of interest are: How many agents will eventually be infected? How fast they will be infected? How can we slow down the growth of the infection rate? We consider two state values for an agent, e.g., for agent i, x i = 0 means healthy (susceptible), x i = 1 means infected. Everyday, every agent i decides its level of activities u i ∈ [0, 1]. The level of activities for agent i can be understood as the expected percentage of other agents in the system that agent i wants to meet. For example, u i = 1/M means agent i expects to meet one other agent. The actual number of agents that agent i meets depends not only on agent i's activity level, but also on other agents' activity level. For example, if all microscopic epidemic model and marl 3 other agents choose an activity level 0, then agent i will not be able to meet any other agent no matter what u i it chooses. Mathematically, the chance for agent i and agent j to meet each other depends on the minimum of the activity levels of these two agents, i.e., min{u i , u j }. In the extreme cases, if agent i decides to meet everyone in the system by choosing u i = 1, then the chance for agent j to meet with agent i is u j . If agent i decides to not meet anyone in the system by choosing u i = 0, then the chance for agent j to meet with agent i is 0. Before we derive the system dynamic model, the assumptions are listed below: 6 6 These assumptions can all be relaxed in future work. They are introduced mainly for the simplicity of the discussion. 1. In the agent model, we only consider two states: healthy (susceptible) and infected. All healthy agents are susceptible to the disease. There is no recovery and no death for infected agents. There is no incubation period for infected agents, i.e., once infected, the agent can start to infect other healthy agents. To relax this assumption, we may introduce more states for every agent. 2. The interactions among agents are assumed to be uniform, although it is not true in the real world. In the real world, given a fixed activity level, agents are more likely to meet with close families, friends, colleagues than strangers on the street. To incorporate this non-uniformity into the model, we need to redefine the chance for agent i and agent j to meet each other to be β i,j min{u i , u j }, where β i,j ∈ [0, 1] is a coefficient that encodes the proximity between agent i and agent j and will affect the chance for them to meet with each other. For simplicity, we assume that the interaction patterns are uniform in this paper. 3. Meeting with infected agents will result in immediate infection. To relax this assumption, we may introduce an infection probability to describe how likely it is for a healthy agent to be infected if it meets with an infected agent. On day k, denote agent i's state and control as x i,k ∈ X and u i,k ∈ U . By definition, the agent state space is X = {0, 1} and the agent control space is U = [0, 1]. The system state space is denoted X M := X × · · · × X . The system control space is denoted U M := U × · · · × U . Define m k = ∑ i x i,k as the number of infected agents at time k. The set of infected agents is denoted: The state transition probability for the multi-agent system is a microscopic epidemic model and marl 4 According to the assumptions, an infected agent will always remain infected. Hence the state transition probability for an infected agent i does not depend on other agents' states or any control. However, the state transition probability for a healthy agent i depends on others. The chance for a healthy agent i to not meet an infected agent j ∈ I k is 1 − min{u i , u j }. A healthy agent can stay healthy if and only if it does not meet any infected agent, the probability of which is Π j∈I k (1 − min{u i , u j }). Then the probability for a healthy agent to be infected is 1 − Π j∈I k (1 − min{u i , u j }). From the expression Π j∈I k (1 − min{u i , u j }), we can infer that: the chance for a healthy agent i to stay health is higher if • the agent i limits its own activity by choosing a smaller u i ; • the number of infected agents is smaller; • the infected agents in I k limit their activities. The state transition probability for an agent i is summarized in table 1. Example Consider a four-agent system shown in Fig. 1 . Only agent 1 is infected. And the agents choose the following activity levels: u 1 = 0.1, u 2 = 0.2, u 3 = 0.3, u 4 = 0.4. Then the chance p i,j for agents i and j to meet with each other is p 1,2 = p 1,3 = p 1,4 = 0.1, p 2,3 = p 2,4 = 0.2, and p 3,4 = 0.3. Note that p i,j = p j,i . The chance for agents 2, 3, and 4 to stay healthy is 0.9, although they have different activity levels. Other agents are healthy. The numbers on the links denote the probability for agents to meet with each other, which depend on the chosen activity levels of different agents. Before we start to derive the optimal strategies for individual agents and analyze the closed-loop multi-agent system, we first characterize the (open-loop) multi-agent system dynamics by Monte Carlo simulation according to the state transition probability in table 1. Suppose we have M = 1000 agents. At the beginning, only agent 1 is infected. We consider two levels of activities: normal activity level u and reduced activity level u * . The two activity levels are assigned to different agents following different strategies as described below. In particular, we consider "no intervention" case where all agents microscopic epidemic model and marl 5 continue to follow the normal activity level, "immediate isolation" case where the activity levels of infected agents immediately drop to the reduced level, "delayed isolation" case where the activity levels of infected agents drop to the reduced level after several days, and "lockdown" case where the activity levels of all agents drop to the reduced level immediately. The vertical axis corresponds to agent ID i. The color in the graph represents the value of x i,k , blue for 0 (healthy) and yellow for 1 (infected). For each case, we simulate 200 system trajectories and compute the average, maximum, and minimum m k (number of infected agents) versus k from all trajectories. A system trajectory in the "no intervention" case is illustrated in Fig. 2 , where u = 1/M for all agents. The m k trajectories under different cases are shown in Fig. 3 , where the solid curves illustrate the average m k and the shaded area corresponds to the range from min m k to max m k . The results are explained below. • Case 0: no intervention. All agents keep the normal activity level u. The scenarios for u = 1/M and u = 2/M are illustrated in Fig. 3 . As expected, a higher activity level for all agents will lead to faster infection. The trajectory of m k has a S shape, whose growth rate is relatively slow when either the infected population is small or the healthy population is small, and is maximized when 50% agents are infected. It will be shown in the following discussion that (empirical) macroscopic models also generate S-curves. • Case 1: immediate isolation of infected agents. The activity levels of infected agents immediately drop to u * , while others remain u. The scenario for u = 1/M and u * = 0.1/M is illustrated in Fig. 3 . Immediate isolation significantly slows down the growth of the infections rate. As expected, it has the best performance in terms of flattening the curve, same as the lockdown case. The trajectory also has a S shape. • Case 2: delayed isolation of infected agents. The activity levels of infected agents drop to u * after T days, while others remain u. In the simulation, u = 1/M and u * = 0.1/M. The scenarios for T = 1 and T = 2 are illustrated in Fig. 3 . As expected, the longer the delay, the faster the infection rate grows, though the growth of the infection rate is still slower than the "no intervention" case. Moreover, the peak growth rate (when 50% agents are infected) is higher when the delay is longer. • Case 3: lockdown. The activity levels of all agents drop to u * . The scenario for u * = 0.1/M is illustrated in Fig. 3 . As expected, it has the best perfor-microscopic epidemic model and marl 6 mance in terms of flattening the curve, same as the immediate isolation case. 7 7 In the case that infected population can be asymptomatic or have a long incubation period before they show any symptom, like what we observe for COVID-19, immediate identification of infected person and then immediate isolation is not achievable. Then lockdown is the only best way to control the spread of the disease in our model. Since the epidemic model is monotone, every agent will eventually be infected as long as the probability to meet infected agents does not drop to zero. Moreover, we have not discussed decision making by individual agents yet. The activity levels are just predefined in the simulation. We simulate the system trajectory under different infection coefficients as shown in Fig. 4 . The trajectories also have S shapes, similar to the ones in the microscopic model. However, since this macroscopic SI model is deterministic, there is no "uncertainty" range as shown in the microscopic model. The infection coefficient β depends on the agents' choices of activity levels. However, there is not an explicit relationship yet. It is better to directly use the microscopic model to analyze the consequences of individual agents' choices. This section tries to answer the following question: in the microscopic multi-agent epidemic model, what is the best control strategy for individual agents? To answer that, we need to first specify the knowledge and observation models as well as the cost (reward) functions for individual agents. Then we will derive the optimal choices of agents in a distributed manner. The resulting system dynamics correspond to a Nash Equilibrium of the system. A knowledge and observation model for agent i includes two aspects: what does agent i know about itself, and what does agent i know about others? The knowledge about any agent j includes the dynamic function of agent j and the cost function of agent j. The observation corresponds to run-time measurements, i.e., the observation of any agent j includes the run-time state x j,k and the run-time control u j,k . In the following discussion, regarding the knowledge and observation model, we make the following assumptions: • An agent knows its own dynamics and cost function; • All agents are homogeneous in the sense that they share the same dynamics and cost functions. And agents know that all agents are homogeneous, hence they know others' dynamics and cost functions; 8 • At time k, agents can measure x j,k for all j. But they cannot measure u j,k until time k + 1. Hence, the agents are playing a simultaneous game. They need to infer others' decisions when making their own decisions at any time k. We consider two conflicting interests for every agent: 9 9 The identification of these two conflicting interests is purely empirical. To build realistic cost functions, we need to either study the real world data or conduct human subject experiments. • Limit the activity level to minimize the chance to get infected; • Maintain a certain activity level for living. We define the run-time cost for agent i at time k as where x i,k+1 corresponds to the first interest, p(u i,k ) corresponds to the second interest, and α i > 0 adjusts the preference between the two interests. The function p(u) is assumed to be smooth. 10 Due to 10 The function p(u) can be a decreasing function on [0, 1], meaning that the higher the activity level, the better. The function p(u) can also be a convex parabolic function on [0, 1] with the minimum attained at some u * , meaning that the activity level should be maintained around u * . our homogeneity assumption on agents, they should have identical preferences, i.e., α i = α for all i. Agent i chooses its action at time k by minimizing the expected cumulative cost in the future: where γ ∈ [0, 1] is a discount factor. The objective function depends on all agents' current and future actions. It is difficult to directly obtain an analytical solution of (5). Later we will use multi-agent reinforcement learning to obtain a numerical solution. In this section, to simplify the problem, we consider a single stage game 11 where the agents have zero discount of the future, i.e., γ = 0. 11 The formulation (5) corresponds to a repeated game as opposed to the single stage game. Repeated games capture the idea that an agent will have to take into account the impact of its current action on the future actions of others. This impact is called the agent's reputation. The interaction is more complex in a repeated game than that in a single stage game. Hence the objective function is reduced to which only depends on the current actions of agents. According to the state transition probability in table 1, the expected cost is Nash Equilibrium According to (7), the expect cost for an infected agent only depends on its own action. Hence the optimal choice for an infected agent is u i,k =ū := arg min u p(u). Then the optimal choice for a healthy agent satisfies: Note that the term 1 − (1 − min{u,ū}) m k is positive and is increasing for u ∈ [0,ū] and then constant for u ∈ [ū, 1]. Hence, the optimal solution for (9) should be smaller thanū = arg min u p(u). 12 Then 12 If u ≥ū, then (9) becomes Since ∂ ∂u |ū J(u) > 0, the optimal solution satisfies that u <ū with cost J(u) < J(ū). Note that J(ū) equals to the smallest cost for the case u ≤ū. Hence the optimal solution for (9) satisfies that u <ū. the objective in (9) can be simplified as 1 − (1 − u) m k + α i p(u). In summary, the optimal actions for both the infected and the healthy agents in the Nash Equilibrium can be compactly written as Example Consider the previous example with four agents shown in Fig. 1 . Define microscopic epidemic model and marl 9 which is a monotonically decreasing function as illustrated in Fig. 5 . Then the optimal actions in the Nash Equilibrium for this specific problem satisfy: Solving for (12), for infected agents, u i,k = 1. For healthy agents, the choice also depends on α i as illustrated in Fig. 6 . We have assumed that α i = α which is identical for all agents. We further assume that α < 2 such that the optimal solution for healthy agents should be u i,k = 0. The optimal actions and the corresponding costs for all agents are listed in table 2. In the Nash Equilibrium, no agent will meet each other, since all agents except agent 1 reduce their activity levels to zero. The actual cost (received at the next time step) equals to the expected cost (computed at the current time step). However, let us consider another situation where the infected agent chooses 0 activity level and all other healthy agents choose 1 activity level. The resulting costs are summarized in table 3. Obviously, the overall cost is reduced in the new situation. However, this better situation cannot be attained spontaneously by the agents, due to externality of the system which will be explained below. Actual l i,k 1 1 0 1+α exp(−1) 1+α exp(−1) 2,3,4 0 1 0 0 Total 1 + α exp(−1) Table 3 : List of the agent decisions and associated costs in a situation better than the Nash Equilibrium in the four-agent example. For a multi-agent system, define the system cost as a summation of the individual costs: The system cost in the Nash Equilibrium is denoted L * k , which corresponds to the evaluation of L k under agent actions specified in (10). On the other hand, the optimal system cost is defined as microscopic epidemic model and marl 10 The optimization problem (14) is solved in a centralized manner, which is different from how the Nash Equilibrium is obtained. To obtain the Nash Equilibrium, all agents are solving their own optimization problems independently. Although their objective functions depend on other agents' actions, they are not jointly make the decisions, but only "infer" what others will do. By definition, L o k ≤ L * k . In the example above, L * k = 1 + 3α exp(−1) and L o k = 1 + α exp(−1). The difference L * k − L o k is called the loss of social welfare. In the epidemic model, the loss of social welfare is due to the fact that bad consequences (i.e., infecting others) are not penalized in the cost functions of the infected agents. Those unpenalized consequences are called externality. There can be both positive externality and negative externality. Under positive externality, agents are lacking motivations to do things that are good for the society. Under negative externality, agents are lacking motivations to prevent things that are bad for the society. In the epidemic model, there are negative externality with infected agents. To improve social welfare, we need to "internalize" externality, i.e., add penalty for "spreading" the disease. Now let us redefine agent i's run-time cost asl where q(·) is a monotonically increasing function. The last term x i,k q(u i,k ) does not affect healthy agents since x i,k = 0, but adds a penalty for infected agents if they choose large activity level. One candidate function for q(u) is 1 − (1 − u) m k . In the real world, such "cost shaping" using q can be achieved through social norms or government regulation. The expected cost becomes Suppose the function q is well tuned such that the arg min u [α i p(u) + q(u)] = 0. Then although the expected costs for infected agents are still independent from others, their decision is considerate to healthy agents. When the infected agents choose u = 0, then for healthy agents, the expected cost becomes α i p(u i,k ), meaning that they do not need to worry about getting infected. Let us now compute the resulting Nash Equilibrium under the shaped costs using the previous example. Example In the four-agent example, set q(u) = u. Then arg min u [αp(u) + u] = 0. Hence agent 1 will choose u 1,k = 0. For agents i = 2, 3, 4, they will choose u i,k = 1 since they are only minimizing p(u). The resulting costs are summarized in table 4. With the shaped costs, the microscopic epidemic model and marl 11 system enters into a better Nash Equilibrium which indeed aligns with the system optimum in (14). A few remarks: • Cost shaping did not increase the overall cost for the multi-agent system. • The system optimum remains the same before and after cost shaping. • Cost shaping helped agents to arrive at the system optimum without centralized optimization. We have shown how to compute the Nash Equilibrium of the multiagent epidemic model in a single stage. However, it is analytically intractable to compute the Nash Equilibrium when we consider repeated games (5). The complexity will further grow when the number of agents increases and when there are information asymmetry. Nonetheless, we can apply multi-agent reinforcement learning 13 to 13 Lucian Buşoniu, Robert Babuška, and Bart De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183-221. Springer, 2010 numerically compute the Nash Equilibrium. Then the evolution of the pandemic can be predicted by simulating the system under the Nash Equilibrium. As evident from (10), the optimal action for agent i at time k is a function of x i,k and m k . Hence we can define a Q function (action value function) for agent i as According to the assumptions made in the observation model, all agents can observe m k at time k. For a single stage game, we have derived in (10) For repeated games (5), we can learn the Q function using temporal different learning. At every time k, agent i chooses its action as microscopic epidemic model and marl 12 After taking the action u i,k , agent i observes x i,k+1 and m k+1 and receives the cost l i,k at time k + 1. Then agent i updates its Q function: where η is the learning gain and δ i,k is the temporal difference error. All agents can run the above algorithm to learn their Q functions during the interaction with others. However, the algorithm introduced above has several problems: • Exploration and limited rationality. There is no exploration in (18). Indeed, Q-learning is usually applied together with -greedy where with probability 1 − , the action u i,k is chosen to be the optimal action in (18), and with probability , the action is randomly chosen with a uniform distribution over the action space. The -greedy approach is introduced mainly from an algorithmic perspective to improve convergence of the learning process. When applied to the epidemic model, it has a unique societal implication. When agents are randomly choosing their behaviors, it represents the fact that agents have only limited rationality. Hence in the learning process, we apply -greedy as a way to incorporate exploration for faster convergence as well as to take into account limited rationality of agents. • Data efficiency and parameter sharing. Keeping separated Q functions for individual agents is not data efficient. An agent may not be able to collect enough samples to properly learn the desired Q function. Due to the homogeneity assumptions we made earlier about agents' cost functions, it is more data efficient to share the Q function for all agents. Its societal implication is that agents are sharing information and knowledge with each other. Hence, we apply parameter sharing 14 as a way to 14 Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66-83. Springer, 2017 improve data efficiency as well as to consider information sharing among agents during the learning process. 15 15 In a more complex situation where agents are not homogeneous, it is desired to have parameter sharing with a smaller group of agents, instead of parameter sharing will all agents. With the above modifications, the multi-agent Q learning algorithm 16 is summarized below. 16 Junling Hu and Michael P Wellman. Nash Q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov): 2003 • For every time step k, agents choose their actions as: • At the next time step k + 1, agents observe the new states x i,k+1 and receive rewards l i,k for all i. Then the Q function is updated: Example In this example, we consider M = 50 agents in the system. Only one agent is infected in the beginning. The run-time cost is the same as in the example in the distributed optimal control section, i.e., l i,k = x i,k+1 + α exp( 1 u i,k −1 ) where α is chosen to be 1. For simplicity, the action space is discretized to be {0, 1/M, 10/M}, called as low, medium, and high. Hence the Q function can be stored as a 2 × M × 3 matrix. In the learning algorithm, the learning rate is set to η = 1. The exploration rate is set to decay in different episodes, i.e., = 0.5(1 − E/ max E) where E denotes the current episode and the maximum episode is max E = 200. The Q function is initialized to be 10 for all entries. Three different cases are considered. For each case, we illustrate the Q function learned after 200 episodes as well as the system trajectories for episodes 10, 20, . . . , 200, blue for earlier episodes and red for later episodes. The results are shown in Fig. 7 . • Case 1: discount γ = 0 with runtime cost l i,k . With γ = 0, this case reduces to a single stage game as discussed in the distributed optimal control section. The result should align with the analytical Nash Equilibrium in (10). As shown in the left plot in Fig. 7(a) , the optimal action for a healthy agent is always low (solid green), while the optimal action for an infected agent is always high (dashed magenta). The Q values for infected agents do not depend on m k . The Q values for healthy agents increase when m k increases if the activity level is not zero, due to the fact that: for a fixed activity level, the chance to get infected is higher when there are more infected agents in the system. All these results align with our previous theoretical analysis. Moreover, as shown in the right plot in Fig. 7(a) , the agents are learning to flatten the curve across different episodes. • Case 2: discount γ = 0.5 with runtime cost l i,k . Since the agents are now computing cumulative costs as in (5), the corresponding Q values are higher than those in case 1. However, the optimal actions remain the same, low (solid green) for healthy agents, high (dashed magenta) for infected agents, as shown in the left plot in Fig. 7(b) . The trends of the Q curves also remain the same: the Q values do not depend on m k for infected agents and for healthy agents whose activity levels are zero. However, as shown in the right plot in Fig. 7(b) , the agents learned to flatten the curve faster than in case 1, mainly because healthy agents are more cautious (converge faster to low activity levels) when they start to consider cumulative costs. • Case 3: discount γ = 0.5 with shaped runtime costl i,k in (15). The shaped cost changes the optimal actions for all agents as well as the resulting Q values. As shown in the left plot in Fig. 7(c) , the optimal action for an infected agent is low (dashed green), while that for a healthy agent is high (solid magenta) when m k is small and low (solid green) when m k is big. Note that when m k is high, the healthy agents still prefer low activity level, though the optimal actions for infected agents are low. That is because: due to the randomization introduced in -greedy, there is still chance for infected agents to have medium or high activity levels. When m k is high, the healthy agents would rather limit their own activity levels to avoid the risk to meet with infected agents that are taking random actions. This result captures the fact that agents understand others may have limited rationality and prefer more conservative behaviors. We observe the same trends for the Q curves as the previous two cases: the Q values do not depend on m k for infected agents and for healthy agents whose activity levels are not zero. In terms of absolute values, the Q values for infected agents are higher than those in case 2 due to the additional cost q(u) inl i,k . The Q values for healthy agents are smaller than those in case 2 for medium and high activity levels, since the chance to get infected is smaller as infected agents now prefer low activity levels. The Q values remain the same for healthy agents with zero activity levels. With shaped costs, the agents learned to flatten the curve even faster than in case 2, as shown in the right plot in Fig. 7(c) , since the shaped cost encourages infected agents to lower their activity levels. Agents vs humans The epidemic model can be used to analyze realworld societal problems. Nonetheless, it is important to understand the differences between agents and humans. We can directly design and shape the cost function for agents, but not for humans. For agents, their behavior is predictable once we fully specify the problem (i.e., cost, dynamics, measurement, etc). Hence we can optimize the design (i.e., the cost function) to get desired system trajectory. For humans, their behavior is not fully predictable due to limited rationality. We need to constantly modify the knowledge and observation model as well as the cost function to match the true human behavior. Future work The proposed model is in its preliminary form. Many future directions can be pursued. • Relaxation of assumptions. We may add more agent states to consider recovery, incubation period, and death. We may consider the fact that the interaction patterns among agents are not uniform. We may consider a wide variety of agents who are not homogeneous. For example, health providers and equipment suppliers are key parts in fighting the disease. They should receive lower cost (higher reward) for maintaining or even expanding their activity levels than ordinary people. Their services can then lead to higher recovery rate. In addition, we may relax the assumptions on agents' knowledge and observation models, to consider information asymmetry as well as partial observation. For example, agents cannot get immediate measurement whether they are infected or not, or how many agents are infected in the system. • Realistic cost functions for agents. The cost functions for agents are currently hand-tuned. We may learn those cost functions from data through inverse reinforcement learning. Those cost functions can vary for agents from different countries, different age groups, and different occupations. Moreover, the cost functions carry important cultural, demographical, economical, and political information. A realistic cost function can help us understand why we observe significantly different outcomes of the pandemic around the world, as well as enable more realistic predictions into the future by fully considering those cultural, demographical, economical, and political factors. • Incorporation of public policies. For now, the only external intervention we introduced is cost shaping. We may consider a wider range of public policies that can change the closed-loop system dynamics. For example, shut-down of transportation, isolation of infected agents, contact tracing, antibody testing, etc. • Transient vs steady state system behaviors. We have focused on the steady state system behaviors in the Nash Equilibrium. However, as agents live in a highly dynamic world, it is not guaranteed that a Nash Equilibrium can always be attained. While agents are learning to deal with unforeseen situations, there are many interesting transient dynamics, some of which is captured in Fig. 7, i.e., agents may learn to flatten the curve at different rates. Methods to understand and predict transient dynamics may be developed in the future. • Validation against real world historical data. To use the proposed model for prediction in the real world, we need to validate its fidelity again the historical data. The validation can be performed on the m k trajectories, i.e., for the same initial condition, the predicted m k trajectories should align with the ground truth m k trajectories. This paper introduced a microscopic multi-agent epidemic model, which explicitly considered the consequences of individual's decisions on the spread of the disease. In the model, every agent can choose its activity level to minimize its cost function consisting of two conflicting components: staying healthy by limiting activities and maintaining high activity levels for living. We solved for the optimal decisions for individual agents in the framework of game theory and multi-agent reinforcement learning. Given the optimal decisions of all agents, we can make predictions about the spread of the disease. The system had negative externality in the sense that infected agents did not have enough incentives to protect others, which then required external interventions such as cost shaping. We identified future directions were pointed out to make the model more realistic. Estimating the impact of public and private strategies for controlling an epidemic: A multi-agent approach Multi-agent reinforcement learning: An overview Epidemic modelling: an introduction Cooperative multi-agent control using deep reinforcement learning Nash Q-learning for generalsum stochastic games SEIR epidemic model with delay