key: cord-0046201-2ofdqmb2 authors: Khadilkar, Harshad; Ganu, Tanuja; Seetharam, Deva P. title: Optimising Lockdown Policies for Epidemic Control using Reinforcement Learning: An AI-Driven Control Approach Compatible with Existing Disease and Network Models date: 2020-06-24 journal: Trans Indian Natl DOI: 10.1007/s41403-020-00129-3 sha: 81b30b7836ecc1436b09a095f0cbf5f260aab7a6 doc_id: 46201 cord_uid: 2ofdqmb2 There has been intense debate about lockdown policies in the context of Covid-19 for limiting damage both to health and to the economy. We present an AI-driven approach for generating optimal lockdown policies that control the spread of the disease while balancing both health and economic costs. Furthermore, the proposed reinforcement learning approach automatically learns those policies, as a function of disease and population parameters. The approach accounts for imperfect lockdowns, can be used to explore a range of policies using tunable parameters, and can be easily extended to fine-grained lockdown strictness. The control approach can be used with any compatible disease and network simulation models. In this article 1 , we briefly describe an AI-driven approach for generating lockdown policies that are optimised based on disease characteristics and network parameters. This approach is aimed for use by policy-makers who are knowledgeable in epidemiology, but not necessarily wellversed in dynamic systems and control. The approach is designed to be modular and flexible. That is, the underlying reinforcement learning algorithm can work with any compatible disease and network models. Furthermore, the critical characteristics of the models are parameters that can be tuned. The models described in this paper are based on a commonly used epidemiological model from literature Perez and Dragicevic (2009) , as shown in Fig. 1 . The parameter values can be tuned for modelling infectious diseases including Covid-19. We use values for Covid-19 computed by Jung et al. (2020) . We also account for network propagation characteristics through tunable parameters such as strictness of lockdowns within network nodes and travel between nodes, including the possibility of leaky quarantine. The probability of disease transmission between people is a macro-level parameter, but it accounts for micro-level effects such as social distancing, mask usage, and weather effects. The network definition is based on node locations and population of each node, with connectivity between each pair of nodes defined by a gravity model Allamanis et al. (2012) . The results presented in this paper are based on a randomly generated network with 100 nodes and 10,000 people randomly distributed amongst those Fig. 2 for a fixed strategy of locking down any node when its symptomatic population exceeds 5% of total, and reopening when it falls below this level. This is a typical method followed in several regions worldwide. While the peak is small, the epidemic lasts for nearly the full year (with high economic cost). Reinforcement learning (RL) works by running a large number of simulations of the spread of the disease while attempting to find the optimal policy for lockdowns Sutton and Barto (2012) . The chief requirement is to quantify the cost of each outcome of the simulation. In this study, we impose a cost of 1.0 on each day of lockdown, 1.0 on each person infected, and 2.5 on each death 3 . A reward is defined as the negative of these costs (higher the reward, lower the cost). The actions asked of the algorithm are binary 4 : at the beginning of every week and for every node, the algorithm must decide whether to keep the node open or lock it down. We use Deep Q Learning Mnih et al. (2015) to train the algorithm. The RL algorithm improves and then saturates in 75 simulations as shown in Fig. 3 , for this specific instance. The evolution of infection rates in Fig. 4 (computed through 10 independent runs) shows that the policy has a higher peak than the 5% policy in Fig. 2 , but significantly fewer lockdowns and a shorter epidemic duration. Note also that there are no kinks due to new infections after release. The key points of novelty in this approach are: (i) we focus neither on epidemiological models nor on prediction of the spread of the disease, but rather on controlling the spread of disease while balancing long-term economic and health costs, (ii) our control approach can work with any disease parameters (not just , and with any compatible network data and propagation model (not just for specific geographies), (iii) rather than taking decisions based on simple thresholds such as fraction of people with symptoms, the learnt policies combine several context variables such as rates of new infections to take optimal decisions, (iv) the end-users need to only change input parameters to create policies with their desired characteristics, and (v) the algorithm is not a black box, and sensitivity of the policy to features can be studied. Fig. 5 demonstrates the last claim by considering the sensitivity of decisions to sets of two input features at a time. The first plot shows that the policy recommends lockdowns when the infection rate in the overall population or within a node exceeds 0.2. However, lockdowns can be recommended at much smaller values if both infection rates reach 0.1. A similar trend is shown in the plot on the right, which shows that lockdowns are recommended at much lower infection rates if a node has a large population. The reinforcement learning algorithm is ready for use in conjunction with real-world data sets, epidemiological models and network propagation models. Any of these three aspects can be changed as per user requirements. The algorithm is computationally lightweight, and, running it only requires Python. We have demonstrated its capability of handling nation-scale data Khadilkar et al. (2020) . We are open to collaborating with epidemiologists who could benefit from a computational approach to address the spread of communicable diseases. S is susceptible, E is exposed (virus in the body but not yet affecting the immune system), IS is infected (showing symptoms), IA is asymptomatic carrier, D is dead, and R is recovered. Note the numbers represent capture probabilities and not rates of change. All parameters can be tuned. We do not consider a transition from Recovered to susceptible states, but this can be added if found to be possible Evolution of a locationbased online social network: analysis and models Real-time estimation of the risk of death from novel coronavirus (COVID-19) infection: inference using exported cases Optimising lockdown policies for epidemic control using reinforcement learning Human-level control through deep reinforcement learning An agent-based approach for modeling dynamics of contagious disease spread Reinforcement learning, 2nd edn Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations