key: cord-1047901-m2h53oq6
authors: Padmanabhan, Regina; Meskin, Nader; Khattab, Tamer; Shraim, Mujahed; Al-Hitmi, Mohammed
title: Reinforcement Learning-based Decision Support System for COVID-19
date: 2021-04-27
journal: Biomed Signal Process Control
DOI: 10.1016/j.bspc.2021.102676
sha: 5b696b6055a0f03301d6ca8a15bfc094613fe64c
doc_id: 1047901
cord_uid: m2h53oq6

Globally, informed decision on the most effective set of restrictions for the containment of COVID-19 has been the subject of intense debates. There is a significant need for a structured dynamic framework to model and evaluate different intervention scenarios and how they perform under different national characteristics and constraints. This work proposes a novel optimal decision support framework capable of incorporating different interventions to minimize the impact of widely spread respiratory infectious pandemics, including the recent COVID-19, by taking into account the pandemic's characteristics, the healthcare system parameters, and the socio-economic aspects of the community. The theoretical framework underpinning this work involves the use of a reinforcement learning-based agent to derive constrained optimal policies for tuning a closed-loop control model of the disease transmission dynamics.

. Hence, it is quite imperative to consolidate the lessons learned out of our experience with the current COVID-19 global pandemic towards building a resilient community with people prepared to prevent, respond to, combat, and recover from the social, health, and economic impacts of pandemics. Preparedness is a key factor in mitigating pandemics. It encompasses inculcating awareness about the outbreaks and fostering response strategies to ensure avoiding loss of life and socio-economic havoc. While the emergence of a harmful microorganism with pandemic potential may be unpreventable, pandemics can be prevented [4] . Preparedness includes technological readiness to identify pathogen identity, fostering drug discovery, and developing reliable theoretical models for prediction, analysis, and control of pandemics.

Lately, collaborative efforts among epidemiologists, microbiologists, geneticists, anthropologists, statisticians, and engineers have complimented the research in epidemiology and have paved the way for improved epidemic detection and control [8] [9] . There exists an enormous amount of studies concerning epidemiological models and the use of such theoretic models in deriving cost-effective decisions for the control of epidemics. Sliding mode control, tracking control, optimal control, and adaptive control methods have been applied to control the spread of malaria, influenza, zika virus, ... etc. [7] [10]- [12] . Optimal control methods are used to identify ideal intervention strategies for mitigating epidemics that accounts for the cost involved in implementing pharmaceutical or nonpharmaceutical interventions (PI or NPI). For instance, in [13] , a globally-optimal vaccination strategy for a general epidemic model (susceptible-infectedrecovered (SIR)) is derived using the Hamilton-Jacobi-Bellman (HJB) equation. It is pointed out that such solutions are not unique and a closer analysis is needed to derive cost-effective and physically realizable strategies. In [14] , the hyperchaotic behavior of epidemic spread is analyzed using the SEIR (susceptible-exposed-infected-recovered) model by modeling nonlinear transmissibility.

Even though various optimization algorithms were used to derive time-optimal and resourceoptimal solutions for general epidemic models, only a few of the possibilities have been explored for COVID-19 in particular. The majority of the model-based studies for COVID-19 discuss various scenario analyses such as the influence of isolation only, vaccination only, and combining isolation with vaccination on the overall disease transmission [15] [16] [17] [18] [19] . Even though several works focused on evaluating the influence of various control interventions on the mitigation of COVID-19, only very few literature discuss the derivation of an active intervention strategy from J o u r n a l P r e -p r o o f a control-theoretic viewpoint. In [20] , the authors discuss an SEIR model-based optimal control strategy to deploy strict public-health control measures until the availability of a vaccine for COVID- 19 . Simulation results show that the derived optimal solution is more effective compared to constant-strict control measures and cyclic control measures. In [21] , optimal and active closed-loop intervention policies are derived using quadratic programming method to mitigate COVID-19 in the United States while accounting for death and hospitalizations constraints.

In this paper, we propose the development and use of a reinforcement learning-based closedloop control strategy as a decision support tool for mitigating COVID-19. Reinforcement Learning (RL) is a category of machine learning that has proved promising in handling control problems that demand multi-stage decision support [22] . With the exponential advancement in computing methods, machine learning-based methods are becoming increasingly useful in many biomedical applications. For instance, RL-based controllers have been used to make intelligent decisions in the area of drug dosing for patients undergoing hemodialysis, sedation, and treatment for cancer or schizophrenia [22] [23] [24] [25] [26] [27] . Similarly, machine-learning experts are contributing to the area of epidemics detection and control [9] [28] [29] . In [6] , the RL-based method is used to make optimal decisions regarding the announcement of an anthrax outbreak. Data on the benefits of true alarms and the cost associated with false alarms are used to formulate and solve the problem of the anthrax outbreak announcement in a RL-framework. Decisions concerning the declaration of an outbreak are evaluated by defining six states such as no outbreak, waiting day 1, waiting day 2, waiting day 3, waiting day 4, and outbreak detected.

Using RL-based closed-loop control, at each stage, decisions can be revised according to the response of the system that embodies a multitude of uncertainties. In the case of a mathematical model that represents COVID-19 disease transmission dynamics, uncertainties include system disturbance such as a sudden increase in exposure rate due to school reopening or reduced transmission due to increased compliance of people or any other unmodeled system dynamics.

The underlying strategy behind RL-based methods is the concept of learning an ideal policy from the agent's experience with the environment. Basically, the agent (actor) interacts with the system (environment) by applying a set of feasible control inputs and learns a favorable control policy based on the values attributed to each intervention-response pair.

The mathematical formulation of the optimal control problem under RL-framework allows it to be used as a tool for optimizing intervention policies. The focus of this paper is to present such J o u r n a l P r e -p r o o f a learning-based model-free closed-loop optimal and effective decision support tool for limiting the spread of COVID-19. We use a mathematical model that captures COVID-19 transmission dynamics in a population as a simulation model instead of the real system to collect interaction data (intervention-response) required for training the RL-based controller. The main contributions of this work can be summarized as follows: (1) Novel disease spread model that accounts for the influence of NPIs on the overall disease transmission rate and specific infection rates during the asymptomatic and symptomatic periods, (2) Development of an RL-based closed-loop controller for mitigating COVID-19, and (3) Design of reward function to account for cost and hospital saturation constraints.

The organization of this paper is as follows. In Section II, a mathematical model for COVID-19 and the development of a RL-based controller are presented. Simulation results for two case studies are given in Section III. Robustness of the controller with respect to various disturbances are also discussed in this section. Conclusions and scope for future research are presented in Section IV.

The proposed approach incorporates the development of a decision support system that utilizes a Q-learning-based approach to derive optimal solutions with respect to certain predefined cost objectives. The main components of the RL-framework include an environment (system or process) whose output signals need to be regulated and an RL-agent that explores the RL environment to gain knowledge about the system dynamics towards deriving an appropriate control strategy. Schematic of such a learning framework is shown in Figure 1 , where the population dynamics pertaining to COVID-19 represents the RL environment, and control interventions represent the actions imposed by the RL-agent.

In this paper, Watkin's Q-learning algorithm which does not demand an accurate or complete system model is used to train the RL-agent [27] , [30] . The control objective is to derive an optimal control input that minimizes the infected population while minimizing the cost associated with interventions. The RL-based methodology provides a framework for an agent to interact with its environment and receive rewards based on observed states and actions taken. In Q-table, the desirability of an action when in a particular system state is encoded in terms of a quantitative J o u r n a l P r e -p r o o f Fig. 1 : Schematic representation of reinforcement learning framework for COVID-19. This learning-based controller design is predicated on the observed data obtained as a response to an action imposed on the population. The response data y(k) include the number of infected, hospitalized, recovered, etc. Error is the difference between observed number of severely infected and desired number of severely infected (I sd ). Learning is facilitated based on the reward r k incurred according to the state (s k ), action(a k ), new state (s k+1 ).

value calculated with respect to the reward incurred for an intervention-response pair. The goal of an RL-based agent is to learn the best sequence of actions that can maximize the expected sum of returns (rewards). Note that the RL-based controller design is model-free and does not rely on parameter knowledge of the system but it utilizes the intervention-response observations from the environment. Specifically, the RL-based controller design discussed in this paper requires the information on the number of susceptibles and severely infected cases. As mentioned earlier, instead of the real system we use a simulation model to obtain intervention-response data to J o u r n a l P r e -p r o o f train the RL-agent. The model is given by [20] :

with

I a (t) = I am (t) + I as (t), I a (0) = I a0 ,

where S(t) denotes the number of susceptibles, E m (t) and I m (t) denote the number of exposed and mildly infected symptomatic patients, respectively, R m (t) is the number of recovered patients from mild infection, E s (t) and I s (t) denote the number of exposed and severely infected symptomatic patients, I am (t) and I as (t) denote asymptomatic patients who later on move to mildly and severely infected compartments, respectively, and D(t) is the total number of direct and indirect death due to COVID-19 [20] . Out of the total number of exposed, a larger proportion J o u r n a l P r e -p r o o f 7 (E m (t) > 80% of E(t)) develop mild infection and rest (E s (t)) develop severe infection after a delay. The intervention-response data required for training the RL-agent is derived using the mathematical model (1)- (10) . Figure 2 shows the corresponding compartmental representation,

where the state vector Table I) . (10)) of COVID-19 that accounts for differential disease severity and import of exposed cases into the population [20] .

The transmission parameter β(t) in (1)-(10) is given by

J o u r n a l P r e -p r o o f The obvious increase in the disease exposure of the population in susceptible compartment following the increase in the number of I am (t), I as (t), I m (t), and I s (t) is modeled in (16) ,

where γ A and γ I are the rates at which the population with asymptomatic and symptomatic disease manifestation infect the susceptible population, respectively, u i (t), i = 1, 2, 3, account for the influence of various control interventions on the transmission rate of the virus, and m is the modification parameter used to model the reduced transmission rate of the severely sick population as they will be moved to hospital hence under strict isolation. Specifically, u 1 (t)

accounts for the impact of travel restrictions on the overall mobility and interactions of the population in various infected compartments, u 2 (t) accounts for the efforts to reduce the infection rate γ A (during the asymptomatic period). Asymptomatic patients often remain undetected and hence awareness campaigns to increase the compliance of people can reduce the chance of infection spread during the asymptomatic period. Specific efforts to reduce the infection rate γ I (during symptomatic period) is accounted by u 3 (t). This includes hospitalization of severely infected (I s (t)) and isolation/quarantine of mildly infected (I m (t)) that will reduce the chance of infection spread during the symptomatic period. The viability of each of the control inputs u i (t), i = 1, 2, 3, in controlling the overall transmission rate β(t) is different, an increase in u 1 (t) results

in an overall reduction in β(t) (e.g. lockdown or travel ban influence interaction rate among I am (t), I as (t), I m (t), and I s (t)), where as an increase in u 2 (t) (e.g. increased hygiene habits due awareness) or u 3 (t) (e.g. strict exposure control measures and bio hazard handling protocols at healthcare facilities) reduces the disease transmission through I a (t) or I(t), respectively.

It should be noted that apart from death due to COVID-19, there can be indirect fatalities due to the overwhelming of hospitals and the allocation of hospital resources for the management of the pandemic. The indirect fatalities account for the death of the patients due to the unavailability of medical attention or inaccessibility of hospitals. In (18) , the death rate indirectly related to COVID-19 is denoted as (µ ), and it is set to zero if the active number of the severely infected population is below the hospital capacity (H) and is set to µ H whenever hospitals are saturated,

where µ H models the increase in the mortality rate due to inaccessibility to hospitals. Similarly, direct death due to COVID-19 (µ) can also increase significantly when hospitals saturate, hence µ max is set to double when I s (t) ≥ H [20] .

In the control theory view point, the model (1)- (21) can be written in the form

where x(t) ∈ R 10 is the state vector that model the dynamics in the compartments shown in Figure 2 , u(t) ∈ R 3 is the control input, and y(t) ∈ R 2 is the output (observations) of the

Similarly, in the finite Markov decision process (MDP) framework, the system (environment) dynamics are modeled in terms of finite sequences S, A, R, and P, where S is a finite set of states, A a finite set of actions defined for the states s k ∈ S, R represents the reward function that guides the agent in accordance to the desirability of an action a k ∈ A, and P is a state transition probability matrix.

The state transition probability matrix P a k (s k , s k+1 ) gives the probability that an action a k ∈ A takes the state s k ∈ S to the state s k+1 in a finite time step. Furthermore, the discrete states in the finite sequence S are represented as (S i ) i∈I + , where I + {1, 2, . . . , q} and q denotes the total number of states. Likewise, the discrete actions in the finite sequence A are represented as

. . , q } and q denotes the total number of actions. The transition probability matrix P can be formulated based on the system dynamics (22) . Note that, since the Q-learning framework does not require P for deriving the optimal control policy, we assume P is unknown [24] , [27] .

In the case of epidemic control, the goal is to derive an optimal control sequence to take the system from a nonzero initial state to a desired low infectious state. This problem of deriving action sequence for bringing down the number of infected people requires multi-stage decision making based on the response of the population to various kinds of control interventions. Note that, changes in the overall population dynamics in response to interventions depend upon how far people comply with the restrictions imposed by the government. As shown in Figure 1 , this can be achieved by using the RL algorithm defined/built on the MDP framework by iteratively evaluating action-response sequences observed from system [31] , [32] .

RL-based learning phase starts with an initial arbitrary policy, for instance with a Q-table with zero entries. Q-table is a mapping from states s k ∈ S to a predefined set of interventions a k ∈ A J o u r n a l P r e -p r o o f [32] . Each entry of the Q-table (Q k (s k , a k )) associates an action in the finite sequence (A j ) j∈J + to a state of the finite sequence (S i ) i∈I + . In the case of epidemic control, a policy represents a series of interventions that have to be imposed on the population to shift the initial status of the environment to a targeted status which is equivalent to the desired set of system states. With respect to a learned Q-table, a policy is a sequence of decisions embedded as values in Q-table which corresponds to decisions such as "if in state s k , take the ideal action a k ∈ A".

As shown in Figure 1 , during the training phase, the agent imposes control actions (a k ) on the RL environment and as the agent gains more and more experience (observations) from the environment the initial arbitrary intervention policy is iteratively updated towards an optimal intervention policy. One of the key factors that helps the agent to assess the desirability of an action and guides it towards the optimal intervention policy is the reward function. Reward function associates an action a k with a numerical value r k+1 ∈ R (reward) with respect to the state transition s k → s k+1 of the environment in response to that action. Reward incurred depends on the ability of the last action in transitioning the system states towards the target state or goal state (G s ). The reward can be negative or positive for inappropriate or appropriate actions, respectively.

An optimal intervention policy is derived by maximizing the expected value (E[ · ]) of the discounted reward (r k ) that the agent receives over an infinite horizon denoted as

where the discount rate parameter θ ∈ [0, 1] represents the importance of immediate and future rewards. With a value of θ = 0, the agent considers only the immediate reward, whereas for θ approaching 1 it considers immediate and future rewards. Based on the experience gained by the agent at each time step k = 1, 2, . . . , the Q-table is updated iteratively as

is used to specify minimum threshold of convergence [30] , [32] , [33] .

J o u r n a l P r e -p r o o f

As shown in Figure 1 , learning is facilitated based on the reward (r k ) incurred according to the state (s k ), action (a k ), and new state (s k+1 ). The control interventions (actions) imposed on the population basically reduce the disease transmission rate as depicted in (16) . As the vaccine for COVID-19 is not approved yet, the control measures against this disease broadly rely on two major factors, namely, I) non-pharmaceutical interventions (NPIs) such as restriction on the social gathering, closure of institutes, and isolation; and II) available pharmaceutical interventions (PIs)

such as hospital care with supporting medicines and equipment such as ventilators. Constraints in the health care system such as the number of medical personnel, intensive care beds, COVID-19

testing capacity, COVID-19 isolation and quarantine capacity, dedicated hospitals, and ventilators, as well as the compliance of the society with the interventions are the major challenges for health care system.

The choice of the reward function is critical in guiding the RL-agent towards an optimal intervention policy that will drive the population dynamics to a desired low infectious state while minimizing the socio-economic cost involved. Hence, the reward r k+1 is designed to incorporate the influence of three factors 1) r 1 k+1 is used to penalize the agent if I s (t) exceeds hospital saturation capacity H. 2) r 2 k+1 is used to assign a proportional reward to the RL-agent's actions that reduce I s (t).

k+1 is used to reward/penalize the agent according to the cost associated with the implementation of various control interventions.

The reward r k+1 in (24) is calculated as:

where e(kT ) = I s (kT ) − I sd , I sd is the desired value of I s (t), kT ≤ t < (k + 1)T , and c a k is the cost associated with each action set. In (27), very low cost, low cost, medium cost, and high cost action represent a predefined combination of actions that are associated with a range of cost such as 0-30%, 20-50%, 30-70%, and 30-90%, respectively (see Table III ). The total reward is:

where β w is used to relatively weigh the cost of interventions over the infection spread.

The RL-based controller design is predicated on the intervention-response observations that is obtained during the interaction of the RL-agent with the RL-environment (real or simulated system). The states s k of the population dynamics is defined in terms of the observable output [24] . In the case of COVID-19, it is widely agreed that the currently reported number of cases actually corresponds to the cases 10-14 days back. This delay is due to the virus incubation time and delay involved in diagnosis and reporting [21] . The influence of such delays is reflected in the intervention-response curves as well. Hence, for training the RL-agent using the Q-learning algorithm, for each action a k imposed on the system, the system states (s k ) are assessed using s k = e(t) = I s (t) − I sd ,

kT ≤ t < (k + 1)T , where T = 14 days. Specifically, as the sampling time T is set to 14 days, the reward r k+1 reflects the response of the system for an action a k imposed on the system 14 J o u r n a l P r e -p r o o f days ago. As mentioned earlier, the Q-learning algorithm starts with an arbitrary Q-table and based on the information on the current state (s k ), action (a k ), new state (s k+1 ), and reward (r k+1 ), the Q-table is updated using (24) . See Tables II and III . In each episode, the system states are initialized at a random initial state s k , and the RL-agent imparts control actions to the system to calculate the reward incurred and to update the Q-table until s k = G s is reached. The initial Q- the agent assesses the current state s k of the system and imparts an action a k by following -greedy policy, where is a small positive number [24] [27] [32] . Specifically, at every time step, the RL-agent chooses random actions with probability and ideal actions otherwise (1 − ) [32] . After convergence of the Q-table, the RL-agent chooses the action a k as a k = (A j ) j∈J + , j = arg max Q k (s k , ·).

As the RL-based learning is predicated on the quantity and quality of the experience gained by the agent from the environment, the more it explores the environment, the more it learns. To learn an optimal policy, the RL-agent is expected to explore the entire RL-environment sufficient number of times, ideally an infinite number of times. However, in most cases, convergence is achieved with an acceptable tolerance δ satisfying ∆Q k ≤ δ for some finite number of episodes provided the learning rate η k (s k , a k ) is reduced as the learning progresses [24] [27] [32] .

In disease transmission dynamics in Qatar is simulated using the model parameter values given in [35] and [36] . Some of the parameter values for Case 2 are set based on the data available online [37] [38] [39] [40] . Two different RL-agents are obtained for each of the cases using MATLAB R . Figure 3 shows the schematic diagram of RL-based closed-loop control of COVID-19. In the RL-based closed-loop set up, the RL-agent is capable of deriving the optimal intervention policy to drive the system in any state s k ∈ S, (S i ) i∈I + to the goal state (G s ) based on the converged optimal Q-table. Specifically, the agent assess the current state s k of the system and then imparts the action a k ∈ A,(A j ) j∈J + , J + {1, 2, . . . , q }, q = 20 which corresponds to the maximum value in the Q-table as determined using (29) .

For training the RL-agent, the parameter β w in the reward function (28) is set to β w = 0.5. The choice between β w = 0.5 and a higher value (e.g. β w = 1) depends on the resource availability and cost affordability of the community. Compared to β w = 0.5, the agent is penalized with a higher negative value when β w = 1 is used. Hence, with β w = 1, the agent tends to avoid actions in the high-cost set and opts only for low-cost inputs. For training the RL-agent, we iterated 20,000 (arbitrarily high) scenarios, where a scenario represents the series of transitions from an arbitrary initial state to the required terminal state G s . Furthermore, we initially assigned ( Figure 3 ). Table IV summarizes the parameters used in the Q-learning algorithm. Figure 4 ). It can be seen from Figure 4 that the number of severely ill patients (I s (t)) who need hospitalization has peaked to 1.104 × 10 6 at 210th day of the epidemics. Also note that from the 98th day to 336th day, the number of severely infected is above the hospital capacity (H = 1.2 × 10 4 ) which has lead to an increased death due to COVID-19 (1056 on 98th day increased to 1.55 × 10 6 on 336th day). Similarly, indirect death due to COVID-19 has increased (0 on 98th day to 1.58 × 10 5 on 336th day) due to the hospital saturation. As given in (10) , it can be seen that the state trajectory of D(t) in Figure 4 shows the total number of death due to direct and indirect impact of COVID. Parameter Initial condition (Case 1) Initial condition (Case 2) N 0 67 × 10 6 2881053 

I 0 0.01H 1 S 0 N 0 − I 0 N 0 − I 0 I m0 pI 0 pI 0 I s0 (1 − p)I 0 (1 − p)I 0 E m0 , E s0 0 3, 0 I am0 , I as0 0 0 R m0 , R s0 0 0 D 0 0 0 J o u

Note that the number of susceptibles (S(t)) reduces monotonically over time due to increased movement of people to the exposed or infected compartments ( Figure 4) . Similarly, the number of people in recovery compartments and death compartment increases monotonically as they are terminal compartments. However, in other compartments including the severely infected (I s (t)), the number initially increases and then decreases. Hence, the value of e(t), kT ≤ t < (k + 1)T , can be in the same range during initial and final phases of the trajectory (Figure 4) . However, the status quo of the system at these two phases are different as reflected in the trajectory of the susceptible population. Hence, different state-assignments are necessary in these two phases for the RL-agent to differentiate between the regions with similar e(t) values but different S(t)

values. Hence, we assign i states, i = 1, . . . , 10 for S(t) > 3 × 10 7 and i = 11, . . . , 20 otherwise.

See Table II for the state assignments based on the values of e(kT ) and S(t) used for Case 1.

The goal state for this case is set as G s ∈ (S i ) i∈I + , i = 1, which corresponds to the case where e(kT ) ∈ [0, 100] and S(t) > 3 × 10 7 . corresponds to the initial value D 0 . The peak value of I s (t) is slightly more because the initial condition itself was 5.55 × 10 5 and a fraction of initial high number of population in the exposed (E s0 ), and asymptomatic infected (I as0 ) also moves to the severely infected compartment. Note that the peak value of I s (t) represents the number of active cases at a time point, not the total number of infected. The total number of infected has reduced to 4.74 × 10 7 compared to the value 5.97 × 10 7 in the case of no intervention. has reduced to 1.19 × 10 4 from a value of 1.1 × 10 6 for no intervention and the total number of infected has reduced to 5 × 10 5 compared to the value 5.97 × 10 7 in the case of no intervention. Comparing the control inputs for the cases I s0 < H and I s0 ≥ H, it can be seen that the control input for the latter case ( Figure 7) is more cost-effective. However, in the case corresponding to Figure 9 , the control input is not coming down to zero as the number of susceptible in the compartment is very high as only 5 × 10 5 peoples are infected. In this case, as there are imported infected cases and many unreported cases in the community, the number of cases will increase once the restrictions are relaxed. These results are in line with the effective control suggestions for earlier pandemics. In the case of an earlier influenza pandemic, studies suggested that controlling the epidemic at the predicted peak is most effective [42] . Closing too early results in the reappearing of cases if restrictions are lifted and require restrictions for a longer time period. Note that the reward function (25)- (27) is designed to train the controller (RL-agent) to chose control inputs that will minimize the total number of severely infected and penalize the use of high-cost control input (see Table III ). Designing a reward function that will penalize the RL-agent for variations in the control input and that can account for various delays in the system is an interesting extension of the current framework. Considering the incubation time and delay in reporting (10-14 days), the observable output y(t), s k = g(y(t)), kT ≤ t < (k + 1)T , k = 1, 2, . . . is sampled at every 14th day (T = 14).

To investigate the closed-loop performance of the RL-agent, we tested the RL-based controller for various sampling periods. As shown in Table VIII , for different values of T , the RL-based controller is able to bring down the number of severely infected to 675±22 cases by the 100th day.

From Tables VII and VIII and Figures 10 Case 2: In this case, the COVID-19 disease transmission data of Qatar is used to conduct various scenario analysis. Comparatively, the population in Qatar (2.88 × 10 6 ) is far less than that of Case 1 (6.7 × 10 7 ). Figure 12 shows the number of infected cases reported per day in Qatar from 29th February to 22nd October. The first case (I 0 = 1 ) is that of a 36-year-old male who traveled to Qatar during the repatriation of Qatari nationals stranded in Iran. Table V shows the initial conditions used for our simulations and the value of E m0 is set 3 [36] . The majority of the population in Qatar are young expatriates and hence the value of R 0 , severity of the disease, and mortality rate associated with COVID-19 in Qatar is estimated to be lesser than many other countries [36] , [40] , [41] . In [41] , it is reported that, the case fatality rate in Qatar is 1.4 out of 1000, hence µ min = 0.0014 is used for Case 2. Active disease mitigation policies of the government and appropriate public health response of a well-resourced population has also played a key role in bringing down the total number of COVID-19 infections and associated death in Qatar [41] . Various restriction and relaxation phases implemented in Qatar are marked in Figure 12 as 1 -8 . As mentioned in Table IX respectively. Note that, the number of severely infected (active acute cases + active ICU cases)

is above 100 cases as of October 22nd (see Table XI ).

J o u r n a l P r e -p r o o f corresponds to the case where e(kT ) ∈ [0, 100] and S(t) > 1.2 × 10 6 . One of the important concerns pertaining to COVID-19 is the possibility of hospital saturation which will lead to increased indirect death due to COVID-19. Qatar government responded rapidly to the need for increased hospital capacity. Apart from arranging 37,000 isolation beds and 12,500 quarantine beds, the government has set up 3000 acute care beds and 700 intensive care beds [38] , [43] .

Hence, the hospital saturation capacity H which is related to severely sick is set to 3500 in (25) while training the RL-agent. The action set a k ∈ A, (A j ) j∈J + , and the cost assignments c a k for assessing the reward (27) is given in Table III . Figure 14 shows the convergence of the Q-table

for Case 2.

Note that, with appropriate public health response and relatively young expat population with lower risk of severe COVID-19 illness, Qatar never had severely infected cases above H.

However, as shown in Figure 13 , the scenario I s (t) ≥ H is valid with no intervention. Figures 17 and 18 show the closed-loop performance of the controller with initial conditions 1000, 4991, 19965, 26750, 350, 1493, 240, 6687, 40] T . This set of initial conditions is from the COVID-19 data of Qatar on June 1st and it corresponds to the scenario I s0 < H with I s0 = 240. As shown in Figure 17 , by 600 days from June 1st, direct and indirect deaths are 202 and 0, respectively. As given in Table XI , on October 22nd, the total number of infected and deaths with government intervention is 1.30 × 10 5 and 228 and with RL-based control is 1.01 × 10 5 and 121. Note that October 22nd corresponds to 144th day in Figure 17 . With RLbased control, the number of susceptibles is more than 2.72 × 10 6 (> 94%) throughout. Since, a very low percentage of the total population is infected, the likelihood of seeing secondary waves when control is lifted is very high. It can be seen from Figures 17 and 18 that are not in compliance with the COVID-19 mitigation protocols can considerably increase the transmission rate β(t). The import of infected cases through international airports can also increase the infection rate in society. Such changes can be modeled as a disturbance that contributes to a sudden change in the value of β(t). Qatar is a country with considerable international traffic and on average the Doha airport was handling 100000 passengers per day before the pandemic [44] . However, due to COVID-19 restrictions only around 20% of the regular traffic is expected to arrive in Qatar. Out of these passengers a small percentage can be infected despite the strict screening strategies including the testing and quarantining protocols followed currently. Hence, a per day import of 5 infected cases (ρ = 5) is used for the nominal model for Case 2. However, completely lifting travel restrictions can increase the number of imported infected cases. Figure 19 shows the performance of the RL-based closed-loop controller when a disturbance in the form of an increase in ρ is introduced to the system. For this scenario, the initial condition This initial condition corresponds to the COVID-19 infection data in Qatar on October 22nd.

Starting from October 22nd, a disturbance of ρ = 500 (days −1 ) is applied on the 150th day and maintained for 4 weeks. This disturbance model a scenario wherein 500 infected cases are imported per day due to relaxing all restrictions on international travel. It can be seen from Figure 19 that the control input is increased during the time of disturbance to limit the total number of infected and death to 211053 and 352, respectively. Also, note that the import of a lesser number (< 100) of infected cases does not significantly influence the dynamics of the COVID-19 in the society. The results of this simulation study imply that it is imperative to limit the number of imported cases per day below 100 per day by implementing testing and screening strategies as it is done currently until the number of cases is reduced worldwide or a protective vaccine is available.

In general, simulation results for Case 1 and Case 2 show that even though the relaxation of control measures can be started when the peak declines, complete relaxation is advised only if the number of active cases falls below 100 and a significant proportion of the total population is infected (Figure 7 ). If the total number of active cases is above 100 and/or the number of susceptibles is significantly high, it is recommended to exercise 50% control on overall interactions of the infected (detected and undetected) which includes maintaining social J o u r n a l P r e -p r o o f distancing, sanitizing contaminated surfaces, and isolating detected cases. International travel can be allowed by following COVID-19 protocols and continuing screening and testing of the passengers to keep the number of imported cases to a minimum.

In this paper, we have demonstrated the use of an RL-based learning framework for the closedloop control of an epidemiological system, given a set of infectious disease characteristics in a society with certain socio-economic and healthcare characteristics and constraints. Simulation results show that the RL-based controller can achieve the desired goal state with acceptable performance in case of disturbances. Incorporating real-time regression models to update the parameters of the simulation model to match the real-time disease transmission dynamics can be a useful extension of this work.

Control of malaria outbreak using a non-linear robust strategy with adaptive gains

Nonlinear robust adaptive sliding mode control of influenza epidemic in the presence of uncertainty

Optimal control of intervention strategies and cost effectiveness analysis for a zika virus model

Anticipating emerging infectious disease epidemics

Evolving epidemiology of nipah virus infection in bangladesh: Evidence from outbreaks during

Optimizing anthrax outbreak detection using reinforcement learning

Mathematical and computational approaches to epidemic modeling: A comprehensive review

Infecting epidemiology with genetics: A new frontier in disease ecology

Big data and machine learning in critical care: Opportunities for collaborative research

A review of the use of optimal control in social models

Robust sliding control of SEIR epidemic models

Optimal control and cost-effectiveness analysis of a zika virus infection model with comprehensive interventions

Globally optimal vaccination policies in the SIR model: Smoothness of the value function and uniqueness of the optimal strategies

Analysis and control of an SEIR epidemic system with nonlinear transmission rate

Epidemiological impact of SARS-CoV-2 vaccination: Mathematical modeling analyses

Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts

Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions

An updated estimation of the risk of transmission of the novel coronavirus (2019-nCov)

Mathematical modeling and simulation of the COVID-19 pandemic

Optimal COVID-19 epidemic control until vaccine deployment

Safety-critical control of active interventions for COVID-19 mitigation

Informing sequential clinical decisionmaking through reinforcement learning: An empirical study

A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients

Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment

Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer

Reinforcement learning-based control of tumor growth under anti-angiogenic therapy

Closed-loop control of anesthesia and mean arterial pressure using reinforcement learning

Machine learning for healthcare: On the verge of a major shift in healthcare epidemiology

Machine learning in social epidemiology: learning from experience

Q-learning

Reinforcement Learning: An Introduction

Neuronlike adaptive elements that can solve difficult learning control problems

COVID-19: SEIRD model for Qatar COVID-19 outbreak

Epidemic analysis of COVID-19 in Egypt, Qatar and Saudi Arabia using the generalized SEIR model

Qatar open data portal

Coronavirus disease (COVID-19)

COVID-19 pandemic in Qatar

Births and deaths in state of Qatar

Characterizing the Qatar advanced-phase SARS-CoV-2 epidemic

A modeling study of school closure to reduce influenza transmission: A case study of an influenza A (H1N1) outbreak in a private Thai school

Qatar's response to COVID-19 pandemic

Qatar civil aviation authority, open data, air transport data

Author Contribution Statement: Conceptualization, Nader Meskin and Tamer Kattab; writing and original draft preparation, Regina Padmanabhan