key: cord-0621555-z4cagypm authors: Libin, Pieter; Moonens, Arno; Verstraeten, Timothy; Perez-Sanjines, Fabian; Hens, Niel; Lemey, Philippe; Now'e, Ann title: Deep reinforcement learning for large-scale epidemic control date: 2020-03-30 journal: nan DOI: nan sha: 91f09ff4b588dcd9e5f27cb5c8ce18481b8fcced doc_id: 621555 cord_uid: z4cagypm Epidemics of infectious diseases are an important threat to public health and global economies. Yet, the development of prevention strategies remains a challenging process, as epidemics are non-linear and complex processes. For this reason, we investigate a deep reinforcement learning approach to automatically learn prevention strategies in the context of pandemic influenza. Firstly, we construct a new epidemiological meta-population model, with 379 patches (one for each administrative district in Great Britain), that adequately captures the infection process of pandemic influenza. Our model balances complexity and computational efficiency such that the use of reinforcement learning techniques becomes attainable. Secondly, we set up a ground truth such that we can evaluate the performance of the 'Proximal Policy Optimization' algorithm to learn in a single district of this epidemiological model. Finally, we consider a large-scale problem, by conducting an experiment where we aim to learn a joint policy to control the districts in a community of 11 tightly coupled districts, for which no ground truth can be established. This experiment shows that deep reinforcement learning can be used to learn mitigation policies in complex epidemiological models with a large state space. Moreover, through this experiment, we demonstrate that there can be an advantage to consider collaboration between districts when designing prevention strategies. Epidemics of infectious diseases are an important threat to public health and global economies. The most efficient way to combat epidemics is through prevention. To develop prevention strategies and to implement them as efficiently as possible, a good understanding of the complex dynamics that underlie these epidemics is essential. To properly understand these dynamics, and to study emergency scenarios, epidemiological models are necessary. Such models enable us to make predictions and to study the effect of prevention strategies in simulation. The development of prevention strategies, which need to fulfil distinct criteria (i.a., prevalence, mortality, morbidity, cost), remains a challenging process. For this reason, we investigate a deep reinforcement learning (RL) approach to automatically learn prevention strategies in an epidemiological model. The use of model-free deep reinforcement learning is particularly interesting, as it allows us to set up a learning environment in a complex epidemiological setting (i.e., large state space and non-linear dependencies) while imposing few assumptions on the policies to be learned. In this work, we conduct our experiments in the context of pandemic influenza, where we aim to learn optimal school closure policies to mitigate the epidemic. Pandemic preparedness is important, as influenza pandemics have made many victims in the (recent) past [47] and the ongoing COVID-19 epidemic is yet another reminder of this fact [65] . Contrary to seasonal influenza epidemics, an influenza pandemic is caused by a newly emerging virus strain that can become pandemic by spreading rapidly among naive human hosts (i.e., human hosts with no prior immunity) worldwide [47] . This means that at the start of the pandemic no vaccine will be available and it will take several months before vaccine production can commence [59] . For this reason, learning optimal strategies of non-therapeutic intervention measures, such as school closure policies, is of great importance to mitigate pandemics [39] . To meet this objective, we consider a reinforcement learning approach. However, as the state-of-the-art of reinforcement learning techniques require many interactions with the environment in order to converge, our first contribution entails a realistic epidemiological model that still has a favourable computational performance. Specifically, we construct a meta-population model that consists of a set of 379 interconnected patches, where each patch corresponds to an administrative region in Great Britain and is internally represented by an age-structured stochastic compartmental model. To conduct our experiments, we establish a Markov decision process with a state space that directly corresponds to our epidemiological model, an action space that allows us to open and close schools on a weekly basis, a transition function that follows the epidemiological model's dynamics, and a reward function that is targeted to the objective of reducing the attack rate (i.e., the proportion of the population that was infected). In this work, we will use "Proximal Policy Optimization" (PPO) [51] to learn the school closure policies. First, we set up an experiment in an epidemiological model that covers a single administrative district. This setting enables us to specify a ground truth that allows us to empirically assess the performance of the policies learned by PPO. In this analysis, we consider different values for the basic reproductive number R 0 (Definition 1) and the population composition (i.e., proportion of adults, children, elderly, adolescents) of the district. Both parameters induce a significant change of the epidemic model's dynamics. Definition 1 (Basic reproductive number). The basic reproductive number, R 0 , is the number of infections that is, on average, generated by one single infected individual that is placed in an otherwise fully susceptible population. Through these experiments, we demonstrate the potential of deep reinforcement learning algorithms to learn policies in the context of complex epidemiological models, opening the prospect to learn in even more complex stochastic models with large action spaces. In this regard, we consider a large scale setting where we examine whether there is an advantage to consider the collaboration between districts when designing school closure policies. We conduct an experiment in our epidemiological model with 379 districts and attempt to learn a joint policy to control the districts in the Cornwall-Devon community, a set of 11 tightly coupled districts. To this end, we assign an agent to each of the 11 districts of the Cornwall-Devon community and use a reinforcement learning approach to learn a joint policy. We compare this joint policy to a non-collaborative policy (i.e., aggregated independent learners). The closing of schools is an effective way to limit the spread of an influenza pandemic [39] . For this reason, the use of school closures as a mitigation strategy has been explored in variety of modelling studies [26, 11, 43, 14, 12, 22, 27, 25, 17] , of which the work presented in [22] is the most recent and comprehensive study. The concept to learn dynamic policies by formulating the decision problem as a Markov decision process (MDP) was first introduced in [60] . The proposed technique was used to investigate dynamic tuberculosis case-finding policies in HIV/tuberculosis co-epidemics [61] . Later, the technique was extended towards a method to include cost-effectiveness in the analysis [62] , and applied to investigate mitigation policies (i.e., school closures and vaccines) in the context of pandemic influenza in a simplified epidemiological model. The work presented in [60, 62] uses a policy iteration algorithm to solve the MDP. To scale this approach to larger problem settings, we explore the use of on-line reinforcement learning techniques (e.g., TD-learning, policy gradient). Note that the "Deep Q-networks" algorithm was recently used to investigate culling and vaccination in farms in a simple individual-based model to delay the spread of viruses in a cattle population [48] . However, to our best knowledge, the work presented in this manuscript is the first attempt to use deep reinforcement learning algorithms directly on a complex meta-population model. Furthermore, we experimentally validate the performance of these algorithms using a ground truth, in a variety of model settings (i.e., different census compositions and different R 0 's). This is the first validation of this kind and it demonstrates the potential of on-line deep reinforcement learning techniques in the context of epidemic decision making. Finally, we present a novel approach to investigate how intervention policies can be improved by enabling collaboration between different geographic districts, by formulating the setting as a multi-agent problem, and by solving it using deep multi-agent reinforcement learning algorithms. We construct a meta-population model that consists of 379 patches, where each patch represents one administrative region in Great Britain. Great Britain consists of three countries with the following administrative regions: 325 districts in England, 22 unitary authorities in Wales and 32 council areas in Scotland. Each patch consists of a stochastic age-structured compartmental model, which we describe in sub-section 3.1, and the different patches are connected via a mobility model, as detailed in sub-section S3. In sub-section 3.3 we discuss how we validate and calibrate the model. We analyse the model's computational complexity and discuss the model's performance in the Supplementary Information. We consider a stochastic compartmental SEIR model from which we sample trajectories. We first describe the model in terms of ordinary differential equations (i.e., a deterministic representation) that we then transform to stochastic differential equations [5] to make a stochastic evaluation possible. An SEIR model divides the population in a susceptible, exposed, infected and recovered compartment, and is commonly used to model influenza epidemics [17] . More specifically, we consider an age-structured SEIR model (see Figure 1 for a visualization) with a set of n disjoint age groups [17, 21] . This model is formally described by this system of ordinary differential equations (ODEs), defined for each age group i: Every susceptible individual in age group i is subject to an age-specific and timedependent force of infection: which depends on: • The probability of transmission β when a contact occurs. Children (C) I A I C Figure 1 : We depict an age-structured SEIR model that considers two age groups (i.e., adults and children). This model consists of two SEIR models, one for each age group, that are connected to represent mixing between the age groups (yellow arrows). Note that it is also possible to mix within the age groups. Note that we use two age groups in this figure to provide a clear visualization of the model. In our actual model, we consider four different age groups. • The time-dependent contact matrix M , where M ij (t) is the average frequency of contacts that an individual in age group i has with an individual in age group j. • The frequency at which contacts with infected individuals (in age group j) occur: Once exposed, individuals move to the infected state according to the latency rate ζ. Individuals recover from infection (i.e., get better or die) at a recovery rate γ. We omit vital dynamics (i.e., births and deaths that are not caused by the epidemic) in this SEIR model, as the epidemic's time scale is short and we therefore assume that births and deaths will have a limited influence on the epidemic process [54] . Therefore, at any time: where the total population size N i corresponds to age-specific census data. Our model considers 4 age groups: children (0-4 years), adolescents (5-18 years), adults (19-64 years) and elderly (65 years and older). Note that the contact frequency M ij (t) is time-dependent, in order to model school closures, i.e., a different contact matrix is used for school term and school holiday. Following [17] , we consider conversational contacts, i.e., contacts for which physical touch is not required. As we aim to model the effectiveness of school closure interventions, we use the United Kingdom contact matrices presented in [17] , which encodes a contact matrix for both school term and school holiday. These contact matrices are the result of an internet-based social contact survey completed by a cohort of participants [17] . The contact matrices encode for the same age groups as mentioned before: children, adolescents, adults and elderly. We defined the SEIR model in terms of a system of ODEs which implies a deterministic evaluation of the system. However, for predictions, stochastic models are preferred, as they to account for stochastic variation and allow us to quantify uncertainty [29] . In order to sample trajectories from this set of differential equations, we transform the system of ODEs to a system of stochastic differential equations (SDEs), using the transformation procedure presented in [5] . This procedure introduces stochastic noise to the system by adding a Wiener process to each transition in the ODE. We evaluate the SDE at discrete time steps using the Euler-Maruyama approximation method [5] . Each compartmental model is representative of one of the administrative districts and as such the compartmental model is parametrised with the census data of the respective district, i.e., population counts stratified by age groups. We use the 2011 United Kingdom census data made available by NOMIS 1 . We present more details on the census data in the Supplementary Information. Our model, that is comprised of a set of connected SEIR patches, is inspired by the recent BBC pandemic model [32] . The BBC pandemic model was in its turn motivated by the model presented in [23] . At each time step, our model checks whether a patch p becomes infected. This is modulated by the patch's force of infection, which combines the potential of the infected patches in the system, weighted by a mobility model, that represents the commuting of adults between the different patches: where P is the set of patches in the model, M p p is the mobility flux between patch p and p, β is the probability of transmission on a contact, S A p (t) is the susceptible population of adults in patch p at time t and its contribution is modulated by parameter µ (range in [0, 1]), and I p (t) is the infectious potential of patch p at time t. We define this infectious potential as, where I A p (t) is the size at time t of the infectious adult population in patch p and M AA is the average number of contacts between adults, as specified in the contact matrix (see sub-section 3.1). M is a matrix based on the mobility dataset provided by NOMIS 2 . This dataset describes the amount of commuting between the districts in Great Britain. In general, this inter-patch model is constructed from first principles i.e., census data, a mobility model, the number of infected individuals and the transmission potential of the virus. However, for the parameter µ that modulates the contribution of the susceptibles in the receptive patch (while it is commonly used in literature [23, 18, 31] ) no such intuition is readily available. Therefore, this parameter is typically fitted to match the properties of the epidemic that is under investigation [23, 18, 31] . We will calibrate this parameter such that it can be used for a range of R 0 values, as detailed in the next sub-section. Our objective is to construct a model that is representative for contemporary Great Britain with respect to population census and mobility trends. This model will be used to study school closure intervention strategies for future influenza pandemics. While in many studies [31, 23, 18] a model is created specifically to fit one epidemic case, we aim for a model that is robust with respect to different epidemic parameters, most importantly R 0 , the basic reproduction number. To validate our model according to these goals, we conduct two experiments. In the first experiment, we compare our patch model to an SEIR compartmental model that uses the same contact matrix and age structure, but with homogeneous spatial mixing (i.e., no spatial structure). While we do not expect our model to behave exactly like the compartmental model, as the patches and the mobility network that connects them induces a different dynamic, we do observe similar trends with respect to the epidemic curve and peak day. This experiment also enables us to calibrate the µ parameter. We present a detailed description of this analysis and report the results in Supplementary Information. In the second experiment we show that our model is able to reproduce the trends that were observed during the 2009 influenza pandemic, commonly known as the swine-origin influenza pandemic, i.e., A(H1N1)v2009, that originated in Mexico. The 2009 influenza pandemic in Great Britain is an interesting case to validate our model for two reasons. Firstly, the pandemic occurred quite recently and thus our model's census and mobility scheme are a good fit, as both the datasets on which we base our census and mobility model were released in 2011. Secondly, due to the time when the virus entered Great Britain, the summer holiday started 11 weeks after the emergence of the epidemic. The timing of the holidays had a severe impact on the progress of the epidemic and resulted in a epidemic curve with two peaks. This characteristic epidemic curve enables us to demonstrate the predictive power of our age-structured contact model with support for school closures. In Figure S13 , we show a set of model realisations in conjunction with the symptomatic case data, which shows that we were able to closely match the epidemic trends observed during the British pandemic in 2009 (details on this case study in the Supplementary Information) . Note that our model reports the number of infections while the British Health Protection Agency only recorded symptomatic cases. Therefore we scale the epidemic curve with a factor of 1 4 . This large number of asymptomatic cases produced by our model is in line with earlier serological surveys [41] and with previous modelling studies [33] . In order to apply reinforcement learning, we construct an MDP based on the epidemiological model that we introduced in the previous section. This epidemiological model consists of patches that correspond to administrative regions. We have an agent for each patch that we attempt to control, and for each agent we have an action space A = {open, close} that allows us to open and close schools for one week. Each agent has a predefined budget b of school closure actions it can execute. Once this budget is depleted, executing a close action will default to executing an open action. We refer to the remaining budget at time t as b (t) . In the epidemiological model, when schools are closed we use a contact matrix that is representative for school holidays and when schools are open we use a contact matrix that is representative for school term (details in Section 3.1). For each patch, we consider a state space that combines the state of the SEIR model and the remaining budget of school closures b (t) p . For the SEIR model, we have 16 state variables (i.e., R 16 ), as we have an SEIR model (4 state variables) for each of the four age groups. The remaining school closure budget is encoded as an integer, resulting in a combined state space of 17 variables. We refer to the state space of one patch p, that thus combines the epidemiological states and the budget, as S p . The state space S of the MDP corresponds to the aggregation of the state space of each patch that we attempt to control: where P c ⊆ P is the set of patches that we control. The transition probability function T (s | s, a) stochastically determines the state of the epidemic in the next week, taking into account the school closure actions that were chosen, using the epidemiological dynamics as defined in the previous section. To reduce the attack rate, we consider an immediate reward function that quantifies the negative loss in susceptibles over one simulated week: where S(.) is the function that determines the total number of susceptible individuals given the state of the epidemiological model. For PPO, we have both a policy and value network. The policy network accepts the state of the epidemiological model as input (details in Section 4) and the output of the network contains 1 unit, which is passed through a sigmoid activation function. This output thus represents the probability of keeping the schools open in the district. Every hidden layer in the PPO network uses the hyperbolic tangent activation function. The value network has the same architecture as the policy network, with the exception that the output is not passed through an activation function. We will refer to this setting throughout this work as the single-district PPO agent. PPO's hyper-parameters are tuned (hyper-parameter values in Supplementary Information) on a single-district (i.e., the Greenwich district) learning environment with R 0 = 1.8. To this end, we performed a hyper-parameter sweep using Latin hypercube sampling (n = 1000) [52] . We conduct two kinds of experiments: in the context of a single district and in the context of the Great Britain model that combines all 379 districts. We consider two values for the reproductive number, i.e., R 0 = {1.8, 2.4}, to investigate the effect of distinct reproductive numbers. R 0 = 1.8 represents an epidemic with moderate transmission potential [19] and R 0 = 2.4 represents an epidemic with high transmission potential [38] . We investigate the effect of different school closure budgets, i.e., b = {2, 4, 6} weeks. The epidemic is simulated for a fixed number of weeks, chosen beforehand, to ensure that there is enough time for the epidemic to fade out after its peak. Following [6] , we use a latent period of one day (ζ = 1 1 ) and an infectious period of 1.8 days (γ = 1 1.8 ). Given the contact matrix M ij (t), the latency rate ζ, the recovery rate γ, we can compute β for an R 0 , as specified in the Supplementary Information. We now establish a ground truth for different population compositions, i.e., the proportion of the different age groups in a population. We will use this ground truth to empirically validate that PPO converges to the appropriate policy. To establish this ground truth 3 , first consider that when we deal with a single district, we can approach the 'average' behaviour of the model by removing the stochastic terms from the differential equations. Hence, for a particular parameter configuration (i.e., district, R 0 , γ, ζ), the model will always produce the same epidemic curve. This (1) or closed (0) during the i-th week. For short-lived epidemics, such as influenza epidemics, we can enumerate these policies and evaluate them in our model (i.e., using exhaustive policy search). Note that, in the epidemiological models that we consider, the epidemic spans no more than 25 weeks, and thus exhaustive search is possible. In this analysis, we consider different values for the basic reproductive number R 0 and the population composition of the district, both parameters that induce a significant change of the epidemic models dynamics. To this end, we select 10 districts that are representative of the population heterogeneity in Great Britain: one district that is representative for the average of this census distribution and a set of nine districts that is representative for the diversity in this census distribution. Details on this selection procedure can be found in the Supplementary Information. To evaluate PPO with respect to the ground truth, we repeat the experiment for which we established a ground truth (i.e., R 0 ∈ {1.8, 2.4}, 10 districts and b ∈ {2, 4, 6}) and learn a policy using PPO, in the stochastic epidemic model. For each experimental setting (i.e., the combination of a district, an R 0 value, and a school closure budget b), we run PPO 5 times (5 trials), to asses the variance of the learning performance. Each PPO trial is run for 10 4 episodes of 43 weeks. We show the learning curves, i.e., total reward at the end of the episode, for the district that is representative for the average of the census distribution (i.e., the Barnsley district in England), with R 0 = 2.4 in Figure 3 , for the other settings we report similar learning curves in the Supplementary Information. To compare each of the learned policies to its ground truth (one for each district), we take the learned policy and apply it 1000 times in the stochastic model, which results in a distribution over model outcomes (i.e., attack rate improvement: the difference between the attack rate produced by the model and the baseline when no schools are closed). We then compare this distribution to the attack rate improvement that was recorded for the ground truth. We show these results, for the setting with a school closure budget of 6 weeks and R 0 = 2.4, in Figure 3 , and for the other settings in Supplementary Information. These results show that PPO learns a policy that matches the ground truth for all districts and combinations of R 0 and b. Note that for these experiments, we use the same hyper-parameters for PPO that were introduced in Section 4. This demonstrates that, for different values of R 0 and for different census compositions (which induce a significant change in dynamics in the epidemic model) these hyper-parameters work well. This indicates that these hyperparameters are adequate to be used for different variations of the model. In this section, we compare to the ground truth (that has been found through an exhaustive policy search) to a policy learned by PPO, a deep reinforcement learning algorithm. This allows us to empirically validate that PPO converges to the optimal policy. This experimental validation is important, as it demonstrates the potential of deep reinforcement learning algorithms to learn policies in the context of complex epidemiological models. This indicates that it is possible to learn in even more complex stochastic models with large action spaces, for which it is impossible to compute the ground truth. In Section 6, we investigate such a setting, where we aim to learn a joint policy for a set of districts, using deep multi-agent reinforcement learning. To investigate the collaborative nature of school closure policies, we apply deep multiagent reinforcement learning algorithms. In our model, we have 379 agents, one for each district, as agents represent the district for which they can control school closure. As the current state-of-the-art of deep multi-agent reinforcement learning algorithms is limited to deal with about 10 agents [28] , we thus need to partition our model into smaller groups of agents, such that deep multi-agent reinforcement learning algorithms become feasible. To this end, we analyse the mobility matrix M to detect clusters of districts that represent closely connected communities (details in Supplementary Material). Through this analysis we identify a community with 11 districts, to which we will refer as the Cornwall-Devon community, as it is comprised of the Cornwall and Devon regions. We now examine whether there is an advantage to consider the collaboration between districts when designing school closure policies. We conduct an experiment in our epidemiological model with 379 districts, and attempt to learn a joint policy to control the districts in the Cornwall-Devon community. To this end, we assign an agent to each of the 11 districts of the Cornwall-Devon community, and use a reinforcement learning approach to learn a joint policy. We compare this joint policy to a non-collaborative policy (i.e., aggregated independent learners). We refer to the state space of one patch p as S p , as detailed in Section 4. The state space S of the MDP corresponds to the aggregation of the state space of the set of patches P c that we attempt to control. In this experiment, P c corresponds to the 11 In order to learn a joint policy, we need to consider an action space that combines the actions for each district p ∈ P c that we attempt to control. This results in a joint action space with a size that is exponential with respect to the number of agents. To approach this problem, we use a PPO super-agent that controls multiple districts simultaneously, to learn a joint policy. To this end, we use a custom policy network that gets as input the combined model state of each district p ∈ P c (Equation 6), and as a result, the input layer has 17 · |P c | input units. In contrast to the single-district PPO, that was introduced in Section 4, the output layer of the policy network of this agent has a unit for each district that we attempt to control. Again, each output unit is passed through a sigmoid activation function, and hence corresponds to the probability of closing the schools in the associated district. Similar to the single-district PPO, each hidden layer uses the hyperbolic tangent activation function. The value network has the same architecture for the input layers and hidden layers, but only has a single output unit that represents the value for the given state. We will refer to this agent as multi-district PPO. We conduct experiments for R 0 = 1.8 (i.e., moderate transmission potential) and R 0 = 2.4 (i.e., high transmission potential), and we consider a school closure budget of 6 weeks, i.e., b = 6. We run multi-district PPO 5 times, to assess the variance of the learning signal, for 10 5 episodes of 43 weeks, and we show the learning curves in Figure 4 . These learning curves demonstrate a stable and steady learning process. For R 0 = 1.8 the reward curve is still increasing, while for R 0 = 2.4 the reward curve indicates that the learning process has converged. To investigate whether these joint policies provide a collaborative advantage, we compare it to the aggregation of single district policies, to which we will refer as the aggregated policy. To construct this aggregated policy, we learn a distinct school closure policy for each of the 11 districts in the Cornwall-Devon community, using PPO, following the same procedure as in Section 5. To evaluate this aggregated policy, we execute the distinct policies simultaneously. For the districts that are not controlled For both R 0 = 1.8 and R 0 = 2.4, respectively, we simulate the joint and the aggregated policy 1000 times, and we show the attack rate improvement distribution in Figure 5 . These results corroborate that there is a collaborative advantage when devising school closures policies, for both R 0 = 1.8 and R 0 = 2.4. However, the improvement is most significant for R 0 = 1.8. We conjecture that this difference is due to the fact that there is less flexibility when the transmission potential of the epidemic is higher, since there is less time to act. Although, we observe an improvement when a joint policy is learned, it remains challenging to interpret deep multi-agent policies, and we discuss in Section 7 possible directions for future work with respect to multi-agent reinforcement learning. In this analysis, where we have a limited number of actions per agent, the use of multi-district PPO proved to be successful. However, the use of more advanced multiagent reinforcement learning methods is warranted when a more complex action space is considered or a larger number of districts needs to be controlled. For this reason, we also investigated the recently introduced QMIX [50] algorithm, but the resulting learning curve proved to be quite unstable (shown in Supplementary Information) . We conducted our experiments in the setting of school closures, and our findings are of direct relevance with respect to the mitigation of pandemic influenza. Furthermore, our novel approach to investigate the collaborative nature of prevention strategies as a multi-agent reinforcement learning problem, can be applied to other epidemiological settings, such as for example the ongoing COVID-19 pandemic. We demonstrate the potential of deep reinforcement learning in the context of epidemiological decision making by conducting experiments that show that PPO converges to the optimal policy. Next, we investigate and show that there is a collaborative advantage when devising school closures policies, by formulating this hypothesis as a multi-agent problem. The work conducted in this manuscript indicates that there is the potential to use reinforcement learning in the context of complex stochastic epidemiological models. For future work, it would be interesting to investigate how well these algorithms scale to even larger state and/or action spaces. In this regard, the use of attention-based multi-agent reinforcement learning algorithms could be explored [37] . Another important concern is to scale these reinforcement learning methods to individual-based epidemiological models, as such models can be easily configured to approach a variety of research scenarios, i.a., vaccine allocation, telecommuting, antiviral drug allocation. However, the computational burden that is associated with individual-based models complicates the use of reinforcement learning methods [64] . To this end, it would be interesting to devise methods to automatically learn a surrogate model from the individual-based model, such that the reinforcement learning agent can learn in this computationally leaner surrogate model. While we show that deep reinforcement learning algorithms can be used to learn optimal mitigation strategies, the interpretation of such policies remains challenging [24] . This is especially the case for the multi-district setting we considered, where state and time do not match, and the infection onset of the patches is highly stochastic. To this end, further research into explainable reinforcement learning, both in a singleagent and multi-agent setting, is warranted. Each compartment model is representative of one of the districts defined in the main manuscript, and as such the compartment model is parametrised with the census data of the respective district, i.e., population counts stratified by age groups. We use the 2011 United Kingdom census data made available by NOMIS 4 . This dataset contains census data for all of the considered districts for the following age groups: 0-4, 5-7, To be compatible with our model, we need to map this census data to the age structure imposed by the Eames contact matrix: i.e., 0-4 years (children), 5-18 years (adolescents), 19-64 years (adults), 65-90+ years (elderly). To define this mapping, we will refer to the number of individuals with the symbol N , subscripted with the dataset type (i.e., NOMIS or Eames) and the age group. For the age group 0-4 and 65-90+ we have a direct mapping: However, as for the contact matrix the adolescents and adults are split between age 18 and 19, and for the census data these 2 age groups are aggregated, we need to make a custom mapping. We will aggregate all shared age groups and divide the common age group in two: When restructuring the census data according to the Eames age groups, we observe clear trends over the districts with respect to the proportion of children, adolescents, adults and elderly, as shown in Figure S1 . However, the histograms in Figure S1 only show the marginalized distribution per age group. To reason about the distribution over all age groups, consider that we have a proportion of each of the age groups, and we thus can represent this data as a positive simplex [1] , as defined in Definition 2. Definition 2 (Unit simplex). A unit simplex [4] , with D components, corresponds to the set: This representation enables us to reason about this data in a statistical framework, and to visualize the four-dimensional data in a three-dimensional space by using the Barycentric coordinate system, as shown in Figure S2 . Figure S2 shows that the census distribution exhibits a dense region with a limited number of outliers. Note that we use the 2011 census dataset, rather than the more recent 2018 census dataset, to be fully compatible with the mobility dataset used to inform our betweenpatch transition model (see Section S3). For each district, the base contact matrix is corrected to make it reciprocal, using that district's census data. In influenza modelling literature, it is common to parametrise the model in terms of a specific R 0 value. We will now introduce an equation that enables us to compute the transmission probability β for a given R 0 value. To this end, we need to determine the next-generation matrix, which summarizes the number of secondary infections between age groups [57] , and determine the spectral radius of this matrix (see Definition 3). Definition 3 (Spectral radius). The spectral radius of a matrix L, Υ(L), is the dominant eigenvalue of that matrix L. As the contact matrix is a square matrix with positive real entries, according to the Perron-Frobenius theorem, this eigenvalue exists and is unique. Note that, in a graph, the spectral radius is a measure of the graph's connectivity [36] . As our context matrix M can be seen as a graph that represents how strongly the different age groups are connected, this notion of connectivity applies here as well. Following [16] and [17] , we construct the next-generation matrix for our SEIR model: Given K, we can compute R 0 as: Using equation S5, we can now compute the transmission risk β for a given R 0 , γ and contact matrix M . Note that for each district, we have a contact matrix that is corrected for reciprocity by using that district's census data. Therefore, we have a distribution over Υ(M d ), where M d is a contact matrix for district d. We would expect that this distribution is centred around Υ(M GB ), where M GB the contact matrix that is corrected for reciprocity using the census data representative for Great-Britain in its entirety (i.e., an aggregation of all the districts). This is confirmed in Figure S3 , which shows that the median of the distribution over Υ(M d ) coincides with Υ(M GB ). Furthermore, note that the contact matrix denotes the average frequency of contacts that an individual in age group i has with an individual in age group j. Figure S3 thus shows a limited variance (σ 2 = 0.001). Our model, that is comprised of a set of connected SEIR patches, is inspired by the recent BBC pandemic model [32] . The BBC pandemic model was in its turn motivated by the model presented in [23] . 30 Figure S3 : Both figures display the distribution of the contact matrix' spectral radius for the different districts. To demonstrate the shape of this distribution, the left panel shows a histogram, annotated with a dotted green line that represents the median. To demonstrate the distribution with respect to its quartiles and outliers, the right panel shows a box plot of the distribution, annotated with an orange line that represents the median and a yellow star that shows the spectral radius for Great Britain. At each time step, our model decides whether a patch p becomes infected. This is modulated by the patch's force of infection, which combines the potential of the infected patches in the system, weighted by a mobility model: where P is the set of patches in the model, M p p is the mobility flux between patch p and p, β is the probability of transmission on a contact, S A p (t) is the susceptible population of adults in patch p at time t and its contribution is modulated by parameter µ, and I p (t) is the infectious potential of patch p at time t. We define this infectious potential as, where I A p (t) is the size at time t of the infectious adult population and M AA is the average number of contacts between adults. M is a matrix based on the mobility dataset provided by NOMIS 5 . This dataset describes the amount of commuting between the districts in Great-Britain. In general, this between-patch model is constructed from first principles i.e., census data, a mobility model, the number of infected individuals and the transmission potential of the virus. However, for the parameter µ that modulates the contribution of the susceptibles in the receptive patch, while it is commonly used in literature [23, 18, 31] , no such intuition is readily available. Therefore, this parameter is fitted to match the properties of the epidemic that is under investigation [23, 18, 31] . To validate our model, we conduct two experiments. Firstly, we compare our model to the original compartment model and perform a sensitivity analysis with respect to parameter µ. Secondly, we show that our model fits the recent influenza pandemic of 2009, by choosing an appropriate value for all model parameters. Given this time-dependent force of infection, we model the event that a patch becomes infected with a non-homogeneous Poisson process [58, 53, 8] . A Poisson process can be used to model the occurrence of events with a given intensity (see Definition 5), and non-homogeneous Poisson processes generalize this concept to timedependent intensities (see Definition 6) . As the process' intensity depends on how the model (i.e., the set of all patches) evolves, we cannot sample the time at which a patch becomes infected a priori. Therefore, we determine this time of infection using the time scale transformation algorithm [13] . Firstly, we explain the generic time scale transformation algorithm (Section S3.1). Secondly, we adjust the algorithm to our setting (Section S3.2). The time scale transformation algorithm enables us to determine the time at which an event, modelled by a non-homogeneous Poisson process, will take place [13] . We will start by formally defining the homogeneous and non-homogeneous Poisson process. A Poisson process is a counting arrival process, defined on a sample space Ω with probability measure P . Definition 4 (Arrival process). An arrival process is a stochastic process [13] , such that for any ω ∈ Ω, the mapping t → N t (ω), has N 0 = 0, is non-decreasing, increases only by integer jumps and is right continuous. A homogeneous Poisson process is an arrival process [13, 10] , for which these axioms hold: 1. for almost all ω ∈ Ω, t → P t (ω) jumps in steps of size 1 2. the number of arrivals within any interval [t..t + s], is independent of the history of arrivals prior to t From this definition, we can show that for each homogeneous Poisson process P: for some constant λ ≥ 0, where λ signifies the intensity (i.e., rate) of the process. The concept of a Poisson process can be generalized to a non-homogeneous Poisson process by removing the time-homogeneity requirement: Definition 6 (Non-homogeneous Poisson process). A non-homogeneous Poisson process is an arrival process [13] , for which these axioms hold: 1. for almost all ω ∈ Ω, t → P t (ω) jumps in steps of size 1 2. the number of arrivals within any interval [t, t + s], is independent of the history of arrivals prior to t P λ(t) t has a time-dependent rate that is specified by the intensity function λ(t), where λ(t) ≥ 0. We define the process' cumulative intensity function: Definition 7 (Cumulative intensity function). A non-homogeneous Poisson process P λ(t) with intensity function λ(t) has a cumulative intensity function: Furthermore, we can show that [46] : and thus we have that Λ(t) is the expectation function of P λ(t) t : (S14) From Definition 7, it is clear that Λ(t) will be a non-decreasing function and at least right-continuous. The crucial theorem that underlies the time scale transformation algorithm denotes that the arrival times in a non-homogeneous Poisson process can be mapped to a homogeneous Poisson process with rate 1 (Theorem 1). We present an example that demonstrates this theorem in Figure S4 . The time scale transformation algorithm uses the relation in Theorem 1 to transform a homogeneous Poisson process with λ = 1 into a non-homogeneous Poisson process with expectation function Λ. The homogeneous process is formed by sampling from an exponential probability distribution with λ = 1. To make this transformation possible, a time inverse function of Λ is required: Definition 8 (Time inverse of Λ(t)). The time inverse τ of an expectation function Λ(t) for a non-homogeneous Poisson process P λ(t) t : τ (s) = inf{t : Λ(t) > s} (S17) In Algorithm 1, we formalize this procedure. At each step i, we obtain a sample X i from an exponential probability distribution with rate λ = 1, which is added to the set of samples X i . The sum of the elements in X i represents the i th arrival in the homogeneous Poisson process, and is transformed into the i th arrival in the nonhomogeneous Poisson process using the inverse time function τ (.). In order to use the time scale transformation algorithm (Algorithm 1) in our epidemiological model, note that the patches' internal state is updated in a discrete number of time steps. We determine a patch's intensity φ p (t) (Equation S6) at the end of each day. This results in a sequence of intensities between which we linearly interpolate to obtain a piecewise linear intensity function: Figure S5 : An example of a piecewise linear intensity function for a patch in our model (see Equation S18 ). The blue scatter points represent the evaluation of φ(t) (Equation S6) at discrete time steps (i.e., the end of each day). The orange connecting lines represent the linear interpolation between φ(i − 1) and φ(i). where line(x, interpolates linearly between (x 1 , f (x 1 )) and (x 2 , f (x 2 )). This piecewise linear intensity function λ(t) is continuous and thus its cumulative counterpart Λ(t) is continuous as well. Furthermore, as φ p (t) ≥ 0 for all t ≥ 0, Λ(t) is non-decreasing. As φ p (t) depends on the simulation state at time t, it is clear that we cannot evaluate this function beyond the current simulator time step. However, the definition of the time inverse τ (Definition 8) shows that we can use the arrival time in the homogeneous Poisson process as a threshold for the arrival time in the non-homogeneous Poisson process. We formalize this threshold-based time scale transformation algorithm in Algorithm 2. Note that this algorithm approximates the original algorithm as we check whether the threshold is surpassed at discrete time steps. Trigger event X (t) ∼ Exp(λ = 1) X = X + X (t) end end Algorithm 2: Time scale transformation algorithm using discrete time steps Our objective is to construct a model that is representative for contemporary Great Britain with respect to population census and mobility trends. This model is to be used to study school closure intervention strategies for future influenza pandemics. While in many studies [31, 23, 18] , a model is created specifically to fit one epidemic case, we aim for a model that is robust with respect to different epidemic parameters, most importantly R 0 , the basic reproduction number. To validate our model according to these goals, we conduct two experiments. In the first experiment, we compare our patch model to a SEIR compartment model that uses the same contact matrix and age structure. While we do not expect our model to behave exactly like the compartment model, as the patches and the mobility network that connects them induces a different dynamic, we do expect to see similar trends with respect to the epidemic curve and peak day. In the second experiment we show that our model is able to reproduce the trends that were observed during the 2009 influenza pandemic, commonly known as the swine-origin influenza pandemic, that originated in Mexico. The 2009 influenza pandemic in Great Britain is an interesting case to validate our model for three main reasons. Firstly, the pandemic occurred quite recently and thus our model's census and mobility scheme should be a good fit, as both the datasets on which we base our census and mobility model were released in 2011. Secondly, due to the time when the virus entered Great Britain, the summer holiday started 11 weeks after the emergence of the epidemic. The timing of the holidays had a severe impact on the progress of the epidemic and resulted in a epidemic curve with two peaks. This characteristic epidemic curve enables us to demonstrate the predictive power of our age-dependent contact model with support for school closures. Thirdly, the number of symptomatic cases that occurred in Great Britain during the 2009 pandemic was recorded meticulously and is publicly available [33] . In this experiment, we compare our patch model to a simple SEIR model that encompasses the same age structure and contact matrix [17] , to which we will refer as Eames-SEIR from this point onwards. We consider a stochastic implementation of the Eames-SEIR. This experimental setting will be central to the reinforcement learning experiments, related to finding optimal school closure policies, that we present in the main manuscript. Following [17] and [6] , we use a latent period of one day (ζ = 1 1 ) and an infectious period of 1.8 days (γ = 1 1.8 ). We perform our experiment for a set of R 0 values within the range of 1.4 to 2.4, in steps of 0.2. This range is considered representative for the epidemic potential of influenza pandemics [9, 40] . Furthermore, we need to choose a value for the parameter in the between-patch model, i.e., µ, that modulates the contribution of susceptible adults in the receiving patch (see Section S3). This parameter is typically fitted towards data, however, in this experiment and in the reinforcement learning experiments in the main manuscript, we consider a model to investigate future epidemics. Our goal is to calibrate our model such that it produces peak days that are similar to the peak days in Eames-SEIR [17] , which is a prominent model for pandemic influenza that moreover generates peak days that are in agreement with earlier work [19] . Therefore, we investigate the effect of µ in this setting, through a sensitivity analysis. We consider µ in the interval [0, 1], where the left end of the interval (i.e., µ = 0) signifies that the contribution of susceptible adults is ignored and the right end of the interval (i.e., µ = 1) signifies that the contribution of adults is not modulated. In Figure S6 However, no value of µ provides a good fit for all of the considered R 0 's, when comparing the peak days to the Eames-SEIR model. Rather, we can discern a logrelationship between µ and the best fit for the different R 0 's. Based on this observation, 4} (xaxis). A curve is shown for µ = log(R 0 ) · 0.6 (orange curve) and the peak days as produced by the SEIR-Eames model (blue curve with diamond scatter points). For each R 0 , 100 stochastic trajectories were sampled and the bound signifies the 95% confidence interval of the sample. we propose to define: where s is a scaling factor. For this experimental setting, we find that s = .6 provides a good fit for all of the considered R 0 's, which we show in Figure S7 . Provided this choice of µ, when we compare the epidemic trajectories of our spatial model with the SEIR-Eames model in Figure S8 , we observe similar trends with respect to the shape of the trajectory distributions. The main difference is that the epidemic curves grow slower in our spatial model than in the Eames-SEIR model and also achieve a lower peak incidence. This is expected, as we constrain mixing in our spatial model within the districts, and thus increase the resolution of our model, which has been shown to more accurately predict peak incidence [42] . Furthermore, in Figure S9 , we show the number of districts that get infected over time for different R 0 values. This shows that all districts get infected, and the time it takes for all districts to reach this point depends mainly on the transmission-ability of the virus strain. We expect the attack rate to be similar for the SEIR-Eames and spatial model. When all districts get infected, the attack rate in the spatial model is the sum of the attack rates of a set of SEIR-Eames models (i.e., one Eames-SEIR model per district). We verified that the attack rates are indeed nearly identical, as shown in Figure S10 , with little variance for either of the models. The virus responsible for the 2009 influenza pandemic arrived in Great Britain during the first week of May 2009 (week 19). Following this introduction, the epidemic grew for 11 weeks until the summer school holidays started, after which the epidemic showed its first peak. After the school holidays, the epidemic was rekindled and grew to a second peak. In Figure S11 we show the weekly case count, as recorded by the British Health Protection Agency (HPA) and the time at which the school holidays take place. To reproduce this distinctive epidemic curve, we use our original model as it was described in the main manuscript. We consider two free parameters: the basic reproduction number and the time of the infectious period. The general consensus is that the basic reproduction number was moderate during the 2009 influenza pandemic, with estimates ranging from 1.16 to 2 [33, 56, 15, 20, 63, 44, 7] . We present a detailed overview of the reported basic reproduction number estimates in Table 1 . For the period of infectiousness we found estimates of 1.8, 2.5 and 3.38 days [17, 7, 56] . We present a detailed overview of the infectious period estimates in Table 2 . Given these prior estimates, we parametrize our model with a basic reproduction number that is in the range of 1.2 to 2.0 and consider a duration of infectiousness of respectively 1.8, 2.5 and 3.38 days. For this experiment, we found, to be a good fit for the overall comparison. In Figure S12 we show the epidemic curve for our model with respect to these parameters. In general, the epidemic curves that result from using an infectious period of 1.8 days are insufficient to reproduce the trends of the 2009 pandemic. For the other infectious periods (i.e., 2.5 and 3.38), we show that for all but the highest reproductive numbers we observe an epidemic curve with 2 peaks. Furthermore, we observe a deeper trough in the epidemic curve when an infectious period of 2.5 days is chosen. In Figure S13 , we show a set of model realisations in conjunction with the symptomatic case data, which shows that we were able to closely match the epidemic trends observed during the British pandemic in 2009. This model was configured with a basic reproductive number of 1.4 and infectious period of 2.6. The reproductive number is in good concordance with the general consensus that the virus responsible for the 2009 pandemic exhibited a moderate infectiousness. While the infectious period slightly differs from the value reported by [7] (i.e., 2.5 days), it lies well within the confidence bounds reported in this study (confidence interval: 1.1-4.0 days). Note that our model reports the number of infections, while the HPA only recorded symptomatic cases. Therefore we scale the epidemic curve with a factor of 1 4 . While atypical, this large number of asymptomatic cases produced by our model is in line with earlier serological surveys [41] and with previous modelling studies [33] . An analysis of the computational complexity of our model needs to consider that the model incorporates two components. On the one hand, infection in the patches is triggered via the time scale transformation algorithm (see Section S3.2). On the other hand, once infected, each patch in the system evolves independently, and we use the Euler-Maruyama approximation method to obtain samples from the stochastic differential equation that is associated with the patch. The time scale transformation algorithm samples a random threshold for each patch, which is compared to the force of infection of the associated patch. This comparison occurs at each time step that the patch was not yet infected. Computing the force of infection in Equation S6 considers all model patches, and thus has a worst case complexity that is linear in the number of model patches, i.e., O(|P|). In worst case, at each time step t, if only one of the model patches is infected, and we need to compute the force of infection for each patch, which has a quadratic complexity in the number of model patches, i.e., However, when each patch only needs to be infected once, we observe that we have a complexity in terms of infected P i and uninfected patches P ¬i . After all, at each time step, we only need to consider the force of infection of the uninfected patches, and this force of infection only takes into account the infected patches, i.e., O(t · |P i · P ¬i |). (S24) Since we have, we can see that while this expression has the same worst case complexity as in Equation S23, in practice less operations will be required. For both the complexity in Equation S23 and Equation S24 , it is clear that as long as the number of patches is limited, as is the case in our model, this procedure will be computationally efficient, as this can be implemented as a vector product, on vectors that all fit in memory (RAM). When a patch is infected, at each time step it will be advanced by using a number of operations that is proportional to the number of compartments in the the age-dependent SEIR model. This model was implemented in Python, and the performance critical sections were either implemented using NumPy when a vector representation was possible (e.g., to compute the force of infection) [45] , or JIT-compiled using Numba (e.g., to evolve the age-dependent SEIR model in a patch) [34] . This implementation performs well, resulting in about 2 simulation runs per second on a MacBook Pro. To establish a ground truth, we select 10 districts that are representative of the population heterogeneity in Great Britain. To this end, we remind the reader that in Section S1, we analysed the population heterogeneity by representing the population structure as a positive simplex. We select 10 districts: one district that is representative for the average of this distribution and a set of nine districts that is representative for the diversity in this distribution. To determine the average district, we consider the population heterogeneity distribution over all districts, and determine the Aitchison's mean (Definition 9) of this distribution [3] . We then select the district that is closest to the Aitchison's mean (Definition 9) according to the Aitchison distance (Definition 10), as shown in Figure S14 . Definition 9 (Aitchison's mean). Given a set of points from a unit simplex (Definition 2), the Aitchison's mean [4] is: where, is the geometric mean of the d-th component over all simplex points in P . Definition 10 (Aitchison distance). Given two points from a unit simplex p, q ∈ S D (Definition 2), we define the Aitchison distance function [2] : where, denotes the geometric mean of p. This distance defines a metric on the simplex sample space. Next, we determine the outer extreme points, as these represent the most diverse census points. To do this, we compute the convex hull of the point cloud (i.e., the smallest convex set of points that contains the point cloud), as shown in Figure S15 . We proceed by taking the points that belong to the surface of the convex hull, of which we make a sub-selection of 9 census points. As the convex hull consists out of 21 points, we consider all k-combinations (with k = 9) and select the set of points that maximizes the minimum Aitchison distance between the selected points, as shown in Figure S16 . To investigate the collaborative nature of school closure policies, we apply deep multiagent reinforcement learning algorithms. In our model, we have 379 agents, one for each district, as agents represent the district for which they can control school closure. As the current state-of-the-art of deep multi-agent reinforcement learning algorithms is limited to deal with about 10 agents [28] , we thus need to partition our model into smaller groups of agents, such that deep multi-agent reinforcement learning algorithms become feasible. To this end, we consider the mobility matrix M and define a directed commute graph for M ij ≥ 0 (Definition 11). Definition 11 (Commute graph). For a commuting matrix M that describes the mobility flux between a set of districts D, we define a commute graph, where V c is the set of vertices, with a vertex for each of the districts in D, and A c is the adjacency matrix that specifies the vertices that are connected: Each pair of connected vertices i and j has a weight M ij . To detect communities in the commute graph, we used the Leiden algorithm [55] , an algorithm that searches for communities that maximize the network modularity [35] . We found a partition of which we demonstrated the robustness (p-value ≤ 0.001) using a bootstrapping approach presented in [49] . Furthermore, by rendering this partition on top of the map of Great Britain, as is shown in Figure S17 , we show that the districts belonging to the same community are close to each other geographically, as we would expect. Moreover, when we overlay the NUTS-2 administrative regions 6 on the partitioning ( Figure S17 ), we observe that our partitioning scheme mostly overlaps with the NUTS-2 regions, which indicates that the Leiden algorithm produces a sensible partitioning. Figure S17 : We show the communities, that resulted from applying the Leiden algorithm, on the map of Great Britain. We show all administrative districts colour-coded by the community they belong to and the add the borders of the NUTS-2 administrative regions on top of this map. We annotate the Cornwall-Devon community with a yellow rectangle. We conduct our multi-agent reinforcement learning experiments in the community S10 Comparing PPO to the ground truth (attack rate) -Gradient norm clipping threshold: 1.0 Principal component analysis of compositional data On criteria for measures of compositional difference Principles of compositional data analysis The one-hour course in compositional data analysis or compositional data analysis is simple Construction of equivalent stochastic differential equation models. Stochastic analysis and applications Vaccination against pandemic influenza a/h1n1v in england: a real-time economic evaluation Seasonal transmission potential and activity peaks of the new influenza a (h1n1): a monte carlo likelihood analysis based on human mobility Fluctuation effects in metapopulation models: percolation and pandemic threshold Strategies for pandemic and seasonal influenza vaccination of schoolchildren in the United States Introduction to probability Would school closure for the 2009 h1n1 influenza epidemic have been worth the cost?: a computational simulation of pennsylvania School closure policies at municipality level for mitigating influenza spread: a model-based evaluation Introduction to stochastic processes. Courier Corporation The impact of regular school closure on seasonal influenza epidemics: a data-driven spatial transmission model for belgium A preliminary analysis of the epidemiology of influenza a (h1n1) v virus infection in thailand from early outbreak data The construction of nextgeneration matrices for compartmental epidemic models Measured dynamic social contact patterns explain the spread of h1n1v influenza Spatial dynamics of the 1918 influenza pandemic in england, wales and the united states Strategies for mitigating an influenza pandemic Pandemic potential of a strain of influenza a (h1n1): early findings. science Inferring the structure of social contacts from demographic data in the analysis of infectious diseases spread School dismissal as a pandemic influenza response: When, where and for how long? Spatial transmission of 2009 pandemic influenza in the us Darpa's explainable artificial intelligence program Effectiveness of interventions to reduce contact rates during a simulated influenza pandemic Developing guidelines for school closure interventions to be used during a future influenza pandemic Modeling targeted layered containment of an influenza pandemic in the United States A survey and critique of multiagent deep reinforcement learning Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola Adam: A method for stochastic optimization Geographic transmission hubs of the 2009 influenza pandemic in the united states Contagion! the bbc four pandemic-the model behind the documentary Why was the 2009 influenza pandemic in england so small? Numba: A llvm-based python jit compiler Community structure in directed networks Network science: Theory and applications Multi-agent game abstraction via graph attention neural network Containing pandemic influenza at the source Nonpharmaceutical interventions implemented by us cities during the 1918-1919 influenza pandemic Optimizing influenza vaccine distribution Incidence of 2009 pandemic influenza a h1n1 infection in england: a cross-sectional serological study The spatial resolution of epidemic peaks The cost effectiveness of pandemic influenza interventions: a pandemic severity based analysis Pros and cons of estimating the reproduction number from early epidemic growth rate of influenza a (h1n1) 2009 Guide to NumPy Applied stochastic system modeling Context matters: using reinforcement learning to develop human-readable, state-dependent outbreak response policies Community structure of copper supply networks in the prehistoric balkans: An independent evaluation of the archaeological record from the 7th to the 4th millennium bc QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning Proximal policy optimization algorithms Large sample properties of simulations using latin hypercube sampling A simple explanation for the low impact of border control as a countermeasure to the spread of an infectious disease Social contact patterns and control strategies for influenza in the elderly From louvain to leiden: guaranteeing well-connected communities Estimated epidemiologic parameters and morbidity associated with pandemic h1n1 influenza An introduction to infectious disease modelling Characterizing the dynamics underlying global spread of epidemics Are we ready for pandemic influenza? Dynamic health policies for controlling the spread of emerging infections: influenza as an example Identifying dynamic tuberculosis case-finding policies for hiv/tb coepidemics Identifying cost-effective dynamic policies to control epidemics The transmissibility and control of pandemic influenza a (h1n1) virus Towards sample efficient reinforcement learning A novel coronavirus from patients with pneumonia in china Pieter Libin and Timothy Verstraeten were supported by a PhD grant of the FWO (Fonds Wetenschappelijk Onderzoek -Vlaanderen). We thank Jelena Grujic (Vrije Universteit Brussel) for her comments on the network analysis.