key: cord-0289559-rk2relpp authors: Custode, Leonardo Lucio; Iacca, Giovanni title: Interpretable AI for policy-making in pandemics date: 2022-04-08 journal: nan DOI: 10.1145/3520304.3533959 sha: ae679f9dc637d404c3f78e8248972e915c3a85f3 doc_id: 289559 cord_uid: rk2relpp Since the first wave of the COVID-19 pandemic, governments have applied restrictions in order to slow down its spreading. However, creating such policies is hard, especially because the government needs to trade-off the spreading of the pandemic with the economic losses. For this reason, several works have applied machine learning techniques, often with the help of special-purpose simulators, to generate policies that were more effective than the ones obtained by governments. While the performance of such approaches are promising, they suffer from a fundamental issue: since such approaches are based on black-box machine learning, their real-world applicability is limited, because these policies cannot be analyzed, nor tested, and thus they are not trustable. In this work, we employ a recently developed hybrid approach, which combines reinforcement learning with evolutionary computation, for the generation of interpretable policies for containing the pandemic. These policies, trained on an existing simulator, aim to reduce the spreading of the pandemic while minimizing the economic losses. Our results show that our approach is able to find solutions that are extremely simple, yet very powerful. In fact, our approach has significantly better performance (in simulated scenarios) than both previous work and government policies. COVID-19 has changed the way of living of all the people on Earth. In fact, since its outbreak, several countries adopted safety measures such as: social distancing, lockdowns, closing certain types of economic activities and others. These safety measures were taken as a prevention mechanism to slow down the spreading of the pandemic in the world population. However, some safety measures may have a relevant impact on the economy of a country. For this reason, it is important to find a trade-off between spreading of the pandemic and economic losses. To this end, some previous works focused on the use of simulators to estimate the virus propagation and economic losses given a policy [7, 16] . In this setting, a policy decides the safety measures to adopt in a given scenario. However, while these works have been proven to be able to find policies that have better trade-offs between economic losses and pandemic spread than the ones used by governments, they do not employ interpretable AI methods. Thus, even though such models do perform very well, their applicability is limited due to the lack of understandability. More specifically, the main drawbacks of non-interpretable approaches for this task are: a) the fact that the trained models act as "oracles" and, thus, a policymaker cannot understand the rationale underlying a decision; and b) the fact that such models cannot be inspected and, thus, their behavior cannot be formally specified. In this work, we propose interpretable models for policy-making in pandemics. These models, which combine decision trees (DTs) trained by means of grammatical evolution with Q-learning, are very simple and objectively interpretable. Moreover, our solutions exhibit better performance w.r.t. non-interpretable state-of-the-art models. Since our evolved policies are both more effective and more interpretable than existing black-box models, they are extremely suitable for creating policies to control the pandemic while avoiding unnecessary economic damage. This paper is structured as follows. The next section makes a short review of the state of the art. Section 3 describes the method used to evolve interpretable DTs. In Section 4 we show the experimental results and, in Section 5, we interpret the trees obtained. Finally, in Section 6 we draw the conclusions of this work and suggest future work. In [7] , the authors propose a simulator of pandemics that is able to simulate a population at a very fine granularity, modeling the activities that each agent can perform, as well as various types of economic activities. The goal of this simulator is to test policies with the objective of minimizing simultaneously the spreading of the pandemic and the economic damages. The authors test various handcrafted policies, policies used from a few countries, and a deep reinforcement learning based policy. The results show that the deep reinforcement learning based policy is able to outperform both the handmade policies and the governmental ones. In [16] , the authors propose a simulator to estimate the effects on the management of closing economic activities on both the spreading of the pandemic and the economic losses. The proposed simulator is tailored on the U.S. economy, simulating all the 51 states and a central entity that manages subsidies. The main difference between the approaches proposed above is that in [7] the simulator acts at a very low scale, but with a high level of detail, being able to simulate aspects of everyday life, while in [16] the simulator aims to perform at a very large scale with a lower level of detail. In [9] , the authors use a surrogate-assisted evolutionary process to optimize an agent that has to apply restrictions to avoid the spreading of the pandemic. They make use of two neural networks: one that acts as the policy, and the other that estimates the quality of a policy, trained on real data. While the performance of this approach is promising, the use of black-box policies such as neural arXiv:2204.04256v2 [cs. LG] 30 Apr 2022 network makes the real-world application of such systems difficult [11] . In recent years, the need for interpretable AI emerged. Interpretable AI [1] is defined as the branch of AI that focuses on models that are inherently interpretable by humans. This type of models is extremely important in high-stakes and safety critical scenarios, as the ability to inspect the model and understand its behavior becomes crucial [11] . While the field of interpretable AI is making progresses, there are some sub-fields that are still seen as crucial to the development of interpretable AI. One of these fields is interpretable reinforcement learning [12] . Lately, several works approached the problem of interpretable reinforcement learning. Silva et al. [15] make use of differentiable DTs that are trained by means of the proximal policy optimization algorithm [14] . While this approach has shown very promising performance in the tested environments, its performance degrades significantly when transforming the differentiable DT into a regular DT. Dhebar et al. [5] propose an evolutionary approach for producing DTs with nonlinear splits (i.e., hyperplanes defined by conditions) for reinforcement learning tasks. This approach is able to obtain very good performance in the tested tasks. However, the high non-linearity of the splits hinders the interpretability of the DTs obtained. In [2] , the authors propose a two-level optimization approach that combines grammatical evolution [13] and Q-learning [18] for evolving DTs for reinforcement learning tasks. The approach is tested on classic reinforcement learning tasks, and the results show that the systems obtained are competitive w.r.t. non-interpretable state-of-the-art models while being very easy to interpret. However, this approach only works in environments with discrete action spaces. In [3] , the authors extend the approach performed in [2] to make it work in scenarios with continuous action spaces. To do so, they make use of a co-evolutionary approach [10] that allows to evolve simultaneously decision tress and "pools" of continuous actions. In [4] , the authors use genetic programming [8] and CMA-ES [6] to evolve interpretable DTs that are able to work in RL settings with images as input. However, the experimental results show that the proposed approach exhibits good performance only in scenarios that are not affected by noise. Since our goal is to evolve general policies for minimizing the spread of the pandemic and, at the same time, reducing the economic losses, we employ the simulator proposed in [7] . In fact, this simulator allows us to test rules that are applicable in every country (i.e., not only tailored to U.S. as in [16] ) and that do not make use of economic subsidies. All the variables are normalized by using a min-max normalization before being fed to the policy-making agent. Note that the simulator returns a noisy version of the estimates of the variables described above, to simulate a real-world scenario in which not all the results of the tests are known. A more detailed list of actions is described in [7] . At each step, the agent receives a reward of: where is the capacity of the hospitals. Thus, the reward function aims to trade-off the number of patients in critical conditions with the stringency level of the restrictions. Moreover, this function indirectly induces the agent to minimize also the closing of stores, limiting the economic losses. Since the action space of this environment is discrete and the interpretability of the policies is crucial for their application in real 1 Please note that the simulator always returns = . However, for consistency with the experiments reported in [7] , in our experiments we kept both values as inputs to the policy-making agent. world scenarios, we employ the method described in [2] for evolving policies under the form of DTs. Thus, we use grammatical evolution to evolve DTs with empty leaves, i.e., leaves that do not perform any action. In the fitness evaluation phase, the leaves are then trained by means of Q-learning [18] , by exploiting the reward signals coming from the environment. A graphical illustration of the fitness evaluation phase is shown in Figure 1 . The evolutionary process is iterated for 50 generations using a population of 45 individuals. All the parameters (including the ones shown in the following subsections, e.g., mutation rate and tournament size) were empirically tuned to balance performance and computational cost. We perform 10 i.i.d. experimental runs to assess the statistical repeatability of our results. Each individual is encoded as a list of integers, ranging from 0 to a maximum value , such that is significantly higher than the maximum numbers of productions in the grammar. In our experiments, = 4 ยท 10 3 . Genotype to phenotype mapping. Each genotype (i.e., a list of integers) is transformed into its corresponding phenotype, i.e., a Python program that encodes the corresponding DT by means of "if-then-else" statements. To do so, we make use of the grammar shown in Table 1 . Starting from the first gene and from a production containing only the rule "dt", each gene is used to replace the first rule currently found in the string with the ( )-th production, where is the value of the current gene and is the size of the production of the current rule (i.e., the number of choices associated to the current rule). Note that the leaves of the resulting tree are not defined, i.e., they do not have a fixed action. This allows us to perform Q-learning in the fitness evaluation phase, as mentioned above. To select the individuals for mutation/crossover, we use a tournament selection with tournament size 2. is the uniform mutation, which replaces the current gene with a new integer uniformly sampled in [0, ]. The mutation operator is applied with probability 1 (i.e., to all the individuals in the population), while the mutation rate (e.g., the probability of mutation of each gene) is 10%. Since crossover proved to be too "destructive" in preliminary experiments, we do not employ it in the experiments reported below. By destructive, we mean that, by means of this operator, the offspring can be significantly different from both Table 1 : Grammar used to produce the decision trees. Production [0, 1) with step 0.1 11, 12 [0, 1) with step 0.5 parents, hindering the exploitation of good structures emerged during the evolutionary process. In order to increase the exploitativeness of our approach, we introduce a replacement operator. This operator replaces a parent with an offspring only in case the offspring has (strictly) better fitness than the parent. The fitness of each phenotype is determined by using the corresponding policy to perform decisions in the PandemicSimulator 2 environment. To learn actions for the leaves, we employ -greedy Q-learning, with learning rate = 10 โˆ’3 , = 0.05, and random initialization for the Q-values in [โˆ’1, 1]. Moreover, we evaluate each phenotype in 10 independent episodes, each of which simulates 100 days of pandemic. Table 2 shows the scores obtained by the best agents obtained in each independent run. Here, we differentiate between training and testing returns. Training returns are defined as the returns obtained by the agent in the training process, while testing refers to the returns obtained by the agent after the evolutionary process, i.e., when learning is disabled. Moreover, we also measure the interpretability of our solutions. To do so, we employ the metric of interpretability proposed in [17] , reworked as done in [2] . This metric is defined as: where โ„“ is the number of symbols appearing in the formula, is the number of operations contained in the model, is the number of non-arithmetical operations, and is the maximum number of consecutive compositions of non-arithmetical operations. Figure 2 , instead, compares the results obtained with the best DT produced by our method (i.e., the one corresponding to seed 3 in Table 2 , which obtained the highest test mean return) to the policies reported in [7] . In particular, the policies used for the comparison are the following: (1) S0-4-0: Starts with stage 0, then, once 10 have been infected, it switches to stage 4 and, after 30 days, it returns to stage 0. (2) S0-4-0FI: Similar to S0-4-0, but the return from stage 4 to stage 0 is performed gradually, by reducing the restrictions by 1 stage each 5 days. Figure 2 , we can see that our best DT outperforms all the hand-crafted policies under comparison (note that the negative reward is to be maximized). The statistical difference between the best decision tree evolved and the hand-crafted policies has been confirmed by a two-sided Wilcoxon test with = 0.05. As for the deep reinforcement learning based policy presented in [7] , while we could not test it due to the fact that the model is not publicly available (and neither its numerical results are), we estimate, from the plots reported in the original paper, a cumulative reward of about -5, which is substantially lower than the cumulative reward obtained by our best DT. Finally, in Figure 3 we compare the policy obtained by means of our best DT with the ones implemented by the Italian and Swedish governments (for which we use the implementation provided in [7] ). These policies act as follows: (1) ITA (approximation of the restrictions adopted by the Italian government): Increase the restrictions gradually from stage 0 to stage 4 and then it gradually returns to stage 2. (2) SWE (approximation of restrictions adopted by the Swedish government): Applies stage 0 regulations for 3 days and then stage 1 restrictions for all the duration of the simulation. Also in this case, we observe that the performance are significantly better than the compared policies. Similarly, also in this case, a Wilcoxon test with = 0.05 confirms the statistical significance of the results. Moreover, we observe that the number of people infected, by using our proposed policy, is significantly smaller w.r.t. the other approaches. This fact, combined with the very high rewards obtained by our policy, suggests that the policy minimizes the economic losses by trying to stop the pandemic at the beginning, and then making the restrictions less stringent. In the next section, we analyze the policy to understand the reasons underlying its significantly higher performance. The best DT obtained is shown in Figure 4 . While the size of the DT is quite limited, its performance are extremely satisfactory. We can see that the DT makes use of two conditions to compute its output. The first condition (the root) checks whether the number of never-infected people ( ) is greater than the 90%. If so, then it checks the second condition. Otherwise, it applies stage 2 restrictions, regardless of the other variables. We may hypothesize that, in In essence, the combination of the two branches of the root aims to induce a "strong" response to the initial wave of the pandemic, which is then slightly relaxed after the number of total cases increases. Thus, the overall strategy learned by the agent is to stop the pandemic at the beginning, keeping slightly less stringent restriction after the initial wave, to avoid new waves. In this paper, we leverage interpretable AI methodologies to train interpretable policies for managing a pandemic. Despite the widely-thought trade-off between interpretability and performance, the obtained policies proved to be significantly better than handcrafted policies, deep learning policies, and government policies. It is important to note that the results shown above have been tested in a simulated scenario and, thus, their applicability to realworld scenario must be assessed. A limitation of the current work is the simulation scenario adopted: in our work, we simulate a population of 1000 people. The assessment of the scalability of our results is left as future work. Other future works include: testing the proposed methodology on different simulators with different objectives, and adopting more realistic scenarios w.r.t. economical aspects (e.g., subsidies) or epidemiological aspects (e.g., the possibility of the emergence of new variants). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI Evolutionary learning of interpretable decision trees 2021. A co-evolutionary approach to interpretable reinforcement learning in environments with continuous action spaces Interpretable pipelines with evolutionarily optimized modules for RL tasks with visual inputs Interpretable-AI Policies using Evolutionary Nonlinear Decision Trees for Discrete Action Systems Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation 2020. Reinforcement Learning for Optimization of COVID-19 Mitigation policies Genetic programming: on the programming of computers by means of natural selection From prediction to prescription: evolutionary optimization of nonpharmaceutical interventions in the COVID-19 pandemic A cooperative coevolutionary approach to function optimization. In Parallel Problem Solving from Nature -PPSN III Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Lesia Semenova, and Chudi Zhong. 2021. Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges Grammatical evolution: Evolving programs for an arbitrary language Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning Building a foundation for data-driven, interpretable, and robust policy design using the ai economist Learning a Formula of Interpretability to Learn Interpretable Formulas Learning from delayed rewards