key: cord-0637537-ktjbgcnw
authors: Fan, Xiuyi; Liu, Siyuan; Chen, Jiarong; Henderson, Thomas C.
title: An Investigation of COVID-19 Spreading Factors with Explainable AI Techniques
date: 2020-05-05
journal: nan
DOI: nan
sha: e1060735f3b98f654b029a778cd75c1c83ee118d
doc_id: 637537
cord_uid: ktjbgcnw

Since COVID-19 was first identified in December 2019, various public health interventions have been implemented across the world. As different measures are implemented at different countries at different times, we conduct an assessment of the relative effectiveness of the measures implemented in 18 countries and regions using data from 22/01/2020 to 02/04/2020. We compute the top one and two measures that are most effective for the countries and regions studied during the period. Two Explainable AI techniques, SHAP and ECPI, are used in our study; such that we construct (machine learning) models for predicting the instantaneous reproduction number ($R_t$) and use the models as surrogates to the real world and inputs that the greatest influence to our models are seen as measures that are most effective. Across-the-board, city lockdown and contact tracing are the two most effective measures. For ensuring $R_t<1$, public wearing face masks is also important. Mass testing alone is not the most effective measure although when paired with other measures, it can be effective. Warm temperature helps for reducing the transmission.

gies that support such contact tracing while providing privacy protection should be seriously considered; promoting mask use and ensuring its supplies should also be considered.

COVID-19 has spread globally for more than three months since it was first identified in December 2019. Shortly after its initial identification, various control measures have been implemented in different countries for the purposes of containing and delaying the pandemic; at the moment, some of the less affected countries are considering implementing these measures.

Thus, a major question to be answered is that: among these control measures, what are the more effective ones, at different stages of the disease development?

Thus, the overarching goal of this work is to identify factors that are most influential in controlling the spread of the disease. R t , the average number of secondary cases generated by one primary case with symptom onset on day t, is one of the most important quantities used to measure the epidemic spread. If R t > 1 , then the epidemic is expanding at time t, whereas if R t < 1, then it is shrinking at time t. We want to identify factors that are most influential for controlling R t .

Explainable AI (XAI) is a rising field in AI. In addition to developing AI systems that making accurate predictions, XAI systems "explain" their predictions (1-3, 9, 12, 13) . The development of XAI is motivated by building trustworthy systems and revealing insights from data.

Still in its infancy, several XAI techniques such as Shapley additive explanations (SHAP) (8) and Explainable Classification with Probabilistic Inferences (ECPI) (4) , have been developed in recent years for identifying decisive features in prediction tasks. These techniques are datadriven; they "explain" a prediction by pointing out factors that are "most significant" for the prediction purely based on the data provided.

Much effort has been put into COVID-19 data collection by the global community. As of early April 2020, time series data containing daily confirmed cases in more than 100 countries are made publicly available at online repositories. With help of Internet search engines, one can identify control measures implemented in different countries and their respective implementation time. Using such data, with Machine Learning (ML) techniques, we construct classification models that predict R t . We then apply XAI techniques to examine the developed ML models and identify key factors that are most influential to their predictions. In this way, the constructed ML models serve as surrogates to the real world; and identifying effective factors in controlling R t becomes explaining the classifications given by the developed ML models.

Data Preparation. Our analysis is based on the following information:

• Implementation dates of control measures as shown in Table 1 .

• The daily reported number of confirmed cases from 22/01/2020 to 02/04/2020 in countries and regions shown in Table 1 .

• Temperature and humidity during our study period at these countries and regions.

From daily reported number of confirmed cases, we apply a method of estimating R t as reported in Flaxman, et al. (5). We start by introducing a serial interval distribution, which models the time between a person getting infected and he/she subsequently infecting another people, as a Gamma distribution g with mean 7 and standard deviation 4.5 (these two parameters are reported in (14)); we also assume this serial interval distribution is the same for all countries and regions studied in this work. The number of new infections c t on a given day t is given by the following discrete convolution function: where g s = s+0.5 τ =s−0.5 g(τ )dτ for s = 2, 3, . . . and g 1 = 1.5 τ =0 g(τ )dτ , c τ is the number of new infections on day τ . Thus, new infections identified on day t depend on the number of new infections in days prior to t, weighted by the discretized serial interval distribution, which is the aforementioned Gamma distribution.

From Equation 1, solve for R t , we have:

Since c t and c τ are available from our data directly, e.g., c t is the difference between the confirmed case on day t and the confirmed case on day t−1, and g t−τ can be obtained by integrating the Gamma distribution, we now have a way to compute R t for all countries on all days between 22/01/2020 and 02/04/2020.

We then compose a data set in a tabular form where each row describes information for one country/region on a day, containing the number of new confirmed cases on that day, days since each of the control measures that have been implemented, and the temperature as well as humidity of that day. R t is added to every row in the data set and later used as the target for prediction. Since R t calculated as such is sensitive to noise at the number of new infection cases on a given day and the data set we use contains imperfection, e.g., for the United Kingdom, both 14/03/2020 and 15/03/2020 have 1140 confirmed cases so there is no increase on 15/03/2020, we thus run a sliding-window mean filter with radius 1 on the data for noise removal. Moreover, as R t calculated in Equation 2 assumes a reasonably large t, (otherwise both c τ and g t−τ would be too small, resulting an artificially large R t ), we drop entries with confirmed case less than 20.

In other words, we only use data where there are more than 20 accumulated confirmed cases in that country / region; and as the number of confirmed case is monotonically increasing, there is no "skipped" dates. For instance, our Singapore cases start on 03/02/2020, Japan cases start on 02/02/2020, and Germany cases start on 25/02/2020. Figure 1 illustrates R t computed by the presented method for Japan, Singapore, Australia and Hubei.

A fraction of this data set is shown in Table 2 Since we aim for obtaining easily interpretable qualitative results, we further discretize our data into category intervals as follows:

• with each category represented with an integer 0. . . 4. Similarly, temperature and humidity are discretized into 4 and 3 categories, respectively. Table 3 shows the result of discretization from data shown in Table 2 . we formulate the factor importance problem as two binary Explainable Classification problems:

given a data entry (row) as shown in Table 3 , classify whether the R t (for that row) is greater than some threshold θ; if an entry is classified as negative (R t thus less than or equal to θ), then identify the features that are most influential to the classification.

SHAP is based on Shapley value, a game theory concept that assigns a unique distribution of a total surplus generated by the coalition of all players in a cooperative game. In our context, each factor with its value, e.g., NC = 0, GA = 4, MU = 3, etc., is viewed as a "player" in the game where the outcome is in one of the two classes (R t being either greater than θ, or not).

Shapley value for each feature-value describes its "contribution" to the outcome classification.

ECPI is a probabilistic logic based explainable classification algorithm. ECPI maps a dataset into a knowledge-base in probabilistic logic and performs classification with probabilistic logic inference. ECPI computes explanation by identifying the subset of feature values that gives the same inference result as the full set (or as close as possible to the full set when it is not possible to infer the same result). In our context, roughly speaking, from feature value NC = 0, GA = 4, MU = 3, etc., one deduces R t is in some class (either R t < θ or not), then the subset of these feature values that also infer R t in the same class is the explanation.

Both SHAP and ECPI identify key features for prediction instances with SHAP being a model-agnostic method that computes only feature importance and ECPI an interpretable model that makes the prediction as well. Technically, two major differences between SHAP and ECPI are that (1) SHAP considers features individually when evaluating their "contribution" whereas ECPI considers all coalition among features; (2) SHAP estimates the Shapley characteristic function, the function describing the total expected sum of payoffs the players can obtain by cooperation, from data whereas ECPI does not perform this estimation.

Since our goal is to obtain explanations for cases where R t < θ, the entire dataset is used for training the ML models. Then, for each entry in the dataset which has its R t < θ, we 

With SHAP and ECPI, we study the two classification cases for θ = 1, 2. For each case, we compute the top k (k = 1, 2) features that are the most influential. There are 228 and 435 entries with R t < 1 and R t < 2, respectively. The results are shown in Figure 2 . 

Since COVID-19 was identified in December 2019, there are a few works which have been dedicated to understanding the effect of non-pharmaceutical countermeasures. Leung et al. (7) estimated R t in four Chinese cities and ten province between mid January to mid February.

They found that though aggressive non-pharmaceutical interventions such as city lockdown have made the first wave of COVID-19 outside of Hubei abated, control measures should be relaxed gradually. Our results, asserting city lockdown being the most effective measure confirms their findings.

In (10), the authors evaluated the effectiveness of surveillance and containment measures for the first 100 patients with COVID-19 in Singapore. The surveillance strategy in Singapore includes applying the case definition at medical consultations, tracing contacts of patients, enhancing surveillance among different patient groups and allowing clinician discretion. It was found that rapid identification and isolation of cases, quarantine of close contacts, and active monitoring of other contacts have been effective in suppressing expansion of the outbreak. Our results find that contact tracing being the overall second effective measure (after city lockdown) confirms their results.

In (6), the authors studied the real-time mobility data from Wuhan and detailed case data including travel history to investigate the role of travel restrictions in limiting the spread of COVID-19. It was found that travel restrictions are particularly useful in the early stage of an outbreak, while the effect will drop when the outbreak is more widespread. A combination of interventions may be necessary though the individual role of each intervention is yet determined.

Our results show that international travel ban, although exactly same as their work on intercity travel, but also on the effectiveness of restricting travel, is not nearly as effective as other measures such as city lockdown and contact tracing.

In (11), the authors built an age-structured susceptible-exposed-infected-removed (SEIR) model to estimate the effect of physical distancing measures, such as extended school closures and workplace distancing, on the progression of the COVID-19 epidemic. It was found that sustaining these measures are effective in reducing the size of epidemic. Our results, especially the ones from ECPI, show that school closure has a positive effect in lowering R t .

As a data-driven modeling approach, our work is limited by a number of factors. Firstly, all results are based on data collected from the selected 18 countries and regions during the period of 22/01/2020 to 03/04/2020, although results might be generalizable, but are about these regions during the said period. Thus, when applying these results to other regions and other time, they should be viewed indicative. Secondly, the data used is inherently ambiguous, e.g., "contact tracing" and "mass testing" have been implemented at different countries, but it is unlikely the same standard has been applied. Thus, although our methods are quantitative, due to the qualitative nature of the data, one should read our results qualitatively. Thirdly, we rely on the calculated R t to label our data, which is then used to construct our models. The calculation used is reported in (5) with parameters found in (14). We are aware that R t is an estimate that can be approximated with more than one method, some authors such as (7) give a much smaller estimate of R t for Beijing in January (they estimate R t being close to 0.5 whereas our calculation shows it is greater than 2; although ours drops to below 0.5 after February 10, same as theirs), different results might be obtained if R t is estimated differently.

In conclusion, we applied two machine learning and explainable AI methods in studying the influence of factors affecting the spread of COVID-19. We find city lockdown and contact tracing being the two most effective control measures, surpassing mass testing, school closure, international travel ban and mask use. As countries are considering lifting city lockdown, to prevent resurgent disease, effort should be put to developing privacy preserving, practical and effective contact tracing techniques. 

Classification Performance Since the reliability of our explanation results depends on the quality of our ML models, which depend on the quality of data and the discretization process we used, we first present some classification results. The two datasets are randomly divided into 90% / 10% split with 90% used for training and 10% for testing. We present results using a random forest classifier with 100 trees, neural network with two hidden layers with 12 and 10 nodes each, and ECPI. We measure performance with precision and recall, which are defined as The precision and recalls are shown in Table 4 . These results show that effective models can be built with the dataset and our ECPI model gives indicative predictions. 

Explanation and justification in machine learning : A survey

What does explainable AI really mean? A new conceptualization of perspectives

Streaming weak submodularity: Interpreting neural networks on the fly

Explainable ai for classification using probabilistic logic inference