key: cord-0577059-c55udaq8 authors: Smith, Jonathan; Ghotbi, Borna; Yi, Seungeun; Parsapoor, Mahboobeh title: Non-Pharmaceutical Intervention Discovery with Topic Modeling date: 2020-09-10 journal: nan DOI: nan sha: be442c34f4158da7e7f22624eb526c7f0868b1a0 doc_id: 577059 cord_uid: c55udaq8 We consider the task of discovering categories of non-pharmaceutical interventions during the evolving COVID-19 pandemic. We explore topic modeling on two corpora with national and international scope. These models discover existing categories when compared with human intervention labels while reduced human effort needed. The COVID-19 pandemic has seen varying government responses in countries around the world. More than 2.24 million confirmed cases and 152.5 thousand deaths worldwide (January 1, 2020 -April 19, 2020) (WHO, 2020) are attributed to novel coronavirus (SARS-CoV-2). Due to the high transmission rate, non-pharmaceutical interventions (NPIs) play a critical role in stemming the spread of the virus (Ferguson et al., 2020) . Studies assess the effectiveness of interventions in "flattening the curve" (Koo et al., 2020; Tuite et al., 2020) , using individual or compartmental models (Biswas et al., 2014) to simulate scenarios for fixed parameters. Broad categories encompassing physical distancing, business closures, and travel restrictions, are used and tailored to the specific country or region the model focuses on (Wang et al., 2020) . More specific interventions are less likely to be tracked, leaving the effects of increased funding, stratified closure, and administrative flexibility unestimated. Increasing the granularity of NPIs recorded is limited by manual labor involved in intervention identification and evaluation. This is seen in the categorizations used by four projects currently compiling data (McCoy et al., 2020; HIT COVID, 2020; ACAPS, 2020; Hale et al., 2020) which record 63, 23, 35, and 13 NPI types respectively. In this study we leverage the availability of global text data on intervention announcements to automate the task of intervention category discovery, which will speed up the identification of interventions with strong effects on transmission. Presented at the ICML 2020 Workshop on Machine Learning for Global Health. Copyright 2020 by the author(s). We propose a method for using topic modeling to discover intervention categorizations. This method uses two wellknown topic modeling algorithms, Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP), combined with the unique structure of the NPI category discovery problem. We introduce evaluation methods for comparing unsupervised discovery models with existing intervention labels and demonstrate that our approach discovers known categories of NPIs as seen in the selected topics in Table 1 . Our method has shown useful performance on datasets with global scope, suggests that better categorizations can be learned by pooling knowledge across countries. This paper uses two datasets: 1) the Oxford COVID-19 Government Response Tracker v4.0 (Hale et al., 2020) (D OXF ), which contains 13 intervention categories from 156 countries, and 2) CAN-NPI dataset (McCoy et al., 2020) (D CAN ), which contains 63 intervention categories found in Canada. The former provides notes for each change to the status of an intervention citing public announcement texts, quotes, email interactions with public health authorities, and links. In contrast, the latter provides more standardized full-text articles for each intervention tagged at the provincial, territorial, and municipal level in a single country. The D CAN dataset also links all eligible interventions from the Oxford intervention categories. For consistency of results, the period of January 1, 2020 -April 19, 2020, is used for both of the datasets. Table 2 compares dataset size and composition. It is also essential that international data is evenly sampled from across regions. In D OXF , countries have similar distributions of average note counts in Asia (31), Europe (28), Africa (29), and the Americas (29). A corpus is constructed from each dataset. We preprocess the documents in each dataset to remove default english stopwords from NTLK (Bird et al., 2009 ) and geographic words (eg. province or country names) from data labels. Bigrams are constructed from word pairs that generally occur together (eg. public health) and are added as well. We apply LDA (Blei, 2012) and HDP (Teh et al., 2004) using gensim (Řehůřek & Sojka, 2010) to generate topics for each corpus. Topic coherence, C v , is calculated as a measure of topic interpretability using co-occurence of words and cosine similarity between each top word and the total of all top words (Röder et al., 2015; Syed & Spruit, 2017) . LDA is run to generate a specific number of topics, K, then coherence is calculated for each K. For LDA, the maximum coherence is associated with the near-optimal number of topics found from a corpus. HDP predetermines the number of topics and calculates a single coherence over all of them. The online nature of these algorithms allows for topics that can be easily extended when more data is gathered, helping researchers respond quickly in a fast-moving pandemic. We evaluate the suitability of a topic as an intervention category by comparing with existing labels from subject matter experts (SMEs). Mean cosine similarity (S) between weighted topic occurrence per document (T ∈ [0, 1] D×T ) and label occurrence (L ∈ [0, 1] D×C ) for D documents, C labels, and T topics is one metric. This is calculated as: A higher similarity shows that topics are similar in structure to given labels. We evaluate individual label discovery by assuming that each topic can correspond to at least one intervention category. To determine the coverage (Cov) we select the topic with the highest similarity for each label and count the number of distinct topics used to cover all labels. Coverage ratio (Cov(%)) is found with Cov/C. Coverage ratio combined with high similarity indicates the discovery of label categories. Combining these metrics enables model performance evaluation without additional work by SMEs. We summarize the results of category discovery with topic modeling on D CAN and D OXF datasets in Table 3 using K ∈ {10, 25, 50, 100}. HDP leads to topics with significantly more coherence, while a clear tradeoff must be made between overall similarity and coverage. Similarity drops less precipitously as the number of topics increases when using LDA as opposed to HDP. D OXF has a much smaller number of categories, so there is a clear early convergence in coverage. In practice, a higher number of topics will be harder for humans to check for interventions but should lead to coverage of existing categories and aid with discovery. In this paper, we briefly demonstrate LDA-and HDP-based methods for NPI discovery on D CAN and D OXF . As our results have verified, combining coherence, topic similarity, and coverage can provide provide intervention category suggestions for experts to explore and find new NPIs leveraging national and international data. For future work, we will explore prediction-focused supervised LDA and identification of NPIs in an online manner. COVID-19: Government measures dataset -humanitarian data exchange (version 1.0) Natural Language Processing with Python A seir model for control of infectious diseases with constraints. MBE Probabilistic topic models Impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand Health interventions tracking for covid-19 (hit-covid) (version v0.3) Interventions to mitigate early spread of sars-cov-2 in singapore: a modelling study. The Lancet Infectious Diseases CAN-NPI: A curated open dataset of canadian non-pharmaceutical interventions in response to the global covid-19 pandemic Exploring the space of topic coherence measures Full-text or abstract? examining topic coherence scores using latent dirichlet allocation Hierarchical dirichlet processes Mathematical modelling of covid-19 transmission and mitigation strategies in the population of ontario, canada Evolving epidemiology and impact of nonpharmaceutical interventions on the outbreak of coronavirus disease 2019 in wuhan, china