key: cord-0616244-hw9yp0lw authors: Newman-Griffis, Denis; Sivaraman, Venkatesh; Perer, Adam; Fosler-Lussier, Eric; Hochheiser, Harry title: TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora date: 2021-03-19 journal: nan DOI: nan sha: 50b6334af2919e4e10e65042e4cc7b7b09817778 doc_id: 616244 cord_uid: hw9yp0lw Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence is available from https://github.com/drgriffis/text-essence. Distributional representations of language, such as word and concept embeddings, provide powerful input features for NLP models in part because of their correlation with syntactic and semantic regularities in language use (Boleda, 2020) . However, the use of embeddings as a lens to investigate those regularities, and what they reveal about different text corpora, has been fairly limited. Prior work using embeddings to study language shifts, such as the use of diachronic embeddings to measure semantic change in specific words over time (Hamilton et al., 2016; Schlechtweg et al., 2020) , has focused primarily on quantitative measurement of change, rather than an interactive exploration of its qualitative aspects. On the other hand, prior work on interactive analysis of text collections has focused on analyzing individual corpora, rather than facilitating inter-corpus analysis (Liu et al., 2012; Weiss, 2014; Liu et al., 2019) . We introduce TextEssence, a novel tool that combines the strengths of these prior lines of research by enabling interactive comparative analysis of different text corpora. TextEssence provides a multiview web interface for users to explore the properties of and differences between multiple text corpora, all leveraging the statistical correlations captured by distributional embeddings. TextEssence can be used both for categorical analysis (i.e., comparing text of different genres or provenance) and diachronic analysis (i.e., investigating the change in a particular type of text over time). Our paper makes the following contributions: • We present TextEssence, a lightweight tool implemented in Python and the Svelte JavaScript framework, for interactive qualitative analysis of word and concept embeddings. • We introduce a novel measure of embedding confidence to mitigate embedding instability and quantify the reliability of individual embedding results. • We report on a case study using TextEssence to investigate diachronic shifts in the scientific literature related to COVID-19, and demonstrate that TextEssence captures meaningful month-to-month shifts in scientific discourse. The remainder of the paper is organized as follows. §2 lays out the conceptual background behind TextEssence and its utility as a corpus analysis tool. In §3 and §4, we describe the nearestneighbor analysis and user interface built into TextEssence. §5 describes our case study on scientific literature related to COVID-19, and §6 highlights key directions for future research. Computational analysis of text corpora can act as a lens into the social and cultural context in which those corpora were produced (Nguyen et al., 2020) . Diachronic word embeddings have been shown to reflect important context behind the corpora they are trained on, such as cultural shifts (Kulkarni et al., 2015; Hamilton et al., 2016; Garg et al., 2018) , world events (Kutuzov et al., 2018) , and changes in scientific and professional practice (Vylomova et al., 2019) . However, these analyses have proceeded independently of work on interactive tools for exploring embeddings, which are typically limited to visual projections (Zhordaniya et al.; Warmerdam et al., 2020) . TextEssence combines these directions into a single general-purpose tool for interactively studying differences between any set of corpora, whether categorical or diachronic. When corpora of interest are drawn from specialized domains, such as medicine, it is often necessary to shift analysis from individual words to domain concepts, which serve to reify the shared knowledge that underpins discourse within these communities. Reified domain concepts may be referred to by multi-word surface forms (e.g., "Lou Gehrig's disease") and multiple distinct surface forms (e.g., "Lou Gehrig's disease" and "amyotrophic lateral sclerosis"), making them more semantically powerful but also posing distinct challenges from traditional word-level representations. A variety of embedding algorithms have been developed for learning representations of domain concepts and real-world entities from text, including weakly-supervised methods requiring only a terminology (Newman-Griffis et al., 2018) ; methods using pre-trained NER models for noisy annotation (De Vine et al., 2014; Chen et al., 2020) ; and methods leveraging explicit annotations of concept mentions (as in Wikipedia) (Yamada et al., 2020) . 1 These algorithms capture valuable patterns about concept types and relationships that can inform corpus analysis (Runge and Hovy, 2020) . TextEssence only requires pre-trained embeddings as input, so it can accommodate any embedding algorithm suiting the needs and characteristics of specific corpora (e.g. availability of annotations or knowledge graph resources). Furthermore, while the remainder of this paper primarily refers to concepts, TextEssence can easily be used for word-level embeddings in addition to concepts. Contextualized, language model-based embeddings can provide more discriminative features for 1 The significant literature on learning embeddings from knowledge graph structure is omitted here for brevity. NLP than static (i.e., non-contextualized) embeddings. However, static embeddings have several advantages for this comparative use case. First, they are less resource-intensive than contextualized models, and can be efficiently trained several times without pre-training to focus entirely on the characteristics of a given corpus. Second, the scope of what static embedding methods are able to capture from a given corpus has been well-established in the literature, but is an area of current investigation for contextualized models (Jawahar et al., 2019; Zhao and Bethard, 2020) . Finally, the nature of contextualized representations makes them best suited for context-sensitive tasks, while static embeddings capture aggregate patterns that lend themselves to corpus-level analysis. Nevertheless, as work on qualitative and visual analysis of contextualized models grows (Hoover et al., 2020) , new opportunities for comparative analysis of local contexts will provide fascinating future research. While embeddings are a well-established means of capturing syntax and semantics from natural language text (Boleda, 2020) , the problem of comparing multiple sets of embeddings remains an active area of research. The typical approach is to consider the nearest neighbors of specific points, consistent with the "similar items have similar representations" intuition of embeddings. This method also avoids the conceptual difficulties and low replicability of comparing embedding spaces numerically (e.g. by cosine distances) (Gonen et al., 2020). However, even nearest neighborhoods are often unstable, and vary dramatically across runs of the same embedding algorithm on the same corpus (Wendlandt et al., 2018; Antoniak and Mimno, 2018) . In a setting such as our case study, the relatively small sub-corpora we use (typically less than 100 million tokens each) exacerbate this instability. Therefore, to quantify variation across embedding replicates and identify informative concepts, we introduce a measure of embedding confidence. 2 We define embedding confidence as the mean overlap between the top k nearest neighbors of a 2 An embedding replicate here refers to the embedding matrix output by running a specific embedding training algorithm on a specific corpus. Ten runs of word2vec on a given Wikipedia dump produce ten replicates; using different Wikipedia dumps would produce one replicate each of ten different sets of embeddings. ... given item between multiple embedding replicates. Formally, let E 1 . . . E m be m embedding replicates trained on a given corpus, and let kNN i (c) be the set of k nearest neighbors by cosine similarity of concept c in replicate E i . Then, the embedding confidence EC@k is defined as: This calculation is illustrated in Figure 1 . We can then define the set of high-confidence concepts for the given corpus as the set of all concepts with an embedding confidence above a given threshold. A higher threshold will restrict to highlystable concepts only, but exclude the majority of embeddings. We recommend an initial threshold of 0.5, which can be configured based on observed quality of the filtered embeddings. After filtering for high-confidence concepts, we summarize nearest neighbors across replicates by computing aggregate nearest neighbors. The aggregate neighbor set of a concept c is the set of high-confidence concepts with highest average cosine similarity to c over the embedding replicates. This helps to provide a more reliable picture of the concept's nearest neighbors, while excluding concepts whose neighbor sets are uncertain. The workflow for using TextEssence to compare different corpora is illustrated in Figure 2 . Given the set of corpora to compare, the user (1) trains embedding replicates on each corpus; (2) identifies the high-confidence set of embeddings for each corpus; and (3) provides these as input to TextEssence. TextEssence then offers three modalities for interactively exploring their learned representations: (1) Browse, an interactive visualization of the embedding space; (2) Inspect, a detailed comparison of a given concept's neighbor sets across corpora; and (3) Compare, a tool for analyzing the pairwise relationships between two or more concepts. The first interface presented to the user is an overview visualization of one of the embedding spaces, projected into 2-D using t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008). High-confidence concepts are depicted as points in a scatter plot and color-coded by their high-level semantic grouping (e.g. "Chemicals & Drugs," "Disorders"), allowing the user to easily navigate to an area of interest. The user can select a point to highlight its aggregated nearest neighbors in the high-dimensional space, an interaction similar to TensorFlow's Embedding Projector (Smilkov et al., 2016 ) that helps distinguish true neighbors from artifacts of the dimensionality reduction process. The Browse interface also incorporates novel interactions to address the problem of visually comparing results from several corpora (e.g., embeddings from individual months). The global structures of the corpora can differ greatly in both the high-dimensional and the low-dimensional representations, making visual comparison difficult. While previous work on comparing projected data has focused on aligning projections (Liu et al., 2020; Chen et al., 2018) and adding new comparison-focused visualizations (Cutura et al., 2020) , we chose to align the projections using a simple Procrustes transformation and enable the user to compare them using animation. When the user hovers on a corpus thumbnail, lines are shown between the positions of each concept in the current and destination corpora, drawing attention to the concepts that shift the most. Upon clicking the thumbnail, the points smoothly follow their trajectory lines to form the destination plot. In addition, when a concept is selected, the user can opt to center the visualization on that point and then transition between corpora, revealing how neighboring concepts move relative to the selected one. Once a particular concept of interest has been identified, the Inspect view presents an interactive table depicting how that concept's aggregated nearest neighbors have changed over time. This view also displays other contextualizing information about the concept, including its definitions (derived from the UMLS (Bodenreider, 2004) for our case study 3 ), the terms used to refer to the con-cept (limited to SNOMED CT for our case study), and a visualization of the concept's embedding confidence over the sub-corpora analyzed. For information completeness, we display nearest neighbors from every corpus analyzed, even in corpora where the concept was not designated high-confidence (note that a concept must be high-confidence in at least one corpus to be selectable in the interface). In these cases, a warning is shown that the concept itself is not high-confidence in that corpus; the neighbors themselves are still exclusively drawn from the high-confidence set. The Compare view facilitates analysis of the changing relationship between two or more concepts across corpora (e.g. from month to month). This view displays paired nearest neighbor tables, one per corpus, showing the aggregate nearest neighbors of each of the concepts being compared. An adjacent line graph depicts the similarity between the concepts in each corpus, with one concept specified as the reference item and the others serving as only for a small subset of concepts. Entities denotes the number of SNOMED CT codes for which embeddings were learned; Hi-Conf. is the subset of these that had confidence above the 0.5 threshold. comparison items (similar to Figure 3 ). Similarity between two concepts for a specific corpus is calculated by averaging the cosine similarity between the corresponding embeddings in each replicate. The scale of global COVID-19-related research has led to an unprecedented rate of new scientific findings, including developing understanding of the complex relationships between drugs, symptoms, comorbidities, and health outcomes for COVID-19 patients. We used TextEssence to study how the contexts of medical concepts in COVID-19-related scientific literature have changed over time. in corpus volumes, all are sufficiently large for embedding training. We created disjoint sub-corpora containing the new articles indexed in CORD-19 each month for our case study. CORD-19 monthly corpora were tokenized using ScispaCy (Neumann et al., 2019) , and concept embeddings were trained using JET (Newman-Griffis et al., 2018) , a weakly-supervised concept embedding method that does not require explicit corpus annotations. We used SNOMED Clinical Terms (SNOMED CT), a widely-used reference representing concepts used in clinical reporting, as our terminology for concept embedding training, using the March 2020 interim release of SNOMED CT International Edition, which included COVID-19 concepts. We trained JET embeddings using a vector dimensionality d = 100 and 10 iterations, to reflect the relatively small size of each corpus. We used 10 replicates per monthly corpus, and a high-confidence threshold of 0.5 for EC@5. TextEssence captures a number of shifts in CORD-19 that reflect how COVID-19 science has devel-oped over the course of the pandemic. Table 2 highlights key findings from our preliminary investigation into concepts known a priori to be relevant. Please note that while full nearest neighbor tables are omitted due to space limitations, they can be accessed by downloading our code and following the included guide to inspect CORD-19 results. 44169009 Anosmia While associations of anosmia (loss of sense of smell) were observed early in the pandemic (e.g., Hornuss et al. (2020) , posted in May 2020), it took time to begin to be utilized as a diagnostic variable (Talavera et al., 2020; Wells et al., 2020) . Anosmia's nearest neighbors reflect this, staying stably in the region of other otolaryngological concepts until October (when Talavera et al. (2020); Wells et al. (2020) , inter alia were included in , where we observe a marked shift in utilization to play a similar role to other common symptoms of COVID-19. 116568000 Dexamethasone The corticosteroid dexamethasone was recognized early as valuable for treating severe COVID-19 symptoms (Lester et al. (2020) , indexed July 2020), and its role has remained stable since (Ahmed and Hassan (2020), indexed October 2020). This is reflected in the shift of its nearest neighbors from prior contexts of traumatic brain injury (Moll et al., 2020) to a stable neighborhood of other drugs used for COVID-19 symptoms. However, in September 2020, 702806008 Ruxolitinib emerges as Dexamethasone's nearest neighbor. This reflects a spike in literature investigating the use of ruxolitinib for severe COVID-19 symptom management (Gozzetti et al., 2020; Spadea et al., 2020; Li and Liu, 2020) . As the similarity graph in Figure 3 shows, the contextual similarity between dexamethasone and ruxolitinib steadily increases over time, reflecting the growing recognition of ruxolitinib's new utility (Caocci and La Nasa (2020) , indexed May 2020). 83490000 Hydroxychloroquine Hydroxychloroquine, an anti-malarial drug, was misleadingly promoted as a potential treatment for COVID-19 by President Trump in March, May, and July 2020, leading to widespread misuse of the drug (Englund et al., 2020). As a result, a number of studies reinvestigated the efficacy of hydroxychloroquine as a treatment for COVID-19 in hospitalized patients (Ip et al. (2020) ; Albani et al. (2020) ; Rahmani et al. (2020) , all indexed August 2020). This shift is reflected in the neighbors of Hydroxychloroquine, adding investigative outcomes such as nosocomial (hospital-acquired) infections and respiratory failure to the expected anti-malarial neighbors. TextEssence is an interactive tool for comparative analysis of word and concept embeddings. Our case study on scientific literature related to COVID-19 demonstrates that TextEssence can be used to study diachronic shifts in usage of domain concepts, and a previous study on medical records (Newman-Griffis and Fosler-Lussier, 2019) showed that the technologies behind TextEssence can be used for categorical comparison as well. The utility of TextEssence is not limited to analysis of text corpora. In settings where multiple embedding strategies are available, such as learning representations of domain concepts from text sources (Beam et al., 2020; Chen et al., 2020) , knowledge graphs (Grover and Leskovec, 2016) , or both (Yamada et al., 2020; Wang et al., 2020b) , TextEssence can be used to study the different regularities captured by competing algorithms, to gain insight into the utility of different approaches. TextEssence also provides a tool for studying the properties of different terminologies for domain concepts, something not previously explored in the computational literature. While our primary focus in developing TextEssence was on its use as a qualitative tool for targeted inquiry, diachronic embeddings have significant potential for knowledge discovery through quantitative measurement of semantic differences. However, vector-based comparison of embedding spaces faces significant conceptual challenges, such as a lack of appropriate alignment objectives and empirical instability (Gonen et al., 2020) . While nearest neighbor-based change measurement has been proposed (Newman-Griffis and Fosler-Lussier, 2019; Gonen et al., 2020) , its efficacy for small corpora with limited vocabularies remains to be determined. Our novel mbedding confidence measure offers a step in this direction, but further research is needed. Our implementation and experimental code is available at https://github.com/ drgriffis/text-essence, and the database derived from our CORD-19 analysis is available at https://doi.org/ 10.5281/zenodo.4432958. A screencast of TextEssence in action is available at https://youtu.be/1xEEfsMwL0k. Dexamethasone for the Treatment of Coronavirus Disease (COVID-19): a Impact of Azithromycin and/or Hydroxychloroquine on Hospital Mortality in COVID-19 Evaluating the Stability of Embedding-based Word Similarities Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing The Unified Medical Language System (UMLS): integrating biomedical terminology Distributional Semantics and Linguistic Theory Could ruxolitinib be effective in patients with COVID-19 infection at risk of acute respiratory distress syndrome Visual exploration and comparison of word embeddings BioCon-ceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale Comparing and exploring high-dimensional data with dimensionality reduction algorithms and matrix visualizations Medical semantic similarity with a neural language model Rise and Fall: Hydroxychloroquine and COVID-19 Global Trends: Interest, Political Influence, and Potential Implications Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora The Janus kinase 1/2 inhibitor ruxolitinib in COVID-19 Node2Vec: Scalable Feature Learning for Networks Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change 2020. ex{BERT}: {A} {V}isual {A}nalysis {T}ool to {E}xplore {L}earned {R}epresentations in {T}ransformer {M}odels Hydroxychloroquine and tocilizumab therapy in COVID-19 patients-An observational study What Does {BERT} Learn about the Structure of Language? Statistically Significant Detection of Linguistic Change Diachronic word embeddings and semantic shifts: a survey The use of dexamethasone in the treatment of COVID-19 Whether the timing of patient randomization interferes with the assessment of the efficacy of ruxolitinib for severe COVID-19 Bridging Text Visualization and Mining: A Task-Driven Survey Tiara: Interactive, topic-based visual text summarization and analysis 2020. 2-map: Aligned visualizations for comparison of high-dimensional point sets Effects of dexamethasone in traumatic brain injury patients with pericontusional vasogenic edema: A prospectiveobservational DTI-MRI study {S}cispa{C}y: Fast and Robust Models for Biomedical Natural Language Processing Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings Jointly Embedding Entities and Text with Distant Supervision How We Do Things With Words: Analyzing Text as Social and Cultural Data Ohio supercomputer center Comparing outcomes of hospitalized patients with moderate and severe COVID-19 following treatment with hydroxychloroquine plus atazanavir/ritonavir. Daru : journal of Faculty of Pharmacy Exploring Neural Entity Representations for Semantic Information 2020. {S}em{E}val-2020 Task 1: Unsupervised Lexical Semantic Change Detection Embedding projector: Interactive visualization and interpretation of embeddings Successfully treated severe COVID-19 and invasive aspergillosis in early hematopoietic cell transplantation setting Anosmia is associated with lower in-hospital mortality in COVID-19 Visualizing data using t-sne Evaluation of Semantic Change of Harm-Related Concepts in Psychology William Merrill, and Others. 2020a. CORD-19: The Covid-19 Open Research Dataset Weili Liu, and Others. 2020b. Covid-19 literature knowledge graph construction and drug repurposing report generation Going Beyond {T}-{SNE}: Exposing whatlies in Text Embeddings MUCK: A toolkit for extracting and visualizing semantic dimensions of large text collections Estimates of the rate of infection and asymptomatic COVID-19 disease in a population sample from SE England Factors Influencing the Surprising Instability of Word Embeddings 2020. {W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia How does {BERT}{'}s attention change when you fine-tune? An analysis methodology and a case study in negation scope Vec2graph: A python library for visualizing word embeddings as graphs This work made use of computational resources generously provided by the Ohio Supercomputer Center (Ohio Supercomputer Center, 1987) in support of COVID-19 research. The research reported in this publication was supported in part by the National Library of Medicine of the National Institutes of Health under award number T15 LM007059.