key: cord-0628925-7bznyhdf authors: Trewartha, Amalie; Dagdelen, John; Huo, Haoyan; Cruse, Kevin; Wang, Zheren; He, Tanjin; Subramanian, Akshay; Fei, Yuxing; Justus, Benjamin; Persson, Kristin; Ceder, Gerbrand title: COVIDScholar: An automated COVID-19 research aggregation and analysis platform date: 2020-12-07 journal: nan DOI: nan sha: efba8933ddedc685fd608fc9a1b0386161cf3682 doc_id: 628925 cord_uid: 7bznyhdf The ongoing COVID-19 pandemic has had far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community's COVID-19 response has lead to the emergence of new research literature on a remarkable scale -- as of October 2020, over 81,000 COVID-19 related scientific papers have been released, at a rate of over 250 per day. This has created a challenge to traditional methods of engagement with the research literature; the volume of new research is far beyond the ability of any human to read, and the urgency of response has lead to an increasingly prominent role for pre-print servers and a diffusion of relevant research across sources. These factors have created a need for new tools to change the way scientific literature is disseminated. COVIDScholar is a knowledge portal designed with the unique needs of the COVID-19 research community in mind, utilizing NLP to aid researchers in synthesizing the information spread across thousands of emergent research articles, patents, and clinical trials into actionable insights and new knowledge. The search interface for this corpus, https://covidscholar.org, now serves over 2000 unique users weekly. We present also an analysis of trends in COVID-19 research over the course of 2020. The scientific community has responded to the COVID-19 pandemic with unprecedented speed, and as a result an enormous amount of research literature is rapidly emerging, at a rate of over 250 papers a day [1] . The urgency and volume of emerging research has caused pre-prints to take a prominent role in lieu of traditional journals, leading to widespread usage of pre-print servers for the first time in many fields, most prominently biomedical sciences [2] [3] . While this allows new research to be disseminated to the community sooner, this also circumvents the role of journals in filtering poor or flawed papers and highlighting relevant research [4] . Additionally, the uniquely multi-disciplinary nature of the scientific community's response to the pandemic has lead to pertinent research being dispersed across many open access and pre-print services -no single one of which captures the entirety of the COVID-19 literature. These challenges have created a need and opportunity for new tools and methods to rethink the way in which researchers engage the wealth of available COVID-19 scientific literature. COVIDScholar is an effort to address these issues by using natural language processing (NLP) techniques to aggregate, analyze, and search the COVID-19 research literature. We have developed an automated, scalable infrastructure for scraping and integrating new research as it appears, and used it to construct a targeted corpus of over 81,000 scientific papers and documents pertinent to COVID-19 from a broad range of disciplines. The search interface for this corpus, https://covidscholar.org, now serves over 2000 unique users weekly. While a variety of other COVID-19 literature aggregation efforts exist [5, 6, 7] , COVIDScholar differs in the breadth of literature collected. In addition to the biological and medical research collected by other large-scale aggregation efforts such as CORD-19 [6] and LitCOVID [7] , COVIDScholar's collection includes the full breadth of COVID-19 research, including public health, behavioural science, physical sciences, economics, psychology, and humanities. In this paper, we present a description of the COVIDScholar data intake pipeline and back-end infrastructure, and the NLP models used to power directed searches on the front-end search portal. We also present an analysis of the COVIDScholar corpus, and discuss trends in the dynamics of research output during the pandemic. At the heart of COVIDScholar is the automated data intake and processing pipeline, depicted in Fig. 1 a "COVID-19 relevance" score calculated by a classification model trained to predict whether a paper discusses the SARS-CoV-2 virus or COVID-19 using this approach. We observe that papers from before the COVID-19 pandemic that are related to certain viruses/diseases tend to receive high relevance scores, especially papers on the original SARS and other respiratory diseases. SARS-CoV-2 shares 79% of its genome sequence identity with the SARS-CoV virus [23] , and there are many similarities between how the two viruses enter cells, replicate, and transmit between hosts. [24] Because the relevance classification model gives a higher score to studies on these similar diseases, search results are more likely to contain relevant information, even if it is not directly focused on COVID-19. For example, the transmembrane protease TMPRSS2 plays an important role in viral entry and spread for both SARS-CoV and SARS-CoV-2, and its inhibition is a promising avenue for treating COVID-19 [25] . COVIDScholar also provides tools that utilizes unsupervised document embeddings so that searches can be performed within "related documents" to automatically link research papers together by topics, methods, drugs, and other key pieces of information. Documents are sorted by similarity via the cosine distances between unsupervised document embeddings [26] , which is then combined with the more overall result-ranking score mentioned above. Classification of abstracts is performed using a fine-tuned SciBERT [27] model. While other BERT models pre-trained on scientific text exist (e.g. BioBERT [28] , MedBERT [29] , and ClinicalBERT [30] ), we select SciBERT due to its broad, multidisciplinary training corpus, which we expect to more closely resemble the COVIDScholar corpus than those pre-trained on a single discipline. SciBERT has state-of-the-art performance on the task of paper domain classification [31] , as well as a number of biomedical domain benchmarks [32, 33, 34] -the most common discipline in the COVIDScholar corpus. A single fully-connected layer with sigmoid activation is used as a classification head, and the model is finetuned for 4 epochs using 2600 human-annotated abstracts 6 ROC curves for the classifier's performance for each top-level discipline using 20-fold cross-validation are shown in Fig. 2 Table 2 : Scoring metrics of SciBERT [27] and baseline random forest discipline classification models. Models were evaluated using 10-fold cross-validation on 2600 labeled abstracts. Input features to the random forest model generated using TF-IDF. scores which are between 0.1 and 0.14 higher. These are the broadest disciplines, It is also of note in each case that while precision is broadly similar between the two models, the baseline model exhibits significantly lower recall. This may be due to unbalanced training data -no single discipline accounts for more than 33% of the total corpus. For search applications, often a relatively small number of documents is relevant to each query. In this case, a high recall is more desirable than a high precision -in practice, the performance gap between the two models is larger than indicated by relative F1 scores. On the task of binary classification as related to COVID-19, our current models perform similarly well, achieving an F1 score of 0.98. While the binary classification task is significantly simpler from an NLP perspective -the majority of related papers contain "COVID-19" or some synonym -this still represents a significant performance improvement over the baseline model, which achieves an F1-score of 0.90. Given the relative simplicity of this task, in cases where an abstract is absent we classify it as related to COVID-19 based on the title. For the task of unsupervised keyword extraction, 63 abstracts were annotated by humans, and two statistical methods, TextRank [36] and TF-IDF [37] , and two graph-based models, RaKUn [38] and Yake [39] , were tested. Models were evaluated for overlap between human-annotated keywords and extracted keywords, and results are shown in Table. 3. Note that due to the inherent subjectivity of the keyword extraction task that scores are relatively low -the best Table 3 : Precision, recall, and F1 scores for 4 unsupervised keywords extractors, RaKUn [38] , Yake [39] , TextRank [36] , and TF-IDF [37] . Output from keyword extractors was compared to 63 abstracts with human-annotated keywords. To better visualize the embedding of COVID-19-related phrases and find latent relationship between biomedical terms, we designed a tool based on Embedding Projector [40] . A screenshot of the tool is shown in Fig. 3 We utilize FastText [41] embeddings for the embed- Cosine distance is used to measure the similarity between phrases. If the cosine distance between two phrases is quite small, they are likely to have similar meaning. Cosine Distance(p 1 , p 2 ) = Emb(p 1 ) ยท Emb(p 2 ) Emb(p 1 ) Emb(p 2 ) p 1 , p 2 represent two phrases, Emb maps phrases to their embedded representation in the learned semantic space. As Table 4 : The number of papers and fraction of total COVID-19 related papers in the COVIDScholar corpus for each discipline. Only papers with abstracts are classified and included in final count. Note that a given paper may have any number of discipline labels. Fig. 4 : Cumulative count by primary discipline of COVID-19 papers in the COVIDScholar database, and total number of reported US COVID-19 cases during the first 10 months of 2020. Papers are categorized by the classification model described in Sec. 3, and assigned to the discipline with highest predicted likelihood. Case data from The New York Times, based on reports from state and local health agencies. Note that only those papers with abstracts available are classified, and so the publication is somewhat lower than the total from Sec. 4.1. The cumulative count of COVID-19 papers in the COVIDScholar collection over the first 10 months of 2020 is shown in Fig. 4 A breakdown of research by discipline over the course of 2020 is shown in Fig. 5 , which depicts the fraction of monthly COVID-19 publications primarily associated with each discipline. From January -April, the relative popularity of discipline showed some shifts. While Biological and Chemical Sciences comprised 45% of the total corpus in January, by April that had decreased to 28%. This is largely accounted for by an increase in papers from Physical and Medical Sciences -over the same period the fraction of papers from Medical Sciences increased from 15% to 20% of the total, and Physical Sciences from 5% to 8%. By April, the fraction of the corpus from each discipline seems to have stabilized, with fluctuations of relative fractions of under 1%. This is further support for the evidence in Fig. 4 that research output had already reached its maximum rate by April/May -this seems to hold true on a discipline-by-discipline basis also. We investigate this increase in Fig. 6 We have developed and implemented a scalable research aggregation, analysis, and dissemination infrastructure, and created a targeted corpus of over 81,000 While to-date the COVIDScholar research corpus has primarily been used for front-end user search, it provides a rich opportunity for NLP analysis. Recent work [47] has highlighted the ability of NLP to discover latent knowledge from unstructured scientific text, utilizing information from thousands of research papers. We are now moving to employ similar techniques here, applied to such problems as drug re-purposing and predicting protein-protein interactions. We are thankful to the editorial team of Rapid Reviews: COVID-19 for their assistance in annotating text. Preprints: An underutilized mechanism to accelerate outbreak science Preprinting the COVID-19 pandemic Coronavirus: The spread of misinformation WHO COVID-19 Database CORD-19: The COVID-19 Open Research Dataset Keep up with the latest coronavirus research OpenCitations, an infrastructure organization for open scholarship The Multidisciplinary Preprint Platform The Lens COVID-19 Data Initiative Introducing PsyArXiv: a preprint service for psychological science Dimensions COVID-19 Dataset Keep up with the latest coronavirus research 12 et al. New Preprint Server Aims to Be Biologists' Answer to Physicists' arXiv New preprint server for medical research Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding SARS-CoV-2, SARS-CoV, and MERS-CoV: A comparative overview TMPRSS2 and COVID-19: Serendipity or Opportunity for Intervention? Distributed Representations of Sentences and Documents SciBERT: Pretrained Language Model for Scientific Text BioBERT: a pre-trained biomedical language representation model for biomedical text mining Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction Publicly Available Clinical BERT Embeddings An Overview of Microsoft Academic Service (MAS) and Applications". In: WWW -World Wide Web Consortium (W3C) CollaboNet: collaboration of deep neural networks for biomedical named entity recognition A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature Chemical-gene relation extraction using recursive neural network" Rapid Reviews: COVID-19, publishes reviews of COVID-19 preprints TextRank: Bringing Order into Text Term-weighting approaches in automatic text retrieval RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation YAKE! Collection-Independent Automatic Keyword Extractor Embedding projector: Interactive visualization and interpretation of embeddings Enriching Word Vectors with Subword Information the-second-meeting-of-the-international-health-regulations-(2005)-emergency -committee -regarding -the -outbreak -of -novel -coronavirus Unsupervised word embeddings capture latent knowledge from materials science literature