key: cord-0766007-k3940lbm authors: Chen, Chuming; Ross, Karen E; Gavali, Sachin; Cowart, Julie E; Wu, Cathy H title: COVID-19 knowledge graph from semantic integration of biomedical literature and databases date: 2021-10-06 journal: Bioinformatics DOI: 10.1093/bioinformatics/btab694 sha: cca98ab728dfae94cfd8d581792c79e58994dba7 doc_id: 766007 cord_uid: k3940lbm SUMMARY: The global response to the COVID-19 pandemic has led to a rapid increase of scientific literature on this deadly disease. Extracting knowledge from biomedical literature and integrating it with relevant information from curated biological databases is essential to gain insight into COVID-19 etiology, diagnosis, and treatment. We used Semantic Web technology RDF to integrate COVID-19 knowledge mined from literature by iTextMine, PubTator, and SemRep with relevant biological databases and formalized the knowledge in a standardized and computable COVID-19 Knowledge Graph (KG). We published the COVID-19 KG via a SPARQL endpoint to support federated queries on the Semantic Web and developed a knowledge portal with browsing and searching interfaces. We also developed a RESTful API to support programmatic access and provided RDF dumps for download. AVAILABILITY AND IMPLEMENTATION: The COVID-19 Knowledge Graph is publicly available under CC-BY 4.0 license at https://research.bioinformatics.udel.edu/covid19kg/. The worldwide research community's response to the COVID-19 pandemic has led to a burst of publications on this deadly disease (Brainard 2020) . The need for computational approaches and tools that can distill biomedical knowledge from literature and integrate it with relevant information from curated biological databases is essential to gain insight into COVID-19 etiology, diagnosis, and treatment. Chen et al. (2021a) has surveyed more than 200 natural language processing studies and systems addressing the COVID-19 pandemic. Knowledge Graphs (KGs) are a powerful method to represent and integrate such heterogeneous data and their relationships to generate novel insights. Several efforts are underway to investigate COVID-19 using KGs. A cause-and-effect KG on COVID-19 pathophysiology was constructed from literature (Domingo-Fernandez et al., 2020) . A framework that can integrate heterogeneous biomedical data to produce KGs was developed for COVID-19 (Reese et al., 2021) . Repurposing drugs were discovered using a literature-derived KG and the graph completion method (Zhang et al., 2021) . A detailed review comparing existing KGs and our work can be found in Supplementary File 1. In this paper, we used Semantic Web technology RDF (Resource Description Framework) to integrate COVID-19 knowledge from literature annotated by text-mining pipelines as well as relevant biological databases. Information was extracted and formalized in a standardized and computable KG to enable researchers to explore, analyze and answer questions. To make this resource readily available to the research community in accordance with the FAIR principles (Wilkinson et al. 2016) , we published the COVID-19 KG with multiple dissemination mechanisms, including a SPARQL (RDF Query Language) endpoint, a knowledge portal, a RESTful API, as well as downable RDF dumps. LitCovid is a curated resource of articles about COVID-19 and SARS-CoV-2 in PubMed (Chen et al., 2021b) . The COVID-19 Open Research Dataset (CORD-19) (Wang et al., 2020) consists of publications and preprints on COVID-19 and other coronaviruses (SARS and MERS) from the WHO, PubMed Central, bioRxiv and medRxiv. The abstracts and full Page 1 of 3 Bioinformatics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 texts from LitCovid and CORD-19 datasets have been processed by several text-mining pipelines to discover entity and relationship annotations: (i) iTextMine (Ren et al., 2018) , which provides text mining relation extraction results for protein phosphorylation (kinase-substratesite), phosphorylation-dependent protein-protein interactions, and miRNA-gene relations; (ii) PubTator (Wei et al, 2019) , which provides annotations of biomedical concepts such as genes/proteins, genetic variants, diseases, chemicals, species and cell lines; and (iii) SemRep (Rosemblat et al., 2013) , which uses the Unified Medical Language System (Humphreys et al., 1998) to extract semantic predictions from biomedical text. We also include relevant data from curated biomedical databases such as Protein Ontology (Chen et al., 2020) , DrugBank (Wishart et al., 2018) , CoV-AbDab (Raybould et al., 2020) , UniProtKB (UniProt consortium, 2020), STRING (Szklarczyk et al., 2019) and iPTMnet (Huang et al., 2018) . The annotations of the LitCovid and CORD-19 datasets by iTextMine and PubTator in BioC JSON format were downloaded and converted to RDF format using AtomGraph's generic JSON to RDF converter. DrugBank data in XML format was downloaded and converted to JSON using xml2json tool, then converted to RDF format. CoV-AbDab, SemRep, STRING, and iPTMnet data in text format were downloaded and converted to RDF format using custom scripts. The source code and instructions on how to create those RDF files used in the COVID-19 KG are publicly available at https://github.com/udel-cbcb/covid19kg_rdf. The COVID-19 KG is served by OpenLink Virtuoso server community edition with SPARQL 1.1 query federation. To help the exploration and use of COVID-19 KG, a knowledge portal (https://research.bioinformatics.udel.edu/covid19kg/) with browsing and searching interfaces was developed using Django framework. The KG can be accessed via YASGUI with comprehensive example SPARQL queries for new users. We also developed a RESTful API for programmatic access to KG for data integration and analysis. In addition, we provide RDF dumps of COVID-19 KG in text/turtle format with corresponding RDF centric statistics. The COVID-19 KG consists of 23 Named Graphs with a total of more than 1.2 billion RDF triples. The summary statistics of literature sources and the entities and relationships annotated by different text-mining tools can be found at the knowledge portal under "Dashboard". For case studies, we have used the COVID-19 KG to identify drug repurposing candidates for COVID-19 and potential therapeutic interventions to disrupt function of the SARS coronavirus nucleocapsid protein (N protein). Detailed descriptions can be found in Supplementary File 2. To construct a drug repurposing network, we used the COVID-19 KG SPARQL GUI (query CPPQ5) to retrieve the top 10 most frequently mentioned genes in the CORD-19 corpus as annotated by PubTator. We then browsed the DrugBank section of the KG web interface to identify drug and disease relations involving these genes. Finally, we performed a federated SPARQL query of DisGeNET (Piñero et al., 2020) and a web search of the Therapeutic Information Browser (TIB) (https://covidtib.c19hcc.org/app_direct/dashboard/) for additional variant, drug, and disease relations. In August 2020, our network predicted that the TNF-targeting drugs etanercept and certolizumab pegol (Fig. S1-A) were candidates for COVID-19 drug repurposing. As of March 2021, both drugs were mentioned in the COVID-19 literature and etanercept has been reported to be beneficial in individual cases (Zhu et al., 2021 , Clark, 2020 . Another promising candidate identified using the KG, the IFN- targeting drug, olsalazine (Fig. S1-B) , is currently not mentioned in the COVID-19 literature. However, other IFN- targeting drugs are being investigated as COVID-19 treatments. Moreover, olsalazine is a recommended treatment for ulcerative colitis (UC), and several other UC therapeutics are mentioned in the literature in the context of COVID-19. To identify strategies to disable the N protein (Fig. S2) , we browsed the KG using the web interface to identify phosphorylation and proteinprotein interaction relations involving the N protein. We then further browsed the KG to identify drugs and miRNAs that targeted kinases that phosphorylate N protein and proteins that interact with it. The potential avenues of intervention we identified include small molecule inhibitors of the N protein kinases, CDK1 and GSK3 (Fig. S2 , V-shaped nodes), or miRNAs that inhibit expression of LARP1 and/or G3BP1 (Fig. S2 , triangular nodes). The COVID-19 KG will be regularly updated. We plan to develop a visualization application for the KG and combine graph representation learning, ontology, automated reasoning, and neural networks to open up the KG for machine learning and further data analytics. This work has been partially supported by the National Institutes of Health (Grant Nos. U24HG007822 and R35GM141873) and institutional resources at the University of Delaware. Conflict of Interest: none declared . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Scientists are drowning in COVID-19 papers. Can new tools keep them afloat? Science Protein ontology on the semantic web for knowledge discovery Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing LitCovid: an open database of COVID-19 literature Background to new treatments for COVID-19, including its chronicity, through altering elements of the cytokine storm COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology iPTMnet: an integrated resource for protein post-translational modification network discovery The Unified Medical Language System: An informatics research collaboration CoV-AbDab: the Coronavirus Antibody Database KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response iTextMine: integrated text-mining system for large-scale knowledge extraction from the literature. Database A methodology for extending domain coverage in SemRep STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets UniProt consortium (2020) UniProt: the universal protein knowledgebase in 2021 The Covid-19 Open Research Dataset. ACL NLP-COVID Workshop PubTator central: automated concept annotation for biomedical full text articles The fair guiding principles for scientific data management and stewardship DrugBank 5.0: a major update to the DrugBank database for 2018 Drug repurposing for COVID-19 via knowledge graph completion Update on the Clinical Management and Diagnosis of Kawasaki Disease