key: cord-0824420-uce7xhem authors: Tworowski, Dmitry; Gorohovski, Alessandro; Mukherjee, Sumit; Carmi, Gon; Levy, Eliad; Detroja, Rajesh; Mukherjee, Sunanda Biswas; Frenkel-Morgenstern, Milana title: COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics date: 2020-11-09 journal: Nucleic Acids Res DOI: 10.1093/nar/gkaa969 sha: 7590a68ef9d6a97da1f39d8ea7093785e4830c46 doc_id: 824420 cord_uid: uce7xhem The recent outbreak of COVID-19 has generated an enormous amount of Big Data. To date, the COVID-19 Open Research Dataset (CORD-19), lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. According to LitCovid (11 August 2020), ∼40,300 COVID19-related articles are currently listed in PubMed. It has been shown in clinical settings that the analysis of past research results and the mining of available data can provide novel opportunities for the successful application of currently approved therapeutics and their combinations for the treatment of conditions caused by a novel SARS-CoV-2 infection. As such, effective responses to the pandemic require the development of efficient applications, methods and algorithms for data navigation, text-mining, clustering, classification, analysis, and reasoning. Thus, our COVID19 Drug Repository represents a modular platform for drug data navigation and analysis, with an emphasis on COVID-19-related information currently being reported. The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs’ action. Our COVID19 drug repository provide a most updated world-wide collection of drugs that has been repurposed for COVID19 treatments around the world. The COVID-19 pandemic outbreak has triggered immediate reactions from the medical and scientific communities, and has resulted in an explosive growth of novel data regarding possible therapies or therapeutic oppor-tunities (1, 2) . The COVID-19 data portal (https://www. covid19dataportal.org/) established by the European Commission in April, 2020 has facilitated the exchange and sharing of COVID-19 research data. One of the first open initiatives realized with creation of this portal was the development of the COVID-19 Open Research Dataset (CORD-19) (2) . The CORD-19 (https://www.semanticscholar.org/ cord19) currently lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. Another comprehensive list of COVID-19 databases and journals can be found on the Centres for Disease Control (CDC) library webpage: https://www.cdc.gov/library/researchguides/ 2019novelcoronavirus/databasesjournals.html. According to recent records from LitCovid resource (1), 40 300 COVID19-related articles have been currently listed in PubMed (1) . The rapid accumulation of COVID-19 literature requires novel tools for the data collection and organization with efficient navigation capabilities. Such navigation capabilities are based on the literature-based discovery (LBD) concept (3) and can be achieved by implementing text-mining, clustering, and classification methods (1, (4) (5) (6) (7) (8) . Available text and data-mining tools, such as those found at LitCovid (1), PubTator (4, 9, 10) , the iSearch platform (https://icite.od.nih.gov/covid19/search/), Neural-Covidex (https://covidex.ai/) (7) , the COVID-19 Data Portal (https://www.covid19dataportal.org/), Carrot/Lingo (https://search.carrot2.org/#/web) (11) and ProtFus (12) , efficiently extract target information across articles and other text sources. Using the mentioned tools for textmining, we have created the COVID-19 Drug Repository. The goal of our COVID-19 Drug Repository was to automatically collect data on drugs used against COVID-19 around the world and build a structured repository that includes drug descriptions, side effects and available publications. The repository also contains medicine-and pharmacology-oriented data, including annotated information on (FDA-)approved drugs, therapeutic agents (experimental drugs), and drug-like synthetic or natural chemi- cal substances. The data was collected and integrated by methods developed for the 'omics' field (13) , in particular, chemogenomics (i.e. chemical genomics) (14) (15) (16) (17) , pharmacogenomics (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) , genomics and personomics (28) (29) (30) . In addition, we made use of a number of chemogenomics (31) (32) (33) and pharmacogenomics (34) (35) (36) approaches that focused on the repositioning (i.e., repurposing) of FDA approved drugs and clinical trials in the treatment of COVID-19. All the data collected in the COVID-19 Drug Repository are designed for use by researchers and clinicians in the field. The information cannot be used for self-medication! The COVID-19 Drug Repository: structure and technical description Repository is an open-source modular platform built on the MySQL server platform, comprising 15 curated tables. The structure of the database, presenting the logical relations between these tables, and the data collection process are shown in Figures 1 and 2 , respectively. To ensure consistency between the drug syn, drug recipe, covid salt, drug link, drug pubmed, Clinicaltrial, text mining and covid drug tables, the insertion/update/deletion of rows is linked to the covid drug table. Each covid drug entry is linked to 15 data fields corresponding to drug data and a target. Most of the data fields (i.e. ACC id, UNII, CAS1, CAS2, CAS3, and PubChem cid) are hyperlinked to other databases (i.e. DrugBank (37, 38) , ClinicalTrials.gov, PubChem (39), IUPHAR/BPC (40) and Chemical Abstracts Service (41, 42) (Figure 1, Figure 2 , Table 1 ). Each covid recipe entry is associated with 12 data fields, including drug formulation (recipe), citing the country of manufacture, FDA-approved drugs, guidelines, etc. The COVID-19 Drug Repository supports text query inputs using the search box on the homepage (Figure 3 ). The MyISAM engine (43) was implemented to support the FULLTEXT search functionality, with the 'utf8' DEFAULT CHARSET. Detailed instructions on the browsing and search tools found in the database were provided below and can also be found on the database homepage (under 'Help' option). Finally, the database update process is semi-automatic as follows: (a) selection of potential COVID19 therapeutic substances found in research articles is manual; (b) updating and adding new records to COVID19 Drug Repository database is fully automated using Perl scripts; (c) hyperlinks to PubMed and other sources, and maps are generated automatically by python scripts. The database search is case-insensitive. Simple queries can include either the full or partial drug name ('Drug search' box). Advanced queries can be constructed by combining identifiers of the various databases (Table 1 ) and/or concepts (e.g. compound class, viral target, etc.). The 'compound class', for example, includes the following terms: Antibody | Metabolite | Natural product | Inorganic | Peptide | Synthetic organic , while the 'viral target' category comprises the acronym of viral name : BCV | BtCoV- The search results page shows all relevant instances associated with a query drug. The HTML page is generated with hyperlinks to all databases (Table 1 ) associated with the drug of interest. Alternatively, the information can also be accessed by selecting a drug name from the menu in the selection field (see the Database 'Help' page: http://covid19. md.biu.ac.il/). The Treatment Options section enables access to a detailed description of the query drug. The Repository is a COVID19-targeted collection (shortlist) of ∼460 items representing 184 approved drugs, 384 investigated therapeutic agents and 76 drug-like synthetic or natural chemical substances. The main focus of the repository is cross-referencing to PubMed articles linking these drugs with multiple research sources, mapping associations between the drugs and COVID19-related concepts, text/data-mining, clustering, and visualization. Furthermore, the Biopython collection of modules (44) , in particular, the Entrez Bio.Entrez module (44) , was implemented into our database framework so as to enable fast data retrieval via efficient command-line interactions with all NCBI resources and databases (45) , sub-divided into six categories: Literature, Genes, Proteins, Genomes, Genetics, and Chemicals. A toolset of Perl and Python scripts had been created for specific tasks, such as (a) automatic generation of links between the Repository and other sources, (b) collection of data/references and creation of work tables and (c) mapping associations/concepts, with visualization options being realized via external tools. Recent information on approved drugs and therapeutic combinations thereof considered useful for the treatment Using this search strategy, we found ∼100 substances with activities associated with COVID-19. The queries and the patterns used, and the information obtained daily are the main sources for Repository updates. All 'active' chemical entities/substances (i.e. those with demonstrated or proposed therapeutic potential) were collected and linked via their identifiers in CAS, PubChem, IUPHAR, etc. Furthermore, we adopted the web-based text clustering engine Carrot 2 (47) for visualization of pair-wise 'drug-COVID-19 concept' associations found in PubMed abstracts for each pair. In this version of our database, COVID19 drugs were mapped to a dictionary of 21 terms related to concepts of 'COVID-19', e.g. 'viral infections', 'respiratory diseases', 'inflammatory cell', 'coronavirus pneumonia', etc. (Supplemental Table S1 ). These terms are the most frequent words/combinations clustered around the central words such as 'virus', 'infection', 'inflammation', 'pneumonia', 'lungs'. To create the concepts' dictionary, a variety of clusters were generated by experimentation with different hierarchical clustering algorithms applied to the collection of PubMed titles/abstracts. Links to all PubMed abstracts associated with these 'drug-concept' pairs were generated and enumerated using Python scripts. PubMed search queries were created according to PubMed query syntax (48) and MeSH terms (49) . The automatically generated tables (available in the 'DOWNLOADS' section) list the number of retrieved PubMed publications corresponding to each 'drug-COVID-19 concept' pair. These numbers are hyperlinked with the corresponding PubMed publications. Links to references and the Carrot 2 text clustering and visualization tool can be updated on a regular basis. Such updates are necessary as the web and PubMed database are constantly expanding, with new references and sites appearing daily. All desired data can be downloaded from the COVID19 Drug Repository website (link) as Excel tables containing the list of keywords (i.e., the 'dictionary') used for text-mining and mapping. In subsequent versions of the database, users will be able to modify the list or introduce additional concepts. With this simple mapping tool, one can discover and visualize new concepts and associations that would not otherwise be found. Currently, there are 384 mapped drug names mentioned in 960 COVID-19 clinical studies (data retrieved on August 15, 2020), with at least 1 drug intervention ( Table 2) . None of these drugs are novel. Rather, they exemplify a 'drug repurposing/repositioning' approach (26, 34, (50) (51) (52) . Recently, numerous COVID19-specific web pages and chemical libraries have been created by different research organizations (CAS, IUPHAR (53, 54) , ChEMBL, Open-Data Portal (https://opendata.ncats.nih.gov/covid19/index. html), etc.) and companies (MedChem Express), and used for the high throughput screening against SARS-CoV-2 infection (55) (56) (57) (58) (59) (60) (61) (62) (63) . All these molecular libraries and collections ( Figure 2 , Table 2 ) are being used in our data collection process (Figure 2) , and listed in the Repository web page ('Useful Links'). The Biopython/Entrez-based Python command-line script (as discussed in the Features and functionality section) was created to access the NCBI Gene database (45) , and to automatically retrieve human or microbial (and in particular, viral) genes associated with a given list of drugs or chemical substances. The output (Figure 4) provides a list of genes with a short description of the biological role associated with each gene product in the output list. Those genes associated with a set of drugs can be analysed, clustered, or served as input for building 'drug-gene' networks and then visualized using external programs. As a working example, protein-protein association networks were built for output gene sets using the STRINGv11 database (64) . Moreover, the application programming interface (API) implemented in the STRINGv11 database enables efficient interaction of external databases with the STRING visualization and analysis tools (64) . For example, visualization of the set of genes associated with the PDE5A inhibitor sildenafil, a vasodilator agent, revealed other interesting targets (Figure 4) , such as the enzyme PDE6G. Both enzymes are active in the lungs (65, 66) . Further network analysis of available data showed that PDE5A/PDE6G inhibition by sildenafil in lung blood vessels can trigger different anti-inflammatory pathways. To build a network ( Figure 5 ), we extracted additional information from the literature and external databases. In the PDE5A and PDE6G protein expression summaries obtained from the Human Protein Atlas (67), the PGE6G gene is categorized as 'Group enriched' in natural killer (NK) cells, according to consensus transcriptomics data. NK cells, acting as cytotoxic lymphocytes, are involved in innate immune system regulation, including rapid cytokine production in the presence of virus-infected cells (67) . The NK-mediated antiviral immune response is associated with the NCR1 gene that encodes the natural cytotoxicity receptor 1 (68) . In the next step, both the PDE6 and NCR1 genes were detected in the Chronic Obstructive Pulmonary Disease (COPD)-related Gene Set using Harmonizome on the collection of 'omics' Big Data sets (69) . This gene set was deposited in the GEO Signatures of Differentially Expressed Genes for Diseases, under the name 'COPD-Chronic Obstructive Pulmonary Disease Muscle-Striated (Skeletal)-Diaphragm (MMHCC) GSE47'. The data show that the expression of the PDE6G gene is significantly increased, whereas decreased expression was reported for the NCR1 gene. Therefore, in the context of drug repurposing strategies, it is reasonable to expect that sildenafil will be use-ful for the treatment of COVID19 complications. Accordingly, two recent clinical studies (ClinicalTrials.gov identifiers NCT04304313 and NCT04489446) were initiated to study the efficacy and safety of sildenafil in patients with COVID-19 (NCT04304313), and to assess the role of sildenafil in improving oxygenation among hospitalised patients (NCT04489446). We extracted target gene information for each putative COVID-19 drug from the Therapeutic Target Database (70) , and by text-mining of the literature at PubMed. To understand the expression profile of drug target genes identified in this manner in the COVID-19 infection, we performed transcriptome analysis of infected bronchial epithelial cells. For this, we retrieved raw RNA-sequencing data for SARS-CoV-2-infected bronchial epithelial cells from the sequence read archive (SRA) database under accession no. PRJNA615032 (71) . The FASTQ files were mapped and Figure 5 . The Drug-Gene local Network built using the output for sildenafil (A). Further functional enrichment reveals the group of genes involved in the innate immunity and inflammatory processes. Moreover, VDR (receptors activated by calcitriol), and the HIF1A-STAT3 path suppressed by resveratrol, are also involved in the local anti-inflammatory pharmacological network, thereby suggesting the "sildenafil-resveratrol-vitamin D3" drug combination to treat the COVID-19 complications aligned to the hg38 reference genome using STAR (72) . Differentially expressed genes were identified using edgeR (73) , with parameters set at 2.0-fold change and <0.05 Pvalue cut-off. We thus found target genes for 41 drugs from our database (e.g. sildenafil as discussed in previous paragraph) which are significantly differentially expressed during COVID-19 infection. These drug-gene pairs are given in the Supplemental Table S2 . The COVID19 Drug Repository server was built on an Apache web server and deployed on the RedHat Enterprise Linux (RHEL) 7.4 server of an Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz and 32GB RAM unit. The COVID19 Drug Repository has a code base and infrastructure similar to that of the ChiTaRS database (74, 75) . The COVID19 Drug Repository website is compatible with modern web browsers (such as Chrome, Firefox, Microsoft Edge, Opera and Safari), provided that JavaScript is enabled. We recommend using the latest release version of these web browsers for optimal rendering. The COVID19 Drug Repository not only provides extended 'Search' options but also offers the possibility to download all database tables and data sets in a user-friendly manner. The repository is available at: http://covid19.md.biu.ac.il/. The COVID19 Drug Repository maps data from chemogenomics and pharmacogenomics studies and provides viral and human genomics and proteomics information on approved drugs and other therapeutics. The database enables the user to focus on different levels of complexity, starting from general information, clinical trials and formulations, and increasing the resolution to the level of molecular mechanisms of drug action. Therefore, the database can serve as a navigation and recommendation tool both for research and for healthcare purposes. Future plans include the following additions to the database: (a) continuous updating with new data on approved drugs, experimental drugs, and drug-like synthetic or natural chemical substances; (b) automatic machine learning and text-mining-based annotation and visualization of 'Mode of Action' (MoA) data, as well as 'Drug-Gene', and 'Drug-Symptom' networks; and (c) incorporation of the Drugs/NGS analysis tools ('transcriptomics') to accelerate the translation of knowledge for use as personalized medicine for COVID19 patients. http://covid19.md.biu.ac.il/. Keep up with the latest coronavirus research CORD-19: The Covid-19 Open Research Dataset PubTator: a web-based text mining tool for assisting biocuration TREC-COVID: rationale and Structure of an Information Retrieval Shared Task for COVID-19 Neural networks for open and closed Literature-based Discovery Rapidly deploying a neural search engine for the covid-19 open research dataset: Preliminary thoughts and lessons learned Data and text mining help identify key proteins involved in the molecular mechanisms shared by SARS-CoV-2 and HIV-1 Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts PubTator central: automated concept annotation for biomedical full text articles Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data ProtFus: a comprehensive method characterizing protein-protein interactions of fusion proteins Omics and drug response Chemical genomics: a systematic approach in biological research and drug discovery Iconix Pharmaceuticals, Inc.-removing barriers to efficient drug discovery through chemogenomics TDR targets: a chemogenomics resource for neglected diseases TargetHunter: an in silico target identification tool for predicting therapeutic potential of small organic molecules based on chemogenomic database Pharmacogenomics: the promise of personalized medicine Pharmacogenomics: bench to bedside Pharmacogenomics. Going from genome to pill Progress in pharmacogenomics and its promise for medicine Pharmacogenomics: a systems approach Pharmacogenomics in early-phase clinical development Systems pharmacogenomic landscape of drug similarities from LINCS data: Drug Association Networks How to find the right drug for each patient? Advances and challenges in pharmacogenomics Implications of pharmacogenomics for drug development MOLI: multi-omics late integration with deep neural networks for drug response prediction Using what we already have: uncovering new drug repurposing strategies in existing omics data Genomics, other "Omic" technologies, personalized medicine, and additional biotechnology-related techniques TDR Targets 6: driving drug discovery for human pathogens through intensive chemogenomic data integration Virus-CKB: an integrated bioinformatics platform and analysis resource for COVID-19 research Extending the small-molecule similarity principle to all levels of biology with the chemical checker Structure-based drug repositioning over the human TMPRSS2 protease domain: search for chemical probes able to repress SARS-CoV-2 Spike protein cleavages A SARS-CoV-2 protein interaction map reveals targets for drug repurposing Pharmacogenomics of COVID-19 therapies DrugBank: a comprehensive resource for in silico drug discovery and exploration DrugBank 5.0: a major update to the DrugBank database PubChem 2019 update: improved access to chemical data The IUPHAR/BPS Guide to PHARMACOLOGY in 2020: extending immunopharmacology content and introducing the IUPHAR/MMV Guide to MALARIA PHARMACOLOGY Chemical Abstracts Service approach to management of large data bases Chemical Abstracts Service Chemical Registry System: history, scope, and impacts In: MySQL Reference Manual: Documentation From the Source Biopython: freely available Python tools for computational molecular biology and bioinformatics Database resources of the National Center for Biotechnology Information Programming techniques: regular expression search algorithm Carrot2 and Language Properties in Web Search Results Clustering E-utilities Quick Start. EntrezProgramming Utilities Help Medical subject headings (MeSH) Computational Drug Repositioning: a lateral approach to traditional drug discovery? Web-based drug repurposing tools: a survey Drug repurposing using deep embeddings of gene expression profiles A rational roadmap for SARS-CoV-2/COVID-19 pharmacotherapeutic research and development: IUPHAR Review 29 Coronavirus Information. IUPHAR/BPS Guide to Pharmacology Identification of antiviral drug candidates against SARS-CoV-2 from FDA-approved drugs A large-scale drug repositioning survey for SARS-CoV-2 antivirals An OpenData portal to share COVID-19 drug repurposing data in real time Morphological cell profiling of SARS-CoV-2 infection identifies drug repurposing candidates for COVID-19 Human organs-on-chips for virology Broad anti-coronaviral activity of FDA approved drugs against SARS-CoV-2 In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2 STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets Phosphodiesterase 6 subunits are expressed and altered in idiopathic pulmonary fibrosis PDE5A inhibition attenuates bleomycin-induced pulmonary fibrosis and pulmonary hypertension through inhibition of ROS generation and RhoA/Rho kinase activation Proteomics. Tissue-based map of the human proteome. Science Molecular cloning of NKp46: a novel member of the immunoglobulin superfamily involved in triggering of natural cytotoxicity The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics Imbalanced host response to SARS-CoV-2 drives development of COVID-19 STAR: ultrafast universal RNA-seq aligner edgeR: a Bioconductor package for differential expression analysis of digital gene expression data ChiTaRS 5.0: the comprehensive database of chimeric transcripts matched with druggable fusions and 3D chromatin maps ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data ACKNOWLEDGEMENTS E.L. has been working as a volunteer at the Frenkel-Morgenstern's lab in the COVID-19 related project.