key: cord-0824420-uce7xhem
authors: Tworowski, Dmitry; Gorohovski, Alessandro; Mukherjee, Sumit; Carmi, Gon; Levy, Eliad; Detroja, Rajesh; Mukherjee, Sunanda Biswas; Frenkel-Morgenstern, Milana
title: COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics
date: 2020-11-09
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkaa969
sha: 7590a68ef9d6a97da1f39d8ea7093785e4830c46
doc_id: 824420
cord_uid: uce7xhem

The recent outbreak of COVID-19 has generated an enormous amount of Big Data. To date, the COVID-19 Open Research Dataset (CORD-19), lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. According to LitCovid (11 August 2020), ∼40,300 COVID19-related articles are currently listed in PubMed. It has been shown in clinical settings that the analysis of past research results and the mining of available data can provide novel opportunities for the successful application of currently approved therapeutics and their combinations for the treatment of conditions caused by a novel SARS-CoV-2 infection. As such, effective responses to the pandemic require the development of efficient applications, methods and algorithms for data navigation, text-mining, clustering, classification, analysis, and reasoning. Thus, our COVID19 Drug Repository represents a modular platform for drug data navigation and analysis, with an emphasis on COVID-19-related information currently being reported. The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs’ action. Our COVID19 drug repository provide a most updated world-wide collection of drugs that has been repurposed for COVID19 treatments around the world.

The COVID-19 pandemic outbreak has triggered immediate reactions from the medical and scientific communities, and has resulted in an explosive growth of novel data regarding possible therapies or therapeutic oppor-tunities (1, 2) . The COVID-19 data portal (https://www. covid19dataportal.org/) established by the European Commission in April, 2020 has facilitated the exchange and sharing of COVID-19 research data. One of the first open initiatives realized with creation of this portal was the development of the COVID-19 Open Research Dataset (CORD-19) (2) . The CORD-19 (https://www.semanticscholar.org/ cord19) currently lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. Another comprehensive list of COVID-19 databases and journals can be found on the Centres for Disease Control (CDC) library webpage:

https://www.cdc.gov/library/researchguides/ 2019novelcoronavirus/databasesjournals.html.

According to recent records from LitCovid resource (1), 40 300 COVID19-related articles have been currently listed in PubMed (1) . The rapid accumulation of COVID-19 literature requires novel tools for the data collection and organization with efficient navigation capabilities. Such navigation capabilities are based on the literature-based discovery (LBD) concept (3) and can be achieved by implementing text-mining, clustering, and classification methods (1, (4) (5) (6) (7) (8) . Available text and data-mining tools, such as those found at LitCovid (1), PubTator (4, 9, 10) , the iSearch platform (https://icite.od.nih.gov/covid19/search/), Neural-Covidex (https://covidex.ai/) (7) , the COVID-19 Data Portal (https://www.covid19dataportal.org/), Carrot/Lingo (https://search.carrot2.org/#/web) (11) and ProtFus (12) , efficiently extract target information across articles and other text sources. Using the mentioned tools for textmining, we have created the COVID-19 Drug Repository.

The goal of our COVID-19 Drug Repository was to automatically collect data on drugs used against COVID-19 around the world and build a structured repository that includes drug descriptions, side effects and available publications. The repository also contains medicine-and pharmacology-oriented data, including annotated information on (FDA-)approved drugs, therapeutic agents (experimental drugs), and drug-like synthetic or natural chemi- cal substances. The data was collected and integrated by methods developed for the 'omics' field (13) , in particular, chemogenomics (i.e. chemical genomics) (14) (15) (16) (17) , pharmacogenomics (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) , genomics and personomics (28) (29) (30) . In addition, we made use of a number of chemogenomics (31) (32) (33) and pharmacogenomics (34) (35) (36) approaches that focused on the repositioning (i.e., repurposing) of FDA approved drugs and clinical trials in the treatment of COVID-19. All the data collected in the COVID-19 Drug Repository are designed for use by researchers and clinicians in the field. The information cannot be used for self-medication!

The COVID-19 Drug Repository: structure and technical description

Repository is an open-source modular platform built on the MySQL server platform, comprising 15 curated tables. The structure of the database, presenting the logical relations between these tables, and the data collection process are shown in Figures 1 and 2 , respectively. To ensure consistency between the drug syn, drug recipe, covid salt, drug link, drug pubmed, Clinicaltrial, text mining and covid drug tables, the insertion/update/deletion of rows is linked to the covid drug table. Each covid drug entry is linked to 15 data fields corresponding to drug data and a target. Most of the data fields (i.e. ACC id, UNII, CAS1, CAS2, CAS3, and PubChem cid) are hyperlinked to other databases (i.e. DrugBank (37, 38) , ClinicalTrials.gov, PubChem (39), IUPHAR/BPC (40) and Chemical Abstracts Service (41, 42) (Figure 1, Figure 2 , Table 1 ). Each covid recipe entry is associated with 12 data fields, including drug formulation (recipe), citing the country of manufacture, FDA-approved drugs, guidelines, etc. The COVID-19 Drug Repository supports text query inputs using the search box on the homepage (Figure 3 ). The MyISAM engine (43) was implemented to support the FULLTEXT search functionality, with the 'utf8' DEFAULT CHARSET. Detailed instructions on the browsing and search tools found in the database were provided below and can also be found on the database homepage (under 'Help' option). Finally, the database update process is semi-automatic as follows: (a) selection of potential COVID19 therapeutic substances found in research articles is manual; (b) updating and adding new records to COVID19 Drug Repository database is fully automated using Perl scripts; (c) hyperlinks to PubMed and other sources, and maps are generated automatically by python scripts. 

The database search is case-insensitive. Simple queries can include either the full or partial drug name ('Drug search' box). Advanced queries can be constructed by combining identifiers of the various databases (Table 1 ) and/or concepts (e.g. compound class, viral target, etc.). The 'compound class', for example, includes the following terms:

Antibody | Metabolite | Natural product | Inorganic | Peptide | Synthetic organic , while the 'viral target' category comprises the acronym of viral name : BCV | BtCoV- The search results page shows all relevant instances associated with a query drug. The HTML page is generated with hyperlinks to all databases (Table 1 ) associated with the drug of interest. Alternatively, the information can also be accessed by selecting a drug name from the menu in the selection field (see the Database 'Help' page: http://covid19. md.biu.ac.il/). The Treatment Options section enables access to a detailed description of the query drug. 

The Repository is a COVID19-targeted collection (shortlist) of ∼460 items representing 184 approved drugs, 384 investigated therapeutic agents and 76 drug-like synthetic or natural chemical substances. The main focus of the repository is cross-referencing to PubMed articles linking these drugs with multiple research sources, mapping associations between the drugs and COVID19-related concepts, text/data-mining, clustering, and visualization. Furthermore, the Biopython collection of modules (44) , in particular, the Entrez Bio.Entrez module (44) , was implemented into our database framework so as to enable fast data retrieval via efficient command-line interactions with all NCBI resources and databases (45) , sub-divided into six categories: Literature, Genes, Proteins, Genomes, Genetics, and Chemicals. A toolset of Perl and Python scripts had been created for specific tasks, such as (a) automatic generation of links between the Repository and other sources, (b) collection of data/references and creation of work tables and (c) mapping associations/concepts, with visualization options being realized via external tools.

Recent information on approved drugs and therapeutic combinations thereof considered useful for the treatment Using this search strategy, we found ∼100 substances with activities associated with COVID-19. The queries and the patterns used, and the information obtained daily are the main sources for Repository updates.

All 'active' chemical entities/substances (i.e. those with demonstrated or proposed therapeutic potential) were collected and linked via their identifiers in CAS, PubChem, IUPHAR, etc. Furthermore, we adopted the web-based text clustering engine Carrot 2 (47) for visualization of pair-wise 'drug-COVID-19 concept' associations found in PubMed abstracts for each pair.

In this version of our database, COVID19 drugs were mapped to a dictionary of 21 terms related to concepts of 'COVID-19', e.g. 'viral infections', 'respiratory diseases', 'inflammatory cell', 'coronavirus pneumonia', etc. (Supplemental Table S1 ). These terms are the most frequent words/combinations clustered around the central words such as 'virus', 'infection', 'inflammation', 'pneumonia', 'lungs'. To create the concepts' dictionary, a variety of clusters were generated by experimentation with different hierarchical clustering algorithms applied to the collection of PubMed titles/abstracts. Links to all PubMed abstracts associated with these 'drug-concept' pairs were generated and enumerated using Python scripts. PubMed search queries were created according to PubMed query syntax (48) and MeSH terms (49) . The automatically generated tables (available in the 'DOWNLOADS' section) list the number of retrieved PubMed publications corresponding to each 'drug-COVID-19 concept' pair. These numbers are hyperlinked with the corresponding PubMed publications. Links to references and the Carrot 2 text clustering and visualization tool can be updated on a regular basis. Such updates are necessary as the web and PubMed database are constantly expanding, with new references and sites appearing daily. All desired data can be downloaded from the COVID19 Drug Repository website (link) as Excel tables containing the list of keywords (i.e., the 'dictionary') used for text-mining and mapping. In subsequent versions of the database, users will be able to modify the list or introduce additional concepts. With this simple mapping tool, one can discover and visualize new concepts and associations that would not otherwise be found.

Currently, there are 384 mapped drug names mentioned in 960 COVID-19 clinical studies (data retrieved on August 15, 2020), with at least 1 drug intervention ( Table  2) . None of these drugs are novel. Rather, they exemplify a 'drug repurposing/repositioning' approach (26, 34, (50) (51) (52) . Recently, numerous COVID19-specific web pages and chemical libraries have been created by different research organizations (CAS, IUPHAR (53, 54) , ChEMBL, Open-Data Portal (https://opendata.ncats.nih.gov/covid19/index. html), etc.) and companies (MedChem Express), and used for the high throughput screening against SARS-CoV-2 infection (55) (56) (57) (58) (59) (60) (61) (62) (63) . All these molecular libraries and collections ( Figure 2 , Table 2 ) are being used in our data collection process (Figure 2) , and listed in the Repository web page ('Useful Links').

The Biopython/Entrez-based Python command-line script (as discussed in the Features and functionality section) was created to access the NCBI Gene database (45) , and to automatically retrieve human or microbial (and in particular, viral) genes associated with a given list of drugs or chemical substances. The output (Figure 4) provides a list of genes with a short description of the biological role associated with each gene product in the output list. Those genes associated with a set of drugs can be analysed, clustered, or served as input for building 'drug-gene' networks and then visualized using external programs. As a working example, protein-protein association networks were built for output gene sets using the STRINGv11 database (64) . Moreover, the application programming interface (API) implemented in the STRINGv11 database enables efficient interaction of external databases with the STRING visualization and analysis tools (64) . For example, visualization of the set of genes associated with the PDE5A inhibitor sildenafil, a vasodilator agent, revealed other interesting targets (Figure 4) , such as the enzyme PDE6G. Both enzymes are active in the lungs (65, 66) . Further network analysis of available data showed that PDE5A/PDE6G inhibition by sildenafil in lung blood vessels can trigger different anti-inflammatory pathways.

To build a network ( Figure 5 ), we extracted additional information from the literature and external databases. In the PDE5A and PDE6G protein expression summaries obtained from the Human Protein Atlas (67), the PGE6G gene is categorized as 'Group enriched' in natural killer (NK) cells, according to consensus transcriptomics data. NK cells, acting as cytotoxic lymphocytes, are involved in innate immune system regulation, including rapid cytokine production in the presence of virus-infected cells (67) . The NK-mediated antiviral immune response is associated with the NCR1 gene that encodes the natural cytotoxicity receptor 1 (68) . In the next step, both the PDE6 and NCR1 genes were detected in the Chronic Obstructive Pulmonary Disease (COPD)-related Gene Set using Harmonizome on the collection of 'omics' Big Data sets (69) . This gene set was deposited in the GEO Signatures of Differentially Expressed Genes for Diseases, under the name 'COPD-Chronic Obstructive Pulmonary Disease Muscle-Striated (Skeletal)-Diaphragm (MMHCC) GSE47'. The data show that the expression of the PDE6G gene is significantly increased, whereas decreased expression was reported for the NCR1 gene. Therefore, in the context of drug repurposing strategies, it is reasonable to expect that sildenafil will be use-ful for the treatment of COVID19 complications. Accordingly, two recent clinical studies (ClinicalTrials.gov identifiers NCT04304313 and NCT04489446) were initiated to study the efficacy and safety of sildenafil in patients with COVID-19 (NCT04304313), and to assess the role of sildenafil in improving oxygenation among hospitalised patients (NCT04489446).

We extracted target gene information for each putative COVID-19 drug from the Therapeutic Target Database (70) , and by text-mining of the literature at PubMed. To understand the expression profile of drug target genes identified in this manner in the COVID-19 infection, we performed transcriptome analysis of infected bronchial epithelial cells. For this, we retrieved raw RNA-sequencing data for SARS-CoV-2-infected bronchial epithelial cells from the sequence read archive (SRA) database under accession no. PRJNA615032 (71) . The FASTQ files were mapped and Figure 5 . The Drug-Gene local Network built using the output for sildenafil (A). Further functional enrichment reveals the group of genes involved in the innate immunity and inflammatory processes. Moreover, VDR (receptors activated by calcitriol), and the HIF1A-STAT3 path suppressed by resveratrol, are also involved in the local anti-inflammatory pharmacological network, thereby suggesting the "sildenafil-resveratrol-vitamin D3" drug combination to treat the COVID-19 complications aligned to the hg38 reference genome using STAR (72) . Differentially expressed genes were identified using edgeR (73) , with parameters set at 2.0-fold change and <0.05 Pvalue cut-off. We thus found target genes for 41 drugs from our database (e.g. sildenafil as discussed in previous paragraph) which are significantly differentially expressed during COVID-19 infection. These drug-gene pairs are given in the Supplemental Table S2 .

The COVID19 Drug Repository server was built on an Apache web server and deployed on the RedHat Enterprise Linux (RHEL) 7.4 server of an Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz and 32GB RAM unit. The COVID19 Drug Repository has a code base and infrastructure similar to that of the ChiTaRS database (74, 75) . The COVID19 Drug Repository website is compatible with modern web browsers (such as Chrome, Firefox, Microsoft Edge, Opera and Safari), provided that JavaScript is enabled. We recommend using the latest release version of these web browsers for optimal rendering.

The COVID19 Drug Repository not only provides extended 'Search' options but also offers the possibility to download all database tables and data sets in a user-friendly manner. The repository is available at: http://covid19.md.biu.ac.il/.

The COVID19 Drug Repository maps data from chemogenomics and pharmacogenomics studies and provides viral and human genomics and proteomics information on approved drugs and other therapeutics. The database enables the user to focus on different levels of complexity, starting from general information, clinical trials and formulations, and increasing the resolution to the level of molecular mechanisms of drug action. Therefore, the database can serve as a navigation and recommendation tool both for research and for healthcare purposes. Future plans include the following additions to the database: (a) continuous updating with new data on approved drugs, experimental drugs, and drug-like synthetic or natural chemical substances; (b) automatic machine learning and text-mining-based annotation and visualization of 'Mode of Action' (MoA) data, as well as 'Drug-Gene', and 'Drug-Symptom' networks; and (c) incorporation of the Drugs/NGS analysis tools ('transcriptomics') to accelerate the translation of knowledge for use as personalized medicine for COVID19 patients.

http://covid19.md.biu.ac.il/.

Keep up with the latest coronavirus research

CORD-19: The Covid-19 Open Research Dataset

PubTator: a web-based text mining tool for assisting biocuration

TREC-COVID: rationale and Structure of an Information Retrieval Shared Task for COVID-19

Neural networks for open and closed Literature-based Discovery

Rapidly deploying a neural search engine for the covid-19 open research dataset: Preliminary thoughts and lessons learned

Data and text mining help identify key proteins involved in the molecular mechanisms shared by SARS-CoV-2 and HIV-1

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts

PubTator central: automated concept annotation for biomedical full text articles

Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data

ProtFus: a comprehensive method characterizing protein-protein interactions of fusion proteins

Omics and drug response

Chemical genomics: a systematic approach in biological research and drug discovery

Iconix Pharmaceuticals, Inc.-removing barriers to efficient drug discovery through chemogenomics

TDR targets: a chemogenomics resource for neglected diseases

TargetHunter: an in silico target identification tool for predicting therapeutic potential of small organic molecules based on chemogenomic database

Pharmacogenomics: the promise of personalized medicine

Pharmacogenomics: bench to bedside

Pharmacogenomics. Going from genome to pill

Progress in pharmacogenomics and its promise for medicine

Pharmacogenomics: a systems approach

Pharmacogenomics in early-phase clinical development

Systems pharmacogenomic landscape of drug similarities from LINCS data: Drug Association Networks

How to find the right drug for each patient? Advances and challenges in pharmacogenomics

Implications of pharmacogenomics for drug development

MOLI: multi-omics late integration with deep neural networks for drug response prediction

Using what we already have: uncovering new drug repurposing strategies in existing omics data

Genomics, other "Omic" technologies, personalized medicine, and additional biotechnology-related techniques

TDR Targets 6: driving drug discovery for human pathogens through intensive chemogenomic data integration

Virus-CKB: an integrated bioinformatics platform and analysis resource for COVID-19 research

Extending the small-molecule similarity principle to all levels of biology with the chemical checker

Structure-based drug repositioning over the human TMPRSS2 protease domain: search for chemical probes able to repress SARS-CoV-2 Spike protein cleavages

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Pharmacogenomics of COVID-19 therapies

DrugBank: a comprehensive resource for in silico drug discovery and exploration

DrugBank 5.0: a major update to the DrugBank database

PubChem 2019 update: improved access to chemical data

The IUPHAR/BPS Guide to PHARMACOLOGY in 2020: extending immunopharmacology content and introducing the IUPHAR/MMV Guide to MALARIA PHARMACOLOGY

Chemical Abstracts Service approach to management of large data bases

Chemical Abstracts Service Chemical Registry System: history, scope, and impacts

In: MySQL Reference Manual: Documentation From the Source

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Database resources of the National Center for Biotechnology Information

Programming techniques: regular expression search algorithm

Carrot2 and Language Properties in Web Search Results Clustering

E-utilities Quick Start. EntrezProgramming Utilities Help

Medical subject headings (MeSH)

Computational Drug Repositioning: a lateral approach to traditional drug discovery?

Web-based drug repurposing tools: a survey

Drug repurposing using deep embeddings of gene expression profiles

A rational roadmap for SARS-CoV-2/COVID-19 pharmacotherapeutic research and development: IUPHAR Review 29

Coronavirus Information. IUPHAR/BPS Guide to Pharmacology

Identification of antiviral drug candidates against SARS-CoV-2 from FDA-approved drugs

A large-scale drug repositioning survey for SARS-CoV-2 antivirals

An OpenData portal to share COVID-19 drug repurposing data in real time

Morphological cell profiling of SARS-CoV-2 infection identifies drug repurposing candidates for COVID-19

Human organs-on-chips for virology

Broad anti-coronaviral activity of FDA approved drugs against SARS-CoV-2

In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication

Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection

Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets

Phosphodiesterase 6 subunits are expressed and altered in idiopathic pulmonary fibrosis

PDE5A inhibition attenuates bleomycin-induced pulmonary fibrosis and pulmonary hypertension through inhibition of ROS generation and RhoA/Rho kinase activation

Proteomics. Tissue-based map of the human proteome. Science

Molecular cloning of NKp46: a novel member of the immunoglobulin superfamily involved in triggering of natural cytotoxicity

The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins

Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics

Imbalanced host response to SARS-CoV-2 drives development of COVID-19

STAR: ultrafast universal RNA-seq aligner

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

ChiTaRS 5.0: the comprehensive database of chimeric transcripts matched with druggable fusions and 3D chromatin maps

ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data

ACKNOWLEDGEMENTS E.L. has been working as a volunteer at the Frenkel-Morgenstern's lab in the COVID-19 related project.