key: cord-1019582-t6sfy48y authors: Kuleshov, Maxim V.; Stein, Daniel J.; Clarke, Daniel J.B.; Kropiwnicki, Eryk; Jagodnik, Kathleen M.; Bartal, Alon; Evangelista, John E.; Hom, Jason; Cheng, Minxuan; Bailey, Allison; Zhou, Abigail; Ferguson, Laura B.; Lachmann, Alexander; Ma’ayan, Avi title: The COVID-19 Drug and Gene Set Library date: 2020-07-25 journal: Patterns (N Y) DOI: 10.1016/j.patter.2020.100090 sha: 1a6c262928339790eff3f28c19af979d642d59ca doc_id: 1019582 cord_uid: t6sfy48y [Figure: see text] Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a novel coronavirus that causes the coronavirus disease . Globally, there are more than ~11 million confirmed COVID-19 cases and ~530,000 reported deaths (as of July 3rd, 2020). Many biomedical researchers have been shifting their efforts to battle the coronavirus COVID-19 pandemic. One area of activity is computationally prioritizing and experimentally testing approved and experimental drugs for repurposing as candidate therapies for COVID- 19 . Drug repurposing studies present a promising avenue for quickly offering a treatment because these drugs have known safety profiles. So far, drug repurposing studies can be categorized into two groups in-vitro screens 1-6 and computational predictions. Computational predictions are mostly based on structural biology methods 7-10 , but some are based on network analysis and transcriptomics [11] [12] [13] . Few studies have validated top computational predictions in cell-based assays 7, 11, 12 . The lists of drugs mentioned in these studies can be analyzed for consensus, while identified drugs can be grouped by their type. At the same time, many researchers attempt to understand the molecular mechanisms of SARS-CoV-2 virus life cycle. Much attention has been given to studies that profiled, with massspectrometry proteomics and phosphoproteomics. These methods identify host proteins that interact with each of the SARS-CoV-2 proteins 12 , or differentially phosphorylated proteins before and after SARS-CoV-2 infection 14 . Another important dataset produced RNA-seq gene expression signatures from various relevant human cell lines, ferret lungs, and human lung biopsies before and after SARS-CoV-2 infection 15 . These are just few examples of the many studies that produce gene sets that can be organized and compared. In the past, we have developed a crowdsourcing project where we asked the community to identify gene expression signatures from drug, gene, and disease perturbations 16 . The collection of over 6,000 signatures that were collected with the help of >70 contributors from around the world, enabled us to produce a useful database called CREEDS (https://amp.pharm.mssm.edu/CREEDS/). Similarly, for this project, we developed a crowdsourcing project to integrate drug and gene sets related to COVID-19 research collected with the assistance of the research community. The resource is delivered as a web-based platform that was already accessed by >1,500 unique users. So far, we have collected 125 drug sets composed of 1474 unique drugs, and 424 gene sets consisting of 17,090 unique human genes. These are presented to users via the COVID-19 Drug and Gene Set Library website in several sortable and searchable tables (Fig. 1) . The drug sets are subdivided into two categories: experimental (n=26) and computational (n=79). The top 20 most frequent drugs and genes across all sets are displayed in Fig. 2A -C. The experimental drugs, with most supportive evidence, are remdesivir, chloroquine, hydroxychloroquine, and mefloquine ( Fig. 2A) . Although hydroxychloroquine, chloroquine, and remdesivir received a lot of attention by the media and are tested in many clinical trials, mefloquine received far less attention. Mefloquine, just like hydroxychloroquine and chloroquine, is an anti-malaria drug 17 . However, it has a different chemical structure and it is known to act via different mechanisms. The top 20 most commonly computational predicted drugs include several known antivirals such as ritonavir, darunavir, lopinavir and ribavirin (Fig. 2B ). This might be due to their pre-selection as candidates for computational docking. The top 20 most frequently submitted genes are all members of the innate immune response (Fig. 2C) . These genes include the typical interferon and cytokine response genes observed to be involved in the response of human cells to most pathogens. While most of the drug sets in the library are from studies that utilized computational methods, several key studies are from large-scale drug screens that include mostly FDA approved drugs [1] [2] [3] [4] [5] [6] . Using a Venn diagram, we compared the results from these six in-vitro SARS-CoV-2 drug screen studies (Fig. 3) . Although, overall, there is little overlap across these screens. Only 11 drugs are shared across two or more studies (Table 1) . Namely, the drugs that appear as hits in more than one screen, in addition to remdesivir and chloroquine, are: mefloquine, clofazimine, acitretin, gilteritinib, hexachlorophen, niclosamide, tetrandrine, tioguanine, and almitrine. Clofazimine is the only drug that appeared as a hit in 3 out of the 6 screens. Clofazimine is a drug used to treat leprosy and its mechanisms of action suggest that it interferes with DNA synthesis 18 . Acitretin is an anti-inflammatory second-generation retinoid that is used to treat severe psoriasis, it is a metabolite of etretinate 19 . Almitrine is a drug that stimulates respiratory respiration by activating receptors of carotid bodies 20 . It is used in the treatment of chronic obstructive pulmonary disease 21 , and as such it is relevant to COVID-19 symptoms. It should be noted that remdesivir appears as a hit in all six screens, but it was pre-selected as a positive control in half of the studies. The little overlap among the screens can be due to various reasons including different assay types, cellular contexts, inclusion criteria, original library content, and different laboratory protocols. We carefully reviewed and compared the results from these screens including compound screened, assay, drug concentrations used in screens, incubation, infection MOI, and hit criteria. These aspects are summarized (Table S1 ) and the final drug sets from each study are provided (Table S2 ). This analysis enabled us to compare the IC50 values reported for those drugs that appeared in multiple screens (Tables 2 and S3 ). Overall, we observe relative consistency of reported IC50 values across screens. We also checked whether the hits from the six COVID-19 screens also appeared as hits in other previously published similar screens for other viruses and other diseases (Fig. 4 , Table S4 -5). We observe that the hits from the Jeon et al. study 2 overlap with several other screens that reported potential antivirals for Zika 22 , Ebola 23 , and MERS 24 . This might confirm the potentially good quality of the Jeon et al. screen. Next, we examined whether any of the drugs considered as hits across the six COVID-19 screens contain PAINS chemotypes 25 . To achieve this we compared the COVID-19 screen hits to a list of PAINS filters downloaded from ChEMBL 26 . To check for possible PAINS among the hits, we checked whether any of the hits contain any one of the PAINS substructure chemotypes (Table S6 ). Six hits, namely, eltrombopag, ketoconazole, phenazopyridine, posaconazole, SDZ-62-434, and Z-Leu-Val-Gly-diazomethylketoneout, out of 195 total hits contain such substructures, this level overlap is not statistically significant (Fisher's exact test, p=0.57). To further explore the molecular effects of the positive hits from the six in-vitro drug screens, and to demonstrate the utility of the collected library, we developed a case study that asks whether the hits from the six screens up-or down-regulate genes that are highly co-expressed with the ACE2 gene. ACE2 is the suspected cell surface receptor for SARS-CoV-2 27 , and cells that do not express this gene have been shown to be less prone to SARS-CoV-2 infection. Since it is still undetermined whether it is desired to up-or down-regulate the ACE2 expression module, we queried drugs from the published in-vitro drug screen hits against the LINCS L1000 data 28 . We identified 61 drug hits from the six screens that have been profiled by L1000 assay. There are two drugs that significantly up-regulate the ACE2 module (50 genes most correlated with ACE2 based on RNA-seq data from GEO) and one drug that significantly downregulates these genes after p-value correction (FDR < 0.1) (up: homoharringtonine, 5.32e-09; alvocidib, 1.58e-05; down: tazarotene, 5.77-e02). Overall, 33 drugs on average upregulate the ACE2 module and 28 down regulate the module (Fig. 5 ), suggesting that up-regulating the ACE2 module might be more protective than harmful, which is counter-intuitive. However, the relatively balanced division of drugs that induce or suppress this module makes this assertion inconclusive. The positive hits from the six COVID-19 drug screens can be used to train machine learning models that can be used to prioritize the hits and suggest additional compounds that strongly share features with these hits. Using gene expression (GE) and chemical structure (CS) features of the hits and additional drugs and small molecules profiled via the L1000 assay, we implemented an Extra Trees (ET) classifier. The ET classifier was able to predict hits from the six SARS-CoV-2 drug screens with an average AUROC of 0.76 across cross-validation splits, suggesting that GE and CS features are overall predictive of the types of compounds that could inhibit SARS-CoV-2 infection ( Fig. 6A and 6B , Table S7 ). The lower value for the area under the precision recall curve (AUPRC) can be explained by the class imbalance, which causes many non-hits to be ranked above known hits (Tables 3 and 4 ). Similar training and predictions were done using only GE signature features as input. In this case, the ET classifier achieved an average cross-validation AUROC of 0.66, which was lower than when CS features were also included, but still statistically significant ( Fig. 6C and 6D , Table S8 ). It should be noted that the top ranked predicted drugs are all from the same class of ATPase inhibitor cardiac drugs that have a similar structure and a similar gene expression signature effect in the L1000 assay. These drugs are over-represented in the Jeon et al screen 2 , so these initial results should be viewed with caution. The classifier also ranked high, lanatoside C, a drug identified as an active compound against MERS-CoV infection 24 . This confirms that the machine learning method could prioritize compounds that are missed by these screens. In sum, this simple classification model is intended to demonstrate the potential for utilizing the drug sets collected for the library for machine learning applications. Here we describe a platform created to collect drug and gene sets related to COVID-19 research using various methods of data accrual. All top ranked most frequent genes that are associated with COVID-19 are part of the interferon pathway. This is consistent with our knowledge that type I (IFN-α, IFN-β) and type III interferon (IFN-λ) systems are the primary defense against viral infections. However, it was suggested that one of the evasion mechanisms by SARS-CoV-2 is to dampen the interferon response 15 . It has been hypothesized that hyperinflammation in COVID-19 could drive disease severity and would be amenable to treatment with drugs that reduce inflammation 29, 30 . However, this remains controversial as the high level of antiviral response could be reflective of increased viral burden rather than an inappropriate host response 31 . The most striking result from the meta-analysis applied to the content of the library is the little overlap across drug screen studies. It is expected that experimental validation of drugs to inhibit SARS-CoV-2 in-vitro will be more consistent. The inconsistency across these studies could be due to a need to produce results quickly due to the urgency for discovering potential treatments. Regardless, there is some interesting overlap that cannot be explained by artifacts such as PAINS chemotypes. Hence, there is an expectation that more screens will be published, and top leads will advance to animal models and human trials for further testing. To prioritize compounds that may treat COVID-19, some researchers have used the strategy of finding drugs that modulate genes related to ACE2 gene expression 32 . We found few hits that also highly significantly up-or down-regulate of the genes most correlated with ACE2. However, it is inconclusive whether up or down regulation of this module is beneficial. Finally, we demonstrated how the positive hits across the screens can be pooled to develop machine learning models that can further prioritize candidates based on direct experimental accumulated evidence. It should be clear that the consensus analysis results should be viewed with caution. The most common drugs are not necessarily the most efficacious or promising treatments. At the same time, the most common genes may not be the most relevant to understand COVID-19 research. It should be noted that not all drug sets and gene sets have equal weight in quality and relevancy. A list of computationally predicted drugs is not as useful towards identifying a therapy for COVID-19 when compared with a list of experimentally validated drugs. A list of upregulated genes after SARS-CoV-2 infection of cells may provide more useful information about the virus life cycle when compared with a list of genes returned from a PubMed search using the term SARS. Hence, the users of the data collected for the library should be aware of such limitations. With these limitations in mind, we hope that researchers will be able to develop or refine hypotheses. In a period of rapid development of methods and data related to COVID-19 research, it is critical to provide means to organize the accumulated information in a way that it can be summarized and reused. The COVID-19 Drug and Gene Set Library provides such utility. The library of drug and gene sets can be used to identify community consensus and make researchers and clinicians aware of the developments about new potential therapies as they become available, as well as allow the research community to work together towards a cure for COVID-19. Geneshot 36 is a platform that can be used to convert PubMed searches into gene sets. Using Geneshot, gene sets associated with the search terms: SARS, SARS-CoV, MERS-CoV, ACE2, and TMPRSS2 were created using both the AutoRIF and GeneRIF 37 options. Additionally, top COVID-19 drug repurposing candidates reported in recent literature were included as search terms. Predictions of additional genes potentially associated with the genes directly comentioned with these terms were also added to the database. These predictions were based on five strategies: co-occurrence via AutoRIF, GeneRIF 37 , Enrichr 38 , or Tagger 39 , and coexpression using data from ARCHS4 40 . The COVID-19 Drug and Gene Set Library website have two sortable and searchable tables that list the drug and gene sets. Sorting can be based on the date of submission, alphabetical ordering, or list size. The tables are searchable via metadata terms such as title, authors, and descriptions, as well as via data search for specific drug or gene terms. Users can download each drug or gene set as well as the entire library. In addition, each gene set is provided with the option to perform gene set enrichment analysis with Enrichr 38 , while genes are linked to Harmonizome 41 for further interrogation. Similarly, drug sets can be analyzed with DrugEnrichr, a drug set enrichment analysis tool. The individual drugs that map to known compounds are linkable to their corresponding DrugBank landing pages 42 . The website enables users to submit drug and gene sets related to COVID-19 research by completing a simple form. The form includes a dataset title, a URL source, and a description that explains how the set is relevant to COVID-19 research. The submitter is also provided with mechanisms to add additional metadata terms that can describe the cell type, tissue, organism, and other critical information about the submitted set. Users can specify the category of the additional metadata, allowing for a broad set of expanded annotations for each submitted set. Users can also submit their contact information; this information is kept private, but users can opt-in to make it public. Once a user submits a contribution to the site, their dataset is directed to a review queue in which we manually examine the validity and relevance of the contribution. The reviewing process enables an administrator to approve or reject the submitted set. If approved, the set is added to the database. To make it easy for contributors to submit multiple sets, users can access the site via API. The code behind the site is open source and available at: https://github.com/maayanlab/covid19_crowd_library Drug sets extracted from the six in-vitro screens 1-6 were first identified. The drugs were matched to drugs profiled by the L1000 assay available from GSE92742 28 . Average signatures for each drug were computed by taking the z-score mean for each gene. To quantify the average change in expression of genes co-expressed with ACE2, we obtained the top 50 genes that mostly co-express with ACE2 from the ARCHS4 resource 40 . We then calculated the mean z-scores of the top 50 correlated genes to ACE2 and compared those values against a distribution calculated from sampling 50 random genes, repeatedly 10,000 times. The p-values were calculated against the sampled distribution and corrected for multiple hypothesis testing by applying the Bonferroni correction method. The code behind this analysis is open source and available at: https://github.com/maayanlab/covid19l1000. To identify publications that describe similar in-vitro drug screens from other contexts, we followed these steps: 1) We first queried PubMed for studies that contain the term ["drug screen" AND "in-vitro"]. 2) The text from these studies was processed such that papers containing a table with drug names were saved for further manual inspection; 3) We then manually selected studies that performed drug screens that are comparable to the published screens for SARS-CoV-2. The study selection criteria required the identification of in-vitro studies that included quantitative measures of many drugs' efficacy against a disease model. A list of 195 drug hits from the six in-vitro screens 1-6 (Table S1) was used as positives for applying a machine learning method to prioritize these compounds and additional compounds. Gene expression L1000 signatures for 19,777 drugs measuring the response of 978 landmark genes and their associated 166 MACCS molecular fingerprints were obtained from the SEP-L1000 project 43 . The binary MACCS key association matrix was TF-IDF normalized to account for the frequency of different chemical structures. The dataset included 19,777 different drugs, of which 96 matched the 195 hits from the drug screens. After removing compounds from the library that appeared to be similar structurally, 8,787 compounds remained, of which 72 were hits. Extra trees (ET) classifiers 44 were trained to identify drug screen hits from the gene expression (GE) and chemical structure (CS) features and evaluated using 3-fold crossvalidation. Class weights were set inversely proportional to the class frequencies to address class imbalance. Otherwise, all ET parameters were the default Scikit-learn values 45 . Feature selection was performed by recursive feature elimination to use 128 when both GE and CS data was used, or 64 features when only GE data was used. Additionally, prediction probabilities were calibrated across cross-validation splits. Table 4 Ranked predictions for top additional compounds based on L1000 + MACCS input. Summary of the six in-vitro drug screens. Drugs lists from the six in-vitro drug screens. The COVID-19 pandemic requires rapid response by the research community to develop vaccines and therapeutics. While the development of vaccines may take years, drug repurposing can offer pandemic mitigation much quicker. In-vitro drug screens is the first step toward identifying and prioritizing potential safe therapeutics for COVID-19. However, these screens are done by different laboratories across the world using different methods. As a result, these screens produce different lists of hits. Here we attempted to consolidate the results from these drug screens to see if consensus emerges. In addition, we utilized machine learning methods to further predict and prioritize the validity of the hits from these drug screens. Such analysis identified molecular mechanisms that may explain how some of these drugs interfere with viral replication inside human cells. As more SARS-CoV-2 drug screens are published, a clearer picture of the most promising drug candidates is expected to emerge. • Collections of drug and gene sets relevant to COVID-19 research • Detailed comparison of results from six in-vitro SARS-CoV-2 drug screens • Analysis of hits that up-or down-regulate the ACE2 expression module • Machine Learning framework to further prioritize hits and other similar drugs Kuleshov et al. developed a web-based platform that collects and presents drug and gene sets related to COVID-19 research. Analysis of the results from six in-vitro drug screens by comparing the overlap among these screens shows that there is some unexpected overlap among these screens. The authors also use the hits from these screens to develop a machine learning classifier that further prioritizes the hits and identifies a pharmacological theme that is shared among several hits. Morphological Cell Profiling of SARS-CoV-2 Infection Identifies Drug Repurposing Candidates for COVID-19. bioRxiv Identification of antiviral drug candidates against SARS-CoV-2 from FDAapproved drugs In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. bioRxiv Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2. bioRxiv A Large-scale Drug Repositioning Survey for SARS-CoV-2 Antivirals. bioRxiv A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing. bioRxiv Structure of Mpro from COVID-19 virus and discovery of its inhibitors Homology Modeling of TMPRSS2 Yields Candidate Drugs That May Inhibit Entry of SARS-CoV-2 into Human Cells Knowledge-based structural models of SARS-CoV-2 proteins and their complexes with potential drugs Reversal of Infected Host Gene Expression Identifies Repurposed Drug Candidates for COVID-19. bioRxiv A data-driven drug repositioning framework discovered a potential therapeutic agent targeting COVID-19. bioRxiv Repurposing Didanosine as a Potential Treatment for COVID-19 Using Single-Cell RNA Sequencing Data The Global Phosphorylation Landscape of SARS-CoV-2 Infection SARS-CoV-2 launches a unique transcriptional signature from in vitro, ex vivo, and in vivo systems Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd Lamprene (clofazimine) in leprosy Acitretin Use in Dermatology Effects of almitrine bismesylate on the ionic currents of chemoreceptor cells from the carotid body Improvement in ventilation-perfusion matching by almitrine in COPD A Screen of FDA-Approved Drugs for Inhibitors of Zika Virus Infection Identification of 53 compounds that block Ebola virus-like particle entry via a repurposing screen of approved drugs Screening of FDA-approved drugs using a MERS-CoV clinical isolate from South Korea identifies potential therapeutic options for COVID-19. bioRxiv New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays The ChEMBL database in 2017 SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor A next generation connectivity map: L1000 platform and the first 1,000,000 profiles COVID-19: consider cytokine storm syndromes and immunosuppression Effect of Dexamethasone in Hospitalized Patients with COVID-19: Preliminary Immunosuppression for hyperinflammation in COVID-19: a double-edged sword? The Lancet Silico Discovery of Candidate Drugs against Covid-19 GEO2Enrichr: browser extension and server app to extract gene sets from GEO and analyze them for biological functions Automated Generation of Interactive Notebooks for RNA-seq Data Analysis in the Cloud GEN3VA: aggregation and analysis of gene expression signatures from related studies Geneshot: search engine for ranking genes from arbitrary text queries GeneRIF is a more comprehensive, current and computationally tractable source of gene-disease relationships than OMIM Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool BeCalm API for rapid named entity recognition. bioRxiv Massive mining of publicly available RNA-seq data from human and mouse The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins DrugBank 5.0: a major update to the DrugBank database for Drug-induced adverse events prediction with the LINCS L1000 data Extremely randomized trees Scikit-learn: Machine learning in Python. the We would like to thank Akira Mitsui, Russ Altman, Anne Carpenter, Pedro Bellester, and Tudor Oprea for contributing information about missing publications. This project is partially funded by U54HL127624 and U24CA224260. MK, DK, JE developed the website, DS, AL, JH and AM performed the analyses, EK, AB, KJ, MC, AZ and LF contributed drug and gene sets, AM initiated and managed the project, all authors contributed to writing the manuscript. The authors declare no competing interests.