key: cord-0115629-0cmfmy16 authors: Singh, Kuldeep; Singla, Puneet; Sarode, Ketan; Chandrakar, Anurag; Nichkawde, Chetan title: Uncovering the Corona Virus Map Using Deep Entities and Relationship Models date: 2020-09-07 journal: nan DOI: nan sha: 6e5e9d024ad24cb2318af88972e9c4cbf1a267d9 doc_id: 115629 cord_uid: 0cmfmy16 We extract entities and relationships related to COVID-19 from a corpus of articles related to Corona virus by employing a novel entities and relationship model. The entity recognition and relationship discovery models are trained with a multi-task learning objective on a large annotated corpus. We employ a concept masking paradigm to prevent the evolution of neural networks functioning as an associative memory and induce right inductive bias guiding the network to make inference using only the context. We uncover several import subnetworks, highlight important terms and concepts and elucidate several treatment modalities employed in related ailments in the past. The recent outbreak of SARS-CoV-2 has led to a global pandemic with the total number of infections exceeding 6 million with more than 370000 mortality already. The disease has been code named COVID-19 and had far reaching repercussions the world over. This article aims to uncover the life science universe of the Corona virus and related ailments by employing some of the state-of-the-art natural language processing technologies applied to biomedical domain. We took the corpus of about 40000 titles and abstracts released as a part of CORD-19 Open Research Challenge and applied our entity recognition and relationship discovery models to construct a knowledge graph related to COVID-19. In the process, we uncovered about 40000 entities and 80000 relationships. This article presents our salient findings and is organized as follows. Section 2 briefly describes our masked entities model and masked relationship model. Section 3 presents a network analysis of the knowledge network discovered by mining CORD-19 dataset. The coverage of CORD-19 dataset may be not exhaustive and up-to-date. We took snapshot around April 15, 2020. Nevertheless, the primary aim of this work is to demonstrate the application of artificial intelligence on condensing unstructured information in the biomedical domain to a sufficiently low entropy state so that some important leads can be We ran our entity recognition model that was trained on about 1 billion data points that we have built in-house. A corpus of about 33 million titles and abstracts was tagged for 4 different kinds of entities -protein, drug, disease, and taxonomy. We employ a novel concept masking paradigm where the term occurrences were replaced by a dummy token. Thus, essentially we remove the entire vocabulary associated with these entities from our training corpus. This inductive bias guides the model to make inference using the surrounding context alone and helps us achieve state-of-the-art results on biomedical entity recognition problem 1 . We use a transformer architecture 2 with the following three joint end-to-end multitask learning objectives: 1) masked token prediction 2) next sentence prediction 3) entity type prediction. We train the word piece tokenization algorithm on our in-house corpus to generate a vocabulary set of 30000 word pieces. The network had 8 encoder layers with each layer composed of self-attention followed by a feedforward network. Positional encodings were used in the beginning. We will have an elaborate publication on this at a later date 1 . We further uncover relationships between the entities by employing our relationship discovery model. The entities are once again masked to guide the network to make an inference using the context alone. The transformer architecture is once again used for encoding sentences. We use a novel bilinear attention at the output to model interaction between contextualized embeddings of the two entities between whom we are trying to establish a relationship 3 . We present here some of the important subnetworks and concepts uncovered as result of probing of CORD-19 dataset 4 by our neural network models. The findings in the paper are only suggestive and the purpose of this work is to demonstrate the application of artificial intelligence to uncovering important concepts and relationships in the life sciences domain. We hope to bring few import leads in sharp focus and help a researcher to narrow down the scope of his search for most important concepts and relationships. Figure 1 shows the full set of all entities and relationships. We computed Katz centrality measure for each node in the network. The Katz centrality measure can be understood follows: let A be an n × n adjacency matrix of the network with the element A i j being 1 if there is a relationship between node i and node j and zero otherwise. The powers of A such as k th power A k is representative of paths between two nodes through intermediaries. The Katz centrality measure for the node i defined as: The attenuation factor α is chosen such that it is smaller than the reciprocal of the absolute value of the largest eigenvalue of A. The top 20 concepts in the literature ranked by the normalized Katz centrality measure 5 is shown in Table 1 ACE2 protein also finds substantial mentionings in the published literature. We show here the subnetwork of ACE2 in Fig. 2 . The subnetwork has all nodes connected to ACE2 and their interconnections. It is a large network with 367 nodes and 1141 edges. There is a total of 60 drugs, 173 proteins, 90 diseases, and 44 organisms in the ACE2 network. We did path analysis to uncover more lead compounds in the ACE2 network. We found all the paths between ACE2 and Spike protein with a maximum of 3 hops between the two nodes. We further impose a condition that all nodes in the path should either be a drug or a protein. The following drugs or drug like compounds were found in the paths: A291P, Alanine, Arbidol, Chloroquine, Emodin, Glutathione, HR2P-M2, IL-4, K267N, MAB 1a9, Nitric Oxide, Rabbit Antisera, Sialic acid, SP-10, SP-8, Superoxide, and TAPI-2. K267E and A291P are actually polymorphism to DDP4 host protease. DDP4 helps the binding of Spike protein to the host. It was observed that these polymorphism reduce viral replication and thus have a therapeutic effects 7 . Arbidol is an antiviral drug that has been reported to block viral entry and replication 8 . Emodin is another top drug in the ACE2 network that works by blocking the interaction between ACE2 and Spike protein 9 . Glutathione works by downregulating ACE2 10 . SP-8 and SP-10 are peptides that disrupt the binding of Spike protein to ACE2 11 . While discussing ACE2, it is worth mentioning the role of TMPRSS2 in the pathogenesis. TMPRSS2 is a serine protease that plays a role in cleaving the spike protein and helps the binding of S protein to ACE2. Figure 3 shows the TMPRSS2 subnetwork. It is a small and yet important subnetwork and is easy to visualize. It has 37 nodes and 95 edges. There are 2 drugs, 20 proteins, 3 diseases, and 12 organisms in the network. Camostat which is a TMPRSS2 inhibitor shows up the network and can be used as a drug to block the viral entry. One of the important protein that shows in the network in Fig. 3 is LY6E. It is an interferon stimulated protein and has been shown to be effective in curbing the entry of SARS-CoV-2 in a couple of studies 12, 13 . over the years to inhibit the function of RdRp. Remdesivir has emerged as one of the more promising drugs. Remdesivir treatment is prohibitively expensive. However, several other drugs have emerged out of our literature mining. In fact, one of the drugs with a higher centrality measure is Adenosine which is an Adenosine triphoshate analog and can successfully block the viral replication 14 . There are few more drugs that have emerged targeting RdRp namely Sofosbuvir, AZT, Tenofovir Alafenamide, and Alovudine. We also analyzed the 3C-like protease network and 3 experimental drugs named EPDTC, JMF1586, and JMF1600 emerged. 16, 17 . We further tried to uncover the drug used for dealing with SARS-CoV-2 like viruses by interrogating our network for Ribavirin which was recently reported to be successful in phase 2 clinical trial for COVID-19 16 . We discovered that the same combination was proposed earlier for MERS (see entry 23) 18 . Thus, the entries in Table 3 may serve as a ready reference point to explore many other treatment modalities for COVID-19. The extensive drug-disease network corresponding to Table 3 is shown in Fig. 5 . This subnetwork was formed by taking all the drugs listed in Table 3 and finding all the diseases related to these drugs. We undertook a comprehensive concept identification and network analysis for COVID-19. We demonstrated the use of a novel concept recognition and relationship discovery engine that crafts some of the latest advances in natural language processing into a state-of-the-art solution for biomedical entity recognition and relationship discovery problem. Several new drugs were uncovered through the studies and many different treatment modalities were brought to the surface. We envision these solutions to have a wide ranging impact through the length and breadth of drug discovery process spanning all therapeutic areas. We also discussed several putative mechanisms of the anti-SARS-CoV-2 effects for CVL218 or other PARP1 inhibitors to be involved in the treatment of COVID-19. In summary, the PARP1 inhibitor CVL218 discovered by our data-driven drug repositioning framework can serve as a potential therapeutic agent for the treatment of COVID-19. Biomedical entity recognition using a masked concept model Attention is all you need Biomedical relationship discovery using a masked concept model Cord-19: The covid-19 open research dataset A new status index derived from sociometric analysis A sars-cov-2 protein interaction map reveals targets for drug repurposing Polymorphisms in dipeptidyl peptidase 4 reduce host cell entry of middle east respiratory syndrome coronavirus The synthetic antiviral drug arbidol inhibits globally prevalent pathogenic viruses Emodin blocks the sars coronavirus spike protein and angiotensin-converting enzyme 2 interaction Excessive glutamate stimulation impairs ace2 activity through adam17-mediated shedding in cultured cortical neurons Design and biological activities of novel inhibitory peptides for sars-cov spike protein and angiotensinconverting enzyme 2 interaction Ly6e impairs coronavirus fusion and confers immune control of viral disease Ly6e restricts the entry of human coronaviruses, including the currently pandemic sars-cov-2. bioRxiv Adenosine triphosphate analogs can efficiently inhibit the zika virus rna-dependent rna polymerase Sars-cov-2 launches a unique transcriptional signature from in vitro, ex vivo, and in vivo systems Triple combination of interferon beta-1b, lopinavir-ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid-19: an open-label, randomised, phase 2 trial Interferon beta-1b for covid-19 Treatment of middle east respiratory syndrome with a combination of lopinavir-ritonavir and interferonβ 1b (miracle trial): study protocol for a randomized controlled trial We assessed the effectiveness of ribavirin and corticosteroids as the initial treatment for severe acute respiratory syndrome using propensity score analysis.