key: cord-0431234-tnoybbfx authors: Wagner, Tyler; Awasthi, Samir; Wittenberg, Gayle; Venkatakrishnan, AJ; Tarjan, Dan; Anyanwu-Ofili, Anuli; Badley, Andrew; Halamka, John; Flores, Christopher; Khan, Najat; Barve, Rakesh; Soundararajan, Venky title: Real-time biomedical knowledge synthesis of the exponentially growing world wide web using unsupervised neural networks date: 2020-04-04 journal: bioRxiv DOI: 10.1101/2020.04.03.020602 sha: 8b4234b45ff6bcac91aeec02379a651f6a93ccfa doc_id: 431234 cord_uid: tnoybbfx Decoding disease mechanisms for addressing unmet clinical need demands the rapid assimilation of the exponentially growing biomedical knowledge. These are either inherently unstructured and non-conducive to current computing paradigms or siloed into structured databases requiring specialized bioinformatics. Despite the recent renaissance in unsupervised neural networks for deciphering unstructured natural languages and the availability of numerous bioinformatics resources, a holistic platform for real-time synthesis of the scientific literature and seamless triangulation with deep omic insights and real-world evidence has not been advanced. Here, we introduce the nferX platform that makes the highly unstructured biomedical knowledge computable and supports the seamless visual triangulation with statistical inference from diverse structured databases. The nferX platform will accelerate and amplify the research potential of subject-matter experts as well as non-experts across the life science ecosystem (https://academia.nferx.com/). The nference cloud-based software platform (nferX) enables dynamic inference from 45 quadrillion possible conceptual associations that synthesize over 100 million documents scraped from the published world wide web, and this is continuously updated as new material is published online. The platform supports visual triangulation of insights via statistical enrichments from nearly 50,000 curated collections of structured databases, with the diseases, biomolecules, drugs, and cells & tissues collections loaded by default. A hypergeometric test is used to capture the overlap between the knowledge synthesis results and each of these enrichment sets. The sources include all freely accessible literature integrated into a Core Corpus as well as distinct sub-corpora that provide contextual lenses into PubMed, preprints, clinical trials, SEC filings, patents, grants, media, company websites, etc. The collections include curated ontologies or statistical inference applied to molecular data (e.g. genomics, bulk and single cell RNA-seq, proteomics) and realworld data (e.g. FDA adverse event reports, clinical trial outcomes, epidemiology, clinical case reports). Here, we describe how the nferX platform 1 can enable data science driven decisionmaking via a pair of illustrative applications --(i) biopharmaceutical lifecycle management across conventionally siloed therapeutic areas, and (ii) connecting clinical pathophysiology to molecular profiling for a rapidly evolving pandemic. As described previously, the nferX platform adeptly identifies well-known and emerging associations embedded in the biomedical literature using two key metrics 2 : local context score and global context score ( Figure 1A) . The local context score is based on two significant improvements over the traditional pointwise mutual information (PMI) 3 . First, we extend the PMIbased strength of association to biomedical concepts that can be constructed by a logical combination of proximal phrases, e.g. "EGFR-positive" AND "non-small cell lung cancer". This effectively makes the number of biomedical concepts that can be queried unbounded. Moreover, we extend the traditional PMI notion, which is unable to capture the word-distance between cooccurring terms, using "exponential masking" to meaningfully account for the distance between co-occurring terms, captured by "score decay" in nferX. Our experimental studies show several measures for which our local score method outscores traditional PMI metrics (unpublished results). To compute the global context score we use an unsupervised neural network with dependency parsing to generate over 300 million biomedical multi-word phrase vectors, and leverage word2vec 4 to compute the cosine distance between these phrase vectors is projected in a 300-dimensional space. Previous studies of word embeddings provide a heuristic to extend their unigram technique to specify multi-word terms apriori 4 , and cui2vec relies on a curated set of phrases with associated vectors 5 . However, our examination indicates that relying on such occurrence frequency or curated data is not as exhaustive as our approach for generating a complete list of biomedical terms of interest, in particular when there are many low frequency or emerging concepts (unpublished results). We have also devised a novel method to compute vectors "on demand" for very low frequency phrases, for which pre-computing vectors is exorbitantly expensive. As an application of how researchers can use the combination of local and global context to add credence to existing hypotheses and identify novel associations, nferX was used to investigate potential indications for the label expansion of esketamine, an NMDA receptor antagonist recently approved for treatment-resistant depression (TRD). A hypothesis-free analysis was performed using the platform to quantify associations between esketamine, its targets (NR2A, NR2B), and all possible indications {nferX link}. nferX automatically populated the synonyms for the key biomedical entities such as genes, as an "expanded query". The platform correctly recapitulated relationships between esketamine and well-known indications, such as treatment resistant depression (local score = 7) and neuropathic pain (local score = 5.3), but also identified novel associations such as neuro-oncogenesis (e.g. global score of 2.4 between 'astrogliomas' and 'NMDA receptors'; nferX link) (Figure 1B) . This link to neuro-oncogenesis was subsequently confirmed 6 , approximately 3 months after our analysis was initially performed. Such a prospective validation adds credence to further interrogate esketamine for other emerging indications identified as having significant global score to NMDA receptors (e.g. neurogenic inflammation, global score = 2.4, local score = 0.9, nferX link). As another application, the nferX platform enabled the rapid, comprehensive literature-based and multi-omic profiling of ACE2 7 , the putative receptor of SARS-CoV2 (Figure 1C) . We found that tongue keratinocytes and olfactory epithelial cells of the nasal cavity are potentially novel ACE2expressing cell populations. Clinically this aligns with the known routes of transmission through droplet spread and viral attachment within oral/nasal mucosa 8, 9 . Also this may explain the altered sense of smell and taste in otherwise asymptomatic COVID-19+ individuals 8, 9 . Next the very high rates of viral pneumonitis in infected patients with ground glass infiltrates on chest imaging 10 are a logical sequelae of infection given the expression of ACE2 in type-2 pneumocytes, club cells and ciliated cells of the lung 7 . The copious ACE2 expression in various gastrointestinal (GI) cell types emphasizes the recent reports of diarrhea 11 and signs/symptoms of enteropathy that are seen clinically, and may also explain the occurrence of fecal shedding that persists postrecovery 12 . Applying nferX to identify incipient associations to ACE2 in the disease collection highlights diabetic renal disease, cardiorenal syndrome, and nephropathy hypertension (each with a global score of 3.3 to ACE2), as well as the enrichment sets of renal insufficiency, heart failure and kidney diseases {nferX link}. These emerging insights from nferX are consistent with diabetes mellitus and chronic kidney disease being identified as the leading mortality indicators for the elderly COVID-19+ patients 13 . This may also provide a pathophysiological rationale as to why some COVID patients experience complications like acute kidney injury 14 , proteinuria, hematuria, or myocarditis with associated rise in troponins 15 . Along these lines, it will be of interest to see if cases of SARS-CoV2 induced orchitis occurs in COVID-19 patients, given ACE2 expression in cells of the testes 7 , and the high nferX local score of 4.8 for the orchitis-infertility association {nferX link}. The nferX data science platform will help researchers generate insights via holistic triangulation of structured and unstructured data at an unprecedented scale. The full clinical potential of the unsupervised neural networks that power this platform will be realized when they are applied towards automated de-identification and synthesis of the unstructured physician notes that dominate the Electronic Health Records (EHRs). To enable such seamless real-world insight triangulation with the wealth of published biomedical knowledge, a privacy-preserving federated architecture that exports aggregate statistical inferences while retaining the primary de-identified data within the academic medical center's span of control is needed. Such a platform can truly propel clinical research and biopharmaceutical development into the digital era. Recapitulation and Retrospective Prediction of Biomedical Associations Using Temporally-enabled Word Embeddings Quality-Based Knowledge Discovery from Medical Text on the Web Distributed Representations of Words and Phrases and their Compositionality Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data Dangerous liaisons as tumour cells form synapses with neurons Knowledge synthesis from 100 million biomedical documents augments the deep expression profiling of coronavirus receptors Anosmia, Hyposmia, and Dysgeusia Symptoms of Coronavirus Disease. American Academy of Otolaryngology-Head and Neck Surgery Essentials for Radiologists on COVID-19: An Update-Radiology Scientific Expert Panel COVID-19: Gastrointestinal manifestations and potential fecal-oral transmission Characteristics of pediatric SARS-CoV-2 infection and potential evidence for persistent fecal viral shedding Covid-19 in Critically Ill Patients in the Seattle Region -Case Series Kidney disease is associated with in-hospital death of patients with COVID-19 Cardiac Involvement in a Patient With Coronavirus Disease 2019 (COVID-19)