Accelerating COVID-19 research with graph mining and transformer-based learning Accelerating COVID-19 research with graph mining and transformer-based learning Ilya Tyagin Center for Bioinformatics and Computational Biology University of Delaware Newark, DE tyagin@udel.edu Ankit Kulshrestha Computer and Information Sciences University of Delaware Newark, DE akulshr@udel.edu Justin Sybrandt∗ School of Computing Clemson University Clemson, SC jsybran@clemson.edu Krish Matta Charter School of Wilmington Wilmington, DE matta.krish@charterschool.org Michael Shtutman Drug Discovery and Biomedical Sciences University of S. Carolina Columbia, SC shtutmanm@sccp.sc.edu Ilya Safro Computer and Information Sciences University of Delaware Newark, DE isafro@udel.edu ABSTRACT In 2020, the White House released the, “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science commu- nity answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availabil- ity of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research acceler- ates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis in- volving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research find- ings such as the relationship between COVID-19 and oxytocin hormone. Reproducibility: All code, details, and pre-trained models are available at https://github.com/IlyaTyagin/AGATHA-C-GP CCS CONCEPTS • Applied computing → Bioinformatics; Document management and text processing; • Computing methodologies → Learning latent representations; Neural networks; Information extraction; Semantic networks. ∗Now with Google Brain. Contact: jsybrandt@google.com. KEYWORDS Hypothesis Generation, Literature-Based Discovery, Transformer Models, Semantic Networks, Biomedical Recommendation, 1 INTRODUCTION Development of vaccines for COVID-19 is a major triumph of mod- ern medicine and humankind’s ability to accelerate scientific re- search. While we are all hoping to see large-scale positive changes from fast mass adoption of the existing vaccines, there remain significant open research questions around COVID-19. The scien- tific community has a responsibility to do everything possible to block the ongoing transmission of the dangerous virus and acceler- ate research to mitigate its consequences. We present the following automated knowledge discovery system in order to propose new tools that could compliment the existing arsenal of techniques to accelerate biomedical and drug discovery research for events like COVID-19. The COVID-19 pandemic became one of the most important events in the information space since the end of 2019. The pace of published scientific information is unprecedented and spans all resolutions, from the news and pop-science articles to drug design at the molecular level. The pace of scientific research has already been a significant problem in science for years [29], and under current circumstances this factor becomes even more pronounced. Several thousands papers are being added weekly to CORD-19 [39] (the dataset of publications related to COVID-19) and even more in MEDLINE [1]. As a result, groups working on similar problems may not be immediately aware of the other’s findings, which can lead to inefficient investments and production delays. Under normal circumstances, the MEDLINE database of biomed- ical citations receives approximately 950,000 new papers per year. Currently this database indexes 31 million total citations. This pace challenges traditional research methods, which often rely on human intuition when searching for relevant information. As a result, the demand for modern AI solutions to help with the automated anal- ysis of scientific information is incredibly high. For instance, the field of drug discovery has explored a range of AI analytical tools .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://github.com/IlyaTyagin/AGATHA-C-GP https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 1: Number of new citations per week in CORD-19 dataset. to expedite new treatments [12]. Designing lab experiments and finding candidate chemical compounds is a costly and long-lasting procedure, often taking years. To accelerate scientific discovery, researchers came up with a family of strategies to utilize public knowledge from databases like MEDLINE that are available through the National Institute of Health (NIH), which facilitate automated hypothesis generation (HG) also known as literature-based discov- ery. Undiscovered public knowledge, information that is implicitly present within available literature, but is not yet explicitly known by an individual who can act on that information, represents the target of our work. Although, there are quite a few automated HG systems [12] in- cluding those we have previously proposed [35, 37], none of them is currently customized and available in the open domain to mas- sively process COVID-19 related queries. In addition to the traditional general requirements for HG systems, such as high-quality results of hypotheses, interpretability and availability for broad scientific community, a specific demand for COVID-19 data analysis requires: (1) customization of the vocabulary and other logical units such as subject-verb-object predicates; (2) customization of the training data that in the reality of urgent research contains a lot of controver- sial and incorrect information; (3) models for different information resolutions; and (4) validation on the on-going domain-specific discovery. Our contribution: In this work we bridge this gap by releasing, AGATHA-C and AGATHA-GP , reliable and easy to use HG sys- tems that demonstrate state-of-the art performance and validate their inference capabilities on both COVID-19 related and general biomedical data. To make them closely related to different goals of COVID-19 research, they correspond to micro- (AGATHA-C, for COVID-19) and macroscopic (AGATHA-GP, for general purpose) scales of knowledge discovery. Both systems are able to process any queries to connect biomedical concepts but AGATHA-C exhibits better results on the molecular scale queries, e.g., those that are relevant to drug design, and AGATHA-GP works better for general queries, e.g., establishing connections between certain profession and COVID-19 transmission. Both systems are the next generation of the AGATHA knowl- edge network mining transformer model [37]. They substantially improve the quality of the previous AGATHA by introducing new information layer into multi-layered semantic knowledge network pipeline, and expanding new information retrieval techniques that facilitate inference. We deploy the deep learning transfer model trained with up-to date datasets and provide easy to use interface to broad scientific community to conduct COVID-19 research. We validate the system via candidate ranking [36, 37] using very recent scientific publications containing findings absent in the training set. While the original AGATHA has demonstrated state-of-the- art performance for the time of its release, AGATHA and other systems were found to perform with notably lower quality on ex- tremely rapidly changing COVID-19 research. We demonstrate a remarkable improvement in the range of approximately 20-30% (in ROC-AUC) on the average on different types of queries with very fast query process that allows massive validation. In addition, we demonstrate that the proposed system can identify recently uncovered gene (BST2) and hormone (oxytocin and melatonin) re- lationships to COVID-19, using only papers published before these connections were discovered. Reproducibility: All code, details, and pre-trained models are available at https://github.com/IlyaTyagin/AGATHA-C-GP 2 BACKGROUND CORD-19 dataset [39] was released as a response to the world’s COVID-19 pandemic to help data science experts and researchers to tackle the challenge of answering the high priority scientific questions. It updates daily and was created by the Allen Institute for AI in collaboration with Microsoft Research, NLM, IBM and other organizations. At the time of this publication it contains over 400.000 scientific abstracts and over 150.000 full-text papers about coronaviruses, primarily COVID-19. MEDLINE is a database of NIH that includes almost 31 million citations (as of 2021) of scientific papers related to the biomedical and related fields. Some of the citations are provided with MeSH (Medical Subject Headings) terms and other metadata. MEDLINE is one of the largest and well-known resources for biomedical text mining. Hypothesis Generation Systems. The HG field has been present in information sciences for several decades. The first notable ap- proach was proposed by Swanson et al. in 1986 [33], which is called the A-B-C model. The concept of A-B-C model is to discover in- termediate (B) terms which occur in titles of publications for both terms A (source) and C (target). In their experiments, Swanson et al. discovered an implicit connection between Raynauld’s syndrome (term A) and fish oil (term C) through blood viscosity (term B), which was mentioned in both sets. The hypothesis that fish oil can be used for patients with Raynaud’s disease was experimentally confirmed several years later [10]. The key idea of the proposed method is that all fragmented bits of information are explicitly known, but their implicit relationships is what HG systems are aimed to uncover. We note the difference between HG and traditional information retrieval. The information retrieval techniques which represent the vast majority of biomedical literature based discovery systems are trained and (what is even more important) validated to retrieve existing information whereas the HG techniques predict undiscov- ered knowledge and thus must be massively validated on it. The HG validation requires training the system strictly on historical data rather than sampling it over the entire time. The advances in machine and deep learning transformed the algorithmics of HG systems (see Sec. 9) that are now able to pro- cess much larger information volumes demonstrating much higher .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://github.com/IlyaTyagin/AGATHA-C-GP https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ quality predictions. However, lack of broader applicability of HG systems in the situation with COVID-19 pandemic demonstrates that several major issues exist and require immediate attention: (1) Most of the existing HG systems are domain-specific (e.g., gene- disease interactions) that is usually expressed in limiting the pro- cessed information (e.g., significant filtering vocabulary and papers to a specific domain in probabilistic topic modeling [38]); (2) A proper validation of HG system remains a technical problem because multiple large-scale models have to trained with all het- erogeneous data carefully eliminated several years back; (3) Moreover, a large number of HG systems are not massively validated at all except of very old findings rediscovery [28] or demonstrating of just a few proactive examples in humanly cu- rated investigation; and (4) Interpretability and explainbability of generated hypotheses remains a major issue. The UMLS Metathesaurus [7] is the NIH database containing information about millions of concepts (both medical and general) and their synonyms. Metathesaurus accumulates information about its entries from more than 200 different vocabularies allowing to map and connect concepts from different terminologies. Metathe- saurus also keeps metadata about the concepts such as semantic types and their hierarchy. The core unit of information in UMLS is the concept unique identifier, or CUI. CUI is a codified representa- tion of a specific term, which includes its different atoms (spelling variants or translations of the term on other languages), vocabulary entries, definitions and other metadata. SemRep [4] is a software kit developed by NIH for extraction of semantic predicates (subject-verb-object triples) from the provided corpus. It also allows to extract entities not involved in any semantic predicate, if the corresponding option is selected. The official exam- ple of possible SemRep output is: INPUT = “We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia.”, OUTPUT = “Hemofiltration-TREATS- Patients; Digoxin overdose-PROCESS_OF-Patients; hyperkalemia- COMPLICATES-Digoxin overdose; Hemofiltration-TREATS(INFER)- Digoxin overdose”. SemRep handles word sense disambiguation and performs terms mapping to the corresponding CUIs from UMLS metathesaurus. ScispaCy [24] ScispaCy is a special version of spaCy maintained by AllenAI, containing spaCy models for processing scientific and bio-related texts. ScispaCy models are trained on different sources, such as PMC-pretrained word2vec representations, MedMentions Entity linking Dataset and so on. SciSpacy can handle various NLP tasks, such as NER, dependency parsing and POS-tagging, where achieves state of the art performance. SciBERT [6] is a BERT-like transformer pretrained language model, where full-text scientific papers were used as a training dataset. Embeddings are learned in a word-piece fashion, which makes them capture the relationships between not only words in a sentence, but also between word parts in each word. FAISS [15] is a library for fast approximate clustering and similarity search between dense vectors. It scales to the huge datasets that do not fit in RAM and can be used in a distributed fashion. FAISS is used in our pipeline to perform 𝑘-means clustering of PQ-quantizated sentence vectors to generate 𝑘-nearest neighbor edges for similar sentences (nodes) in knowledge network. Figure 2: AGATHA multi-layered graph schema. PTBG [21] (stands for PyTorch BigGraph) is a high-performance graph embedding system allowing distributed training. It was de- signed to handle large heterogeneous networks containing hun- dreds of millions of nodes of different types and billions of typed edges. Distributed training is achieved by computing embeddings on disjoint node sets. AllenNLP Open Information Extraction. AllenNLP [11] is a powerful library developed by AllenAI that uses PyTorch backend to provide deep-learning models for various natural processing tasks. Specifically, AllenNLP Open Information Extraction provides a trained deep bi-LSTM model for extracting predicates from un- structured text. An API is provided for running inference in both single sentence and batch modes. 3 PIPELINE SUMMARY We briefly summarize the AGATHA semantic graph construction pipeline. It is described in greater detail in the original paper [37]. Text pre-processing. The input for our system is a corpora of scientific citations from the MEDLINE and CORD-19 datasets. These files contain titles and abstracts for millions of biomedical papers. We filter non-English documents, using the FastText Langauge Identification model [16] if the language is not provided. After that we split all abstracts into sentences and process all sentences with ScispaCy library. From each sentence we extract POS-annotated lemmas, entities and perform 𝑛-gram mining, where 𝑛 ∈ [2, 3, 4] and 𝑛-grams are composed of frequently co-occurring lemmas. Additionally, we associate all sentences with any relevant metadata, such as the MeSH/UMLS keywords provided along with the citation. Semantic Graph Construction. We construct a semantic graph containing different types of nodes, namely, sentences, entities, coded terms (from UMLS and MeSH), 𝑛-grams, lemmas, and pred- icates following the schema depicted in Figure 2. Edges between sentences are induced from the nearest-neighbors network of sen- tence embeddings. We also include an edge between two sentences that appear sequentially within the same abstract, counting the title as the first sentence. Other edges can be inferred directly from the recorded metadata. For instance, the node representing the en- tity “COVID-19” is connected to every sentence and predicate that discuss COVID-19. NLM UMLS implementation. The prior AGATHA semantic net- work only includes UMLS terms that appear in SemMedDB predi- cates [18] which is a major limitation. In this work we enrich the “Coded Term” layer by introducing an additional preprocessing phase wherein we run the SemRep tool with full-fielded output option ourselves on the entire input corpora. This phase would be necessary as CORD-19 and most recent MEDLINE citations are not represented within slowly updated SemMedDB. However, we find that we can substantially increase the quality of recovered terms by applying these tools ourselves. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ By doing that we not only enrich the "Coded Terms" semantic network layer, but also introduce a significant number of uncovered previously semantic predicates. It happens because SemMedDB is a cumulative database, having various citations in the database processed over many years with various versions of SemRep and various UMLS releases available at different time periods. To illustrate what was just said, let us consider the following example (PMID: 20109154): "The results showed that V. cholerae O395 and also other related enteric pathogens have the essential CASS components (CRISPR and cas genes) to mediate a RNAi-like path- way." The current SemRep version extracts the following predicate: CRISPR-AFFECTS-RNAi, while SemMedDB does not contain any predicates for this sentence. The year of publication of the corre- sponding paper is 2009, but CRISPR term (C3658200) did not exist in the UMLS metathesaurus on or before 2012, that is why at the time of adding this citation to SemmedDB CRISPR-involved relation could not be identified. Graph Embedding. We embed our large semantic graph using a heterogeneous technique that captures node similarity through a biased transformed dot product. By explicitly including a bias term for each node, we capture a concepts overall affinity within the network that is critical for such general terms as “coronavirus.” By learning transformations between each pair of node types (e.g., between sentences and lemmas), we enable each type to occupy embedding spaces with differing characteristics. Specifically, we fit an embedding model that optimizes the following similarity measure: S(𝑢, 𝑣) = 𝑢1 + 𝑣1 +𝑇𝑢𝑣1 + 𝑑∑ 𝑖=2 𝑢𝑖 (𝑣𝑖𝑇𝑢𝑣𝑖 ), (1) where 𝑢, 𝑣 are nodes in the semantic graph with embeddings 𝑢, 𝑣, and 𝑇𝑢𝑣 is the directional transformation vector between nodes of 𝑢’s type to nodes of 𝑣’s. We use the PTBG heterogeneous graph embedding library to learn 𝑑 = 512 dimensional embeddings for each node of our large semantic graph. While fitting embeddings (𝑢) and transformation vectors (𝑇𝑢𝑣), we represent each edge of the semantic graph as two directed edges. These learned values are optimized using softmax loss, where the similarity for one edge is compared against the similarities of 100 negative samples. Ranking Semantic Predicates (Transformer model). After we obtain embeddings per node in the semantic graph, we train AGA- THA system ranking model. This model is trained to rank published subject-object pairs above randomly composed pairs of UMLS con- cepts (negative samples). Two coded terms, along with a fixed-size random subsample of predicates containing each term are input to this model. Graph embeddings for each term and predicate are fed into stacked transformer encoder layers, which apply multi-headed self-attention across the embedding set. The last set of encodings are averaged and the result is projected to the unit interval, forming a scalar prediction for the input’s “plausibility.” Allennlp Predictor CORD-19 Process Abstracts UMLS Concept Tagging Semnet Filter Final Predicates MEDLINE Figure 3: Predicate Extraction pipeline with Deep Learning based Open IE system. Formally, the model to evaluate term pairs is defined as: 𝑓 (𝑥,𝑦) = 𝑔 ([ 𝑥 𝑦 𝑥′1 . . . 𝑥 ′ 𝑘 𝑦′1 . . .𝑦 ′ 𝑘 ]) 𝑔(𝑋) = sigmoid(MΘ) M = 1 |𝑋 | ColSum (E𝑁 (FeedForward(𝑋))) E0(𝑋) = 𝑋 E𝑖+1(𝑋) = LayerNorm (FeedForward(A(𝑋)) + A(𝑋)) A(𝑋) = LayerNorm (MultiHeadAttention(𝑋) + 𝑋) , (2) where each 𝑥′ and 𝑦′ are randomly sampled from the neighbor- hoods of 𝑥 and 𝑦 respectively, and each ·̂ denotes the graph embed- ding of the given node. Furthermore, Θ represents a free parameter, which is fit along with parameters internal to each FeedForward and MultiHeadAttention layer, following the standard conventions for each. The above model is fit using margin ranking loss, where pred- icates from the training set are compared against a large set of negative samples. Additional details pertaining to specific opti- mization choices surrounding this model are present in the work originally proposing this model [37]. 4 AUGMENTING SEMANTIC PREDICATES WITH DEEP LEARNING We used SemRep predicate extraction system in the first system, AGATHA-C , to extract predicates from the abstracts. However, SemRep relies on expert coded rules and heuristics to extract biomed- ical relations leading to significantly fewer predicates for training. Thus, in order to augment the predicates (for the second system, AGATHA-GP ) we decided to use a deep learning based informa- tion extraction system by Stanvosky et al. [31]. Figure 3 shows our overall predicate extraction pipeline. Abstract Pre-processing. The input for the proposed semantic predicate extraction system is the output files generated by SemRep tool with full-fielded output option enabled, obtained from the pre- processing stage described in Sec. 3. As it was mentioned previously, SemRep system extracts not only semantic triples, but also maps entities found in the input corpus to their corresponding UMLS concept IDs, this is the data which is used for the following method. The initial set of records includes the sentence raw texts and ex- tracted from them UMLS terms and is augmented throughout the pipeline making it easier to extract final predicates for downstream training. Raw Predicate Extraction. We use a pre-trained instance of RnnOIE [31] provided as an API by AllenNLP. The model was trained on the OIE2016 corpus. At a high level the model aims to learn a joint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ embedding of individual words and their corresponding Beginning- Input-Output (BIO) tags. The output of the model is a probability distribution over the BIO tags. During inference the model selects specific phrases and groups them into ARG0, V, ARG1 tags. By con- vention, we treat ARG0 as the subject and ARG1 as the object in a subject-verb-object tuple. To speed up processing and scale it to thousands of abstracts, we leverage model-parallelism across differ- ent machines and run batch-mode inference on chunks of abstracts. Once the model predictions have been extracted we extract the phrases with relevant tags into raw predicates and augment them in the record. A subsequent filtering is performed by extracting the terms matching with previously detected UMLS concepts in the sentence. Semnet Filtering Using a general purpose RnnOIE model has it’s own challenges. During processing we noted that a lot of raw predicates were either too general or contained too little meaning to be useful for training a prediction model. To overcome this challenge we designed a corrective filter to reduce noise and retain most useful predicates. We call this filter the semnet filter. Each UMLS concept has an associated semantic type (e.g., COVID- 19 has an associated semantic type of dsyn (disease)). This is useful for summarizing large set of diverse text concepts into smaller num- ber of categories. We used the metadata from semantic types to construct two networks - a semantic network and a hierarchical network. The semantic network consists of semantic types as nodes and the edges imply a corresponding direct relation between them. The hierarchical network is a network of a semantic type connected to its more general semantic types. For example, a semantic type dsyn (disease) is more generally associated with a biof (biological function) or a pathf (pathological function). In order to filter a predicate, all edges emanating from the subject’s semantic types are computed on a per-predicate basis. These edges also include any specific-general concept relationships. If the object’s semantic type is found to be in the candidate edge set, then we deem the predicate as valid. In our experiments, we found that this filtering method significantly eliminates predicates which do not directly pertain to the biomedical domain. Processing Abstracts at Scale Building a pipeline that scales to thousands of abstracts is not a trivial task. In order to extract predi- cates from RnnOIE model and extract quality terms of interest we not only have to contend with the problem of running inference on a deep neural network but also the task of aligning the extracted terms with the entities recognized by SemRep. Deployment details: The RnnOIE model by Stanovsky et al. uses a deep Bi-LSTM [27] model to learn the joint word embedding and predict the resulting semantic position tags. Since LSTMs are inherently sequential model, it means that the inference time per sentence would be considerable. We first tried processing an entire collection of abstracts at once on a cluster of 10 machines each consisting of 24 CPUs using the Dask [26] library. The entire process took more than 8 hours. Considering that we had about 100 such collections, this inference time was prohibitively high. In order to speed up inference we read each collection once and distributed chunks of abstracts over the machines. This change helped us to cut down the processing time from over a week to just over 4 days for the MEDLINE corpus. For the CORD-19 corpus the processing time was even faster at 2 days. The next step was to align the extracted predicates with the SemRep recognized biomedical concepts. We achieved this alignment by first building an index of files that contained a specific abstract ID and then processing the RnnOIE predicates with the aforementioned index. We further optimized the indexing phase by updating the existing index each time we processed more than 𝜏 abstracts. The semnet filter does not introduce additional computational overhead and can process a thousand abstracts in under 1 second. Hence, to obtain the most relevant set of predicates we were able to parallelize over “checkpoints" (each of which contained 30k abstracts) in an hour. 5 VALIDATION A fair validation of HG systems is extremely challenging, as these models are designed to predict novel connections that are unknown to even those who evaluate the system [34]. In addition, even if validated by rediscovering findings using historical, the process is computationally expensive because of the need to train multiple models to understand how many months (or years) back, the HG system can predict the findings which requires careful filtering of the used papers, vocabulary and other types of data. To present our results in terms of its usefulness for urgent CORD-19-related HG, we use a historical benchmark, which is conceptually described in [37]. This technique is fully automated and does not require any domain experts intervention. Positive samples collection. We use SemRep and proposed in Sec. 4 approach to process the most recent CORD-19 citations, which were published after the specific cut date making sure that the citations are not included in the training set. After that we extract all subject-object pairs from the obtained results and explicitly check that none of these pairs are presented in the training set. Pairs mentioned in the CORD-19 less than twice are filtered out from the validation set. Almost all of them are either noisy or represent information that already appears in other pairs (e.g., because of the difference in grammar). We also use the strategy of subdomain recommendation. This strategy works in the following way. For each UMLS term we collect its semantic type (which is a part of the metadata provided in UMLS metathesaurus) and group all extracted SemRep pairs by the term-pair criteria (combination of subject and object types). Then we identify the top-20 most common term-pairs subdomains and construct the validation set from pairs belonging to these 20 subdomains. Negative samples generation. To generate negative samples per domain, the random sampling is used, that is, for each positive sample we keep its subject and randomly sample the object belong- ing to the same semantic type as the object of the source pair. We do this 10 times, thus having 10 negative domain-specific samples for each positive sample. When the validation set is generated, we apply our ranking criteria to it, obtaining a numerical score value 𝑠 per each sample, where 𝑠 ∈ [0, 1]. Evaluation metrics. We propose our approach as a recommenda- tion system and to report our results we use a combination of the following classification and recommendation metrics. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ • Classification metrics: (1) Area under the receiver-operating- characteristic curve (AUC ROC); (2) Area under the precision- recall curve (AUC PR). • Recommendation metrics: (1) Top-k precision (P.@k); (2) Average precision (AP.@k); and (3) Overall reciprocal rank (RR). We report these numbers in per subdomain manner to better un- derstand how the system performs with respect to specific task (e.g. drug repurposing). 6 RESULTS To report results, we provide the performance measures for three AGATHA models trained on the same input data (MEDLINE corpus and CORD-19 abstracts dataset): (1) AGATHA-O : Baseline AGATHA model [37]; (2) AGATHA-C : AGATHA-O with new UMLS layer and SemRep enrichment; (3) AGATHA-GP : AGATHA-C with additional deep learning- based extracted and further filtered predicates. It is done in this particular manner because the major role in learn- ing the proposed ranking criteria depends heavily on the quality of extracted semantic predicates and their number, as they form the training set for the AGATHA ranking module. At the moment of writing, no other general purpose and available for public use HG system compliant with the three validation criteria, namely, (a) ability to run thousands of queries in a reasonable time, (b) ability to process COVID-19 related vocabulary, and (c) ability to operate in multiple domains was available for comparison. The performance of both AGATHA-C and AGATHA-GP allows to run thousands of queries in a very short time (in the order of minutes), making the validation on a large number of samples pos- sible. Unfortunately, given the current circumstances, large-scale validation for the specific scientific subdomain (COVID-19 related hypotheses) is hard to implement, because well-established and reliable factual base is being actively developed at the moment and big historic gap for the vocabulary simply does not exist (e.g., the COVID-19 term is just approximately one year old). We, how- ever, provide the validation set including 2736 positive connections extracted from CORD-19 dataset citations added within the time frame from October 28, 2020 to January 21, 2021, which numbered at 77 thousand abstracts. Table 1: Graph metrics (M = millions, B = billions). Counts Node Type AGATHA-O AGATHA-C AGATHA-GP Sentence 190.6 M. 190.6 M. 190.6 M. Predicate 24.2 M. 36.3 M. 38.7 M. Lemma 16.8 M. 16.1 M. 16.1 M. Entity 41.7 M. 43.2 M. 43.2 M. Coded Term 538,588 855,351 855,351 𝑛-Grams 212.922 326.864 333.575 Total Nodes 274,1 M. 287.4 M. 289.8 M. Total Edges 13.52 B. 13.5 B. 13.53 B. In Table 1, we share some basic graph metrics for the models AGATHA-O , AGATHA-C and AGATHA-GP . The most signifi- cant change is observed in the number of semantic predicates and coded terms, which clearly represents the purpose of introducing additional preprocessing steps. In Table 2, we compare aforementioned models using the met- rics described in Sec. 5. We present predicate types with NLM semantic type codes [23] due to space restrictions. Both AGATHA- C and AGATHA-GP models show significant gains when compared to AGATHA-O baseline model. Benefits in the most problematic for the baseline model areas (e.g., (Gene) → (Gene) denoted by (gngm,gngm)) serve the best illustration for that, showing up to almost 30 percent advantage in ROC AUC. Now all most popular biomedical subdomains are covered by the proposed models and show AUC ROC results at at least 0.87. Average ROC AUC value is increased by 0.09. Our validation strategy involves a big number of many-to-many queries, making the area under precision-recall curve another very illustrative metric. This is where the newly proposed models show even more drastic improvements over the baseline AGATHA-O . For some subdomains, like (Gene or Genome) → (Gene or Genome) (gngm,gngm) or (Amino Acid, Peptide, or Protein) → (Gene or Genome) (aapp,gngm), we observe that new models take the recommenda- tions performance to the new quality level. Average PR AUC value is increased by 0.16. The approximate running time with corresponding types of used hardware is presented in Table 3. Each row corresponds to the stage in the AGATHA-C /AGATHA-GP pipelines. The column “M” (machines) and CPU show the number of machines and required CPUs, respectively. In the column “GPU” we indicate if GPU was required or optional. For AGATHA training we used two NVIDIA V100 per machine. The minimal requirements for RAM per machine are in column “RAM”. The running time of queries is negligible. 7 CASE STUDY The proactive discovery of ongoing research findings is an impor- tant component in the validation of hypothesis generation systems [36]. In particular, in the current uncertain situation when a lot of unintentionally incorrect discoveries are published, the validation must include human-in-the-loop part even in limited capacity such as in [2, 30]. To demonstrate the predictive potential of AGATHA-C we perform a case study on three COVID-19-related novel connec- tions manually selected by the domain expert. These connections were published after the cut date before which any data used in training was available to download at NIH. At a low level, all AGATHA models use entity subsampling to calculate pairwise ranking criteria, which means that the absolute numbers may fluctuate slightly. Thus, to present the numeric scores, each experiment was repeated 100 times to compute the average and standard deviation that we present in Table 4. AGATHA-C was tested whether it will be able to predict com- pounds potentially applicable for the treatment of COVID-19 and the genes involved in the SARS-CoV-2 pathogenesis. The data con- firming cardiovascular protective effects of hormone oxytocine were published recently [9, 40]. The protective effect is linked to .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table 2: Classification and recommendation quality metrics across recently popular COVID-19-related biomedical subdomains. Labels O, C and GP stand for AGATHA-O , AGATHA-C and AGATHA-GP models, respectively. ROC AUC PR AUC RR P.@10 P.@100 AP.@10 AP.@100 O C GP O C GP O C GP O C GP O C GP O C GP O C GP orch:dsyn 0.91 0.93 0.92 0.47 0.57 0.55 1.00 1.00 0.50 0.60 0.90 0.70 0.48 0.59 0.61 0.79 0.88 0.64 0.64 0.73 0.71 aapp:dsyn 0.90 0.95 0.95 0.45 0.58 0.63 1.00 0.50 1.00 0.60 0.70 0.90 0.52 0.56 0.65 0.79 0.73 0.98 0.57 0.66 0.74 phsu:dsyn 0.89 0.93 0.94 0.40 0.48 0.57 0.50 0.12 1.00 0.40 0.20 0.80 0.50 0.56 0.69 0.56 0.17 0.98 0.43 0.49 0.76 orch:orch 0.85 0.92 0.91 0.47 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.70 0.51 0.60 0.57 1.00 0.99 0.79 0.66 0.76 0.71 phsu:phsu 0.85 0.90 0.91 0.35 0.41 0.47 0.33 0.20 1.00 0.30 0.50 0.50 0.39 0.42 0.47 0.40 0.38 0.78 0.44 0.49 0.56 orch:phsu 0.87 0.93 0.93 0.51 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.80 0.49 0.56 0.52 0.91 0.91 0.86 0.68 0.72 0.67 fndg:dsyn 0.89 0.95 0.94 0.46 0.60 0.60 1.00 1.00 1.00 0.60 0.80 0.80 0.56 0.69 0.69 0.88 0.80 0.75 0.65 0.68 0.72 orch:aapp 0.87 0.93 0.93 0.57 0.66 0.73 1.00 1.00 1.00 0.90 0.90 0.90 0.48 0.55 0.60 0.88 0.98 1.00 0.77 0.79 0.84 geoa:spco 0.79 0.77 0.93 0.32 0.23 0.52 1.00 0.50 1.00 0.60 0.30 0.60 0.39 0.26 0.56 0.91 0.51 0.84 0.54 0.35 0.64 geoa:idcn 0.65 0.81 0.88 0.10 0.11 0.28 0.05 0.03 0.50 0.00 0.00 0.70 0.17 0.09 0.25 0.00 0.00 0.69 0.14 0.06 0.45 topp:dsyn 0.90 0.95 0.95 0.53 0.66 0.66 1.00 1.00 1.00 0.90 0.90 0.90 0.60 0.77 0.72 0.96 0.88 0.95 0.72 0.82 0.86 hlca:dsyn 0.89 0.96 0.96 0.58 0.72 0.72 1.00 1.00 1.00 0.90 1.00 0.80 0.46 0.54 0.56 0.88 1.00 0.79 0.75 0.79 0.78 gngm:dsyn 0.93 0.97 0.96 0.47 0.72 0.74 0.50 1.00 1.00 0.60 0.80 0.90 0.48 0.65 0.66 0.62 0.82 1.00 0.50 0.79 0.82 fndg:humn 0.83 0.92 0.91 0.38 0.53 0.54 1.00 0.50 0.50 0.60 0.70 0.80 0.45 0.64 0.63 0.65 0.69 0.73 0.62 0.69 0.77 gngm:gngm 0.66 0.88 0.89 0.14 0.40 0.41 0.10 0.50 1.00 0.10 0.60 0.30 0.15 0.45 0.44 0.10 0.51 0.61 0.17 0.49 0.52 dsyn:fndg 0.81 0.91 0.92 0.31 0.44 0.43 0.25 0.50 0.33 0.20 0.60 0.60 0.42 0.49 0.46 0.32 0.55 0.49 0.45 0.53 0.51 phsu:fndg 0.78 0.91 0.90 0.28 0.51 0.47 0.50 1.00 1.00 0.50 0.50 0.50 0.30 0.49 0.46 0.54 0.76 0.68 0.41 0.62 0.58 dsyn:humn 0.80 0.87 0.88 0.30 0.40 0.42 1.00 0.50 0.20 0.70 0.50 0.50 0.35 0.49 0.54 0.81 0.45 0.40 0.56 0.58 0.56 dsyn:dsyn 0.86 0.92 0.92 0.40 0.50 0.53 0.50 1.00 1.00 0.60 0.80 0.70 0.55 0.65 0.66 0.54 1.00 0.86 0.55 0.67 0.73 aapp:gngm 0.70 0.88 0.87 0.19 0.36 0.37 0.14 0.33 0.20 0.10 0.30 0.30 0.24 0.42 0.42 0.14 0.29 0.32 0.27 0.43 0.47 Mean 0.83 0.91 0.92 0.38 0.50 0.54 0.69 0.68 0.81 0.55 0.63 0.69 0.42 0.52 0.56 0.63 0.66 0.76 0.53 0.61 0.67 Table 3: Running time and hardware requirements. Stage Time Hardware M CPU GPU RAM SemRep Processing 2 d 10-28 20+ Opt N/A AllenNLP Predicates 3 d 28-40 20+ Opt N/A Graph Construction 10 d 30+ 20+ Opt 120GB+ Graph Conversion 7 h 1 40+ Opt 1TB+ Graph Embedding 1 d 20 24+ Opt 120GB+ AGATHA Training 22 h 5+ 2+ Yes 300GB+ Network Adjacency 1 d 1 40+ Opt 1.5TB+ Table 4: Scores for valid recently published connections ob- tained by different AGATHA models. Reported average val- ues for 100 runs and standard deviation. AGATHA-O AGATHA-C AGATHA-GP COVID-19:Melatonin 0.63 ± 0.03 0.91 ± 0.03 0.78 ± 0.03 COVID-19:Oxytocin 0.75 ± 0.03 0.98 ± 0.02 0.81 ± 0.02 COVID-19:BST2 gene 0.41 ± 0.01 0.88 ± 0.03 0.74 ± 0.03 anti inflammatory activity of the hormone. For this connection AGATHA-C generated the score of 0.98. Similarly, we tested the prediction of the effects of the other hormone, melatonin. Several publications, started from November 2020 [3, 8, 13, 43] show the protective effects of melatonin, specifi- cally for COVID-19 neurological complications. The activity was linked to anti-oxidative effects of the melatonin. For this connection AGATHA-C generated the score of 0.91. Our system accurately predicted with score of 0.88 the involve- ment of tetherin (BST2). The results published in 2021 [32] show that tetherin restricts the secretion of SARS-CoV-2 viral particles and is downregulated by SARS-CoV-2. Therefore, pharmacological activation of tetherin expression, or inhibition of the degradation could be a promising direction of the development of SARS-CoV-2 treatment. 8 LESSONS LEARNED AND OPEN PROBLEMS Quality of the information retrieval pipelines. Information retrieval is an important part of any HG pipeline. In order to uncover implicit connections, the system should be able to capture existing explicit connections with as much quality as possible. Given that human knowledge is usually stored in a non-structured manner (e.g., scientific texts), the quality of systems that process raw textual data, such as those that solve the named entity recognition, or word sense disambiguation problems, is crucial. We observed that the SemRep system performs better concept and relation recognition when full abstracts are used as input data instead of single sentences. SemRep also allows to perform optional sortal anaphora resolution to extract co-references to the entities from neighbouring sentences, which was shown to be useful in [17] and is used in this work. "Positive" research bias. The absence of published negative re- search results is a big problem for the HG field. With mostly posi- tive results available, often we have to generate negative examples through some kind of random sampling. These negative samples likely do not adequately represent the real nature of negatively confirmed scientific findings. Likely, one of the most important .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ future work directions in the area of HG is to accurately distinguish and leverage positive and negative proposed results. Domain experts involvement. When any hypothesis generation system is built, one of the first questions a designer should address is extent that domain experts are expected to participate in the pipeline. Modern decision-making systems allow a fully automated discovery process (like the AGATHA system), but this may not be sufficient. A domain expert who interfaces with a HG system as a black box may not trust generated results or know how best to interpret them. The challenge of interpretable hypothesis genera- tion remains a significant barrier to widespread adoption of these kinds of research tools. For this we advocate using our “structural” learning HG system MOLIERE [35] in which with the topical mod- eling and network analytic measures we interpret and explain the results. The nature of input corpora. The question of what should be used as input to a topic-modeling based hypothesis generation sys- tem is raised in [34]. Using full-text papers shows an improvement, but the trade-off between run time and output quality was barely justifiable. However, deep learning models have a greater potential for extracting useful information from large input sources, and as it was demonstrated in our previous work [37], show significant per- formance advancements. Thus the question of using full-text papers in deep learning-based hypothesis generation systems should be addressed. Unfortunately, it is currently too computationally expen- sive our resources as the number of sentences and thus predicates and edges will be significantly larger. Knowledge resolution. Our newly proposed systems showed that the knowledge resolution plays a major role in subdomain recom- mendation. To increase the scope of model expertise (and the scope of potential applications beyond the biomedical fields) we deliber- ately incorporate a general-purpose information retrieval system RnnOIE into AGATHA-GP . This additional information results in significant gains in broad subdomains like (Geographic Area) → (Idea or Concept) (geoa,idcn). At the same time, we observe that AGATHA-C performs better in “microscopic” biomedical areas, e.g. (Organic Chemical) → (Organic Chemical) (orch,orch), which raises the question of choosing the appropriate model for every specific use case. Although, both systems process all types of queries, the general purpose predicates participated in training significantly improve “macroscopic” types of queries. 9 RELATED WORK A number of works have been proposed to organize the CORD-19 literature into a structured knowledge graph for different purposes. For instance, Basu et al. [5] propose ERLKG - a knowledge graph built on CORD-19 with entities corresponding to gene/chemical/dis- ease names and the edges forming relations between the concept. They use a fine tuned SciBERT model for both entity and relation extraction. The main purpose of the knowledge graph is to predict a link between a given chemical-disease and chemical-protein pair using a trained GCN autoencoder [19] approach. In another similar work, Oniani et al. [25] build a co-occurrence network on a subset of CORD-19 with the edges corresponding to either gene-disease, gene-mutation or chemical-disease type. The network is then em- bedded into latent space using a node2vec walk. Link prediction is performed on the nodes by training different classical machine learning algorithms. A major shortcoming of these approaches is that they limit themselves to either specific kind of entities or re- lations or both and as a result not only the scope of possible new literature is narrowed but a lot of additional useful knowledge is filtered out of the system. In contrast, our system does not limit itself to specific entity or relation type and is able to capture much more information from the same corpus. A major interest of constructing knowledge graphs is to al- low medical researchers to re-purpose existing drugs for treating COVID-19. Zhang et al. [42] develop a system that uses combined semantic predications from SemMedDB and CORD-19 (extracted using SemRep) to recommend drugs for COVID-19 treatment. To improve the predications from CORD-19, the authors fine tune various transformer based models on a manually annotated inter- nal dataset. Their resulting knowledge graph consists of 131,555 nodes and 2,558,935 edges. Our work on the other hand utilizes similar technologies and produces a bigger graph with 287,356,836 nodes and 13,500,291,256 edges. Moreover, we do not post-process extracted relations from SemRep and are still able to achieve a higher RoC metric. Another system proposed by Martinc et al. [22] uses a fine-tuned SciBERT model to generate contextualized embed- dings of CORD-19 articles and using an initial seed set of targets proposes possible therapy targets. However, this system is very different from ours as it treats the entire article as a bag of words and directly trains a word embedding model on CORD-19. It was earlier noted that KinderMiner [20] provides a web-based literature discovery tool and supports COVID-19 queries. The underlying algorithm is based on a simple keyword co-count between source and target words in a given corpus. While co-count is a fast and scalable approach, it suffers from a lack of “discrimination" i.e. two keywords occurring together more frequently do not always imply a high degree of correlation. The vastness of COVID-19 literature also spurned the need for having systems that could allow researchers and base users alike to get their COVID-19 queries answered. Systems like CKG (Wise et al.) [41] and SciSight (Hope et al.) [14] currently provide this functionality. While we do aim to provide an easy to use web- framework for medical researchers, the scope of the aforementioned systems is beyond the scope of our work. Unfortunately, no existing system out of those that are trained to accept terms related to COVID-19 or SARS-CoV-2 provided an open access for massive validation for a fair comparison with or was able to be tested in multiple domains like AGATHA-C . 10 CONCLUSIONS We present two graph mining transformer based models AGATHA- C and AGATHA-GP , for micro- and macroscopic scales of queries respectively, which are designed to help domain experts solve high- priority research problems and accelerate scientific discovery. We perform per-subdomain validation of these new models on a rapidly changing COVID-19 focused dataset, composed of recently pub- lished concept pairs and demonstrate that the proposed models achieve state-of-the-art prediction quality. Both models signifi- cantly outperform the existing baseline system AGATHA-O . We deploy the proposed models to the broad scientific community and .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ believe that our contribution can raise more interest in prospective hypothesis generation applications. REFERENCES [1] [n.d.]. Citations Added to MEDLINE by Fiscal Year. https://www.nlm.nih.gov/ bsd/stats/cit_added.html [2] Marina Aksenova, Justin Sybrandt, Biyun Cui, Vitali Sikirzhytski, Hao Ji, Diana Odhiambo, Matthew D Lucius, Jill R Turner, Eugenia Broude, Edsel Peña, et al. 2019. Inhibition of the Dead Box RNA Helicase 3 prevents HIV-1 Tat and cocaine- induced neurotoxicity by targeting microglia activation. Journal of Neuroimmune Pharmacology (2019), 1–15. [3] Lise Alschuler, Ann Marie Chiasson, Randy Horwitz, Esther Sternberg, Robert Crocker, Andrew Weil, and Victoria Maizes. 2020. Integrative medicine consid- erations for convalescence from mild-to-moderate COVID-19 disease. Explore (2020). [4] Patrick Arnold and Erhard Rahm. 2015. SemRep: A repository for semantic mapping. Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015). [5] Sayantan Basu, Sinchani Chakraborty, Atif Hassan, Sana Siddique, and Ashish Anand. 2020. ERLKG: Entity Representation Learning and Knowledge Graph based association analysis of COVID-19 through mining of unstructured biomed- ical corpora. In Proceedings of the First Workshop on Scholarly Document Pro- cessing. Association for Computational Linguistics, Online, 127–137. https: //doi.org/10.18653/v1/2020.sdp-1.15 [6] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [7] Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): Inte- grating Biomedical Terminology. [8] Daniel P Cardinali, Gregory M Brown, and Seithikurippu R Pandi-Perumal. 2020. Can Melatonin Be a Potential “Silver Bullet” in Treating COVID-19 Patients? Diseases 8, 4 (2020), 44. [9] Phuoc-Tan Diep. 2021. Is there an underlying link between COVID-19, ACE2, oxytocin and vitamin D? Medical Hypotheses 146 (2021), 110360. [10] R. A. DiGiacomo, J. M. Kremer, and D. M. Shah. 1989. Fish-oil dietary supple- mentation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study. Am J Med 86, 2 (Feb 1989), 158–164. [11] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv:arXiv:1803.07640 [12] Vishrawas Gopalakrishnan, Kishlay Jha, Wei Jin, and Aidong Zhang. 2019. A survey on literature based discovery approaches in biomedical domain. Journal of biomedical informatics 93 (2019), 103141. [13] Ping Ho, Jing-Quan Zheng, Chia-Chao Wu, Yi-Chou Hou, Wen-Chih Liu, Chien- Lin Lu, Cai-Mei Zheng, Kuo-Cheng Lu, and You-Chen Chao. 2021. Perspective Adjunctive Therapies for COVID-19: Beyond Antiviral Therapy. International Journal of Medical Sciences 18, 2 (2021), 314. [14] Tom Hope, Jason Portenoy, Kishore Vasan, Jonathan Borchardt, Eric Horvitz, Daniel S. Weld, Marti A. Hearst, and Jevin West. 2020. SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search. arXiv:2005.12668 [cs.IR] [15] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017). [16] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016). [17] H. Kilicoglu, G. Rosemblat, M. Fiszman, and T. C. Rindflesch. 2016. Sortal anaphora resolution to enhance relation extraction from biomedical literature. BMC Bioin- formatics 17 (Apr 2016), 163. [18] Halil Kilicoglu, Dongwook Shin, Marcelo Fiszman, Graciela Rosemblat, and Thomas C. Rindflesch. 2012. SemMedDB: a PubMed-scale repository of biomedi- cal semantic predications. Bioinform. 28, 23 (2012), 3158–3160. http://dblp.uni- trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 [19] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Repre- sentations (ICLR). [20] F. Kuusisto, J. Steill, Z. Kuang, J. Thomson, D. Page, and R. Stewart. 2017. A Simple Text Mining Approach for Ranking Pairwise Associations in Biomedical Applications. AMIA Jt Summits Transl Sci Proc 2017 (2017), 166–174. [21] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA. [22] Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, and Senja Pollak. 2020. COVID-19 Therapy Target Discovery with Context-Aware Literature Mining. In Discovery Science, Annalisa Appice, Grigorios Tsoumakas, Yannis Manolopoulos, and Stan Matwin (Eds.). Springer International Publishing, Cham, 109–123. [23] A. T. McCray, A. Burgun, and O. Bodenreider. 2001. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform 84, Pt 1 (2001), 216–220. [24] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019). [25] David Oniani, Guoqian Jiang, Hongfang Liu, and Feichen Shen. 2020. Con- structing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases. Journal of the American Medical Informatics Association 27, 8 (05 2020), 1259–1267. [26] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 – 136. [27] M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/ 10.1109/78.650093 [28] Neil R Smalheiser. 2017. Rediscovering Don Swanson: The past, present and future of literature-based discovery. Journal of Data and Information Science 2, 4 (2017), 43–64. [29] Scott Spangler. 2015. Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation. Chapman and Hall/CRC. [30] Scott Spangler, Angela D Wilkins, Benjamin J Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R Pickering, Austin Comer, Jef- frey N Myers, et al. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international confer- ence on Knowledge discovery and data mining. 1877–1886. [31] Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Super- vised Open Information Extraction. In Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). Association for Computational Linguistics, New Orleans, Louisiana, (to appear). [32] Hazel Stewart, Kristoffer H Johansen, Naomi McGovern, Roberta Palmulli, George W Carnell, Jonathan Luke Heeney, Klaus Okkenhaug, Andrew Firth, Andrew A Peden, and James R Edgar. 2021. SARS-CoV-2 spike downregulates tetherin to enhance viral spread. bioRxiv (2021), 2021–01. [33] Don R Swanson. 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30, 1 (1986), 7–18. [34] Justin Sybrandt, Angelo Carrabba, Alexander Herzog, and Ilya Safro. 2018. Are Ab- stracts Enough for Hypothesis Generation?. In 2018 IEEE International Conference on Big Data (Big Data). 1504–1513. https://doi.org/10.1109/bigdata.2018.8621974 [35] Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Auto- matic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing (Halifax, NS, Canada) (KDD ’17). ACM, New York, NY, USA, 1633–1642. https://doi.org/10.1145/3097983.3098057 [36] Justin Sybrandt, Micheal Shtutman, and Ilya Safro. 2018. Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking. In 2018 IEEE International Conference on Big Data (Big Data). 1494–1503. https://doi.org/10.1109/bigdata. 2018.8622637 [37] Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. 2020. AGATHA: Automatic Graph Mining And Transformer Based Hypothesis Generation Approach. Association for Computing Machinery, New York, NY, USA, 2757–2764. https: //doi.org/10.1145/3340531.3412684 [38] Huijun Wang, Ying Ding, Jie Tang, Xiao Dong, Bing He, Judy Qiu, and David J Wild. 2011. Finding complex biological relationships in recent PubMed articles using Bio-LDA. PloS one 6, 3 (2011), e17243. [39] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, K. Funk, Rodney Michael Kinney, Ziyang Liu, W. Merrill, P. Mooney, D. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Brandon Stil- son Stilson, Alex D Wade, Kuansan Wang, Christopher Wilhelm, Boya Xie, Dou- glas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. CORD-19: The Covid-19 Open Research Dataset. ArXiv (2020). [40] Stephani C Wang and Yu-Feng Wang. 2021. Cardiovascular protective properties of oxytocin against COVID-19. Life Sciences (2021), 119130. [41] Colby Wise, Vassilis N. Ioannidis, Miguel Romero Calvo, Xiang Song, George Price, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, and George Karypis. 2020. COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature. arXiv:2007.12731 [cs.IR] [42] Rui Zhang, Dimitar Hristovski, Dalton Schutte, Andrej Kastrin, Marcelo Fiszman, and Halil Kilicoglu. 2020. Drug Repurposing for COVID-19 via Knowledge Graph Completion. arXiv:2010.09600 [cs.CL] [43] Petra Zimmermann and Nigel Curtis. 2020. Why is COVID-19 less severe in children? A review of the proposed mechanisms underlying the age-related difference in severity of SARS-CoV-2 infections. Archives of Disease in Childhood (2020). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.11.430789doi: bioRxiv preprint https://www.nlm.nih.gov/bsd/stats/cit_added.html https://www.nlm.nih.gov/bsd/stats/cit_added.html https://doi.org/10.18653/v1/2020.sdp-1.15 https://doi.org/10.18653/v1/2020.sdp-1.15 https://arxiv.org/abs/arXiv:1803.07640 https://arxiv.org/abs/2005.12668 http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#KilicogluSFRR12 https://doi.org/10.1109/78.650093 https://doi.org/10.1109/78.650093 https://doi.org/10.1109/bigdata.2018.8621974 https://doi.org/10.1145/3097983.3098057 https://doi.org/10.1109/bigdata.2018.8622637 https://doi.org/10.1109/bigdata.2018.8622637 https://doi.org/10.1145/3340531.3412684 https://doi.org/10.1145/3340531.3412684 https://arxiv.org/abs/2007.12731 https://arxiv.org/abs/2010.09600 https://doi.org/10.1101/2021.02.11.430789 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract 1 Introduction 2 Background 3 Pipeline Summary 4 Augmenting Semantic Predicates with Deep Learning 5 Validation 6 Results 7 Case study 8 Lessons Learned and Open Problems 9 Related Work 10 Conclusions References