key: cord-0176065-hl1905y2 authors: Anteghini, Marco; D'Souza, Jennifer; Santos, Vitor A. P. Martins dos; Auer, Soren title: SciBERT-based Semantification of Bioassays in the Open Research Knowledge Graph date: 2020-09-16 journal: nan DOI: nan sha: bff1ef1bc0122f471333d7a0b10d6d6a8e37efb4 doc_id: 176065 cord_uid: hl1905y2 As a novel contribution to the problem of semantifying biological assays, in this paper, we propose a neural-network-based approach to automatically semantify, thereby structure, unstructured bioassay text descriptions. Experimental evaluations, to this end, show promise as the neural-based semantification significantly outperforms a naive frequency-based baseline approach. Specifically, the neural method attains 72% F1 versus 47% F1 from the frequency-based method. Biological assays are defined as standard biochemical test procedures used to determine the concentration or potency of a stimulus (physical, chemical, or biological) by its effect on living cells or tissues [3, 4] . In the context of the current Covid-19 pandemic, bioassays are critical, for example, for vaccine development. They reveal the functional and biologically relevant immunological responses that correlate with vaccine efficacy. However, massive volumes of bioassays are being produced and researchers are inundated with this information. Apart from their sheer quantity, bioassay diversity presents enormous challenges to organizing, standardizing, and integrating the data with the goal to maximize their scientific and ultimately their public health impact as the screening results are carried forward into drug development programs. Against this broad societal application setting, we present a solution as a step in the easier knowledge acquisition of bioassays for researchers: the neuralbased automated structuring of unstructured, non-standardized bioassays based on the standardized BioAssay Ontology (BAO) [7] . Bioassays, until their recent semantification in an expert-annotated dataset [2, 5, 6] based on the BAO, were published in the form of unstructured text. Integrating their semantified counterpart in a KG facilitates their advanced computational processing. E.g., bioassays can be easily compared across their key properties, viz. Target, Perturbagen, Participants, and Detection Technology, captured as KG nodes and links. Nonetheless, the fine-grained semantification of bioassays as a manual task is a costly and time-intensive endeavor. Their automated semantification not only alleviates the costly manual task, but potentially makes it possible to rapidly semantify this data in large volumes. Herein, we present our novel SciBERT-based [1] neural BAO [7] bioassay semantification system. For automated bioassay semantification, we carry out the supervised machine learning of semantic statements (i.e., subject-predicate-object triples) based on the BioAssay Ontology (BAO) [7] for a given unstructured bioassay description. The code for our method is publicly available at: https://github.com/MarcoAnteghini/ SciBERT-bioassays ORKG. Our dataset for learning comprises an expert manually annotated collection of 983 semantified bioasssays [5, 6] . In the data, each assay has between 5 and 92 semantic statements at an average of 53. To better reflect the data, we show example annotations in Table 1 for a selected bioassay. has assay format → biochemical format has assay format → protein format has assay format → single protein format assay measurement type → endpoint assay Table 1 : Four example semantic statement annotations (from 50 total) for PubChem Assay ID 346. Note, these statements are triples with subject "bioassay." The dataset can be formalized as follows. Let b be a bioassay from the assays dataset B. Each bi is annotated with an annotation sequence asi such that asi ∈ S, where S is a set of all possible semantic statements seen in the training dataset. Specifically, asi = {s1, s2, s3, ..., s k }, such that sx is a semantic statement ∈ S; asi has k different statements. In general, annotation sequences are of varying lengths. The dataset we use has |S| = 1756 unique statements (after filtering for non-informative ones). In the supervised task, the input data instance corresponds to a pair (b, s; c) where c ∈ {true, f alse} is the classification label. Thus, specifically, our semantification problem is formulated as a binary classification task. (b, s) is true if s ∈ b's annotation sequence (as), else f alse. Where f alse instances are formed by pairing b with any other label not in the annotation sequence as of b. As an aggregate, the semantification of each bioassay is a multi-label, multi-class classification problem which we have broken up into binary classification decisions. Intuitively, our task formulation is meaningful because it emulates the way the human expert annotates the data. Basically, the expert, from their memory of all semantic statements S, simply assigns s to a given b if they deem it as true; irrelevant statements are not considered, thus implicitly deemed f alse. Our machine learning system is the state-of-the-art, bidirectional transformer-based SciBERT [1] , pre-trained on millions of scientific articles. In each data instance (b, s; c), the classifier input representation for the pair 'b, s' is the standard SciBERT format, treating them as sentence pairs separated by the special [SEP] token; the special classification token ([CLS]) remains the first token of every instance. Its final hidden state is used as the aggregate sequence representation for classification tasks fed into a linear classification layer. For robust evaluations, we perform 3-fold cross validation (2:1 train-test split). In each fold experiment, training data contains roughly 655 bioassays and the remaining 328 bioassays are used for testing, where the test assays are unique across the folds. Standard precision (P ), recall (R), and f-score (F 1) metrics are used. We refer the reader to the SciBERT paper [1] for hyperparameter details. Finally, we have an additional parameter: f alse instances per bioassay. They are varied between 100 to 300, in increments of 10, to obtain an optimal model. Our results are depicted in Tables 2 and 3 . And we examine the RQ: can advanced neural technologies be leveraged to automatically semantify bioassays? We find that the cumulative obtainable F 1 by the SciBERT classifier out-of-the-box is 0.72 (bold in Table 3 )-significantly higher than 0.47 from a naive frequency-based semantification approach. Furthermore, the difference of the neural approach from the frequency method is clearly evident in the hit-and-miss illustration in Fig 1. The top thin neck of the curve in Fig 1(a) indicates that the neural approach, for most bioassays, had faster true semantic statement hits among its top-scoring predictions. Thus, answering RQ, neural technologies can indeed perform reliable semantification of bioassays. They are also practically efficient, since, given the 1756 unique statements considered as labels, each test assay is semantified at a rate of 4 seconds. The discovery of cures during pandemics such as Covid-19 can be greatly expedited if scientists are given intelligent information access tools, and our work toward automatically semantifying bioassays are a step in this direction. We refer the reader to the Appendix for an illustrated use case of semantified bioassays data in next-generation digital libraries. Each bioassays present on average 53 labels. The distribution is visible in Figure 2 Figure 3 is an instance of integrating one semantified bioassay in the ORKG DL. This bioassay was semantified on eight semantic statements based on the BAO. Integrating machine actionable graphs of bioassays is essential for the ORKG DL to automatically compute the tabulated comparison surveys of several bioassays as shown in Figure 4 in the next section. Next generation DLs target semantified scholarly knowledge. The ORKG with the semantified bioassays integrated, automatically computes their survey comparisons depending on how many of the machine-actionable assays were selected to be compared by the user. Such tools must be available to scientists to assist them in such massive knowledge ingestion scenarios to quickly grasp the scholarly knowledge highlights fostering faster progress with discoveries. Scibert: Pretrained language model for scientific text Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation Uses of bioassay in entomology Statistical method in biological assay Bioassay ontology annotations facilitate cross-analysis of diverse high-throughput screening data sets Formalization, annotation and analysis of diverse drug and probe screening assay datasets using the bioassay ontology (bao) Bioassay ontology (bao): a semantic description of bioassays and high-throughput screening results