key: cord-0039636-w13bh9ia authors: Brack, Arthur; D’Souza, Jennifer; Hoppe, Anett; Auer, Sören; Ewerth, Ralph title: Domain-Independent Extraction of Scientific Concepts from Research Articles date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_17 sha: 65438011f67bdb17497db772ce1290544edc8f2d doc_id: 39636 cord_uid: w13bh9ia We examine the novel task of domain-independent scientific concept extraction from abstracts of scholarly articles and present two contributions. First, we suggest a set of generic scientific concepts that have been identified in a systematic annotation process. This set of concepts is utilised to annotate a corpus of scientific abstracts from 10 domains of Science, Technology and Medicine at the phrasal level in a joint effort with domain experts. The resulting dataset is used in a set of benchmark experiments to (a) provide baseline performance for this task, (b) examine the transferability of concepts between domains. Second, we present a state-of-the-art deep learning baseline. Further, we propose the active learning strategy for an optimal selection of instances from among the various domains in our data. The experimental results show that (1) a substantial agreement is achievable by non-experts after consultation with domain experts, (2) the baseline system achieves a fairly high F1 score, (3) active learning enables us to nearly halve the amount of required training data. Scholarly communication as of today is a document-centric process. Research results are usually conveyed in written articles, as a PDF file with text, tables and figures. Automatic indexing of these texts is limited and generally does not access their semantic content. There are thus severe limitations how current research infrastructures can support scientists in their work: finding relevant research works, comparing them, and compiling summaries is still a tedious and error-prone manual work. The heightened increase in the number of published research papers aggravates this situation [7] . Knowledge graphs are recognised as an effective approach to facilitate semantic search [3] . For academic search engines, Xiong et al. [47] have shown that exploiting knowledge bases like Freebase can improve search results. However, the introduction of new scientific concepts occurs at a faster pace than knowledge base curation, resulting in a large gap in knowledge base coverage of scientific entities [1] , e.g. the task geolocation estimation of photos from the Computer Vision field is neither present in Wikipedia nor in more specialised knowledge bases like Computer Science Ontology (CSO) [39] or "Papers with code" [36] . Information extraction from text helps to identify emerging entities and to populate knowledge graphs [3] . It then is a first vital step towards a fine-grained research knowledge graph in which research articles are described and interconnected through entities like tasks, materials, and methods. Our work is motivated by the idea of the automatic construction of a research knowledge graph. Information extraction from scientific texts, obviously, differs from its general domain counterpart: Understanding a research paper and determining its most important statements demands certain expertise in the article's domain. Every domain is characterised by its specific terminology and phrasing which is hard to grasp for a non-expert reader. In consequence, extraction of scientific concepts from text would entail the involvement of domain experts and a specific design of an extraction methodology for each scientific discipline -both requirements are rather time-consuming and costly. At present, a systematic study of these assumptions is missing. We thus present the task of domain-independent scientific concept extraction. We examine the intuition that most research papers share certain core concepts such as the mentions of research tasks or methods. If so, these would allow a domainindependent information extraction system to support populating a research knowledge graph, which does not reach all semantic depths of the analysed article, but still provides some science-specific structure. In this paper, we introduce a set of common scientific concepts that we find are relevant over a set of 10 examined domains from Science, Technology, and Medicine (STM). These generic concepts have been identified in a systematic, joint effort of domain experts and non-domain experts. The inter-coder agreement is measured to ensure the adequacy and quality of concepts. A set of research abstracts has been annotated using these concepts and the results are discussed with experts from the corresponding fields. The resulting dataset serves as a basis to train two baseline deep learning classifiers. In particular, we present an active learning approach to reduce the number of required training data. The systems are evaluated in different experimental setups. Our main contributions can be summarised as follows: (1) We introduce the novel task domain-independent scientific concept extraction, which aims at automatically extracting scientific entities in a domain-independent manner. (2) We release a new corpus that comprises 110 abstracts of 10 STM domains annotated at the phrasal level. (3) We present and evaluate a state-of-the-art deep learning approach for this task. Additionally, we employ active learning for an optimal selection of instances, which to our knowledge, is demonstrated for the first time on scholarly text. We find that strategic instance selection gives us the same performance with only about half of the training data. (4) We release a silver-labelled corpus with 62 K automatically annotated abstracts of Elsevier with CCBY license and 1.2 Mio. extracted unique concepts comprising 24 domains. (5) We make our corpora and source code publicly available to facilitate further research. This section gives a brief overview of existing annotated datasets for scientific information extraction, followed by related work on some exemplary applications for domain-independent information extraction from scientific papers. Sentence Level Annotation. Early approaches for semantic structuring of research papers focused on sentences as the basic unit of analysis. This enables, for instance, automatic highlighting of relevant paper passages to enable efficient assessment regarding quality and relevance. Several ontologies have been created that focus on the rhetorical [11, 19] , argumentative [31, 46] or activity-based [37] structure of research papers. Annotated datasets exist for several domains, e.g. PubMed200k [12] from biomedical randomized controlled trials, NICTA-PIBOSO [26] from evidencebased medicine, Dr. Inventor [15] from Computer Graphics, Core Scientific Concepts (CoreSC) [31] from Chemistry and Biochemistry, and Argumentative Zoning (AZ) [46] from Chemistry and Computational Linguistics, Sentence Corpus [8] from Biology, Machine Learning and Psychology. Most datasets cover only a single domain, while few other datasets cover three domains. Several machine learning methods have been proposed for scientific sentence classification [12, 15, 24, 30] . Phrase Level Annotation. More recent corpora have been annotated at phrasal level (e.g. noun phrases). SciCite [9] and ACL ARC [25] are datasets for citation intent classification from Computer Science, Medicine, and Computational Linguistics. ACL RD-TEC [20] from Computational Linguistics aims at extracting scientific technology and non-technology terms. ScienceIE-17 [2] from Computer Science, Material Sciences, and Physics contains three concepts Process, Task and Material. SciERC [32] from the machine learning domain contains six concepts Task, Method, Metric, Material, Other-Scientific-Term and Generic. Each corpus covers at most three domains. Experts vs. Non-experts. The aforementioned datasets were usually annotated by domain experts [2, 12, 20, 26, 31, 32] . In contrast, Teufel et al. [46] explicitly use non-experts in their annotation tasks, arguing that text understanding systems can use general, rhetorical and logical aspects also when qualifying scientific text. According to this line of thought, more researchers used (presumably cheaper) non-expert annotation as an alternative [8, 15] . Snow et al. [43] provide a study on expert versus non-expert performance for general, non-scientific annotation tasks. They state that about four nonexperts (Mechanical Turk workers, in their case) were needed to rival the experts' annotation quality. However, systems trained on data generated by non-experts showed to benefit from annotation diversity and to suffer less from annotator bias. A recent study [38] examines the agreement between experts and nonexperts for visual concept classification and person recognition in historical video data. For the task of face recognition, training with expert annotations lead to an increase of only 1.5% in classification accuracy. Active Learning in Natural Language Processing (NLP). To the best of our knowledge, active learning has not been utilised in classification approaches for scientific text yet. Recent publications demonstrate the effectiveness of active learning for NLP tasks such as Named Entity Recognition (NER) [41] and sentence classification [49] . Siddhant and Lipton [42] and Shen et. al. [41] compare several sampling strategies on NLP tasks and show that Maximum Normalized Log-Probability (MNLP) based on uncertainty sampling performs well in NER. Academic Search Engines. Academic search engines such as Google Scholar [18] , Microsoft Academic [34] and Semantic Scholar [40] specialise in search of scholarly literature. They exploit graph structures such as the Microsoft Academic Knowledge Graph [35] , SciGraph [45] , or the Semantic Scholar Corpus [1] . These graphs interlink the papers through meta-data such as citations, authors, venues, and keywords, but not through deep semantic representation of the articles' content. However, first attempts towards a more semantic representation of article content exist: Ammar et al. [1] interlink the Semantic Scholar Corpus with DBpedia [29] and Unified Medical Language System (UMLS) [6] using entity linking techniques. Yaman et al. [48] connect SciGraph with DBpedia person entities. Xiong et al. [47] demonstrate that academic search engines can greatly benefit from exploiting general-purpose knowledge bases. However, the coverage of science-specific concepts is rather low [1] . Research Paper Recommendation Systems. Beel et al. [4] provide a comprehensive survey about research paper recommendation systems. Such systems usually employ different strategies (e.g. content-based and collaborative filtering) and several data sources (e.g. text in the documents, ratings, feedback, stereotyping). Graph-based systems, in particular, exploit citation graphs and genes mentioned in the papers [27] . Beel et al. conclude that it is not possible to determine the most effective recommendation approach at the moment. However, we believe that a fine-grained research knowledge graph can improve such systems. Although "Papers with code" [36] is not a typical recommendation system, it allows researchers to browse easily for papers from the field of machine learning that address a certain task. In this section, we introduce the novel task of domain-independent extraction of scientific concepts and present an annotated corpus. As the discussion of related work reveals, the annotation of scientific resources is not a novel task. However, most researchers focus on at most three scientific disciplines and on expertlevel annotations. In this work, we explore the domain-independent annotation of lexical phrasal units indicating scientific knowledge, i.e. scientific concepts, in abstracts from ten different science domains. Since other studies have also shown that non-expert annotations are feasible for the scientific domain, we go for a cost-efficient middle course: annotations by non-experts with scientific proficiency, and consultation with domain-experts. Finally, we explore how well a state-of-the-art deep learning model performs on this novel information extraction task and whether active learning can help to reduce the amount of required training data. Our novel corpus and the annotation process are described below. The OA-STM corpus [14] is This first annotation cycle focuses on the articles' abstracts as they contain a condensed summary of the article. The OA-STM Corpus is used as a base for (a) the identification of potential domain-independent concepts; (b) a first annotated corpus for baseline classification experiments. The annotation task was mainly performed by two postdoctoral researchers with a background in Computer Science (acting as nonexpert annotators); their basic annotation assumptions were checked by domain experts. Pre-annotation. A literature review of annotation schemes [2, 11, 30, 31] provided a seed set of potential candidate concepts. Both non-experts independently annotated a subset of the STM abstracts with these concepts (non-overlapping) and discussed the outcome. In a three-step process, the concept set was pruned to only contain those which seemed suitably transferable between domains. Our set of generic scientific concepts consists of Process, Method, Material, and Data (see Table 1 for their definitions). We also identified Task [2] , Object [30] , Method A commonly used procedure that acts on entities, e.g. powder X-ray (Che), the PRAM analysis (CS ), magnetoencephalography (Med ) Material A physical or abstract entity used in scientific experiments or proofs, e.g. soil (Agr ), the moon (Ast), the carbonator (Che) Data The data themselves, measurements, or quantitative or qualitative characteristics of entities, e.g. rotational energy (Eng), tensile strength (MS ), 3D time-lapse seismic data (ES ) and Results [11] , however, in this study we do not consider nested span concepts, hence we leave them out since they were almost always nested with the other scientific entities (e.g. a Result may be nested with Data). Phase I. Five abstracts per domain (i.e. 50 abstracts) were annotated by both annotators and the inter-annotator agreement was computed using Cohen's κ [10] at exact annotated spans. Results showed a moderate inter-annotator agreement of 0.52 κ. Phase II. The annotations were then presented to subject specialists who each reviewed (a) the choice of concepts and (b) annotation decisions on the respective domain corpus. The interviews mostly confirmed the concept candidates as generally applicable. The experts' feedback on the annotation was even more valuable: The comments allowed for a more precise reformulation of the annotation guidelines, including illustrating examples from the corpus. Consolidation. Finally, the 50 abstracts from phase I were reannotated by the non-experts. Based on the revised annotation guidelines, a substantial agreement of 0.76 κ could be reached (see Table 2 ). Similar annotation tasks for scientific entities, i.e. SciERC [32] considering one domain and ScienceIE-17 [2] considering three domains achieved agreements of 0.76 κ and 0.6 κ, respectively. Subsequently, the remaining 60 abstracts (six per domain) were annotated by one annotator. This phase also involved reconciliation of the previously annotated 50 abstracts to obtain a gold standard corpus. The current state-of-the-art for scientific entity extraction is Beltagy et al.'s deep learning system with SciBERT word embeddings [5] , which were pre-trained on scientific texts using the BERT [13] architecture. It consists of three components: (a) a token embedding layer comprising a per-sentence sequence of tokens, where each token is represented as a concatenation of SciBERT word embedding and CNN-based character embeddings [33] , (b) a token-level encoder with two stacked bidirectional LSTMs [21] , and (c) a Conditional Random Field (CRF) based tag decoder [33] with BILOU (beginning, inside, last, outside, unit) tagging scheme. This deep learning architecture is implemented in AllenNLP [17] and uses spaCy [44] for text preprocessing, i.e. for tokenisation and sentencesplitting. Using the above mentioned architecture, we train one model with data from all domains combined. We refer to this model as the domain-independent classifier. Similarly, we train 10 models for each domain in our corpus -the domain-specific classifier. To obtain a robust evaluation of models, we perform five-fold cross-validation experiments. In each fold experiment, we train a model on 8 abstracts per domain (i.e. 80 abstracts), tune hyperparameters on 1 abstract per domain (i.e. 10 abstracts), and test on the remaining 2 abstracts per domain (i.e. 20 abstracts) ensuring that the data splits are not identical between the folds. All results reported in the paper are averaged over the five folds. We still obtain reliably trained domain-specific classifiers since on average they are trained on 400 concepts. In this setting, we employ an active learning strategy [42, 49] to train a new domain-independent classifier. Active learning is usually applied to determine the optimal set of sufficiently distinct instances to minimise annotation costs. With our application of active learning we find which proportion of our annotations suffice for training a robust classifier. We decide to use the MNLP [41] sampling strategy. We prefer it over its contemporary, Bayesian Active Learning by Disagreement (BALD) [22] , since it has less computational requirements. The MNLP objective involves greedy sampling of sentences preferring those with the least logarithmic likelihood of the predicted tag sequence output by the CRF tag decoder, normalised by the number of tokens to avoid preferring longer sentences. In our experiments, we found that adding 4% of the data to be the most discriminative selection of classifier performance. Therefore, we run 25 iterations of active learning in each stage adding 4% training data. We perform five-fold cross validation as before and the per-fold models are retrained after data resampling. In this section, we discuss the results obtained with our trained classifiers and the correlation analysis between inter-annotator agreement and performance of the classifiers. Training Dataset Table 4 shows an overview of the domain-independent classifier results. The system achieves an overall F 1 of 65.5% and has low standard deviation 1.26 across the five folds. For this classifier, Material was the easiest concept with an F 1 of 71% (±1.88), whereas Method was the hardest concept with an F 1 of 43% (±6. 30) . Method is also the most underrepresented in our corpus, which partly explains the poor extraction performance. Best reported results for similar datasets, ScienceIE17 [2] and SciERC [32] (both have 500 abstracts), have an F 1 score of 65.6% [5] and 44.7% [32] , respectively, indicating that the size of our dataset with only 110 abstracts is sufficient. Next, we compare and contrast the 10 domain-specific classifiers (see Fig. 1 ) by their capability to extract the concepts from their own domains and in other domains. Most Robust Domain. Bio (third bar in each domain in Fig. 1 ) extracts scientific concepts from its own domain at the same performance as the domainindependent classifier with an F 1 score of 71% (±9.0) demonstrating a robust domain. It comprises only 11% of the overall data, yet the domain-independent classifier trained on all data does not outperform it. Domain-Independent vs. Domain-Specific Classifier. Except for Bio the domain-independent classifier clearly outperforms the domain-specific one in extracting concepts from their respective domains. We attribute this, in part, to the improved span-detection performance. Span-detection merely relies on syntactic regularity, thus the domain-independent classifier can benefit from more training data of other domains. E.g., the CS classifier shows a relative improvement of 49.5% domain-specific F 1 score to 65.9% in the domain-independent setting, which is supported by the enhanced span-detection performance from 73.4% to 82.0% in F 1. Accuracy on token-level also improves from 67.7% to 77.5% F 1 for CS, that is correct labelling of the tokens also benefits from other domains. This is also supported by the results in the confusion matrix depicted in Fig. 2 for the CS and the domain-independent classifier on token-level. Scientific Concept Extraction. Figure 3 The results of the active learning experiment over the full dataset plotted over the 25 iterations are depicted in Fig. 4 , showing that MNLP clearly outperforms the random baseline. While using only 52% of the training data, the best result of the domain-independent classifier trained with all training data is surpassed with an F 1 score of 65.5% (±1.0). The random baseline achieves an F 1 score of only 62.5% (±2.6) with the same proportion of training data. When 76% of the data are sampled by MNLP, the best active learning performance across all steps is achieved with an F 1 score of 69.0% on the validation set, having the best F 1 of 66.4% (±2.0) on the test set. Thus, 76% of our annotated sentences suffice to train an optimal performing model. Analysing the distribution of sentences in the training data sampled by MNLP, shows (Math, CS ) as the most preferred domains and (Eng, MS ) the least preferred ones. Nonetheless, all domains are represented, that is a nonuniformly mix of sentences sampled by MNLP yields the most generic model with less training data. In contrast, the random sampling strategy uniformly samples sentences from all domains. Further, we show in Table 5 the proportion of training data for MNLP when the performance using the entire training dataset is achieved for related Sci-ERC [32] and ScienceIE-17 [2] datasets. The results indicate, that also for related datasets on scientific texts MNLP can significantly reduce the amount of labelled training data. Fig. 4 . Progress of active learning with MNLP and random sampling strategy; the areas represent the standard deviation (std) of the F1 score across 5 folds for MNLP and random sampling strategy, respectively In this section, we analyse the correlations (Pearson's R) of inter-coder agreement κ and the number of annotated concepts per domain (#) on (1) the performance F 1 and (2) variance resp. standard deviation (std) of the classifiers across fivefold cross validation. Table 6 summarises the results of our correlation analysis. The active learning classifier (AL-trained) has been trained with 52% training data sampled by MNLP since it is the point at which the performance of the full data trained model is surpassed (see Table 5 ). For the domain-specific, domain-independent and AL-trained classifier we observe a strong correlation between F1 and number of concepts per domain (R 0.70, 0.76, 0.68) and a weak correlation between κ and F1 (R 0.20, 0.28, 0.23). Thus, we surmise that the number of annotated concepts in a particular domain has more influence on the performance than the inter-annotator agreement. Table 6 . Inter-annotator agreement (κ) and the number of concept phrases (#) per domain; F1 and std of domain-specific classifiers on their domains; F1 and std of domain-independent and AL-trained classifier on each domain; the right side depicts correlation coefficients (R) of each row with κ and the number of concept phrases The correlation values for the variance are different between the classifier types. For the domain-specific classifier the correlation between κ and std, and the number of concepts per domain and std are slightly positive (R 0.29, 0.28), i.e. the higher the agreement and the size of the domain, the higher the variance of the domain-specific classifier. For the domain-independent classifier, there is no correlation (R 0.11, −0.05) and for the AL-trained classifier, the correlations become negative (R −0.41, −0.72), i.e. higher agreement and more annotated concepts per domain lead to less variance for the AL-trained classifier. In summary, we hypothesise that more diverse training data from several domains lead to better performance and lower variance by introducing an inductive bias. In this paper, we have introduced the novel task of domain-independent concept extraction from scientific texts. During a systematic annotation procedure involving domain experts, we have identified four general core concepts that are relevant across the domains of Science, Technology and Medicine. To enable and foster research on these topics, we have annotated a corpus for the domains. We have verified the adequacy of the concepts by evaluating the human annotator agreement for our broad STM domain corpus. The results indicate that the identification of the generic concepts in a corpus covering 10 different scholarly domains is feasible by non-experts with moderate agreement and after consultation of domain experts with substantial agreement (0.76 κ). We evaluated a state-of-the-art system on our annotated corpus which achieved a fairly high F1 score (65.5% overall). The domain-independent system noticeably outperforms the domain-specific systems, which indicates that the model can generalise well across domains. We also observed a strong correlation between the number of annotated concepts per domain and classifier performance, and only a weak correlation between inter-annotator agreement per domain and the performance. It is assumed that more annotated data positively influence the performance in the respective domain. Furthermore, we have suggested active learning for our novel task. We have shown that only approx. 5 annotated abstracts per domain serving as training data are sufficient to build a performant model. Our active learning results for SciERC [32] and ScienceIE17 [2] datasets were similar. The promising results suggest that we do not need a large annotated dataset for scientific information extraction. Active learning can significantly save annotation costs and enable fast adaptation to new domains. Construction of the literature graph in semantic scholar Semeval 2017 task 10: Scienceie -extracting keyphrases and relations from scientific publications Entity-oriented search Research-paper recommender systems: a literature survey SciBERT: pretrained language model for scientific text The unified medical language system (UMLS): integrating biomedical terminology Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references Statistical models for text classification and clustering: applications and analysis Structural scaffolds for citation intent classification in scientific publications A coefficient of agreement for nominal scales The document components ontology (DoCO) Pubmed 200k RCT: a dataset for sequential sentence classification in medical abstracts BERT: pre-training of deep bidirectional transformers for language understanding On the discoursive structure of computer graphics research papers Semeval-2018 task 7: semantic relation extraction and classification in scientific papers AllenNLP: a deep semantic natural language processing platform Google scholar Salt: semantically annotated latex The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics Long short-term memory Bayesian active learning for classification and preference learning Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge Hierarchical neural networks for sequential sentence classification in medical scientific abstracts Measuring the evolution of a scientific field through citation frames Automatic classification of sentences to support evidence based medicine Relational retrieval using a combination of path-constrained random walks End-to-end neural coreference resolution DBpedia -a large-scale, multilingual knowledge base extracted from Wikipedia Automatic recognition of conceptualization zones in scientific articles and two life science applications Corpora for the conceptualisation and zoning of scientific papers Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF Scholarly ontology: modelling scholarly practices Investigating correlations of inter-coder agreement and machine annotation performance for historical video data The computer science ontology: a large-scale taxonomy of research areas Deep active learning for named entity recognition Deep Bayesian active learning for natural language processing: results of a large-scale empirical study Cheap and fast -but is it good? Evaluating non-expert annotations for natural language tasks Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics Explicit semantic ranking for academic search via knowledge graph embedding Interlinking SciGraph and DBpedia datasets using link discovery and named entity recognition techniques Active discriminative text representation learning We make our annotated corpus, a silver-labelled corpus with 62K abstracts comprising 24 domains, and source code publicly available. 1 Thereby, we hope to facilitate research on the task of scientific information extraction and its several applications, e.g. academic search engines or research paper recommendation systems.In the future, we plan to extend and refine the concepts for certain domains. We also intend to apply and evaluate our automatic scientific concept extraction system to expand an open research knowledge graph [23] . For this purpose, we plan to extend the corpus with additional relevant annotation layers such as with coreference links [28] and relations [16, 32] .