key: cord-0471583-02rexr1y authors: Liu, Shifeng; Sun, Yifang; Li, Bing; Wang, Wei; Bourgeois, Florence T.; Dunn, Adam G. title: Sent2Span: Span Detection for PICO Extraction in the Biomedical Text without Span Annotations date: 2021-09-06 journal: nan DOI: nan sha: a9b131fefb85936e939c8592d4481699fa72d94c doc_id: 471583 cord_uid: 02rexr1y The rapid growth in published clinical trials makes it difficult to maintain up-to-date systematic reviews, which requires finding all relevant trials. This leads to policy and practice decisions based on out-of-date, incomplete, and biased subsets of available clinical evidence. Extracting and then normalising Population, Intervention, Comparator, and Outcome (PICO) information from clinical trial articles may be an effective way to automatically assign trials to systematic reviews and avoid searching and screening - the two most time-consuming systematic review processes. We propose and test a novel approach to PICO span detection. The major difference between our proposed method and previous approaches comes from detecting spans without needing annotated span data and using only crowdsourced sentence-level annotations. Experiments on two datasets show that PICO span detection results achieve much higher results for recall when compared to fully supervised methods with PICO sentence detection at least as good as human annotations. By removing the reliance on expert annotations for span detection, this work could be used in human-machine pipeline for turning low-quality crowdsourced, and sentence-level PICO annotations into structured information that can be used to quickly assign trials to relevant systematic reviews. Systematic reviews are a critical part of regulatory and clinical decision-making because they are designed to robustly make sense of all available evidence from primary research, especially clinical trials, accounting for study design quality and heterogeneity. Searching and screening for reports of clinical trials are time-consuming tasks that require specialised expertise but are a necessary component of systematic reviews. The rapid rate at which papers were published about COVID-19 (Wang and Lo, 2021) highlights the need for tools to improve the efficiency of systematic reviews. A range of methods have been developed to help reduce the amount of human effort required to conduct systematic reviews (Tsafnat et al., 2014; Marshall and Wallace, 2019) , but the methods developed to support the screening task are trained on data from a small number of systematic reviews and are not yet able to fully replace humans (O'Mara-Eves et al., 2015) . An alternative is to find ways to map all clinical trials to standardised representations of the populations, interventions, comparators, and outcomes (PICO) and aggregate information across studies that answer equivalent clinical questions. PICO extraction is a well-studied problem structure and was used as one of the examples in the development of SciBERT (Beltagy et al., 2019) . Many PICO extraction methods focus on annotating sentences that include PICO information. However, if the goal is to fully automate a process for augmenting a systematic review with new studies as their results become available, even a perfect annotation of just the sentence is not enough. An expert will still need to read the sentence and then extract and normalise the information representing the population, intervention, comparator, and primary outcomes (PICO) from those sentences. Full automation of the task requires the ability to identify the text spans that represent the PICO information. Named entity recognition (NER) seeks to identify the types and boundaries of targeted spans in unstructured text data. Machine learning NER methods have been based on BiLSTM-CRF (Lample et al., 2016) and BERT (Devlin et al., 2019) , and these require human annotated data for training. For other application domains where span annotation is designed for entities such as people, locations, and organisations, crowdsourcing of labels is somewhat easier because a broader range of people can annotate data without specialised training. Even when annotated by domain experts, there is still substantial inconsistency across annotators (Lee and Sun, 2019) , though sentence level annotations tend to be more consistent (Zlabinger et al., 2020) . There is a gap in both the volume of available training data and demonstrated performance between NER in general domains such as news and biomedical applications compared to PICO extraction. There is a clear need for new approaches that can handle the more complicated and challenging token structure of biomedical entities including unusual synonyms and subordinate clauses that might include numerical information (e.g. drug dosage, or test result threshold); and can incorporate domainspecific knowledge in intelligent and useful ways. In this paper, we address these challenges by proposing a novel PICO annotation task design. We propose a simplified requirement for annotation where we only need to know whether a sentence includes any PICO information. We use this coarser set of annotations to learn and infer PICO sentence types and span detections. The pre-trained neural language model (Devlin et al., 2019; Lee et al., 2020; Peng et al., 2019) is firstly used as feature representation learning model for sentence classification to be fine-tuned, and identify whether a given sentence contains PICO spans. This makes the proposed approach also capable of inferring PICO sentences even without crowdsourced annotations. To get the PICO spans, we apply a masked span prediction task to assist the inference process. The fine-tuned language model is then used as the taskspecific knowledge provider for the PICO spans. Scored spans are then fed into an inference algorithm to produce the final detection results. The contributions of this paper are as follows: • We propose a span detection approach for PICO extraction that uses only low-quality, crowd-sourced, sentence-level annotations as inputs, which reduces the need for timeconsuming annotations from experts. • We evaluate a novel structure for identifying candidate PICO sentences and masked span inference together. The masked span inference task replaces input spans with predefined mask tokens and the language model is used to infer which spans contribute most to the PICO sentence classification results. • We demonstrate results that substantially improve on recall in span detection to align with the use case in the systematic review application domain on two benchmark datasets. 2 Related Work Pre-trained deep neural language models, such as ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) , BERT (Lee et al., 2020) and its variants (Liu et al., 2019; Lan et al., 2020) , have brought significant performance improvements in a wide range of NLP tasks, such as relation extraction (Alt et al., 2019) , entity resolution (Li et al., 2021) , and question answering (Lee et al., 2020) . These language models generally benefit from large scale text corpora. A major advantage of these methods come from the way long dependency token relations are captured to produce contextualised representations. To adapt language models for use in the biomedical domain, researchers took pre-trained language models and re-trained them with domain-specific corpora, including PubMed abstracts and full text articles. These were then applied to a diverse range of NLP tasks, such as named entity recognition (Lee et al., 2020; Peng et al., 2019) and document classification (Peng et al., 2019) . In this paper, we make use of pre-trained neural language models as the backbone model to learn task-oriented information. There are multiple uses cases associated with representing clinical trial reports (including article abstracts, registrations, and protocols) by the participant inclusion criteria including condition, the interventions used in each of the study arms, and the set of primary outcomes measured during the trial. Like named entity recognition (NER), PICO detection aims to identify certain spans in the text corresponding to the each of the categories: population, interventions, and outcomes. While is it possible to apply general NER methods for the PICO extraction task (Nguyen et al., 2017; Nye et al., 2018; Kang et al., 2021) , there are several key differences that make general NER methods less effective. These differences include spans that often do not have distinguishing features such as capitalised tokens and PICO elements that are not limited to noun phrases. Most PICO extraction methods are fully supervised and need annotated data, which requires expertise and can be time consuming. To acquire enough PICO annotations for training, researchers have developed methods that use crowdsourcing as an alternative (Nguyen et al., 2017; Nye et al., 2018) . This results in low-quality annotations, especially for the boundaries of the spans. To improve the annotation quality, Zlabinger et al. (2020) simplified the annotation task from document level to sentence level and additionally guided workers with similar sentences that had already been annotated using an unsupervised semantic shorttext similarity method. In this paper, we instead only require sentence-level crowdsourced annotations without a boundary information developing novel method to predict PICO spans without training data. Compared with token based methods such as BiLSTM-CRF (Lample et al., 2016; Ma and Hovy, 2016) , span based methods treat the spans (i.e. consecutive tokens), as the targets. In one stream of research advances in span based methods, the aim is to extract the hidden representations of each token with the raw token sequence as input, then either use boundary token representations (Ouchi et al., 2018; Ebner et al., 2020) or aggregate all the token representations as the span representation. All possible spans are enumerated, classified, and decoded. An alternative stream of span-based research aims to mask a span in the token sequence and recover the masked tokens with hidden representations (Joshi et al., 2020) . Both research streams use supervised or self-supervised methods and high-quality annotations as training data. In this paper, we instead extract the information stored in the span using only sentence-level annotations derived from crowdsourcing. In this section, we define the task and then introduce the proposed approach with BLUE (Peng et al., 2019) , a BERT (Devlin et al., 2019) structured neural language model in the biomedical domain, as our backbone model and the inference algorithm. Generally, to construct a dataset for PICO span detection, the annotators are presented with the full text, i.e., the entire abstract of a clinical trial report (Nguyen et al., 2017; Nye et al., 2018) . To improve the quality of the annotation, Zlabinger et al. (2020) proposed a novel annotation task, asking annotators to annotate sentences instead of abstracts with retrieved expert annotated sentences as examples based on sentence similarity methods. Both annotation tasks require annotators to locate the boundaries of PICO spans. However, getting agreed boundaries for PICO span annotation is challenging, especially for crowdsourcing annotators in the biomedical domain. To fuse the gap between PICO sentence prediction and PICO span detection, we formalise the annotated dataset and the task as follows. We represent the dataset as D with |D| sentences and a sentence from the dataset as s with sentence annotations from |C| annotators. We use i, j to denote the boundaries of a PICO span starting from the i-th token (inclusive) and ending before the j-th token (exclusive). The task is then to train a model M that is able to yield a PICO span i, j with the probability P (i, j|M, s, C). The pre-trained neural language model BLUE is trained on domain specific text (e.g. PubMed abstracts). This injects the neural language model with biomedical knowledge. From the annotators, we have gathered whether a sentence contains PICO annotations or not for each sentence. These annotations represent task specific information and can be used to "teach" the model. Thus, we form this process as a sentence classification task and fine-tune the pre-trained neural language model BLUE to predict whether a given sentence is a PICO sentence. For each sentence, we feed it into the pre-trained BLUE model and collect the contextualised representations h ∈ R h of [CLS] token. With this representation h, we predict the probability of a PICO sentence using: where W ∈ R h * h and b ∈ R h are learnable parameters trained during the fine-tuning process, and y is a function of C annotations y = . Then the fine-tuned BLUE encoder predicts the score with span masked (the right part of the figure). The predicted scores of candidate spans along with the raw token sequence is collected for inference. f (C 1 , . . . , C |C| ) (details of the functions are presented in Section 4.1). Compared with other stateof-the-art text classification methods (Huang et al., 2019; Zhang et al., 2020) , our sentence classification method is relatively simple with just one extended module. The corresponding loss function we apply here is the cross entropy function: Like how human beings locate PICO spans given known PICO sentences, we want the model to focus on the most indicative token spans in the sentences. A straightforward way is to directly use the last contextualised representations of the finetuned model to make predictions. However, such an approach is problematic because: first, the last contextualised representations contain information from the corresponding tokens and the surrounding tokens. Thus taking either the span boundary representations or inner span representations would introduce a faint amount of interference and lead to error-prone results. Second, these representations show the information in the token level rather than span level, which does not treat the span as a whole. To address the aforementioned problems, we apply a Masked Span Prediction (MSP) task which is similar to the Masked Language Model training task applied in BERT (shown in Fig 1) . In the MSP task, for span i, j , we firstly mask the original token sequence with [MASK] token. This masked token sequence is put into the fine-tuned neural language model to get the prediction score score i,j . The score score i,j is then compared with the original score score to infer the impact by the span i, j , i.e., the contribution of the span in classifying the sentence as a PICO sentence. We define the contribution of a span i, j as contribution i,j = score − score i,j . The contribution contribution i,j could be either positive or negative in classifying the sentence as a PICO sentence. Given a sentence with N tokens, there are O(N 2 ) candidate spans to be masked, which could generate an intractable number of spans. To reduce the number of candidate spans, we follow Ouchi et al. (2018) and Ebner et al. (2020) by limiting the number of tokens in a span up to M tokens. This would reduce the number of candidate spans to M (2N −M +1) 2 , which is linear in the length of the sentence. We observe that some spans with a negative contribution to PICO sentence classification have at least one nested span split where the split spans have negative contributions as well. To reduce the number of candidate spans for inference, we make use of this observation and apply with a bottom-up nested span elimination algorithm (Algorithm 1). Eliminated Span Set RM is initialised with single token spans of negative contributions to PICO sentence classification. For spans with more than two tokens, we search every split of the target span (Line 2). When the split spans are both in the existing eliminated span set RM (Line 3), we update RM with the target span (Line 4), claim this span can be eliminated (Line 5), and return the result. As we remove the spans based on model related set initialisation RM , we report the percentage of eliminated candidate span in Section 4.6. With all the reserved candidate spans and their contributions, we need to select spans that (1) have positive contributions to PICO sentence classification and (2) are not nested spans. Following previous work (Ouchi et al., 2018; Ebner et al., 2020) , we establish a similar argmax inference method. However, different from the tasks in Ouchi et al. (2018) and Ebner et al. (2020) , where each role is generally satisfied by exactly one span, there could be multiple spans appearing in one sentence in PICO span detection task. Thus we apply a top-K argmax inference method with pre-defined K for each PICO type. As shown in Algorithm 2, we firstly sort the candidate span set by span contribution (Line 2), and iteratively select spans (Line 3-7). If the span does not overlap with selected spans (Line 4), we add it to the result set. The algorithm ends when we get K spans (Line 6) or traverse all the reserved candidate spans. In this section, we evaluate our method and compare it with supervised, semi-supervised methods, and crowdsourced annotations on two benchmark datasets. We also investigate the PICO sentence prediction results and the effect of candidate span reduction in the proposed method. We use two benchmark datasets for the PICO span detection task: EBM-NLP (Nye et al., 2018) dataset 12 and PICO-data (Nguyen et al., 2017) dataset 3 . The dataset statistics are shown in Table 1. All datasets are in English language. The PICO dataset includes 3549, 500, and 191 abstracts for training, development, and test sets, respectively. This dataset is only annotated with Population (P) type. The EBM-NLP dataset includes 4,993 PubMed abstracts annotated with Population (P), Intervention (I), and Outcome (O), respectively. Comparator (C) and Intervention (I) are combined together as Intervention (I). As each PICO type is annotated individually to avoid cognitive load, this makes the dataset contain three sub-datasets with a single PICO type. Without standard train-development split, we leave 500 abstracts from the training set as the development for each PICO type. For all the training, development, and test sets in both datasets, they all collect the crowdsourced annotations, and the aggregated annotations. The test sets also have expert annotations from medical experts 4 . We additionally supply PICO sentence an- Table 2 : Precision, Recall, and F 1 score for PICO Span Detection on the EBM-NLP dataset. Supervised and Semisupervised methods are trained on the entire training set using the aggregated crowdsourced PICO span annotations. Weakly-supervised methods are trained on the entire training set only using PICO sentence annotations. Human annotation are the aggregated crowdsourced PICO span annotations on the test set. All methods are evaluated against expert span annotations on the test set. We highlight the recall as it is the most important metric when PICO detection is applied for systematic review processes. * indicates the results applying the crowdsourced sentence annotations. We use the BLUE (Peng et al., 2019) model trained on PubMed abstracts as the backbone model 6 with the BERT-Base structure and uncased tokenisation. The maximum sequence length is set to 512 tokens. The training batch size is 32 and the evaluation batch size is 64. For PICO sentence classification, we use Adam (Kingma and Ba, 2015) optimiser and set the peak learning rate to 2e-5. We train the models for 5 epochs and select the models with the best F 1 score on the development set. For PICO Masked Span Prediction, the maximum span lengths M are set to be 20, 7, and 10 tokens 7 , for Population, Intervention, and Outcome, respectively. The number of selected candidate spans K is set to 2 for all the PICO types. The maximum span lengths and the number of selected candidate spans are set based on the statistical information of the aggregated crowdsourced PICO spans on the development sets to cover at least 90% and 95% PICO spans, respectively. All the experiments are run on one NVIDIA V100 GPU. Our source code is available online 8 . Table 4 : Accuracy, Precision, Recall, and F 1 score of PICO sentence classification for crowdsourced annotations and proposed methods on the EBM-NLP dataset. Crowd X refers to the aggregated crowdsourced annotations mentioned in Section 4.1. Sent2Span X refers to proposed methods trained with different PICO sentence annotations mentioned in Section 4.1. The bold-faced scores represent the best results among crowdsourced annotations and proposed methods. Following previous work (Nguyen et al., 2017; Nye et al., 2018) , we use the token-wise precision, recall, and F 1 score of the output PICO spans against the expert annotations. Following previous work (Thomas et al., 2021) , we focus on recall as it is the most important metric when PICO detection is applied for systematic review processes. We also report the accuracy, precision, recall and F 1 for the PICO sentence classification results against the expert annotations. We compare Sent2Span against supervised method, semi-supervised method, and aggregated human annotations. All the methods are evaluated on the test set of expert annotations. • HMMCrowd (Nguyen et al., 2017) : HMM-Crowd extends Dawid-Skene model (Dawid and Skene, 1979) with a HMM component, and explicitly uses the sequential structure of spans. This model is directly applied on the crowdsourced annotations to get the aggregated annotations without training. • Conditional Random Fields (CRF) (Lafferty et al., 2001 ): The CRF model is fully supervised with a feature template, including the current, previous, and next words; part-ofspeech tags; and character information such as whether a token contains digits, uppercase letters, symbols, etc. This model is trained with the aggregated crowdsourced span annotations on the training dataset. • BiLSTM-CRF (Lample et al., 2016) : BiLSTM-CRF model is a semi-supervised method with pre-trained word2vec embeddings trained on PubMed abstracts. This model is trained with the aggregated crowdsourced span annotations on the training dataset. • Sent2Span X : This is our proposed method with different sentence annotation generation functions X. X refers to the agg, major, and minor mentioned in Section 4.1. This method is trained only with sentence annotations without any span annotations. This method is evaluated on the test datasets using both the predicted PICO sentence classification results and the crowdsourced PICO sentence annotations. Table 2 and Table 3 show the performance of different methods on the EBM-NLP and PICO-data datasets. Sent2Span minor always shows the best recall with the best F 1 score in most case among the rest Sent2Span models. Compared with the rest methods, Sent2Span minor achieves the best recall for the Population and Outcome types, and even better than the aggregated human annotation results for Population. It surpasses the aggregated human annotation by 10% recall on Population on both datasets, even though the crowdsourcing annotators have been given annotation guidelines and certain examples. For Outcome, it achieves better recall compared to supervised and semi-supervised methods by around 6%, while these methods are trained with span-level annotations compared with the sentencelevel annotation applied for Sent2Span. Though Sent2Span does not achieve the best recall for Intervention, it beats the supervised CRF methods by a large margin (0.51 vs 0.21). Sent2Span does not exhibit a performance drop when applying sentence-level human annotation for inference. It shows better recall (i.e. Population on the PICO dataset) for all the proposed models. This indicates: (1) Sent2Span generalises well on the datasets; and (2) Sent2Span can be directly applied for PICO span detection. As the results are inferred from the PICO sentence classification results, it is worth examining the corresponding impact. We show the results of crowdsourced annotation and Sent2Span in Table 4 and Table 5 for both datasets. From both tables, agg results have the best accuracy and F 1 score, major results have the best precision, and minor results have the best recall. Sent2Span minor shows the better recall for Intervention and Outcome on the EBM-NLP dataset, and Population on the PICO dataset compared with crowdsourced annotations. This explains the equivalent and superior recall results in PICO span detection (see Table 2 and Table 3 ). Without losing potential PICO sentences, Sent2Span minor selects the most PICO spans. And Sent2Span major has the best precision with the worst recall. To demonstrate the effectiveness of the Nested Table 6 for both datasets. At least 8% of candidate spans are eliminated and do not pass to the fine-tuned neural language model for inference, which saves inference time. In Comparison to models trained with different sentence-level annotations, Sent2Span major has the most discarded candidate spans with more than 22% candidate spans eliminated cross both datasets. And Sent2Span minor receives the least number of eliminated spans. This indicates that Sent2Span major tends to ignore more spans and Sent2Span minor builds more connections between the spans and PICO sentence classification results resulting in higher recall. We perform an error analysis using the test datasets for which there are expert annotations. Error types include boundary errors (BE), overlap errors (OE), false-positive errors (FP), and false-negative errors (FN) ( Table 7) . The number of errors varies by type when comparing the proposed methods and the crowdsourced annotations (Table 8 ). Sent2Span major gives the least number of BEs compared with the two other methods, at the cost of the largest number of FN errors and thus the lowest recall. All proposed 201 415 1032 45 357 561 891 641 443 844 1532 422 274 486 1125 42 methods have more OEs when compared to crowdsourced annotations, suggesting an area where future work would be valuable. In a post-hoc analysis of the OEs, we find that there are no nested predictions (for example, "training" is nested in "progressive muscle relexation training"). Sent2Span minor produces the most FP errors and the least FN errors across all datasets, and the difference is most pronounced for the Population type. Overall, Sent2Span minor produces a high recall, low precision tool, and this is related to the way K (the number of selected candidate spans) is set, introducing redundant selected spans in the final inference result. This approach is likely to be useful as part of certain pipelines where PICO extraction is used, but there is also room for further improvement in the span selection mechanism. The results of the experiments show that our proposed method could be a useful component of a pipeline for PICO detection and extraction, in use cases where costly expert annotation is limited and where the aim is to identify all relevant examples. The results show the approach compares favourably to existing supervised methods and achieves high recall for PICO span detection even without using any span annotations as training data. Our proposed approach is likely to be useful in a range of other application domains. Many NLP tasks across NER (Sohrab and Miwa, 2018; Xia et al., 2019) and semantic role labelling (Ouchi et al., 2018) can be formulated as a span detection task. In cases where it is easier to acquire or estimate low-quality sentencelevel annotations, and resource-intensive to acquire high-quality span-level annotations, our proposed approach may be appropriate. Several opportunities exist for future work. For the sake of simplicity, we used the BLUE model (Peng et al., 2019) with the BERT-baseuncased structure, rather than exploring models with BERT-large-cased structure. We also did not explore the use of other NLP tools such as part-ofspeech tagging or dependency parsing. Integration of these methods in the proposed Sent2Span approach may improve the performance. The Sent2Span method is designed to be part of a larger pipeline of techniques designed to support systematic review processes. Other than PICO detection, extraction, and representation, other methods have been proposed for identifying which trials should be included in systematic review updates (Surian et al., 2018) . Future work in the space could include head-to-head comparisons of approaches for rapid systematic review updating that test for maximising the completeness of evidence identification and minimising human effort. The Sent2Span method for PICO span detection we propose and test in this paper could be used to support new tasks in systematic review processes. The difference between Sent2Span and previous approaches to PICO detection include the use of only low-quality sentence-level annotations as training data and the results demonstrating achieve high recall in span detection, which is an important requirement for systematic review processes. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction Scibert: A pretrained language model for scientific text Maximum likelihood estimation of observer errorrates using the em algorithm BERT: pre-training of deep bidirectional transformers for language understanding Multi-sentence argument linking Text level graph neural network for text classification Spanbert: Improving pre-training by representing and predicting spans Umls-based data augmentation for natural language processing of clinical research literature Adam: A method for stochastic optimization Conditional random fields: Probabilistic models for segmenting and labeling sequence data Neural architectures for named entity recognition ALBERT: A lite BERT for self-supervised learning of language representations A study on agreement in PICO span annotations Biobert: a pre-trained biomedical language representation model for biomedical text mining Improving the efficiency and effectiveness for bert-based entity resolution HAMNER: headword amplified multi-span distantly supervised method for domain specific named entity recognition Roberta: A robustly optimized BERT pretraining approach End-to-end sequence labeling via bi-directional lstm-cnns-crf Toward systematic review automation: a practical guide to using machine learning tools in research synthesis Aggregating and predicting sequence labels from crowd annotations A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature A span selection model for semantic role labeling Using text mining for study identification in systematic reviews: a systematic review of current approaches Transfer learning in biomedical natural language processing: An evaluation of BERT and elmo on ten benchmarking datasets Deep contextualized word representations Improving language understanding by generative pre-training Deep exhaustive model for nested named entity recognition A shared latent space matrix factorisation method for recommending new trial evidence for systematic review updates Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for cochrane reviews Systematic review automation technologies Text mining approaches for dealing with the rapidly expanding literature on COVID-19 Multi-grained named entity recognition Every document owns its structure: Inductive text classification via graph neural networks Effective crowdannotation of participants, interventions, and outcomes in the text of clinical trial reports This work is supported by National Library of Medicine, National Institutes of Health under grant No.R01LM012976.