key: cord-0569743-nzmeockz authors: Parmar, Mihir; Mishra, Swaroop; Purohit, Mirali; Luo, Man; Murad, M. Hassan; Baral, Chitta title: In-BoXBART: Get Instructions into Biomedical Multi-Task Learning date: 2022-04-15 journal: nan DOI: nan sha: fb30166c218bef3597b0d9789ad340defc3989ca doc_id: 569743 cord_uid: nzmeockz Single-task models have proven pivotal in solving specific tasks; however, they have limitations in real-world applications where multi-tasking is necessary and domain shifts are exhibited. Recently, instructional prompts have shown significant improvement towards multi-task generalization; however, the effect of instructional prompts and Multi-Task Learning (MTL) has not been systematically studied in the biomedical domain. Motivated by this, this paper explores the impact of instructional prompts for biomedical MTL. We introduce the BoX, a collection of 32 instruction tasks for Biomedical NLP across (X) various categories. Using this meta-dataset, we propose a unified model termed In-BoXBART, that can jointly learn all tasks of the BoX without any task-specific modules. To the best of our knowledge, this is the first attempt to propose a unified model in the biomedical domain and use instructions to achieve generalization across several biomedical tasks. Experimental results indicate that the proposed model: 1) outperforms the single-task baseline by ~3% and multi-task (without instruction) baseline by ~18% on an average, and 2) shows ~23% improvement compared to the single-task baseline in few-shot learning (i.e., 32 instances per task) on an average. Our analysis indicates that there is significant room for improvement across tasks in the BoX, implying the scope for future research direction. For long, task-specific models have played a central role in achieving state-of-the-art performance in both general and biomedical NLP (Wang et al., 2021a; Banerjee et al., 2021) . During 2017-2019, pre-train and fine-tune paradigm (Liu et al., 2021) became the prevalent approach in NLP. Due to success of Language Models (LMs) in the biomedical domain such as BioBERT (Lee et al., 2020) , 1 https://github.com/Mihir3009/In-BoXBART Input: Stem Cell Therapy: A promising approach in the treatment of the pandemic is a global health crisis in the 21st Century. Question: What is the promising approach for treating COVID-19? NER Biomedical Instruction: From the given input recognize all the disease and chemical named entities. ... In this task, you are given a context and a question, your task is to find the answer for the given question based on the given context. ... You are given an abstract and title of the paper as the context. Your task is to classify a given article into Include or Exclude, based on the given criteria. ... Stem Cell Therapy Include Figure 1 : Schematic representation of multi-tasking in biomedical domain using instructional prompts. In this approach, a model is allowed to utilize tasks to get familiar with instructions and use them to map a given input to its corresponding output. ClinicalXLNET (Huang et al., 2019) , and others (Alrowili and Vijay-Shanker, 2021; Kraljevic et al., 2021; Phan et al., 2021) , this paradigm is widely used for creating many task-specific models (Wang et al., 2021a; Banerjee et al., 2021) . However, taskspecific models have limitations to real-world applications because this approach is computationally expensive (i.e., require large computational resources) and time-consuming (Strubell et al., 2019; Schwartz et al., 2020) . Hence, there is a need for generalization where a single model can perform various tasks leading to a computationally efficient approach. Past attempts have been made in generaldomain NLP to achieve generalization across tasks such as MQAN (McCann et al., 2018) , UNICORN (Lourie et al., 2021) , and UnifiedQA (Khashabi et al., 2020) . However, approaches to achieve generalization across various biomedical NLP tasks have not been systematically studied. Hence, this paper studies the multi-tasking approach that can generalize over different biomedical NLP tasks. Figure 1 shows the overview of our proposed multitasking approach where the single model can perform various biomedical NLP tasks. Recently, prompt-based models have been widely used because of their ability to achieve generalization instead of task-specific models (Liu et al., 2021) . Mishra et al. (2021b) ; Wei et al. (2021) and Sanh et al. (2021) show the effectiveness of instructional prompts in generalizing on seen as well as unseen general-domain NLP tasks. In this paper, we adapt this instructional promptbased approach for the first time to achieve generalization across various biomedical NLP tasks. To this extent, this paper introduces a collection of 32 instruction tasks for Biomedical NLP across (X) various categories (BoX) and proposes a unified model that can generalize over 32 different biomedical NLP tasks. The proposed unified model (i.e., In-BoXBART) is trained on the instruction-based meta-dataset (i.e., BoX) and evaluated on each task individually from the BoX. To evaluate the proposed approach, we compare our model (i.e., In-BoXBART) with two baselines: (1) single-task models (i.e., models trained on one task and evaluated on the same task), and (2) multitask model (i.e., a single model trained on a combination of all tasks) without instructions. Experimental results show that In-BoXBART outperforms single-task baseline by ∼3%, and multi-task baseline by ∼18%. We also analyze few-shot learning scenario using In-BoXBART since obtaining annotated data in the biomedical domain is costly and time-consuming (Luo et al., 2022b) . In the few-shot setting (i.e., 32 instances per task), In-BoXBART outperforms the single-task baseline by 23.33%. This indicates that Multi-Task Learning (MTL) and instruction-tuning have an advantage in the low resources settings. Although the performance of the In-BoxBART is promising, our analysis reveals that there is still room for improvement on some tasks, implying the scope for future research direction. Concisely, our contributions can be summarized in three folds: 1. This paper introduces the first benchmark metadataset in biomedical domain, i.e., BoX: a collection of 32 instruction tasks for Biomedical NLP across (X) various categories. Each task is processed in a unified format and equipped with instructions that can be used to train sequenceto-sequence models. 2. Using this meta-dataset, we propose an instruction-tuned Bidirectional and Auto-Regressive Transformer (BART) model, termed as In-BoXBART. The comparison of In-BoxBART and two baselines shows that In-BoXBART outperforms single-task baseline by ∼ 3% and multi-task (without instruction) baseline by ∼ 18%. 3. In the few-shot setting, we show that In-BoxBART significantly outperforms the singletask baseline by ∼ 23%. This indicates the potential application of instruction-tuning in the biomedical domain where annotated data is difficult to obtain. Multi-task Learning Owing to the problems associated with single-task learning in terms of their space and time requirements, several multi-task learning approaches have been proposed over the years. We use 29 existing, widely adopted biomedical NLP datasets collected from various challenges, platforms and organizations to create BoX. We define the BoX as a benchmark dataset for biomedical MTL across 9 different categories. In the BoX, Figure 2 shows the 9 different categories and corresponding generated tasks. Each category is defined as colored box and each box contains instruction tasks re-purposed from original datasets. Table 1 shows the number of training samples we have used for each category. Further details of each instruction task statistics is shown in Appendix A. Each category and corresponding tasks from the BoX are defined as below: Named Entity Recognition (NER) NER has been considered a necessary first step in processing literature for biomedical text mining where the model helps in identifying named entities such as protein, gene, chemical, disease, treatment. We use fifteen publicly available biomedical NER datasets (Crichton et al., 2017) to create instruction tasks. De-Identification (DI) In this task, the model takes medical discharge records of a patient as input and identify Private Health Information (PHI) such as organizations, persons, locations, dates. We use n2c2 2006 de-identification challenge dataset (Uzuner et al., 2007) to perform this task. Part-Of-Speech (POS) Tagging The goal of this task is to identify various POS tags from the biomedical text. We use GENIA corpus (Tateisi et al., 2005) built from MEDLINE abstracts for the POS tagging task. Question-Answering (QA) QA models receive a question and a corresponding context as input and output the relevant answer from the given context. To execute this task, we used the BioASQ-8b dataset (Nentidis et al., 2020) for different question types, i.e., yes/no, factoid, and list type questions. We created three different tasks from this dataset. Also, we use PubMedQA dataset (Jin et al., 2019) for this task. Relation Extraction (RE) We used two datasets for this task: (1) Systematic Review (SR) We have included data from the following five Systematic Reviews (SRs) that were conducted using the traditional (manual) process and published in relevant venues by Mayo Clinic physicians: (1) Hormone Replacement Therapy (HRT), (2) Cooking, (3) Accelerometer, (4) Acromegaly, and (5) COVID for this task (Parmar, 2021) . More details about these datasets creation and statistics are given in Appendix C. Sentiment Analysis (SA) Analyzing the sentiment of people towards medical drugs is an essential task in the biomedical domain. To that effect, we use medical drug sentiment analysis dataset 2 to identify one of three sentiments: (1) positive, (2) negative, and (3) neutral. We have used the Hallmarks of Cancer (HoC) dataset (Baker et al., 2016) for this task. Risk Factor Identification (RFI) The goal of this task is to identify risk factors for Coronary Artery Disease (CAD) in diabetic patients over time. For this, we used n2c2 2014 shared task track 2 dataset (Kumar et al., 2015) with two different purposes: (1) identify if the risk factor is presented in the medical discharge summary and (2) time of risk factor present in the discharge records. tions (BIs). BI consists of natural language instructions that describe a task and contain instances of that task. Here, we introduce a unified schema to present BI and described how we can construct BI for each task given in the BoX. Figure 3 illustrates the schematic representation of the schema, and Figure 4 shows an example of BI that describes a "Named Entity Recognition (NER)" task accompanied with a few positive examples. All BIs are mapped to the unified schema. As shown in Figure 3 , unified schema consists of a definition, prompt, and positive examples. This schema helps in organizing each BI. Each of the elements of the schema is explained below: Definition contains the core explanation about the task and detailed instruction to the model that what needs to be done in the given task. Prompt is the short explanation of the task that needs to be done. Examples contain the input/output pairs of the task instance along with the explanation of how the output is generated. Generally, we provide 2-3 examples for each task. Instances contain the input/output pairs of training samples from the task datasets. We have created a BI for each dataset given in the BoX. To create BI, we manually fill in the fields of unified instruction schema ( Figure 3 ). For each dataset, the BI is created by one author and were verified by other authors. In the instruction verification process, we edit BIs if needed in terms of grammar, typos, ambiguity, etc. to improve the quality. According to (Beltagy et al., 2020) , concise instructions are more beneficial compared to repetition, hence, we also redact repetition from BIs. As a result, our BIs consists of high-quality, short, and meaningful task definition, and prompts. Positive examples and its explanation For each dataset, we have provided 2-3 positive examples and corresponding explanations to give an idea of how to perform the given task. As we know, the selection of examples has an impact on model performance (Lu et al., 2021) . To that extent, we have been careful in selecting examples for text generation and classification tasks. For text generation, we have provided 2-3 examples with a detailed explanation about how the output is generated. For text classification tasks, we have included examples corresponding to each class with an explanation of why the particular class is assigned to a given input instance. All positive examples are drawn from training instances and have been removed from training in order to avoid repetition. All the explanations of examples pass through the verification process to maintain high quality. Collection of input/output instances Since each biomedical NLP dataset included in the BoX has its own annotated input/output instances, we converted them into text-to-text format (Lourie et al., 2021) . Example of instances converted for each task is given in Appendix B. After this, we appended all instances tuple (i.e., ) with instruction schema (as shown in Figure 3 ). Let us assume, we have input/output instances pair (X t , Y t ) for given task t. Along with that, each task is described in terms of its instruction BI t . Single-task models Traditional supervised models learn a mapping function (f M ) between input (x) and output (y), where (x, y) ∈ (X t train , Y t train ) and are evaluated on the same task (X t test , Y t test ). We refer this setup as single-task learning. Multi-task models In this setup, we combined training data and corresponding biomedical instruction of all tasks together. The goal of multitask learning models is to learn mapping function (f M ) between input (x), output (y) and biomedi- In contrast to single-task models, a single model is used here to solve various tasks, hence, achieving generalization. We refer this setup as MTL. We propose an instruction-based model to achieve multi-tasking and compare it with two baselines: (1) single-task models, and (2) multi-task models without instructions. We have fine-tuned the BART (base) model (Lewis et al., 2019) to build baselines as well as the proposed model. Single-Task models As formulated in the singletask problem setup, we have trained the BART model on each task from the BoX and evaluated it on the same task. Multi-task without instruction To build this baseline, we have combined training data of each task from the BoX together without appending BIs and trained a single model on the combined data. We refer this model as Vanilla-BoXBART. This model is evaluated on each task of the BoX. As formulated in the multi-task problem setup, we have combined training data and the corresponding BI of each task. To combine instruction with input instances, we map a BI and an input (x) into the textual format and obtain enc(BI t , x). After that, BART model is used to predict an output (y) using a mapping function f M : enc(BI t , x) → y. To perform encoding, a standard NLP paradigm of mapping is used, i.e., mapping an input to text. Here, we map each element of BI (i.e., definition and positive examples as shown in the schema) to a textual format and append it before the input instances. After appending BI of each task to instances, we combined all training data of each task. Now, we fine-tuned the BART model with this combined instruction meta-dataset. We refer this instruction-tuned model as In-BoXBART. We have used BART (base) model to build all baselines and proposed model. All the experiments are performed using Quadro RTX 8000 GPU. All models are trained for 3 epochs. In particular, we have used huggingface implementation (Wolf et al., 2019) of the BART and its pre-defined functions for the training and evaluation with default parameters. Instance Selection As we know, BART (base) can accept the input of a maximum 1024 token length. Since there are few instances in some datasets that exceed this limit (after including instructions), we have discarded those instances while creating instruction tasks. We have also removed the same instances while training two baselines to do a fair comparison. We have discarded long samples (>1024 token length) from validation and testing data as well. Example Selection As discussed in Lu et al. (2021) , the selection and order of the examples included in instructions matters for mainly classification tasks and affects the performance of the model. We empirically conclude that the proposed model benefits from ignoring examples from biomedical instructions for classification tasks during training and evaluation. Hence, we have discarded all examples from the BIs associated with the classification instruction tasks. Instance Sampling Some classification datasets used to create the BoX are imbalanced. To balance these datasets, we have applied the sampling techniques (Poolsawad et al., 2014) before using datasets to create BoX. In particular, we have analyzed three sampling techniques: (1) undersampling, (2) average-sampling, and (3) oversampling. In under-sampling, we have reduced instances for all the classes to the class with the lowest number of instances. In contrast, we have over-sampled instances via replication of random instances to the class with the highest number of instances to achieve over-sampling. In average sampling, we calculated mean of number of instances across all the classes and over-sampled or undersampled instances accordingly for each class. Few-shot setting Similar to the (Schick and Schütze, 2020), we have started with 32 randomly selected instances for each instruction task from the BoX to exhibit few-shot learning. After that, we have increased randomly selected instance instances per task to 100/1k/4k. If any task have already less number of instances than the threshold (i.e., 100/1k/4k), we keep all the instances from that task. While selecting the instances, we made sure that we select balanced data for the classification tasks. Moreover, the BoX contains an average 6k instances per task. Evaluation Metric We use Rouge-L (Lin, 2004) as our evaluation metric since we treat all the tasks as text generation problems. We also use F 1 -Score for evaluations. Effect of Sampling As mentioned above, we conduct three experiments to analyze the effect of sampling on In-BoXBART. We train our model using training data obtained from (1) under-sampling, (2) average-sampling, and (3) over-sampling. We achieve on an average (across all instruction tasks) 69.62, 70.23 and 73.49 Rouge-L for under-, average-and over-sampling, respectively. Here, we observe from the experimental results that over-sampling gives better performance compared to under-and average-sampling since there is a loss of training data samples for under-and average-sampling. Hence, we report results of oversampling as the main result in Table 2 . Table 2 presents the results for single-task model, Vanilla-BoXBART and In-BoXBART. We can see from Table 2 Table 2 : Results comparison between single-task baseline, Vanilla-BoXBART and In-BoXBART in terms of Rouge-L and F 1 -Score. All the results for F 1 -Score are presented in %. V-BB: Vanilla-BoXBART, I-BB: In-BoXBART, RFHD: Risk Factor for Heart Disease. F 1 -Score, respectively, exhibiting the same performance behaviour as Rouge-L. Hence, we use Rouge-L for further comparisons. From the result, we can observe that Vanilla-BoXBART reduces the complexity compared to the single-task model (i.e., 110 million parameters vs. 32x110 million pa-rameters), however, on an average the performance drops by 14.96% in terms of Rouge-L, and compared to single-task models. This indicates that multi-task learning in the biomedical domain is more difficult than general domain NLP since many previous works have shown that the multi-task model outperforms the single-task model (Lourie et al., 2021; McCann et al., 2018) . On the other hand, In-BoXBART, which has the same complexity as Vanilla-BoXBART, significantly outperforms Vanilla-BoXBART by on average 17.94%, and also outperforms the single-task model by a 2.98% margin, precisely. This indicates the benefit of using instructions to achieve the MTL in the biomedical domain. We have compared the average Rouge-L of In-BoXBART with a single-task baseline for fewshot setting. Figure 5 shows the relative performance of In-BoXBART compared to single-task baseline. We have shown results for all few-shot learning experiments in Appendix D. From the results, we see that In-BoXBART achieves on an average 60.64% Rouge-L and the single-task model achieves 37.31% for 32 instances per task. In-BoxBART significantly outperforms the single-task baseline by 23.33%. From Figure 5 , we can see that In-BoXBART consistently perform better compared to the baseline. As we know, obtaining a large annotated dataset in the biomedical domain is difficult, time-consuming and costly. From fewshot learning, we can see that instructions are beneficial in achieving high performance compared to task-specific models. For which tasks, instruction is helpful? From Table 2 , we can see that In-BoXBART outperforms baselines for 5 categories, i.e., NER, deidentification, RE, SR and risk factor identification. From this, we can see that instructions are more helpful in these five categories. However, In-BoXBART achieves performance lower or par with the single-task baseline for the tasks from QA, POS tagging, sentiment analysis and document classification which indicates room for improvement in this direction. Which are harder tasks to solve using instructions? Although instructions help in achieving better performance for some tasks compared to the single-task model, the overall performance is still lower. For example, instruction improves performance for de-identification, but overall performance on this task is only 50.82% which can be improved. A similar pattern we can see for BioNLP12CG and CRAFT from NER; BioASQ-8b (factoid, list) and PubmedQA from QA; and Medical Drug from the sentiment analysis category. In general, we can observe that tasks that include either multi-class scenario or answer generation from the context are most likely to be harder to solve using instructions. For example, CRAFT and BioNLP13CG have 6 entity types which are higher than any other tasks from NER, and we can see that the performance for these two tasks is lower compared to other tasks of NER. For which tasks, instruction is the most beneficial in few shot setting? From the results shown in Appendix D, tasks from the NER, deidentification, QA, sentiment analysis and risk factor identification shows on average larger improvement compared to baselines for the few-shot settings (i.e., 32 and 100 instances per task). This indicates that instructions are beneficial for the tasks from the above categories. Can we design better instructions? Since instruction teach the model how to solve a given task, domain specific information rich instructions can improve model performance. One potential way is to use the knowledge of domain experts. However, designing a good biomedical instruction can be one research direction. How to handle long-context input? Training instances of many biomedical datasets consist Electronic Health Records (EHRs) or discharge summaries of patients. Because of this, these instances are long and exceed the maximum input length of LMs such as BERT, BART. In this scenario, encoding extra information in terms of prompts or instructions becomes difficult. One potential solution is to use Longformer (Beltagy et al., 2020) , and another solution is to use T5 kind of models which use relative position embeddings so that the inference length can be longer (Luo et al., 2022a) . How to handle multi-class classification tasks? Multiple classes cause an issue while creating biomedical instructions because we can not present one example per class. If we do that, the encoding of BI and input will exceed the maximum length of LMs. A naive solution is to select examples of a few labels or remove the examples. However, this will cause a label bias issue or performance degradation. Potential future research direction can be designing a methodology to handle multi-class classification tasks. How far we are from the SOTA? We have presented preliminary comparison of our results w.r.t. state-of-the-art (SOTA) single-task systems for 21 instruction tasks from the BoX as shown in Appendix E. Form the results, we can see that the performance of the proposed model remains far from the SOTA for some tasks, indicating significant room for further research in this domain. This research shows the impact of instructions in MTL for the first time in the biomedical domain. To this extent, we introduced the BoX, a first benchmark dataset consisting of 32 instruction tasks across various biomedical NLP domains. Using this meta-dataset, we proposed a unified model, i.e., In-BoXBART which outperforms single-task baseline and Vanilla-BoxBART by ∼ 3% and ∼ 18%, respectively. Our proposed approach also shows an effective performance for a few-shot setting which is more beneficial in the biomedical domain where obtaining large annotated datasets is difficult. We hope that the BoX benchmark, In-BoXBART, and experimental results encourage future research into more unified models for biomedical NLP. This section provides all the statistics of training, validation and inference data used for experiments in Table 3 . All the number of instances provided in Table 3 are calculated after discarding the instances with more than 1024 token length as described in the section 5.1. We have divided the dataset into standard 70/10/20 splits for train/validation/test if there is no separate validation and testing set provided in the dataset. To build all the models (baselines, proposed model and few-shot learning), we adapt the unified format for all the tasks of BoX. We converted all the tasks into the text-to-text format, including the classification tasks. Table 4 shows an example of input and output from each category. Moreover, we have also re-purposed some biomedical datasets to create more than one task as described in the section 3.1. This section describes the brief data creation process for Systematic Reviews (SRs) that are used in this study. The relentless growth in clinical research and published articles have created a need for automation to expedite the process of SRs and to enable Living Systematic Reviews (LSRs). A crucial step in both SRs and LSRs is the title and abstract-based screening of the articles. A new dataset was developed from six SRs in the clinical domain by Mayo clinic physicians. In this study, we used data from the following five SRs that were conducted using the traditional (manual) process and published in relevant venues: (1) Hormone Replacement Therapy (HRT), (2) Cooking, (3) Accelerometer, (4) Acromegaly, and (5) COVID. The initial bibliographic search was designed and conducted by an experienced librarian with guidance from the principal investigators for the respective studies. The search was conducted in different bibliographic databases like PubMed, PubMed Central (PMC), Embase, EBM Reviews, and Ovid MEDLINE(R). Each article in the bibliographic search results was categorized by two physicians with domain expertise as "Include" or "Exclude", by reading the title and abstract of the article. When there was a disagreement between two annotators, a positive class (i.e., "Include") was preferred. Table 6 : The state-of-the-art (SOTA) results for each task compared with Vanilla-BoXBART and In-BoXBART. All results are in %. F: F 1 -score, V-BB: Vanilla-BoXBART, I-BB: In-BoXBART, RFHD: Risk Factor for Heart Disease. Muppet: Massive multi-task representations with pre-finetuning Biomtransformers: Building large biomedical language models with bert, albert and electra Automatic semantic classification of scientific literature according to the hallmarks of cancer Biomedical named entity recognition via knowledge guidance and question answering Longformer: The long-document transformer A neural network multi-task learning approach to biomedical named entity recognition The turking test: Can language models understand instructions? The authors acknowledge support from ONR award number N00014-20-1-2332 for this project. The authors thank the anonymous reviewers for their feedback. This section presents the results of few-shot learning for all instruction tasks in Table 5 . In Table 6 , we present State-Of-The-Art (SOTA) results for 21 tasks. To compare the SOTA results with the proposed model, we calculate the corresponding metric used in particular research from our model predictions. For each task, we gather the best performance, and specifically, they are BioASQ-8b (Nentidis et al., 2020)