key: cord-0545095-5r23exvg authors: Reddy, Revanth Gangi; Iyer, Bhavani; Sultan, Md Arafat; Zhang, Rong; Sil, Avirup; Castelli, Vittorio; Florian, Radu; Roukos, Salim title: Synthetic Target Domain Supervision for Open Retrieval QA date: 2022-04-20 journal: nan DOI: nan sha: 2749ab9e22d936ee55b58ee78eced106fcd138fc doc_id: 545095 cord_uid: 5r23exvg Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domain shift, we explore its fine-tuning with synthetic training examples, which we generate from unlabeled target domain text using a text-to-text generator. In our experiments, this noisy but fully automated target domain supervision gives DPR a sizable advantage over BM25 in out-of-domain settings, making it a more viable model in practice. Finally, an ensemble of BM25 and our improved DPR model yields the best results, further pushing the SOTA for open retrieval QA on multiple out-of-domain test sets. Open retrieval question answering (ORQA) finds a short answer to a natural language question in a large document collection [4, 9, 26] . Most ORQA systems employ (i) an information retrieval Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGIR '21, July 11-15, 2021 (IR) component that retrieves relevant passages from the given corpus [14, 26, 35] and (ii) a machine reading comprehension (MRC) component that extracts the final short answer from a retrieved passage [2, 29, 33] . Recent work on ORQA by Karpukhin et al. [21] shows that distant supervision for neural passage retrieval can be derived from annotated MRC data, yielding a superior approach [17, 28] to classical term matching methods like BM25 [9, 34] . Concurrent advances in tools like FAISS [19] that support efficient similarity search in dense vector spaces have also made this approach practical: when queried on an index with 21 million passages, FAISS processes 995 questions per second (qps). BM25 processes 23.7 qps per CPU thread in a similar setting [21] . Crucially, all training and test instances for the Dense Passage Retrieval (DPR) model in [21] were derived from open domain Wikipedia articles. This is a rather limited experimental setting, as many real-world ORQA use cases involve distant target domains with highly specialized content and terminology, for which there is no labeled data. On COVID-19, for example, a large body of scientific text is available [38] , but practically no annotated QA data for model supervision. 1 In this paper, we closely examine neural IR-DPR to be specific-in out-of-domain ORQA settings, where we find that its advantage over BM25 diminishes or disappears altogether in the absence of target domain supervision. Domain adaptation is an active area of investigation in supervised learning; existing techniques for different target scenarios include instance weighting [18] , training data selection using reinforcement learning [30] and transfer learning from open domain datasets [39] . For pre-trained language models, fine-tuning on unlabeled target domain text has also been found to be a useful intermediate step [13, 40] , e.g., with scientific [7] and biomedical text [3, 25] . To address the performance degradation of DPR in lowresource out-of-domain settings, we explore another approach: pre-trained model [1, 11, 12, 36] , we rely on only synthetic examples in the target domain. The contributions of this paper are as follows: • This section describes our methods for generating synthetic examples in the target domain and their application to both IR and MRC to construct the final ORQA pipeline. Let ( , , ) be an MRC example comprising a passage , a question , and its short answer in . Let be the sentence in that contains the answer . In what follows, we train an example generator to produce the triple ( , , ) given . The answer sentence is subsequently used to locate in , as a short answer text (e.g., a named entity) can generally occur more than once in a passage. To train the generator, we fine-tune BART [27]-a pre-trained denoising sequence-to-sequence generation model-with MRC examples from open domain datasets like SQuAD [33] . The generator with parameters learns to maximize the conditional joint probability ( , , | ; ). In practice, we (i) only output the first ( ) and the last ( ) word of instead of the entire sentence for efficiency, and (ii) use special separator tokens to mark the three items in the generated triple. Given a target domain passage at inference time, an ordered sequence ( , , [ ], , [ ], ) is sampled from using toptop-sampling [15] , which has been shown to yield better training examples than greedy or beam search decoding due to greater sample diversity [36] . From this generated sequence, we create positive synthetic training examples for both passage retrieval: ( , ) and MRC: ( , , ), where and are used to locate in . Table 1 shows two examples generated by our generator from a passage in the CORD-19 collection [38] . As stated before, we use DPR [21] as our base retrieval model. While other competitive methods such as ColBERT [22] exist, DPR offers a number of advantages in real-time settings as well as in our target scenario where retrieval is only a component in a larger ORQA pipeline. For example, by compressing each passage down to a single vector representation, DPR can operate with significantly less memory. It is also a faster model for several reasons, including not having a separate re-ranking module. For target domain supervision of DPR, we fine-tune its off-theshelf open domain instance with synthetic examples. At each iteration, a set of questions is randomly sampled from the generated dataset. Following Karpukhin et al. [21] , we also use in-batch negatives for training. We refer the reader to their article for details on DPR supervision. We call this final model the Adapted DPR model. For MRC, we adopt the now standard approach of Devlin et al. [10] that (i) starts from a pre-trained transformer language model (LM), (ii) adds two pointer networks atop the final transformer layer to predict the start and end positions of the answer phrase, and (iii) fine-tunes the entire network with annotated MRC examples. We choose RoBERTa [31] as our base LM. Given our out-of-domain target setting, we fine-tune it in two stages as follows. First, the RoBERTa LM is fine-tuned on unlabeled target domain documents, which is known to be a useful intermediate fine-tuning step [13] . This target domain model is then further fine-tuned for MRC, where we use both human annotated open domain MRC examples and target domain synthetic examples, as detailed in Section 3. Additionally, we denoise the synthetic training examples using a roundtrip consistency [1] filter: an example is filtered out if its candidate answer score, obtained using an MRC model trained on SQuAD 2.0 and NQ, is lower than a threshold ( tuned on a validation set). Using the described retrieval and MRC components, we construct our final ORQA system that executes a four-step process at inference time. First, only the highest scoring passages returned by IR for the input question are retained ( tuned on a validation set). Each passage is then passed along with the question to the MRC component, which returns the respective top answer and its MRC score. At this point, each answer has two scores associated with it: its MRC score and the IR score of its passage. In the third step, these two scores get normalized using the Frobenius norm and combined using a convex combination. The weight in the combination operation is tuned on a validation set. Finally, the answer with the highest combined score is returned. Table 2 We evaluate the proposed systems on out-of-domain retrieval, MRC, and end-to-end ORQA against SOTA open domain baselines. We select COVID-19 as our primary target domain, an area of critical interest at the point of the writing. We use 74,059 full text PDFs from the June 22, 2020 version of CORD-19 [38] document collection on SARS-CoV-2-and related coronaviruses as our retrieval corpus. Each document is split into passages that (a) contain no more than 120 words, and (b) align with sentence boundaries, yielding around 3.5 million passages. We utilize three existing datasets for COVID-19 target domain evaluation. The first one, used to evaluate retrieval and MRC results separately as well as end-to-end ORQA performance, is COVID-QA-2019 [32] -a dataset of question-passage-answer triples created from COVID-19 scientific articles by volunteer biomedical experts. Finally, COVID-QA-111 [24] contains queries gathered from different sources, e.g., Kaggle and the FAQ sections of the CDC and the WHO. It has 111 question-answer pairs with 53 interrogative and 58 keyword-style queries. Since questions are not aligned to passages in this dataset, we use it only to evaluate IR and ORQA. We use the DPR-Multi systemfrom [21] as our primary neural IR baseline. DPR-Multi comes pre-trained on open-retrieval versions of several MRC datasets: Natural Questions (NQ) [23] , WebQuestions [8] , CuratedTrec [6] and TriviaQA [20] . We fine-tune it for six epochs with COVID-19 synthetic examples to train our Adapted DPR model (lr=1e-5, batch size=128). We also evaluate the Inverse Cloze Task (ICT) method [26] as a second neural baseline, which masks out a sentence at random from a passage and uses it as a query to create a query-passage training pair. We use ICT to finetune DPR-Multi on the CORD-19 passages of Section 3.2, which makes it also a synthetic domain adaptation baseline. Finally, for each neural IR model, we also evaluate its ensemble with BM25 that computes a convex combination of normalized neural and BM25 scores. The weight for BM25 in this combination is 0.3 (tuned on Open-COVID-QA-2019 Dev). Our baseline MRC model is based on a pre-trained RoBERTa-Large LM, and is fine-tuned for three epochs on SQuAD2.0 and then for one epoch on NQ. It achieves a short answer EM of 59.4 on the NQ dev set, which is competitive with numbers reported in [29] . For target domain training, we first fine-tune a RoBERTa-Large LM on approximately 1. We evaluate IR using Match@ , for ∈ {20, 40, 100} [21] . For MRC, we use standard Exact Match (EM) and F1 score. Finally, end-to-end ORQA accuracy is measured using Top-1 and Top-5 F1. We first report results separately for IR and MRC. Then we evaluate ORQA pipelines that must find a short answer to the input question in the CORD-19 collection. Reported numbers for all trained models are averages over three random seeds. Table 2 shows performances of different IR systems on Open-COVID-QA-2019 and COVID-QA-111. BM25 2 demonstrates robust results Open- COVID-QA-2019 COVID-QA-111 Dev Test Test Top-1 Top-5 Top-1 Top-5 Top-1 Top- Table 3 : End-to-end F1 scores achieved by different Open retrieval QA systems. The best system (last row) utilizes target domain synthetic training examples for both IR and MRC supervision. relative to the neural baselines. While DPR-multi is competitive with BM25 on COVID-QA-111, it is considerably behind on the larger Open-COVID-QA-2019. ICT improves over DPR-multi, indicating that even weak target domain supervision is useful. The proposed Adapted DPR system achieves the best single system results on both datasets, with more than 100% improvement over DPR-Multi on the Open-COVID-QA-2019 Test set. Finally, ensembling over BM25 and neural approaches yields the best results. The BM25+Adapted DPR ensemble is the top system across the board, with a difference of at least 14 points with the best baseline on the Open-COVID-QA-2019 Test set (all metrics), and 8 points on COVID-QA-111. Upon closer examination, we find that BM25 and Adapted DPR retrieve passages that are very different. For Open-COVID-QA-2019, for example, only 5 passages are in common on average between the top 100 retrieved by the two systems. This diversity in retrieval results explains why they complement each other well in an ensemble system, leading to improved IR performance. Using different pairings of the above IR and MRC systems, we build several ORQA pipelines. Each computes a convex combination of its component IR and MRC scores after normalization, with the IR weight being 0.7 (tuned on Open-COVID-QA-2019 Dev). We observe that retrieving =100 passages is optimal when IR is BM25 only, while =40 works best for BM25+Neural IR. Table 3 shows end-to-end F1 scores of the different ORQA pipelines. Adapted MRC refers to the best system of Section 4.2 (Table 4 Row 3). Crucially, the best system in Table 3 (last row) uses synthetic target domain supervision for both IR and MRC. In a paired -test [16] of the Top-5 F1 scores, we find the differences with the baseline (Row 1) to be statistically significant at <0.01. To investigate if our synthetically fine-tuned COVID-19 models can also help improve performance in a related target domain, we evaluate them zero shot on the BioASQ [5] task. BioASQ contains biomedical questions with answers in the PubMed abstracts. For evaluation, we use the factoid questions from the Task 8b training and test sets, totaling 1,092 test questions. As our retrieval corpus, we use around 15M abstracts from Task 8a. We pre-process them as described in Section 3.1 to end up with around 37.4M passages. Table 5 : IR results on BioASQ Task 8B factoid questions. Table 5 shows the BioASQ retrieval results, where the proposed Adapted DPR model outperforms both baselines. Table 6 summarizes the evaluation on the end-to-end ORQA task, where we see similar gains from synthetic training. These results show that synthetic training on the CORD-19 articles transfers well to the broader related domain of biomedical QA. Low-resource target domains can present significant challenges for supervised language processing systems. In this paper, we show that synthetically generated target domain examples can support strong domain adaptation of neural open domain open retrieval QA models, which can further generalize to related target domains. Crucially, we assume zero labeled data in the target domain and rely only on open domain MRC annotations to train our generator. Future work will explore semi-supervised and active learning approaches to examine if further improvements are possible with a small amount of target domain annotations. Synthetic QA Corpora Generation with Roundtrip Consistency A BERT baseline for the Natural Questions Publicly Available Clinical BERT Embeddings XOR QA: Cross-lingual Open-Retrieval Question Answering BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. In Revised Selected Papers from the First International Workshop on Multimodal Retrieval in the Modeling of the question answering task in the yodaqa system SciBERT: A Pretrained Language Model for Scientific Text Semantic parsing on freebase from question-answer pairs Reading Wikipedia to Answer Open-Domain Questions BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Simple and Effective Semi-Supervised Question Answering Unified Language Model Pretraining for Natural Language Understanding and Generation 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks Realm: Retrieval-augmented language model pre-training The Curious Case of Neural Text Degeneration Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering Instance weighting for domain adaptation in NLP Billion-scale similarity search with GPUs Trivi-aQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension Dense Passage Retrieval for Open-Domain Question Answering Colbert: Efficient and effective passage search via contextualized late interaction over bert Natural questions: a benchmark for question answering research 2020. Answering Questions on COVID-19 in Real-Time BioBERT: a pretrained biomedical language representation model for biomedical text mining Latent Retrieval for Weakly Supervised Open Domain Question Answering BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks RikiNet: Reading Wikipedia Pages for Natural Question Answering Reinforced training data selection for domain adaptation Roberta: A robustly optimized bert pretraining approach COVID-QA: A Question Answering Dataset for COVID-19 SQuAD: 100,000+ Questions for Machine Comprehension of Text The Probabilistic Relevance Framework: BM25 and Beyond Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index On the Importance of Diversity in Question Generation for QA Rapidly Bootstrapping a Question Answering Dataset for COVID-19 CORD-19: The COVID-19 Open Research Dataset Neural Domain Adaptation for Biomedical Question Answering Multi-Stage Pretraining for Low-Resource Domain Adaptation