key: cord-0774529-kdlfiozr authors: Otegi, Arantxa; San Vicente, Iñaki; Saralegi, Xabier; Peñas, Anselmo; Lozano, Borja; Agirre, Eneko title: Information retrieval and question answering: A case study on COVID-19 scientific literature date: 2021-12-31 journal: Knowl Based Syst DOI: 10.1016/j.knosys.2021.108072 sha: b553d5488267e2e16bc8890499d4da5bb8118a8d doc_id: 774529 cord_uid: kdlfiozr Biosanitary experts around the world are directing their efforts towards the study of COVID-19. This effort generates a large volume of scientific publications at a speed that makes the effective acquisition of new knowledge difficult. Therefore, Information Systems are needed to assist biosanitary experts in accessing, consulting and analyzing these publications. In this work we develop a study of the variables involved in the development of a Question Answering system that receives a set of questions asked by experts about the disease COVID-19 and its causal virus SARS-CoV-2, and provides a ranked list of expert-level answers to each question. In particular, we address the interrelation of the Information Retrieval and the Answer Extraction steps. We found that a recall based document retrieval that leaves to a neural answer extraction module the scanning of the whole documents to find the best answer is a better strategy than relying in a precise passage retrieval before extracting the answer span. Many bio-sanitary researchers around the world are directing their efforts towards the study of COVID-19. This effort generates a large volume of scientific publications and at a speed that makes the effective acquisition of new knowledge difficult. Information Systems are needed to assist bio-sanitary experts in 5 accessing, consulting and analyzing these publications. The ultimate goal of this research is to develop a system that receives a set of questions asked by experts about the disease COVID-19 and its causal virus SARS-CoV-2, and returns a ranked list of expert-level answers to each question, extracted from scientific literature as collected in the CORD-19 document 10 collection about COVID-19 [1] . Given the size of the document collection (over 400,000 articles), it is customary that, for each given question, to first apply Information Retrieval (IR) to retrieve the most relevant contexts (documents or passages), and then extract the answer from those contexts using a neural Question Answering (QA) 15 system 1 . In the COVID-19 domain, answers are long and they have multiple dimensions (nuggets) that must be returned to provide a complete correct answer. Ours is a general scenario where the different nuggets of relevant information can come from different documents and, therefore, the system must avoid to re-20 turn irrelevant or repeated information. Solving this task requires a three step architecture. First an initial retrieval of contexts (documents or passages) where the candidate nuggets may appear. Second, the selection of text spans out of these contexts containing the relevant nuggets. And third, the ranking of these text spans. The evaluation measures will evaluate the quality of this ranking 25 promoting relevant information and avoiding irrelevant or repeated nuggets. Table 1 shows an example of the task. In this example, the system has 1 This QA architecture has been named recently as retriever-reader. In this work we will refer indistinctly the reader as Answer Extraction step since the task is to identify inside a context the span that gives answer to the question. J o u r n a l P r e -p r o o f Journal Pre-proof returned four different contexts. These contexts can come from the same or different documents. Then, for each context, the system has selected the text spans (marked in bold face) that will be evaluated against the list of expected 30 relevant nuggets. In general, the system will consider not only four contexts, but hundreds or thousands so, after this process, the system must provide the best possible ranking of all text spans coming from all different retrieved contexts. The evaluation of this ranking will consider the coverage of the expected nuggets. What is the origin of COVID 19? It is improbable that SARS-CoV-2 emerged through laboratory manipulation of a related SARS-CoV-like coronavirus. As noted above, the RBD of SARS-CoV-2 is optimized for binding to human ACE2 with an efficient solution different from those previously predicted. 2 Furthermore, if genetic manipulation had been performed, one of the several reverse-genetic systems available for betacoronaviruses would probably have been used. 3 However, the genetic data irrefutably show that SARS-CoV-2 is not derived from any previously used virus backbone. 4 Instead, we propose two scenarios that can plausibly explain the origin of SARS-CoV-2: (i) natural selection in an animal host before zoonotic transfer; and (ii) natural selection in humans following zoonotic transfer. This three-step architecture raises several research questions: 35 1. Related to the system architecture, which is the best strategy: a system relying on a precise passage retrieval before extracting the answer span 3 J o u r n a l P r e -p r o o f To answer these questions we have conducted a series of experiments taking advantage of the Epidemic Question Answering (EPIC-QA) dataset 2 . The main contribution of this work is the experimentation that gives answers to the above research questions and the proposal of a system architecture 55 following those answers that returns a ranked list of expert-level answers to questions related to COVID-19. Open-domain QA aims to answer questions by finding answers in a large collection of documents [2] . Early approaches to solve this problem consisted 60 in elaborated systems with multiple components [3, 4] . Recent advances follow a two-step pipeline, the retriever-reader [5] . The retriever first extracts a small subset of contexts from a large collection. This component is most commonly approached using an ad-hoc IR engine, but over the past years, alternative neural architectures have been proposed [6] [7] [8] [9] [10] [11] [12] . Among the proposed approaches, 65 those based on pre-trained language models stand out, such as [11] and [12] , as they offer a significant improvement over classic term-matching based IR systems. Those approaches use the neural model to rerank an initial ranking generated by a classical IR model based on term-matching techniques. [11] propose a neural reranker based on BERT Large to address the task of passage These probabilities are then used to rank the final relevant passages. [12] adopt a similar strategy to address the document retrieval task. Because documents 75 exceed the maximum length of BERT's input, they divide the documents into sentences, and add their scores. A known issue of such neural architectures is that they require a large number of query relevances (qrels) for training, but their manual generation is very expensive. Some authors [11, 12] use qrel data oriented to passage retrieval such as MS-Marco [13] and TREC-CAR [14] . An-80 other alternative is to generate relevance judgements automatically. [15], for example, propose to train neural models for ranking using pseudo-qrels generated by unsupervised models like BM25. The TREC-CAR dataset [14] itself is automatically generated from the structure (article, section and paragraph) of the Wikipedia articles. [16] generate pseudo-qrels from a news collection, using 85 the titles as pseudo-queries and their content as relevant text. The second component of the pipeline, the reader (or Answer Extraction module), scans each context thoughtfully in search for a span that contains the answer to the question. [5] encode the retrieved contexts and the questions using different recurrent neural networks. For each question-context pair, two 90 distributions over the contexts tokens are computed using bi-linear terms, one for the start of the span and the other for the end. The final answer maximizes the probability of the start and end tokens. With the advent of transformers and pre-trained language models many systems adopted them as their reader 5 J o u r n a l P r e -p r o o f Journal Pre-proof [17] . These systems, although effective at extracting correct answers from a 95 context, process each question-context pair as independent of each other. To improve on this issue [18] normalizes the probabilities of the span start and end for all tokens in all contexts whereas [19] adds another distribution over the [CLS] token representation of all contexts. Other approaches substitute the reader by an answer reranking module [11, 20] where the retrieved passages are 100 divided into plausible sentences which are used as the span of the answer. These sentences are then further reranked by a cross-encoder. Recently some authors proposed generative models that generate the answer instead of extracting it [21] . Although competitive in some benchmarks, large generative models are expensive to train and make inferences on. To tackle 105 this problem [22] combine evidence from the retrieved passages to generate the answer. Note also that some systems use symbolic knowledge to complement the background knowledge in pre-trained transformers [23, 24] . The symbolic knowledge has been shown to be useful in tasks such as OK-VQA [25] where the answer is not contained in the target document, and background knowledge 110 is needed in order to be able to answer. One crucial part of the pipeline is the granularity of the passages that the retriever extracts for the reader to scan. Early works studied the downstream effect of this parameter in the retriever with [26] suggesting full documents might lead to better QA performance whereas [27] conclude that small passages 115 with high coverage allow a smaller search space for the QA system to find the correct answer. Most recent work is inconclusive about which type of textual length (full documents [5] , natural passages [28] or sentences [29] ) works best. tiated between "vital" nuggets and "non-vital" nuggets [34] . In order to check our research questions we take advantage of a recent evaluation proposal (EPIC-QA) [ The best performing systems in EPIC-QA are based on a two-stage pipeline which includes a retriever and an answer extraction module. [38] return full sentences as answers, and thus they use two reranker language models for scoring the sentences, returning the top sentence as answer. [39] also return sentences, using the ROUGE score to filter sentences in their ranked set. [40] use BERT-145 based to rerank and a generative transformer for filtering. All these systems retrieve paragraphs instead of documents, and do not explore one of our research question: why to retrieve paragraphs instead of documents, which allows the reader to scan larger contexts? While convolutional and recurrent neural networks have been used in the 150 past [41, 42] , the current state-of-the-art relies heavily on transformer neural networks [43] which are often pre-trained using different variants of language model losses [17, 44] . Transformers have been applied to natural language processing discriminative classifiers, but recent trends have also used generative models with success [21, 45] . Pre-trained models are based on large quantities 155 of text, and some models have explored hybrid architectures which tap the semantic information in knowledge graphs [46] . Current neural models for QA demand large amounts of training data. There are some attempts to generalize the learning from fewer data points. For example, [47] explore the extension of existing capsule networks into a new framework with advantages concerning 160 scalability, reliability and generalizability, showing promising results in QA. In this work we have focused on the use of pre-trained discriminative transformer models [17] . The proposed system has an architecture with three steps: context retrieval 165 (documents or passages); context scanning for answer extraction; and ranking of answers. Each of these steps requires some experimentation before we can conclude about the best way to adapt them to the COVID-19 domain, as set out in the introduction. Our IR module follows two main steps, preliminary retrieval and reranking. Before indexing the collection, a keyword-based filter is applied to select only COVID related documents, since CORD-19 also includes papers focused on other coronaviruses. Keywords are different variants of the "COVID-19" term, which are used to filter out up to 37.5% of the documents. Previous experi-175 ments done for the TREC-COVID challenge [48] showed the effectiveness of this filtering for improving the retrieval (see Section 4.1). Related to retrieval, there is a research question about which strategy is best, a fine-grained passage retrieval before extracting the answer span, or a document based retrieval leaving to the Answer Extraction module the scanning 180 of full documents to evaluate the passages and find the best answer. For this reason, we will conduct our experiments on the whole architecture for both options, and see which is the most appropriate at the end. Regarding preliminary retrieval, we obtain an initial ranking for the query from the collection of full texts of the scientific articles. We use a language 185 modeling based IR approach [49] including Pseudo Relevance Feedback (PRF). For that purpose, we used the Indri search engine [50] , which combines Bayesian networks with language models. The query and documents are tokenized and stemmed using Krovetz stemmer [51] , and stopwords are removed. The adaptation of this system to the COVID-19 domain requires some exper-190 imentation that will be addressed in Section 5.1. First, the EPIC-QA questions have three fields (keywords based query, natural language question and narrative or background). Thus, there is a question about how we should construct the query to best exploit the information contained in those fields. Second, there is a question about the number of contexts (passages or doc-195 uments) to retrieve before feeding the QA module. That is, find the balance between the recall of the retrieval and the noise that the QA module can manage. Regarding reranking, the preliminary ranking obtained in the previous step is reranked using a BERT-based relevance classifier, following a strategy similar 200 to the one proposed by [11] . For each candidate document given by the preliminary ranking, its abstract and the corresponding query are processed through a BERT-based relevance classifier, which returns a probability of an abstract to be relevant with respect to the given query. Section 5.1.2 gives further details on the experimentation done on this regard. The answer extraction module is based on neural network techniques. More specifically, we have used the SciBERT language representation model, which is a pre-trained language model based on BERT, but trained on a large corpus of scientific text, including text from the biomedical domain [52] . SciBERT was 210 selected for this module over other language models adapted to the biomedical domain (e.g. BioBERT [53] , Clinical BERT [54] ) based on the results obtained in initial experiments for EPIC-QA participation. We fined-tuned SciBERT for QA using SQuAD2.0 [33], which is a reading comprehension dataset widely used in the QA research community. Following 215 the usual answer extraction method [33] we used this fined-tuned SciBERT model as a pointer network, which selects an answer start and end index given a question and a context. According to the EPIC-QA guidelines, the answers returned by the QA system must be a sentence or several contiguous sentences. In our case, we select those sentences which contain the answer span delimited 220 by the start and end indexes given by the neural network. In case the input contexts exceed the maximum input sequence length (e.g. when working with full documents) we follow the sliding window approach where the documents are split into overlapping passages. For the maximum sequence length, stride parameters and other parameters we used the default values of 225 [55] . After scanning the whole context (passage or document depending on the strategy), we keep the most probable answers to the question for each. So at this step, there are several research questions we must address to adapt the system to the COVID-19 domain as follows. First, about the best dataset to fine-tune the SciBERT model for the target task. SQuAD 2.0 aims at relatively short factoid questions, while in this dataset, questions are complex and answers are expected to be longer. Therefore, we need to assess our hypothesis that using QuAC [56] in addition to SQuAD 2.0 when fine-tuning SciBERT will improve system results, as QuAC is a conversational 235 QA dataset containing a higher rate of non-factoid questions than SQuAD. J o u r n a l P r e -p r o o f Second, we need to determine both the appropriate number of relevant contexts that will be scanned by the answer extraction module, and, the number of candidate answers that will be extracted from each context. The idea is to find a balance between different answers that come from different documents and 240 those that are in a single document, without introducing to much noise when producing the final ranking of the answers per each question. Third, we have to find out whether we will consider each context corresponding to the same question as independent from each other when normalizing the scores of the answers extracted from them. Considering contexts independently 245 could originate incomparable answer scores if these answers come from different contexts. Thus we will explore if normalizing globally the scores across all relevant contexts for each question is helpful or not. Finally, the last question to address is which will be the exact question given as an input to the answer extraction module. Each topic of the EPIC-QA 250 dataset provides three different fields as it will be described in Section 4.1. We need to figure out if using the text provided in the question field, which is how humans post a question using natural language, is enough to get the correct answers, or if some other piece of information provided in other fields is needed (for example, the more elaborated information provided in background field). The first three questions will be addressed in Section 5.2.1 by an extensive hyperparameter exploration, whereas the last question will be answered once all other hyperparameters are fixed in Section 5.2.3. At this point, each answer comes with two relevance evidences: the context 260 retrieval score, and the score given by the answer extraction from the context. Therefore, we need to study which is the best way to combine both evidences and produce the final ranking of answers. We will focus on this issue together with other questions formulated in the previous section in the hyperparameter exploration carried out in Section 5.2.1. In order to evaluate our IR systems two well known evaluation measures were selected, both used also in the TREC-COVID [57] shared task, specifically NDCG and recall at different cutoffs of the ranking. NDCG is a measure of ranking quality widely used to evaluate search engine being rel i the relevance value of the ith element in the ranking, and IDCG is the DCG that an ideal ranking would have at position p (2) The main evaluation metric is NDNS, which was provided in the EPIC-QA where n a is the number of novel nuggets of answer a and f a is the sentence factor which weights the score based on the number of sentences in a. Three different variants of NDNS are considered based on how this factor is computed: • Exact: Answers should express novel nuggets in as few sentences as possible. This scenario is more suited to evaluate systems where brevity is a priority, like a chat bot which can only give one answer. The sentence factor is the number of sentences (n sentences ) in the answer: where n non−relevant is the number of sentences with no nuggets, n redundant is the number of sentences that contain previously seen nuggets and n novel where brevity is not a requirement but non-redundancy is. f a = n non-relevant + n redundant + min(n novel , 1) • Partial: Redundant information is not penalized which makes this metric well suited for systems solving tasks like a state of the art research about a topic where some overlap in the relevant answers is expected. f a = n non-relevant + min(n novel , 1) The final metric is computed as the cumulative NS of answers up to rank k = 1000: where N DN S ideal is the optimal ranking of answers that could have been found in the document collection for the given question, computed using a beamsearch with a width of 10 over the annotated sentences. As mentioned in Section 3.1 our IR module follows a two step approach [58] : preliminary retrieval and re-reranking. With respect to the evaluation of the IR module, the gold-standard associated with the EPIC-QA dataset contains nuggets annotated at sentence level. In order to evaluate the IR systems at passage and document level, we have created qrels at these two levels, annotating as relevant passages or documents those that contain at least one relevant nugget. Results show that using complex queries perform best, both at passage level (see Table 2a ) and at document level (see Table 2b ). Hence we adopted the following query building strategy for the rest of the IR experiments: where w = 0.5 for passage retrieval and w = 0.7 for document retrieval. As we have already mentioned, the preliminary ranking obtained in the previous step is reranked using a BERT-based relevance classifier, following a strategy similar to the one proposed by [11] . In the case of document retrieval, for each candidate document given by the preliminary ranking, its abstract 365 and the corresponding query are processed through a BERT-based relevance classifier, which returns a probability of an abstract to be relevant with respect to the given query. In the case of passage retrieval the candidate passage is processed with the corresponding query. Clinical BERT for our context retrieval system based on the results obtained on the TREC-COVID dataset [58] . In order to train the neural reranker a set of queries and their respective relevant and non-relevant documents are needed. The objective is to learn the classification -the relevance of the second text with respect to the first-of a pair of texts. We use two different query relevance sets 380 to fine-tune our reranker: • We exploit the title-abstract relationship [58] . Titles of scientific articles are usually brief and at the same time descriptive of the content. Therefore, they are similar to the queries used in search systems, and can be used as a pseudo-query. Its corresponding abstract constitutes a good 385 candidate to be a relevant text (pseudo-positive) to that pseudo-query. We take (title,abstract) pairs to generate (pseudo-query,pseudo-positive) pairs. Non-relevant (pseudo-negative) texts are generated by randomly selecting abstracts (n = 2) from the collection. The CORD-19 version used in the final round of the TREC-COVID shared task 13 was used to auto-390 matically generate this training dataset. This dataset contains 369,930 title-abstract pair relevance judgements. • TREC-COVID shared task official query relevance set, comprising 69,316 query-abstract pair judgements. Fine-tuning was done in two steps, first over the automatically generated Regarding passages (see Table 3a ), ranking quality oriented systems not only outperform recall oriented systems on NDCG, but they also show very 415 competitive recall performances, even outperforming recall oriented systems for some cutoffs. The same trend is observed for document retrieval (see Table 3b ), where ranking quality oriented systems are again better in terms of NDCG and they are on par with recall oriented systems in terms of recall. Reranking improves results for both passage and document retrieval, and the 420 trend observed in favor of ranking quality oriented systems is more accentuated for systems using reranking, being superior to recall oriented systems in all but recall@5K cutoff. With those results in hand, quality oriented ranking settings and the use of reranking are selected for the remaining experiments. There is a final question about retrieval regarding the granularity of the textual fragments to be handed to the answer extraction module: passage retrieval so the answer is extracted directly from the passage, or document retrieval so the answer extraction module has to scan the full document in order to select 430 both the passage and the answer. We carried out three sets of experiments to find out which IR engine (passage or document) would offer the best starting point to the answer extraction module in terms of recall of documents, passages and nuggets, respectively. In order to measure the recall at document level, passage-based retrieval 435 rankings must be converted to document rankings. In order to do so, passages are substituted by their corresponding document, and duplicates are removed from the ranking, i.e., a document is given the rank of its top ranked passage. Table 4a show that documents offer a better recall if we were to give the first 500 elements in the rank, but otherwise passages would be preferable. To be fair with both strategies, recall at passage level should also be mea- passages on average. In order to compare passage rankings with similar sizes, 445 5,000 document rankings are retrieved and expanded, and they are compared to 50,000 retrieved passages rankings 14 . As expected, the document to passage conversion leads to a decay in the recall in the top part of the ranking (see Table 4b ) which smooths only when using very large rankings. As in the document level evaluation retrieving passages is the best performing strategy. EPIC-QA has the concept of nuggets, which introduce the factor of finding not only relevant information, but also "new" information. The third experiment measures the recall of the IR systems in terms of the nuggets retrieved. directly as answers to the EPIC-QA questions? In order to do that, we prepared 3 systems, to check whether the conclusions would be the same: • pas: full retrieved passages are returned from the first context to the last. • doc2pas: documents are expanded to passages as done for the previous recall experiments, and then full passages are returned as answer candi-465 dates. • pas-to-sent: Instead of returning full passages, the first sentence of each candidate in the passage ranking is returned. In this section we first check the hyperparameters, then the linear combination of retrieval and answer scores, and finally explore the most appropriate field to use as question. In order to answer the research questions regarding context scanning for 485 the answer extraction module discussed in Section 3.2, we carried out an exploration of all hyperparameters in question of the module. Two independent explorations have been carried out: the first one for the system that its context retrieval module is based on passages, and the second one for the system that uses documents as contexts. For each hyperparameter tuning, we consider the Figure 1d shows that document-level normalization is a better choice for both systems. Finally, we can see in Figure 1e that applying z-score normalization to the linear combination of both scores is the best. As explained above, the best way to combine the score given by the context retrieval module and the score given by the answer extraction module to produce equally, but we wanted to explore the best value for the weight (k) in the linear combination: f inal score = (k · cr score) + ((1 − k) · ae score) where cr score is the score given by the context retrieval module and ae score 555 is the score given by the answer extraction module. Figure 2 shows the NDNS-Relaxed results for different k values, and its optimum value is 0.5 for both systems. We fixed all the hyperparameters of both systems at their best values as shown in Table 5 for this exploration. In this section we want to explore which field of the question (query, question, background or a combination of some of these) we should use to get the best performance of the answer extraction module. Note that we have used the text in the "question" field as a query in all the explorations we have carried out in the previous sections. For this exploration we fixed the best hyperparameter 565 values (see Table 5 ) and we set k = 0.5 for the linear combination. The results obtained in this exploration for both passage-and document- based systems can be seen in Table 6 . Passage-based system performs best when using the concatenation of query and question fields, while using only the question field obtains the best results for the document-based system. Therefore, 570 adding extra information from the background field to the input does not yield better performance in any case. The study performed so far run over the Preliminary dataset of EPIC-QA. The conclusion at this point is that a recall based document retrieval that leaves 575 to a neural answer extraction module the scanning of whole documents to find the best answer is a better strategy than relying in a precise passage retrieval before extracting the answer span. We wanted to check this result over the Primary dataset, which was unseen during the whole development process described above. Results are shown in 580 Table 7 and they confirm the previous observation. The results shows that in all scenarios, the performance of the document- Table 7 : Results on Primary dataset for passages-based and document-based full QA systems. based system is better than the passage-based one. In this paper we have analyzed how to construct a system that extracts 585 answers about questions on COVID from the scientific literature. We have performed extensive experiments to check which is the most effective combination of the retrieval and answer extraction components. If we pay attention to IR results with IR metrics, results suggest that passage retrieval offers a better starting point for the QA module that extracts the actual 590 answer. However, when we take into account the QA metrics, results show that document retrieval clearly outperforms passage retrieval. To obtain this result, the system must use smaller document rankings (around 500 candidates), and the neural QA module for extracting the answer must be fine-tuned properly. At this respect, using QuAC dataset for additional fine-tuning after SQuAD Our experiments also showed that adding the extra information in the task 600 query description (background or narrative fields) when posing the questions is useful in the IR module, but is not effective in the QA module. Finally, the ranking of answers for a given question is more effective if it combines both the relevance scores from the retrieval engine and scores for the extracted answer span. In our case, we obtain the best results giving the same CORD-19: The COVID-19 Open Research Dataset Proceedings of the 8th Text REtrieval Conference (TREC-8) An Analysis of the AskMSR Question-Answering System Methods in Natural Language Processing Introduction to Reading Wikipedia to Answer Open-Domain Questions A Deep Relevance Matching Model 635 for Ad-Hoc Retrieval End-to-End Neural Ad-Hoc Ranking with Kernel Pooling ACM SIGIR Conference on Research and Development in Information Retrieval Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining Co-PACRR: A Context-Aware Neural IR Model for Ad-Hoc Retrieval Learning to Match Using Local and Distributed Representations of Text for Web Search Passage Re-ranking with BERT Simple Applications of BERT for Ad Hoc Document Retrieval A Human Generated MAchine Reading COmprehension Dataset Neural Ranking Models with Weak Supervision An Approach for Weakly-Supervised Deep 670 SIGIR 2017 Workshop on Neural Information Retrieval Pre-training of Deep Bidirectional Transformers for Language Understanding Association for Computational Linguistics: Human Language Technologies Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering Dense Passage Retrieval for Open-Domain Question Answering Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing The Expando-Mono-Duo Design Pattern 690 for Text Ranking with Pretrained Sequence-to-Sequence Models How Much Knowledge Can You Pack Into the Parameters of a Language Model? Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering Conference of the European Chapter of the Association for Computational 700 Linguistics: Main Volume KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA Multi-Modal Answer Validation for Knowledge-Based VQA A Visual Question Answering Benchmark Requiring External Knowledge Passage Retrieval vs. Document Retrieval for Factoid Question Answering tional ACM SIGIR Conference on Research and Development in Informaion Retrieval Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) Re-Ranking in Open-Domain Question Answering AWS CORD-19 Search: A Neural Search Engine for COVID-19 Literature RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble Covidex: Neural ranking models and keyword search infrastructure for the COVID-19 open research dataset Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics Overview of the TREC 2006 Question Answering 745 Track Overview of the 2020 Epidemic Question Answering Track Rapidly Bootstrapping a Question Answering Dataset for COVID-19 A Question Answering Dataset for COVID-19 IBM Submissions to EPIC-QA Open Retrieval Question Answering on COVID-19 The University of Texas at Dallas HLTRI's 760 Participation in EPIC-QA: Searching for Entailed Questions Revealing Novel Answer Nuggets Effective Use of Word Order for Text Categorization with Convolutional Neural Networks Conference of the North American Chapter of the Association for Com-765 putational Linguistics: Human Language Technologies Hierarchical Attention Networks for Document Classification Conference of the North American Chapter of the Association for Compu-770 tational Linguistics: Human Language Technologies Proceedings of the 31st International Conference on Neural Information Processing Systems Language Models are Unsupervised Multitask Learners Denoising sequence-to-sequence pre-780 training for natural language generation, translation, and comprehension QA-GNN: Rea-785 soning with language models and knowledge graphs for question answering Towards scalable and 790 reliable capsule networks for challenging NLP applications Searching for scientific evidence in 795 a pandemic: An overview of TREC-COVID A Language Modeling Approach to Information Retrieval Indri: A language modelbased search engine for complex queries Viewing Morphology as an Inference Process SciBERT: A Pretrained Language Model for Scientific Text Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing BioBERT: a pre-trained biomedical language representation model for biomedical 815 text mining Proceedings of the 2nd Clinical Natural Language Processing Workshop Transformers: State-of-the-Art Natural Language Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 2020 Proceedings 830 of the 2018 Conference on Empirical Methods in Natural Language Processing TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19 Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels