key: cord-0117895-06tvzzzk authors: Chen, Tao; Zhang, Mingyang; Lu, Jing; Bendersky, Michael; Najork, Marc title: Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models date: 2022-01-25 journal: nan DOI: nan sha: b2297404824f3dfec5e414ed89bdea1c1b68e300 doc_id: 117895 cord_uid: 06tvzzzk The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting. Our findings show that the performance of a deep retrieval model is significantly deteriorated when the target domain is very different from the source domain that the model was trained on. On the contrary, lexical models are more robust across domains. We thus propose a simple yet effective framework to integrate lexical and deep retrieval models. Our experiments demonstrate that these two models are complementary, even when the deep model is weaker in the out-of-domain setting. The hybrid model obtains an average of 20.4% relative gain over the deep retrieval model, and an average of 9.54% over the lexical model in three out-of-domain datasets. Traditionally, search engines have used lexical retrieval models (eg, BM25) to perform query-document matching. Such models are efficient and simple, but are vulnerable to vocabulary mismatch when queries use different terms to describe the same concept [4] . Recently, deep pre-trained language models (eg, BERT) have shown strong ability in modeling text semantics and have been widely adopted in retrieval tasks. Unlike lexical retrievers, deep/dense retrievers 1 capture the semantic relevance between queries and documents in a lower dimensional space, bridging the vocabulary mismatch gaps. Deep retrievers have been successful in many retrieval benchmarks. For instance, the most recent five winners in MS-MARCO passage [2] ranking leaderboard adopt deep retrievers as their first-stage retrieval model. 1 While we recognize that in some cases the deep retrievers are not necessarily dense, and vice versa, we loosely use these two terms interchangeably throughout the paper. However, training a deep retrieval model is computationally expensive and a sizable labeled dataset to guide model training is not always available. A natural question then arises, can we train a deep retrieval model in one domain, and then directly apply it to new datasets/domains in a zero-shot setting with no indomain training data? To answer this question, we carefully select five datasets, including two in-domain, and three out-of-domain datasets with different levels of domain shift. Through comprehensive experiments, we find that a deep retriever model performs well on related domains, but deteriorates when the target domain is distinct from the model source domain. On the contrary, lexical models are rather robust across datasets and domains. Our further analysis shows that lexical and deep models can be complementary to each other, retrieving different sets of relevant documents. Inspired by this, we propose a zero-shot hybrid retrieval model to combine lexical and deep retrieval models. For simplicity and flexibility, we train a deep retrieval model and a lexical model separately and integrate the two (or more) models via Reciprocal Rank Fusion. This non-parametric fusion framework can be easily applied to any new datasets or domains, without any fine-tuning. Our experiments demonstrate the effectiveness of the hybrid model in both in-domain and out-of-domain datasets. In particular, though the zero-shot deep model is weaker in out-of-domain datasets, the hybrid model brings an average of 20.4% of relative recall gain over the deep retrieval model, and an average of 9.54% gain over lexical model (BM25) in three out-of-domain datasets. It also outperforms a variety of stronger baselines including query and document expansion. To summarize, in this paper we explore the following research questions: -RQ 1: Can deep retrieval generalize to a new domain in a zero-shot setting? -RQ 2 Is deep retrieval complementary to lexical matching and query and document expansion? -RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? To the best of our knowledge, this paper is the first to propose a hybrid retrieval model that incorporates lexical matching, expansion and deep retrieval in a zero-shot setup. We demonstrate that the proposed hybrid model is simple yet effective in a variety of datasets and domains. Information retrieval systems usually contain of two main stages: (a) candidate retrieval (b) candidate re-ranking. The retrieval stage is aimed at optimizing the recall of relevant documents, while the re-ranking stage optimizes early precision metrics such as NDCG@k or MRR. Prior research (eg, [3] ) found that the two stages are complementary -gains in retrieval recall often lead to better early precision. Therefore, in this paper, we focus on retrieval recall optimization, with the assumption that the findings can benefit the re-ranking stage as well. Lexical retriever Traditionally, the first-stage retrieval has been a lexical based model such as BM25 [35] , to capture the exact lexical match between queries and documents. Such simple and effective lexical models were the stateof-the-art for decades, and are still widely used in both academia and industry. One key issue with lexical models is the vulnerability to vocabulary mismatch, where queries and documents mention the same concept with different terms. One popular line to alleviate this is to expand terms in queries from pseudo relevance feedback (eg, [15] and [1] ) or expand terms in documents from related documents (eg, [37] ). As a result, queries and documents have a higher chance to match each other at the surface form. Deep LM augmented lexical retriever More recently, pre-trained deep language models (LM) such as BERT [10] have been shown to be powerful in many natural language understanding tasks. The very first application of such models in IR is to augment lexical retrieval models. Dai et al. [7, 9] proposed to learn context-aware term weights by BERT to replace the term frequencies used by lexical models. To remedy the vocabulary gap between queries and documents, Nogueira and Lin [29, 28] employed seq2seq model transformer [39] and later T5 [33] to generate document expansions, which brings significant gains for BM25. In the same vein, Mao et al. [27] adopted seq2seq model BART [20] to generate query expansions, which outperforms RM3 [15] , a highly performant lexical query expansion method. Deep retriever In a separate line of research, deep neural retrieval models adopt LMs to build a new paradigm for first-stage retrieval: instead of performing exact lexical match, they aim at capturing the relevance of queries and documents in a lower dimensional semantic space. This paradigm can largely bridge the vocabulary gap between queries and documents. Since cross-attention models are cost-prohibitive for first-stage retrieval, most works adopt a dual-encoder architecture to learn two single vector representations for the query and the document separately, and then measure their relevance by a simple scoring function (eg, dot product or cosine similarity). In this way, finding most relevant documents can be formulated as a nearest neighbor search problem and can be accelerated with quantization techniques [16, 14] . For model training, it is often the case that positive (query, document) pairs are available, while negative pairs need to be sampled from the dataset. Negative sampling strategy plays a crucial role for model performance. Earlier works adopt simple in-batch negative sampling [45, 18] , or mine negative pairs from top BM25 results [17] . Recent works propose more sophisticated sampling strategies to identify high-quality hard negatives, such as cross-batch negatives [32] , demonised hard negatives [32] and semantic similarity based negatives [23] . Deep retriever model has shown superior performance over lexical models in several passage retrieval tasks (eg, MS-MARCO passage ranking [2] ). However, training a deep model is expensive computationally but also in terms of labeled data creation. A simple remedy is to directly apply a trained deep retriever model to new domains in a zero-shot setting. However, little work has been conducted to uncover the generalization ability of deep retrievers. One exception is by Thakur et al. [38] who introduce BEIR, an IR benchmark of 18 datasets with diverse domains and tasks, and evaluate several trained deep models in a zeroshot setup. They found that deep models exhibit a poor generalization ability, and are significantly worse than BM25 on datasets that have a large domain shift compared from what they have been trained on. In our work, we conduct similar studies, and observe the same performance deterioration for a deep model in zero-shot setting. We additionally propose a hybrid model to utilize a lexical model to alleviate the domain shift. Hybrid retriever Deep retrievers are good at modeling semantic similarity, while could be weaker at capturing exact match or could have capacity issues when modeling long documents [24, 42] . A few recent works attempt to build a hybrid model to take the strength of both deep and lexical retrievers. Most works train a deep model separately and then interpolate its score with a lexical model score [42, 17, 24, 22, 25, 21] , or use RM3 built on the top lexical results to select deep retriever results as the final list [18] , or simply combine the results of the two models in an alternative way [45] . Gao et al. [13] is the only work that explicitly trains a deep model to encode semantics that lexical model fails to capture. In model inference, they interpolate the scores of deep and lexical models and generate the top retrieval results. While insightful, these prior works limit the model evaluation to a single task and a single domain. It is unclear how such hybrid model performs in a cross-domain setting, without any fine-tuning. Our work aims to fill this research gap, and demonstrates that a zero-shot hybrid retrieval model can be more effective than either of the two models alone. In this section, we describe our zero-shot hybrid retrieval model. For simplicity and flexibility, we train deep and lexical retrieval models separately, and propose a simple yet effective non-parametric framework to integrate the two. Both traditional lexical retrieval models [35, 30, 1] , as well as deep neural retrieval models [13, 18, 41] represent queries and documents using vectors q, d ∈ R N , and score candidates based on the dot product < q, d >. Thus, the difference between deep and lexical models stems from how these vectors are constructed. Lexical models represent queries and documents using sparse weight vectors q sparse , d sparse ∈ R V , respectively (where V denotes the vocabulary size). The vectors are sparse such that all the entries for vocabulary terms that do not appear in query and document are zeroed out. To combat issues of term mismatch, lexical models often include additional terms in queries and document through some form of expansion (eg, based on pseudo-relevance feedback [19] ). However, the resulting vectors are still highly sparse, due to the high dimensionality of vocabulary size V . In contrast, deep neural retrieval models represent queries and documents using dense embedding vectors q dense , d dense ∈ R E , where E << V . While theoretically dense embeddings overcome the term mismatch problem, they do have several shortcomings. First, they require large amounts of data and resources for training [32] , and thus may not be directly trained over collections with fewer queries and relevance judgments. Second, they do not capture exact query-document matches as well as the sparse lexical scores. Therefore, a lexical and deep model combination is likely to yield the optimal relevance scores. Most prior works [42, 17, 24, 23] model this combination as a linear interpolation of the scores of deep and lexical retrieval models. This fusion method is sensitive to the score scales and the weights assigned to the different models [42] , which needs careful score normalization and weight tuning, especially when multiple models are combined. We expect that the raw scores of the models can vary from one domain/dataset to another, and likewise the interpolation weights. Since our goal is to build a hybrid model which can be easily applied to a new domain in a zero shot setting (with no in-domain training data), we would like to eliminate such domain-specific normalization and tuning. Therefore, we adopt Reciprocal Rank Fusion (RRF) [5] to generate the final ranking results by considering the ranking positions of each candidate generated by different models, instead of fusing their scores. RRF demonstrates robust and effective ensembles in prior works [3, 5] and our experiments. Assuming a set of lexical and deep retrieval models M , we define π m (q, d) as the rank for document d, induced by its score for query q assigned by model m ∈ M . The RRF score is then defined as: where k = 60, following the definition in the original paper [5] . In the remainder of this paper we demonstrate that this simple non-parametric approach generalizes well across domains, and can make an effective use of outof-domain semantics of retrieval models trained on a different collection. In the remainder of this section, we describe the lexical and deep retrieval models used to instantiate Equation 1. We adopt BM25 as the base lexical retrieval model, as it is widely used and shown to be robust [38] . To alleviate the vocabulary mismatch issue, we additionally apply popular query expansion and document expansion techniques to expand the query and the document, forming enhanced lexical models. BM25+Query expansion Most conventional query expansion approaches follow the pseudo-relevance feedback (PRF) paradigm. It assumes the top K ranked documents for the original query to be relevant, and generates query expansions from these documents. In our work, we experiment with RM3 [15] (a relevance-based language model) and Bo1 [1] (a variant of Divergence From Randomness term weighting model), to obtain query expansions from PRF. Recently, generative models like T5 were shown to generate high-quality document expansions, and bring large gains to the BM25 model on retrieval tasks [29, 28, 31] . Following the docT5query approach [28, 31] , we fine-tune T5-base with identical setting as the prior works on (query, relevant passage) pairs from the MS-MARCO passage ranking training set, where the query is considered as pseudo document expansion. We adopt the top-k sampling decoder [11] to generate N (a tunable parameter) queries per passage. For each document, we append the expansions to each passage and aggregate them as the document expansion. We adopt NPR [23] , a neural passage retrieval model with improved negative contrast as the deep retrieval model in our framework. Note that our framework is flexible, and NPR can be replaced with any other deep model. Aligned with many popular deep retrievers [17, 32, 43] , NPR adopts a dual encoder architecture, learning dense embedding vectors representations, computing the relevance using the dot product < q dense , d dense >. The training of this model is enhanced with several negative sampling strategies, aiming at obtaining hard and highquality negative (query, passage) pairs. This model is trained on MS-MARCO passage dataset (detailed in Section 4.1), and achieves a very competitive performance. To adapt NPR to document retrieval setting, we split documents into passages by applying sliding overlapping sentence windows. Following work by [8] , we use the max passage retrieval score as the document level score. As we are interested in exploring the performance of the deep retrieval model in a variety of out-of-domain settings, we choose to specifically focus on five datasets in our evaluation (summarized in Table 1 ). Since it has a very large number of queries (10M), we evaluate our model using a stratified sample of 10k queries, based on query length. cles and preprints about the COVID-19 pandemic. Each query contains a few keywords, along with a more specific natural language version of question, and a narrative which adds additional clarifications of user intent. As shown by Thakur et al. [38] , it is quite distinct from the MS-MARCO dataset, and provides a good test case for whether an out-of-domain retrieval system can be useful in a bio-medical domain. In following, we detail our experimental setup to ensure the reproducibility of all the reported results. Deep retrieval model As described in Section 3, we train NPR on the training set of MS-MARCO passage dataset, and apply this model to the other four datasets without any fine-tuning. The documents in the other four datasets are long and may exceed the 512 token length limitation. Following prior work [31] , we use a sliding window of ten sentences with a stride of five to split each document into passages. We run NPR on each passage, perform the nearest neighbor search via SCaNN [14] at passage-level and consider the best passage score as its document score. The query used for each dataset is the same as BM25 based lexical model (detailed in Table 2 ). Lexical retrieval models For implementing our lexical models, we use the Terrier search engine [26] , and apply the default options for stemming and stop word removal provided by Terrier. We employ three fully lexical benchmarks. We carefully tune the parameters, and detail the settings in Table 2 . -BM25 is a commonly used bag-of-words retrieval method. We use the default parameters provided by Terrier, and verify that our results (in terms of MAP) are comparable to other previously reported BM25 benchmarks [44] . We experiment with a few indexing options: 1) full text, 2) passage and 3) abstract for TREC-COVID only. . We also experiment with RM3 query expansion package by Terrier and carefully tune the two parameters. However, it yields lower performance than Bo1 in all the five datasets. We thus only report the results of Bo1 in the Section 5. -docT5query is a T5 based document expansion model. As described in Section 3, we fine-tune T5 model on the MS-MARCO passage training set by strictly following the setup of prior works [28, 31] . We feed each passage length text, namely, passage in the MS-MARCO passage collection, the abstract in TREC-COVID, or split passages of other three datasets, to T5 model and generate N (a tunable parameter; [10, 20, 40] ) numbers of expansions. We append the expansions for all the passages to a document. As our work focuses on the first stage retrieval, in this section we adopt Re-call@1K as the primary evaluation metric and additionally report MAP score. In our evaluation, we aim to address the research questions posed in Section 1. We first focus on the results on two in-domain datasets ( Table 3) . As expected, the deep retrieval model NPR performs very well on MS-MARCO passage on which it is trained. It substantially beats BM25 by an absolute 10.77 (relative 12.35%) and 16.15 (relative 83.59%) in terms of Recall@1K and MAP, respectively. In MS-MARCO doc (the in-domain document retrieval task), NPR also performs well, and betters BM25 by 4.55 (5.0%) and 3.86 (14.57%) at Recall@1K and MAP, respectively. This indicates that a well-trained deep passage retrieval model generalizes well to an in-domain document retrieval task. In Table 4 , we discuss the results of three out-of-domain document retrieval datasets. Compared to MS-MARCO doc, ORCAS dataset has the least domain shift (as the candidate documents stem from MS-MARCO doc albeit with different queries), followed by Robust04 (news domain). TREC-COVID contains Table 3 . Experimental results on two in-domain datasets. The improvements (R@1K) of all hybrid models (5-8) over baselines (1) (2) (3) (4) are statistically significant via a paired two-tailed t-test (p < 0.05). Lexical retrieval models are prone to vocabulary mismatch between queries and documents. We examine whether query and document expansion models could bridge this gap. From Table 3 and Table 4 , we see that Bo1 query expansion model consistently brings recall gains, with 1% relative gain on MS-MARCO passage/doc and ORCAS, 8.48% on Robust04 and 6.67% on TREC-COVID. Recall that docT5query document expansion model is trained on the training set of MS-MARCO passage dataset. In this dataset, it brings very large gains to BM25. In the other four datasets, docT5query shows a consistent, albeit smaller, improvement over BM25 (above 2.5% recall gain), similar to the analysis by Thakur et al. [38] . As with query/document expansion, deep retrieval model can narrow the vocabulary gap between queries and documents. One natural question is, are these models still complementary to each other? To answer this, we plot the unique relevant documents retrieved by BM25+Bo1, BM25+docT5query and NPR and their overlaps in Figure 1 for Robust04 and TREC-COVID (other three datasets only have around one relevance document per query, ref Table 1 ). We see that each method is complementary to each other. In general, NPR retrieves the largest number of unique relevant results, though it retrieves less relevant results than the other two methods. Our proposed hybrid framework provides a flexible mechanism for fusing multiple lexical or deep retrieval models. In Table 3 and Table 4 (row 5-8), we demonstrate the performance of our hybrid model which consistently outperforms either the lexical or deep retrieval model alone. In in-domain MS-MARCO passage dataset, the best performing hybrid model of BM25+docT5query and NPR (#7) obtains a Recall@1K of 98.65, betters BM25 and NPR by relative 12.94% and 0.52%, respectively. This hybrid model outperforms coCondenser (Recall@1K=98.4) [12] , the current MS-MARCO leaderboard winner (as of 2021/08/09) in the passage retrieval task. 2 In the in-domain document retrieval task, the best performing hybrid model is the one with all the three methods (Bo1, docT5query and NPR). In three out-of-domain datasets, the advantage of hybrid model is more evident, given that NPR is weakened in datasets with a large domain shift (ie, TREC-COVID). It consistently improves over BM25 by almost 10% relatively for the three datasets, and substantially outperforms NPR by 6.11%, 14.16% and 44.25% in ORCAS, Robust04, and TREC-COVID, respectively. This demonstrates that our proposed zero-shot hybrid retrieval model is effective and robust across different tasks and domains. Our zero-shot hybrid model has demonstrated its effectiveness in the experiments. For comparison, we implement the linear interpolation method that most prior works adopted [42, 17, 24, 23] , though such model is not zero-shot, and requires weight tuning. As weight tuning complexity increases with the number of models, we only interpolate BM25 and NPR as a case study: We perform min-max score normalization and carefully tune the weight α ∈ [0.1, . . . , 0.9] via grid search for out-of-domain datasets Robust04 and TREC-COVID. Figure 2 (bottom curve), shows that interpolation is weight-sensitive, and furthermore even the best setting underperforms our simple non-parametric hybrid model RRF(BM25, NPR) by a relative 3% in both datasets. The differences are even larger, when compared with the full RRF model (dashed line). We also explore the hybrid upper bound by fusing the retrieval results of BM25+Bo1, BM25+docT5query and NPR via an oracle, ie, merging all relevant results from each method regardless of their ranking positions. Figure 2 (dotted top line) illustrates the large potential headroom for designing an even better fusion model. Similarly to us, Wang et al. [42] found that setting an oracle per-query weight yields better performance than optimizing a global weight. Inspired by this, we hypothesize that the performance of retrieval models relate to query length. We bin the ORCAS queries into 10 groups, based on the number of non-stopword tokens, and show the breakdown results in Table 5 . When the queries are very short, NPR largely beats BM25, even with query and document expansion. However, its performance deteriorates for longer queries, with 7 or more tokens. To gain more insights, we spot-check wins and losses. For single token queries, BM25 performs badly when the query is misspelled (eg, "ihpone6") or a compound word (eg, "tvbythenumbers"). These words are very likely to be out-ofvocabulary (OOV) in lexical retrieval models. On the contrary, deep retrieval model NPR adopts wordpiece tokenizer, which could still capture the semantics of the OOV from its sub-units. For long queries, NPR performs poorly for those employing complex logic and seeking very specific information, eg, "according to piaget, which of the following abilities do children gain during middle childhood?". In this example query, BM25 successfully retrieves relevant documents containing the identical query sentence, while NPR fails. This may indicate that NPR is worse at capturing exact match, consistently with prior work [42, 24] . Compared to traditional lexical retrieval models, a deep retrieval model mitigates the vocabulary mismatch by modeling semantic relevance between queries and documents, and has a great success in many retrieval tasks. We show that a deep retrieval model poorly generalizes to a new domain with large domain shift, while lexical matching and expansion models are robust across domains. To address this, we propose a simple non-parametric zero-shot hybrid model to integrate lexical matching, expansion, and deep retrieval models. Our proposed model demonstrates its effectiveness in both in-domain and out-of-domain datasets. A recent work [36] found that deep retrieval models underperform lexical models for rare entities in an entity-centric QA task. As a future work, we plan to investigate the effectivenss of our hybrid model in this task. Additionally, we plan to parameterize the hybrid retrieval model using query structure, query length, the degree of domain shift, and other signals that may reflect the performance of each individual model. Finally, we plan to explore techniques that improve the utility of out-of-domain deep retrieval models via domain adaptation. Probabilistic models of information retrieval based on measuring the divergence from randomness Ms marco: A human generated machine reading comprehension dataset RRF102: meeting the TREC-COVID challenge with a 100+ runs ensemble Bridging the lexical chasm: Statistical approaches to answer-finding Reciprocal rank fusion outperforms condorcet and individual rank learning methods Orcas: 20 million clicked query-document pairs for analyzing search Context-aware sentence/passage term importance estimation for first stage retrieval. CoRR, abs Deeper text understanding for ir with contextual neural language modeling Context-aware term weighting for first stage passage retrieval BERT: Pre-training of deep bidirectional transformers for language understanding Hierarchical neural story generation Unsupervised corpus aware language model pretraining for dense passage retrieval Complement lexical retrieval model with semantic residual embeddings Accelerating large-scale inference with anisotropic vector quantization Umass at TREC 2004: Novelty and HARD Billion-scale similarity search with gpus Dense passage retrieval for open-domain question answering Leveraging semantic and lexical matching to improve the recall of document retrieval systems: A hybrid approach Relevance based language models BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. CoRR, abs In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval Multistage training with improved negative contrast for neural passage retrieval Sparse, Dense, and Attentional Representations for Text Retrieval A replication study of dense passage retriever From puppy to maturity: Experiences in developing terrier Generation-augmented retrieval for open-domain question answering From doc2query to doctttttquery. Online Document expansion by query prediction A language modeling approach to information retrieval The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering Exploring the limits of transfer learning with a unified text-to-text transformer TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. Journal of the American Medical Informatics Association Okapi at trec-3 Simple entity-centric questions challenge dense retrievers Language model information retrieval with document expansion BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models Attention is all you need Overview of the trec 2004 robust retrieval track CORD-19: The COVID-19 open research dataset Bert-based dense retrievers require interpolation with bm25 for effective passage retrieval Approximate nearest neighbor negative contrastive learning for dense text retrieval Anserini: Reproducible ranking baselines using lucene Repbert: Contextualized text embeddings for first-stage retrieval. CoRR, abs