key: cord-0432551-8fo37vxw authors: Formal, Thibault; Piwowarski, Benjamin; Clinchant, St'ephane title: Match Your Words! A Study of Lexical Matching in Neural Information Retrieval date: 2021-12-10 journal: nan DOI: nan sha: 60b5a1601b7d64bb656faed9ec07f6b00d6b06a1 doc_id: 432551 cord_uid: 8fo37vxw Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf. In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an 'ideal' one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it's actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training Over the last two years, the effectiveness of neural IR systems has risen substantially. Neural retrievers based on pre-trained Language Models like BERT [4] whether dense or sparse -hold the promise to replace lexical matching models (e.g. BM25) for first-stage ranking in modern search engines. Despite this success, little is known regarding their actual inner working in the IR setting. Previous works scrutinizing BERT-based ranking models either relied on axiomatic approaches adapted to neural models [1, 17] , controlled experiments [11] , or direct investigation of the learned representations [9, 7] or attention [19] . This line of work has shown -among other findings -that these models, which rely on contextualized semantic matching, are actually still quite sensitive to lexical match and term statistics in documents/collections [9, 7] . However, these observations are based on specifically tailored approaches that cannot directly be applied to any given model. To generalize these findings, we introduce instead an intuitive black box approach: we propose to "count" query terms appearing in top documents retrieved by various state-of-the-art neural systems, in order to compare their ability to perform lexical matching. Furthermore, previous studies have been conducted on the MS MARCO dataset, on which models have been trained. The BEIR benchmark [18] has shown that the only systems improving the overall performance over BM25 in the zero-shot setting have (somehow) a lexical bias, e.g. models like doc2query-T5 [13] or ColBERT [10] . Therefore, we also propose to study the extent to which neural IR models are able to generalize lexical matching, for query terms that either have not been seen in the training set or with different collection statistics (e.g. common in the training set but rare on an out-of-domain evaluation set). In this work, we first develop indicators that help measuring to what extent a lexical match is "important" for the user (user relevance) or for the model (system relevance). By comparing both values -i.e. computing the difference between the user and the system, we can look at the following research questions: (RQ1) To what extent neural retrievers perform accurate lexical matching (Sect. 3.1)? (RQ2) Do they generalize term matching to unseen query terms (Sect. 3.1)? (RQ3) Do they generalize term matching to new collections (Sect. 3.2)? Our analysis rationale is the following: the more a term is important for a query (w.r.t. relevant documents), the more frequent the term should be retrieved by the system in top retrieved documents. Therefore, we first need to define what it means for a term to be important for lexical matching, and how to accurately measure frequency in top documents. Roughly speaking, we are interested in the models ability to retrieve documents containing query terms, when they are deemed important. Note that we are not interested in expansion mechanisms in our analysis since they are more related to semantic matching. Intuitively, term importance w.r.t. relevance can be measured by the extent to which a term allows to distinguish relevant from non-relevant documents in a collection of documents. It is thus natural to use the Robertson-Sparck Jones (RSJ) weight [20, 14] . The RSJ weights have been shown, if estimated correctly, to order documents in the optimal order w.r.t. the Probability Ranking Principle [15] . For a given user information need U , the user RSJ U weight for term t is defined as follows (the conditioning on query q is implicit): where P (t|R) is the probability that term t occurs in a relevant document. RSJ t,U is thus high when a term, for a document to be relevant, is both necessary (p(.|R)) and sufficient (p(.|¬R)). Intuitively, it is low for e.g. stopwords, as they have equal odds to appear in relevant and irrelevant documents. The above weight can be estimated using the set of relevant documents and collection statistics. We now want to compute the same weight, when relevance is defined by the system (and not the user ). In other words, we would like to measure how much a model "retrieves" term t. One way to proceed is to suppose that top-K documents are relevant from the point of view of the system, for a suitable K. While a more accurate definition of system relevance could be used, we found out in our preliminary analysis that results were not very sensitive to the choice of K. We hence define the system RSJ S weight for term t as: Intuitively, it gives us a mean to properly count occurrences of query terms in retrieved documents -taking into account collection statistics. It is estimated similarly to Eq. 1. Once RSJ U and RSJ S have been computed, we can look at the difference between both, i.e. ∆RSJ t = RSJ t,S − RSJ t,U . If ∆RSJ t > 0 (resp. ∆RSJ t < 0), it means that the model overestimates (resp. underestimates) the importance of term t when considering its document ordering. In other words, the model retrieves "too much" (resp. "too few") this term. Please note that a high correlation between RSJ S and RSJ U is not indicative of the absolute performance of a model, as RSJ U is neither a perfect model nor performance measure. However, we argue that it can still indicate partly the performance of the model w.r.t. lexical matching, especially for terms whose RSJ U are high. We conducted experiments by analyzing models trained on MS MARCO [12] , using public model parameters when available (indicated by ). We evaluated models on the in-domain TREC Deep Learning 2019-2020 datasets [3, 2] (97 queries in total), and two out-of-domain datasets from the BEIR [18] benchmark (TREC-COVID (bio-medical) and FiQA-2018 (financial), with respectively 50 and 648 test queries). For all our experiments, we measure the system relevance by using top-K = 100. For the term-level analysis, we keep stopwords, and use standard tokenization and Porter stemming. We solely focus on first-stage retrievers (and not re-rankers), for which lexical matching might be more critical. We thus compare various state-of-the-art models (based on the BEIR benchmark), considering different types of approaches (sparse and dense). We include two lexical models, the standard BM25 [16] and doc2query-T5 ( ) [13] ; SPLADE ( ) [6, 5] , an expansion-based sparse approach; ColBERT [10] , an interaction-based architecture; two dense retrievers, TAS-B ( ) [8] and a standard Bi-encoder trained with contrastive loss and in-batch negatives. In Fig. 1 , we plot the relationship between the user weight and ∆RSJ, for each term in the test queries appearing at least 10 times in the training queries (left, IT for In-Training). We first note that lexical-based models tend to overestimate the importance of query terms (∆RSJ > 0). The second observation is that models are roughly similar in their estimations for low user RSJ U weights (below 5 We now shift our attention to the behavior of models for query words that are (almost) not in the training set. In Fig. 1 , we show the distribution of ∆RSJ for terms appearing in less than 10 training queries (out of > 500k) (right, OOT for Out-Of-Training). Comparing with ∆RSJ for terms in the training set, we can see that all neural models are affected somehow, showing that lexical match does not fully generalize to "new" terms. For the (8, 17] bin, and for every model (except BM25), the difference in mean between IT/OOT is significant, based on a t-test with p = 0.01. Finally, we also looked at the relationship between IT/OOT and model performance. More precisely, for terms in the (8, 17] bin, we computed the mean ndcg@10 for queries containing at least one term either in IT or OOT (respectively 55 and 37 queries out of the 97, with 9 queries in both sets). We found that BM25 and doc2query-T5 performance increased by 0.1 and 0.02 respectively, while for all neural models the performance decreased (≈ 0 for TAS-B, -0.11 for SPLADE, -0.27 for the bi-encoder and -0.38 for ColBERT). The fact that BM25 performance increased is likely due to the fact that the mean IDF increased (from 7.3 to 10.9), i.e. important terms are more discriminative in the OOT query set. With this in mind, the decrease of all neural models might suggest that a potential reason for the relative performance decrease (w.r.t. BM25) is due to a worse estimate of high RSJ U . We now analyze whether term importance can generalize to the zero-shot setting 4 . We distinguish two categories of words, namely those that occurred 5 times more in the target collection than in MS MARCO (IDF+), or those for which term statistics were more preserved (IDF-), allowing us to split query terms in sets of roughly equal size. Since term importance is related to the collection frequency (albeit loosely), we can compare ∆RSJ in those two settings. Fig. 2 shows the ∆RSJ with respect to RSJ U for the TREC-COVID and FiQA-2018 collections from the BEIR benchmark [18] . We can first observe that neural models underestimate RSJ U for terms that are more frequent in the target collection than in the training one (IDF+). It might indicate that models have learned a dataset-specific term importanceconfirming the results obtained in the previous section on out-of-training terms. When comparing dense and sparse/interaction models overall -by considering the average ∆RSJ over terms -we observe than dense models underestimate even more RSJ U than on in-domain (∆RSJ = −0.17 for TAS-B and −0.38 for the bi-encoder) while sparse/interaction seem to overestimate (∆RSJ = 0.18 for ColBERT and 0.30 for SPLADE), but however to a lesser extent than BM25 (∆RSJ = 0.83). Finally, we observed that when transferring, all the models have a higher ∆RSJ variance compared to their trained version on MS MARCO: in all cases, the standard deviation (when normalized by BM25 one) is around 0.8 for MS MARCO, but around 1.1 for TREC-COVID and FiQA-2018. This further strengthens our point on the issue of generalizing lexical matching to out-ofdomain collections. In this work, we analyzed how different neural IR models predict the importance of lexical matching for query terms. We proposed to use the Robertson-Sparck Jones (RSJ) weight as an appropriate measure to compare term importance w.r.t. the user and system relevance. We introduce a black box approach that enables a systematic comparison of different models w.r.t. term matching. We have also investigated the behavior of lexical matching in the zero-shot setting. Overall, we have shown that lexical matching properties are heavily influenced by the presence of the term in the training collection. The rarer the term, the harder it is to find documents containing that term for most neural models. Furthermore, this phenomenon is amplified if term statistics change across collections. Diagnosing BERT with Retrieval Heuristics Overview of the trec 2020 deep learning track Overview of the trec 2019 deep learning track BERT: pre-training of deep bidirectional transformers for language understanding SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval Splade: Sparse lexical and expansion model for first stage ranking A White Box Analysis of Col-BERT Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling How Does BERT Rerank Passages? An Attribution Analysis with Information Bottlenecks ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT ABNIRML: Analyzing the Behavior of Neural IR Models Ms marco: A human generated machine reading comprehension dataset Relevance weighting of search terms The Probability Ranking Principle in IR The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval Simple entity-centric questions challenge dense retrievers BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Pretrained transformers for text ranking: Bert and beyond Precision Weighting -An Effective Automatic Indexing Method