key: cord-0455545-mi1t66m3 authors: Santhanam, Keshav; Khattab, Omar; Saad-Falcon, Jon; Potts, Christopher; Zaharia, Matei title: ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction date: 2021-12-02 journal: nan DOI: nan sha: ca08ab27c0507739d87654eb9792d865b9366e8b doc_id: 455545 cord_uid: mi1t66m3 Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector representations, late interaction models produce multi-vector representations at the granularity of each token and decompose relevance modeling into scalable token-level computations. This decomposition has been shown to make late interaction more effective, but it inflates the space footprint of these models by an order of magnitude. In this work, we introduce ColBERTv2, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. We evaluate ColBERTv2 across a wide range of benchmarks, establishing state-of-the-art quality within and outside the training domain while reducing the space footprint of late interaction models by 5--8$times$. Neural information retrieval (IR) has quickly dominated the search landscape over the past 2-3 years, dramatically advancing not only passage and document search (Nogueira and Cho, 2019) but also many knowledge-intensive NLP tasks like opendomain question answering (Guu et al., 2020) , multi-hop claim verification (Khattab et al., 2021a) , and open-ended generation (Paranjape et al., 2021) . Many neural IR methods follow a single-vector similarity paradigm: a pretrained language model is used to encode each query and each document into a single high-dimensional vector, and relevance is modeled as a simple dot product between both vectors. An alternative is late interaction, introduced in ColBERT (Khattab and Zaharia, 2020) , where queries and documents are encoded at a finergranularity into multi-vector representations, and * Equal contribution. relevance is estimated using rich yet scalable interactions between these two sets of vectors. Col-BERT produces an embedding for every token in the query (and document) and models relevance as the sum of maximum similarities between each query vector and all vectors in the document. By decomposing relevance modeling into tokenlevel computations, late interaction aims to reduce the burden on the encoder: whereas single-vector models must capture complex query-document relationships within one dot product, late interaction encodes meaning at the level of tokens and delegates query-document matching to the interaction mechanism. This added expressivity comes at a cost: existing late interaction systems impose an order-of-magnitude larger space footprint than single-vector models, as they must store billions of small vectors for a corpus with billions of tokens like Wikipedia or larger Web-scale collections. Considering this challenge, it might seem more fruitful to focus instead on addressing the fragility of single-vector models by introducing new supervision paradigms for negative mining (Xiong et al., 2020) , pretraining , and distillation (Qu et al., 2021) . Indeed, recent singlevector models with highly-tuned supervision strategies (Ren et al., 2021b; Formal et al., 2021a) sometimes perform on-par or even better than "vanilla" late interaction models, and it is not necessarily clear whether late interaction architectures-with their fixed token-level inductive biases-admit similarly large gains from improved supervision. In this work, we show that late interaction retrievers naturally produce lightweight representations ( §3), which can be encoded effectively with minimal space footprint, and that they can benefit drastically from denoised supervision. We couple those in ColBERTv2, 1 a new late-interaction retriever that employs a simple combination of dis-tillation from a cross-encoder and hard-negative mining ( §4.2) to boost quality beyond any existing method, and then uses a residual compression mechanism ( §4.3) to reduce the space footprint of late interaction by 5-8× while preserving quality. As a result, ColBERTv2 establishes state-of-the-art retrieval quality both within and outside its training domain with a competitive space footprint with typical single-vector models. When trained on MS MARCO Passage Ranking, ColBERTv2 achieves the highest MRR@10 of any standalone retriever. In addition to in-domain quality, we seek a retriever that generalizes "zeroshot" to domain-specific corpora and long-tail topics, which are often under-represented in large public training sets. To this end, we also evaluate ColBERTv2 on a wide array of out-of-domain benchmarks. These include three Wikipedia Open-QA tests and 13 diverse retrieval and semanticsimilarity tasks from BEIR (Thakur et al., 2021) . In addition, we introduce a new benchmark, dubbed LoTTE, for Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ (Khashabi et al., 2021) . LoTTE focuses on long-tail topics in its passages, unlike the Open-QA tests and many of the BEIR tasks, and evaluates models on their capacity to answer natural search queries with a practical intent, unlike many of BEIR's semantic-similarity tasks. On 22 of 28 out-of-domain tests, ColBERTv2 achieves the highest quality, outperforming the next best retriever by up to 8% relative gain, while using its compressed representations. This work makes the following contributions: 1. We analyze the semantic space learned by Col-BERT and find that token-level decomposition leads to "lightweight" representations. Leveraging this, we propose a simple yet effective residual compression mechanism for off-theshelf late interaction models. 2. We propose ColBERTv2, a retriever that combines denoised supervision and residual compression to establish state-of-the-art quality with competitive space footprint. 3. We introduce LoTTE, a new resource for outof-domain evaluation of retrievers. LoTTE focuses on queries with practical intent over long-tail topics, an important yet understudied application space. Many neural IR approaches encode passage and document embeddings as a single high-dimensional vector, thereby trading off the higher quality of cross-encoders for improved efficiency and scalability (Karpukhin et al., 2020; Xiong et al., 2020; Qu et al., 2021) . ColBERT's (Khattab and Zaharia, 2020) late interaction paradigm addresses this tradeoff by computing multi-vector embeddings and using a scalable "MaxSim" operator for retrieval. Several other systems leverage multi-vector representations, including Poly-encoders (Humeau et al., 2020) , PreTTR (MacAvaney et al., 2020) , and MORES (Gao et al., 2020) , but they target attention-based re-ranking as opposed to Col-BERT's scalable MaxSim-based interaction. In this work, we build on ColBERT's late interaction architecture by reducing the space footprint and improving supervision. COIL generates token-level document embeddings similar to ColBERT, but the token interactions are restricted to lexical matching between query and document terms. The uni-COIL (Lin and Ma, 2021) model simplifies COIL by limiting its token embedding vector to a single dimension, which in effect replaces the vector weights with scalar weights, extending models like DeepCT and DeepImpact (Mallia et al., 2021) . Similarly, SPLADE (Formal et al., 2021b) and SPLADEv2 (Formal et al., 2021a ) produce a sparse vocabulary-level vector that retains the term-level decomposition of late interaction while simplifying the storage into one dimension per token, but these models also directly piggyback on the language modeling capacity acquired by BERT during pretraining. SPLADEv2 has been shown to be highly effective, within and across domains, and it is our central point of comparison in the experiments we report on in this paper. There has been a surge of recent interest in compressing representations for IR. We highlight the following compression approaches. One of the earliest related works is the study by Izacard et al. (2020) , who explore dimension reduction, product quantization, and passage filtering for single-vector retrievers. The Binary Passage Retriever (BPR; Yamada et al. 2021) learns to directly hash embeddings to binary codes using a differentiable tanh function. JPQ (Zhan et al., 2021a) and its extension, RepCONC (Zhan et al., 2021b) , use product quantization (PQ) to compress embeddings, and jointly train the query encoder along with the centroids produced by PQ via a ranking-oriented loss function. SDR (Cohen et al., 2021) uses an autoencoder to reduce the dimensionality of the contextual embeddings used for attention-based re-ranking and then applies a quantization scheme for further compression. In contrast to these systems, ColBERTv2 focuses on scalable late interaction retrieval and applies compression using a residual compression approach. We show in §3 that ColBERT's representations naturally lend themselves to residual compression. Techniques for residual compression are well-studied (Barnes et al., 1996) and have previously been applied across several domains, including approximate nearest neighbor search (Wei et al., 2014; Ai et al., 2017) , neural network parameter and activation quantization (Li et al., 2021b,a) , and distributed deep learning (Chen et al., 2018; . To the best of our knowledge, ColBERTv2 is the first approach to use residual compression for scalable neural IR. Instead of compressing multi-vector representations as we do, much recent work has focused on improving the quality of single-vector models, which are often very sensitive to the specifics of supervision. This line of work can be decomposed into three directions: (1) distillation of more expressive architectures (Hofstätter et al., 2020; including explicit denoising (Qu et al., 2021; Ren et al., 2021b) , (2) hard negative sampling (Xiong et al., 2020; Zhan et al., 2020a) , and (3) improved pretraining . We adopt similar techniques to (1) and (2) for Col-BERTv2's multi-vector representations (see §4.2). Recent progress in retrieval has mostly focused on large-data evaluation, where many tens of thousands of annotated training queries are associated with the test domain, as in MS MARCO or Natu- ral Questions (Kwiatkowski et al., 2019) . In these benchmarks, queries tend to reflect high-popularity topics like movies and athletes in Wikipedia. In practice, user-facing IR and QA applications often pertain to domain-specific corpora, for which little to no training data is available and whose topics are under-represented in large public collections. This out-of-domain regime has received recent attention with the BEIR (Thakur et al., 2021) benchmark. BEIR combines several existing datasets into a heterogeneous suite for "zero-shot IR" tasks, spanning bio-medical, financial, and scientific domains. While the datasets in BEIR provide a useful testbed, most of them capture broad semantic relatedness tasks-like citations, counter arguments, or duplicate questions-instead of natural search tasks, or else they focus on high-popularity entities like those in Wikipedia. In §5, we introduce LoTTE, a new dataset for out-of-domain evaluation of IR models, exhibiting natural search queries over longtail topics. ColBERT (Khattab and Zaharia, 2020) decomposes representations and similarity computation at the token level. Because of this compositional architecture, we hypothesize that ColBERT exhibits a "lightweight" semantic space: without any special re-training, vectors corresponding to each sense of a word would cluster very closely, with only minor variation due to context. If this hypothesis is true, we would expect the embeddings corresponding to each token in the vocabulary to localize in only a small number of regions in the embedding space, corresponding to the contextual "senses" of the token. To validate this hypothesis, we analyze the ColBERT embeddings corresponding to the tokens in the MS MARCO Passage Ranking (Nguyen et al., 2016) collection: we perform k-means clustering on the nearly 600M embeddings-corresponding to 27,000 unique tokens-into k = 2 18 clusters. As a baseline, we repeat this clustering with random embeddings but keep the true distribution of tokens. Figure 1 presents empirical cumulative distribution function (eCDF) plots representing the number of distinct non-stopword tokens appearing in each cluster (1a) and the number of distinct clusters in which each token appears (1b). 2 Most tokens appear in a very small fraction of the number of centroids: in particular, we see that roughly 90% of clusters have ≤ 16 distinct tokens with the ColBERT embeddings, whereas less than 50% of clusters have ≤ 16 distinct tokens with the random embeddings. This suggests that the centroids effectively map the ColBERT semantic space. Table 1 presents examples to highlight the semantic space captured by the centroids. The most frequently appearing tokens in cluster #917 relate to photography; these include, for example, 'photos' and 'photographs'. If we then examine the additional clusters in which these tokens appear, we find that there is substantial semantic overlap between these new clusters (e.g., Photos-Photo, Photo-Image-Picture) and cluster #917. We observe a similar effect with tokens appearing in cluster #216932, comprising tornado-related terms. This analysis indicates that cluster centroids can summarize the ColBERT representations with high precision. In §4.3, we propose a residual compression mechanism that uses these centroids along 2 We rank tokens by number of clusters they appear in and designate the top-1% as stopwords. with minor refinements at the dimension level to efficiently encode late-interaction vectors. We now introduce ColBERTv2, which improves the quality of multi-vector retrieval models ( §4.2) while reducing their space footprint ( §4.3). ColBERTv2 adopts the late interaction architecture of ColBERT, depicted in Figure 2 . Queries and passages are independently encoded with BERT , and the output embeddings encoding each token are projected to a lower dimension. During offline indexing, every passage d in the corpus is encoded into a set of vectors, and these vectors are stored. At search time, the query q is encoded into a multi-vector representation, and its similarity to a passage d is computed as the summation of query-side "MaxSim" operations, namely, the largest cosine similarity between each query token embedding and all passage token embeddings: where Q is an matrix encoding the query with N vectors and D encodes the passage with M vectors. The intuition of this architecture is to align each query token with the most contextually relevant passage token, quantify these matches, and combine the partial scores across the query. We refer to Khattab and Zaharia (2020) for a more detailed treatment of late interaction. Training a neural retriever typically requires positive and negative passages for each query in the training set. Khattab and Zaharia (2020) train ColBERT using the official q, d + , d − triples of MS MARCO. For each query, a positive d + is human-annotated, and each negative d − is sampled from unannotated BM25-retrieved passages. Subsequent work has identified several weaknesses in this standard supervision approach (see §2.3). Our goal is to adopt a simple, uniform supervision scheme that selects challenging negatives, avoids mislabeled positives, and does not penalize the model for retrieving unlabeled positives (i.e., false negatives). To this end, we start with a ColBERT model trained with triples as in Khattab and Zaharia (2020) and Khattab et al. (2021b) , using this to index the training passages with ColBERTv2 compression. For each training query, we retrieve the top-k passages. We feed each of those query-passage pairs into a cross-encoder reranker. 3 We then collect w-way tuples consisting of a query, a highly-ranked passage (or labeled positive), and one or more lower-ranked passages. Like RocketQAv2 (Ren et al., 2021b) , we use a KL-Divergence loss to distill the cross-encoder's scores into the ColBERT architecture. We also employ in-batch negatives per GPU, where a cross-entropy loss is applied between the query and its positive against all passages corresponding to other queries in the same batch. This strategy can be iterated one or more times to refresh the index and thus the negatives sampled by the model. Denoised training with hard negatives has been positioned in recent work as ways to bridge the gap between single-vector and interaction-based models, including late interaction architectures like ColBERT. Our results in §6 reveal that such supervision can improve multi-vector models dramatically, resulting in state-of-the-art retrieval quality. The analysis from §3 shows that ColBERT vectors cluster into regions that capture highly-specific token semantics. We exploit this regularity with a residual representation that dramatically reduces the space footprint of late interaction models. Given a set of centroids C, ColBERTv2 encodes each vector v as the index of its closest centroid C t and a quantized vectorr that approximates the residual r = v−C t . At search time, we use the centroid index t and residualr recover an approximatẽ v = C t +r. To encoder, we quantize every dimension of r into one or two bits. In §6, we find that this simple encoding preserves model quality across a wide variety of downstream tasks, dataset domains, and model checkpoints, while considerably lowering storage costs against typical 32-or 16-bit precision used by existing late interaction systems. In principle, our b-bit encoding of n-dimensional vectors needs log |C| + bn bits per vector. In practice, with n = 128, we use four bytes to capture up to 2 32 centroids and 16 or 32 bytes (for b = 1 or b = 2) to encode the residual. This total of 20 or 36 bytes per vector contrasts with ColBERT's use of 256-byte vector encodings at 16-bit precision. Given a corpus of passages, the indexing stage precomputes all passage embeddings and organizes their representations to support fast nearestneighbor search. ColBERTv2 divides indexing into three stages, described below. Centroid Selection. In the first stage, Col-BERTv2 selects a set of cluster centroids C. These are embeddings that ColBERTv2 uses to support residual encoding ( §4.3) and also for nearestneighbor search ( §4.5). Standardly, we find that setting |C| proportionally to the square root of n embeddings in the corpus works well empirically. 4 To create the centroids, we apply k-means clustering to the embeddings produced by invoking our BERT encoder over a sample of all passages. Passage Encoding. Having selected the centroids, we encode every passage in the corpus. This entails invoking the BERT encoder and compressing the output embeddings as described in §4.3, assigning each embedding to the nearest centroid and computing a quantized residual. Once a chunk of passages is encoded, the compressed representations are saved to disk. Index Inversion. To support fast nearestneighbor search, we group the embedding IDs that correspond to the same centroid together, and save this inverted list to disk. At search time, this allows us to quickly find token-level embeddings similar to those in a query. Given a query representation Q, retrieval starts with candidate generation. For every vector Q i in the query, the nearest n probe ≥ 1 centroids are found. Using the inverted list, ColBERTv2 identifies the passage embeddings close to these centroids. Col-BERTv2 decompresses these embeddings and finds the cosine similarity between every vector in Q and each of those passage embedding. For each query vector, the scores are then grouped by passage ID. Scores corresponding to the same passage are maxreduced. This allows ColBERTv2 to conduct an approximate "MaxSim" operation per query vector, akin to the approximation explored for scoring by Macdonald and Tonellotto (2021) but applied for candidate generation. This computes a lowerbound on the true MaxSim ( §4.1) using the embeddings identified via the inverted list. These lower bounds are summed across the query-length dimension, and the top-scoring candidate passages based on these approximate scores are selected for the next stage, namely ranking. Ranking loads the complete set of embeddings corresponding to each passage, and conducts the same scoring function using all embeddings per document following Equation 1. The results are then sorted by score and returned. We introduce LoTTE (pronounced latte), a new dataset for Long-Tail Topic-stratified Evaluation for IR. To complement the out-of-domain evaluations in BEIR (Thakur et al., 2021) , LoTTE focuses on natural user queries that pertain to long-tail topics, ones that might not be present in an entitycentric knowledge base like Wikipedia. LoTTE consists of 12 test sets, each with 500-2000 queries and 100k-2M passages. The test sets are explicitly divided by topic, and each test set is accompanied by a validation set of related but disjoint content. The 12 test (and dev) sets include two "pooled" settings, which aggregate the passages and queries across topics to create a cross-domain challenge. Table 2 outlines the composition of LoTTE. We derive the topics and passage corpora from the answer posts across various StackExchange forums, which are question-and-answer communities that target individual topics (e.g., "physics" or "bicycling"). We gather forums from five overarching domains: writing, recreation, science, technology, and lifestyle. To evaluate retrievers, we collect Search and Forum queries, each of which is linked to one or more answer posts in its corpus. Example queries, and snippets from posts that answer them in the corpora, are shown in Table 3 . Search Queries. We collect search queries from GooAQ (Khashabi et al., 2021) , a recent dataset of Google search-autocomplete queries and their answer boxes, which we filter for queries whose answers link to a specific StackExchange post. To map these natural queries to their answers Google Search likely relies, as Khashabi et al. (2021) hy-Q: what is the difference between root and stem in linguistics? A: A root is the form to which derivational affixes are added to form a stem. A stem is the form to which inflectional affixes are added to form a word. Q: are there any airbenders left? A: the Fire Nation had wiped out all Airbenders while Aang was frozen. Tenzin and his 3 children are the only Airbenders left in Korra's time. Q: Why are there two Hydrogen atoms on some periodic tables? A: some periodic tables show hydrogen in both places to emphasize that hydrogen isn't really a member of the first group or the seventh group. Q: How can cache be that fast? A: the cache memory sits right next to the CPU on the same die (chip), it is made using SRAM which is much, much faster than the DRAM. Table 3 : Examples of queries and shortened snippets of answer passages from LoTTE. The first two examples show "search" queries, whereas the last two are "forum" queries. Snippets are shortened for presentation. pothesize, on a wide variety of signals for relevance, including expert annotations, user clicks, and hyperlinks as well as specialized QA components for various question types with access to the post title and question. Using those annotations as ground truth, we evaluate the models on their capacity for retrieval using only free text of the answer posts (i.e., no hyperlinks or user clicks, question title or body, etc.), which poses a significant challenge for IR and NLP systems trained only on public datasets. Forum Queries. We collect the forum queries by extracting question-like post titles from the StackExchange communities and use their answer posts as targets. These questions tend to have a wider variety than the "search" queries, while the search queries may exhibit more natural patterns. For search as well as forum queries, the resulting evaluation set consists of a query and a target StackExchange page. Similar to evaluation in the Open-QA literature (Karpukhin et al., 2020; Khattab et al., 2021b) , we evaluate retrieval quality by computing the success@5 (S@5) metric, which awards a point to the system for each query where it finds an accepted or upvoted (score ≥ 1) answer from the target page in the top-5 hits. We now evaluate ColBERTv2 on passage retrieval tasks, testing its quality within the training domain ( §6.1) as well as outside the training domain in zeroshot settings ( §6.2). We discuss the space savings achieved by ColBERTv2 in §6.3. Unless otherwise Next, we evaluate ColBERTv2 outside the training domain using BEIR (Thakur et al., 2021) , Wikipedia Open QA as in Khattab et al. (2021b) , and LoTTE. We compare against a wide range of recent retrieval results from the literature on the BEIR benchmark, selecting the best three-namely, Col-BERTv2, SPLADEv2, and vanilla ColBERT-for the Open QA and LoTTE evaluations. BEIR. We start with BEIR, reporting the quality of models that do not incorporate distillation from cross-encoders, namely, ColBERT (Khattab and Zaharia, 2020) , DPR-MARCO , ANCE (Xiong et al., 2020) , and MoDIR , as well as models that do utilize distillation, namely, TAS-B (Hofstätter et al., 2021) and SPLADEv2 (Formal et al., 2021a) . We divide the table into "search" (i.e., natural queries and questions) and "semantic relatednes" (e.g., citationrelatedness and claim verification) tasks to reflect the nature of queries in each dataset. 6 Table 5a reports the official nDCG@10 results. Among the models without distillation, we see that the vanilla ColBERT model outperforms the singlevector systems DPR, ANCE, and MoDIR across all but three tasks. ColBERT often outpaces all three systems by large margins and, in fact, outperforms the TAS-B model, which utilizes distillation, on most datasets. Shifting our attention to models with distillation, we see a similar pattern: while distillation-based models are generally stronger than their vanilla counterparts, the models that decompose scoring into term-level interactions, ColBERTv2 and SPLADEv2, are almost always considerably stronger than the models that don't. Looking more closely into the comparison between SPLADEv2 and ColBERTv2, we see that ColBERTv2 has an advantage in six benchmarks and ties SPLADEv2 in two, with the largest improvements attained on NQ, TREC-COVID, and FiQA-2018, all of which feature natural search queries. On the other hand, SPLADEv2 has the lead on five benchmarks, displaying the largest gains on Climate-FEVER (C-FEVER) and Hot-PotQA. In C-FEVER, the input queries are sentences making climate-related claims and, as a result, do not reflect the typical characteristics of search queries. In HotPotQA, queries are written by crowdworkers who have access to the target passages. This is known to lead to artificial lexical bias , as in the SQuAD benchmark, where crowdworkers copy terms from the passages into their questions. Wikipedia Open QA. As a further test of outof-domain generalization, we evaluate the strongest three models-ColBERTv2, SPLADEv2, and vanilla ColBERT-on retrieval for open-domain question answering, similar to the out-of-domain setting of Khattab et al. (2021b) . We report Suc-cess@5 for the open-domain versions Karpukhin et al., 2020) of Natural Questions (NQ; Kwiatkowski et al. 2019 ), TriviaQA (TQ; Joshi et al. 2017) , and SQuAD (Rajpurkar et al., 2016) datasets in Table 5b . As a baseline, we include the BM25 results using the Anserini (Yang et al., 2018a) toolkit. We observe that ColBERTv2 outperforms both vanilla ColBERT as well as SPLADEv2 across all datasets and metrics, with improvements of up to 4.6 points over SPLADEv2. LoTTE. Next, we analyze performance on the LoTTE test benchmark, testing also BM25 and the recently-released RocketQAv2 using the official checkpoint trained on MS MARCO. 7 We report Success@5 for each corpus on both search queries and forum queries. Considering the baselines, we observe that while RocketQAv2 tends to have a slight advantage over SPLADEv2 on the "search" queries, SPLADEv2 is considerably more effective on the "forum" tests. We hypothesize that the search queries, obtained from Google (through GooAQ) are more similar to MS MARCO than the forum queries and, as a result, the latter stresses generalization more heavily, rewarding term-decomposed models like SPLADEv2 and ColBERTv2. Similar to the Wikipedia-OpenQA results, we find that ColBERTv2 outperforms the baselines across all topics for both query types, improving upon SPLADEv2 and RocketQAv2 by up to 3.7 and 8.1 points, respectively. 7 https://github.com/PaddlePaddle/RocketQA ColBERTv2's residual compression approach significantly reduces index sizes compared to Col-BERT: whereas ColBERT requires 154 GiB to store the index for MS MARCO, ColBERTv2 only requires 20 GiB or 29 GiB when compressing embeddings to 1 or 2 bit(s) per dimension, respectively, resulting in compression ratios of 5-8×. This storage includes the 9 GiB used by ColBERTv2 to store the inverted list. In this paper, we introduced ColBERTv2, a retriever that advances the quality and space efficiency of multi-vector representations. Our analysis of the embedding space of ColBERT revealed that cluster centroids tend to capture context-aware semantics of the token-level representations, and we proposed a residual representation that leverages these patterns to dramatically reduce the footprint of multi-vector systems. We then explored improving the supervision of multi-vector retrieval and found that their quality improves considerably upon distillation from a cross-encoder system. This allows multi-vector retrievers to considerably outperform single-vector systems in withindomain and out-of-domain evaluations, which we conducted extensively across 28 datasets, establishing state-of-the-art quality while exhibiting competitive space footprint. Optimized Residual Vector Quantization for Efficient Approximate Nearest Neighbor Search DBpedia: A Nucleus for a Web of Open Data Advances in Residual Vector Quantization: A Review. IEEE transactions on image processing Overview of touché 2020: Argument Retrieval A Full-text Learning to Rank Dataset for Medical Information Retrieval AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training SPECTER: Document-level Representation Learning using Citation-informed Transformers Besnik Fetahu, and Amir Ingber. 2021. SDR: Efficient Neural Reranking using Succinct Document Representation Context-aware term weighting for first stage passage retrieval BERT: Pre-training of deep bidirectional transformers for language understanding Massimiliano Ciaramita, and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking Unsupervised corpus aware language model pre-training for dense passage retrieval Modularized Transfomer-based Ranking Framework COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List Realm: Retrievalaugmented language model pre-training Mete Sertkan, and Allan Hanbury. 2020. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring A memory efficient baseline for open domain question answering Billion-scale similarity search with gpus TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension Dense passage retrieval for open-domain question answering Hannaneh Hajishirzi, and Chris Callison-Burch. 2021. GooAQ: Open Question Answering with Diverse Answer Types Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval Relevance-guided supervision for openqa with ColBERT ColBERT: Efficient and effective passage search via contextualized late interaction over BERT Natural questions: A benchmark for question answering research Latent retrieval for weakly supervised open domain question answering TRQ: Ternary Neural Networks With Residual Quantization Residual Quantization for Low Bit-width Neural Networks 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques Distilling Dense Representations for Ranking using Tightly-Coupled Teachers A Double Residual Compression Algorithm for Efficient Distributed Learning Efficient Document Re-Ranking for Transformers by Precomputing Term Representations On approximate nearest neighbour selection for multistage dense retrieval WWW'18 Open Challenge: Financial Opinion Mining and Question Answering Learning passage impacts for inverted indexes MS MARCO: A human-generated MAchine reading COmprehension dataset Passage Re-ranking with BERT Hindsight: Posterior-guided Training of Retrievers for Improved Open-Ended Generation Rocketqa: An optimized training approach to dense passage retrieval for opendomain question answering SQuAD: 100,000+ questions for machine comprehension of text 2021a. PAIR: Leveraging Passage-centric Similarity Relation for Improving Dense Passage Retrieval RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Reranking Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Christos Christodoulopoulos, and Arpit Mittal TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection Retrieval of the Best Counterargument without Prior Topic Knowledge Fact or Fiction: Verifying Scientific Claims Projected Residual Vector Quantization for ANN Search Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval Efficient passage retrieval with hashing for open-domain question answering Anserini: Reproducible ranking baselines using lucene HotpotQA: A dataset for diverse, explainable multi-hop question answering Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval Learning to retrieve: How to train a dense retrieval model effectively and efficiently Repbert: Contextualized text embeddings for first-stage retrieval This research was supported in part by affiliate members and other supporters of the Stanford DAWN project-Ant Financial, Facebook, Google, and VMware-as well as Cisco, SAP, Virtusa, and the NSF under CAREER grant CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.