key: cord-0273123-fzsnv6py
authors: Guo, Jiafeng; Cai, Yinqiong; Fan, Yixing; Sun, Fei; Zhang, Ruqing; Cheng, Xueqi
title: Semantic Models for the First-stage Retrieval: A Comprehensive Review
date: 2021-03-08
journal: nan
DOI: 10.1145/3486250
sha: 9f808d689639f3eb577614d2236ad7feaa3b6ecb
doc_id: 273123
cord_uid: fzsnv6py

Multi-stage ranking pipelines have been a practical solution in modern search systems, where the first-stage retrieval is to return a subset of candidate documents, and latter stages attempt to re-rank those candidates. Unlike re-ranking stages going through quick technique shifts during past decades, the first-stage retrieval has long been dominated by classical term-based models. Unfortunately, these models suffer from the vocabulary mismatch problem, which may block re-ranking stages from relevant documents at the very beginning. Therefore, it has been a long-term desire to build semantic models for the first-stage retrieval that can achieve high recall efficiently. Recently, we have witnessed an explosive growth of research interests on the first-stage semantic retrieval models. We believe it is the right time to survey current status, learn from existing methods, and gain some insights for future development. In this paper, we describe the current landscape of the first-stage retrieval models under a unified framework to clarify the connection between classical term-based retrieval methods, early semantic retrieval methods and neural semantic retrieval methods. Moreover, we identify some open challenges and envision some future directions, with the hope of inspiring more researches on these important yet less investigated topics.

Large-scale query-document retrieval is a key problem in search systems, e.g., Web search engines, which aims to return a set of relevant documents from a large document repository given a user query. To balance the search efficiency and effectiveness, modern search systems typically employ a multi-stage ranking pipeline in practice, as shown in Figure 1 . The first-stage retrieval aims to return an initial set of candidate documents from a large repository by some cheaper ranking models assisted by some specially-designed indexing structures. Later, several re-ranking stages take more complex and effective ranking models to prune and improve the ranked document list output by the previous stage. Such a "retrieval and re-ranking" pipeline has been widely adopted in both academia [40, 147] and industry [133, 167] and achieved state-of-the-art results on multiple IR benchmarks [59, 160, 205] .

Besides the pipeline architecture, to achieve a successful retrieval, it is generally recognized that the system needs to understand the query and the document well so that it can find relevant results to users' information needs. Therefore, semantic models are expected throughout the pipeline but with different requirements and goals at different stages. For the first-stage retrieval, the model aims to recall all potentially relevant documents from the whole collection. Thus, it is desired to build semantic models that can achieve high recall efficiently, i.e., to return a subset of documents that contain relevant documents as many as possible within a short time-span. For latter re-ranking stages, only a small number of documents are fed into the ranking model. As a result, semantic models used for re-ranking are allowed to employ more sophisticated architectures to achieve high precision, i.e., to put as many relevant documents as possible to top positions of the list.

During past decades, we have witnessed re-ranking stages going through quick technique shifts towards more and more powerful semantic models, from early probabilistic models [177, 178, 202] , learning to rank models [126, 134] , to recent neural ranking models [82, 96, 161] . Specifically, with BERT-style pre-training tasks on cross-attention models, better contextualized representations and deeper interactions between query-document pairs have led to significant improvement on the re-ranking effectiveness [161, 163] . However, these models are often very computationally expensive, which makes them unable to handle high-throughput incoming queries each with a large collection of candidate documents in the first-stage retrieval.

On the contrary, the first-stage retrieval has long been dominated by classical term-based models. Specifically, the discrete symbolic representation, i.e., bag-of-words (BoW) representation, is adopted for both queries and documents, and the inverted indexing technique is leveraged networks for re-ranking stages. For the first-stage retrieval, the booklet by Li and Xu [127] talked about early semantic retrieval models, but without recent booming neural models for the first-stage retrieval. Recently, Lin et al. [131] discussed several pre-training models for the first-stage retrieval and re-ranking stages. Different from them, we make an comprehensive overview of semantic models for the first-stage retrieval under a unified framework, including early semantic retrieval models, neural semantic retrieval models and the connection between them.

To sum up, our contributions include:

(1) We describe the current landscape of the first-stage retrieval models under a unified framework to clarify the connection between the classical term-based retrieval, early methods for semantic retrieval and neural methods for semantic retrieval. (2) We provide a comprehensive and up-to-date review of semantic retrieval models, with a brief review of early semantic retrieval models and a detailed description of recent neural semantic retrieval models. (3) We summarize neural semantic retrieval models into three paradigms from the perspective of model architecture, i.e., sparse retrieval methods, dense retrieval methods and hybrid retrieval methods. We also discuss key topics on model learning, including loss functions and negative sampling strategies. (4) We discuss some open challenges and suggest potentially promising directions for future works.

We organize this survey as follows. We first introduce three typical applications of semantic retrieval models in Section 2. Then, we provide some background knowledge, including problem formalization, index methods and classical term-based retrieval methods in Section 3. We sketch early methods for semantic retrieval in Section 4. In Section 5, we review existing neural methods for semantic retrieval from the perspective of model architecture, and introduce key topics on model learning. Finally, we discuss challenges and future directions in Section 6, and conclude this survey in Section 7.

The first-stage retrieval plays an essential role in almost all large-scale IR applications. In this section, we describe three major text retrieval applications, including ad-hoc retrieval [12] , open-domain question answering [191, 206] , and community-based question answering [31, 193] .

Ad-hoc retrieval is a typical retrieval task, and there has been a long research history on ad-hoc retrieval models. In this task, users express their information needs as queries, then trigger searches in the retrieval system to obtain relevant documents. All retrieved documents are often returned as a ranked list according to the degree of relevance to the user query. A major characteristic of ad-hoc retrieval is the length heterogeneity between the query and the document. Queries are often short in length, consisting of only a few keywords [151] . While documents have longer texts, ranging from multiple sentences to several paragraphs. Such heterogeneity between queries and documents leads to the classical vocabulary mismatch problem, which has been a long-term challenge in both the retrieval stage as well as re-ranking stages in ad-hoc retrieval [127] . The earliest datasets to support reliable evaluation of the first-stage retrieval models are always based on TREC collections, such as Associated Press Newswire (AP), Wall Street Journal (WSJ) and Robust [117] . The number of documents in these collections is usually hundreds of thousands, and documents are usually news articles. Later, larger collections based on Web data, such as ClueWeb [44] , are built for the evaluation of retrieval technology. However, the number of queries in these datasets is only a few hundred, which is not enough for the training of neural-based retrieval models. In recent years, large-scale datasets, such as MS MARCO [160] , TREC CAR [59] and TREC Deep Learning Track [47] , are released, which label relevant documents for hundreds of thousands of queries. The availability of these large-scale datasets has greatly promoted the development of neural retrieval models. Besides, there are also some domain-specific retrieval datasets, e.g., GOV2 [43] , TREC Medical Records Track (MedTrack) and TREC-COVID [203] , which are also commonly used for the evaluation.

Open-domain question answering (OpenQA) is a task to answer any sort of (factoid) questions that humans might ask, using a large collection of documents (e.g., Wikipedia, or Web page) as the information source [110] . Unlike the ad-hoc retrieval which aims to return a ranked list of documents, the OpenQA task is to extract a text span as the answer to the question. To achieve this, most existing works build the OpenQA system as a two-stage pipeline [36] : (1) A document retriever selects a small set of relevant documents that probably contain the answer from a large-scale collection; (2) A document reader extracts the answer from relevant documents returned by the document retriever. In our work, we only consider the document retriever component since the document reader is out of the scope of this paper. Typically, the question in OpenQA tasks is a natural language sentence, which has well-formatted linguistic structures. While the document is often a small snippet of text, ranging from several sentences to a passage [56, 63] . Moreover, relevant documents are required to be not only topically related to but also correctly address the question, which requires more semantics understanding except for exact term matching features. For the evaluation of the first-stage retrieval models on OpenQA tasks, several benchmark datasets are available. Most commonly used datasets, such as SQuAD-open [36] , SearchQA [63] , TriviaQAunfiltered [107] and Natural Questions Open [116] , have tens of thousands of queries for model training. Several smaller-scale datasets, e.g., WebQuestions [20] and CuratedTREC [16] , are also often used for model evaluation. The document collection in these datasets is usually based on Wikipedia pages (e.g., SQuAD-open and Natural Questions Open) or Web pages (e.g. SearchQA, and WebQuestions), and queries are written by crowd-workers (e.g., SQuAD-open) or crawled from existing websites (e.g., SearchQA and TriviaQA-unfiltered).

Community-based question answering (CQA) aims to address user's questions using the archived question-answer (QA) pairs in the repository, since CQA systems have already accumulated a large amount of high-quality human-generated QA pairs, such as Yahoo! Answers 1 , Stack Overflow 2 and Quora 3 . There are two different ways to produce the answer to a user's question. One is to directly retrieve answers from the collection if the answer exists [208] . The other is to select the duplicate question from the collection and take the accompanied answer as the result [212] . Both of these two ways require the retrieval system to firstly recall a subset of candidates from the whole collection, and then re-rank candidates to generate the final result. However, targets (i.e., answers and questions) in these two ways often have very different expressions, leading to different challenges in terms of semantic modeling. Firstly, the duplicate question retrieval needs to capture semantic similarities between words (phrases) since there are often different ways to express the same question. Secondly, the answer retrieval needs to model logical relationships between questions and answers. Although many datasets are constructed based on CQA data, few of them are suitable for evaluating the first-stage retrieval models. Existing related works usually conduct experiments on QQP 4 and WikiAnswers [66] datasets.

There are also some other retrieval scenarios, such as entity linking [80] , e-commerce search [125, 128, 234] and sponsored search [68] . For these applications, academic researchers and industrial developers have realized the importance of utilizing semantic information for the first-stage retrieval.

Due to page limitations, we will not discuss these works in this survey, but it is possible and necessary to generalize techniques applied in text retrieval to other retrieval tasks.

In this section, we first characterize the first-stage retrieval by giving a unified formulation of the first-stage retrieval models. Then, we introduce typical indexing methods cooperating retrieval models to support efficient retrieval. Finally, we summarize classical term-based retrieval methods.

Given a query , the first-stage retrieval aims to recall all potentially relevant documents from a large corpus C = { 1 , 2 , · · · , }. Different from re-ranking stages with a small set of candidates, the corpus size for the first-stage retrieval can range from millions (e.g., Wikipedia) to billions (e.g., the Web). Thus, efficiency is a crucial concern for models used in the first-stage retrieval.

Formally, given a dataset D = {( , , )} =1 , where denotes a user query, =[ 1 , 2 , · · · , ] denotes a list of documents to the query , and = [ 1 , 2 , · · · , ] ∈ {1, 2, · · · , } is the corresponding relevance label of each document in . There exists a total order between relevance labels > − 1 > · · · > 1, where > denotes the order relation. Note here the number of labeled documents to each query is often significantly smaller than the corpus size , since it is impossible to manually annotate all the huge amount of documents. The goal of the first-stage retrieval is to learn a model (·, ·) from D that gives high scores to relevant ( , ) pairs and low scores to irrelevant ones. For any query-document pair ( , ), ( , ) gives a score that reflects the relevance degree between and , and thus allows one to rank all the documents in the corpus C according to predicted scores. Without loss of generality, the scoring function can be abstracted by the following unified formulation:

where ∈ and ∈ are the input query and document, and two representation functions : → R 1 and : → R 2 map a sequence of tokens in and to their associated embeddings ( ) and ( ), respectively. To build a responsive model for the first-stage retrieval, it leads to a number of requirements on these three components:

• The document representation function should be independent of the query since queries are unknown before the search system is deployed. In this way, document representations can be pre-computed and indexed offline with methods in Section 3.2. Meanwhile, this means that the ( ) component can be sophisticated to some extent since it has no impact on the online serving. • The query representation function is required to be as efficient as possible since it needs to compute query embeddings online. Thanks to the nature of independence, two components and can be identical or different, which is flexible enough to design models for different retrieval tasks with homogeneous or heterogeneous inputs.

• To satisfy the real-time retrieval requirement, on the one hand the scoring function should be as simple as possible to minimize the amount of online computation, and on the other hand, it must take the indexing method into account.

As mentioned above, one major difference between the first-stage retrieval and re-ranking stages is that the former does ranking on large-scale documents in the repository. Thus, the efficiency of the first-stage retrieval models is one of the core considerations. In practice, to support storing and fast retrieval of documents in the whole repository, retrieval systems need to build an index, where the indexing technique is crucial to the rapid response during the online serving. There are many indexing techniques, such as signature, inverted index, and dense vector index. Rather than exploring all the existing approaches, we only describe the fundamental principle of two typical indexing schemes. The inverted index is currently the most popular indexing scheme and is used for many applications due to its simplicity and efficiency. Before building an inverted index, each document in the collection is parsed and segmented into a list of tokens. Then, the inverted index is created, which mainly consists of a dictionary and a collection of posting lists. The dictionary includes all the terms found in the collection and their document frequencies. Each posting list records document identifiers, term occurrence frequencies, and possibly other information of documents in which the corresponding term appears. During the online serving, for a user's query, top most similar documents are fetched in turn with the help of the inverted index. Concretely, the query is processed with one term at a time. Initially, each document has a similarity of zero to the query. Then, for each query term , the similarity score of each document in 's posting list increases by the contribution of to the similarity of the query-document pair. Once all query terms have been processed, the largest similarity scores are identified, and the corresponding document list is returned to the user. In fact, many acceleration strategies are applied during the process to improve retrieval efficiency, but they are omitted here. More details about the inverted index technique could be found in [214, 241] .

Along with the development of neural representation learning methods, dense vector index based on approximate nearest neighbor search algorithms is used to support the new representation paradigm. One of reasons why the inverted index works well is that the documents-term matrix is very sparse. However, most semantic retrieval models produce dense and distributed document representations, thus the inverted index method is no longer feasible to retrieve documents efficiently from a large collection. From the equation (1), the retrieval problem could be viewed as the nearest neighbor search problem [188] , once the query embedding and all the document embeddings have been calculated. This fundamental problem has been well studied in the research community [1, 8] . The simplest approach to the nearest neighbor search is the brute-force search, which scans all the candidates and computes similarity scores one by one. However, the brute-force search becomes impractical when the size of collections exceeds a certain point. Thus, most researches resort to an approximate nearest neighbor (ANN) search [11, 64, 129] , which allows for a slight loss in precision while yielding multiple orders of magnitude improvement in speed. Generally, existing ANN search algorithms can be categorized into four major types, including tree-based [17, 19] , hashing-based [53, 99] , quantization-based [79, 102] and proximity graph approaches [113, 144] . The earliest solutions to ANN search are based on locality-sensitive hashing [99] , but currently proximity graph methods [113, 144] yield a better performance among all the approaches in most respects based on a popular benchmark 5 . Graph-based methods build the index by retaining the neighborhood information for each individual data point towards other data points or a set of pivot points. Then, various greedy heuristics are proposed to navigate the proximity graph for a given query point. So far, several open-source libraries for ANN search, such as Faiss [105] and SPTAG [39] , have been developed, and search engines 6, 7, 8 

This subsection provides an overview of classical term-based methods for the first-stage retrieval, including the vector space model, probabilistic retrieval models and language models for IR. In general, these methods build representations of queries and documents based on the bag-of-words (BoW) assumption where each text is represented as a bag (multiset) of its words, disregarding grammar and even word order. Particularly, the representation functions and are set to be manually defined feature functions, such as word frequency, and dimensions of representations (i.e., 1 and 2 ) are generally equal to the vocabulary size. The representation functions and are usually different for queries and documents, but they all pledge the sparsity of representations so that the inverted index could be used to support efficient retrieval.

The early representative of term-based methods is the vector space model (VSM) [185] which represents queries and documents as high-dimensional sparse vectors in a common vector space. Under this framework, queries and documents are viewed as vectors with each dimension corresponding to a term in the vocabulary, where the weight of each dimension can be determined by different functions, e.g., term frequency (TF), inverse document frequency (IDF) or the composite of them [183, 184] . Then, one can use the similarity (usually cosine similarity) between a query vector and a document vector as the relevance measure of the query-document pair. The resulting scores can then be used to select the top relevant documents for the query. The VSM has become the fundamental of a series of IR solutions-the probabilistic retrieval model and language model for IR can be both viewed as the instantiation of VSM with different weighting schemes.

Probabilistic methods are one of the oldest formal models in IR, which introduce the probability theory as a principled foundation to estimate the relevance probability ( = 1| , ) of a document to the query . The Binary Independence Model (BIM) [178] is the most original and influential probabilistic retrieval model. It represents documents and queries to binary term vectors, that an entry is 1 if the corresponding term occurs in the document, and otherwise the entry is 0. With these representations, "binary" and "term independency" assumptions are introduced by BIM. But these assumptions are contrary to facts, so a number of extensions are proposed to relax some assumptions of BIM, such as the Tree Dependence Model [202] and BM25 [177] . In particular, the BM25 model takes into account the document frequency, document length and term frequency, which has been widely used and quite successful across different academic researches as well as commercial systems [133, 167] .

Instead of modeling the relevance probability explicitly, language models (LM) for IR [170] build a language model for each document , then documents are ranked based on the likelihood of generating the query , i.e., ( | ). The document language model is also built on the bag-of-words assumption, and could be instantiated as either multiple Bernoulli [170] or Multinomial [89, 150] . Experimental results in [170] prove the effectiveness of term weights coming from language models over the traditional TF-IDF weight. Moreover, language models provide another perspective for modeling retrieval tasks, and subsequently inspire many extended approaches [29, 229] .

In summary, modeling relevance in a shallow lexical way, especially combined with the inverted index, endows classical term-based models a key advantage on efficiency, making it possible to retrieve from billions of documents quickly. However, such a paradigm is also accompanied by clear drawbacks, like the well-known vocabulary mismatch problem or not well capturing text semantics. Therefore, more sophisticated semantic models for improving the first-stage retrieval performance start to attract researchers' interests in the following.

From the 1990s to the 2000s, extensive studies have been carried out to improve term-based retrieval methods. Most of them mine information from external resources or the collection itself to enrich query representations ( ), document representations ( ) or both of them for semantic retrieval. Here, we sketch a brief picture of some of them.

To compensate for the mismatch between queries and documents, the query expansion technique is used to expand the original query with terms selected from external resources [217] . In this way, query representations ( ) are enriched, and more documents could be considered during the retrieval process through the extended query terms.

Query expansion is the process of adding relevant terms to a query to improve retrieval effectiveness. There are a number of query expansion methods, and they can be classified into global methods [124, 171] and local methods [2, 230] . Global methods expand or reformulate query words by analyzing word co-occurrences from the corpus being searched or using an external hand-crafted thesaurus (e.g., WordNet) [204] . Although a number of data-driven query expansion methods, such as [13] , can improve the average retrieval performance, they are shown to be unstable across queries. On the other hand, local methods adjust a query based on top-ranked documents retrieved by the original query. This kind of query expansion is called pseudo-relevance feedback (PRF) [33] , which has been proven to be highly effective to improve the performance of many retrieval models [138, 179] . Relevance model [119] , mixture model, and divergence minimization model [230] are the first PRF methods proposed under the language modeling framework. Since then, several other local methods have been proposed, but the relevance model is shown to be still among state-of-the-art PRF methods and performs more robustly than many other methods [138] .

In general, query expansion methods have been widely studied and adopted in IR applications, but they do not always yield a consistent improvement. Especially expansion methods based on pseudo-relevance feedback are prone to the query drift problem [46] . Subsequently, with the development of deep learning technique, neural word embeddings and deep language models are used to enhance query expansion methods [58, 146, 181] .

An alternative to query expansion is to perform the expansion for all documents in the corpus, then those enriched documents are indexed and searched as before. Intuitively, document expansion methods supplement each posting list in the inverted index, which have shown to be particularly effective for information retrieval tasks [3, 65, 200] .

Document expansion is first proposed in the speech retrieval community [192] . Singhal and Pereira [192] proposed to use the original document as a query into the collection, and the ten most relevant documents were selected. Then, they enhanced the representation of the original document by adding to the document vector a linearly weighted mixture of related documents. Similarly, Efron et al. [65] followed a similar approach on short text retrieval tasks. They submitted documents as pseudo-queries and performed document expansion based on the analysis of the result set. Different from the retrieval-based method to determine related documents for expansion, it is another way to use document clustering to determine similar documents, and document expansion is carried out with respect to these results [114, 135] . Both works report significant improvements over non-expanded baselines on the TREC ad-hoc document retrieval task. In addition to using the document collection itself, it is also helpful to use external information to augment document representations [3, 190] . For example, Agirre et al. [3] presented a novel document expansion method based on a WordNet-based system to find related concepts and words, which is the first to perform document expansion using lexical semantic resources.

Document expansion technique has been less popular with IR research because they are less amenable to rapid experiments. The corpus needs to be re-indexed every time the expansion technique changes, which is a costly process. In contrast, manipulations to query representations can happen at retrieval time and hence are much faster. Besides, the success of document expansion has been mixed. Billerbeck and Zobel [23] explored both query expansion and document expansion in the same framework and concluded that the former is consistently more effective. Nevertheless, dramatic improvement for the first-stage retrieval has been achieved after equipping the document expansion technique with neural models, such as doc2query [164] and docTTTTTquery [162] (See Section 5.1).

Typically, term-based methods consider terms in the document independently and ignore the term orders. As a result, concepts represented by multiple contiguous words cannot be depicted correctly, and the stronger relevance of consecutive or ordered terms matching between queries and documents cannot be reflected well. Term dependency models attempt to address the above problem by incorporating term dependencies into the representation functions and .

A natural way is to extend the dictionary in the inverted index with frequent phrases. For example, Fagan [67] tried to incorporate phrases into the VSM, where phrases are viewed as additional dimensions in the representation space. Then, the scoring function can be formalized to the combination of term-level score and phrase-level score:

where term and phr are weights to achieve weight normalization, and the score of a phrase can be defined as the average of TF-IDF weights of its component terms. Xu et al. [218] also investigated the approach that extends BM25 with n-grams. They defined the BM25 kernel as follows:

where BM25-Kernel ( , ) denotes the BM25 kernel of type , and can be bigram, trigram, etc.

where denotes a n-gram of type , ( , ) and ( , ) are frequency of unit in query and document respectively, ( ) is total number of units with type in document ,¯is average number of ( ) within the whole collection, 1, 3, and are parameters.

Integrating term dependencies to term-based methods increases the complexity, but gains are not significant as expected [118] . The Markov Random Field (MRF) approach proposed by Metzler and Croft [148] reports the first clear improvement for term dependency models over term-based baselines. In MRF, the document and each term in the query are represented as a node respectively. The document node is connected to every query term node. Moreover, there are some edges between query term nodes, based on pre-defined dependency relations (e.g., bigram, named entity, or co-occurrence within a distance), to represent their dependencies. Then, the joint probability of query and document can be formally represented as

where denotes a clique on the constructed graph G, is the interpolation coefficient, ( ) is the potential function defined on clique , and denotes the partition function. In practice, we can define different feature functions to capture different types of term dependencies, and the coefficient can be optimized towards designative retrieval metrics, as in [148] . While those methods are capable of capturing certain syntactics and semantics, their "understanding" capability is much limited. How to go beyond these simple counting statistics and mine deeper signals to better query-document matching is still an open question. Nevertheless, there is no doubt that term dependency models demonstrate the importance of understanding document semantics with context, stimulating a series of neural retrieval models that emphasize the capturing of contextual information [112, 232] .

Another line to improve and simultaneously focuses on semantic relationships between words-usually modeling words' co-occurrence relation to discover latent topics in texts and matching queries and documents by their topic representations. In this way, each dimension of the representation indicates a topic instead of a term. Besides, the inverted index becomes impractical since topic representations lose sparsity.

Topic modeling methods have received much attention in natural language processing (NLP) tasks. Overall, they can be divided into two categories, including probabilistic and non-probabilistic approaches. The non-probabilistic topic model, such as latent semantic indexing (LSI) [54] , nonnegative matrix factorization (NMF) [121] , and regularized latent semantic indexing (RLSI) [211] , is usually obtained by matrix factorization. Taking LSI as an example, it uses a truncated singular value decomposition (SVD) to obtain a low-rank approximation to the document-term matrix, then each document can be represented as a mixture of topics. Other topic models choose different strategies to conduct the matrix factorization. For example, NMF introduces the non-negative constraint and RLSI assumes topics are sparse. For probabilistic approaches, probabilistic latent semantic indexing (pLSI) [90] and latent dirichlet allocation (LDA) [25] are most widely used. Probabilistic topic models are usually generative models, where each topic is defined as a probabilistic distribution over terms in the vocabulary and each document in the collection is defined as a probabilistic distribution over topics.

Studies that apply topic models to improve retrieval results can be classified in two ways. The first is to obtain query and document representations in the topic space, and then calculate relevance scores based on topic representations. For example, the LSI learns a linear projection that casts the sparse bag-of-words text vector into a dense vector in latent topic space, then the relevance score between a query and a document is the cosine similarity of their corresponding dense vectors. In the LDA-based retrieval model [213] , queries and documents are represented by their latent topic distributions. The relevance score of each query-document pair is computed by the Kullback-Leibler divergence as follows:

where and are topic representations of query and document respectively, and and are the -th element of and .

Another way is to combine topic models with term-based methods. A simple and direct approach is to linearly combine relevance scores calculated by topic models and term-based models [90] :

where is the coefficient, topic ( , ) and term ( , ) are the topic matching score and term matching score respectively. In addition, probabilistic topic models can be taken as the smoothing method to language models for IR [57, 213, 224] .

where is the coefficient, LM ( | ) and TM ( | ) are generating probabilities of word given document estimated by a language model and a topic model. The TM ( | ) can be defined as:

where denotes a latent topic. According to results in [10] , using latent topic representations obtained by topic models alone for IR tasks only has small gains or poor performance over term-based baselines, unless combining them with term-based methods. Possible reasons include: (1) These topic models are mostly unsupervised, learning with a reconstruction objective, either based on mean squared error [54] or likelihood [25, 90] . They may not learn a matching score that works well for specific retrieval tasks; (2) Word co-occurrence patterns learned by these topic models are from documents, ignoring the fact that language usages in searching texts (queries) can be different from those in writing texts (documents), especially when the heterogeneity between queries and documents is significant; (3) Topic models represent documents as compact vectors, losing detailed matching signals over term-level. Later, using more powerful neural models, e.g., doc2vec [120] , instead of topic models for information retrieval has achieved better results [5, 6] .

A notable attempt to address the vocabulary mismatch problem is the statistical translation approach, which enriches the document representation function from term frequency to translation models. Statistical machine translation (SMT) is leveraged for IR by viewing queries as texts in one language and documents as texts in another language. Retrieval by translation models needs to learn translation probabilities from queries to associated relevant documents, which can be obtained from labeled data, and thus belongs to the supervised learning approach.

Berger and Lafferty [21] firstly proposed to formulate retrieval tasks as SMT problem, in which query is translated into document with the conditional probability ( | ). The model can be written as:

where ( | ) denotes a translation model which translates to , and ( ) denotes a language model giving rise to . Translation probabilities can be estimated with queries and their associated relevant documents, e.g., click-through datasets, and the language model can be learned with different schemes, such as BM25. As Karimzadehgan and Zhai have noted [108] , the translation probability ( | ) allows for the incorporation of semantic relations between terms with non-zero probabilities, which provides a sort of "semantic smoothing" for ( | ).

One important difference between conventional machine translation and machine translation for retrieval is that both queries (target language) and documents (source language) are in the same language. The probability of translating a word to itself should be quite high, i.e., ( | ) > 0, which corresponds to exact term matching in retrieval tasks. How to accurately calculate selftranslation probabilities is an important issue. If self-translation probabilities are too large, it will make other translation probabilities small and decrease the effect of using translation. On the other hand, if self-translation probabilities are too small, then it will make exact matching less effective and hurt the performance of retrieval. A number of methods [74, 108, 109] have been proposed to estimate self-translation probabilities. For example, Karimzadehgan and Zhai [108] proposed to address this estimation problem based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation [21] :

where is the weight which is empirically set on heldout data. Similarly, an alternative heuristic is to impose constant self-translation probabilities for all words in the vocabulary [109] , i.e., setting ( | ) to a constant value for every , where ( | ) is estimated according to:

All these methods assume that self-translation probabilities estimated directly from data are not optimal for retrieval tasks, and the authors have demonstrated that significant improvement can be achieved by adjusting the probabilities [74] . Statistical translation models have also been applied to query expansion. For example, Riezler and Liu [176] suggested utilizing a word-based translation model for query expansion. The model is trained with click-through data consisting of queries and snippets of clicked web pages. Gao and Nie [75] generalized the word-based translation model to a concept-based model and employed the model in query expansion.

Nevertheless, SMT models have not been used much because they are difficult to train due to data sparsity, and are not more effective than the term-based retrieval with pseudo-relevance feedback [119] in most situations. Subsequently, after the appearance of neural word embeddings, using distributed representations to calculate translation probabilities and improve translation models are proposed naturally [73, 242] .

Takeaway. Early semantic retrieval models, such as query expansion, document expansion, term dependency models, topic models, and translation models, aim to improve classical BoW representations with semantic units extracted from external resources or the collection itself. Most of them still follow classical term-based methods by representing texts with high-dimensional sparse vectors in symbolic space, so as to be easily integrated with the inverted index to support efficient retrieval. However, these approaches always rely on hand-crafted features to build representation functions. As a result, only shallow syntactic and semantic information can be captured. Nevertheless, these early proposals are crucial because they have initially explored beneficial factors for the first-stage retrieval. Thereby, a series of new semantic retrieval models could be inspired when the deep learning technique breaks out, and exciting results could be obtained concomitantly.

During the past decade, big data and fast computer processors have brought a new era for deep learning technique. A set of simple math units, called neurons, are organized into layers, and stacked into neural networks. Neural networks have the expressive power to represent complex functions and fit hidden correlations in complicated tasks [98] . For example, it converts discrete symbols (e.g., words, phrases and sentences) into low-dimensional dense vectors which are able to capture semantic and syntactic features for various NLP tasks [48, 223] . Naturally, it also attracts researchers from the IR field and leads to the research wave of neural approaches to IR (neural IR). However, most earlier researches focus on re-ranking stages [82, 96] . Until recently, much attention is paid to explore neural networks to improve the semantic matching for the first-stage retrieval. Different from early semantic retrieval models, neural semantic retrieval models employ neural networks to build the representation functions (i.e., and/or ) as well as the scoring function (i.e., ). In this way, these models can learn deep semantics and complex interactions from data in an end-to-end way. From the perspective of model architecture, neural methods for semantic retrieval can be categorized into three classes, including sparse retrieval methods, dense retrieval methods, and hybrid retrieval methods. In this section, we will review major works about them. Table 1 summaries surveyed neural semantic retrieval models in different categories.

Sparse retrieval methods usually represent each document and each query with sparse vectors, where only a small number of dimensions are active. The sparse representation has attracted great attention as it connects to the nature of human memories and shows better interpretability [14] . Besides, sparse representations can be easily integrated into existing inverted indexing engines for efficient retrieval. Without loss of generality, sparse retrieval methods can be categorized into two classes. One is to encode queries and documents still in the symbolic space but employ neural models to improve term weighting schemes, namely neural weighting schemes. The other is to directly learn sparse representations, i.e., ( ) and ( ), in latent space for queries and documents with neural networks, which we call sparse representation learning.

5.1.1 Neural Weighting Schemes. One of basic methods to leverage the advantage of neural models while still employing sparse term-based retrieval is to re-weight the term importance before indexing. To this purpose, a direct way is to design neural models to predict term weights based on semantics rather than pre-defined heuristic functions. An alternative method is to augment each document with additional terms, then, expanded documents are stored and indexed with classical term-based methods.

One of the earliest methods to learn term weights is the DeepTR model [240] , which leverages neural word embeddings to estimate the term importance. Specifically, it constructs a feature vector for each query term and learns a regression model to map feature vectors onto ground truth weights of terms. Estimated weights can be directly used to replace classical term weighting schemes in the inverted index, e.g., BM25 and LM, to generate bag-of-words query representations to improve the retrieval performance. More recently, Frej et al. [71] proposed a term discrimination values (TDVs) learning method, which replaces the IDF field in the original inverted index based on FastText [26] . In addition to the pairwise ranking objective, they also minimized the ℓ 1 -norm of bag-of-words document representations to reduce the memory footprint of the inverted index and speed up the retrieval process. Besides, Zuccon et al. [242] used word embeddings within the translation language model for information retrieval. They leveraged word embeddings to estimate translation probabilities between words. This language model captures implicit semantic relations between words in queries and those in relevant documents, thus bridging the vocabulary mismatch and producing more accurate estimations of document relevance.

In recent years, contextual word embeddings, which are often learned with pre-trained language models, have achieved great success in many NLP tasks [55, 169, 221] . Compared with static word embeddings (e.g., Word2Vec [149] , GloVe [168] , and FastText [26] ), contextual word embeddings model the semantic information of words under the global context. There are also several works trying to utilize contextual word embeddings to estimate term weights. For example, Dai and Callan [49, 51] proposed a BERT-based framework (DeepCT) to evaluate the term importance of sentences/passages in a context-aware manner. It maps contextualized representations learned by BERT to term weights, then uses predicted term weights to replace the original TF field in the inverted index. Experimental results show that predicted weights could better estimate the term importance and improve term-based methods for the first-stage retrieval. Moreover, results in [143] verify that DeepCT can improve search efficiency via static index pruning technique. Furthermore, Dai et al. [50] introduced the HDCT model to learn term weights for long documents. It firstly estimates passage-level term weights using contextual term representations produced by BERT. Then, passage-level term weights are combined into document-level term weights through a weighted sum. It is worth noting that the learned term weights by the above models, including DeepCT and HDCT, are in the range of 0-1. Then, they scale the real-valued predictions into a tf -like integer. In this way, these term weights can be directly integrated into the existing inverted index and be implemented with existing retrieval models.

Above mentioned approaches rely on neural embeddings, which are learned within local or global contexts, to predict term weights directly. Besides, there are also some works trying to estimate term weights by evaluating the matching score between each term and the whole document through a complex interaction network. For example, Mitra et al. [156] proposed to incorporate query term independence assumption into three state-of-the-art neural ranking models (BERT [55] , Duet [153] , and Conv-KNRM [52] ), and the final relevance score of the document can be decomposed with respect to each query term. In this way, these neural ranking models can be used to predict the matching score of each term to the document, which can be pre-computed and indexed offline. Experimental results on a passage retrieval task show that this method exhibits significant improvement over classical term-based methods, with only a small degradation compared with original neural ranking models. Similarity, Mitra et al. [154] extended the Transformer-Kernel [92] architecture to the full retrieval setting by incorporating the query term independence assumption. Firstly, they simplified the query encoder by getting rid of all Transformer layers and only considering non-contextualized embeddings for query terms. Secondly, instead of applying the aggregation function over the full interaction matrix, they applied it to each row of the matrix individually, which corresponds to an individual matching score between each query term and the whole document.

In addition to explicitly predicting term weights, another kind of method is to augment the document with additional terms using neural sequence-to-sequence (seq2seq) models. In this way, term weights of those elite terms can be promoted in the inverted index. In fact, this kind of method follows the idea of document expansion described in Section 4.2, yet it does the expansion with neural networks. For example, the doc2query [164] model trains a seq2seq model based on relevant query-document pairs. Then, the seq2seq model generates several queries for each document, and those synthetic queries are appended to the original document, forming the "expanded document". This expansion procedure is performed on every document in the corpus, and the expanded document collection is indexed as usual. Finally, it relies on a BM25 algorithm to retrieve relevant candidates. When combined with a re-ranking component, it achieves the state-of-the-art performance on MS MARCO [160] and TREC CAR [59] retrieval benchmarks. Later, the docTTTTTquery model [162] employs a stronger pre-trained model T5 [173] to generate queries and achieves large gains compared with doc2query. Moreover, Yan et al. [219] proposed a Unified Encoder-Decoder networks (UED) to enhance document expansion with the document ranking task. Experimental results on two large-scale datasets show that UED achieves a new state-of-the-art performance on both MS MARCO passage retrieval task and TREC 2019 Deep Learning Track.

There are also some works trying to learn term weights as well as document expansion simultaneously in a unified framework [14, 70, 145] . For example, Bai et al. [14] proposed a novel framework SparTerm to build term-based sparse representations in the full vocabulary space. It takes the pre-trained language model to map the frequency-based BoW representation to a sparse term importance distribution in the whole vocabulary. In this way, it can simultaneously learn the weights of existing terms and expand new terms for the document. Besides, SparTerm also constructs a gating controller to generate binary and sparse signals across the dimension of vocabulary size, ensuring the sparsity of final representations. Besides, DeepImpact [145] leverages docTTTTTquery to enrich the document collection, and then uses a contextualized language model to estimate the semantic importance of tokens in the document. In this way, it can produce a single-value representation for the original token and expanded token in each document.

Learning. In contrast to weighting terms in the symbolic space, sparse representation learning methods focus on building sparse vectors for queries and documents, where representations are expected to capture semantic meanings of each input text. In this way, queries and documents are represented in the latent space. But different from topic models in Section 4.4, each dimension of the latent space learned by neural models has no clear concepts. Then, the learned sparse representations can be stored and searched with an inverted index efficiently, where each unit in the inverted index table corresponds to a "latent word" instead of a term.

Learning sparse embeddings can be traced back to semantic hashing [182] , which employs deep auto-encoders for semantic modeling. It takes a multi-layer auto-encoder to learn distributed representations for documents. This model captures the document-term information, but it does not model the relevance relationship between queries and documents. Thus, it still cannot outperform classical term-based retrieval models, such as BM25 and QL. Zamani et al. [228] proposed a standalone neural ranking model to learn latent sparse representation for each query and document. Specifically, it firstly maps each n-gram in queries and documents to a low-dimensional dense vector to compress the information and learn the low dimensional manifold of the data. Then, it learns a function to transform n-gram representations to high-dimensional sparse vectors. Finally, the dot product is used as the matching function to calculate the similarity between each query and document. This architecture learns latent sparse representations to better capture semantic relationships between query-document pairs, showing better performance over traditional term-based retrieval and several neural ranking models. But it uses n-gram as an encoding unit, which can only capture local dependencies and cannot adjust dynamically to the global context. Recently, Jang et al. [100] presented UHD-BERT, a novel sparse retrieval method empowered by extremely high dimensionality and controllable sparsity. They showed that the model outperforms previous sparse retrieval models significantly and delivers competitive performance compared to dense retrieval models.

To make interaction-focused models applicable for the first-stage retrieval, Ji et al. [103] proposed to use sparse representations to improve the efficiency of three interaction-focused neural algorithms (DRMM [82] , KNRM [215] , and Conv-KNRM [52] ). The work investigates a Locality Sensitive Hashing (LSH [53] ) approximation of three neural methods with fast histogram-based kernel calculation and term vector precomputing for a runtime cache. Evaluation results show that the proposed method yields 4.12x, 80.54x, and 106.52x time speedups for DRMM, KNRM, and Conv-KNRM respectively on the ClueWeb dataset.

One of the biggest benefits of neural retrieval methods is to move away from sparse representations to dense representations, which is able to capture semantic meanings of input texts for better 111 · · · · · · 1 query encoder aggregator ( ) 1 · · · · · · document encoder aggregator ( ) Fig. 1 . Dual-encoder architecture of dense retrieval methods. Author's address:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. relevance evaluation. As shown in Figure 2 , dense retrieval models usually have dual-encoder architecture, also called Siamese network [30] , which consists of twin networks that accept distinct inputs (queries and documents) and learn standalone dense embeddings for them independently. Then, the learned dense representations ( ) and ( ) are fed into a matching layer , which is often implemented via a simple similarity function, to produce the final relevance score. To support the online serving, the learned dense representations are often indexed and searched via approximate nearest neighbor (ANN) algorithms [38, 106] .

Researchers have devoted a lot of effort to designing sophisticated architectures to learn dense representations for retrieval. Due to the heterogeneous nature of text retrieval, the document often has abundant contents and complicated structures, so that much attention has been paid to the design of the document-side representation function . According to the form of learned document representations, we can divide dense retrieval models into two classes, as shown in Figure 3 , term-level representation learning and document-level representation learning.

Learning. Term-level representation learning methods learn finegrained term-level representations for queries and documents, and queries and documents are represented as a sequence/set of term embeddings. As is shown in Figure 3 (a), the similarity function then calculates term-level matching scores between the query and the document and aggregates them as the final relevance score.

One of the easiest methods is to take word embeddings, which have been proved to be effective in building ranking models for later re-ranking stages [82, 215] , to build term-level representations for queries and documents. For example, Kenter and de Rijke [111] investigated whether it is possible to rely only on semantic features, e.g., word embeddings, rather than syntactic representations to calculate similarities between short texts. They replaced the ( , ) in BM25 with the maximum cosine similarity between the word embedding of and words in the document . Their results show that the model can outperform baseline methods that work under the same condition. Mitra et al. [155] trained a word2vec embedding model on a large unlabelled query corpus, but in contrast to only retain the output lookup table, they retained both input and output projections, allowing to leverage both embedding spaces to derive richer distributional relationships. During ranking, they mapped query words into the input space and document words into the output space, and computed the relevance score by aggregating cosine similarities across all query-document word pairs. The experimental results show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-based signal like TF-IDF. However, when retrieving in a non-telescoping setting, DESM features are very susceptible to false positive matches and can only be used either in conjunction with other document ranking features, such as TF-IDF, or for re-ranking a smaller set of candidate documents. Author's address:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. In recent years, the combination of contextual word embeddings and self-supervised pre-training has revolutionized the field of NLP and obtained state-of-the-art performance on many NLP tasks [55, 169, 221] . There are also a number of works that employ contextual word embeddings to learn query/document representations for IR. For example, Zhang et al. [237] proposed the DC-BERT which employs dual BERT encoders for low layers, as shown in Figure 4 (a) , where an online BERT encodes the query only once and an offline BERT pre-encodes all documents and caches all term representations. Then, the obtained contextual term representations are fed into high-layer Transformer interaction, which is initialized by the last layers of the pre-trained BERT [55] . The number of Transformer layers is configurable to a trade-off between the model capacity and efficiency. On the SQuAD dataset and Natural Questions dataset, DC-BERT achieves 10x speedup over the original BERT model on document retrieval, while retaining most (about 98%) of the QA performance compared to state-of-the-art approaches for open-domain question answering. An alternative way to use BERT for the term-level representation learning is the ColBERT [112] model, which employs a cheap yet powerful interaction function, i.e., a term-based MaxSim, to model fine-grained matching signals, as shown in Figure 4 (b). Concretely, every query term embedding interacts with all document term embeddings via a MaxSim operator, which computes maximum similarity (e.g., cosine similarity or L2 distance), and scalar outputs of these operators are summed across query terms. Based on this, it can achieve cheap interaction and high-efficient pruning for top-relevant documents retrieval. Results on MS MARCO and TREC CAR show that ColBERT's effectiveness is competitive with existing BERT-based models (and outperforms every non-BERT baseline), while executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query. A similar model COIL is proposed by Gao et al. [77] , but the query term embedding only interacts with exactly matched document term embeddings in the MaxSim operator. Experimental results show that COIL performs on par with more expensive and complex all-to-all matching retrievers (e.g., ColBERT). Besides, Cao et al. [34] and MacAvaney et al. [141] proposed DeFormer and PreTTR to decompose lower layers of BERT, which substitutes the full self-attention with question-wide and passage-wide self-attentions, as shown in Figure  5 . The proposed approaches considerably reduce the query-time latency of deep Transformer networks. The difference between them is that the PreTTR model [141] inserts a compression layer to match attention scores to reduce the storage requirement up to 95% but without substantial degradation in retrieval performance. Author's address:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. A natural extension of the term-level representation learning is to learn phrase-level (i.e, ngrams, sentences) representations for documents, and documents are finally represented as a sequence/set of embeddings. Meanwhile, the query is usually viewed as one phrase and abstracted into a single vector as it is often short in length. Then, the similarity function calculates matching scores between the query with all phrases in the document and aggregates these local matching signals to produce the final relevance score. For example, Seo et al. [186] proposed to learn phrase representations based on BiLSTM for OpenQA task. It leads to a significant scalability advantage since encodings of answer candidate phrases in the document can be pre-computed and indexed offline for efficient retrieval. Subsequently, Seo et al. [187] and Lee et al. [122] replaced the LSTMbased architecture with a BERT-based encoder, and augmented dense representations learned by BERT with contextualized sparse representations, improving the quality of each phrase embedding. Different from the document encoder, the query encoder only generates one embedding in capturing the whole contextual information of queries. Experimental results show that the OpenQA model that augments learned dense representations with learned contextual sparse representations outperforms previous OpenQA models, including recent BERT-based pipeline models, with two orders of magnitude faster inference time. For the multi-hop OpenQA task, Feldman and El-Yaniv [69] proposed the MUPPET model for efficient retrieval. The retrieval is performed by considering similarities between the question and contextualized sentence-level representations of the paragraph in the knowledge source. Given the sentence representations ( 1 , 2 , . . . , ) of a paragraph , and the question encoding for , the relevance score of with respect to a question is calculated in the following way:

where 1 , 2 , 4 ∈ R and 3 , ∈ R are learned parameters. The method achieves state-of-the-art performance over two well-known datasets, SQuAD-Open and HotpotQA.

The document-level representation learning methods learn one or more coarse-level global representation(s) for each query and each document · · · · · · · · · · · · · · · · · · · · · · · · · · · decompose Fig. 1 . Decompose BERT to question-wide and passage-wide self-attentions. Author's address:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. by abstracting their semantic meanings with dense vectors. It often employs a simple similarity function (e.g., dot product, or cosine similarity) to calculate the final relevance score based on the query embedding ( ) and the document embedding(s) ( ), as is shown in Figure 3 (b). Initial attempts to obtain query embeddings and document embeddings are to directly aggregate their corresponding word embeddings with some pre-defined heuristic functions. Clinchant and Perronnin [45] was the first to propose a document representation model, Fisher Vector (FV), based on continuous word embeddings. It firstly maps word embeddings into a higher-dimensional space, then aggregates them into a document-level representation through the fisher kernel framework. Although the FV model outperforms latent semantic indexing (LSI) for ad-hoc retrieval tasks, it does not perform better than classical IR models, such as TF-IDF and the divergence from randomness [7] retrieval model. Gillick et al. [81] proposed to utilize the average of word embeddings as the query or document representation. The experimental results show the proposed model outperforms term-based retrieval models (e.g., TF-IDF and BM25), which indicates dense retrieval is a viable alternative to the discrete retrieval model. Obtaining text representations by aggregating word embeddings loses the contextual and word orders information as classical term-based retrieval models do. To solve this problem, Le and Mikolov [120] proposed Paragraph Vector (PV), an unsupervised algorithm that learns fixed-length representations from variable-length pieces of texts, such as sentences, paragraphs and documents. Ai et al. [5, 6] evaluated the effectiveness of PV representations for ad-hoc retrieval, but produced unstable performance and limited improvements. With many attempts that use word/document embeddings to obtain dense representations for queries and documents, only moderate and local improvements over traditional term-based retrieval models have been observed, suggesting the need for more IR-customized embeddings or more powerful representation learning models.

As for embeddings customized for IR, Ai et al. [5] analyzed intrinsic problems of the original PV model that restrict its performance on retrieval tasks. Then, they produced modifications to the PV model, making it more suitable for IR tasks. The evaluation results on Robust04 and GOV2 show the effectiveness of the enhanced PV model. Subsequently, Gysel et al. [85] proposed the Neural Vector Space Model (NVSM), an unsupervised method that learns latent representations of words and documents from scratch for news article retrieval. The query is represented by averaging its constituent word representations and projected to the document feature space. The matching score between a document and a query is given by the cosine similarity between their representations in document feature space. The experiments show that the NVSM outperforms lexical retrieval models on four article retrieval benchmarks. Similar to NVSM, another unsupervised embedding learning method tailored for IR is SAFIR [4] . SAFIR jointly learns word, concept and document representations from scratch. The similarity of a query to a document is calculated by averaging its word-concept representations and then projecting it into the document space. Finally, the matching score between the query and the document is given by the cosine similarity between their representations in the document space. The evaluation on shared test collections for medical literature retrieval shows the effectiveness of SAFIR in terms of retrieving relevant documents. In addition to optimizing word/document embeddings for retrieval objectives directly, considering external knowledge resources, e.g, semantic graphs, ontologies and knowledge graphs, to enhance embeddings learning for semantic retrieval is another effective solution [136, 159, 197] . For example, Liu et al. [136] leveraged the existing knowledge (word relations) in the medical domain to constrain word embeddings using the principle that related words should have similar embeddings. The resulting constrained word embeddings are used for IR tasks, showing superior effectiveness to unsupervised word embeddings.

For more powerful representation learning models for the first-stage retrieval, Henderson et al. [87] proposed a computationally efficient neural method for natural language response suggestion. The feed-forward neural network uses n-gram embedding features to encode messages and suggested replies into vectors, which is optimized to give message-response pairs higher dot product values. The DPR [110] model is proposed to learn dense embeddings for text blocks with a BERTbased dual encoder. The retriever based on the DPR model outperforms a strong Lucene BM25 system on a wide range of OpenQA datasets and is beneficial for the end-to-end QA performance. Similar to DPR, the RepBERT [232] model employs a dual encoder based on BERT to obtain query and document representations, then inner products of query and document representations are regarded as relevance scores. Experimental results show that the RepBERT outperforms BM25 on the MS MARCO passage ranking task.

Another alternative approach is to distill a more complex model (e.g., term-level representation learning method or interaction-focused model) to a document-level representation learning architecture. For example, Lin et al. [132] distilled the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product, thus enabling a singlestep ANN search. Their key insight is that during distillation, tight coupling between the teacher model and the student model enables more flexible distillation strategies and yields better learned representations. The approach improves query latency and greatly reduces the onerous storage requirement of ColBERT, while only making modest sacrifices in terms of effectiveness. Tahami et al. [196] utilized knowledge distillation to compress the complex BERT cross-encoder network as a teacher model into the student BERT bi-encoder model. This increases the prediction quality of BERT-based bi-encoders without affecting its inference speed. They evaluated the approach on three domain-popular datasets, and results show that the proposed method achieves statistically significant gains.

It should be noted that among neural models proposed early for IR tasks, such as DSSM [96] , ARC-I [93] and QA_LSTM [198] , they learn highly abstract document representations based on different network architectures, such as fully connection, CNN, and RNN. Then a simple matching function, such as cosine similarity and bilinear, is used to evaluate similarity scores. These models are usually proposed for re-ranking stages at the beginning, however, because of their dual-encoder architecture, it is theoretically that they are also applicable for the first-stage retrieval. Nevertheless, a study by Guo et al. [82] shows that DSSM, C-DSSM [189] , and ARC-I perform worse when trained on a whole document than when trained only on titles. Due to these limitations, most of these early neural models fail to beat unsupervised term-based retrieval baselines (e.g., BM25) on academic benchmarks. These drawbacks motivate the development of models discussed in this survey that are designed specifically for the retrieval stage.

In addition to learn a single global representation for each query and each document, another more sophisticated approach is to employ different encoders for queries and documents, where Fig. 6 . Document-level multi-vector representation method in Poly-encoder [76] .

baselines (e.g., BM25) on academic benchmarks. These drawbacks motivate the development of models discussed in this survey that designed specifically for the retrieval stage.

In addition to learn a single global representation for each query and each document, another more sophisticated approach is to employ different encoders for queries and documents, where the document encoder abstracts the content into multiple embeddings-each embedding captures some aspects of the document, while the query encoder obtains a single embedding for each query. The motivation is that the documents are often lengthy and have diverse aspects in them, but the queries are usually short and have focused topics. For example, Luan et al. [107] proposed the Multi-Vector BERT (ME-BERT) to obtain a single-vector representation for the query and a multi-vector representation for the document. They represented the sequence of contextualized query/document embeddings at the top level of a deep Transformer, then defined the single-vector query representation as the contextualized embedding of the special token "[CLS]" and the multivector document representation as the first contextualized vectors of the tokens in the document. The value of is always smaller than , where is the number of tokens in the document. Finally, the relevance score is calculated as the largest inner product yielded by each document vector with the query vector. Experimental results show that the ME-BERT model yields strong performances than alternatives in open retrieval. Similarly, Humeau et al. [76] proposed the Poly-encoder, an architecture with an additional learned attention mechanism to represent more global features. The Poly-encoder, as shown in Figure 6 , uses two separate Transformer models to encode contexts and candidates. The candidate is encoded into a single vector candi , and the input context, which usually includes more information than a candidate, is represented with vectors ( 1 ctxt , . . . , ctxt ) instead of just one. Then, the vectors are attended using the candidate encoding vector candi to get the final score. The value of will give a trade-off between inference accuracy and speed. It should be noted that different from the general retrieval tasks that the retrieved texts (documents) are usually longer than the input texts (queries), the task in [76] has longer input texts than the retrieved texts, thus, the multi-vector representation model is actually employed for the query encoder in [76] . the document encoder abstracts the content into multiple embeddings-each embedding captures some aspects of the document, while the query encoder obtains a single embedding for each query [97, 137, 199] . The motivation is that documents are often lengthy and have diverse aspects in them, but queries are usually short and have focused topics. For example, Luan et al. [137] proposed the Multi-Vector BERT (ME-BERT) to obtain a single-vector representation for the query and a multi-vector representation for the document. They represented the sequence of contextualized query/document embeddings at the top level of a deep Transformer, then defined the single-vector query representation as the contextualized embedding of the special token "[CLS]" and the multi-vector document representation as the first contextualized vectors of tokens in the document. The value of is always smaller than , where is the number of tokens in the document. Finally, the relevance score is calculated as the largest inner product yielded by each document vector with the query vector. Experimental results show that the ME-BERT model yields strong performance than alternatives in open retrieval. Similarly, Humeau et al. [97] proposed the Poly-encoders, an architecture with an additional learned attention mechanism to represent more global features. The Poly-encoders, as shown in Figure 6 , uses two separate Transformer models to encode contexts and candidates. The candidate is encoded into a single vector candi , and the input context, which usually includes more information than a candidate, is represented with vectors ( 1 ctxt , . . . , ctxt ) instead of just one. Then, vectors are attended using the candidate encoding vector candi to get the final score. The value of will give a trade-off between inference accuracy and speed. It should be noted that different from general retrieval tasks that retrieved texts (documents) are usually longer than input texts (queries), the task in [97] has longer input texts than retrieved texts, thus, the multi-vector representation model is actually employed for the query encoder in [97] .

Sparse retrieval methods take words or "latent words" as the unit of indexing, which preserves strong discriminative power as the score is calculated by hard matching between each unit. As a result, they can identify exact matching signals, which are momentous for retrieval tasks. On the other hand, dense retrieval methods learn continuous embeddings to encode semantic information to encode contexts and candidates. The candidate is encoded into a single vector candi , and the input context, which usually includes more information than a candidate, is represented with vectors ( 1 ctxt , . . . , ctxt ) instead of just one. Then, vectors are attended using the candidate encoding vector candi to get the final score. The value of will give a trade-off between inference accuracy and speed. It should be noted that different from general retrieval tasks that retrieved texts (documents) are usually longer than input texts (queries), the task in [97] has longer input texts than retrieved texts, thus, the multi-vector representation model is actually employed for the query encoder in [97] .

Sparse retrieval methods take words or "latent words" as the unit of indexing, which preserves strong discriminative power as the score is calculated by hard matching between each unit. As a result, they can identify exact matching signals, which are momentous for retrieval tasks. On the other hand, dense retrieval methods learn continuous embeddings to encode semantic information and soft matching signals, but detailed low-level features are always sacrificed. A natural approach to balance between the fidelity of sparse retrieval methods and the generalization of dense retrieval methods is to combine merits of them to build a hybrid retrieval model [78, 85, 115, 208] . Hybrid Issue 1.1 retrieval methods define multiple representation functions ( and ), and then obtain sparse and dense representations for queries/documents. Finally, these representations are used to calculate the final matching score with different merging ways ( ). The general architecture of hybrid retrieval methods is shown in Figure 7 .

With the development of the word embedding technique, there are a number of works on exploiting it with term-based models for the first-stage retrieval. Vulić and Moens [208] obtained better results on monolingual and bilingual retrieval by combining the word-embedding-based method with a uni-gram language model. However, the embedding-based model solely does not outperform traditional language models in the monolingual retrieval task. That is to say, the effectiveness of neural semantic retrieval models is more observed when combined with termbased retrieval methods, instead of replacing them. The consistent observation is also obtained in [155, 158] , that direct using of word embeddings only obtains extremely poor performance in the non-telescoping setting, unless combining it with a term-based feature, such as BM25. The GLM [73] is an embedding-based translation model linearly combined with a traditional language model. The probability of observing a term in the query from a document is modeled by three parts, i.e., direct term sampling, generating a different term ′ either from the document itself or from the collection, and then transforming it to the observed query term . The empirical results and soft matching signals, but detailed low-level features are always sacrificed. A natural approach to balance between the fidelity of sparse retrieval methods and the generalization of dense retrieval methods is to combine merits of them to build a hybrid retrieval model [78, 85, 115, 207] . Hybrid retrieval methods define multiple representation functions ( and ), and then obtain sparse and dense representations for queries/documents. Finally, these representations are used to calculate the final matching score with different merging ways ( ). The general architecture of hybrid retrieval methods is shown in Figure 7 .

With the development of the word embedding technique, there are a number of works on exploiting it with term-based models for the first-stage retrieval. Vulić and Moens [207] obtained better results on monolingual and bilingual retrieval by combining the word-embedding-based method with a uni-gram language model. However, the embedding-based model solely does not outperform traditional language models in the monolingual retrieval task. That is to say, the effectiveness of neural semantic retrieval models is more observed when combined with termbased retrieval methods, instead of replacing them. The consistent observation is also obtained in [155, 158] , that direct using of word embeddings only obtains extremely poor performance in the non-telescoping setting, unless combining it with a term-based feature, such as BM25. The GLM [73] is an embedding-based translation model linearly combined with a traditional language model. The probability of observing a term in the query from a document is modeled by three parts, i.e., direct term sampling, generating a different term ′ either from the document itself or from the collection, and then transforming it to the observed query term . The empirical results show that GLM performs better than the traditional language model. Roy et al. [180] also proposed to combine word vector based query likelihood with the standard language model based query likelihood for document retrieval. Experiments on standard text collections show that the combined similarity measure almost always outperforms the language model similarity measure significantly. Besides, according to the experimental results got in [85] , although the NVSM model outperforms term-based retrieval models on some benchmarks, it will be more useful as a supplementary signal to term-based models. Similar conclusions could also be found in [4, 136, 197] .

Different from using word embeddings to construct dense representations and using term frequency to obtain term-based matching scores directly, there are also some works trying to employ simple neural networks to learn sparse and dense representations, then calculate matching scores based on learned representations. For example, dos Santos et al. [62] proposed the BOW-CNN architecture to retrieve similar questions in online QA community sites, as shown in Figure 8 , which combines a bag-of-words (BOW) representation with a distributed vector representation created by a convolutional neural network (CNN). The BOW-CNN model computes two partial similarity Author's address: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request scores: bow ( 1 , 2 ) for BOW representations and conv ( 1 , 2 ) for CNN representations. Finally, it combines two partial scores to create the final score ( 1 , 2 ). They performed experiments on two datasets collected from Stack Exchange communities. The experimental results evidence that BOW-CNN is more effective than BOW-based information retrieval methods such as TF-IDF, and BOW-CNN is more robust than the pure CNN for long texts. Besides, MacAvaney et al. [142] proposed a new approach for passage retrieval, which trains a model to generate query and document representations in a given fixed-length vector space, and produce a ranking score by computing a similarity score between two representations. Different from other representation learning methods, it represents each query as a sparse vector and each document as a dense vector. Finally, the dot product is used to compute the similarity between the query vector and the document vector. The experimental results show that the proposed EPIC model significantly outperforms prior approaches. It is also observed that the performance is additive with current leading first-stage retrieval methods.

With the rise of more powerful pre-training neural networks (e.g., BERT, GPT-3), it is a natural way to combine them with term-based models for improving the first-stage retrieval. Seo et al. [187] proposed the DenSPI for the retrieval stage of OpenQA. The DenseSPI model constructs the dense-sparse representation for each phrase unit. The dense vector is represented as pointers to the start and end BERT-based token representations of the phrase, which is responsible for encoding syntactic or semantic information of the phrase with respect to its context. The sparse embedding uses 2-gram-based tf-idf for each phrase, which is good at encoding precise lexical information. Later, Lee et al. [122] proposed to learn contextual sparse representation for each phrase based on BERT to replace term-frequency-based sparse encodings in DenSPI [187] . This method leverages rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space, improving the quality of each phrase embedding by augmenting it with a contextualized sparse representation. Experimental results show that the OpenQA model that augments DenSPI with learned contextual sparse representations outperforms previous OpenQA models, including recent BERT-based pipeline models, with two orders of magnitude faster inference time. Luan et al. [137] proposed to linearly combine the term-based system (BM25-uni) and neural-based system (dual-encoder or multi-vector model) scores using a single trainable weight , tuned on a development set, which yields strong performance while maintaining the scalability. Gao et al. [78] proposed the CLEAR model, which uses a BERT-based embedding model to complement the term-based model (BM25). Experimental results show that retrieval from CLEAR without re-ranking is already almost as accurate as the BERT re-ranking pipeline. Similarly, Kuzi et al. [115] proposed a general hybrid approach for document retrieval that leverages both a semantic model (BERT) and a lexical retrieval model (BM25). An in-depth empirical analysis is performed, which demonstrates the effectiveness of the hybrid approach and also sheds some light on the complementary nature of the lexical and semantic models.

As described above, neural semantic retrieval models always define functions , , and in the network structure. These functions are usually learned from data using deep learning technology. Here, we discuss key topics on the learning of neural semantic retrieval models, including loss functions and negative sampling strategies.

We review major training objectives adopted by neural semantic retrieval models. Ideally, after the training loss is minimized, all preference relationships between documents should be satisfied and the model will produce the optimal result list for each query. This makes training objectives effective in many tasks where performance is evaluated based on the ranking of relevant documents.

In practice, the most commonly used loss function is sampled cross entropy loss, also called negative log likelihood loss:

where denotes a query, + is a relevant document of , and − is the irrelevant document set of . Another commonly used loss function is the hinge loss:

where denotes a query, + is a relevant document of , − is the irrelevant document set of , is the number of documents in − , and is the margin which is usually set as 1.

In fact, the negative log likelihood loss (Eq. (14) ) and hinge loss (Eq. (15) ) are also widely used in many other tasks with different names, e.g., InfoNCE loss in contrastive representation learning [41, 201] and bayesian personalized ranking (BPR) loss in recommender systems [175] . These loss functions and their variations have been well studied in other fields, including extreme multi-class classification [15, 24, 174] , representation learning [41, 201] , deep metric learning [194, 209, 210] , etc. The research progress in these fields might provide some insights to inspire the loss design in neural semantic retrieval. First of all, Wang et al. [210] showed the softmax log likelihood loss is actually a smooth version of hinge loss. Moreover, several works have shown that the concept of margin in hinge loss can also be introduced into softmax cross entropy loss to improve the performance in tasks like face recognition and Person Re-identification [194, 209] . In addition, works [41, 210, 220] in different domains all verify that applying the ℓ 2 normalization to final representations (i.e., using cosine as the score function ) along with temperature can make the learning robust and improve the performance. Another line of research focuses on the bias in sampled softmax cross entropy loss [42, 101] . For example, works in NLP [24, 101] usually focus on the unbiased estimation of the full softmax, while Chuang et al. [42] focused on correcting the bias introduced by the false negative samples that have the same label as the ground truth. It is worth noting that these conclusions need to be re-examined under the first-stage retrieval task.

In loss functions (Eq. (14) and Eq. (15)), the negative example set − is an important part of inputs. However, during the learning of the first-stage retrieval models, it is often the case that only positive examples are available in the training dataset, while negative examples are not explicitly labeled. In fact, the sampling strategy of negative examples is a crucial topic in neural semantic retrieval models, because it directly determines the quality of the learned retrieval model.

Negative sampling is a common fundamental problem in the learning of many tasks where only positive signals are explicitly existed, like recommender system [60, 220, 225, 226, 236] , graph mining [9, 195, 222] , and self-supervised representation learning [27, 37, 41, 86, 238] . Here, we mainly focus on the research progress in the field of the first-stage retrieval. The neural semantic retrieval models usually vary in their mechanisms to construct negative examples. But in general, negative sampling strategies can be divided into three categories:

(1) Random Negative Sampling: random samples from the entire corpus [110, 137] or in batch [81, 87, 88, 232] . It should be noted that if using the batch as a source for random negatives, the batch size becomes important [81] . Lee et al. [123] suggested to use a large batch size because it makes the training task more difficult and closer to what the retriever observes at test time. However, the batch size is usually restricted by computing resources and cannot be set very largely. To address this problem, He et al. [86] proposed to decouple the size of mini-batch and sampled negative examples by maintaining a queue of data samples (encoded representations of the current mini-batch are enqueued, and the oldest are dequeued) to provide negative samples. In this way, they can use a very large size (e.g., 65,536) for negative samples in unsupervised visual representation learning. However, random negative sampling is usually sub-optimal for training neural semantic retrieval models. Models can hardly focus on improving top ranking performance since these random negative samples are usually too easy to be distinguished. This problem would lead to serious performance dropping in practice. To make the model better at differentiating between similar results, one can use samples that are closer to positive examples in the embedding space as hard negatives for training. Thus, mining hard negative samples to optimize retrieval performance is a key problem that needs to be addressed. (2) Static Hard Negative Sampling: random samples from pre-retrieved top documents by a traditional retriever [78, 110, 137] , such as BM25. Recent researches find it helps training convergence to include BM25 negatives to provide stronger contrast for representations learning [110, 137] . Obtaining hard negative samples with pre-retrieval is computationally efficient. However, hard negative samples obtained by static methods are not real hard negatives. Intuitively, strong negatives close to relevant documents in an effective neural retrieval model space should be different from those from term-based retrieval models, as the goal of neural semantic retrieval models is to find documents beyond those retrieved by term-based models. If using negative samples from BM25, there exists a severe mismatch between negatives used to train the retrieval model and those seen in testing. (3) Dynamic Hard Negative Sampling: random samples from top-ranked irrelevant documents predicted by the retrieval model itself. Intuitively, negative sampling dynamically according to current semantic retrieval models, e.g., using the distribution which is proportional to relevance scores predicted by the current model, should be a very promising choice for producing informational negative samples [166, 195] . In this way, neural semantic retrieval models can optimize themselves using negative samples they did wrong (i.e., predict a high relevance score for an irrelevant document). However, it is usually impractical to score all candidate documents in a very large corpus on the fly. Thus, in real-world settings, periodically refreshing the index and retrieving top-ranked documents as hard negatives is a more practical compromise choice [61, 78, 95, 216] . For example, hard negatives mining in [216] elevates the BERT-based siamese architecture to robustly exceed term-based methods for document retrieval. It also convincingly surpasses concurrent neural semantic retrieval models for passage retrieval on OpenQA benchmarks.

It should be noted that the negative sampling strategies described above are not exclusive mutually. In practice, random sampled easy negatives and hard negatives are always used simultaneously. For example, the counterintuitive finding in [95] shows that models trained simply using hard negatives cannot outperform models trained with random negatives. The hypothesis is that the presence of easy negatives in training data is still necessary, as a retrieval model is to operate on an input space which comprises data with mixed levels of hardness, and the majority of documents in the collection are easy cases which do not match the query at all. Having all negatives being such hard will change the representativeness of the training data to the real retrieval task, which might impose a non-trivial bias to learned embeddings.

Takeaway. Neural semantic retrieval methods learn the representation functions (i.e, and ) and the scoring function ( ) with deep learning technologies. To support fast retrieval, document representations are often learned with standalone networks, and pre-computed and stored with delicate structures. According to how the representations are computed and stored, we summarize neural semantic retrieval methods into three paradigms, i.e., sparse retrieval methods, dense retrieval methods and hybrid retrieval methods.

• Sparse retrieval methods focus on improving classical term-based methods by either learning to re-weight terms with contextual semantics or mapping texts into "latent word" space. Empirical results show that sparse retrieval methods could indeed improve the performance of the first-stage retrieval, and they are easily integrated with the existing inverted index for efficient retrieval. Moreover, these methods often show good interpretability as each dimension of the representation corresponds to a concrete token or a latent word. • Dense retrieval methods employ the dual-encoder architecture to learn standalone lowdimensional dense vectors for queries and documents, aiming to capture the global semantics of input texts. To support online services, the learned dense representations are often indexed and searched via approximate nearest neighbor (ANN) algorithms. These methods have shown promising results on several benchmarks (e.g., MS MARCO and TREC CAR), and attracted increasing attention of researchers. • Hybrid retrieval methods define multiple representation functions for queries and documents, and then obtain their sparse and dense representations simultaneously for matching. They are able to achieve a balance between the fidelity of sparse retrieval methods and the generalization of dense retrieval methods. As a result, hybrid retrieval methods show better performance in practice, but require much higher space occupation and retrieval complexity.

For neural semantic retrieval models learning, the negative sampling strategy is decisive for learning a high-quality retrieval model. Currently, there have been several works to explore better negative sampling methods, but it is still an open problem on how to mine negative documents for efficient and effective model learning.

In this section, we discuss some open challenges and several future directions related to semantic models for the first-stage retrieval. Some of these topics are important but have not been well addressed in this field, while some are very promising directions for future researches.

Starting 2018, there is rapid progress in different NLP tasks with the development of large pretraining models, such as BERT [55] and GPT [172] . They are pre-trained on the large-scale corpus and general-purpose modeling tasks such that the knowledge can be transferred into a variety of downstream tasks. With this intriguing property, one would expect to repeat these successes for IR tasks. Some researchers [35, 84, 123] have explored pre-training models for the retrieval stage with a dual-encoder architecture. For example, Lee et al. [123] proposed to pre-train the two-tower Transformer encoder model with the Inverse Cloze Task (ICT) to replace BM25 in the passage retrieval stage for the OpenQA task. The advantage is that the retriever can be trained jointly with the reader. Nevertheless, the pre-training model does not outperform BM25 on the SQuAD dataset, potentially because the fine-tuning is only performed on the query-tower. Except for the ICT pre-training task, Chang et al. [35] also proposed the Body First Selection (BFS) and Wiki Link Prediction (WLP) tasks, and studied how various pre-training tasks help the large-scale retrieval problem, e.g., passage retrieval for OpenQA. The experimental results show that with properly designed paragraph-level pre-training tasks including ICT, BFS, and WLP, the two-tower Transformer encoder model can considerably improve over the widely used BM25 algorithm. Besides, Ma et al. [139, 140] proposed pre-training with the Representative Words Prediction (ROP) task for ad-hoc retrieval, which achieves significant improvement over baselines without pretraining or with other pre-training methods. However, whether the ROP task works for the retrieval stage needs to be re-examined since their experiments are conducted under re-ranking stages.

In summary, there has been little effort to design large pre-training models towards the first-stage retrieval task. As is known to all, the first-stage retrieval mainly focuses on the capability to recall potentially relevant documents as many as possible. Thus, considering retrieval requirements in recalling relevant documents and modeling task-dependent characteristics would be important elements during designing novel pre-training objectives for the retrieval stage. Besides, using cross-modal data (e.g., images) to enhance language understanding is also a promising direction in pre-training researches.

For information retrieval tasks, the construction of benchmark datasets often relies on a pooling process to recall a subset of documents for expert judging. Such labeling process leads to the well-known bias problem, where the dataset only contains partially positive documents and the rest of unlabeled documents are oftentimes assumed to be equally irrelevant [110, 232] . To address the bias problem, it is necessary to devise smart learning strategies to achieve effective and efficient model training. For example, Chuang et al. [42] developed a debiased contrastive objective that corrects for the sampling of the same label data-points, even without knowledge of true labels. Next, as discussed in Section 5.4.2, hard negative samples can improve the model's ability to differentiate between similar examples. However, hard negatives mining strategies have not been fully explored. One of the state-of-the-art methods is the Asynchronous ANCE training proposed by Xiong et al. [216] , which periodically refreshes the ANN index and samples top-ranked documents as negatives. Although ANCE is competitive in terms of effectiveness, refreshing the index periodically greatly increases the model training cost (e.g., 10h for each period). Besides, some works conclude that it would be more effective to learn semantic retrieval models with hard negative samples and easy negative samples simultaneously [231] . Thus, in addition to mining hard negatives, it is also worthy to explore arranging the position and order of training samples since negative documents often show varied-level of difficulties. We believe it would be interesting and valuable to study more complex training strategies, such as curriculum learning [18] , to help the model optimization for the first-stage retrieval. Moreover, the supervised data for IR is always scarce since it requires much manual labor to obtain. Besides, the supervised dataset is prone to long-tail, sparsity and other issues. Thus, weak supervised or unsupervised learning, e.g., contrastive learning [41, 86] , are promising directions. For example, Dai and Callan [50] proposed a content-based weak supervision strategy that exploits the internal structure of documents to mine training labels.

The multi-stage retrieval paradigm aims to balance between the effectiveness and efficiency of retrieval tasks, where the first-stage retrieval focus on the efficiency and re-ranking stages pay more attention to the effectiveness. But efficiency metrics in isolation are meaningless unless contextualized with corresponding effectiveness measures. Ideally, the efficiency metrics at different effectiveness cutoffs should be reported on the leaderboard. Moreover, since the customized hardware, e.g., GPUs or TPUs, has a significant impact on the computation time of deep models, and the response time of the first-stage retrieval models is also infamously sensitive to constraints, such as locality of data on file systems for caching, it is expected to compare different models under the same conditions. However, fair conditions for model efficiency comparison have not been fully valued and studied in the IR field as in the computer vision (CV) community. For example, the medical computer vision community has already recognized the need for a focus on run time considerations. The medical image analysis benchmark VISCERAL [104] includes run-time measurements of participant solutions on the same hardware. Additionally, CV tasks, such as object detection and tracking, often require real-time results [94] . For IR tasks, Hofstätter and Hanbury [91] put forward a preliminary solution, which makes the comparison of run time metrics feasible by introducing docker-based submissions of complete retrieval systems so that all systems can be compared under the same hardware conditions by a third party.

As described in Section 3.2, for IR tasks, indexing schemes play an important role in determining the way to organize and retrieve large-scale documents. Specially, most dense retrieval methods, which learn dense representations for queries and documents, rely on ANN algorithms to perform efficient vector search for online services [32, 112] .

Existing dense retrieval methods always separate two steps of representation learning and index building. This pattern suffers from a few drawbacks in practical scenarios. Firstly, the indexing process cannot benefit from supervised information because it uses the task-independent function to build the index. Besides, the representation and index are separately obtained and thus may not be optimally compatible. These problems all result in severely decayed retrieval performance. In fact, there have been studies [227, 233] to explore the joint training of encoders and indexes in the fields of image retrieval and recommendation. For information retrieval, it is still in its infancy stage to design joint learning schemes of the first-stage retrieval models and indexing methods, and we believe it would be an interesting and promising direction.

On the other hand, how to design better ANN algorithms that can manage large-scale documents and support efficient and precise retrieval is another important direction. Compared with the brute-force search, the essence of ANN search is to sacrifice part of precision to get higher retrieval efficiency. Generally, there are two kinds of ANN algorithms from the principle of improving retrieval efficiency. One is non-exhaustive ANN search methods [22, 144] , and the other is vector compression methods [79, 99, 102] . However, each method has its limitations or deficiency, where the non-exhaustive method has a large index size and the compression method has suboptimal performance. Thus, with the booming development of dense retrieval methods, it is urgent to develop more advanced ANN search algorithms to achieve a better balance between the efficiency and effectiveness.

The purpose of this survey is to summarize the current research status on semantic retrieval models, analyze existing methodologies, and gain some insights for future development. It includes a brief review of early semantic retrieval methods, a detailed description of recent neural semantic retrieval methods and the connection between them. Specially, we pay attention to neural semantic retrieval methods, and review them from three major paradigms, including sparse retrieval methods, dense retrieval methods and hybrid retrieval methods. We also refer to key topics about neural semantic retrieval models learning, such as loss functions and negative sampling strategies. In addition, we discuss several challenges and promising directions that are important for future researches. We look forward to working with the community on these issues.

We hope this survey can help researchers who are interested in this direction, and will motivate new ideas by looking at past successes and failures. Semantic retrieval models are part of the broader research field of neural IR, which is a joint domain of deep learning and IR technologies with many opportunities for new researches and applications. We are expecting that, through the effort of the community, significant breakthroughs will be achieved for the first-stage retrieval problem in the near future, similar to those happened in re-ranking stages.

A survey on nearest neighbor search methods

UMass at TREC 2004: Novelty and HARD

Document expansion based on WordNet for robust IR

Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval

Analysis of the paragraph vector model for information retrieval

Improving language estimation with the paragraph vector model for ad-hoc retrieval

Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness

Nearest neighbor search: the old, the new, and the impossible

Robust Negative Sampling for Network Embedding

Latent semantic indexing (LSI) fails for TREC collections

ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms

Modern Information Retrieval: The Concepts and Technology behind Search

Using Query Contexts in Information Retrieval

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

Extreme Classification via Adversarial Softmax Approximation

Modeling of the Question Answering Task in the YodaQA System

Shape indexing using approximate nearest-neighbour search in highdimensional spaces

Curriculum learning

Multidimensional binary search trees used for associative searching

Semantic Parsing on Freebase from Question-Answer Pairs

Information Retrieval as Statistical Translation

Annoy: Approximate nearest neighbors in c++/python. Python package version

Document expansion versus query expansion for ad-hoc retrieval

Adaptive Sampled Softmax with Kernel Based Sampling

Latent Dirichlet Allocation

Enriching Word Vectors with Subword Information

Adversarial Contrastive Estimation

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Hypergeometric Language Model and Zipf-like Scoring Function for Web Document Similarity Retrieval

Signature Verification using a "Siamese" Time Delay Neural Network

Question answering from frequently asked question files: Experiences with the faq finder system

A Discriminative Semantic Ranker for Question Retrieval

Selecting good expansion terms for pseudorelevance feedback

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Pre-training Tasks for Embedding-based Large-scale Retrieval

Reading Wikipedia to Answer Open-Domain Questions

Improving Negative Sampling for Word Representation Using Self-Embedded Features

SPTAG: A library for fast approximate nearest neighbor search

SPTAG: A library for fast approximate nearest neighbor search

Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval

A simple framework for contrastive learning of visual representations

Debiased Contrastive Learning

Overview of the TREC

Overview of the trec 2009 web track

Aggregating continuous word embeddings for information retrieval

Reducing the Risk of Query Expansion via Robust Constrained Optimization

Overview of the trec 2019 deep learning track

Attention-over-Attention Neural Networks for Reading Comprehension

Context-aware sentence/passage term importance estimation for first stage retrieval

Context-Aware Document Term Weighting for Ad-Hoc Search

Context-Aware Term Weighting For First Stage Passage Retrieval

Convolutional Neural Networks for Soft-Matching N-Grams in Ad-Hoc Search

Locality-sensitive hashing scheme based on p-stable distributions

Indexing by latent semantic analysis

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Quasar: Datasets for question answering by search and reading

Regularizing Ad Hoc Retrieval Scores

Query Expansion with Locally-Trained Word Embeddings

TREC Complex Answer Retrieval Overview

Reinforced Negative Sampling for Recommendation with Exposure Data

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

Learning Hybrid Representations to Retrieve Semantically Equivalent Questions

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Themis Palpanas, and Houda Benbrahim. 2020. Return of the lernaean hydra: Experimental evaluation of data series approximate similarity search

Improving retrieval of short texts through document expansion

Paraphrase-Driven Learning for Open Question Answering

Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods

MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search

Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

Learning Term Discrimination

The vocabulary problem in human-system communication

Word Embedding Based Generalized Language Model for Information Retrieval

Clickthrough-Based Translation Models for Web Search: From Word Models to Phrase Models

Towards Concept-based Translation Models using Search Logs for Query Expansion

Dependence Language Model for Information Retrieval

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

Complementing Lexical Retrieval with Semantic Residual Embedding

Optimized product quantization

Learning Dense Representations for Entity Retrieval

End-to-end retrieval in continuous space

A Deep Relevance Matching Model for Ad-Hoc Retrieval

A deep look into neural ranking models for information retrieval

Realm: Retrieval-augmented language model pre-training

Neural vector spaces for unsupervised information retrieval

Momentum Contrast for Unsupervised Visual Representation Learning

Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply

Training neural response selection for taskoriented dialogue systems

A probabilistic justification for using tf× idf term weighting in information retrieval

Probabilistic Latent Semantic Indexing

Let's measure run time! Extending the IR replicability infrastructure to include performance aspects

2020. Interpretable & time-budget-constrained contextualization for re-ranking

Convolutional Neural Network Architectures for Matching Natural Language Sentences

Speed/accuracy trade-offs for modern convolutional object detectors

Embedding-Based Retrieval in Facebook Search

Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data

Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring

Deep Learning to Speed up the Development of Structure-Property Relations For Hexagonal Boron Nitride and Graphene

Approximate nearest neighbors: towards removing the curse of dimensionality

Taewon Yoon, and Heecheol Seo. 2021. UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

On Using Very Large Target Vocabulary for Neural Machine Translation

Product quantization for nearest neighbor search

Efficient Interaction-Based Neural Ranking with Locality Sensitive Hashing

Cloud-Based Evaluation of Anatomical Structure Segmentation and Landmark Detection Algorithms: VISCERAL Anatomy Benchmarks

Billion-scale similarity search with GPUs

Billion-scale similarity search with GPUs

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval

Axiomatic analysis of translation language model for information retrieval

Dense Passage Retrieval for Open-Domain Question Answering

Short Text Similarity with Word Embeddings

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Navigation in a small world

Corpus structure, language models, and ad hoc information retrieval

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

Natural Questions: A Benchmark for Question Answering Research

Robust Track Experiments Using PIRCS

A generative theory of relevance

Relevance Based Language Models

Distributed representations of sentences and documents

Algorithms for Non-negative Matrix Factorization

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Word-word associations in document retrieval systems

Multi-interest network with dynamic routing for recommendation at Tmall

Learning to rank for information retrieval and natural language processing

Semantic matching in search

From Semantic Retrieval to Pairwise Ranking: Applying Deep Learning in E-Commerce Search

Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement

Embedding-based Zero-shot Retrieval through Query Generation

Pretrained transformers for text ranking: Bert and beyond

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Cascade Ranking for Operational E-Commerce Search

Learning to rank for information retrieval

Cluster-based retrieval using language models

Constraining word embeddings by prior knowledgeapplication to medical information retrieval

Sparse, Dense, and Attentional Representations for Text Retrieval

A comparative study of methods for estimating query language models with pseudo feedback

Yixing Fan, Xiang Ji, and Xueqi Cheng. 2020. PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

Expansion via Prediction of Importance with Contextualization

Efficiency Implications of Term Weighting for Passage Retrieval

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

Nicola Tonellotto, and Torsten Suel. 2021. Learning Passage Impacts for Inverted Indexes

Generation-Augmented Retrieval for Open-domain Question Answering

High Accuracy Retrieval with Multiple Nested Ranker

A Markov Random Field Model for Term Dependencies

Distributed Representations of Words and Phrases and their Compositionality

A Hidden Markov Model Information Retrieval System

Neural models for information retrieval

An introduction to neural information retrieval

WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva

Conformer-Kernel with Query Term Independence for Document Retrieval

A dual embedding space model for document ranking

Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks

Scalable nearest neighbor algorithms for high dimensional data

WWW '16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva

Learning concept-driven document embeddings for medical information search

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Passage Re-ranking with BERT

From doc2query to docTTTTTquery

Multi-stage document ranking with BERT

Document expansion by query prediction

Neural information retrieval: At the end of the early years

Adversarial Sampling and Training for Semi-Supervised Information Retrieval

Query understanding at Bing. Invited talk

GloVe: Global Vectors for Word Representation

Deep contextualized word representations

A Language Modeling Approach to Information Retrieval

Concept Based Query Expansion

Improving language understanding by generative pre-training

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Sampled Softmax with Random Fourier Features

BPR: Bayesian Personalized Ranking from Implicit Feedback

Query Rewriting Using Monolingual Statistical Machine Translation

The Probabilistic Relevance Framework: BM25 and Beyond

Relevance Weighting of Search Terms

Relevance feedback in information retrieval. The Smart retrieval system-experiments in automatic document processing

Representing documents and queries as sets of word embedded vectors for information retrieval

Using word embeddings for automatic query expansion

Semantic hashing

Developments in automatic text retrieval

Term-Weighting Approaches in Automatic Text Retrieval

A Vector Space Model for Automatic Indexing

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index

Nearest-neighbor methods in learning and vision: theory and practice (neural information processing

Learning Semantic Representations Using Convolutional Neural Networks for Web Search

Document expansion using external collections

Answering English Questions by Computer: A Survey

Document expansion for speech retrieval

A Comprehensive Survey and Classification of Approaches for Community Question Answering

Circle Loss: A Unified Perspective of Pair Similarity Optimization

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

Kamyar Ghajar, and Azadeh Shakery. 2020. Distilling Knowledge for Fast Retrieval-based Chat-bots

Offline versus online representation learning of documents using external knowledge

Lstm-based deep learning models for non-factoid answer selection

Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

Language model information retrieval with document expansion

Representation Learning with Contrastive Predictive Coding

A theoretical basis for the use of co-occurrence data in information retrieval

TREC-COVID: constructing a pandemic information retrieval test collection

Query expansion using lexical-semantic relations

Overview of the TREC 2005 Robust Retrieval Track

Building a Question Answering Test Collection

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Match-srnn: Modeling the recursive matching structure with spatial rnn

Additive Margin Softmax for Face Verification

NormFace: L 2 Hypersphere Embedding for Face Verification

Regularized Latent Semantic Indexing

Match 2 : A Matching over Matching Model for Similar Question Identification

LDA-Based Document Models for Ad-Hoc Retrieval

Managing gigabytes: compressing and indexing documents and images

End-to-End Neural Ad-Hoc Ranking with Kernel Pooling

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Quary expansion using local and global document analysis

Relevance ranking using kernels

A Unified Pretraining Framework for Passage Ranking and Expansion

Mixed Negative Sampling for Learning Two-Tower Neural Networks in Recommendations

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Understanding Negative Sampling in Graph Representation Learning

Graph convolutional networks for text classification

A Comparative Study of Utilizing Topic Models for Information Retrieval

Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations

Graph Convolutional Neural Networks for Web-Scale Recommender Systems

Product quantization network for fast image retrieval

From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing

Statistical Language Models for Information Retrieval A

Model-based feedback in the language modeling approach to information retrieval

Optimizing Dense Retrieval Model Training with Hard Negatives

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index

Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-Commerce Search via Embedding Learning

GRIP: Multi-Store Capacity-Optimized High-Performance Nearest Neighbor Search for Vector Search Engine

Optimizing Top-n Collaborative Filtering via Dynamic Negative Item Sampling

DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

GNEG: Graph-Based Negative Sampling for word2vec

Term necessity prediction

Learning to Reweight Terms with Distributed Representations

Inverted files for text search engines

Integrating and Evaluating Neural Word Embeddings in Information Retrieval