key: cord-0517382-x9tfob7p authors: Qiu, Yifu; Li, Hongyu; Qu, Yingqi; Chen, Ying; She, Qiaoqiao; Liu, Jing; Wu, Hua; Wang, Haifeng title: DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine date: 2022-03-19 journal: nan DOI: nan sha: 6abc06e7b706465a875ba9ccdbead7677180117b doc_id: 517382 cord_uid: x9tfob7p In this paper, we present DuReader-retrieval, a large-scale Chinese dataset for passage retrieval. DuReader-retrieval contains more than 90K queries and over 8M unique passages from Baidu search. To ensure the quality of our benchmark and address the shortcomings in other existing datasets, we (1) reduce the false negatives in development and testing sets by pooling the results from multiple retrievers with human annotations, (2) and remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a cross-lingual set that has been manually translated for cross-lingual retrieval. The experiments demonstrate that DuReader-retrieval is challenging and there is still plenty of room for improvement, e.g. salient phrase and syntax mismatch between query and paragraph. These experimental results show that the dense retriever does not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader-retrieval will be publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval. Passage retrieval requires systems to select candidate passages from a large passage collection. In recent years, pre-trained language models have been applied to retrieval problems, known as dense retrieval (Karpukhin et al., 2020; Qu et al., 2021; Zhan et al., 2021) . The success of dense retrieval relies on the corpus with high-quality annotation and considerable scale. For English community, there are several popular datasets like MS-MARCO (Nguyen et al., 2016) , TriviaQA (Joshi et al., 2017) , † The work was done when the first author was doing internship at Baidu. and Natural Questions (Kwiatkowski et al., 2019) . For non-English communities (e.g., Chinese), the existing datasets are neither large-scale nor humangenerated, e.g. TianGong-PDR has only 70 questions and 11K passages, while the large scale multilingual dataset mMARCO (Bonifacio et al., 2021) is constructed via machine translation from MS-MARCO, and Chinese Sougou-QCL (Zheng et al., 2018) is solely developed based on click logs of the web data. In this paper, we present DuReader retrieval , a large-scale Chinese dataset for passage retrieval from the web search engine. The dataset contains more than 90K queries and over 8M unique passages. All queries are issued by the real users in Baidu Search * , and document passages are from the search results. Similar to (Karpukhin et al., 2020) , we create the DuReader retrieval from DuReader (He et al., 2018) , a Chinese machine reading comprehension dataset, and obtain the labels for paragraphs in a distant supervision way (See Section 2.2). An example from DuReader retrieval is shown in Table 1, and a comparison between different datasets is shown in Table 2 . Additionally, recent works point out two major shortcomings of the development and testing sets in the existing datasets: • Arabzadeh et al. (2021) and Qu et al. (2021) observe that false negatives (i.e. relevant passages but labelled as negatives) are a common issue existing in the passage retrieval datasets due to their large scale but limited human annotation. As a result, the top items retrieved by models may be superior to labelled relevant positives, and the benchmark may fail to identify model improvements. • find that 30% of the test-set queries in the common machine reading comprehension datasets (Kwiatkowski et al., 2019; Joshi et al., 2017) have a near-duplicate paraphrase in their corresponding training sets, thus leaking the testing information into model's training. The similar issue has been observed in MS-MARCO (Zhan et al., 2022) . In the construction of DuReader retrieval , we improve the quality of the development and testing sets in the following two ways (see Section 2.3): • To reduce the false negatives in the development and testing set, we invite the internal data team to manually check and relabel the passages in the top retrieved results pooled from multiple retrievers. • To alleviate the leaking of testing information into model's training by the semantically similar queries, we use the query matching model from (Zhu et al., 2021) to identify and remove the training queries that are semantically similar to the development and testing queries. Moreover, inspired by , we provide two out-of-domain testing sets (see Section 2.4) from the medical domain (cMedQA and cCOVID-News † ) as the separate testing sets for out-of-domain evaluation. We also provide a human-annotated cross-lingual set (Asai et al., 2021a; Sun and Duh, 2020) for assessing retrievers in the cross-lingual scenario (see Section 2.5). Our in-domain experiments show that there is still plenty of room for baselines to improve, e.g., the challenges in salient phrase mismatch, syntax mismatch (see Section 3.5). It is also difficult for † available at www.datafountain.cn/competitions/424 dense retrievers to generalize well across different domains as we observed in the out-of-domain experiments (see Section 3.6). Finally, the crosslingual experiments indicate that cross-lingual retrieval is essentially challenging (see Section 3.7). We summarize the characteristics of our dataset and our contributions as follows: • We present a large-scale Chinese dataset for benchmarking the passage retrieval models from Baidu Search. Our dataset comprises more than 90K queries and more than 8M unique passages. • We leverage two strategies to improve the quality of our benchmark and address the existing shortcomings in other existing datasets, including reducing the false negatives with human annotations on pooled retrieval results, and remove the training queries semantically similar to the development and testing queries. • We introduce two extra out-of-domain testing sets to evaluate the domain generalization capability, and a cross-lingual set to assess the cross-lingual retrievers. • We conduct extensive experiments and the results demonstrate that the dataset is challenging and there is still plenty of room for improvement. In this section, we introduce our DuReader retrieval dataset (See dataset statistics in Table 3 ). We first formally define the passage retrieval task in Section Table 2 : Summary of data statistics for passage retrieval datasets. The annotations of passages for TriviaQA and Natural Questions are presented in (Karpukhin et al., 2020) . Compared with other works, the instances in DuReader retrieval come from user logs in web search. Its annotation consists of distant supervision (Dist. Sup.) for training data and manual annotations (Human) for development and testing sets. 2.1. We then introduce how we initially construct our dataset from DuReader by distant supervision in Section 2.2. Our strategies for further improving the data quality by reducing the false negatives in development and testing sets and removing the training queries semantically similar to development and testing queries are discussed in Section 2.3. Finally, we introduce two out-of-domain testing sets in Section 2.4 and a cross-lingual set in in Section 2.5. DuReader retrieval is created for the task of passage retrieval, that is retrieving a list of relevant passages in response to a query. Formally, given a query q and a large passage collection P, a retrieval system F is required to return the top-K relevant passages P (q) where K is a manually defined number. Ideally, all the relevant passages to q within P should be included and ranked as high as possible in the retrieved results P (q) K . DuReader retrieval is developed based on the Chinese machine reading comprehension dataset DuReader (He et al., 2018) . All queries in DuReader are posed by the users of Baidu Search, and document-level contexts are gathered from search results. Each instance in DuReader is a tuple < q, t, D, A >, where q is a query, t is a query type, D is the top-5 retrieved documents constituted by their paragraphs returned by Baidu Search. A is the answers written by human annotators, rather than extracting spans from the document as in MS-MARCO (Nguyen et al., 2016) . In this section, we first describe our approach to labelling the positive passages. We then discuss our approaches to dealing with the two challenges when constructing the retrieval dataset from the raw DuReader: 1) the paragraph lengths are too short to provide a meaningful context; and 2) the high term overlap between the query and the document title may ease the challenges for passage retrieval. Following MS-MARCO Passage Ranking (Nguyen et al., 2016) , we use the human-written answers from DuReader (He et al., 2018) to label the positive passages by the distant supervision, i.e., the paragraph is positive if it contains any humanwritten answer. Specifically, we leverage the spanlevel F1 score measurement to softly match the human-written answers with the paragraphs in documents. If there is any span-answer pair that has a higher F1-score than the threshold (0.5), we label the paragraph as positive. We show the details for our annotations in Algorithm 1. Algorithm 1 Span-level F1 Annotation for Positives Input: { p, a }, p: candidate paragraph, a: answer, τ : threshold for positive labelling. Output: l p ∈ {0, 1}: label, 0 and 1 denote p as negative and positive, separately. Additionally, due to the variety of original paragraph lengths in DuReader, most paragraphs are too short to forming meaningful contexts. We concatenate the paragraphs of each document in DuReader with the following rules: 1) If the length of the document is less than 256 Chinese characters, we treat the entire document as a single passage. 2) For each paragraph in the document, if its length is less than 256, we concatenate it with the following paragraph until the length exceeds 256 Chinese characters. The concatenated passage is labelled as positive if any paragraph before the concatenation is positive. After this processing, the median and mean for the passage length are 304 and 272, respectively. Removing Document Titles Finally, we discard all document titles, since we observe that there is high term overlap between queries and document titles for many data instances. Otherwise, the retrieval systems can easily match the query with the document titles and achieve high performance. Nevertheless, we expect the retrievers to capture all contextual information in passages in order to answer queries. We further design two strategies to address the shortcomings of other existing datasets and im-prove the quality of the development and testing sets in DuReader retrieval . Reducing False Negatives There is a common issue in existing passage retrieval datasets (Qu et al., 2021; Arabzadeh et al., 2021) that there are false negatives in the development and testing sets, which means there are passages relevant to the query but not labeled as positives. In this section, we discuss our strategy for reducing the number of false negatives in the development and testing sets of DuReader retrieval . As a complement to the distant supervision labeling approach discussed in Section 2.2, we invite the internal data team to manually check the annotations in the development and testing sets and relabel the passage if necessary. To avoid inductive bias in our annotation process, we apply a pooling technique to select candidate passages for annotation by following the TREC-style evaluation framework (Voorhees et al., 2005) . For each query, we collect the top passages retrieved by a set of contributing retrievers, and ask human annotators to label the pooled candidates. In particular, the annotator is presented with a query and the top-5 passages pooled from five retrieval systems. We use BM25 and four neural retrievers with the initialization from ERNIE (Sun et al., 2019) , BERT Cui et al., 2021) , RoBERTa and MacBERT (Cui et al., 2020) to serve as our contributing retrievers. We combine their top-50 retrieved passages as candidates. An ensembled re-ranker is then used (See Appendix A.1 for implementation details) to select the top-5 passages for human annotation. To ensure data quality, we perform all annotations on our internal annotation platform. Please refer to the Appendix A.3 for annotation settings and quality control. After adopting our strategy for reducing false negatives, the average positive paragraph per query has increased from 2.43 to 4.91. There are 71.53% of queries that have at least one false negative relabeled by the internal data team. The results show there are many false negatives in the raw corpus derived directly from DuReader. Removing Similar Queries Retrieval models should avoid merely memorizing queries and their relevant items in the training set and directly applying such memorization during inference. find that in some popular datasets, including Natural Questions (Kwiatkowski et al., 2019) , WebQuestions (Berant et al., 2013) and TriviaQA (Joshi et al., 2017) , 30% of the test-set queries have a near-duplicate paraphrase in their corresponding training sets, which leaks the testing information into the model training. In this paper, we use a model-based approach to remove training queries that are semantically similar to development and testing queries. We use the query matching model in (Zhu et al., 2021) , which computes the similarity score for a query pair ranging from [0, 1]. We set a threshold of 0.5, meaning that if the similarity between a training and testing query pair is higher than 0.5, we mark the query pair as semantically similar. There are 566 training queries that are semantically similar to 387 queries in development and testing set, accounting for approximately 6.45% of total development and testing queries. All these 566 training instances are removed in DuReader retrieval . Recent work reveals that the dense retrievers are weak at generalization. Therefore, we provide two testing sets for assessing the cross-domain generalization ability of the retrievers in DuReader retrieval . We carefully select two publicly available datasets for Chinese text retrieval, including cMedQA from the online medical consultation and cCOVID-News from the COVID-19 news articles. We randomly select 949 and 3,999 samples from cCOVID-News and cMedQA, respectively, as our out-of-domain samples. Cross-lingual passage retrieval has recently received much attention (Shi et al., 2021; Asai et al., 2021b) , which aims to retrieve the passages in the target language (e.g., Chinese) as the response to the query in source language (e.g., English). In DuReader retrieval , we provide a cross-lingual retrieval set which contains the English queries paired with Chinese positive passages. The total numbers of training/development/testing English queries are 9.5K/4K/2K, respectively. All English queries in our cross-lingual set are human annotated and the passage annotations are aligned with DuReader retrieval . To translate the Chinese queries into English, we ask the internal data team to manually check the quality of the machine-translated queries, and modify the translated queries if necessary. The quality controls for our cross-lingual annotations are the same as our Figure 1 : Illustration for the training procedure of our one dual-encoder retriever and two cross-encoder re-rankers. We train our first retriever and re-ranker by the negatives sampled from BM25's output as in (Karpukhin et al., 2020) . We further attempt the strategy in (Qu et al., 2021 ) that sampling negatives from dual-encoder retriever to enhance the cross-encoder reranker. previous human annotations for development and testing set as in Appendix A.3, except that annotators are asked to check the consistency between queries and their machine-translated outputs. We use the recent two-stage framework (retrievalthen-re-rank) (Dang et al., 2013; Qu et al., 2021) for passage retrieval and evaluate two retrieval and two reranking models on our DuReader retrieval dataset. In particular, we utilize the dual-encoder and cross-encoder architecture in RocketQA (Qu et al., 2021) to develop our neural retrievers and re-rankers. We introduce the baselines as follows. BM25 BM25 is a sparse retrieval baseline (Robertson and Zaragoza, 2009). DE w/ BM25 Neg Karpukhin et al. (2020) shows that the hard negatives from BM25 are more effective at training the dense retrievers than inbatch random negatives. With BM25's hard negatives, we train a dual-encoder as our first neural retriever. CE w/ BM25 Neg We use BM25's hard negatives to train a cross-encoder as our first neural re-ranker. CE w/ DE Neg CE w/ DE Neg is the second enhanced re-ranker. We follow Qu et al. (2021) to train CE w/ DE Neg. Specifically, we use CE w/ BM25 Neg to initialize the parameters, and use DE w/ BM25 Neg to retrieve negatives from the entire passage collection. The relationships among our neural retrievers and re-rankers are shown in Figure 1 Table 5 : Performance of re-ranking models on testing set of DuReader retrieval . We present re-ranking results based on two retrieval models including BM25 and DE w/ BM25 Neg. We use the following evaluation metrics in our experiments: (1) Mean Reciprocal Rank for the top 10 retrieved documents (MRR@10), (2) Recall for the top-1 retrieved items (Recall@1) and (3) Recall for the top-50 retrieved items (Recall@50). Recall@50 is more suitable for evaluating the firststage retrievers, while MRR@10 and Recall@1 are more suitable for assessing the second-stage re-rankers. We report the in-domain baseline performances for the first-stage retrievers in Table 4 . Compared with the traditional retrieval system BM25, it is expected that DE w/ BM25 Neg outperforms the traditional system among all metrics, thanks to the powerful expressive ability of the neural encoder. We then report the in-domain baseline performances for the second-stage re-rankers in Table 5 . We observe that training the re-ranker with the hard negatives sampled from the neural retriever's top predictions is shown to outperform the negatives sampled from BM25's retrieved results in terms of MRR@10 and Recall@1. In this section, we examine the effects of our strategies to improve the data quality of DuReader retrieval as in Section 2.3. Reducing False Negatives We test three models, including BM25, a dense retrieval model (DE w/ BM25 Neg) and a re-ranking model (CE w/ BM25 Neg) based on BM25's top-50 retrieved re- Table 6 : Comparison of models by using two groups of training data: 1) CE w/ Sim. Q: training data without removing the queries that are semantically similar to the development and testing queries, 2) CE w/o Sim. Q: training data with removing the queries that are semantically similar to the development and testing queries. We evaluate the two models on the duplicated queries (Duplicated) and the rest of development queries (Others), respectively. All top-50 retrieval results are based on BM25. We bold the best model on each column. sults, to quantify the impact of our strategy on reducing false negatives. Specifically, we compare the performance of the same model on the development set either with or without reducing false negatives. As shown in Figure 2 , all metrics of the three models are significantly improved after adopting our strategy. These results suggest that there are many false negatives in the raw retrieval dataset derived from DuReader, and that our strategy successfully captures and reduces false negatives in development and testing sets. We conduct an experiment to quantify the effects of removing the training queries that are semantically similar to the development and testing queries. We train our reranking model (CE w/ BM25 Neg) by using the training data without (CE w/o Sim. Q) and with (CE w/ sim. Q) semantically duplicated queries, respectively. We then test both models on all 387 semantically duplicated queries (Duplicated) in the development and testing sets, as well as the rest of the development set (Others). We use BM25's top-50 retrieved results for the re-ranking models to re-rank. As shown in Table 6 , comparing the two models' performance on Others, we find that the model trained without training queries semantically similar with development and testing queries (CE w/o Sim. Q) has better performance in terms of MRR@10 and Recall@1. However, the model trained with those semantically similar queries (CE w/ Sim. Q) performs better on Duplicated. This suggests that using semantically similar queries in training may allow the model to simply memorize the data during training and achieve better performance during inference. In this section, we analyze the results of our best baseline system (i.e., retrieving the top-50 passages by DE w/ BM25 Neg, then re-ranking by CE w/ DE Neg) to better understand the specific challenges and limitations of DuReader retrieval . Specifically, we manually analyze 500 query-passage predictions of the baseline, and the 500 query-passage pairs are constituted by 100 random-selected development queries with the top-5 passages retrieved and re-ranked by the baseline. We ensure that the top-5 passages of these 100 queries contain no positive passages. Salient Phrase Mismatch We observe that the mismatch in salient phrases between query and retrieved passages is particularly challenging for the baseline system as found in , accounting for 53.4% of total incorrect predictions. We further divide salient phrase into several subcategories, by focusing on entities, numerical value, and modifier, namely entity mismatch, numerical mismatch, and modifier mismatch. We show the examples and its explanation in Table 10 in Appendix A.4. Syntax Mismatch We also observe there are around 1% predictions that have subject-object relation mismatch between query and passages. The case in Table 10 in Appendix A.4 suggests that it is difficult for the baseline system to ensure the consistency in syntactic relationship between the query and top-predicted passages. Other Challenges We also show two other typical challenges accounting for 22.6% incorrect predictions: 1) term over-sensitivity: whether the baseline system is over-sensitive to retrieve the passages by simply matching some low-frequency terms shared by query and passage; 2) typos robustness: whether the baseline system is robust against typos in queries or passages. Note that our dataset is constructed from the real query log in Baidu search. The noise introduced by the low-frequency terms or user typos challenges the robustness of the baseline system. We notice that there are still about 14.8% false negatives. This suggests that despite the success of our strategy in Section 2.3 to reduce false negatives in development and testing sets to some extents, the presence of false negatives remains a challenge in building a high-quality passage retrieval benchmark. In this experiment, we test the out-of-domain (OOD) generalization ability of our dense retriever (DE w/ BM25 Neg) on the two OOD testing sets. We report the results in two settings: 1) Zero-shot setting: we directly test DE w/ BM25 Neg without fine-tuning. 2) Fine-tuning setting: we fine-tune DE w/ BM25 Neg with the data from the target domain and evaluate it on OOD testing sets. The Table 9 : Monolingual (retrieving Chinese passages with Chinese queries) and Cross-lingual (retrieving Chinese passages with English queries) performance of the dual-encoder retrievers on our cross-lingual evaluations. We report the Recall@50 score for each retrieval model. performance of the fine-tuned models is the estimated upper-bound that DE w/ BM25 Neg can achieve on OOD testing sets. In Table 7 and 8, we summarize the results of the out-of-domain experiments. First, we notice that the performance of the dense retriever is largely degraded on the two OOD testing sets. According to the in-domain evaluation (see Table 4 and 5), the dense retriever considerably outperforms BM25, however it has no obvious advantage over BM25 in the zero-shot setting, or even worse. In addition, the dense retriever can be significantly improved by fine-tuning. Its can maintain a large advantage over BM25 after fine-tuning on the target-domain. This results show that the dense retriever has limited domain adaption capability as observed in . We experiment with three dual-encoder baselines based on multilingual BERT (mBERT) on our cross-lingual set. Supervised Model is our first baseline that directly fine-tunes mBERT on the parallel data of English queries paired with Chinese passages. We also report the performance of two models in a transfer learning setting. We fine-tune an mBERT retriever on the full monolingual Chinese training data (i.e., 86K Chinese queries with Chinese positive paragraphs in DuReader retrieval ), and test it directly on the cross-lingual task. This is named as Zero-shot Model. We further fine-tune Zero-shot Model by using the parallel data paired with English queries and Chinese passages, and test it on the crosslingual task. This is named as Transferred Model. As shown in Table 9 , we first notice that the performance of Zero-shot Model is less effective than Supervised Model. Second, Zero-shot Model performs significantly worse on cross-lingual data than on monolingual data. According to these find-ings, cross-lingual retrieval is more difficult than monolingual retrieval, since the retriever cannot find relevant passages by simply matching shared terms between queries and passages (Litschko et al., 2021) . Instead, cross-lingual retrievers must capture the semantic relevance of the query and passages to support their predictions. Furthermore, Transferred Model outperformed other baselines, demonstrating the validity of transferring knowledge from the monolingual Chinese annotated data. Passage Retrieval Benchmarks. Passage retrieval and open-domain question-answering are challenging tasks that attracts much attention in developing the benchmarks in the community. MS-MARCO (Nguyen et al., 2016) contains queries extracted from the search log of Microsoft Bing, which poses challenges in both the retrieval for relevant contextual items and reading comprehension based on the context. Natural Questions (Kwiatkowski et al., 2019) is an open-domain question answering benchmarks that consist of real queries issued to the Google search engine. However, find that there are 30% of test-set queries have semantically overlaps in the training queries for these popular datasets. For Chinese corpus, DuReader (He et al., 2018) becomes the first large-scale dataset focusing on Chinese open question answering. TianGong-PDR and Sougou-QCL (Zheng et al., 2018) are two retrieval datasets for the news documents and web-pages, separately. Dense Retrieval Model. Information retrieval is a long-standing problem in both the NLP and web science communities. In contrast to the traditional retrieval methods (Salton and Buckley, 1988; Robertson and Zaragoza, 2009) , recent neural retrievers aim at encoding the query and retrieved items as contextualized representations by the pretraining language model Sun et al., 2019) , then calculate the relevance based on the similarity function (Karpukhin et al., 2020; Luan et al., 2021; Qu et al., 2021) . Based on different learning paradigms, neural retrieval systems can be divided into two categories: 1) unsupervised: pre-training the retrieval without annotated data ; 2) supervised: training the query and item encoder(s) by contrasting the positives with designed negatives (Karpukhin et al., 2020; Xiong et al., 2021; Zhan et al., 2021) . In terms of architecture, the recent systems normally follow the two-stage framework (retrieval-then-re-ranking) in which a retriever Nogueira et al., 2019; Dai and Callan, 2019) first retrieve a list of top candidates and the re-ranker (Gao et al., 2020; Khattab and Zaharia, 2020) will finalize the rankings of retrieved items. This paper presents a large-scale Chinese passage retrieval dataset to benchmark the retrieval systems. We employ two strategies to control the quality of our dataset: 1) reducing the false negatives in development and testing sets with our pooling strategy and human annotations, and 2) removing the training queries that are semantically similar to the development and testing queries. We further provide two out-of-domain testing sets for out-of-domain evaluation, and a cross-lingual set for cross-lingual evaluation. We examine several retrieval baselines, including the traditional sparse retrieval system and the neural retrievers, and present the challenges and the limitation of our dataset. We hope this dataset can help facilitate the research of passage retrieval. We first use four different pre-training models, including ERNIE (Sun et al., 2019) , BERT Cui et al., 2021) , RoBERTa and MacBERT (Cui et al., 2020) , as the initializations to train four cross-encoder re-rankers as in (Qu et al., 2021) with negatives sampled from the pooled passages as discussed in Section 2.3. We then ensemble these four re-ranking models by averaging their prediction scores for each querypassage pair. We conduct all experiments with the deep learning framework PaddlePaddle on up to eight NVIDIA Tesla A100 GPUs (with 40G RAM). We use the ERNIE 1.0 base (Sun et al., 2019) as the initializations for both our first dual-encoder retriever (DE w/ BM25 Neg) and cross-encoder re-ranker (CE w/ BM25 neg.). ERNIE shares the same architecture with BERT but is trained with entity-level masking and phrase-level masking to obtain better knowledge-enhanced representations. To train our second enhanced re-ranker (CE w/ DE Neg), we use the parameters from CE w/ BM25 neg. as initialization. For training settings, we also use the Cross-batch negatives setting as in (Qu et al., 2021) . When sampling the hard negatives from the top-50 retrieved items, we sample 4 negatives per positive passage. The dual-encoders are trained with the batch size of 256. The cross-encoders are trained with the batch size of 64. The dual-encoders and cross-encoders are trained with 10 and 3 epochs. We use ADAM optimizer for all models' trainings and the learning rate of the dual-encoder is set to 3e-5 with the rate of linear scheduling warm-up at 0.1, while the learning rate of the cross-encoder is set to 1e-5 with no warm-up training. We set the maximal length of questions and passages as 32 and 384, respectively. In inference time of our dense retrieval model (DE w/ BM25 Neg), we use FAISS (Johnson et al., 2019) to index the dense representations of all passages. We perform the annotation in our internal annotation platform to ensure the data quality, where all the annotators and reviewers are full-time employees. The pairs of all queries and their pooled top-5 paragraphs retrieved by all models are divided into packages, with 1K samples for each. Annotators are asked to identify whether each query-paragraph pair is relevant for a single package. Then at least two reviewers check the accuracy of this package by reviewing 100 random query-paragraph pairs independently. If the average accuracy is less than the threshold (i.e., 93%), the annotators will be asked to revise the package until the accuracy is higher than the threshold. We present the selected cases in Table 10 and discuss them in this section to support our error analysis in Section 3.5. Salient Phrase Mismatch Taking the entity mismatch as an example, we expect that the main entity in the retrieved passage should be consistent with the query. However, the second example in Table 10 shows that the query asks for information about Taobao, but the retrieved paragraph is related to Alipay instead. There is a challenge for retrieval systems to filter out passages that entail entities inconsistent with the query. Syntax Mismatch Given the case showed in Table 10 as an example, the retrieval system is hard to understand the subject and object in the example query are Taipei and Ruifang, instead, it simply ranks the candidate passage entailing Taipei and Ruifang to a top predictions. Other Challenges In our analysis, it is found that about 21% of the errors are due to the retrieval system simply predicting its output based on the presence of co-occurring low-frequency terms (e.g., "wow" in the example in Table 10 ) in query and paragraph, but their semantic meanings are not related indeed. And about 1.6% of the errors are due to noise in the query or paragraph. For example, misspelling the "iPhone" as "ipone". Table 10 : Summary of the manual analysis for the 500 query-passage pairs predicted by our strongest re-ranker (CE w/ DE Neg). We highlight the challenges in salient phrase mismatch in red, syntax mismatch in blue, and Other Challenges in green. Shallow pooling for sparse labels XOR QA: Cross-lingual open-retrieval question answering One question answering model for many languages with cross-lingual dense passage retrieval Semantic parsing on Freebase from question-answer pairs 2021. mmarco: A multilingual version of ms marco passage ranking dataset Pre-training tasks for embedding-based large-scale retrieval Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one? CoRR Revisiting pretrained models for Chinese natural language processing Bing Qin, and Ziqing Yang. 2021a. Pre-training with whole word masking for chinese bert Deeper text understanding for ir with contextual neural language modeling Two-stage learning to rank for information retrieval BERT: Pre-training of deep bidirectional transformers for language understanding Unsupervised corpus aware language model pre-training for dense passage retrieval Modularized transfomer-based ranking framework COIL: Revisit exact lexical match in information retrieval with contextualized inverted list DuReader: a Chinese machine reading comprehension dataset from real-world applications Billion-scale similarity search with gpus TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension Dense passage retrieval for open-domain question answering ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Back-training excels self-training at unsupervised domain adaptation of question generation and passage retrieval Natural questions: A benchmark for question answering research Question and answer test-train overlap in open-domain question answering datasets On cross-lingual retrieval with multilingual text encoders Roberta: A robustly optimized BERT pretraining approach Sparse, Dense, and Attentional Representations for Text Retrieval. Transactions of the Association for Computational Linguistics Paddlepaddle: An open-source deep learning platform from industrial practice Generation-augmented retrieval for opendomain question answering Ms marco: A human generated machine reading comprehension dataset Document expansion by query prediction Models and datasets for cross-lingual summarisation RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering The probabilistic relevance framework: Bm25 and beyond Termweighting approaches in automatic text retrieval Cross-lingual training of dense retrievers for document retrieval CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval Ernie: Enhanced representation through knowledge integration Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models TREC: Experiment and evaluation in information retrieval Gpl: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval Leveraging passage-level cumulative gain for document ranking Investigating passage-level relevance and its role in document-level relevance judgment Approximate nearest neighbor negative contrastive learning for dense text retrieval Everything is all it takes: A multipronged strategy for zero-shot cross-lingual information extraction Optimizing Dense Retrieval Model Training with Hard Negatives Evaluating extrapolation performance of dense retrieval Multi-scale attentive interaction networks for chinese medical question answer selection Towards best practices for training multilingual dense retrieval models Sogou-qcl: A new dataset with click relevance label Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph Duqm: A chinese dataset of linguistically perturbed natural questions for evaluating the robustness of question matching models