key: cord-0158704-u6tetsj7
authors: Li, Canjia; Yates, Andrew; MacAvaney, Sean; He, Ben; Sun, Yingfei
title: PARADE: Passage Representation Aggregation for Document Reranking
date: 2020-08-20
journal: nan
DOI: nan
sha: e1d89e5a9585a28c9530e2e336876d3dd41c021d
doc_id: 158704
cord_uid: u6tetsj7

Pretrained transformer models, such as BERT and T5, have shown to be highly effective at ad-hoc passage and document ranking. Due to inherent sequence length limits of these models, they need to be run over a document's passages, rather than processing the entire document sequence at once. Although several approaches for aggregating passage-level signals have been proposed, there has yet to be an extensive comparison of these techniques. In this work, we explore strategies for aggregating relevance signals from a document's passages into a final ranking score. We find that passage representation aggregation techniques can significantly improve over techniques proposed in prior work, such as taking the maximum passage score. We call this new approach PARADE. In particular, PARADE can significantly improve results on collections with broad information needs where relevance signals can be spread throughout the document (such as TREC Robust04 and GOV2). Meanwhile, less complex aggregation techniques may work better on collections with an information need that can often be pinpointed to a single passage (such as TREC DL and TREC Genomics). We also conduct efficiency analyses, and highlight several strategies for improving transformer-based aggregation.

Pre-trained language models (PLMs), such as BERT [19] , ELEC-TRA [12] and T5 [59] , have achieved state-of-the-art results on standard ad-hoc retrieval benchmarks. The success of PLMs mainly relies on learning contextualized representations of input sequences using the transformer encoder architecture [68] . The transformer uses a self-attention mechanism whose computational complexity is quadratic with respect to the input sequence's length. Therefore, PLMs generally limit the sequence's length (e.g., to 512 tokens) to reduce computational costs. Consequently, when applied to the ad-hoc ranking task, PLMs are commonly used to predict the relevance of passages or individual sentences [17, 80] . The max or -max passage scores (e.g., top 3) are then aggregated to produce a document relevance score. Such approaches have achieved state-of-the-art results on a variety of ad-hoc retrieval benchmarks. * This work was conducted while the author was an intern at the Max Planck Institute for Informatics. Documents are often much longer than a single passage, however, and intuitively there are many types of relevance signals that can only be observed in a full document. For example, the Verbosity Hypothesis [60] states that relevant excerpts can appear at different positions in a document. It is not necessarily possible to account for all such excerpts by considering only the top passages. Similarly, the ordering of passages itself may affect a document's relevance; a document with relevant information at the beginning is intuitively more useful than a document with the information at the end [8, 36] . Empirical studies support the importance of full-document signals. Wu et al. study how passage-level relevance labels correspond to document-level labels, finding that more relevant documents also contain a higher number of relevant passages [73] . Additionally, experiments suggest that aggregating passage-level relevance scores to predict the document's relevance score outperforms the common practice of using the maximum passage score (e.g., [1, 5, 20] ).

On the other hand, the amount of non-relevant information in a document can also be a signal, because relevant excerpts would make up a large fraction of an ideal document. IR axioms encode this idea in the first length normalization constraint (LNC1), which states that adding non-relevant information to a document should decrease its score [21] . Considering a full document as input has the potential to incorporate signals like these. Furthermore, from the perspective of training a supervised ranking model, the common practice of applying document-level relevance labels to individual passages is undesirable, because it introduces unnecessary noise into the training process.

In this work, we provide an extensive study on neural techniques for aggregating passage-level signals into document scores. We study how PLMs like BERT and ELECTRA can be applied to the ad-hoc document ranking task while preserving many documentlevel signals. We move beyond simple passage score aggregation strategies (such as Birch [80] ) and study passage representation aggregation. We find that aggregation over passage representations using architectures like CNNs and transformers outperforms passage score aggregation. Since the utilization of the full-text increases memory requirements, we investigate using knowledge distillation to create smaller, more efficient passage representation aggregation models that remain effective. In summary, our contributions are:

• The formalization of passage score and representation aggregation strategies, showing how they can be trained end-to-end,

• A thorough comparison of passage aggregation strategies on a variety of benchmark datasets, demonstrating the value of passage representation aggregation, • An analysis of how to reduce the computational cost of transformer-based representation aggregation by decreasing the model size, • An analysis of how the effectiveness of transformer-based representation aggregation is influenced by the number of passages considered, and • An analysis into dataset characteristics that can influence which aggregation strategies are most effective on certain benchmarks.

We review four lines of related research related to our study. Contextualized Language Models for IR. Several neural ranking models have been proposed, such as DSSM [34] , DRMM [24] , (Co-)PACRR [35, 36] , (Conv-)KNRM [18, 74] , and TK [31] . However, their contextual capacity is limited by relying on pre-trained unigram embeddings or using short n-gram windows. Benefiting from BERT's pre-trained contextual embeddings, BERT-based IR models have been shown to be superior to these prior neural IR models. We briefly summarize related approaches here and refer the reader to a survey on transformers for text ranking by Lin et al. [46] for further details. These approaches use BERT as a relevance classifier in a cross-encoder configuration (i.e., BERT takes both a query and a document as input). Nogueira et al. first adopted BERT to passage reranking tasks [56] using BERT's [CLS] vector. Birch [80] and BERT-MaxP [17] explore using sentence-level and passage-level relevance scores from BERT for document reranking, respectively. CEDR proposed a joint approach that combines BERT's outputs with existing neural IR models and handled passage aggregation via a representation aggregation technique (averaging) [53] . In this work, we further explore techniques for passage aggregation and consider an improved CEDR variant as a baseline. We focus on the underexplored direction of representation aggregation by employing more sophisticated strategies, including using CNNs and transformers. Other researchers trade off PLM effectiveness for efficiency by utilizing the PLM to improve document indexing [16, 58] , precomputing intermediate Transformer representations [23, 37, 42, 51] , using the PLM to build sparse representations [52] , or reducing the number of Transformer layers [29, 32, 54] .

Several works have recently investigated approaches for improving the Transformer's efficiency by reducing the computational complexity of its attention module, e.g., Sparse Transformer [11] and Longformer [4] . QDS-Transformer tailors Longformer to the ranking task with query-directed sparse attention [38] . We note that representation-based passage aggregation is more effective than increasing the input text size using the aforementioned models, but representation aggregation could be used in conjunction with such models.

Passage-based Document Retrieval. Callan first experimented with paragraph-based and window-based methods of defining passages [7] . Several works drive passage-based document retrieval in the language modeling context [5, 48] , indexing context [47] , and learning to rank context [63] . In the realm of neural networks, HiNT demonstrated that aggregating representations of passage level relevance can perform well in the context of pre-BERT models [20] . Others have investigated sophisticated evidence aggregation approaches [82, 83] . Wu et al. explicitly modeled the importance of passages based on position decay, passage length, length with position decay, exact match, etc [73] . In a contemporaneous study, they proposed a model that considers passage-level representations of relevance in order to predict the passage-level cumulative gain of each passage [72] . In this approach the final passage's cumulative gain can be used as the document-level cumulative gain. Our approaches share some similarities, but theirs differs in that they use passage-level labels to train their model and perform passage representation aggregation using a LSTM. Representation Aggregation Approaches for NLP. Representation learning has been shown to be powerful in many NLP tasks [6, 50] . For pre-trained language models, a text representation is learned by feeding the PLM with a formatted text like

. The vector representation of the prepended [CLS] token in the last layer is then regarded as either a text overall representation or a text relationship representation. Such representations can also be aggregated for tasks that requires reasoning from multiple scopes of evidence. Gear aggregates the claim-evidence representations by max aggregator, mean aggregator, or attention aggregator for fact checking [83] . Transformer-XH uses extra hop attention that bears not only in-sequence but also inter-sequence information sharing [82] . The learned representation is then adopted for either question answering or fact verification tasks. Several lines of work have explored hierarchical representations for document classification and summarization, including transformer-based approaches [49, 78, 81] . In the context of ranking, SMITH [76] , a long-to-long text matching model, learns a document representation with hierarchical sentence representation aggregation, which shares some similarities with our work. Rather than learning independent document (and query) representations, SMITH is a bi-encoder approach that learns separate representations for each. While such approaches have efficiency advantages, current bi-encoders do not match the effectiveness of cross-encoders, which are the focus of our work [46] .

Knowledge Distillation. Knowledge distillation is the process of transferring knowledge from a large model to a smaller student model [2, 27] . Ideally, the student model performs well while consisting of fewer parameters. One line of research investigates the use of specific distilling objectives for intermediate layers in the BERT model [39, 64] , which is shown to be effective in the IR context [9] . Turc et al. pre-train a family of compact BERT models and explore transferring task knowledge from large fine-tuned models [67] . Tang et al. distill knowledge from the BERT model into Bi-LSTM [66] . Tahami et al. propose a new cross-encoder architecture and transfer knowledge from this model to a bi-encoder model for fast retrieval [65] . Hofstätter et al. also proposes a cross-architecture knowledge distillation framework using a Margin Mean Squared Error loss in a pairwise training manner [28] . We demonstrate the approach in [65, 66] can be applied to our proposed representation aggregation approach to improve efficiency without substantial reductions in effectiveness. 

In this section, we formalize approaches for aggregating passage representations into document ranking scores. We make the distinction between the passage score aggregation techniques explored in prior work with passage representation aggregation (PARADE) techniques, which have received less attention in the context of document ranking. Given a query and a document , a ranking method aims to generate a relevance score ( , ) that estimates to what degree document satisfies the query . As described in the following sections, we perform this relevance estimation by aggregating passage-level relevance representations into a document-level representation, which is then used to produce a relevance score.

As introduced in Section 1, a long document cannot be considered directly by the BERT model 1 due to its fixed sequence length limitation. As in prior work [7, 17] , we split a document into passages that can be handled by BERT individually. To do so, a sliding window of 225 tokens is applied to the document with a stride of 200 tokens, formally expressed as = { 1 , . . . , } where is the number of passages. Afterward, these passages are taken as input to the BERT model for relevance estimation. Following prior work [56] , we concatenate a query and passage pair with a [SEP] token in between and another [SEP] token at the end. The special [CLS] token is also prepended, in which the corresponding output in the last layer is parameterized as a relevance representation ∈ R , denoted as follows:

Previous approaches like BERT-MaxP [17] and Birch [80] use a feedforward network to predict a relevance score from each passage representation , which are then aggregated into a document relevance score with a score aggregation approach. Figure 1a illustrates common score aggregation approaches like max pooling ("MaxP"), sum pooling, average pooling, and k-max pooling. Unlike score aggregation approaches, our proposed representation aggregation approaches generate an overall document relevance representation by aggregating passage representations directly (see Figure 1b) . We describe the representation aggregators in the following sections.

Given the passage relevance representations = { 1 , . . . , }, PARADE summarizes into a single dense representation ∈ R in one of several different ways, as illustrated in Figure 2 .

PARADE-Max utilizes a robust max pooling operation on the passage relevance features 2 in . As widely applied in Convolution Neural Network, max pooling has been shown to be effective in obtaining position-invariant features [62] . Herein, each element at index in is obtained by a element-wise max pooling operation on the passage relevance representations over the same index.

(2) PARADE-Attn assumes that each passage contributes differently to the relevance of a document to the query. A simple yet effective way to learn the importance of a passage is to apply a feed-forward network to predict passage weights:

where softmax is the normalization function and ∈ R is a learnable weight.

For completeness of study, we also introduce a PARADE-Sum that simply sums the passage relevance representations. This can be regarded as manually assigning equal weights to all passages (i.e., = 1). We also introduce another PARADE-Avg that is combined with document length normalization(i.e., = 1/ ). PARADE-CNN, which operates in a hierarchical manner, stacks Convolutional Neural Network (CNN) layers with a window size of × 2 and a stride of 2. In other words, the CNN filters operate on every pair of passage representations without overlap. Specifically, we stack 4 layers of CNN, which halve the number of representations in each layer, as shown in Figure 2b .

PARADE-Transformer enables passage relevance representations to interact by adopting the transformer encoder [68] in a hierarchical way. Specifically, BERT's [CLS] token embedding and all are concatenated, resulting in an input = ( , 1 , . . . , ) that is consumed by transformer layers to exploit the ordering of and dependencies among passages. That is,

where LayerNorm is the layer-wise normalization as introduced in [3] , MultiHead is the multi-head self-attention [68] , and FFN is a two-layer feed-forward network with a ReLu activation in between. As shown in Figure 2c , the [CLS] vector of the last Transformer output layer, regarded as a pooled representation of the relevance between query and the whole document, is taken as .

For all PARADE variants except PARADE-CNN, after obtaining the final embedding, a single-layer feed-forward network (FFN) is adopted to generate a relevance score, as follows:

where ∈ R is a learnable weight. For PARADE-CNN, a FFN with one hidden layer is applied to every CNN representation, and the final score is determined by the sum of those FFN output scores.

We note that the computational complexity of representation aggregation techniques are dominated by the passage processing itself. In 

We compare PARADE against the following traditional and neural baselines, including those that employ other passage aggregation techniques. Table 2 : Ranking effectiveness of PARADE on the Robust04 and GOV2 collection. Best performance is in bold. Significant difference between PARADE-Transformer and the corresponding method is marked with † ( < 0.05, two-tailed paired -test). We also report the current best-performing model on Robust04 (T5-3B from [57] ). BM25 is an unsupervised ranking model based on IDF-weighted counting [61] . The documents retrieved by BM25 also serve as the candidate documents used with reranking methods.

BM25+RM3 is a query expansion model based on RM3 [43] . We used Anserini's [77] implementations of BM25 and BM25+RM3. Documents are indexed and retrieved with the default settings for keywords queries. For description queries, we set = 0.6 and changed the number of expansion terms to 20.

Birch aggregates sentence-level evidence provided by BERT to rank documents [80] . Rather than using the original Birch model provided by the authors, we train an improved "Birch-Passage" variant. Unlike the original model, Birch-Passage uses passages rather than sentences as input, it is trained end-to-end, it is fine-tuned on the target corpus rather than being applied zero-shot, and it does not interpolate retrieval scores with the first-stage retrieval method. These changes bring our Birch variant into line with the other models and baselines (e.g., using passages inputs and no interpolating), and they additionally improved effectiveness over the original Birch model in our pilot experiments.

ELECTRA-MaxP adopts the maximum score of passages within a document as an overall relevance score [17] . However, rather than fine-tuning BERT-base on a Bing search log, we improve performance by fine-tuning on the MSMARCO passage ranking dataset. We also use the more recent and efficient pre-trained ELECTRA model rather than BERT.

ELECTRA-KNRM is a kernel-pooling neural ranking model based on query-document similarity matrix [74] . We set the kernel size as 11. Different from the original work, we use the embeddings from the pre-trained ELECTRA model for model initialization.

CEDR-KNRM (Max) combines the advantages from both KNRM and pre-trained model [53] . It digests the kernel features learned from KNRM and the [CLS] representation as ranking feature. We again replace the BERT model with the more effective ELECTRA. We also use a more effective variant that performs max-pooling on the passages' [CLS] representations, rather than averaging.

T5-3B defines text ranking in a sequence-to-sequence generation context using the pre-trained T5 model [57] . For document reranking task, it utilizes the same score max-pooling technique as in BERT-MaxP [17] . Due to its large size and expensive training, we present the values reported by [57] in their zero-shot setting, rather than training it ourselves.

To prepare the ELECTRA model for the ranking task, we first finetune ELECTRA on the MSMARCO passage ranking dataset [55] . The fine-tuned ELECTRA model is then used to initialize PA-RADE's PLM component. For PARADE-Transformer we use two randomly initialized transformer encoder layers with the same hyperparemeters (e.g., number of attention heads, hidden size, etc.) used by BERT-base. Training of PARADE and the baselines was performed on a single Google TPU v3-8 using a pairwise hinge loss. We use the Tensorflow implementation of PARADE available in the Capreolus toolkit [79] , and a standalone imiplementation is also available 8 . We train on the top 1,000 documents returned by a first-stage retrieval method; documents that are labeled relevant in the ground-truth are taken as positive samples and all other documents serve as negative samples. We use BM25+RM3 for first-stage retrieval on Robust04 and BM25 on the other datasets with parameters tuned on the dev sets via grid search. We train for 36 "epochs" consisting of 4,096 pairs of training examples with a learning rate of 3e-6, warm-up over the first ten epochs, and a linear decay rate of 0.1 after the warm-up. Due to its larger memory requirements, we use a batch size of 16 with CEDR and a batch size of 24 with all other methods. Each instance comprises a query and all split passages in a document. We use a learning rate of 3e-6 with warm-up over the first 10 proportions of training steps.

Documents are split into a maximum of 16 passages. As we split the documents using a sliding window of 225 tokens with a stride of 200 tokens, a maximum number of 3,250 tokens in each document are retained. The maximum passage sequence length is set as 256. Documents with fewer than the maximum number of passages are padded and later masked out by passage level masks. For documents 

Following prior work [17, 53] , we use 5-fold cross-validation. We set the reranking threshold to 1000 on the test fold as trade-off between latency and effectiveness. The reported results are based on the average of all test folds. Performance is measured in terms of the MAP, Precision, ERR and nDCG ranking metrics using trec_eval 9 with different cutoff. For NTCIR WWW-3, the results are reported using NTCIREVAL 10 .

The reranking effectiveness of PARADE on the two commonly-used Robust04 and GOV2 collections is shown in Table 2 . Considering the three approaches that do not introduce any new weights, PARADE-Max is usually more effective than PARADE-Avg and PARADE-Sum, though the results are mixed on GOV2. PARADE-Max is consistently better than PARADE-Attn on Robust04, but PARADE-Attn sometimes outperforms PARADE-Max on GOV2. The two variants that consume passage representations in a hierarchical manner, PARADE-CNN and PARADE-Transformer, consistently outperforms the four other variants. This confirms the effectiveness of our proposed passage representation aggregation approaches. Considering the baseline methods, PARADE-Transformer significantly outperforms the Birch and ELECTRA-MaxP score aggregation approaches for most metrics on both collections. PARADE-Transformer's ranking effectiveness is comparable with T5-3B on the Robust04 collection while using only 4% of the parameters, though it is worth noting that T5-3B is being used in a zero-shot setting. CEDR-KNRM and ELECTRA-KNRM, which both use Table 3 . We first observe that this is a surprisingly challenging task for neural models. Unlike Robust04 and GOV2, where transformer-based models are clearly state-of-the-art, we observe that all of the methods we consider almost always underperform a simple BM25 baseline, and they perform well below the best-performing TREC submission. It is unclear whether this is due to the specialized domain, the smaller amount of training data, or some other factor. Nevertheless, we observe some interesting trends. First, we see that PARADE approaches can outperform score aggregation baselines. However, we note that statistical significance can be difficult to achieve on this dataset, given the small sample size (64 queries). Next, we notice that PARADE-Max performs the best among neural methods. This is in contrast with what we observed on Robust04 and GOV2, and suggests that hierarchically aggregating evidence from different passages is not required on the Genomics dataset.

We additionally study the effectiveness of PARADE on the TREC DL Track and NTCIR WWW-3 Track. We report results in this section and refer the readers to the TREC and NTCIR task papers for details on the specific hyperparameters used [44, 45] . Results from the TREC Deep Learning Track are shown in Table 4. In TREC DL'19, we include comparisons with competitive runs from TREC: ucas_runid1 [10] used BERT-MaxP [17] as the reranking method, TUW19-d3-re [30] is a Transformerbased non-BERT method, and idst_bert_r1 [75] utilizes struct-BERT [71] , which is intended to strengthen the modeling of sentence Table 2 . We explore this further in Section 5.4. Results from the NTCIR WWW-3 Track are shown in Table 5 . KASYS-E-CO-NEW-1 is a Birch-based method [80] that uses BERT-Large and Technion-E-CO-NEW-1 is a cluster-based method. As shown in Table 5 , PARADE-Transformer's effectiveness is comparable with KASYS-E-CO-NEW-1 across metrics. On this benchmark, PARADE-Transformer outperforms PARADE-Max by a large margin.

In this section, we consider the following research questions:

• RQ1: How does PARADE perform compare with transformers that support long text? • RQ2: How can BERT's efficiency be improved while maintaining its effectiveness? • RQ3: How does the number of document passages preserved influence effectiveness? • RQ4: When is the representation aggregation approach preferable to score aggregation?

Recently, a line of research focuses on reducing the redundant computation cost in the transformer block, allowing models to support longer sequences. Most approaches design novel sparse attention mechanism for efficiency, which makes it possible to input longer documents as a whole for ad-hoc ranking. We consider the results reported by Jiang et al. [38] to compare some of these approaches with passage representation aggregation. The results are shown in Table 6 .

In this comparison, long-text transformer approaches achieve similar effectiveness and underperform PARADE-Transformer by a large margin. However, it is worth noting that these approaches use the CLS representation as features for a downstream model rather than using it to predict a relevance score directly, which may contribute to the difference in effectiveness. A larger study using the various approaches in similar configurations is needed to draw conclusions. For example, it is possible that QDS-Transformer's effectiveness would increase when trained with maximum score aggregation; this approach could also be combined with PARADE to handle documents longer than Longformer's maximum input length of 2048 tokens. Our approach is less efficient than that taken by the Longformer family of models, so we consider the question of how to improve PARADE's efficiency in Section 5.2.

While BERT-based models are effective at producing high-quality ranked lists, they are computationally expensive. However, the reranking task is sensitive to efficiency concerns, because documents must be reranked in real time after the user issues a query. In this section we consider two strategies for improving PARADE's efficiency.

Using a Smaller BERT Variant. As smaller models require fewer computations, we study the reranking effectiveness of PARADE when using pre-trained BERT models of various sizes, providing guidance for deploying a retrieval system. To do so, we use the pre-trained BERT provided by Turc et al. [67] . In this analysis we change several hyperparameters to reduce computational requirements: we rerank the top 100 documents from BM25, train with a cross-entropy loss using a single positive or negative document, reduce the passage length 150 tokens, and reduce the stride to 100 tokens. We additionally use BERT models in place of ELECTRA so that we can consider models with LM distillation (i.e., distillation using self-supervised PLM objectives), which Gao et al. [22] found to be more effective than ranker distillation alone (i.e., distillation using a supervised ranking objective). From Table 7 , it can be seen that as the size of models is reduced, their effectiveness decline monotonously. The hidden layer size (#6 vs #7, #8 vs #9) plays a more critical role for performance than the number of layers (#3 vs #4, #5 vs #6). An example is the comparison between models #7 and #8. Model #8 performs better; it has fewer layers but contains more parameters. The number of parameters and inference time are also given in Table 7 to facilitate the study of trade-offs between model complexity and effectiveness.

Distilling Knowledge from a Large Model. To further explore the limits of smaller PARADE models, we apply knowledge distillation to leverage knowledge from a large teacher model. We use PARADE-Transformer trained with BERT-Base on the target collection as the teacher model. Smaller student models then learn from the teacher at the output level. We use mean squared error as the distilling objective, which has been shown to work effectively [65, 66] . The learning objective penalizes the student model based on both the ground-truth and the teacher model:

where is the cross-entropy loss with regard to the logit of the student model and the ground truth, weights the importance of the learning objectives, and and are logits from the teacher model and student model, respectively.

As shown in Table 7 , the nDCG@20 of distilled models always increases. The PARADE model using 8 layers (#4) can achieve comparable results with the teacher model. Moreover, the PARADE model using 10 layers (#3) can outperform the teacher model with 11% fewer parameters. The PARADE model trained with BERT-Small achieves a nDCG@20 above 0.5, which outperforms BERT-MaxP using BERT-Base, while requiring only 1.14 ms to perform inference on one document. Thus, when reranking 100 documents, the inference time for each query is approximately 0.114 seconds.

One hyper-parameter in PARADE is the maximum number of passages being used, i.e., preserved data size, which is studied to answer RQ3 in this section. We consider title queries on the GOV2 dataset given that these documents are longer on average than in Robust04. We use the same hyperparameters as in Section 5.2. Figure 3 depicts nDCG@20 of PARADE-Transformer with the number of passages varying from 8 to 64. Generally, larger preserved data size results in better performance for PARADE-Transformer, which suggests that a document can be better understood from document-level context with more preservation of its content. For PARADE-Max and PARADE-Attn, however, the performance degrades a little when using 64 passages. Both max pooling (Max) and simple attention mechanism (Attn) have limited capacity and are challenged when dealing with such longer documents. The PARADE-Transformer model is able to improve nDCG@20 as the number of passages increases, demonstrating its superiority in detecting relevance when documents become much longer.

However, considering more passages also increases the number of computations performed. One advantage of the PARADE models is that the number of parameters remains constant as the number of passages in a document varies. Thus, we consider the impact of varying the number of passages considered between training and inference. As shown in Table 8 , rows indicate the number of passages considered at training time while columns indicate the number used to perform inference. The diagonal indicates that preserving more of the passages in a document consistently improves nDCG.

Similarly, increasing the number of passages considered at inference time (columns) or at training time (rows) usually improves nDCG. In conclusion, the number of passages considered plays a crucial role in PARADE's effectiveness. When trading off efficiency for effectiveness, PARADE models' effectiveness can be improved by training on more passages than will be used at inference time. This generally yields a small nDCG increase. We test this hypothesis by using passage-level relevance judgments to compare the number of highly relevant passages per document in various collections. To do so, we use mappings between relevant passages and documents for those collections with passagelevel judgments available: TREC DL, TREC Genomics, and GOV2. We create a mapping between the MS MARCO document and passage collections by using the MS MARCO Question Answering (QnA) collection to map passages to document URLs. This mapping can then be used to map between passage and document judgments in DL'19 and DL'20. With DL'19, we additionally use the FIRA passage relevance judgments [33] to map between documents and passages. The FIRA judgments were created by asking annotators to identify relevant passages in every DL'19 document with a relevance label of 2 or 3 (i.e., the two highest labels). Our mapping covers nearly the entire MS MARCO collection, but it is limited by the fact that DL's passage-level relevance judgments may not be complete. The FIRA mapping covers only highly-relevant DL'19 documents, but the passage annotations are complete and it was created by human annotators with quality control. In the case of TREC Genomics, we use the mapping provided by TREC. For GOV2, we use the sentence-level relevance judgments available in WebAP [40, 41] , which cover 82 queries.

We compare passage judgments across collections by using each collection's annotation guidelines to align their relevance labels with MS MARCO's definition of a relevant passage as one that is sufficient to answer the question query. With GOV2 we consider passages with a relevance label of 3 or 4 to be relevant. With DL documents we consider a label of 2 or 3 to be relevant and passages with a label of 3 to be relevant. With FIRA we consider label 3 to be relevant. With Genomics we consider labels 1 or 2 to be relevant.

We align the maximum passage lengths in GOV2 to FIRA's maximum length so that they can be directly compared. To do so, we convert GOV2's sentence judgments to passage judgments by collapsing sentences following a relevant sentence into a single passage with a maximum passage length of 130 tokens, as used by FIRA 11 . We note that this process can only decrease the number of relevant passages per document observed in GOV2, which we expect to have the highest number. With the DL collections using the MS MARCO mapping, the passages are much smaller than these lengths, so collapsing passages could only decrease the number of relevant passages per document. We note that Genomics contains "natural" passages that can be longer; this should be considered when drawing conclusions. In all cases, the relevant passages comprise a small fraction of the document.

In each collection, we calculate the number of relevant passages per document using the collection's associated document and passage judgments. The results are shown in Table 9 . First, considering the GOV2 and MS MARCO collections that we expect to lie at opposite ends of the spectrum, we see that 38% of GOV2 documents contain a single relevant passage, whereas 98-99% of MS MARCO documents contain a single relevant passage. This confirms that MS MARCO documents contain only 1-2 highly relevant passages per document by nature of the collection's construction. The percentages are the lowest on GOV2 as expected. While we would prefer to put these percentages in the context of another collection like Robust04, the lack of passage-level judgments on such collections prevents us from doing so. Second, considering the Deep Learning collections, we see that DL'19 and DL'20 exhibit similar trends regardless of whether our mapping or the FIRA mapping is used. In these collections, the majority of documents contain a single relevant passage and the vast majority of documents contain one or two relevant passages. We call this a "maximum passage bias." The fact that the queries are shared with MS MARCO likely contributes to this observation, since we know the vast majority of MS MARCO question queries can be answered by a single passage. Third, considering Genomics 2006, we see that this collection is similar to the DL collections. The majority of documents contain only one relevant passage, and the vast majority contain one or two relevant passages. Thus, this analysis supports our hypothesis that the difference in PARADE-Transformer's effectiveness across collections is related to the number of relevant passages per document in these collections. PARADE-Max performs better when the number is low, which may reflect the reduced importance of aggregating relevance signals across passages on these collections.

We proposed the PARADE end-to-end document reranking model and demonstrated its effectiveness on ad-hoc benchmark collections.

Our results indicate the importance of incorporating diverse relevance signals from the full text into ad-hoc ranking, rather than basing it on a single passage. We additionally investigated how In response to the urgent demand for reliable and accurate retrieval of COVID-19 academic literature, TREC has been developing the TREC-COVID challenge to build a test collection during the pandemic [69] . The challenge uses the CORD-19 data set [70] , which is a dynamic collection enlarged over time. There are supposed to be 5 rounds for the researchers to iterate their systems. TREC develops a set of COVID-19 related topics, including queries (key-word based), questions, and narratives. A retrieval system is supposed to generate a ranking list corresponding to these queries.

We began submitting PARADE runs to TREC-COVID from Round 2. By using PARADE, we are able to utilize the full-text of the COVID-19 academic papers. We used the question topics since it works much better than other types of topics. In all rounds, we employ the PARADE-Transformer model. In Round 3, we additionally tested PARADE-Attn and a combination of PARADE-Transformer and PARADE-Attn using reciprocal rank fusion [13] .

Results from TREC-COVID Rounds 2-4 are shown in Table 10,  Table 11, and Table 12 , respectively. 12 In Round 2, PARADE achieves the highest nDCG, further supporting its effectiveness. 13 In Round 3, our runs are not as competitive as the previous round. One possible reason is that the collection doubles from Round 2 to Round 3, which can introduce more inconsistencies between training and testing data as we trained PARADE on Round 2 data and tested on Round 3 data. In particular, our run mpiid5_run3 performed poorly. We found that it tends to retrieve more documents that are not likely to be included in the judgment pool. When considering the bpref metric that takes only the judged documents into account, its performance is comparable to that of the other variants. As measured by nDCG, PARADE's performance improved in Round 4 (Table 12 ), but is again outperformed by other approaches. It is worth noting that the PARADE runs were created by single models (excluding the fusion run from Round 3), whereas e.g. the UPrrf38rrf3-r4 run in Round 4 is an ensemble of more than 20 runs.

A Neural Passage Model for Ad-hoc Document Retrieval

Do Deep Nets Really Need to be Deep

Layer Normalization

Longformer: The Long-Document Transformer

Utilizing Passage-Based Language Models for Document Retrieval

Representation Learning: A Review and New Perspectives

Passage-Level Evidence in Document Retrieval

Enhanced News Retrieval: Passages Lead the Way! Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Simplified Tiny-BERT: Knowledge Distillation for Document Retrieval

UCAS at TREC-2019 Deep Learning Track

Generating Long Sequences with Sparse Transformers

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In SIGIR

Overview of the TREC 2020 deep learning track

Overview of the TREC 2019 deep learning track

Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval

Deeper Text Understanding for IR with Contextual Neural Language Modeling

Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Modeling Diverse Relevance Patterns in Ad-hoc Retrieval

Diagnostic Evaluation of Information Retrieval Models

Understanding BERT Rankers Under Distillation

EARL: Speedup Transformerbased Rankers with Pre-computed Representation

A Deep Relevance Matching Model for Ad-hoc Retrieval

Distilling the Knowledge in a Neural Network

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

Local Self-Attention over Long Text for Efficient Document Retrieval

TU Wien @ TREC Deep Learning '19 -Simple Contextualization for Re-ranking

Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking

Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

Learning deep structured semantic models for web search using clickthrough data

PACRR: A Position-Aware Neural IR Model for Relevance Matching

Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval

Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Long Document Ranking with Query-Directed Sparse Transformer

TinyBERT: Distilling BERT for Natural Language Understanding

Evaluating answer passages using summarization measures

Retrieving passages and finding answers

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Relevance-Based Language Models

MPII at the TREC 2020 Deep Learning Track

MPII at the NTCIR-15 WWW-3 Task

Pretrained transformers for text ranking: Bert and beyond

Is searching full text more effective than searching abstracts?

Passage retrieval based on language models

Hierarchical Transformers for Multi-Document Summarization

Representation Learning for Natural Language Processing

Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

Expansion via Prediction of Importance with Contextualization

CEDR: Contextualized Embeddings for Document Ranking

Conformer-Kernel with Query Term Independence for Document Retrieval

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Passage Re-ranking with BERT

Document Ranking with a Pretrained Sequence-to-Sequence Model

Document Expansion by Query Prediction

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval

Okapi at TREC-4

Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition

A passage-based approach to learning to rank documents

Patient Knowledge Distillation for BERT Model Compression

Distilling Knowledge for Fast Retrieval-based Chat-bots

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection

CORD-19: The Covid-19 Open Research Dataset

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Leveraging Passage-level Cumulative Gain for Document Ranking

Investigating Passage-level Relevance and Its Role in Document-level Relevance Judgment

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Anserini: Reproducible Ranking Baselines Using Lucene

Hierarchical Attention Networks for Document Classification

Flexible IR pipelines with Capreolus

Applying BERT to Document Retrieval with Birch

HIBERT: Document Level Pretraining of Hierarchical Bidirectional Transformers for Document Summarization

Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification

This work was supported in part by Google Cloud and the Tensor-Flow Research Cloud.