key: cord-0510074-67rrmynu
authors: Lin, Jimmy; Ma, Xueguang; Lin, Sheng-Chieh; Yang, Jheng-Hong; Pradeep, Ronak; Nogueira, Rodrigo
title: Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations
date: 2021-02-19
journal: nan
DOI: nan
sha: 5ac627f229fa8d54f5ad43f7f99e9b29d93ada29
doc_id: 510074
cord_uid: 67rrmynu

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.

The advent of pretrained transformers has led to many exciting recent developments in information retrieval [15] . In our view, the two most important research directions are transformer-based reranking models and learned dense representations for ranking. Despite many exciting opportunities and rapid research progress, the need for easy-to-use, replicable baselines has remained a constant. In particular, the importance of stable first-stage retrieval within a multi-stage ranking architecture has become even more important, as it provides the foundation for increasingly-complex modern approaches that leverage hybrid techniques.

We present Pyserini, our Python IR toolkit designed to serve this role: it aims to provide a solid foundation to help researchers pursue work on modern neural approaches to information retrieval. The toolkit is specifically designed to support the complete "research lifecycle" of systems-oriented inquiries aimed at building better ranking models, where "better" can mean more effective, more efficient, or some tradeoff thereof. This typically involves working with one or more standard test collections to design ranking models as part of an end-to-end architecture, iteratively improving components and evaluating the impact of those changes. In this context, our toolkit provides the following key features:

• Pyserini is completely self-contained as a Python package, available via pip install. The package comes with queries, collections, and qrels for standard IR test collections, as well as pre-built indexes and evaluation scripts. In short, batteries are included. Pyserini supports, out of the box, the entire research lifecycle of efforts aimed at improving ranking models.

• Pyserini can be used as a standalone module to generate batch retrieval runs or be integrated as a library into an application designed to support interactive retrieval.

• Pyserini supports sparse retrieval (e.g., BM25 scoring using bagof-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well hybrid retrieval that integrates both approaches.

• Pyserini provides access to data structures and system internals to support advanced users. This includes access to postings, document vectors, and raw term statistics that allow our toolkit to support use cases that we had not anticipated.

Pyserini began as the Python interface to Anserini [27, 28] , which our group has been developing for several years, with its roots in a community-wide replicability exercise dating back to 2015 [14] . Anserini builds on the open-source Lucene search library and was motivated by the desire to better align academic research with the practice of building real-world search applications; see, for example, Grand et al. [9] . More recently, we recognized that Anserini's reliance on the Java Virtual Machine (due to Lucene), greatly limited its reach [2, 3] , as Python has emerged as the language of choice for both data scientists and researchers. This is particularly the case for work on deep learning today, since the major toolkits (PyTorch [22] and Tensorflow [1] ) have both adopted Python as their front-end language. Thus, Pyserini aims to be a "featurecomplete" Python interface to Anserini. Sparse retrieval support in Pyserini comes entirely from Lucene (via Anserini). To support dense and hybrid retrieval, Pyserini integrates Facebook's FAISS library for efficient similarity search over dense vectors [11] , which in turns integrates the HNSW library [17] to support low-latency querying. Thus, Pyserini provides a superset of features in Anserini; dense and hybrid retrieval is entirely missing from the latter. This paper is organized in the following manner: After a preamble on our design philosophy, we begin with a tour of Pyserini, highlighting its main features and providing the reader with a sense of how it might be used in a number of common scenarios. This is followed by a presentation of empirical results illustrating the use of Pyserini to provide first-stage retrieval in two popular ranking tasks today. Before concluding with future plans, we discuss how our group has internalized replicability as a shared norm through social processes supported by technical infrastructure.

possible. " While aspects of the lifecycle for systems-oriented IR research are not difficult per se, there are many details that need to be managed: downloading the right version of a corpus, building indexes with the appropriate settings (tokenization, stopwords, etc.), downloading queries and relevance judgments (deciding between available "variants"), manipulating runs into the correct output format for the evaluation script, selecting the right metrics to obtain meaningful results, etc. The list goes on. These myriad details often trip up new researchers who are just learning systems-oriented IR evaluation methodology (motivating work such as Akkalyoncu Yilmaz et al. [2] ), and occasionally subtle issues confuse experienced researchers as well. 1 The explicit goal of Pyserini is to make these "easy things" easy, supporting common tasks and reducing the possibility of confusion as much as possible.

At the other end of the spectrum, "hard things should be possible". In our context, this means that Pyserini provides access to data structures and system internals to support researchers who may use our toolkit in ways we had not anticipated. For sparse retrieval, the Lucene search library that underlies Anserini provides interfaces to control various aspects of indexing and retrieval, and Pyserini exposes a subset of features that we anticipate will be useful for IR researchers. These include, for example, traversing postings lists to access raw term statistics, manipulating document vectors to reconstruct term weights, and fine-grained control over document processing (tokenization, stemming, stopword removal, etc.). Pyserini aims to sufficiently expose Lucene internals to make "hard things" possible.

Finally, the most common use case of Pyserini as first-stage retrieval in a multi-stage ranking architecture means that replicability is of utmost concern, since it is literally the foundation that complex reranking pipelines are built on. In our view, replicability can be divided into technical and social aspects: an example of the former is an internal end-to-end regression framework that automatically validates experimental results. The latter includes a commitment to "eat our own dog food" and the adoption of shared norms. We defer more detailed discussions of replicability to Section 5.

Pyserini is packaged as a Python module available on the Python Package Index. Thus, the toolkit can be installed via pip, as follows:

$ pip install pyserini==0.11.0.0

In this paper, we are explicitly using v0.11.0.0. The code for the toolkit itself is available on GitHub at pyserini.io; for users who may be interested in contributing to Pyserini, we recommend a "development" installation, i.e., cloning the source repository itself. However, for researchers interested only in using Pyserini, the module installed via pip suffices.

In this section, we will mostly use the MS MARCO passage ranking dataset [5] as our running example. The dataset has many features that make it ideal for highlighting various aspects of our toolkit: the corpus, queries, and relevance judgments are all freely 1 from pyserini.search import SimpleSearcher 2 3 searcher = SimpleSearcher.from_prebuilt_index( msmarco-passage ) 4 hits = searcher.search( what␣is␣a␣lobster␣roll? , 10) downloadable; the corpus is manageable in size and thus experiments require only modest compute resources (and time); the task is popular and thus well-studied by many researchers.

In Figure 1 , we begin with a simple example of using Pyserini to perform bag-of-words ranking with BM25 (the default ranking model) on the MS MARCO passage corpus (comprising 8.8M passages). To establish a parallel with "dense retrieval" techniques using learned transformer-based representations (see below), we refer to this as "sparse retrieval", although this is not common parlance in the IR community at present.

The SimpleSearcher class provides a single point of entry for sparse retrieval functionality. In (L3), we initialize the searcher with a pre-built index. For many commonly used collections where there are no data distribution restrictions, we have built indexes that can be directly downloaded from our project servers. For researchers who simply want an "out-of-the-box" keyword retrieval baseline, this provides a simple starting point. Specifically, the researcher does not need to download the collection and build the index from scratch. In this case, the complete index, which includes a copy of all the texts, is a modest 2.6GB.

Using an instance of SimpleSearcher, we issue a query to retrieve the top 10 hits (L4), the results of which are stored in the array hits. Naturally, there are methods to control ranking behavior, such as setting BM25 parameters and enabling the use of pseudorelevance feedback, but for space considerations these options are not shown here. In (L6-7), we iterate through the results and print out rank, docid, and score. If desired, the actual text can be fetched from the index (e.g., to feed a downstream reranker). Figure 2 shows an example of interactive retrieval using dense learned representations. Here, we are using TCT-ColBERT [16] , a model our group has constructed from ColBERT [13] using knowledge distillation. As with sparse retrieval, we provide pre-built indexes that can be directly downloaded from our project servers. The SimpleDenseSearcher class serves as the entry point to nearest-neighbor search functionality that provides top retrieval on dense vectors. Here, we are taking advantage of HNSW [17] , which has been integrated into FAISS [11] to enable low latency interactive querying (L6).

The final component needed for dense retrieval is a query encoder that converts user queries into the same representational space as the documents. We initialize the query encoder in (L4), which is passed into the method that constructs the searcher. The encoder itself is a lightweight wrapper around the Transformers library by Huggingface [25] . Retrieval is performed in the same manner (L9), and we can manipulate the returned hits array in a 1 from pyserini.dsearch import SimpleDenseSearcher, \ 2 TCTColBERTQueryEncoder 3 4 encoder = TCTColBERTQueryEncoder( castorini/tct_colbert-msmarco ) 5 searcher = SimpleDenseSearcher.from_prebuilt_index( 6 msmarco-passage-tct_colbert-hnsw , 7 encoder 8 ) 9 hits = searcher.search( what␣is␣a␣lobster␣roll ) Figure 2 : Simple example of interactive dense retrieval (i.e., approximate nearest-neighbor search on dense learned representations).

TCTColBERTQueryEncoder 4 from pyserini.hsearch import HybridSearcher 5 6 ssearcher = SimpleSearcher.from_prebuilt_index( msmarco-passage ) 7 encoder = TCTColBERTQueryEncoder( castorini/tct_colbert-msmarco ) 8 dsearcher = SimpleDenseSearcher.from_prebuilt_index( 9 msmarco-passage-tct_colbert-hnsw , 10 encoder 11 ) 12 hsearcher = HybridSearcher(dsearcher, ssearcher) 13 hits = hsearcher.search( what␣is␣a␣lobster␣roll , 10) manner similar to sparse retrieval (Figure 1 ). At present, we support the TCT-ColBERT model [16] as well as DPR [12] . Note that our goal here is to provide retrieval capabilities based on existing models; quite explicitly, representational learning lies outside the scope of our toolkit (see additional discussion in Section 6).

Of course, the next step is to combine sparse and dense retrieval, which is shown in Figure 3 . Our HybridSearcher takes as its constructor the sparse retriever and the dense retriever and performs weighted interpolation on the individual results to arrive at a final ranking. This is a standard approach and Pyserini adopts the specific implementation in TCT-ColBERT [16] , but similar techniques are used elsewhere as well [12] .

Beyond the corpus, topics (queries) and relevance judgments (qrels) form indispensable components of IR test collections to support systems-oriented research aimed at producing better ranking models. Many topics and relevance judgments are freely available for download, but at disparate locations (in various formats)-and often it may not be obvious to a newcomer where to obtain these resources and which exact files to use.

Pyserini tackles this challenge by packaging together these evaluation resources and providing a unified interface for accessing them. Figure 4 shows an example of loading topics via get_topics (L3) and loading qrels via get_qrels (L4) for the standard 6980query subset of the development set of the MS MARCO passage ranking test collection. We have taken care to name the text descriptors consistently, so the associations between topics and relevance judgments are unambiguous. 1 from pyserini.search import get_topics, get_qrels 2 3 topics = get_topics( msmarco-passage-dev-subset ) 4 qrels = get_qrels( msmarco-passage-dev-subset ) Using Pyserini's provided functions, the topics and qrels are loaded into simple Python data structures and thus easy to manipulate. A standard TREC topic has different fields (e.g., title, description, narrative), which we model as a Python dictionary. Similarly, qrels are nested dictionaries: query ids mapping to a dictionary of docids to (possibly graded) relevance judgments. Our choice to use Python data structures means that they can be manipulated using standard constructs such as list comprehensions. For example, we can straightforwardly compute the avg. length of queries (L7) and the avg. number of relevance judgments per query (L10).

Putting everything discussed above together, it is easy in Pyserini to perform an end-to-end batch retrieval run with queries from a standard test collection. For example, the following command generates a run on the development queries of the MS MARCO passage ranking task (with BM25):

The option --msmarco specifies the MS MARCO output format; an alternative is the TREC format. We can evaluate the effectiveness of the run with another simple command:

$ python -m pyserini.eval.msmarco_passage_eval \ msmarco-passage-dev-subset run.msmarco-passage.txt ##################### MRR @10: 0.18741227770955546 QueriesRanked: 6980 ##################### Pyserini includes a copy of the official evaluation script and provides a lightweight convenience wrapper around it. The toolkit manages qrels internally, so the user simply needs to provide the name of the test collection, without having to worry about downloading, storing, and specifying external files. Otherwise, the usage of the evaluation module is exactly the same as the official evaluation script; in fact, Pyserini simply dispatches to the underlying script after it translates the qrels mapping internally.

The above result corresponds to an Anserini baseline on the MS MARCO passage leaderboard. This is worth emphasizing and nicely illustrates our goal of making Pyserini easy to use: with one simple command, it is possible to replicate a run that serves as a common baseline on a popular leaderboard, providing a springboard to experimenting with different ranking models in a multistage architecture. Similar commands provide replication for batch retrieval with dense representations as well as hybrid retrieval.

Beyond existing corpora and test collections, a common use case for Pyserini is users who wish to search their own collections. For bag-of-words sparse retrieval, we have built in Anserini (written in Java) custom parsers and ingestion pipelines for common document formats used in IR research, for example, the TREC SGML format used in many newswire collections and the WARC format for web collections. However, exposing the right interfaces and hooks to support custom implementations in Python is awkward. Instead, we have implemented support for a generic and flexible JSON-formatted collection in Anserini (written in Java), and Pyserini's indexer directly accesses the underlying capabilities in Anserini. Thus, searching custom collections in Pyserini necessitates first writing a simple script to reformat existing documents into our JSON specification, and then invoking the indexer. For dense retrieval, support for custom collections is less mature at present, but we provide utility scripts that take an encoder model to convert documents into dense representations, and then build indexes that support querying.

The design of Pyserini makes it easy to use as a standalone module or to integrate as a library in another application. In the first use case, a researcher can replicate a baseline (first-stage retrieval) run with a simple invocation, take the output run file (which is just plain text) to serve as input for downstream reranking, or as part of ensembles [6, 8] . As an alternative, Pyserini can be used as a library that is tightly integrated into another package; see additional discussions in Section 6.

Beyond simplifying the research lifecycle of working with standard IR test collections, Pyserini provides access to system internals to support use cases that we might not have anticipated. A number of these features for sparse retrieval are illustrated in Figure 5 and available via the IndexReader object, which can be initialized with pre-built indexes in the same way as the searcher classes. 2 In (L7-9), we illustrate how to iterate over all terms in a corpus (i.e., its dictionary) and access each term's document frequency and collection frequency. Here, we use standard Python tools to select and print out the first 10 terms alphabetically. In the next example, (L12-14), we show how to "analyze" a word (what Lucene calls tokenization, stemming, etc.). For example, the analyzed form of "atomic" is "atom". Since terms in the dictionary (and document vectors, see below) are stored in analyzed form, these methods are necessary to access system internals. Another way to access collection statistics is shown in (L17-18) by direct lookup.

Pyserini also provides raw access to index structures, both the inverted index as well as the forward index (i.e., to fetch document vectors). In (L21-23), we show an example of looking up a term's postings list and traversing its postings, printing out term frequency and term position occurrences. Access to the forward index is shown in (L26-27) based on a docid: In the first case, Pyserini returns a dictionary mapping from terms in the document to for term in tf.keys() 37 } Figure 5 : Examples of using Pyserini to access system internals such as term statistics and postings lists. their term frequencies. In the second case, Pyserini returns a dictionary mapping from terms to their term positions in the document. From these methods, we can, for example, look up document frequencies for all terms in a document using a list comprehension in Python (L28-31). This might be further manipulated to compute tfidf scores. Finally, the toolkit provides a convenience method for computing BM25 term weights, using which we can reconstruct the BM25-weighted document vector (L32-37).

At present, access to system internals focuses on manipulating sparse representations. Dense retrieval capabilities in Pyserini are less mature. It is not entirely clear what advanced features would be desired by researchers, but we anticipate adding support as the needs and use cases become more clear.

Having provided a "tour" of Pyserini and some of the toolkit's features, in this section we present experimental results to quantify its effectiveness for first-stage retrieval. Currently, Pyserini provides support for approximately 30 test collections; here, we focus on two popular leaderboards.

Pyserini provides baselines for two MS MARCO datasets [5] : the passage ranking task (Table 1 ) and the document ranking task (Table 2). In both cases, we report the official metric (MRR@10 for [20] 0.372 -0.365 (4e) Expando-Mono-DuoT5 [23] 0.420 -0.408 Table 1 : Results on the MS MARCO passage ranking task.

passage, MRR@100 for document). For the development set, we additionally report recall at rank 1000, which is useful in establishing an upper bound on reranking effectiveness. Note that evaluation results on the test sets are only available via submissions to the leaderboard, and therefore we do not have access to recall figures. Furthermore, since the organizers discourage submissions that are "too similar" (e.g., minor differences in parameter settings) and actively limit the number of submissions to the leaderboard, we follow their guidance and hence do not have test results for all of our experimental conditions. For the passage ranking task, Pyserini supports sparse retrieval, dense retrieval, as well as hybrid dense-sparse retrieval; all results in rows (1) through (3) are replicable with our toolkit. Row (1a) reports the effectiveness of sparse bag-of-words ranking using BM25 with default parameter settings on the original text; row (1b) shows results after tuning the parameters on a subset of the dev queries via simple grid search to maximize recall at rank 1000. Parameter tuning makes a small difference in this case. Pyserini also provides document expansion baselines using our doc2query method [21] ; the latest model uses T5 [24] as described in Nogueira and Lin [19] . Bag-of-words BM25 ranking over the corpus with document expansion is shown in rows (1c) and (1d) for default and tuned parameters. We see that doc2query yields a large jump in effectiveness, while still using bag-of-words retrieval, since neural inference is applied to generate expansions prior to the indexing phase. With doc2query, parameter tuning also makes a difference.

For dense retrieval, results using TCT-ColBERT [16] are shown in rows (2) using different indexes. Row (2a) refers to brute-force scans over the document vectors in FAISS [11] , which provides exact nearest-neighbor search. Row (2b) refers to approximate nearestneighbor search using HNSW [17] ; the latter yields a small loss in effectiveness, but enables interactive querying. We see that retrieval using dense learned representations is much more effective than retrieval using sparse bag-of-words representations, even taking into account document expansion techniques.

Results of hybrid techniques that combine sparse and dense retrieval using weighted interpolation are shown next in Table 1 . Row (3a) shows the results of combining TCT-ColBERT with BM25 bag-of-words search over the original texts, while row (3b) shows results that combine document expansion using doc2query with the T5 model. In both cases we used a brute-force approach. Results show that combining sparse and dense signals is more effective than either alone, and that the hybrid technique continues to benefit from document expansion.

To put these results in context, rows (4) provide a few additional points of comparison. Row (4a) shows the BM25 baseline provided by the MS MARCO leaderboard organizers, which appears to be less effective than Pyserini's implementation. Rows (4b) and (4c) refer to two alternative dense-retrieval techniques; these results show that our TCT-ColBERT model performs on par with competing models. Finally, rows (4d) and (4e) show results from two of our own reranking pipelines built on Pyserini as first-stage retrieval: monoBERT, a standard BERT-based reranker [20] , and our "Expando-Mono-Duo" design pattern with T5 [23] . These illustrate how Pyserini can serve as the foundation for further explorations in neural ranking techniques.

Results on the MS MARCO document ranking task are shown in Table 2 . For this task, there are two common configurations, what we call "per-document" vs. "per-passage" indexing. In the former, each document in the corpus is indexed as a separate document; in the latter, each document is first segmented into multiple passages, and each passage is indexed as a separate "document". Typically, for the "per-passage" index, a document ranking is constructed by simply taking the maximum of per-passage scores; the motivation for this design is to reduce the amount of text that computationally expensive downstream rerankers need to process. Rows (1a)-(1d) show the per-document and per-passage approaches on the original texts, using default parameters and after tuning for recall@100 using grid search. With default parameters, there appears to be a large effectiveness gap between the per-document and per-passage approaches, but with properly tuned parameters, (1b) vs. (1d), we see that they achieve comparable effectiveness. As with passage retrieval, we can include document expansion with either the perdocument or per-passage approaches (the difference is whether we append the expansions to each document or each passage); these results are shown in (1e) and (1f) . Similarly, the differences in effectiveness between the two approaches are quite small.

Dense retrieval using TCT-ColBERT is shown in row (2); this is a new experimental condition that was not reported in Lin et al. [16] . Here, we are simply using the encoder that has been trained on the MS MARCO passage data in a zero-shot manner. Since these encoders were not designed to process long segments of text, only the per-passage condition makes sense here. In row (3a), we combine row (2) with the per-passage sparse retrieval results on the original text, and in row (3b), with the per-passage sparse retrieval results using document expansion. Overall, the findings are consistent with the passage ranking task: Dense retrieval is more effective than sparse retrieval (although the improvements for document ranking are smaller, most likely due to zero-shot application). Dense and sparse signals are complementary, shown by the effectiveness of the dense-sparse hybrid, which further benefits from document expansion (although the gains from expansion appear to be smaller).

Similar to the passage ranking task, Table 2 provides a few points of comparison. Row (4a) shows the effectiveness of the BM25 baseline provided by the leaderboard organizers; once again, we see that Pyserini's results are better. Row (4b) shows ACNE results [26] , which are more effective than TCT-ColBERT, although the comparison isn't quite fair since our models were not trained on MS MARCO document data. Finally, Row (4c) shows the results of applying our "Expando-Mono-Duo" design pattern with T5 [23] in a zero-shot manner.

In summary, Pyserini "covers all the bases" in terms of providing first-stage retrieval for modern research on neural ranking approaches: sparse retrieval, dense retrieval, as well as hybrid techniques combining both approaches. Experimental results on two popular leaderboards show that our toolkit provides a good starting point for further research.

As replicability is a major consideration in the design and implementation of Pyserini, it is worthwhile to spend some time discussing practices that support this goal. At a high-level, we can divide replicability into technical and social aspects. Of the two, we believe the latter are more important, because any technical tool to support replicability will either be ignored or circumvented unless there is a shared commitment to the goal and established social practices to promote it. Replicability is often in tension with other important desiderata, such as the ability to rapidly iterate, and thus we are constantly struggling to achieve the right balance.

Perhaps the most important principle that our group has internalized is "to eat our own dog food", which refers to the colloquialism of using one's own "product". Our group uses Pyserini as the foundation for our own research on transformer-based reranking models, dense learned representations for reranking, and beyond (see more details in Section 6). Thus, replicability comes at least partially from our self interest-to ensure that group members can repeat their own experiments and replicate each other's results. If we can accomplish replicability internally, then external researchers should be able to replicate our results if we ensure that there is nothing peculiar about our computing environment.

Our shared commitment to replicability is operationalized into social processes and is supported by technical infrastructure. To start, Pyserini as well as the underlying Anserini toolkit adopt standard best practices in open-source software development. Our code base is available on GitHub, issues are used to describe proposed feature enhancements and bugs, and code changes are mediated via pull requests that are code reviewed by members of our group.

Over the years, our group has worked hard to internalize the culture of writing replication guides for new capabilities, typically paired with our publications; these are all publicly available and stored alongside our code. These guides include, at a minimum, the sequence of command-line invocations that are necessary to replicate a particular set of experimental results, with accompanying descriptions in prose. In theory, copying and pasting commands from the guide into a shell should succeed in replication. In practice, we regularly "try out" each other's replication guides to uncover what didn't work and to offer improvements to the documentation. Many of these guides are associated with a "replication log" at the bottom of the guide, which contains a record of individuals who have successfully replicated the results, and the commit id of the code version they used. With these replication logs, if some functionality breaks, it becomes much easier to debug, by rewinding the code commits back to the previous point where it last "worked".

How do we motivate individuals to write these guides and replicate each other's results? We have two primary tools: appealing to reciprocity and providing learning experiences for new group members. For new students who wish to become involved in our research, conducting replications is an easy way to learn our code base, and hence provides a strong motivation. In particular, replications are particularly fruitful exercises for undergraduates as their first step in learning about research. For students who eventually contribute to Pyserini, appeals to reciprocity are effective: they are the beneficiaries of previous group members who "paved the way" and thus it behooves them to write good documentation to support future students. Once established, such a culture becomes a self-reinforcing virtuous cycle.

Building on these social processes, replicability in Anserini is further supported by an end-to-end regression framework, that, for each test collection, runs through the following steps: builds the index from scratch (i.e., the raw corpus), performs multiple retrieval runs (using different ranking models), evaluates the output (e.g., with trec_eval), and verifies effectiveness figures against expected results. Furthermore, the regression framework automatically generates documentation pages from templates, populating results on each successful execution. All of this happens automatically without requiring any human intervention. There are currently around 30 such tests, which take approximately two days to run end to end. The largest of these tests, which occupies most of the time, builds a 12 TB index on all 733 million pages of the ClueWeb12 collection. Although it is not practical to run these regression tests for each code change, we do try to run them as often as possible, resources permitting. This has the effect of catching new commits that break existing regressions early so they are easier to debug. We keep a change log that tracks divergences from expected results (e.g., after a bug fix) or when new regressions are added.

On top of the regression framework in Anserini, further end-toend regression tests in Pyserini compare its output against Anserini's output to verify that the Python interface does not introduce any bugs. These regression tests, for example, test different parameter settings from the command line, ensure that single-threaded and multi-threaded execution yield identical results, that pre-built indexes can be successfully downloaded, etc.

Written guides and automated regression testing lie along a spectrum of replication rigor. We currently do not have clear-cut criteria as to what features become "enshrined" in automated regressions. However, as features become more critical and foundational in Pyserini or Anserini, we become more motivated to include them in our automated testing framework.

In summary, replicability has become ingrained as a shared norm in our group, operationalized in social processes and facilitated by technical infrastructure. This has allowed us to balance the demands of replicability with the ability to iterate at a rapid pace.

Anserini has been in development for several years and our group has been working on Pyserini since late 2019. The most recent major feature added to Pyserini (in 2021) has been dense retrieval capabilities alongside bag-of-words sparse retrieval, and their integration in hybrid sparse-dense techniques.

Despite much activity and continued additions to our toolkit, the broad contours of what Pyserini "aims to be" are fairly well defined. We plan to stay consistent to our goal of providing replicable and easy-to-use techniques that support innovations in neural ranking methods. Because it is not possible for any single piece of software to do everything, an important part of maintaining focus on our goals is to be clear about what Pyserini isn't going to do.

While we are planning to add support for more dense retrieval techniques based on learned representations, quite explicitly the training of these models is outside the scope of Pyserini. At a highlevel, the final "product" of any dense retrieval technique comprises an encoder for queries and an encoder for documents (and in some cases, these are the same). The process of training these encoders can be quite complex, involving, for example, knowledge distillation [10, 16] and complex sampling techniques [26] . This is an area of active exploration and it would be premature to try to build a general-purpose toolkit for learning such representations.

For dense retrieval techniques, Pyserini assumes that query/document encoders have already been learned: in modern approaches based on pretrained transformers, Huggingface's Transformers library has become the de facto standard for working with such models, and our toolkit provides tight integration. From this starting point, Pyserini provides utilities for building indexes that support nearest-neighbor search on these dense representations. However, it is unlikely that Pyserini will, even in the future, become involved in the training of dense retrieval models.

Another conscious decision we have made in the design of Pyserini is to not prescribe an architecture for multi-stage ranking and to not include neural reranking models in the toolkit. Our primary goal is to provide replicable first-stage retrieval, and we did not want to express an opinion on how multi-stage ranking should be organized. Instead, our group is working on a separate toolkit, called PyGaggle, that provides implementations for much of our work on multi-stage ranking, including our "mono" and "duo" designs [23] as well as ranking with sequence-to-sequence models [18] . PyGaggle is designed specifically to work with Pyserini, but the latter was meant to be used independently, and we explicitly did not wish to "hard code" our own research agenda. This separation has made it easier for other neural IR toolkits to build on Pyserini, for example, the Caprelous toolkit [29, 30] .

On top of PyGaggle, we have been working on faceted search interfaces to provide a complete end-to-end search application: this was initially demonstrated in our Covidex [31] search engine for COVID-19 scientific articles. We have since generalized the application into Cydex, which provides infrastructure for searching the scientific literature, demonstrated in different domains [7] .

Our ultimate goal is to provide reusable libraries for crafting end-to-end information access applications, and we have organized the abstractions in a manner that allows users to pick and choose what they wish to adopt and build on: Pyserini to provide firststage retrieval and basic support, PyGaggle to provide neural reranking models, and Cydex to provide a faceted search interface.

Our group's efforts to promote and support replicable IR research dates back to 2015 [4, 14] , and the landscape has changed quite a bit since then. Today, there is much more awareness of the issues surrounding replicability; norms such as the sharing of source code have become more entrenched than before, and we have access to better tools now (e.g., Docker, package mangers, etc.) than we did before. At the same time, however, today's software ecosystem has become more complex; ranking models have become more sophisticated and modern multi-stage ranking architectures involve more complex components than before. In this changing environment, the need for stable foundations on which to build remains. With Pyserini, it has been and will remain our goal to provide easyto-use tools in support of replicable IR research.

TensorFlow: A system for large-scale machine learning

A Lightweight Environment for Learning Experimental IR Research Practices

Applying BERT to Document Retrieval with Birch

Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble

Cydex: Neural Search Infrastructure for the Scholarly Literature

Wenpeng Yin, Dragomir Radev, and Richard Socher. 2020. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization

From MaxScore to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance

Mete Sertkan, and Allan Hanbury. 2021. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

Billion-scale similarity search with GPUs

Dense Passage Retrieval for Open-Domain Question Answering

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

Pretrained Transformers for Text Ranking: BERT and Beyond

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

Document Ranking with a Pretrained Sequence-to-Sequence Model

From doc2query to docTTTTTquery

Multi-Stage Document Ranking with BERT

Document Expansion by Query Prediction

PyTorch: An Imperative Style, High-Performance Deep Learning Library

The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing

Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Anserini: Enabling the Use of Lucene for Information Retrieval Research

Anserini: Reproducible Ranking Baselines Using Lucene

Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval

Flexible IR Pipelines with Capreolus

Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset

This research was supported in part by the Canada First Research Excellence Fund, the Natural Sciences and Engineering Research Council (NSERC) of Canada, and the Waterloo-Huawei Joint Innovation Laboratory.