key: cord-0288281-op5s03zx
authors: MacAvaney, Sean; Yates, Andrew; Feldman, Sergey; Downey, Doug; Cohan, Arman; Goharian, Nazli
title: Simplified Data Wrangling with ir_datasets
date: 2021-03-03
journal: nan
DOI: 10.1145/3404835.3463254
sha: 387d17e3bc44ed4f81de25fc27fa1b8ab7d6bc8d
doc_id: 288281
cord_uid: op5s03zx

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a Python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through the ir_datasets catalog: https://ir-datasets.com/. The catalog acts as a hub for information on datasets used in IR, providing core information about what data each benchmark provides as well as links to more detailed information. We welcome community contributions and intend to continue to maintain and grow this tool.

The datasets and benchmarks we use are a cornerstone of Information Retrieval (IR) research. Unfortunately, many of these datasets remain frustrating to find and manage. Once obtained, the variety of data formats can be a challenge to work with. Even data formats that seem simple can hide subtle problems. For example, the TSV files used by the MS-MARCO [66] has a double-encoding problem that affects special characters in roughly 20% of documents.

Recently, several tools have begun to incorporate automatic dataset acquisition. These include Capreolus [93] , PyTerrier [58] and OpenNIR [55] . These reduce the user burden of finding the dataset source files and figuring out how to parse them correctly. However, the dataset coverage of each individually is patchy, as shown in Table 1 . Further, using the dataset interfaces outside of these tools can be difficult, as they are often tightly coupled with the tool's primary functionality. Finally, each of these tools keep their own copy of data, leading to wasted storage. Thus, it is advantageous to have a lightweight tool that focuses on data acquisition, management, and typical operations like lookups.

Many tools rely on manual instructions for downloading, extracting, and processing datasets. 1 We believe providing a tool to automatically perform as much of this work as possible is clearly preferable to this approach since it ensures proper processing of data. A common automatic tool has additional advantages, such as reducing redundant copies of datasets and easily allowing tools to be run on alternative or custom datasets with little effort.

Anserini [91] and its Python interface Pyserini [53] use a hybrid approach by distributing copies of queries and relevance judgments in the package itself and primarily relying on manual instructions for document processing. Sometimes Anserini provides document content via downloadable indices.

Other dataset distribution tools are not well-suited for IR tasks. For instance, packages like HuggingFace Datasets [90] and Ten-sorFlow Datasets [3] take a record-centric approach that is not well-suited for relational data like documents, queries, and querydocument relevance assessments. Furthermore, IR work involves additional important use cases when working with datasets, such as efficiently looking up a document by ID, for which the designs of prior libraries is not conducive. Dataset schemata, such as DCAT and schema.org, provide a common format machine-readable dataset documentation, which could be supported in the future. [58] (PT), OpenNIR [55] (ONIR), Anserini [91] (Ans.), and ir_datasets (IRDS). indicates built-in support that automatically provides documents, queries, and query relevance judgments (i.e., as an automatic download). ♢ indicates support for a dataset with some manual effort (e.g., specifying the document parser and settings to use). Datasets marked with * have licenses that require manual effort (e.g., requesting from NIST), and therefore can at most have ♢. ♢ MS-MARCO Pass. [66] MS-MARCO QnA [66] Natural Questions [48, 50] ♢ TREC CAR [28, 29] ♢ TREC DL [25, 26] TREC DL-Hard [59] ♢ ♢ ♢ TriviaQA [47, 48] ♢ Scientific, Bio-medical, Health Cranfield [1] CLEF eHealth* [64, 94] ♢ ♢ ♢ NFCorpus [9] TREC CDS [71, 72, 77] TREC COVID [84, 88] TREC Genomics [40] [41] [42] [43] TREC Health Misinfo. * [4] ♢ ♢ ♢ TREC PM [68] [69] [70] ♢ TripClick* [67] ♢ ♢ ♢

Web NTCIR WWW* [54, 62] ♢ ♢ ♢ ORCAS [21] ♢ ♢ ♢ TREC Million Query* [5, 6, 11] ♢ ♢ ♢ TREC Terabyte* [10, 12, 13] ♢ ♢ ♢ TREC Web* [14-17, 19, 20, 22-24] ♢ ♢ ♢

Other/Miscellaneous BEIR [8, 9, 18, 30, 39, 44, 50, 60, 66, 81, 82, 84, [86] [87] [88] 92] CodeSearchNet [45] TREC Microblog [51, 52, 76] ♢ WikIR [31, 32] In this work, we present ir_datasets, a tool to aid IR researchers in the discovery, acquisition, and management of a variety of IR datasets. The tool provides a simple and lightweight Python and command line interface (see Figure 1 ) allowing users to iterate the documents, queries, relevance assessments, and other relations provided by a dataset. This is useful for indexing, retrieval, and evaluation of ad-hoc retrieval systems. A document lookup API provides fast access to source documents, which is useful for recent text-based ranking models, such as those that use BERT [27] . PyTerrier [58] , Capreolus [93] , and OpenNIR [55] recently added support for ir_datasets, greatly expanding the number of datasets they support, and other tools like Anserini [91] can utilize our tool using the command line interface. Finally, the ir_datasets catalog 2 acts as a documentation hub, making it easy to find datasets and learn about their characteristics. We intend to continue to backfill prior datasets and add support for new datasets as they are released. The package is open source, 3 and we welcome contributions.

ir_datasets is a lightweight tool focused on providing easy access to a variety of IR datasets and benchmarks. It provides both a Python and command line interface (see Figure 1 ), allowing it to be easily used by a variety of toolkits, or simply for ad-hoc data exploration.

To achieve these goals, ir_datasets adheres to several design principles. First, to stay lightweight, the tool is focused on core dataset operations, such as downloading content, iterating through queries or documents, and performing document lookups by ID. This policy explicitly leaves functionality like full-text indexing or neural network processing to other tools. Further, to be practical in a variety of environments, ir_datasets attempts to keep a low memory footprint by using inexpensive data structures and iterators. Finally, in order to leave maximum flexibility to the tool's users, we attempt to perform "just enough" processing of the data to account for various formats, while not removing information that is potentially useful. We hope that this commitment to being lightweight and flexible makes ir_datasets an attractive tool to jump-start or enhance other tools for doing IR research.

Since no standard identifiers (IDs) exist for datasets in IR, we propose hierarchical dataset IDs. These IDs allow datasets to be looked up in the Python API, command line interface, and online documentation. IDs are usually in the format of corpus/benchmark. For instance, the TREC COVID [84] benchmark uses the CORD-19 [88] document corpus and is given an ID of cord19/trec-covid. In this case, cord19 provides documents, while cord19/trec-covid provides queries and relevance judgments for those documents.

A dataset object can be obtained simply by calling: import ir_datasets ds = ir_datasets.load("dataset-id")

Each dataset objects provides access to a number of entity types (see Table 2 ). Dataset objects are stateless; they simply define the ds.qrels_iter() A query relevance assessment. Maps a query_id and doc_id to a relevance score or other human assessments. scoreddocs ds.scoreddocs_iter() (uncommon) A scored document (akin to a line from a run file). Maps a query_id and doc_id to a ranking score from a system. Available for datasets that provide an initial ranking (for testing reranking systems). docpairs ds.docpairs_iter() (uncommon) A pair of documents (useful for training). Maps a query_id to two or more doc_ids. Available for datasets that provide suggested training pairs. capabilities and the procedures for obtaining and processing the data.

Most ad-hoc retrieval datasets consist of 3 main entity types: documents (docs), queries/topics (queries), and query relevance assessments (qrels). In the spirit of being simple, lightweight, and low-memory, entities are provided as namedtuple instances from iterators. For each entity type provided by a particular dataset, there is a corresponding ds.{entity}_iter() function that returns an iterator (e.g., ds.docs_iter()). Since the particular attributes returned for an entity differ between datasets (e.g., some provide only an ID and text for a document, while others also include a title field), type definitions can be accessed via ds.{entity}_cls(). The type definitions include type annotations for each field, and try to adhere to conventions when possible (e.g., the ID of documents is the first field and named doc_id).

The iterator approach is versatile. In some cases, it is only necessary to operate over a single entity at a time, minimizing the memory overhead. In other cases, particularly in neural networks, operations happen in batches, which can also be accomplished trivially through an iterator. And finally, in cases where all data needs to be loaded, all entities can be easily loaded, e.g., by passing the iterator into the Python list constructor, or the dataframe constructor in Pandas [65] .

Some datasets provide other entity types, such as sample document rankings or training sequences. For the former, we have a scoreddocs entity type, which by default is a tuple containing a query ID, a document ID, and a score. For the latter, we have a docpairs entity, which consists of a query and a pair of contrasting document IDs (e.g., one relevant and one non-relevant).

ir_datasets also provides a Command Line Interface (CLI) for performing basic operations over supported datasets. This is helpful for integration with tools not written in Python, or simply for ad-hoc data exploration. The primary operations of the CLI are export (corresponding to Python's dataset.*_iter() functions) and lookup (corresponding to Python's docstore.get_many_iter()). Examples of these operations are shown in right-hand side of Figure 1 . The command line interface supports multiple output formats, including TSV and JSON lines. The output fields can also be specified, if only certain data is desired.

When possible, ir_datasets downloads content automatically from the original public sources as needed. In cases where a data usage agreement exists, the user is notified before the file is downloaded. The download process is robust; it verifies the integrity of the downloaded content via a hash and is resilient to interrupted downloads by re-issuing the request if the connection is broken (using Range HTTP requests, if supported by the server). Further, the access to and integrity of downloadable content is automatically checked periodically using a continuous integration job so that if access to some resources are lost (e.g., a file is moved) the problem can be quickly investigated and fixed. There are nearly 350 downloadable files supporting the current datasets in ir_datasets, each validated weekly.

Some data are not publicly available. For instance, due to its size, the ClueWeb 2009 and 2012 collections (used for tasks like the TREC WebTrack and NTCIR WWW tasks) are obtained via hard drives. Other datasets, like the Arabic Newswire collection (used for the TREC Arabic tasks) contain copyrighted material and are only available with a usage agreement and subscription to the Linguistic Data Consortium. In these cases, the user is presented with instructions on how to acquire the dataset and where to put it. Once acquired by the user, ir_datasets will take care of any remaining processing. There are currently 12 document collections that require a manual process to acquire.

ir_datasets supports a wide variety of datasets (see Table 1 ). These include some of the most popular evaluation benchmarks (e.g., TREC Robust [83] ), large-scale shallow datasets (e.g., MS-MARCO [66] ), biomedical datasets (e.g., TREC CDS [71, 72, 77] ), multi-and cross-lingual datasets (e.g., TREC Arabic [33, 34] ), a content-based weak supervision dataset (NYT [57] ), a large-scale click dataset (ORCAS [21] ), and a ranking benchmark suite (BEIR [81] ). To our knowledge, this represents the largest collection and variety of IR datasets supported by any tool. To facilitate experiments with custom datasets, the Python API provides an easy mechanism to build a dataset object from files that use simple data formats: ds = ir_datasets.create_dataset(docs_tsv="path/docs.tsv", queries_tsv="path/queries.tsv", qrels_trec="path/qrels")

It is a common task to look up documents by their ID. For instance, when training or running a neural IR model, it is often necessary to fetch the text of the current document to perform processing. Another example would be a researcher who is looking into cases in which their model fails may want to see the text of the offending documents.

One option is to load all documents into an in-memory hashmap. This may be appropriate in some cases, such a long-running process where the large upfront cost is negligible and memory is plentiful (enough for the entire collection). Building an in-memory hashmap for a collection is trivial with the Python interface: To support other cases, ir_datasets provides a docs_store API that simplifies the process of looking up documents from disk. This API supports fetching individual or multiple documents by their ID: The implementation of docs_store() varies based on the dataset. For many small datasets (those with up to a few million documents), we build a specialized lookup structure for the entire collection on disk as needed. A specialized structure was built for this package to provide a good trade-off between lookup speed and storage costs. All documents are compressed using lz4 and stored in sequence. A separate sorted document ID and corresponding index offset structure is also built on disk. Although simple, we found that this structure enables lookups that exceed the performance of leading indexes and databases (see Table 3 ). In this experiment, we used the metadata lookup functionality of Anserini [91] and Terrier [63] and key-value storage with SQLite and MongoDB. The average duration was computed per query for TREC DL 2019 passage task [26] (with the official set of reranking documents), and for TREC COVID complete [84] (using the judged documents). We also find that the storage cost is reasonable, with a total storage size comparable to MongoDB for the MS-MARCO passage collection and smaller than all others for the CORD19 collection.

For large collections, it is impractical and undesirable to make a copy of all documents. For instance, the ClueWeb09 and ClueWeb12 collections (for TREC Web Track) are several TB in size, even when heavily compressed. Luckily, for these datasets, their directory structure mimics the structure of the document IDs, which allows the desired source file containing a given document ID to be easily identified. To speed up lookups within a given file, we use zlib-state 4 to take periodic checkpoints of the zlib decoding state of the source files. This eliminates the need to read all the source file contents up to the desired document and greatly speeds up lookups of documents that appear late in the source files. The pre-built checkpoints are automatically downloaded and used when appropriate. Furthermore, we cache fetched documents on disk for even faster subsequent lookups. Different approaches are taken for other large collections, such as Tweets2013-ia [76] (for the TREC Microblog task [51, 52] ). See Table 4 for a comparison between document lookup times using ir_datasets and Pyserini (from stored document source). Even though ir_datasets is slower than Pyserini on the first lookup, the cache greatly speeds up subsequent fetches (see "Warm"). Since experiments in neural IR frequently only work with a small subset of documents, this is very beneficial for these pipelines. We also observe that the checkpoint files for ClueWeb12 speed up lookups considerably, without adding much overhead in terms of storage; since Anserini keeps a copy of all documents, it accumulates around 6TB of storage overhead, compared to 4.5GB using ir_datasets. Note that the other approaches explored in Table 1 would accumulate similar storage overheads, as they also copy the data. Tweets2013-ia accumulates considerable storage costs, as the source hierarchy is not conducive to document lookups. In this case, ir_datasets builds an ID-based lookup file hierarchy.

In many cases, it is beneficial to select a segment of a document collection. For instance, some techniques involve pre-computing neural document representations to speed up reranking [56] or for performing first-stage retrieval [49] . In this case, dividing the operation over multiple GPUs or machines can yield substantial speed gains, as the process is embarrassingly parallel. To divide up the work, it is helpful to be able to select ranges of the document collection for processing.

The Python standard library islice function is not ideal for this task because I/O and processing of documents would be performed for skipped indices. Instead, all objects returned form doc_iter can themselves be sliced directly. The implementation of the slicing depends on the particular dataset, but in all implementations avoid unnecessary I/O and processing by seeking to the appropriate location in the source file. This fancy slicing implementation mostly follows typical Python slicing semantics, allow for different workers to be assigned specific ranges of documents: capabilities ( Figure 2 ). The documentation page for each individual dataset includes a brief description, relevant links (e.g., to shared task website and paper), supported relations, citations, and code samples. An example is shown in Figure 3 for the TREC COVID dataset [84] .

ir_datasets includes several suites of automated tests to ensure the package works as expected, functionality does not regress as changes are made, and to ensure that downloaded content remains available and unchanged. The automated testing suite includes include unit tests, integration/regression tests, and tests to ensure downloadable content remains available and unchanged.

The CLI makes ir_datasets easy to use with various tools (e.g., the PISA engine [61] can index using the document export). However, deeper integration can provide further functionality, as we demonstrate in this section with four tools. Note that ir_datasets does not depend on any of these tools; instead they use ir_datasets.

Capreolus [93] is a toolkit for training and evaluating neural learning-to-rank models through Python and command line interfaces. In terms of data, it includes components for "collections" (sets of documents) and "benchmarks" (sets of queries and qrels). Though it has some built-in datasets, it also supports all datasets available from ir_datasets in its pipelines: import capreolus as cap collection, benchmark = cap.get_irds("pmc/v2/trec-cds-2016", fields=["abstract"], query_type="summary") index = cap.AnseriniIndex({"stemmer": None}, collection) index.create_index() benchmark.qrels benchmark.queries PyTerrier [58] is a Python interface to the Terrier search engine [63] that enables the creation of flexible retrieval pipelines. It has a native dataset API, but it now also automatically adds all datasets from ir_datasets, expanding the number of available datasets. They can be accessed via the dataset ID with an irds: prefix, and then used seamlessly with the rest of PyTerrier: import pyterrier as pt pt.init() ds = pt.get_dataset('irds:cord19/trec-covid') indexer = pt.index.IterDictIndexer('./cord19') indexer.index(ds.get_corpus_iter(), fields=('abstract',)) topics = ds.get_topics(variant="description") qrels = ds.get_qrels() OpenNIR [55] provides a command line neural reranking pipeline for several standard IR benchmarks. OpenNIR supports ir_datasets for its training, validation, and testing dataset components. Queries and qrels are trivially fed into the training and validation processes. Documents are automatically indexed with Anserini for first-stage retrieval, and document lookups are used to fetch the text when training and scoring. Here is an example testing on the TREC COVID dataset: $ scripts/pipeline.sh test_ds=irds test_ds.ds=cord19/trec-covid Anserini [91] , and its Python-wrapper counterpart Pyserini [53] focus on reproducibility in IR. They provide a wrapper and suite of tools around a Lucene index. As such, operations on datasets in this tool are tightly coupled with the Lucene and Anserini packages. Though it has support for a wide variety of query and relevance assessments (distributed with the package), the support for document content is sparse, since only a few collections have automaticallydownloadable indices. The remainder rely on manual instructions. Queries and qrels from ir_datasets can be used with Anserini by using the export CLI (as TSV or TREC format). The CLI can also efficiently output documents in a format it can index in parallel:

$ ir_datasets doc_fifos medline/2017 # To index with Anserini, run: # IndexCollection -collection JsonCollection -input # /tmp/tmp6sope5gr -threads 23 -index <your_index_path> DiffIR [46] is a tool that enables the visualization and qualitative comparison of search results. Using ir_datasets, it shows the textual content of the top results for queries and highlights modelspecific impactful text spans.

We welcome (and encourage) community contributions. Extending ir_datasets as a separate package is straightforward, 5 and we also welcome pull requests to the main package.

To maintain quality in ir_datasets, we require considerations of ease-of-use, efficiency, data integrity, and documentation. We request that issues are opened before implementation to ensure proper consideration of these aspects. ir_datasets provides tools for handling typical data formats (e.g., TREC, TSV, CSV), making the process relatively straightforward. Atypical formats likely require special processing. There are plenty of examples to help guide the contributor.

We envision ir_datasets enabling a variety of useful applications.

Training/evaluation in private settings. This tool could facilitate experiments and tasks that involve keeping data private. This is a realistic setting in several circumstances. For instance, a shared task involving searching through clinical notes would likely face challenges distributing this collection due to patient privacy concerns. Or a company may want to offer a shared task using a proprietary document collection or query log. In both these cases, a version of ir_datasets could be built that provides this data that is only available in a secure environment (e.g., one where networking is disabled). Participants could feel confident that their code is processing the data correctly, given that it supports the ir_datasets API; their code can switch to this dataset simply by using the dataset ID of the private dataset.

Dataset exploration GUI. Performing ad-hoc data analysis using ir_datasets is an improvement over prior approaches. The user experience could be further improved through a graphical user interface that facilitate common dataset exploration tasks. For instance, this tool could graphically present the list of queries and link to the text of judged documents. Though this functionality is easy through the Python and command line interfaces, a graphical interface would further reduce friction and ease exploration.

We presented ir_datasets, a tool that provides access to a variety of datasets and benchmarks for search engines. The tool automatically downloads and verifies content when possible, to aid in reproducibility. Through Python and command-line interfaces, users can iterate over documents, queries, and relevance judgments, and perform lookups of documents by ID. The utility of these functionalities are demonstrated through integration with several tools for performing IR experiments. The ir_datasets catalog can help users discover datasets and acts as a hub of information with links and citations to relevant literature. We hope that ir_datasets reduces researcher burden, helps reduce redundant copies of datasets across toolkits, and enables the creation of new tools.

We thank those who contributed to issues or discussions about ir_datasets. This was funded in part by the ARCS Foundation.

Cranfield collection

A collection of ready-to-use datasets

Overview of the TREC 2019 Decision Track

Million Query Track

Blagovest Dachev, and Evangelos Kanoulas. 2007. Million Query Track

Common Core Track Overview

Overview of Touché 2020: Argument Retrieval

A Full-Text Learning to Rank Dataset for Medical Information Retrieval

The TREC 2006 Terabyte Track

Million Query Track

Falk Scholer, and Ian Soboroff. 2005. The TREC 2005 Terabyte Track. In TREC

Overview of the TREC 2004 Terabyte Track

Overview of the TREC 2009 Web Track

Overview of the TREC 2010 Web Track

Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track

Overview of the TREC 2012 Web Track

SPECTER: Document-level Representation Learning using Citation-informed Transformers

TREC 2013 Web Track Overview

TREC 2014 Web Track Overview

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

Overview of the TREC-2002 Web Track

Overview of the TREC-2004 Web Track

Overview of the TREC 2003 Web Track

Overview of the TREC 2020 deep learning track

Overview of the TREC 2019 deep learning track

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TREC CAR: A Data Set for Complex Answer Retrieval

CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims

MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic using English, French or Arabic Queries

The TREC-2002 Arabic/English CLIR Track

Arabic Newswire Part 1 LDC2001T55

Overview of the Third Text REtrieval Conference (TREC-3)

Overview of the Fourth Text REtrieval Conference (TREC-4)

ANTIQUE: A Non-Factoid Question Answering Benchmark

DBpedia-Entity v2: A Test Collection for Entity Search

CQADupStack: A Benchmark Data Set for Community Question-Answering Research

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

DiffIR: Exploring Differences in Ranking Models' Behavior

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Dense Passage Retrieval for Open-Domain Question Answering

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Natural Questions: a Benchmark for Question Answering Research. TACL

Overview of the TREC-2013 Microblog Track

Overview of the TREC-2014 Microblog Track

Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Overview of the NTCIR-13 We Want Web Task

OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline

Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

Content-Based Weak Supervision for Ad-Hoc Re-Ranking

Declarative Experimentation inInformation Retrieval using PyTerrier

How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset

WWW'18 Open Challenge: Financial Opinion Mining and Question Answering

PISA: Performant Indexes and Search for Academia

42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Overview of the NTCIR-14 We Want Web Task

Terrier information retrieval platform

Task Overview: The IR Task at the eHealth Evaluation Lab -Evaluating Retrieval Methods for Consumer Health Search

The pandas development team. 2020. pandas-dev/pandas: Pandas

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

TripClick: The Log Files of a Large Health Web Search Engine

Overview of the TREC 2018 Precision Medicine Track

Overview of the TREC 2017 Precision Medicine Track

Shubham Pant, and Funda Meric-Bernstam. 2019. Overview of the TREC 2019 Precision Medicine Track

Overview of the TREC 2016 Clinical Decision Support Track

Overview of the TREC 2015 Clinical Decision Support Track

TREC Mandarin LDC2000T52

TREC Spanish LDC2000T51

The new york times annotated corpus. Linguistic Data Consortium

Finally, a Downloadable Test Collection of Tweets

Overview of the TREC 2014 Clinical Decision Support Track

Spanish and Chinese Document Retrieval in TREC-5

TREC 2018 News Track Overview. In TREC

TREC 2019 News Track Overview. In TREC

Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

FEVER: a Large-scale Dataset for Fact Extraction and VERification

Overview of the TREC 2004 Robust Retrieval Track

TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection

Overview of the TREC 2005 Robust Retrieval Track

Retrieval of the Best Counterargument without Prior Topic Knowledge

Fact or Fiction: Verifying Scientific Claims

Chinese Document Retrieval at TREC-6

Anserini: Enabling the Use of Lucene for Information Retrieval Research

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Flexible IR pipelines with Capreolus

The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval