key: cord-0819256-jgajuivu
authors: Chen, J.; Hersh, W.
title: A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge
date: 2020-10-20
journal: nan
DOI: 10.1101/2020.10.15.20213645
sha: f76f36e9274de7a935cfcdd44b326ffcfe95b056
doc_id: 819256
cord_uid: jgajuivu

The COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval methods and systems for this quickly expanding corpus. Based on the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system's ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.

Since the World Health Organization declared the Coronavirus Disease 2019 (COVID- 19 ) a public health emergency, 1 there has been explosive growth in scientific knowledge about this novel virus. Consequently, the use of preprints and fast-track publication policies has resulted in a significant increase in the number of COVID-19 related publications over a short period of time. 2, 3 Information retrieval (IR, also known as search) systems are the tool usually employed to manage access to large corpora of literature. 4 The efficacy of IR systems is often assessed in challenge evaluations that provide reusable test collections, such as those led by the National Institutes of Standards and Technology (NIST) in the Text REtrieval Conference (TREC). 5 To address the need for system evaluation in this rapidly changing information environment, NIST sponsored the TREC-COVID Challenge. 6 As with most IR challenge evaluations, test collections of documents, topics for searching, and relevance judgments were developed. 7 The document collections were derived from snapshots of the COVID-19 Open

Research Dataset (CORD-19), a regularly updated dataset of manuscripts consisting of coronavirus-related research gathered from various sources including journal articles, PubMed references, arXiv, medRxiv, and bioRxiv. 3 Based on the the TREC framework, the TREC-COVID Challenge tasked researchers to evaluate IR systems over 5 rounds to retrieve manuscripts relevant to topics about COVID-19 from CORD-19, with the goal of building a reusable test collection for further research. 8 Every 2-3 weeks, researchers implemented and evaluated retrieval systems with un updated CORD-19 dataset and the addition of new search topics. 9 While there exists previous work examining techniques used in other IR test collections, [10] [11] [12] there have not yet been any studies comparing the methods and systems used by participants in the TREC-COVID Challenge.

The purpose of this study was to compare performance in different approaches used in the TREC-COVID Challenge by: 1) developing a taxonomy to characterize IR techniques and system characteristics used, and 2) applying this taxonomy to identify features of IR systems associated with higher performance. Using run reports from Round 2 and Round 5, we designed a taxonomy and evaluated its features using a univariate and multivariate regression analysis. In this study, we assessed how certain methodologies were associated with higher retrieval performance and discussed the implications and limitations of our analysis.

The TREC-COVID challenge 6 occurred over 5 rounds in 2020 on the rapidly growing CORD19 dataset, 3(p19) with 30 initial topics in Round 1 and 5 new topics each subsequent round.

Each topic consisted of three fields: (1) a short query statement that a user might enter, (2) a longer question field more thoroughly expressing the information need of the topic, and (3) were manually reviewed by J.C. We chose to review reports from Rounds 2 and 5 because we wanted to compare methodologies used in two different rounds where feedback methods from topics from previous rounds were available. Each run report was written as a textual description of the methodology used to produce the run in whatever detail the submitting team provided. An example run report is shown (Figure 1 ).

During submission of a run, participants were encouraged to provide a methodological description of each submitted run. This run description, along with the run ID, topic types, and performance metrics, were reported in a publicly available repository of archived results (https://ir.nist.gov/covidSubmit/archive.html). These run reports were manually reviewed for features.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.15.20213645 doi: medRxiv preprint The following features were extracted for each run in the reports from Round 2 and Round 5:

 Text used (i.e., title and abstract only, paragraph-based indices, or full-text)  Type of query (i.e., any combination of the query, question, and narrative from the TREC-COVID topic fields)  Any query pre-processing (i.e., stemming, removing stop words)  Query term expansion (addition of terms not originally provided in each topic)  Manual review methods (i.e., human interventions including the use of human assessors in Continuous Active Learning) 13  Any weighted ranking system used (i.e. non-neural scoring functions such as BM25 14 and term frequency-inverse document frequency, or TF-IDF 15 )  Any ranking model that used a neural architecture (including deep transformer models such as BERT, 16 SciBERT, 17 T5 18 as well as DeepRank, 19 a neural network that attempts to simulate humans in relevance judgments)  Other techniques (machine learning models such as SVM, logistic regression, custom scoring functions, otherwise known as term proximity scores. Custom search methods were also included in this category including ReQ-ReC, 20 a double-loop retrieval system)  Dataset used to fine tune any system (i.e., MS-MARCO, a large dataset of annotated documents based off 100,000 Bing queries, 21 CORD-19 dataset transformed into document vectors, and relevance judgments from previous rounds)  Fusion of multiple runs into a single run (including use of reciprocal rank fusion, 22 COMB fusion methods 23 )  Re-ranking implemented, defined as whether a second system (most commonly a neural network) was used to refine an initial scoring system.  Pseudo-relevance feedback, or system-generated relevance feedback based on an initial query.

 How/if human-generated relevance feedback, or relevance judgements, from the previous round(s) were used.

 Runs filtered by date. Removing documents published before 2020 (or when the pandemic began to gain widespread notice) had been previously suggested by McAvaney . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . et al. in their post-hoc analysis of their neural re-ranking system as a possible method to improve performance. 24 These features were selected to encompass a broad set of commonly used techniques in ad-hoc retrieval. Of note, TREC challenges typically do not occur over multiple rounds; thus, the addition of relevance judgments was a novel addition to the TREC-COVID challenge. Since the length of reports varied at the researcher's discretion, many reports likely had some number of missing features. To minimize the impact of these null features, we assumed that runs that did not provide information about the type of text or query used in their system likely searched on the full-text using the query subfield from each topic. Features extracted from reports were either characterized as a binary feature (used or not used in the system) or a categorical feature (a description of feature for the system, i.e., BM25 as the weighted system used). Categorical features were later one-hot encoded, or converted into binary features over multiple columns, prior to input into our regression analysis. The extracted features and their encoding is shown in Table 2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.15.20213645 doi: medRxiv preprint We included all runs that contained more than 1 extracted feature to ensure a reasonably large enough and useful dataset for analysis. We excluded unusually poorly performing runs that likely represented poor system or method implementations. The exclusion threshold for these runs was defined as an average performance of less than 0.2 across all 5 performance metrics used to evaluate run performance in the TREC-COVID Challenge. Performance metrics reported by NIST and used in our analysis included: precision at K documents (P@K), normalized discounted cumulative gain at K documents (NDCG@K), rank-based precision with depth = 5 (RBP (p = 0.5)), binary preference (bpref), and mean average precision (MAP). In Round 2, the depth of documents for P@K and NDCG@K were 5 and 10 respectively, while in Round 5, the depth of documents for P@K and NDCG@K were both 20. These changes were made out of concern for inflated performance when evaluating precision on a small number of documents. 6 For each run, these performance metrics were computed as the mean performance across all topics in the round.

All data analysis and pre-processing was performed using R (version 4.0.2) 25 using the glmnet package. 26 For each round, 5 univariate linear regressions were created using all extracted features as the independent variables, and each of the 5 performance metrics (NDCG@K, P@K, RBP, bpref, and MAP) as the dependent variable. Coefficients and standard errors were calculated for each feature, and p-values were extracted for each feature coefficient, with significance defined as p < 0.05. Features that met the threshold for significance in the univariate regression were subsequently input into a multivariate linear regression. Overall, positive coefficients were interpreted to be associated with higher performance. Therefore, features which remained significant after both univariate and multivariate regression were likely associated with high performance in the TREC-COVID challenge. (defined as using judgments from prior rounds), and automatic (defined as neither feedback nor manual) runs varied between runs. In Round 2, the majority of the runs were categorized as automatic runs; in round 5, the majority of the runs were characterized as feedback runs. These findings are summarized in Table 3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.15.20213645 doi: medRxiv preprint Significant features for the 5 univariate regressions each for Round 2 and Round 5 are shown in Figure 2 and varied depending on the performance metric used. In Round 2, query term expansion (n = 37 runs), fine-tuning of ranking systems on MS-MARCO (n = 18 runs), Round 1 judgments (n = 9 runs), or document vectors formed by the CORD-19 dataset (n = 9 runs) were associated with higher performance across most, if not all, performance metrics. Use of ReQ-Rec (n = 3 runs submitted by 1 team), and narrative text in the query (n = 28 runs) were associated with decreased performance across the majority of performance metrics. In Round 5, use of the question text in the query (n = 32 runs) and TF-IDF vectors were associated with increased performance (n = 14 runs), whereas the use of neural networks, narrative text in the query (n = 67 runs), and proximity score (n = 2 runs) were associated with decreased performance across all performance metrics.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.15.20213645 doi: medRxiv preprint

Univariate analysis was performed on features extracted from reports from Rounds 2 and 5.

Features that were significant after input into a univariate linear regression are shown for the following performance metrics from Rounds 2 and 5 respectively: binary preference (A and F), mean average precision (B and G), normalized distributive cumulative gain (C and H), precision @ k documents (D and I), and rank-based precision (E and J). The count, or number of times that the feature occurred in our extracted dataset, is displayed adjacent to the feature name. These . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.15.20213645 doi: medRxiv preprint significant features were subsequently input into a multivariate regression to determine which features were independently associated with performance.

Significant features from multivariate regressions on the 5 different performance metrics in Rounds 2 and 5 are shown in Figure 3 . After features found to be significant on univariate regression were input into a multivariate regression, the following features remained significantly associated with increased performance in Round 2 with the majority of performance metrics: term expansion (n = 37), ranking system fine-tuning on CORD-19 vectors (n = 9), MS-MARCO (n = 18), and Round 1 judgments (n =9). Using ReQ-Rec (n = 3) remained significantly associated with decreased performance. In Round 5, using the question text to formulate the query (n = 60) and TF-IDF vector weighting (n =14) were associated with increased performance, while a custom proximity score (n = 2) as a scoring function was associated with decreased performance. As seen in Round 2, using feedback in Round 5 (n = 59) was associated with increased performance when runs were evaluated on RBP.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. Features that were found to be significant in univariate regression were further input into a second, multivariate regression. Significant features were reported for the following performance metrics in Rounds 2 and 5 respectively: binary preference (A and F), mean average precision (B and G), normalized distributive cumulative gain (C and H) , precision @ k documents (D and I),

and rank-based precision (E and J). Depending on the coefficients, these features were . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.15.20213645 doi: medRxiv preprint concluded to be significantly associated with increased or decreased performance in the TREC-COVID challenge.

This study aimed to develop a taxonomy of features to evaluate techniques associated with higher performance in runs submitted to Rounds 2 and 5 of the TREC-COVID Challenge.

The key findings were: 1) fine-tuning ranking systems using relevance judgments resulted in significant improvement in performance, particularly in Round 2, and 2) query formulation is an important component of successful search.

Our first key finding was that fine-tuning ranking systems using relevance judgments resulted in significant improvement in performance, particularly in Round 2. Unlike previous TREC challenges, rapid turnout of relevance judgments over multiple rounds resulted in opportunities to fine-tune ranking systems for improved performance. Many of the runs labelled as feedback runs (n = 41 in Round 2 and n = 65 in round 5) employed fine-tuning, though a small portion specifically used the relevance judgments specifically in fine-tuning their ranking systems. Other teams who fine-tuned on similar datasets, including the vectorized CORD19 dataset (represented as Dataset.for-FineTuning_CORD19) and MS-MARCO also achieved comparable levels of improvement when compared to systems that did not use fine-tuning on these specific datasets. The noted improvement of fine-tuning on an annotated dataset has been reported in other TREC challenges, most notably the usage of MS-MARCO by Nogueira et al to refine a neural network system that vastly outperformed other runs in the TREC CAR challenge. 27 Interestingly, the benefits of fine-tuning systems did not persist into Round 5 (with the exception of evaluation on RBP) despite more prevalent use of neural systems and feedback runs.

Since the TREC-COVID Challenge brought together a mix of research teams with varying experience in IR challenge evaluations, along with the short time between rounds (2-3 weeks), the absence of significance with fine-tuning on previous round judgments may be explained by implementation differences between teams, as many teams implemented variations of the popular sequence of an initial weighted system (most commonly BM25) followed by a neural re-ranker (i.e., BERT with or without fine-tuning on MS-MARCO or previous relevance judgments). 28, 29 However, since we one-hot encoded other techniques, our linear regressions may . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.15.20213645 doi: medRxiv preprint have overrepresented individual techniques that few teams used (including ReQ-ReC in Round 2, and proximity scoring in Round 5) . Future work may be needed to validate the performance of other techniques compared with the standard weighted and neural pipelines. Furthermore, since we chose to set neural networks as a binary variable, there may be opportunities to explore how different architectures influence performance in the TREC-COVID challenge.

Our second key finding was that query formulation was an important component of successful search. While most teams used the query and question fields in formulating an input query, several teams (n = 28 and 32 in Rounds 2 and 5 respectively) chose to use the narrative portion of the topic, which was associated with decreased performance in both rounds. Because the narrative contained freehand descriptions qualifying each topic, these descriptive fields were noisy. For example, topics 33 and 34 contained the phrase "excluding…," with subsequent wording describing what not to search. Furthermore, vocabulary used in the topics designed by the TREC-COVID organizers were not consistently used in manuscripts included in the CORD-19 dataset (i.e. differences in how COVID-19 was named: SARS-CoV 2, coronavirus) and may have adversely affected search performance for those who did not expand their queries to include such terms.

In fact, many of the successful runs from teams that used baseline runs from Anserini 30 employed a query preprocessing tool produced by University of Delaware that used SciSpacy 31 to lemmatize and remove non-stop words. Runs generated by Anserini comparing standard addition of various topic fields with and without the "Udel" method consistently showed improvements in retrieved relevant documents no matter what topics were used to construct the query and which indices were used. 32 This approach was taken further either manually by certain teams (i.e., OHSU) or automatically, as seen in approaches in initial iterations of Covidex. 33 In fact, adapting the queries to better represent document representations, or minimize querydocument mismatch, has long been researched and includes work using relevance judgments 34 and query expansion. 35 Novel methods have focused on reverse: adjusting documentation representations to better represent queries -for example, Doc2Query 36 was employed most commonly in Round 5 by one team, though this technique was not shown to be significantly associated with high performance in our study. However, the team that incorporated this technique submitted runs that were widely variable in performance, and may have used other features not found to be significant in our taxonomy. The importance of defining relevant terms . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted October 20, 2020. 

This study had several limitations that future work could address. First, the instructions for describing methodologies in the run reports varied in detail. As such, the data used for this study were only as complete as what was provided in the reports. This not only presented a challenge to building our taxonomy, but also meant that important features may not have been (and likely were not) reported. In the future, teams should document methodologies that promote reproducibility or publish their results in reports as is done in the regular TREC challenges.

Second, it was difficult to capture run-specific differences between runs submitted by the same team, as team-specific features were often not provided. This had important implications in runs submitted in Round 5, where teams were allowed to submit up to 8 runs. While many runs submitted from the same team were largely similar (and often performed similarly), our methodology was not well-suited to capture nuances such as hyperparameter tuning that were likely small adjustments to otherwise similar methods and pipelines. We sought to characterize runs broadly, rather than capture each individual technique and adjustment in each run, since features built around individual techniques were subject to bias. However, to find a balance between granularity vs. breadth of techniques, we attempted to take into account differences between runs (even from the same team) using a one-hot encoded column of other techniques that we thought were unique enough to warrant specific inclusion. Future directions for this work may include identifying how to best capture adjustments between runs using similar techniques that result in different performances.

Third, our study was retrospective and limited in scope. While we studied techniques associated with performance on the CORD-19 dataset, these techniques may not be generalizable to other test collections. This was reflected in our work, where our regressions overfitted particularly performances with teams that used unique methodologies (i.e., associated a feature with significantly low or high performance despite a low number of teams employing this feature, such as ReQ-ReC 20 or Proximity score). As mentioned above, implementation may have . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.15.20213645 doi: medRxiv preprint had a role in this "significantly" poor performance. Future work will be needed to prospectively evaluate these unique techniques across different developers and users.

Using multivariate regression analysis, we developed and evaluated a taxonomy of features IR systems associated with high performance in the TREC-COVID Challenge. While our multivariate analysis demonstrates the utility of relevance feedback and the need for welldefined queries, it remains unclear which broad methodologies are associated with high performance in the TREC-COVID test collection. While our study has limitations in generating specific, prospective generalizations about IR systems and techniques, our work broadly showcases general techniques that may be useful in building search systems for COVID-19, and serves as a springboard for future work on TREC-COVID and related test collections.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.15.20213645 doi: medRxiv preprint 

Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV)

Pandemic Publishing Poses a New COVID-19 Challenge

CORD-19: The COVID-19 Open Research Dataset. ArXiv200410706 Cs

Information Retrieval: A Biomedical and Health Perspective

TREC: Experiment and Evaluation in Information Retrieval

TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19

TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. ArXiv200504474 Cs

Searching for Answers in a Pandemic: An Overview of TREC-COVID Submitted to Journal of Biomedical Informatics COVID-19 special issue

Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions

State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track

A comparative analysis of retrieval features used in the TREC 2006 Genomics Track passage retrieval task

Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review. ArXiv150406868 Cs

Okapi at TREC-5

Data Mining. In: Mining of Massive Datasets

Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs

SciBERT: A Pretrained Language Model for Scientific Text. ArXiv190310676 Cs

Rapidly Bootstrapping a Question Answering Dataset for COVID-19. ArXiv200411339 Cs

DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval

ReQ-ReC: High Recall Retrieval with Query Pooling and Interactive Classification

A Human Generated MAchine Reading COmprehension Dataset. ArXiv161109268 Cs

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods

Combination of Multiple Searches

SLEDGE: A Simple Yet Effective Baseline for Coronavirus Scientific Knowledge Search. ArXiv200502365 Cs

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Regularization Paths for Generalized Linear Models via Coordinate Descent

Passage Re-ranking with BERT. ArXiv190104085 Cs

An Introduction to Neural Information Retrieval. Found Trends Inf Retr

Neural Ranking Models with Weak Supervision. ArXiv170408803 Cs

Enabling the Use of Lucene for Information Retrieval Research

Fast and Robust Models for Biomedical Natural Language Processing

A Lucene toolkit for replicable information retrieval research

Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. ArXiv200707846 Cs

The Smart Retrieval System -Experiments in Automatic Document Processing

Query Expansion Using Lexical-Semantic Relations

Document Expansion by Query Prediction. ArXiv190408375 Cs