key: cord-0583475-rlc2f6zw authors: Zhang, Edwin; Gupta, Nikhil; Tang, Raphael; Han, Xiao; Pradeep, Ronak; Lu, Kuang; Zhang, Yue; Nogueira, Rodrigo; Cho, Kyunghyun; Fang, Hui; Lin, Jimmy title: Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset date: 2020-07-14 journal: nan DOI: nan sha: e3e36944102c9baee49dfec397fc0e4bb63a7c77 doc_id: 583475 cord_uid: rlc2f6zw We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI. Our system has been online and serving users since late March 2020. The Covidex is the user application component of our three-pronged strategy to develop technologies for helping domain experts tackle the ongoing global pandemic. In addition, we provide robust and easy-to-use keyword search infrastructure that exploits mature fusion-based methods as well as standalone neural ranking models that can be incorporated into other applications. These techniques have been evaluated in the ongoing TREC-COVID challenge: Our infrastructure and baselines have been adopted by many participants, including some of the highest-scoring runs in rounds 1, 2, and 3. In round 3, we report the highest-scoring run that takes advantage of previous training data and the second-highest fully automatic run. As a response to the worldwide COVID-19 pandemic, on March 13, 2020, the Allen Institute for AI (AI2) released the COVID-19 Open Research Dataset . 1 With regular updates since the initial release (first weekly, then daily), the corpus contains around 188,000 scientific articles (as of July 12, 2020), including most with full text, about COVID-19 and coronavirus-related research more broadly (for example, SARS and MERS). These articles are gathered from a variety of sources, including PubMed, a curated list of articles from the WHO, as well as preprints from arXiv, bioRxiv, and medRxiv. The goal of the effort is "to mobilize researchers to apply recent advances 1 www.semanticscholar.org/cord19 in natural language processing to generate new insights in support of the fight against this infectious disease." We responded to this call to arms. As motivation, we believe that information access capabilities (search, question answering, etc.) can be applied to provide users with high-quality information from the scientific literature, to inform evidence-based decision making and to support insight generation. Examples include public health officials assessing the efficacy of wearing face masks, clinicians conducting meta-analyses to update care guidelines based on emerging studies, and virologist probing the genetic structure of COVID-19 in search of vaccines. We hope to contribute to these efforts via a three-pronged strategy: 1. Despite significant advances in the application of neural architectures to text ranking, keyword search (e.g., with "bag of words" queries) remains an important core technology. Building on top of our Anserini IR toolkit (Yang et al., 2018) , we have released robust and easy-to-use open-source keyword search infrastructure that the broader community can build on. 2. Leveraging our own infrastructure, we explored the use of sequence-to-sequence transformer models for text ranking, combined with a simple classification-based feedback approach to exploit existing relevance judgments. We have also open sourced all these models, which can be integrated into other systems. 3. Finally, we package the previous two components into Covidex, an end-to-end search engine and browsing interface deployed at covidex.ai, initially described in Zhang et al. (2020a) . All three efforts have been successful. In the ongoing TREC-COVID challenge, our infrastructure and baselines have been adopted by many teams, which in some cases have submitted runs that scored higher than our own submissions. This illustrates the success of our infrastructure-building efforts (1). In the latest round 3 results, we report the highest-scoring run that exploits relevance judgments in a user feedback setting and the secondhighest fully automatic run, affirming the quality of our own ranking models (2). Finally, usage statistics offer some evidence for the success of our deployed Covidex search engine (3). Multi-stage search architectures represent the most common design for modern search engines, with work in academia dating back over a decade (Matveeva et al., 2006; Wang et al., 2011; Asadi and Lin, 2013) . Known production deployments of this design include the Bing web search engine (Pedersen, 2010) as well as Alibaba's ecommerce search engine (Liu et al., 2017) . The idea behind multi-stage ranking is straightforward: instead of a monolithic ranker, ranking is decomposed into a series of stages. Typically, the pipeline begins with an initial retrieval stage, most often using bag-of-words queries against an inverted index. One or more subsequent stages reranks and refines the candidate set successively until the final results are presented to the user. The multi-stage design provides a clean interface between keyword search, neural reranking models, and the user application. This section details individual components in our architecture. We describe later how these building blocks are assembled in the deployed system (Section 3) and for TREC-COVID (Section 4.2). In our design, initial retrieval is performed by the Anserini IR toolkit (Yang et al., 2017 (Yang et al., , 2018 , 2 which we have been developing for several years and powers a number of our previous systems that incorporate various neural architectures Yilmaz et al., 2019) . Anserini represents an effort to better align real-world search applications with academic information retrieval research: under the covers, it builds on the popular and widely-deployed open-source Lucene search library, on top of which we provide a number of missing features for conducting research on modern IR test collections. 2 anserini.io Anserini provides an abstraction for document collections, and comes with a variety of adaptors for different corpora and formats: web pages in WARC containers, XML documents in tarballs, JSON objects in text files, etc. Providing keyword search capabilities over CORD-19 required only writing an adaptor for the corpus that allows Anserini to ingest the documents. An issue that immediately arose with CORD-19 concerns the granularity of indexing, i.e., what should we consider to be a "document" as the "atomic unit" of indexing and retrieval? One complication is that the corpus contains a mix of articles that vary widely in length, not only in terms of natural variations (scientific articles of varying lengths, book chapters, etc.), but also because the full text is not available for some articles. It is well known in the IR literature, dating back several decades (e.g., Singhal et al. 1996) , that length normalization plays an important role in retrieval effectiveness. Guided by previous work on searching full-text articles (Lin, 2009 ), we explored three separate indexing schemes: • An index comprised of only titles and abstracts. • An index comprised of each full-text article as a single, individual document; articles without full text contained only titles and abstracts. • A paragraph-level index structured as follows: each full-text article is segmented into paragraphs and for each paragraph, we created a "document" comprising the title, abstract, and that paragraph. The title and abstract alone comprised an additional "document". Thus, a fulltext article with n paragraphs yields n + 1 separate retrieval units in the index. To be consistent with standard IR parlance, we call each of these retrieval units a document, in a generic sense, despite their composite structure. Following best practice, documents are ranked using BM25 (Robertson et al., 1994) . The relative effectiveness of each indexing scheme, however, is an empirical question. With the paragraph index, a query is likely to retrieve multiple paragraphs from the same underlying article; since the final task is to rank articles, we take the highest-scoring paragraph across all retrieved results to produce a final ranking. Furthermore, we can combine these multiple representations to capture different ranking signals using fusion techniques, which further improves effectiveness; see Section 4.2 for details. Since Anserini is built on top of Lucene, which is implemented in Java, it is designed to run on the Java Virtual Machine (JVM). However, Tensor-Flow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019) , the two most popular neural network toolkits today, use Python as their main language. More broadly, with its diverse and mature ecosystem, Python has emerged as the language of choice for most data scientists today. Anticipating this gap, we have been working on Pyserini, 3 Python bindings for Anserini, since late 2019 (Yilmaz et al., 2020) . Pyserini is released as a well-documented, easy-to-use Python module distributed via PyPI and easily installable via pip. 4 Putting everything together, we provide the community keyword search infrastructure by sharing code, indexes, as well as baseline runs. First, all our code is available open source. Second, we share regularly updated pre-built versions of CORD-19 indexes, so that users can replicate our results with minimal effort. Finally, we provide baseline runs for TREC-COVID that can be directly incorporated into other participants' submissions. In our infrastructure, the output of Pyserini is fed to rerankers that aim to improve ranking quality. We describe three different approaches: two are based on neural architectures, and the third exploits relevance judgments in a feedback setting using a classification approach. monoT5. Despite the success of BERT for document ranking (Dai and Callan, 2019; MacAvaney et al., 2019; Yilmaz et al., 2019) , there is evidence that ranking with sequence-to-sequence models can achieve even better effectiveness, particularly in zero-shot and other settings with limited training data (Nogueira et al., 2020) , such as for TREC-COVID. Our "base" reranker, called monoT5, is based on T5 (Raffel et al., 2019) . Given a query q and a set of candidate documents D from Pyserini, for each d ∈ D we construct the following input sequence to feed into our model: (1) The model is fine-tuned to produce either "true" or "false" depending on whether the document is rele-3 pyserini.io 4 pypi.org/project/pyserini/ vant or not to the query. That is, "true" and "false" are the ground truth predictions in the sequence-tosequence task, what we call the "target words". At inference time, to compute probabilities for each query-document pair, we apply softmax only to the logits of the "true" and "false" tokens. We rerank the candidate documents according to the probabilities assigned to the "true" token. See Nogueira et al. (2020) for additional details about this logit normalization trick and the effects of different target words. Since in the beginning we did not have training data specific to COVID-19, we finetuned our model on the MS MARCO passage dataset (Nguyen et al., 2016) , which comprises 8.8M passages obtained from the top 10 results retrieved by the Bing search engine (based on around 1M queries). The training set contains approximately 500k pairs of query and relevant documents, where each query has one relevant passage on average; non-relevant documents for training are also provided as part of the training data. Nogueira et al. (2020) and Yilmaz et al. (2019) have both previously demonstrated that models trained on MS MARCO can be directly applied to other document ranking tasks. We fine-tuned our monoT5 model with a constant learning rate of 10 −3 for 10k iterations with class-balanced batches of size 128. We used a maximum of 512 input tokens and one output token (i.e., either "true" or "false", as described above). In the MS MARCO passage dataset, none of the inputs required truncation when using this length limit. Training variants based on T5-base and T5-3B took approximately 4 and 40 hours, respectively, on a single Google TPU v3-8. At inference time, since output from Pyserini is usually longer than the length restrictions of the model, it is not possible to feed the entire text into our model at once. To address this issue, we first segment each document into spans by applying a sliding window of 10 sentences with a stride of 5. We obtain a probability of relevance for each span by performing inference on it independently, and then select the highest probability among the spans as the relevance score of the document. duoT5. A pairwise reranker estimates the probability s i,j that candidate d i is more relevant than d j for query q, where i = j. Nogueira et al. (2019) demonstrated that a pairwise BERT reranker running on the output of a pointwise BERT reranker yields statistically significant improvements in ranking metrics. We applied the same intuition to T5 in a pairwise reranker called duoT5, which takes as input the sequence: where d i and d j are unique pairs of candidates from the set D. The model is fine-tuned to predict "true" if candidate d i is more relevant than d j to query q and "false" otherwise. We fine-tuned duoT5 using the same hyperparameters as monoT5. At inference time, we use the top 50 highest scoring documents according to monoT5 as our candidates {d i }. We then obtain probabilities p i,j of d i being more relevant than d j for all unique candidate pairs {d i , d j }, ∀i = j. Finally, we compute a single score s i for candidate d i as follows: where J i = {0 ≤ j < 50, j = i}. Based on exploratory studies on the MS MARCO passage dataset, this setting leads to the most stable and effective rankings. Relevance Feedback. The setup of TREC-COVID (see Section 4.1) provides a feedback setting where systems can exploit a limited number of relevance judgments on a per-query basis. How do we take advantage of such training data? Despite work on fine-tuning transformers in a few-shot setting (Zhang et al., 2020b; Lee et al., 2020) , we were wary of the dangers of overfitting on limited data, particularly since there is little guidance on relevance feedback using transformers in the literature. Instead, we implemented a robust approach that treats relevance feedback as a document classification problem using simple linear classifiers, described in Yu et al. (2019) and Lin (2019). The approach is conceptually simple: for each query, we train a linear classifier (logistic regression) that attempts to distinguish relevant from nonrelevant documents for that query. The classifier operates on sparse bag-of-words representations using tf-idf term weighting. At inference time, each candidate document is fed to the classifier, and the classifier score is then linearly interpolated with the original candidate document score to produce a final score. We describe the input source documents in Section 4.2. All components above have also been open sourced. The two neural reranking modules are available in PyGaggle, 5 which is our recently developed neural ranking library designed to work with Pyserini. Our classification-based approach to feedback is implemented in Pyserini directly. These components are available for integration into any system. Beyond sharing our keyword search infrastructure and reranking models, we've built the Covidex as an operational search engine to demonstrate our capabilities to domain experts who are not interested in individual components. As deployed, we use the paragraph index and monoT5-base as the reranker. An additional highlighting module based on BioBERT is described in Zhang et al. (2020a) . To decrease end-to-end latency, we rerank only the top 96 documents per query and truncate reranker input to a maximum of 256 tokens. The Covidex is built using the FastAPI Python web framework, where all incoming API requests are handled by a service that performs searching, reranking, and text highlighting. Search is performed with Pyserini (Section 2.1), and the results are then reranked with PyGaggle (Section 2.2). The frontend (which is also open source) is built with React to support the use of modular, declarative JavaScript components, 6 taking advantage of its vast ecosystem. A screenshot of our system is shown in Figure 1 . Covidex provides standard search capabilities, either based on keyword queries or natural-language input. Users can click "Show more" to reveal the abstract as well as excerpts from the full text, where potentially relevant passages are highlighted. Clicking on the title brings the user to the article's source on the publisher's site. In addition, we have implemented a faceted browsing feature. From CORD-19, we were able to easily expose facets corresponding to dates, authors, journals, and sources. Navigating by year, for example, allows a user to focus on older coronavirus research (e.g., on SARS) or the latest research on COVID-19, and a combination of the journal and source facets allows a user to differentiate between preprints and the peer-reviewed literature, and between venues with different reputations. The system is currently deployed across a small cluster of servers, each with two NVIDIA V100 GPUs, as our pipeline requires neural network inference at query time. Each server runs the complete software stack in a simple replicated setup (no partitioning). On top of this, we leverage Cloudflare as a simple load balancer, which uses a round robin scheme to dispatch requests across the different servers. The end-to-end latency for a typical query is around two seconds. The first implementation of our system was deployed in late March, and we have been incrementally adding features since. Based on Cloudflare statistics, our site receives around two hundred unique visitors per day and the site serves more than one thousand requests each day. Of course, usage statistics were (up to several times) higher when we first launched due to publicity on social media. However, the figures cited above represent a "steady state" that has held up over the past few months, in the absence of any deliberate promotion. Reliable, large-scale evaluations of text retrieval methods are a costly endeavour, typically beyond the resources of individual research groups. Fortunately, the community-wide TREC-COVID challenge sponsored by the U.S. National Institute for Standards and Technology (NIST) provides a forum for evaluating our techniques. The TREC-COVID challenge, which began in mid-April and is still ongoing, provides an opportunity for researchers to study methods for quickly standing up information access systems, both in response to the current pandemic and to prepare for similar future events. Both out of logistic necessity in evaluation design and because the body of scientific literature is rapidly expanding, TREC-COVID is organized into a series of "rounds", each of which use the CORD-19 collection at a snapshot in time. For a particular round, participating teams develop systems that return results to a number of information needs, called "topics"-one example is "serological tests that detect antibodies of COVID-19". These results comprise a run or a submission. NIST then gathers, organizes, and evaluates these runs using a standard pooling methodology (Voorhees, 2002) . The product of each round is a collection of relevance judgments, which are annotations by domain experts about the relevance of documents with respect to topics. On average, there are around 300 judgments (both positive and negative) per topic from each round. These relevance judgments are used to evaluate the effectiveness of systems (populating a leaderboard) and can also be used to train machine-learning models in future rounds. Runs that take advantage of these relevance judgments are known as "feedback runs", in contrast to "automatic" runs that do not. A third category, "manual" runs, can involve human input, but we did not submit any such runs. Currently, TREC-COVID has completed round 3 and is in the middle of round 4. We present evaluation results from rounds 1, 2, and 3, since results from round 4 are not yet available. Each round contains a number of topics that are persistent (i.e., carryover from previous rounds) as well as new topics. To avoid retrieving duplicate documents, the evaluation adopts a residual collection methodology, where judged documents (either relevant or not) from previous rounds are automatically removed from consideration. Thus, for each topic, future rounds only evaluate documents that have not been examined before (either newly published articles or have never been retrieved). Note that due to the evaluation methodology, scores across rounds are not comparable. A selection of results from TREC-COVID are shown in Table 1 , where we report standard metrics computed by NIST. We submitted runs under team "covidex" (for neural models) and team "anserini" (for our bag-of-words baselines). In Round 1, there were 143 runs from 56 teams. Our best run T5R1 used BM25 for first-stage re- Table 1 : Selected TREC-COVID results. Our submissions are under teams "covidex" and "anserini". All runs notated with † incorporate our infrastructure components in some way. trieval using the paragraph index followed by our monoT5-3B reranker, trained on MS MARCO (as described in Section 2.2). The best automatic neural run was run2 from team GUIR S2 (MacAvaney et al., 2020) , which was built on Anserini. This run placed second behind the best automatic run, sabir.meta.docs, which interestingly was based on the vector-space model. While we did make meaningful infrastructure contributions (e.g., Anserini provided the keyword search results that fed the neural ranking models of team GUIR S2), our own run T5R1 was substantially behind the top-scoring runs. A post-hoc experiment with round 1 relevance judgments showed that using the paragraph index did not turn out to be the best choice: simply replacing with the abstract index (but retaining the monoT5-3B reranker) improved nDCG@10 from 0.5223 to 0.5702. 7 We learned two important lessons from the results of round 1: 1. The effectiveness of simple rank fusion techniques that can exploit diverse ranking signals by combining multiple ranked lists. Many teams adopted such techniques (including the topscoring run), which proved both robust and effective. This is not a new observation in infor-7 Despite this finding, we suspect that there may be evaluation artifacts at play here, because our impressions from the deployed system suggest that results from the paragraph index are better. Thus, the deployed Covidex still uses paragraph indexes. mation retrieval, but is once again affirmed by TREC-COVID. 2. The importance of building the "right" query representations for keyword search. Each TREC-COVID topic contains three fields: query, question, and narrative. The query field describes the information need using a few keywords, similar to what a user would type into a web search engine. The question field phrases the information need as a well-formed natural language question, and the narrative field contains additional details in a short paragraph. The query field may be missing important keywords, but the other two fields often contain too many "noisy" terms unrelated to the information need. Thus, it makes sense to leverage information from multiple fields in constructing keyword queries, but to do so selectively. Based on results from round 1, the following query generation technique proved to be effective: when constructing the keyword query for a given topic, we take the non-stopwords from the query field and further expand them with terms belonging to named entities extracted from the question field using ScispaCy (Neumann et al., 2019) . We saw these two lessons as an opportunity to further contribute community infrastructure, and starting in round 2 we made two fusion runs from Anserini freely available: fusion1 and fusion2. In both runs, we combined rankings from the abstract, full-text, and paragraph indexes via reciprocal rank fusion (RRF) (Cormack et al., 2009 ). The runs differed in their treatment of the query representation. The run fusion1 simply took the query field from the topics as the basis for keyword search, while run fusion2 incorporated the query generator described above to augment the query representation with key phrases. These runs were made available before the deadline so that other teams could use them, and indeed many took advantage of them. In Round 2, there were 136 runs from 51 teams. Our two Anserini baseline fusion runs are shown as r2.fusion1 and r2.fusion2 in Table 1 . Comparing these two fusion baselines, we see that our query generation approach yields a large gain in effectiveness. Ablation studies further confirmed that ranking signals from the different indexes do contribute to the overall higher effectiveness of the rank fusion runs. That is, the effectiveness of the fusion results is higher than results from any of the individual indexes. Our covidex.t5 run takes r2.fusion1 and r2.fusion2, reranks both with monoT5-3B, and then combines (with RRF) the outputs of both. The monoT5-3B model was fine-tuned on MS MARCO then fine-tuned (again) on a medical subset of MS MARCO (MacAvaney et al., 2020) . This run essentially tied for the best automatic run GUIR S2 run1, which scored just 0.0001 higher. As additional context, Table 1 shows the best "manual" and "automatic" runs from round 2 (mpiid5 run3 and SparseDenseSciBert, respectively), which were also the top two runs overall. These results show that manual and feedback techniques can achieve quite a bit of gain over fully automatic techniques. Both of these runs and four out of the five top teams in round 2 took advantage of the fusion baselines we provided, which demonstrates our impact not only in developing effective ranking models, but also our service to the community in providing infrastructure. In Round 3, there were 79 runs from 31 teams. Our Anserini fusion baselines, r3.fusion1 and r3.fusion2, remained the same from the previous round and continued to provide strong baselines. Our run r3.duot5 represents the first deployment of our monoT5 and duoT5 multi-stage reranking pipeline (see Section 2.2), which is a fusion of the fusion runs as the first-stage candidates, reranked by monoT5 and then duoT5. From Table 1, we see that duoT5 does indeed improve over just using monoT5 (run r3.monot5), albeit the gains are small (but we found that the duoT5 run has more unjudged documents). The r3.duot5 run ranks second among all teams under the "automatic" condition, and we are about two points behind team SFDC. However, according to Esteva et al. (2020) , their general approach incorporates Anserini fusion runs, which bolsters our case that we are providing valuable infrastructure for the community. Our own feedback run r3.t5 lr implements the classification-based feedback technique (see Section 2.2) with monoT5 results as the input source document (with a mixing weight of 0.5 to combine monoT5 scores with classifier scores). This was the highest-scoring run across all submissions (all categories), just a bit ahead of BioInfo-run1. Our project has three goals: build community infrastructure, advance the state of the art in neural ranking, and provide a useful application. We believe that our efforts can contribute to the fight against this global pandemic. Beyond COVID-19, the capabilities we've developed can be applied to analyzing the scientific literature more broadly. TensorFlow: A system for large-scale machine learning Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures Reciprocal rank fusion outperforms Condorcet and individual rank learning methods Deeper text understanding for IR with contextual neural language modeling CO-Search: COVID-19 information retrieval with semantic search Mixout: Effective regularization to finetune large-scale pretrained language models Is searching full text more effective than searching abstracts? The simplest thing that can possibly work: pseudo-relevance feedback using text classification Cascade ranking for operational e-commerce search SLEDGE: A simple yet effective baseline for coronavirus scientific knowledge search CEDR: Contextualized embeddings for document ranking High accuracy retrieval with multiple nested ranker Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast and robust models for biomedical natural language processing MS MARCO: a human-generated machine reading comprehension dataset Document ranking with a pretrained sequence-to-sequence model Multi-stage document ranking with BERT PyTorch: an imperative style, high-performance deep learning library Query understanding at Bing Exploring the limits of transfer learning with a unified text-to-text transformer Okapi at TREC-3 Pivoted document length normalization The philosophy of information retrieval evaluation A cascade ranking model for efficient ranked retrieval Anserini: enabling the use of Lucene for information retrieval research Anserini: reproducible ranking baselines using Lucene. Journal of Data and Information Quality End-to-end open-domain question answering with BERTserini A lightweight environment for learning experimental IR research practices Cross-domain modeling of sentence-level evidence for document retrieval Simple techniques for cross-collection relevance feedback Rapidly deploying a neural search engine for the COVID-19 Open Research Dataset: Preliminary thoughts and lessons learned 2020b. Revisiting fewsample BERT fine-tuning This research was supported in part by the Canada First Research Excellence Fund, the Natural Sciences and Engineering Research Council (NSERC) of Canada, CIFAR AI & COVID-19 Catalyst Funding 2019-2020, and Microsoft AI for Good COVID-19 Grant. We'd like to thank Kyle Lo from AI2 for helpful discussions and Colin Raffel from Google for his assistance with T5.