key: cord-0229439-1uwbdnk7 authors: Hofstatter, Sebastian; Althammer, Sophia; Sertkan, Mete; Hanbury, Allan title: Establishing Strong Baselines for TripClick Health Retrieval date: 2022-01-02 journal: nan DOI: nan sha: c4a622eb0f5ac4d7a58ebe2f7773bc40bbb761c1 doc_id: 229439 cord_uid: 1uwbdnk7 We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the - originally too noisy - training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact of different domain-specific pre-trained models on TripClick. Finally, we show that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures. The latest neural network advances in Information Retrieval (IR) -specifically the ad-hoc passage retrieval task -are driven by available training data, especially the large web-search-based MSMARCO collection [1] . Here, neural approaches lead to enormous effectiveness gains over traditional techniques [8, 13, 20, 24] . A valid concern is the generalizability and applicability of the developed techniques to other domains and settings [16, 14, 31, 34] . The newly released TripClick collection [27] with large-scale click log data from the Trip Database, a health search engine, provides us with the opportunity to re-test previously developed techniques on this new ad-hoc retrieval task: keyword search in the health domain with large training and evaluation sets. TripClick provides three different test sets (Head, Torso, Tail), grouped by their query frequency, so we can analyze model performance for different slices of the overall query distribution. This study conducts a range of controlled ad-hoc retrieval experiments using pre-trained Transformer [32] models with various state-of-the-art retrieval architectures on the TripClick collection. We aim to reproduce effectiveness gains achieved on MSMARCO in the click-based health ad-hoc retrieval setting. Typically, neural ranking models are trained with a triple of one query, a relevant and a non-relevant passage. As part of our evaluation study, we discovered a flaw in the provided neural training data of TripClick: The original negative sampling strategy included non-clicked results, which led to inadequate training. Therefore, we re-created the training data with an improved negative sampling strategy, based solely on BM25 negatives, with better results than published baselines. As the TripClick collection was released only recently, we are the first to study a wide-ranging number of BERT-style ranking architectures and answer the fundamental question: RQ1 How do established ranking models perform on re-ranking TripClick? In the re-ranking setting, where the neural models score a set of 200 candidates produced by BM25, we observe large effectiveness gains for BERT CAT , ColBERT, and TK for every one of the three frequency-based query splits. BERT CAT improves over BM25 on Head by 100%, on Torso by 66% and Tail still by 50%. We compare the general BERT-Base & DistilBERT with the domain-specific SciBERT & PubMedBERT models to answer: RQ2 Which BERT-style pre-trained checkpoint performs best on TripClick? Although the general domain models show good effectiveness results, they are outperformed by the domain-specific pre-training approaches. Here, PubMed-BERT slightly outperforms SciBERT on re-ranking with BERT CAT & ColBERT. An ensemble of all domain-specific models with BERT CAT again outperforms all previous approaches and sets new state-of-the-art results for TripClick. Finally, we study the concept of retrieving passages directly from a nearest neighbor vector index, also referred to as dense retrieval, and answer: RQ3 How well does dense retrieval work on TripClick? Dense retrieval outperforms BM25 considerably for initial candidate retrieval, both in top-10 precision results and for all recall cutoffs, except top-1000. In contrast to re-ranking, SciBERT outperforms PuBMedBERT on dense retrieval results. We publish our source code as well as the improved training triples at: https://github.com/sebastian-hofstaetter/tripclick We describe the collection, the BERT-style pre-training instances, ranking architectures, and training procedures we use below. TripClick contains 1.5 million passages (with an average length of 259 words), 680 thousand click-based training queries (with an average of 4.4 words), and 3, 525 test queries. The TripClick collection includes three test sets with 1, 175 queries each grouped by their frequency and called Head, Torso, and Tail queries. For the Head queries a DCTR [3] click model was employed to created relevance signals, the other two sets use raw clicks. In comparison to the widely analyzed MSMARCO collection [10] , TripClick is yet to be fully understood. This includes the quality of the click labels and the effect of various filtering mechanisms of the professional search production UI, that are not part of the released data. 1 We study multiple architectures with different aspects on the efficiency vs. effectiveness tradeoff scale. Here, we give a brief overview, for more detailed comparisons see Hofstätter et al. [8] . BERT CAT -Concatenated Scoring The base re-ranking model BERT CAT [24, 20, 40] concatenates query and passage sequences with special tokens and computes a score by reducing the pooled CLS representation with a single linear layer. It represents one of the current state-of-the art models in terms of effectiveness, however it exhibits many drawbacks in terms of efficiency [9, 39] . ColBERT The ColBERT model [13] delays the interactions between every query and document representation after BERT. The interactions in the ColBERT model are aggregated with a max-pooling per query term and sum of queryterm scores. The aggregation only requires simple dot product computations, however the storage cost of pre-computing passage representations is very high as it depends on the total number of terms in the collection. The Transformer-Kernel model [12] is not based on BERT pre-training, but rather uses shallow and independently computed Transformers followed by a set of RBF kernels to count match signals in a termby-term match matrix, for very efficient re-ranking. BERT DOT -Dense Retrieval The BERT DOT model matches a single CLS vector of the query with a single CLS vector of a passage [39, 18, 17] , independently computed. This decomposition of interactions to a single dot-product allows us to pre-compute every contextualized passage representation and employ a nearest neighbor index for dense retrieval, without a traditional first stage. The 12-layer BERT-Base model [5] (and the 6-layer distilled version DistilBERT [29] ) and its vocabulary are based on the Books Corpus and English Wikipedia articles. The SciBERT model [2] uses the identical architecture to the BERT-Base model, but the vocabulary and the weights are pre-trained on Semantic Scholar articles (with 82% articles from the broad biomedical domain). Similarly the PubMedBERT model [7] and its vocabulary are trained on PubMed articles using the same architecture as the BERT model. At the time of writing, this is the first paper evaluating on the novel TripClick collection. However many other tasks have been set up before in the biomedical retrieval domain, such as BioASQ [23] , TREC Precision Medicine tracks [28, 6] or the timely created TREC-COVID [33, 22, 30] (which is based on CORD-19 [35] , a collection of scientific articles concerned with the coronavirus pandemic). For TREC-COVID, MacAvaney et al. [19] train a neural re-ranking model on a subset of the MS MARCO dataset containing only medical terms (Med-MARCO) and demonstrate its domain-focused effectiveness on a transfer to TREC-COVID. Xiong et al. [38] and Lima et al. [15] explore medical domain specific BERT representations for the retrieval from the TREC-COVID corpus and show that using SciBERT for dense retrieval outperforms the BM25 baseline by a large margin. Wang et al. [36] explore continuous active learning for the retrieval task from the COVID-19 corpus, this method is also studied for retrieval in the precision medicine track [28, 4] . Reddy et al. [26] demonstrate synthetic training for question answering of COVID-19 related questions. Many of these related works are concerned with overcoming the lack of large training data on previous medical collections. Now with TripClick we have a large-scale medical retrieval dataset. In this paper we jumpstart work on this collection, by showcasing the effectiveness of neural ranking approaches on TripClick. In our experiment setup, we largely follow Hofstätter et al. [8] , except where noted otherwise. Mainly we rely on PyTorch [25] and HuggingFace Transformer [37] libraries as foundation for our neural training and evaluation methods. For TK, we follow Rekabsaz et al. [27] and utilize a PubMed-trained 400 dimensional word embedding as starting point [21] . For validation and testing we utilize the data splits outlined in TripClick by Rekabsaz et al. [27] . The TripClick dataset conveniently comes with a set of pre-generated training triples for neural training. Nevertheless, we found this training set to produce less than optimal results and the trained BERT models show no robustness against increased re-ranking depth. This phenomena of having to tune the best re-ranking depth for effectiveness, rather than efficiency, has been studied as part of early non-BERT re-rankers [11] . With the advent of Transformer-based re-rankers, this technique became obsolete [12] . In the TripClick dataset, the clicked results are considered as positives samples for training. However, we discovered a flaw in the published negative sampling procedure, that non-clicked results -ranked above the clicked ones -are included as negative sampled passages. We hypothesize this leads to many false negatives in the training set, confusing the models during training. We confirm this thesis by observing our training telemetry data, showing low pairwise training accuracy as well as a lack of clear distinction in the scoring margins of the BERT CAT models. For all results presented in this study we generate new training data with the following simple procedure: 1. We generate 500 BM25 candidates for every training query 2. For every pair of query -relevant (clicked) passage in the training set we randomly sample, without replacement, up to 20 negative candidates from the candidates created in 1. -We remove candidates present in the relevant pool, regardless of relevance grade. -We discard positional information (we expect position bias to be in the training data -a potential for future work). 3. After shuffling the training triples we save 10 million triples for training Our new training set gave us a 45 − 50% improvement on MRR@10 (from .41 to .6) and nDCG@10 (from .21 to .30) for the HEAD validation queries using the same PubMedBERT CAT model and setup. The models are now also robust against increasing the re-ranking depth. In this section, we present the results for our research questions, first for reranking and then for dense retrieval. We present the original baselines, as well as our re-ranking results for all three frequency-based TripClick query sets in Table 1 . All neural models re-rank the top-200 results of BM25. While the original baselines do improve the frequent Head queries by up to 6 points nDCG@10 (TK-L3 vs. BM25-L1); they hardly improve the Tail queries with only 1 − 3 points difference in nDCG@10 (CK-L2 & TK-L3 vs. BM25-L1). This is a pressing issue, as those queries make up 83% of all Trip searches [27] . Turning to our results in Table 1 , to answer RQ1 How do established ranking models perform on re-ranking TripClick? We can see that our training approach for TK (Line 4) strongly outperforms the original TK (L3), especially on the To answer RQ3 How well does dense retrieval work on TripClick? we present our results in Table 2 . Dense retrieval with BERT DOT (L13 to L15) outperforms BM25 (L1) considerably for initial candidate retrieval, both in terms of top-10 precision results, as well as for all recall cutoffs, except top-1000. We also provided the judgement coverage for the top-10 results, and surprisingly, the coverage for dense retrieval increases compared to BM25. Future annotation campaigns should explore the robustness of these click-based evaluation results. Test collection diversity is a fundamental requirement of IR research. Ideally, we as a community develop methods that work on the largest possible set of problem settings. However, neural models require large training sets, which restricted most of the foundational research to the public MSMARCO and other web search collections. Now, with TripClick we have a another large-scale collection available. In this paper we show that in contrast to the original baselines, neural models perform very well on TripClick -both in the re-ranking task and the full collection retrieval with nearest neighbor search. We make our techniques openly available to the community to foster diverse neural information retrieval research. A Human Generated MAchine Reading COmprehension Dataset SciBERT: A pretrained language model for scientific text Click Models for Web Search Technology-assisted review in empirical medicine: Waterloo participation in clef ehealth 2018 Bert: Pre-training of deep bidirectional transformers for language understanding Citius at the trec 2020 health misinformation track Domain-specific language model pretraining for biomedical natural language processing Improving efficient neural ranking models with cross-architecture knowledge distillation Let's measure run time! Extending the IR replicability infrastructure to include performance aspects Mitigating the position bias of transformer models in passage re-ranking On the Effect of Low-Frequency Terms on Neural-IR Models Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking Colbert: Efficient and effective passage search via contextualized late interaction over bert Multi-task dense retrieval via model uncertainty fusion for open-domain question answering Denmark's participation in the search engine trec covid-19 challenge: Lessons learned about searching for precise biomedical scientific information on covid-19 A proposed conceptual framework for a representational approach to information retrieval Twinbert: Distilling knowledge to twin-structured bert models for efficient retrieval Sparse, dense, and attentional representations for text retrieval Sledge: A simple yet effective baseline for covid-19 scientific knowledge search Cedr: Contextualized embeddings for document ranking Deep relevance ranking using enhanced document-query interactions. arXiv preprint1809.01682 COVID-QA: A question answering dataset for COVID-19 Overview of BioASQ 2020: The Eighth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering Passage re-ranking with bert Automatic differentiation in pytorch End-to-end qa on covid-19: Domain adaptation with synthetic training Tripclick: The log files of a large health web search engine Overview of the trec 2019 precision medicine track. The Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter Rapidly bootstrapping a question answering dataset for covid-19. CoRR, abs Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models Attention is all you need Trec-covid: Constructing a pandemic information retrieval test collection Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning Huggingface's transformers: State-of-the-art natural language processing Cmt in trec-covid round 2: Mitigating the generalization gaps from web to special domain search Approximate nearest neighbor negative contrastive learning for dense text retrieval Cross-domain modeling of sentencelevel evidence for document retrieval