key: cord-020801-3sbicp3v authors: MacAvaney, Sean; Soldaini, Luca; Goharian, Nazli title: Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_31 sha: doc_id: 20801 cord_uid: 3sbicp3v While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance. Every day, billions of non-English speaking users [22] interact with search engines; however, commercial retrieval systems have been traditionally tailored to English queries, causing an information access divide between those who can and those who cannot speak this language [39] . Non-English search applications have been equally under-studied by most information retrieval researchers. Historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on English corpora (e.g., [1, 5, 6, 23] ). More recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [11, 14, 21, 24, 26, 35, 37] ; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as BM25. Like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on English queries and documents. The absence of rankers designed to operate on languages other than English can largely be attributed to a lack of suitable publicly available data sets. This aspect particularly limits supervised ranking methods, as they require samples for training and validation. For English, previous research relied on English collections such as TREC Robust 2004 [32] , the 2009-2014 TREC Web Track [7] , and MS MARCO [2] . No datasets of similar size exist for other languages. While most of recent approaches have focused on ad hoc retrieval for English, some researchers have studied the problem of cross-lingual information retrieval. Under this setting, document collections are typically in English, while queries get translated to several languages; sometimes, the opposite setup is used. Throughout the years, several cross lingual tracks were included as part of TREC. TREC 6, 7, 8 [4] offer queries in English, German, Dutch, Spanish, French, and Italian. For all three years, the document collection was kept in English. CLEF also hosted multiple cross-lingual ad-hoc retrieval tasks from 2000 to 2009 [3] . Early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations [27] . More recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. For example, Vulić and Moens [34] proposed BWESG, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. Sasaki et al. [28] leveraged a data set of Wikipedia pages in 25 languages to train a learning to rank algorithm for Japanese-English and Swahili-English cross-language retrieval. Litschko et al. [20] proposed an unsupervised framework that relies on aligned word embeddings. Ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-English document retrieval. A few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. For example, TREC produced several non-English test collections: Spanish [12] , Chinese Mandarin [31] , and Arabic [25] . Other languages were explored, but the document collections are no longer available. The CLEF initiative includes some non-English monolingual datasets, though these are primarily focused on European languages [3] . Recently, Zheng et al. [40] introduced Sogou-QCL, a large query log dataset in Mandarin. Such datasets are only available for languages that already have large, established search engines. Inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non English languages using supervised neural approaches. In particular, to circumvent the lack of training data, we leverage transfer learning techniques to train Arabic, Mandarin, and Spanish retrieval models using English training data. In the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g. [16, 17, 29, 38] ). Our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in English. We evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. By leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned [19] or unaligned [8] web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. During the peer review of this article, a preprint [30] was published with similar observations as ours. In summary, our contributions are: -We study zero shot transfer learning for IR in non-English languages. -We propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. Our approach outperforms several baselines on multiple non-English collections. -We show that including additional in-language training samples may help further improve ranking performance. -We release our code for pre-processing, initial retrieval, training, and evaluation of non-English datasets. 1 We hope that this encourages others to consider cross-lingual modeling implications in future work. Zero-Shot Multi-lingual Ranking. Because large-scale relevance judgments are largely absent in languages other than English, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. This setting makes use of relevance data from one language that has a considerable amount of training data (e.g., English) for model training and validation, and applies the trained model to a different language for testing. More formally, let S be a collection of relevance tuples in the source language, and T be a collection of relevance judgments from another language. Each relevance tuple q, d, r consists of a query, document, and relevance score, respectively. In typical evaluation environments, S is segmented into multiple splits for training (S train ) and testing (S test ), such that there is no overlap of queries between the two splits. A ranking algorithm is tuned on S train to define the ranking function R Strain (q, d) ∈ R, which is subsequently tested on S test . We propose instead tuning a model on all data from the source language (i.e., training R S (·)), and testing on a collection from the second language (T ). We evaluate on monolingual newswire datasets from three languages: Arabic, Mandarin, and Spanish. The Arabic document collection contains 384k documents (LDC2001T55), and we use topics/relevance information from the 2001-02 TREC Multilingual track (25 and 50 topics, respectively). For Mandarin, we use 130k news articles from LDC2000T52. Mandarin topics and relevance judgments are utilized from TREC 5 and 6 (26 and 28 topics, respectively). Finally, the Spanish collection contains 58k articles from LDC2000T51, and we use topics from TREC 3 and 4 (25 topics each). We use the topics, rather than the query descriptions, in all cases except TREC Spanish 4, in which only descriptions are provided. The topics more closely resemble real user queries than descriptions. 2 We test on these collections because they are the only document collections available from TREC at this time. 3 We index the text content of each document using a modified version of Anserini with support for the languages we investigate [36] . Specifically, we add Anserini support for Lucene's Arabic and Spanish light stemming and stop word list (via SpanishAnalyzer and ArabicAnalyzer). We treat each character in Mandarin text as a single token. Modeling. We explore the following ranking models: -Unsupervised baselines. We use the Anserini [36] implementation of BM25, RM3 query expansion, and the Sequential Dependency Model (SDM) as unsupervised baselines. In the spirit of the zero-shot setting, we use the default parameters from Anserini (i.e., assuming no data of the target language). -PACRR [14] models n-gram relationships in the text using learned 2D convolutions and max pooling atop a query-document similarity matrix. -KNRM [35] uses learned Gaussian kernel pooling functions over the querydocument similarity matrix to rank documents. -Vanilla BERT [21] uses the BERT [10] transformer model, with a dense layer atop the classification token to compute a ranking score. To support multiple languages, we use the base-multilingual-cased pretrained weights. These weights were trained on Wikipedia text from 104 languages. We use the embedding layer output from base-multilingual-cased model for PACRR and KNRM. In pilot studies, we investigated using cross-lingual MUSE vectors [8] and the output representations from BERT, but found the BERT embeddings to be more effective. Experimental Setup. We train and validate models using TREC Robust 2004 collection [32] . TREC Robust 2004 contains 249 topics, 528k documents, and 311k relevance judgments in English (folds 1-4 from [15] for training, fold 5 for validation). Thus, the model is only exposed to English text in the training and validation stages (though the embedding and contextualized language models are trained on large amounts of unlabeled text in the languages). The validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via nDCG@20). We train using pairwise softmax loss with Adam [18] . We evaluate the performance of the trained models by re-ranking the top 100 documents retrieved with BM25. We report MAP, Precision@20, and nDCG@20 to gauge the overall performance of our approach, and the percentage of judged documents in the top 20 ranked documents (judged@20) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments. We present the ranking results in Table 1 . We first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, RM3 and SDM outperform BM25, whereas in other cases they underperform. Similarly, the PACRR and KNRM neural models also vary in effectiveness, though more frequently perform much worse than BM25. This makes sense because these models capture matching characteristics that are specific to English. For instance, n-gram patterns captured by PACRR for English do not necessarily transfer well to languages with different constituent order, such as Arabic (VSO instead of SVO). An interesting observation is that the Vanilla BERT model (which recall is only tuned on English text) generally outperforms a variety of approaches across three test languages. This is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. The exceptions are the Arabic 2001 dataset, in which it performs only comparably to BM25 and the MAP results for Spanish. For Spanish, RM3 is able to substantially improve recall (as evidenced by MAP), and since Vanilla BERT acts as a re-ranker atop BM25, it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. In all cases, Vanilla BERT exhibits judged@20 above 85%, indicating that these test collections are still valuable for evaluation. To test whether a small amount of in-language training data can further improve BERT ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. The in-language samples are interleaved into the English training samples. Results for this fewshot setting are shown in Table 2 . We find that the added topics for Arabic 2001 (+50) and Spanish 4 (+25) significantly improve the performance. This results in a model significantly better than BM25 for Arabic 2001, which suggests that there may be substantial distributional differences in the English TREC 2004 training and Arabic 2001 test collections. We further back this up by training an "oracle" BERT model (training on the test data) for Arabic 2001, which yields a model substantially better (P@20 = 0.7340, nDCG@20 = 0.8093, MAP = 0.4250). We introduced a zero-shot multilingual setting for evaluation of neural ranking methods. This is an important setting due to the lack of training data available in many languages. We found that contextualized languages models (namely, BERT) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to English). We also found that additional in-language training data may improve the performance, though not necessarily. By releasing our code and models, we hope that cross-lingual evaluation will become more commonplace. Probabilistic models of information retrieval based on measuring the divergence from randomness MS MARCO: a human generated machine reading comprehension dataset CLEF 2003 -overview of results Cross-language information retrieval (CLIR) track overview Learning to rank: from pairwise approach to listwise approach A survey of automatic query expansion in information retrieval TREC 2014 web track overview Word translation without parallel data Deeper text understanding for IR with contextual neural language modeling BERT: pre-training of deep bidirectional transformers for language understanding Overview of the fourth text retrieval conference (TREC-4) Overview of the third text retrieval conference (TREC-3) PACRR: a position-aware neural IR model for relevance matching Parameters learned in the comparison of retrieval models using term dependencies Google's multilingual neural machine translation system: enabling zero-shot translation Cross-lingual transfer learning for POS tagging without cross-lingual resources Adam: a method for stochastic optimization Cross-lingual language model pretraining Unsupervised cross-lingual information retrieval using monolingual data only CEDR: contextualized embeddings for document ranking A Markov random field model for term dependencies An introduction to neural information retrieval The TREC 2002 Arabic/English CLIR track Neural information retrieval: at the end of the early years Multilingual Information Retrieval: From Research to Practice Cross-lingual learning-torank with shared representations Cross-lingual transfer learning for multilingual task oriented dialog Cross-lingual relevance transfer for document retrieval The sixth text retrieval conference (TREC-6) Overview of the TREC 2005 robust retrieval track Overview of the fifth text retrieval conference (TREC-5) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings End-to-end neural ad-hoc ranking with kernel pooling Anserini: reproducible ranking baselines using Lucene Simple applications of BERT for ad hoc document retrieval Transfer learning for sequence tagging with hierarchical recurrent networks The digital language divide The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval This work was supported in part by ARCS Foundation.