key: cord-0220977-rcyiita7 authors: Bonifacio, Luiz; Abonizio, Hugo; Fadaee, Marzieh; Nogueira, Rodrigo title: InPars: Data Augmentation for Information Retrieval using Large Language Models date: 2022-02-10 journal: nan DOI: nan sha: 4e36db22808c1d677438137b10979a9279fb6c1f doc_id: 220977 cord_uid: rcyiita7 The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars . Language models (LMs) such as GPT-3 (Brown et al., 2020) , FLAN (Wei et al., 2022) , Gopher (Rae et al., 2021) , and T0++ (Sanh et al., 2021) have demonstrated impressive performance on many NLP tasks. Additionally, when sufficient supervised information is not available for a task, they have been shown to be effective and at times yield compelling results (Winata et al., 2021; Schick and Schütze, 2021b) . Despite the appealing capabilities of large LMs, multi-billion parameter models are rarely used in information retrieval (IR), with p q p q What are the effects of caffeine during pregnancy? Document: We don't know a lot about the effects of caffeine during pregnancy on you and your baby. So it's best to limit the amount you get each day. Question: Output: Question q and prob p q Relevancy P(R=1|d,q) Model G Reranker p q Select top K (q,d) pairs w.r.t. p q Figure 1 : Illustration of our few-shot method that generates training data for search tasks. We use a language model G to generate a question q (and its probability p q ) from a document d. The top K pairs (q, d) with respect to p q are used as positive examples to train a reranker whose task it to estimate the relevancy of d to q. some notable exceptions (Nogueira et al., 2020; Pradeep et al., 2021; Neelakantan et al., 2022) . One reason is the computationally intensive nature of information retrieval tasks. In a typical reranking task, for instance, we compute the relevancy of 1000 candidate documents for one query, which requires 1000 inference passes on a reranking model. This can be prohibitively expensive when using large models. For example, OpenAI offers a search API that allows one to compute query-document relevancy using their models with billions of parameters. As of February 2022, they charge 0.06 USD per 1000 tokens for their largest model. If each candidate document contains 250 tokens, naively using this API for a reranking task would cost approximately 15 USD per query. Dense retrievers (Karpukhin et al., 2020; Khattab and Zaharia, 2020) avoid this expensive reranking step by precomputing vector representations for each document in the collection prior to retrieval. When a query comes in, only its vector representations are computed, and a fast vector search framework can be used to retrieve the nearest document vectors to the vector representation of the query (Johnson et al., 2019) . Despite being computationally cheaper at inference time, dense retrievers need one inference pass to compute the vector representation of each document in the collection, which also makes billion-parameter neural models impracticable to be used as dense retrievers. 1 Another challenge in developing neural models for IR is the lack of domain-specific training data. Manually constructing highquality datasets is difficult as it requires queries from real users. While there are a few general-purpose labeled data available (Nguyen et al., 2016; Kwiatkowski et al., 2019) , they are not always effective in generalizing to out-of-domain datasets (Thakur et al., 2021) . For this goal, zero-shot and few-shot learning models are in particular promising. However, a cost-effective manner of using large LMs in IR tasks is still an open question. In this work, we propose a simple yet effective approach towards efficiently using large LMs in retrieval and obtain improvements across several IR datasets. Rather than using large LMs directly in the retrieval process, we harness them to generate labeled data in a few-shot manner. We then finetune retrieval models on this synthetic data and use them to rerank the the search results of a first-stage retrieval system. We summarize our contributions as follows: • We propose a method for adapting large LMs to IR tasks that otherwise are infeasible to be used due to their computational demands. • In an unsupervised setting, our method largely outperforms recently proposed ones. When combined with supervised finetuning, our method achieves state-ofthe-art results in two of the three transfer learning datasets evaluated in this work. Data augmentation methods aim at increasing the amount of data to assist the learn-ing process of data-driven models. To improve the performance of neural models in lowresource settings, small-scale LMs have been used to generate synthetic data in various NLP tasks (Fadaee et al., 2017; Kobayashi, 2018) . Recent works show that large pretrained LMs are capable of generating data of reasonable quality (Anaby-Tavor et al., 2020; Papanikolaou and Pierleoni, 2020; Yang et al., 2020; Mohapatra et al., 2021; Kumar et al., 2020; Schick and Schütze, 2021a; Meng et al., 2022) , sometimes leading to better transfer learning than human generated datasets (Liu et al., 2022) . In information retrieval, dense retrievers can achieve comparable results to BM25 in some datasets when solely pretrained on documents without annotations (Ram et al., 2021; Izacard et al., 2021; Neelakantan et al., 2022) . These methods rely on extracting pairs of segments of texts that are likely relevant to each other which then used as positive pairs to train the retrieval models. Focusing on improving the transfer learning effectiveness of dense retrievers, Ma et al. (2021) and Wang et al. (2021) use supervised sequence-to-sequence models to augment the training data. They generate questions from texts from different collections and use these synthetic question-text pairs as positive training examples. Our work differs from existing approaches as we rely exclusively on simple prompts to generate questions from large language models with minimal supervision, i.e., using only a few supervised examples. We were mostly inspired by Han et al. (2021) , who uses such models to generate synthetic translation pairs in a zero-shot manner, i.e., without using any parallel corpora. In this section, we describe the proposed method, dubbed InPars (Inquisitive Parrots for Search), for generating synthetic training datasets for IR tasks. Given a document d and a prefix t consisting of N pairs of questions and their relevant documents, i.e., t = {(q * 1 , d * 1 ), ..., (q * N , d * N )}, our method uses a language model G(t, d) to generate a question q that is likely to be relevant to d. The pair (q, d) forms a positive training example that is later used to finetune our retrieval models. We generate thousands of these positive training examples using documents randomly sampled from a collection D. The prefix t is always the same regardless of the input document d, i.e., we can potentially generate millions of synthetic training examples using only N manually annotated examples. This characterizes our method as a few-shot learning approach as long as N is small (in our experiments, we use three examples). As a last step to create our training dataset, we select the top K pairs with respect to the following (log) probability: where p(q i |t, d, q