key: cord-0220977-rcyiita7
authors: Bonifacio, Luiz; Abonizio, Hugo; Fadaee, Marzieh; Nogueira, Rodrigo
title: InPars: Data Augmentation for Information Retrieval using Large Language Models
date: 2022-02-10
journal: nan
DOI: nan
sha: 4e36db22808c1d677438137b10979a9279fb6c1f
doc_id: 220977
cord_uid: rcyiita7

The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .

Language models (LMs) such as GPT-3 (Brown et al., 2020) , FLAN (Wei et al., 2022) , Gopher (Rae et al., 2021) , and T0++ (Sanh et al., 2021) have demonstrated impressive performance on many NLP tasks. Additionally, when sufficient supervised information is not available for a task, they have been shown to be effective and at times yield compelling results (Winata et al., 2021; Schick and Schütze, 2021b) .

Despite the appealing capabilities of large LMs, multi-billion parameter models are rarely used in information retrieval (IR), with p q p q What are the effects of caffeine during pregnancy? Document: We don't know a lot about the effects of caffeine during pregnancy on you and your baby. So it's best to limit the amount you get each day. Question:

Output: Question q and prob p q Relevancy P(R=1|d,q)

Model G

Reranker p q Select top K (q,d) pairs w.r.t. p q Figure 1 : Illustration of our few-shot method that generates training data for search tasks. We use a language model G to generate a question q (and its probability p q ) from a document d. The top K pairs (q, d) with respect to p q are used as positive examples to train a reranker whose task it to estimate the relevancy of d to q. some notable exceptions (Nogueira et al., 2020; Pradeep et al., 2021; Neelakantan et al., 2022) . One reason is the computationally intensive nature of information retrieval tasks. In a typical reranking task, for instance, we compute the relevancy of 1000 candidate documents for one query, which requires 1000 inference passes on a reranking model. This can be prohibitively expensive when using large models. For example, OpenAI offers a search API that allows one to compute query-document relevancy using their models with billions of parameters. As of February 2022, they charge 0.06 USD per 1000 tokens for their largest model. If each candidate document contains 250 tokens, naively using this API for a reranking task would cost approximately 15 USD per query.

Dense retrievers (Karpukhin et al., 2020; Khattab and Zaharia, 2020) avoid this expensive reranking step by precomputing vector representations for each document in the collection prior to retrieval. When a query comes in, only its vector representations are computed, and a fast vector search framework can be used to retrieve the nearest document vectors to the vector representation of the query (Johnson et al., 2019) . Despite being computationally cheaper at inference time, dense retrievers need one inference pass to compute the vector representation of each document in the collection, which also makes billion-parameter neural models impracticable to be used as dense retrievers. 1 Another challenge in developing neural models for IR is the lack of domain-specific training data. Manually constructing highquality datasets is difficult as it requires queries from real users. While there are a few general-purpose labeled data available (Nguyen et al., 2016; Kwiatkowski et al., 2019) , they are not always effective in generalizing to out-of-domain datasets (Thakur et al., 2021) . For this goal, zero-shot and few-shot learning models are in particular promising. However, a cost-effective manner of using large LMs in IR tasks is still an open question.

In this work, we propose a simple yet effective approach towards efficiently using large LMs in retrieval and obtain improvements across several IR datasets. Rather than using large LMs directly in the retrieval process, we harness them to generate labeled data in a few-shot manner. We then finetune retrieval models on this synthetic data and use them to rerank the the search results of a first-stage retrieval system. We summarize our contributions as follows:

• We propose a method for adapting large LMs to IR tasks that otherwise are infeasible to be used due to their computational demands. • In an unsupervised setting, our method largely outperforms recently proposed ones. When combined with supervised finetuning, our method achieves state-ofthe-art results in two of the three transfer learning datasets evaluated in this work.

Data augmentation methods aim at increasing the amount of data to assist the learn-ing process of data-driven models. To improve the performance of neural models in lowresource settings, small-scale LMs have been used to generate synthetic data in various NLP tasks (Fadaee et al., 2017; Kobayashi, 2018) . Recent works show that large pretrained LMs are capable of generating data of reasonable quality (Anaby-Tavor et al., 2020; Papanikolaou and Pierleoni, 2020; Yang et al., 2020; Mohapatra et al., 2021; Kumar et al., 2020; Schick and Schütze, 2021a; Meng et al., 2022) , sometimes leading to better transfer learning than human generated datasets (Liu et al., 2022) . In information retrieval, dense retrievers can achieve comparable results to BM25 in some datasets when solely pretrained on documents without annotations (Ram et al., 2021; Izacard et al., 2021; Neelakantan et al., 2022) . These methods rely on extracting pairs of segments of texts that are likely relevant to each other which then used as positive pairs to train the retrieval models.

Focusing on improving the transfer learning effectiveness of dense retrievers, Ma et al. (2021) and Wang et al. (2021) use supervised sequence-to-sequence models to augment the training data. They generate questions from texts from different collections and use these synthetic question-text pairs as positive training examples. Our work differs from existing approaches as we rely exclusively on simple prompts to generate questions from large language models with minimal supervision, i.e., using only a few supervised examples. We were mostly inspired by Han et al. (2021) , who uses such models to generate synthetic translation pairs in a zero-shot manner, i.e., without using any parallel corpora.

In this section, we describe the proposed method, dubbed InPars (Inquisitive Parrots for Search), for generating synthetic training datasets for IR tasks. Given a document d and a prefix t consisting of N pairs of questions and their relevant documents, i.e., t = {(q * 1 , d * 1 ), ..., (q * N , d * N )}, our method uses a language model G(t, d) to generate a question q that is likely to be relevant to d. The pair (q, d) forms a positive training example that is later used to finetune our retrieval models.

We generate thousands of these positive training examples using documents randomly sampled from a collection D. The prefix t is always the same regardless of the input document d, i.e., we can potentially generate millions of synthetic training examples using only N manually annotated examples. This characterizes our method as a few-shot learning approach as long as N is small (in our experiments, we use three examples).

As a last step to create our training dataset, we select the top K pairs with respect to the following (log) probability:

where p(q i |t, d, q <i ) is the probability assigned by G when autoregressively generating the ith token of q. We show in Section 6.3 that this filtering step largely improves IR metrics.

Due to its few-shot nature, our method can be used to adapt retrievers to any collection or IR task, which we later empirically confirm on various collections. This is particularly relevant for IR tasks as gathering data to train retrieval models is an expensive process (Yilmaz et al., 2020) , with most highquality collections having less than a few hundred queries (Voorhees and Harman, 1999; Voorhees, 2004) .

We do not perform any pretraining to adapt the model to the target corpus, such as proposed by Gao and Callan (2021). Our method does not require any modifications in the loss function, as is done by Izacard et al. (2021); Neelakantan et al. (2022) . This makes In-Pars also suitable for non-neural retrieval algorithms.

In this section, we describe the datasets used in this work, the procedure to generate questions from the datasets in an few-shot manner, and finally, how we train retrievers on this synthetic data. Nguyen et al. (2016) is a largescale ranking dataset with more than half million anonymized questions sampled from Bing's search query logs. Its passage ranking version is formed by 8.8M passages and approximately 500k training pairs of queries and relevant documents. On average, there is one relevant passage per query, which were manually annotated. The development and test sets contain approximately 6,900 queries each and the test set is kept hidden from the public.

TREC-DL Craswell et al. (2021) is a dataset that uses the same collection of passages from MS MARCO, but it contains only 54 queries and a higher number of judged documents per query.

Robust04 (Voorhees, 2005) is an retrieval dataset formed by 249 queries. Its corpus consists of 528k documents from the news wire domain. It has on average 1250.6 judged documents per query.

Natural Questions (Kwiatkowski et al., 2019) is an open domain Question Answering dataset created from Wikipedia documents. The questions are real anonymized queries submitted to Google's search engine. In this work, we use the BEIR's version of NQ, which contains 3,452 queries and 1.21 relevant documents per query in its development set. The corpus consists of 2.6M passages from Wikipedia.

TREC-COVID (Roberts et al., 2020) dataset was created during a challenge organized by NIST whose goal was to study information retrieval methods in response to the pandemic. We use the dataset version provided by the BEIR benchmark. The dataset consists of 50 queries, an average of 1,326 judged documents per query, and a corpus of 171k articles from the COVID-19 scientific literature.

Our training set comprises triples of a query, a positive, and a negative document. We first describe how pairs of query and positive documents are generated. We randomly sample 100,000 documents from the collection and generate one question per document using GPT-3's Curie as our language model G.

We prepend the document text with its title, when it is available. Documents with less than 300 characters are discarded and a new one is sampled. We use a temperature of 0, which defaults to greedy decoding. We tried Figure 2 : "Vanilla" (left) and "GBQ" (right) prompts proposed in this work. The GBQ prompt consists of 3 relevant passages and queries randomly sampled from MS MARCO. The query is used as a bad question and we provide a more descriptive good question for the passage. For the fourth example, we replace {document text} with a sampled document for which the language model is asked to generate a good question.

other decoding algorithms such as sampling (with a temperature of 1.0) and top-p sampling (Holtzman et al., 2019) (with p=0.95) but did not notice significant differences in the final results.

We experiment with two prompt templates for generating questions which are illustrated in Figure 2 . The first called "Vanilla" (left), uses N = 3 pairs of document and relevant question randomly chosen from the MS MARCO training dataset. The string {document text} is replaced with the document sampled from the collection and the language model generates a question one token at a time until a termination token 

 is chosen or a maximum number of 64 tokens is reached.

The second prompt template called "Guided by Bad Questions" (GBQ) illustrated in Figure 2 (right) is similar to the "Vanilla" template, but we encourage the language model to produce more contextual-aware questions than the ones from MS MARCO. For that, we use MS MARCO questions as examples of "bad" questions, and manually create more complicated ones as examples of "good" questions.

Rather than finding the answer in part of the input document, the full context of the document will contribute to the answer. With this prompt, the model generates a "good" and a "bad" question for each document and we keep only the good ones to form our positive training pairs. Because the GBQ prompt, by design, generates questions that are different from MS MARCO ones, we use the "Vanilla" prompt to generate questions that are used in finetuning retrievers evaluated on MS MARCO and TREC-DL datasets.

There are four GPT-3 models available to the public via the OpenAI API: Ada, Babbage, Curie and Davinci. Their sizes are not explicitly mentioned, but from the figures reported by Neelakantan et al. (2022) and its accompanying blog post, 2 we infer that Ada, Baggage, Curie and Davinci have 300M, 1.2B, 6B and 175B parameters, respectively. We show in Section 6.2 that questions generated by the more expensive Davinci model lead to marginal improvements in IR metrics. There-fore, throughout this paper, we report results using questions generated by the Curie model.

Of the 100,000 questions generated (and their respective input documents), we use the top K=10,000 pairs w.r.t to p q as positive examples for finetuning our models.

Our retrievers are binary classifiers, so we also need to select documents that are not relevant to q to form negative finetuning examples (q, d − ). We use a simple method that has been shown to be effective for finetuning rerankers (Pradeep et al., 2021) . We use BM25 with q as query to retrieve 1000 documents from the collection D. We randomly select one of these as d − , and the pair (q, d − ) forms a negative example.

We use a multi-stage retrieval architecture (Matveeva et al., 2006; Wang et al., 2011; Chen et al., 2017; Liu et al., 2017) comprised of initial retrieval with bag-of-words BM25 (ROBERTSON et al., 1995) followed by a neural reranker.

The collection is indexed using pyserini (Lin et al., 2021) and 1000 candidate documents for each query are retrieved using BM25. Then we rerank the candidate documents using monoT5, which is an adaption of the T5 model (Raffel et al., 2020) to text ranking proposed by Nogueira et al. (2020) . We finetune monoT5 base (220M parameters) and 3B with a constant learning rate of 10 −3 and an equal number of positive and negative examples in each batch of size 128. We did not conduct experiments with the 11B version due to its computational cost. To simplify our training procedure (and related hyperparameters) as well as to eliminate the need for convergence checks, we simply train for 156 steps (approximately one epoch). Because the number of training steps is small, we observed a difference of up to two points in the final metric depending on the order in which the training examples are shown to the model. Thus, we report the results that are the average of three seeds. Training T5 base and 3B takes a few minutes on a single Google TPU v3-8. We finetune a model per collection using the synthetic questions generated from that collection. In Section 6.1, we show that having in-domain synthetic sets is beneficial. Never-theless, using only MS MARCO as a source for documents also leads to reasonable results.

We use a maximum of 512 input tokens and two output tokens (one for the target token and another for the end-of-sequence token). In the experiments with the MS MARCO passage dataset, none of the inputs exceed this length limitation. For NQ and TREC-COVID, documents are often truncated.

Robust04 contains long documents that would require large amounts of memory due to the quadratic cost of the Transformer models. Thus, during finetuning and inference, it is not possible to directly feed the entire text at once to our models. To address this issue, we use a slightly modified version of the MaxP technique of Dai and Callan (2019). We first segment each document into passages by applying a sliding window of 10 sentences with a stride of 5. We then obtain a relevance probability for each passage by classifying it independently. We select the highest probability among these passages as the relevance probability of the document; that is, we do not use the original (BM25) retrieval scores. OpenAI Search API: We also experiment with OpenAI's Search API 3 as a reranker of 100 documents retrieved by BM25. The Search API provides an endpoint to perform semantic text search over a set of documents, where we provide a query and the top 100 documents retrieved by BM25, and the result is a reranked list with the respective scores. OpenAI does not disclose what is under the hood of their Search API, whether it was finetuned on IR datasets, the prompt used during inference, or its relation with cpt-text (Neelakantan et al., 2022) . Therefore we compare their Search API, which uses large LMs as rerankers, with our method, which uses the same LMs to train rerankers with lower inference cost.

We did not rerank more documents due to the high cost of the API (approximately 5.85 USD per 1000 query-document pairs of TREC-DL 2020 using Davinci). Nevertheless, we conducted a single experiment reranking 1000 documents per query using Curie and saw an improvement of almost two nDCG@10 points on TREC-DL 2020 in comparison to - Table 1 would probably be higher.

The main results are shown in Table 1 . We observe that unsupervised models finetuned using our method (rows 7-8), outperform models of equivalent size available in OpenAI's Search API (rows 4-6). For instance, our monoT5 with 3B parameters (row 8) is significantly better than the larger Curie and Davinci models (rows 5 and 6 We believe different factors contribute to these results: 1) Due to the query-document interaction, cross-encoders (rows 7-8) are more effective (and contextual) than the independent query and document encoding of biencoders (rows 3-6). This is inline with re-sults from Thakur et al. (2021) which shows that bi-encoders and cross-encoders perform similarly when trained and evaluated on the same dataset, but cross-encoders show better zero-shot effectiveness. 2) The synthetic questions generated by InPars have closer resemblance to the queries that the retriever will see at inference time than the sentences extracted from texts used by Contriever and cpt-text. 3) The brand-new queries we generate combined with documents that the model otherwise will not use, provide a completely new training set with high diversity. Diversity in the training data has been shown to lead to improvement in performance of neural models (Qu et al., 2021) .

The last row (16) 

In this section we take a deeper look into the properties of our proposed approach.

First, we investigate the impact of using different styles of prompts as well as using indomain or ad hoc collections to generate synthetic data. We compare IR metrics obtained with different prompt templates in Table 2. The column "Input docs" refers to the collection used to generate the questions. 5 We do not report PARADE (Li et al., 2020) results, that achieves an nDCG@20 of 0.6127 on Ro-bust04, since it was finetuned on Robust04 data and are are only comparing against zero-shot models on this dataset.

Marco indicates that we sampled documents from the MS MARCO passage collection as input for generating questions since models trained on them have shown great generalization capabilities (Pradeep et al., 2020b; Thakur et al., 2021) . In contrast, in-domain signifies that questions were generated from documents sampled from the same collections that the model is evaluated on. This approach can be regarded as zero-shot domain adaptation as we do not use any in-domain labeled data to finetune our retrievers. Since the MS MARCO development set and TREC-DL 2020 use the MS MARCO passage collection, this distinction of in-domain and Marco does not apply to these datasets.

Column "Prompt" in Table 2 refers to the type of prompt fed to the language model during the question generation step. Preliminary experiments demonstrate that GBQ leads to better performance than the Vanilla prompt when used in combination with in-domain input documents. A marginal gain was observed when using the GBQ prompt on MS MARCO documents. Therefore, we combined Marco with the Vanilla prompt and in-domain with the GBQ prompt in our experiments.

It costs 100-300 USD to generate 100,000 questions using the Curie model regardless of the prompt type. Table 3 shows the difference in the predicted query for both prompts. It is notably observed that GBQ questions tend to be more descriptive and specific to the input document while Vanilla questions are more generic and can be answered by a broader set of documents.

Both MS MARCO and TREC-DL 2020 achieved the highest scores using the Vanilla prompt for both monoT5 220M and 3B models. However, for the other datasets (Ro-bust04, NQ, and TREC-COVID), we obtained the highest scores with the GBQ prompt and using questions generated from their collec-tions. The 3B models benefited the most from the in-domain examples, consistently outperforming the Marco results. For smaller monoT5 220M, there is no clear winner between the types of prompts and input documents.

In Figure 3 we present the effectiveness on the MS MARCO development set of the monoT5-220M reranker trained on synthetic questions generated using different sizes of GPT-3. As we increase the model size, the IR metric keeps increasing, although very slowly. A hypothesis that explains this gain is that larger models can generate more relevant questions to a given passage, instead of broad and more general questions. Using more specific questions avoids confusing the reranker with questions that are relevant to many passages, which can lead to false-negative examples.

This relation between GPT-3 model sizes and their IR effectiveness is also observed in the OpenAI Search results presented in Table 1. However, in those results, the Davinci model rivals Curie as the best performing, with the latter delivering a higher score on most experiments.

As mentioned in Section 3, we used only the top K = 10, 000 pairs w.r.t to p q to finetune the rerankers. Finetuning on all 100,000 synthetic examples leads to a 4 MRR@10 points decrease on MS MARCO when compared to the top K filtering approach. Thus, to avoid relying on the test sets and keep the method fully unsupervised, we only evaluated this difference on MS MARCO and applied the filtering approach to all other datasets.

The answer to this question is not disclosed to the public, but we tried to answer it by measuring the number of questions produced by GPT-3 that match those in the MS MARCO dataset. Of the 93,200 unique questions produced by the largest model (Davinci), 5,285 were also found in the 1M questions from MS MARCO, i.e., less than 5.7%. Most of these questions contain a single entity (e.g., "what is freedom of the press", "what is the fda", "who is gary young") that are likely to be answered by multiple documents in the collection. Also, the Davinci model produced questions from 6,396 documents marked as relevant in either the training or development sets of MS MARCO. Of those synthetic questions, 725 were found in MS MARCO ground-truth, i.e., less than 12%. Similar numbers were found for smaller models. We argue that these low percentages are evidence that the models were not finetuned on MS MARCO, or at least, they did not memorize it.

In this work, we presented InPars, a method to generate synthetic training data for IR tasks using large LMs in a few-shot manner. This allows one to harness the information learned by large models in a more efficient and effective way.

Our experiments demonstrated that using large LMs to generate synthetic training data is a promising direction for development of neural retrievers. There are, however, many directions not explored in this work that we leave as future work: 1) Finetuning dense retrievers on our synthetic data; 2) Using "bad questions" as negative training examples; 3) Scale up our synthetic datasets to millions of examples; 4) More sophisticated methods to select (question, relevant document) pairs. Document: John West has resigned as non-executive chairman of Dalgety, the food group, and of Bridon, maker of wire rope and engineered products, having just suffered a stroke. Dalgety has appointed Maurice Warren, chief executive, to succeed West, 65, who became chairman of Dalgety only last September. Richard Clothier, who had been due to become chief executive when Warren turns 60 in June, will now do so on April 1. Warren has agreed to remain chairman for the indefinite future and says the appointment will not affect his plans to become non-executive chairman of the South West Electricity Board in June. Bridon yesterday named Derek Edwards, a non-executive director of the company for the past eight years, as chairman, and Brian Clayton as chief executive. Clayton has been responsible for day-to-day executive decisions since David Allday resigned as chief executive in September. John Hogan has been appointed chief operating officer of Lasmo, the independent oil exploration and production company. He replaces Joe Darby who recently became chief executive after Chris Greentree decided to step down. John Hogan has been managing director of Lasmo's North Sea operations for the past four years. At 39, he is generally regarded as one of the industry's younger generation of pragmatic managers who have to weigh more keenly the financial risks and rewards of oil exploration in a climate of persistently low oil prices. Vanilla question: What is the name of the food group and the wire rope maker? GBQ question: Information on John West's resignation as non-executive chairman of Dalgety and Bridon Document: WIDE PRICE fluctuations in many commodity markets this week resulted from relatively modest buying and selling orders in thin, pre-Christmas conditions. The sharpest movement was gold's Dollars 4.85a-troy-ounce drop over Monday and Tuesday, which traders said was initiated by selling from an individual US trade house. As the price slid towards a support level at Dollars 332 an ounce one dealer told the Reuter news agency that 'under normal circumstances the volume traded would not have moved the market so much'. The gold price was steadier yesterday, gaining 55 cents to Dollars 332.85 an ounce, but was still Dollars 4.30 down on the week so far. The platinum price followed a similar pattern, recovering Dollars 1.10 at yesterday's afternoon fixing to reach Dollars 359.60 an ounce, Dollars 3.90 below last Friday's level, while cash silver was six cents down overall at 370.5 cents an ounce. Most base metals markets at the London Metal Exchange presented a mirror image of the movements in the gold market, slipping back a little yesterday following strong gains earlier in the week. Copper was a case in point. Modest Chinese buying and worries about copper workers joining a Polish general strike were less influential than technical factors in the three months delivery position's Pounds 54.50 rise to Pounds 1,489.25 a tonne over the first two days, dealers said. Mr Ted Arnold, analyst at the Merrill Lynch financial services group, pointed to the profound effect commodity funds, managing about Dollars 26bn, were having on traded metals markets.' These funds tend to work primarily on technical analysis,' he said. Vanilla question: What is the price of gold? GBQ question: What is the effect of commodity funds on the price of metals? Table 4 : Examples of questions generated from the TREC-COVID dataset using the Curie model with generated good and bad questions.

Document: [Ethical principles compromised during the COVID-19 pandemic?]. In the late 1970s, the American bioethicists Tom Beauchamp and James Childress described the four ethical principles that should guide a physician's actions in individual patient care. These principles are: (a) respect for autonomy; (b) doing well (beneficence); (c) not harming (non-maleficence); and (d) justice. In many countries, the global outbreak of SARS-CoV-2 has led to overloaded healthcare systems due to large numbers of COVID-19 patients. In order to provide care to this high volume of patients, far-reaching measures are taken that affect everyone. These measures are not taken from an individual patient's perspective but in the interest of public health; nonetheless, they can directly affect the individual patient's interests. This article examines the extent to which Beauchamp and Childress' ethical principles may be compromised during the COVID-19 pandemic. 

Method: Data from the Belgian CF Registry (year 200020132010) were collected. Inclusions: Bcc infected patients with entries on lung function in at least 1 y before and 3 y after Bcc acquisition. For each case, we included 2 controls, matched for age at the index year (year of first Bcc infect ion), pancreatic status, sex. Cumulative data up to 2 years before index year were compared to values obtained after infection using Rank sum test. Rate of decline in lu ng function was adjusted for baseline lung function, age, sex. RESULTS: Bcc prevalence in CF

This study aims to analyze the efficacy and safety of the 5-mm and 10-mm devices. SUBJECTS AND METHODS Patients who received a laparoscopic or hand-assisted laparoscopic operation for a tumor located in the sigmoid colon or rectum since 2006 were abstracted from a prospectively designed database, and findings were analyzed in two groups based on size of the device used during the procedure. The videotapes of the procedures were watched, and operation reports were read to obtain further information on specific intra-and postoperative complications. Demographics, tumor and operation-related information, and postoperative data were compared. RESULTS Among 215 (128 [59.5%] males; median age, 59.500b113.8 years) patients, data obtained from the 5-mm (n=32) and 10-mm (n=183) groups were identical regarding demographics and data related to tumor (localization and stage) and operation (number of harvested lymph nodes, conversion rates, operation time, intraoperative bleeding, transfusion requirement, reoperation rates, complications, 30-day mortality, and length of hospital stay). However, more patients

CONCLUSIONS The 5-mm and 10-mm LigaSure devices are similarly effective and safe during laparoscopic sigmoid colon and rectal resections

This research was partially funded by a grant from Fundação de Amparoà Pesquisa do Estado de São Paulo (FAPESP) 2020/09753-5. We also would like to thank Google Cloud for credits to support this work.

 Table 5 : Examples of questions generated from the Natural Questions dataset using the Curie model with generated good and bad questions. Document: 1993 Miami Dolphins season. The season was marked by Don Shula passing George Halas's record for most wins, against the Philadelphia Eagles. Also, during the Week 5 game against Cleveland, quarterback Dan Marino ruptured his Achilles' tendon and was lost for the remainder of the season. Quarterback Scott Mitchell filled in for Marino, and was Player of the Month for October 1993. Mitchell, too, became injured, leaving the then 900e220ac201c2 team in the hands of Doug Pederson and NFL veteran Steve DeBerg. Bad question: What happened to Dan Marino? Good question: What happened to the 1993 Miami Dolphins? Document: Social construction of gender. On Butler's hypothesis, the socially constructed aspect of gender performativity is perhaps most obvious in drag performance, which offers a rudimentary understanding of gender binaries in its emphasis on gender performance. Butler understands drag cannot be regarded as an example of subjective or singular identity, whereẗhere is a 'one' who is prior to gender, a one who goes to the wardrobe of gender decides with deliberation which gender it will be today.