key: cord-0272126-b1pjogat authors: Rosa, Guilherme Moraes; Rodrigues, Ruan Chaves; Lotufo, Roberto de Alencar; Nogueira, Rodrigo title: To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment date: 2022-02-07 journal: nan DOI: 10.1145/3462757.3466103 sha: e25494d144b848040c8ae34c28c8e4942b03b36f doc_id: 272126 cord_uid: b1pjogat There has been mounting evidence that pretrained language models fine-tuned on large and diverse supervised datasets can transfer well to a variety of out-of-domain tasks. In this work, we investigate this transfer ability to the legal domain. For that, we participated in the legal case entailment task of COLIEE 2021, in which we use such models with no adaptations to the target domain. Our submissions achieved the highest scores, surpassing the second-best team by more than six percentage points. Our experiments confirm a counter-intuitive result in the new paradigm of pretrained language models: given limited labeled data, models with little or no adaptation to the target task can be more robust to changes in the data distribution than models fine-tuned on it. Code is available at https://github.com/neuralmind-ai/coliee. in each of them have a similar architecture to the original Transformer [40] and are pretrained on variations of the masked language modeling objective used by Devlin et al. [7] . Zero-shot and few-shot models are becoming more competitive with models fine-tuned on large datasets. For instance, few-shot results of GPT-3 [4] sparked an interest in prompt engineering methods, which are now an active area of research [20, 36, 38] . The goal of these methods is to find input templates such that the model is more likely to give the correct answer. In information retrieval, pretrained models fine-tuned only on a large dataset have also shown strong zero-shot capabilities [39] . For example, the same multi-stage pipeline based on T5 [33] was the best or second best-performing system in 4 tracks of the TREC 2021 [27] , including specialized tasks such as Precision Medicine [34] , and TREC-COVID [48] . A remarkable feature of this pipeline is that, for most tasks, the models are fine-tuned only on a general-domain ranking dataset, i.e., they do not use in-domain data. However, to date, there has not been strong evidence that zeroshot models transfer well to the legal domain. Most state-of-the-art models need adaptations to the target task. For example, the topperforming system on the legal case entailment task of COLIEE 2020 [31] uses an interpolation of BM25 [35] scores and scores from a BERT model fine-tuned on the target task [24] . In this work, we show that, for the legal case entailment task of COLIEE, pretrained language models without any fine-tuning on the target task perform at least equivalently or even better than models fine-tuned on the task itself. Our approach is characterized as zeroshot since the model was only fine-tuned on annotated data from another domain. Our result confirms in the legal domain a counterintuitive recent finding in other domains: given limited labeled data, zero-shot models tend to perform better on held-out datasets than models fine-tuned on the target task [27, 32] . It is a common assumption among NLP researchers that models developed using nonlegal texts would lead to unsatisfactory performance when directly applied to legal tasks [9, 50] . To overcome this issue, general-purpose techniques are adapted to the legal domain. For example, Chalkidis and Kampas [5] pre-trained legal word embeddings using word2vec [22, 23] [8] applied multi-task learning to minimize the problems related to data scarcity in the legal domain. The models were trained in translation, summarization, and multi-label classification tasks, and achieved better results than single-task models. Pretrained transformer models have only begun to be adopted in legal NLP applications more broadly [2, 10, 16, 37, 44] . In some tasks, they marginally outperform classical methods, especially when training data is scarce. For example, Zhong et al. [51] showed that a BERT-based model performs better than a tf-idf similarity model on a judgment prediction task [43] , but is slightly less effective than an attention-based convolutional neural network [47] . In some cases, they outperform classical methods, but at the expense of using hand-crafted features or by being fine-tuned on the target task. For example, the best submission to task 2 of COLIEE 2019 was a BERT model fed with hand-crafted inputs and fine-tuned on in-domain data [29] . Peters et al. [26] demonstrate that fine-tuning on the target task may not perform better than simple feature extraction from a pretrained model if the pretraining task and the target task belong to highly different domains. These findings lead us to consider zeroshot approaches while investigating how general domain Transformer models can be applied to legal tasks. Although zero-shot approaches are relatively novel in the legal domain, our work is not the first to apply zero-shot Transformer models to domain-specific entailment tasks where limited labeled data is available. Yin et al. [45] have transformed multi-label classification tasks into textual entailment tasks, and then evaluated the performance of a BERT model fine-tuned on mainstream entailment datasets. Yin et al. [46] also performed similar experiments while transforming question answering and coreference resolution tasks into entailment tasks. We are not the first to use zero-shot techniques on the legal case entailment task. For instance, Rabelo et al. [28] used a BERT fine-tuned for paraphrase detection combined with two transformer-based models fine-tuned on a generic text entailment dataset and features generated by a BERT model fine-tuned on the COLIEE training dataset. However, we are the first to show that zero-shot models can outperform fine-tuned ones on this task. The Competition on Legal Information Extraction/Entailment (COL-IEE) [13, 14, 30, 31] is an annual competition whose aim is to evaluate automatic systems on case and statute law tasks. Among the five tasks of the 2021 competition, we submitted systems to task 2, called legal case entailment, which consists of identifying paragraphs from existing cases that entail a given fragment of a base case. Training data consists of a set of decision fragments, its respective candidate paragraphs that could be relevant or not to the fragment and a set of labels containing the number of the paragraphs by which the decision fragment is entailed. Test data includes only decision fragments and candidate paragraphs, but no labels. As shown in Figure 1 , the input to the model is a decision fragment Q of an unseen case and the output should be a set of paragraphs = [ 1 , 2 , ..., ] that are relevant to the given decision . In The micro F1-score is the official metric in this task: where is the number of correctly retrieved paragraphs for all queries divided by the number of retrieved paragraphs for all queries, and is the number of correctly retrieved paragraphs for all queries divided by the number of relevant paragraphs for all queries. We experiment with the following models: BM25, monoT5-zeroshot, monoT5, and DeBERTa. We also evaluate an ensemble of our monoT5 and DeBERTa models. BM25 is a bag-of-words retrieval function that scores a document based on the query terms appearing in it. We use the BM25 implemented in Pyserini [18] , a Python toolkit that supports replicable information retrieval research. We use its default parameters. We first index all paragraphs in datasets of tasks 1 and 2. Having more paragraphs from task 1 improves the term statistics (e.g., document frequencies) used by BM25. Task 1 dataset is composed of long documents, while task 2 is composed of paragraphs. This difference in length may degrade BM25 scores for task 2 paragraphs because the average document length will be higher due to task 1 documents. We address this problem by segmenting each document into several paragraphs using a context window of 10 sentences with overlapping strides of 5 sentences. The entailed fragment might be comprised of multiple sentences. Here we treat each of its sentences as a query and compute a BM25 score for each sentence and candidate paragraph pair independently. The final score for each paragraph is the maximum among its sentence and paragraph pair scores. We then use the method described in Section 3.5 to select the paragraphs that will comprise our final answer. At a high level, monoT5-zero-shot is a sequence-to-sequence adaptation of the T5 model [33] proposed by Nogueira et al. [25] and further detailed in Lin et al. [19] . This ranking model is close to or at the state-of-the-art in retrieval tasks such as Robust04 [42] , TREC-COVID, and TREC 2020 Precision Medicine and Deep Learning tracks. Details of the model are described in Nogueira et al. [25] ; here, we only provide a short overview. In the T5 model, all target tasks are cast as sequence-to-sequence tasks. For our task, we use the following input sequence template: Query: Document: Relevant: where and are the query and candidate texts, respectively. In this work, is a fragment, and is one of the candidate paragraphs. The model estimates a score quantifying how relevant a candidate text is to a query . That is: The model is fine-tuned to produce the tokens "true" or "false" depending on whether the candidate is relevant or not to the query. That is, "true" and "false" are the "target tokens" (i.e., ground truth predictions in the sequence-to-sequence transformation). The suffix "Relevant:" in the input string serves as hint to the model for the tokens it should produce. We use a T5-large model fine-tuned on MS MARCO [1] , a dataset of approximately 530k query and relevant passage pairs. We use a checkpoint available at Huggingface's model hub that was trained with a learning rate of 10 −3 using batches of 128 examples for 10k steps, or approximately one epoch of the MS MARCO dataset. 1 In each batch, a roughly equal number of positive and negative examples is sampled. We refer to this model as monoT5-zero-shot. Although fine-tuning for more epochs leads to better performance on the MS MARCO development set, Nogueira et al. [25] showed that further training degrades a model's zero-shot performance on other datasets. We observed similar behavior in our task and opted to use the model trained for one epoch on MS MARCO. At inference time, to compute probabilities for each query-candidate pair, a softmax is applied only on the logits of the tokens "true" and "false". The final score of each candidate is the probability assigned to the token "true". We further fine-tune monoT5-zero-shot on the 2020 task 2 training set following a similar training procedure described in the previous section. Fragments are mostly comprised of only one sentence, while candidate paragraphs are longer, sometimes exceeding 512 tokens in length. Thus, to avoid excessive memory usage due to the quadratic 1 https://huggingface.co/castorini/monot5-large-msmarco-10k memory cost of Transformers with respect to the sequence length, we truncate inputs to 512 tokens during both training and inference. The model is fine-tuned with a learning rate of 10 −3 for 80 steps using batches of size 128, which corresponds to 20 epochs. Each batch has the same amount of positive and negative examples. We refer to this model as monoT5. Decoding-enhanced BERT with disentangled attention (DeBERTa) improves on the original BERT and RoBERTa architectures by introducing two techniques: the disentangled attention mechanism and an enhanced mask decoder [12] . Both improvements seek to introduce positional information to the pretraining procedure, both in terms of the absolute position of a token and the relative position between them. The COLIEE 2021 Task 2 dataset has very few positive examples of entailment. Therefore, for fine-tuning DeBERTa on this dataset, we found appropriate to artificially expand the positive examples. As fragments take up only a small portion of a base case paragraph, we expand positive examples by generating artificial fragments from the same base case paragraph in which the original fragment has occurred. This is done by moving a sliding window, with a stride that is half the size of the original fragment, over the base case paragraph. Each step of this sliding window is taken to be an artificial fragment, and such artificial fragments are assigned the same labels as the original fragment. Although the resulting dataset after these operations is several times larger than the original Task 2 dataset, we achieved better results by fine-tuning DeBERTa on a small sample taken from this artificial dataset. After experimenting with distinct sample sizes, we settled for a sample of twenty thousand fragment and candidate paragraph pairs, equally balanced between positive and negative entailment pairs. In order to find the best hyperparameters for fine-tuning a De-BERTa Large model, we perform a grid search over the hyperparameters suggested by He et al. [12] while early stopping always at the second epoch. The best combination of hyperparameters is used to fine-tune the model for ten epochs. The checkpoint with the best performance on the 2020 test set is selected to generate our predictions for the 2021 test set. The models described above estimate a score for each (fragment, candidate paragraph) pair. To select the final set of paragraphs for a given fragment, we apply three rules: • Select paragraphs whose scores are above a threshold ; • Select the top paragraphs with respect to their scores; • Select paragraphs whose scores are at least of the top score. We use exhaustive grid search to find the best values for , , on the development set of the 2020 task 2 dataset. We swept = The best values for each model can be found in Table 3 . Note that our hyperparameter search includes the possibility of not using the first or third strategies if = 0 or = 0 are chosen, respectively. Ensemble methods seek to combine the strengths and compensate for the weaknesses of the models in order that the final model has better generalization performance. We use the following method to combine the predictions of monoT5 and DeBERTa (both fine-tuned on COLIEE 2020): We concatenate the final set of paragraphs selected by each model. We remove duplicates, preserving the highest score. Then, we apply again the grid search method explained in the previous section to select the final set of paragraphs. It is important to note that our method does not combine scores between models. It ensures that only individual answers with a certain degree of confidence are maintained in the final answer, which generally leads to an increase in Precision. The final answer for each test example can be composed of individual answers from one model or both models. We present our main result in Table 2 . Our baseline BM25 method scores above the median of submissions in both COLIEE 2020 and 2021 (row 2 vs. 1a). This confirms that BM25 is a strong baseline and it is in agreement with results from other competitions such as the Health Misinformation and Precision Medicine track of TREC 2020 [27] . Our pretrained transformer models (rows 3, 4 and 5) score above BM25, the best submission of 2020 [24] , and the second-best team of 2021. Likewise, our ensemble method effectively combines De-BERTa and monoT5 predictions, achieving the best score among all submissions (row 6). However, the performance of monoT5zero-shot decreases when combined with DeBERTa (row 5 vs. 7), showing that monoT5-zero-shot is a strong model. The most interesting comparison is between monoT5 and monoT5zero-shot (rows 4 and 5). In the 2020 test data, monoT5 showed better results than monoT5-zero-shot. Hence, we decided to submit only the fine-tuned model to the 2021 competition. After the release of ground-truth annotations of the 2021 test set, our evaluation of monoT5-zero-shot showed that it performs better than monoT5. A similar "inversion" pattern was found for DeBERTa vs. monoT5 (rows 3 and 4). DeBERTa was better than monoT5 on the 2020 test set, but the opposite happened on the 2021 test set. One explanation for these results is that we overfit on the test data of 2020, i.e., by (unintentionally) selecting techniques and hyperparameters that gave the best result on the 2020 test set as experiments progressed. However, this is unlikely to be the case for our fine-tuned monoT5 model, as our hyperparameter selection is fully automatic and maximized on the development set, whose data is from COLIEE competitions before 2020. Another explanation is that there is a significant difference between the annotation methodologies of 2020 and 2021. Consequently, models specialized in the 2020 data could suffer from this change. However, this is also unlikely since BM25 performed similarly in both years. Furthermore, we cannot confirm this hypothesis since it is difficult to quantify differences in the annotation process. Regardless of the reason for the inversion, our main finding is that our zero-shot model performed at least comparably to fine-tuned models on the 2020 test set and achieved the best result of a single model on 2021 test data. In Table 3 , we show the ablation result of the answer selection method proposed in Section 3.5. Our baseline answer selection method, which we refer to as "no rule" in the table, uses only the paragraph with the highest score as the final answer set, i.e., = = Table 3 : Ablation on the 2020 data of the answer selection method presented in Section 3.5. 0 and = 1. For all models, the proposed answer selection method gives improvements of 0.6 to two F1 points over the baseline. We confirm a counter-intuitive result on a legal case entailment task: that models with little or no adaptation to the target task can have better generalization abilities than models that have been carefully fine-tuned to the task at hand. Domain adversarial fine-tuning [41] and changes to the Adam optimizer [6] [49] have been proposed as valid approaches for fine-tuning Transformer models on small domain-specific datasets. However, whether these techniques could successfully be applied to the legal case entailment task to make models fine-tuned on target task data perform better than zero-shot approaches remains an open question. Therefore, although domain-specific language model pretraining and adjustments to the fine-tuning process are promising directions for future research, we believe that zero-shot approaches should not be ignored as strong baselines for such experiments. It should also be noted that our research has implications for future experiments beyond the scope of legal case entailment tasks. Based on previous work by Yin et al. [45, 46] , it is possible that other legal tasks with limited labeled data, such as legal question answering, may benefit from our zero-shot approach. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset LegalDB: Long DistilBERT for Legal Document Classification UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners Deep learning in law: early adaptation and legal word embeddings trained on large corpora Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Multi-Task Deep Learning for Legal Document Translation, Summarization and Multi-Label Classification Stop Illegal Comments: A Multi-Task Deep Learning Approach. AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference December BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline DeBERTa: Decoding-enhanced BERT with Disentangled Attention Overview of COLIEE 2017 COLIEE-2018: Evaluation of the competition on legal information extraction and entailment UnifiedQA: Crossing Format Boundaries With a Single QA System and Evangelos Kanoulas. 2020. A Benchmark for Lease Contract Review BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations Pretrained transformers for text ranking: Bert and beyond Pinghua Gong, Jieping Ye, and Changshui Zhang. 2020. Learning from Very Few Samples: A Survey PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval Distributed representations of words and phrases and their compositionality Efficient Estimation of Word Representations in Vector Space JNLP Team: Deep Learning for Legal Processing in COLIEE 2020 Document Ranking with a Pretrained Sequence-to-Sequence Model To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks H2oloo at TREC 2020: When all you got is a hammer Application of text entailment tech Combining similarity and transformer methods for case law entailment A Summary of the COLIEE 2019 Competition Learning transferable visual models from natural language supervision Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Overview of the TREC Okapi at TREC-3 Exploiting cloze questions for few-shot text classification and natural language inference Customizing Contextualized Language Models for Legal Document Reviews Shashank Srivastava, and Colin Raffel. 2021. Improving and Simplifying Pattern Exploiting Training Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Attention is All you Need Alexandra Chronopoulou, and Ion Androutsopoulos. 2020. Domain Adversarial Fine-Tuning as an Effective Regularizer Overview of the TREC 2004 Robust Track CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction Effects of inserting domain vocabulary and fine-tuning BERT for German legal language Benchmarking Zeroshot Text Classification: Datasets, Evaluation and Entailment Approach Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start ABCNN: Attention-based convolutional neural network for modeling sentence pairs Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset 2021. Revisiting Few-sample BERT Fine-tuning Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence Tianyang Zhang, Zhiyuan Liu, and Maosong Sun1. 2020. JEC-QA: A Legal-Domain Question Answering Dataset. Proceedings of the AAAI Conference on Artificial Intelligence