key: cord-0039670-a21dwip2 authors: Penha, Gustavo; Hauff, Claudia title: Curriculum Learning Strategies for IR: An Empirical Study on Conversation Response Ranking date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_46 sha: 0e110fd35a79f6c03c7d00d2aabc5f292f3ece6f doc_id: 39670 cord_uid: a21dwip2 Neural ranking models are traditionally trained on a series of random batches, sampled uniformly from the entire training set. Curriculum learning has recently been shown to improve neural models’ effectiveness by sampling batches non-uniformly, going from easy to difficult instances during training. In the context of neural Information Retrieval (IR) curriculum learning has not been explored yet, and so it remains unclear (1) how to measure the difficulty of training instances and (2) how to transition from easy to difficult instances during training. To address both challenges and determine whether curriculum learning is beneficial for neural ranking models, we need large-scale datasets and a retrieval task that allows us to conduct a wide range of experiments. For this purpose, we resort to the task of conversation response ranking: ranking responses given the conversation history. In order to deal with challenge (1), we explore scoring functions to measure the difficulty of conversations based on different input spaces. To address challenge (2) we evaluate different pacing functions, which determine the velocity in which we go from easy to difficult instances. We find that, overall, by just intelligently sorting the training data (i.e., by performing curriculum learning) we can improve the retrieval effectiveness by up to 2% (The source code is available at https://github.com/Guzpenha/transformers_cl.). Curriculum Learning (CL) is motivated by the way humans teach complex concepts: teachers impose a certain order of the material during students' education. Following this guidance, students can exploit previously learned concepts to more easily learn new ones. This idea was initially applied to machine learning over two decades ago [8] as an attempt to use a similar strategy in the training of a recurrent network by starting small and gradually learning more difficult examples. More recently, Bengio et al. [1] provided additional evidence that curriculum strategies can benefit neural network training with experimental results on different tasks such as shape recognition and language modelling. Since then, empirical successes were observed for several computer vision [14, 49] and natural language processing (NLP) tasks [36, 42, 60] . In supervised machine learning, a function is learnt by the learning algorithm (the student) based on inputs and labels provided by the teacher. The teacher typically samples randomly from the entire training set. In contrast, CL imposes a structure on the training set based on a notion of difficulty of instances, presenting to the student easy instances before difficult ones. When defining a CL strategy we face two challenges that are specific to the domain and task at hand [14] : (1) arranging the training instances by a sensible measure of difficulty, and, (2) determining the pace in which to present instances-going over easy instances too fast or too slow might lead to ineffective learning. We conduct here an empirical investigation into those two challenges in the context of IR. Estimating relevance-a notion based on human cognitive processes-is a complex and difficult task at the core of IR, and it is still unknown to what extent CL strategies are beneficial for neural ranking models. This is the question we aim to answer in our work. Given a set of queries-for instance user utterances, search queries or questions in natural language-and a set of documents-for instance responses, web documents or passages-neural ranking models learn to distinguish relevant from non-relevant query-document pairs by training on a large number of labeled training pairs. Neural models have for some time struggled to display significant and additive gains in IR [53] . In a short time though, BERT [7] (released in late 2018) and its derivatives (e.g. XLNet [56] , RoBERTa [25] ) have proven to be remarkably effective for a range of NLP tasks. The recent breakthroughs of these large and heavily pre-trained language models have also benefited IR [54, 55, 57] . In our work we focus on the challenging IR task of conversation response ranking [50] , where the query is the dialogue history and the documents are the candidate responses of the agent. The set of responses are not generated on the go, they must be retrieved from a comprehensive dialogue corpus. A number of deep neural ranking models have recently been proposed for this task [43, 50, 52, 61, 62] , which is more complex than retrieval for single-turn interactions, as the ranking model has to determine where the important information is in the previous user utterances (dialogue history) and how it is relevant to the current information need of the user. Due to the complexity of the relevance estimation problem displayed in this task, we argue it to be a good test case for curriculum learning in IR. In order to tackle the first challenge of CL (determine what makes an instance difficult) we study different scoring functions that determine the difficulty of query-document pairs based on four different input spaces: conversation history {U}, candidate responses {R}, both {U,R}, and {U, R, Y}, where Y are relevance labels for the responses. To address the second challenge (determine the pace to move from easy to difficult instances) we explore different pacing functions that serve easy instances to the learner for more or less time during the training procedure. We empirically explore how the curriculum strategies perform for two different response ranking datasets when compared against vanilla (no curriculum) fine-tuning of BERT for the task. Our main findings are that (i) CL improves retrieval effectiveness when we use a difficulty criteria based on a supervised model that uses all the available information {U, R, Y}, (ii) it is best to give the model more time to assimilate harder instances during training by introducing difficult instances in earlier iterations, and, (iii) the CL gains over the no curriculum baseline are spread over different conversation domains, lengths of conversations and measures of conversation difficulty. Neural Ranking Models. Over the past few years, the IR community has seen a great uptake of the many flavours of deep learning for all kinds of IR tasks such as ad-hoc retrieval, question answering and conversation response ranking. Unlike traditional learning to rank (LTR) [24] approaches in which we manually define features for queries, documents and their interaction, neural ranking models learn features directly from the raw textual data. Neural ranking approaches can be roughly categorized into representation-focused [17, 38, 47] and interaction-focused [13, 48] . The former learns query and document representations separately and then computes the similarity between the representations. In the latter approach, first a query-document interaction matrix is built, which is then fed to neural net layers. Estimating relevance directly based on interactions, i.e. interaction-focused models, has shown to outperform representationbased approaches on several tasks [16, 27] . Transfer learning via large pre-trained Transformers [46] -the prominent case being BERT [7] -has lead to remarkable empirical successes on a range of NLP problems. The BERT approach to learn textual representations has also significantly improved the performance of neural models for several IR tasks [33, 37, 54, 55, 57] , that for a long time struggled to outperform classic IR models [53] . In this work we use the no-CL BERT as a strong baseline for the conversation response ranking task. Curriculum Learning. Following a curriculum that dictates the ordering and content of the education material is prevalent in the context of human learning. With such guidance, students can exploit previously learned concepts to ease the learning of new and more complex ones. Inspired by cognitive science research [35] , researchers posed the question of whether a machine learning algorithm could benefit, in terms of learning speed and effectiveness, from a similar curriculum strategy [1, 8] . Since then, positive evidence for the benefits of curriculum training, i.e. training the model using easy instances first and increasing the difficulty during the training procedure, has been empirically demonstrated in different machine learning problems, e.g. image classification [11, 14] , machine translation [21, 30, 60] and answer generation [23] . Processing training instances in a meaningful order is not unique to CL. Another related branch of research focuses on dynamic sampling strategies [2, 4, 22, 39] , which unlike CL that requires a definition of what is easy and difficult before training starts, estimates the importance of instances during the training procedure. Self-paced learning [22] simultaneously selects easy instances to focus on and updates the model parameters by solving a biconvex optimization problem. A seemingly contradictory set of approaches give more focus to difficult or more uncertain instances. In active learning [4, 6, 44] , the most uncertain instances with respect to the current classifier are employed for training. Similarly, hard example mining [39] focuses on difficult instances, measured by the model loss or magnitude of gradients for instance. Boosting [2, 59] techniques give more weight to difficult instances as training progresses. In this work we focus on CL, which has been more successful in neural models, and leave the study of dynamic sampling strategies in neural IR as future work. The most critical part of using a CL strategy is defining the difficulty metric to sort instances by. The estimation of instance difficulty is often based on our prior knowledge on what makes each instance difficult for a certain task and thus is domain dependent (cf. Table 1 for curriculum examples). CL strategies have not been studied yet in neural ranking models. To our knowledge, CL has only recently been employed in IR within the LTR framework, using LambdaMart [3] , for ad-hoc retrieval by Ferro et al. [9] . However, no effectiveness improvements over randomly sampling training data were observed. The representation of the query, document and their interactions in the traditional LTR framework is dictated by the manually engineered input features. We argue that neural ranking models, which learn how to represent the input, are better suited for applying CL in order to learn increasingly more complex concepts. Table 1 . Difficulty measures used in the curriculum learning literature. Sentence length Machine translation [30] , language generation [42] , reading comprehension [58] Word rarity Machine translation [30, 60] , language modeling [1] External model confidence Machine translation [60] , image classification [14, 49] , ad-hoc retrieval [9] Supervision signal intensity Facial expression recognition [12] , ad-hoc retrieval [9] Noise estimate Speaker identification [34] , image classification [5] Human annotation Image classification [45] (through weak supervision) Before introducing our experimental framework (i.e., the scoring functions and the pacing functions we investigate), let us first formally introduce the specific IR task we explore-a choice dictated by the complex nature of the task (compared to e.g. ad-hoc retrieval) as well as the availability of large-scale training resources such as MSDialog [32] and UDC [26] . Given a historical dialogue corpus and a conversation, (i.e., the user's current utterance and the conversation history) the task of conversation response ranking [43, 50, 52] is defined as the ranking of the most relevant response available in the corpus. This setup relies on the fact that a large corpus of historical conversation data exists and adequate replies (that are coherent, well-formulated and informative) to user utterances can be found in it [51] . be an information-seeking conversations data set consisting of N triplets: dialogue context, response candidates and response labels. The dialogue context U i is composed of the previous utterances {u 1 , u 2 , ..., u τ } at the turn τ of the dialogue. The candidate responses R i = {r 1 , r 2 , ..., r k } are either the true response (u τ +1 ) or negative sampled candidates 1 . The relevance labels Y i = {y 1 , y 2 , ..., y k } indicate the responses' binary relevance scores, 1 if r = u τ +1 and 0 otherwise. The task is then to learn a ranking function f (.) that is able to generate a ranked list for the set of candidate responses R i based on their predicted relevance scores f (U i , r). Curriculum Framework. When training neural networks, the common training procedure is to divide the dataset D into D train , D dev , D test and randomly (i.e., uniformly-every sample has the same likelihood of being sampled) sample and perform an optimization procedure sequentially in {B 1 , ..., B M }. The CL framework employed here is inspired by previous works [30, 49] . It is defined by two functions: the scoring function which determines the difficulty of instances and the pacing function which controls the pace with which to transition from easy to hard instances during training. More specifically, the scoring function , is used to sort the training dataset. The pacing function f pace (s) determines the percentage of the sorted dataset available for sampling according to the current training step s (one forward pass plus one backward pass of a batch is considered to be one step). The neural ranking model samples uniformly from the initial f pace (s) * |D train | instances sorted by f score , while the rest of the dataset is not available for sampling. During training f pace (s) goes from δ (percentage of initial training data) to 1 when s = T . Both δ and T are hyperparameters. We provide an illustration of the training process in Fig. 1 . BERT loss fscore(U , R, Y) = Table 2 ; we now provide intuitions of why we believe each function to capture some notion of instance difficulty. • # turns (U) and # U words (U): The important information in the context can be spread over different utterances and words. Bigger dialogue contexts means there are more places where the important part of the user information need can be spread over. # Rwords (R): Longer responses can distract the model as to which set of words or sentences are more important for matching. Previous work shows that it is possible to fool machine reading models by creating longer documents with additional distracting sentences [18] . • σ SM (U, R) and σ BM 25 (U, R): Inspired by query performance prediction literature [40] , we use the variance of retrieval scores to estimate the amount of heterogeneity of information, i.e. diversity, in the response candidate. Homogeneous ranked lists are considered to be easy. We deploy a semantic matching model (SM) and BM25 to capture both semantic correspondences and keyword matching [19] . SM is the average cosine similarity between the first k words from U (concatenated utterances) with the first k words from r using pre-trained word embeddings. • BERT pred (U, R, Y) and BERT loss (U, R, Y): Inspired by CL literature [14] , we use external model prediction confidence scores as a measure of difficulty 3 . We fine-tune BERT [7] on D train for the conversation response ranking task. For BERT pred easy dialogue contexts are the ones that the BERT confidence score for the positive response r + candidate is higher than the confidence for the negative response candidate r − . The higher the difference the easier the instance is. For BERT loss we consider the loss of the model to be an indicator of the difficulty of an instance. Pacing Functions. Assuming that we know the difficulty of each instance in our training set, we still need to define how are we going to transition from easy to hard instances. We use the concept of pacing functions f pace (s); they should each have the following properties [30, 49] : (i) start at an initial value of training instances f pace (0) = δ with δ > 0, so that the model has a number of instances to train in the first iteration, (ii) be non-decreasing, so that harder instances are added to the training set, and, (iii) eventually all instances are available for sampling when it reaches T iterations, f pace (T ) = 1. As intuitively visible in the example in Fig. 2 , we opted for pacing functions that introduce more difficult instances at different paces-while root 10 introduces difficult instances very early (after 125 iterations, 80% of all training data is available), geom progression introduces them very late (80% is available after ∼ 800 iterations). We consider four different types of pacing functions, formally defined in Table 3 . The step function [1, 14, 41] divides the data into S fixed sized groups, and after T S iterations a new group of instances is added, where S is a hyperparameter. A more gradual transition was proposed by Platanios et al. [30] , by adding a percentage of the training dataset linearly with respect to the total of CL iterations T , and thus the slope of the function is 1−δ T (linear function). They also proposed root n functions motivated by the fact that difficult instances will be sampled less as the training data grows in size during training. By making the slope inversely proportional to the current training data size, the model has more time to assimilate difficult instances. Finally, we propose the use of a geometric progression that instead of quickly adding difficult examples, it gives easier instances more training time. Datasets. We consider two large-scale information-seeking conversation datasets (cf. Table 4 ) that allow the training of neural ranking models for conversation response ranking. MSDialog 4 [32] contain 246 K context-response pairs, built from 35.5 K information seeking conversations from the Microsoft Answer community, a question-answer forum for several Microsoft products. MANtIS 5 [29] was created by us and contains 1.3 million context-response pairs built from conversations of 14 different sites of Stack Exchange. Each MANtIS conversation fulfills the following conditions: (i) it takes place between exactly two users (the information seeker who starts the conversation and the information provider ); (ii) it consists of at least 2 utterances per user; (iii) one of the provider's utterances contains a hyperlink, providing grounding; (iv) if the final utterance belongs to the seeker, it contains positive feedback. We created MANtIS to consider diverse conversations from different domains besides technical ones. We include MSDialog [31, 32, 52] here as a widely used benchmark. As strong neural ranking model for our experiments, we employ BERT [7] for the conversational response ranking task. We follow recent research in IR that employed fine-tuned BERT for retrieval tasks [28, 55] and obtain strong baseline (i.e., no CL) results for our task. The best model by Yang et al. [52] , which relies on external knowledge sources for MSDialog, achieves a MAP of 0.68 whereas our BERT baselines reaches a MAP of 0.71 (cf. Table 5 ). We fine-tune BERT 6 for sentence classification, using the CLS token 7 ; the input is the concatenation of the dialogue context and the candidate response separated by SEP tokens. When training BERT we employ a balanced number of relevant and non-relevant context and response pairs 8 . We use cross entropy loss and the Adam optimizer [20] with learning rate of 5e − 5 and = 1e − 8. For σ SM , as word embeddings we use pre-trained fastText 9 embeddings with 300 dimensions and a maximum length of k = 20 words of dialogue contexts and responses. For σ BM 25 , we use default values 10 of k 1 = 1.5, b = 0.75 and = 0.25. For CL, we fix T as 90% percent of the total training iterations-this means that we continue training for the final 10% of iterations after introducing all samples-and the initial number of instances δ as 33% of the data to avoid sampling the same instances several times. To compare our strategies with the baseline where no CL is employed, for each approach we fine-tune BERT five times with different random seeds-to rule out that the results are observed only for certain random weight initialization values-and for each run we select the model with best observed effectiveness on the development set. The best model of each run is then applied to the test set. We report the effectiveness with respect to Mean Average Precision (MAP) like prior works [50, 52] . We perform paired Student's t-tests between each scoring/pacing-function variant and the baseline run without CL. We first report the results for the pacing functions (Fig. 3) followed by the main results (Table 5 ) comparing different scoring functions. We finish with an error analysis to understand when CL outperforms our no-curriculum baseline. 7 The BERT authors suggest CLS as a starting point for sentence classification tasks [7] . 8 We observed similar results to training with 1 to 10 ratio in initial experiments. 9 https://fasttext.cc/docs/en/crawl-vectors.html. 10 https://radimrehurek.com/gensim/summarization/bm25.html. Pacing Functions. In order to understand how CL results are impacted by the pace we go from easy to hard instances, we evaluate the different proposed pacing functions. We display the evolution of the development set MAP (average of 5 runs) during training on Fig. 3 (we use development MAP to track effectiveness during training). We fix the scoring function as BERT pred ; this is the best performing scoring function, more details in the next section. We see that the pacing functions with the maximum observed average MAP are root 2 and root 5 for MSDialog and MANtIS respectively 11 . The other pacing functions, linear, geom progression and step, also outperform the standard training baseline with statistical significance on the test set and yield similar results to the root 2 and root 5 functions. Our results are aligned with previous research on CL [30] , that giving more time for the model to assimilate harder instances (by using a root pacing function) is beneficial to the curriculum strategy and is better than no CL with statistical significance on both development and test sets. For the rest of our experiments we fix the pacing function as root 2, the best pacing function for MSDialog. Let's now turn to the impact of the scoring functions. Scoring Functions. The most critical challenge of CL is defining a measure of difficulty of instances. In order to evaluate the effectiveness of our scoring functions we report the test set results across both datasets in Table 5 . We observe that the scoring functions which do not use the relevance labels Y are not able to outperform the no CL baseline (random scoring function). They are based on features of the dialogue context U and responses R that we hypothesized make them difficult for a model to learn. Differently, for BERT loss and BERT pred we observe statistically significant results on both datasets across different runs. They differ in two ways from the unsuccessful scoring functions: they have access 11 If we increase the n of the root function to bigger values, e.g. root 10, the results drop and get closer to not using CL. This is due to the fact that higher n generate root functions with a similar shape to standard training, giving the same amount of time to easy and hard instances (cf. Fig. 2 ). to the training labels Y and the difficulty of an instance is based on what a previously trained model determines to be hard, and thus not our intuition. Our results bear resemblance to Born Again Networks [10] , where a student model which is identical in parameters and architecture to the teacher model outperforms the teacher when trained with knowledge distillation [15] , i.e., using the predictions of the teacher model as labels for the student model. The difference here is that instead of transferring the knowledge from the teacher to the student through the labels, we transfer the knowledge by imposing a structure/order on the training set, i.e. curriculum learning. Error Analysis. In order to understand when CL performs better than random training samples, we fix the scoring (BERT pred ) ad pacing function (root 2 ) and explore the test set effectiveness along several dimensions (cf. Figs. 4 and 5) . We report the results only for MSDialog, but the trends hold for MANtIS as well. We first consider the number of turns in the conversation in Fig. 4 . CL outperforms the baseline approach for the types of conversations appearing most frequently (2-5 turns in MSDialog). The CL-based and baseline effectiveness drops for conversations with a large number of turns. This can be attributed to two factors: (1) employing pre-trained BERT in practice allows only a certain maximum number of tokens as input, so longer conversations can lose important information due to truncating; (2) for longer conversations it is harder to identify the important information to match in the history, i.e information spread. Next, we look at different conversation domains in Fig. 5 (left), such as physics and askubuntu-are the gains in effectiveness limited to particular domains? The error bars indicate the confidence intervals with confidence level of 95%. We list only the most common domains in the test set. The gains of CL are spread over different domains as opposed to concentrated on a single domain. Lastly, using our scoring functions we sort the test instances and divide them into three buckets: first 33% instances, 33%-66% and 66%-100%. In Fig. 5 (right), we see the effectiveness of CL against the baseline for each bucket using # U words (the same trend holds for the other scoring functions). As we expect, the bucket with the most difficult instances according to the scoring function is the one with lowest MAP values. Finally, the improvements of CL over the baseline are again spread across the buckets, showing that CL is able to improve over the baseline for different levels of difficulty. In this work we studied whether CL strategies are beneficial for neural ranking models. We find supporting evidence for curriculum learning in IR. Simply reordering the instances in the training set using a difficulty criteria leads to effectiveness improvements, requiring no changes to the model architecture-a similar relative improvement in MAP has justified novel neural architectures in the past [43, 50, 61, 62] . Our experimental results on two conversation response ranking datasets reveal (as one might expect) that it is best to use all available information (U, R, Y) as evidence for instance difficulty. Future work directions include considering other retrieval tasks, different neural architectures and an investigation of the underlying reasons for CL's workings. Curriculum learning. In: ICML Arcing classifier From ranknet to lambdarank to lambdamart: an overview. Learning Active bias: training more accurate neural networks by emphasizing high variance samples Webly supervised learning of convolutional networks Active learning with statistical models BERT: pre-training of deep bidirectional transformers for language understanding Learning and development in neural networks: the importance of starting small Continuation methods and curriculum learning for learning to rank Born-again neural networks. In: ICML Multi-modal curriculum learning for semi-supervised image classification Curriculum learning for facial expression recognition A deep relevance matching model for ad-hoc retrieval On the power of curriculum learning in training deep networks Distilling the knowledge in a neural network Convolutional neural network architectures for matching natural language sentences Learning deep structured semantic models for web search using clickthrough data Adversarial examples for evaluating reading comprehension systems Bridging the gap between relevance matching and semantic matching for short text similarity modeling Adam: A method for stochastic optimization Curriculum learning and minibatch bucketing in neural machine translation Self-paced learning for latent variable models Curriculum learning for natural answer generation Learning to rank for information retrieval RoBERTa: a robustly optimized BERT pretraining approach The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems Empirical study of multi-level convolution models for IR based on representations and interactions Passage re-ranking with BERT Introducing MANtIS: a novel multi-domain information seeking dialogues dataset Competencebased curriculum learning for neural machine translation user intent prediction in information-seeking conversations Analyzing and characterizing user intent in information-seeking conversations BERT with history answer embedding for conversational question answering Curriculum learning based approaches for noise robust speaker recognition Language acquisition in the absence of explicit negative evidence: how important is starting small? Easy questions first? a case study on curriculum learning for question answering FAQ retrieval using queryquestion similarity and BERT-based query-answer relevance A latent semantic model with convolutional-pooling structure for information retrieval Training region-based object detectors with online hard example mining Predicting query performance by query-drift estimation Image difficulty curriculum for generative adversarial networks (CuGAN) Adversarial generation of natural language One time of interaction may not be enough: go deep with an interaction-over-interaction network for response selection in dialogues Support vector machine active learning with applications to text classification How hard can it be? Estimating the difficulty of visual search in an image Attention is all you need A deep architecture for semantic matching with multiple positional sentence representations Match-SRNN: modeling the recursive matching structure with spatial RNN Curriculum learning by transfer learning: theory and experiments with deep networks Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots A hybrid retrieval-generation neural conversation model Response ranking with deep matching networks and external knowledge in information-seeking conversation systems Critically examining the neural hype: weak baselines and the additivity of effectiveness gains from neural ranking models End-to-end open-domain question answering with BERTserini Simple applications of BERT for ad hoc document retrieval XLNet: generalized autoregressive pretraining for language understanding Cross-domain modeling of sentencelevel evidence for document retrieval End-to-end answer chunk extraction and ranking for reading comprehension Boosting neural machine translation An empirical exploration of curriculum learning for neural machine translation Modeling multi-turn conversation with deep utterance aggregation Multi-turn response selection for chatbots with deep attention matching network Acknowledgements. This research has been supported by NWO projects SearchX (639.022.722) and NWO Aspasia (015.013.027).