key: cord-0638503-rb4um889 authors: Bruyn, Maxime De; Lotfi, Ehsan; Buhmann, Jeska; Daelemans, Walter title: ConveRT for FAQ Answering date: 2021-08-02 journal: nan DOI: nan sha: 0fa8cc230d73e11f5dfd10c76f077e3056ac3931 doc_id: 638503 cord_uid: rb4um889 Knowledgeable FAQ chatbots are a valuable resource to any organization. While powerful and efficient retrieval-based models exist for English, it is rarely the case for other languages for which the same amount of training data is not available. In this paper, we propose a novel pre-training procedure to adapt ConveRT, an English conversational retriever model, to other languages with less training data available. We apply it for the first time to the task of Dutch FAQ answering related to the COVID-19 vaccine. We show it performs better than an open-source alternative in both a low-data regime and a high-data regime. In this paper, we present a Dutch-based FAQ retrieval system trained using a limited amount of training data. FAQ answering is the task of retrieving the right answer given a new user query. It is widely used in chatbots and has been studied for many years [6, 22, 9, 18, 10, 20] , although the attention has shifted towards extractive question answering more recently [19] , probably because of a lack of dedicated datasets. FAQ answering systems typically use retrieval systems [6, 22, 9, 18, 10, 20] rather than generative models grounded on external knowledge [13, 4, 14] . The generative approach is more flexible as it is able to generate new answers. However, these models suffer from knowledge hallucinations [21] , limiting their usefulness in a corporate environment. Most previous research focusing on FAQ retrieval and non-factoid question answering were developed for English. ConveRT [7] , a response selection module available within Rasa [1] , caught our attention as it is effective and does not require a GPU at inference time. Unfortunately, it is only available in English. Despite having significantly less conversational training data (400K pairs of utterances) than the original ConveRT model (727M pairs), we successfully trained the same model for Dutch. Our contributions are the following: -We show it is possible to train a ConveRT model for a non-English language using a limited number of conversation pairs by adopting a two-phase pretraining approach (general and conversational). An FAQ dataset consists of pairs of questions and answers. The FAQ retrieval task involves ranking the available answers for a given user query. There are three methods available to solve this problem: matching a new user query on the available questions, the answers, or the concatenation of both. FAQ retrieval can be broadly divided into 4 categories: lexical, supervised, unsupervised, and conversational. Lexical To our knowledge, FAQ-Finder [6] was the first to explicitly study the task of FAQ retrieval, it tries to do so by matching user queries to FAQ questions of the Usenet dataset with TF-IDF. FAQ-Finder was later improved by including the similarity to the answer (on top of the similarity to the question) [23] . Another improvement comes from adding a rule-based layer on top of the TF-IDF module [22] . Unsupervised Another approach is to used unsupervised techniques to retrieve the right FAQ pair given a new user query. One possible way is to use Latent Semantic Analysis (LSA) to overcome the lexical mismatch between related queries [11] . Supervised The first supervised methods were developed using tree kernels and SVMs [15] . BERT methods were later developed specifically for the task of FAQ retrieval [20] . Conversational In this paper, we propose a fourth type not yet explored in the literature: conversational. FAQ retrieval can be treated as a special case of conversational modeling: retrieving the answer is similar to retrieving the next utterance in a conversation. Dual-encoder architectures, pre-trained on response selection, have become increasingly popular in the dialog community due to their simplicity and ease of control [8, 2] . There are two options when it comes to retrieving the next utterance. One can either encode the two sentences separately (dual-encoder) [7] , or simultaneously (cross-encoder) [3] . Dual-encoders are faster than crossencoders as they can cache the answer representations. ConveRT [7] is a dualencoder pre-trained on a large-scale conversational dataset. Thanks to various design optimizations (such as using single-headed self-attention) ConveRT can vastly reduce the size of the model. In this work, we choose to focus on ConveRT as it has a low computational cost and does not require a GPU for inference. In this section, we give a brief overview of the ConveRT (Conversational Representations from Transformers) model [7] . The objective of the model is to generate vector representations for utterances that are as similar as possible (in terms of dot-product) for a given pair. ConveRT takes as input the sequence of tokens of the two utterances. Both sequences are tokenized using the same byte pair encoding vocabulary. The ConveRT architecture (Fig. 1) is composed of three distinct parts: the embedding layer, the Transformer block and the feedforward layers. Embedding The first element stores the embeddings for the subwords and position tokens. Embeddings are shared for the input and response representations. Unlike the original Transformer architecture [24] , ConveRT uses two positional encoding matrices of different sizes to handle sequences larger than seen during training. We refer the reader to the original paper for a detailed description [7] . Transformer Block The next element is the Transformer block. It closely follows the original Transformer architecture [24] with some notable differences. First, the model uses a single-headed self-attention using a 64-dimensional projection for computing the attention weights. Second, the model applies a twoheaded self-attention after the six Transformer layers. The parameters of the Transformer block are fully shared for the input and response sides. ConveRT uses the square-root-of-N reduction [2] to convert the embedding sequences to fixed-dimensional vectors. Feed Forward The last elements are a series of feed-forward hidden layers with skip connections. The parameters are not shared between the inputs and responses side, as there is a separate feed-forward for the inputs and responses. The training objective of ConveRT is to select the right response given a question from a question-answer pair. The relevance of each response to a given input is quantified with a dot-product between the input and response representation. Training proceeds in a batch of K pairs of utterances. The objective is to distinguish between the true relevant responses and irrelevant negative examples (we use other responses from the batch as negative examples). ConveRT uses cross-entropy as the loss function. The model is optimized with Adam [12] and L2 weight decay. The learning rate is warmed up over the first 10,000 steps to a peak value and then linearly decayed. In this section, we explain our approach to training a ConveRT model for Dutch. To overcome the limited supply of conversational data available in Dutch, we use a two-stage pre-training: general pre-training on a large open-domain corpus, and conversational pre-training using a smaller conversational dataset from Reddit. The original ConveRT model was developed for English using a large-scale conversational dataset from Reddit. We did not have access to such a dataset for Dutch. Instead, we chose to split the problem in two. First, we pre-train the model on a general Dutch corpus. Second, we use a smaller Dutch conversational corpus from Reddit. General Dataset We consider the same Dutch-language corpora as Bertje [5] , a successful Dutch BERT model: -Books: a collection of contemporary and historical fiction novels -TwNC [17] : a Multifaceted Dutch News Corpus -SoNaR-500 [16] : a multi-genre reference corpus -Web news -Wikipedia In total, this is about 12GB of uncompressed text. To match the setup expected by ConveRT (the tokens of a pair of utterances), we first split each paragraph into sentences. Next, we save pairs of sentences and treat them as pairs of input and response. To avoid small inputs, we filter out pairs with less than 64 characters. After transformation, the general corpus dataset for pre-training has 110M pairs. We also consider a Dutch conversational dataset for which we downloaded comments from around 200 Dutch subreddits. Non-Dutch comments were filtered out. After filtering for the language we arrive at a size of 400K pairs of utterances. We followed the training procedure of ConveRT, except for the number of epochs and the batch size. For the conversational pre-training, we trained for 10 epochs with a fixed batch size of 2048. Table 1 . Accuracy on the COVID-19 vaccination FAQ dataset per splits of increasing size. Split one has one training example per answer, while split ten has ten training examples. Pre-training ConveRT on both a general dataset, as well as a conversational dataset provides the best results on this task. In this section, we fine-tune our model on a corpus of FAQs related to the COVID-19 vaccine. We then perform an ablation study to analyze which part of the pre-training has the most impact on the downstream performance. To have a better understanding of how our model would perform in the real world, we study its performance as the number of training examples increases. We test the performance of our model on a proprietary dataset. The dataset was collected while running a COVID-19 vaccination FAQ bot with Rasa. It consists of 1,200 questions for 76 distinct answers. As our higher objective is to use this model in a Rasa chatbot, we compare our Dutch ConveRT model to a baseline response retrieval model developed by Rasa. 1 All models are trained using the same number of epochs and dropout probability. When starting out, FAQ bots usually have a one-on-one mapping between the number of questions and answers (one question for one answer). As the number of users increases, the number of available questions per answer also increases. To evaluate the generalization capabilities of our model in a low data scenario, we artificially create datasets of increasing sizes, which we call splits. The first split has one training example per answer (the same as when someone starts a new FAQ chatbot), the second split has two training examples per answer, and so on until split ten. We also generate a test set by randomly selecting (and removing from the training set) one training example per answer. Table 1 confirm our intuition that the baseline accuracy of the Rasa model radically improves with the number of training examples. In our analysis, the accuracy increases by a factor of 3 from split 1 to split 10. The results also show that a ConveRT model without any pre-training underperforms the baseline, on every split. General pre-training modestly improves the model's performance, but the results are not significantly different from the baseline. Conversational pre-training alone (without any general pre-training) shows a consistent improvement over the baseline. The gain is more visible in the low data regime than in the high data regime. The Dutch ConveRT model reveals its true power when pre-trained on a general corpus and a conversational corpus as it outperforms the baseline by a wide margin on every split. We have successfully pre-trained, fine-tuned, and evaluated a Dutch ConveRT model. This model consistently outperforms a baseline response selector from Rasa on a COVID-19 vaccine FAQ dataset. Conversational datasets for non-English languages are scarce. Our two-phase pre-training procedure bypasses this problem by first pre-training on a general corpus, then pre-training on a smaller conversational corpus. In future work, we plan on extending the two-stage training to additional languages and additional domains. Rasa: Open source language understanding and dialogue management Universal sentence encoder for English Optimized transformer models for faq answering Bart for knowledge grounded conversations BERTje: A Dutch BERT Model Faq finder: a case-based approach to knowledge navigation Convert: Efficient and accurate conversational representations from transformers Training neural response selection for task-oriented dialogue systems Retrieving answers from frequently asked questions pages on the web Faqir-a frequently asked questions retrieval test collection Cluster-based faq retrieval using latent term weights Adam: A method for stochastic optimization Internet-augmented dialogue generation Teach me what to say and i will learn what to pick: Unsupervised knowledge selection through response generation with pretrained generative models Exploiting syntactic and shallow semantic kernels for question answer classification The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch Twnc: a multifaceted dutch news corpus Statistical machine translation for query expansion in answer retrieval Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension Faq retrieval using query-question similarity and bert-based query-answer relevance Retrieval augmentation reduces hallucination in conversation Automated faq answering: Continued experience with shallow language understanding Retrieval models and q and a learning with faq files This research received funding from the Flemish Government under the "Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen" programme. We also thank the reviewers for their helpful comments.