key: cord-0584824-qlybwsin authors: Jacobsen, Sindre Andr'e; Ragni, Anton title: Continuous representations of intents for dialogue systems date: 2021-05-08 journal: nan DOI: nan sha: 24b2fc90a0d640abdf9d1f911532537ed4f6f040 doc_id: 584824 cord_uid: qlybwsin Intent modelling has become an important part of modern dialogue systems. With the rapid expansion of practical dialogue systems and virtual assistants, such as Amazon Alexa, Apple Siri, and Google Assistant, the interest has only increased. However, up until recently the focus has been on detecting a fixed, discrete, number of seen intents. Recent years have seen some work done on unseen intent detection in the context of zero-shot learning. This paper continues the prior work by proposing a novel model where intents are continuous points placed in a specialist Intent Space that yields several advantages. First, the continuous representation enables to investigate relationships between the seen intents. Second, it allows any unseen intent to be reliably represented given limited quantities of data. Finally, this paper will show how the proposed model can be augmented with unseen intents without retraining any of the seen ones. Experiments show that the model can reliably add unseen intents with a high accuracy while retaining a high performance on the seen intents. Dialogue systems and virtual assistants have started becoming successful in recent years. While the technology is evolving at great speeds it still is a long way before they can handle a truly natural conversation (Levesque, 2017) . These systems are growing in features, possible ways to interact with the user and are increasingly multilingual. This process from the user communicating with the machine to getting a response can be broken down into multiple steps. This paper is looking into one of these steps called intent prediction that aims to determine the user's goal. A common approach to intent prediction consists of using a machine learning model capable of mapping user queries to a discrete, fixed, intent class. With the recent development in deep learning the interest towards intent prediction in recent years have increased yet the common approach has not changed (Hu et al., 2009; Xu and Sarikaya, 2013; . This common approach has serious limitations. One major limitation is that any intent is assumed to belong to a fixed number of intent classes known well in advance. Machine learning models that follow such approach fail to take into account that the number of intents is dynamic (e.g. a rapid emergence of new intents in the wake of COVID-19 pandemic). These models would have to be retrained or significantly adapted to add new intents. Furthermore, they are incapable of finding relationships between existing and emerging intents which might be important for both the users and the developers. There have been an increased interest in working with models that can handle unseen intents to tackle some of the above issues, such as zero-shot learning (Lampert et al., 2014; Changpinyo et al., 2016) . In this paper we propose a model that can tackle all of those issues. The main idea is to represent intents as points in a continuous space called intent space, which among other benefits allows to compute distances between intents and explore the relationships among seen and unseen intents. This paper shows how these intent spaces can be embedded into standard forms of neural networks adapted to handle a variable number of classes and how unseen intents can be added without retraining any part of the model learnt on seen intents. Intent prediction is a machine learning task where the goal is to predict correct intent class c for a given sentence X consisting of T words. One way to accomplish this is by using a neural network such as Recurrent Neural Network (RNN) (Mikolov et al., 2010) . RNNs take sentences encoded using vector representation schemes, such as one-hot encoding or word embeddings , and recursively compute history states h t for each word x t using the history state h t−1 of the previous word x t−1 where U , V , b are RNN parameters and σ is an element-wise non-linearity, such as sigmoid, tanh or ReLU. More complex forms of RNNs such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) can also be used. We can define the probability of the sentence to belong to intent class c using softmax where h T is the final history vector associated with the final word x T and A, d are additional RNN parameters. Note that the equation above generally implies that the number of classes C is known in advance. The RNN parameters θ = (U , V , b, A, d) can be optimised on a training set D to maximise the probability of assigning sentences X r to reference intent classes y r As previously mentioned, this approach to intent modelling comes with several serious limitations. One approach to solve these limitations is to introduce an intent space. Intent space is the concept of representing each intent c as a point in a B-dimensional space, rather than a discrete class. Figure 1 shows a simple illustration of a 3dimensional intent space with one intent present. The intents will be assumed to live in a low dimensional space with B bases. The goal is for similar intents to be grouped together, while different ones to have a distance between them. As the intent space is not fixed to a particular number of intents, it will also allow for adding unseen intents and being able to measure the similarity to existing ones. Intent space opens up a range of other possibilities of what can be achieved. The rest of this section will discuss some of them. There are at least two options for modelling bases (arrows in Fig. 1 ) in this new coordinate system. One such approach would be to adopt a vector space, where bases are vectors w 1 , . . . , w B and points (circle in Fig. 1 ) can be expressed using co- Vector spaces allow for a simple representation of intents but offer few model parameters to be powerful. To increase modelling power we can model bases using matrices It is also possible to use more complex schemes that change the nature of bases and/or how points are expressed. If a more parameter efficient model is required compared to equation (5), it is possible to examine reduced rank matrix spaces This allows to retain a matrix representation albeit of reduced rank K (fewer parameters). There are a number of important considerations to be made about bases. One consideration is what these bases should represent. If each intent can be represented as a combination of a small number of eigen or proto-intents then each basis should represent one of those eigen-intents. Another consideration is how many bases would be needed for any given set of intents. If intents are unrelated the optimal number of bases B would likely be equal to the number of (seen) intents C. An unseen intent that is related to one of the seen intents can then be represented by means of the corresponding basis and possibly other bases as well. An unrelated unseen intent will be forced to either pick the closest basis or make use of all or some available bases. However, if the unseen intent is related to more than one seen intent it should benefit from the rich representation available to it. Different choices of basis representations discussed in the previous section do not alter the lowdimensional nature of the intent space. Such simplified approach may struggle to deal with complex intents that must co-exist within the same space. Thus, it is desirable to be able to expand the space to meet different levels of complexity. One approach to accomplish this is shown below where Ω c,b is an intent c expansion matrix for basis b. Such intent-dependent expansion enables different intents to use either equation (7) or (5) depending on the complexity needed for their specific cases. It is also possible to incrementally increase complexity by initially using equation (5) and then continuing with equation (7) where all Ω c,b are initialised as identity matrices. Coordinates α = [α 1 . . . α C ] are an important feature of the intent space and can be set in a variety of ways. One such way is to initialise them randomly. Interpretability can be improved by representing each α c as the one-hot encoding of that intent making α an identity matrix. Such a choice implies that the number of bases B is set to the number of seen intents C If α are not constrained in any way the intentspace is Euclidean. One issue with Euclidean spaces is an overall lack of coordinate interpretability (e.g. negative values, scale) as the interaction between coordinates α c,b and bases W b makes interpreting coordinates complicated. However, if each α c is forced to lie on a simplex space allowing to see how much each coordinate α c,b takes from another intent/basis b, and how much each unseen intent learns from the seen ones. Note that β can be viewed as unnormalised coordinates. There are multiple ways how intent spaces can be incorporated within modern forms of neural networks. Consider, for example, the RNN which has three model parameters U , V and b linked with input sequence embedding, and two model parameters A and d linked with intent class prediction. Each model parameter serves as the potential candidate for intent space embedding. In this work we will examine model parameters linked with input sequence embedding, which yields three possible options. Regardless of the chosen option, the history state of intent space RNN will become intentdependent h c,t rather than intent-independent h t . The implications resulting from this change are discussed further in this section. The first option is to embed intent spaces into bias vectors b using the vector space approach defined in equation (4). This would however only act as an intent-dependent offset with a possibility of anchoring history states associated with different intents in different regions of the history space to aid in intent separability. While this approach could theoretically lead to improved performance, the largest benefit is expected from embedding intent spaces either into the recurrent matrix U or (word) embedding matrix V using the matrix spaces approaches in equations (5), (6) or (7). The second option is to introduce intent space into the embedding matrices V , which opens up a number of interesting options. One option is to learn discriminative vocabularies. For instance, an intent linked with acquiring information about weather conditions would be expected to learn high-quality embeddings for some words, such as weather and forecast, but not for others, such as Adele and Madonna. Finally, it is also possible to introduce intent spaces into model parameters U that control which information from the past is propagated along the sequence. Such intent spaces are expected to be very powerful in controlling which information will be ultimately used for making accurate intent predictions. The update rule used to compute history vector at time t in these intent spaces is given by where the model parameters U c and the history states h t,c are intent-dependent. The intent-dependent nature of history states changes how we compute probabilities of sentences to belong to a particular intent class. There are two main options. One approach is to introduce one set of intent-independent parameters. Another approach is to introduce C sets of intentdependent parameters. Given C history states h 1,T , . . ., h C,T associated with each intent, it is possible to compute C scores that reflect how well the sentence x 1:T matches each intent using just one set of intent-independent parameters a and d where σ is a suitable non-linearity. An alternative approach is to introduce parameters a c and d c for each intent class To compute probabilities in either case, we can take the set of (positive) scores and normalise These two approaches require different numbers of model parameters to be introduced. The former may prove useful in limited resource conditions. When a new, previously unseen, intent emerges, it is common to retrain the standard RNN model from scratch. However, the softmax normalisation in equation (13) shared with RNNs is not inherently limited to a fixed number of intent classes. Provided scores associated with the unseen intent do not affect rank ordering and alter relative contributions from other intents on seen sentences, it is possible to retain all existing model parameters. The following section will discuss one practical approach how this can be accomplished, which opens an opportunity for a dynamic model that can add previously unseen intents without the need for altering an already deployed model. It is hard to envision and cater for all possible user intents in practice. Many intents have a seasonal nature (e.g. flu), some are unexpected (e.g. COVID-19). Thus, handling unseen intents, which includes both unseen intent detection and modelling, is an important practical consideration. Common approaches to unseen intent detection can be split into automatic, semi-automatic and manual. One common automatic approach to decide if a given sentence is likely to belong to an unseen intent class is by using entropy The assumption is that the more uncertain the model is about intent classification, the more likely is that the correct intent is in fact unseen. In an idealised setting there exists a threshold ρ such that holds true for all sentences X r of a dataset D. This might also be a good starting point for developing a semi-automatic approaches that leverage human expertise to make decisions regarding a few prefiltered candidates. It is also possible to devise intent space based approaches. One such approach could be to obtain an estimate of coordinates α given a sentence X and compare that to coordinates of known intents α 1 , . . ., α C . If the estimate α is close to one of the known intents α c , it is likely to be a seen intent and unseen otherwise. When any new intent is identified it needs to be incorporated into the model. It would be beneficial if such a change did not involve altering any of the existing model parameters. Furthermore, we would like to do it in a manner that keeps the accuracy on seen intents high, while also being able to reliably represent any unseen intents. One possible solution to this problem consists of introducing the following regularisation function S(h r,c,Tr ; θ) S(h r,C+u,Tr ; θ)   (17) which is computed over K ≪ R training sentences available to seen intents. The above regularisation function takes into consideration the ratio between seen and unseen intents on the seen training data. This allows us to make sure that none of the unseen intents C + 1, . . . , C + U will significantly impact the performance of classifying sentences into seen intent classes. This is achieved by appropriately weighing the contribution of regularisation function in the overall objective function O(θ; D, D ′ ) = F(θ; D ′ ) + ǫR(θ; D) + ζR α (θ) (18) where D and D ′ are the training sets for seen and unseen intents, R α (θ) is another regularisation function that either drives the coordinates to the uniform distribution (Simplex spaces) or minimises L 2 (Euclidean spaces), ǫ and ζ control their respective contributions to the overall objective. The diagram below summarises the overall training process that includes training on seen intents (Fig. 2 a) followed by an optional training on unseen intents (Fig. 2 b) . As illustrated in the figure above, the intent space parameters W 1 , . . . , W B (bases) and α 1 , . . . , α C (coordinates) are estimated in an interleaved fashion by first training the bases whilst keeping the coordinates frozen, and then training the coordinates whilst keeping the bases frozen. This process can be repeated for a fixed number of steps, until a certain milestone has been reached (e.g. accuracy, loss), some other criteria have been satisfied (e.g. no significant improvement in loss/accuracy) or a combination of these. Note that the training in figure (Fig. 2 a) is completed once, while the training in figure (Fig. 2 b) can be completed multiple times without any retraining of the model parameters introduced previously. To extend the intent-space with one or more previously unseen intents, only the new sets of coordinates need to be estimated. Given a sufficiently large regularisation constant ǫ, the regularisation function in equation (17) will ensure that the new coordinates will not alter the top-1 rank ordering of the unnormalised scores on a sample of seen intent sentences. If we are satisfied with the results of what we can achieve by introducing only coordinates the process of expanding the intent space with the unseen intents stops. However, if we are not satisfied we can expand the intent space by means of basis-dependent matrices in equation (7). The intent space proposed in this work is closely related to Zero-Shot Learning (ZSL), which aims to detect and/or classify examples of classes that either have not been present during training or have no labelled data associated with them. The ZSL has gained a lot of traction within the image domain (Socher et al., 2013; Changpinyo et al., 2016; Frome et al., 2013) . In recent years there have been interest within the natural language processing domain and intent modelling in particular. One simple ZSL approach examined for intent modelling consists of creating or learning representations for given sentence and each intent, and utilising suitable distances to perform classification (Williams, 2019; Kumar et al., 2017) . Both information retrieval style approaches and recurrent neural networks have been examined for learning representations. A similar idea have been explored within the image domain (Lampert et al., 2014) . Recently there has been interest in using capsule networks for extracting powerful sentence and intent representations (Si et al., 2020; Xia et al., 2018; Liu et al., 2019) . One common theme in these papers is that they all require availability of intent labels unlike the proposed intent space. The overall idea of generating a continuous space representation has been examined for representing speakers (Gales, 1999) and languages (Zen et al., 2012; Ragni et al., 2015) in speech tasks. Unlike those representations embedded into simpler hidden Markov models, the intent space is embedded into neural networks, which offers access to all the recent developments in deep learning. We conducted experiments on two datasets SNIPS (Coucke et al., 2018) (primary) and Airline Travel Information System (ATIS) (Hemphill et al., 1990) (secondary). SNIPS is a balanced dataset 1 consisting of 7 intents that represent different domains. There are approximately 2000 sentences available for each intent. We moved the first 100 sentences per intent to create a validation dataset. In exper-iments where we explored unseen intents one intent was excluded from the training set. ATIS is an imbalanced dataset within the flight domain made available through Microsoft Cognitive Toolkit 2 , where one of the intents, flight, comprises approximately seventy percent of the data. Any intent that did not appear in both training and test set was removed to yield a total of 16 intents. Implementation details We use Glove (Pennington et al., 2014) , which was pre-trained on large quantities of data and features 1.9 million word vocabulary 3 , to provide word embeddings with a dimensonality of 300. Dimensionality of all history states was also 300. Words without an embedding were mapped to the average word embedding. The embedding matrix V was initialised to an identity matrix to use unmodified word embeddings on the very first iteration. The coordinates α were initialised as one-hot encodings (one basis per intent). The remaining parameters were initialised randomly. All models were trained on a CPU. No grid search or other elaborate methods were used for careful tuning of hyperparameters. such as the learning rate or the modifier term in equation (17). It is expected that better results may be obtained with more careful tuning. RNN was implemented using equation (1) and optimising equation (3) to learn parameters. Intent space was trained using equation (5) by following the process illustrated by Figure 2 , where we trained bases parameters W for 5 epochs, then α for 5 epochs and so on. Both approaches made use of the validation set to implement early stopping. When training coordinates or expansion matrices Ω for unseen intents ǫ was set to 0.20 to penalise the ratio between seen/unseen intents on training data and ζ was set to 1.00. Stochastic gradient descent (SGD) method with a weight decay was used in SNIPS, while in ATIS Adam optimiser (Kingma and Ba, 2014) was used to speed up the process. Experiments were forced to stop after epoch 50, 150 or 500 when training seen intents, coordinates or expansion matrices respectively on rare occasions that the early stopping was not triggered. For the first experiment we picked 6 intents as seen, and left GetWeather as unseen. The complete list of intents is shown in Table 2 . The model was trained by following the approach described in Figure 2 . Figure 3 below provides an illustration of the overall training process that includes accuracies achieved while training on seen (step 0-6) and unseen (7-) intents. Note that the second curve that occurs from the step 7 onwards shows accuracies predicting the unseen intent. We start by training first the bases W and then the coordinates α. While the majority of the gain comes from training bases this is partially caused by them being trained first, which makes the improvement achieved from training coordinates more limited. By training W first we ensure that each basis learns a good representation. It is however highly likely that by letting bases to learn very fast the coordinates will not be able to deviate much from the initial one-hot encoded representation. Thus, the use of coordinates as a way to learn relationships between training intents may be compromised in such situations. Table 1 below summarises the overall performance of Euclidean and simplex intent spaces. The baseline RNN shows a similar level of performance. It is interesting to note that although coordinates in the simplex space are constrained, no loss in the generalisation ability is observed compared to the Euclidean space. Given that the former offers more interpretability, all experiments in the following will be based on the simplex space. Accuracy (%) Seen Unseen Euclidean 97.00 95.00 Simplex 97.33 95.00 The learnt intent space is illustrated below in Figure 4 . The strong diagonal shape was expected given large gains in accuracy obtained by training the bases first. Note that even though the intent space is diagonal when an unseen intent is introduced it is highly likely to not attach itself to just one of the training seen intents. Thus, even though the diagonal shape is somewhat concerning it does not necessarily mean that the intent space cannot be used to successfully model unseen intents. The performance of intent spaces on seen intents is in line with the current state of the art models . We would now like to expand it by adding the unseen intent GetWeather. We start by adding new coordinates α C+1 initialised to a uniform distribution over the bases and optimise only these parameters. The bottom curve in Figure 3 indicates that optimising only coordinates enables to reach a limit of unseen performance at 40%. Such level of performance is significantly below exceptions, but above a random classifier. Given that the coordinates provide only 6 model parameters the previous result is not surprising. To increase modelling power we expanded the intent space by means of matrix parameters Ω C+1 and trained them whilst keeping all other parameters fixed as illustrated by Figure 2(b) . The accuracy improves from 95.67% on seen and 40.00% on unseen intents to 97.33% on seen and 95.00% on unseen intents. Thus, without any retraining of the previously learnt parameters, the intent space achieves high performance on both the seen and unseen intents. To investigate if the high level of performance on unseen intents is limited to the GetWeather intent only, we excluded each of the seen intents in turn. Figure 5 demonstrates that intent space is able to find certain relationship between intents. AddToPlayList (1) is being modelled mainly through PlayMusic (3), which is exactly what a playlist consist of. PlayMusic on the other hand is being modelled by SearchCreativeWork (5) which can be explained by the fact that music is a creative art. There are also intents like BookRestaurant (2) which consist of a variety of intents, which indicates that it cannot be modelled by a single intent as was expected. The previous experiments showed that we are able to get a high performance on unseen intents. We will now look at how much data is required to accomplish that. Table 3 shows the effect that the data has on reaching high prediction accuracy. Even with one training sentence (or point) the intent space achieves better than random performance. At around 500 training sentences the performance reaches 85.00%. It appears however that sufficiently large quantities of data are needed to predict unseen intents nearly perfect. We have so far seen that it is possible to add one unseen intent. In the next experiment we will be adding two unseen intents RateBook and BookRestaurant at the same time whilst keeping the remaining 5 intents as seen. The increase in the complexity caused by adding more than one intent has not led to drop in predicting either seen or unseen intents. The seen performance was 95.60% while the unseen performance was 97.00% which is inline with what we were seeing in Table 2 when only one unseen intent was added. Figure 6 below illustrates the coordinates learnt in this experiment and compares them to those trained one at a time. (4) We can see that the main intent contributors are mostly the same, which indicates that the intent space is able to capture the key semantic features of those intents. It also appears that adding one or two intents at the same time makes no significant difference to how the intent space views them. Unseen intent detection is an important first step for incorporating unseen intents into intent spaces. Although more advanced approaches are available (Lin and Xu, 2019b,a; Yan et al., 2020) , the in-tent spaces can also be examined for unseen intent detection. For simplicity, we focused on the entropy based approach described in equation (15). The ROC curve that summarises unseen event detection capabilities is shown below in Figure 7 . The ROC curve above shows that the model is able to detect unseen intents better than a random classifier but is far from being perfect. As was discussed in Section 3.5 there are more advanced ways to utilise intent space's capabilities. We also explored the highly popular ATIS dataset to assess intent spaces. The maximum accuracy was 92.91% for RNN and 93.24% for intent space, which demonstrates that intent spaces can perform well in situations where the majority of intents have very limited training data. Unfortunately, this task is heavily misbalanced and many sentences of rare intent classes are mislabelled. Therefore, no additional investigation was performed. The table below contains the distribution of intents in the SNIPS dataset used in the experiments. The mapping from integer to intent can be found in table 2. Train Validation Test 1 1842 100 100 2 1873 100 100 3 1900 100 100 4 1856 100 100 5 1854 100 100 6 1859 100 100 7 1900 100 100 Total Samples 13084 700 700 In the experiments that examined adding two unseen intents validation set accuracy was 95.20% on seen and 97.00% for unseen. The tables below replicate the tables given in the main body of the paper with test set accuracies replaced with validation set accuracies. In ATIS task the highest training set performance was 99.82% for intent space and 100.00% for RNN. Due to highly limited number of sentences available to many intents no validation set was used for this task. Synthesized classifiers for zeroshot learning Learning phrase representations using RNN encoder-decoder for statistical machine translation Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces Devise: A deep visual-semantic embedding model Cluster adaptive training of hidden Markov models The ATIS spoken language systems pilot corpus Long short-term memory Understanding user's query intent with wikipedia Adam: A method for stochastic optimization Zero-shot learning across heterogeneous overlapping domains Attribute-based classification for zero-shot visual object categorization Common Sense, the Turing Test, and the Quest for Real AI: Reflections on Natural and Artificial Intelligence Deep unknown intent detection with margin loss A post-processing method for detecting unknown intent of dialogue system via pre-trained deep neural network classifier. Knowledge-Based Systems Reconstructing capsule networks for zero-shot intent classification Efficient estimation of word representations in vector space Recurrent neural network based language model Glove: Global vectors for word representation A language space representation for speech recognition Learning disentangled intent representations for zero-shot intent detection Zero-shot learning through cross-modal transfer Zero shot intent classification using long-short term memory networks Zero-shot user intent detection via capsule neural networks Convolutional neural network based triangular CRF for joint intent detection and slot filling Unknown intent detection using Gaussian mixture model with an application to zero-shot intent classification Statistical parametric speech synthesis based on speaker and language factorization Joint slot filling and intent detection via capsule neural networks We are living in a world where new intents constantly appear. Some of them may have a temporary nature while others are here to stay. This provides a strong motivation for investigating approaches for detecting and modelling unseen intents. In this paper we propose a novel approach for representing intents in a continuous space, which is called intent space. The intent space is highly flexible and allows to add intents unseen during training without retraining any of its parameters. This continuous representation allows to investigate relationships between the intents, which is not normally possible with standard approaches. We demonstrate high prediction performance on two separate datasets. The future work will examine the use of more powerful bases representations and alternatives to weighted combination approaches to obtain coordinates.