key: cord-0596200-o538fa2q
authors: Zhang, Yuwei; Zhang, Haode; Zhan, Li-Ming; Wu, Xiao-Ming; Lam, Albert Y.S.
title: New Intent Discovery with Pre-training and Contrastive Learning
date: 2022-05-25
journal: nan
DOI: nan
sha: c08277edb2d10d2aaa6992f691a93407084885a7
doc_id: 596200
cord_uid: o538fa2q

New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes. It is a critical task for the development and service expansion of a practical dialogue system. Despite its importance, this problem remains under-explored in the literature. Existing approaches typically rely on a large amount of labeled utterances and employ pseudo-labeling methods for representation learning and clustering, which are label-intensive, inefficient, and inaccurate. In this paper, we provide new solutions to two important research questions for new intent discovery: (1) how to learn semantic utterance representations and (2) how to better cluster utterances. Particularly, we first propose a multi-task pre-training strategy to leverage rich unlabeled data along with external labeled data for representation learning. Then, we design a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering. Extensive experiments on three intent recognition benchmarks demonstrate the high effectiveness of our proposed method, which outperforms state-of-the-art methods by a large margin in both unsupervised and semi-supervised scenarios. The source code will be available at url{https://github.com/zhang-yu-wei/MTP-CLNN}.

Why Study New Intent Discovery (NID)? Recent years have witnessed the rapid growth of conversational AI applications. To design a natural language understanding system, a set of expected customer intentions are collected beforehand to train an intent recognition model. However, the predefined intents cannot fully meet customer needs. This implies the necessity of expanding the intent recognition model by repeatedly integrating new intents discovered from unlabeled user utterances * Work done while the author was with HK PolyU. † Corresponding author. ( Fig. 1) . To reduce the effort in manually identifying unknown intents from a mass of utterances, previous works commonly employ clustering algorithms to group utterances of similar intents (Cheung and Li, 2012; Hakkani-Tür et al., 2015; Padmasundari, 2018) . The cluster assignments thereafter can either be directly used as new intent labels or as heuristics for faster annotations.

Research Questions (RQ) and Challenges. Current study of NID centers around two basic research questions: 1) How to learn semantic utterance representations to provide proper cues for clustering? 2) How to better cluster the utterances?

The study of the two questions are often interwoven in existing research. Utterances can be represented according to different aspects such as the style of language, the related topics, or even the length of sentences. It is important to learn semantic utterance representations to provide proper cues for clustering. Simply applying a vanilla pre-trained language model (PLM) to generate utterance representations is not a viable solution, which leads to poor performance on NID as shown by the experimental results in Section 4.2. Some recent works proposed to use labeled utterances of known intents for representation learning (Forman et al., 2015; Haponchyk et al., 2018; Lin et al., 2020; Zhang et al., 2021c; Haponchyk and Moschitti, 2021) , but they require a substantial amount of known intents and sufficient labeled utterances of each intent, which are not always available especially at the early development stage of a dialogue system. Further, pseudo-labeling approaches are often exploited to generate supervision signals for representation learning and clustering. For example, Lin et al. (2020) finetune a PLM with an utterance similarity prediction task on labeled utterances to guide the training of unlabeled data with pseudo-labels. Zhang et al. (2021c) adopt a deep clustering method (Caron et al., 2018) that uses k-means clustering to produce pseudo-labels. However, pseudo-labels are often noisy and can lead to error propagation.

Our Solutions. In this work, we propose a simple yet effective solution for each research question. Solution to RQ 1: multi-task pre-training. We propose a multi-task pre-training strategy that takes advantage of both external data and internal data for representation learning. Specifically, we leverage publicly available, high-quality intent detection datasets, following Zhang et al. (2021d) , as well as the provided labeled and unlabeled utterances in the current domain, to fine-tune a pre-trained PLM to learn task-specific utterance representations for NID. The multi-task learning strategy enables knowledge transfer from general intent detection tasks and adaptation to a specific application domain. Solution to RQ 2: contrastive learning with nearest neighbors. We propose to use a contrastive loss to produce compact clusters, which is motivated by the recent success of contrastive learning in both computer vision (Bachman et al., 2019; He et al., 2019; Chen et al., 2020; Khosla et al., 2020) and natural language processing (Gunel et al., 2021; Gao et al., 2021; Yan et al., 2021) . Contrastive learning usually maximizes the agreement between different views of the same example and minimize that between different examples. However, the commonly used instance discrimination task may push away false negatives and hurts the clustering performance. Inspired by a recent work in computer vision (Van Gansbeke et al., 2020) , we introduce neighborhood relationship to customize the contrastive loss for clustering in both unsupervised (i.e., without any labeled utterances of known intents) and semi-supervised scenarios. Intuitively, in a semantic feature space, neighboring utterances should have a similar intent, and pulling together neighboring samples makes clusters more compact.

Our main contributions are three-fold.

• We show that our proposed multi-task pretraining method already leads to large performance gains over state-of-the-art models for both unsupervised and semi-supervised NID.

• We propose a self-supervised clustering method for NID by incorporating neighborhood relationship into the contrastive learning objective, which further boosts performance.

• We conduct extensive experiments and ablation studies on three benchmark datasets to verify the effectiveness of our methods.

New Intent Discovery. The study of NID is still in an early stage. Pioneering works focus on unsupervised clustering methods. Shi et al. (2018) leveraged auto-encoder to extract features. Perkins and Yang (2019) considered the context of an utterance in a conversation. Chatterjee and Sengupta (2020) proposed to improve density-based models. Some recent works (Haponchyk et al., 2018; Haponchyk and Moschitti, 2021) studied supervised clustering algorithms for intent labeling, yet it can not handle new intents. Another line of works (Forman et al., 2015; Lin et al., 2020; Zhang et al., 2021c) investigated a more practical case where some known intents are provided to support the discovery of unknown intents, which is often referred to as semisupervised NID.

To tackle semi-supervised NID, Lin et al. (2020) proposed to first perform supervised training on known intents with a sentence similarity task and then use pseudo labeling on unlabeled utterances to learn a better embedding space. Zhang et al. (2021c) proposed to first pre-train on known intents and then perform k-means clustering to assign pseudo labels on unlabeled data for representation learning following Deep Clustering (Caron et al., 2018) . They also proposed to align clusters to accelerate the learning of top layers. Another approach is to first classify the utterances as known and unknown and then uncover new intents with the unknown utterances (Vedula et al., 2020; Zhang et al., 2021b) . Hence, it relies on accurate classification in the first stage.

In this work, we address NID by proposing a multi-task pre-training method for representation learning and a contrastive learning method for clustering. In contrast to previous methods that rely on ample annotated data in the current domain for pre-training, our method can be used in an unsupervised setting and work well in data-scarce scenarios (Section 4.3).

Pre-training for Intent Recognition. Despite the effectiveness of large-scale pre-trained language models (Radford and Narasimhan, 2018; Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020) , the inherent mismatch in linguistic behavior between the pre-training datasets and dialogues encourages the research of continual pre-training on dialogue corpus. Most previous works proposed to pre-train on open domain dialogues in a selfsupervised manner (Mehri et al., 2020; Henderson et al., 2020; Hosseini-Asl et al., 2020) . Recently, several works pointed out that pre-training with relavant tasks can be effective for intent recognition. For example, formulated intent recognition as a sentence similarity task and pre-trained on natural language inference (NLI) datasets. Vulić et al. (2021) ; Zhang et al. (2021e) pre-trained with a contrastive loss on intent detection tasks. Our multi-task pre-training method is inspired from Zhang et al. (2021d) which leverages publicly available intent datasets and unlabeled data in the current domain for pre-training to improve the performance of few-shot intent detection. However, we argue that the method is more suitable to be applied for NID due to the natural existence of unlabeled utterances.

Contrastive Representation Learning. Contrastive learning has shown promising results in computer vision (Bachman et al., 2019; Chen et al., 2020; He et al., 2019; Khosla et al., 2020) and gained popularity in natural language processing. Some recent works used unsupervised contrastive learning to learn sentence embeddings (Gao et al., 2021; Yan et al., 2021; Kim et al., 2021; Giorgi et al., 2021) . Specifically, Gao et al. (2021) ; Yan et al. (2021) showed that contrastive loss can avoid an anisotropic embedding space. Kim et al. (2021) proposed a self-guided contrastive training to improve the quality of BERT representations. Giorgi et al. (2021) proposed to pre-train a universal sentence encoder by contrasting a randomly sampled text segment from nearby sentences. Zhang et al. (2021e) demonstrated that self-supervised contrastive pre-training and supervised contrastive fine-tuning can benefit few-shot intent recognition. Zhang et al. (2021a) showed that combining a contrastive loss with a clustering objective can improve short text clustering. Our proposed contrastive loss is tailored for clustering, which encourages utterances with similar semantics to group together and avoids pushing away false negatives as in the conventional contrastive loss.

3 Method Problem Statement. To develop an intent recognition model, we usually prepare a set of expected intents C k along with a few annotated utterances D labeled known = {(x i , y i )|y i ∈ C k } for each intent. After deployed, the system will encounter utterances D unlabeled = {x i |y i ∈ {C k , C u }} from both predefined (known) intents C k and unknown intents C u . The aim of new intent discovery (NID) is to identify the emerging intents C u in D unlabeled . NID can be viewed as a direct extension of out-ofdistribution (OOD) detection, where we not only need to identify OOD examples but also discover the underlying clusters. NID is also different from zero-shot learning in that we do not presume access to any kind of class information during training. In this work, we consider both unsupervised and semi-supervised NID, which are distinguished by the existence of D labeled known , following Zhang et al. (2021c) .

Overview of Our Approach. As shown in Fig. 2 , we propose a two-stage framework that addresses the research questions mentioned in Sec. 1. In the first stage, we perform multi-task pre-training (MTP) that jointly optimizes a crossentropy loss on external labeled data and a selfsupervised loss on target unlabeled data (Sec. 3.1). In the second stage, we first mine top-K nearest neighbors of each training instance in the embedding space and then perform contrastive learning with nearest neighbors (CLNN) (Sec. 3.2). After training, we employ a simple non-parametric clustering algorithm to obtain clustering results.

We propose a multi-task pre-training objective that combines a classification task on external data from publicly available intent detection datasets and a self-supervised learning task on internal data from the current domain. Different from previous works (Lin et al., 2020; Zhang et al., 2021c) , our pre-training method does not rely on annotated data (D labeled known ) from the current domain and hence can i=1 are plotted (hollow markers within large circles). Since x 2 falls within N 1 , x 2 along with its neighbors are taken as positive instance for x 1 (but not vice versa since x 1 is not in N 2 ). We also show an example of adjacency matrix A and augmented batch B . The pairwise relationships with the first instance in the batch are plotted with solid lines indicating positive pairs and dashed lines indicating negative pairs. be applied in an unsupervised setting.

Specifically, we first initialize the model with a pre-trained BERT encoder (Devlin et al., 2019). Then, we employ a joint pre-training loss as in Zhang et al. (2021d) . The loss consists of a crossentropy loss on external labeled data and a masked language modelling (MLM) loss on all available data from the current domain:

where θ are model parameters. For the supervised classification task, we leverage an external public intent dataset with diverse domains (e.g., CLINC150 (Larson et al., 2019) ), denoted as D labeled external , following Zhang et al. (2021d) . For the self-supervised MLM task, we use all available data (labeled or unlabeled) from the current domain, denoted as D all internal . Intuitively, the classification task aims to learn general knowledge of intent recognition with annotated utterances in external intent datasets, while the self-supervised task learns domain-specific semantics with utterances collected in the current domain. Together, they enable learning semantic utterance representations to provide proper cues for the subsequent clustering task. As will be shown in Sec. 4.3, both tasks are essential for NID.

For semi-supervised NID, we can further utilize the annotated data in the current domain to conduct continual pre-training, by replacing D labeled external in Eq. 1 to D labeled known . This step is not included in unsupervised NID.

In the second stage, we propose a contrastive learning objective that pulls together neighboring instances and pushes away distant ones in the embedding space to learn compact representations for clustering. Concretely, we first encode the utterances with the pre-trained model from stage 1. Then, for each utterance x i , we search for its top-K nearest neighbors in the embedding space using inner product as distance metric to form a neighborhood N i . The utterances in N i are supposed to share a similar intent as x i . During training, we sample a minibatch of utterances

For each utterance x i ∈ B, we uniformly sample one neighbor x i from its neighborhood N i . We then use data augmentation to generatex i andx i for x i and x i respectively. Here, we treatx i andx i as two views of x i , which form a positive pair. We then obtain an augmented batch

with all the generated samples. To compute contrastive loss, we construct an adjacency matrix A for B , which is a 2M × 2M binary matrix where 1 indicates positive relation (either being neighbors or having the same intent label in semi-supervised NID) and 0 indicates negative relation. Hence, we can write the contrastive loss as:

where

.., 2M }} denotes the set of instances having positive relation with x i and |C i | is the cardinality.h i is the embedding for utterancex i . τ is the temperature parameter. sim(·, ·) is a similarity function (e.g., dot product) on a pair of normalized feature vectors. During training, the neighborhood will be updated every few epochs. We implement the contrastive loss following Khosla et al. (2020) .

Notice that the main difference between Eq. 2 and conventional contrastive loss is how we construct the set of positive instances C i . Conventional contrastive loss can be regarded as a special case of Eq. 2 with neighborhood size K = 0 and the same instance is augmented twice to form a positive pair (Chen et al., 2020). After contrastive learning, a non-parametric clustering algorithm such as kmeans can be applied to obtain cluster assignments.

Data Augmentation. Strong data augmentation has been shown to be beneficial in contrastive learning (Chen et al., 2020). We find that it is inefficient to directly apply existing data augmentation methods such as EDA (Wei and Zou, 2019) , which are designed for general sentence embedding. We observe that the intent of an utterance can be expressed by only a small subset of words such as "suggest restaurant" or "book a flight". While it is hard to identify the keywords for an unlabeled utterance, randomly replacing a small amount of tokens in it with some random tokens from the library will not affect intent semantics much. This approach works well in our experiments (See Table 5 RTR).

By introducing the notion of neighborhood relationship in contrastive learning, CLNN can 1) pull together similar instances and push away dissimilar ones to obtain more compact clusters; 2) utilize proximity in the embedding space rather than assigning noisy pseudo-labels (Van Gansbeke et al., 2020) ; 3) directly optimize in the feature space rather than clustering logits as in Van Gansbeke et al. (2020), which has been proven to be more effective by Rebuffi et al. (2020) ; and 4) naturally incorporate known intents with the adjacency matrix. Zhang et al. (2021b) . Details about dataset splitting are provided in the Appendix. Experimental Setup. We evaluate our proposed method on both unsupervised and semi-supervised NID. Notice that in unsupervised NID, no labeled utterances from the current domain are provided. For clarity, we define two variables. The proportion of known intents is defined as |C k |/(|C k |+|C u |) and referred to as "known class ratio (KCR)", and the proportion of labeled examples for each known intent is denoted as "labeled ratio (LAR)". The labeled data are randomly sampled from the original training set. Notice that, KCR = 0 means unsupervised NID, and KCR > 0 means semi-supervised NID. In the following sections, we provide experimental results for both unsupervised NID and semisupervised NID with KCR = {25%, 50%, 75%} and LAR = {10%, 50%}.

Evaluation Metrics. We adopt three popular evaluation metrics for clustering: normalized mutual information (NMI), adjusted rand index (ARI), and accuracy (ACC).

Baselines and Model Variants. We summarize the baselines compared in our experiments for both unsupervised and semi-supervised NID. Our imple- (Yang et al., 2017) are unsupervised clustering methods based on stacked auto-encoder.

• Semi-supervised baselines.

(1) BERT-KCL (Hsu et al., 2018) and (2) (Zhang et al., 2021c) improves Deep Clustering (Caron et al., 2018) by aligning clusters between iterations.

• Our model variants include MTP and MTP-CLNN, which correspond to applying kmeans on utterance representations learned in stage 1 and stage 2 respectively. Further, we continue to train a DAC model on top of MTP to form a stronger baseline MTP-DAC for semi-supervised NID.

Implementation. We take pre-trained bertbase-uncased model from Wolf et al. (2019) 2 as our base model and we use the [CLS] token as the BERT representation. For MTP, we first train until convergence on the external dataset, and then when training on D labeled known , we use a development set to validate early-stopping with a patience of 20 epochs following Zhang et al. (2021c) . For contrastive learning, we project a 768-d BERT embedding to an 128-d vector with a two-layer MLP and set the temperature as 0.07. For mining nearest neighbors, we use the inner product method 2 https://github.com/huggingface/ transformers provided by Johnson et al. (2017) 3 . We set neighborhood size K = 50 for BANKING and M-CID, and K = 500 for StackOverflow, since we empirically find that the optimal K should be roughly half of the average size of the training set for each class (see Section 4.4). The neighborhood is updated every 5 epochs. For data augmentation, the random token replacement probability is set to 0.25. For model optimization, we use the AdamW provided by Wolf et al. (2019) . In stage 1, the learning rate is set to 5e −5 . In stage 2, the learning rate is set to 1e −5 for BANKING and M-CID, and 1e −6 for StackOverflow. The batch sizes are chosen based on available GPU memory. All the experiments are conducted on a single RTX-3090 and averaged over 10 different seeds. More details are provided in the Appendix.

Unsupervised NID. We show the results for unsupervised NID in Table 2 . First, comparing the performance of BERT-KM with GloVe-KM and SAE-KM, we observe that BERT embedding performs worse on NID even though it achieves better performance on NLP benchmarks such as GLUE, which manifests learning task-specific knowledge is important for NID. Second, our proposed pre- training method MTP improves upon baselines by a large margin. Take the NMI score of BANKING for example, MTP outperforms the strongest baseline SAE-DCN by 14.38%, which demonstrates the effectiveness of exploiting both external public datasets and unlabeled internal utterances. Furthermore, MTP-CLNN improves upon MTP by around 5% in NMI, 10% in ARI, and 10% in ACC across different datasets.

Semi-supervised NID. The results for semisupervised NID are shown in Table 3 . First, MTP significantly outperforms the strongest baseline DAC in all settings. For instance, on M-CID, MTP achieves 22.57% improvement over DAC in NMI. Moreover, MTP is less sensitive to the proportion of labeled classes. From KCR = 75% to KCR = 25% on M-CID, MTP only drops 8.55% in NMI, as opposed to about 21.58% for DAC. The less performance decrease indicates that our pretraining method is much more label-efficient. Furthermore, with our proposed contrastive learning, MTP-CLNN consistently outperforms MTP and the combined baseline MTP-DAC. Take BANK-ING with KCR = 25% for example, MTP-CLNN improves upon MTP by 4.11% in NMI while surpassing MTP-DAC by 2.63%. A similar trend can be observed when LAR = 50%, and we provide the results in the Appendix.

Visualization. In Fig. 3 , we show the t-SNE visualization of clusters with embeddings learned by two strongest baselines and our methods. It clearly shows the advantage of our methods, which can produce more compact clusters. Results on other datasets can be found in the Appendix.

To further illustrate the effectiveness of MTP, we conduct two ablation studies in this section. First, we compare MTP with the pre-training method employed in Zhang et al. (2021c) , where only internal labeled data are utilized for supervised pretraining (denoted as SUP). 4 In Fig. 4 , we show the results of both pre-training methods combined with CLNN with different proportions of known classes. Notice that when KCR = 0 there is no pretraining at all for SUP-CLNN. It can be seen that MTP-CLNN consistently outperforms SUP-CLNN. Furthermore, the performance gap increases while KCR decreases, and the largest gap is achieved when KCR = 0. This shows the high effectiveness of our method in data-scarce scenarios. Second, we decompose MTP into two parts: supervised pre-training on external public data (PUB) and self-supervised pre-training on internal unlabeled data (MLM). We report the results of the two pre-training methods combined with CLNN as well as MTP in Table 4 . We can easily conclude that either PUB or MLM is indispensable and multi-task pre-training is beneficial.

Number of Nearest Neighbors. We conduct an ablation study on neighborhood size K in Fig. 5 . We can make two main observations. First, although the performance of MTP-CLNN varies with different K, it still significantly outperforms MTP (dashed horizontal line) for a wide range of K. For example, MTP-CLNN is still better than MTP when K = 50 on StackOverflow or K = 200 on BANKING. Second, despite the difficulty to search for K with only unlabeled data, we empirically find an effective estimation method, i.e. to choose K as half of the average size of the training set for each class 5 . It can be seen that the estimated K ≈ 60 on BANKING and K ≈ 40 on M-CID (vertical dashed lines) lie in the optimal regions, which shows the effectiveness of our empirical estimation method.

Exploration of Data Augmentation. We compare Random Token Replacement (RTR) used in our experiments with other methods. For instance, dropout is applied on embeddings to provide data augmentation in Gao et al. (2021) , randomly shuffling the order of input tokens is proven to be effective in Yan et al. (2021) , and EDA (Wei and Zou, 2019) is often applied in text classification. Furthermore, we compare with a Stop-words Replacement (SWR) variant that only replaces the stop-words with other random stop-words so it minimally af- 5 We presume prior knowledge of the number of clusters. There are some off-the-shelf methods that can be directly applied in the embedding space to determine the optimal number of clusters (Zhang et al., 2021c Table 5 : Ablation study on data augmentation for unsupervised NID. * is the method used in the main results.

fects the intents of utterances. The results in Table 5 demonstrate that (1) RTR and SWR consistently outperform others, which verifies our hypothesis in Section 3.2.

(2) Surprisingly, RTR and SWR perform on par with each other. For simplicity, we only report the results with RTR in the main experiments.

We have provided simple and effective solutions for two fundamental research questions for new intent discovery (NID): (1) how to learn better utterance representations to provide proper cues for clustering and (2) how to better cluster utterances in the representation space. In the first stage, we use a multi-task pre-training strategy to exploit both external and internal data for representation learning. In the second stage, we perform contrastive learning with mined nearest neighbors to exploit self-supervisory signals in the representation space. Extensive experiments on three intent recognition benchmarks show that our approach can significantly improve the performance of NID in both unsupervised and semi-supervised scenarios.

There are two limitations of this work.

(1) We have only evaluated on balanced data. However, in real-world applications, most datasets are highly imbalanced.

(2) The discovered clusters lack interpretability. Our clustering method can only assign a cluster label to each unlabeled utterance but cannot generate a valid intent name for each cluster.

We would like to thank the anonymous reviewers for their valuable comments. This research was supported by the grants of HK ITF UIM/377 and PolyU DaSAIL project P0030935 funded by RGC. 

In this section, we provide more details about the datasets. The development sets are prepared to exclude no unknown intents.

• BANKING (Casanueva et al., 2020) is a finegrained intent detection dataset in which 77 intents are collected for banking dialogue system. The dataset is splitted into 9,003, 1,000 and 3,080 for training, validation, and test sets respectively.

• StackOverflow (Xu et al., 2015) is a large scale dataset for online questioning which contains 20 intents with 1,000 examples in each class. We split the dataset into 18,000 for training, 1,000 for validation, and 1,000 for test.

• M-CID (Arora et al., 2020 ) is a small scale dataset for cross-lingual Covid-19 queries. We only use the English subset of this dataset which has 16 intents. We split the dataset into 1,220 for training, 176 for validation, and 349 for test.

• CLINC150 (Larson et al., 2019) consists of 10 domains across multiple unique services. We use 8 domains 6 and remove the out-ofscope data. We only use this dataset during training stage 1.

The batch size is set to 64 for stage 1 and 128 for stage 2 in all experiments to fully utilize the GPU memory. In stage 1, we first train until convergence on external data and then train with validation on internal data. In stage 2, we train until convergence without early-stopping.

The results on semi-supervised NID when LAR = 50% are shown in Table 6 . It can be seen that our methods still achieve the best performance in this case. In Fig. 6 and Fig. 7 , we show the t-SNE visualization of clusters on BANKING and M-CID with embeddings learned by two strongest baselines and our methods. Again, it shows that our methods can produce more compact clusters. Table 6 : Performance on semi-supervised NID with different known class ratio. The LAR is set to 50%. For each dataset, the best results are marked in bold. Comb denotes the baseline method combined with our proposed MTP. 

Mrinal Mohit, Lorena Sainz-Maza Lecanda, and Ahmed Aly. 2020. Cross-lingual transfer learning for intent detection of covid-19 utterances

Learning representations by maximizing mutual information across views

Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners

Deep clustering for unsupervised learning of visual features

Momentum contrast for unsupervised visual representation learning

ConveRT: Efficient and accurate conversational representations from transformers

A simple language model for task-oriented dialogue

Learning to cluster in order to transfer across domains and tasks

Multi-class classification without multi-class labels

Billion-scale similarity search with gpus

Supervised contrastive learning

Self-guided contrastive learning for BERT sentence representations

An evaluation dataset for intent classification and out-of-scope prediction

Discovering new intents via constrained deep adaptive clustering with cluster refinement

Some methods for classification and analysis of multivariate observations

Dialoglue: A natural language understanding benchmark for task-oriented dialogue

Intent discovery through unsupervised semantic text clustering

GloVe: Global vectors for word representation

Dialog intent induction with deep multi-view clustering

Improving language understanding by generative pretraining

Lsdc: Linearly separable deep clusters. arXiv

Autodialabel: Labeling dialogue data with unsupervised learning

Scan: Learning to classify images without labels

Automatic discovery of novel intents & domains from text utterances

ConvFiT: Conversational fine-tuning of pretrained language models

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

TOD-BERT: Pre-trained natural language understanding for task-oriented dialogue

Unsupervised deep embedding for clustering analysis

Short text clustering via convolutional neural networks

ConSERT: A contrastive framework for self-supervised sentence representation transfer

Towards k-means-friendly spaces: Simultaneous deep learning and clustering

Supporting clustering with contrastive learning

TEXTOIR: An integrated and visualized platform for text open intent recognition

Discovering new intents with deep aligned clustering

Effectiveness of pre-training for few-shot intent classification

Few-shot intent detection via contrastive pre-training and finetuning

Discriminative nearest neighbor few-shot intent detection by transferring natural language inference