key: cord-0045507-sl15t0h0 authors: Rakib, Md Rashadul Hasan; Zeh, Norbert; Jankowska, Magdalena; Milios, Evangelos title: Enhancement of Short Text Clustering by Iterative Classification date: 2020-05-26 journal: Natural Language Processing and Information Systems DOI: 10.1007/978-3-030-51310-8_10 sha: 8b2a42cc62ecf251f31107c3657457b0fd0c472b doc_id: 45507 cord_uid: sl15t0h0 Short text clustering is a challenging task due to the lack of signal contained in short texts. In this work, we propose iterative classification as a method to boost the clustering quality of short texts. The idea is to repeatedly reassign (classify) outliers to clusters until the cluster assignment stabilizes. The classifier used in each iteration is trained using the current set of cluster labels of the non-outliers; the input of the first iteration is the output of an arbitrary clustering algorithm. Thus, our method does not require any human-annotated labels for training. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different baseline clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin. Due to technological advances, short texts are generated at large volumes from different sources, such as micro-blogging, question answering, and social news aggregation websites. Organizing these texts is an important step towards discovering trends (e.g., political, economic) in conversations and in other data mining tasks, such as data summarization, frequent pattern analysis, and searching for and filtering information. Clustering the texts into groups of similar texts is the foundation for many of these organization strategies [1] . The lack of signal contained in short texts makes grouping of short texts based on shared topics difficult, leading to poor cohesion of texts assigned to the same cluster. The objective of our research is to improve the cohesion of clusters in a cluster partition produced by an arbitrary baseline clustering method. To achieve this, we remove outliers from each cluster and reassign them to clusters with which they have greater similarity. We demonstrate that this approach produces more accurate cluster partitions than computationally more costly state-of-theart short text clustering methods based on neural networks [2, 3] . The k-means algorithm can be viewed as an iterative classification algorithm. Starting with an initial (random) cluster partition, each iteration computes the centers of all clusters and uses these cluster centers as a classifier to reassign every input point to a new cluster. k-means-- [5] is a variation of k-means that achieves improved clustering performance by removing outliers before computing cluster centers, that is, before "training the classifier". The classification step then assigns all points to their closest cluster centers, including the outliers ignored when computing cluster centers. Our method is inspired by this approach but uses a more sophisticated classifier than computing cluster centers and assigning every point to the closest cluster center. Specifically, our method follows the approach of [6] to train a classifier based on the cluster labels of the non-outliers. Iterative classification then uses the trained classifier to reassign outliers to clusters. Just as with k-means, the resulting set of clusters is the input for the next iteration or, if this is the last iteration, the final set of clusters returned by the algorithm. Iterative classification can be applied to any set of initial clusters and is thus independent of the method used to obtain these clusters. The quality of the final set of clusters, however, does depend on the method used to compute the initial clusters. We use k-means [7] , k-means-- [5] and hierarchical clustering [7] using dense and sparse similarity matrices to compute the initial clusters. k-means and k-means--clustering are applied to the vector representations of the texts. For hierarchical clustering, we use the text similarity matrix (dense or sparse). The dense similarity matrix stores the similarity value for each text pair, whereas the sparse similarity matrix keeps a certain number of similarity values and discards the remaining ones (sets them to 0) [8] . Matrix sparsification can be performed using different criteria for choosing the values to discard. We consider two approaches here, one based on k-nearest neighbors [7] and the other based on the similarity distribution [9] . The knearest neighbor method keeps the k largest entries in each row. In the similarity distribution-based method, the number of similarities to keep in each row is not fixed. Instead, it is based on the distribution of the similarity values in each row, as characterized by he mean and standard deviation of these values. These sparsification methods are discussed in detail in Sect. 4. The two main contributions of this work are as follows: -We introduce iterative classification as a method that improves the clustering quality of different baseline clustering methods on various short text datasets and does not require human-annotated data to train the classification model. Our implementation of iterative classification and the datasets used in our experiments are publicly available. 1 -The combination of hierarchical clustering (using a sparse similarity matrix based on similarity distribution [9] ) and iterative classification performs better than other clustering methods combined with iterative classification. This combination outperforms the state-of-the-art short text clustering methods by a statistically significant margin. A major challenge in short text clustering is the sparseness of the vector representations of these texts resulting from the small number of words in each text. Several clustering methods have been proposed in the literature to address this challenge, including methods based on text augmentation [10, 11] , neural networks [2, 3] , topic modeling [12] , and Dirichlet mixture model [4] . A recent method based on text augmentation [11] uses topic diffusion to augment each short text by finding words not appearing in the text that are related to its content. To find related words, this method determines possible topics for each text using the existing words. Then new words are added to each text; these new words are closely related to the text's topics based on the posterior probabilities of the new words given the words in the text. An earlier text augmentation method [10] finds Wikipedia articles using the short text as query string and uses the articles' titles as features. A short text clustering method based on word embedding and a convolutional neural network called STC2-LE was proposed in [2] . It uses a convolutional neural network to learn a text representation on which clustering is performed. Another short text clustering method based on weighted word embedding and autoencoder was proposed in [3] . For each text, it calculates the average of the weighted embeddings [13] of its words. The weight of a word is calculated based on its inverse frequency in the corpus [3] which is then multiplied with its embedding to obtain weighted word embedding. After that, the embeddings of the texts are feed into an autoencoder to obtain the low dimensional representation of the texts on which clustering is performed. Biterm topic modeling (BTM) [12] is a topic modeling approach for short texts that learns topics from word co-occurrence patterns (i.e., biterms). Given a topic distribution produced by BTM for each text, clustering is performed by assigning a text to its most probable topic. A short text clustering method based on a Dirichlet process multinomial mixture model called GSDPMM was proposed in [4] . GSDPMM does not partition the input into a pre-specified number of clusters. It processes the texts one by one and assigns each text to a new cluster or to one of the existing clusters based on two factors: a) a preference for a cluster with more texts and, b) a preference for a cluster whose texts share more words with the current text. Sparsification of the text similarity matrix keeps the association between a text and its most similar (nearest) texts while breaking associations with less similar ones by setting the corresponding similarity scores to 0 [8] . Several similarity matrix sparsification methods have been discussed in the literature, including ones based on a global threshold [7] , nearest neighbors [7] , and center vectors [8] . Similarity matrix sparsification based on global threshold is the simplest sparsification method. It removes all similarity values that are below a given threshold [7] . The problem with this method is that some real clusters may be destroyed or merged because different clusters may have different similarity levels between the texts they contain. Nearest neighbors' based methods for similarity matrix sparsification include k-nearest neighbor [7] and shared nearest neighbor [14] . k-nearest neighbor sparsification keeps only the k highest similarity scores for each text; the sharednearest neighbor approach adds a condition that texts retaining similarity values with a particular text should share a prescribed number of neighbors. A similarity matrix sparsification method based on the center vector was proposed in [8] . Texts are represented by tf -idf (term frequency-inverse document frequency) vectors and a center vector is computed by averaging these vectors. The sparsification of the similarity matrix is performed by removing similarities between all pairs of texts that are not more similar to each other than the maximum similarities of these two texts to the center vector. Given a collection of short texts and a partition of these texts into clusters, iterative classification modifies the given cluster partition by detecting outliers in each cluster and changing the clusters to which they are assigned. This is repeated several times, hence the term iterative in the method's name. In each iteration, we generate training and test sets containing non-outliers and outliers respectively. Then we train a classification algorithm using the training set and classify the test set using the trained model. This iterative process repeats until the stopping criterion discussed in Sect. 3.1 is satisfied. The details are shown in Algorithm 1 and are described next. In each iteration, we choose a number P that roughly corresponds to the fraction of texts selected for the training set. P is chosen uniformly at random from an interval [P 1 , P 2 ] determined in Sect. 6.2. To generate the training set, we remove outliers from each of the K clusters defined by the current cluster labels L. To remove outliers, we use an outlier detection algorithm called Isolation Forest, which is applied to the tf -idf vector representations of the texts. The algorithm isolates the texts that exist in the low density region of the tf -idf feature space. If after removing outliers, a cluster contains more than n K × P texts, then we remove texts from that cluster uniformly at random to reduce the number of texts in the cluster to n K × P . The reason of removing texts from each cluster is that we want each cluster to consist of roughly the same number of texts so as to reduce the bias of the classification algorithm. We add the removed texts to the test set and add the other texts to the training set. We train a classifier (Multinomial Logistic Regression) using the non-outliers and their cluster labels. Then we classify the texts in the test set using the trained classifier. This defines a new set of cluster labels of the texts in the test set and thus produces an updated cluster partition. Require: D = set of n texts, L = initial cluster labels of the texts in D, K = number of clusters Ensure: Enhanced cluster labels of the texts 1: maxIteration = 50 2: avgTextsPerCluster = n/K 3: for i = 1 to maxIteration do 4: Choose a parameter P uniformly at random from the interval [P1, P2]. (P1 and P2 are parameters determined in Section 6.2. P bounds the fraction of texts kept per cluster.) 5: Remove outliers from each of the K clusters defined by L using an outlier detection algorithm. 6: If a cluster contains more than avgTextsPerCluster ×P texts, remove texts from that cluster uniformly at random so that exactly avgTextsPerCluster ×P texts remain in the cluster. 7: testSet = texts removed in Steps 5 and 6 trainingSet = all the texts not in testSet 8: Train a classifier using the trainingSet and classify the texts in testSet. This assigns a new cluster label L(t) to each text t ∈ testSet. 9: Stop iterative classification if the per cluster text distribution becomes stable (as described in Section 3.1). 10: end for 11: return L Iterative classification stops when it reaches the maximum number of iterations (i.e., 50) or the sizes of the clusters become stable. Let C 1 , ..., C k and C 1 , ..., C k be the clusters before and after an iteration, respectively. We consider the cluster sizes to be stable if For example, consider the problem of partitioning 100 texts into two clusters. Then the average cluster size is 50. If one iteration assigns 48 texts to the first cluster and 52 texts to the second cluster and the next iteration assigns 49 and 51 texts to these clusters, respectively, then the average absolute change of the cluster size is 1 2 (|48−49|+|52−51|) = 1. Since this is less than 5% of the average cluster size (50), we consider the cluster sizes to have stabilized. The k-nearest neighbor (k-NN) sparsification method [7] uses the number of nearest neighbors k as a parameter. A square n × n symmetric similarity matrix S = (s ij ) is the input for k-NN sparsification method. The method criterion is to retain, for each text, exactly the k highest similarities with this text outside of the diagonal. For the text t i , we retain a similarity (s ij ) between t i and other text t j , if s ij is within the k highest similarities of t i . However, the similarity s ji between a text t j and other text t i may not be retained because s ji may not be within the k highest similarities of t j . Hence after applying this criterion, the resulting sparsified matrix can be a non-symmetric matrix. Therefore we symmetrize the sparsified similarity matrix by retaining both s ij and s ji , if any of the similarities among s ij and s ji is retained in the sparsified similarity matrix. The similarity distribution based sparsification method was proposed in our previous work [9] . It sparsifies a similarity matrix based on the distribution of the similarity scores in the matrix. The input of this sparsification method is a symmetric similarity matrix for a set of n texts. The goal is to increase the signal-to-noise ratio in the matrix by keeping only the most significant similarity scores and setting less significant similarity scores to 0. Our criterion for setting entries to 0 may result in a non-symmetric matrix. Such a matrix requires symmetrization. We follow the sparsification with exclusion approach [7] which sets an item s ij to zero only if the sparsification criterion retains neither s ij nor s ji . In contrast to the k-nearest neighbor method, the number of similarities to keep for each text is not fixed. Instead, it is based on the distribution of the similarity values between each text and all other texts. For each text t i , we calculate the mean μ i and standard deviation σ i of similarities between t i and all other texts. Then, we sparsify similarities between t i and other texts based on these statistics. In particular, we define the retaining criterion as follows: a similarity s ij is to be retained if and only if for some global factor α. The factor α is chosen so that after applying the criterion and symmetrization of the matrix, the average number of non-zero elements outside of the diagonal per row is equal to l = n K − 1. Note that if each cluster has exactly n K elements and we return exactly the similarity scores between elements in the same cluster, then l is the number of non-zero nondiagonal entries in each row. To choose the retained similarity values efficiently, we use an auxiliary value a ij = sij −μi σi for each similarity value s ij . This is s ij 's deviation from the mean of row i normalized by the standard deviation of row i. The criterion of Eq. 1 can be restated as: a similarity s ij is to be retained if and only if a ij > α. Since we follow the sparsification with exclusion approach for symmetrization, we keep s ij in the final symmetric matrix if the retaining criterion is fulfilled for s ij or for s ji . Thus, if the average number of non-zero non-diagonal entries per row is to be l, we need to return N = n×l 2 entries above the main diagonal, which is achieved by choosing α to be the N th largest value in {max(a ij , a ji )|1 ≤ i < j ≤ n}. k-means clustering [1] is used to cluster a collection of short texts into k clusters. First, k-means clustering initializes k centers, then it assigns each text to its closest center. Then the algorithm runs for a number of iterations. In each iteration, it recomputes the cluster centers using the texts assigned to each cluster and reassigns the texts to their closest centers. This iterative process continues until the algorithm reaches the maximum number of iterations or the cluster assignments becomes stable between two successive iterations. k-means-- [5] is a variation of k-means clustering, in which outliers are removed in each iteration of the k-means clustering before recomputing the cluster centers. To detect outliers, short texts are ranked in decreasing order using their distances to their nearest cluster centers and the d (parameter for defining the total number of outliers) most distant texts are considered as outliers and removed from the clusters so that the cluster centers will become less sensitive to outliers. This has been confirmed to improve the clustering performance. Hierarchical agglomerative clustering uses a symmetric matrix storing pairwise similarities between documents. Such a matrix is dense if it stores a similarity between every pair of documents. The clustering method starts with each document in its own clusters and repeatedly merges pairs of most similar clusters until only k (the desired numbers of clusters) clusters remain. A dense similarity matrix provides the most detailed information about pairwise text similarities but the lowest similarity scores can be considered noise in the sense that they suggest (tenuous) connections between texts that are almost guaranteed to belong to different clusters. Setting these similarities to 0 increases the separation between clusters and produces better clustering results. We consider two sparsification methods in our experiments: k-nearest neighbor and similarity distribution based, which are discussed in Sects. 4.1 and 4.2 respectively. We form clusters based on the two resulting sparse similarity matrices using the same hierarchical clustering method as discussed above. We used five different datasets of short texts in our experiments. The basic properties of these datasets are shown in Table 1 . SearchSnippet is a dataset of search results from Google's search engine, containing 12340 snippets distributed into 8 groups [15] . SearchSnippet-test is a subset of the SearchSnippet dataset consisting of 2280 search snippets distributed into 8 groups. AgNews is a subset of a dataset of news titles [16] . It consists of 8000 texts in 4 topic categories (for each category, we randomly selected 2000 texts). StackOverflow is a subset of the challenge data published on Kaggle 2 , where 20000 question titles from 20 groups were randomly selected [2] . BioMedical is a subset of the challenge data published on the BioASQ's website 3 , where 20000 paper titles from 20 groups were randomly selected [2] . Experimental Setup for Iterative Classification. We preprocessed the texts by removing stop words and converting them to lowercase. Then we transformed each text into the tf -idf vector representation for a given text collection. Each iteration of the iterative classification algorithm picks some percentage P of each cluster as the training set and reassigns the remaining texts to clusters based on a classifier trained using this training set; P is chosen uniformly at the random from some interval [P 1 , P 2 ]. To justify this approach and to determine optimal choices for P 1 and P 2 , we ran preliminary experiments using a representative dataset (SearchSnippet-test). Specifically, we considered choosing P uniformly at random from the interval [P 1 , P 2 ] or choosing a fixed percentage P in every iteration. For the former method, we determined the optimal combination of P 1 and P 2 (P 1 = 0.5 and P 2 = 0.95). For the later, we determined the optimal choice of P (P = 0.6). Choosing P uniformly at random from the interval [0.5, 0.95] resulted in cluster accuracy of 82.21 for the representative dataset. Choosing a fixed percentage P = 0.6 in every iteration resulted in cluster accuracy of 80.25. Thus we chose P 1 = 0.5 and P 2 = 0.95 and chose P uniformly at random from this interval in all experiments. Experimental Setup for Clustering. To perform clustering, we used the preprocessed texts described in Sect. 6.2. Then, texts were represented as vectors using pretrained word embeddings (i.e., Glove [17] and BioASQ [18] ). The Glove embedding 4 was trained using the Glove method [17] on Wikipedia dumps. The BioASQ embedding 5 was trained using the Word2Vec method [13] on abstracts of biomedical publications. We used the Glove embedding for all datasets except the biomedical dataset since these datasets contained terms related to general domains such as search snippets. For the biomedical dataset, the BioASQ embedding was more appropriate due to its specific focus on biomedical terms. We represented each text by the average of the vectors of all words in the text. Then, we applied the five different clustering methods described in Sect. 5 to the text vectors. For the k-means and k-means--clustering algorithms, we used the text vectors as the points to be clustered. For hierarchical clustering, we constructed the dense similarity matrix by computing similarities between the vectors using cosine similarity for all the text pairs. After that, we sparsified the dense similarity matrix using the k-NN and similarity distribution based (SD) sparsification methods. Then we applied hierarchical agglomerative clustering using dense (HAC) and sparse similarity matrices (HAC k-NN and HAC SD). In our experiments, we use five datasets of short texts which are SearchSnippet, SearchSnippet-test, AgNews, StackOverflow, and BioMedical. We used accuracy (ACC) and normalized mutual information (NMI) as the evaluation measures for different clustering algorithms (as in [2] ). The clustering results (ACC, NMI) of these datasets are shown in Table 2. The last two rows of Tables 2a and 2b show the ACC and NMI scores obtained using the state-of-the-art short text clustering methods STC2-LE [2] and SIF-Auto [3] . The ACC and NMI scores of five clustering algorithms both before and after iterative classification for the five datasets are shown in these two Tables. The results with or without the IC suffix are the results with or without iterative classification. The best result (ACC, NMI) for each dataset is shown in bold. To compensate for the dependence of k-Means, k-Means--on the choice of cluster seeds, we ran the k-Means and k-Means--clustering algorithms 20 times on the same dataset and performed iterative classification on the clustering obtained in each run. After that, we calculated the mean and standard deviation of the 20 clustering results (ACC, NMI) obtained by k-Means, k-means--, k-Means IC and k-means--IC for each dataset. We ran hierarchical agglomerative clustering (HAC), HAC k-NN, and HAC SD only once since HAC is deterministic. However, the enhancement of the clustering obtained by iterative classification varies between runs since the training and test sets are chosen randomly in each iteration. So, we ran iterative classification 20 times on the clustering obtained using HAC, HAC k-NN and HAC SD, and again calculated the mean and standard deviation of each of the 20 clustering results obtained by HAC IC, HAC k-NN IC and HAC SD IC for each dataset. We evaluated whether iterative classification improves the initial clustering obtained using different clustering algorithms. We consider iterative classification to improve the clustering for a given dataset if both ACC and NMI are increased using iterative classification. Table 2 shows that iterative classification improves the initial clustering of short texts in terms of both ACC and NMI. For most of the datasets, the best clustering ACC and NMI were obtained by applying iterative classification to the clustering obtained by HAC with SD sparsification (HAC SD). The reason is that HAC SD [9] produces better initial clustering than other clustering methods for these datasets and the enhancement of clustering depends on the initial clustering. Comparison with State-of-the-Art Methods. Our second comparison aims to assess how the results of iterative classification in conjunction with the different clustering methods compare to state-of-the-art short text clustering methods, specifically STC2-LE [2] and SIF-Auto [3] . Table 2a and 2b show that HAC SD IC and HAC k-NN IC outperform STC2-LE 6 for the SearchSnippet, StackOverflow and BioMedical datasets in terms of ACC and NMI. It is also shown that HAC SD IC, HAC k-NN IC, HAC IC, k-Means IC, and k-means--IC outperform SIF-Auto for the SearchSnippet and StackOverflow datasets in terms of ACC and NMI. However, on the Biomedical dataset, the performance of SIF-Auto is better than any clustering method and its corresponding enhancement by iterative classification. Statistical Significance Testing of Clustering Performance. Our third comparison aims to investigate whether the clustering improvements achieved by iterative classification are statistically significant. In particular, we perform two investigations: a) whether the improved results achieved by iterative classification are statistically significantly better than the results of their corresponding clustering methods. b) whether the improved results achieved by our best clustering method HAC SD IC are statistically significantly better than the results of different clustering methods (with or without iterative classification and stateof-the-art methods). For significance testing, we performed a two-tailed paired t-test (with significance level α = 0.05) using the pairwise differences of clustering results (ACC, NMI) of 20 runs obtained by different pairs of clustering methods. On all datasets except the BioMedical dataset, and for all clustering methods tested, the enhancement by iterative classification is statistically significantly better than the base clustering method, and the former are statistically significantly inferior to our method HAC SD IC. For the BioMedical dataset, the ACC and NMI scores achieved by HAC SD IC are statistically significantly better than that of STC2-LE. However, SIF-Auto outperforms HAC SD IC on the BioMedical dataset. We have demonstrated that iterative classification enhances the clustering of short texts for various short text datasets based on initial clusters obtained using such as k-means, k-means--, hierarchical agglomerative clustering (HAC), HAC using k-NN and SD sparsification methods. The most promising results were obtained by applying iterative classification to the clustering obtained by HAC using the proposed SD sparsification (HAC SD IC). Experimental results show that HAC SD IC outperforms a state-of-the-art short text clustering method based on convolutional neural network (STC2-LE) on all the datasets in terms of ACC and NMI. Moreover, HAC SD IC outperforms another state-of-the-art short text clustering method based on autoencoder (SIF-Auto) in terms of ACC and NMI on several short text datasets. The proposed clustering enhancement method advances the state of the art in short text clustering, which is important in the following practical contexts such as social media monitoring, product recommendation, and customer feedback analysis. The proposed method is a generic clustering enhancement approach for short texts where various classification algorithms, initial clustering and number of clusters can be easily integrated. In the future, we will apply our clustering enhancement algorithm to long documents to investigate whether iterative classification leads to performance improvements. We also plan to use phrase similarity as a basis for computing text similarity so as to obtain better text similarity scores, since the performance of clustering algorithms depends on the quality of individual text similarity scores. A survey of text clustering algorithms Self-taught convolutional neural networks for short text clustering A self-training approach for short text clustering A model-based approach for text clustering with outlier detection A unified approach to detecting spatial outliers Unlabeled data classification via support vector machines and k-means clustering An introduction to cluster analysis for data mining Unsupervised sparsification of similarity graphs Improving short text clustering by similarity matrix sparsification Clustering short texts using Wikipedia Corpus-based topic diffusion for short text clustering BTM: topic modeling over short texts Efficient estimation of word representations in vector space Shared nearest neighbor clustering in a locality sensitive hashing framework Learning to classify short and sparse text & web with hidden topics from large-scale data collections Character-level convolutional networks for text classification Glove: global vectors for word representation Biomedical semantic indexing by deep neural network with multi-task learning