key: cord-0602739-ln2dbqn2 authors: Najafi, Ali; Gholipour-Shilabin, Araz; Dehkharghani, Rahim; Mohammadpur-Fard, Ali; Asgari-Chenaghlu, Meysam title: ComStreamClust: A communicative text clustering approach to topic detection in streaming data date: 2020-10-11 journal: nan DOI: nan sha: 47f0d679dc9e23c7f0783491d843d261ec3fb05c doc_id: 602739 cord_uid: ln2dbqn2 Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we propose a novel communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g. COVID-19. The proposed approach was evaluated on two datasets: the COVID-19 and the FA CUP. The results obtained from ComStreamClust approve the effectiveness of the proposed approach when compared to existing methods such as LDA. Social media, which has achieved growing popularity in recent decades, provides the opportunity for people to share their ideas with an enormous number of users around the world. As a micro-blogging platform, Twitter allows its users to write short text messages regarding various issues ranging from politics, economy, and healthcare to routine tasks of people's daily lives. One such issue, the COVID-19 pandemic has had a profound impact on people's social lives since the beginning of 2020. Determining and tracking health issues such as COVID-19 on Twitter would help governments and healthcare companies better handle the impact of those diseases on societies. Concretely, assembling tweets on this topic and analyzing them would result in valuable information for those companies. From the healthcare perspective, crawling tweets related to COVID-19 as a pandemic issue, might help in finding a remedy for it. As manual processing of such information is prohibitively expensive, automatic or semi-automatic methods are thus needed; however, assembling and distilling such data is a challenging task. Previous works have tackled this problem by streaming and grouping tweets into various categories by using supervised [1] or unsupervised [2] methods. Unsupervised methods, however, have gained greater popularity, which attempt to cluster tweets into previously unknown topics. More specifically, they collect streaming tweets in a time interval and assign them to clusters based on their topics. Clustering has been broadly used for topic detection in literature. In stream data clustering, a two-phase task is accomplished. In the first phase, data is captured from streaming data; and in the second phase, clusters are created and probably re-organised in order to constitute denser clusters. The ultimate goal is to increase the intra-cluster similarities as well as to decrease the inter-cluster similarities. In order to tackle the aforementioned topic detection problem, we propose a communicative text clustering approach for tweet clustering, which has been applied on the COVID-19 and FA CUP datasets, which is described with greater details in Section 3. The key aspect of this work is its communicative and multiprocessing structure. The difference between this work and the existing ones is in the second phase (as mentioned above). In the communication step, existing clusters may export data to, and/or import data from other clusters. At the same time, the proposed approach is also capable of distinguishing and ignoring outlier data. The obtained results provide confirmatory evidence that the proposed approach is effective and superior to the existing ones in topic detection on Twitter data. Ibrahim et. al. [3] divide the topic detection techniques into five groups: clustering, frequent pattern mining, Exemplarbased, matrix factorization, and probabilistic models. The current research falls into clustering-based methods. Stream clustering is a type of clustering, in which, data is continuously fed to a clustering system. Given this sequence of data, the goal is to group them in clusters, the elements of which are similar to each other but different from elements of other clusters. Unlike traditional clustering approaches that rely on a fixed set of input data, stream clustering assumes that input data is in a stream with an unknown number of usually unlabelled data. Along with the fast growth of social media such as Twitter, stream text clustering has gained growing popularity in recent decades. Several approaches have tackled this problem. The proposed method in [4] incrementally builds micro-clusters, which are later re-clustered to assemble final clusters. The idea of micro-clusters was earlier used in [5] , in which, the first micro-cluster-based online clustering algorithm, so called DBSCAN was introduced. Hahsler and Bolanos in this work, took into account the density of area between the micro-clusters, for the first time in the literature. In another perspective, Fang et. al., [6] categorize topic detection methods in Twitter into two main groups: traditional and new topic detection methods. In the traditional side, some research works [7] use an extension of LDA [8] for solving the topic detection problem. Some others tackle this problem by constructing a term co-occurrence network of keywords [9] and single-pass clustering along with a new threshold method [10] . Although traditional methods work well for long texts, they do not portray high performance on short texts such as tweets. Therefore, in new topic detection methods, traditional approaches are extended to deal with new data types. In [6] , the authors propose a new topic detection framework based on multi-view clustering. The so-called Gradient Boosted Decision Trees is another method for topic detection which has been used in [11] . Computational cost is one of the major challenges in the real-time topic or event detection on Twitter. Hasan et. al. [12] deal with this issue by proposing an event detection system called TwitterNews+ which utilizes inverted indices and an incremental clustering approach, with a low computational cost. Asgari et. al [13] propose a model based on the universal sentence encoder [14] and transformers [15] to detect the main topics of tweets regarding the COVID-19 pandemic. Early detection of bursty topics is one of the most challenging problems in this era. TopicSkech has been proposed [16] for dealing with this problem on Twitter. Similarly, PoliTwi [17] , as another system, has been proposed for the early detection of emerging political events. There are several other approaches towards the problem tackled in this paper which use a novel method such as Formal Concept Analysis [18] , clustering based on n-grams, and named entity boosting [19] , Combination of singular value decomposition, and K-means clustering methods [20] . Twitter streaming data is a sequence of data, in which data-points appear as time passes. The problem tackled in this paper can be formally defined as follows: Each data-point is assumed as a quadruple (id, t, ts, s), such that id is a unique value as the identification number; t is the text with at most 280 characters; ts is the timestamp of the tweet including its arrival date and time; and s is the subject of the tweet which is not known in advance. Once the subject s is determined, the tweet can be assigned to one of the existing clusters. Having a set of topic clusters, the task is to assign a newly arrived tweet to one of the clusters. After this assignment, the attribute s of the tweet will be initialized; however, it may be updated later. ComStreamClust consists of three main steps: (1) Data streaming, in which data points, i.e. tweets, are fetched from Twitter streaming data, (2) Data assignment where the newly arrived tweet is assigned to an existing or new cluster, and (3) Data exchange which is a communication-based step to exchange data among clusters to build denser clusters. The initialization phase is not assumed to be the main step because it is done once at the beginning. Different steps of the proposed approach are described in greater detail in their respective subsections. The proposed approach is illustrated in figure 3 as a flowchart. The reference implementation of the proposed approach is also released under the MIT license 1 . The initialization phase receives and handles the first k tweets; we set k to 10. The first k tweets are randomly assigned to one of the initial agents. this phase us finished when all initial agents are filled. The initial number of agents, identified by the init-agents parameter, and initial capacity of these agents indicated by init-agent-cap are respectively set to 5 and 2 for both datasets (5 * 2 = 10). In other words, the first ten tweets would be randomly assigned to five agents. In this step, some data is caught from a data stream, e.g. Twitter. Such data sources have unique characteristics, in contrast to other data sources, access to the dataset is limited by time -all the data simply doesn't exist yet. In other words, the system receives data one datapoint at a time. In this paper, two streaming data sources are used for topic detection: the COVID-19 dataset and the FA CUP dataset. These datasets are introduced in section 4.1. The streaming step fetches tweets from the data stream one datapoint at a time, and after passing a tweet through a preprocessing channel, assigns it to one of the agents. Because of the short length of tweets, Twitter users usually prefer to use informal language. Due to this informality, tweets should be first preprocessed to be prepared for further processing. In this phase, several subtasks are applied to the newly-arrived tweet before passing it to the next step (Data assignment). • URL removal: In this subtask, hyperlinks to various webpages inside the tweets are removed, as they generally do not contribute to the topic. • Hashtag tokenization: Hashtags are words or phrases starting with a '#' character. Hashtags may contribute to the topic as they carry contextual information. A hashtag semantically links a tweet to all other tweets including it. As a hashtag usually consists of several words, in this subtask, it is separated into its constituting words. • Mention removal: Twitter users might be mentioned in a tweet with an '@' character followed by their username. These names are also removed from tweets as they usually do not contain useful information for topic detection. • Tweet cleaning: As explained above, Informal language is often preferred in Twitter, which tends to include digits or special characters such as "[]()/!?.". Such characters are also removed in this subtask. • Tweet tokenization: Finally, the cleaned tweets are tokenized using the whitespace character. Bigrams and trigrams including a hyphen between words are left unchanged as they usually convey a non-compositional phrase, e.g., "brother-in-law", or "chronicle-independent". Moreover, all capital words are lowered. In this step, the coordinator receives a tweet and assigns it to one of the existing clusters that is semantically the most similar to it. Each cluster has a weight which is incremented by one after each tweet assignment. The similarity between a tweet and a cluster is measured by the similarity of topics covered by them. More specifically, we first compute the TF-IDF score of words in a tweet as an n-dimensional vector (n is the number of words in the tweet) and then measure the cosine similarity of this vector with the TF-IDF vectors of existing clusters. The tweet will then be assigned to the most similar cluster. Note that if the semantic distance of the tweet from the most similar cluster is greater than a given threshold parameter, namely radius, a new cluster including only the newly-arrived tweet will be generated. The TF-IDF method for calculating the similarity between a tweet and a cluster is calculated based on equation 1. In this equation, W t,d is the weighted term frequency of term t in document d which might be a newly-arrived tweet or a set of tweets inside a cluster (also called agent). idf t is the inverse document frequency of term t in the dataset which is the set of all clusters; in other words, each cluster including one or more tweets, acts as a document. The cosine similarity of two n-dimensional vectors is computed by equation 2. cos( tw, cl) = tw · cl tw and cl respectively represent the tweet and cluster vectors. The ith element in these vectors is the tf-idf value of term i . idf t for both vectors is computed according to the number of clusters including the term t. tf t in tw, considers the count of occurrences of t in tw, but in the cl vector, tf t takes into account the number of times the term t appears in all tweets included in the cluster cl. Note that in cosine similarity, only the intersection of the terms in tw and cl are used. In other words, only those words of the arrived tweet are taken into account which have already appeared in at least one of the existing tweets in cluster cl. To prevent overflow in agents, a sliding window method is used. This window indicated by a parameter named slid-win-init, is set to 24 hours, and two minutes respectively in the COVID-19 and FA CUP datasets evaluations. This parameter differs from the timeslot parameter which indicates a time interval after which the output of the system is stored and evaluated. In other words, topics and their keywords are separately extracted for each day in the COVID-19 and each minute in the FA CUP datasets. After each timeslot, a constant number of topics and a constant number of keywords per topic are stored. These constant numbers are represented by two parameters: no-topics and no-keywords . After assigning the tweets to existing clusters in the data assignment step, a communication-based data exchange occurs periodically among agents under the supervision of the coordinator. This period is a parameter in our approach, named comm-int. In this step, the coordinator redistributes outlier tweets among clusters to achieve higher cluster density. Concretely, in each timeslot, the agents determine their outlier tweets and return them back to the coordinator. A tweet is assumed an outlier in a cluster if its cosine similarity from the cluster centroid is lower than a given threshold. This threshold, named outlier-threshold is another parameter, which is set to 0.78 and 0.73 respectively in the COVID-19 and the FA CUP datasets. Then, the coordinator redistributes these outliers among existing clusters (agents), again based on the cosine similarity. The intuition behind this communication is that due to the automatic update in cluster Table 1 : The parameters used in the proposed methodology. cc, dp, and kw respectively stand for cluster center, data point, and keyword. Parameter name COVID-19 val FA CUP val explanation init-agents 5 5 initial number of agents init-agent-cap 2 2 initial # of data-points per agents timeslot 24h 1m time-interval to store output comm-int 1.5h 1m time-interval to repeat communication phase slid-win-int 24h 2m time-interval after which a dp is considered old assign-radius 0.75 0.70 max distance for assigning a dp to an agent outlier-threshold 0.78 0.73 min distance from cc for naming a dp as outlier no-topics variable variable number of topics stored in each timeslot no-keywords 5 9 the number of kws per topic stored in each timeslot agent-fading-rate 0.5 0 percentile of agents faded in communication del-agent-weight-threshold 0.4 0 weight threshold for deleting agents centroids, which is caused by the newly-added tweets, some tweets inside clusters gradually become an outlier. In other words, the topic carried by an outlier gradually gets away from the overall topic of the cluster including it. At the end of each communication phase, the weight of each agent is reduced by a parameter named agent-fading-rate. After this change, if the weight of any agent is lower than a threshold, del-agent-weight, it will be faded. This weight is incremented by each data point's arrival to an agent, but not decremented by each outlier's removal from that agent. As mentioned, the proposed methodology includes several parameters, which have been initialized by the try-and-test method. The complete list of parameters, as well as their values separately for two datasets, are listed in Table 1 . In this section, we evaluate the proposed approach on two datasets, COVID-19 and FA CUP. Evaluation metrics, ground-truth, the datasets, and the results have all been explained with details in the following subsections. Evaluation metrics: We use topic recall, keyword recall, and keyword precision for evaluation. F-score is also used for keyword evaluation as the harmonic mean of precision and recall. These metrics are calculated according to equations 3 through 5. Topic precision is the ratio of correctly extracted topics over the total number of extracted topics. Topic recall is the ratio of correctly extracted topics over the total number of topics. Keyword recall is the ratio of correctly extracted keywords of a topic over the total number of the keywords of that topic. The correct topics and their keywords have been collected in ground-truth. Note that we did not use topic precision, because there exist hot topics in the datasets which might not be in the ground-truth topics. For example, daily events such as the death of someone's cat might be extracted as a hot topic, where it is not relevant to any topic in the ground-truth. P kw = |estimated keywords for a hot topic| |keywords for a hot topic in ground-truth| (3) R tp = |correctly extracted hot topics| |hot topics in ground-truth| , R kw = |correctly extracted keywords for a hot topic| |keywords for a hot topic in ground-truth| (4) Ground-truth: The ground-truth data are available for the FA CUP dataset, which have been explained with details in [21] . However, to the best knowledge of the authors, there is no ground-truth readily available for the COVID-19 dataset. Therefore, we manually extracted the hot topics and their associated keywords from online media and search engines, separately for each day from March 29 to April 30. We have made these ground-truth data publicly available 2 . As already mentioned, two datasets have been used in this paper: COVID-19 and the FA CUP. The Football Association Challenge Cup, FA CUP, was compiled during a football game between Chelsea and Liverpool, in May 05, 2012, from 16:00:00 to 18:30:00. The data have been crawled using key hashtags such as the event name and the name of teams and key players. The set of hot topics in this dataset is comprised of 13 special times including the start and end of the match, the goal times, penalizing the players and so on. This dataset includes about 113 thousand English tweets already labelled in [21] . The COVID-19 dataset is a collection of tweets compiled from Twitter from March 29 to April 30, 2020. This dataset has been assembled by crawling tweets including the hashtag #COVID19. The total number of tweets in this dataset is about 9 million but we randomly chose one million tweets and manually labeled them based on their topics. We have made this subset publicly available for other researchers' use 3 . The number of tweets in each timeslot for both datasets has been illustrated as a histogram in Figures 1 and 2 . Both datasets have been automatically collected by using hashtags, therefore irrelevant (outlier) tweets might exist in both datasets which would lead to challenges in topic detection. This issue is the cause for neglecting topic precision as an evaluation metric. The evaluation metrics include topic and keyword recall, keyword precision, and F-score (for keyword evaluation). The overall values for these metrics in the case of two topics per timeslot are listed in Table 2 . We used micro-averaging for computing the final value of precision and recall both for the topic and keyword evaluation. These results have been obtained when topic-number-per-timeslot is 2 (@2) and keyword-number-per-topic is 5 for the COVID-19 and 9 for the FA CUP. We set the number of topics per timeslot to 2, as usually no more than two hot topics are discussed in tweets of a certain timeslot. A higher number of hot topics would label routine issues as a hot topic. For example, daily statistics in the COVID-19 and each shoot towards the gate in the FA CUP would be captured as a hot topic; However, we experimented with the proposed approach by different values of this parameter in Section 4.3. To better understand the effects of topic number per timeslot on the final performance, we created new experiments, the results of which are portrayed in Figure 3 . It can be concluded from these diagrams that the higher number of topics would result in a higher topic recall, especially in the COVID-19 dataset. The intuition behind this harsh slope is that the number of hot topics per timeslot ranges from 0 to 3, and therefore increasing the number of estimated hot topics would increase topic recall until 3, but after this number, due to the lower number of topics per time slot in the ground-truth, the line rises with a lower slope. However, increasing this number does not affect other metrics (Keyword precision and recall) much because having more topics would require estimating more topic keywords, while our estimation would not be always correct. Finally, we provide the t-distributed stochastic neighbor embedding (tSNE) diagrams for both datasets in a specific timeslot in Figure 4 . Each agent in this diagram represents a cluster including similar datapoints, i.e., tweets sharing a topic. We obtained different topic recall values for different values for topic-number-per-timeslot, which have been provided in Table 3 . This table also compares the obtained values with other approaches applied to the FA CUP dataset for topic detection. We could not provide such a comparison for the COVID-19 dataset, due to the fact that it is new and to the best of our knowledge, there is no published work on it. There were several erroneous cases causing lower topic recall in both datasets, such as the overshadowing phenomenon. In the FACUP, for instance, a goal was achieved by Drogba in the 24th minute which is a hot topic in timeslot 24. There is another hot topic in the 25th minute that discusses passing the ball to Drogba before achieving the goal. The latter topic was overshadowed by the first one, i.e., the latter was lost in minute 25 among tweets issued in minute 24. As can be seen in table 3, the proposed approach is quite successful in extracting the hot topics on both datasets as well as keywords associated with those topics. Keyword analysis: We tested the proposed approach with different values for the parameter, # of keywords per topic, and concluded that 5 keywords per topic achieve the best results. A lower number of keywords would result in lower topic-recall due to detecting fewer topics, but higher topic precision. A greater number of keywords, on the other hand, would detect more topics causing higher topic recall and lower topic precision. At the end of all timeslots in each dataset, some keywords have been labeled as most frequent, which have been illustrated as boxplots in Figures 5 and 6 . As can be seen in COVID's boxplot, the most frequent words are "stay", "pandemic", "home" and "people", which may imply that due to the COVID pandemic, people are telling each other to "stay at home". We observed that the majority of tweets include the hashtag #StayAtHome. Note that obvious frequent keywords such as "COVID19", "coronavirus", "corona", and "virus" have been treated as stopwords, as their inverse document frecuency is 0. We chose a few sample tweets from each dataset, which include a hot topic in its timeslot. Table 4 lists these sample tweets. The strengths of the proposed approach include its dynamic nature and keeping the detected topics -clusters -as pure as possible. In other words, ComStreamClust attempts to keep clusters fresh, by discarding older tweets, and updating the clusters by adding newer ones, and also detecting and deleting outliers. Note that a tweet might gradually turn into an outlier, due to the updates that happen to the cluster including it. The weaknesses of this approach might be its need for parameter tuning, i.e. it requires adapting each parameter for the current dataset. This paper proposed a new topic detection approach using stream clustering on Twitter data. The proposed approach called "ComStreamClust" is unique in that it benefits from a communication phase in which clusters communicate with each other under the supervision of a coordinator that decides which data should be assigned to which cluster. ComStreamClust has been applied on two datasets, the COVID-19 and the FA CUP datasets. The performance of ComStreamClust on the FA CUP dataset has been compared with other methods but we could not compare its performance on the COVID-19 dataset with other methods. This analysis on COVID-19 dataset approves the assumption that social media can help governments and health centers cure this pandemic in a more efficient and rapid manner. For example, almost all Twitter users have used #StayAtHome in their tweets, which in turn would remind people that staying at home is the most efficient treatment for said pandemic. Our future works include exploiting levenshtein distance for computing syntactic word similarity and also using methods other then TF-IDF for measurnig the semantic similarity between tweets. Furthermore, we will make the proposed approach parallel and asynchronous in its next version, such that agents can collaborate with each other without coordinator's supervision, for the sake of lower time complexity. Sentimental causal rule discovery from twitter Emerging topic detection on twitter based on temporal and social terms evaluation Tools and approaches for topic detection from twitter streams: survey Stream clustering of chat messages with applications to twitch streams Clustering data streams based on shared density between micro-clusters Detecting hot topics from twitter: A multiview approach Lda-based online topic detection using tensor factorization Latent dirichlet allocation Hot topic detection in professional blogs On-line new event detection using single pass clustering. University of Massachusetts Detecting controversial events from twitter Real-time event detection from the twitter data stream using the twitternews+ framework Narjes Nikzad-Khasmakhi, and Shervin Minaee. Covid-transformer: Detecting trending topics on twitter using universal sentence encoder Chris Tar, et al. Universal sentence encoder Attention is all you need Topicsketch: Real-time bursty topic detection from twitter Politwi: Early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowledge-Based Systems A step forward for topic detection in twitter: An fca-based approach Topic detection using bngram method and sentiment analysis on twitter dataset Combination of singular value decomposition and k-means clustering methods for topic detection on twitter Ayse Göker, Ioannis Kompatsiaris, and Alejandro Jaimes. Sensing trending topics in twitter