key: cord-0524185-ks9zw5kd authors: Jing, Xiaonan; Hu, Qingyuan; Zhang, Yi; Rayz, Julia Taylor title: Tracing Topic Transitions with Temporal Graph Clusters date: 2021-04-16 journal: nan DOI: nan sha: 8968a29af863b0ebc9c4388366d0d5d50b0b2fab doc_id: 524185 cord_uid: ks9zw5kd Twitter serves as a data source for many Natural Language Processing (NLP) tasks. It can be challenging to identify topics on Twitter due to continuous updating data stream. In this paper, we present an unsupervised graph based framework to identify the evolution of sub-topics within two weeks of real-world Twitter data. We first employ a Markov Clustering Algorithm (MCL) with a node removal method to identify optimal graph clusters from temporal Graph-of-Words (GoW). Subsequently, we model the clustering transitions between the temporal graphs to identify the topic evolution. Finally, the transition flows generated from both computational approach and human annotations are compared to ensure the validity of our framework. Continuously updating data streams make it challenging to identify real-time topics from platforms like Twitter. Previously, topic identification has mainly been studied on static dataset (Stoyanov and Cardie 2008; Lo, Chiong, and Cornforth 2017; Pappagari, Villalba, and Dehak 2018) . However, oftentimes, real-world events are dynamic. During a continuously evolving event, the center of topic can shift as new information being updated throughout the event duration. We are interested in learning the underlying structure of how an event unfolds in the online community, especially when the data are limited for studying user behaviors. Stream based event detection aims to identify a sequence of temporal states of the event(s). Given a continuous event across a set of timepoints {t 1 , t 2 , ..., t n }, we define a temporal (sub)-event as the state s i of the event at any timepoint t i (i < n), with t n being the final timepoint whereas the event has no further updates. One of the biggest challenges in stream based detection lays in locating the temporal state boundary across the timepoints. When a dynamic event evolves over time, the temporal state s 1 may remain unchanged across several timepoints, however, suddenly converting to state s 2 then subsequently emerging to s 3 . In other words, the temporal state at each timepoint is not independent of each other. There is a transition between the states Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. when a change is to occur. Thus, the detection of a significant change in state becomes crucial in stream based tasks. Previously, burst based detection (McMinn and Jose 2015; Kaneko and Yanai 2016) and anomaly based detection (Guille and Favre 2015; Fedoryszak et al. 2019 ) have been explored. Burst based detection exploits a frequency based approach, which does not emphasize the semantic content of the events. On the other hand, anomaly based detection focuses on the change of semantic topic in textual content. In this paper, we are interested in the latter type, which traces the semantic topic change in a continuous time space. We propose to employ graph structure to model temporal states due to its flexibility in relationship assignment and scalability in computational cost. Graphs have been adopted for similar tasks in event identification (Meladianos et al. 2015; Jing and Rayz 2020) . A graph G = (V, E) generally consists of a set of vertices V and a set of edges E which connects the vertices. The vertices can be words, sentences, or documents, and the edges can be used to model the statistical or semantic relationships between the textual elements. Our approach utilizes Graph-of-Words (GoW) to construct a temporal graph for tweets content at each timepoint, allowing the graph clustering to group temporal topics. Consequently, we develop topic transition flow by modeling cluster transitions at global level across all timepoints. Our main contributions are: 1) development of a clustering with nodes removal algorithm to find the optimal graph clusters over topical dataset; 2) improvement on cluster transition modeling to simulate the transition across all timepoints as well as taking re-emerging clusters into consideration; 3) visualization of topic transition flows in the time space. In this section, we review previous work on event identification on Twitter and graph based event modeling. We start with event clustering graphs, for which Jin and Bai (2016) proposed a long document clustering approach utilizing a directed GoW for representing the word features contained in each document. The document clusters were generated based on the maximum common subgraphs between each document graph. In a similar attempt, Edouard et al. (2017) proposed an event clustering model which Tweets 38 89 87 65 27 68 53 29 19 53 23 18 40 35 16 Nodes 139 218 167 177 69 189 204 101 65 137 72 56 122 128 113 leveraged named entities (NE) based directed GoW structure. The GoW was improved by using surrounding context of the graph nodes, NE, to enrich node level information. The approach is capable to automatically detect different events with the same keywords without any prior knowledge. Jinarat et al. (2018) employed a GoW combined with pretrained Word2Vec embedding (Mikolov et al. 2013 ) for tweet clustering. The Word2Vec similarity served as a metric for edge removal in generating tweet clusters. However, since abbreviations and hashtags occur regularly in tweets, pre-trained embeddings can be vulnerable to these irregularities which may not present in the training data. Extracting event streams of an ongoing event from Twitter has a goal of detecting how an event unfolds as people post updates. Meladianos et al. (2015) improved the GoW approach by integrating the tweet length with the global cooccurrence frequency to identify the sub-events of a World Cup match on Twitter. Tweets containing the top k degenerated subgraph were used to describe each sub-event. Fedoryszak et al. (2019) proposed an interpretation of Twitter event streams -a chain of clustered trending entities arranged in chronological order. Additionally, Fedoryszak et al. overcame the limitation of lack of coverage in event detection with the aid of Twitter's internal knowledge graph (KG). Jing and Rayz (2020) introduced Graph of Tweets (GoT) for modeling popular events with both word and document level structures. GoT treat a tweet as a collection of conceptualized token nodes, whereas tokens of contextual similarities were merged prior to the GoT construction. Popular sub-events were extracted by detecting cliques among the similar tweet nodes. Last but not least, we review previously proposed event representation on Twitter as there has not been a formal definition due to the various nature of the tasks. Hashtag based event identifications (Feng et al. 2015; Yang and Rayz 2018) treat an event as "a group of hashtags that focus on the same topic". Another approach utilizes NE to define event, whereas each NE is treated as an individual event and a set of NE as a merged event (McMinn and Jose 2015). Text triplet (subject, predicate, object) has also been adopted to describe a Twitter event (Tonon et al. 2017 ). In such approaches, OpenIE (Angeli, Johnson Premkumar, and Manning 2015) has been one of the top candidates for triplet extraction. Finally, embedding based approach which treats each tweet embedding as an event has also been explored (Dhingra et al. 2016 ). All of the above representations have their pros and cons -hashtags are excellent carries of topical information, but they may not be present in every tweet; NE can support details of the event, but they may also cause crucial information to be excluded (i.e. pronouns which are often subjects of an event); triplets can provide relational information, but entities that are not involved in a triplet cannot be captured; Embeddings allow efficient mathematical com-putations but also requires an adequate amount of data to train. We combine hashtags, NE and nouns in this paper to represent Twitter topics due to the limited amount of data we could collect for our experiment. We are interested in identifying topic transitions in specified events. We break the task into the following steps: 1) constructing a temporal graph for each timepoint; 2) applying clustering with node removal on each temporal graph; 3) modeling cluster transition flow across timepoints. Opportunistically, we chose to model local responses to the on-going event "COVID-19" for a short duration. Thus, tweets were collected from a local community from Aug 19th to Sep 2nd. "COVID-19" related tweets were identified by matching a set of manually selected hashtags for the corpus. The tweets of interests are pre-processed with Stanford CoreNLP 1 to annotate the part-of-speech and named entities. The statistics of the dataset is shown in Table 1 . We treat each day as a timepoint and split the collected dataset based on the timestamp of the tweets. Temporal graphs are constructed using GoW with the nodes being the unique nouns or named entities extracted from tweets of the day. Furthermore, we employ normalized Point-wise Mutual Information (PMI) value as the edge weights between two nodes (1). In Equation (1), the marginal probabilities p(x) and p(y) and the joint probability p(x, y) are computed as the proportions of the occurrence of tokens x and y in a total of N tweets, where n x , n y , and n xy denote the (joint) frequency of tokens x and y. For consistency and computational efficiency, we further normalize the PMI with self-information h(x, y) which sets the boundary of the PMI value to [−1, 1] (2). Clustering with Node Removal clustering (Ng et al. 2002) and highly connected subgraph clustering (Hartuv and Shamir 2000) , achieve the grouping from graph cutting. One drawback of applying cutting based algorithm on graphs with bridging nodes is that no obvious local structures can be observed due to the inter-connectivity introduced by these nodes. More precisely, the cluster assignments for the neighboring nodes of a bridging node tend to fail as the graph cannot be cut in an appropriate way. As a result, the graph cutting mechanism tends to cluster each node into individual cluster. Thus, we propose to exclude the bridging nodes from the GoW during the clustering process, and treat them as belonging to each resulting clusters which have at least one edge in between. We locate a bridging node by its clustering coefficient C i ∈ [0, 1] (Watts and Strogatz 1998). As defined in Equation 3, clustering coefficient measures the embeddedness of a single nodes among its neighbors. A larger C i indicates the neighbors of i tend to be more connected to form a community. In our case, the smaller the C i is, the more likely the node is to serves as bridging node. • e i : number of edges between the neighbors of node i • k i and k j : degree of node i and j • A ij : the edge weight between nodes i and j • 2m: sum of all edge weights • c i and c j : communities of nodes i and j • δ: an indicator function, δ = 1 if c i = c j , δ = 0 otherwise To achieve an optimal clustering, we determine the number of bridging nodes to exclude by incrementally removing the denser node from the graph based on a clustering quality metric. Modularity (Newman 2006) is a common metric used for measuring community quality in graph theory. Given a partitioning of graph G, modularity Q ∈ [−1, 1] computes the difference between actual and expected number of edges within groups (Equation 4). A larger modularity value Q indicates more significant community structure. Algorithm 1 outlines our method for finding optimal clustering. During each iteration, a node with the lowest C i is removed (with its edges) from the graph G and the rest of the subgraph is clustered. The modularity value Q is computed on the subgraph to determine the current clustering quality. The best clustering is achieved at the (first) max Q value. We adopt the random walk based Markov Clustering (MCL) (Van Dongen 2008) as our choice of graph clustering algorithms over other common graph cutting based clustering algorithms due to the drawbacks mentioned above. MCL simulates the stochastic flow in a graph which makes it more scalable. Furthermore, unlike other clustering algorithms, the number of clusters does not need to be pre-determined in MCL. In many stream based data analysis tasks, modeling cluster transitions is the key to identify the evolution of the tar-Algorithm 1 Finding Optimal Graph Clustering if Q > Q max then Q max = Q best clustering = clusters end if end for return best clustering get of interests. Given a timepoint t i , a cluster transition can be defined as "the change experienced by a cluster that has been discovered at an earlier timepoint" (Spiliopoulou et al. 2006) . Previously proposed frameworks such as MONIC (Spiliopoulou et al. 2006 ) and MEC (Oliveira and Gama 2012) define a set of transition types between clusters across consecutive timepoints and use a matching function with a threshold to identify these types. We adopt the basic transition types defined in MONIC and MEC, and further extend them to make the framework more robust for our task. From timepoint t i to t i+1 , we define pairwise transition types for consecutive timepoints in Table 2 (first 7 types) , where X and Y are clusters at timepoints t i and t i+1 respectively, α is the threshold for the match function to measure the overlaps between two clusters. In addition to transition types defined between pairs of consecutive timepoints, we introduce another transition type, namely, "a cluster has re-emerged", which measures the transition across non-consecutive timepoints. To visualize the transitions of a cluster in a global view, we model the set of transitions starting at a newly emerged cluster as a flow. In other words, we treat the pairwise transitions as the edges between the cluster nodes across different timepoints. When a transition exists between two cluster nodes, an edge with the transition type as value is assigned to connect them. It should be noted that to be computationally consistent with the basic transition types, the re-emergence transition is matched in respect of the last node in the sequence of the consecutive transitions. We validate the topic transition results from our computer generated approach by applying the same transition framework to human annotated clusters. Each tweet is annotated with at most three noun (-phrase) labels which summarize the tweet the most by a human expert. The majority of the labels are directly selected from the tweets. Out-of-content labels are only generated when the tokens in a tweet cannot meaningfully summarize it. A preliminary inspection over the manually and computationally generated clusters showed a mismatch in quantity Mathematical Definition the cluster stays unchanged X → X , where X = match α (X) the cluster is absorbed of labels. In addition, we observed several co-occurring topics in the annotated clusters due to retweets. To merge the co-occurring topics in human annotated clusters, we employ frequent itemsets (Hornik, Grün, and Hahsler 2005) to discover strongly associated topics. Support supp is calculated as an indication of how frequently an itemset appears in the data (Equation 5). If two itemsets share the same supp value and one itemset is the subset of another, we merge the subset into its parent. supp(X) = |t ∈ T ; X ⊆ t| |T | • T : a set of transactions of a given dataset. It should be noted that the methods of clustering between computer generated and human-annotated data are very different. However, the trends should be visible regardless of the methods, provided that the clusters are done well. The number of nodes distributed in each temporal GoW separated by timepoint is shown in Table 1 . As previously mentioned, we are interested in learning the responses and interests of a local community, and the real-world data we collected is limited in both quantity and quality. Thus, the number of nodes in each GoW varies largely depending on the community's activity of that day. Reporting global average based results would not guarantee an accurate evaluation to our framework. We instead will report results by timepoint and evaluate our framework by trend based comparison between computer and human generated clusters to ensure the validity of this work. To summarize, the average percentage of bridging nodes removed in each temporal GoW is 10.71% with an average C i of 0.44. Timepoint specific data of the clusterings is shown Figure 1 : An instance of event graph for Aug 19, 2020 with 12 clusters. Each color represents a cluster and the bold (anonymized) words are the bridging nodes. in Table 3 , whereC rm ,C all ,C best denote the average C i for removed nodes, the original GoW, and the best clustered subgraph respectively; and % rm denotes the percentage of nodes removed. The best subgraphs across all timepoints showed an increase in average C i after the removal of bridging nodes, which confirms that the global embeddedness of the graph have become stronger. It is noteworthy that many of the best subgraphs achieve a nearly 1.0 (maximum boundry) average C i , which suggests that the neighbors of each node in the subgraph are inter-connected. This can occur when the resulting subgraph is a complete graph or consists several fully connected components which are not interconnected. Our results are latter. Another metric should be considered for future tasks as the clustering coefficient C i lacks the ability to differentiate the structural characteristics since it only considers the internal variance of the clusters. In addition, we observed consistent converging trends among the modularity plots over all timepoints. As the nodes with lower C i get excluded from the subgraph, the resulting clustering quality starts to increase and gradually converges. Once a maximal modularity value is achieved, re- moving more bridging nodes will cause a decay in quality. Currently, our approach chose the result at the maximal modularity value as the optimal clustering. To increase generalizability and reduce computational cost, Algorithm 1 can be updated to stop when the modularity value starts to converge. Figure 1 illustrates an example clustering on Aug 19. We applied the metrics defined in Table 2 to construct transition flows for both the computer generated clusters and human annotated label sets. An opportunistic threshold of α = 2/3 is used for the match function to determine if two clusters are considered as the same, no other threshold have been attempted. The computational approach generated 34 independent transition flows with an average length of 2.41 timepoints per flow, while 151 clusters only exist across a single timepoint. On the other hand, the human annotation suggests that there are 17 independent transition flows with an average length of 4.53 timepoints per flow, and 68 sets only exist across a single timepoint. It should be noted that the computer generated clusters are based on nouns and NE from tweets with no order in mind, while human-annotated clusters are based on three summarized terms per tweet. Thus, computer generated clusters group all topic per day together irrespective of which tweet they come from, while human annotated ones have an additional layer of summarization: each tweet to keywords. The mismatch in the number of clusters can thus be explained by this additional layer of complexity, and should not necessarily be treated as a negative result. What is more interesting to see is the patterns that emerge from repeating topics, within both layers, throughout two weeks of data. Figure 2 demonstrates the cluster progression charts of the computational approach and human annotated topics side by side. It can be seen that many clusters, both computer and human generated, reappear overtime. One clear outlier is a sequence of human-annotated clusters that contains a series of various transitions for over a week. A shared label between these clusters denotes the name of a policy that the local community is enforcing to ensure safety during the pandemic. Coincidentally, the same policy name has been detected as the bridging node in computer generated clusters in nearly all timepoints during the clustering process. Further analysis suggests that the longest sequence of transitions in both computer and human generated clusters can be traced to the same topic not mentioned here due to anonymity requirement. Despite the differences between the cluster generating mechanisms, both transition flows progress similarly in trends of emergence, re-appearance, and disappearance. In this paper, we proposed a graph based framework 2 in modeling topic transitions of a local online community throughout two weeks of Tweets. Our proposed clustering with node removal approach attempted to resolve the drawback of the traditional hard clustering method on topical datasets. The cluster transition modeling provided insides on how topic flows can be traced in a continuous time space. Finally, the flow comparison between computer generated clusters and human annotated label sets demonstrated the applicability of our framework on real-world data. Several improvements can be made in the future: 1) using a convergent schema to locate the best modularity value for the best subgraph to increase the model's robustness and reduce computational costs; 2) the assignments of membership for the bridging nodes to each resulting clusters as currently they are considered to belong to all inter-connected cluster with equal weights; 3) incorporating conceptual level information (i.e. knowledge graph) into the temporal graph to refine the events/topics clusters. However, even without these improvements, our framework is capable of tracing the evolution of the topics and the duration of the progressions as evident by comparison with human annotation. Leveraging linguistic structure for open domain information extraction Real-time event detection on social data streams Streamcube: Hierarchical spatio-temporal hashtag clustering for event exploration over the twitter stream A clustering algorithm based on graph connectivity. Information processing letters arules-a computational environment for mining association rules and frequent item sets Text clustering algorithm based on the graph structures of semantic word co-occurrence Graphof-tweets: A graph merging approach to sub-events identification Event photo mining from twitter using keyword bursts and image clustering An unsupervised multilingual approach for online social media topic identification Distributed representations of words and phrases and their compositionality A framework to monitor clusters evolution applied to economy and finance problems Joint verificationidentification in end-to-end multi-scale cnn framework for topic identification Topic identification for fine-grained opinion analysis Graph clustering via a discrete uncoupling process Collective dynamics of 'small-world'networks. nature An event detection approach based on twitter hashtags