key: cord-0164238-awgjwrhy authors: Bilal, Iman Munire; Wang, Bo; Liakata, Maria; Procter, Rob; Tsakalidis, Adam title: Evaluation of Thematic Coherence in Microblogs date: 2021-06-30 journal: nan DOI: nan sha: 99211c5cfe0da6e18330c38f64b2587bbbc4c389 doc_id: 164238 cord_uid: awgjwrhy Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows. As social media gains popularity for news tracking, unfolding stories are accompanied by a vast spectrum of reactions from users of social media platforms. Topic modelling and clustering methods have emerged as potential solutions to challenges of filtering and making sense of large volumes of microblog posts (Rosa et al., 2011; Aiello et al., 2013; Resnik et al., 2015; Surian et al., 2016) . Providing a way to access easily a wide range of reactions around a topic or event has the potential to help those, such as journalists (Tolmie et al., 2017) , police (Procter et al., 2013) , health (Furini and Menegoni, 2018) and public safety professionals (Procter et al., 2020) , who increasingly rely on social media to detect and monitor progress of events, public opinion and spread of misinformation. Recent work on grouping together views about tweets expressing opinions about the same entities has obtained clusters of tweets by leveraging two topic models in a hierarchical approach (Wang et al., 2017b) . The theme of such clusters can either be represented by their top-N highest-probability words or measured by the semantic similarity among the tweets. One of the questions regarding thematic clusters is how well the posts grouped together relate to each other (thematic coherence) and how useful such clusters can be. For example, the clusters can be used to discover topics that have low coverage in traditional news media (Zhao et al., 2011) . Wang et al. (2017a) employ the centroids of Twitter clusters as the basis for topic specific temporal summaries. The aim of our work is to identify reliable metrics for measuring thematic coherence in clusters of microblog posts. We define thematic coherence in microblogs as follows: Given clusters of posts that represent a subject or event within a broad topic, with enough diversity in the posts to showcase different stances and user opinions related to the subject matter, thematic coherence is the extent to which posts belong together, allowing domain experts to easily extract and summarise stories underpinning the posts. To measure thematic coherence of clusters we require robust domain-independent evaluation metrics that correlate highly with human judgement for coherence. A similar requirement is posed by the need to evaluate coherence in topic models. Röder et al. (2015) provide a framework for an extensive set of coherence measures all restricted to word-level analysis. Bianchi et al. (2020) show that adding contextual information to neural topic models improves topic coherence. However, the most commonly used word-level evaluation of topic coherence still ignores the local context of each word. Ultimately, the metrics need to achieve an optimal balance between coherence and diversity, such that resulting topics describe a logical exposition of views and beliefs with a low level of duplication. Here we evaluate thematic coherence in microblogs on the basis of topic coherence metrics, while also using research in text generation evaluation to assess semantic similarity and thematic relatedness. We consider a range of state-of-the-art text generation metrics (TGMs), such as BERTScore (Zhang et al., 2019) , MoverScore (Zhao et al., 2019) and BLEURT (Sellam et al., 2020) , which we re-purpose for evaluating thematic coherence in microblogs and correlate them with assessments of coherence by journalist experts. The main contributions of this paper are: • We define the task of assessing thematic coherence in microblogs and use it as the basis for creating microblog clusters (Sec. 3). • We provide guidelines for the annotation of thematic coherence in microblog clusters and construct a dataset of clusters annotated for thematic coherence spanning two different domains (political tweets and COVID-19 related tweets). The dataset is annotated by journalist experts and is available 1 to the research community (Sec. 3.5). • We compare and contrast state-of-the-art TGMs against standard topic coherence evaluation metrics for thematic coherence evaluation and show that the former are more reliable in distinguishing between thematically coherent and incoherent clusters (Secs 4, 5). Measures of topic model coherence: The most common approach to evaluating topic model coherence is to identify the latent connection between topic words representing the topic. Once a function between two words is established, topic coherence can be defined as the (average) sum of the function values over all word pairs in the set of most probable words. Newman et al. (2010) use Pointwise Mutual Information (PMI) as the function of choice, employing co-occurrence statistics derived from external corpora. Mimno et al. (2011) subsequently showed that a modified version of PMI correlates better with expert annotators. AlSumait et al. (2009) identified junk topics by measuring the distance between topic distribution and corpus-wide distribution of words. Fang et al. (2016a) model topic coherence by setting the distance between two topic words to be the cosine similarity of their respective embedded vectors. Due to its generalisability potential we follow this latter approach to topic coherence to measure thematic coherence in tweet clusters. We consider GloVe (Pennington et al., 2014) and BERTweet (Nguyen et al., 2020) embeddings, derived from language models pre-trained on large external Twitter corpora. To improve performance and reduce sensitivity to noise, we followed the work of Lau and Baldwin (2016), who consider the mean topic coherence over several topic cardinalities |W | ∈ {5, 10, 15, 20}. Another approach to topic coherence involves detecting intruder words given a set of topic words, an intruder and a document. If the intruder is identified correctly then the topic is considered coherent. Researchers have explored varying the number of 'intruders' (Morstatter and Liu, 2018) and automating the task of intruder detection (Lau et al., 2014) . There is also work on topic diversity (Nan et al., 2019) . However, there is a tradeoff between diversity and coherence (Wu et al., 2020) , meaning high diversity for topic modelling is likely to be in conflict with thematic coherence, the main focus of the paper. Moreover, we are ensuring semantic diversity of microblog clusters through our sampling strategy (See Sec. Text Generation Metrics: TGMs have been of great use in applications such as machine translation (Zhao et al., 2019; Zhang et al., 2019; Guo and Hu, 2019; Sellam et al., 2020) , text summarisation (Zhao et al., 2019) and image captioning (Vedantam et al., 2015; Zhang et al., 2019; Zhao et al., 2019) , where a machine generated response is evaluated against ground truth data constructed by human experts. Recent advances in contextual language modeling outperform traditionally used BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) scores, which rely on surfacelevel n-gram overlap between the candidate and the reference. In our work, we hypothesise that metrics based on contextual embeddings can be used as a proxy for microblog cluster thematic coherence. Specifically, we consider the following TGMs: (a) BERTScore is an automatic evaluation metric based on BERT embeddings (Zhang et al., 2019 ). The metric is tested for robustness on adversarial paraphrase classification. However, it is based on a greedy approach, where every reference token is linked to the most similar candidate token, leading to a time-performance trade-off. The harmonic mean F BERT is chosen for our task due to its most consistent performance (Zhang et al., 2019) . (Zhao et al., 2019) expands from the BERTScore and generalises Word Mover Distance (Kusner et al., 2015) by allowing soft (many-to-one) alignments. The task of measuring semantic similarity is tackled as an optimisation problem with the constraints given by n-gram weights computed in the corpus. In this paper, we adopt this metric for unigrams and bigrams as the preferred embedding granularity. (c) BLEURT (Sellam et al., 2020) is a state-ofthe-art evaluation metric also stemming from the success of BERT embeddings, carefully curated to compensate for problematic training data. Its authors devised a novel pre-training scheme leveraging vast amounts of synthetic data generated through BERT mask-filling, back-translation and word dropping. This allows BLEURT to perform robustly in cases of scarce and imbalanced data. Notation We use C = {C 1 , ..., C n } to denote a set of clusters C i . Each cluster C i is represented by the pair C i = (T i , W i ), where T i and W i represent the set of tweets and top-20 topic words of the dominant latent topic in C i , respectively. The task of identifying thematic coherence in microblog clusters is formalised as follows: Given a set of clusters C , we seek to identify a metric function f : C → R s.t. high values of f (C i ) correlate with human judgements for thematic coherence. Here we present (a) the creation of a corpus of topic clusters of tweets C and (b) the annotation process for thematic coherence. (a) involves a clustering (Sec. 3.2), a filtering (Sec. 3.3) and a sampling step (Sec. 3.4); (b) is described in (Sec. 3.5). Experiments to identify a suitable function f are in Sec. 4. We used three datasets pertaining to distinct domains and collected over different time periods as the source of our tweet clusters. The COVID-19 dataset (Chen et al., 2020) was collected by tracking COVID-19 related keywords (e.g., coronavirus, pandemic, stayathome) and accounts (e.g., @CDCemergency, @HHSGov, @DrTedros) through the Twitter API from January to May 2020. This dataset covers specific recent events that have generated significant interest and its entries reflect on-going issues and strong public sentiment regarding the current pandemic. The Election dataset was collected via the Twitter Firehose and originally consisted of all geolocated UK tweets posted between May 2014 and May 2016 2 . It was then filtered using a list of 438 election-related keywords relevant to 9 popular election issues 3 and a list of 71 political party aliases curated by a team of journalists (Wang et al., 2017c) . The PHEME dataset (Zubiaga et al., 2016) of rumours and non-rumours contains tweet conversation threads consisting of a source tweet and associated replies, covering breaking news pertaining to 9 events (i.e., Charlie Hebdo shooting, Germanwings airplane crash, Ferguson unrest, etc.). These datasets were selected because they cover a wide range of topics garnering diverse sentiments and opinions in the Twitter sphere, capturing newsworthy stories and emerging phenomena of interest to journalists and social scientists. Of particular interest was the availability of stories, comprising groups of tweets, in the PHEME dataset, which is why we consider PHEME tweet clusters separately. The task of thematic coherence evaluation introduced in this paper is related to topic modelling evaluation, where it is common practice ( Mimno et al. (2011 ), Newman et al. (2010 ) to gauge the coherence level of automatically created groups of topical words. In a similar vein, we evaluate thematic coherence in tweet clusters ob-tained automatically for the Election and COVID-19 datasets. The clusters were created in the following way: Tweets mentioning the same keyword posted within the same time window (3 hours for Election, 1 hour for Covid-19) were clustered according to the two-stage clustering approach by Wang et al. (2017b) , where two topic models (Yin and Wang, 2014; with a tweet pooling step are used. We chose this as it has shown competitive performance over several tweet clustering tasks, without requiring a pre-defined number of clusters. The PHEME dataset is structured into conversation threads, where each source tweet is assigned a story label. We assume that each story and the corresponding source tweets form a coherent thematic cluster since they have been manually annotated by journalists. Thus the PHEME stories can be used as a gold standard for thematically coherent clusters. We also created artificial thematically incoherent clusters from PHEME. For this purpose we mixed several stories in different proportions. We designed artificial clusters to cover all types of thematic incoherence, namely: Random, Intruded, Chained (See Sec. 3.5 for definitions). For Intruded, we diluted stories by eliminating a small proportion of their original tweets and introducing a minority of foreign content from other events. For Chained, we randomly chose the number of subjects (varying from 2 to 5) to feature in a cluster, chose the number of tweets per subject and then constructed the 'chain of subjects' by sampling tweets from a set of randomly chosen stories. Finally, Random clusters were generated by sampling tweets from all stories, ensuring no single story represented more than 20% of a cluster. These artificial clusters from PHEME serve as ground-truth data for thematic incoherence. For automatically collected clusters (COVID-19 and Election) we followed a series of filtering steps: duplicate tweets, non-English 4 tweets and ads were removed and only clusters containing 20-50 tweets were kept. As we sought to mine stories and associated user stances, opinionated clusters were prioritised. The sentiment analysis tool VADER (Gilbert and Hutto, 2014) was leveraged to gauge subjectivity in each cluster: a cluster is considered to be opinionated if the majority of its tweets express strong sentiment polarity. 5 VADER was chosen for its reliability on social media text and for its capacity to assign granulated sentiment valences; this allowed us to readily label millions of tweets and impose our own restrictions to classify neutral/non-neutral instances by varying the thresholds for the VADER compound score. Work on assessing topic coherence operates on either the entire dataset (Fang et al., 2016b) or a random sample of it (Newman et al., 2010; Mimno et al., 2011) . Fully annotating our entire dataset of thematic clusters would be too timeconsuming, as the labelling of each data point involves reading dozens of posts rather than a small set of topical words. On the other hand, purely random sampling from the dataset cannot guarantee cluster diversity in terms of different levels of coherence. Thus, we opt for a more complex sampling strategy inspired by stratified sampling (Singh and Mangat, 2013) , allowing more control over how the data is partitioned in terms of keywords and scores. After filtering Election and COVID-19 contained 46,715 and 5,310 clusters, respectively. We chose to sample 100 clusters from each dataset s.t. they: • derive from a semantically diverse set of keywords (required for Elections only); • represent varying levels of coherence (both); • represent a range of time periods (both). We randomly subsampled 10 clusters from each keyword with more than 100 clusters and keep all clusters with under-represented keywords (associated with fewer than 100 clusters). This resulted in 2k semantically diverse clusters for Elections. TGM scores were leveraged to allow the selection of clusters with diverse levels of thematic coherence in the pre-annotation dataset. Potential score ranges for each coherence type were modelled on the PHEME dataset (See Sec. 3.2, 3.5), which is used as a gold standard for cluster coherence/incoherence. For each metric M and each coherence type CT , we defined the associated interval to be: where µ, σ are the mean and standard deviation for the set of metric scores M characterising clusters of coherence type CT . We thus account for 95% of the data 6 . We did not consider metrics M for which the overlap between I(M) Good , I(M) Intruded-Chained 7 and I(M) Random was significant as this implied the metric was unreliable. As we did not wish to introduce metric bias when sampling the final dataset, we subsampled clusters across the intersection of all suitable metrics for each coherence type CT . In essence, our final clusters were sampled from each of the sets For each of COVID-19 and Elections we sampled 50 clusters ∈ C Good , 25 clusters ∈ C Intruded-Chained and 25 clusters ∈ C Random . Coherence annotation was carried out in four stages by three annotators. We chose experienced journalists as they are trained to quickly and reliably identify salient content. An initial pilot study including the journalists and the research team was conducted; this involved two rounds of annotation and subsequent discussion to align the team's understanding of the guidelines (for the guidelines see Appendix B). The first stage tackled tweet-level annotation within clusters and drew inspiration from the classic task of word intrusion (Chang et al., 2009): annotators were asked to group together tweets discussing a common subject; tweets considered to be 'intruders' were assigned to groups of their own. Several such groups can be identified in a cluster depending on the level of coherence. This grouping served as a building block for subsequent stages. This sub-clustering step offers a good trade-off between high annotation costs and manual evaluation since manually creating clusters from thousands of tweets is impractical. We note that agreement between journalists is not evaluated at this first stage as obtaining exact subclusters is not our objective. However, vast differences in sub-clustering are captured in the next stages in quality judgment and issue identification (See below).The second stage concerned cluster quality assessment, which is our primary task. Similar to Newman et al. (2010) for topic words, annotators evaluated tweet cluster coherence on a 3-point scale (Good, Intermediate, Bad) . Good coherence is assigned to a cluster where the majority of tweets belong to the same theme (sub-cluster), while clusters containing many unrelated themes (sub-clusters) are assigned bad coherence. The third stage pertains to issue identification of low coherence, similar to Mimno et al. (2011) . When either Intermediate or Bad are chosen in stage 2 annotators can select from a list of issues to justify their choice: • Chained: several themes are identified in the cluster (with some additional potential random tweets), without clear connection between any two themes. Analysis of pairwise disagreement in stage 2 shows only 2% is due to division in opinion over Good-Bad clusters. Good-Intermediate and Intermediate-Bad cases account for 37% and 61% of disagreements respectively. This is encouraging as annotators almost never have polarising views on cluster quality and primarily agree on the coherence of a good cluster, the main goal of this task. For issue identification the majority of disagreements (%49) consists in distinguishing Intermediate-Chained cases. This can be explained by the expected differences in identifying subclusters in the first stage. For the adjudication process, we found that a majority always exists and thus the final score was assigned to be the majority label (2/3 annotators). Table 1 presents a summary of the corpus size, coherence quality and issues identified for COVID-19 and Election (See Appendix C for a discussion). Our premise is that a pair of sentences scoring high in terms of TGMs means that the sentences are semantically similar. When this happens across many sentences in a cluster then this denotes good cluster coherence. Following Douven and Meijs (2007) , we consider three approaches to implementing and adapting TGMs to the task of measuring thematic coherence. The differences between these methods consist of: (a) the choice of the set of tweet pairs S ⊂ T × T on which we apply the metrics and (b) the score aggregating function f (C) assigning coherence scores to clusters. The TGMs employed in our study are BERTScore (Zhang et al., 2019) , MoverScore (Zhao et al., 2019) for both unigrams and bigrams and BLEURT (Sellam et al., 2020) . We also employed a surface level metric based on cosine similarity distances between TF-IDF representations 8 of tweets to judge the influence of word cooccurrences in coherence analysis. Each approach has its own advantages and disadvantages, which are outlined below. In this case S = T ×T , i.e., all possible tweet pairs within the cluster are considered. The cluster is assigned the mean sum over all scores. This approach is not biased towards any tweet pairs, so is able to penalise any tweet that is off-topic. However, it is computationally expensive as it requires O(|T | 2 ) operations. Formally, given a TGM M, we define this approach as: We assume there exists a representative tweet able to summarise the content in the cluster, denoted as the representative tweet (i.e. tweet rep ). This is formally defined as: where we compute the Kullback-Leibler divergence (D KL ) between the word distributions of the topic θ representing the cluster C and each tweet in C (Wan and Wang, 2016) ; we describe the computation of D KL in Appendix A. We also considered other text summarisation methods (Basave et al., 2014; Wan and Wang, 2016) such as MEAD (Radev et al., 2000) and Lexrank (Erkan and Radev, 2004) to extract the best representative tweet, but our initial empirical study indicated D KL consistently finds the most appropriate representative tweet. In this case cluster coherence is defined as below and has linear time complexity O(|T |): As S = {(tweet, tweet rep )| tweet ∈ T } T × T , the coherence of a cluster is heavily influenced by the correct identification of the representative tweet. Similar to the work of Erkan and Radev (2004) , each cluster of tweets C can be viewed as a complete weighted graph with nodes represented by the tweets in the cluster and each edge between tweet i , tweet j assigned as weight: w i,j = M(tweet i , tweet j ) −1 . In the process of constructing a complete graph, all possible pairs of tweets within the cluster are considered. Hence S = T × T with time complexity of O(|T | 2 ) as in Section 4.1. In this case, the coherence of the cluster is computed as the average closeness centrality of the associated cluster graph. This is a measure derived from graph theory, indicating how 'close' a node is on average to all other nodes; as this definition intuitively corresponds to coherence within graphs, we included it in our study. The closeness centrality for the node representing tweet i is given by: where d(tweet j , tweet i ) is the shortest distance between nodes tweet i and tweet j computed via Dijkstra's algorithm. Note that as Dijkstra's algorithm only allows for non-negative graph weights and BLEURT's values are mostly negative, we did not include this TGM in the graph approach implementation. Here cluster coherence is defined as the average over all closeness centrality scores of the nodes in the graph: Table 2 presents the four best and four worst performing metrics (for the full list of metric results refer to Appendix A). MoverScore variants are not included in the results discussion as they only achieve average performance. Graph TF-IDF consistently outperformed TGMs, implying that clusters with a large overlap of words are likely to have received higher coherence scores. While TF-IDF metrics favour surface level co-occurrence and disregard deeper semantic connections, we conclude that, by design all posts in the thematic clusters (posted within a 1h or 3 h window) are likely to use similar vocabulary. Nevertheless, TGMs correlate well with human judgement, implying that semantic similarity is a good indicator for thematic coherence: Exhaustive BERTScore performs the best of all TGMs in Election while Exhaustive BLEURT is the strongest competitor to TF-IDF based metrics for COVID-19. On the low end of the performance scale, we have found topic coherence to be overwhelmingly worse compared to all the TGMs employed in our study. BERTweet improves over Glove embeddings but only slightly as when applied at the word level (for topic coherence) it is not able to benefit from the context of individual words. We followed Lau and Baldwin (2016), and computed average topic coherence across the top 5, 10, 15, 20 topical words in order to obtain a more robust performance (see Avg Topic Coherence Glove, Avg Topic Coherence BERTweet). Results indicate that this smoothing technique correlates better with human judgement for Election, but lowers performance further in COVID-19 clusters. In terms of the three approaches, we have found that the Exhaustive and Graph approaches perform similarly to each other and both outperform the Representative Tweet approach. Sacrificing time as trade off to quality, the results indicate that metrics considering all possible pairs of tweets account for higher correlation with annotator rankings. PHEME The best performance on this dataset is seen with TGM BLEURT, followed closely by BERTScore. While TF-IDF based metrics are still in the top four, surface level evaluation proves to be less reliable: PHEME stories are no longer constrained by strict time windows 9 , which allows the tweets within each story to be more lexically diverse, while still maintaining coherence. In such instances, strategies depending exclusively on word frequencies perform inconsistently, which is why metrics employing semantic features (BLEURT, BERTScore) outperform TF-IDF ones. Note that PHEME data lack the topic coherence evaluation, as these clusters were not generated through topic modelling (See Subsection 3.2). We analysed several thematic clusters to get a better insight into the results. Tables 3 and 4 show representative fragments from 2 clusters labelled as 'good' in the COVID-19 dataset. The first cluster contains posts discussing the false rumour that bleach is an effective cure to COVID-19, with the majority of users expressing skepticism. As most tweets in this cluster directly quote the rumour and thus share a significant overlap of words, not surprisingly, TF-IDF based scores are high Exhaustive TF-IDF = 0.109. In the second cluster, however, users challenge the choices of the American President regarding the government's pandemic reaction: though the general feeling is unanimous in all posts of the second cluster, these tweets employ a more varied vocabulary. Consequently, surface r s / ρ / τ r s / ρ / τ r s / ρ / τ level metrics fail to detect the semantic similarity Exhaustive TF-IDF = 0.040. When co-occurrence statistics are unreliable, TGMs are more successful for detecting the 'common story' diversely expressed in the tweets: in fact, Exhaustive BLEURT assigns similar scores to both clusters (-0.808 for Cluster 1 and -0.811 for Cluster 2) in spite of the vast difference in their content intersection, which shows a more robust evaluation capability. We analyse the correlation between topic coherence and annotator judgement in Tables 5 and 6 . Both are illustrative fragments of clusters extracted from the Election dataset. Though all tweets in Table 5 share the keyword 'oil', they form a bad random cluster type, equivalent to the lowest level of coherence. On the other hand, Table 6 clearly presents a good cluster regarding an immigration tragedy at sea. Although this example pair contains clusters on opposite sides of the coherence spectrum, topic coherence metrics fail to distinguish the clear difference in quality between the two. Moreover, Table 6 receives lower M'gonna have a nap, I feel like I've drank a gallon of like grease or oil or whatever bc I had fish&chips like 20 minutes ago Check out our beautiful, nostalgic oil canvasses. These stunning images will take you back to a time when life... Five years later, bottlenose dolphins are STILL suffering from BP oil disaster in the Gulf. Take action! Once the gas and oil run out countries like Suadia Arabia and Russia won't be able to get away with half the sh*t they can now Ohhh this tea tree oil is burning my face off Table 5 : Cluster fragment from Election dataset, TC Glove = 0.330, Exhaustive BERTScore = 0.814 and Exhaustive TF-IDF = 0.024. Common Keyword: 'migrants' Up to 300 migrants missing in Mediterranean Sea are feared dead #migrants. NEWS: More than 300 migrants feared drowned after their overcrowded dinghies sank in the Mediterranean Imagine if a ferry sunk with 100s dead -holiday makers, kids etc. Top story everywhere. 300 migrants die at sea and it doesn't lead. @bbc5live Hi FiveLive: you just reported 300 migrants feared dead. I wondered if you could confirm if the MIGRANTS were also PEOPLE? Cheers. If the dinghies were painted pink would there be as much uproar about migrants drowning as the colour of a f**king bus? scores (TC Glove = 0.307) than its incoherent counterpart (TC Glove = 0.330) for Glove Topic Coherence. However, TGM metric BERTScore and surface-level metric TF-IDF correctly evaluate the two clusters by penalising incoherence (Exhaustive BERTScore = 0.814 and Exhaustive TF-IDF = 0.024) and awarding good clusters (Exhaustive BERTScore = 0.854 and Exhaustive TF-IDF = 0.100). We have defined the task of creating topicsensitive clusters of microblogs and evaluating their thematic coherence. To this effect we have investigated the efficacy of different metrics both from the topic modelling literature and text generation metrics TGMs. We have found that TGMs correlate much better with human judgement of thematic coherence compared to metrics employed in topic model evaluation. TGMs maintain a robust performance across different time windows and are generalisable across several datasets. In future work we plan to use TGMs in this way to identify thematically coherent clusters on a large scale, to be used in downstream tasks such as multi-document opinion summarisation. (grant no. EP/N510129/1). We would like to thank our 3 annotators for their invaluable expertise in constructing the datasets. We also thank the reviewers for their insightful feedback. Finally, we would like to thank Yanchi Zhang for his help in the redundancy correction step of the pre-processing. Ethics approval to collect and to publish extracts from social media datasets was sought and received from Warwick University Humanities & Social Sciences Research Ethics Committee. During the annotation process, tweet handles, with the except of public figures, organisations and institutions, were anonymised to preserve author privacy rights. In the same manner, when the datasets will be released to the research community, only tweets IDs will be made available along with associated cluster membership and labels. Compensation rates were agreed with the annotators before the annotation process was launched. Remuneration was fairly paid on an hourly rate at the end of task. Ioannis Kompatsiaris, and Alejandro Jaimes. 2013. Sensing trending topics in twitter. IEEE Transactions on Multimedia Evaluating topic coherence using distributional semantics Topic significance ranking of lda generative models Automatic labelling of topic models learned from twitter by summarisation Pre-training is a hot topic: Contextualized document embeddings improve topic coherence Reading tea leaves: How humans interpret topic models Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set Measuring coherence Lexrank: Graph-based lexical centrality as salience in text summarization Using word embedding to evaluate the coherence of topics from twitter data Using word embedding to evaluate the coherence of topics from twitter data Public health and social media: Language analysis of vaccine conversations Vader: A parsimonious rule-based model for sentiment analysis of social media text Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation From word embeddings to document distances The sensitivity of topic coherence evaluation to topic cardinality Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality ROUGE: A package for automatic evaluation of summaries Optimizing semantic coherence in topic models In search of coherence and consensus: Measuring the interpretability of statistical topics Topic modeling with Wasserstein autoencoders Automatic evaluation of topic coherence Improving topic models with latent feature word representations Bertweet: A pre-trained language model for english tweets Bleu: a method for automatic evaluation of machine translation Glove: Global vectors for word representation Roadmapping uses of advanced analytics in the uk food and drink sector Reading the riots: What were the police doing on twitter? Policing and society Centroid-based summarization of multiple documents: sentence extraction, utilitybased evaluation, and user studies Beyond lda: exploring supervised topic modeling for depression-related language in twitter Exploring the space of topic coherence measures Anatole Gershman, and Robert Frederking BLEURT: Learning robust metrics for text generation Elements of survey sampling Characterizing twitter discussions about hpv vaccines using topic modeling and community detection Supporting the use of user generated content in journalistic practice Cider: Consensus-based image description evaluation Automatic labeling of topic models using text summaries Lazaros Apostolidis, Arkaitz Zubiaga, Rob Procter, and Yiannis Kompatsiaris Arkaitz Zubiaga, and Rob Procter. 2017b. A hierarchical topic modelling approach for tweet clustering TDParse: Multi-target-specific sentiment recognition on Twitter Short text topic modeling with topic distribution quantization and negative sampling decoder A dirichlet multinomial mixture model-based approach for short text clustering Bertscore: Evaluating text generation with bert Comparing twitter and traditional media using topic models Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance Analysing how people orient to and spread rumours in social media by looking at conversational threads This work was supported by a UKRI/EPSRC Turing AI Fellowship to Maria Liakata (grant no. EP/V030302/1) and The Alan Turing Institute Appendix A As described in Section 4.2, we select the tweet that has the lowest divergence score to the top topic words of the cluster. Following (Wan and Wang, 2016) , we compute the Kullback-Leibler divergence (D KL ) between the word distributions of the topic θ the cluster C represents and each tweet in C as follows:where p θ (w) is the probability of word w in topic θ. T W denotes top 20 words in cluster C according to the probability distribution while SW denotes the set of words in tweet i after removing stop words. tf (w, tweet i ) denotes the frequency of word w in tweet i , and len(tweet i ) is the length of tweet i after removing stop words. For words that do not appear in SW , we set tf (w, tweet i )/len(tweet i ) to 0.00001. The complete results of our experiments are in Table A2. The notation is as follows:• Exhaustive indicates that the Exhaustive Approach was employed for the metric.• Linear indicates that the Representative Tweet Approach was employed for the metric.• Graph indicates the the Graph Approach was employed for the metric.Shortcuts for the metrics are: MoverScore1 = MoverScore applied for unigrams; Mover-Score2 = MoverScore applied for bigrams PHEME data coherence evaluationAs original PHEME clusters were manually created by journalists to illustrate specific stories, they are by default coherent. Hence, according to the guidelines, these clusters would be classified as "Good". For the artificially created clusters, PHEME data is mixed such that different stories are combined in different proportions (See 3.2). Artificially intruded and chained clusters would be classed as 'Intermediate' as they have been generated on the basis that a clear theme (or themes) can be identified. Finally, an artificially random cluster was created such that there is no theme found in the tweets as they are too diverse; this type of cluster is evaluated as 'Bad'.Election COVID-19 PHEME Overview You will be shown a succession of clusters of posts from Twitter (tweets), where the posts originate from the same one hour time window. Each cluster has been generated by software that has decided its tweets are variants on the same 'subject'. You will be asked for your opinion on the quality ('coherence') of each cluster as explained below. As an indication of coherence quality consider how easy it would be to summarise a cluster. In the guidelines below, a subject is a group of at least three tweets referring to the same topic.Marking common subjects: In order to keep track of each subject found in the cluster, label it by entering a number into column Subject Label and then assign the same number for each tweet that you decide is about the same subject. Note, the order of the tweets will automatically change as you enter each number so that those assigned with the same subject number will be listed together. (a) Carefully read each tweet in the cluster with a view to uncovering overlapping concepts, events and opinions (if any).(b) Identify the common keyword(s) present in all tweets within the cluster. Note that common keywords across tweets in a cluster are present in all cases by design, so by itself it is not a sufficient criterion to decide on the quality of a cluster.(c) Mark tweets belonging to the same subject as described in the paragraph above.2. Cluster Annotation : What was your opinion about the cluster?(a) Choose 'Good' if you can identify one subject within the cluster to which most tweets refer (you can count these based on the numbers you have assigned in the column Subject Label). This should be a cluster that you would find it easy to summarise. Proceed to Step 4.(b) Choose 'Intermediate' if you are uncertain that the cluster is good, you would find it difficult to summarise its information or you find that there are a small number (e.g., one, two or three) of unrelated subjects being discussed that are of similar size (chained, See issues inStep 3) or one clear subject with a mix of other unrelated tweets (intruded, See issues inStep 3). Additionally, if there is one significantly big subject and one or more other 'small' subjects (small 2,3 tweets), this cluster should be Intermediate Intruded. Proceed to Step 3. (c) Choose 'Bad' if you are certain that the cluster is not good and the issue of fragmented subjects within the cluster is such that many unrelated subjects are being discussed (heavily chained) or there is one subject with a mix of unrelated tweets but the tweets referring to one subject are a minority. Proceed to Step 3. In terms of size, we observe that the average tweet in Election data is significantly shorter (20 tokens) than its correspondent in the COVID-19 corpus which is 34 tokens long. We observe that the former's collection period finished before Twitter platform doubled its tweet character limit which would be confirmed by the figures in the table. Further work will tackle whether tweet length in a cluster has any impact on the coherence of its message. We believe differences in the application of the clustering algorithm influenced the score differences between Election and COVID-19 datasets. The clustering algorithm we employed uses a predefined list of keywords that partitions the data into sets of tweets mentioning a common keyword as a first step. The keyword set used for the Election dataset contains 438 keywords, while the COVID-19 dataset contains 80 keywords used for Twitter API tracking (Chen et al., 2020) . We also note that the different time window span can impact the quality of clusters.http://arxiv.org/ps/2106.15971v1