key: cord-0508019-8pbuk0w0 authors: Tondulkar, Rohan; Dubey, Manisha; Srijith, P. K.; Lukasik, Michal title: Hawkes Process Classification through Discriminative Modeling of Text date: 2020-10-22 journal: nan DOI: nan sha: 7926329b404aa7ba41553527b3b58c093a3039dd doc_id: 508019 cord_uid: 8pbuk0w0 Social media has provided a platform for users to gather and share information and stay updated with the news. Such networks also provide a platform to users where they can engage in conversations. However, such micro-blogging platforms like Twitter restricts the length of text. Due to paucity of sufficient word occurrences in such posts, classification of this information is a challenging task using standard tools of natural language processing (NLP). Moreover, high complexity and dynamics of the posts in social media makes text classification a challenging problem. However, considering additional cues in the form of past labels and times associated with the post can be potentially helpful for performing text classification in a better way. To address this problem, we propose models based on the Hawkes process (HP) which can naturally incorporate the temporal features and past labels along with textual features for improving short text classification. In particular, we propose a discriminative approach to model text in HP where the text features parameterize the base intensity and/or the triggering kernel. Another major contribution is to consider kernel to be a function of both time and text, and further use a neural network to model the kernel. This enables modelling and effectively learning the text along with the historical influences for tweet classification. We demonstrate the advantages of the proposed techniques on standard benchmarks for rumour stance classification. Social media platforms like Twitter, Facebook, WhatsApp etc. provide platform for common users to share information and content. However, content shared on these websites are not verified and misinformation or rumours can spread quickly through social media. Social media has become the starting point for many rumours and fake news. In India, recently a rumour about a child-lifting gang was the cause of death of 29 people 1 . False rumours about death of famous celebrities like Justin Bieber, Charlie Sheen, Taylor Swift, etc have been common in the past. During the English riots in 2011, false rumours about various incidents were spread via Twitter and Facebook. We have encountered incidents where rumours and fake news are used in social media to influence the outcome of elections. Rapid spread of rumours through social media can create chaos and cause lot of damage to life and property. Thus, it is very important to keep the spread of rumours in check. Analyzing various aspects of a rumour can help in reducing the damage. If correct information is provided to authority related to rumour, it can help them in taking corrective measures sooner. However, social media networks provide a platform for common users to share information, generally in the form of short snippets of text, with a prominent example of Twitter. The effective involvement of such information pieces can be useful in addressing various real world problems. For instance, it can help in understanding the stance or opinion of people towards a product, or even prevent the spread of rumours through rumour stance classification [26] . However, tweets involve frequent use of informal grammar as well as irregular vocabulary e.g. abbreviations, typographical errors and hashtags. The posts exchanged via Twitter are referred to as microblogs because there is a 140 character limit imposed by Twitter for every tweet. Since, these texts are short in nature, therefore do not provide sufficient word occurrences. This often acts as limitation for classification of social media posts. However, considering additional cues in the form of past labels and times associated with the social media posts can help in more effective classification thereof. Motivated with this concern, our work targets the problem of rumour stance classification. In rumour stance classification task, we classify the posts following a would-be rumour post as supporting, denying, questioning or commenting about the rumour. During the spread of rumours via social networking platforms like Twitter, a previous tweet can influence a response in the form of another tweet. This process continues and various tweets, replies and retweet events are formed in a short time until a cool off period is reached. Such characteristics like cluster of events and selfexcitation can be modelled using Hawkes process [9] (HP) model. The stance associated with a post depends on the labels associated with the past tweets and the time of those posts. This can be naturally and easily considered using a HP model and makes it a suitable candidate to solve the problem of stance classification in social media [19] . We propose a discriminative modeling of text where textual features are a part of the intensity function of HP. This allows HP to consider the impact of text as well as time. For text classification problems, discriminative methods were found to perform better [11] . The kernel in a HP model represents the influence of historical events on current event. Our work provides new direction on how various kernels can be used to consider historical impact of text along with time. We show ways in which neural networks can be used instead of standard kernels to learn complex non-linear relationships of influence from historical events. The proposed Neural Kernel Hawkes process provides dual benefit i.e. to use the power of neural networks for learning non-linear relationships and still maintain the explainability provided by HP. Contributions. Our contributions can be summarized as follows: • We propose discriminative modeling of text using Hawkes process. • We propose use of various text-based kernels to understand historical influence of text in determining stance. • We propose the use of neural networks as a kernel in HP intensity function to learn complex non-linear relationships between events. • We show improvement in performance using the proposed models in rumour stance classification problem. Although we have performed this task of text classification for rumour stance classification, it is a general method which can be used for various other applications like rating prediction etc. Stance classification problems in social media were tried to solve using tree based structures like Linear-Chain conditional random field (CRF) and Tree CRF [25] . [1] used various machine learning classifiers using problem-specific features and provided insight on whether it was necessary to use more complex models or extract better features. A long short term memory (LSTM) based sequential approach was used in [12] that modelled the conversational structure of tweets. With new approaches being developed in deep learning, similar were applied to the problem of rumour stance classification. Attention models using convolution neural networks and LSTMs were used for multi-class classification with sequence of threads of tweets as input in [23] . They also used follower-followee relationship between users as an added feature. Four different sequential classifiers using LSTM were used on local and contextual features [26] which showed the higher performance of LSTM. In [13] , a multi-task learning model is proposed which uses a shared LSTM to solve, rumour detection, stance classification and veractiy prediction, all together to also learn common characteristics. [20] proposes a Siamese adaptation of LSTMs with an attention mechanism for stance classification problem. The usefulness of LSTM and other sequence models for stance classification of short texts shows that considering past labels will be useful to solve stance classification effectively. This is especially useful in classifying short text arising in social media. There exist few works in literature where the Hawkes process is used for natural language modeling like topic modeling [10] , clustering document streams [5] , discovering topical interactions [2] and narrative reconstruction [21] . Although there have been works in the intersection of Hawkes process and topic modeling for various applications like detecting fake retweets [6] and modeling of COVID-19 Twitter narratives [22] , incorporating text in the framework of Hawkes process has not been extensively studied. A closely related approach for stance classification is [16] where authors have used multivariate Hawkes Process (MHP) to solve this problem using a generative approach of modeling text. They used a Hawkes process model which consider both labels and time of past posts along with text to perform stance classification. The text features are considered through an additional likelihood. A multinomial distribution is used to model text, where the likelihood of generating text is given by . Here, is the number of words in vocabulary and is the matrix of size | | × providing the word distribution for each class. However, this generative model is restrictive and prevents consideration of text in determining the influence from the past events. For instance, posts with similar textual content will have higher influence in determining the stance of the current post. In view of this, we propose HP models which consider text in a discriminative manner. We include various ways to incorporate text within the intensity function of Hawkes process. Therefore, our proposed approach perform time sensitive sequence classification of a tweet by considering label, time and text associated with the previous tweets through intensity function. This leads to more powerful HP models which can perform stance classification considering the influence not only from past labels and time, but also from text. Moreover, we also provide more generic way to model the influence using a neural kernel. We consider tweets associated with topics (or statements or claims) of interest for stance classification. Each tweet is represented as a tuple = ( , , , ), which includes the following information: is the posting time of the tweet, is the text message, is the topic category and is the stance of the tweet towards a topic or statement. In particular, we consider rumour stance classification where = {supporting, denying, questioning, commenting}. The stance classification task is to classify the tweet to one of the four classes . A point process is random process which models a occurrence of set of points on a real line. If the point process models the occurrence of events over a time period then it is called a temporal point process. For e.g. a point process can be used to model occurrence of earthquakes, rains, etc. A point process can be characterized by its conditional intensity function defined as - where H is history of the process up to time t, with the list of events as { 1 , 2 , ... } For the ease of notation, we will denote ( |H ) as ( ). Some of the varieties of commonly used point process are Poisson process, Cox process, Hawkes process etc. Point processes are useful to model the distribution of points over some space and are defined using an underlying intensity function. A Hawkes process [8] is a point process with self-triggering property i.e occurrence of previous events trigger occurrences of future events. Conditional intensity function for univariate Hawkes process is defined as where is the base intensity function and (·) is the triggering kernel function capturing the influence from previous events. The summation over < represents all the effect of all events prior to time which will contribute in computing the intensity at time . Figure 1 displays the intensity function for Hawkes process which exhibits self-exciting behavior. The intensity function for Hawkes process can be Hawkes process has been used in earthquake modelling [7] ,crime forecasting [18] and epidemic forecasting [3] . Events are most often associated with features other than time such as categories or users in LBSNs. Such features are known as marks. The multi-variate Hawkes process [14] is a multi-dimensional point process that can model time-stamped events with marks. It allows explicit representation of marks through the ℎ dimension of the intensity function and can capture influences across these marks. The intensity function associated with the ℎ mark is where > 0 is the base intensity of ℎ mark. We consider that previous event is associated with a mark ( ) and is treated as a dimension in Hawkes process. The intensity at time for a mark is assumed to be influenced by all the events happening before at time and mark . The influence of mark on some mark is given by . This models the mutual excitation property between events with different marks. Motivated by [16] , we also employ a multivariate Hawkes process (MHP) to solve rumour stance classification problem. We treat each dimension of the MHP to correspond to the stance associated with the tweet. The idea is that stances can influence each other and their influence can be captured through the influence matrix of the MHP. The intensity function is given by Where, the base intensity is a constant base value per stance label and the triggering kernel ( − ℓ ) = − ( − ℓ ) (exponentially decaying kernel or excitation function) captures the extent of influence from the past events ( − ). The matrix of size | | × | | captures the influence between various classes, e.g. a tweet belonging to class 'Support' may have less influence on future tweets belonging to class 'Deny' but higher influence on future tweets belonging to class 'Support' or 'Comment'. But the influence of a 'Support' tweet on a future 'Support' tweet can be low if the future tweets are happening far ahead in time. This will be captured by multiplying the influence matrix with the exponentially decaying triggering kernel, which captures the exponential decay of influence past events on future events. The likelihood function is given by - The first part of the likelihood function is the joint likelihood of tweets at time 1 , ..., and the second part provides the likelihood that no tweets happen in the interval [0,T] except at times 1 , ..., . Here, the intensity function can be defined in different ways, depending on the problem set-up. For incorporating textual information of tweets, we propose different models where the textual features are modeled in a discriminative way as a part of intensity function of multivariate Hawkes process model. We present different ways to model text through the base intensity as well as triggering kernel. Along with time based kernels, text based kernels can represent the impact of historical events better. We also introduce a methodology where we use as a kernel to model text and time. We discuss the proposed models in detail in the following section: The base intensity influences the arrival of events due to exogenous factors. In a standard Hawkes process model, base intensity is constant and learnt from the data. However, we propose a model (Textual HP) where base intensity considers the textual features. Along with this, we capture the influence from previous tweets using excitation kernel. In this way, we can model text within the framework of Hawkes process. The base intensity is no longer a constant and depends on the textual content of the post at time . The base intensity is normalized across all labels to avoid it from have a dominating influence on the intensity function. • is V-dimensional text representation of tweet at time t. • of size | | × | | are the weights associated with the classes. Using this base intensity, we can write intensity function as: As discussed in equation 2, , of size | | × | | captures the influence between various classes of tweets. And ( − ℓ ) = exp(− ( − ℓ ) is the exponentially decaying kernel which captures the effect of previous tweets. We can observe that base intensity will be higher for posts whose textual content resembles that of a stance label. Consequently it favours posts with this particular stance if the influence of past labels weighted by time are also favourable. In this way, we model augment Hawkes process with both time and text based information. where the intensity function is defined as in (5) . Please note that this is different than equation 3 in the way that the base intensity used in the intensity function is capable of modeling text as well. The parameters of the model i.e. and are learnt by maximizing the full likelihood. The log-likelihood function can be given as follows: We add a regularization term over the weights of text where C is a hyper-parameter for better generalization of model. After expanding individual components of equation (7), we get We estimate the parameters by maximizing the log-likelihood function in equation 8. We find parameters using joint gradient based optimization over and , using partial derivatives of log-likelihood. In optimization, we employ L-BFGS approach to gradient search. The partial derivatives after expanding equation 8 are given as: where ( − ℓ ) = 1 − (− ( − ℓ )) arises from the integration of ( − ℓ ). where, = | | =1 exp( × ). We propose a model (Fully Textual HP) where we use a text based kernel in combination with the exponential decaying kernel based on time in addition our Base HP model discussed in Section 5.1. The text based kernel can help in representing the influence of past events/tweets based on their textual similarity. In this case, our model will comprise of text-based base intensity, an exponentially decaying kernel to model time of tweets and a text-based kernel as well. We can use different types of text kernels like the gaussian kernel, linear kernel, polynomial kernel etc. Similar to equation 5, we can write the intensity function for the proposed model as: The base intensity , is same as the one used in equation (4). The gaussian kernel can be given by - where is the hyper-parameter. When the textual contents of the post at time ( ) is similar to a past post text ( ℓ ) then ( , ℓ ) will be higher and consequently the influence of the corresponding label will be higher. So, we supplement our intensity function using text using text-based base intensity and kernel. Similar to Section 5.1.2, we can define the likelihood for this model as well. A restriction with the previous approaches is that the past influence is specified through a predefined exponentially decaying kernel function. Often these influences can take a form other than exponential decay and we intend to capture the functional form of the influence (kernel) through the proposed Neural kernel Hawkes process. In the proposed model, we model kernels using a neural network which is theoretically capable of modelling any function (universal approximator). This is to learn the complex non-linear relationships between historical events and current event. This is different from the previous works [4, 17, 24] combining neural networks with Hawkes process, where the full intensity function is modelled using a neural network losing interpretability advantage of HP models. Also as we are just modeling kernels using neural networks, we continue to maintain the advantage of explainability of Hawkes process in the form of label-label influence provided by the matrix. Since the existing kernels can only learn predefined functions, we get extra advantage with neural networks that it can model any function. This model enable us to learn a more generalized version of Hawkes process keeping its causality intact. The intensity function can be defined as: where are the weights in the NN kernel, the text and time are input together and the base intensity , is defined in (4) . All the parameters including the neural network kernel parameters are learnt by maximizing the likelihood defined in (7) . We approximate the intractable integral using the Monte Carlo approximation. It computes average intensity over uniformly sampled time and multiply with time period to get integral value. Backpropagation is applied on the (7) to learn parameters of neural kernel. Prediction is done by evaluating the intensity function across all the classes at the time of the post and choosing the class with the highest intensity function. We use the PHEME dataset [27] for rumour stance classification. It considers tweets belonging to nine noteworthy events occurred around the world. Along with tweets, it also considers retweets, and replies to form a tweet thread. The dataset contains a set of rumour threads. Each thread contains a source tweet as well as replies to that tweet. Every tweet is assigned a stance of -Supporting, Denying, Questioning, Commenting classes w.r.t. the source tweet. The detailed statistics of the dataset used is mentioned in Figure 3 . One notable characteristic of the dataset is that the distribution of categories is skewed towards commenting tweets, and that this imbalance varies slightly across the eight events. This varying imbalance makes the task more realistic and challenging. We have considered following baselines: • Hawkes Process: [16] The authors have considered two approaches for optimization using gradient based optimization and approximating the parameters. They have been shown to perform better than several machine learning models including conditional random fields. We have compared our results with both the approaches used. • LSTM: [26] The authors have used the sequential structure of conversational threads using LSTM. The experiments are tried to build in a way where we depict real world scenarios as closely as possible. In real world, new rumours arise on a regular basis. We train models on old rumours and then use them for stance classification on new rumours. We try to perform something similar in our experiments. We call it the 'leave Leave one out -Thread. Following prior work [15] , we consider 4 events -Ottawa, Ferguson, Charlie Hebdo and Sydney Siege, the largest events from PHEME (each with approximately 1000 tweets per event). Every event in the data set has multiple tweet threads (50 − 70), where each thread is a new rumour generated when the event occurred. We train on − 1 rumour threads and test on the ℎ rumour. We perform this times, testing on a different rumour each time. This helps in getting the overall performance across all rumours. Leave one out -Event. Here, a dataset of top 8 events is considered and then combined to form a bigger data set, with 4554 tweets in total. We consider training on 7 events at a time and testing on the 8 ℎ one. This is repeated 8 times, and an average score is reported. Evaluation Metrics. We use the popular metrics for multi-class classification i.e. accuracy and F1 scores. We consider micro-averaged accuracy and macro averaged F1 score as reported in the previous work. Macro-averaged F1-score can be calculated as the harmonic mean of macro-averaged precision and recall. Considering number of stances to be , the formulae for macro F1-score can be written as follows - We have performed our experiments on Intel(R) Xeon(R) processor with 2.70GHz CPU and 125 GB RAM. Neural Kernel HP Model has been run on Tesla P-100 infrastructure. The tweets are subjected to various pre-processing steps like removal of stopwords, URLs and punctuations; replacing emoticons, user mentions and URLs followed by stemming. After preprocessing, we use standard 100dimensional word2vec (trained on Google News) representation obtained by averaging the word2vec representation of each token in a tweet. The best configuration for Base HP Textual is for regularization parameter and temporal kernel parameter as 0.05. For Fully HP Textual, best results have been achieved for regularization parameter, temporal kernel parameter and text kernel parameter as 0.05. Best configuration for Neural Network Kernel model includes 2 hidden layers with 20 neurons each for 0.005 learning rate and 0.9 momentum. The optimizer used in AdaGrad. And 50 samples have been used for Monte Carlo approximation. The proposed approach is compared against the Hawkes process [16] and LSTM [26] based approaches for rumour stance classification. The Base Textual HP which used discriminative modeling of text with normalized base intensity outperforms the benchmarks for all events except for Charlie Hebdo in terms of micro-accuracy, demonstrating that including textual features as part of intensity function helps improve results. In comparison with the LSTM approach [26] for this setup, we can see that our Textual HP model gives better accuracy in all datasets except Charlie Hebdo. This shows usefulness of HP based models over modern neural networks especially when dataset size is small. The Fully textual HP which uses text in base intensity as well as kernel, gives comparable results to benchmark models, but doesn't outperform Base Textual HP. This means that influence arising through text similarity is not very useful for predictions at thread level, with typical thread size being 10. Here, dissimilar text belonging to different classes (e.g. deny and support tweets) can have higher influence, which is restricted through the text kernel. Although, this shows another successful way of augmenting Hawkes process with text. We also observe that Neural Kernel HP did not give good performance. In Figure 5 we show an example function learned by the neural kernel against text similarity and in general, and we find a decrease w.r.t cosine similarity. This supports the observations from fully textual HP. However, Neural kernel HP did not perform well overall, presumably due to the small sized rumour stance data (1000 tweets per event). We can see the results for Leave one out -event approach explained in Section 6.3 in Table 2 . The Fully Textual HP gives the best results beating the benchmarks, showing importance of considering text similarities between posts, as opposed to only similarities between categories from [16] . The Neural Kernel HP model performs better in this setup, however is still limited by the small data set size. On the other hand, using the inductive bias of Hawkes process assumption helps perform better under this data scarce scenario. 7.2.1 Analysis of Influence Matrix . We analyze the values learnt by the influence matrix, . It is a 4 × 4 dimensional matrix which learns the influence of different classes of tweets on others. For example it learns the impact of a previous tweet being of class Support on next tweet being of class Deny. In Table 3 , we see sample values of influence matrix belonging to Fully Textual HP model. The bold face values show the best result in a row while italics show the second best. An interesting observation is that Deny class has highest influence on the Question and Deny classes. This means that a deny tweet is often followed by a question tweet or a Deny tweet, which makes sense in rumour stance classification. The diagonal values are relatively high which means that each class influences the next tweet to belong to that class, i.e. Support Figure 6 : A snapshot of posts and intensities for 2 classes, Support and Deny, at post times using Base Textual HP on Sydney Siege data. The intensity for a class becomes higher as some tweet occurs from that class through textual features and temporal influences or Question is likely to attract more Support or Question tweets respectively than tweets belonging to other classes. The values in the last column are also usually quite high in the row. The last column belongs to Comments class. This tells that it is very likely for a comment tweet to follow tweets belonging to other classes. This also is quite expected as the data set has a class imbalance with more than 60% tweets belonging to Comments class. Figure 6 plots intensity of Textual HP for Support and Deny class tweets. Tweets are associated with posting times, and in Figure 6 we plot the intensity value of tweets at their posting times for support and deny classes. The intensity function values are obtained by considering temporal and textual information as discussed in equation 2. In Figure 6 , we find that intensity value is higher for the tweets of respective classes. Figure 5 shows the kernel function learnt against cosine similarity of text for Neural Kernel. The pairs of tweets are selected such that they belong to the same thread. We compute the cosine similarity (using text) and the neural kernel values (using text and time) between them. The difference in their time of occurrence ranges between 0-1hrs. Hence there can be multiple pairs with the same cosine similarity. We can observe here that in general neural kernel value is decreasing with increasing cosine similarity. Figure 7b shows the function learnt by neural network with a single time-based kernel in Neural Kernel HP. We can see that it learns a kernel similar to exponential decaying kernel. In Figure 5 we can see the function learnt by text-based neural network when we use separate networks for both time and text. Its an increasing function of cosine similarity. This conveys that we get higher value when the text of historical events are similar to current event. We carry out an ablation study on Neural Hawkes Process model by using text and time based kernels in different ways. Thereby, we learn the nature of the function learnt by neural network. Figure 7 : 7a) displays function learnt by independent text neural network when we consider neural network with two kernels and 7b) displays function learnt for time by neural network kernel difference of time for the neural network. We can observe that it learns a function similar to exponentially decaying kernel, which explains the relevance of exponential kernel for such settings. In another variation, we use two separate kernels for time and text. Figure 5 displays the function learnt for text under such settings. The value of function learnt increases with increasing similarity of text. However, results of both the variants were not as good as the proposed model. We proposed a novel method for text classification based on Hawkes processes, where we consider textual features, expanding on previous applications of this model. In particular, we propose a text based kernel and a neural kernel over text and time, providing more flexible approaches to modeling influences among data points. We propose three methods to model text within the intensity function. Then we perform discriminative modeling and use the intensity to obtain labels. This enables us to capture the influence among tweets not only using time but also using textual content of tweets. We propose using kernels (exponential and neural network) over text and time for modelling influences. Neural network kernel can learn the functional form of the influence rather than predefining as exponential function. This allows to easily consider pre-trained word embeddings in the model for effective text classification. The experiments on rumour stance classification showed the effectiveness of the proposed approaches. We infer from the results that the Textual HP model outperforms generative models of text with HP. We also show that HP based approaches can perform better than neural networks on smaller datasets. It opens up a new direction to use text within the framework of Hawkes process. Simple open stance classification for rumour analysis Jayesh Choudhari, and Anirban Dasgupta Point process methodology for on-line spatio-temporal disease surveillance Recurrent marked temporal point processes: Embedding event history to vector Dirichlet-hawkes processes with applications to clustering continuoustime document streams HawkesEye: Detecting Fake Retweeters Using Hawkes Process and Topic Modeling Seismicity models based on Coulomb stress calculations. Community Online Resource for Statistical Seismicity Analysis Spectra of some self-exciting and mutually exciting point processes A cluster process representation of a self-exciting process Hawkestopic: A joint model for network inference and topic modeling from text-based cascades Maximum conditional likelihood via bound maximization and the CEM algorithm Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm All-in-one: Multi-task learning for rumour verification Multivariate hawkes processes Using Gaussian Processes for Rumour Stance Classification in Social Media Hawkes processes for continuous time sequence classification: an application to rumour stance classification in twitter The neural hawkes process: A neurally self-modulating multivariate point process Self-exciting point process modeling of crime Swapnil Mishra, and Lexing Xie. 2017. A tutorial on hawkes processes for events in social media Can Siamese Networks help in stance detection Hierarchical Dirichlet Gaussian Marked Hawkes Process for Narrative Reconstruction in Continuous Time Domain Dynamic topic modeling of the COVID-19 Twitter narrative among US governors and cabinet executives A temporal attentional model for rumor stance classification Modeling the intensity function of point process via recurrent neural networks Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations Discourse-aware rumour stance classification in social media using sequential classifiers Analysing how people orient to and spread rumours in social media by looking at conversational threads