key: cord-0069654-n2a9043q
authors: Chang, Shuyu; Wang, Rui; Huang, Haiping; Luo, Jian
title: TA-BiLSTM: An Interpretable Topic-Aware Model for Misleading Information Detection in Mobile Social Networks
date: 2021-11-10
journal: Mobile Netw Appl
DOI: 10.1007/s11036-021-01847-w
sha: 481a314b0cfd945d37f3ed562d3ba2f532fae97a
doc_id: 69654
cord_uid: n2a9043q

As essential information acquisition tools in our lives, mobile social networks have brought us great convenience for communication. However, misleading information such as spam emails, clickbait links, and false health information appears everywhere in mobile social networks. Prior studies have adopted various approaches to detecting this information but ignored global semantic features of the corpus and lacked interpretability. In this paper, we propose a novel end-to-end model called Topic-Aware BiLSTM (TA-BiLSTM) to handle the problems above. We firstly design a neural topic model for mining global semantic patterns, which encodes word relatedness into topic embeddings. Simultaneously, a detection model extracts local hidden states from text content with LSTM layers. Then, the model fuses those global and local representations with the Topic-Aware attention mechanism and performs misleading information detection. Experiments on three real datasets prove that the TA-BiLSTM could generate more coherent topics and improve the detecting performance jointly. Furthermore, case study and visualization demonstrate that the proposed TA-BiLSTM could discover latent topics and help in enhancing interpretability.

Mobile social networks have brought us great facilities for acquiring information. Inevitably, a vast amount of useless misleading information, such as spam emails, clickbait links, and false health information, is created. S. Chang This information will deceive us to do things with ill consequences. Table 1 gives two examples of how the meanings of content mislead people and impact categories in the Webis-Clickbait-17 dataset. In general, misleading information is deceptive, which makes it hard to distinguish the difference between two kinds of posts (positive and negative). Thus, how to detect misleading information effectively is challenging. Also, developing an efficient approach with high performance for misleading information detection is particularly essential.

Existing work on misleading information detection could be categorized into two types: machine learningbased approaches and deep learning-based approaches. Approaches based on machine learning often build document representations depending on different feature engineering techniques [10, 26, 35] . Various algorithms such as Labeled-LDA [35] and GBDT [2] also help enhance detection accuracy. Unfortunately, these approaches heavily rely on people to design sophisticated features and will cause lousy performance in a complex context. Deep learning-based approaches extract semantic features from content by multiple non-linear units to solve the above problems. Convolutional neural networks [1, 17] , recurrent neural networks [23] , and a combination of the two [22] are commonly used frameworks. Still, these approaches are limited to local semantic information and severely lack interpretability due to the complex structures. To address the above limitations, we propose a novel model called Topic-Aware BiLSTM (TA-BiLSTM) to add corpus-level topic relatedness and enhance interpretability. Specifically, the TA-BiLSTM is decomposed into two parts: a neural topic model module and a text classification module. Assuming that a multi-layer neural network can approximate the document's topic distribution, we model the topic by Wasserstein autoencoder (WAE) [37] . Neural topic model module constructs the topic distribution on latent space and reconstructs the document representation. The topic distribution could be transformed into the topic embedding provided for the attention mechanism concurrently. Unlike variational autoencoder-based approaches previously [29, 36] , our model minimizes the Maximum Mean Discrepancy regularizer [15] based on Optimal Transport theory [39] to reduce Wasserstein distance between the topic distribution and Dirichlet prior.

Furthermore, the text classification module utilizes a two-layer bidirectional LSTM based on the Topic-Aware attention mechanism to extract semantic features. This attention mechanism incorporates topic relatedness information while calculating the representation. Finally, we input representations to the classifier for misleading information detection. To balance the two task learning, we leverage a dynamic strategy to control the importance of these objectives. We concentrate on the neural topic model preferentially, then simultaneously train the classification objective and topic modeling objective.

The main contributions of our work are as follows:

• We propose a novel end-to-end framework Topic-Aware BiLSTM for misleading information detection. • We introduce a new Topic-Aware attention mechanism to encode the document's local semantic and global topical representation. • Experiments are conducted on three public datasets to verify the effectiveness of our Topic-Aware BiLSTM model in terms of topic coherence measures and classification metrics. • We select representative cases from different datasets for visualization, demonstrating that the Topic-Aware BiLSTM enhances interpretability than other traditional approaches.

The remainder of the paper is organized as follows: Section 2 reviews the relevant work, and Section 3 introduces preliminary techniques. Section 4 introduces the methodology of Topic-Aware BiLSTM model. Experiments and result analysis are given in Section 5. Lastly, in Section 6, we conclude the paper.

Our work is related to three lines of research which are misleading information detection, topic modeling and attention mechanism.

Misleading information detection models could be categorized as two streams based on implementation techniques: machine learning-based approaches and deep learningbased approaches.

Generally, machine learning-based approaches need to design the specific representation of texts. For example, Liu et al. [26] employs both the local and the global features via Latent Dirichlet Allocation and utilizes Adaboost to detect spammer. Likewise, Chakraborty et al. [7] uses multinomial Naive Bayes classifiers for pruned features of Clickbait data. Different models of this branch could also result in different detection performance. Song et al. [35] proposes the labeled latent Dirichlet allocation to mine the latent topics from user-generated comments and filter social spam. Biyani et al. [2] uses Gradient Boosted Decision Trees [11] to detect clickbait in news streams. Similarly, Elhadad et al. [10] detects misleading information about COVID-19 through constructing a voting mechanism. However, approaches of this branch often require sophisticated feature engineering and could not capture deep semantic patterns.

Thanks to the rapid development of deep representation learning, approaches such as convolutional neural networks, recurrent neural networks have been applied to extract semantic representation from text directly. Agrawal [1] and Hai-Tao et al. [17] utilize a convolutional neural network to detect misleading information from clickbait. Kumar et al. [23] adopts a bidirectional LSTM with an attention mechanism to learn a word contributing to the clickbait score in a different manner. Jain et al. [22] constructs a deep learning architecture based on convolutional layers and long short-term memory layers. Nevertheless, deep learningbased approaches often have complex structures and severely lack interpretability. Thus, we integrate the neural topic model to provide corpus-level semantic information and enhance interpretability.

Given a collection of documents, each document will discuss different topics. Topic modeling is an efficient technique which could mine latent semantic patterns from corpus.

Latent Dirichlet Allocation (LDA) [3] is the most publicly used traditional probabilistic generative model that can perform topic mining. Unlike traditional graphical topic models, Miao et al. [29] proposes a neural topic model NVDM based on variational autoencoders (VAE). Variational autoencoders use KL divergence to measure the distance between the topic distribution and Gaussian prior. ProdLDA [36] utilizes the approximated Dirichlet prior through Laplace approximation and improves the topic quality. On the other hand, Wang et al. proposes ATM [43] , BAT, and Gaussian-BAT [44] in an adversarial manner. Wang et al. [42] also extends the ATM model for open event extraction. Inspired by ATM model, Hu et al. [20] attempts to improve topic modeling with cycleconsistent adversarial training and names this approach ToMCAT. Zhou et al. [49] extends this line of work by taking into account documents and words as nodes in the graph. Further, autoencoders could be trained stably and reduce the document's representation dimensionally [25] to extract the most effective information [48] . So Nan et al. [31] incorporates adversarial training into Wasserstein autoencoder framework and proposes W-LDA model for unsupervised topic extraction.

The attention mechanism is a brain processing mechanism unique to human vision originally. When we see a picture in life, our brain will prioritize the main content in the image, ignoring the background and other irrelevant information.

Inspired by this mechanism of the human brain, various attention mechanisms have achieved success in natural language processing tasks, such as sentiment analysis [45] and machine translation [27] . The typical attention mechanism only pays attention to word-level dependencies and assigns weights so that the model could highlight key elements of sentences [18] . Further, the hierarchical attention mechanism [47] uses twolayer attention, which is successively applied at the word level and sentence level to generate the document representation with rich semantics. Besides, Vaswani et al. [38] proposes a self-attention mechanism to deal with the increasing length of text. Self-attention calculates associations between words in a sentence directly. Previous work [16, 41] has shown that topic information could improve the semantic representation of text with the help of attention mechanisms. Nevertheless, to our best knowledge, no relevant work has been conducted on misleading information detection, so we explore and study in this work.

Latent Dirichlet Allocation (LDA) is the most commonly used generative model for topic extraction. Assuming that a document can be represented by the probability distribution over topics, and each topic can be represented by the probability distribution over words. To learn the topic better, LDA utilizes the Dirichlet distribution as prior over latent space.

LDA uses θ d to denote the topic distribution of a document d and z n to represent a topic allocation of the word w n . Thus, the generative process of documents is shown in Algorithm 1.

Here, Dir(α ) is the Dirichlet prior distribution, α signifies the hyper-parameter of Dirichlet prior, and θ d is the topic distribution of document d sampled from Dirichlet prior. z n denotes the topic allocation of each position n in the document, and w n is a word randomly generate from multinomial distribution. ϕ i is a topic-word distribution of the i-th topic, and ϕ z n is one column in the matrix. LDA infers these parameters in an unsupervised manner. After model training, we can obtain representative words with high probabilities in each topic, and these words represent the semantic meaning of each topic.

As text is sequential data, and small changes of word order will affect the meaning of the entire sentence. However, traditional feedforward neural networks cannot directly extract the word dependency of context. Thus, researchers develop sequential models such as Recurrent Neural Networks (RNN) to extract sequential and contextual features from these data [21] . The RNN comprises an input layer, a hidden layer and an output layer. However, as the length of sentences increases, the training process will appear gradient disappearance and gradient explosion. The Long Short-Term Memory (LSTM) [19] adds a cell state to store long-term memory [13] , which could deal with this problem.

Assuming that x j ∈ R D w represents a word embedding of the j -th word in the content and D w is the dimension of word embeddings. LSTM feeds in word embeddings as a sequence and calculates the hidden state h j ∈ R D h for each word, where D h is the dimension of hidden states. The calculation procedure follows below equations:

are learnable parameters, and σ (·) is sigmoid function. Forget gate f j determines the information that needs to be retained from the cell state C j −1 . Input gate i j controls the proportion of new information stored in the new candidate C j . Lastly, LSTM constrains the hidden state of the current node through output gate o j . The elaborated design of its structure enables LSTM could learn longer dependencies and better semantic representation.

In this section, we first introduce the Topic-Aware BiLSTM (TA-BiLSTM) model. As depicted in Fig text corpus. The text classification module utilizes a twolayer BiLSTM network based on the Topic-Aware attention mechanism to detect misleading information from text.

As shown in the left panel of Fig. 1 , its structure is composed of an encoder and a decoder. (1) Encoder takes the V -dimensional x bow of the document as the input and transforms it into a topic distribution θ with K dimension through two fully connected layers. (2) Decoder takes the encoded topic distribution θ as the input, then reconstructs the documentx bow with reconstruction distribution x re . After decoded by the first layer, the topic embedding v t is collected. Besides, to ensure the quality of extracted topics, we use the Wasserstein distance to conduct prior matching in latent topic space.

.., d n }, the encoder utilizes its bag-ofwords representation x bow as input, where the weights are calculated by TF-IDF formulation:

where c ij indicates the number of the word w i appearing in document d j , and k c kj is the total number of words in document d j . |C d | indicates the total number of documents in the corpus, and j : w i ∈ d j represents the number of documents containing word w i .

bow refers to the semantic relevance of the i-th word in the vocabulary in document d j .

According to Eqs. 7 and 8, each document could be represented as x bow ∈ R V , where V indicates the vocabulary size.

The encoder firstly maps x bow into the D s -dimensional semantic space through following transformation:

where W s ∈ R D s ×V and b s ∈ R D s are the weight matrix and bias term of the fully connected layer, h s is the hidden state normalized by batch normalization BN(·), leak denotes the hyper-parameter of LeakyReLU activation, and o s represents the output of the layer. Subsequently, the encoder projects the output vector o s into a K-dimensional document-topic distribution θ e :

where W o ∈ R K×D s and b o ∈ R K are the weight matrix and bias term of the fully connected layer, θ e denotes the topic distribution corresponding to the input x bow and the k-

e means the proportion of k-th topic in the document.

We add noise to document-topic distribution to draw more consistent topics. We randomly sample a noise vector θ n from the Dirichlet prior and merge it with θ e . The calculation is defined as:

where η ∈ [0, 1] denotes the mixing proportion of noise. The encoder transforms the bag-of-words representation into topic distribution which perceives the semantic information in latent space.

The decoder takes the topic distribution θ as input. And then, two fully connected layers reconstruct the document's word representationx bow . After the transformation of first layer, v t serves as the topic embedding of the input document and is provided for the attention mechanism.

The decoder firstly transforms the topic distribution θ into the D t -dimensional topic embedding space:

where W t ∈ R D t ×K and b t ∈ R D t are the weight matrix and bias of the fully connected layer, h t is the hidden vector normalized by batch normalization BN(·). The v t is activated by the LeakyReLU and then used in Topic-Aware attention mechanism. Subsequently, the decoder transforms the hidden vector h t into V -dimensional reconstruction distribution:

where W r ∈ R V ×D t and b r ∈ R V are the weight matrix and bias, and x re is the reconstruction distribution. The decoder is an essential part of the neural topic model. After model training, it could generate the words corresponding to each topic. We input one-hot vectors into the decoder to obtain the word distribution of each topic. Here, we use 10 words with the highest probability of each topic to represent its semantic meaning. Based on the topic distribution and the semantics of topics, interpretable word-level information could be provided for classifying documents in the detection process.

Since the Dirichlet distribution is commonly regarded as the prior of multinomial distribution, choosing this prior has substantial advantages [40] . To match the encoded topic distribution to Dirichlet prior, we add a regularizer in TA-BiLSTM. Thus, the training process minimizes the regularization term based on the Maximum Mean Discrepancy (MMD) [15] to reduce the Wasserstein distance, which measures the divergence between the topic distribution θ and randomly samples θ from prior.

Regarding the kernel function is k : × → R, the MMD based regularizer could be defined as:

where H is the Reproducing Kernel Hilbert Space (RKHS) of real-valued functions mapping to R. k(·, ·) implies the kernel function of this space, and k(θ, ·) maps θ to the features on the high-dimensional space.

As distributions in the latent space are matched with the Dirichlet prior on the simplex, we choose the information diffusion kernel [24] as the kernel function. This function is susceptible to points near the simplex boundary and has better effects on sparse data. The detailed calculation equation is:

When performing distribution matching, we employ the Dirichlet distribution, α means hyper-parameter, then θ can be sampled by the following equations:

where θ (i) denotes the value of the i-th dimension of θ , α (i) means the hyper-parameter of the i-th dimension of the Dirichlet distribution, θ represents a sample sampled from the Dirichlet prior, and

. Given M encoded samples and M samples sampled from Dirichlet prior, MMD could be calculated by the following unbiased estimation:

where {θ 1 , θ 2 , ..., θ M } ∼ Q are the samples collected from the encoder, and Q is the encoded distribution of samples. {θ 1 , θ 2 , ..., θ M } ∼ P are sampled from the prior distribution P .

In this subsection, we will introduce the text classification module. As depicted in the right panel of Fig. 1 , we utilize a two-layer BiLSTM based on the Topic-Aware attention mechanism. Because of the complex context of misleading information, we incorporate corpus-level topic features by this mechanism to obtain richer semantic representation. Then, we use a classifier with two fully connected layers to detect misleading information.

Bag-of-words representation is sparse, and the typical solution approach to the sparsity problem is computational intelligence [46] like word embedding technology. Word2vec [30] and GloVe [32] utilize words as the smallest unit for training, while the fastText [4] splits words into n-gram subwords to construct vectors.

Considering that there are many out-of-vocabulary words in misleading information, we use the embedding layer initialized by the pre-trained fastText. Suppose the word sequence of a document d = {w 1 , w 2 , ..., w m }, w i represents the i-th word in the content. After transforming each word to a one-hot vector, the embedding layer could map words to their corresponding vectors x embed ∈ R D w , where D w is the dimension of embedding space.

Then, we utilize a two-layer BiLSTM to extract semantic features, and each layer contains bidirectional LSTM units. This bidirectional structure implements the semantic contextual representation of misleading information. The network takes x embed in the order of the content as input and gets each word's hidden state. If the definition of LSTM unit is simplified as LSTM(·), the hidden state h of each word could be calculated by:

where h f 1 , h f 2 ∈ R D h are vectors calculated by the forward LSTM, and h b1 , h b2 ∈ R D h are vectors calculated by the backward LSTM. h ∈ R 2×D h +D w is the hidden state that combines the word embedding and the bidirectional LSTM.

Generally, the attention mechanism is similar to human behavior when reading a sentence, evaluating how important each word is by giving a weight to each part [50] ; the higher value is, the more important the word will be. In the typical attention-based model, the alignment score of each word is calculated as:

where q ∈ R D h are learnable parameters. However, typical attention mechanisms could not utilize external information, so we design the Topic-Aware attention mechanism to incorporate topic features while calculating the misleading information representation. In this way, we integrate the neural topic module and the text classification module to train the entire model end-to-end.

The attention weights a for each word are calculated based on the similarity between the topic embedding v t and hidden states H = {h 1 , h 2 , ..., h L } in the last layer of BiLSTM, where L represents the max sentence length in batch.

Specifically, TA-BiLSTM counts the attention weight a i based on the alignment score between the hidden state h i and the topic embedding v t , where i = {1, 2, ..., L}. We set D t = D h and use the following equation to calculate the alignment score:

where W a ∈ R D h ×D h and b a ∈ R D h are learnable parameters. The larger the value of f (h , v t ), the greater the probability of misleading information implied by the corresponding word. Then, the document representation could be summarized based on the alignment scores above:

where a (i) is the weight of the hidden state h i of the i-th word, and v d ∈ R D h contains both semantics of hidden states and topic information embedded by the neural topic model.

In this paper, the text which contains misleading information is taken as a positive example. We apply two fully connected layers and a sigmoid activation function to convert the document representation v d into the probability for classification. Therefore, the higher value of the output, the more possible this document containing misleading information. The prediction process could be defined as:

are learnable parameters, andŷ is the predicted probability.

In multi-task learning framework, models are optimized for multiple objectives jointly. Our proposed framework mainly has two training objectives: neural topic modeling objective and misleading information detection objective.

For the neural topic modeling, its objective includes the reconstruction term and the MMD based regularization term. It is defined as follows:

where c(x bow , x re ) is the reconstruction loss, x (i) bow denotes the weight of the i-th word in the vocabulary, and x (i) re denotes the probability of the i-th word in reconstruction distribution. In our implementation, we follow W-LDA and multiply a scaling factor μ = 1/(l log V ) to balance the two terms, where l indicates the average sentence length in each batch and V indicates the vocabulary size.

For classification objective, we measure the binary crossentropy between the target label and the predicted output:

where y i is the ground truth, andŷ i represents the predicted probability of the i-th document. N means the total number of document in the corpus. To balance the two task specific objectives, we adopt a dynamic strategy to control the weights of objectives above. The neural topic model is mainly concerned in the early stage, and then we pay more attention to train the classification objective. Thus, the total training objective is formed as:

where λ is a hyper-parameter that dynamically balances the two objectives.

We set λ to a slight value in the early stage, allowing the framework to train neural topic model preferentially. Later, we change λ to 1, shifting the focus to multi-task learning, and train the classifier and the neural topic model jointly.

We conduct experiments on three public datasets about misleading information to evaluate the effectiveness of the proposed TA-BiLSTM model. [28] is an English public spam dataset compiled in 2006. Ham emails are collected from the mailboxes of six employees in Enron Corporation. Spam messages are obtained from four sources: SpamAssassin corpus, Honeypot project, spam collection of Bruce Guenter, and spam collected by third parties. These emails were sent and received between 2001 and 2005. The dataset consists of six sub-datasets, which are combined into a whole dataset for experiments. [9] . The Text Retrieval Conference (TREC) is a series of seminars, which mainly focuses on the problems and challenges in information retrieval research. The 2007 TREC conference held a spam filtering competition and published this dataset. The dataset includes complete mail information such as sending and receiving addresses, time, HTML code. In the experiments, we retain content in the main body and ignore other information.

Webis-Clickbait-17 [33] contains a total of 19,538 Twitter posts with links from 27 major news publishers in the United States. These posts were published between November 2016 and June 2017. Five annotators from Amazon Mechanical Turk marked whether articles in these links were misleading information. We use the content of articles linked in the post for detection.

Due to noisy data such as blanks and duplicate documents in three datasets, the statistics of preprocessed datasets are listed in Table 2 . We arrange 2/3 of the data as the training set and 1/3 of the data as the test set.

In the experiments, all datasets use package enchant to check the spelling of words. Each word is reverted to base form with no inflectional suffixes by the en core web lg model of package spacy. We utilize package gensim to obtain the word embedding matrix and initialize the embedding layer.

For the neural topic model, we set the number of topics K to 50 and the dimension D s of the fully connected layer in the encoder to 256. The dimension D t of the topic embedding is equal to the dimension D h of the hidden state h . We make Dirichlet prior as sparse as possible and set the Dirichlet hyper-parameter α to 0.001. The proportion of noise η that adds to topic distribution is defined as 0.1.

For text classification model, we apply 300-dimensional pre-training fastText word embeddings [14] , that is, D w is set to 300. The dropout of the BiLSTM layer is 0.3, and the dimension D m in the classifier is 64. The weight matrixes in BiLSTM are initialized by orthogonal initialization, and the parameters in the Topic-Aware attention mechanism are initialized by uniform initialization.

During model training, the hyper-parameter λ is set to 1e-8 initially, and when the training reaches the last 20 Epochs, λ is set to 1. Adam optimizer with a learning rate of 1e-4 to train the parameters of the neural topic model and with a learning rate of 5e-5 to train other parameters. The batch size is 16. The computer CPU is Intel Xeon (Skylake) Platinum 8163, and the operating system is Ubuntu 20.04 64-bit. All models are implemented with PyTorch and run on an NVIDIA V100 32G graphic card.

We choose Naive Bayes, Support Vector Machine, Decision Tree, Random Forest four machine learning models for comparison.

Naive Bayes [28] is a probabilistic model. By learning the joint probability distribution of the input and output of the training data, the model computes the label with the largest posterior probability of the predicted data. Positive samples refer to misleading information, while negative ones are opposite SVM [8] is a linear binary classification model defined in the feature space. It uses a kernel function to find a hyperplane to separate the two categories, and maximizes the interval between the data and the plane.

Decision Tree [6] adopts a tree structure and uses layered inferences on the data to achieve the final classification, so it has good interpretability.

Random Forest [5] is an ensemble learning method containing multiple decision trees. The model trains each decision tree independently, and the result is determined by the category with the most output of decision trees. Besides, we also compare our model with following deep learning-based baselines.

BiLSTM uses a BiLSTM network without attention mechanism. The hidden state of words in the document is averaged as the classifier's input.

Attention-BiLSTM uses a BiLSTM network based on a traditional attention mechanism and inputs the classifier after the weighted summation of each word's hidden state.

In the aspect of topic modeling, we compare our model with the following neural topic models. LDA 1 [3] extracts topics based on the co-occurrence information of words in the document. We use package gensim to implement this model. NVDM 2 [29] comprises an encoder network and a decoder network, inspired by the variational autoencoder based on Gaussian prior distribution. 3 [31] is the prototype of our model, which uses Wasserstein autoencoder and Dirichlet prior distribution to mine topic information.

BAT [44] applies bidirectional adversarial training with Dirichlet prior for neural topic modeling.

The last three neural topic models mentioned above adopt a neural network structure similar to our model.

In the experiments, we mainly evaluate the classification performance of the text classification model and the topic quality of the neural topic model.

For classification, we compare three widely used performance metrics: accuracy, precision, and F1-score. Accuracy refers to the proportion of correctly classified samples to the total number. The calculation is:

where N is the total number of samples, and I(·) depicts the indicator function. When · is true, the function equals 1; otherwise, it is equal to 0. In binary classification, we generally divide the combination of predicted labels and ground truths into four types, namely True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). True or False means whether the prediction is correct, Positive or Negative means whether the forecast result is a positive or negative sample. These four categories respectively correspond to the number of samples that meet the condition, so the sum of four values equals N. Based on the above, the definition of precision is:

Precision is the number of correct labels divided by the number of all predicted positive results, and recall is the fraction of true positive samples predicted to be positive. So the precision and recall are a set of contradictory measures. To comprehensively consider the precision and recall metrics, we also evaluate the effectiveness with the F1-score. The definition is below:

Under the same experimental conditions, the higher above metrics, the better classification performance. For topic quality, we utilize two standard metrics C V and C A of topic coherence 4 [34] . Here we choose 10 representative words for each topic as word sets and respectively compute C V to measure semantical support for one word in each set. Variously, C A compares pairs of single words in each topic's set to evaluate the coherence between words. To this end, we apply the two metrics to quantify the quality of topic modeling comprehensively.

In this section, we present the experimental results and corresponding analysis of proposed TA-BiLSTM model in terms of classification performance and topic quality. The first four items are machine learning models, and the last two items are deep learning models for ablation study All significant information has been bolded Table 3 lists the results of classification performance on three used public datasets compared with different baselines. We could observe that the TA-BiLSTM model could obtain better results in accuracy, precision and F1-score. Specifically, the bag-of-words representation limits the traditional machine learning approaches. The precision of Random Forest on the Clickbait-17 dataset is higher because the model only selects confirmed positive samples to minimize the number of FP. Therefore, the accuracy of Random Forest is not high, and the F1-score is lower than other approaches.

Moreover, we conduct ablation study by comparing BiLSTM and Attention-BiLSTM to verify the outperforming of the Topic-Aware attention mechanism. We could observe that the results are better than those of machine learning-based approaches, indicating that richer semantic feature representation, especially context information, could improve classification performance. Compared with the BiLSTM, the results of Attention-BiLSTM show slight improvements, indicating that the attention mechanism assigns more weights to specific words to provide a more suitable document representation.

Furthermore, in the comparison of Attention-BiLSTM and TA-BiLSTM, we observe that accuracy increases 0.64%, 1.12%, 3.11% and F1-score increases 0.63%, 0.99%, 4.95% for the latter on the three datasets, respectively. The significant improvements show that Topic-Aware attention mechanism could incorporate topic information into classification module. Moreover, the topic information could indeed help TA-BiLSTM to provide more suitable representations for misleading information detection.

The calculation of attention mechanism often incorporates supervision signal from a document, which will be helpful for mining latent semantic patterns in topic modeling procedure. Thus, we also evaluate the quality of topics in this subsection. Table 4 presents the results of different topic coherence metrics C A and C V comparing with other topic modeling baselines on three datasets. The five topics on the Enron Spam dataset are "college", "conference", "politics", "prize-winning", and "loan", on the 2007 TREC dataset are "weather", "sports", "computer", "software" and "mathematics", and on the Clickbait-17 dataset are "politics", "sports", "medicine", "flight" and "crime"

Compared with the topics extracted by W-LDA on Enron Spam dataset, the C A of TA-BiLSTM has increased by 5.81%, and the C V metric has risen by 11.53%. On the 2007 TREC dataset, C A is almost the same as the W-LDA, but the C V has increased by 13%. We also present the comparison with BAT. It obtains slightly higher than W-LDA and LDA on Clickbait-17, but our model improves C A and C V by 2.31% and 3.06%.

Ignoring NVDM with poor performance, Table 5 lists the top-10 representative words with the highest probability for each topic on three datasets. Thus, we could compare the quality of performance intuitively. Generally, compared with other models, we could realize that the topics generated by TA-BiLSTM have fewer irrelevant words and higher semantic coherence.

The topic words of NVDM are not very consistent because it employs Gaussian prior to mimic Dirichlet in topic distribution space. As the proposed TA-BiLSTM utilizes Dirichlet as prior distribution in topic space, it could obtain coherent topics than NVDM. Meanwhile, the supervision signal also helps the TA-BiLSTM to surpass LDA, W-LDA and BAT in topic modeling evaluation. 

To further validate the robustness of TA-BiLSTM, we conduct hyper-parameter analysis in this subsection. Concretely, parameter analysis on three parameters (the number of topics K, the dimension of hidden states h and the proportion of noise η) has been carried out. Firstly, the number of topics K is set to 30, 50, 80 and 100, respectively. The quantitative results on three datasets are reported in Table 6 and visualized in Fig. 2 .

For Enron Spam and 2007 TREC datasets, we could observe that TA-BiLSTM performs fairly stable on three metrics. For Clickbait-17 dataset, the classification performance is more sensitive to changes of K, which may be caused by the complicity of the dataset. It is worth mentioning that optimal numbers of topics over datasets are different (50 on Enron Spam, 80 on 2007 TREC and 50 on Clickbait-17). If this number is too large, the model is not interpretable, and if the number is too small, the model training will be negatively affected [12] . Thus, we set the number of topics K to 50 in our experiments.

Similarly, we conduct parameter analysis on the dimension of hidden states h . It has been set to 25, 50, 75, 100 and 150 respectively. And the corresponding statistics are listed in Table 7 . By comparing the results, we could observe that simple models perform better on Enron Spam and 2007 TREC datasets. While dealing with Clickbait-17, classification performance improves with the increasing of model complexity. This may be also caused by the complexity of Clickbait-17 dataset which needs a more complicated model to fit the data.

We further investigate the impact of different proportions of noise η on the performance. In detail, we compute the metrics of classification and topic modeling separately with five proportion settings [0, 0.1, 0.2, 0.3, 0.4]. The detailed comparison is shown in Table 8 . It can be concluded that adding a proper proportion of noise to the topic distribution upgrades the quality of topic modeling on all datasets. However, not the optimal parameter for the topic mining has the same consequence on classification performance. Topic coherence is better when the proportion is set to 0.1 or 0.2, while less noise is helpful for the Topic-Aware attention mechanism to preserve topic features and prediction. Hence we set the proportion of noise to 0.1 for better comprehensive results in the experiments.

To validate that proposed TA-BiLSTM could indeed improve the model interpretability, we conduct case study and visualization in this subsection. Figure 3a shows an advertising email for an online pharmacy in the Enron Spam dataset. As Topic 8 represents drugs, we could infer that this email may discuss related topics. Also, we could find various drug names appeared in its text content. Likewise, Fig. 3b depicts a web page content from Clickbait-17 which entices people to buy cosmetics. We can also find relevant words from Topic 15 and Topic 45, such as 'carpet', 'fashion', 'beauty', 'makeup'.

Thus, the above two examples show that corpus-level topic relatedness could really improve model interpretability. 

Clickbait Detection Using Deep Learning

Amazing Secrets for Getting More Clicks: Detecting Clickbaits in News Streams Using Article Informality

Latent dirichlet allocation

Enriching word vectors with subword information

Random Forests

Classification and Regression Trees

Stop clickbait: Detecting and preventing clickbaits in online news media

LIBSVM: A library for support vector machines

TREC 2007 Spam Track Overview. In: In The Sixteenth Text RETrieval Conference

Detecting Misleading Information on COVID-19

Stochastic gradient boosting

Collaborative Learning-based Industrial IoT API Recommendation for Software-defined Devices. The Implicit Knowledge Discovery Perspective

The Cloud-edge-based Dynamic Reconfiguration to Service Workflow for Mobile Ecommerce Environments: A QoS Prediction Perspective

Learning word vectors for 157 languages

A kernel Two-Sample test

Multi-Task Learning with mutual learning for joint sentiment classification and topic detection

An attention-based neural framework for uncertainty identification on social media texts

Long short-term memory

Neural topic modeling with cycle-consistent adversarial training

Ssur: an approach to optimizing virtual machine allocation strategy based on user requirements for cloud data center

Spam detection in social media using convolutional and long short term memory neural network

Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks

Information Diffusion Kernels

An energy-efficient data collection scheme using denoising autoencoder in wireless sensor networks

Detecting "Smart" Spammers on Social Network: A Topic Model Approach

Effective Approaches to Attention-based Neural Machine Translation

Spam filtering with naive Bayes-Which naive bayes? In: CEAS

Neural variational inference for text processing

Distributed Representations of Words and Phrases and their Compositionality

Topic modeling with wasserstein autoencoders

GloVe: Global Vectors for Word Representation

Crowdsourcing a large corpus of clickbait on twitter

Exploring the Space of Topic Coherence Measures

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

Autoencoding variational inference for topic models

Wasserstein Auto-Encoders

Attention is All you Need

Rethinking LDA: Why Priors Matter

An End-to-end Topic-Enhanced Self-Attention Network for Social Emotion Classification

Open Event Extraction from Online Text using a Generative Adversarial Network

ATM: Adversarial-Neural Topic Model. Inf Process Manag

Neural topic modeling with bidirectional adversarial training

Attention-based LSTM for Aspect-level Sentiment Classification

An approach to alleviate the sparsity problem of hybrid collaborative filtering based recommendations: The product-attribute perspective from user reviews

Hierarchical attention networks for document classification

Qos Prediction for Service Recommendation With Features Learning in Mobile Edge Computing Environment

Neural topic modeling by incorporating document relationship graph

A novel approach to workload prediction using attention-based lstm encoder-decoder network in cloud environment

(b) Clickbait-17

In this paper, we proposed the Topic-Aware BiLSTM (TA-BiLSTM) model, an end-to-end framework. TA-BiLSTM contains a neural topic model and a text classification model, which explores corpus-level topic relatedness to enhance misleading information detection. Meanwhile, the supervision signal could be incorporated into topic modeling process to further improve the topic quality. Experiments on three English misleading information datasets demonstrate the superiority of TA-BiLSTM compared with baseline approaches. Additionally, we analyze multiple hyper-parameters in detail and select specific topic examples for visualization. More recently, classification and topic modeling on short texts are still challenging tasks. Our future study would pay more attention to detect misleading information from the short text on social media platforms.Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.