key: cord-0112383-nfxhe2ue authors: Alonso-Bartolome, Santiago; Segura-Bedmar, Isabel title: Multimodal Fake News Detection date: 2021-12-09 journal: nan DOI: nan sha: a51494fdbefe6b9b205aae07e7aa1f4e05ef4809 doc_id: 112383 cord_uid: nfxhe2ue Over the last years, there has been an unprecedented proliferation of fake news. As a consequence, we are more susceptible to the pernicious impact that misinformation and disinformation spreading can have in different segments of our society. Thus, the development of tools for automatic detection of fake news plays and important role in the prevention of its negative effects. Most attempts to detect and classify false content focus only on using textual information. Multimodal approaches are less frequent and they typically classify news either as true or fake. In this work, we perform a fine-grained classification of fake news on the Fakeddit dataset, using both unimodal and multimodal approaches. Our experiments show that the multimodal approach based on a Convolutional Neural Network (CNN) architecture combining text and image data achieves the best results, with an accuracy of 87%. Some fake news categories such as Manipulated content, Satire or False connection strongly benefit from the use of images. Using images also improves the results of the other categories, but with less impact. Regarding the unimodal approaches using only text, Bidirectional Encoder Representations from Transformers (BERT) is the best model with an accuracy of 78%. Therefore, exploiting both text and image data significantly improves the performance of fake news detection. Digital medial has provided a lot of benefits to our modern society such as facilitating social interactions, boosting productivity, improving sharing information, among many others. However, it has also also led to the proliferation of fake news (Finneman & Thomas, 2018) , that is, news articles containing false information that has been deliberately created (Hunt & Gentzkow, 2017) . The effects of this kind of misinformation and disinformation spreading can be seen in different segments of our society. The Pizzagate incident (Hauck, 2017) as well as the mob lynchings that occurred in India (Mishra, 2019) are some of the most tragic examples of the consequences of fake news dissemination. Changes in health behaviour intentions (Greene & Murphy, 2021) , an increase in vaccine hesitancy (Islam et al., 2021) , and significant economic losses (Brown, 2019) are also some of the negative effects that the spread of fake news may have. Every day, a huge quantity of digital information is produced, making impossible the detection of fake news by manual fact checking. Due to this, it becomes essential to count with techniques that help us to automate the identification of fake news so that more immediate actions can be taken. During the last years, several studies have already been carried out to perform an automatic detection of fake news (Thota et al., 2018; Choudhary et al., 2021; Singh et al., 2021; Giachanou et al., 2020; Singhal et al., 2019; Wang et al., 2018) . Most works have focused only on using textual information (unimodal approaches). Much less effort has been devoted to explore multimodal approaches (Singh et al., 2021; Giachanou et al., 2020; Kumari & Ekbal, 2021) , which exploit both texts and images to detect the fake news, obtaining better results than the unimodal approaches. However, these studies typically address the problem of fake news detection as a binary classification task (that is, consisting on classifying news as either true or fake). Therefore, the main goal of this paper is to study both unimodal and multimodal approaches to deal with a finer-grained classification of fake news. To do this, we use the Fakeddit dataset (Nakamura et al., 2020) , made up of posts from Reddit. The posts were classified into the following six different classes: true, misleading content, manipulated content, false connection, imposter content and satire. We explore several deep learning architectures for text classification such as Convolutional neural network (CNN) (Goodfellow et al., 2016) , Bidirectional long short-term memory (BiLSTM) (Hochreiter & Schmidhuber, 1997) and Bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018) . As multimodal approach, we propose a CNN architecture that combines both texts and images to classify the fake news. Since the revival of neural networks in the second decade of the current century, many different applications of deep learning techniques have emerged. Part of Natural Language Processing (NLP) and Computer Vision advances are due to the incorporation of deep neural network approaches (O'Mahony et al., 2019; Deng & Liu, 2018) . Fields such as object recognition (Zhao et al., 2019) , image captioning (Hossain et al., 2019) , sentiment analysis (Tang et al., 2015) or question answering (Sharma & Gupta, 2018) , among others, have benefited from the development of deep learning in recent years. Text classification problems are also one of the tasks for which deep neural networks are being extensively used (Minaee et al., 2021a) . Most of these works have been based on unimodal approaches that only exploit texts. More ambitious architectures that combine several modalities of data (such as text and image) have also been tried (Abavisani et al., 2020; Bae et al., 2020; Yu & Jiang, 2019; Viana et al., 2017; Gaspar & Alexandre, 2019) . The main intuition behind these multimodal approaches is that many texts are often accompanied by images, and these images may provide useful information to improve the results of the classification task (Baheti, 2020) . We now focus on the recent research on fake news detection, distinguishing between unimodal and multimodal approaches. We review the most recent studies for the detection of fake news using only the textual content of the news. Wani et al. (2021) use the Constraint@AAAI Covid-19 fake news dataset (Patwa et al., 2020) , which contains tweets classified as true or fake. Several methods are evaluated: CNN, LSTM, Bi-LSTM + Attention, Hierarchical Attention Network (HAN) (Yang et al., 2016) , BERT, and DistilBERT (Sanh et al., 2019) , a smaller version of BERT. The best accuracy obtained is 98.41 % by the DistilBERT model when it is pre-trained on a corpus of Covid-19 tweets. Goldani et al. (2021) use a capsule network model (Sabour et al., 2017) based on CNN and pre-trained word embeddings for fake news classification over the ISOT (Ahmed et al., 2017) and LIAR (Wang, 2017) datasets. The ISOT dataset is made up of fake and true news articles collected from Reuters and Kaggle, while the LIAR dataset contains short statements classified into the following six classes: pants-fire, false, barely-true, half-true, mostly-true and true. Thus, the authors perform both binary and multi-class fake news classification. The best accuracy obtained with the proposed model is 99.8 % for the ISOT dataset (binary classification) and 39.5 % for the LIAR dataset (multi-class classification). Girgis et al. (2018) perform fake news classification using the above mentioned LIAR dataset. More concretely, they use three different models: vanilla Recurrent Neural Network (Aggarwal, 2018) , Gated Recurrent Unit (GRU) (Chung et al., 2014) and LSTM. The GRU model obtains an accuracy of 21.7 %, slightly outperforming the LSTM (21.66 %) and the vanilla RNN (21.5 %) models. From this review on approaches using only texts, we can conclude that deep learning architectures provide very high accuracy for the binary classification of fake news, however, the performance is much lower when these methods address a fine-grained classification of fake news. Curiously enough, although BERT is reaching state-of-the-art results in many text classification tasks, it has hardly ever used for the multiclassification of fake news. 4 2.2. Multimodal approaches for fake news detection Singh et al. (2021) study the improvement in performance on the binary classification of fake news when textual and visual features are combined as opposed to using only text or image. They explore several traditional machine learning methods: logistic regression (LR) (Kleinbaum & Klein, 2010) , classification and regression tree (CART) (Hastie et al., 2009a) , linear discriminant analysis (LDA) (Murphy, 2012) , quadratic discriminant analysis (QDA) (Murphy, 2012), k-nearest neighbors (KNN) (Murphy, 2012) , naïve Bayes (NB) (Barber, 2012) , support vector machine (SVM) (Hastie et al., 2009b) and random forest (RF) (Breiman, 2001) . The authors use a Kagle dataset of fake news (Kaggle, b) . Random forest is the best model with an accuracy of 95.18 %. Giachanou et al. (2020) propose a model to perform multimodal classification of news articles as either true or fake. In order to obtain textual representations, the BERT model (Devlin et al., 2018) is applied. For the visual features, the authors use the VGG (Simonyan & Zisserman, 2014) network with 16 layers followed by a LSTM layer and a mean pooling layer. The dataset used by the authors is retrieved from the FakeNewsNet collection (Shu et al., 2020) . More concretely the authors use 2,745 fake news and 2,714 real news collected from the Finally, another recent architecture proposed for multimodal fake news classification can be found in the work carried out by Kumari & Ekbal (2021) . The authors propose a model that is made up of four modules: i) ABS-BiLSTM As we can see from this review, most multimodal approaches that were evaluated on the Fakkeddit dataset only address the binary classification of fake news. Only one of them (Kang et al., 2021) has dealt with the multiclassification of fake news using a reduced version of this dataset. To the best of our knowledge, this is the first work that addresses a finegrained classification of fake news using the whole Fakeddit dataset. Furthermore, contrary to the work proposed in (Kang et al., 2021) , which exploits a deep convolutional network, we propose a multimodal approach that simply uses a CNN, obtaining very similar performance. Three unimodal models only using the texts are proposed: CNN, BiLSTM and BERT. We start pre-processing the documents in the corpus by removing stopwords, punctuations, numbers and multiple spaces. Then, we split each text into tokens and we apply lemmatization. After lemmatization, we transform the texts into sequences of integers. This is done firstly by learning the vocabulary of the corpus and building a dictionary where each word is mapped to a different 7 integer number. Then, this dictionary is used to transform each text into a sequence of integers. Every non-zero entry in such sequence corresponds to a word in the original text. The original order of the words in the text is respected. As we need to feed the deep learning models with vectors of the same length, we pad and truncate the sequences of integers so that they have the same number of entries. This has the disadvantage that those vectors that are too long will be truncated and some information will be lost. In order to select the length of the padded runcated vectors, we computed the percentage of texts that are shorter than 10, 15, 20 and 25 tokens. Figure 1 shows the results for the training, validation and test partitions in each case. We can see that 98 % of the texts are smaller than 15 in length. Since the number of texts that will have to be truncated is very small (less than 2 %) very little information is lost. Therefore, we selected 15 as the length of the vectors after padding and truncating. The deep learning architectures use the sequence of word embedddings corresponding to a given text as input. For this reason, an embedding layer will transform each integer value from the input sequence into a vector of word embeddings. In this way, every vectorized document is transformed into matrix of 15 rows and 300 columns (300 being the dimension of the word embeddings). We use both random initialization and pre-trained Glove word embeddings (Pennington et al., 2014) . We also compare a dynamic approach (letting the model 8 further train the word embeddings) and a static approach (not letting the model train the word embeddings). We now explain the CNN architecture for the text classification of fake news. As was mentioned above, the first layer is an embedding layer. We initialize the embedding matrix using both random initialization and the pre-trained GloVe word embeddings of dimension 300. We chose this size for the word embeddings over other options (50, 100 or 200) because word embeddings of a larger dimension have been proven to give better results (Patel & Bhattacharyya, 2017) . After the embedding layer, we apply four different filters in a convolutional layer. Each of these filters slides across the (15 x 300) matrix with the embeddings of the input sequence and generates 50 output channels. The 4 filters have sizes (2 x 300), (3 x 300), (4 x 300) and (5 x 300), respectively, since these are the typical filter sizes of a CNN for text classification (Voita, 2021) . As a consequence the outputs of the previous filters have shapes (14 x 1), (13 x 1), (12 x 1) and (11 x 1), respectively. The next step is to pass the outputs obtained from the previous layer through the ReLU activation function. This function is applied elemen-twise, and therefore, it does not alter the size of the outputs obtained after the previous step. The effect of this function is to set all the negative values to 0 and leave the positive values unchanged. After going through the ReLU activation, a maxpooling layer is applied that selects the biggest element out of each of the 200 feature maps (50 feature maps per each of the 4 filters). Thus, 200 single numbers are generated. These 200 numbers are concatenated and the result is passed through 2 dense layers with one ReLU activation in between (Minaee et al., 2021b) . The resulting output is a vector of six entries (each entry corresponding to a different class of the Fakeddit dataset) that, after passing through the logsoftmax function, can be used to obtain the predicted class for the corresponding input text. Early stopping (Brownlee) with the train and validation partitions is used in order to select the appropiate number of epochs. We use the Adam optimization algorithm (Kingma & Ba, 2014 ) for training the model and the negative log likelihood as loss function. Figure 2 shows the CNN architecture for text classification. Actually, this model is a hybrid model that uses a bidirectional LSTM followed by a CNN layer. Firstly, texts are processed as was described above and these input are passed through the same embedding layer that used for the CNN model. Therefore, each input vector of length 15 is transformed into a matrix of shape 15 x 300. Similarly to what was done for the CNN model, the output of the maxpooling layer is concatenated and passed through two dense layers with ReLU activation in betweeen. The resulting vector goes through the logsoftmax function and the predicted class in obtained. Figure 3 shows the architecture of the BiLSTM for text classification. Early stopping is again used for selecting the optimal number of epochs. We use Adam as the optimization algorithm and the negative log likelihood as the loss function. In this case, instead of using random initialization of the pre-traiend Glove embeddings, we now use the vectors provided by BERT to represent the input tokens. As opposed to the GloVe model (Pennington et al., 2014) , BERT is taking into account the context of each word (that is, the words that surround it). For the pre-processing of the texts the steps are similar to those described above. The main differences are that we tokenize the texts by using the Bert-Tokenizer class from the transformers library (Face) . This class has its own vocabulary with the mappings between words and ID's so it was not necessary to train a tokenizer with the corpus of texts. We also add the [CLS] and [SEP] tokens at the beginning and at the end of each tokenized sequence. It was also necessary to create an attention mask in order to distinguish what entries in each sequence correspond to real words in the input text and what entries are just 0's resulting from padding the sequences. Thus, the attention mask is composed of 1's (indicating non-padding entries) and 0's (indicating padding entries). We use the BERT base model in its uncased version (12 layers, 768 hidden size, 12 heads and 110 million parameters). Then, we fine-tune it on our particular problem, that is, the multi-classification of fake news. To do this, we add a linear layer on top of the output of BERT that receives a vector of length 768 and outputs a vector of length 6. For the training process, we used the Adam algorithm for optimization with a learning rate of 2·10 −5 . We trained the model for two epochs, since the authors of BERT recommended using between two and four epochs for fine-tuning on a specific NLP task (Devlin et al., 2018) . Our multimodal approach uses a CNN that takes as inputs both the text and the image corresponding to the same news. The model outputs a vector of six numbers out of which the predicted class is obtained. In the following lines, we describe the preprocessing steps applied before feeding the data into the network as well as the architecture of the network. For the input texts, we apply the same preprocessing steps described in 3.1.1. Regarding the preprocessing of the images, we only reshaped them so that all have the same shape (560 x 560). Once the pre-processed data is fed into the network, different operations are applied to text and image. The CNN architecture that we use for the texts is the same that we described in 3.1.2, except for the fact that we eliminate the last 2 dense layers with ReLU activation in between. We now describe the CNN model to classify the images. The data first goes through a convolutional layer. Since each image is made up three channels, the number of input cahnnels of this layer is also 3. Besides, it has 6 output channels. Filters of size (5 x 5) are used with stride equal to 1 and no padding. The output for each input image is therefore a collection of 6 matrices of shape (556 x 556).The output of the convolutional layer passes through a nonlinear activation function (ReLU) and then maxpooling is applied with a filter of size (2 x 2) and a stride equal to 2. The resulting output is a set of six matrices of shape (278 x 278). The output from the maxpooling layer passes again through another convolutional layer that has 6 input channels and 3 output channels. The filter size, stride length and padding are the same as those used in the previous convolutional layer. Then the ReLU non-linear activation function and the maxpooling layer are applied again over the feature maps resulting from the convolutional layer. Thus, for a given input (image) we obtain a set of 3 feature maps of shape (137 x 137). Finally, these feature maps are flattened into a vector of length 56307. The outputs from the operations applied to each text and image are concatenated into a single vector. Then, this vector is passed through 2 dense layers with a ReLU non-linear activation in between. Finally, the logsoftmax function is applied and the logarithm of the probabilities is used in order to compute the predicted class of the given input. In our experiments, we train and test our models using the Fakeddit dataset (Nakamura et al., 2020) , which consists of a collection of posts from Reddit users. It includes texts, images, comments and metadata. The texts are the titles of the posts submitted by users while the comment are made by other 13 users as a answer to a specific post. Thus, the dataset contains over 1 million instances. One of the main advantages of this dataset is that it can be used to implement systems capable to perform a finer-grained classification of fake news than the usual binary classification, which only distinguishes between true and fake news. In the Fakeddit dataset, each instance has a label which distinguishes five categories of fake news, besides the unique category of true news. We describe briefly each category: • True: this category indicates that the news is true. • Manipulated Content: in this case, the content has been manipulated by different means (such as photo editing, for example). • False Connection: this category corresponds to those samples in which the text and the image are not in accordance. • Satire \Parody: this category refers to those news in which the meaning of the content is twisted or misinterpreted in a satirical or humorous way. • Misleading Content: this category corresponds to those news in which the information has been deliberately manipulated or altered in order to mislead the public. • Imposter Content: in the context of this project, all the news that belong to this category include content generated by bots. The Fakeddit dataset is divided into training, validation and test partitions. Moreover, there are two different versions of the dataset: the unimodal dataset, whose instances only contains texts, and the multimodal dataset, whose instances have both text and image. Actually, all texts of the multimodal dataset are also included in the unimodal dataset. Figure 4 shows the distribution of the classes in the unimodal dataset. Figure 5 provides the same information for the multimodal dataset. As we can see, all classes follow a similar distribution in both versions of the dataset (unimodal and 14 multimodal) as well as in the training, validation and test splits. Moreover, both datasets, unimodal and multimodal, are clearly imbalanced (the classes true, manipulated content and false connection have more instances than the other classes satire, misleading content and imposter content, which are much more underrepresented in both datasets). This imbalance may cause the classification task to be more difficult for those classes with less instances. In this section, we present the results obtained for each model. We report the recall, precision and F1 scores obtained by all the models for each class. The accuracy is computed over all the classes. It helps us to compare models and find the best approach. Moreover, we are also interested in knowing which model is better at detecting only those news containing false content. For this reason, we also compute the micro and macro averages of the recall, precision and F1 metrics only over five classes of fake news without the true news. We use the F 1 micro score and the accuracy to compare the performance of the models. Our first experiment with CNN uses random initialization to initialize the weights of the embedding layer, which are updated during the training process. This model obtains an accuracy of 72%, a micro F1 of 57% and a macro F1 of 49% (see Table 1 ). We can also see that True and Manipulated content are the classes with the highest F1 (79%). A possible reason for this could be that they are the majority classes. On the other hand, the model obtains the lowest F1 (13%) for Imposter content, which is the minority class in the dataset (see Fig. 4 ). Therefore, the results for the different classes appear to be related with the number of instances per class. However, the model achieves an F1 or 61% for the second minority class, Misleading content. As was explained before, the content of this news has been deliberately manipulated. Identifying these manipulations appears to be easier than detecting humour or sarcasm in news (Satire) or fake news generated by bots (Imposter content). Interestingly, although the model only exploits the textual content of the news, it achieves an F1 of 57% for classifying the instances of False connection. In these instances, the text and the image are not in accordance. We also explore CNN with static (see Table 2 ) and dynamic (see Table 3 ) GloVe embeddings Pennington et al. (2014) . In both models, the embedding We also compare the effect of the pretrained Glove vectors with random initialization (see Table 1 ). In both dynamic and static approaches, initializing the model with the pretrained GloVe word embeddings gets better results than random initilization. The reason for this is that the GloVe vectors contain information about the relationship between different words that random vectors can not capture. As the dataset is highly unbalanced, we use the micro F1 to assess and compare the overall performances of the three models. Thus, the best model is a CNN with dynamic Glove vectors. However, the dynamic training takes much more time than the static training (around 6000 to 8000 seconds more). This is due to the fact that, in a dynamic approach, word embeddings are also learned and this increases significantly the training time. BiLSTM results. As a second deep learning model, we explore a BiLSTM model. We replicate the same experiments as described for CNN, that is, using random initialization and pre-trained Glove vectors. The BiLSTM initializated with random vectors obtains very similar results to those achieved by CNN with random initialization (see Table 1 ). In fact, both models provides the same accuracy of 0.72. However, in terms of micro F1, the BiLSM model obtains up to 9 points more than the CNN model with random initialization. This improvement may because the BiLSTM improved its scores for Imposter content. The use of static Glove vectors appears to have a positive effect on the performance of the BiLSTM model (see Table 5 . The model shows significant improvements for False connection, Satire, Misleading content, and Imposter, with increases of 6, 12, 3, and 10 points, respectively. Therefore, the pretrained Glove vectors get better results than random initialization. Table 8 shows that the multimodal approach obtains an accuracy of 87% and a micro F1 of 72%, which are the highest scores out of all the unimodal models. As expected, the training set size for each class strongly affects the model scores. While True and Manipulated content, the majority classes, get the highest scores, Imposter content, the minority class, shows the lowest F1 (32%), even six points lower than that provided by BERT for the same class (F1=38%). Thus, we can say that the image content provides little information for identifying instances of Imposter content. Manipulated content shows an F1 or 100%. This is probably due to the fact that the images in this category, have been manipulated. These manipulations may be easily detected by CNN. As expected, the use of images significantly improves the results for False connection. The multimodal model shows an F1 of 76%, 8 points up than that obtained by BERT, the best unimodal approach, and 15 points up than the unimodal CNN model using only texts. The improvement is even greater for detecting instances of Satire, with an increase of 16 points up than those obtained by BERT and by the unimodal CNN model. In addition to the deep learning algorithms, we also propose as baseline a Support Vector Machine (SVM), one of the most successful algorithm for text classification. Table 9 shows a comparative of the best models (traditional algorithms, CNN, BiLSTM, BERT and multimodal CNN) according to their accuracy and micro average scores. Table 9 : Comparison of the best models (micro_averages). In conclusion, we can see that the multimodal CNN outperforms all the unimodal approaches. This proves the usefulness of combining texts and images for a fine-grained fake news classification. Focusing on the unimodal approaches, the BERT model is the best both in terms of accuracy and micro F1 score, which shows the advantage of using contextual word embeddings. The third 24 best approach is the BiLSTM with dynamic Glove vectors. Finally, all the deep learning approaches outperform our baseline SVM. Fake news could have a significant negative effect on politics, health and economy. Therefore, it becomes necessary to develop tools that allow for a rapid and reliable detection of misinformation. Apart from the work carried out by the creators of the Fakeddit dataset (Nakamura et al., 2020) , this is, to the best of our knowledge, the only study that addresses a fine-grained classification of fake news by performing a comprehensive comparison of unimodal and multimodal approaches based on the most advanced deep learning techniques. The multimodal approach overcomes the approaches that only exploit texts. BERT is the best model for the taks of text classification. Moreover, using dynamic GloVe word embeddings outperforms random initialization for the CNN and BiLSTM architectures. As future work, we plan to use pre-trained networks to generate the visual representations. In particular, we will use the network VGG , which was pretrained on a large dataset of images such as ImageNet. We also plan to explore different deep learning techniques such as LSTM, BiLSTM, GRU or BERT, as well as, different methods to combine the visual and textual representations. In our current study, we have built our multimodal CNN using an early fusion approach, which consists on creating textual and visual representations, combining then, and then applying a classifier over the resulting combined representation to get the probabilities for each class. Instead of this, we plan to study a late fusion approach, which would require two separate classifiers (one for the textual inputs and the other for the image inputs). The predictions from both classifiers are then combined and the final prediction is obtained. Multimodal Categorization of Crisis Events in Social Media Recurrent neural networks Detection of online fake news using n-gram analysis and machine learning techniques. In International conference on intelligent, secure, and dependable systems in distributed and cloud environments Flower classification with modified multimodal convolutional neural networks Introduction to Multimodal Deep Learning Naive bayes Verifying Multimedia Use at MediaEval Random Forests. Machine Learning Online fake news is costing us $78 billion globally each year A Gentle Introduction to Early Stopping to Avoid Overtraining Neural Networks BerCon-voNet: A deep learning framework for fake news classification Empirical evaluation of gated recurrent neural networks on sequence modeling Deep learning in natural language processing Bert: Pretraining of deep bidirectional transformers for language understanding A family of falsehoods: Deception, media hoaxes and fake news A multimodal approach to image sentiment analysis Multimodal multi-image fake news detection Deep Learning Algorithms for Detecting Fake News in Online Text Detecting fake news with capsule neural networks Convolutional networks Quantifying the effects of fake news on behavior: Evidence from a study of COVID-19 misinformation Additive models, trees, and related methods Support vector machines and flexible discriminants Pizzagate' shooter sentenced to 4 years in prison Long Short-Term Memory A comprehensive survey of deep learning for image captioning Social media and fake news in the 2016 election COVID-19 vaccine rumors and conspiracy theories: The need for cognitive inoculation against misinformation to improve vaccine adherence Multimodal fusion with recurrent neural networks for rumor detection on microblogs Getting Real about Fake News DeepNet: An Efficient Neural Network for Fake News Detection using News-User Engagements Fake news detection with heterogenous deep graph convolutional network Adam: A method for stochastic optimization Multimodal Detection of Information Disorder from Social Media Logistic Regression AMFB: Attention based multimodal Factorized Bilinear Pooling for multimodal Fake News Detection Entity-Oriented Multi-Modal Alignment and Fusion Network for Fake News Detection Deep learning-based text classification: A comprehen-30 ACM Computing Surveys (CSUR) Deep Learning-based Text Classification: A Comprehensive Review India's fake news problem is killing real people Kernels Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection Deep learning vs. traditional computer vision Towards lower bounds on number of dimensions for word embeddings Fighting an infodemic: Covid-19 fake news dataset Glove: Global vectors for word representation Dynamic routing between capsules Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter Deep learning approaches for question answering system FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media Very deep convolutional networks for large-scale image recognition Detecting fake news stories via multimodal analysis SpotFake: A multi-modal framework for fake news detection Deep learning for sentiment analysis: successful approaches and future challenges Fake news detection: a deep learning approach Multimodal Classification of Document Embedded Images Convolutional Neural Networks for Text Liar, Liar Pants on Fire": A new benchmark dataset for fake news detection EANN: Event adversarial neural networks for multi-modal fake news detection Evaluating deep learning approaches for Covid19 fake news detection SERN: Stance Extraction and Reasoning Network for Fake News Detection IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Hierarchical attention networks for document classification Adapting BERT for Target-Oriented Multimodal Sentiment Classification Object detection with deep learning: A review Exploiting context for rumour detection in social media