key: cord-0225922-owqupyf1
authors: Cheema, Gullal S.; Hakimov, Sherzod; Muller-Budack, Eric; Ewerth, Ralph
title: On the Role of Images for Analyzing Claims in Social Media
date: 2021-03-17
journal: nan
DOI: nan
sha: 0b13e9f96a56d94a2334a5e7a08a81b906e8d75d
doc_id: 225922
cord_uid: owqupyf1

Fake news is a severe problem in social media. In this paper, we present an empirical study on visual, textual, and multimodal models for the tasks of claim, claim check-worthiness, and conspiracy detection, all of which are related to fake news detection. Recent work suggests that images are more influential than text and often appear alongside fake text. To this end, several multimodal models have been proposed in recent years that use images along with text to detect fake news on social media sites like Twitter. However, the role of images is not well understood for claim detection, specifically using transformer-based textual and multimodal models. We investigate state-of-the-art models for images, text (Transformer-based), and multimodal information for four different datasets across two languages to understand the role of images in the task of claim and conspiracy detection.

1 Introduction manually fact-check news and publish their outcomes for public use. Although more such initiatives are coming up worldwide, they cannot keep up with the rate of news or information production on online platforms. Therefore, fake news detection has gathered much interest in computer science for developing automated methods to speed and scale up to handle the continuous fast streaming social media data.

As social media is inherently multimodal in nature, fact-checking initiatives and computation methods consider not only text but also image content [14, 21, 39, 43] as it can be easily fabricated and manipulated due to the availability of free image and video editing tools. In this paper, we investigate the role of images in the context of claim and conspiracy detection. Claim detection is one of the first vital steps to identify fake news where the purpose is to flag a statement if it contains check-worthy facts and information, while the claim may be true or false. Whereas in conspiracy detection, a statement that includes a conspiracy theory is fake news and consists of manipulated facts. Although fake news on social media has been explored recently from a multimodal perspective, images have hardly been considered for claim detection except in recent work by Zlatkova et al. [48] . Here, meta-information of images is treated as features, and reverse image search is performed to compare the claim text. However, the image's semantic information is not considered, and the authors highlight that images are more influential than text and appear alongside fake text or unverified news.

Since we are interested in the impact of using images in a multimodal framework, to keep our models simple, we focus on extracting only semantic or contextual features from text and do not consider its structure or syntactic information. To this end, we mainly consider deep transformer Bidirectional Encoder Representations from Transformers (BERT) to extract contextual embeddings and use them along with image embeddings. Taking inspiration from recent work by Cao et al. [4] , we extract image sentiment features that are widely applied for image credibility or fake news detection in addition to object and scene information for the semantic overlap with textual information.

To carry out this study 3 , we experiment with four Twitter datasets 4 on binary classification tasks, two of which are from the recent CLEF-CheckThat! 2020 [2] , one in English [36] and the other one in Arabic [17] . The third one is an English dataset from MediaEval 2020 [33] on conspiracy detection, and the last one is a recent claim detection dataset (English) from Gupta et al. [16] on COVID-19 tweets. Four examples for claim and conspiracy detection are shown in Figure 1 . To train our unimodal and multimodal models, we use Support Vector Machines (SVM) [40] and Principal Component Analysis (PCA) [45] for dimensionality reduction due to the small datasets and large size of combined features. We also fine-tune BERT models on the text input to see the extent of the unimodal model's performance on limited-sized datasets and use different pre-trained BERT models to see the effect of domain gap. Furthermore, Fig. 1 : Examples from CLEF-English [36] (a) check-worthy claims dataset and MediaEval [33] (b) conspiracy detection dataset we investigate the recently proposed transformer-based ViLBERT [25] (Visionand-Language BERT) model that learns semantic features via co-attention on image and textual inputs. Just like BERT models, we perform fixed embedding and fine-tuning experiments using ViLBERT to see if a large transformer-based multimodal model can learn meaningful representation and perform better on small-sized datasets.

The remainder of the paper is organized as follows. Section 2 briefly discusses related work on fake news detection and the sub-problems of claim and conspiracy detection. Section 3 presents details of image, text, and multimodal features as well as the fine-tuned and applied models. Section 4 describes the experimental setup, results and summarizes our findings. Section 5 concludes the paper with future research directions.

There is a wide body of work on fake news detection that goes well beyond this paper's scope. Therefore, we restrict this section to multimodal fake news, claim detection, and conspiracy detection.

The earliest claim detection works go back a decade. Rosenthal et al. [34] in their pioneering work extracted claims from Wikipedia discussion forums. They classified them via logistic regression using the sentiment, syntactic and lexical features like POS (Part-of-Speech) tags and n-grams, and other statistical features over text. Since then, researchers have proposed context dependent [22] , context independent [23] , and cross-domain [10] and in-domain approaches for claim detection. Recently, the transformer-based models [6] have replaced structurebased claim detection approaches due to their success in several downstream natural language processing (NLP) tasks.

For claim detection on social media in particular, recently CLEF-CheckThat! 2020 [2] hosted a challenge to detect check-worthy claims in COVID-19 related English tweets and several other topics in Arabic. The challenge attracted several models with top submissions [7, 32, 44] all using some version of transformer-based models like BERT [11] and RoBERTa [24] along with tweet meta-data and lexical features. Outside of CLEF challenges, some works [12, 27] have also conducted a detailed study on detecting check-worthy tweets in U.S. politics and proposed real-time systems to monitor and filter them. Taking inspiration from [10] , Gupta et al. [16] address the limitations of current methods in cross-domain claim detection by proposing a generalized claim detection model called LESA (Linguistic Encapsulation and Semantic Amalgamation). Their model combines contextual transformer features with learnable POS and dependency relation embeddings via transformers to achieve impressive results on several datasets. For conspiracy detection, MediaEval 2020 [33] saw interesting methods to automatically detect 5G and Coronavirus conspiracy in tweets. Top submissions used BERT [8, 28] pre-trained on COVID Twitter data, tweet meta-data, graph network data and RoBERTa models [9] along with Graph Convolutional Neural (GCN) networks.

For multimodal fake news in general, several benchmark datasets have been proposed in the last few years, generating interest in developing multimodal visual and textual models. In one of the relatively early works, Jin et al. [20] explored rumor detection on Twitter using text, social context (emoticons, URLs, hashtags), and the image by learning a joint representation with attention from LSTM outputs over image features. The authors observed the benefit of using the image and social context in addition to text by improving the detection of fake news in Twitter and Weibo datasets. Later, Wang et al. [43] , proposed an improved model that learns a multi-task model to detect fake news as one task and event discriminator as another task to learn event invariant representations. Since then, improvements have been proposed via using multimodal variational autoencoders [21] , transfer learning [15, 39] with transformer-based text and deep visual CNN models. Recently, Nakamura [30] et al. proposed a fake news dataset r/Fakeddit mined from Reddit with over 1 million samples, which includes text, images, meta-data, and comments data. The data is labeled through distant supervision into 2-way, 3-way, and 6-way classification categories. In addition to our different tasks, another difference with the approaches mentioned above is that the size of the datasets is moderate (several thousand) to large (millions) in comparison to a few hundred or a couple of thousand samples in our four datasets for claim and conspiracy detection. [25] or training an SVM based on its multimodal embeddings extracted from text and image.

In this section, we provide details of different image (Section 3.1), textual (Section 3.2), and multimodal (Section 3.3) models and their feature encoding process and how classification models (Section 3.4) are built. An overview of classification models are presented in Figure 2 .

The purpose of image models is to encode the presence of different objects, scene, place or background, and affective image content. When learning a multimodal model or a classifier, specific overlapping patterns between image and text can act as discriminatory features for claim detection. Object Features (I o ) In order to encode objects and the overall image content, we extract features from a pre-trained ResNet [19] model trained on Ima-geNet [35] dataset. The pre-trained model has been shown to boost performance over low-level features in several computer vision tasks. We use widely recognized ResNet-152 and its last convolution layer to extract features instead of the object categories (final layer). The final convolutional layer outputs 2048 feature maps each of size 7 × 7, which is then pooled with a global average to get a 2048-dimensional vector. Place and Scene Features (I p ) In order to encode the scene information in an image, we extract features from a pre-trained ResNet [19] model trained on

Places365 [47] dataset. In this case, we use ResNet-101 and follow the same encoding process as described for object features. Hybrid Object and Scene Features (I h ) We also experiment with a hybrid model trained on both ImageNet and Places365 datasets that encodes object and scene information in a single model. To extract these features, we again use a ResNet-101 model and follow the same encoding process. Image Sentiment (I s ) To encode the image sentiment, we use a pre-trained model [41] that is trained on three million images using weak supervision of sentiment label from the tweet text. Although the image labels are noisy, the model has shown superior performance on unseen Twitter testing datasets. We use their best CNN model based on VGG-19 [38] . The image sentiment embeddings (I se ) are extracted from the last layer in the model, which are 4096-dimensional vectors. Additionally, we extract the image sentiment predictions (I sp ) from the classification layer that outputs a three-dimensional vector corresponding to the probabilities of three sentiment classes (Negative, Neutral and Positive).

Since context and semantics of the sentence is shown [2, 6] to be important for claim detection, we use transformer-based BERT -Base [11] (T BB ), to extract contextual word embeddings and employ different pooling strategies to get a single embedding for the tweet. As different layers of BERT capture different kinds of information, we experiment with four combinations, i.e., 1) concatenate the last four hidden layers, 2) sum of the last four hidden layers, 3) the last hidden layer, and 4) the second last hidden layer. We finally take an average over the word embeddings to obtain a single vector.

To reduce the domain gap for our Twitter datasets in English, we experiment with two BERT models. The first variant is called BERTweet [31] (T BT ) a BERT-base model that is further pre-trained on 850 million English tweets, and the second one called COVID-Twitter-BERT [29] (T CT ), a BERT-large model trained on 97 million English tweets on the topic of COVID-19. For Arabic tweets, we experiment with the AraBERT [1] (T AB ) that is trained on Arabic news corpus called OSIAN [46] and 1.5 Billion words Arabic corpus [13] . We also perform two experiments, one with raw tweets and the other with pre-processing tweets as part of the AraBERT's language-specific text processing method.

For English text, with vanilla BERT-base model, we pre-process the text by following the steps mentioned in Cheema et. al. [7] using the publicly available text processing tool Ekphrasis [3] . We also show the performance of vanilla BERT -base on raw tweets (T Raw BB ) to reflect its sensitivity towards text preprocessing (T Clean BB ). For both BERTweet and COVID-Twitter-BERT, we follow their pre-processing steps, which normalize text, and additionally replaces user mentions, emails, URLs with special keywords.

ViLBERT (Vision-and-Language BERT) We use ViLBERT [25] , one of the recent multimodal transformer architectures that process image and text inputs through two separate transformer-based streams and combines them through transformer layers with the co-attention. It eventually outputs co-attended image and text features that can be combined (added, multiplied or concatenated) to learn a classifier for vision and language tasks. The authors proposed to use visual grounding as a self-supervised pre-training task on a large conceptual captions dataset [37] . They used the model for various downstream tasks involving vision and language, such as visual question answering, visual commonsense reasoning, and caption-based image retrieval.

For the image branch, ViLBERT uses state-of-the-art object detection model Mask R-CNN [18] and extracts top 100 region proposals (boxes) and their corresponding features. These features are used in a sequence through a 5-layer image transformer, which outputs the image region embeddings. For the text branch, it uses BERT-base model to get the contextual word embeddings. A 6-layer transformer block with the co-attention follows the individual streams that outputs the co-attended image and text embeddings.

Feature Extraction In our fixed embedding experiments with a SVM, we experiment with the output of pooling and last layers of image and text branches. With pooling layers, we directly concatenate (M CAT pool ) the image and text outputs. With last layer outputs we average the image region embeddings and word embeddings to get one single embedding per modality and then concatenate them (M CAT avg ). From pooling layers, each modality's embedding size is a 1024dimensional vector, and the last layer average of embeddings gives 1024 and 768-dimensional vectors for image and text, respectively. For fine-tuning, we follow ViLBERT 's downstream task approach, where the pooling layer outputs are either added (M ADD pool ) or multiplied (M M U L pool ) and passed to a classifier. For Arabic text, we use Google Translate to convert the text into English because all ViLBERT models are trained on English text.

ViLBERT is fine-tuned on several downstream tasks which can be relevant for encapsulating image-text relationship for our claim detection problem. Therefore, we experiment with four different pre-trained models, namely, conceptual captions , image retrieval (Image-Ret), grounding referring expressions (localize an image region given a natural language reference) (RefCOCO), and a multitask model [26] that is trained on 12 different tasks.

For our fixed embedding experiments, we train SVM models with each type of image and text embeddings for binary classification of tweets as shown in Figure 2 (a). For fine-tuning textual models (Figure 2 (b) ), given that we have relatively small-sized datasets, we only experiment with fine-tuning the last two and four layers of transformer models for each dataset. We concatenate the image and text features for multimodal fixed embedding experiments and train an SVM model over them for classification.

In the case of ViLBERT (Figure 2 (c) ), we again train SVM over the extracted pooled image and text outputs for classification. For fine-tuning, we fix the individual transformer branches and experiment with fine-tuning the last two and four co-attention layers to activate the interaction between modalities. It enables us to see the effect of only the attention mechanism that can show the benefit of an image and text in claim detection. We use a simple classifier on top of ViLBERT outputs as recommended by the authors of ViLBERT, which includes a linear layer for down projecting outputs to 128 dimensions, followed by ReLU (Rectified Linear Unit) non-linear activation function, a normalization layer and finally a binary classification layer. Dropout is used to avoid overfitting, and the fine-tuning is performed by minimizing the cross-entropy loss.

In this section, we describe all the datasets and their statistics, training details and hyper-parameters, model details, experimental results, and discuss them as obtained by different models mentioned in Section 3.

We selected the following four publicly available Twitter datasets with highquality annotations (which excludes [30] , besides its focus on fake news), three of which are on claim detection and one on conspiracy detection. The number of tweets in the original datasets is four to fifteen times more as they were mined for text-based fake news detection. We only selected tweets that have an image. CLEF-En [36] -Released as a part of CLEF-CheckThat! 2020 challenge, the purpose is to identify COVID-19 related tweets that are check-worthy claims vs not check-worthy claims. Only 281 English tweets in the dataset include images, whereas the original dataset included 964 tweets. CLEF-Ar [17] -Released in the same challenge, the dataset consists of 15 topics related to middle east including COVID-19 and the purpose is to identify check-worthy claims. It consists of 2571 Arabic tweets and corresponding images. MediaEval [33] -Released in MediaEval 2020 workshop [33] challenge on identifying 5G and Coronavirus conspiracy tweets. The original dataset has three classes, 5G and Corona conspiracy, other conspiracies, and no conspiracy. To make the problem consistent with other datasets in this paper, we combine conspiracy classes (Corona and others) and treat it as a binary classification problem. It consists of 1724 tweets and images. LESA [16] -This is a recently proposed dataset of COVID-19 related tweets on the problem of claim detection. Here, the problem is identifying whether a tweet is a claim or not, and not the claim check-worthiness as in CLEF-En. The original dataset consists of 10 000 tweets in English, out of which only 1395 consists of images.

We applied 5-fold cross-validation to overcome the issue of low number of samples in each dataset. We used the ratio of around 72:10:18 for training, validation, and testing in each data split. Next, we report the experimental results for different model configurations. The reported results are averaged across five splits of each dataset. We report accuracy and weighted-F1 measure to account for label imbalance in all the datasets.

SVM hyper-parameters: we perform grid search over PCA energy (%) conservation, regularization parameter C and RBF kernel's gamma. The parameter range for PCA varies from 100% (original features) to 95% with decrements of 1. The parameter range for C and gamma vary between −1 to 1 on a log-scale with 15 steps. For experiments only on the CLEF-En dataset, we use the range between −2 to 0 for C and gamma, as the number of samples are very low and needs aggressive regularization. We normalize the final embedding so that l2 norm of the vector is 1. Fine-tuning BERT and VilBERT: we use a batch size of 4 for CLEF-En and 16 for the other datasets. We train all the models for 6 epochs with a starting learning rate of 5e − 5 and a linear decay. A dropout with ratio 0.2 is applied after the first linear layer in the classifier for regularization during fine-tuning. Table 1 and Table 2 show the unimodal and multimodal models' performance for all the four datasets based on type of features and feature combinations respectively. Unimodal Results -In Table 1 , it can be seen that all the visual features perform poorly in comparison to textual features. This is expected as visual information on its own cannot indicate whether a social media post makes a claim unless it has text or it's a video. Among the four types of visual models, Object (I o ) and Hybrid (I h ) features are slightly better, probably because the place or scene information (lowest F1 for all datasets) on its own is not a useful indicator in images for claim detection. With textual features, BERT models that are further pre-trained on tweets (T BT , T † BT ) and COVID-related data (T CT , T † CT ) perform better in comparison to vanilla BERT (T Clean BB , T Clean BB † ) in at-least three datasets. It suggests that the tweets' structure and the domain gap are better captured and reduced respectively in Twitter corpus pre-trained models. Further, normalizing (T Clean BB ) the tweet text delivers better performance than using the raw text (T Raw BB ). In SVM training, we observed the sum of the last four layers of BERT to compute the embeddings performs better than the other pooling combinations. It indicates that downstream tasks can benefit from the diverse information in different layers of BERT. Similarly, fine-tuning the last four layers instead of two (marked with 2 ) gives better performance across all the datasets with BERT-base (T Clean BB † ), COVID-Twitter-BERT (T CT † ) and AraBERT (T AB † ). Multimodal Results -In Table 2 , we can see the effect of combining visual features with textual features by using a simple concatenation in SVM and also with multimodal co-attention transformer ViLBERT. Although we do not see any benefit of using the image sentiment embeddings (I se ) in unimodal models, here instead, we use the image sentiment predictions (I sp ) that perform better or equivalent in comparison to other visual features. For instance, in case of CLEF-Ar, sentiment predictions I sp with AraBERT (T AB † ) gives the best fixed embedding performance. Similarly, combining hybrid features (I h ) with BERTbase (T Clean BB † ) and object features with COVID-Twitter-BERT (T CT † ) in case of LESA and MediaEval improves the metrics by 1% over textual SVM models.

With ViLBERT, it is interesting to see that with fixed visual and textual branches, it can capture some information from image and text with co-attention to boost performance in case of LESA and MediaEval. It is worth mentioning that the best unimodal textual models for English and Arabic are pre-trained models further trained on Twitter and language-specific data corpus. In the case of ViLBERT, there is a wider domain gap, and for Arabic, the translation process loses quite a bit of information that results in a drop in performance. Different pooling operations applied for pre-trained ViLBERT models show more difference in fixed-embedding SVM experiments where the average pooling (M CAT avg ) yields a considerable performance, which we also observed in unimodal SVM experiments. We observed that pre-training tasks (best two reported in Table 2 ) also matter, where image retrieval (Image-Ret) and language reference grounding (RefCOCO) features perform much better for all the datasets. It is explainable since both tasks require capturing complex relationships and linking text to specific image regions in the image, enabling them to perform better for our tasks.

We can summarize the findings of our experiments as follows: 1) Domainspecific languages models should be preferred for downstream tasks such as claim detection or fake news, where underlying meaning and context of certain words (like COVID) is essential, 2) Multimodality certainly helps as seen with multimodal transformer models, where activating interaction through co-attention layers between fixed unimodal embeddings improves the performance in two datasets, 3) To further understand underlying multimodal dynamics it might be better to explicitly model multimodal relationships, for instance, importance of image or correlation between image-text in addition to claim detection, 4) Certain pre-training tasks in ViLBERT are better suited for downstream tasks and need further introspection on larger datasets, and lastly, 5) Visual models need to be better adapted to social media images, for instance, the models used here are not sufficient for diagrams or images with large text, which constitute around 30-40% of LESA and MediaEval datasets.

In this paper, we have investigated the role of images and tweet text for two problems related to fake news, claim, and conspiracy detection. For this purpose, we combined several state-of-the-art CNN features for images with BERT features for text. We observed the performance improvement over unimodal models in two out of four Twitter datasets over two languages. We also experimented with the recently proposed multimodal co-attention transformer ViLBERT and observed a promising performance using both image and text even with relatively small-sized datasets. In future work, we will look into other ways to include external knowledge in domain-independent claim detection models without relying on different domain-specific language models. Second, we plan to investigate multimodal transformers in more detail and analyze if the performance does scale with more data in similar tasks. Finally, to address the limitation of visual models, we will consider models that can deal with text and graphs in images and extract suitable features.

AraBERT: Transformer-based model for Arabic language understanding

Overview of checkthat! 2020: Automatic identification and verification of claims in social media

DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis

Exploring the role of visual content in fake news detection. Disinformation, Misinformation, and Fake News in Social Media pp

Working Notes of CLEF 2020 -Conference and Labs of the Evaluation Forum

IMHO fine-tuning improves claim detection

Check square at checkthat! 2020 claim detection in social media via fusion of transformer and syntactic features

Tib's visual analytics group at mediaeval'20: Detecting fake news on corona virus and 5g conspiracy

Detecting fake news in tweets from text and propagation graph: Irisa's participation to the fakenews task at mediaeval 2020

What is the essence of a claim? cross-domain claim identification

BERT: Pre-training of deep bidirectional transformers for language understanding

Detecting Real-time Check-worthy Factual Claims in Tweets Related to US Politics

1.5 billion words arabic corpus

Leveraging emotional signals for credibility detection

Multimodal multi-image fake news detection

Lesa: Linguistic encapsulation and semantic amalgamation based generalised claim detection from online content

Overview of CheckThat! 2020 Arabic: Automatic identification and verification of claims in social media

Mask R-CNN. In: IEEE International Conference on Computer Vision

Deep residual learning for image recognition

Multimodal fusion with recurrent neural networks for rumor detection on microblogs

MVAE: multimodal variational autoencoder for fake news detection

Context dependent claim detection

Context-independent claim detection for argument mining

Roberta: A robustly optimized bert pretraining approach

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

12-in-1: Multi-task vision and language representation learning

ClaimPortal: Integrated monitoring, searching, checking, and analytics of factual claims on Twitter

Fakenews detection using pre-trained language models and graph convolutional networks

Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter

Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection

BERTweet: A pre-trained language model for English tweets

Team Alex at CheckThat! 2020: Identifying check-worthy tweets with transformer models

Fakenews: Corona virus and 5g conspiracy task at mediaeval 2020

Detecting opinionated claims in online discussions

Imagenet large scale visual recognition challenge

Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Very deep convolutional networks for large-scale image recognition

Spotfake: A multi-modal framework for fake news detection

Least squares support vector machine classifiers

Cross-media learning for image sentiment analysis in the wild

The spread of true and false news online

EANN: event adversarial neural networks for multi-modal fake news detection

Accenture at checkthat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models

Principal component analysis. Chemometrics and intelligent laboratory systems

OSIAN: Open source international Arabic news corpus -preparation and integration into the CLARINinfrastructure

Places: A 10 million image database for scene recognition

Fact-checking meets fauxtography: Verifying claims about images

This work was funded by European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement no 812997.