key: cord-0539586-flhbq99r authors: Cheema, Gullal S.; Hakimov, Sherzod; Sittar, Abdul; Muller-Budack, Eric; Otto, Christian; Ewerth, Ralph title: MM-Claims: A Dataset for Multimodal Claim Detection in Social Media date: 2022-05-04 journal: nan DOI: nan sha: b13f34a3bd1dd3f6ccbc564d9f39da03a0c28886 doc_id: 539586 cord_uid: flhbq99r In recent years, the problem of misinformation on the web has become widespread across languages, countries, and various social media platforms. Although there has been much work on automated fake news detection, the role of images and their variety are not well explored. In this paper, we investigate the roles of image and text at an earlier stage of the fake news detection pipeline, called claim detection. For this purpose, we introduce a novel dataset, MM-Claims, which consists of tweets and corresponding images over three topics: COVID-19, Climate Change and broadly Technology. The dataset contains roughly 86000 tweets, out of which 3400 are labeled manually by multiple annotators for the training and evaluation of multimodal models. We describe the dataset in detail, evaluate strong unimodal and multimodal baselines, and analyze the potential and drawbacks of current models. The importance of combating misinformation was once again illustrated by the coronavirus pandemic, which came along with a lot of "potentially lethal" misinformation. At the beginning of the COVID-19 pandemic, the United Nations (UN) (DGC, 2020) started even using the term "infodemic" for this phenomenon of misinformation and called for proper dissemination of reliable facts. However, tackling misinformation online and specifically on social media platforms is challenging due to the variety of information, volume, and speed of streaming data. As a consequence, several studies have explored different aspects of COVID-19 misinformation online including sharing patterns (Pennycook et al., 2020) , platform-dependent engagement patterns (Cinelli et al., 2020) , web search behaviors (Rovetta and Bhagavathula, 2020), and fake images (Sánchez and Pascual, 2020) . We are primarily interested in claims on social media from a multimodal perspective (Figure 1 ). and text together abstractly represent effects of climate change), b) claim but not checkworthy (claim in text, but lacks details like to which experts is referred to, while image is relevant), c) checkworthy but not visually relevant (claim in text targets CDC and China but the image is a stock photograph), and d) checkworthy and visually relevant (claim in text and in image with important details in both). Claim detection can be seen as an initial step in fighting misinformation and as a precursor to prioritize potentially false information for fact-checking. Traditionally, claim detection is studied from a linguistic standpoint where both syntax (Rosenthal and McKeown, 2012) and semantics (Levy et al., 2014) of the language matter to detect a claim accurately. However, claims or fake news on social media are not bound to just one modality and become a complex problem with additional modalities like images and videos. While it is clear that a claim in the text is denoted in verbal form, it can also be part of the visual content or as overlaid text in the image. Even though much effort has been spent on the curation of datasets (Boididou et al., 2016; Nakamura et al., 2020; Jindal et al., 2020) and the development of computational models for multimodal fake news detection on social media (Ajao et al., 2018; Wang et al., 2018; Khattar et al., 2019; Singhal et al., 2019) , hardly any research has focused on multimodal claims (Zlatkova et al., 2019; Cheema et al., 2020b) . In this paper, we extend the definitions of claims and check-worthiness from previous work (Barrón-Cedeno et al., 2020; Nakov et al., 2021) to multimodal claim detection and introduce a novel dataset called Multimodal Claims (MM-Claims) curated from Twitter to tackle this critical problem. While previous work has focused on factually-verifiable check-worthy (Barrón-Cedeno et al., 2020; Alam et al., 2020) or general claims (i.e., not necessarily factually-verifiable, e.g., (Gupta et al., 2021) ) on a single topic, we focus on three different topics, namely COVID-19, Climate Change and Technology. As shown in Figure 1 , MM-Claims aims to differentiate between tweets without claims ( Figure 1a ) as well as tweets with claims of different types: claim but not check-worthy (Figure 1b) , checkworthy claim (Figure 1c) , and check-worthy visually relevant claim (Figure 1d ). Our contributions can be summarized as follows: • a novel dataset for multimodal claim detection in social media with more than 3000 manually annotated and roughly 82 000 unlabeled image-text tweets is introduced; • we present details about the dataset and the annotation process, class definitions, dataset characteristics, and inter-coder agreement; • we provide a detailed experimental evaluation of strong unimodal and multimodal models highlighting the difficulty of the task as well as the role of image and text content. The remainder of the paper is structured as follows. Section 2 describes the related work on unimodal and multimodal approaches for claim detection. The proposed dataset and the annotation guidelines are presented in Section 3. We discuss the experimental results of the compared models in Section 4, while Section 5 concludes the paper and outlines areas of future work. Before research on claim detection targeted social media, pioneering work by Rosenthal and McKeown (2012) focused on claims in Wikipedia discussion forums. They used lexical and syntactic features in addition to sentiment and other statistical features over text. Since then, researchers have proposed context-dependent (Levy et al., 2014) , context-independent (Lippi and Torroni, 2015) , cross-domain (Daxenberger et al., 2017) , and in-domain approaches for claim detection. Recently, transformer-based models (Chakrabarty et al., 2019) have replaced structure-based claim detection approaches due to their success in several Natural Language Processing (NLP) downstream tasks. A series of workshops (Barrón-Cedeno et al., 2020; Nakov et al., 2021) focused on claim detection and verification on Twitter and organized challenges with several sub-tasks on text-based claim detection around the topic of COVID-19 in multiple languages. Gupta et al. (2021) addressed the limitations of current methods in cross-domain claim detection by proposing a new dataset of about ∼10 000 claims on COVID-19. They also proposed a model that combines transformer features with learnable syntactic feature embeddings. Another dataset introduced by Iskender et al. (2021) includes tweets in German about climate change for claim and evidence detection. Wührl and Klinger (2021) created a dataset for biomedical Twitter claims related to COVID-19, measles, cystic fibrosis and depression. One common theme and challenge among all the datasets is the variety of claims where some types of claims (like implicit) are harder to detect than explicit ones where a typical claim structure is present. Table 1 shows a comparison of existing social media based claim datasets, with number of samples, modalities, data sources, language, topic, and type of tasks. From the multimodal perspective, very few works have analyzed the role of images in the context of claims. Zlatkova et al. (2019) introduced a dataset that consists of claims and is created from the idea of investigating questionable or outright false images which supplement fake news or claims. The authors used reverse image search and several image metadata features such as tags from Google Vision API, URL domains and categories, relia- Barrón-Cedeno et al. (2020) and Gupta et al. (2021) with images to evaluate multimodal detection approaches. Although previous work has provided multimodal datasets on claims, they are either on veracity (true or false) of claims or labeled only text-based for a single topic . In terms of multimodal models for image-text data, most previous work is in the related area of multimodal fake news, where several benchmark datasets and models exist for fake news detection (Nakamura et al., 2020; Boididou et al., 2016; Jindal et al., 2020) . In an early work, Jin et al. (2017) explored rumor detection on Twitter using text, social context (emoticons, URLs, hashtags), and the image by learning a joint representation in a deep recurrent neural network. Since then, several improvements have been proposed, such as multi-task learning with an event discriminator (Wang et al., 2018) , multimodal variational autoencoder (Khattar et al., 2019) and multimodal transfer learning using transformers for text and image (Giachanou et al., 2020; Singhal et al., 2019) . This section describes the problem of multimodal claim detection (Section 3.1), the data collection (Section 3.2), the guidelines for annotating multimodal claims (Section 3.3), and the annotation process (Section 3.4) to obtain the new dataset. Given a tweet with a corresponding image, the task is to identify important factually-verifiable or check-worthy claims. In contrast to related work, we introduce a novel dataset for claim detection that is labeled based on both the tweet and the corresponding image, making the task truly multimodal. Our scope of claims is motivated by Alam et al. (2020) and Gupta et al. (2021) , which have provided detailed annotation guidelines. We restrict our dataset to factually-verifiable claims (as in Alam et al. (2020) ) since these are often the claims that need to be prioritized for fact-checking or verification to limit the spread of misinformation. On the other hand, we also include claims that are personal opinions, comments, or claims existing at sub-sentence or sub-clause level (as in Gupta et al. (2021) ), with the condition that they are factuallyverifiable. Subsequently, we extend the definition of claims to images along with factually-verifiable and check-worthy claims. In previous work on claim detection in tweets, most of the publicly available English language datasets (Alam et al., 2020; Barrón-Cedeno et al., 2020; Gupta et al., 2021; Nakov et al., 2021) are text-based and on a single topic such as COVID-19, or U.S. 2016 Elections. To make the problem interesting and broader, we have collected tweets on three topics, COVID-19, Climate Change and broadly Technology, that might be of interest to a wider research community. Next, we describe the steps for crawling and preprocessing the data. We have used an existing collection of tweet IDs, where some are topic-specific Twitter dumps, and extracted tweet text and the corresponding image to create a novel multimodal dataset. COVID-19: We combined tweets from three Twitter resources (Banda et al., 2020; Dimitrov et al., 2020; Lamsal, 2020 ) that were posted between October 2019 and April 2020. In our dataset, we use tweets in the period from March -April 2020. To avoid the extraction of all the tweets from 2019 to 2020 irrespective of the topic, we followed a two-step process to find tweets remotely related to technology. The corpus is available in form of RDF (Resource Description Framework) triples with attributes like tweet ID, hashtags, entities and emotion labels, but without tweet text or media content details. First, we selected tweet IDs based on hashtags and entities, and only kept those that contain keywords like technology, cryptocurrency, cybersecurity, machine learning, nano technology, artificial intelligence, IOT, 5G, robotics, blockchain, etc. The second step of filtering tweets based on a selected set of hashtags for each topic is described in the next subsection. From the above resources, we collected 214 715, 28 374 and 417 403 tweets for the topics COVID-19, Climate Change and Technology, respectively. We perform a number of filtering steps to remove inconsistent samples: 1) tweets that are not in English or without any text, 2) duplicated tweets based on tweet IDs, processed text and retweets, 3) tweets with corrupted or no images, 4) tweets with images of less than 200 × 200 pixels resolution, 5) tweets that have more than six hashtags, and finally, 6) we make a list of the top 300 hashtags in each topic based on count and manually select those related to the selected topics. We only keep those tweets where all hashtags are in the list of top selected hashtags. The hashtags are manually marked because some top hashtags are not relevant to the main topic of interest. The statistics of tweets after each filtering step are provided in the Appendix (see Table 7 ). In summary, we end up with 17 771, 4874, and 62 887 tweets with images for COVID-19, Climate Change and Technology, respectively. In this section, we provide definitions for all investigated claim aspects, the questions asked to annotators, and the cues and explanations for the annotation questions. We define a claim as state or assert that something is the case, typically without providing evidence or proof using the definition in the Oxford dictionary (like Gupta et al. (2021)). The definition of a factually-verifiable claim is restricted to claims that can possibly be verified using external sources. These external sources can be reliable websites, books, scientific reports, scientific publications, credible fact-checked news reports, reports from credible organizations like World Health Organization or United Nations. Although we did not provide external links of reliable sources for the content in the tweet, we highlighted named entities that pop-up with the text and image description. External sources are not important at this stage because we are only interested in marking claims, which have possibly incorrect details and information. A list of identifiable cues (extended from Barrón-Cedeno et al. (2020) ) for factually-verifiable claims is provided in the Appendix A.3.1. To define check-worthiness, we follow Barrón-Cedeno et al. (2020) and identify claims as checkworthy if the information in the tweet is, 1) harmful (attacks a person, organization, country, group, race, community, etc), or 2) urgent or breaking news (news-like statements about prominent people, organizations, countries and events), or 3) up-to-date (referring to recent official document with facts, definitions and figures). A detailed description of these cases is provided in the Appendix A.3.1. Given these key points, the answer to whether the claim is check-worthy is subjective since it depends on the person's (annotator's) background and knowledge. Annotation Questions: Based on the definitions above, we decided on the following annotation questions in order to identify factuallyverifiable claims in multimodal data. • Q1: Does the image-text pair contain a factually-verifiable claim? -Yes / No • Q2: If "Yes" to Q1, Does the claim contain harmful, up-to-date, urgent or breaking-news information? -Yes / No • Q3: If "Yes" to Q1, Does the image contain information about the claim or the claim itself (in the overlaid text)? -Yes / No Question 3 (Q3) intends to identify whether the visual content contributes to a tweet having factuallyverifiable claims. The question is answered "Yes" if one of the following cases hold true: 1) there exists a piece of evidence (e.g. an event, action, situation or a person's identity) or illustration of certain aspects in the claim text, or 2) the image contains overlay text that itself contains a claim in a text form. Please note that we asked the annotators to label tweets with respect to the time they were posted. During our annotation dry runs we observed that there were several false annotations for the tweets where the claims were false but already well known facts. This aspect intends to ignore the veracity of claims since some of the claims become facts over time. In addition, we ignore tweets that are questions and label them as not claims unless the corresponding image consists of a response to the question and is a factually-verifiable claim. Each annotator was asked to answer these questions by looking at both image and text in a given tweet. We distribute the data among nine external and four expert internal annotators for the annotation of training and evaluation splits, respectively. The nine annotators are graduate students with engineering or linguistics background. These annotators were paid 10 Euro per hour for their participation. The four expert annotators are doctoral and postdoctoral researchers of our group with a research focus on computer vision and multimodal analytics. Each annotator was shown a tweet text with its corresponding image and asked to answer the questions presented in Section 3.3. Exactly three annotators labeled each sample, and we used a majority vote to obtain the final label. We selected a total of 3400 tweets for manual annotation of training (annotated by external annotators) and evaluation (annotated by internal experts) splits. Each split contains an equal number of samples for the topics: COVID-19, Climate Change, and Technology. Labels for three types of claim 1 annotations are derived: • binary claim classes: not a claim, and claim • tertiary claim classes: not a claim, claim but not check-worthy, and check-worthy claim • visual claim classes: not a claim, visuallyirrelevant claim, and visually-relevant claim The annotators were trained with detailed annotation guidelines, which included the definitions given in Section 3.3 and multiple examples. To ensure the quality, we performed two dry runs using a set of samples (30-40) to annotate. Afterwards, the annotations were discussed to check agreements among annotators and the guidelines were refined based on the feedback. We measured the agreements between two groups of annotators using Krippendorff's alpha (Krippendorff, 2011). The agreements were computed for the three types of annotations described in the previous section. For the training dataset group, we observe 0.53, 0.39, and 0.42 as agreement scores for the binary, tertiary, and visual claims, respectively. For the test dataset group, we observe the following agreement scores: 0.57, 0.47, and 0.52 for three classifications, respectively. The moderate agreement scores suggest that the problem of identifying check-worthy claims is partially a subjective task for both non-experts and experts. While a majority is always possible for the binary claim classification that allows us to derive unambiguous labels, entirely different labels could be chosen for the tertiary and visually-relevant claim classification task since the annotators assign three possible classes. Consequently, it is not possible to derive a label with majority voting when each annotator selects a different option. In such cases, we resolve the conflict by prioritizing the claim but not check-worthy class since check-worthiness is a stricter constraint and chosen by only one annotator, while two annotators agreed it is a claim. For visual claims, we select a visually-relevant claim since it is possible that image and text are related, even when one annotator marked "no" to the claim question. A table and detailed explanation of the conflict cases is described in Appendix A.3.4. 1 Here claim is a factually-verifiable claim not any claim As a result of the annotation process, the Multimodal Claims (MM-Claims) dataset 2 consists of 2815 (T C (training)) and 585 (E C (evaluation)) samples (C in the subscript stands for "with resolved conflicts"). However, as discussed above, there are conflicting examples for the tertiary and visual claim labels. To train and evaluate our models on unambiguous examples, we derive a subset of Multimodal Claim (MM-Claims) dataset that contains 2555 (T ) and 525 (E) samples "without conflicts" where a majority vote can be taken. We divided the training set (T C , T ) in each case further into training and validation in a 90:10 split for hyper-parameter tuning. We noticed that one-third of the images in the dataset contains a considerable amount of overlaid text (five or more words). As suggested by previous work (Cheema et al., 2021; Parcalabescu et al., 2021; Kirk et al., 2021) , overlaid text in images should be considered in addition to tweet text and other image content. Specifically, the images with overlaid text not only act as related information to the tweet text but are sometimes the central message of the tweet. We used Tesseract-OCR (Fayez, 2021) to select images that contain five or more words in their overlay text. In an internal pre-test with 100 images, we observed that Tesseract-OCR produced more random (and incorrect) text from images than Google Vision API. To reduce the incorrect text, we ran Google Vision API on the selected images (avoiding unnecessary costs) in the second step that resulted in a better quality OCR detected text. Besides the labeled dataset, we will also provide the images, tweet text, and the overlay text (extracted using OCR methods as described above) of the unlabeled portion of the dataset. In this section, we describe the features, baseline models, and the comprehensive experiments using our novel dataset. We test a variety of features and recent multimodal state-of-the-art models. 2 Source code is available at: https://github.com/ TIBHannover/MM_Claims Dataset (Tweet IDs) and labels are available at: https:// data.uni-hannover.de/dataset/mm_claims For complete labeled data access (Images and Tweets), please contact at gullal.cheema@tib.eu or gullalcheema@gmail.com Pre-processing: For images, we use the standard pre-processing of resizing and normalizing an image, whereas text is cleaned and normalized according to Cheema et al. (2020a) (SVM, (Cortes and Vapnik, 1995) ), we employ a pooling strategy by adding the last four layers' outputs and then average them to obtain the final 768dimensional vector. Multimodal Features: We use the following two pre-trained image-text representation learning architectures to extract multimodal features. The ALBEF (ALign BEfore Fuse) embedding results from a recent multimodal state-of-the-art model for vision-language downstream tasks. It is trained on a combination of several image captioning datasets (∼14 million image-text pairs) and uses BERT and a visual transformer (Dosovitskiy et al., 2021) for text and image encoding, respectively. It produces a multimodal embedding of 768 dimensions. The CLIP (Contrastive Language-Image Pretraining) model (Radford et al., 2021) is trained without any supervision on 400 million image-text pairs. We evaluate several image encoder backbones including ResNet and vision transformer (Dosovitskiy et al., 2021 ). The CLIP model outputs two embeddings of same size, i.e., the image (CLIP I ) and the text (CLIP T ) embedding, while CLIP I⊕T denotes the concatenation of two embeddings. In the following, we describe training details, hyper-parameters, input combinations, and different baseline models' details. To obtain unimodal and multimodal embeddings for our experiments, we first use PCA (Principal Component Analysis) to reduce the feature size and train a SVM model with the RBF kernel. We perform grid search over PCA energy (%) conservation, regularization parameter C and RBF kernel's gamma. The parameter range for PCA varies from 100% (original features) to 95% with decrements of 1. The parameter range for C and gamma vary between −1 to 1 on a log-scale with 15 steps. For multimodal experiments, image and text embeddings are concatenated before passing them to PCA and SVM. We normalize the final embedding so that l2 norm of the vector is 1. We experiment with fine-tuning the last few layers of unimodal and multimodal transformer models to get a strong multimodal baseline and see whether introducing cross-modal interactions improves claim detection performance. We fine-tune the last layers of both the models and report the best ones in Table 2 . Additional experimental results on fine-tuned layers are provided in Appendix A.2.5. For fine-tuning, we limit the tweet text to the maximum number of tokens (91) seen in a tweet in the training data and pad the shorter tweets with zeros. Hyper-parameter details for fine-tuning are provided in the Appendix A.1. To incorporate OCR text embeddings into our models, we experiment with two strategies for embedding generation and one strategy to fine-tune models. To obtain an embedding for SVM models, we experimented with concatenating the OCR embedding to image and tweet text embeddings as well as adding the OCR embedding directly to tweet text embedding. To fine-tune the models, we concatenate the OCR text to tweet text and limit the OCR text to 128 tokens. We compare our models with two state-of-the-art approaches for multimodal fake news detection. MVAE (Khattar et al., 2019 ) is a multimodal variational auto-encoder model that uses a multi-task loss to minimize the reconstruction error of individual modalities and task-specific cross-entropy loss for classification. We use the publicly available source code and hyper-parameters for our task. SpotFake (Singhal et al., 2019) is a model built as a shallow multimodal neural network on top of VGG-19 image and BERT text embeddings using a cross-entropy loss. We re-implement the model Table 2 : Accuracy (Acc) and Macro-F1 (F1) for binary (BCD) and tertiary claim detection (TCD) in percent [%] . As described in Section 3.5, we use the training split (T ) with resolved (index C) and without (no index) conflicts, and evaluation (test) split (E C ) with conflicts. This evaluation split reflects the real-world scenario for the subjective task of tertiary claim classification (TCD). Unless FT (fine-tuning) is written, all models (except MVAE and SpotFake) are SVM models trained on extracted features. in PyTorch and use the hyper-parameter settings given in the paper. We report accuracy (Acc) and Macro-F1 (F1) for binary (BCD) and tertiary claim detection (TCD) in Table 2 . We also present the fraction (in %) of visually-relevant and visually-irrelevant (textual only) claims retrieved by each model in Table 3 . Please note that in Table 2 and Table 5 , BCD results are shown for only one split (T C → E C ), because there are no conflicts in the labels for binary claim classification. Although we do not train the models specifically to detect visual claim labels, we analyze the fraction of retrieved samples in order to evaluate the bias of binary classification models towards a modality. As mentioned in Section 3, we observed disagreements in the annotated data that reflect the realworld difficulty and subjectivity of the problem. Therefore, we analyze the effect of keeping (T C , E C ) and removing (T , E) conflicting examples in training and evaluation data splits ( Table 2 , 5). The findings are as follows: 1) multimodal models are more sensitive to the conflict resolution strategy as most have lower accuracy when trained on T C but relatively better F1 score. On the contrary, visual and textual models perform better on both metrics with training on T C , 2) overall, training on T C with conflict resolution is a better strategy with a higher F1 score, i.e., better on claim and check-worthiness (fewer samples) detection; and 3) when comparing all the cross-split experiments in Table 2 and Table 5, multimodal models perform the best in case of "without conflicts" T and E splits. The latter two observations also apply to retrieval of visuallyrelevant and visually-irrelevant claims in Table 3 and Table 6 . For image-based models, CLIP I performs (70.0, 69.8) considerably better than ResNet-152's Ima-geNet (63.1, 62.6) features in terms of both accuracy and F1 metrics ( where the task has a variety of information and text in images. It is further exaggerated and clearly observable in Table 3 where fraction of visuallyrelevant claims retrieved using CLIP I (70.3) is higher and comparable to fine-tuned ALBEF ⊕ OCR (71.2). For text-based models, fine-tuning (FT) BERT gives the best performance, better than any other unimodal model. This result indicates that the problem is inherently a text-dominant task. The model also retrieves the most visually-irrelevant Table 3 : Visually-relevant (V) and visuallyirrelevant (text-only) (T) claim detection evaluation. The number of test samples is reported in brackets and the fraction, how many of them were retrieved, is given in percent [%] . The underlying models are trained for binary claim detection (BCD). The labels for visual relevance are only used for retrieval evaluation. claims when trained on T C . It should be noted that textual models can still identify visually-relevant claims since they can have a claim or certain cues in the tweet text that refer to the image. Finally, the CLIP T features perform considerably worse than BERT features, possibly because CLIP is limited to short text (75 tokens) and is not trained like vanilla BERT on a large text corpus. For multimodal models, the combination of BERT and ResNet-152 features performs slightly better (0.5 − 1%) on two metrics in Table 2 on full dataset in binary task and with T split training in case of tertiary. Although this gain is not impressive, the benefit of combining two modalities is more obvious in identifying visually-relevant claims (> 10%) in Table 3 , which comes at the cost of a lower fraction of visually-irrelevant claims. Similarly with CLIP, the combination of image and text features (CLIP I⊕T ) improves the overall accuracy from CLIP I or CLIP T . However, we do not see the same result for identifying visually-relevant claims (< 4 − 5%). We also experiment with the combination of BERT features with CLIP's image features, which improves the overall accuracy further but indicates that the model relies strongly on text (65.8 vs. 57.7 visual retrieval %) rather than the combination. The stronger reliance on text is possibly not a trait of the model alone, but could be also caused by an incompatibility of BERT and CLIP I features. Finally, we achieve the best performance (by 1 − 4%) on binary and tertiary (when trained on T ) claim detection by fine-tuning the ALBEF with and without OCR, respectively (Table 2 , block 3, last row). While the benefit of using OCR text in SVM models is not optimal and not considerably helpful, OCR addition to ALBEF retrieves the maximum number of visually-relevant claims (71.2%) without losing much on visually-irrelevant claims (79.3%) when trained on T (Table 3 , block 2, last row). These results point towards a major challenge of combining multiple modalities and retaining intra-modal information (and influence) for the task at hand. As noted in section 4.3.1, an interesting result is that ALBEF in particular is less robust to resolved conflicts (split T C ) in the data when compared to just using BERT. On closer inspection, these conflicts are mostly caused by the image relevance to the text. The gap is further exaggerated in both training and evaluation. Figure 2 shows a few examples where our best multimodal model correctly classifies, whereas unimodal models based on either image or text do not. All the samples in the figure have images that have some connection to the tweet text. The image in Figure 2b has a connection to one of the words or phrases (e.g., washing your hands) in the tweet text but is not relevant for the claim itself. Figure 2a includes an image with the claim itself and a very generic scene in the background. Both image and text in Figure 2c and Figure 2d are relevant, and the image acts as evidence and additional information. In all these examples, a rich set of information extraction and complex cross-modal learning is required to identify claims in multimodal tweets. When comparing results of recent state-of-the-art architectures for fake news detection, SpotFake (Singhal et al., 2019) does considerably better than MVAE (Khattar et al., 2019) but worse than any of our baseline models. In this paper, we have presented a novel MM-Claims dataset to foster research on multimodal claim analysis. The dataset has been curated from Twitter data and contains more than 3000 manually annotated tweets for three tasks related to claim detection across three topics, COVID-19, Climate Change, and Technology. We have evaluated several baseline approaches and compared them against two state-of-the-art fake news detection approaches. Our experimental results suggest that the fine-tuning of pre-trained multimodal and unimodal architectures such as ALBEF and BERT yield the best performance. We also observed that the overlaid text in images is important in information dissemination, particularly for claim detection. To this end, we evaluated a couple of strategies to incorporate OCR text into our models, which yielded a much better trade-off between identifying visuallyrelevant and visually-irrelevant (text-only) claims. In the future, we will explore other and novel architectures for multimodal representation learning and other information extraction techniques to incorporate individual modalities better. We also plan to investigate fine-grained overlaps of concepts and meaning in image and text, and expand the dataset to COVID-19 related sub-topics and specific climate change events. In the following we include additional hyperparameter details (A.1) and experimental results (A.2), additional dataset and annotation process details (A.3), and some annotated tweets for multimodal claim detection (A.4). For fine-tuning BERT and ALBEF, we use a batchsize of 16 and 8 (size constraints) respectively. We train the models for five epochs and use the best (on validation set) performing (accuracy) model for evaluation. For BERT, a dropout with the ratio of 0.2 is applied before the classification head. Further, we use AdamW (Loshchilov and Hutter, 2019) as the optimizer with a learning rate of 3e − 5 and a linear warmup schedule. The learning rate is first linearly increased from 0 to 3e − 5 for iterations in the first epoch and then linearly decreased to 0 for the rest of the iterations in 4 epochs. For AL-BEF, we use the recommended fine-tuning hyperparameters and settings from the publicly available code. We experiment with CLIP's three variants that use different visual encoder backbones, ResNet-50 (RN50), ResNet-50x4 (RN504) and a vision transformer (ViT-B/16) (Dosovitskiy et al., 2021) with BERT as textual encoder backbone. We select the models for textual and multimodal SVM experiments based on the performance (higher accuracy) using features from the visual encoders. Table 4 shows different visual encoders' features (with SVM) performance on binary and tertiary claim detection. It should be noted that just like ALBEF, CLIP models can be fine-tuned with image-text tweet pairs for binary and tertiary tasks. However, when we experimented with fine-tuning the last few layers of CLIP with a classification head on top, it always performed worse than using extracted features for classification with SVM. This phenomenon is probably because of our relatively smaller sized labeled dataset, which is not enough for fine-tuning CLIP for the task. Table 4 : CLIP's different visual encoder backbones features' performance evaluation. Accuracy (Acc) and Macro-F1 (F1) for binary (BCD) and tertiary claim detection (TCD) in percent [%] . As described in Section 3.5, we use the Training Split (T ) and Evaluation (Testing) Split (E) with resolved (index C) and without (no index) conflicts. In Section 4, we show results for tertiary claim detection (TCD) on evaluation splits "with resolved conflicts" (E C ) by training on T and T C . Here in Table 5 , we show the evaluation on "without conflicts" evaluation split (E). As with evaluation on E C , multimodal models are more sensitive to training on T C where conflict resolution strategy causes the accuracy to drop for all models. However, CLIP and ALBEF models, in this case, have higher F1-score (as well as accuracy) when trained on T . Even with less training data, the models perform better and best among all evaluated multimodal models. In the case of training on T C , BERT performs the best, which is closely followed by ALBEF with OCR text. As described in section 4.3.1, the evaluation of retrieved visually-relevant and visually-irrelevant claims on E follows the evaluation on E C . Even though CLIP I and fine-tuned BERT retrieves the most amount of two types of claims, all models do better when trained on T C than on T . Overall, for a realistic scenario, training on T C gives the best performance trade-off between Acc, F1 and retrieved claims for multimodal models. Following the results on E C in section 4 for binary and tertiary tasks, we show normalized (by row) confusion matrices based on predictions from the ALBEF ⊕ OCR ⊕ F T model. Figure 3a is the confusion matrix on E C for binary claim detection (BCD) . Whereas, Figure 3b shows the matrices on E C with training on T C (b.1) and T (b.2). Although the not-claim's true positives remain the same, confusion for the not-check-worthy and check-worthy class is less severe when trained on T C . Table 6 : Visually-relevant (V) and visuallyirrelevant (T) claim detection evaluation. The amount of test samples is reported in brackets and the fraction, how many of them were retrieved, is given in percent [%] . Additional results on evaluation split without conflicts (E). The underlying models are trained for binary claim detection (BCD). The labels for visual relevance are only used for retrieval evaluation. The amount of text that can be detected from an image varies, as it can be seen in Figure 8 . As a consequence, we experimented with the length of OCR text in terms of the number of words for both binary and tertiary claim detection with ALBEF. We observe (see Figure 5 ) that 128 words give comparable or better performance than any less number of words in OCR text across tasks and number of layers fine-tuned. We chose 128 words instead of 64 because the model with 128 words showed a balanced performance for binary, tertiary and retrieved claims. Models with 64 or greater than 128 words had a lower performance for either visually-relevant or irrelevant retrieved claims. Macro-F1 We ran ablation experiments to see the effect of training the last few layers of BERT and ALBEF ⊕ OCR. We experiment with fine-tuning the last six, four, two layers and only the last layer of each model. The results are shown in Figure 4 . Overall, fine-tuning the last two and four layers of BERT and ALBEF respectively gives the best results. Therefore, all the fine-tuning results for BERT, ALBEF and ALBEF ⊕ OCR are based on the above observation. For fine-tuning six or more layers, the unlabeled dataset can be incorporated in the future as a pre-training step followed by taskspecific training. • Harmful: if the statement attacks a person, organization, country, group, race, community, etc. The intention of such statements can be to spread rumours about an individual or a group, which should be checked by a professional or flagged and prioritized for further checking. • Urgent or breaking news: such statements are news-like where the claim is about prominent people (public personality like politicians, celebrities), organizations, countries and events (like disease outbreaks, forest fires, stock market crash). • Up-to-date: such claims often refer to official documents and contain parts of clauses in climate agreements or articles in a constitution. This information is vital for checking, as many people consume social media as means of news, information and believe it to be true. The following Table 7 shows number of samples after each filtering step. The duplicate removal is performed across all the data irrespective of the topic in order to avoid duplicates that might fall into more than one topic. In Figure 6 , we provided the topic and class distributions in the labeled dataset. Type of claim Table 8 : Labeled data characteristics in terms of type of labels and topic. Shown as Training/Validation/Testing splits. Second and third blocks are claims which are check-worthy (and not) and visual claims (and not) respectively. Red -"with resolved conflicts" and black -"without conflicts" Since three different users annotated each sample, a majority is always possible for the binary claim classification to derive unambiguous labels. However, a majority vote can not be achieved for the tertiary and visually-relevant claim classification task where all three annotators choose differently out of the possible three options. In Table 9 , we provide the corresponding classes chosen by each annotator and the derived class after resolving the conflicts. The first case is resolved by giving priority to the claim but not check-worthy label as checkworthiness is a stricter constraint that is decided by only the majority. Two annotators indicated that the given sample is a claim (A-2 → Q1-Yes, A-3 → Q1-Yes). For the second case with visual claims, we select visually-relevant claim label as there is a possibility of image and text being related even if one annotator marked "no" to the claim question (A-1 → Q1-No) but at least one annotator indicated that the sample is visually-relevant claim (A-3 → Q3-Yes). The following Table 8 shows split-wise distribution of topics and labels in data. Numbers in red and black are for "with resolved conflicts" and "without conflicts" splits, respectively. : Although we crawl tweets from topic-based corpora, we further filter tweets by manually marking top 300 hashtags (sorted by occurrence) relevant to the topic. Figure 7 shows top-20 relevant hashtags for each topic. A.3.7 Annotation Tool Figure 7d shows the annotation screen with the image-text pair, claim questions and a text box for feedback on difficult and missing image tweets. Multimodal fusion with recurrent neural networks for rumor detection on microblogs Newsbag: A benchmark multimodal dataset for fake news detection MVAE: multimodal variational autoencoder for fake news detection Memes in the wild: Assessing the generalizability of the hateful memes challenge dataset Computing krippendorff's alpha-reliability Context dependent claim detection Align before fuse: Vision and language representation learning with momentum distillation Contextindependent claim detection for argument mining Climate Change Tweets Ids Decoupled weight decay regularization Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection 2021. The CLEF-2021 checkthat! lab on detecting check-worthy claims, previously factchecked claims, and fake news What is multimodality? Fighting covid-19 misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention Learning transferable visual models from natural language supervision This work was funded by European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no 812997 (CLEOPATRA project), and by the German Federal Ministry of Education and Research (BMBF, FakeNarratives project, no. 16KIS1517). We included multiple annotated samples corresponding to visually-relevant claim (see Figure 8 ) and not a claim (see Figure 9 ) classes.