key: cord-0168288-swqm2wz3 authors: McCrae, Scott; Wang, Kehan; Zakhor, Avideh title: Multi-Modal Semantic Inconsistency Detection in Social Media News Posts date: 2021-05-26 journal: nan DOI: nan sha: a8019fd3074c5311a9a4f40af7fab9b5e2221839 doc_id: 168288 cord_uid: swqm2wz3 As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies. There has been a great deal of attention on misinformation and deepfakes recently, especially with regards to the ongoing COVID-19 pandemic and the 2020 US Presidential election. There are a variety of methods for detecting both manipulated media, such as Photoshopped images, and data which is machine-generated, such as images from generative adversarial networks (GANs). However, these tools tend to focus on a single modality, such as imagery, and look for clues that the image has been manipulated using statistical methods or by leveraging metadata. While these tools are indisputably useful, we are interested in investigating multi-modal analysis, where we attempt to detect manipulations or misinformation using semantic clues from a variety of modalities. The use of multiple modalities allows us to reason about the semantic content of each source. For instance, a caption describing an out-of-control protest would be inconsistent with a video of a candle-light vigil, and a video of a reporter in the midst of a hurricane in Florida would be inconsistent with a news article on the effects of deforestation in the Amazon. On their own, neither modality is manipulated, but together they represent an inconsistency. This might model the threat of "cheapfakes," where an attacker lazily sources pairs of material This research was supported by the DARPA Semantic Forensics program. Funding for compute resources was provided by Google Cloud. to output misinformation at scale; an attacker attempting to misrepresent some original source; or an attacker with one or more undetectably altered modalities generated by a system unaware of high-level semantic consistency. While current methods are able to detect GAN-based images or deepfakes and text generated from language models, such generation techniques may continue to improve and begin to fool unimodal detection approaches. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. To analyze the semantic alignment of videos and captions, we need three main ingredients. First, and most importantly, we need pristine data as ground truth. Second, we need to extract semantic feature representations from each modality and its constituents, such as transcripts and named entities. Third, we need to jointly reason about semantic content. In the following sections, each of these components will be addressed in turn. Section II describes related work in natural language processing, computer vision, and multi-modal analysis. Section III describes our data collection and pre-processing methods. Section IV describes experimental results and ablation studies. Section V provides a conclusion and a discussion of future work for this project. The field of natural language processing has seen a rapid shift in recent years towards transformer-based methods, introduced in [30] , with large language models achieving state of the art performance [5, 16, 23, 2] . Machine learning in computer vision has been dominated by convolutional methods, with 2D methods such as ResNet [10] becoming standard backbone networks. Several later works have extended 2D convolutional networks to process videos [35, 9, 33] . Approaches such as [35] extend convolution into three dimensions, while [33] introduces separable computations over the spatial and temporal domains to increase efficiency. [20] adapts [33] to include text embeddings which are jointly learned with video embeddings, and is trained on a very large corpus of instructional videos [21] . Recent research has shown promising results adapting transformer methods to process videos [1] , opening the door to processing video clips which are longer than a few seconds. Research in multi-modal learning with text and imagery has demonstrated the efficacy of learning modality-specific embeddings [7] . New methods have been developed with the goal of leveraging transformers to jointly process text and imagery [17, 27, 14, 28] . [19] extends joint text and image transformer-based methods to process text and video clips. [15] employs cross-modal transformers with video frame and text embeddings for multi-modal learning. A variety of methods have been introduced recently for detecting computer-generated content and semantic inconsistencies. [34] detects neural fake news by modeling a joint distribution over a news article's domain, date, authors, headline, and body. [31] demonstrates the relative ease of detecting GAN-generated images from a variety of state-of-the-art generators at the time of publication. [29] checks for consistency between a news article and its images and captions. [26] attempts to identify and attribute inconsistencies between images and their captions. [18] introduces and evaluates detection methods on a new dataset for the task of identifying various semantic inconsistencies between images and captions. We construct our dataset using raw data accessed via CrowdTangle [3] , a public insights tool owned and operated by Facebook. The platform can surface public Facebook posts, including sources such as posts by celebrities and news outlets. It does not include paid advertisements unless they began as organic, non-paid posts that were subsequently "boosted" using Facebook's advertising tools. It also does not include activity on private accounts, or posts made visible only to specific groups of followers. We used the historical data function of the platform to construct our dataset. With the historical data feature, we downloaded all public Facebook posts which had videos in the last decade from the US General Media group, for a total of 647,009 posts. This list of organizations was curated by CrowdTangle, and ranges from large, relatively non-partisan sources such as The Associated Press to smaller, more partisan sources such as Breitbart News. While CrowdTangle provides access to large amounts of Facebook posts, it has two limitations that impact this project. First, it does not provide labels for whether or not a post contains misinformation. Second, since it does not provide video files, they must be scraped from Facebook using other tools. Therefore we used CrowdTangle to source posts to scrape and used the open-source youtube-dl tool [4] to scrape video files. Due to this limitation, we were only able to scrape a sample of 4,651 videos. To construct a labelled dataset for multi-modal semantic alignment, we treat the original caption-video post pairs as pristine examples and randomly swap in new captions from other posts to generate inconsistent examples. Examples are shown in Figure 1 . In this manner, a pristine example features a real-world video and a real-world caption which were intended to relate to each other by the organization which posted them. We assume that pristine examples are semantically consistent across modalities. An inconsistent example still features a realworld video and caption, except the video and caption are not taken from the same post. In an inconsistent example, the caption is the only modality which is manipulated, i.e. swapped. For the additional modalities described in following subsections, such as a video's transcript and Facebook reactions, each example video is always paired with its matching transcript and reactions. This assumes that a random swap of caption would result in some amount of semantic mismatch between the new caption and the original video. In practice, half of the examples in our dataset are pristine and half are inconsistent. We opt to perform swaps on real-world captions rather than creating inconsistencies by generating captions using large language models. This avoids reducing the problem of identifying semantic inconsistencies across modalities to the problem of detecting whether or not a caption is synthetically generated. Although some real news posts may include synthetically generated text, such as short reports on financial news [22] , we do not attempt to filter out posts which might contain synthetic text. If such synthetic posts are present, they would not be correlated with semantic inconsistency labels due to our random swapping approach. Our chosen task of detecting caption and video appearance inconsistency is challenging because of the abstract relationships between captions, videos, and other modalities. The captions in our dataset do not represent dense descriptions of video clips, nor are they necessarily a literal description of a video. Our transcripts are noisy due to automatic generation, and are not guaranteed to be faithful representations of what is said. We do not have audio descriptions of videos. Our videos cover a wide range of styles and subjects, and are not necessarily well-lit and well-produced, as one might expect in datasets with movie clips. However, the random swapping approach does make this task easier than some more adversarial swapping strategies. We hope to strike a balance between perceived human difficulty and the challenge of learning abstract associations between modalities from a small set of noisy data. After collecting video data, we take several steps to standardize formats and to prepare the files for input to our system. Figure 2 illustrates how data flows through our model. Each video is transcoded to a constant resolution of 256×256 pixels and a constant frame rate of 10 frames per second. All files are converted to mp4 videos, regardless of the original format. Audio is left unchanged. Video transcoding is handled using FFmpeg [6] . Because videos are scraped at random from Facebook, there is a very wide range of video lengths, styles, and subjects. In our dataset, the minimum video length is 1 second, the maximum length is 14 hours, and the mean is 8.5 minutes. To handle the long and variable video lengths, we adopt a keyframe-based approach. Each video is broken up into a sequence of 32-frame-long clips, with each clip beginning at a keyframe. A keyframe is intended to be a point in the video where there is a change in scene or camera angle. These keyframes should be well-aligned to the starts of semantically consistent clips. In practice, we identify keyframes as timestamps in a video where the FFmpeg [6] scene detection filter is triggered, with the scene detection threshold set at 0.4. If no keyframes are detected, which might be the case with very short videos or videos which are all one shot, we create placeholder keyframes every 3.2 seconds, corresponding to 32 frames. In this manner, a clip with no detected keyframes is split into 32-frame-long clips every 32 frames. We choose to use 16 keyframes per video, taking into account that 73% of videos in our dataset have at most 16 keyframes. We did not observe a significant difference in performance between using 8 or 16 keyframes. Every video is transcribed using the DeepSpeech [8] transcription system. Before passing a video's audio stream into DeepSpeech, we transcode it using FFmpeg to the PCM signed 16-bit little-endian format with a sample rate of 16kHz, apply a highpass filter with cutoff 200Hz, and apply a lowpass filter with cutoff 3kHz. Using these filters, the generated transcripts are generally faithful and fluent, although they are imperfect and tend to misspell named entities. Below is an excerpt from an example audio transcript with typos generated using DeepSpeech: the fourth democratic presidential debate wrapped up in ohio on tuesday minnesota senator amicable no time getting back on the campaign trail she picked off with a tour of new hampshire traveling to all ten counties and just thirty hours overcasting a wave of support after snagging the spotlight on tuesday night going head to head against fortune elizabeth warehouses not even the billionaire to protect billionaire wreaking time locked in and more than thirteen minutes that prefer in sir behind warren and former vice president joined some people like what they heard on twitter cobhouse received one point one million dollars in campaign donations in the twenty four hours after the debate . . . While our transcripts are mostly correct, they tend to include misspelled names and other misidentified words. In this case, misspelled names include "amicable," "warehouses," and "cobhouse." The correct names are "Amy Klobuchar," "Warren," and "Klobuchar." These errors make it difficult to compare named entities in captions and transcripts, as transcript typos might not correspond to common human mistakes which might be corrected by spell-check methods. While some videos provide closed captions, we use automatically generated transcripts uniformly across our dataset to avoid introducing any linguistic biases in the fluency or style of transcripts from different sources. In this section we describe our approaches to verifying named entities using facial verification and text-based comparison of names in captions and audio transcripts. 1) Facial Verification: We implement facial verification for named entities in order to check semantic consistency between modalities. This subsection will describe the implementation of our facial verification system. We define facial verification in our context as checking whether or not people named in the caption of a video actually appear in the video. To accomplish this, we need to identify people in captions and build a database of representations for them. People are identified by using the named entity recognition (NER) feature available in the spaCy [12] natural language processing library. Using spaCy's en_core_web_trf language model, which implements RoBERTa [16] , we run NER on our dataset of captions, and take all strings with the PERSON label as names of people. These strings are compiled into a set of people whose names appear in our dataset. Once all named people are identified, we compute a representation for each person. To this end, we query Google Images for the top 10 results for each name. These images are considered ground-truth references for how each named entity should appear, as shown in Figure 3 . Having multiple images per name allows our dataset to contain examples of each person with potentially diverse lighting conditions, poses, ages, and camera angles. Once reference images are collected, we use FaceNet [25] to compute facial recognition features for each image. The features for each set of 10 reference images are averaged to Fig. 3 . Representations of named entities in a caption, generated by querying Google Images, are compared against frames of a video. In this example, musicians Phoebe Bridgers and Paul McCartney could be verified by checking for their faces in the video, although there will also be apparent mismatches between each person's name and the other face present. Images courtesy [13, 24] . Best viewed in color. create a general representation for each name. Figure 2 shows how FaceNet features are used in our model. At inference time, FaceNet features are also computed for a video's keyframes. We then take the cosine similarity between the representations for names appearing in the caption and the features for each keyframe in the video. In practice, these keyframe features are pre-computed for efficiency. The similarity scores are passed on to our model's classification head to be used alongside features from other modalities. This approach to person identification has a few drawbacks. The reference images of named entities from Google Images are not manually curated, which introduces issues such as the appearance of multiple people in a reference image. Additionally, in some cases, an individual might be referenced first by their full name, i.e. "Alice Appleseed," and then only by their first name, "Alice." Our NER approach does not account for this, and "Alice" would not be associated with "Alice Appleseed." In this case, the system may try to verify the appearance of "Alice" in a video, without knowing which "Alice" it should look for. This is less of a problem for individuals who are commonly referred to by a single name, or a variety of distinctive names. For instance, celebrities can often be uniquely identified by their first or last name, and many politicians are referred to by their last names. While there will be separate reference images for the named entities "Kanye West" and "Kanye," or the entities "Nancy Pelosi" and "Pelosi," they will be faithful representations of the same person. 2) Name Verification: While it is possible to verify the appearance of named entities from captions in videos, we can also compare captions to audio transcripts. This can alleviate the problem where an individual might be a topic of discussion, rather than a subject appearing in a video. To accomplish this, we compute character-based embeddings for the names which appear in captions and/or transcripts. The intuition behind this operation is that our goal is to focus on misspellings, rather than any semantic concepts associated with names. Given a string representing a named entity, we convert each character in the string to its lower-case ASCII numerical value and pad to a maximum length of 64 characters. In our dataset, 100% of strings identified as names have at most 64 characters. We then feed this vector into a 2layer fully connected network, with hidden size 64 and output size 32. These name embeddings are then passed on to our classification head for use along with other modalities, as shown in Figure 2 . By using learned embeddings, we are able to make comparisons between captions and audio transcripts, even when there are transcription errors in named entities. Since our data is collected from Facebook posts, we also have access to the Facebook reactions for each post. In Facebook, users are able to select the following reactions in response to a post: Like, Love, Wow, Haha, Sad, Angry, and Care. We hypothesize that these reactions can provide a coarse measure of the perceived semantics of an entire post, taking into consideration all of its modalities. In that case, the semantic inconsistency between an uplifting video paired with a sad or inflammatory caption might be reflected in an inconsistency between the post as a whole and its reactions. We take the normalized reactions to a post as an input feature to our model, shown in Figure 2 . To normalize reactions, we divide the raw count of each reaction, such as Love, by the total number of reactions a post received. In this manner, viewers' reactions to content are separated from the popularity of a post, as all normalized reactions are bound between 0 and 1. One problem with this approach is that our data is collected from 2010 to 2020, but reactions were first introduced in 2016, and the Care reaction was added in 2020. So, for some posts in our dataset, users would not have been able to choose a reaction other than Like. We adopt a uni-modal ensemble approach to multi-modal fusion, as shown in Figure 2 . To classify whether or not a post has a semantic inconsistency, we take as input the video split into clips starting at keyframes, the audio transcript, the normalized reactions to the video's pristine post, and a potentially inconsistent caption. In addition to the named entity verification features described in Section III-C, we compute features for the caption, transcript, and video clip inputs. Both the audio transcript and caption are processed using a pre-trained BERT [5] language model, implemented by HuggingFace [32] . When using the language model, inputs are truncated to their first 1024 characters, and split into two sets of characters with length 512. We need to create these splits because the maximum input length to the language model is 512 characters. In our dataset, 60% of audio transcripts and 99.97% of captions have at most 1024 characters. The video clips are processed using both a videounderstanding network and an object detection network. For video understanding, we use S3D-MIL-NCE (S3D) [20] , and for object detection, we use a ResNet50 model [10] . S3D is run on the full 32-frame sequence in each of the video clips, while ResNet is run on the first frame of each clip. We use the mixed_5c output of S3D. For each modality, we learn an embedding to a shared semantic latent space. Figure 2 shows our full model architecture. Each embedding function is implemented as a 2-layer fully connected network, mapping from the output feature space of a modality's feature extraction network to a common 256-dimensional latent space. The learned semantic embeddings for video clips, object detection, audio transcript, and caption are concatenated and passed through a Long Short-Term Memory (LSTM) [11] module to condense information from the clips into one summary feature vector. This fuses multi-modal content at the clip level, before the output of the LSTM is concatenated with named entity verification features. The final combined feature vector is passed on to our classification network. Our classifier is implemented as a 3-layer fully connected network, with input size 1096, hidden layer sizes 512 and 128, and output size 2. We train our model with the self-supervised dataset described in Section III. We optimize the binary cross-entropy loss function, where our model classifies caption, audio transcript, and video appearance tuples as either pristine or inconsistent. We report classification accuracy for our experiments, computed as the percentage of examples correctly identified as either pristine or inconsistent in our balanced test set. Our data is split such that 15% of the examples are reserved for the test set, and the other 85% for training and validation. We perform a variety of ablation experiments to characterize the impact of each modality on the accuracy of our model. Results are shown below in Table I , with each modality removed one-by-one. Due to the fact that removing object detection features improved model performance, we perform one-by-one removal ablation studies again, with object detection features always removed. These experiments are referred to as "No OD" models in Table I . Note that "removing" a modality refers to removing its features or embeddings from our classifier. For instance, removing the video appearance makes the semantic video embeddings inaccessible to our classifier, although the video is still available for checking named entity consistency with facial verification. As seen in Table I , best performance is achieved by using all modalities, except object detection features, and reaches classification accuracy of 60.5%. Table II shows the confusion matrix for this model. We observe that the model is more accurate when classifying inconsistent examples. Specifically, it can correctly detect inconsistency 71% of the time, and detects consistency 51% of the time. Table III shows results for models using one or two modalities. We observe that named entity verification is key to model accuracy, as seen in Table I . Without facial verification, classification accuracy decreases slightly to 59.6%. Without comparing names between captions and transcripts, classification accuracy falls to 54.8%. Without performing either consistency check, classification accuracy falls to 49.9%, essentially random. We find that named entities are not the only useful information provided by captions. As seen in Table I , when semantic embeddings for captions are removed, accuracy falls to 54.2% and 51.5%, depending on whether or not object detection features are present, respectively. When caption embeddings are removed, the names present in the caption are still made available to our named entity verification process. Combination of semantic embeddings and named entity verification is the best use of information in the caption modality. We note that video embeddings from S3D are more important than object detection (OD) embeddings from ResNet. In fact, removing OD embeddings improves the performance of our model, while removing S3D embeddings lowers performance. When OD embeddings are present, removing S3D embeddings leads to 3.8% lower accuracy, and without OD embeddings, removing S3D embeddings leads to 4% lower accuracy. For other datasets with instructional or cooking videos, we expect OD to play a more important role. This could be due to the fact that features from S3D contain representations of objects, so the contribution of object detection features is diluted. OD features are not temporally aware, and so they cannot contain all the information represented in S3D features. Furthermore, the ResNet50 model we take features from is trained for image classification, which may be too general of a task to be useful for modelling abstract video semantics. We note that Facebook reactions do not seem to provide a useful signal, as removing them from our model did not decrease performance. Finally, we observe that multi-modal fusion is necessary for achieving the best possible accuracy. Removing any one of our modalities decreases performance, with the exception of reactions. More importantly, no uni-modal model can perform better than random. Accuracy for uni-and bi-modal models is shown in Table III . Caption-only and video-only models achieve 49.9% and 49.8% classification accuracy, respectively, confirming that our dataset does not have linguistic or visual bias. A model combining caption and video clip embeddings achieves 49.6% accuracy, highlighting the importance of incorporating additional modalities and features. A model which solely compares named entities in captions and audio transcripts achieves 53.5% accuracy, and a model which compares named entities in captions with video frame facial verification features achieves 51.7% accuracy. While attending to named entities is important, named entities alone are not sufficient for our model to achieve the highest possible accuracy. We have introduced a novel multi-modal semantic inconsistency detection system, along with a 4k-large dataset for selfsupervising semantic alignment detection in real-world social media posts. We demonstrate the importance of making use of modalities beyond video appearance and captions, including transcription, facial verification, and possibly misspelled named entity comparison. We observe that fusion across modalities is key to detecting semantic inconsistencies. We find that named entities provide strong signals for verifying consistency across modalities, and that verifying named entities using both language-based and visual methods is better than only using one. Semantic consistency checks cannot be fully explained by named entity verification, however, highlighting the need to consider semantic embeddings for language and video. Future work could explore aspects of attributing and characterizing inconsistencies. Modules for explainable facial verification and author attribution could take steps towards addressing this. Our approach would likely benefit from more data, and we are interested in expanding data collection to other social networks such as Twitter and TikTok. Increasing the size of our dataset might also allow for more challenging inconsistencies during training time. Is Space-Time Attention All You Need for Video Understanding? Language Models are Few-Shot Learners. 2020 Pre-training of Deep Bidirectional Transformers for Language Understanding Video2vec Embeddings Recognize Events When Examples Are Scarce Deep Speech: Scaling up end-to-end speech recognition Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Deep Residual Learning for Image Recognition Long Short-Term Memory Industrial-strength Natural Language Processing in Python Phoebe Bridgers (41599189180) (cropped).jpg File: Phoebe Bridgers (41599189180) (cropped) Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training A Robustly Optimized BERT Pretraining Approach. 2019 Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation End-to-End Learning of Visual Representations from Uncurated Instructional Videos HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips The Rise of the Robot Reporter Language Models are Unsupervised Multitask Learners Online FaceNet: A unified embedding for face recognition and clustering FOIL it! Find One mismatch between Image and Language caption Association for Computational Linguistics Pre-training of Generic Visual-Linguistic Representations. 2020 LXMERT: Learning Cross-Modality Encoder Representations from Transformers Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News Attention Is All You Need CNN-generated images are surprisingly easy to spot...for now Transformers: State-of-the-Art Natural Language Processing Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Defending Against Neural Fake News 3D Inception Convolutional Neural Networks for Automatic Lung Nodule Detection