key: cord-0665597-k8750l2k
authors: Agarwal, Shruti; Hu, Liwen; Ng, Evonne; Darrell, Trevor; Li, Hao; Rohrbach, Anna
title: Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion
date: 2021-12-21
journal: nan
DOI: nan
sha: 8749ccccea419556f0cba218bf1a2d3fd8118270
doc_id: 665597
cord_uid: k8750l2k

In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous correspondences between their facial movements and the words they are saying. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a persons' face and head movement as opposed to deep CNN visual features, and we are the first to use word-conditioned facial motion analysis. Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.

Humans tend to trust what they see, especially when it comes to video. Historically, video has been the best proof that an event has indeed occurred. However, in the rapidly evolving misinformation landscape of the present digital era, this may not be true for long. Video manipulation techniques are more accessible than ever, while the reach of Internet and social media enables rapid spread of falsified content. Recent headlines, such as "XR Belgium posts deepfake of Belgian premier linking Covid-19 with climate crisis" [43] , "Dutch MPs in video conference with deepfake "Hi Everybody" Head Pitch Lip Tightner (AU23) time Figure 1 . Shown above are the example frames from videos of Barack Obama saying "Hi Everybody" on three different occasions. In each case, Obama opens his mouth (left frame), moves his head up and to the right (middle frame), and finally ends with pressing his lips together (right frame). We aim to leverage such word-gesture patterns to detect video falsifications. Shown in the bottom panel is the magnitude of "lip-tightner" AU (red) and head pitch rotation (black) over 30 frames. The head rotates up and down and the "lip-tightner" peaks in the 2nd half of the utterance.

imitation of Navalny's Chief of Staff" [44] 1 , "When virtual turns fake: Danish politicians 'meet' Belarusian opposition figure" [39] are examples of how both deepfakes and cheap-fakes (e.g. lookalikes) can pose real threats and have serious consequences, especially if targeted at people in power.

To protect the public against potential disinformation campaigns, new deepfake detection methods are being introduced to combat new and more advanced deepfake techniques [19, 25, 32, 34-36, 41, 42] . Not only is detection increasingly challenging, most methods are ineffective against cheapfakes falsified through conventional techniques (e.g., speeding up or slowing down a video), or with no video manipulation at all (a lookalike, audio dubbing).

In this work, we aim to detect video falsifications related to a person's identity. Specifically, our goal is to recognize whether the purported person "seen" in a video is indeed themselves. This is distinct from the deepfake detection problem, where the goal is to distinguish between pristine (non-manipulated) and generated/altered videos. For such methods, a video of an impersonator would be wrongly identified as "real". Similarly, a non-manipulated video with dubbed or edited audio would also be considered "real" by many deepfake detectors, since often they do not take audio into account. In contrast, our problem statement is more general, as it includes both deepfakes and falsified pristine videos. Furthermore, as the quality of deepfakes improves, detecting visual flaws will also become increasingly difficult. Our key insight is to use semantic, person-specific cues as an alternative, and generalizable solution to detecting video falsifications.

Biometric-based techniques [5] have been recently introduced to identify falsified videos, that are either not manipulated or extremely realistic. [5] analyze the authenticity of a person's performance by correlating the head and facial movements with existing footage of the subject in question. Even though several hours of training video is required, they are well-suited for public figures such as celebrities and world leaders who are often a target. However, these person-specific methods are ineffective against cutting-edge audio-to-lip synthesis techniques such as [36, 41] or commercial video dialog replacement solutions developed by Synthesia 2 or Canny AI 3 , which only manipulate the mouth.

To this end, we propose a semantic, multimodal detection approach that integrates speech transcripts into personspecific gesture analysis. We leverage interpretable Action Units (AUs) [6] to model a person's face and head movement. We also consider their speech (transcribed audio). Our approach is to analyze the word-conditioned facial movement captured with AUs to learn per-word models for classifying real and fake videos. Our intuition is that each individual may have identifying, unique patterns in how their speech, facial expressions, and gestures co-occur, see Figure 1 . Thus, our models distinguish real videos with synchronized movement and speech from falsified ones that are not in sync. At test time, we compute classification scores for each word in a video clip, and aggregate them into a final score. An AUC (area under ROC curve) metric is used to evaluate the performance of our method.

Our experiments include several world leaders and TV talk-show hosts and we consider the full spectrum of cutting edge deepfake and video manipulation techniques [35, 36, 41] , as well as videos found in the wild. We compare our approach to several prominent prior works, and we show that we achieve the best AUC in all scenarios but one, performing well across the entire range of fakes. No other method that we consider demonstrates such general capability, as they tend to suffer on audio dubbing or in-the-wild lip-sync fakes. We perform additional ablation studies to confirm that the key advantage of our method is indeed the wordconditioned analysis. Lastly, an added benefit of our approach is interpretability: we are able to capture human understandable, person-specific word-movement patterns predictive of a video being real or falsified (e.g., common in real videos but absent in the fakes ones).

Our key contributions are as follows. (a) We present a novel, general problem statement: given a video, predict whether a shown person is authentic, regardless of the falsification method. (b) We propose a new semantic personspecific approach to address this problem that leverages word-conditioned facial movement analysis. (c) We perform a comparative study of several fake video detection methods across multiple fake types, ranging from deepfakes to using impersonators and audio dubbing. Unlike the prior works, our approach shows strong generalization performance across all types of fakes. (d) Our approach also enables a degree of interpretability, allowing us to inspect the word/gesture patterns predictive of real/fake video.

Media forensics is increasingly important due to the rapid progress of AI synthesis techniques and the spread of fake videos. We identify two types of detection techniques: (1) Person-generic methods analyze whether manipulation occurred regardless of the subject in question; (2) Personspecific ones verify that the characteristics of the seen individual match the real person. Person-generic approaches are often trained on large datasets with real and fake videos, and rely on either low-level features or high-level semantics. Person-specific methods on the other hand typically require additional biometric-based data for identification.

Low-Level Feature-Based Forensics. These methods (often CNN classifiers) are typically person-generic and focus on visual artifacts or statistical anomalies learned implicitly from images or videos [1, 21, 30, 33, 33, 38, 40, 46, [51] [52] [53] . While many techniques struggle to generalize to new video manipulation techniques or unseen deepfake videos [16] , some focus on artifacts that appear also for un-seen fakes. [29] detect the presence of warping effects, [27] identify blending traces during face swapping, and [47] leverage inconsistencies between images and meta-data. While promising detection capabilities have been shown, these methods are often susceptible to deteriorations like compression, resolution reduction, or adversarial perturbations and attacks [8, 24] .

High-Level Semantic-Based Forensics.

Persongeneric and high-level semantic-based techniques focus on explicit anomalies of person's characteristic or performance, such as the absence of eye blinking [28] . Other techniques use inconsistencies in head pose [50] , human physiological signals like ear movements [2] , heart beat [18, 37] and other biological signals [10] . These approaches often generalize better to unseen deepfakes and are more resilient to deterioration techniques. However, a reliable extraction of high level characteristics is often difficult to achieve in unconstrained settings and short video clips. Several recent methods focus on temporal inconsistencies in facial performances [4, 9, 22, 31] but rely on robust 3D face tracking.

Similar to our proposed work, multi-modal techniques [9, 31] exploit the mismatch between audio and visual signal to detect deepfakes. Even though audio signals can provide cues like emotions and how a person is talking, our work focuses on spoken words, which provide a more direct information about what is being said. For example, different head nods associated with words convey agreement, disagreement or greetings in many cultures [13] . In [4] , the authors exploit the shape of the lips when phonemes 'P', 'B', or 'M' are being pronounced. Whereas in [22] , the authors use only visual signal to detect if the lip movements are 'readable" in a video. Even though these techniques can detect deepfakes where the lips are modified, since they are not person-specific they will struggle identifying video falsifications using impersonators.

Biometric-Based Forensics. Biometric-based detection methods [3, 5, 11, 12, 26, 48, 49] are person-specific as they try to verify the authenticity of a person using known identity priors. These works are the most relevant to our technique and many of them exploit person-specific facial movement over time to detect deepfakes. In [26, 49] , the authors use visual appearance and movement of lips to perform speaker verification and detect person-specific deepfakes. In these previous works, the authors analysed only lips for a small set of words which will restrict their approach to only those fakes where those are being said. In contrast, we include the facial movements from the entire face and use a much larger vocabulary size, enabling our approach to handle in-the-wild deepfakes. The method of [5] introduced a biometric approach for public figures, where person-specific facial movements in a video are compared with those of pristine videos. Despite the requirement for hours of training data of a known person, this approach is resistant to realistic deepfakes, or even to lookalikes, when no video manipulation is used. More advanced techniques that incorporate CNN-based behavior classification using optical flow [3] or 3DMM-based facial tracking [12] have shown improved performance for deepfake detection. Nevertheless, recent advancements in speech-to-lip synthesis [36, 41] show that it is possible to produce highly convincing speech manipulations without altering global facial characteristics. In this work, we introduce a multi-modal semantic approach that exploits the fact that spoken words may be associated with distinct person-specific facial movements. In particular, these movements involve the entire face/head and not only the lip region of a person, and are difficult to disguise even for skilled impersonators.

Given an input video of a purported individual, our goal is to classify it as real or fake. We leverage the key insight that individuals often use identifying gestures associated with specific interactions like greeting, disagreement, etc. In our approach, we represent these conversational units in terms of words and analyze the facial gestures associated with them. Considering conversational units at the granularity of words gives us a good trade-off between the number of occurrences of each unique conversational unit and speech semantics. Using N-grams or unique sentences would result in fewer occurrences while phonemes would result in less meaningful speech semantics.

As shown in Figure 2 , we first transcribe the audio and obtain a per-frame alignment of each word in the video using a DeepSpeech variant [23] . We then extract the corresponding per-frame facial expressions and head poses of the speaker represented by interpretable Action Units (AUs) [17] . To encode the speaker motion, we compute the amount of change in the AUs that happens within the window of the word's occurrence. We then train word-level classifier models to distinguish when the visual movement aligns with the spoken words (real) or not (fake).

We denote F 1:

as a set of frames f from a video of length T . Given a video, we transcribe the audio of the video to get the phonation time of each word w, expressed as the start f s and end f n frames, where d = n−s is the duration of the phonation. To associate an individual's facial expression and head motion with the corresponding word, we extract AUs for the window F s:n . In contrast to 3-D or 2-D facial landmarks, these AUs represent semantically meaningful micro-expressions such as the strength of a cheek or chin movement (e.g. "chin raiser"). For a given word spoken within the frame range F s:n , we extract a 25-D facial feature g i at each timestep to obtain G s:n = { g} n i=s . Each 25-D facial feature can be decomposed into 4 compo- Instead of using a variable-length feature G s:n ∈ R d×25 , we use the deltas between the maximum and the minimum values extracted during the word phonation window. The facial feature of each word occurrence is then expressed as:

where x w ∈ R 25×1 is used for building a word-specific model. Intuitively, these features capture the maximum range of movement happening when a word is spoken (how much a person moves their head up when they say "Hi").

To train word-specific models, we use linear classification models to identify whether the given gesture features are aligned with the given words. Instead of using larger, more complex learning-based approaches to learn high-dimensional features, we use linear classifiers to highlight the efficacy of our interpretable features in a simple model. Given real videos where the word is aligned with the video, and fake videos, where the words are deliberately misaligned with the gestures (e.g. speech transcript matched to a different video), we extract the facial features x w for each occurrence of word w in every video. In addition to using the genuine videos for fake samples, we also use synthetic fakes. Specifically, we create videos with misaligned audio inputs using the recent lip-sync deepfake generation method Wav2Lip [36] . By augmenting our misaligned fake videos with these synthesized fakes, we ensure that our classifier does not rely only on lip synchronization errors from misaligned video sequences. From this dataset of real and fake videos, we train a word-specific logistic regression classifier for each individual.

Let x w ∈ R 25×1 be the facial feature corresponding to word w. Let y w ∈ [0, 1] be the ground truth label of x w where y w = 1 if x w is from a real video sequence. We learn the model parameters θ w ∈ R 25×1 for a linear classifier that maximizes the following objective function L θw :

where P (y i | x i ) is the probability of y i given x i and M is the number of total occurrences of w in the training data.

During evaluation, given a test video of a purported individual, we extract the features as described above. The transcribed words not seen in the training set are discarded. For each remaining word w, the corresponding facial features x w are examined using the target word classifier θ w for the given individual. A score s w in the range of 0 (fake) and 1 (real) for x w is computed as:

A final score is computed for a given video using the geometric mean of scores across all trained words in the video.

To validate our proposed approach on the general problem statement that includes both deepfakes and nonmanipulated fakes, we compile the following dataset. We consider four US politicians (Barack Obama, Donald Trump, Joe Biden, Kamala Harris) and two TV talk-show hosts (John Oliver, Conan O' Brien). Further we provide the details for the types of data that we use.

Real: Real videos were collected for the six individuals. The videos of the politicians were taken from the World Leaders Dataset (WLDR) [3] and the videos of the talkshow hosts were taken from [20] . Shown in the first column of Table 1 Audio Dubbing: Using the real videos for each individual, we simulate the dubbing scenario by mismatching the video and the audio in a video. For every real video, a new dubbed video is created by matching it with a random audio of the same length. We produce the same number of hours for the dubbed videos as we have of real videos, Table 1 .

Wav2Lip: Using the real videos of each individual, we create lip-sync deepfakes where the lip region in the video is modified to match a new random audio. We use the off- the-shelf implementation of Wav2Lip [36] to create these fakes. An example frame for each individual is shown in the second column of Figure 3 . We produce the same number of hours of Wav2Lip lip-syncs as of real videos. Impersonator:

The person-specific impersonator videos are obtained from Saturday Night Live videos on YouTube. The comedic impersonator videos for Obama, Biden, and Trump are from WLDR, and we collect the impersonator videos for Harris, Oliver and O'Brien from YouTube. The example frames for each impersonator are shown in the third column of Figure 3 and the duration of the videos (in hours) is given in Table 1 .

FaceSwap: The FaceSwap videos are created using the person-specific impersonator videos by replacing the face of an impersonator with the target person's face. We obtain the FaceSwap videos for Obama, Biden, and Trump from WLDR and create the videos for Harris, Oliver and O'Brien using the open-source FaceSwap library [35] . The example frames are shown in the fourth column of Figure 3 and the duration of the videos (in hours) is given in Table 1 .

In-the-wild: The person-specific lip-sync videos are generated using different techniques. The videos for Obama are collected from WLDR and these videos were created using the lip-sync technique in [41] . The in-the-wild videos for Trump and Biden were created using a GAN-based synthesis technique where the mouth region in the video is modified to match a new audio [2] . The example frames are shown in Figure 4 and the duration of the videos (in hours) is given in Table 1 .

We use the Real, Audio Dubbing and Wav2Lip videos for training, while all the six types are used for testing.

To analyze our proposed method, we conduct a series of experiments. We compare our approach with the state-of- Obama  812  812  812  248  211  543  Trump  543  543  543  296  282  81  Biden  523  523  523  133  145  121  Harris  346  346  346  125  124  -O' Brien 548  548  548  196  187  -Oliver  670  670  670  118  98  -Table 3 . Top: number of unique word models evaluated in each sub-task (compare to Table 2 ).

the-art video falsification detection methods and show that our method outperforms them in terms of generalizability.

We further provide an ablation study and evaluate the robustness of our system against unseen video perturbations. We end with a qualitative analysis of what our model learns.

Data Preprocessing: Each video is first preprocessed so that only the person of interest is retained. Given one frame of an input video, we first use a single-stage face detector [14] to localize all the faces. Then a face recognition network ArcFace [15] is used to examine whether each face is the target person and all the outliers are masked out. For transcription, we used an open-source implementation of DeepSpeech [23] . For AU extraction, we use the facial behavior analysis toolkit OpenFace2 [6, 7] .

Training Details: In our experiments, we use logistic regression to solve the binary classification problem of real/fake video. To train our person-specific word classifiers, we use 90% of the real videos for the "Real" class, and 90% of the audio dubbing and Wav2Lip lip-sync videos for the "Fake" class. The number of unique words present in the speech for each individual is given in Table 2 (left). The word-specific models are trained for the words which on average occur at least once in every hour of real video. Shown in the right column of Table 2 is the total number of word models trained for each individual. On average, 799 word models are trained, with the smallest/largest number of models trained for O'Brien/Obama.

Testing Details: We test our approach on remaining 10% of real, audio dubbing and Wav2Lip lip-sync videos. Additionally, we test on all the videos with Impersonators, FaceSwap, and in-the-wild lip-sync deepfakes which were not seen during training (as introduced in Section 4). Each test video is divided into overlapping 10-second video clips (30 fps) with a shift window of two seconds. Shown in the Table 3 , is the total number of unique words that were evaluated in each of the test datasets, based on the occurred words within the trained words set in testing time.

Evaluation Metric: We report the Area Under the Curve (AUC) score for the 10-second test videos. For the previous methods that perform analysis on a temporal window less than 10 seconds, we average predictions over 10 seconds. Table 4 . Accuracy in terms of AUC on 10-second video clips for the six individuals and five different video falsification scenarios. The average AUC across all individuals in given in the last row. Table 4 , is the performance of our method in terms of AUC for each individual test case. The average AUC across all the individuals is shown in the bottom row. Our approach works the best for Obama with the average AUC of 0.97 across all types of falsification scenarios and the worst for O'Brien with an average AUC of 0.88. This is expected as the Obama videos have higher quality and better consistency in facial movement during the formal weekly addresses. The videos of O'Brien are of lower visual quality and have a wider range of facial movements during the informal interviews, monologues, and audience interactions during the talk-show. This makes it more difficult for our word-conditioned model to learn consistent facial movement patterns from O'Brien videos.

We compare our approach with: the low-level featurebased method in XceptionNet [40] ; the high-level semanticbased approach in LipForensics [22] ; the biometric-based techniques in Protecting World Leaders (PWL) [5] and ID-Reveal [12] . Shown in Table 5 are the average AUCs across all the individuals for each method and video falsification scenarios. Our approach performs the best across all the video falsification scenarios except in case of Wav2Lip where LipForensics obtains the best performance of 0.98. All the previous methods fail to detect the audio dubbing video falsification scenario as there is no video manipulation performed in this case. The non-biometric techniques fail to detect impersonators' video. Even though the related biometrics-based methods are able to detect FaceSwaps and impersonators, they perform poorly on lip-sync videos. This is because these techniques only use the visual cues of a person identity, most of which are preserved in the lipsync videos. This shows the advantage of our approach, i.e. using words in combination with the visual cues.

Effect of Using Word-specific Classifiers: We analyse the effect of training the word-specific classifiers by training two different versions of our approach. In the first ver-Audio Dubbing Wav2Lip Impersonator FaceSwap in-the-wild XceptionNet [40] 0 Table 6 . The average AUC performance across all individuals when: 1) fixed visual window and a single classifier is used; 2) word-based visual window and a single classifier is used; and 3) word-based visual window and word-based classifiers are used. sion (Fixed Window), we do not use the word information and compute the 25-D visual gesture features using all the non-overlapping fixed windows of 30 frames. This window size is chosen as 95% of the words have a duration smaller or equal to 30 frames. Using the corresponding gesture features, we train a single linear classifier to predict Real vs. Fake. In the second version (Word Window), the gesture features are extracted using word intervals, as in our approach, but we train a single linear classifier instead of word-specific classifiers. Shown in Table 6 , are the average AUCs across all individuals for the two ablations and our approach. While the word intervals already improve over the fixed window case, the word-specific training helps in improving the performance on each type of video falsification scenarios. This clearly shows that the key advantage of our approach is indeed in leveraging the word-conditioned facial gesture analysis. Effect of Training Size: We analyse the effect of number of hours of real videos used for training person-specific word models. The effect of training size is evaluated on: 1) Wav2Lip lip-sync fakes which on average have 72% vocabulary overlap with the training dataset and 2) in-the-wild lip-sync fakes which on average have only 28% vocabulary overlap with the training dataset. Shown in Figure 5 are the AUCs for individuals as the function of training size ranging from 0.1 to 2.1/5.0 hours of real training videos. For each real training size, we used the equivalent hours of fake videos from audio dubbing and Wav2Lip training datasets. The evaluation in the left/right plot is performed on Wav2Lip/in-the-wild lip-sync fakes. Shown with the black curve is the average AUC across all individuals as a function of training size. In each of these evaluation scenarios, the performance improves with the number of hours in training. In case of Wav2Lip, the average performance improves from 0.62 to 0.88 (42%) from 0.1 to 1.3 hours and then improves from 0.88 to 0.90 (2.0%) for training size greater than 1.3 hours. Similarly, for the in-the-wild lip-sync fakes, the average performance of 0.91 is achieved with 1.3 hours of training videos with only a slight improvement after that. This shows that even though we used several hours of videos for each individuals, a relatively smaller training dataset (≈1.5 hours) can also provide a similar performance.

Here, we evaluate the robustness of our approach against video compression. For this experiment, we re-saved the real test videos of each individual using the ffmpeg compression quality of 40. Shown in Table 7 are the results of our approach when compared to the best performing related techniques. The average performance of our method is reduced by 5% from 0.92 to 0.87, whereas the average performance in case of LipForensics and PWL are reduced by 14% and 6%. Even though the performance of our method is reduced after compression, our approach still performs better in all falsification scenarios except Impersonators where PWL achieves slightly better performance.

Here we present a few qualitative results showing the regularity of facial movements associated with words. Shown in Figure 6 are word-based facial movements of two individuals. For each person, we select one word from the top-5 performing words. (The performance of the wordbased classifiers is evaluated on our training data in terms chin-raise (AU17) and lip-rounding (lip-hor) during the word "tremendous" real fake dimpler (AU14) and lip-corner-pull (AU12) during the word "billion" real fake Figure 6 . Shown here are some qualitative analysis of the facial movements for specific words that were used to detect real vs fake. The results are shown for one word for three individuals. For each word and individual, shown are two examples for facial movements associated in real and Wav2Lip fake videos. Shown in the last column is the distribution of a gesture feature in real and fake training dataset of the individual. For example, in case of Trump, the lip rounding and chin raise actions during the word "tremendous" are missing in fakes. This is also supported by the distributions of AU17 and lip-hor, where the average strength of these movements is lower in fake videos of Trump than the reals. of word-level AUC.) For each selected word and individual, two occurrences are shown, from real (top row) and Wav2Lip fake (bottom row) videos. Shown in the last column is the distribution of one gesture feature (AU) in real and fake training dataset.

It can be observed that Trump, while saying the word "tremendous", rounds his lips and then presses the lips together before finally opening the lips apart. This rounding action of the lips is absent in the fake examples, even though the lips are closed once in the sequence. This difference in real and fake utterances can also be seen in the distributions of change in chin-raise (AU17) and change in lip-hor facial features. For Oliver, the word "billion" is associated with the creation of dimples on the cheeks, which is violated in the fake frames shown here.

We proposed a novel multi-modal, semantic-based approach towards detecting falsified videos. We leverage the idea of learning person-specific associations between a speaker's facial gestures and spoken words to verify the purported person's identity in the video. Our experiments show that inconsistent head movements and facial expressions can be identified reliably when a different performer is used for falsification. In particular, we demonstrate the ef-fectiveness and robustness of our approach on a wide range of deep and cheapfakes, outperforming all other methods in most cases. Since we do not attempt to detect video manipulation artifacts, our method will still work for more advanced future deepfake methods.

Our current approach relies heavily on the accuracy of 3-D facial tracking via AU extraction. While this is achievable for our dataset, in which the speakers are often front facing, for fully unconstrained videos, deep learning based features may be more reliable. Although our method seems to behave rather sample efficiently ( Figure 5) , it is personspecific and requires sufficient training data to be effective, similar to PWL [5] . Furthermore, while the AUs allow us to obtain interpretable results, denser 3-D facial features could allow for detecting more subtle anomalies. Also, we believe that our approach can be extended to investigate person-generic cases by detecting common inconsistencies between individuals. Finally, we have only validated our method for English speech. In the future, we would like to explore how well our word-conditioned technique would work with other languages.

Falsified media is an emerging threat to society, so we envision mostly positive impact from our work. At the same time, almost any method for fake detection may be adapted to create even more robust fakes. 

Here we provide some of the quantitative and qualitative results to support the analysis made in the main paper. In the main paper we compared with the related works using the average AUCs across all individuals. In order to give a better insight into the comparison, we first present perindividual results for each of the related works. We then present a more detailed version of qualitative results using both training and testing datasets.

Shown in Table 8 are the per-individual results for all the related methods that were presented in the main paper.

Here we provide the videos used for qualitative analysis of the words presented in the Figure 1 and Figure 6 of the main paper. For Obama, Trump, and Oliver we provide occurrences of the word "hi", "tremendous", and "billion" in the real and fake videos. Therefore, there are a total of six videos for this section: Obama hi real.mp4, Obama hi fake.mp4, Trump tremendous real.mp4, Trump tremendous fake.mp4, and Oliver billion real.mp4, Oliver billion fake.mp4. In each video, the output probability of the word-specific classifier is shown in red on the top left corner (a value of 1 is for real and 0 is fake). The occurrences of the words are selected from the training dataset. This is done to demonstrate the facial gestures associated with specific words during training.

In each case, it can be observed that a specific facial gesture is present in real videos which is missing in the fake videos. For example, the occurrences of the word "hi" is associated with an upward head movement which is missing in the fake examples. Similarly, in case of the word "tremendous", notice the presence of lip rounding and chin raise action in multiple occurrences of the word in real videos, whereas these actions are missing in the fake videos.

Here we show how the results of our method can be interpreted during the evaluation of a test video. For this we provide four example videos, a real and a fake video of Obama and Trump.

The real videos are from test-split of real dataset and fake videos are from in-the-wild dataset. The videos are named as: Obama itw test.mp4, Obama real test.mp4, Trump itw test.mp4, and Trump real test.mp4.

Given a test video of 10-second length, we show the output of word-specific classifier for each word. Shown on the x-axis of the plot is time and on the y-axis is the probability that the word occurrence is real. Shown in orange is the probability of the word in the test video and shown in the blue is the average real probability of the word in real dataset during training. The region in blue indicates the standard deviation of training probability. The gaps in the plot indicate that the word-specific classifier was missing. The current time is indicated by the red dot on the plot and the current word is displayed on the top of the video.

These word-level probabilities, can be used to isolate the words which obtain low probability of being real. For example, in Obama itw test.mp4 many words have a low probability of being real with a minimum probability of zero for the word "coverage". Similarly in Trump itw test.mp4 video, the word "protected" has the zero probability of being real. Whereas in the videos Obama real test.mp4 and Trump real test.mp4, the real probability for each of the words is close to training real dataset (average of 0.8).

Shown in Figure 7 are the distributions of the 25 facialgesture features for the word "coverage" for Obama. In each panel, shown in blue is the distribution of one facialgesture feature in real training videos of Obama. Shown with red line is the value of facial-gesture feature in the current test video of Obama which in this case is the fake video shown in Obama itw test.mp4. The word "coverage" in this example fake video have an out-of-distribution value for AU26 i.e. jaw drop. The out-of-distribution value can also be observed for lip-ver motion where the value in the fake is lower than any of the value seen during training.

Similarly, shown in Figure 8 are the distributions of the 25 facial-gesture features for the word "protect" for Trump. The red line in each panel is the value of facialgesture feature in the fake test video of Trump shown in Trump itw test.mp4. For the word "protect" the value for AU17 (chin raise) and AU23 (lip tightner) in the fake is lower than any of the value seen during training.

MesoNet: A compact facial video forgery detection network

Detecting deep fakes from aural and oral dynamics

Detecting deep-fake videos from appearance and behavior

Detecting deep-fake videos from phonemeviseme mismatches

Protecting world leaders against deep fakes

Cross-dataset learning and person-specific normalisation for automatic action unit detection

OpenFace 2.0: Facial behavior analysis toolkit

Evading deepfake-image detectors with white-and black-box attacks

Not made for each other-audiovisual dissonance-based deepfake detection and localization

Fake-Catcher: Detection of synthetic portrait videos using biological signals

Individual differences in facial expression: stability over time, relation to self-reported emotion, and ability to inform person identification

Id-reveal: Identity-aware deepfake video detection

The expression of the emotions in man and animals

Retinaface: Single-shot multilevel face localisation in the wild

Arcface: Additive angular margin loss for deep face recognition

Nicole Baram, and Cristian Canton Ferrer. The deepfake detection challenge (DFDC) preview dataset

Measuring facial movement. Environmental Psychology and Nonverbal Behavior

Predicting heart rate variations of deepfake videos using neural ode

Text-based editing of talking-head video

Learning individual styles of conversational gesture

Deepfake video detection using recurrent neural networks

Lips don't lie: A generalisable and robust approach to face forgery detection

Deep speech: Scaling up end-to-end speech recognition

Exposing vulnerabilities of deepfake detection systems with robust attacks. Digital Threats: Research and Practice

Deep video portraits

Speaker inconsistency detection in tampered video

Face X-ray for more general face forgery detection

In ictu oculi: Exposing AI created fake videos by detecting eye blinking

Exposing deepfake videos by detecting face warping artifacts

Exploiting visual artifacts to expose deepfakes and face manipulations

Emotions don't lie: An audiovisual deepfake detection method using affective cues

PaGAN: Real-time avatars using dynamic textures

Capsule-forensics: Using capsule networks to detect forged images and videos

FSGAN: Subject agnostic face swapping and reenactment

A simple, flexible and extensible face swapping framework

A lip sync expert is all you need for speech to lip generation in the wild

DeepRhythm: Exposing deepfakes with attentional visual heartbeat rhythms

Thinking in frequency: Face forgery detection by mining frequency-aware clues

When virtual turns fake: Danish politicians 'meet' belarusian opposition figure

FaceForen-sics++: Learning to detect manipulated facial images

Synthesizing Obama: Learning lip sync from audio

Neural voice puppetry: Audio-driven facial reenactment

Brussels Times. Xr belgium posts deepfake of belgian premier linking covid-19 with climate crisis

Dutch mps in video conference with deep fake imitation of navalny's chief of staff

The Verge. 'deepfake' that supposedly fooled european politicians was just a look-alike, say pranksters

Multi-modal multi-scale transformers for deepfake detection

CNN-generated images are surprisingly easy to spot...for now

Body motion analysis for multi-modal identity verification

Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis

Exposing deep fakes using inconsistent head poses

Attributing fake images to gans: Learning and analyzing gan fingerprints

Multi-attentional deepfake detection

Two-stream neural networks for tampered face detection

Acknowledgements. This work was supported in part by DoD including DARPA's LwLL, and/or SemaFor programs, and Berkeley Artificial Intelligence Research (BAIR) industrial alliance programs.