key: cord-0491214-usar72tu authors: Groh, Matthew; Sankaranarayanan, Aruna; Picard, Rosalind title: Human Detection of Political Deepfakes across Transcripts, Audio, and Video date: 2022-02-25 journal: nan DOI: nan sha: f4a92b400800807e2d17565ca5e1b9a55424d3ec doc_id: 491214 cord_uid: usar72tu Recent advances in technology for hyper-realistic visual effects provoke the concern that deepfake videos of political speeches will soon be visually indistinguishable from authentic video recordings. Yet there exists little empirical research on how audio-visual information influences people's susceptibility to fall for political misinformation. The conventional wisdom in the field of communication research predicts that people will fall for fake news more often when the same version of a story is presented as a video as opposed to text. However, audio-visual manipulations often leave distortions that some but not all people may pick up on. Here, we evaluate how communication modalities influence people's ability to discern real political speeches from fabrications based on a randomized experiment with 5,727 participants who provide 61,792 truth discernment judgments. We show participants soundbites from political speeches that are randomly assigned to appear using permutations of text, audio, and video modalities. We find that communication modalities mediate discernment accuracy: participants are more accurate on video with audio than silent video, and more accurate on silent video than text transcripts. Likewise, we find participants rely more on how something is said (the audio-visual cues) rather than what is said (the speech content itself). However, political speeches that do not match public perceptions of politicians' beliefs reduce participants' reliance on visual cues. In particular, we find that reflective reasoning moderates the degree to which participants consider visual information: low performance on the Cognitive Reflection Test is associated with an underreliance on visual cues and an overreliance on what is said. Recent advances in technology for algorithmically applying hyper-realistic manipulations to video are simultaneously enabling new forms of interpersonal communication and posing a threat to traditional standards of evidence and trust in media [1] [2] [3] [4] [5] [6] [7] . In the last few years, computer scientists have trained machine learning models to generate photorealistic images of people who do not exist 8, 9 , inpaint people out of images 10, 11 , clone voices based on a few samples 12, 13 , and modulate the lip movements of people in videos to make them appear to say something they have not said 14, 15 . The synthetic videos' false appearance of indexicality -the presence of a direct relationship between the photographed scene and reality 16, 17 has the potential to lead people to believe video-based messages that they otherwise would not have believed if the messages were communicated via text. This potential influence is particularly concerning because research demonstrates that videos, especially videos of an injustice, elicit more engagement and emotional reactions (e.g., anger, sympathy) than text descriptions displaying the same information [18] [19] [20] (although, see ref. 21 ). Moreover, visual misinformation is common on social media 22 and the emotional and motivational influences of visual communication have been attributed to why fake, viral videos have provoked mob-violence 23, 24 . While people are more likely to believe a real event occurred after watching a video of the event than reading a description of the event 25 , an open question remains: Does visual communication relative to text increase the believability of fabricated events? The realism heuristic 24, 26 predicts "people are more likely to trust audiovisual modality [relative to text] because its content has a higher resemblance to the real world." This prediction is relevant for many deepfake videos 27 and suggests fabricated video would be more believable than fabricated text conditional on the absence of obvious perceptual distortions. Yet there exists little direct empirical evidence for this heuristic applied to algorithmically manipulated video. In an experiment using three fake videos as stimuli, researchers found that stories presented as videos are perceived as more credible than stories presented as text or read aloud in audio form 24 . In contrast, in an experiment showing 6 political deepfake videos (videos manipulated by artificial intelligence to make someone say something they did not say) and 9 non-manipulated videos, researchers did not find differences between truth discernment rates in video, audio, and text 28 . Perhaps some of the experiments' participants did not take the videos' "indexicality" as evidence of authenticity because participants were aware of how easily such videos could be manipulated. Alternatively, some participants may have noticed perceptual distortions in the videos, which would naturally lead one to believe the video has been manipulated. The mixed evidence on how communication modalities mediate people's ability to discern fabricated content may be due to the small samples of stimuli in media effects research 29 . In the related domain of fake images, visual information can be persuasive: research finds people rarely question the authenticity of images even when primed 30 , images can increase the credibility of disinformation 31 , and images of synthetic faces produced by StyleGAN2 9 are indistinguishable from the original photos on which the StyleGAN2 algorithm was trained 32 . When it comes to videos of political speeches, the question whether people are more likely to believe an event occurred because they saw it as opposed to only read about it remains open. In fact, today's algorithmically generated deepfakes are not yet consistently indistinguishable from real videos. On a sample of 166 videos from the largest publicly available dataset of deepfake videos to date 33 , people are significantly better than chance but far from perfect at discerning whether an unknown actor's face has been visually manipulated by a deepfake algorithm 34 . This finding is significant because it demonstrates that people can identify deepfake videos from real videos based solely on visual cues. However, some videos are more difficult than others to distinguish due to blurry, dark, or grainy visual features. On a subset of 11 of the 166 videos, researchers do not find that people can detect deepfakes better than chance 35 . In another experiment with 25 deepfake videos and 4 real videos but only 94 participants, researchers found that the overall discernment accuracy is 51% and a media literacy training increases discernment accuracy by 24 percentage points for participants assigned to the training relative to the control group 36 . In experiments examining how people react to deepfake videos of politicians, researchers find people are more likely to feel uncertain than misled after viewing a deepfake of Barack Obama 37 and people consider a deepfake of a Dutch politician significantly less credible than the real video from which it was adapted 38 . In the experiment examining the fabricated video of a Dutch politician, some respondents explained their credibility judgements by indicating audio-visual cues of how the message was communicated (e.g., unnatural mouth movements); others indicated inconsistency in the content of the message itself (e.g., contextually unrealistic speeches) 38 . People's capacity to identify multimedia manipulations raises questions: how do various kinds of fabricated evidence (e.g., audio and video of fake political speeches) alter the perceived credibility of misinformation, how do audience characteristics (e.g., reflective reasoning) moderate media effects, and how does the source and content of a message interact with the fabricated evidence and audience characteristics 39 ? A growing field of misinformation science is beginning to address these questions. Research on news source quality demonstrates that people in the United States are generally accurate at identifying high and low-quality publishers 40 and the salience of source information does not appear to change how accurately people identify fabricated news stories 41 , manipulated images 42 , or fake news headlines 43, 44 although evidence on fake news headlines is mixed 45, 46 . Research on political fake news content suggests an individual's tendency to rely on intuition instead of analytic thinking is a stronger factor than motivated reasoning in explaining why people fall for fake news 47 , and similarly, people with more analytic cognitive styles worldwide are more accurate at discerning true and false headlines related to COVID-19 48 . In fact, people tend to be better at discerning truth from falsehood when evaluating news headlines that are concordant with their political partisanship relative to when evaluating news headlines that are discordant 49 . While the science of fake news has generally focused on the messengers (the source credibility of publishers) 50 and the message of what is said (the media credibility of written articles and headlines) 49 , the relevance of audio-visual communication channels to the psychology of misinformation has received less attention 51 . In this paper, we evaluate discernment across 32 political speeches by two well-known politicians. We present these speeches to participants via the 7 possible permutations of 3 digital media communication modalities: text, audio, and video. Based on 61,792 responses from 5,727 individuals who participated in a pre-registered 1 cross-randomized experiment, we examine ordinary people's performance at discerning political speeches randomized to appear in each of the following seven conditions: a transcript, an audio clip, a silent video, audio with subtitles, silent video with subtitles, video with audio, and video with audio and subtitles. By randomly assigning political speeches to these permutations of text, audio, and video modalities and asking participants to discern truth from falsehood, this experiment is designed to disentangle the degree to which participants attend to and consider the content of what is said and the audio-visual cues as to how it is said. In addition, we evaluate these disentangled components across message types (speeches that are either concordant or discordant with the general public's perception of a speaker's political identity) and audience characteristics (reflective reasoning as measured by the Cognitive Reflection Test (CRT) 52 ). We hosted multimedia stimuli -transcripts, audio, and video of fabricated and authentic political speeches -on a custom designed website called Detect Fakes 2 . In the experiment, we asked participants to identify fabricated and non-fabricated stimuli. After collecting informed consent and presenting participants with instructions, we show participants a short political speech and ask "Did [Joseph Biden/Donald Trump] say that?" followed by "Please [read/listen/watch] this [transcript/audio clip/video] from [Joseph Biden/Donald Trump] and share how confident you are that it is fabricated. Remember half the media snippets we show are real and half are fabricated." Figure 5 in the Supplementary Information section presents a screenshot of the user interface, which shows participants were instructed to move a slider to report their confidence from 50% to 100% that a stimulus is fabricated (or 50% to 100% that a stimulus is not fabricated). After each response, we informed participants whether the stimulus was actually fabricated and then presented participants with another stimulus selected at random until participants viewed all 32 stimuli or decided to leave the experiment. Each participant began the experiment with an attention check stimulus. As we specified in the pre-registered analysis, we removed all participants from the analysis who failed the attention check. The multimedia stimuli are drawn from the Presidential Deepfake Dataset (PDD) 53 , which is made up of 32 videos showing two United States presidents making political speeches. Half the videos are authentic videos that have not been altered by a deepfake algorithm. The other half have been fabricated to make the politicians appear to say something that they have not said. The fabricated videos were produced by writing a fabricated script, recording professional voice-actors reading the script, and applying a deepfake lip-syncing algorithm 14 In order to validate the concordance and discordance of speeches, we conducted an independent survey where 84 participants who passed an attention check rated each of the 32 transcripts for how well the political speeches match either politicians' political views. Participants were instructed "For each statement, we want you to rank how closely the statement matches your understanding of President Joseph Biden or President Donald Trump's political views" and asked to provide a judgment on a 5-point Likert scale from "Strongly Disagree" (-2) to "Strongly Agree" (2) that "This statement matches President [Joseph Biden's/Donald Trump's] political viewpoint: [statement] ." Participants' responses confirm that speeches designed to be concordant and discordant with the two politicians views were indeed concordant and discordant with the average participants' perception of the politicians' views. The Z-values of participants responses to concordant and discordant speeches are -0.25 and 0.21, respectively, and this difference is statistically significant with p < 0.001 based on a T-Test. In this experiment, we transform each of the original videos from the PDD into 7 different forms of media: a transcript, an audio clip, a silent video, audio with subtitles, silent video with subtitles, video with audio, and video with audio and subtitles. As a result, there are 7 modality conditions, 32 unique speeches, and 224 unique stimuli. On the experiment website, the transcript appears as HTML text and the six other forms of media content appear in a video player. The audio clip shows a black screen in the video player and the audio clip with subtitles shows a black screen with subtitles at the bottom. We randomly assigned the order in which the 32 unique political speeches are presented to participants and each political speech is randomly assigned to one of the seven conditions. By randomly assigning the order of political speeches and the modality condition in which speeches were presented, we can identify the causal impact of media modality on participants' ability to discern misinformation. A total of 5,727 individuals participated in this experiment. We used the Prolific platform 54 to recruit 555 individuals from the United States who completed 16,699 trials. In addition to the recruited participants, 5,172 individuals (76% of whom visited from outside the United States) found the experiment website organically and completed 45,093 trials. We focus our analysis on the 509 of 554 recruited participants and 2,838 of 5,172 non-recruited participants who passed the attention check. Many but not all participants responded to all 32 speeches; 482 recruited participants and 476 non-recruited participants viewed all 32 speeches. Before the experiment began, participants in the recruited cohort (but not the non-recruited cohort) responded to a baseline survey that included questions on political preferences, trust in media and politics, and questions from the Cognitive Reflection Test 52 , which is a robust test for measuring an individual's tendency towards reflecting on questions before answering [55] [56] [57] . The sample of recruited participants is balanced across political identities; 257 recruited participants self-report as Democrats, and the other 252 recruited participants self-report as Republicans. Across all 224 stimuli, recruited and non-recruited participants correctly identified the stimuli in 75% and 69% of observations, respectively. We find the fabricated political speech transcripts and visual deepfake manipulations are generally difficult for participants to discern. Across the 32 text transcripts, the least accurately identified one is identified correctly in 27% of trials, the most accurately identified one is identified correctly in 75% of trials, and the median accurately identified one is identified correctly in 45% of trials. Similarly for silent videos without subtitles, the median accurately identified one is identified correctly in 63% of trials and the range of accurate identification from the least to the most accurately identified is 38% to 87% of trials. In contrast, we find audio clips are easier to discern than text transcripts or silent videos. On the audio clips with no subtitles, the median one is identified correctly in 78% of trials and the range of accurate identification is 60% to 88% of trials. Figure 1b plots confidence on a scale that ranges from a minimum of 50% confidence (just as likely as chance) to 100% confidence (full confidence). Figure 1e plots response time windsorized at the 99th percentile to control for time response outliers, which are an artifact of participants who return to the experiment after an extended time. Figure 1 presents participants' weighted accuracy, confidence, perceived fabrications in fabricated speeches, perceived fabrications in non-fabricated speeches, and response duration across modality conditions. Weighted accuracy indicates participants' accuracy weighted by confidence (e.g., if a participant responded "82% confidence this is fabricated" and the participant is correct, then the participant is assigned a weighted accuracy score of 82, and otherwise, if the participant is incorrect, then the participant would be assigned a weighted accuracy score of 18). Confidence indicates participants' self-reported level of confidence which ranges from 50 (just as likely as chance) to 100 (full confidence). Perceived fabrications in fabricated and non-fabricated speeches is defined as a participant indicating a 51% or higher confidence that a stimulus is fabricated. Response time is measured in seconds and windsorized at the 99th percentile to control for time response outliers, which are an artifact of participants who return to the experiment after an extended time. We evaluate the marginal effect of each condition on participants' weighted accuracy via an ordinary least squares regression with standard robust errors clustered at the participant level following Abadie et al (2017) 58 . We find both recruited and non-recruited participants' accuracy increase as political speeches are presented with video and audio modalities. Recruited participants' accuracy is 57% (p < 0.001) on transcripts, 7% (p < 0.001) higher on silent videos, 9% (p < 0.001) higher on silent videos with subtitles, 19% (p < 0.001) higher on audio clips and audio clips with subtitles, and 25% (p < 0.001) higher on videos with audio and videos with audio and subtitles. 3 Similarly, non-recruited participants' accuracy is 53% on transcripts, 12-13% (p < 0.001) higher on silent videos and silent videos with subtitles, 20% (p < 0.001) higher on audio clips and audio clips with subtitles, and 27-28% (p < 0.001) higher on videos with audio and videos with audio and subtitles. Overall, participants are better at identifying whether an event actually happened when watching videos or listening to audio than reading transcripts. In contrast to the high variability in accuracy across speeches and modality conditions, participants' confidence is less variable. On text transcripts, participants' mean confidence is 81%. Speeches presented via video and audio increase participants' confidence relative to text by 6% and 9% (p < 0.001) independently, respectively, and 12% together (p < 0.001). As participants receive more information via video and audio, participants' weighted accuracy, confidence, discernment of fabricated speeches, and discernment of real speeches increase on average. However, we do not find any significant, marginal effects of subtitles on any of the dependent variables for modality conditions that already include audio. The median response time across all stimuli was 24 seconds, which is 3 seconds longer than the average video length. The median response time for the silent, subtitled videos is 31 seconds, which is slightly longer than the response time for all other modality conditions. Across all 7 modality conditions, the median response time for fabricated stimuli is shorter than the median response time for non-fabricated stimuli; fabricated text, video, and audio have 3.8 seconds (p < 0.001), 2.5 seconds (p < 0.001), and 3.7 seconds (p < 0.001) shorter response times than their non-fabricated counterparts. Based on this experiment's large sample size of 46,098 observations by participants who passed the attention check, the 224 stimuli in this experiment have a mean of 206 observations each. This large sample size per stimuli provides high statistical power to individually evaluate whether participants are discerning stimuli more accurately than chance. Specifically, using 206 observations provides over 95% statistical power to detect a 10 percentage point increase beyond chance at the p < 0.05 threshold. We evaluate the degree to which participants' discernment surpasses random chance by running a binomial test on responses to each stimuli within a modality condition and applying a Bonferonni correction 59 , which means multiplying each p-value by 32 (the number of speeches per modality condition) to correct for multiple hypothesis testing. After applying this correction for multiple hypothesis testing, we find participants' discernment is statistically significantly better than chance (p < 0.05) on 7 of 32 text transcripts and 18 of the 32 silent videos without subtitles. In particular, participants are better than chance (p < 0.05) on 8 of the 16 non-fabricated, silent videos without subtitles and 10 of the 16 fabricated, silent videos without subtitles. In other words, we have high statistical power, and we do not find evidence that participants are better than chance on 6 of the 16 fabricated, silent videos without subtitles, 8 of the 16 non-fabricated, silent videos without subtitles, and 25 of the 32 text transcripts. When the information from the political speech transcript and video are combined in the silent, subtitled videos, we find participants discern better than chance (p < 0.05) on all 16 of 16 fabricated, silent videos with subtitles and 8 of 16 non-fabricated, silent videos with subtitles. Likewise, the addition of audio significantly increases discernment rates; in all modality conditions with audio, participants discern better than chance (p < 0.05) on between 29 to 32 of the 32 political speeches. Figure 1c and Figure 1d show the distributions of discernment rates across modality conditions for fabricated and real videos. Similarly to Figure 1a and Figure 1b , these plots show that regardless of whether the stimuli are fabricated or not, the addition of audio or video is associated with an increase in participants' discernment. However, we find slight differences in response bias: participants have a higher bias towards identifying text transcripts as real relative to all other modality conditions. Participants respond that text transcripts and silent videos without subtitles are fabricated in 37% and 48% of trials, respectively, while participants respond that the other 5 modality conditions are fabricated in 52% -54% of trials. In Figure 2 , we present participants' marginal accuracy on transcripts, silent videos, and video with audio relative to silent, subtitled videos for each of the 32 speeches. Figure 2a reveals that participants are mostly less accurate on text transcripts than silent, subtitled videos. Likewise, Figure 2c shows participants are consistently more accurate on videos with audio than silent, subtitled videos. In contrast, Figure 2b illustrates heterogeneity in participants' performance with and without subtitles. In the following section, we examine this heterogeneity along two dimensions: whether the video is fabricated or not and whether the speech content is considered discordant with the politician's identity or not. We evaluate how discordant messages influence participants' discernment by examining the interactions between discordance and modality conditions in the linear regressions on participants' weighted accuracy presented in Table 1 and Table 2 in the Appendix. We limit this analysis to recruited participants for two reasons: first, recruited participants are all from the United States while the majority of non-recruited participants visited the website from outside the United States and it is unclear how familiar non-recruited participants are with United States politicians' viewpoints; second, we only collected performance on the Cognitive Reflection Test (CRT) for recruited participants. When considering all 32 fabricated and real speeches together (see column 1 of Table 1 in the Appendix), we find participants are 4.7 percentage points (p = 0.002) more accurate on silent, subtitled videos than the same videos without subtitles. However, we find participants are 5.0 percentage points (p = 0.018) less accurate on the discordant silent, subtitled videos than the same silent videos without subtitles. In other words, the addition of subtitles reduces discernment accuracy for political speeches that are discordant with the general public's perception of what politicians would say. Figure 2 . Participants' accuracy on silent, subtitled videos is compared against accuracy on transcripts, silent videos, and videos with audio for each of the 32 speeches. The error bars represent 95% confidence intervals. The 32 speeches are ordered by the absolute value of the difference in accuracy between the silent, subtitled video and the modality condition to which it is being compared. In order to further evaluate this effect, we consider fabricated videos and non-fabricated videos separately in columns 2 and 3 in Table 1 in the Appendix. We find the negative effect of discordance on subtitled videos is driven by participants' discernment of non-fabricated videos. We find participants are 6.8 percentage points (p = 0.021) less accurate on discordant silent, subtitled videos that are not fabricated compared to the same silent videos without subtitles. In contrast, we do not find a statistically significant difference (p = 0.341) between participants' performance on discordant silent, subtitled videos that have been fabricated and the same silent videos without subtitles. The negative effects of subtitles on non-fabricated yet discordant silent videos indicates the content of a message can change how participants weigh visual information. The heterogeneous effects of subtitles on the discernment of silent videos is robust to our specification of discordance. In Table 2 in the Appendix, we consider the same regressions as Table 1 in the Appendix except we replace the binary variable indicating discordance with a continuous variable for how discordant the speech is with the speaker based on the independent survey with 84 participants on how well the political speeches match either politicians' political views. The regressions in columns 2 and 3 of Table 2 in the Appendix present qualitatively similar results as Table 1 in the Appendix. When we consider discordance based on the public's perceived discordance, we find participants are 4.2 percentage points (p = 0.003) less accurate on discordant silent, subtitled videos that are not fabricated compared to the same silent videos without subtitles. Likewise, we do not find a statistically significant difference (p = .751) between participants' performance on discordant silent, subtitled videos that have been fabricated and the same silent videos without subtitles. We find that participants' performance on the CRT moderates participants' discernment accuracy. In this analysis, the CRT score is a continuous variable ranging from 0 to 3 with 124 participants answering none correctly and 109, 122, and 154 participants answering 1, 2, and 3 questions correctly, respectively. For every question that participants answer correctly on the CRT, participants are 2.9 percentage points (p = 0.002) more accurate (see column 4 in Table 1 in the Appendix). Likewise, participants who respond correctly to all 3 items on the CRT are 8.7 percentage points (p = 0.002) more accurate than participants who respond incorrectly to all 3 items. In Figure 4 in the Appendix, we present the distribution of media truth discernment scores following Pennycook and Rand (2019) 47 for "intuitive" participants who incorrectly answered all 3 CRT items and "deliberative" participants who correctly answered all 3 CRT items. We also find that participants' performance on the CRT moderates the influence of subtitles on the discernment accuracy of discordant messages in silent videos. In columns 4-6 of Table 1 in the Appendix, we report regressions that include the same independent variables as columns 1-3 plus interaction effects of these independent variables with participants' scores on the CRT. As a visual aid, we present these results in Figure 3 . In column 6 where we consider only non-fabricated videos, we find the coefficient on the interaction between "Discordant" and "Silent Subtitled Video" is negative 17.5 percentage points (p < 0.001), which means that participants are that much less accurate on non-fabricated discordant silent, subtitled videos than the same silent videos without subtitles while holding all else constant. The interaction between "CRT Score," "Discordant," and "Silent Subtitled Video" is 6.3 percentage points (p = 0.011), which means for each correct response to the CRT, participants are 6.3 percentage points more accurate at identifying discordant silent, subtitled videos while holding all else constant. This means that participants who answered all 3 CRT items correctly would be 18.9 percentage points (p = 0.011) more accurate on discordant silent, subtitled videos than participants who failed to answer any CRT item correctly. This improvement by 18.9 percentage points for answering all CRT items correctly cancels out the 17.5 percentage point decrease associated with discordant silent, subtitled videos compared to the same silent videos without subtitles. In other words, perfect performance on the CRT moderates the negative effects of discordant content such that participants are considering visual information and discerning just as accurately on silent subtitled videos as the same silent videos without subtitles. These results are qualitatively similar when we replace the binary variable for discordance with the continuous variable for discordance in Table 2 . Table 1 in the Appendix. In column 6 of Table 1 , the interaction between "CRT Score" and "Subtitled Silent Video" is negatively associated with discernment accuracy. For each CRT question answered incorrectly, participants are 5.3 percentage points (p = 0.004) better at discerning concordant silent, subtitled videos holding all else constant. This is consistent with the bias of participants who respond incorrectly to the CRT to guess concordant speeches are real and discordant speeches are fake. However, we do not find this effect is robust to the alternative specification of discordance as a continuous variable in Table 2 in the Appendix. How do communication modalities mediate truth discernment? More specifically, does realistic visual information increase people's susceptibility to falling for misinformation? Our results contrast with the conventional wisdom that video is more persuasive than text in convincing an audience that a fake event really happened 26 . Instead, we find participants are significantly more accurate at assessing the authenticity of political speeches in videos than transcripts. Our results cannot simply be explained by obvious, unrealistic visual manipulations in videos. We find participants are only 64% accurate at identifying manipulations in silent videos. Moreover, we find participants do not perform better than chance in nearly half of the silent videos. These findings suggest that ordinary people can sometimes, but not always, recognize visual inconsistencies created by the lip syncing deepfake manipulations. As such, the assessment of multimedia information involves both perceptual cues from video and audio and considerations about the content (e.g., the degree to which what is said matches participants' expectations of what the speaker would say, which is known as the expectancy violation heuristic 60 ). With the message content alone, participants are only slightly better than random guessing at 57% accuracy on average. With perceptual information from video and the message content via subtitles, participants are slightly more accurate (and more confident) at 66% accuracy on average and with information from both video and audio, participants are even more accurate (and more confident) at 82% accuracy on average. This experiment examines a specific task: how well can ordinary people discern whether or not a short soundbite of a political speech by a well-known politician in text, audio, or video has been fabricated. In this experiment with real and fabricated stimuli produced by professional voice actors and a deepfake lip syncing algorithm, we find silent videos with subtitles are easier to identify as real or fake than transcripts, audio without video is easier to identify that silent videos with subtitles, and video with audio is easier to identify that audio without video. These findings suggest that ordinary people are generally attentive to each communication modality when tasked to discern real from fake and have a relatively keen sense for what the two most recent US presidents sound like. As participants have access to more information via audio and video, they are able to make more accurate assessments as to whether a political speech has been fabricated. We find one notable exception to finding that more information leads to higher discernment rates: political speeches that conflict with the public's perspective of what a politician would say are harder to discern in silent, subtitled videos than the same silent videos without subtitles. This effect on discordant content is not driven by subtitles distracting participants. We do not find any evidence of any effect on subtitles when audio is included. Instead, the heterogeneous effects of concordant and discordant content are a consequence of how participants handle cognitive dissonance and balance the consideration of perceptual and content-based cues. We find that these effects are driven by responses to non-fabricated videos and are moderated by deliberative, reflective thinking as measured by the CRT. Fabricated videos differ from non-fabricated videos in how people can discern their authenticity. Fabricated videos involve visual manipulations, which can sometimes be explicitly identified (e.g., a glitch, a flicker, or mechanical and otherwise out of place lip movement). If someone finds a peculiar visual distortion, then that individual can be quite confident the video has been fabricated. In contrast, non-fabricated videos have not been visually manipulated. As a result, there is no single bit of information to signify fabrication or authenticity. Furthermore, we find people take on average 2.5s to 3.8s longer to provide a response to non-fabricated speeches than fabricated speeches. If someone cannot find a visual distortion, then that individual cannot be perfectly certain that the video has or has not been fabricated; for example, the video may have been fabricated without any perceptible distortion, or perhaps, the individual has yet to find the subtle visual distortion. This asymmetry between assessing fabricated and non-fabricated speeches exacerbates the "liar's dividend" where the general possibility that speeches can be fabricated calls into question whether any speech is fabricated and thus enables "liars to avoid accountability for things that are in fact true." 2 Clear articulation of the precise state-of-the-art algorithms and associated contexts in which audio-visual content can be fabricated to be indistinguishable from the real thing can help inform how people assess the content they consume and reduce the effects of the "liar's dividend." We find that participants' performance on the CRT moderates the effects of subtitles on the discernment accuracy of silent videos. In particular, participants who correctly answered all three CRT items show no difference in discernment rates of discordant silent, subtitled videos relative to the same silent videos without subtitles. But, for every CRT item that participants incorrectly answer, participants are 6.3 percentage points less accurate on real discordant silent, subtitled videos than the same silent videos without subtitles. In other words, reflective thinking moderates how participants balance what is said (the content of the speech) with how it is said (visual information). Our results show that the least reflective participants tend to rely on the expectancy violation heuristic and discount visual information more than the most reflective participants. Unlike for videos and transcripts, we cannot disentangle the content and perceptual information for audio modalities. Nevertheless, we find that the interaction between discordant speeches and any audio condition is negative after controlling for the level effects of discordance and any audio. This suggests that discordant media not only impair the incorporation of visual cues but may also impair attention to and incorporation of auditory cues as well. We evaluate truth discernment on 224 stimuli made up of 32 different speeches across 7 modality conditions. This stimuli set is much larger than most stimuli sets for the psychology of media effects research 29 , but it is still limited to only 32 different speeches and 16 fabricated speeches. We focused on one kind of deepfake manipulation: lip-syncing via the wav2lip algorithm. Future research may consider other deepfake manipulations like face swapping and head puppetry 61 . Given the affordances of the wav2lip visual effects algorithm, videos where a single person is facing forward and speaking are relatively easy to manipulate into a convincing fake. Videos where a person is moving, turning their head, and interacting with other people require much more sophistication to produce a convincing fake. Future work may consider more heterogeneity in the speakers, the settings, and the synthetic algorithms 62 . The danger of fabricated videos may not be the average algorithmically produced deepfake but rather a single, highly polished, and extremely convincing video. For example, hyper-realistic deepfakes like the Tom Cruise deepfakes on Tiktok (see https://www.tiktok.com/@deeptomcruise) are produced by visual effects artists using both artificial intelligence algorithms and video editing software. While these hyper-realistic deepfakes may still contain manipulation artifacts (e.g., unattached earlobes that do not match Tom Cruise's attached earlobes 63 ), future work on the psychology of multimedia misinformation may consider hyper-realistic videos produced by visual effects studios in addition to algorithmically manipulated videos. Political deepfakes are most dangerous when people are least expecting information to be manipulated, and this experiment on multimedia truth discernment does not match the ecological realities that people typically face when confronted with fake news. In this experiment, 50% of content is fake, and we explicitly inform participants of this base rate. In today's media ecosystem, fake news is relatively rare: less than a fraction of a percent 64, 65 of news is fake news. As such, this experiment is useful to study how people discern multimedia information when attending to questions of accuracy, but it is less useful in understanding how people will share misinformation they read on social media. People are generally highly accurate in discerning the veracity of news headlines yet share fake news headlines because their attention is not focused on accuracy 66 . On social media, video-based misinformation will often be designed to incorporate characteristics (e.g., fear, disgust, surprise, novelty) that divert people's focus from accuracy and make content go viral 67, 68 . Given that multimedia misinformation may be both easier to discern and shared on social media more often than text-based media, more research needs to be done to understand how people allocate attention while browsing the Internet 69 . Finally, discernment -how accurately people discern misinformation -is different than belief -how much people report they believe misinformation. It is possible (though quite peculiar) that someone could be highly accurate at discerning truth from falsehood while also tending to believe the fabricated content and not believe the true content. For example, research on fake news headlines and articles finds that people are better at discerning news concordant with their political leanings than discordant news while also believing concordant news more often than discordant news 49 . While video and audio manipulations have been hypothesized to make speeches more believable and harder for the general public to discern an event's or speech's authenticity, we find that lip-syncing deepfake manipulations based on audio recordings by professional voice actors do not reduce participants' ability to discern fake political speeches from real ones relative to text transcripts. The finding that fabricated videos of political speeches are easier to discern than fabricated text transcripts highlights the need to re-introduce and explain the oft-forgotten second half of the "seeing is believing" adage. In 1732, the old English adage appears as: "Seeing is believing but feeling is the truth." 70 Here, "feeling" does not refer to emotion but rather experience. Since the advent of photography, people knew that what they see was not always the truth and further assessment was often necessary [71] [72] [73] . In fact, we find that more information via communication modalities -text transcripts vs. silent, subtitled video vs. video with audio -enables people to more accurately discern fabricated and real political speeches. We suggest media literacy programs consider curricula that encourage reflective thinking and remind people to consider both how something is said and what is said. For content moderation systems for flagging misinformation with AI-based decision support, we suggest models incorporate explanations to direct human attention to both the content 74 and perceptual cues (e.g., low-level pixel features, high-level semantic features, and biometric-based features 75 ). Finally, the design of this experiment where speeches are randomly assigned to permutations of communication modalities can be a guide for future research on the media effects of misinformation. In particular, the comparison between discernment on transcripts, silent videos without subtitles, and silent videos with subtitles can help researchers disentangle how people weigh considerations of what is said and how it is said on truth discernment tasks. This research complies with all relevant ethical regulations and the Massachusetts Institute of Technology's Committee on the Use of Humans as Experimental Subjects determined this study to fall under Exempt Category 3 -Benign Behavioral Intervention. This study's exemption identification number is E-3105. All participants are informed that "Detect Fakes is an MIT research project. All guesses will be collected for research purposes. All data for research were collected anonymously. For questions, please contact detectfakes@mit.edu. If you are under 18 years old, you need consent from your parents to use Deep Fakes." Most participants arrived at the website via organic links on the Internet. For participants recruited from Prolific, we compensated participants at a rate of $9.78 an hour and provided bonus payments of $5 to the top 1% of participants. Before beginning the experiment, all participants from Prolific were also provided a research statement, "The findings of this study are being used to shape science. It is very important that you honestly follow the instructions requested of you on this task, which should take a total of 15 minutes. Check the box below based on your promise:" with two options, "I promise to do the tasks with honesty and integrity, trying to do them uninterrupted with focus for the next 15 minutes." or "I cannot promise this at this time." Participants who responded that they could not do this at this time were re-directed to the end of the experiment. The datasets and code generated and analyzed during the current study are available in our public Github repository, https: //github.com/mattgroh/fabricated-political-speeches (the Github repository will be set to public upon peer-reviewed publication). All PDD videos are available on Youtube with links provided in the Presidential Deepfakes Dataset paper 53 . Table 2 . Ordinary least squares regressions with robust standard errors clustered on participants. Weighted accuracy is the dependent variable. The "Silent Video" condition is held out and represented by the constant term. The "Discordance (Continuous)" variable is computed by calculating the z-transformation of participants' mean response on a 5-point Likert scale for how well a statement aligns with the public's perception of the politicians' viewpoints. The social impact of deepfakes Deep fakes: A looming challenge for privacy, democracy, and national security Deepfakes and cheap fakes The Deepfake Detection Dilemma: A Multistakeholder Exploration of Adversarial Dynamics in Synthetic Media Protecting world leaders against deep fakes Ai-generated characters for supporting personalized learning and well-being Misinformation, disinformation, and online propaganda. Soc. media democracy: The state field, prospects for reform A style-based generator architecture for generative adversarial networks Analyzing and improving the image quality of stylegan Human detection of machine-manipulated media Resolution-robust large mask inpainting with fourier convolutions Neural voice cloning with a few samples Nautilus: a versatile voice cloning system A lip sync expert is all you need for speech to lip generation in the wild Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization Peirce on signs: Writings on semiotic The role of images in framing news stories Seeing is believing: communication modality, anger, and support for action on behalf of out-groups If a picture is worth a thousand words is video worth a million? differences in affective and cognitive processing of video and text cases Rich media, poor media: The impact of audio/video vs. text/picture testimonial ads on browsers' evaluations of commercial web sites and online products Video killed the news article? comparing multimodal framing effects in news videos and articles Images and misinformation in political groups: Evidence from whatsapp in india How whatsapp leads mobs to murder in india Seeing is believing: Is video modality more powerful in spreading fake news via online messaging apps? The (minimal) persuasive advantage of political video over text The main model: A heuristic approach to understanding technology effects on credibility Ai-mediated communication: definition, research agenda, and ethical considerations Political Deepfake Videos Misinform the Public, But No More than Other Fake Media. preprint, Open Science Framework The use of media in media psychology Seeing is believing: How people fail to identify fake images on the web A picture paints a thousand lies? the effects and mechanisms of multimodal disinformation and rebuttals disseminated via social media Ai-synthesized faces are indistinguishable from real faces and more trustworthy The DeepFake Detection Challenge (DFDC) Dataset Deepfake detection by human crowds, machines, and machine-informed crowds Fooled twice-people cannot detect deepfakes but think they can Seeing is believing: Exploring perceptual differences in deepfake videos Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news Do (microtargeted) deepfakes have real effects on political attitudes? The Int Mediated misinformation: Questions answered, more questions to ask Fighting misinformation on social media using crowdsourced judgments of news source quality Source v. content effects on judgments of news believability Fake images: The effects of source, intermediary, and digital media literacy on contextual assessment of image credibility online Emphasizing publishers does not effectively reduce susceptibility to misinformation on social media The role of source, headline and expressive responding in political news evaluation Perceived truth of statements and simulated social media postings: an experimental investigation of source credibility, repeated exposure, and presentation format Combating fake news on social media with source ratings: The effects of user and expert reputation ratings Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning Understanding and reducing online misinformation across 16 countries on six continents The psychology of fake news The science of fake news Visual mis-and disinformation, social media, and democracy Cognitive reflection and decision making The presidential deepfakes dataset Prolific.ac-A subject pool for online experiments Is the cognitive reflection test a measure of both reflection and intuition? Performance on the cognitive reflection test is stable across time The cognitive reflection test is robust to multiple exposures When Should You Adjust Standard Errors for Clustering Teoria statistica delle classi e calcolo delle probabilita Social and heuristic approaches to credibility evaluation online DeepFake Detection: Current Challenges and Next Steps Behavioural science is unlikely to change the world without a heterogeneity revolution Detecting deep-fake videos from aural and oral dynamics Evaluating the fake news problem at the scale of the information ecosystem Measuring the news and its impact on democracy Shifting attention to accuracy can reduce misinformation online What makes online content viral? The spread of true and false news online Studying human attention on the internet Gnomologia: adagies and proverbs; wise sentences and witty sayings, ancient and modern Visual persuasion: The role of images in advertising Digital doctoring: how to tell the real from the fake The commissar vanishes: The falsification of photographs and art in Stalin's Russia On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection Watch those words: Video falsification detection using word-conditioned facial motion The authors would like to acknowledge funding from MIT Media Lab member companies, thank JL Cauvin and Austin Nasso (@jlcauvin and @austinnasso on Tiktok) for providing voice impressions, and thank the following individuals for helpful 9/17 feedback: Andrew Lippman, David Rand, Gordon Pennycook, Rahul Bhui, Yunhao (Jerry) Zhang, Ziv Epstein, and members of the Affective Computing lab at the MIT Media Lab and the Human Cooperation lab at MIT Sloan School of Management.