key: cord-0149096-bu5pn8l2
authors: Abilov, Anton; Hua, Yiqing; Matatov, Hana; Amir, Ofra; Naaman, Mor
title: VoterFraud2020: a Multi-modal Dataset of Election Fraud Claims on Twitter
date: 2021-01-20
journal: nan
DOI: nan
sha: 402b91a6765c056db7b76e14c3ecbdd8fdb72b8f
doc_id: 149096
cord_uid: bu5pn8l2

The wide spread of unfounded election fraud claims surrounding the U.S. 2020 election had resulted in undermining of trust in the election, culminating in violence inside the U.S. capitol. Under these circumstances, it is critical to understand discussions surrounding these claims on Twitter, a major platform where the claims disseminate. To this end, we collected and release the VoterFraud2020 dataset, a multi-modal dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter fraud claims. To make this data immediately useful for a wide area of researchers, we further enhance the data with cluster labels computed from the retweet graph, user suspension status, and perceptual hashes of tweeted images. We also include in the dataset aggregated information for all external links and YouTube videos that appear in the tweets. Preliminary analyses of the data show that Twitter's ban actions mostly affected a specific community of voter fraud claim promoters, and exposes the most common URLs, images and YouTube videos shared in the data.

Free and fair elections are the foundation of every democracy. The 2020 presidential election in the United States was probably one of the most consequential and contentious such events. Two-thirds of the voting-eligible population voted, resulting in the highest turnout in the past 120 years (Schaul, Rabinowitz, and Mellnik 2020) . The Democratic Party candidate Joe Biden was elected as the president.

Unfortunately, efforts to deligitimize the election process and its results were carried out before, throughout and after the election. Mostly unfounded claims of voter fraud (Frenkel 2020) were spread both through public statements by politicians, and on social media platforms. As a result, 34% of Americans say that they do not trust the election results as of December, 2020 (NPR 2020). Voter fraud claims without credible evidence have great ramifications on both the integrity of the election and the stability of the U.S. democracy. On January 6th, 2021, believing that the election was 'stolen', mobs breached U.S. capitol while the Congress voted to certify Biden as the winner of the election.

Social media platforms like Facebook, Twitter, YouTube and Reddit play a significant role in political events (Vitak Copyright © 2021 , Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Allcott and Gentzkow 2017) , and the 2020 election was no exception . In particular, Twitter has been the focus of public and media attention as a prominent public square where ideas are adopted and claims -false or true -are spread (Vosoughi, Roy, and Aral 2018; Grinberg et al. 2019) . It is thus important to understand the participants, discussions, narratives, and allegations around voter fraud claims on this specific platform.

In this work, we release VoterFraud2020, a multi-modal Twitter dataset of 7.6M tweets and 25.6M retweets that are related to voter fraud claims. Using a manually curated set of keywords (e.g., "voter fraud" and "#stopthesteal") that was further expanded using a data-driven approach, we streamed Twitter activities between October 23rd and December 16th, 2020. We performed various validations on the limits of our stream, given Twitter's API constraints (Morstatter et al. 2013) , and estimate that we were able to retrieve around 60% of the data containing our crawled keywords.

We further enhanced the VoterFraud2020 dataset in order to make it accessible for a broader set of researchers and future research: (1) We cluster users according to their retweeting dynamics and release the cluster labels; (2) Given Twitter's widespread post-election suspension action, we crawl and include the user status as of January 10th, 2021; (3) We compute and share the perceptual hashes of 168K images that appeared in the data; (4) We aggregate and share metadata about 138K external links that appeared in the tweets, including 12K unique YouTube videos. Our dataset also allows researchers to calculate the amount of Twitter interactions with the collected tweets, users, and media items, including number of retweets and quotes from various clusters, or from suspended users.

A preliminary analysis finds a significant cluster of users who were promoting the election fraud related claims, with nearly 7.8% of them suspended in January. The suspensions focused on a specific community within the cluster. A simple analysis of the distribution of images, based on visual similarity, exposes that the most broadly shared (by number of tweets) and the most retweeted images are different. Although recent research has shown that voter fraud claims are pushed mainly by mass media (Benkler et al. 2020) , we also find that external links referenced by promoters of the claims point mostly to low-quality news websites, streaming services, and YouTube videos. Some of the widespread videos claiming 'evidence' of voter fraud were published by surprisingly small channels. Most strikingly, all of the top ten channels and videos spreading voter fraud claims were still available on YouTube as of January 11th, 2021.

We believe that the release of VoterFraud2020, the largest public dataset of Twitter discussions around the voter fraud claims, with the enhanced labels and data, will help the broad research community better understand this important topic at a critical time.

Our data collection process involved streaming Twitter data using a data-driven manually curated set of keywords and hashtags. We report on the span and volume of the collected data, as well as on analyses estimating its coverage.

We used a data-driven approach to generate a list of keywords and hashtags related to election fraud claims in an iterative manner. We started with a single phrase and two derived keywords: voter fraud and #voterfraud. We first used a convenience sample of 11M political tweets consisting of the tweets of 2,262 U.S. political candidates and the replies to those tweets, collected between July 21st and Oct 22nd, 2020 using the Twitter Streaming API (Twitter 2019) . We then identified hashtags that co-occur with our meta-seed keywords, voter fraud and voterfraud. We selected all hashtags that appeared in at least 10 tweets and co-occurred with either of the meta-seed keywords at least 50% of the time. From the resulting set, we manually filtered out those that were not directly relevant to voter fraud. To this end, two members of the research team reviewed the hashtags, including, if needed, searching for them on Twitter to see whether they produce relevant results. Only the hashtags that were agreed on by both evaluators were added, resulting in an initial set of hashtags that was added to the two original keywords.

We computed the Jaccard coefficient between each of our seed hashtags and all other hashtags that appeared in the new stream. We added to our set all hashtags that had a Jaccard coefficient greater than 0.001 with any of the seed hashtags. Three members of the team again reviewed this list by 1) excluding hashtags that were not related to voter fraud, 2) adding corresponding keywords of the hashtags (e.g. #discardedballots corresponds to discarded ballots), and 3) adding relevant hashtags or keywords that the researchers observed while searching for hashtags from the generated list. Both the seed list and the final list of keywords and hashtags we used for streaming are included in Appendix A (Table 3) .

We collected data using the Twitter streaming API (Twitter 2019). The VoterFraud2020 dataset includes tweets from 17:00, October 23rd, 2020 to 13:00 December 16th, 2020. We expanded the keywords list on Oct. 31st with additional keywords, and added #stopthesteal as it started trending on November 3rd. While streaming, we stored each tweet's metadata (e.g., user ID, text, timestamp) . We also downloaded all image media items included in the tweets.

In total, we collected 3,781,524 original tweets, 25,566,698 retweets, and 3,821,579 quote tweets (i.e. tweets that include a reference to another tweet) discussing election fraud claims. Note that quote tweets are included in the Twitter stream when either the new tweet or the referenced (quoted) tweet include one of the keywords or hashtags on the list. In total, we collected tweets from 2,559,018 users who posted, shared or quoted one or more tweets with these keywords.

Since the Twitter streaming API provides only a sample of the tweets, especially for large-volume keywords (Morstatter et al. 2013) , we performed multiple tests to evaluate and estimate the coverage of the VoterFraud2020 dataset. This analysis suggests that the dataset covers over 60% of the content shared on Twitter using the keywords we tracked.

Retweet and quote coverage. We evaluated the coverage of retweet and quote tweets by comparing the counts of these objects in the stream to Twitter's metadata. When a new retweet for an original tweet appears in the stream, the API returns the tweet's metadata including the current retweet count and quote count of the original tweet. In other words, if an original tweet t i is retweeted, it will appear in the stream as a retweet rt j , and the metadata for rt j will include the total number of retweets of t i so far. From this metadata, it is easy to define the retweet coverage as the ratio of the total number of retweets (rt objects) streamed and stored in our dataset, over the sum of all retweet counts of the original t tweets, returned by the API in the latest rt retweet of each original tweet. The quote coverage is defined analogously. According to this analysis, the VoterFraud2020 dataset captured 63.2% of the retweets and 62.6% of the quote tweets. These findings compare favorably with previous work that shows a single API client captures only 37.6% of the retweets through the Streaming API (Morstatter et al. 2013) .

Comparison with #Election2020. To further evaluate the coverage on the voter fraud tweets, we compared our dataset with a previously published Twitter dataset of the U.S. 2020 election (Chen, Deb, and Ferrara 2020) . The creators of the #Election2020 dataset used the streaming API to track 168 keywords that are broadly related to the election and 57 accounts that are tied to candidates running for president.

As in VoterFraud2020, the keyword 'voter fraud' was also used to collect data for #Election2020. We used this overlap to estimate our coverage. First, we can directly compare the relative volume and overlap between the 'voter fraud' tweets in both datasets. We expect VoterFraud2020 to have a higher volume of such tweets because of its more focused set of keywords. Second, if we assume sampling for both streams is independent and random, we could estimate the coverage of VoterFraud2020 by looking at the proportion of #Election2020 tweets that also appear in our data.

To this end, we extracted all tweets and retweets that contain this keyword from both datasets posted on two days following the November 3rd election data: November 6th 0% 20% 40% 60% 80% 100% Nov 13th

Nov 6th VoterFraud2020

In both datasets #Election2020 Figure 1 : Coverage comparison between our dataset and #Election2020 for tweets containing 'voter fraud'.

and November 13th. The analysis, performed on December 17th, was limited to two days as we had to obtain the content of the tweets of the #Election2020 dataset by "hydrating" them (i.e. using the tweet IDs to get the full tweet text using the Twitter API). We were unable to hydrate the full data, presumably due to inactive accounts and deleted tweets. The hydration yielded 92.4% of the #Elec-tion2020 data on November 6th (a total of 1.4M tweets/3.5M retweets), and 91.1% of the data on November 13th (1.3M tweets/3M retweets). In total, our VoterFraud2020 data includes 45,322 'voter fraud' related tweets on November 6th, 2.3 times as much as recorded in #Election2020. The ratio is even higher on November 13th, when we obtained 47,313 tweets, 3.1 times as much as in #Election2020. Figure 1 breaks down the coverage by dates (separated by rows), in the two datasets (by different colors). From left to right, the bars show the percentages of tweets that are available only in our dataset (dark blue), that are available in both datasets (light blue), and that are available only in #Election2020 (yellow). On any given day, the VoterFraud2020 dataset contains substantially more tweets related to voter fraud, as compared to #Election2020, especially when the estimated total volume is lower. On November 13th (second row), VoterFraud2020 contained 95.7% of the combined data (left two bars) while #Elec-tion2020 only collected 30.7% (right two bars) of the tweets. These numbers also indicate that VoterFraud2020's sample includes 32.1% of the related samples in #Election2020 on November 6th and 85.9% on November 13th. We acknowledge that these two numbers are not consistent, presumably because of November 6th's much higher volume of activity. If these samples are indeed independent, though, it means that our lower bound of coverage is November 6th's 32.1%.

Based on these coverage analyses, we conclude that Voter-Fraud2020 is, at the time of submission, the largest known public Twitter dataset of voter fraud claims and discussions.

To ensure the reusability of our data, we took the following steps to enhance the dataset. In addition to raw streaming data, we clustered users according to the retweet dynamics and release the cluster labels. We also queried Twitter for the user status on 10th of January, and share the user status as active/not-found/suspended. Furthermore, to enable research on visual misinformation, we encode all images shared in the tweets with perceptual hash. Finally, we release the URLs, and the metadata of the YouTube videos that appeared in our dataset.

Retweet Graph Communities. Retweet networks have been frequently analyzed in previous works in order to understand political conversations on Twitter (Arif, Stewart, and Starbird 2018; Cherepnalkoski and Mozetič 2016) . Using community detection algorithms, researchers are able to study key players, sharing patterns and content on different sides of a discussion surrounding a heated political topic.

We first constructed a retweet graph of the Voter-Fraud2020 dataset, where nodes represent users and directed edges correspond to retweets between the users. The direction of an edge corresponds to the direction of the information spreading in the retweet relation. Edges are weighted according to the number of times the corresponding source user has been retweeted. The resulting network consists of 1,887,736 nodes and 16,718,884 edges.

To detect communities within the graph, we used the Infomap community detection algorithm (Bohlin et al. 2014) , which captures the flow of information in directed networks. Using the default parameters, the algorithm produces thousands of clusters. By excluding all clusters that contain fewer than 1% of the nodes we are left with 90% of all nodes 1 which are clustered into five communities (see Figure 2a ).

In Figure 2b , we visualize the retweet network using the Force Atlas 2 layout in Gephi (Bastian, Heymann, and Jacomy 2009) , using a random sample of 44,474 nodes and 456,372 edges. The nodes are colored according to their computed community as described in Figure 2a . Edges are colored by their source node. The visualization indicates that the nodes are split between two distinct clusters -community 0 (blue) on the left and communities 1, 2, 3 and 4 on the right. By examining the top users in each community, we conclude that community 0 mostly consists of accounts that detract the voter fraud claims, while the communities on the right consist of accounts that promote the voter fraud claims. Most of the tweets from these users are written in English, except for users in Community 3 who mainly post tweets in Japanese and users in Community 4 who write in Spanish. Community 2 is more deeply embedded in the promoter cluster compared to Community 1, as we observe tweets from Community 1 being retweeted by Community 0 on the left, but not from Community 2. We include the list of top 5 Twitter accounts in each community by the number of community retweets in the Appendix.

For brevity, in the following analyses, we refer to the cluster on the left as the detractor cluster, and the cluster with community 1,2,3,4 on the right as the promoter cluster. Note that due to the partisan nature of the U.S. politics, most of the promoter users are likely aligned with right-leaning politics, and detractor users align with left-leaning politics.

To identify users that are prominent within each of these two cluster, we calculate the closeness centrality of the user nodes in each cluster. In a retweet network this metric can be interpreted as a user's ability to spread information to other users in the network (Okamoto, Chen, and Li 2008) . We compute the top-k closeness centrality to find the 10,000 most central nodes within the detractor and promoter clus- ters (Bisenius et al. 2017 ).

We release the author's community label of each tweet, the community label of each user, and a user's closeness centrality in the detractor and promoter clusters. We also include two additional metrics -retweet count by community X and quote count by community X. For a tweet t i , the retweet count by community X is the total number of retweets rt i it received from each user u X in community X. The metric is computed analogously for quotes.

Labeling Suspended and Deleted Users When the electoral college were set to confirm the election results on January 6th, 2021, the allegations of voter fraud took a dramatic turn, which culminated in the storming of the US Capitol. Subsequently, Twitter took a harder stance on moderating content on their platform and suspended at least 70,000 accounts that were engaged in propagating conspiracy theories and sharing QAnon-content (Twitter 2021) . This ban has substantial implications for researchers seeking to understand the spread of voter fraud allegations on Twitter, since the Twitter API does not allow the "hydration" of Tweets from suspended users. In order to understand the distribution of suspensions within our dataset, we queried the updated user status of all users in our dataset on January 10th, a few days following the ban. The Twitter API returns a user status that indicates if the user is active, suspended or not found (presumably deleted). In total, 3.9% of the accounts (99,884 accounts) in our data were suspended.

We enhance the VoterFraud2020 dataset by labeling tweets and users that were suspended. This metadata will enable both research and ease hydration by allowing hydraters to skip content that is no longer available. We also include two additional metrics for each tweet: retweet count by suspended users and quote count by suspended users.

Due to its immense public interest, we have retained the full data we retrieved from the 99,884 suspended users including 1,240,405 tweets and 6,246,245 retweets. This detailed data is not part of VoterFraud2020. However, we will distribute an anonymized version of this data to published academic researchers upon request.

Images. Because of their persuasive power and ease of spread, there is a growing interest in analyzing how visual misinformation spreads both within a platform or across platforms (Zannettou et al. 2018; Highfield and Leaver 2016; Paris and Donovan 2019; Moreira et al. 2018; Zannettou et al. 2020 ). However, visual information such as images or videos is difficult for many researchers to study due to computational and storage costs. Here, we make the information about image content shared in VoterFraud2020 easier to use by sharing perceptual hash values for these images. With these numeric hash values, researchers can easily find duplicates and near-duplicate images in tweets, without working directly with cumbersome image content. To this end, we download all image media items that were posted in the tweets in the streaming data, and encode them with three different types of perceptual hashes.

Common perceptual hashes are binary strings designed such that the Hamming distance (Zauner, Steinebach, and Hermann 2011) between two hashes is close if and only if the two corresponding images are perceptually similar. In other words, an image that is only slightly transformed, for example, by re-sizing, cropping, or rotation, will have a similar hash value to the original image. However, as the definition of perceptual similarity is often subjective and the underlying algorithms are often different, various hashing functions have different performance characteristics dealing with various types of image transformations. Therefore, we encode the images in our dataset with three perceptual hash functions: the Perceptive Hash (pHash), the Average Hash (aHash), and the Wavelet Hash (wHash) (Petrov 2017; Zauner, Steinebach, and Hermann 2011) .

In total, our streamed tweets included 201,259 image URLs, 167,696 of them were retrieved during streaming. We provide some more details about the distribution of these images in Section 5.

External links. Misinformation campaigns are known to use broad cross-platform information, often via links to other sites (Wilson and Starbird 2020; Golovchenko et al. 2020) . Hence, we extracted and publish the set of external (non-Twitter) URLs that were referenced in the tweets. For ease of use, we resolved URLs that point to a redirected location (e.g. bit.ly URLs) to their final destination URL. Our streamed tweets included references to a total of 138,411 unique URLs, appearing in 609,513 tweets.

Since a large portion (over 12%) of all URLs in the data were YouTube links, we further enhanced the data with YouTube-specific metadata. A key motivation for this specific focus was the known role YouTube plays generally in spreading misinformation (Hussein, Juneja, and Mitra 2020; Papadamou et al. 2020 ) and specifically its role in the 2020 election and voter fraud claims (Kaplan 2020; Wakabayashi 2020) . For each YouTube video that was shared in the collected tweets, we used YouTube's Data API (YouTube 2021) , to retrieve the video's title, description, as well as the id and title of the channel that posted it. We retrieved the YouTube metadata on Jan 1st, 2021. On that data, out of the 13,611 unique video ids that we have queried, 1,608 were no longer available resulting in 12,003 YouTube URLs with full additional metadata.

Our VoterFraud2020 dataset is available for download under FAIR principles (Wilkinson et al. 2016) in CSV format 2 . The data includes "item data" tables for tweets, retweets, and users keyed by Twitter assigned IDs and augmented with additional metadata as described below. The data also includes the images that appear in the dataset, indexed by randomly genenerated unique IDs. Finally, the data includes aggregated tables for URLs and for YouTube videos including the information described in Section 3. The dataset tables and the fields they contain are summarized on Github 3 .

The VoterFraud2020 dataset conforms with FAIR principles. The dataset is Findable as it is publicly available on Figshare, with a digital object identifier (DOI): 10.6084/m9.figshare.13571084. It is also Accessible since it can be accessed by anyone in the world through the link. The datset is in csv format, hence it is Interoperable. We release the full dataset with descriptions detailed in this paper, as well as an online tool to explore the dataset at http://voterfraud2020.io, making the dataset Reusable to the research community.

The tables for Tweets and Retweets contain the full set of items that were collected, including from suspended users. These tables do not include raw tweet data beyond the ID, according to Twitter's ToS. However, to support use of the data without being required to download ("hydrate") the full set of tweets, we augment the Tweets table with several key properties. For each tweet we provide the number of total retweets as computed by Twitter (retweet/quote count metadata), as well as the number of retweets and quotes we streamed for this tweet from users in each of the five main communities (retweet/quote count community X, X ranging from 0 to 4). Note that the latter do not add up to the Twitter metadata due to the coverage issues listed in Section 2.2. The Tweet table properties also include the user community (0-4) for the user who posted the tweet, computed using methods listed in Section 3. Some of the Twitter accounts were not clustered into one of the five main communities. In this case, the user community label is null. With this augmentation, researchers using this dataset could very quickly, for example, select and then hydrate a subset of the most retweeted tweets from non- suspended users in Community 2. As the tweet itself and the ID of the user who tweeted it is not available until hydration, Twitter's users' privacy is preserved. The Users table is similarly augmented with aggregate information about the importance of the user in the dataset, including the community that they belong to, their centrality in the two meta-clusters, detractor and promoter (closeness centrality detractor cluster and closeness centrality promoter cluster), and the amount of attention (retweets and quotes) they received from other users in the different communities. We also note whether, according to the data we collected, the user had been suspended. With this data, interested researchers can quickly focus their attention and research on the main actors in each community.

The Images table includes all the image media items retrieved in the stream, their unique media ID, and the ID of the tweet in which the image was shared. We augment this table with the image hash using three types of perceptual hash functions -aHash, pHash and wHash, as detailed in Section 3. This augmentation, together with the link to the Tweet ID, will allow researchers to quickly identify and hydrate popular images using the tweet metadata. They can also quickly identify and get information for images that are similar to any other arbitrary image, by computing and comparting the perceptual hash values.

The two aggregate tables, the URLs table and the YouTube Videos table again provide information about the popularity of the object in the dataset: aggregate retweet and quote counts, both using the Twitter metadata and the count of objects in our stream from the various communities. In addition, these tables are augmented with metadata about the item (URL or YouTube video) as noted in Section 3.

We performed a preliminary analysis of our dataset and its different modalities -tweets and users, images, external links -to demonstrate its potential interest and provide some initial guiding insights about the data.

Tweets and users. Figure 3 shows the amount of retweets (green), original tweets (blue) and quote tweets (yellow) in the VoterFraud2020 dataset over the time (X-axis) of the data collection. Three shaded regions, from left to right, mark the expansion of our set of keywords on October 31st (light blue, region b) and November 3rd (light green, region c). The Y-axis specifies the daily count. In general, except for the large increase after the election date (November 3rd, dotted vertical line), the volume of the stream remains roughly the same. On average, there are 170,938 tweets, 576,136 retweets, and 85,488 quote tweets per day after the election. Our manual inspection shows that top tweets retweeted by the detractor cluster often condemn the alleged voter fraud claims, while top tweets on the promoter cluster indeed make voter fraud claims. Not surprisingly, among the top ten most retweeted tweets in the promoter cluster, nine were tweeted by President Trump. We refer readers to our project website for more details about popular tweets.

While the promoter clusters seems rather homogeneous (Figure 2b) , users in Community 2 (yellow) stand out in both their level of activity and the rate in which they were suspended. Community 2 was highly active in our dataset. For example, Community 2 comprises 18.1% of the users, but contributed 68% of the VoterFraud2020 tweets, and 74% of the retweets. Moreover, 14% of Community 2's users were suspended by Twitter by the time we collected the account status data as described above, a much higher rate than the other communities, as shown in Figure 2a . In total, Community 2 was responsible for 46.1% of all suspensions amongst the users we associated with the top five communities. The suspension effect, and its focus on Community 2, can also be observed in Figure 2c .

A full analysis of the suspended accounts and their network communities, and the potential impact of the suspension is out of scope for this dataset paper, but can be easily performed using the data we share in VoterFraud2020. For example, the data shows that 35% of the promoter cluster users that were retweeted more than 1,000 times (1,596 in total) were suspended.

To conclude, our preliminary analysis shows that alleged election fraud claims mostly circulate in the promoter cluster, and in particular in Community 2 within the cluster. The most popular tweets (by retweet counts) supporting such claims often come from prominent accounts. The recent moderation efforts from Twitter seem to have effected the most active community that engaged in fraud related misinformation, and did not broadly target all accounts involved in promoting such claims.

Images. We conducted a preliminary examination of matching and repeated images in VoterFraud2020 to analyze the distribution of images related to voter fraud claims. Our data, using the perceptual hash functions described in Section 3, allows tracking of duplicate and near-duplicate images that were posted in multiple tweets. In this analysis, we experimented with three perceptual hash functions and refer to two images as matching if they have an identical perceptual hash value.

In Figure 5a , we show the cumulative distribution of the number of unique perceptual hashes in VoterFraud2020 (Yaxis), with hash values sorted based on the number of unique tweets in which they appear, from the highest to the lowest (X-axis). For example, according to pHash, the 1,000 images shared in the largest number of unique tweets appeared together in 25,019 different tweets (not including retweets). Although in general the results are similar when using different hash functions, pHash is the most "conservative" in terms of assigning matches.

Overall, our results are similar when using different hash functions. For example, there are 109,312 (out of 167,696) images with the same pHash value. Of these, 17,831 were shared in more than one tweet, an average of 4.27 times. In other words, 34% of the images instances in Voter-Fraud2020 tweets appear in more than one tweet. Figure 5b presents the image that appeared in most number of unique tweets: the same perceptual hash value appeared in over 1,000 tweets, according to all three hash functions.

We further investigate the popularity of images, defined by number of retweets, in particular, within the promoter and detractor clusters. After grouping images by the same pHash value, we present in Figure 4 the top three images that have been retweeted in the promoter cluster. Also note that despite the high values of metadata retweets and cluster retweets, all these "popular" images appeared in only a few original tweets in our data. For example, image (a) appeared in 15 tweets, whose metadata retweet (as returned from the API) counts add up to 24,399 in total, and was retweeted (as recorded in our dataset) from users in the promoter cluster 18,020 times. We note that images (a) and (b) were also the top two images retweeted by users in the suspended users set, with 5,547 and 3,122 retweets in that set, respectively (recall that as almost all suspended users belong to the promoter cluster). As expected, the most retweeted images in the two clusters are quite different. The three most retweeted images in the detractor cluster (not included for lack of space) have somewhat lower spread, appearing in tweets that were retweeted 10743, 6425, 3411 times (based on metadata). The top image is a screenshot of the NY Times front page of Nov 11th, 2020 reporting that top election officials across the country have not identified any fraud. The analysis presented above can be easily extended with less-strict image similarity matching by calculating the Hamming distance between a pair of perceptual hash values. In this initial analysis, we used a strict sense of similarity, treating images as similar only when they share the same perceptual hash values.

URLs. We conduct preliminary analyses of the external links that have been included in the VoterFraud2020 tweets. Table 1 lists the top 10 domains that have been shared inside the detractor and promoter clusters respectively. Most of the links shared by users in the detractor clusters are to mainstream news media, such as the Washington Post, CNN, and the New York Times. The rest are other news-related websites. The links shared by users in the promoter cluster mostly point to low-quality news-related websites.

The most shared domain in the promoter cluster is pscp.tv, a live video streaming app that's owned by Twitter. YouTube stands out as the second most retweeted domain among the promoter users. This trend is reflected in multiple news reports, warning of the significant role that YouTube plays in spreading false information related to voter fraud claims (Frenkel 2020) . The majority of the top 10 most retweeted videos by the promoter users falsely claim evidence of widespread election fraud. The users spreading these videos had significant overlap with the January (or earlier) suspension action by Twitter. For eight of these videos, around 30% of the retweets of tweets sharing those videos were by accounts later suspended by Twitter.

A scan of the top 10 YouTube channels retweeted in the promoter cluster shows that they were relatively large (millions of subscribers), though there are also several smaller channels. For example, the most retweeted channel, Precinct 13, has only 3.67K subscribers, has a video that appeared in 88 tweets and have been retweeted over 9K times.

Despite YouTube's announcement that it will take actions against content creators who falsely claim the existence of widespread voter fraud 4 , as of Jan 11th, the top 10 channels and videos listed in our tables are still available on YouTube.

We review prior work using Twitter data analysing politically related events, with an emphasis on those that have released a public dataset.

In particular, previous works had used and published Twitter data to study U.S. elections. Using tweets collected during the 2016 U.S. election, researchers have analysed information operations run by social bots (Rizoiu et al. 2018) , characterized the dissemination of misinformation (Vosoughi, Roy, and Aral 2018) and its exposure to American voters (Grinberg et al. 2019) . Work in Hua, Naaman, and Ristenpart (2020); Hua, Ristenpart, and Naaman (2020) characterized adversarial interaction against political candidates during the 2018 U.S. general election and shared 1.7M tweets interacting with political candidates.

Focusing on the U.S. 2020 election, research studied false claims regarding mail-in ballots (Benkler et al. 2020) before the election as the COVID-19 pandemic made it hard to vote in person. Closest to our work is the #Election2020 dataset (Chen, Deb, and Ferrara 2020) , which streamed a broad set of Twitter data for both political candidates' tweets and related keywords. As discussed above, although some of the voter fraud related keywords were included in their data collection process, our VoterFraud2020 dataset contains more than 2.3 times as much of the related data in #Election2020, for overlapping streaming keywords, presumably because of our more focused stream. Our stream also included a broader set of fraud-claim related keywords.

In order to help understand the dissemination of misinformation cross platforms, Brena et al. (2019) ; Hui et al. (2018) used news articles as queries and released the tweets pointing to these articles. In 2018, Twitter published a list of accounts that the platform suspects to be related with Russia's government controlled Internet Research Agency (Twitter 2018) . This release enabled a number of studies that deepened our understanding of foreign information manipulation in the U.S. (Arif, Stewart, and Starbird 2018; Im et al. 2020; Badawy, Ferrara, and Lerman 2018) .

Most of the previous works that released Twitter datasets only included the tweet IDs, in accordance with Twitter's Terms of Service. We keep to that practice, and augment the data without sharing tweet content, as detailed above, making our multi-modal dataset more accessible and useful to the research community.

The voter fraud allegations to discredit the U.S. 2020 presidential elections are likely to form one of the most consequential misinformation campaigns in modern history. It is critical to allow a diverse set of researchers to provide a deeper understanding of this effort, which will continue to have national and global impact for years to come. To enable that contribution, it is important to provide a public and accessible archive of this campaign on various social media platforms, including Twitter as we do in VoterFraud2020.

The VoterFraud2020 dataset has the potential to benefit the research community, and to further inform the public regarding both Twitter activities around the voter fraud claims, as well as Twitter's response. Yet, the data has some limitations. We could not possibly capture the full extent of the voter fraud claims on Twitter, as our dataset was constructed by using matching keywords. Further, as discussed above, we do not have full coverage even for the keywords we tracked, though we estimate that we have a majority of the tweets with those keywords. Nevertheless, the breadth of the data enables various types of investigation using both the tweet data, as well as the aggregated data of URLs, videos and images used in the campaign. We propose three major categories of such investigation.

First, researchers can use the dataset to study the spread, reach, and dynamics of the voter fraud campaign on Twitter. Researchers can describe and analyze the participants, including the activities of political candidates using information from orthogonal data sets of candidate accounts 5 , and the interaction between public figures and other accounts spreading claims and promoting certain narratives. Further, the data can help expose how different public figures spread different claims, for example the claims regarding the Dominion voting machines, what kind of engagement such narratives received. The data can also be used to understand the role of bots and other coordinated activities and campaigns in spreading this information. In general, the dataset can provide for analysis of the distribution of attention to these claims and how it spreads -via images, tweets, URLs -including comparison among different pre-computed communities and clusters.

Second, we include auxiliary data -URLs including YouTube links, and image hashes -that can help researchers examine other sources of information and their roles in spreading these claims. For example, using the image hash values that were encoded using publicly available algorithms, researchers can easily map images not just within the Twitter data, but also within the larger web ecosystem. Researchers may combine our dataset with datasets that are collected from other social media platforms to examine how 5 https://github.com/vegetable68/Midterm-2020-candidates visual misinformation spread cross platforms (e.g., (Zannettou et al. 2018; Moreira et al. 2018) ).

A third potential area of investigation is Twitter's response to the voter fraud claims. A specific question is the characterization of the suspended users, who are primarily part of a specific community even within the the group promoting voter fraud claims as shown above. Researchers can use the data to both understand Twitter's non-public response, and its potential effectiveness, or even simulate the effectiveness of hypothetical earlier bans of the same population. As noted above, while Twitter's terms forbid us from publicly sharing full data for these suspended users -the Voter-Fraud2020 tweets for these users are no longer available on Twitter by their ID -we will make these tweets available privately to published academic researchers, as we believe these tweets are of immense and justified public interest.

The publicly released VoterFraud2020 data was collected and made available according to Twitter's Terms of Service for academic researchers, following established guidelines for ethical Twitter data use (Rivers and Lewis 2014) . By limiting to the Tweet IDs as the main data element, the dataset does not expose information about users whose data had been removed from the service. The only content in our data that is directly tied to a Tweet ID is the hash of the images for tweets that included them. Even though that hash, theoretically, can be tied to an image from another source, in absence of the original tweet the image will not be associated with any user account. We believe that this minor disclosure risk is justified given the potential benefits of this data.

Social media and fake news in the 2016 election

Acting the part: Examining information operations within# BlackLivesMatter discourse

Analyzing the digital traces of political manipulation: The 2016 russian interference twitter campaign

Gephi: an open source software for exploring and manipulating networks

Mail-In Voter Fraud: Anatomy of a Disinformation Campaign

Computing Top-k Closeness Centrality in Fullydynamic Graphs

Community Detection and Visualization of Networks with the Map Equation Framework

News sharing user behaviour on twitter: A comprehensive data collection of news articles and social interactions

# Election2020: The first public Twitter dataset on the 2020 US presidential election

Retweet networks of the European Parliament: evaluation of the community structure

Characterizing social media manipulation in the 2020 US presidential election

How Misinformation 'Superspreaders' Seed False Election Theories

Cross-Platform State Propaganda: Russian Trolls on Twitter and YouTube During the 2016 US Presidential Election

Fake news on Twitter during the 2016 US presidential election

Instagrammatics and digital methods: Studying visual social media, from selfies and GIFs to memes and emoji

Characterizing twitter users who engage in adversarial interactions against political candidates

Towards Measuring Adversarial Twitter Interactions against Candidates in the US Midterm Elections

The Hoaxy misinformation and fact-checking diffusion network

Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube

Still out there: Modeling and identifying russian troll accounts on twitter

YouTube has allowed conspiracy theories about interference with voting machines to go viral

Image provenance analysis at scale

Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose

NPR. 2020. Poll: Just A Quarter Of Republicans Accept Election Outcome

Ranking of Closeness Centrality for Large-Scale Social Networks Ranking of Closeness Centrality for Large

It is just a flu

Deepfakes and Cheap Fakes

Wavelet image hash in Python

Ethical research standards in a world of big data

# debatenight: The role and influence of socialbots on twitter during the 1st 2016 us presidential debate

2020 turnout is the highest in over a century

Update on Twitter's review of the 2016 US election

Twitter Standard API

An update, following the riots in Washington

It's complicated: Facebook users' political participation in the 2008 election

The spread of true and false news online

Election misinformation continues staying up on YouTube

The FAIR Guiding Principles for scientific data management and stewardship

Cross-platform disinformation campaigns: lessons learned and next steps

YouTube Data API -Google Developers

On the origins of memes by means of fringe web communities

Characterizing the Use of Images in State-Sponsored Information Warfare Operations by Russian Trolls on Twitter

Rihamark: perceptual image hash benchmarking