key: cord-0913760-yjd21zid
authors: Chen, Emily; Deb, Ashok; Ferrara, Emilio
title: #Election2020: the first public Twitter dataset on the 2020 US Presidential election
date: 2021-04-02
journal: J Comput Soc Sci
DOI: 10.1007/s42001-021-00117-9
sha: efc330e35d317edc5b72069a57c1186cfcec2c21
doc_id: 913760
cord_uid: yjd21zid

Credible evidence-based political discourse is a critical pillar of democracy and is at the core of guaranteeing free and fair elections. The study of online chatter is paramount, especially in the wake of important voting events like the recent November 3, 2020 U.S. Presidential election and the inauguration on January 21, 2021. Limited access to social media data is often the primary obstacle that limits our abilities to study and understand online political discourse. To mitigate this impediment and empower the Computational Social Science research community, we are publicly releasing a massive-scale, longitudinal dataset of U.S. politics- and election-related tweets. This multilingual dataset encompasses over 1.2 billion tweets and tracks all salient U.S. political trends, actors, and events from 2019 to the time of this writing. It predates and spans the entire period of the Republican and Democratic primaries, with real-time tracking of all presidential contenders on both sides of the aisle. The dataset also focuses on presidential and vice-presidential candidates, the presidential elections and the transition from the Trump administration to the Biden administration. Our dataset release is curated, documented, and will continue to track relevant events. We hope that the academic community, computational journalists, and research practitioners alike will all take advantage of our dataset to study relevant scientific and social issues, including problems like misinformation, information manipulation, conspiracies, and the distortion of online political discourse that has been prevalent in the context of recent election events in the United States. Our dataset is available at: https://github.com/echen102/us-pres-elections-2020.

In 2020, Americans returned to cast their vote for the next president of the US: incumbent Republican Donald J. Trump or the Democratic challenger, and former Vice-President Joseph R. Biden. We began collecting tweets in May 2019 in an effort to capture online chatter surrounding this defining democratic process and to make this collection available to the research community.

Historically, the incumbent president is favored to win their party's nomination for president; 1 although Trump did face a few challengers from the Republican party, it became increasingly clear that he would gain the Republican party's nomination.

Joe Biden officially accepted the Democratic nomination during the Democratic National Convention. 2 Donald Trump officially accepted his nomination on August 27, 2020, during the Republican National Convention. 3 As the final sprint to election day on November 3, 2020 began, Americans took to online social platforms to voice their opinions and engage in conversation surrounding the elections. Twitter has historically been a platform used by politicians to reach their base [10] , and has recently begun more aggressive efforts to tag posts as misleading and potentially incorrect in order to mitigate the spread of misinformation that had already been prevalent on the platform [4] . 4 On election day, many again used social media to express their thoughts on the unfolding elections. News outlets were unable to call the elections for several days after election day, as many key states were still counting ballots; social media was used as a means to spread information (both factual and misleading) and to both protest and advocate for controversies surrounding ballots and the influx of mail-in ballots caused by COVID-19. 5, 6 On November 7, the media was finally able to call the election and named Biden as the president-elect, and Kamala Harris as the vice-president-elect. 7 Yet, in the aftermath of this pronouncement and in the current polarized nature of the United States political landscape, social media has become an environment where misinformation and disinformation can flourish and spread. President Trump refused to concede the election, and continued to promote the claim that the election had been stolen. 8, 9 These claims from Trump bolstered the basis for the "stop the steal" campaign, and eventually culminated in a riot at the United States Capitol on January 6, 2021. 10, 11 This led Twitter and other social media platforms to either semipermanently or permanently suspend President Trump's accounts from their services, citing the riot and the potential for further incitement of violence as grounds for the bans. 12 Many vendors began to cut ties with right-wing social media platform Parler due to the role it played in coordinating the January 6 riot. 13 President Biden was inaugurated into office on January 20, 2021, along with Vice President Harris. 14 Inspired by the positive impact that our similar initiative to share a COVID-19 Twitter dataset has had on the research community [3] , in this paper, we document the release of our 2020 US Presidential election-related dataset that we have been collecting for over one year, a period covering all the events described above and more. We hope that, in releasing this dataset, the research community can leverage its content to study and understand the dynamics in a highly contentious election held during a pandemic. This dataset enables researchers to directly study the impact that the pandemic has had not only on the political landscape, but also on misinformation, disinformation and coordinated actors, with reports of confirmed foreign interference attempts already surfacing [7] . 15

We uninterruptedly collected election-related tweets beginning May 20, 2019, and have continued collection efforts since then. We use Twitter's streaming API through the Tweepy library and follow specific mentions and accounts related to candidates who were running to be nominated as their party's nominee for president of the United States, in addition to a manually-compiled, general election-related list of keywords and hashtags. 16 As candidates officially announced the suspension of their campaigns, their respective accounts and mentions were removed from our real-time tracking list. In response to real-world events, we decided to restart tracking for a subset of these accounts, in addition to adding supplemental keywords and accounts to our tracking list. This is documented in Table 1 .

We will continue to collect election-related tweets at least through the first six months of the Biden administration, so as to capture the nation's post-election and 10 https:// www. npr. org/ secti ons/ live-updat es-2020-elect ion-resul ts/ 2020/ 11/ 08/ 93254 3826/ the-next-2020-elect ion-fight-convi ncing-trumps-suppo rters-that-he-lost. 11 https:// www. polit ifact. com/ artic le/ 2021/ jan/ 11/ timel ine-what-trump-said-jan-6-capit ol-riot/. 12 https:// www. axios. com/ platf orms-social-media-ban-restr ict-trump-d9e44 f3c-8366-4ba9-a8a1-7f311 4f920 f1. html. 13 https:// www. bloom berg. com/ news/ artic les/ 2021-01-10/ apple-remov es-parler-from-app-store-afteruse-in-capit al-riot. 14 https:// www. nytim es. com/ 2021/ 01/ 20/ us/ polit ics/ biden-presi dent. html. 15 https:// home. treas ury. gov/ news/ press-relea ses/ sm1118. 16 https:// www. tweepy. org/. post-transition activity. In total, our dataset comprises well over 1 billion tweets. Release v1.12 contains 1,258,209,617 tweets, spanning from 12/01/2020 through 1/22/2021. In our latest (v1.16) and future releases, we will continue processing and adding data we collected prior to 12/01/2020 and after 1/22/2021.

Note: Twitter's Developer Agreement & Policy stipulates that we are unable to share any data specific to individual tweets except for a tweet's Tweet ID. As a result, we are releasing a collection of Tweet IDs that researchers are then able to use in tandem with Twitter's API to retrieve the full tweet payload. We recommend using tools such as DocNow's Hydrator 17 or Twarc 18 ; if tweets have been deleted from Twitter's platform, researchers will be unable to retrieve the payloads for those tweets. We provide ready-to-use Python code scripts to perform all the operations described above in our repository.

In order to capture the chatter surrounding the 2020 US presidential elections, we followed specific user mentions and accounts that were and are tied to the official and personal accounts of candidates who ran for president. Twitter's streaming API gives us access to approximately 1% stream of all tweets in real-time, and takes in a list of keywords, returning any tweet within that sample stream that contains any of the keywords in the metadata and text of the tweet payload. 19 Thus it is unnecessary to track every permutation of each keyword. We list a sample of the mentions and accounts that we tracked in release v1.12 in Table 1 and a sample of the keywords we tracked in Table 2 . A full list can be found in the accounts.txt file and keywords. txt file in our data repository.

We upgraded our data collection pipeline on June 20, 2020 for data collection reliability purposes. Data prior to June 20, 2020 experienced higher rates of technical collection issues. While our most recent release is Release v1. 16 We are still continuing our computational efforts to pre-process and clean the rest of our existing dataset, and will be 17 https:// github. com/ DocNow/ hydra tor. 18 https:// github. com/ DocNow/ twarc. 19 https:// devel oper. twitt er. com/ en/ docs/ twitt er-api/ tweets/ filte red-stream/ intro ducti on. uploading batches of past and future data as they become available. A sample of the mentions/accounts and keywords that we followed can be found in Tables 1 and 2 , respectively, with full lists of both available on our Github repository. Furthermore, Table 3 shows the top 40 most popular hashtags, grouped by general categories. We can clearly see that most of the hashtags are directly related to party campaigns and conspiracy theories surrounding the elections. Others are related to political events, social movements and the COVID19 pandemic. As this dataset was curated for the 2020 US Presidential election cycle, it is unsurprising that the majority of these tweets are in English (see Table 4 for a breakdown of the languages in release v1.12).

The dataset is publicly available and continuously maintained on Github at this address: https:// github. com/ echen 102/ us-pres-elect ions-2020.

The dataset is released in compliance with the Twitter's Terms & Conditions and the Developer's Agreement and Policies. 20 This dataset is still presently being collected and will be periodically updated on our Github repository. Researchers who wish to use this dataset must agree to abide by the stipulations stated in the associated license and conform to Twitter's policies and regulations.

Although we are continuing to collect tweets to add to our data collection as we follow the transition to the Biden-Harris administration, we first present an analysis on tweets from our dataset from January 2020 through the end of December 2020. This enables us to examine political discourse on Twitter through the Presidential primaries, debates and election. Highly political divisions have emerged in COVID-19 discourse [9] , alongside conspiracy theories [6] and public heath related trends that have emerged due to COVID-19 [3] . Our recent work on this dataset has also shown that partisan trends drive the discourse on Twitter, with conservative users posting at much higher volumes compared to their liberal counterparts. Conservative users also tended to share more known conspiracy-related narratives [7] . We have also observed that there are highly connected conservative users that are more prone to spread public health and voting misinformation [2] . During the 2020 Presidential election, the incumbent former President Trump, faced little difficulty in securing the Republican nomination. 21 Although Trump did The Democratic primaries were more competitive, with a historic 28 candidates vying for the nomination. 23 However, as national poll results began to roll in and initial primary results were tallied, candidates began to drop out of the race (see Table 5 for dates candidates from both parties suspended their campaigns). The advent of COVID-19 in the United States in March 2020, and the ensuing regulations to encourage social distancing, forced the remaining campaigns to shift to a virtual models. The race narrowed down to two candidates: Vermont senator Bernie Sanders and former Vice President Joe Biden. As more primaries took place and results reported, it became clear that Biden would win the 1991 delegates needed to become the presumptive Democratic nominee 24 . Sanders conceded to Biden on April 8, 2020 and endorsed Biden. 25,26

Our dataset specifically tracked 2020 US Presidential elections-related keywords and accounts. As a result, we expect to see that the captured discourse reflects major events that took place throughout our collection period. We limit our analysis to tweets from our dataset that were collected from January 2020 through December 2020.

We first investigate the chatter surrounding the Democratic primaries, as the race to win the nomination was competitive and multiple candidates emerged as favorites. While Biden may have held an early lead, Sanders, Elizabeth Warren and Pete Buttigieg were also serious contenders. 27 In Fig. 1 , we tracked mentions of each of the Democratic presidential candidates' names and Twitter handles who were still campaigning in March 2020, and found the 7-day daily rolling average percentage of all collected tweets that mentioned each candidate. This particular time series ends on May 8, 2020, which is one month after Sanders conceded to Biden, and Biden became the presumptive Democratic presidential candidate.

Throughout the Democratic primary timeline in Fig. 1 , we can see that the attention that specific candidates attract on Twitter fluctuates greatly. We can clearly see that Sanders and Warren initially led most of the discourse on Twitter in January 2020, but that Sanders would eventually dominate Twitter chatter throughout most of the primaries. This dominance continues until February 25, 2020, when James 27 https:// proje cts. fivet hirty eight. com/ 2020-prima ry-forec ast/. 22 https:// www. 270to win. com/ 2020-repub lican-nomin ation/. 23 https:// www. polit ifact. com/ artic le/ 2019/ may/ 02/ big-democ ratic-prima ry-field-what-need/. 24 https:// apnews. com/ artic le/ bb261 be1a4 ca285 b9422 b2f6b 93d8d 75. 25 https:// www. nytim es. com/ inter active/ 2019/ us/ polit ics/ 2020-presi denti al-candi dates. html. 26 https:// www. npr. org/ 2020/ 04/ 08/ 81429 1136/ bernie-sande rs-is-suspe nding-his-presi denti al-campa ign.

Clyburn, a prominent South Carolina African American Representative, endorsed Biden. From there, we see a sharp increase in Biden mentions, and Biden quickly overtook Sanders not only in polls, but also in Twitter discourse. 28 Biden continued to hold a majority in Twitter mentions throughout the rest of the primaries, through Sanders' concession on April 8, 2020. All other candidates saw a general decrease in tweet mention percentage after an initial increase in percentage after candidates announced that they had suspended their presidential campaigns. While most of the mention percentages generally followed the popularity of certain candidates, in particular Biden, Sanders, Warren and Buttigieg, we find an increase in mentions surrounding Michael Bloomberg during the 9th Democratic debate. 29 The 9th Democratic debate was the first debate that Bloomberg was able to qualify for, but his performance was widely criticized. 30 He also attracted social Fig. 1 The above figure shows a time series analysis of tweets that mention keywords related to a Democratic nominee's campaign from January 2020 through May 8, 2020. Sanders announced the suspension of his presidential campaign on April 8, 2020, so we capture all discourse through a month after Biden was declared the presumptive Democratic Presidential nominee. We measure the percentage of total tweets collected on a particular day that mention the candidate on a rolling 7-day average. The keywords we use for each candidate can be found in Table 6 and descriptions of the noted dates in the table below the time series. We also include the raw volume of all tweets collected on a particular day on a rolling 7-day average above the time series 29 https:// www. pewre search. org/ fact-tank/ 2020/ 02/ 10/a-snaps hot-of-the-top-2020-democ ratic-presi denti al-candi dates-suppo rters/. 30 https:// www. npr. org/ 2020/ 02/ 20/ 80763 9778/6-takea ways-from-the-nevada-democ ratic-debate. media attention after having heavily funded his campaign's ads with his personal money. 31

We now turn to the final race in the 2020 U.S. Presidential election between Biden and Trump. As shown in Fig. 2 the percentage of all tweets that mention Trump is significantly greater than the percentage of tweets that mention Biden (see Table 6 for keywords associated with each candidate). This gap in mentions is not unexpected, as Trump was the incumbent President and thus already had a significant presence on Twitter. While our current analysis is based on percentage of mentions in the tweets collected, our prior work in clustering users by political affiliation based on shared media found that conservative users have a more vocal presence on the political Twitter scene [7] . Despite Trump's general dominance in the chatter, we see that as major events occur, such as when Democratic primaries began to be called for Biden and during the Presidential debates, Biden began to see an increase in mentions. While a tweet may be counted as mentioning both Trump and Biden, we still see a corresponding decrease in percentage of Trump's mentions when Biden's mentions increase. This suggests that the discourse shifted away from Trump and towards Biden, particularly as election day neared, culminating in a similar percentage of tweets mentioning either Biden and/or Trump.

It appears that the tweets we collected in our dataset track well the real world events. However, the sheer percentage of our collected tweets that mention a particular candidate does not necessarily represent the sentiment and popularity of those candidates at the time. As Twitter has evolved as a platform, likewise the user base has also changed [11] . This disparity between Twitter attention and real-world popularity was highlighted during the Democratic primaries. Sanders held the majority of percentage of tweet mentions from early January through the end of February. It was not until the initial primary results began to be tallied and reported that it became clear that Biden had actually won the Democrat's vote. 32 Sanders' dominance in Twitter discourse underscored how Biden's eventual momentum took much of the Democratic party by surprise. 33 This can give us insight into how news and public discourse on social media platforms can misrepresent or give a false impression of the nation's sentiment.

Every tweet we collect is returned with metadata describing the tweet itself, including Twitter's automatic language tag and post date. Each tweet also includes Fig. 2 The above figure shows a time series analysis of tweets that mention keywords related to either Trump or Biden from December 2020 through January 2020. We measure the percentage of total tweets collected on a particular day that mention the candidate on a rolling 7-day average. The keywords we use for each candidate can be found in Table 6 and descriptions of the noted dates in the table below the time series. We also include the raw volume of all tweets collected on a particular day on a rolling 7-day average above the time series information about the author, and if the tweet was a response (reply, retweet or quote) to another tweet, the tweet's metadata also contains information on the original poster. This metadata can sometimes include a user's location data; however, we found that less than 1% of our tweets actually contained this information [9] . Because of this, we leverage the included "location" field that a user manually populates as a part of their profile. We tag each tweet with its country of origin and, if the tweet originates from the United States, the detected state [9] . While some users may list locations that are not accurate, do not exist or are unable to be identified through our algorithm, we leverage this as a proxy for tweet location.

We examine the domestic geographical flow of information within the United States. In isolating only retweets and quoted tweets (retweets with a comment), we find tweets that directly represent one user re-posting the tweet of another. Retweets and quoted tweets also return both the user specified location data for both the user who retweeted or quoted the tweet and the original poster. The user who retweeted or quoted the tweet will be referred to as the retweeter for clarity. Then, we retain all tweets within our dataset where we are able to identify a state for both the retweeter and the original poster, which directly implies that both the retweeter and original poster are also located in the United States. Figure 3 illustrates the flow of the top 200 most frequent state-to-state engagements, with the flow following retweets and quoted tweets from the original poster's state to the retweeter's state.

States in which the most tweets originate from generally coincide with the most populous states in the United States. The US Census Bureau lists California, Texas, Florida and New York as the most populous states in their 2019 estimate. 34 However, most tweets actually originate from the District of Columbia area, which is both the political center and the capital of the United States. This is consistent with the nature of the political landscape, as many politicians are located in the D.C. area. In general, Fig. 3 suggests that while there exists a substantial amount of intra-state tweet engagement, states with larger populations account for larger proportions of the measured intra-state engagement activity.

While this dataset gives us a glimpse of the political chatter on Twitter, there are still limitations to this dataset that warrant discussion. Due to the nature of the keywords we were tracking, the tweets in our dataset are highly skewed towards English and tweets that originate from the United States. Another limitation of the dataset is that the users on Twitter do not necessarily represent the collective sentiment of the United States. The audience that uses Twitter, according to a 2019 study conducted by Pew Research Center, skews younger and more Democratic than the general population; the most vocal on Twitter also tend to engage in political discourse. 35 Twitter also significantly rate limits the number of tweets that one can rehydrate, and tweets that have either been removed by the user or removed because a user was banned or suspended can no longer be retrieved through Twitter's API. Our collection was also highly contingent upon the stability of our network and hardware, which means that there may be gaps in our data collection, particularly prior to our migration to AWS. Twitter has recently released an Academic Research track that enables researchers and academics to access the full-archival search; however, this still imposes rate limits that unfortunately makes filling these gaps in time hard. 36

There are many potential areas that can be explored using our dataset.

Recent work using our dataset has already begun to explore the prevalence of bots and misinformation within the 2020 political landscape [6, 7] . Luceri et al. also scrutinizes the bot engagement in political discourse in 2018 and found that many of these bots remained active during the 2020 election cycle [12] . Our previous work has found that out of all major conspiracy theories that had taken root during the election, QAnon supporters were the most vocal and active. We also found that, when grouping users by their political affiliation, tweets from accounts most likely to be bots outnumber tweets from accounts that are most likely human for both the Republican and Democratic parties. Conservative accounts that are the most likely 34 https:// www. census. gov/ data/ tables/ time-series/ demo/ popest/ 2010s-state-total. html. 35 https:// www. pewre search. org/ inter net/ 2019/ 04/ 24/ sizing-up-twitt er-users/. 36 https:// devel oper. twitt er. com/ en/ solut ions/ acade mic-resea rch.

to be bots also have higher bot scores, suggesting that these accounts are more likely to be automated compared to their left-leaning counterparts [7] . We used Indiana University's Botometer, a tool that assigns a bot-score to a Twitter account based on an account's activity [14, 15] . Others have also leveraged the polarized nature of the 2020 elections to model and estimate echo chambers based on a user's political stance [13] .

While this is just a sampling of current literature, there are many areas that are also being explored, including the presence, effect and detection of trolls [8] and foreign influence during the elections [7] . Many new nascent and promising questions are also emerging in the wake of the elections, particularly as the COVID-19 pandemic has forced individuals to physically social distance and, consequently, seek community online.

After aggressive action to mitigate misinformation and the incitement of violence on major social network platforms, many flocked to alternative social network platforms that have espoused their support for freedom of speech, such as Parler and Gab. 37 While there has been much prior work in leveraging these alternative rightwing platforms to understand fringe views in conjunction with more main stream platforms [16] [17] [18] the recent high profile suspensions of major political figures' accounts led to an increased public awareness and exodus to these platforms. Before Parler went offline, researchers even scraped post data. 38 Data collected across multiple platform have the potential to give insight into how fringe communities not only survive these rebuffs by the community but also thrive in the controversy.

Another interesting question that arises is how the pandemic and the resulting shift to online platforms changed the nature and effectiveness of political campaigns. As some politicians quickly cancelled in-person events as the severity of COVID-19 rose, others chose to continue in-person rallies [1] . 39,40 Social media became an integral part of the campaign process, more so than before, as events such as the Democratic National Convention were held virtually. 41 Cross-platform studies will be essential in beginning to understand the full scope of how and to what extent COVID-19 has fundamentally altered our elections system.

The 2020 US Presidential election cycle has been mired both by the COVID-19 pandemic and controversy. In this paper, we presented a Twitter dataset that we have collected from May 5, 2019 through the months after the transition to the Biden campaign. Twitter is by no means the only platform that campaigns leveraged to reach their base or where the public discussed their opinions. However, there has 37 https:// www. busin essin sider. com/ gab-repor ts-growth-in-the-midst-of-twitt er-bans-2021-1. 38 https:// www. washi ngton post. com/ techn ology/ 2021/ 01/ 12/ parler-data-downl oaded/. 39 https:// www. nytim es. com/ 2020/ 03/ 10/ us/ polit ics/ sande rs-biden-rally-coron avirus. html. 40 https:// www. cnn. com/ 2020/ 10/ 29/ health/ covid-trump-ralli es-count ies-cases/ index. html. 41 https:// www. nytim es. com/ 2020/ 08/ 17/ us/ polit ics/ democ ratic-natio nal-conve ntion-recap. html. already been evidence that misinformation still persists on Twitter and other platforms, even as social media companies' are making efforts to address this problem [5] [6] [7] . Having access to this curated dataset will allow researchers to delve into how a contentious election unfolded and its surrounding chatter, as traditionally offline events transitioned online.

If you have technical questions about the data collection, please contact Emily Chen at https://www.echen920@usc.edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at https://www.emiliofe@usc.edu.

The effects of large group meetings on the spread of COVID-19: The case of Trump rallies

Covid-19 misinformation and the 2020 u.s. presidential election. Special Issue on US Elections and Disinformation

Tracking social media discourse about the COVID-19 pandemic: development of a public coronavirus Twitter data set

Real solutions for fake news? measuring the effectiveness of general warnings and fact-check tags in reducing belief in false stories on social media

Manipulation and abuse on social media

What types of COVID-19 conspiracies are populated by Twitter bots? First Monday

Characterizing social media manipulation in the

Trollhunter2020: Real-time detection of trolling narratives on twitter during the 2020 us elections

Political polarization drives online conversations about COVID-19 in the United States

Twitter use in election campaigns: a systematic literature review

The tweets they are a-changin: Evolution of twitter users and behavior

Down the bot hole: actionable insights from a 1-year analysis of bots activity on twitter

Echo chambers and segregation in social networks: Markov bridge models and estimation

Detection of novel social bots by ensembles of specialized classifiers

Arming the public with artificial intelligence to counter social bots

International world wide web conferences steering committee, republic and canton of Geneva

Who let the trolls out? towards understanding state-sponsored trolls

Elites and foreign actors among the alt-right: The gab social media platform

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

The authors would like to thank Dr. Elizabeth Fife for her assistance in editing this manuscript.Funding The authors gratefully acknowledge support from the Annenberg Foundation.

Ethical approval This data collection is based on public data and is registered as IRB exempt by the University of Southern California IRB (approved protocol UP-17-00610).