Nearly four years ago I decided to start collecting tweets to Donald Trump out of morbid curiosity. If I was a real archivist, I would have planned this out a little bit better, and started collecting on election night in 2016, or inaguration day 2017. I didn’t. Using twarc
, I started collecting with the Filter (Streaming) API on May 7, 2017. That process failed, and I pivoted to using the Search API. I dropped that process into a simple bash script, and pointed cron
at it to run every 5 days. Here’s what the bash script looked like:
#!/bin/bash
DATE=`date +"%Y_%m_%d"`
cd /mnt/vol1/data_sets/to_trump/raw
/usr/local/bin/twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.json
It’s not beautiful. It’s not perfect. But, it did the job for the most part for almost four years save and except a couple Twitter suspensions on accounts that I used for collection, and an absolutely embarassing situtation where I forgot to setup cron correctly on a machine I moved the collecting to for a couple weeks while I was on family leave this past summer.
In the end, the collection ran from May 7, 2017 - January 20, 2021, and collected 362,464,578 unique tweets; 1.5T of line-delminted json! The final created_at
timestamp was Wed Jan 20 16:49:03 +0000 2021
, and the text of that tweet very fittingly reads, “@realDonaldTrump YOU’RE FIRED!“
The “dehydrated” tweets can be found here. In that dataset I decided to include in a number of derivatives created with twut which, I hope rounds out the dataset. This update is the final update on the dataset.
I also started working on some notebooks here where I’ve been trying to explore the dataset a bit more in my limited spare time. I’m hoping to have the time and energy to really dig into this dataset sometime in the future. I’m especially curious at what the leadup to the 2021 storming of the United States Capitol looks like in the dataset, as well as the sockpuppet frequency. I’m also hopeful that others will explore the dataset and that it’ll be useful in their research. I have a suspicion folks can do a lot smarter, innovative, and creative things with the dataset than I did here, here, here, here, or here.
For those who are curious what the tweet volume for the last few months looked like (please note that the dates are UTC), check out these bar charts. January 2021 is especially fun.
-30-