key: cord-0444125-luwjac8u authors: Dashtian, Hassan; Murthy, Dhiraj title: CML-COVID: A Large-Scale COVID-19 Twitter Dataset with Latent Topics, Sentiment and Location Information date: 2021-01-28 journal: nan DOI: nan sha: cdd13bca69f881c22d557ef5116646b7f310a7c3 doc_id: 444125 cord_uid: luwjac8u As a platform, Twitter has been a significant public space for discussion related to the COVID-19 pandemic. Public social media platforms such as Twitter represent important sites of engagement regarding the pandemic and these data can be used by research teams for social, health, and other research. Understanding public opinion about COVID-19 and how information diffuses in social media is important for governments and research institutions. Twitter is a ubiquitous public platform and, as such, has tremendous utility for understanding public perceptions, behavior, and attitudes related to COVID-19. In this research, we present CML-COVID, a COVID-19 Twitter data set of 19,298,967 million tweets from 5,977,653 unique individuals and summarize some of the attributes of these data. These tweets were collected between March 2020 and July 2020 using the query terms coronavirus, covid and mask related to COVID-19. We use topic modeling, sentiment analysis, and descriptive statistics to describe the tweets related to COVID-19 we collected and the geographical location of tweets, where available. We provide information on how to access our tweet dataset (archived using twarc). COVID-19, an unparalleled global health emergency, led to an exceptional social response on social media platforms, which includes posts related to social, political and economic life. High volumes of COVID-19-related misinformation are also present on online social networks such as Twitter [1] . As 68% of Americans report they use social media to access information and news [2, 3] , it is critical that understandings of attitudes, perceptions, and responses to COVID-19 are studied using social media data. Furthermore, one third of people report that Twitter is the most important source of scientific information and news [3] . Twitter, on the other hand, can be a source of misinformation about health issues such as vaccination [4] . While the Ebola outbreak in 2014 [5] and the spread of Zika in 2016 [6] highlight the importance of studying pandemics in the content of social networks [7, 8] , there is a new urgency in monitoring social media content related to COVID-19 [3] . Specifically, social media data related to the COVID-19 pandemic can, for example, be used to study: (1) the impact of social networks on health info-/mis-information, (2) how misinformation diffusion and spreading can influence behavior and beliefs and (3) evaluate the effectiveness of COVID-19-related actions and campaigns deployed by agencies and governments at global and local scales [9] . In this paper, we explore the frequency of tweet activity related to COVID-19 and we make our data and source code publicly available for others to use. We collected tweets real time using the Twitter API from March -July 2020 with the following COVID-19-related query terms ('coronavirus', 'covid' and 'mask'). Here, we describe our data collection methods, present basic statistics of the dataset, and provide information about how to obtain and use the data. We collected over 19,298,967 million tweets from March -June 2020. Our curated data set, CML-COVID, includes 19,298,967 million tweets from 5,977,653 unique individuals from March -June 2020. An average user in our dataset posted 3 tweets on average. All data were collected from Twitter through Netlytic 2 [11] , which queried the Twitter REST API. The dataset is roughly 15 GB of raw data. To comply with Twitter's Terms & Conditions (T&C), we have not publicly released the full text/API-derived information from the collected tweets. Rather, our released data set includes a list of the tweet IDs that others can use to retrieve the full tweet objects directly from Twitter using their own API calls. There are a variety of tools to accomplish this task such as Hydrator 3 . Twitter also provides documentation in their Developer site 4 on how to hydrate 100 tweets per API request. First, we pre-processed each raw tweet by concatenating and converting csv files into Python DataFrames and lists to optimize our subsequent data processing. The pre-processing task includes removing characters such as "\, /, * and etc." and filtering out stop words (including most rare and most frequent words), and performing text tokenization. This step is essential to next steps which includes topic modeling and sentiment analysis. For topic modeling we applied an unsupervised topic clustering technique called Latent Dirichlet Allocation (LDA). We used TextBlob 5 to perform sentiment analysis. We found extraneous terms (e.g., 'amp', 'dan', and 'na') in our derived topic models. Therefore, we re-ran LDA and removed these terms to present clearer topic modeling results (see table 3 ). A preliminary analysis of the data shows that English is the dominant language in the tweets we collected (65.4%). One reason for this is that the keywords that we used for querying the Twitter API were all English-language; however other languages are also notably present. For example, 12.2% of the tweets are Spanish-language. Table 1 summarizes the top 10 languages, the frequency of associated tweets, and the percentage of each language in our dataset. 63 different languages were identified among the tweets and 3.4% of tweets had an undefined language. Also, we summarize the top 10 locations (by country and city) of users in Table 2 . We retrieved the location of users based on what is reported in their profiles. Therefore, a user's location is derived from text in their profiles and not GPS coordinates. Locations such as "USA" and "United States" are considered the same and merged as a singular location (i.e., 'United States'). For each state in the United States with identifiable state-level location, we counted the number of tweets and calculated the frequency of tweets per day. These are illustrated in Figure 1 . The United States has the highest frequency of tweets during the period that we collected these data. The number of tweets are low for most regions and countries. As figure 1 illustrates, Canada, Saudi Arabia and India also have a high volume of tweets. We then conducted a frequency analysis by time. We identified the date and time of each tweet and counted the frequencies of tweets for each day as illustrated in Figure 3 . Tweet frequency is relatively consistent during our data collection period. We then calculated the sentiment of each tweet. Though sentiment analysis has its limitations with large tweet corpora, we do believe, like others, that there is some utility in understanding top-level sentiment of these data [10] . To extract information related to sentiment in our collected tweets, we used Textblob to extract the sentiment and scores. We divide tweet sentiment into three main categories -'Negative', 'Neutral' and 'Positive'. For each day we count the number of tweets with one of these three categories. Figure 4 depicts the time evolution of sentiment by category. As figure 4 indicates, neutral tweets were the most numerous, followed by positive tweets. The gap in frequency between the three sentiment categories is initially, in the first two weeks of April, 2020 reasonably large, but closes after the second week of April, 2020. To perform topic modeling, we sampled 20% of the tweets in our dataset and trained an LDA model that was used to estimate the top most representative words in each topic. Using the trained LDA-based topic model, we obtained 10 topic clusters. Table 3 illustrates the top ten most representative terms associated with each detected 'topic' (3 topics are illustrated in table 3; topic 1 is Spanish-language). Since tweets can be in any of 64 different languages, the topics and the top words may contain words and symbols that are from different languages. As we found, cleaning the data based on stopwords in one language is not enough to solve these issues. Types, sources, and claims of COVID-19 misinformation The rise of social media A first look at COVID-19 information and misinformation sharing on Twitter Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate Assessing the international spreading risk associated with the 2014 West African Ebola outbreak Rapid spread of Zika virus in the Americas-implications for public health preparedness for mass gatherings at the 2016 Brazil Olympic Games Content analysis of a live CDC Twitter chat during the 2014 Ebola outbreak How people react to Zika virus outbreaks on Twitter? A computational content analysis The covid-19 social media infodemic Sentiment analysis of short informal texts Netlytic: Software for Automated Text and Social Network Analysis