key: cord-0226795-pnjt9aa4
authors: Ordun, Catherine; Purushotham, Sanjay; Raff, Edward
title: Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs
date: 2020-05-06
journal: nan
DOI: nan
sha: c36c0d73e7d0aa60fc3d7bb8cdcf06a7a10070a4
doc_id: 226795
cord_uid: pnjt9aa4

This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.

Monitoring public conversations on Twitter about healthcare and policy issues, provides one barometer of American and global sentiment about Covid19. This is particularly valuable as the situation with Covid19 changes every day and is unpredictable during these unprecedented times. Twitter has been used as an early warning notifier, emergency communication channel, public perception monitor, and proxy public health surveillance data source in a variety of disaster and disease outbreaks from hurricanes [1] , terrorist bombings [2] , tsunamis [3] , earthquakes [4] , seasonal influenza [5] , Swine flu [6] , and Ebola [7] . In this paper, we conduct an exploratory analysis of topics and network dynamics of Covid19 tweets.

Since January 2020, there have been a growing number of papers that analyze Twitter activity during the Covid19 pandemic in the United States. We provide a sample of papers published since January 1, 2020 in Table I . Chen, et al. analyzed the frequency of 22 different keywords such as "Coronavirus", "Corona", "CDC", "Wuhan", "Sinophobia", and "Covid-19" analyzed across 50 million tweets from January 22, 2020 to March 16, 2020 [8] . Thelwall also published an analysis of topics for English-language tweets from March 10-29, 2020. [9] . Singh et al. [10] analyzed distribution of languages and propogation of myths, Sharma et al. [11] implemented sentiment modeling to understand perception of public policy, and Cinelli et al. [12] compared Twitter against other social media platforms to model information spread.

Our contributions are applying machine learning methods not previously analyzed on Covid19 Twitter data, mainly Uniform Manifold Approximation and Projection (UMAP) to visualize LDA generated topics and directed graph visualizations of Covid19 retweet cascades. Topics generated by LDA can be difficult to interpret and while there exist coherence values [22] that are intended to score the interpretability of topics, they continue to be difficult to interpret and are subjective. As a result, we apply UMAP, a dimensionality reduction algorithm and visualization tool that "clusters" documents by topic. Vectorizing the tweets using term-frequency inverse-document-frequency (TF-IDF) and plotting a UMAP visualization with the assigned topics from LDA allowed us to identify strongly localized and distinct topics. We then visualized "retweet cascades", which describes how a social media network propagates information [23] , through the use of graph models to understand how dense networks become over time and which users dominate the Covid19 conversations.

In our retweeting time analysis, we found that the median time for Covid19 messages to be retweeted is approximately 50 minutes faster than H7N9 messages during a March 2013 outbreak in China, possibly indicating the global nature, volume, and intensity of the Covid19 pandemic. Our keyword analysis and topic modeling were also rigorously explored, where we found that specific topics were triggered to uptick by Live White House Briefings, implying that Covid19 Twitter [14] 30,990,645 Jan. 1 -Apr 4, 2020 x Medford, et al. [15] 126,049 Jan. 14 -Jan. 28 , 2020 x x x x Singh, et al. [10] 2,792,513 Jan. 16 , 2020 -Mar. 15 , 2020 x x x x Lopez, et al. [16] 6,468,526 Jan. 22 -Mar. 13 , 2020 x x x Cinelli, et al. [12] 1,187,482 Jan. 27 -Feb. 14, 2020 x x x Kouzy, et al. [17] 673 Feb 27 , 2020 x x Alshaabi, et al. [18] Unknown Mar. 1 -Mar 21, 2020 x x Sharma, et al. [11] 30,800,000 Mar. 1, 2020 -Mar. 30, 2020 x x x x x x x Chen, et al. [8] 8,919,411 Mar. 5, 2020 -Mar. 12 , 2020 x Schild [19] 222,212,841 Nov. 1, 2019 -Mar. 22 , 2020 x x x x Yang, et al. [20] Unknown Mar. 9, 2020 -Mar. 29 , 2020 x x Ours 23,830,322 Mar. 24 -Apr. 9 , 2020 x x x x x Yasin-Kabir, et al. [21] 100,000,000

Mar. 5, 2020 -Apr. 24 , 2020 x x x x users are highly attuned to government broadcasts. We think this is important because it highlights how other researchers have identified that government agencies play a critical role in sharing information via Twitter to improve situational awareness and disaster response [24] . Our LDA models confirm that topics detected by Thelwall et al. [9] and Sharma et al. [11] , who analyzed Twitter during a similar period of time, were also identified in our dataset which emphasized healthcare providers, personal protective equipment such as masks and ventilators, and cases of death.

This paper studies five research questions: 1) What high-level trends can be inferred from Covid19 tweets? 2) Are there any events that lead to spikes in Covid19

Twitter activity? 3) Which topics are distinct from each other? 4) How does the speed of retweeting in Covid19 compare to other emergencies, and especially similar infectious disease outbreaks? 5) How do Covid19 networks behave as information spreads? The paper begins with Data Collection, followed by the five stages of our analysis: Keyword Trend Analysis, Topic Modeling, UMAP, Time-to-Retweet Analysis, and Network Analysis. Our methods and results are explained in each section. The paper concludes with limitations of our analysis. The Appendix provides additional graphs as supporting evidence.

II. DATA COLLECTION Similar to researchers in Table I , we collected Twitter data by leveraging the free Streaming API. From March 24, 2020 to April 9, 2020, we collected 23,830,322 (173 GB) tweets. Note, in this paper, we refer to the Twitter data interchangeably as both "dataset" and "corpora" and refer to the posts as "tweets". Our dataset is a collection of tweets from different time periods shown in Table V . Using the Twitter API through tweepy, a Python Twitter mining and authentication API, we first queried the Twitter track on twelve query terms to capture a healthcare-focused dataset: 'ICU beds', 'ppe', 'masks', 'long hours', 'deaths', 'hospitalized', 'cases', 'ventilators', 'respiratory', 'hospitals', '#covid', and '#coronavirus'. For the keyword analysis, topic modeling, and UMAP tasks, we analyzed non-retweets that brought the corpus down to 5,506,223 tweets. In the Time-to-Retweet and Network Analysis, we included retweets but selected a sample out of the larger 23.8 million corpus of 736,561 tweets. Our preprocessing steps are described in the Data Analysis section that follows.

Prior to applying keyword analysis, we first had to preprocess the corpus on the "text" field. First, we removed retweets using regular expressions, in order to focus the text on original tweets and authorship, as opposed to retweets that can inflate the number of messages in the corpus. We use no-retweeted corpora for both the keyword trend analysis and the topic modeling and UMAP analyses. Further we formatted datetime to UTC format, removed digits, short words less than 3 characters, extended the NLTK stopwords list to also exclude "coronavirus", "covid19", "19", "covid", removed "https:" hyperlinks, removed "@" signs for usernames, removed non-Latin characters such as Arabic or Chinese characters, and implemented lower-casing, stemming, and tokenization. Finally, using regular expressions, we extracted tweets that Table VI and the frequencies of tweets per minute here in Table II .

The greatest rate of tweets occurred for the tweets consisting of the term "mask" (mean 55.044) in Table II , followed by "hospital" (mean 32.370) and "vent" (mean 24.811). Tweets of less than 1.0 mean tweets per minute, came from groups about testing positive, being in serious condition, exposure, cough, and fever. This may indicate that people are discussing the issues around Covid19 more frequently than symptoms and health conditions in this dataset. We will later find out that several themes consistent with these keyword findings are mentioned in topic modeling to include personal protective equipment (PPE) like ventilators and masks, and healthcare workers like nurses and doctors.

LDA are mixture models, meaning that documents can belong to multiple topics and membership is fractional [25] . Further, each topic is a mixture of words, where words can be shared among topics. This allows for a "fuzzy" form of unsupervised clustering where a single document can belong to multiple topics, each with an associated probability. LDA is a bag of words model where each vector is a count of terms. LDA requires the number of topics to be specified. Similar to methods described by Syed et al. [26] , we ran 15 different LDA experiments varying the number of topics from 2 to 30, and selected the model with the highest coherence value score. We selected the LDA model that generated 20 topics, with a medium coherence value score of 0.344. Roder et al. [22] developed the coherence value as a metric that calculates the agreement of a set of pairs and word subsets and their associated word probabilities into a single score. In general, topics are interpreted as being coherent if all or most of terms are related.

Our final model generated 20 topics using the default Figure 2 and include the terms generated and each topic's coherence score measuring interpretability. Similar to the high-level trends inferred from extracting keywords, themes about PPE and healthcare workers dominate the nature of topics. The terms generated also indicate emerging words in public conversation including "hydroxychloroquine" and "asymptomatic". Our results also show four topics that are in non-English languages. In our preprocessing, we removed non-Latin characters in order to filter out a high volume of Arabic and Chinese characters. In Twitter there exists a Tweet object metadata field of "lang" for language to filter tweets by a specific language like English ("eng"). However, we decided not to filter against the "lang" element because upon observation, approximately 2.5% of the dataset consisted of an "undefined" language tag, meaning that no language was indicated. Although it appears to be a small fraction, removing even the "undefined" tweets would have removed several thousand tweets. Some of these tweets that are tagged as "undefined" are in English but contain hashtags, emojis, and Arabic characters. As a result, we did not filter out for English language, leading our topics to be a mix of English, Spanish, Italian, French, and Portuguese. Although this introduced challenges in interpretation, we feel it demonstrates the global nature of worldwide conversations about Covid19 occurring on Twitter. This is consistent with what Singh et al. Singh et al. [10] reported as a variety of languages in Covid19 tweets upon analyzing over 2 million tweets. As a result, we labeled the four topics by the language of the terms in the respective topics: "Spanish" (Topic 1), "Portuguese" (Topic 14), "Italian" (Topic 16) and "French" (Topic 19). We used Google Translate to infer the language of the terms.

When examining the distribution of the 20 topics across the corpora in Figure 2 , Topics 18 ("potus"), 12 ("case.death.new"), 13 ("mask.ppe.ventil"), and 2 ("like.look.work") were the top five in the entire corpora. For each plot, we labeled each topic with the first three terms of each topic for interpretability. In our trend analysis, we summed the number of tweets per minute, and then applied a moving weighted average of 60 minutes for topics March 24 -March 28, and 60 minutes for topics March 30 to April 8th. We provided two different plots in order to visualize smaller time frames such as March 24 of 44 minutes compared to Figure 3 and Figure 4 show similar trends on a time-series basis per minute across the entire corpora of 5,506,223 tweets. These plots are in a style of "broken axes" 2 to indicate that the corpora are not continuous periods of time, but discrete time frames, which we selected to plot on one axis for convenience and legibility. We direct the reader to Table V for reference on the start and end datetimes, which are in UTC format, so please adjust accordingly for time zone. The x-axis denotes the number of minutes, where the entire 2 https://github.com/bendichter/brokenaxes corpora is 8463 total minutes of tweets. Figure 3 shows that for the corpora of March 24, 25, and 28, the topics (denoted in hash-marked lines) focused on Topic 18 "potus" and Topic 13 "mask.ppe.ventil" trended greatest. For the later time periods of March 30, March 31, April 4, 5 and 8 in Figure 4 , Topic 18 "potus" and Topic 13 "mask.ppe.ventil" (also in hash-marked lines) continued to trended high. It is also interesting that Topic 18 was never replaced as the top trending topic, across a span of 17 days (April 8, 2020 also includes early hours of April 9 2020 EST), potentially as this may have been a proxy for active government listening. The time series would temporally decrease in frequency during overnight hours, between the We applied change point detection in the time series of tweets per minute for Topic 18 in the datasets March 24, 2020, April 3 -4, 2020, April 5 -6, 2020, and April 8, 2020, to identify whether the live press briefings coincided with inflections in time. Using the ruptures Python package [27] containing a variety of change point detection methods, we used binary segmentation [28] , a standard method for change point detection. Given a sequence of data y 1:n = (y 1 , ..., y n ) the model will have m changepoints with their positions τ 1:m = (τ 1 , ..., τ m ). Each changepoint position is an integer between 1 and n − 1. The m changepoints split the time series data into m + 1 segments, with the ith segment containing y ( τ i−1 + 1) : τ i . Changepoints are identified by minimizing a cost function, C for a given segment, where βf (m) is a penalty to prevent overfitting.

where twice the negative log-likelihood is a commonly used cost function.

Binary segmentation detects multiple changepoints across the time series by repeatedly testing on different subsets of the sequence. It checks to see if a τ exists that satisfies: C(y 1:τ + C(y (τ +1):n ) + β < C(y 1:n )

If not, then no changepoint is detected and the method stops. But if a changepoint is detected, the data are split into two segments consisting of the time series before (Figure 7 blue) and after (Figure 7 pink) the changepoint. We can clearly see in Figure 7 that the timing of the White House briefing indicates a changepoint in time, giving us the intuition that this briefing influenced an uptick in the the number of tweets. We provide additional examples in the Appendix.

Our topic findings are consistent with the published analyses on Covid19 and Twitter, such as [10] who found major themes of healthcare and illness and international dialogue, as we noticed in our four non-English topics. They are also similar to by Thelwall et al. [9] who manually reviewed tweets from a corpus of 12 million tweets occurring earlier and overlapping our dataset (March 10 -29). Similar topics from their findings to ours includes "lockdown life", "politics", "safety messages", "people with COVID-19", "support for key workers", "work", and "COVID-19 facts/news".

Further, our dataset of Covid19 tweets from March 24 to April 8, 2020 occurred during a month of exponential case growth. By the end of our data collection period, the number of cases had increased by 7 times to 427,460 cases on April 8, 2020 [29] . The key topics we identified using our multiple methods were representative of the public conversations being had in news outlets during March and April, including: Term-frequency inverse-document-frequency (TF-IDF) [34] is a weight that signifies how valuable a term is within a document in a corpus, and can be calculated at the n-gram level. TF-IDF has been widely applied for feature extraction on tweets used for text classification [35] [36] , analyzing sentiment [37] , and for text matching in political rumor detection [23] With TF-IDF, unique words carry greater information and value than common, high frequency words across the corpus. TF-IDF can be calculated as follows:

Where i is the term, j is the document, and N is the total number of documents in the corpus. TF-IDF calculates the term frequency tf i,j multiplied by the log of the inverse document frequency N dfi . The term frequency tf i,j is calculated as the frequency of i in j divided by all terms i in given j. The inverse document frequency is N dfi is the log of the total number of documents j in the corpus divided by the number of documents j containing term, i.

Using the Scikit-Learn implementation of TfidfVectorizer and setting max_features to 10000, we transformed our corpus of 5,506,223 tweets into a R n×k sparse dimensional matrix of shape (5506223, 10000). Note, prior to fitting the vectorizer, our corpus of tweets was pre-processed during the keyword analysis stage. We chose to visualize how the 20 topics grouped together using Uniform Manifold Approximation and Projection (UMAP) [38] . UMAP is a dimension reduction algorithm that finds a low dimensional representation of data with similar topological properties as the high dimensional space. It measures the local distance of points across a neighborhood graph of the high dimensional data, capturing what is called a fuzzy topological representation of the data. Optimization is then used to find the closest fuzzy topological structure by first approximating nearest neighbors using the Nearest-Neighbor-Descent algorithm and then minimizing local distances of the approximate topology using stochastic gradient descent [39] . When compared to t-Distributed Stochastic Neighbor Embedding (t-SNE), UMAP has been observed to be faster [40] with clearer separation of groups.

Due to compute limitations in fitting the entire high dimensional vector of nearly 5.5M records, we randomly sampled one million records. We created an embedding of the vectors along two components to fit the UMAP model with the Hellinger metric which compares distances between probability distributions, as follows:

We visualized the word vectors with their respective labels, which were the assigned topics generated from the LDA model. We used the default parameters of n_neighbors = 15 and min_dist = 0.1. Figure 6 presents the visualization of the TF-IDF word vectors for each of the 1 million tweets with their labeled topics. UMAP is supposed to preserve local and global structure of data, unlike t-SNE that separates groups but does not preserve global structure. As a result, UMAP visualizations intend to allow the reader to interpret distances between groups as meaningful. In Figure 6 each topic is colorcoded by its respective topic.

The UMAP plots appear to provide further evidence of the quality and number of topics generated. Our observations is that many of these topic "clusters" appear to have a single dominant color indicating distinct grouping. There is strong local clustering for topics that were also prominent in the keyword analysis and topic modeling time series plots. A very distinct and separated mass of purple tweets represents the "100: N/A" topic which is an undefined topic. This means that the LDA model outputted equal scores across all 20 topics for any single tweet. As a result, we could not assign a topic to these tweets because they all had uniform scores. But this visualization informs us that the contents of these tweets were uniquely distinct from the others. Examples of tweets in this "100: N/A" cateogry include "See, #Democrats are always guilty of whatever", "Why are people still getting in cruise ships?!?", "Thank you Mike you are always helping others and sponsoring Anchors media shows.", "We cannot let this woman's brave and courageous actions go to waste! #ChinaLiedPeopleDied #Chinaneedstopay", "I wish people in this country would just stay the hell home instead of GOING TO THE BEACH". Other observations reveal that the mask-related topic 10 in purple, and potentially a combination of 8 and 9 in red are distinct from the mass of noisy topics in the center of the plot. We can also see distinct separation of aqua-colored topic 18 "potus" and potentially topics 5 and 6 in yellow.

We refer the reader to other examples where UMAP has been leveraged for Twitter analysis, to include Darwish et al. [41] for identifying clusters of Twitter users with controversial topic similarity, Vargas [42] for event detection, political polarization by Darwish et al. [41] and estimating political leaning of users by [43] .

Retweeting is a special activity reserved for Twitter where any user can "retweet" messages which allows them to disseminate their messages rapidly to their followers. Further, a highly retweeted tweet might signal that an issue has attracted attention in the highly competitive Twitter environment, and may give insight about issues that resonate with the public [44] . Whereas in the first three analyses we used no retweets, in the time-series and network modeling that follows, we exclusively use retweets. We began by measuring time-toretweet. Wang et al. [1] calls this "response time" and used it to measure response efficiency and speed of information dissemination during Hurricane Sandy. Wang analyzed 986,579 tweets and found that 67% of re-tweets occur within 1 h [1] . We researched how fast other users retweet in emergency situations, such as what Spiro [45] reported for natural disasters, and how Earle [46] reported as 19 seconds for retweeting about an earthquake.

We extracted metadata from our corpora for the Tweet, User, and Entities objects. For reference, we direct the reader to the Twitter Developer guide that provides a detailed overview of each object [47] . Due to compute limitations, we selected a sample that consisted of 736,561 tweets that included retweets from the corpora of March 24 -28, 2020. However, since we were only focused on retweets, out of the corpus of 736,561 tweets, we reduced it to 567,909 (77%) that were only retweets. The metadata we used for both our Time-to-Retweet and Directed Graph analyses in the next section, included: 1) Created_at (string) -UTC time when this Tweet was created. 2) Text (string) -The actual UTF-8 text of the status update. See twitter-text for details on what characters are currently considered valid. 3) From the User object, the id_str (string) -The string representation of the unique identifier for this User. 4) From the retweeted_status object (Tweet) -the cre-ated_at UTC time when the Retweet was created. 5) From the retweeted_status object (Tweet) -the id_str which is the unique identifier for the retweeting user. We used the corpus of retweets and analyzed the time between the tweet created_at and the retweeted created_at.

Here, the rt_object is the datetime in UTC format for when the message that was retweeted was originally posted. The tw_object is the datetime in UTC format when the current tweet was posted. As a result, the datetime for the rt_object is older than the datetime for the current tweet. This measures the time it took for the author of the current tweet to retweet the originating message. This is similar to Kuang et al. [48] who defined response time of the retweet to be the time difference between the time of the first retweet and that of the origin tweet. Further, Spiro et al. [45] calls these "waiting times". The median time-to-retweet for our corpus was 2.87 hours meaning that half of the tweets occurred within this time (less than what Wang reported as 1.0 hour), and the mean was 12.3 hours. Figure 9 shows the histogram of the number of tweets by their time to retweet in seconds and Figure 10 shows it in hours.

Further, we found that compared to the 2013 Avian Influenza outbreak (H7N9) in China described by Zhang et al. [49] Covid19 retweeters sent more messages earlier than H7N9. Zhang analyzed the log distribution of 61,024 H7N9related posts during April 2013 and plotted reposting time of messages on Sina Weibo, a Chinese Twitter-like platform and one of the largest microblogging sites in China Figure 12 . Zhang found that H7N9 reposting occurred with a median time of 222 minutes (i.e. 3.7 hours) and a mean of 8520 minutes (i.e. 142 hours). Compared to Zhang's study, we found our median retweet time to be 2.87 hours, about 50 minutes faster than the reposting time during H7N9 of 3.7 hours. When comparing Figure 11 and Figure 12 , it appears that Covid19 retweeting does now completely slow down until 2.78 hours later (10 4 seconds). For H7N9 it appears to slow down much earlier by 10 seconds.

Unfortunately few studies appear to document retweeting times during infectious disease outbreaks which made it hard to compare how Covid19 retweeting behavior against similar situations. Further, the H7N9 outbreak in China occurred seven years ago and may not be a comparable set of data for numerous reasons. Chinese social media may not represent similar behaviors with American Twitter and this analysis does not take into account multiple factors that imply retweeting behavior to include the context, the user's position, and the time the tweet was posted [44] .

We also analyzed what rapid retweeters, or those retweeting messages even faster than the median, in less than 10,000 seconds were saying. In Figure 21 we plotted the top 50 TF-IDF features by their scores for the text of the retweets. It is intuitive to see that URLs are being retweeted quickly by the presence of "https" in the body of the retweeted text. This is also consistent with studies by Suh et al. [50] who indicated that tweets with URLs were a significant factor impacting retweetability. We found terms that were frequently mentioned during the early-stage keyword analysis and topic modeling mentioned again: "cases", "ventilators", "hospitals", "deaths", "masks", "test", "american", "cuomo", "york", "president", "china", and "news". When analyzing the descriptions of the users who were retweeted in Figure 21 , we ran the TF-IDF vectorizer on bigrams in order to elicit more interpretable terms. User accounts whose tweets were rapidly retweeted, appeared to describe themselves as political, news-related, or some form of social media account, all of which are difficult to verify as real or fake.

VII. NETWORK MODELING We analyzed the network dynamics of nine different time periods within the March 24 -28, 2020 Covid19 dataset, and visualized them based on their speed of retweeting. These types of graphs have been referred to as "retweet cascades" which describes how a social media network propagates information [23] . Similar methods have been applied for visualizing rumor propogation by Jin et al. [23] We wanted to analyze how Covid19 retweeting behaves at different time points. We used published disaster retweeting times to serve as benchmarks for selecting time periods. As a result, the graphs in Figure 8 are plotted by retweeting time of known benchmarks -the median time to retweet after an earthquake which implies rapid notification, the median time to retweet after a funnel cloud has been seen, all the way to a one-day or 24 hour time period. We did this to visualize a retweet cascade of fast to slow information propogation. We used median retweeting times published Spiro et al. [45] for the time it took users to retweet messages based on hazardous keywords like "Funnel Cloud", "Aftershock", and "Mudslide". We also used the H7N9 reposting time which Zhang et al. [49] published of 3.7 hours. We generated a Directed Graph for each of the nine time periods, where the network consisted of a source which was the author of the tweet (User object, the id_str) and a target which was the original retweeter shown in Table IV . The goal was to analyze how connections change as the retweeting speed increases. The nine networks are visualized in Figure 8 . Graphs were plotted using networkx and drawn using the Kamada Kawai Layout [51] , a force-directed algorithm. We modeled 700 users for each graph. We found that more nodes became too difficult to interpret. The size of the node indicates the number of degrees, or users that it is connected to. It can mean that the node has been retweeted by others several times. Or, it can also mean that the node itself has been retweeted by others several times.

The density of each network increases over time shown in Figure 8 and Figure 13 . Very rapid retweeters, in the time it takes to retweet after an earthquake, start off with a sparse network with a few nodes in the center being the focus of retweets in Figure 8a . By the time we reach Figure 8d , the retweeted users are much more clustered in the center and there are more connections and activity. The top retweeted user in our median time network Figure 8g , was a news network and tweeted "The team took less than a week to take the ventilator from the drawing board to working prototype, so that it can". By 24 hours out in Figure 8h , we see a concentrated set of users being retweeted and by Figure 8i , one account appears to dominate the space being retweeted 92 times. This account was retweeting the following message several times "She was doing #chemotherapy couldn't leave the house because of the threat of #coronavirus so her line sisters...". In addition, the number of nodes generally decreased from 1278 in "earthquake" time to 1067 in one week, and the density also generally increased, shown in Table IV. These retweet cascade graphs provide only an exploratory analysis. Network structures like these have been used to predict virality of messages, for example memes over time as the message is diffused across networks [52] . But, analyzing them further could enable 1) an improved understanding about how Covid19 information diffusion is different than other outbreaks, or global events, 2) How information is transmitted differently from region to region across the world, and 3) What users and messages are being concentrated on over time. This would support strategies to improve government communications, emergency messaging, dispelling medical rumors, and tailoring public health announcements.

There are several limitations with this study. First, our dataset is discontinuous and trends seen in Figure 3 and Figure 4 where there is an interruption in time should be taken with caution. Although there appears to be a trend between one discrete time and another, without the missing data, it is impossible to confirm this as a trend. As a result, it would be valuable to apply these techniques on a larger and continuous corpus without any time breaks. We aim to repeat the methods in this study on a longer continuous stream of Twitter data in the near future.

Next, the corpus we analyzed was already pre-filtered with thirteen "track" terms from the Twitter Streaming API that focused the dataset towards healthcare related concerns. This may be the reason why the high level keywords extracted in the first round of analysis were consistently mentioned throughout the different stages of modeling. However, after review of similar papers indicated in Table I , we found that despite having filtered the corpus on healthcare-related terms, topics still appear to be consistent with analyses where corpora were filtered on limited terms like "#coronavirus".

Third, the users and conversations in Twitter are not a direct representation of the U.S. or global population. The Pew Research Foundation found that only 22% of American adults use Twitter [53] and that this group is different from the majority of U.S. adults, because they are on average younger, more likely to identify as Democrats, more highly educated and possess higher incomes [54] . The users were also not verified and should be considered as a possible mixture of human and bot accounts.

Fourth, we reduced our corpus to remove retweets for the keyword and topic modeling anlayses since retweets can obscure the message by introducing virality and altering the perception of the information [55] . As a result, this reduced the size of our corpus by nearly 77% from 23,820,322 tweets to 5,506,223 tweets. However, there appears to be variability in terms of consistent corpora sizes in the Twitter analysis literature both in Table I Fifth, our compute limitations prohibited us from analyzing a larger corpus for the UMAP, time-series, and network modeling. For the LDA models we leveraged the gensim Mul-ticoreLDA model that allowed us to leverage multiprocessing across 20 workers. But for UMAP and the network modeling, we were constrained to use a CPU. However, as stated above, visualizing more than 700 nodes for our graph models was unintepretable. Applying our methods across the entire 23.8 million corpora for UMAP and the network models may yield more meaningful results. Sixth, we were only able to iterate over 15 different LDA models based on changing the number of topics, whereas Syed et al. [26] iterated on 480 models to select coherent models. We believe that applying a manual gridsearch of the LDA parameters such as iterations, alpha, gamma threshold, chunksize, and number of passes would lead to a more diverse representation of LDA models and possibly more coherent topics.

Seven, it was challenging to identify papers that analyzed Twitter networks according to their speed of retweets for public health emergencies and disease outbreaks. Zhang et al. [49] points out that there are not enough studies of temporal measurement of public response to health emergencies. We were lucky to find papers by Zhang et al. [49] and Spiro et al. [45] who published on disaster waiting times. Chew et al. [62] and Szomszor et al. [6] have published about Twitter analysis in H1N1 and the Swine Flu, respectively. Chew analyzed the volume of H1N1 tweets and categorized different types of messages such as humor and concern. Szomszor correlated tweets with UK national surveillance data and Tang et al. [63] generated a semantic network of tweets on measles during the 2015 measles outbreak to understand keywords mentioned about news updates, public health, vaccines and politics. However, it was difficult to compare our findings against other disease outbreaks due to the lack of similar modeling and published retweet cascade times and network models.

We answered five research questions about Covid19 tweets during March 24, 2020 -April 8, 2020. First, we found highlevel trends that could be inferred from keyword analysis. Second, we found that live White House Coronavirus Briefings led to spikes in Topic 18 ("potus"). Third, using UMAP, we found strong local "clustering" of topics representing PPE, healthcare workers, and government concerns. UMAP allowed for an improved understanding of distinct topics generated by LDA. Fourth, we used retweets to calculate the speed of retweeting. We found that the median retweeting time was 2.87 hours. Fifth, using directed graphs we plotted the networks of Covid19 retweeting communities from rapid to longer retweeting times. The density of each network increased over time as the number of nodes generally decreased.

Lastly, we recommend trying all techniques indicated in Table I to gain an overall understanding of Covid19 Twitter data. While applying multiple methods for an exploratory strategy, there is no technical guarantee that the same combination of five methods analyzed in this paper will yield insights on a different time period of data. As a result, researchers should attempt multiple techniques and draw on existing literature. Models were calculated using the ruptures Python package. We also applied exponential weighted moving average using the ewm pandas function. We applied a span of 5 for March 24, 2020 and a span of 20 for April 3 -4 datasets, April 5 -6 datasets, and April 8 -9 datasets. Our parameters for binary segmentation included selecting the "l2" model to fit the points for Topic 18, using 10 n_bkps (breakpoints). 

Crisis information distribution on twitter: a content analysis of tweets during hurricane sandy

Evaluating public response to the boston marathon bombing and other acts of terrorism through twitter

Twitter tsunami early warning network: a social network analysis of twitter information flows

Twitter earthquake detection: earthquake monitoring in a social world

A case study of the new york city 2012-2013 influenza season with daily geocoded twitter data from temporal and spatiotemporal perspectives

What can we learn about the ebola outbreak from tweets?

Covid-19: The first public coronavirus twitter dataset

Retweeting for covid-19: Consensus building, information sharing, dissent, and lockdown life

A first look at covid-19 information and misinformation sharing on twitter

Coronavirus on social media: Analyzing misinformation in twitter conversations

The covid-19 social media infodemic

Using twitter and web news mining to predict covid-19 outbreak

A large-scale covid-19 twitter chatter dataset for open scientific research-an international collaboration

An" infodemic": Leveraging highvolume twitter data to understand public sentiment for the covid-19 outbreak

Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset

Coronavirus goes viral: Quantifying the covid-19 misinformation epidemic on twitter

How the world's collective attention is being paid to a pandemic: Covid-19 related 1-gram time series for 24 languages on twitter

An early look on the emergence of sinophobic behavior on web communities in the face of covid-19

Prevalence of low-credibility information on twitter during the covid-19 outbreak

Coronavis: A real-time covid-19 tweets analyzer

Exploring the space of topic coherence measures

Detection and analysis of 2016 us presidential election related rumors on twitter

Analysis of twitter users' sharing of official new york storm response messages

Latent dirichlet allocation

Full-text or abstract? examining topic coherence scores using latent dirichlet allocation

Selective review of offline change point detection methods

Optimal detection of changepoints with a linear computational cost

Get your mass gatherings or large community events ready

Trump says fda will fast-track treatments for novel coronavirus, but there are still months of research ahead

The White House. Presidential Memoranda

Using tf-idf to determine word relevance in document queries

Twitter trending topic classification

Predicting popular messages in twitter

Opinion mining and sentiment polarity on twitter and correlation between events and sentiment

Umap: Uniform manifold approximation and projection for dimension reduction

How umap works ¶

Understanding umap

Unsupervised user stance detection on twitter

Event detection in colombian security twitter news using fine-grained latent topic analysis

Predicting the topical stance of media and popular twitter users

Bad news travel fast: A content-based analysis of interestingness on twitter

Waiting for a retweet: modeling waiting times in information propagation

Omg earthquake! can twitter improve earthquake response?

Introduction to tweet json -twitter developers

Predicting the times of retweeting in microblogs

Social media as amplification station: factors that influence the speed of online public response to health emergencies

Want to be retweeted? large scale analytics on factors impacting retweet in twitter network

An algorithm for drawing general undirected graphs

Virality prediction and community structure in social networks

Share of u.s. adults using social media, including facebook, is mostly unchanged since

How twitter users compare to the general public

Retweets are trash

Characterizing diabetes, diet, exercise, and obesity comments on twitter

Comparing twitter and traditional media using topic models

Empirical study of topic modeling in twitter

Characterizing twitter discussions about hpv vaccines using topic modeling and community detection

Topic modeling in twitter: Aggregating tweets by conversations

Twitter-network topic model: A full bayesian treatment for social network and text modeling

Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak

Tweeting about measles during stages of an outbreak: A semantic network approach to the framing of an emerging infectious disease

Software Framework for Topic Modelling with Large Corpora

Lda model parameters

patient 4 -china.thank.lockdown 5 -case.spread.slow 6 -day.case.week 7 -test.case.hosp 8 -die.world.peopl 9 -mask.face.wear 10 -make.home.stay 11 -hospit.nurs.le 12 -case.death.new 13 -mask.ppe.ventil 14 -Portuguese 15 -case.death.number 16 -Italian 17 -great.god.news 18 -potus 1 -Spanish 2 -like.look.work 3 -hospit.realli.patient 4 -china.thank.lockdown 5 -case.spread.slow 6 -day.case.week 7 -test.case.hosp 8 -die.world.peopl 9 -mask.face.wear 10 -make.home.stay 11 -hospit.nurs.le 12 -case.death.new 13 -mask.ppe.ventil 14 -Portuguese 15 -case.death.number 16 -Italian 17 -great.god.news 18 -potus 1 -Spanish 2 -like.look.work 3 -hospit.realli.patient 4 -china.thank.lockdown 5 -case.spread.slow 6 -day.case.week 7 -test.case.hosp 8 -die.world.peopl 9 -mask.face.wear 10 -make

The authors would like to acknowledge John Larson from Booz Allen Hamilton for his support and review of this article.

 [64, 65] . It provides four different coherence metrics. We used the "c_v" metric for coherence developed by Roder [22] . Coherence metrics are used to rate the quality and human interpretability of a topic generated. All models were run with the default parameters using a LdaMulticore model parallel computing on 20 workers, default gamma threshhold of 0.001, chunksize of 10,000, 100 iterations, 2 passes.

Note -Sudden decreases in Figure 14 signal may be due to temporary internet disconnection.