key: cord-034814-flp6s0wd authors: Lamsal, Rabindra title: Design and analysis of a large-scale COVID-19 tweets dataset date: 2020-11-06 journal: Appl Intell DOI: 10.1007/s10489-020-02029-z sha: doc_id: 34814 cord_uid: flp6s0wd As of July 17, 2020, more than thirteen million people have been diagnosed with the Novel Coronavirus (COVID-19), and half a million people have already lost their lives due to this infectious disease. The World Health Organization declared the COVID-19 outbreak as a pandemic on March 11, 2020. Since then, social media platforms have experienced an exponential rise in the content related to the pandemic. In the past, Twitter data have been observed to be indispensable in the extraction of situational awareness information relating to any crisis. This paper presents COV19Tweets Dataset (Lamsal 2020a), a large-scale Twitter dataset with more than 310 million COVID-19 specific English language tweets and their sentiment scores. The dataset’s geo version, the GeoCOV19Tweets Dataset (Lamsal 2020b), is also presented. The paper discusses the datasets’ design in detail, and the tweets in both the datasets are analyzed. The datasets are released publicly, anticipating that they would contribute to a better understanding of spatial and temporal dimensions of the public discourse related to the ongoing pandemic. As per the stats, the datasets (Lamsal 2020a, 2020b) have been accessed over 74.5k times, collectively. During a crisis, whether natural or man-made, people tend to spend relatively more time on social media than the normal. As crisis unfolds, social media platforms such as Facebook and Twitter become an active source of information [20] because these platforms break the news faster than official news channels and emergency response agencies [23] . During such events, people usually make informal conversations by sharing their safety status, querying about their loved ones' safety status, and reporting ground level scenarios of the event [11, 20] . This process of continuous creation of conversations on such public platforms leads to accumulating a large amount of socially generated data. The amount of data can range from hundreds This article belongs to the Topical Collection: Artificial Intelligence Applications for COVID-19, Detection, Control, Prediction, and Diagnosis Rabindra Lamsal rabindralamsal@outlook.com 1 be (i) trimmed [38] or summarized [36, 40, 41, 50] and sent to the relevant department for further analysis, (ii) used for sketching alert-level heat maps based on the location information contained within the tweet metadata or the tweet body. Similarly, Twitter data can also be used for identifying the flow of fake news [7, 8, 24, 49] . If miss-information and unverified rumors are identified before they spread out on everyone's news feed, they can be flagged as spam or taken down. Further, in-depth textual analyses of Twitter data can help (i) discover how positively or negatively a geographical region is being textually-verbal towards a crisis, (ii) understand the dissemination processes of information throughout a crisis. As of July 17, 2020, the number of Novel coronavirus (COVID-19) cases across the world had reached more than thirteen million, and the death toll had crossed half a million [52] . States and countries worldwide are trying their best to contain the spread of the virus by initiating lockdown and even curfew in some regions. As people are bound to work from home, social distancing has become a new normal. With the increase in the number of cases, the pandemic's seriousness has made people more active in social media expression. Multiple terms specific to the pandemic have been trending on social media for months now. Therefore, Twitter data can prove to be a valuable resource for researchers working in the thematic areas of Social Computing, including but not limited to sentiment analysis, topic modeling, behavioral analysis, fact-checking and analytical visualization. Large-scale datasets are required to train machine learning models or perform any kind of analysis. The knowledge extracted from small datasets and region-specific datasets cannot be generalized because of limitations in the number of tweets and geographical coverage. Therefore, this paper introduces a large-scale COVID-19 specific English language tweets dataset, hereinafter, termed as the COV19Tweets Dataset. As of July 17, 2020, the dataset has more than 310 million tweets and is available at IEEE DataPort [30] . The dataset gets a new release every day. The dataset's geo version, the GeoCOV19Tweets Dataset, is also made available [29] . As per the stats reported by the IEEE platform, the datasets [29, 30] have been accessed over 74.5k times, collectively, worldwide. The paper is organized as follows: Section 2 reviews related research works. Section 3 discusses the design methodology of the COV19Tweets Dataset and its geo version. Section 4 focuses on the hydration of tweets ID for obtaining full tweet objects. Section 5 presents the analysis and discussions, and Section 6 concludes the paper. Multiple other studies have also been collecting and sharing large-scale datasets to enable research in understanding the public discourse regarding COVID-19. Some of those publicly available datasets are multi-lingual [4, 13, 26, 39] , and some are language-specific [3, 18] . Among those datasets, [4, 13, 39] have significantly large numbers of tweets in their collection. [39] provides more than 524 million multi-lingual tweets and also an English version as a secondary dataset. However, with the last update released on May 01, 2020, the dataset [39] does not seem to be getting frequent releases. [4] shares around 490 million multilingual tweets alongside the most frequently used terms. [13] provides 302 million multi-lingual tweets, with around 200 million tweets in the English language. However, neither of them [4, 13] have English version releases. First, the volume of English tweets in multi-lingual datasets can become an issue. Twitter sets limits on the number of requests that can be made to its API. Its filtered stream endpoint has a rate limit of 450 requests/15-minutes per app., which is why the maximum number of tweets that can be fetched in 24 hours is just above 4 million. The language breakdown of multi-lingual datasets shows a higher prevalence of English, Spanish, Portuguese, French, and Indonesian languages [4, 13] . Therefore, multi-lingual datasets contain relatively fewer English tweets, unless multiple language-dedicated collections are run and merged later. Second, the size and multi-lingual nature of large-scale datasets can become a concern for researchers who need only the English tweets. For that purpose, the entire dataset must be hydrated and then filtered, which can take multiple weeks. Recent studies have done sentiment analysis on different samples of COVID-19 specific Twitter data. A study [1] analyzed 2.8 million COVID-19 specific tweets collected between February 2, 2020, and March 15, 2020, using frequencies of unigrams and bigrams, and performed sentiment analysis and topic modeling to identify Twitter users' interaction rate per topic. Another study [34] examined tweets collected between January 28, 202, and April 9, 2020, to understand the worldwide trends of emotions (fear, anger, sadness, and joy) and the narratives underlying those emotions during the pandemic. A regional study [33] in Spain performed sentiment analysis on 106,261 conversations collected from various digital platforms, including Twitter and Instagram, during March and April 2020, to examine the impact of risk communications on emotions in Spanish society during the pandemic. In a similar regional study [42] concerning China and Italy, the effect of COVID-19 lockdown on individuals' psychological states was studied using the conversations available on Weibo (for China) and Twitter (for Italy) by analyzing the posts published two weeks before and after the lockdown. Multiple studies have performed social network analysis on Twitter data related to the COVID-19 pandemic. A case study [17] examined the propagation of the #FilmYourHospital hashtag using social network analysis techniques to understand whether the hashtag virality was aided by bots or coordination among Twitter users. Another study [2] collected tweets containing the #5GCoronavirus hashtag between March 27, 2020, and April 4, 2020, and performed network analysis to understand the drivers of the 5G COVID-19 conspiracy theory and strategies to deal with such misinformation. A regional study [37] concerning South Korea used network analysis to investigate the information transmission networks and news-sharing behaviors regarding COVID-19 on Twitter. A similar study [27] investigated the relationship between social network size and incivility using the tweets originating from South Korea between February 10, 2020, and February 14, 2020, when the Korean government planned to bring its citizens back from Wuhan. Twitter provides two API types: search API [47] and streaming API [45] . The Standard version of search API can be used to search against the sample of tweets created in the last seven days, while the Premium and Enterprise versions allow developers to access tweets posted in the previous 30 days (30-day endpoint) or from as early as 2006 (Fullarchive endpoint) [47] . The streaming API is used for accessing tweets from the real-time Twitter feed [45] . For this study, the streaming API is being used since March 20, 2020. The original collection of tweets was started on January 27, 2020. The study commenced as an optimization design project to investigate how much social media data volume can be analyzed using minimal computing resources. Twitter's content redistribution policy restricts researchers from sharing tweets data other than tweet IDs, Direct Message IDs and/or User IDs. The original collection did not have tweet IDs. Therefore, tweets collected between January 27, 2020, and March 20, 2020, could not be released to the public. Hence, a fresh collection was started on March 20, 2020. Figure 1 shows the daily distribution of the tweets in the COV19Tweets Dataset. Between March 20, 2020, and April 17, 2020, four keywords, "corona," "#corona," "coronavirus," and "#coronavirus," were used for filtering the Twitter stream. Therefore, the number of tweets captured in that period per day, on average, is around 893k. However, a dedicated collection was started on a Linux-based highperformance CPU-Optimized virtual machine (VM), with additional filtering keywords, after April 18, 2020. As of July 17, 2020, 46 keywords are being tracked for streaming the tweets. The number of keywords has been evolving continuously since the inception of this study. Table 1 gives an overview of the filtering keywords currently in use. As the pandemic grew, a lot of new keywords emerged. In this study, n-grams are analyzed every 2 hours using the recent most 0.5 million tweets to keep track of emerging keywords. Twitter's "worldwide trends" section is also monitored for the same purpose. On May 13, 2020, Twitter also published a list of 564 multi-lingual filtering keywords used in its COVID-19 stream endpoint [44] . The streaming API allows developers to use up to 400 keywords, 5,000 user IDs, and 25 location boxes for filtering the Twitter stream. The keywords are matched against the tokenized text of the body of the tweet. 46 keywords have been identified as filtering rules for extracting COVID-19 specific tweets. User ID filtering was not used. Also, the location box filtering was avoided as the intention was to create a global dataset. Twitter adds a BCP 47 1 language identifier based on the machine-detected language of the tweet body. Since the aim was to pull only the English tweets, the "en" condition was assigned to the language request parameter. The collection of tweets is a small portion of the dataset design. The other tasks include filtration of geo-tagged tweets and computation of sentiment score for each captured tweet, all that in real-time. A dashboard is also required to visualize the information extracted from the collected tweets. A stable internet connection is needed to download the continuously incoming JSON. The computation of sentiment score for each captured tweet requires the VM to constitute powerful enough CPUs to avoid a bottleneck scenario. Every information gathered to this point needs to be stored on a database, which necessitates a disk with excellent performance. Summing up, a cloud-based VM is required to automate all these tasks. In this study, the VM has to process thousands of tweets every minute. Also, the information extracted from the captured data is to be visualized on an active front-end server that requires plotting of hundreds of thousands of data points. Therefore, a Linux-based compute-optimized hyper-threading VM is used for this study. Table 2 gives an overview of the VM considered in the dataset design. Figure 2a -e shows the resource utilization graphs for various performance parameters of the VM. A new collection starts between 1000-1100hrs GMT+5:45, every day. Therefore, the CPU usage and average load increase gradually as more and more tweets get captured. The CPU usage graph, in Fig. 2a , shows that the highest percentage of CPU usage at any given time does not exceed 35%. Few Python scripts and libraries, and a web server is actively running in the back-end. The majority of the tasks are CPU intensive; therefore, memory usage does not seem to exceed 35%, as shown in Fig. 2b . Past data show that memory usage exceeds 35% only when the web traffic on the visualization dashboard increases; otherwise, it is usually constant. The Load average graph, in Fig. 2c , shows that the processors do not operate overcapacity. The three colored lines, magenta, green and purple, represent 1-minute, 5minute, and 15-minute load average. The Disk I/O graph, in Fig. 2d , interprets the read and write activity of the VM. Saving thousands of tweets information every minute triggers continuous writing activity on the disk. The Disk I/O graph shows that the write speed is around 3.5 MB/s, and the read speed is insignificant. The Bandwidth usage graph, in Fig. 2e , reveals the public bandwidth usage pattern. On average, the VM is receiving a continuous data stream at 3 Mb/s. The VM connects with the backup server's database to download the recent half a million tweets for extracting a list of unigrams and bigrams. A new list is created every 2 hours; therefore, 12 peaks in the Bandwidth usage graph. Geotagging is the process of placing location information in a tweet. When a user permits Twitter to access his/her location via an embedded Global Positioning System (GPS), the geo-coordinates data is added to the tweet location metadata. This metadata gives access to various Geo Objects [46] such as "place type": "city", "name": "Manhattan", "full name": "Manhattan, NY", "country code": "US", "country": "United States" and the bounding box (polygon) of coordinates that encloses the place. Previous studies have shown that significantly less number of tweets are geo-tagged. A study [5] , conducted between 2016-17 in Southampton city, used local and spatial data to show that around 36k tweets out of 5 million had "point" geolocation data. Similarly, in another work [9] done in online health information, it was evident that only 2.02% of tweets were geo-tagged. Further, a multilingual COVID-19 global tweets dataset from CrisisNLP [39] reported having around 0.072% geo-tagged tweets. In this study, the tweets received from the Twitter stream are filtered by applying a condition on the ["coordinates"] Twitter Object to design the GeoCOV19Tweets Dataset. Algorithm 1 shows the pseudo-code for filtering the geotagged tweets. Figure 3 shows the daily distribution of tweets present in the GeoCOV19Tweets Dataset. Out of 310 million tweets, 141k tweets (0.045%) were found to be geo-tagged. If the collection after April 18, 2020, is considered, 118k (0.043%) tweets are geo-tagged. Twitter's content redistribution policy restricts the sharing of tweet information other than tweet IDs, Direct Message IDs and/or User IDs. Twitter wants researchers to pull fresh data from its platform. It is because users might delete their tweets or make their profile private. Therefore, complying with Twitter's content redistribution policy, only the tweet IDs are released. The dataset is updated every day with the addition of newly collected tweet IDs. First, Twitter allows developers to stream around 1% of all the new public tweets as they happen, via its Streaming API. Therefore, the dataset is a sample of the comprehensive COVID-19 tweets collection Twitter has on its servers. Second, there is a known gap in the dataset. Due to some technical reasons, the tweets collected between March 29, 2020, 1605hrs GMT+5:45, and March 30, 2020, 1400hrs GMT+5:45 could not be retrieved. Third, tweets analysis in a single language increases the risks of missing essential information available in tweets created in other languages [15] . Therefore, the dataset is primarily applicable for understanding the COVID-19 public discourse originating from native English-speaking nations. Twitter does not allow JSON of the tweets to be shared with third parties; the tweet IDs provided in the COV19Tweets Dataset must be hydrated to get the original JSON. This process of extracting the original JSON from the tweet IDs is known as the hydration of tweets IDs. There are multiple libraries/applications such as twarc (Python Fig. 3 Daily distribution of tweets in the GeoCOV19Tweets Dataset library) and Hydrator (Desktop application) developed for this purpose. Using the Hydrator application is relatively straightforward; however, working with the twarc library requires basic programming knowledge. Algorithm 2 is the pseudo-code for using twarc to hydrate a list of tweet IDs. The tweet data dictionary provides access to a long list of root-level attributes. The root-level attributes, such as user, coordinates, place, entities, etc., further provide multiple child-level attributes. When hydrated, the tweet IDs produce JSON that contains all the root-level and childlevel attributes with their values. Twitter's documentation [48] can be referred for more information on the tweet data dictionary. The COV19Tweets Dataset has global coverage, and it can also be used to extract tweets originating from a particular region. An implementable solution for this will be to check if a tweet is geo-tagged or has place boundary defined in its data dictionary. If none of these fields are available, the address given on the user's profile can be used. However, Twitter does not validate the profile address field for authentic geo-information. Even addresses such as "Milky Way Galaxy," "Earth," "Land," "My Dream," etc. are accepted entries. A user can also create a tweet from a particular place while having an address of a different one. Therefore, considering user's profile address might not be an effective solution while dealing with location information. Algorithm 3 is the pseudocode for extracting tweets originating from a region of interest. Tweets received from the Twitter stream can be analyzed for making multiple inferences regarding an event. The tweets collected between April 24, 2020, and July 17, 2020, were considered to generate an overall COVID-19 sentiment trend graph. The sampling time is 10 minutes, which means a combined sentiment score is computed for tweets captured in every 10 minutes. Figure 4 shows the COVID-19 sentiment trend based on public discourse related to the pandemic. In Fig. 4 , there are multiple drops in the average sentiment over the analysis period. In particular, there are fourteen drops where the scores are negative. Among those fourteen drops, seven of the significant drops were studied. The tweets collected in those dates were analyzed to see what particular terms (unigrams and bigrams) were trending. Table 3 lists the most commonly used terms during those seven drops. The tweets are pre-processed before extracting the unigrams and bigrams. The pre-processing steps include transforming the texts to their lowercases and removing noisy data such as retweet information, URLs, special characters, and stop words [15] . It should be noted that the removal of stop words from the tweet body results in a different set of bigrams. Therefore, the bigrams listed in Table 3 should not be considered the sole representative of the context in which the terms might have been used. Next, the GeoCOV19Tweets Dataset was used for performing network analysis to extract the underlying relationship between countries and hashtags. Only the hashtags that appear more than ten times in the entire dataset were considered. The dataset resulted in 303,488 number of [country, hashtag] relations from 190 countries and territories, and 5055 unique hashtags. There were 32,474 unique relations when weighted. Finally, the resulting relations were used for generating a network graph, as shown in Fig. 5 . The graph shows interesting facts about dataset. The network has a dense block of nodes forming a sphere and multiple sparsely populated nodes connected to the nodes inside the sphere through some relations. The nodes that are outside the sphere are country-specific hashtags. For illustration, Fig. 6a -d shows the countryspecific hashtags for New Zealand, Qatar, Venezuela, and Argentina. The nodes of these countries are outside the sphere because of outliers in their respective sets of hashtags. However, these countries do have connections with the popular hashtags present inside the sphere. The majority of the hashtags in Fig. 6a -d do not relate directly to the pandemic. Therefore, these hashtags can be considered as outliers while designing a set of hashtags for the pandemic. The network graph, shown in Fig. 5 , is further expanded by a scale factor, as shown in Fig. 7a and b. The network graphs are colored based on the communities detected by a modularity algorithm [6, 28] . The algorithm detected 12 communities in the GeoCOV19Tweets Dataset. The weight='weight' and resolution=1.0 parameters were used for the experimentation. Table 4 gives an overview of the 12 communities identified in the GeoCOV19Tweets Dataset. Country names are represented by their ISO codes. Community 0 constitutes 55.56% of the nodes in the network. The number of members in Community 0 was relatively high; therefore, the ISO column for that community lists only the countries that have associations with at least 25 different hashtags. For the remaining communities, all the members are listed. Communities are formed based on the usage of similar hashtags. The United States has associations with the highest number of different hashtags, it is therefore justified to find most countries in the same group with the United States. However, other native English-speaking nations such as the United Kingdom and Canada seem to be forming their own communities. This formation of separate communities is because of the differences in their sets of hashtags. For example, the United Kingdom appears to be mostly using "lockdown," "lockdown2020," "isolation," "selfisolation," etc. as hashtags, but the presence of these hashtags in the hashtag set of the United States is limited. The ISO codes for each community in Table 4 are sorted in descending order; Fig. 6 Country specific outlier hashtags detected using Network Analysis the country associated with the highest number of unique hashtags is mentioned first. Next, a set of popular hashtags and their communities are identified. Table 5 lists the top 40 commonly used hashtags, their weighted in-degree, and their respective communities. The community for a hashtag in Table 5 means that the hashtag has appeared the most in that particular community. The [country, hashtag] relations can also be used to trace back a hashtag's usage pattern. The hashtags "flattenthecurve," "itsbetteroutside," "quarantine," "socialdistancing," etc. seem to be first used in the tweets originating from the United States. In the fourth week of March 2020, countries such as the United Kingdom, India, and South Africa experienced their first phase of lockdown. For the same reason, there is an unusual increase in the usage of "lockdown" related hashtags during that period in those countries. It should be noted that a thorough tracing back of hashtag usage would require analysis of tweets collected since December 2019, when the "first case" of COVID-19 was identified [19] . As of July 17, 2020, the number of tweets in the GeoCOV19Tweets Dataset is 141,260. The dataset is hydrated to create a country-level distribution of the geotagged tweets, as shown in Table 6 . The United States dominates the distribution with the highest number of US, AU, NG, ZA, AE, ES, ID, IE, MX, PK, SG, FR, BE, GH, KE, TH, SE, AT, SA, PT, LB, UG, EG, CO, MA, LK, EC, HK, KW, RO, PE, FI, HR, NO, ZW, PA, TZ, VN, BS, PG, HU, BH, CR, BB, OM, SX, RS, TW, BG, DO, ZM, AW, KH, GU, BT, BW, CM, CG, CD, FJ, AQ, SV, AL, requirement for converting geo-coordinates to a humanreadable address. Next, the geo-tagged tweets were visualized on a map based on their sentiment scores. Figures 8 and 9 are the sentiment maps generated based on the location information extracted from the tweets collected between March 20, 2020, and July 17, 2020. The world view of the COVID-19 sentiment map, in Fig. 8 , shows that the majority of the tweets are originating from North America, Europe, and the Indian subcontinent. Interestingly, some tweets are also seen to be originating from countries where the government has banned Twitter. Around 0.26% of the geotagged tweets have come from the People's Republic of China, while North Korea does appear on the list, the number is insignificant. When a region-specific sentiment map, as shown in Fig. 9 , is generated, numerous clusters of geo-location points are observed. Such clusters can be a bird's-eye view for the authorities to create first-hand sketches of tentative locations to start for responding to a crisis. For example, the location information extracted from the tweets classified to the "Infrastructure and utilities damage" category can help generate near real-time convex closures of the crisis-hit area. Such convex closures can prove to be beneficial for the first responders (army, police, rescue teams, first-aid volunteers, etc.) to come up with actionable plans. In general, the inferences made from geo-specific data can help (i) understand knowledge gaps, (ii) perform surveillance for prioritizing regions, and (iii) recognize the urgent needs of a population [39] . Understanding the knowledge gaps involves identifying the crisis event-related queries posted by the public on social media. The queries can be anything, a rumor, or even some casual inquiry. Machine learning models can be trained on large-scale tweets corpus for classifying the tweets into multiple informational categories, including a separate class for "queries." Even after the automatic classification, each category still contains hundreds of thousands of tweets conversations, which require further indepth analysis. Those classified tweets can be summarized to extract concise and important set of conversations. Recent studies have used extractive summarization [41, 50] , abstractive summarization [36] , and the hybrid approach [40] for summarizing microblogging streams. If the queries are identified and duly answered, the public's tendency to panicking can be settled to some extent. Further, geo-specific data can assist in surveillance purposes. The social media messages can be monitored actively to identify the messages that report a disease's signs and symptoms. If such messages are detected quite early, an efficient response can be targeted to that particular region. The authorities and decision-makers can come up with effective and actionable plans to minimize possible future severity. Furthermore, social media messages can also be analyzed to understand the urgent needs of a population. The requirements might include anything related to everyday essentials (shelter, food, water) and health services (medicines, checkups). The above-discussed research implications fall under the crisis response phase of the disaster management cycle. However, other sub-areas in the Social Computing domain enforce the computational systems to also understand the psychology, and sociology of the affected population/region as part of the crisis recovery phase. The design of such computational systems requires a humongous amount of data for modeling intelligence within them to track the public discourse relating to any event. Therefore, a largescale Twitter dataset for the COVID-19 pandemic was presented in this paper, hoping that the dataset and its geo version would help researchers working in the Social Computing domain to better understand the COVID-19 discourse. In this paper, a large-scale global Twitter dataset, COV19Tweets Dataset, is presented. The dataset contains more than 310 million English language tweets, originating from 204 different countries and territories worldwide, collected over March 20, 2020, and July 17, 2020. Earlier studies have shown that geo-specific social media conversations aid in extracting situational information related to an ongoing crisis event. Therefore, the geo-tagged tweets in the COV19Tweets Dataset is filtered to create its geo version, the GeoCOV19Tweets Dataset. Out of 310 million tweets, it was observed that only 141k tweets (0.045%) had "point" location in their metadata. The United States dominates the country-level distribution of the geo-tagged tweets and is followed by the United Kingdom, Canada, and India. Designing a large-scale Twitter dataset requires a reliable VM to fully automate the associated tasks. Five performance metrics (specific to CPU, memory, average load, disk i/o, bandwidth) were analyzed to see how the VM was performing over a period (24 hour). The paper then discussed techniques to hydrate tweet IDs and filter tweets originating from a region of interest. Next, the COV19Tweets Dataset and its geo version were used for sentiment analysis and network analysis. The tweets collected between April 24, 2020, and July 17, 2020, were considered to generate an overall COVID-19 sentiment trend graph. Based on the trend graph, seven significant drops in the average sentiment over the analysis period were studied. Trending unigrams and bigrams on those particular dates were identified. Further, a detailed social network analysis was done on the GeoCOV19Tweets Dataset using [country, hashtag] relations. The analysis confirmed the presence of 12 different communities within the dataset. The formation of communities was based on the usage of similar hashtags. Also, a set of popular hashtags and their communities were identified. Furthermore, the GeoCOV19Tweets Dataset was used for generating world and region-specific sentimentbased maps, and the research implications of using geospecific data were briefly outlined. Top concerns of tweeters during the covid-19 pandemic: infoveillance study Covid-19 and the 5g conspiracy theory: social network analysis of twitter data Large arabic twitter dataset on covid-19 A large-scale covid-19 twitter chatter dataset for open scientific research-an international collaboration Assessing twitter geocoding resolution Fast unfolding of communities in large networks A survey on fake news and rumour detection techniques Influence of fake news in twitter during the 2016 us presidential election right time, right place" health communication on twitter: value and accuracy of location information Crowd sourcing disaster management: The complex nature of twitter usage in padang indonesia Big crisis data: social media in disasters and time-critical situations Tsunami early warnings via twitter in government: Net-savvy citizens' coproduction of time-critical public information services Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set A microblogging-based approach to terrorism informatics: Exploration and chronicling civilian sentiment and response to terrorism events via twitter Multilingual sentiment analysis: state of the art and independent comparison of techniques Omg earthquake! can twitter improve earthquake response Going viral: How a single tweet spawned a covid-19 conspiracy theory on twitter Arcov-19: The first arabic covid-19 twitter dataset with propagation networks Clinical features of patients infected with 2019 novel coronavirus in wuhan, china Processing social media messages in mass emergency: A survey Aidr: Artificial intelligence for disaster response Twitter as a lifeline: Humanannotated twitter corpora for nlp of crisis-related messages Using ai and social media multimodal content for disaster response and management: Opportunities, challenges, and future directions Detection of spam-posting accounts on twitter Prediction and characterization of high-activity events in social media triggered by real-world news Effects of social grooming on incivility in covid-19 Laplacian dynamics and multiscale modular structure in networks Coronavirus (covid-19) geo-tagged tweets dataset Lamsal R (2020a) Coronavirus (covid-19) tweets dataset Twitter based disaster response using recurrent nets Using tweets to support disaster planning, warning and response Sentiment analysis and emotion understanding during the covid-19 pandemic in spain and its impact on digital ecosystems Global sentiments surrounding the covid-19 pandemic on twitter: analysis of twitter trends Robust classification of crisis-related data on social networks using convolutional neural networks Efficient online summarization of microblogging streams Conversations and medical news frames on twitter: Infodemiological study on covid-19 in south korea What kind of# conversation is twitter? mining# psycholinguistic cues for emergency coordination Geocov19: a dataset of hundreds of millions of multilingual covid-19 tweets with location information Summarizing situational tweets in crisis scenarios: An extractiveabstractive approach Sumblr: continuous summarization of evolving tweet streams Examining the impact of covid-19 lockdown in wuhan and lombardy: a psycholinguistic analysis on weibo and twitter Communicating on twitter during a disaster: An analysis of tweets during typhoon haiyan in the philippines Twitter: Covid-19 stream Twitter: Filter realtime tweets Twitter: Standard search api Twitter: Twitter object Rumor response, debunking response, and decision makings of misinformed twitter users during disasters On summarization and timeline generation for evolutionary tweet streams Spatial, temporal, and content analysis of twitter for wildfire hazards Automatic identification of eyewitness messages on twitter during disasters Mining twitter data for improved understanding of disaster resilience The author is grateful to DigitalOcean and Google Cloud for funding the computing resources required for this study. The author declares that there is no conflict of interest.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.