key: cord-131667-zl5txjqx authors: Liu, Junhua; Singhal, Trisha; Blessing, Lucienne T.M.; Wood, Kristin L.; Lim, Kwan Hui title: EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets date: 2020-06-09 journal: nan DOI: nan sha: doc_id: 131667 cord_uid: zl5txjqx Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. In this paper, we present EPIC30M, a large-scale epidemic corpus that contains 30 millions micro-blog posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola. Furthermore, we explore and discuss the properties of the corpus with statistics of key terms and hashtags and trends analysis for each subset. Finally, we demonstrate the value and impact that EPIC30M could create through a discussion of multiple use cases of cross-epidemic research topics that attract growing interest in recent years. These use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language understanding and economical modeling. The Coronavirus disease has spread around the globe since the beginning of the year 2020, affecting around 200 countries and everyone's life. To date, the highly contagious disease has caused over 6.6 million confirmed and suspected cases and 389 thousand deaths. In time of crisis caused by epidemics, we realize the necessity of rigorous arrangements, quick responses, credible and updated information during the premature phases of such epidemics [40] . Social media platforms, such as Twitter, play an important role in informing the latest epidemic status, via the announcements of public policies in a timely manner. Facilitating the posting of over half a billion tweets daily [31] , Twitter emerges as a hub for information exchange among individuals, companies, and governments, especially in time of epidemics where economies are placed in a hibernation mode, and citizens are kept isolated at home. Such platforms help tremendously to raise situational awareness and provide actionable information [19] . Recently, numerous COVID-19 related corpora from various sources are presented that contain millions of data points [11, 33] . While these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID- 19 We conduct several exploratory analyses to study the properties of the corpus, such as word cloud visualization and time series trend analysis. Several interesting findings are discovered through these analyses. For instance, we find that a large quantity of topics are related to specific locations; cross-epidemic topics, i.e. one that involves more than one epidemic-related hashtag, appear frequently in several classes; and several hashtags related to non-epidemic events, such as warfare, have relatively high ranks in the list. Furthermore, a time-series analysis also suggests that some of the epidemics, i.e. 2010 Haiti Cholera and 2018 Kivu Ebola, show a surge in tweets before the respective start dates of the outbreaks, which signifies the importance of leveraging social media to conduct early signal detection. We also observe that an epidemic outbreak not only leads to rapid discussion of its own, but also triggers exchanges about other diseases. EPIC30M fills the gap in the literature where very little epidemicrelated corpora are either unavailable or not sizable enough to support cross-epidemic analysis tasks. Through discussing various potential use cases, we anticipate that EPIC30M brings great value and impact to various fast growing computer science communities, especially in natural language processing, data science and computation social science. We also foresee that EPIC30M is able to contribute partially to cross-disciplinary research topics, such as economic modeling and humanity studies. While EPIC30M includes tweets posted throughout the cause of each outbreak available in the corpora, we expect that EPIC30M may serve as a timeless cross-epidemic benchmark. 2 As of 20 Jun 2020 In this section, we discuss the existing Twitter corpora for several domains, such as COVID-19, disasters, and others. These corpora attract a large quantity of interests and enable a large amount of research works in their respective domains, which we believe EPIC30M generalizes to a similar level of impact in the epidemic domain. Corpora of COVID-19. Recently, the COVID-19 pandemic spread across the globe and generated enormous economical and social impact. Throughout the pandemic, numerous related corpora have been released. For instance, Chen et al. [11] released a multi-lingual corpus that consists of 50 million tweets that include tweet IDs and their timestamps, across over 10 languages. Similarly, Banda et al. [7] presented a large-scale COVID-19 chatter corpus that consists of over 152M tweets with retweets and another version of 30 million tweets without retweets. English corpora of disasters. There are several disaster-related corpora presented in the literature that are utilized for multiple works. CrisisLex [38] consists of 60 thousand tweets that are related to six natural disaster events, queried based on relevant keywords and locations during the crisis periods. The tweets are labelled as relavant or not-relevant through crowdsourcing. Olteanu et al. [39] conducts a comprehensive study of tweets to analyze 26 crisis events from 2012 to 2013. The paper analyzes about 25k tweets based on crisis and content dimensions, which include hazard type (natural or human-induced), temporal development (instantaneous or progressive), and geographic speed (focalized or diffused). The content dimensions are represented by several features such as informativeness, types and sources. Imran et al. [20] releases a collection of over 52 million tweets, out of which 50 thousand come with human-annotated tweets that are related to 19 natural crisis events. The work also presents pre-trained Word2Vec embeddings with a set of Out-Of-Vocabulary (OOV) words and their normalizations, contributing in spreading situational awareness and increasing response time for humanitarian efforts during crisis. Phillips [42] releases a set of 7 million tweets related to Hurricane Harvey. Littman [28] publishes a corpus containing tweet IDs of over 35 million tweets related to Hurricane Irma and Harvey. Non-English corpora of disasters. Numerous non-english crisis corpora are also found in the literature. For instance, Cresci et al. [13] released a corpus of 5.6 thousand Italian tweets from 2009 to 2014 during four different disasters. The features include informativeness (damage or, no damage) and relevance (relevant or not relevant). Similarly, Alharbi and Lee [2] compiled a set of 4 thousand Arabic tweets, manually labelled on the relatedness and information-type for four high risk flood events in 2018. Alam et al. [1] released a Twitter corpora composed of manually-annotated 16 thousand tweets and 18 thousand images collected during seven natural disasters (earthquakes, hurricanes, wildfires, and floods) that occurred in 2017. The features of the datasets include Informativeness, Humanitarian categories, and Damage severity categories. Other Twitter Corpora. Apart from crisis-related corpora, several Twitter datasets are used for analysis related to politics, news, abusive behaviour and misinformation, Trolls, movie ratings, weather forecasting, etc. For instance, Fraisier et al. [15] proposes a large and complex dataset with over 22 thousand operative Twitter profiles during the 2017 French presidential campaign with their corresponding tweets, tweet IDs, retweets, and mentions. The data was annotated manually based on their political party affiliation, their nature, and gender. We also find Twitter Corpora that are related to other domains, such as politics [9, 50] , cyberbullying [14] , and misinformation [17, 43, 52] . This section describes the data collection process for crawling EPIC30M. Epidemic Outbreaks. EPIC30M includes six epidemic outbreaks in the 21st century, recorded by World Health Organization 3 Table 1 . We intentionally exclude the recent COVID-19 pandemic outbreak to avoid producing redundant work, as there are already numerous COVID-19 datasets released by different parties with multi-million data points. Search Queries. For each outbreak, we initialize with a large collection of keywords used as the search queries, with the hypothesis to retrieve most relevant tweets from Twitter. We use a combination of keywords for each outbreak, as listed on table 1, to fetch the related tweets. Two types of keywords used, namely (a) general disease-related terms, such as ebola, cholera and swine flu; and (b) specific outbreak-related terms with a combination of location and disease, such as africa ebola and yemen cholera. General Epidemics. Besides the outbreaks set, we extend EPIC30M by including a subset of three general diseases, namely Cholera, Ebola and Swine Flu. The tweets related to these diseases are crawled since the respective first occurrence until 15 t h May 2020. We expect the general epidemic subset is able to act as additional benchmarks and contribute substantially to various research topics, such as pattern recognition and trend analysis. To gain a general overview of EPIC30M, we first conduct hashtags analysis for each epidemic and plot them on a 3 by 3 grid, as shown in Figure 1 . The first row (Fig. 1a) represents three general diseases whereas the second and third rows (Fig. 1b) represent the six outbreak classes in chronological order. Each word cloud contains the top 100 hashtags in their respective class, where the sizes represent their frequencies. Through observation, we identify several interesting phenomena, such as: (1) Key terms provide semantic indication of the crises, in addition to possible cross-epidemic indicators: such as pandemic, epidemic, healthcare, vaccine, disease, sanitation, and others; (2) location-related hashtags, such as #Yemen, #Haiti and #SierraLeone, appear in all classes and occupy majority of the key words, which we believe to be the highest concerned feature; (3) several classes include hashtags of other diseases, i.e., #COVID19 in the 2016_Yemen_Cholera class and #Malaria in the Cholera class, which implies that discussions on cross-epidemic matters are popular; and (4) some hashtags refer to non-epidemic related events, such as #5YearsOfWarOnYemen and #earthquake appearing in the 2016_Yemen_Cholera and 2010_Haiti_Cholera sets respectively. Subsequently, we conduct trend analysis with an attempt to identify time-variant patterns from the corpus. For the three general classes (Fig. 2a) , we plot each class into a line chart, where the x-axis represents the time in yearly dates and the y-axis represents the corresponding number of tweets. For the six outbreak classes (Fig. 2b) , the x-axis of each line chart uses the number of days offset from the start date of the outbreak, whereas the y-axis represents the number of tweets normalized to between 0 and 1. Through the time-series line plots, we observe that some of the epidemics, i.e. 2010 Haiti Cholera and 2018 Kivu Ebola, show a surge in tweets before the respective official start dates of the outbreaks, which signifies the importance of leveraging social media to conduct early signal detection. We also observe that an epidemic outbreak not only leads to rapid discussion of its own, but also trigger exchanges of other diseases. Finally, the time-series analyses also show clear dynamic properties or trends with exponential increases (shocks or spikes) in tweet type and a temporal persistence after an initial shock [16] . Other dynamic properties that may be of interest include local cycles and trends. Such dynamic effects, when paired with semantic content (such as healthcare related terms), may provide potential indicators of an onset of a crisis. 4 According to the World Health Organization, https://www.who.int While Twitter has an enormous volume and frequency of information exchange, i.e. over half a billion tweets posted daily, such rich data potentially exposes information on epidemic events through substantial analysis. In this section, we demonstrate the value and impact that EPIC30M could create by discussing on multiple use cases of cross-epidemic research topics that attract growing interests in recent years. These use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language processing and economical modeling. We claim that EPIC30M fills the gap in the literature where very little disease related corpora are sizable and rich enough to support such cross-epidemic analysis tasks. EPIC30M supplies benchmarks of multiple epidemics to facilitate a wide range of cross-epidemic research topics. Epidemiological Modeling. Epidemiological modeling provides various potential applications to understand the Twitter dynamics during and post-outbreaks, such as compartmental modeling [3] and misinformation detection [51] . To name a few, Jin et al. [21] uses Twitter data to detect false rumors and a susceptible-exposedinfected-skeptic (SEIZ) model to group users in four compartments. Skaza and Blais [49] use susceptible-infectious-recovered (SIR) epidemic models on Twitter' hashtags to compute infectiousness of a trending topic. In the recent event of COVID-19, these models are repeatedly applied to predict discrete questions, such as Chen et al. [11] 's proposal of using a time-dependent SIR model to estimate the total number of infected persons and the outcomes, i.e., recovery or death. Trend Analysis and Pattern Recognition. Extensive prior works leverage social media data to perform trend analysis and pattern recognition tasks. For instance, Kostkova et al. [23] study the 2009 swine-flu outbreak and demonstrates the potential of Twitter to act as an early warning system up-to a period of two or three weeks. Similarly, Joshi et al. [22] predict alerts of Western Africa Ebola epidemic, three months earlier than the official announcement. While early detection and warning systems for crisis events may reduce overall damage and negative impacts [31] , EPIC30M provides high volume and timely information that facilitate trend analysis and pattern recognition tasks for epidemic events. Sentiment and Opinion Mining. The observation of social sentiments and public opinions plays an important part in benchmarking the effect of releasing public policy amendments or new initiatives. Several prior works leverage sentimental analysis and opinion mining to extract the contextual meaning of social media content. For instance, Beigi et al. [8] provides an overview of the relationship among social media, disaster relief and situational awareness in crisis time, and Neppalli et al. [36] performs locationbased sentimental analysis on tweets for Hurricane Sandy in 2012. Topic Detection. Topic detection or modeling may enable authorities in anticipating a crisis and taking actions during the same. The technique helps in recognizing hidden patterns, understanding semantic and syntactic relations, annotating, analyzing, organizing, and summarizing the huge collections of textual information. Considering the same, several researchers have implemented these approaches on crises datasets to detect and categorize the potential topics. Chen et al. [12] suggest two topic modeling prototypes to ameliorate trends estimation by seizing the underlying states of a user from a sequence of tweets and aggregating them in a geographical area. In [27] researchers perform optimized topic modeling using community detection methods on three crises datasets [38, 39, 52] to identify the discussion topics. Natural Language Processing. Several works leverage Twitter datasets to conduct Natural Language Processing (NLP) tasks. As a challenging downstream task of NLP, Automatic Text Summarization techniques extract latent information from text documents where the models generates a brief, precise, and coherent summary from lengthy documents. Text summarization is applicable in various real-would activities during crisis, such as generating news headlines, delivering compact instructions for rescue operations and identifying affected locations. Prior works demonstrate such applications during crisis time. For instance, Rudra et al. [45] and [44] propose two relevant methods that classify and summarize tweets fragments to derive situational information. More recently, Sharma et al. [46] proposes a system that produces highly accurate summaries from the Twitter content during man-made disasters. Several other works focus on NLP subtasks of social media, such as information retrieval [18, 29] and text classification [30, 41] . Disease Classification. Applications of Machine Learning and Deep Learning in the healthcare sector gather growing interests in recent years. For instance, Krieck et al. [24] analyzes the relevance of Twitter content for disease surveillance and activities tracking, which help alert health official regarding public health threats. Lee et al. [26] conducts text mining on Twitter data and deploys a real-time disease tracking system for flu and cancer using spatial, temporal information. Ashok et al. [4] develops a disease surveillance system to cluster and visualise disease-related tweets. Crisis-time Economic Modeling. Estimating economical impact of crises, such as epidemic outbreaks, is a crucial task for policy makers and business leaders to adjust operational strategies [32] and make right decisions for their organizations in the time of crises. Several research studies in such domain. For instance, Okuyama [37] provides an overview and a critical analysis of the methodologies used for estimating the economic impact of disaster; Avelino and Hewings [5] proposes the Generalized Dynamic Input-Output framework (GDIO) to dynamically model higher-order economic impacts of disruptive events. Such studies correlate disaster events and economy impact, which rely on disaster-related data and financial market data, respectively. We believe that EPIC30M is able to contribute to future economic modeling studies for epidemic events. Health Informatics. Compared to the cases above, a more general use case area is healthcare Informatics , i.e., âĂIJthe integration of healthcare sciences, computer science, information science, and cognitive science to assist in the management of healthcare informa-tionâĂİ [6, 34, 48] . While social media and online sources are used to connect with patients and provide reliable educational content in health informatics, there is growing interest in using Twitter and other feeds to study and understand indicators for health trends or particular behaviors or diseases. For example, Nambisan et al. [35] utilize Twitter content to study the behavior of depression. EPIC30M contains behavioral information across various diseases and how the populace behaves with the onset and persistence of the diseases. Multiple disease cases will provide such research to correlate behavioral information across instances. News and Fake News. With the proliferation of news content through internet and virtual media, there is a growing interest in developing an understanding of the science of news and fake news [25] . Data mining algorithms are advancing to study news content [47] . EPIC30M contains real news content that grows over time from social lay-person terminology to technical and professionally based information and opinion. It likewise includes fact-based information as well as distorted or fake content. Through multiple cases over time, the field will have a rich source to study news content, especially when correlating with reliable news sources for particular snapshots of time. All in all, we believe that EPIC30M provides a set of rich benchmarks and is able to facilitate extensions of the above-mentioned works on a higher order, e.g., in cross-epidemic settings. As a result, the research findings are more robust and closer to real-world scenarios. Conclusion. During our other efforts on COVID-19 related work, we discovered very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. In this paper, we present EPIC30M, a large-scale epidemic corpus that contains 11.8 millions tweets from 2006 to 2020. The corpus includes a subset of tweets related to three (3) general diseases and another subset related to six (6) epidemic outbreaks. We conduct exploratory analysis to study the properties of the corpus and identify several phenomena, such as strong correlation between epidemics and locations, frequent cross-epidemic topics, and surge of discussion before occurrence of the outbreaks. Finally, we discuss a wide range of use cases that EPIC30M can potentially facilitate. We anticipate that EPIC30M brings substantial value and impact to both fast growing computer science communities, such as natural language processing, data science and computation social science, and multi-disciplinary areas, such as economic modeling, health informatics and the science of news and fake news. Future work. For some epidemic outbreaks, such as 2009 H1N1 Swine Flu and 2014 West Africa Ebola, EPIC30M includes relevant tweets posted throughout the respective duration of the epidemics. We expect the data of these few classes could serve as strong and timeless cross-epidemic and cross-disease benchmarks. On the other hand, several epidemics, such as 2018 Kivu Ebola and 2016 Yemen Cholera, are still ongoing. We intend to extend the corpus by actively or periodically crawling tweets in addition to the current version. Furthermore, we plan to further develop the corpus with additional epidemic outbreak classes that happened more recently, such as the 2019 multi-national Measles outbreaks in the DR Congo, New Zealand, Philippines and Malaysia, the 2019 Dengue fever epidemic in Asia-Pacific and Latin America, and the 2018 Kerala Nipah virus outbreak. Lastly, we also intend to develop an active crawling web service that automatically update EPIC30M, and migrate to cloudbased relational database services to ensure its availability and accessibility. The corpus is available at https://www.github.com/junhua/epic. This research is funded in part by the Singapore University of Technology and Design under grant SRG-ISTD-2018-140. Crisismmd: Multimodal twitter datasets from natural disasters Crisis Detection from Arabic Tweets Compartmental modeling and tracer kinetics A Machine Learning Approach for Disease Surveillance and Visualization using Twitter Data The Challenge of Estimating the Impact of Disasters: many approaches, many limitations and a compromise System and method for integrated learning and understanding of healthcare informatics Yuning Ding, and Gerardo Chowell. 2020. A large-scale COVID-19 Twitter chatter dataset for open scientific research-an international collaboration An overview of sentiment analysis in social media and its applications in disaster relief Analyzing discourse communities with distributional semantic models Computer-aided mind map generation via crowdsourcing and machine learning Covid-19: The first public coronavirus twitter dataset Syndromic surveillance of Flu on Twitter using weakly supervised temporal topic models. Data mining and knowledge discovery A linguistically-driven approach to cross-event damage assessment of natural disasters from social media messages Large scale crowdsourcing and characterization of twitter abusive behavior # Élysée2017fr: The 2017 French Presidential Campaign on Twitter Time series analysis The Hoaxy misinformation and fact-checking diffusion network AIDR: Artificial intelligence for disaster response Extracting information nuggets from disaster-related messages in social media Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages Epidemiological modeling of news and rumors on twitter Automated monitoring of tweets for early detection of the 2014 Ebola epidemic # swineflu: The use of twitter as an early warning and risk communication tool in the 2009 swine flu pandemic A new age of public health: Identifying disease outbreaks by analyzing tweets The science of fake news Real-time disease surveillance using twitter data: demonstration on flu and cancer Clustop: A clustering-based topic modelling algorithm for twitter using word networks Hurricanes Harvey and Irma Tweet ids Self-Evolving Adaptive Learning for Personalized Education IPOD: An Industrial and Professional Occupations Dataset and its Applications to Occupational Data Mining and Analysis CrisisBERT: a Robust Transformer for Crisis Classification and Contextual Crisis Embedding Strategic and Crowd-Aware Itinerary Recommendation Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset Essentials of nursing informatics Social media, big data, and public health informatics: Ruminating behavior of depression revealed through twitter Sentiment analysis during Hurricane Sandy in emergency response Critical review of methodologies on disaster impact estimation Crisislex: A lexicon for collecting and filtering microblogged communications in crises What to expect when the unexpected happens: Social media communications across crises Managing epidemics: key facts about major deadly diseases. World Health Organization Automatic classification of disaster-related tweets Why WeâĂŹre Sharing 3 Million Russian Troll Tweets. FiveThirtyEight Summarizing situational tweets in crisis scenario Extracting situational information from microblogs during disaster events: a classification-summarization approach Going Beyond Content Richness: Verified Information Aware Summarization of Crisis-Related Microblogs Fake news detection on social media: A data mining perspective Mobile healthcare informatics. Medical informatics and the Internet in medicine Modeling the infectiousness of Twitter hashtags U.S. congressional election tweet ids Mining misinformation in social media Analysing how people orient to and spread rumours in social media by looking at conversational threads