key: cord-0443496-cvsch9t5 authors: Agarwal, Ankita; Salehundam, Preetham; Padhee, Swati; Romine, William L.; Banerjee, Tanvi title: Leveraging Natural Language Processing to Mine Issues on Twitter During the COVID-19 Pandemic date: 2020-10-31 journal: nan DOI: nan sha: 8c28cca49f05db171a644fd20f8880039586b6cf doc_id: 443496 cord_uid: cvsch9t5 The recent global outbreak of the coronavirus disease (COVID-19) has spread to all corners of the globe. The international travel ban, panic buying, and the need for self-quarantine are among the many other social challenges brought about in this new era. Twitter platforms have been used in various public health studies to identify public opinion about an event at the local and global scale. To understand the public concerns and responses to the pandemic, a system that can leverage machine learning techniques to filter out irrelevant tweets and identify the important topics of discussion on social media platforms like Twitter is needed. In this study, we constructed a system to identify the relevant tweets related to the COVID-19 pandemic throughout January 1st, 2020 to April 30th, 2020, and explored topic modeling to identify the most discussed topics and themes during this period in our data set. Additionally, we analyzed the temporal changes in the topics with respect to the events that occurred during this pandemic. We found out that eight topics were sufficient to identify the themes in our corpus. These topics depicted a temporal trend. The dominant topics vary over time and align with the events related to the COVID-19 pandemic. Since December 2019, a series of acute atypical respiratory disease outbreak in Wuhan, China, rapidly spread globally. It was soon discovered that a novel coronavirus, "SARS-CoV-2" was responsible for the outbreak. The disease caused by this virus was called Coronavirus disease 2019 (COVID- 19) , and a pandemic was declared by the World Health Organization (WHO). The disease was found to be highly contagious, with a reproduction rate of 2.2 (R0) [1] . As of August 21 st , 2020, a total of 22,864,873 people were infected globally, and 797,787 deaths were confirmed in 188 countries [2] . With time, COVID-19 has turned out to be an economic and social crisis, attacking societal structure at its core. The crisis has affected all public segments, particularly detrimental to § co-first authors members of most vulnerable social groups like older adults and people with weak immune systems. According to the UN Department of Economic and Social Affairs (UN DESA 1 ), if not adequately addressed the social crisis created by the COVID-19 pandemic may also increase inequality, exclusion, discrimination and global unemployment in the medium and long term. Social media are often seen as fast and effective platforms for searching, sharing, and distributing health information among the general population. They play a pivotal role in information dissemination and consumption during sudden outbreaks for people to access timely and reliable information about the disease symptoms and its prevention. However, social media has both positive and negative social impacts during the crisis, with increasing social distancing and growing reliance on online communication. Recent studies [3] - [7] have shown that social media can play an essential role as a source of data for understanding public attitudes and behaviors during a crisis. Also, in the case of strong emotional reactions, media coverage of the pandemic may influence public sentiments [8] . Social media data can be used to quickly identify the main thoughts, attitudes, feelings, and topics that are occupying the minds of people about the COVID-19 pandemic. Such data can help policymakers, health care professionals, and the public identify primary issues that are of concern and address them in a more appropriate manner [9] . Since January 2020, the number of papers analyzing Twitter activity during the COVID-19 pandemic has been increasing. Recent studies have identified the topics of discussions, concerns, and controversies surrounding COVID-19 in social media [10] - [13] . In this study, we seek to understand the extent to which social media conversations are a part of the discourse for different events unfolding during a pandemic. We investigated the effect of the events which occurred during the COVID-19 pandemic on major topics of discussion on Twitter over time. We addressed the following research questions in this study: 1) Research Q1: What is the feasibility of differentiating between relevant and non-relevant tweets with respect to the COVID-19 pandemic? 2) Research Q2: What are the trending topics in public discussions on Twitter related to the COVID-19 pandemic? 3) Research Q3: What is the relationship between events related to COVID-19 and the trending topics on Twitter over time? II. RELATED WORKS Understanding the prevalent topics during COVID-19 involved two major tasks: (1) Filtering the relevant tweets providing situational and actionable information about the pandemic, and (2) Analyzing the latent topics and themes with respect to the events which might have influenced public opinion. We provide a literature survey of each of these steps. The preliminary research on analyzing social media data related to the COVID-19 pandemic has increased almost daily. Some works have focused on analyzing human behavior and reactions to the spread of COVID-19 in the online world [11] , [14] - [16] , or investigating the conspiracy theories and social activism [17] , [18] . Often studies ignore the need to preprocess the tweets to filter out the tweets that do not pertain to the topic under investigation. For example, in a drug-related Twitter study [19] , one of the keywords used to capture relevant tweets was the word "spice". However, tweets related to pumpkin spice latte were collected, which were not relevant to the topic under investigation. Relevant tweets provide both situational and actionable information to extract requirement-specific information, decision-making, and immediate assistance to the affected people [20] . Several studies have mainly focused on identifying relevant information on Twitter. While [21] have identified tweets relevant to influenza, [22] have worked on filtering out noisy tweets by identifying relevant tweets for the Zika virus using text-based features. Wahbeh et al. [23] have utilized a qualitative approach to identify relevant tweets during COVID-19. Although recent studies have used deep learning methods for social media mining in natural disasters [24] , the same for pandemics is limited. In our work, we experiment with a range of machine learning and deep learning language models to filter out the relevant tweets for COVID-19. Topic modeling has been applied in areas like health informatics [25] , social media networks [26] , telecommunication, and digital payments [27] in order to organize and summarize large textual information and to uncover hidden thematic structures in a collection of documents [28] . Similarly, in the case of crisis and pandemic scenarios, topic modeling has played an important role in understanding the most prevalent themes discussed on social media platforms. Recent works include performing topic modeling on documents related to disease outbreaks like the Ebola virus epidemic [29] , Zika virus epidemic [30] , dengue epidemic [31] , and the seasonal influenza [32] . Some studies have explored the feasibility of topic modeling on documents extracted from other social media platforms such as Facebook [33] , Weibo [34] as well as Google trends [35] . Additionally, Medford et al. [36] studied the change in topics throughout January 14 th to 28 th , 2020 using Twitter data related to COVID-19. In addition, Ordun et al. [37] explored a time series based analysis, and Yin et al. [38] explored Dynamic Topic Modeling to study the trending topics in tweets for the pandemic. However, to the best of our knowledge, no work has discussed the influence of the events related to COVID-19 on the trending topics over time being discussed by the public on social media. We attempt to understand how public discussions on social media about COVID-19 resonate with the events related to this pandemic. In this section, we describe our data collection, handannotated data and preprocessing steps involved in data cleaning. We utilized the COVID-19 pandemic-specific keywords like "coronavirus", "covid-19", "sars-ncov" to extract tweets posted from January 1 st 2020 to April 30 th 2020. In early January (January 9), the World Health Organization (WHO) announced cases of coronavirus-related pneumonia in Mainland China 2 . The WHO officially declared COVID-19 a global pandemic on March 11 3 . By late April (April 29), the National Institute of Health (NIH) announced that the medication, remdesivir, performed better than placebo in treating COVID-19 4 . An open-source crawler library, getOldTweets 5 and Twitter Streaming API were used to extract the tweets. The getOldTweets library was used to overcome the daily quota limitations of the Twitter API. Additionally, this study was limited to English language tweets alone, and a total of 957,923 tweets were collected and preprocessed as described in Section III-C. Consequently, 866,527 unique tweets were obtained and serve as our dataset. When tweets were collected using keywords, often, there were instances of the tweets not related to COVID-19 but still containing the specified keywords. In Table I , we report sample tweets that contain the keyword "coronavirus", but are not relevant to the domain of COVID-19 pandemic. We randomly sampled 1,500 unique tweets out of 866,527 unique tweets, and three annotators labeled them as "Relevant" or "Irrelevant". A tweet was labeled as "Relevant" if it contained information about the spread, cause, effect, opinion, sentiment, emotion with regards to COVID-19, otherwise, it was labeled as "Irrelevant". Out of 1,500 tweets, 1,154 (77%) tweets were labeled as "Relevant" and 346 (23%) as "Irrelevant," which served as our dataset for building a relevancy classifier. Label another foolproof way of not dying of coronavirus is to stick your head in an oven lit on gas mark 8. I expect there are loads more such as being hit by an asteroid. Irrelevant coronavirus live updates: cases up nearly 60% as airports expand screenings: ny times - The tweets fed to the relevancy classifier and the topic modeling were first pre-processed, as discussed in this section. Usually, the language used in social media posts is very informal. People use short words, self-curated abbreviations, URLs, and multiple punctuations, periods, or exclamations. To understand the context and semantics of the content, we first converted the tweets to lowercase and then removed the emoticons, stop words, punctuations, and URLs, including symbols like "#" ,"@" as well as special characters. We also removed the keywords used to collect the tweets to avoid any bias introduced by them being present in most of the tweets. Ten percent of the total collected data were duplicate and empty. These tweets were filtered out to obtain 866,527 unique tweets. We further tokenized the tweets using the Penn Treebank tokenizer which uses regular expressions to tokenize texts similar to the tokenization used in the Penn Treebank 6 . These tokens were then passed through a Porter stemmer 7 for stemming to reduce inflected word forms to their word stem for a reduction in the size of the vocabulary. In this section, we discuss the steps and models involved in our approach to filter the relevant tweets and understand the trending topics. Fig. 1 presents an overview of the sequence of steps involved in our analysis. After preparing a dataset as discussed in Section III, we used the 1,500 preprocessed and annotated tweets to build a relevancy classifier. Two baseline traditional machine learning algorithms using Term Frequency -Inverse Document Frequency (TF-IDF) embeddings and contextual language model embeddings were evaluated and further discussed in Sections IV-A and IV-B. Finally, the best performing model was selected to predict the labels of all the unique 865,027 tweets. Furthermore, we performed topic modeling on the predicted relevant tweets to estimate the probability of a tweet belonging to a unique topic and understand the hidden semantic similarities between the tweets. We discuss the details of the topic modeling algorithm further in Section IV-C. In this study, two types of classifiers were trained using two types of feature representations, and the best performing model was used as a Relevancy Classifier to infer unlabelled tweets. The performance of our supervised models was evaluated using document embeddings based on (TF-IDF) and contextual language models. A document embedding is a learned representation for a text document where words that have a similar meaning have a similar representation. Feature representation in any text classification task plays a crucial role. Traditional context-free representations such as Bag-of-Words (BoW) or TF-IDF are known for wordlevel representation, and in recent years we have seen the improvement in language models by including contextual semantics. We discuss various representations as follows. • Context-Free Embedding: In a BoW based feature vector, each element in the vector represents a unique word or n-gram from the entire data set. The most significant limitation of such a representation is the inability to encode word meaning or contextual information into the vectors. TF-IDF attempts to evaluate the importance of a word by reducing the weight of words that occur more frequently in the data set and increasing the weight of words that are less frequent. The vector representation of a word based on TF-IDF, thereby encodes a different weight depending on its frequency of occurrence rather than its raw count. However, even though TF-IDF representations assign different weights to different words, they fail to capture the meaning (context) of the same word at different places (polysemy). • Language Models (Contextual Embedding): Dense distributional similarity based word embedding such as Glove, FastText, or Word2Vec bridge the gap by trying to capture the neighboring words of a particular word as its context. However, neighboring words cannot necessarily capture the context of a word given the complexity and variety in a sentence. Recent developments in language modeling and the release of Transformer models researchers were able to capture the context of a given word by looking at an input sequence and deciding at each step which other parts of the sequence are essential. Bidirectional Encoder Representations from Transformers (BERT) introduced a novel technique of masked language models (MLM) and next sentence prediction [39] to learn the distributional contextual representation of words. We used the 1,500 annotated tweets to build a relevancy classifier. Due to class imbalance (Relevant:77%, Irrelevant:23%), we used synthetic minority over-sampling technique (SMOTE) [40] to generate synthetic samples of minority class and train the relevancy classifier on 1,154 relevant and 346 irrelevant tweets. After oversampling, we split the dataset into train, test, and validation subsets with a distribution of 75%, 15%, and 10%, respectively. We used a 5-fold crossvalidation technique to confirm the generalizability of the model. Both the feature representations discussed in Section IV-A were used to train two traditional classification algorithms: Support Vector Classification (Linear SVC) and Logistic Regression (LR) to filter the relevant tweets for COVID-19 from the collected tweets. We used the Scikit-Learn 8 pipeline to generate TF-IDF vector representations. For generating contextual embeddings using pre-trained language models, the default conversion process of tokenization, and converting all sentences to a given sequence length (i.e., truncating longer sequences, and padding shorter sequences) were followed. Sentence-Transformers 9 was used to generate BERT based sentence embeddings from the tweets. A pre-trained 'bert-large-uncased' [41] model with an embedding dimension of 768 was used to generate the sentence embeddings. We performed a Principal Component Analysis on the embeddings to reduce the dimensionality of the feature space to 300 and prevent the classifier from overfitting. Latent Dirichlet Allocation (LDA) [28] is an unsupervised probabilistic model to automatically identify the hidden themes (topics) in documents like tweets. It classifies the documents (i.e., tweets) into topics that are the best representation of the data set. After preprocessing the relevant tweets, as discussed in Section III-C, we used a coherence score to find the optimal number of topics. A coherence score provides the score of the semantic similarity between the words in a given topic [42] . The Coherence Model 10 from Gensim 11 package was used for this purpose. An LDA model with the optimal number of topics as eight was inferred by the coherence score graph, as shown in Fig. 2 . The topics thus obtained consisted of the top representative words in each topic, along with their probabilities. We publish the tweet IDs with relevance labels and topics to be explored by the research community 12 . The performance of Logistic Regression and Support Vector Classifier with TF IDF and BERT embeddings are reported in Table II . The performance of the SVC classifier with BERT embedding was better than the other settings. The best performing SVC classifier on BERT embeddings showed precision and recall of 93% and classification accuracy of 93% in the test set and precision and recall of 95% with a classification accuracy of 95% in the validation set. This model was selected to generate predictions on the entire data set of 865,027 remaining tweets (excluding the 1,500 annotated tweets). A total of 688,825 (79.6%) tweets were classified as relevant tweets, and the remaining 176,202 were classified as irrelevant tweets. We used 689,979 relevant tweets (688,825 predicted by the relevancy classifier and 1,154 annotated as relevant earlier) for topic modeling, visualization, and trend analysis. The coherence score for varying numbers of topics for all the predicted relevant tweets is shown in Fig. 2 . As shown in the figure, the coherence score increased until eight topics and then gradually decreased; hence we chose eight to be the optimal number of topics for our data set. The percentage of tweets discussing about each topic in the corpus is shown in Table III. The results of running an LDA model on all the relevant tweets for eight topics and manually curated themes summarizing the top ten most relevant terms along with some representative tweets for each topic are discussed in Table IV. We discuss the themes for each topic as curated based on the top 10 relevant terms below. • Pandemic impact and reopenings: As shown in Table IV , the keywords and representative tweets for Topic 1 represent the major impact of the COVID-19 on the An interactive visualization of the topics generated by the LDA model using pyLDAvis 13 package as shown in Fig. 3 revealed that Topics 5 and 6 overlap with each other, which indicated that people's thoughts were circulating around the novel coronavirus (COVID-19) from China and theories and beliefs related to it. Topic 1, 4, and 8 also overlap with each other as these collectively address the consequences of the COVID-19 pandemic. Among all the topics, People's thoughts during COVID-19 (21.1%), Case statistics (14.5%), and Health workers and authorities (14.5%) were the most discussed, which is reflected from the size of the circles in Fig. 3 . Based on the keywords in each topic, individual tweets were labeled with the prominent topic using our LDA model, and the percentage of tweets belonging to each topic for every week was calculated. As shown in Fig. 4 , we plotted the percentage of tweets belonging to each theme against each week. This was done in order to visualize the trend in each topic being discussed over weeks. A. Principal Findings 1) Data Collection and Annotation Analysis: In the early weeks of January, we observed that the tweets with keywords like coronavirus and COVID-19 were very low compared to later weeks. The volume of tweets increased from the 4 th week of January and continued to remain considerably high As many as 26.5 million Americans had filed unemployment by this time and due to cost concerns the poor and young people defered care for health related issues. National Institutes of Health (NIH) showed early promise for drug Remdesivir for the treatment of COVID-19 People's thoughts during COVID-19, Case statistics and Health workers and authorities throughout February and decreased during April. It was also observed that the ratio of relevant to irrelevant tweets in the entire dataset (3.9) was similar to the manually labeled dataset (3.4) , which shows that the distribution of the labeled data is similar to the data annotated by the classifier. 2) Relevancy Classifier Analysis: It was observed that some of the tweets collected using COVID-19 specific keywords were irrelevant to the COVID-19 pandemic. We defined tweets to be relevant if they provided information about the spread, cause, effect, opinion, sentiment, emotion with regards to the COVID-19 pandemic. We designed a relevancy classifier to filter out the irrelevant tweets which was able to distinguish the tweets like "olcokcevat malnla cmertsrrnla cimri olmal veren azizsr veren zelil olur", "nojo demais dessa pessoa" which were tweeted using English alphabets. 3) Topic Analysis: Overall, among all the eight topics, People's thoughts during COVID-19, Case statistics, and Health workers and authorities were predominantly discussed. This may be due to the fact that as COVID-19 has been declared a pandemic by World Health Organization 15 , rising cases and the role of health agencies were the main concern over social media. Several conspiracy theories were given by people [18] , and their reactions on economic, societal, and political areas came to fore throughout the pandemic [43] . In this section, we addressed the discussion on the trending topics over time by taking into consideration the COVID-19 related timeline events as listed in Table V . The key observations of the same are as discussed below: • Jan 01 to Jan 11: During the first two weeks of January, the World Health Organization had announced the news about the clusters of pneumonia like cases of unknown origin in Wuhan, China and travel restrictions was a concern. As a result, among all the topics discovered on our data set, "People's thoughts during COVID-19" was the most discussed one. Tweets on Twitter during this week like "Only very limited information is available, but this looks like an outbreak of what could be a new respiratory (corona?)virus" mainly talked about some symptoms and general perceptions about this virus by the people. Topics like "Origin of novel coronavirus" and "Case statistics" were also prominent as a majority of the tweets talked about the origin of this virus and the number of people infected with it by that time. • Jan 12 to Jan 25: By the next two weeks of January, according to the AJMC coronavirus timeline, when the cases of novel coronavirus were reported outside China in Thailand, Japan and the US, the national public health institute of US, Centre for Disease Control and Prevention (CDC) began screenings for passengers at US airports. Wuhan in China was also subsequently placed under lockdown due to reports of human to human transmission of this virus. At this time the topic, "Health workers and authorities" emerged as the dominant topics as CDC being a premier health institute started taking action to stop the spread of coronavirus. As cases of this disease started emerging in other countries outside China, the topic, "Case statistics" was also predominantly discussed at this time. • Jan 26 to Feb 01: During this week, as the number of COVID-19 related cases and deaths were increasing, the World Health Organization(WHO) declared a global health emergency 16 . So, the topic, "Health workers and authorities" continued to be discussed more as WHO is a major health organization. Tweets like "UN health agency urges China to continue search for source of new virus, as Thailand case emerges" depicted the role of health agencies at this time. As there was also an exponential rise in the case, "Case statistics" was also one of the dominant topics. • Feb 02 to Feb 08: As the US declared a health emergency due to the coroanvirus outbreak, the topic discussing about the role of "health workers and authorities" was still one of the prominent discussed topics. During this time, the topic "Case statistics" continued to be trending on Twitter at this time. • Feb 09 to Feb 22: During these two weeks, deaths related to COVID-19 were increasing and people started talking more about this novel coronavirus. Testing and diagnosis of people for COVID-19 related symptoms by health authorities became a concern. It was evident from the tweet like "CDC Begins to Test Patients with Flulike Symptoms for Coronavirus". Thus, the topics like "Case statistics" and "Health workers and authorities" were predominantly discussed at this point of time. • Feb 23 to Feb 29: According to CDC, as COVID- 19 was heading towards a pandemic stage, "Government response" topic emerged as one of the prominent topics. Tweets like "Trump bashes media criticism over his handling of coronavirus pandemic " and "You thought the coronavirus was bad? Just look forward to the next 17 . As a result, "Health workers and authorities" topic emerged once again as the dominant topics along with "Government response". • April 19 to April 30: By the end of April, as 26.5 million people had filed unemployment in the US 18 . According to AJMC COVID-19 timeline, people started deferring their treatments due to "cost concerns". "Health workers and authorities" was still one of the predominantly discussed topics as NIH trials had shown show the drug, remdesivir to be effective for the treatment of COVID-19. Promotion of telehealth as mentioned in the tweet "NEW: Bipartisan lawmakers back efforts to expand telehealth services for seniors to help combat the coronavirus" might have also led to the increased discussion related to "Health workers and authorities" topic during this time. Such analysis could be useful for policymakers, government authorities, and health officials to understand the public concerns on the events occurring during emergencies like pandemics. It could help them to provide correct and timely information on the events happening during emergencies. News reports or magazine articles need to be careful in reporting the events to the public as it might influence their discussions on social media. Otherwise, there could be widespread fake news, misinformation, and several conspiracy theories on social media platforms. In this section, we would like to present some limitations we faced in our analysis of social media posts during the COVID-19 pandemic as below. • Dataset: This study is limited to tweets in English. As the COVID-19 pandemic is a global pandemic, the English language constraint restricted our data collection to users who tweeted in the English language alone, and as a result, the findings of this study may not be generalized worldwide. We could not evaluate our analysis on multilingual tweets such as Hindi, German, or French, and we encourage the research community to address it in future works. Future studies may also perform spatial and temporal analysis to identify the events and topics specific to a location. • Geo-location: Due to the low volume of tweets that include geotags, we have excluded it from the scope of this study. • Fake news and misinformation: The use of social media platforms come with a known risk of fake news and misinformation, which could affect the topics of discussion over social media. Research focused on identifying fake news would have interesting applications for understanding pandemics like COVID-19. • Event detection: Another limitation that should be discussed is that we cannot make causal claims regarding the alignment between certain topics and events. These findings help describe the relationship between COVID-19-related timeline events and discussions on Twitter, but we cannot use these data to make causal claims. Exploration of the usefulness of Twitter data for actual detection of events related to COVID-19 is an important next step. • Relevancy Classifier: During crises like natural disasters or pandemics, online social media can create noise which can clutter the useful information. In most of the prior works, the tweets collected using keywords were assumed to be relevant, and studies were carried out. However, we found that the tweets collected using keywords can be irrelevant and contain no useful information. Twentyfive percent (175048) of the total collected tweets were irrelevant. These kinds of tweets can often bias the experiments. In order to mitigate this issue, we built a relevancy classifier that can filter out the irrelevant tweets. Most prior research has filtered out non-English tweets in the process of data preparation. However, oftentimes, we see foreign language tweets containing English alphabets included, which can create noise in the experiments. In this study, we observed that the majority of the foreign language tweets were identified as irrelevant by the relevancy classifier, and thus these were excluded from our experiments. • Understanding of Events Over Time: Temporal trend analysis on online social media data can be important to identify how discussions over social media get influenced over time by the news and information available to the people during pandemics like COVID-19. During events like natural disasters and pandemics, it is important for the government and humanitarian organizations to identify key factors that influence the topics of discussion of the public over time. Prior works have not addressed this problem. In this study, we identified various events that may be linked to the topics of discussion during the COVID-19 pandemic. These types of studies could also be used to understand the influence of fake news and other media sources on public discussions on social media platforms. Hence, proactive actions could be taken to stop the spread of fake news and misinformation, and at the same time, the impact of influential and impact of accurate news could be identified. While urgent actions are needed to mitigate the potentially devastating effects of COVID-19, they can be supported by understanding the behavioral and social impact on the people. Because the pandemic imposes significant psychological burdens on individuals, insights from analyzing public conversations' trends can be used to help align public emotions with the recommendations of epidemiologists and public health experts. To understand the possible effect of media coverage on public emotions, we present an analysis of the trending topics of conversations on Twitter to events during the pandemic. Such analysis can provide insights so that a positive frame could be designed in the media coverage of the pandemic to educate the public and relieve negative emotions while increasing compliance with public health recommendations. The officials monitoring the spread of the pandemic can judge the influence of some false news or rumors related to COVID-19 as they could monitor the topics trending at a given time and thus provide appropriate information to the people. Particular sentiments like fear and misconceptions can also be monitored similarly. Government officials can make better decisions accordingly as their decisions will be reflected in the news, which would impact people's behavior and discussions. Empirical and qualitative evaluations of our analysis indicated that our analysis is trustworthy and can inform the design of frameworks for more precise and definitive social data mining to assist humanitarian organizations during global pandemics. The COVID-19 epidemic An interactive web-based dashboard to track covid-19 in real time Using twitter for public health surveillance from monitoring and prediction to public response Automatically appraising the credibility of Vaccine-Related web pages shared on social media: A twitter surveillance study Twitter as a tool for health research: A systematic review Event detection on twitter by mapping unexpected changes in streaming data into a spatiotemporal lattice How organisations promoting vaccination respond to misinformation on social media: a qualitative investigation Using social and behavioural science to support covid-19 pandemic response From citizens to government policy-makers: Social media data analysis Health communication through news media during the early stage of the COVID-19 outbreak in china: Digital topic modeling approach Top concerns of tweeters during the COVID-19 pandemic: Infoveillance study (preprint) Understand research hotspots surrounding COVID-19 and other coronavirus infections using topic modeling In the eyes of the beholder: Sentiment and topic analyses on social media use of neutral and controversial terms for COVID-19 Characterizing the propagation of situational information in social media during COVID-19 epidemic: A case study on weibo CovidSens: a vision on reliable social sensing for COVID-19 go eat a bat, chang!": An early look on the emergence of sinophobic behavior on web communities in the face of COVID-19 What types of COVID-19 conspiracies are populated by twitter bots? Conspiracy in the time of corona: Automatic detection of covid-19 conspiracy theories in social media and the news Increases in synthetic cannabinoids-related harms: Results from a longitudinal webbased content analysis Microblogging during two natural hazards events Twitter catches the flu: detecting influenza epidemics using twitter Discovering explanatory models to identify relevant tweets on zika Mining physicians' opinions on social media to obtain insights into COVID-19: Mixed methods analysis (preprint) Clustering of social media messages for humanitarian aid response during crisis Estimating patient's health state using latent structure inferred from clinical time series and text Using latent dirichlet allocation for topic modelling in twitter Factors affecting customer service Engagement-Six cases assessing strengths and weaknesses for telecom and payment service providers Latent dirichlet allocation Detecting themes of public concern: a text mining analysis of the centers for disease control and prevention's ebola live twitter chat What are people tweeting about zika? an exploratory study concerning its symptoms, treatment, transmission, and prevention Tracking dengue epidemics using twitter content classification and topic modelling," in Current Trends in Web Engineering Enhancing seasonal influenza surveillance: Topic analysis of widely used medicinal drugs using twitter data Measuring the outreach efforts of public health authorities and the public response on facebook during the COVID-19 pandemic in early Using social media to mine and analyze public opinion related to COVID-19 in china Visible insights of the invisible pandemic: A scientometric, altmetric and topic trend analysis An "infodemic": Leveraging High-Volume twitter data to understand public sentiment for the COVID-19 outbreak Exploratory analysis of covid-19 tweets using topic modeling, UMAP, and DiGraphs Detecting topic and sentiment dynamics due to COVID-19 pandemic using social media BERT: Pre-training of deep bidirectional transformers for language understanding SMOTE: Synthetic minority over-sampling technique BERT: pre-training of deep bidirectional transformers for language understanding Exploring the space of topic coherence measures Economic and social consequences of human mobility restrictions under covid-19