key: cord-0170893-e2m8pldm authors: Saad, Muhammad; Hassan, Muhammad; Zaffar, Fareed title: Towards Characterizing COVID-19 Awareness on Twitter date: 2020-05-17 journal: nan DOI: nan sha: 2e2bf4eab02be3693f06e61da76fa924a97622e1 doc_id: 170893 cord_uid: e2m8pldm The coronavirus (COVID-19) pandemic has significantly altered our lifestyles as we resort to minimize the spread through preventive measures such as social distancing and quarantine. An increasingly worrying aspect is the gap between the exponential disease spread and the delay in adopting preventive measures. This gap is attributed to the lack of awareness about the disease and its preventive measures. Nowadays, social media platforms (ie., Twitter) are frequently used to create awareness about major events, including COVID-19. In this paper, we use Twitter to characterize public awareness regarding COVID-19 by analyzing the information flow in the most affected countries. Towards that, we collect more than 46K trends and 622 Million tweets from the top twenty most affected countries to examine 1) the temporal evolution of COVID-19 related trends, 2) the volume of tweets and recurring topics in those trends, and 3) the user sentiment towards preventive measures. Our results show that countries with a lower pandemic spread generated a higher volume of trends and tweets to expedite the information flow and contribute to public awareness. We also observed that in those countries, the COVID-19 related trends were generated before the sharp increase in the number of cases, indicating a preemptive attempt to notify users about the potential threat. Finally, we noticed that in countries with a lower spread, users had a positive sentiment towards COVID-19 preventive measures. Our measurements and analysis show that effective social media usage can influence public behavior, which can be leveraged to better combat future pandemics. The coronavirus pandemic has spread across the world with over four million reported cases to date. Currently, no vaccine is available for the SARS-CoV-2 strain, and therefore the optimal way to curtail its spread is to avoid physical contact with COVID-19 carriers. To minimize the physical contact, people are advised to practice social distancing, stay at home, and in the worst case, undergo a lockdown (Broniec et al. 2020; Inoue and Todo 2020) . Unfortunately, despite these guidelines, COVID-19 has spread faster than the adoption of preventive measures. The gap between the spread and the adoption of preventive measures is due to 1) limited awareness about the disease and its spread, 2) Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the nature of the disease and its latent symptoms (Robson 2020) , and 3) delayed response in taking corrective measures by governments and the general public. Particularly, the aspect of public awareness largely depends on the information spread through the mainstream media and the social media (Wells et al. 2020; Le, Shafiq, and Srinivasan 2017; Brena et al. 2019) . Between these two axes of communication, social media platforms (i.e., Twitter and Facebook) are highly useful in propagating timely information regarding a major event (Bin Tareaf et al. 2018) . Therefore, it is intuitive to assume that social media platforms contain information footprints that can be leveraged to characterize the response of various communities to the COVID-19 pandemic. To that end, this study uses Twitter data to analyze various attributes of information exchange in order to model preparations of various countries for the COVID-19 pandemic. For this study, we draw inspiration from prior related works that have demonstrated the usefulness of Twitter in characterizing the user behavior in major events. For instance, showed that during the Ebola pandemic, Twitter users actively discussed the risk potential and the spread rate. Similarly (Fischer-Preßler, Schwemmer, and Fischbach 2019) and (Keymanesh et al. 2019) showed that Twitter is useful in monitoring the social efficacy and collective understanding of masses during critical global events. We follow a similar methodology and use Twitter to study the response of various countries to the COVID-19 pandemic. We collect trends and tweets from the top 20 most affected countries by COVID-19 (as of April 19, 2020) and contextualize the information to study their preparatory response. More precisely, using our dataset, we explore the following key questions. 1. Are there variations in the response of different countries to the COVID-19 pandemic that are reflected in rends and tweets from that country? Are there indications to support that awareness through Twitter was useful in influencing the pandemic spread? In pursuit of these questions, we develop a data collection system to collect more than 5,000 Twitter trends and over 622 Million tweets from the top 20 most affected countries by COVID-19 (as of April 19, 2020) . For each country, we monitor the temporal patterns of COVID-19 trends and the volume of tweets in those trends to study the coun- Figure 1 : Design and workflow of our data collection system. First, we deployed a crawler to collect trends from Trendogate, and store them in the "Data Storage and Scheduler" platform hosted on cloud. The scheduler fed search queries to eighty-five workers that collected tweets from trends. Finally, we applied data analytics and NLP to collect results. try's response to the pandemic. We perform topic modeling and sentiment analysis on tweets to analyze the user response towards preventive measures such as social distancing, quarantine, and lockdown. Our dataset reveals meaningful insights, including a correlation between frequent trending on COVID-19 and effective pandemic management. To illustrate this observation, we provide a comparative case study of six countries (USA, Italy, Spain, Sweden, Austria, and Belgium), which indicates that countries with a lower pandemic spread usually generated more tweets and trends about COVID-19 and its preventive measures. We believe that the key takeaways of our work highlight the importance of social media in influencing public interactions that can be useful in combating future pandemics. Contributions and Roadmap. We take a systematic approach towards analyzing the temporal evolution of Twitter trends in 20 most affected countries by COVID-19. Our data collection, methodology, and results are summarized below as the key contributions. 1 We developed a data collection system using which we collect more than 48K trends and over 622 Million tweets from December 15, 2019, to April 5, 2020, for the top 20 countries affected by For each country, we identify the COVID-19 related trends among all trends and the volume of tweets in those trends. We pair that information with key indicators in the country's COVID-19 timeline to study their preparatory status. 3 We present a case study of six countries (United States, Italy, Spain, Sweden, Austria, and Belgium) to a) show the variation in response of each country to the pandemic, and b) showcase observations that suggest that frequent and timely information propagation about preventive measures correlated with a lower pandemic spread. Notably, our results show that on average, Sweden, Austria, and Belgium generated more trends and tweets about COVID-19 and its preventive measures than the United States, Italy, and Spain. 4 Additionally, we apply Natural Language Processing (NLP) to extract the most prevalent topics in the COVID-19 tweets and the user sentiment towards those topics. We observed that countries with a lower pandemic spread frequently used terms related to the preventive measures such as "social distancing." The rest of the paper includes data collection and methodology in §2, experiments and results in §3, discussion in §4, and appendices with supplementary findings in §5. This section describes our data collection and methodology. We started data collection on April 19, 2020, by selecting the top 20 most affected countries on that date. For each country, we collected all their Twitter trends from December 15, 2019, to April 5, 2020, . Figure 1 shows our data collection system, and below, we briefly discuss some key challenges that we encountered during the process. Collecting Historical Trends. A trend on Twitter generally indicates a commonly discussed topic by users in a location (Tulasi et al. 2019) . Logically, on a specific date, if all trends of a location are collected, we can estimate the commonly discussed topics in that region. As such, the first challenge in our study was to obtain all the historical trends of the selected countries. Twitter API does not provide historical trends for countries, and therefore, we relied on third-party services for trend collection. We used an online service called Trendogate that maintains historical Twitter trends for all countries (TrendoGateCommunity 2020). We developed a crawler that periodically scraped trends of each country and stored them in our "Data Storage and Scheduler" platform. For validation, we cross-examined those trends with an Internet archive service called "Wayback Machine" (ArchiveCommunity 2020). The "Wayback Machine" creates snapshots of a vast majority of the Internet webpages every day. Currently, the archive contains historical data of more than 330 billion web pages. After cross-examining and validating data, our 'Data Storage and Scheduler" platform created a list of Trends for eighty-five workers that we deployed for tweet collection (Figure 1 ). Collecting Tweets from Trends. We developed twitter crawlers and deployed them on eighty-five workers for concurrent data collection. We could not use the Twitter API since the API only provides the recent tweets from a trend. To overcome this limitation, we developed web crawlers Table 1 : Results from data collection. Each country is ranked based on the total number of COVID-19 cases as of April 19, 2020. Note that 1) India generated the highest trends, 2) Switzerland generated the highest COVID-19 trends, the highest overall tweets, 3) Ireland generated the highest tweets before the first case, 4) Belgium generated highest tweets before the first death, 5) Switzerland generated the highest COVID-19 tweets, and 6) Turkey generated the highest number of trends before the first case and the first death. Percentages are reported relative to the total number trends and the total number of tweets. that utilized Twitter's scroll loader functionality to collect tweets. Each web crawler generated a search query for a trend in a country and iterated over the scroll loader to scrape tweets. For this purpose, we sought help from prior works that have utilized Twitter's scroll loader functionality for data collection (Pratikakis 2018; Mottl 2019; Valkanas, Saravanou, and Gunopulos 2014) . We also noticed that Twitter applies rate-limiting on IP addresses that generate iterative search queries. Keeping in view the desirable data volume, we provisioned 85 workers and replicated our crawler on these workers. Each worker was represented by a unique IP Address over the Internet. Upon receiving a rate-limiting error, each worker applied a linear backoff time. Figure 1 provides the data collection system workflow. Table 1 and Figure 2 show preliminary results obtained after data collection. At the time of the writing of this paper, we were able to collect trends and tweets from December 15, 2019, to April 5, 2020. Therefore, the results reported in this study are confined within this timeline. Figure 2 , reports the distribution of daily trends obtained from each country as violin plots. For each plot, the white dot in the middle is the median value of the total number of trends, the grey bar is the interquartile range, and the outer shape is the kernel density estimation showing the data distribution (details of kernel density function in subsection 5.1). Figure 2 shows that 1) the number of daily trends from each country varied from as low as one trend in Israel to 47 trends in Italy, and 2) the average number of daily trends was 21. Therefore, we expected variations in the duration of data collection for each country, and our "Data Storage and Scheduler" platform applied load-balancing to maximize the system utility. Trend Collection. In Table 1 , we report the preliminary results in which we collect statistics about the first case and the first death from (COVID-19 2020). For each country, we record the total number of 1) all trends, 2) all tweets, 3) trends related to COVID-19, 4) tweets related to COVID-19, 5) the number of trends and tweets before the first reported case, and 6) the number of trends and tweets before the first reported the death. All countries are sorted in the descending order as of April 19, 2020 (when we began our study), where the United States had the highest number of COVID-19 cases followed by Spain and Italy, respectively. To obtain the COVID-19 related trends and tweets, we curated a list of COVID-19 terms from the Yale Medicine Glossary (Katella 2020) Limitations. During data collection, we could not collect trends from China due to a state-backed ban on Twitter. Another limitation of our work is that Trendogate reported limited visibility into Iran. As a result, we could collect trends from Iran. We exclude these countries from our study, and since they were among the top 20 countries, we omitted them and added Ireland and India that were on 21st and 22nd positions at the time of this study. Despite these limitations, our dataset covers a wide range of countries that can sufficiently provide results required for our analysis. We conduct three experiments to analyze the response of each country to the COVID-19 pandemic. In the first experiment, we analyze the temporal behavior of COVID-19 related trends and tweets to study the patterns in the information spread. We present a case study of six countries to highlight the variation in response, characterized by the number of trends and the tweet volume in those trends. In the second and third experiments, we perform topic modeling and sentiment analysis to study the commonly discussed COVID-19 topics and the user sentiment in those discussions. For temporal analysis, we specify a timeline for each country where we observe the total number of trends and tweets generated every day. Our timeline starts from December 21, 2019, to April 5, 2020. We exclude trends and tweets before December 21 since we did not observe any significant COVID-19 related data to report. We made the following key observations in the temporal analysis. 1 Overall, as the number of cases, increased in a country, the number of tweets and trends increased accordingly. However, the relationship was not always linear. In most cases, the number of tweets decreased while the number of cases kept growing. 2 A few countries (i.e., Sweden and Austria) preemptively responded to the pandemic by actively discussing COVID-19 before the increase in the number of cases. 3 In some countries (i.e., Austria), we observed a constant recurrence in tweets and trends, indicating consistency of interest on the subject. To take a deeper look at these observations, we present a case study below. Case Study. For the case study, we selected the top three countries from Table 1 , namely the United States, Spain, and Italy, and three other countries at random, namely Sweden, Austria, and Belgium. In the United States, the first COVID-19 case was reported on January 21, 2020, and by April 5, 2020, the number of cases exceeded 300K. Similarly, for Spain and Italy, the first case was reported on January 31, and the total number of cases increased to 132K and 18K by April 5, respectively. For Sweden, Austria, and Belgium, the first case was reported on February 4, February 24, and February 4, while the total cases increased to 6K, 12K, and 100K by April 5, respectively. For simplicity of analysis, we divide these countries into two sets where the set S 1 consists of the United States, Spain, and Italy, and the set S 2 consists of Sweden, Austria, and Belgium. Note that 1) all the first cases in S 1 were reported earlier than the first cases reported in S 2 , and 2) by April 05, the total number of cases in S 1 were much higher than the total number of cases in S 2 . We analyzed the temporal behavior of Twitter trends in S 1 and S 2 . For both sets, we apply algorithm 1 to obtain the 1) the timeline of tweets in COVID-19 related trends, and 2) the total number of tweets about COVID-19. We separate tweets into two categories because the text in a tweet may or may not be associated with a COVID-19 trend. To understand this phenomenon, assume that an ongoing trend in a country is #COVID-19. A user in that country tweets, "Today, we have reported ten new cases of #COVID-19." This tweet will appear in the #COVID-19 trend and the trend will appear in our list of terms related to COVID-19. In contrast, if an ongoing trend in a country is #Football-Match and a user tweets "#FootballMatch has been canceled due to coronavirus," then the tweet will not appear in the COVID-19 related trends. However, the tweet will match in our list of terms related to COVID-19. The first example shows that COVID-19 is an actively discussed topic in a country since it appears among trends. If we sample all tweets related to COVID-19 related trends, we can estimate the significance of the topic in the country. However, this method may not capture complete information about tweets related to COVID-19, as demonstrated in the second example. Therefore, apart from acquiring a holistic view through COVID-19 trends, we also construct a complete picture by collecting all COVID-19 tweets from all trends, irrespective of the trend nature. The second method allows us to pre- cisely determine the number of times the COVID-19 was discussed by people in a country. We use algorithm 1 to extract this information. We report our results in Figure 3 where Figure 3 (a) shows the total number of tweets related to COVID-19 trends and Figure 3( b) shows the total number of COVID-19 tweets among all trends. The total number of tweets in both figures also include the number of retweets. Key Takeaways. Our results show that countries in S 2 generated COVID-19 related tweets and trends before countries in S 1 , indicating a preemptive attempt to cause pandemic awareness. In the entire evaluation timeline, all countries in S 2 generated more COVID-19 tweets than countries in S 1 . We also observed spikes in Figure 3 , showcasing a surge in the number of tweets and trends. In all noticeable spikes, S 2 clearly dominated S 1 , indicating a higher user engagement towards COVID-19. Among all countries, Switzerland generated the highest number of COVID-19 tweets (≈7.3 Million) and the highest number of COVID-19 trends (272). The inner plot in Figure 3 (a) shows that number of daily trends in S 2 were considerably higher than S 1 . Notably, in Belgium, 15 COVID-19 trends were generated on March 19, 2020. These results show that countries in S 2 effectively utilized Twitter to propagate information among users and prepare them for the pandemic. In our second experiment, we take a closer look at the textual information in the tweet corpus to make useful inferences about prevalent topics in those tweets. To motivate a common case, we limit our analysis to S 1 and S 1 , and retrieve their tweet corpus from algorithm 1. To study prevalent topics among COVID-19 tweets, we combined those tweets in a single text corpus for each country. We then tokenized the text corpus, removed the stop words, and calculated the frequency count over the resulting Table 2 : Top 10 most common words in the text corpus of each country. Note that the three most common words in Sweden are Social, Distancing, and Coronavirus. text. Finally, we assigned weights to all the topics and sorted them in descending order. In Figure 4 , we show word clouds for each country, providing an intuitive overview of the most commonly used terms in COVID-19 tweets. Figure 4 shows that generally, "coronavirus" was the most common term across all countries. Noticeably, in Sweden and Austria, "social," "distancing", "quarantine," "lockdown," "stay," and "home" were the more dominant compared to other countries. Since it is possible that the two terms "social" and "distancing" may appear in different contexts across tweets, therefore we performed the same experiment while incorporating bigrams model. The bigram model approximates the probability of a word by conditioning over a preceding word. As such, if "social" and "distancing" are collocated, then they would naturally appear in the model. We report . The x-axis shows the sentiment score between the range of -1 and +1. The y-axis shows the kernel density estimation that captures the data distribution shape. Overall, the general sentiment in each class closely aligns across all countries. For social distancing, the sentiment is close to neutral. For quarantine and social distancing, the sentiment is more distributed towards the positive side. the results of the bigram model in Figure 6 . Our results confirmed that "social distancing" and "stay home" were indeed dominant terms in Sweden and Austria. Additionally, in Table 2 , we report the ten most common terms that appeared in our topic modeling. Table 2 shows that the trending topics significantly varied in each country. The common term among all countries was "Coronavirus," followed by "Covid19." Moreover, in all countries except Sweden, "Coronavirus" was the most common term. In Sweden, the top two terms were "Social" and "Distancing," indicating that the Twitter users in Sweden significantly emphasized on the preventive measures. Combined, the number of COVID-19 topics in S 2 were greater than S 1 . Although, considering the total number of COVID-19 cases and deaths in S 1 , we expected the outcome to be the opposite. In our third experiment, we analyze the user sentiments towards the COVID-19 related preventive measures. Towards that, first, we isolated tweets containing terms "social distancing," "quarantine," and "lockdown." We distributed those tweets in three separate classes. Additionally, we also incorporated terms that closely resembled a specified class. For instance, "curfew" closely relates to the class "lockdown," while "self isolation" relates to the class "quarantine." We manually annotated such similar terms and incorporated them into the corresponding classes. For sentiment analysis, we used the "TextBlob" library in Python that provides various useful language processing operations, including speech tagging, text tokenization, sentiment analysis, and sentiment classification. The "TextBlob" library assigns a score in the range of -1 to 1 to each tweet in the class. We eliminated tweets with a neutral score of "0" to focus purely on tweets with a positive or negative sentiment. Additionally, we applied the kernel density function to aggregate tweets with the same sentiment score and observed the distribution shape of each class. We report our results in Figure 5 . Our results show that for social distancing and quarantine; generally, the sentiment across all countries was within the same margin. For social distancing, almost all countries had a close to neutral sentiment, as indicated by a spike around 0.1 Figure 5(a) . However, we also observed a small spike towards the positive sentiment in Belgium and Sweden. Similarly, for the quarantine class, we noticed a spike of around 0.3 for all countries, indicating a more positive response. For the lockdown class, we observed a relatively higher sentiment variation. In Italy, the sentiment was distributed towards the negative side, with a spike around -0.1 Figure 5 (c). However, in Austria and Belgium, the sentiment was allocated towards the positive side with peaks around 0.9. In summary, our data show that the general response to social distancing and quarantine was similar across all countries. However, for lockdown, we observed a variation in response with Italy's inclination towards a nega-tive sentiment. In summary, the general sentiment on social distancing and quarantine, across all countries, converged to a similar score in the density distribution. This observation reflects a sense of uniformity in expression for all countries. However, for lockdown, the variation in score indicates a divergence in expression towards it. This could be a result of societal pressures operating in those countries which we could not capture in our dataset. Perhaps a more precise coupling of sentiment with the increasing number of cases will provide reasoning for the sentiment divergence. In the future, we plan to explore this direction and get more meaningful results to support the observation. As discussed in §1, social media platforms can be useful in characterizing public opinion in a geographical locality. Additionally, these platforms can also be used to monitor the effects of information propagation by pairing the information flow with a desirable outcome. In this paper, we contextualize this methodology to study the relationship between information dissemination the COVID-19 pandemic spread. Our model puts "lower spread" as the desirable outcome and "high volumes of trends and tweets" as the indicators of effective information dissemination. To that end, we developed a large-scale data collection system to collect historical tweets from the top 20 most affected countries by COVID-19. We perform measurements and modelling on our data to study various data attributes including the temporal evolution of trends, the most recurring COVID-19 related topics, and the user sentiment towards preventive measures. Our results show that countries with a lower pandemic spread mostly generated a higher volume of COVID-19 related trends and tweets (Table 1, Figure 3) . A closer look at the nature of tweets further revealed that the countries with a lower pandemic spread emphasized more on the COVID-19 preventive measures (Figure 4, Figure 6 ). Moreover, we also noticed a variation in sentiment towards the lockdown policy that was implemented to control the spread. In addition to making standalone contributions through a novel dataset and useful observations, our study also provides meaningful answers to the questions raised in §1. First, we indeed noticed variations in the response of different countries to the COVID-19 pandemic as shown by the 1) volume of trends and tweets and their timeline, 2) recurring topics discussed in those tweets, and 3) sentiments towards preventive measures. Second, we also observed indications to support that awareness through Twitter contributed in influencing the pandemic spread. For that purpose, we outlined a case study to showcase that users in the highly affected countries displayed lower Twitter engagement compared to the lesser affected countries. Please note that this is not a conclusive statement to suggest that Twitter usage was the dominant factor in influencing the pandemic spread. However, our data and analysis indicate that Twitter can be useful for this purpose, and therefore noteworthy. Future Work. At the time of conducting this study, we did not find a study that precisely analyzed the relationship between Twitter and the spread of COVID-19. However, our methodology is inspired from some notable studies that examined the usefulness of Twitter in characterizing the user behavior at scale. We have mentioned them in §1. Concurrent to work, we have seen a study that analyzed the emergence of Sinophobic behavior due to COVID-19 (Schild et al. 2020 ). However, our work investigates an entirely different relationship between Twitter and COVID-19. Finally, we believe that our dataset has useful information beyond what is presented in this paper. Keeping it in view, as well as the urgency to extend research on this topic, we will soon open-source our dataset to foster future work. : Word clouds for S 1 and S 2 after bigram analysis. Notice that for Sweden and Austria, "Social Distancing" was the most dominant term. In contrast, for Spain, self quarantined was the most dominant term. Overall, the results are consistent with Figure 3(b), showing that Sweden, Austria, and Belgium generated more COVID-19 tweets compared to the other three countries. References [An et al Information propagation speed and patterns in social networks: A case study analysis of german tweets News sharing user behaviour on twitter: A comprehensive data collection of news articles and social interactions Using VERA to explain the impact of social distancing on the spread of COVID-19. CoRR abs/2003.13762. [COVID-19 2020] COVID-19. 2020. Covid-19 pandemic by country and territory Twitter watch: Leveraging social media to monitor and predict collective-efficacy of neighborhoods Scalable news slant measurement using twitter twawler: A lightweight twitter crawler Computers and viral diseases. preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the sars-cov-2 (2019-ncov, COVID-19) coronavirus Catching up with trends: The changing landscape of political discussions on twitter in Trump, twitter, and news media responsiveness: A media systems approach Kernel Density Estimator (KDE) is a renowned probability density function that is used to solve the data smoothing problem for a finite dataset. Typically, this is done by graphing the density of the dataset in its domain. The formal definition of KDE is given by the following function.In the function above, K(x) is the smooth and symmetric kernel function (Gaussian in our case), and h (where h > 0) is the smoothing bandwidth. KDE calculates summation, after each data-point is smoothed into small density bumps. In Figure 6 , we have generated the Bigram Model for countries discussed in our case studies. A Bigram Model looks one word into the past and predicts the next word. Building onto this, the Figure 6 shows that which two words, together, are most likely to appear for each country in S 1 and S 2 .Referring to Figure 6 , we observe word clouds for S 1 and S 2 after bigram analysis. The most common term in Sweden and Austria was "social distancing", and similarly, "corona virus" was dominant in Belgium. In the USA, "billion dollar" was dominant along with "homeschooling year", in Spain's word cloud, the most common term was "self quarantined", and in Italy, the common term was "world news".