key: cord-0642694-f48v71eg authors: Lin, Binbin; Zou, Lei; Duffield, Nick; Mostafavi, Ali; Cai, Heng; Zhou, Bing; Tao, Jian; Yang, Mingzheng; Mandal, Debayan; Abedin, Joynal title: Revealing the Global Linguistic and Geographical Disparities of Public Awareness to Covid-19 Outbreak through Social Media date: 2021-10-29 journal: nan DOI: nan sha: f0464cb9ea2399216519daac4255996eb1e95cba doc_id: 642694 cord_uid: f48v71eg The Covid-19 has presented an unprecedented challenge to public health worldwide. However, residents in different countries showed diverse levels of Covid-19 awareness during the outbreak and suffered from uneven health impacts. This study analyzed the global Twitter data from January 1st to June 30th, 2020, seeking to answer two research questions. What are the linguistic and geographical disparities of public awareness in the Covid-19 outbreak period reflected on social media? Can the changing pandemic awareness predict the Covid-19 outbreak? We established a Twitter data mining framework calculating the Ratio index to quantify and track the awareness. The lag correlations between awareness and health impacts were examined at global and country levels. Results show that users presenting the highest Covid-19 awareness were mainly those tweeting in the official languages of India and Bangladesh. Asian countries showed more significant disparities in awareness than European countries, and awareness in the eastern part of Europe was higher than in central Europe. Finally, the Ratio index could accurately predict global mortality rate, global case fatality ratio, and country-level mortality rate, with 21-30, 35-42, and 17 leading days, respectively. This study yields timely insights into social media use in understanding human behaviors for public health research. The outbreak of the novel coronavirus, known as Covid-19, has profoundly impacted human society. In 2020, Covid-19 had infected more than 83.48 million people and caused nearly 1.82 million deaths in 191 countries and regions (Dong, Du, and Gardner 2020) . Since the Covid-19 outbreak, governments worldwide have implemented several measures requiring or suggesting residents wear masks, keep social distancing, or stay at home to control the spread of the coronavirus. However, residents in different countries showed diverse levels of awareness of Covid-19 and relevant policies during the outbreak and suffered from uneven health impacts, including unequal infection, fatality, and recovery rates (Gollust et al. 2020; McCaffery et al. 2020; Hu et al. 2020; Saad, Hassan, and Zaffar 2020) . Whether the changes and disparities in public awareness led to different responding behaviors and thus affected the pandemic's health impacts is unknown and needs to be investigated. However, continuous long-term and near-real-time data describing disparities in public awareness and responses cannot be obtained through traditional survey methods, especially during the pandemic. With the development of Web 2.0 and GNSS-enabled portable devices, social media platforms, e.g., Twitter, Facebook, and Instagram have become increasingly popular worldwide for sharing feelings and discussing 'what's happening'. Data collected from such platforms provide an emerging channel to observe real-time human responses to different topics and events (Zou et al. 2018) . During Covid-19, many people's social lives have shifted from inperson to online to stay connected while maintaining social distancing, and they spent more time sharing their experiences, concerns, and feelings toward topics relevant to Covid-19 on social media (Alqurashi, Alhindi, and Alanazi 2020; Chen, Lerman, and Ferrara 2020; Lopez, Vasu, and Gallemore 2020) . As a result, the extensive social media data generated during the pandemic offers an innovative opportunity to observe the public reactions to Covid-19 in near real-time. Nevertheless, studying human behaviors at different locations during the pandemic and drawing scientific conclusions from social media data are challenging. First, social media data contain a sheer amount of noisy information irrelevant to . It is difficult to accurately collect pandemic-related messages from the big social media database. Second, many social media platforms support multiple languages, and identifying and analyzing coronavirus-related messages of different languages necessitates advanced natural language processing (NLP) models (Lopez, Vasu, and Gallemore 2020) . Third, social media users are unevenly distributed across space, and only a small proportion of the generated data (around 1-2%) contains precise locations (Graham, Hale, and Gaffney 2013) . Social media data need to be associated with geographic contexts and normalized through preprocessing to enable spatial and temporal analytics. This study analyzed the global Twitter data, referred to as tweets, from January 1 st to June 30 th , 2020, when Covid-19 developed from a regional epidemic disease to a pandemic causing a global health crisis. The overarching research questions are: What are the linguistic and geographical disparities in public awareness of the Covid-19 outbreak on social media? Can the changing public awareness on social media predict the pandemic's outbreak? To address the research questions, three objectives are proposed and achieved: (1) to establish a social media data mining framework tracking the public awareness of Covid-19 by languages and regions; (2) to quantify disparities of awareness toward the pandemic at multiple spatial and temporal scales; and (3) to examine the lag correlations between Covid-19 awareness and health impacts globally and regionally. One hypothesis is tested: social media-derived public awareness can predict the pandemic's outbreak at both global and regional scales. The results can inform governments to mitigate risks from current and future epidemics through social media data analysis. Social media data contain abundant attributes, e.g., time, geographical information, contents, and relationships among users, providing a brand-new perspective for understanding human behaviors in spatial, temporal, contextual, and network dimensions. Since the emergence of social media, researcher have applied data collected from such platforms to address multiple health-related issues, e.g., obesity, depression, insomnia, and air pollution (Choudhury et al. 2013; Gore, Diallo, and Padilla 2015; McIver et al. 2015; Chen et al. 2017; Sun et al. 2018; Gao et al. 2020 ). Sun et al. (2018) collected the obesity data from the Gallup Healthways Wellbeing Survey and the U.S. Centers for Disease Control and Prevention (CDC) and 41 million tweets from 110 major cities in the USA during 2012-2013. They proposed an obesity estimation method via monitoring users' dietary habits, physical activities, emotions, and self-consciousness on Twitter, demonstrating that user activities on online social networks could help evaluate the obesity rate in urban areas. Chen et al. (2017) developed a forecasting model for smog-related hazards by creating a Smog Severity Index and social media diffusion factor based on microblog data. The model-predicted smog severity was consistent with the measured values from ground and satellite sensors, implying the potential to monitor health-related disaster threats and impacts through social media platforms. McIver and others (2015) identified a list of insomniacs based on their Twitter messages and investigated their Twitter use behaviors. The results show that insomniacs had fewer followers, expressed lower sentiments, and were less active on social networks compared with other users. Social media data mining also has a great potential in addressing mental health issues. A study finds that integrating social media users' activities, e.g., social engagement, linguistic styles, networks, and emotions, could characterize and forecast individual-level depression (Choudhury et al. 2013 ). In addition to the above case studies focusing on specific health diseases, Paul and Dredze (2014) built a topic model called Ailment Topic Aspect Model (ATAM) to automatically discover health-relevant topics on Twitter without human supervision or a priori knowledge. Culotta (2014) created 160 indexes based on Twitter data and performed regression analysis to predict the county-level statistics of 27 types of health conditions (e.g., obesity, teen births, and diabetes) in 100 counties in the United States. The results show that the predictive models incorporating Twitter-derived variables have higher accuracy for surveying county-level health conditions compared to the models based on traditional questionnaires. Since the beginning of 2020, researchers from different fields have made enormous efforts to collect and mine social media data during Covid-19 to understand human behaviors during the pandemic and combat it. For instance, several studies collected pandemic-related social media data and shared them through open-source archives (Alqurashi, Alhindi, and Alanazi 2020; Chen, Lerman, and Ferrara 2020; Lopez, Vasu, and Gallemore 2020) . Rufai and Bunce (2020) pointed out that Twitter empowered world leaders to exchange Covid-19 information with citizens rapidly. La et al. (2020) analyzed the official Covid-19 news from online newspapers and concluded that immediate policies and supportive public responses are the key epidemic control measures. Sentiment analysis and epidemic-related topics surveillance are essential for understanding social activities during the pandemic Lwin et al. 2020; Nemes and Kiss 2020; Zhu et al. 2020) . Leveraging social media data, worldwide trends of four emotions, fear, anger, sadness, and joy, were examined in Lwin et al. (2020) . Their results show that public feelings shifted strongly from fear to anger from January 28 th to April 9 th , 2020. Predicting the regional epidemic outbreak from social media activities has also been proved feasible in several investigations (Jahanbin and Rahmanian 2020; Li et al. 2020; Qin et al. 2020) . Specifically, Li et al. (2020) found that the Covid-19 discussion peak on social media occurred 10-14 days earlier than the peak of daily incidences in China. Online social networks are effective platforms for disseminating rumors or conspiracy theories (Allington et al. 2020; Gruzd and Mai 2020; Tasnim, Hossain, and Mazumder 2020) . Shahsavari et al. (2020) investigated the narrative frameworks of fabricating and broadcasting Covid-19 conspiracies to monitor those messages near real-time. Although the existing studies generate innovative methods and valuable knowledge in using social media for Covid-19 research, some limitations exist. For example, Covid-19 related social media data analysis across regions is needed to address the inequalities of public awareness and health impacts. Examining the longterm Covid-19 social media activities is also critical because Covid-19 is a longstanding epidemic and short-term analysis is unable to capture how the public perception of Covid-19 changes temporally. This research collected social media data from Twitter, one of the most popular social media and network platforms with about 340 million users in more than 150 countries (Ahlgren 2021). Approximately 500 million tweets were published per day in various languages in 2020 (Ahlgren 2021), enabling the tracking of long-term global awareness of Covid-19. We collected Twitter data from January 1 st to June 30 th , 2020 from Internet Archive (https://archive.org/), an online free data library that stores around 1 percent of the whole Twitter database. Tweets are encoded in JavaScript Object Notation (JSON) format and represented as a list of name-value pairs. Figure 1 shows the Twitter data collection and preprocessing workflow. Three attributes were used in the subsequent analysis, created_at, text, and lang, representing each tweet's timestamp, content, and language, respectively. The first step was identifying Covid-19-related tweets. We selected eight keywords, i.e., covid, virus, corona, ncov, n95, pandemic, pneumonia, and quarantine, based on an overview of previous literature to search for the Covid-19 relevant tweets (Alqurashi, Alhindi, and Alanazi 2020; Chen, Lerman, and Ferrara 2020; Qin et al. 2020) . Considering that Twitter supports multiple languages, we translated the eight keywords into different languages by the Google Translate Application Programming Interface (API) to collect the global Covid-19 discussion on Twitter. A total of 65 distinct languages (including English) were detected from the collected Twitter data, and the Google Translate API supports 61 of them. Any tweet containing at least one of the eight keywords in one of the 61 languages was identified as Covid-19-related. The second step was to determine the location where each tweet was sent from. There are three common metadata sources to geo-reference tweets -geotagged location, user profile address, and content mentioned place. However, each source has its limitations for this study. First, less than 2% of the Twitter data contains the geotagged locations (Zou et al. 2018) , and the proportion of geotagged Covid-19 tweets could be even less, which makes over 98% of the tweets unusable. Due to the low use of precision geotagging and the increasing concern for users' privacy, Twitter has gradually removed the function of attaching point coordinates to tweets since June 2019. Second, a high percentage of tweets (40% to 60%) can be linked to county-level locations using user profile addresses. However, it requires the use of geocoding services (e.g., Google or OpenStreetMap Geocoding API) for toponym resolution, which is a time-consuming and computationally intensive task when processing a large amount of data and introduces geocoding uncertainty (Zou et al. 2019; Wang et al. 2021 ). Further, deriving locations from the tweet content mentioned places is unreliable because places mentioned in tweet contents (tweet about) do not necessarily reflect the locations of Twitter users (tweet from). Therefore, this study leveraged an alternative metadata source to associate each tweet with a country or a region: tweet language, which Twitter automatically detected and provided in the collected tweets. Users in each country are more likely to post messages on Twitter using their country's official/primary language (Mocanu et al. 2013; Zola, Cortez, and Carpita 2019) . Thus, matching the tweet language and the official/primary language of each country could indicate Twitter users' native locations and cultural backgrounds. For instance, a tweet written in Thai is considered sending from Thailand since it is the only country that uses Thai as its official language. Among the 61 selected languages, 29 can be paired to a single country ( Figure 2 ) based on the language-country matching list in Table A1 . This languagecountry matching method has been applied in a previous investigation with an overall accuracy of 53% (Zubiaga et al. 2017) . We validated the language-country matching approach by calculating the accuracy of language-detected countries in geotagged tweets. The results are summarized in section 4.1. (1) We chose three indicators -case fatality ratio, case rate, and mortality rate -to The public awareness was aggregated and evaluated by different languages and at multiple spatial-temporal scales to examine its relationship with the In 19 health impact indicators were tested at the country level to examine the predictability of public awareness changes on the regional Covid-19 outbreak. This study identified a total of 10,339,830 (2.12%) out of over 488 million tweets as Covid-19-related (Table 1) The 29 countries with unique official/primary languages were included in the countrylevel spatial-temporal analysis. Figure 7 shows the spatial patterns of the Ratio index at the country level from January 1 st to June 30 th , 2020. The value range was from 0.24% to 4.62% and the global average was 2.12%. In general, Nepal and Thailand from southeastern Asia had the highest Twitter-derived public awareness of Covid-19. The public awareness on Twitter in the eastern part of Europe was also higher than the awareness in other countries. Iceland showed the lowest Ratio index. The Ratio index in the selected Asian countries ranged from 0.48% to 4.62% with a standard deviation of 1.31%, showing a larger disparity than the country-level Ratio index values in Europe (range: 0.24~3.89%, standard deviation: 0.98%). The daily Ratio index showed uneven abilities in predicting case rate, mortality rate, and case fatality ratio ( insufficient Twitter data to evaluate the public awareness levels across time. The completion of this research yields several significant implications. First, this research demonstrated that the language-country matching method could overcome the limitation of insufficient social media data with geotagged locations and efficiently geo-reference big Twitter data for large-scale analysis with an accuracy of 77%. The method works the best for countries with unique official languages, offering a novel and rapid channel observing and comparing social media activities, e.g., public awareness toward different topics or events, among those countries. The generated datasets provide baseline information on Covid-19 awareness globally and by language and country. Second, this study found that the daily changes of public awareness reflected on Twitter had strong correlations with the daily mortality rate and case fatality ratio, verifying the near-real-time Twitter data's ability in predicting Covid-19 outbreak. This finding is consistent with prior investigations which applied social media to detect and forecast the outbreak of other infectious diseases, such as influenza (Signorini, Segre, and Polgreen 2011; Zadeh et al. 2019) and Cholera (Chunara, Andrews, and Brownstein 2012) . Therefore, monitoring continuous social media activities could help forecast the future outbreaks of Covid-19 and its variants, and inform governments and communities to plan accordingly. Finally, this study further proved that the prediction abilities of social media data for infectious disease outbreaks were distinct in different countries. It is in line with the analysis in Allen et al. (2016) , which measured the correlation between Twitter rates and the official reports of influenza-like illness (ILI) in 31 major cities in the United States during the 2013-2014 flu season. Our results confirm that the spatial variability of disease detection performance based on the social media data needs to be considered and mitigated in future work to accurately predict the regional outbreak of infectious diseases and develop responding strategies. A few limitations exist in this investigation and necessitate further research. First, the language-country matching method is unable to capture social media activities in countries speaking leading languages of international discourse, such as English and Spanish, or multilingual countries listed in Table A3 . This limitation can be resolved by incorporating the locations obtained from the geotags, users' profiles, and tweet contents in future research. Applying GIS and Machine Learning Methods to Twitter Data for Multiscale Surveillance of Influenza Health-Protective Behavior, Social Media Usage and Conspiracy Belief during the COVID-19 Public Health Emergency Large Arabic Twitter Dataset on COVID-19 Twitter Catches The Flu: Detecting Influenza Epidemics Using Twitter Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set Forecasting Smog-Related Health Hazard Based on Social Media and Physical Sensor the Eyes of the Beholder: Analyzing Social Media Use of Neutral and Controversial Terms for COVID-19 Predicting Depression via Social Media Social and News Media Enable Estimation of Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak Detecting Influenza Outbreaks by Analyzing Twitter Messages Estimating County Health Statistics with Twitter An Interactive Web-Based Dashboard to Track COVID-19 in Real Time Mental Health Problems and Social Media Exposure during COVID-19 Outbreak Americans' Perceptions of Disparities in COVID-19 Mortality: Results from a Nationally-Representative Survey You Are What You Tweet: Connecting the Geographic Variation in America's Obesity Rate to Twitter Content Where in the World Are You? Geolocation and Language Identification in Twitter Going Viral: How a Single Tweet Spawned a COVID-19 Conspiracy Theory on Twitter Social Media for Nowcasting Flu Activity: Spatio-Temporal Big Data Analysis Prediction of Infectious Disease Spread Using Twitter: A Case of Influenza More Effective Strategies Are Required to Strengthen Public Awareness of COVID-19: Evidence from Google Trends ILO Monitor: COVID-19 and the World of Work Forecasting Word Model: Twitter-Based Influenza Surveillance and Prediction Using Twitter and Web News Mining to Predict COVID-19 Outbreak Policy Response, Social Media and Science Journalism for the Sustainability of the Public Health System Amid the COVID-19 Outbreak: The Vietnam Lessons Retrospective Analysis of the Possibility of Predicting the COVID-19 Outbreak from Internet Searches and Social Media Data, China, 2020 Early Stage Influenza Detection from Twitter Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends Disparities in COVID-19 Related Knowledge, Attitudes, Beliefs and Behaviors by Health Literacy Characterizing Sleep Issues Using Twitter The Twitter of Babel: Mapping World Languages through Microblogging Platforms Social Media Sentiment Analysis Based on COVID-19 Discovering Health Topics in Social Media Using Topic Models Prediction of Number of Cases of 2019 Novel Coronavirus (COVID-19) Using Social Media Search Index World Leaders' Usage of Twitter in Response to the COVID-19 Pandemic: A Content Analysis Towards Characterizing COVID-19 Awareness on Twitter Conspiracy in the Time of Corona: Automatic Detection of Covid-19 Conspiracy Theories in Social Media and the News The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic Local Spatial Obesity Analysis and Estimation Using Online Social Network Sensors Impact of Rumors and Misinformation on COVID-19 in Social Media Twitter Use in Hurricane Isaac and Its Implications for Disaster Resilience Regional Influenza Prediction with Sampling Twitter Data and PDE Model Analysis of Spatiotemporal Characteristics of Big Data on Social Media Sentiment with COVID-19 Epidemic Topics Twitter User Geolocation Using Web Country Noun Searches Mining Twitter Data for Improved Understanding of Disaster Resilience Social and Geographical Disparities in Twitter Use during Hurricane Harvey Towards Real-Time, Country-Level Location Classification of Worldwide Tweets Greek Greece/Cyprus Bengali Bangladesh/India This article is based on work supported by the Texas A&M Institute of Data Science (TAMIDS) under the Data Resource Development Program. The statements, findings, and conclusions are those of the authors and do not necessarily reflect the views of the funding agency. The data used in this research were derived from the following resources available in the public domain: Internet Archive (https://archive.org/details/twitterstream?sort=publicdate), United Nation (http://data.un.org/), and COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19). No potential conflict of interest was reported by the authors.