key: cord-0556623-k51ydahs authors: Inuwa-Dutse, Isa title: Towards Combating Pandemic-related Misinformation in Social Media date: 2020-11-28 journal: nan DOI: nan sha: bb1f2234c5bfae8f43c5da232ab9585fc0f43697 doc_id: 556623 cord_uid: k51ydahs Conventional preventive measures during pandemic include social distancing and lockdown. Such measures in the time of social media brought about a new set of challenges -- vulnerability to the toxic impact of online misinformation is high. A case in point is the COVID-19. As the virus propagate, so does the associated misinformation and fake news about it leading to infodemic. Since the outbreak, there has been a surge of studies investigating various aspects of the pandemic. Of interest to this chapter include studies centring on datasets from online social media platforms where the bulk of the public discourse happen. The main goal is to support the fight against negative infodemic by (1) contributing a diverse set of curated relevant datasets (2) offering relevant areas to study using the datasets (3) demonstrating how relevant datasets, strategies and state-of-the-art IT tools can be leveraged in managing the pandemic. Human history is intertwined with various pandemic, infectious disease at a global scale, events resulting in a dramatic high mortality rate and economic hardship. Pandemics from diseases such as smallpox, tuberculosis, and the Spanish flu resulted in a large number of lost lives Kaur (2020) . ~Recently, one of the defining moments of the year 2020 is the outbreak of the zoonotic Coronavirus Disease (COVID-19) that radically disrupt normal social interaction. The virus was first reported by the World Health Organisation (WHO) on December 31, 2019, in Wuhan, China. Recent figures from the WHO reported 45,428,731 confirmed cases and 1,185,721 confirmed deaths across 216 countries, areas or territories. It is easy to be oblivious of early warnings despite apparent reasons suggesting otherwise. When the prevailing pandemic was first reported, many nations were heedless in taking proactive measures to a point that the outbreak quickly overwhelmed healthcare facility making it difficult to attend to ailing people, fatigue from health workers, distress and grieve from families of ailing and lost ones. The scale of spread and impact of the pandemic has prompted many forms of preventive and curative responses. Various approaches have been used to flatten infection peak to avoid overwhelming the prevailing healthcare facilities and alleviate the associated financial challenges. Typical measures to slow down the rate of the infection include disinfection, contact tracing, social distancing, isolation/quarantine and some curative measures. Following the traditional approach of mitigating spread, the infamous lockdown measure introduced to curtail the virus spread has altered many aspects of social routines in which demand for online-based services skyrocketed. Figure 1 shows a summary of the total number of cases globally 1 . While the modern-day social media networks, such as Facebook and Twitter, facilitate the spread of information to a wide audience making it a useful facility for instant information update and socialisation, they also present new sets of challenges. With a substantial proportion of the populace confined to their homes for a long period, vulnerability to the toxic impact of online misinformation is high during the COVID-19 outbreak. There is a growing body of work tackling many problems associated with the outbreak. For instance, concerning infodemic, researchers have been curating and documenting various datasets about COVID-19. Of interest to this chapter are studies centring on online datasets, especially from social media platforms where the bulk of the public discourse happen, regarding spurious content associated with the pandemic. Therefore, the main goal of the chapter is to support the fight against online misinformation with particular emphasis on pandemic-related data. Many aspects of the pandemic can be explored using the datafrom leveraging benchmarking datasets to assess the veracity of information related to the pandemic (infodemic) to the more advanced task of modelling and tracing the propagation of the virus. Consequently, the chapter will, among other benefits, help in understanding how to ensure that relevant content dominates, and irrelevant content is suppressed, especially during critical times of the pandemic. Figure 2 shows a summary of the major events 2 since the outbreak. The chapter is structured as follows: Part I offers relevant background information regarding pandemics and Part II presents a detailed account relevant dataset, including the collection and processing steps, to use. Part III proffers some research problems worthy of investigation using the datasets. Finally, Part IV concludes the study with a closing remark. Traditionally, some of the swift measures taken to mitigate pandemic include social distancing, lockdown, disinfection. Palliatives are also being provided to cushion the financial hardship brought about by the pandemic. With technological advancements, all the above-mentioned measures can be improved. The focus in this section is to highlight relevant topics that would help in achieving such goals, notably from an online social media perspective. A treatise on contemporary social engagements will be incomplete without reference to online social networks. It can be argued that modern-day social interaction will be incomplete without taking online social relationships into account, where various forms of interactions among diverse users happen. This capability makes it possible to empirically quantify and evaluate social relationships among users at an unprecedented scale. Essentially, many social network theories and analytical solutions can now be tested using real social media data. Drawing from ethnography, a form of social research concern with individual culture and group behaviour, netnography is a relatively new term coined to denote the use of online social media platforms to study people's interactions and behaviours (Kozinets, 2007; Pink, 2016) . It encompasses aspects of data collection, analysis, research ethics, and representation, rooted in participants observations. Because observational data can be retrieved from online communities or groups, effective aggregation of such data would yield useful insight (Kozinets, 2007) . While the architecture of social media networks simplifies the spread of information to a wide audience, it also enables a breeding ground for misleading information. This opens up another frontier of challenge in the fight against the virus. It can be argued that the need for online-based services has never been in higher demand as being witnessed in the COVID-19-induced lockdown era. One of the implications of the lockdown is the relegation of virtually all human engagements to the online realm, which also results in an increasing number of uncensored posts. Noting how misleading information can have catastrophic consequences and hampers the fight about applying containment measures, it is pertinent to combat the pandemic from all possible fronts. The transformative power of technological advancements across various facets of public lives is quite enormous. For instance, communication and interaction of people witnessed a tremendous transformation, especially with the advent of online social media platforms, such as Twitter, Facebook, WhatsApp, TikTok, LinkedIn, Snapchat, Twitch, Pinterest, YouTube, Viber, that facilitate information diffusion and socialisation at scale. These platforms are quite popular with the public; thus, it is worthwhile understanding the social media ecosystem in great depth. The contemporary social media ecosystem consists of numerous platforms which support various aspects of humans' social engagements and enable users to simultaneously generate and consume information (Inuwa-Dutse, 2020). Many forms of social interactions are continually evolving to support a myriad of objects to remain connected through a communication model that enables a multi-flow of information (see the influence network model of Watts and Dodds (2007)), thus contrasting it with the two-step flow model in which few users mediate communication between the media and the general public (Katz et al., 2017) . In addition to serving as a news source, the utility offered by the online social media platforms makes it possible for users to socialise and engage in all sort of discussions. The platforms have been instrumental in socialisation, breaking news, globalisation and enabling socio-technological research (Sundaram et al. 2012) . In terms of participants and data size, social media networks have profoundly transformed how various researches are being conducted, especially within the social sciences. However, with the majority of the populace confined to their homes for a long period due to the pandemic, vulnerability to the toxic impact of online misinformation and uncensored posts via social media is high. This can be attributed to the increasing demand for online-based services partly due to the infamous lockdown measure. Prior to the advent of online social media, a large collection of data is exclusive to big research facilities such as weather forecast stations, astronomical stations, and scientific laboratories (Dijk van Jan 2006). The social media networks offer useful utility in understanding modern society and how it functions (Miller et al. 2015) . It was estimated that 2.46 billion users will be connected in this year (2020), amounting to onethird of the global population 3 . Owing to the usefulness of the generated data, datafication 4 , the continuous quest to turn every aspect of humans' lives into computerised data for competitive value (Cukier & Mayer-Schoenberger 2013), is being fueled by social media to supply commoditised data. Several domains have already recognised the crucial role of such data in improving productivity and gaining competitive advantage. (Contractor et al., 2015) . The success of social media platforms has led to an increased interest in empirically testing various theories, making the platforms ideal for studying many aspects of social events. Details about how researchers leverage theories, research constructs, and conceptual frameworks in relation to social media can be found in Ngai et al., (2015) . Through netnography, researchers can systematically retrieve a huge amount of real-life observational data from different online social media platform's using traditional application programming (API) or a custom application. Twitter and Tweets -Massive amount of data can be easily obtained from platforms such as Twitter. Tweets, usually short text snippets, refer to the stream of posts users share on Twitter, and they enable longitudinal studies (Würschinger et al. 2016) . A tweet object is a complex data structure, expressed in JavaScript Object-Notation (JSON) format, consisting of many extractable attributes that describe specific information about the tweet and the account holder (the user). As a marked-up piece of text, the different fields in the tweet object define important characteristics of the tweet. The complexity of a tweet and its unstructured nature makes it difficult to process directly into a usable form, which requires a series of preprocessing before effective analysis can be conducted 5 . The stream of tweets differs from conventional stream texts in terms of posting rate, dynamism and flexibility; they are generated at a rapid rate and tend to be highly dynamic (Guille and Favre, 2015; Chakraborty et al., 2016) . The social media networks have transformed the way sociological research is being conducted by enabling useful utility in understanding modern society and how it functions. Within a short span of the outbreak, there have been a plethora of COVID-19-related studies covering various aspects of the pandemic. A collection of relevant datasets is central to tackling emerging challenges and a driving force in the various research efforts interested in combating harmful infodemic. To this end, researchers have been curating and documenting various online datasets about COVID-19, especially from social media. Many datasets can be obtained from social media for various purposes related to the pandemic, such as in crisis management during the outbreak. It is out of the scope of the chapter to belabour or dwell on the researches associated with COVID-19; the emphasis is on relevant datasets to help towards combating pandemic-related misinformation. There is a comprehensive (spanning various topics about the pandemic) catalogue of COVID-19 datasets in the work of Latif et al. (2020) . An early report about the outbreak in China is summarised 6 in the work of Wu and McGoogan (2020) . Also, Wikipedia projects have been maintaining comprehensive documentation about relevant articles on COVID-19. Some useful collections of social media datasets consisting of tweets can be found in the work of Chen and Ferrara (2020), and Alqurashi et al. (2020) . A collection of image-based data (from Instagram) about COVID-19 can be found in Zarei et al. (2020) . Using a large collection of diverse datasets from online social media, the infodemics observatory project keeps track of the digital response related to the outbreak (Valle et al., 2020) . In conjunction with numerous online social media platforms, the World Health Organisation is preventing the spread of misleading information related to the pandemic 7 . Moreover, some social media platforms have put measures in place to prevent potentially inimical content from spreading. For instance, Twitter's new feature of flagging posts and the dedicated application programming interface can be used to retrieve tweets related to COVID-19 8 . A useful analysis of the impact of COVID-19 and how stakeholders can effectively act can be found in the blog post of Tomas (2020). 5 see https://github.com/ijdutse/covid19-datasets for some relevant information about COVID-19 data preprocessing. 6 The following blogpost also offers useful insights about the pandemic: https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca 7 see https://www.who.int/publications/i/item/9789240010314%20/ Since the outbreak of the pandemic, various stakeholders have been actively battling with the virus causing the pandemic, i.e.~SARS-Cov-2. With a growing number of pandemic-related misleading information, another frontier of challenge in the fight against the virus is open. Of interest to most researchers, especially within the computer science research community, is the need to neutralise the negative impact of infodemic associated with the pandemic. To this end, researchers have been curating and documenting useful datasets about COVID-19. This endeavour is crucial towards enriching existing ground-truth data that could be used to debunk myths and misinformation around the pandemic. The following section is aimed at improving datasets about COVID-19 according to the data source, its online availability and potential utility. To support the fight against the spread of misinformation and rumours, the collection consists of 3 categories of Twitter data, information about standard practices from credible sources and a chronicle of global situation reports from WHO 9 . Regarding data from Twitter, a description of how to retrieve the hydrated version of the data and some research problems that could be addressed using the data are given. Figure 3 shows the focused areas in the fight against the COVID-19 pandemic. The fight against the virus revolves around preventive (such as public enlightenment about the standard practice to prevent spread) and curative. The best approach is to avoid endangering the public to be exposed to the virus. Infodemic could lead to consuming misleading information that could endanger the public. Thus, it is crucial to combat the pandemic from all angles using the right set of datasets. The advent of social media has opened a new window of obtaining a huge number of diverse research datasets across different disciplinesengineering, medicine, sociology, computer science, etc. This section is concerned with a description of how to obtain and curate relevant datasets, notably from Twitter. Table 1 below provides an overview of useful tools for retrieving data from the respective social media platforms 10 . Tweet-based collection -Platforms such as Twitter offers a useful avenue to retrieve a huge amount of data on a variety of topics using the keywords or search terms. Keywords play a central role in identifying the most useful data and relevant stakeholders as the basis for the data collection. The set of datasets presented in this chapter is in response to the growing scepticism, misinformation and myths surrounding the pandemic. Thus, terms that are associated with such myths have been used to collect the data, mostly from accounts that openly dismisses COVID-19 related information as put forward by credible sources such as the WHO. For more effective result it will be helpful to design the collection so that the data can be classified based on whether the collection is from dedicated accounts or random accounts via Twitter's API. Also, the collection can be based on some specific hashtags because a tweet associated with a hashtag offers a high-level filter to the collection and helps in data curation. The account-based collection could be from verified or unverified accounts on Twitter and the random set from a generic collection of daily tweets on diverse topics. These are needed to provide a wider context on the prevailing topic. All the tweet-based collections have been collected using Twitter's Standard Search API. This collection consists of three categories of selected tweets that have been collected from accounts that have been monitored for 5 weeks (March 23 to May 13, 2020). Sometimes it is better to retrieve the whole tweetobject, a complex data object composed of numerous descriptive fields, instead of selected fields using tools such as Tweepy (see Table 1 ) because it enables the extraction of variables that could be used for further analysis. The set of datasets presented in this chapter is in response to the growing scepticism, misinformation and myths surrounding the pandemic. Thus, terms that are associated with such myths have been used to collect the data, mostly from accounts that openly dismisses COVID-19 related information as put forward by credible sources such as the WHO. Non-tweet collection -With the availability of many credible sources debunking misleading narratives about COVID-19, it will be beneficial to have a large collection of curated data. Consequently, the nontweet group consists of information about standard practices from reliable sources and a chronicle of global situation reports on the pandemic. Bodies such as the World Health Organisation and nationally recognised institutions will be instrumental in providing rich and informative material for a robust factual analysis. This will enable researchers to find responses to a wide range of questions related to the pandemic for a broader comparison. Data cleaning -To support an effective longitudinal and exploratory analyses of the data, some crucial preprocessing steps are required. Basic forms of which include tokenisation, stopwords removal and text formatting (involving expanding contracted terms and lemmatisation). Once the data cleaning stage is completed, data analysis proceeds with descriptive analysis to understand the data more before delving into the detailed study. Time-series analysis is crucial in revealing interesting pattern and useful insight. The quest to turn every aspect of humans' lives into computerised data for competitive value is rapidly growing. Depending on the interest and goal of the study, data from social media platforms can be used to conduct studies along the following dimensions: (1) Textual or multimedia data analysis: this is motivated by the prevalence of multimedia data (e.g., text, audio, video and graphics) that enables various studies such as content and discourse analyses for many purposes (2) the second dimension is graphical analysis, which relies on structural analysis to identify the underlying structure of relationships at various levels of granularity in the social network. Depending on which aspect or dimension is chosen, techniques or methodologies based on machine learning or deep learning can be applied to process or solve the problem at hand. With data from online social media, there exist many useful theories, constructs, and conceptual frameworks to utilise. The work of Ngai et al. (2015) offers more insight on the subject. Of interest to this study is to highlight areas where data from online social media platforms can be used to manage pandemicrelated challenges in this age of hyperconnectivity. Potential problems to be addressed can be around pandemic outbreak detection and management, pandemic assessment, contingency planning, early detection or alert system for disease outbreak from online social media platforms, and modelling spread of outbreak at various levels. One of the reasons why the online social media platforms are very popular with the public has to do with the ability for users to simultaneously generate and consume content leading to various forms of information fads, opinions, breaking news (Inuwa-Dutse et al., 2018) . This reason also contributes to the increasing number of uncensored posts on various social phenomena, partly due to their short size and the speed of communication. Demand for online-based services is at its peak during the lockdown, thus exposing the populace to various vulnerabilities. Among the repercussions of the increasing volume of information (relevant and irrelevant) on the pandemic is the tendency to create a sense of bewilderment on the part of the public concerning what preventive measures to take and which piece of information to believe. As such, it is crucial to understand how online misleading content propagate and study how to optimise methods that favour the dominance of relevant content over irrelevant ones. There exist various misinformation and conspiracy sources capable of misleading the public regarding the COVID-19 pandemic. Despite the measures taken by social media platforms to curtail irrelevant content, many sources of misleading information and rumours still exist. A comprehensive repository of both validate and spurious datasets on pandemic will facilitate the authentication of the veracity of a given piece of information on the subject. Because users can share information about virtually anything, social media platforms are ideal for conducting useful studies. For instance, the data can help in informing what action to take that will prevent the occurrence or ramp up containment measures in a given locale. For an area not hit by the pandemic, mitigation measures and scenarios can be systematically categorised as pre-outbreak, in-outbreak, and postoutbreak to analyse situations and answer some beneficial questions. As a result, the community will be proactive in handling any eventuality related to the pandemic because the level of preparedness will be improved significantly. Accordingly, with the right data some of the following crucial analyses, not requiring complex modelling, can be achieved (1) determine the number of casessusceptible, infected and death (2) analyse the impact of the estimation (3) prioritize what course of action to take based on the prevailing situation and identify the most affected areas or groups. In terms of contact tracing, it will be interesting to ask the question about how possible it would be to trace susceptible cases using social media information. A basic strategy is to utilise self-reporting information about the relevant incidence, e.g., being in contact with an infected individual. A simple strategy that will go a long way in mitigating the harmful effect of negative infodemic will require each recipient of social media post to ascertain the veracity of seemingly problematic or controversial information before amplifying. Figure 4 shows a visual illustration from WHO where a simple verification process will prevent further spread. Abiding by this simple illustration will go a long way in curtailing the menace of misleading information, especially during the critical time of the pandemic. Community detection -Social network analysis is useful in revealing the dynamism of many forms of social relationships at various levels. In the social science domain, sociometry is a means to measure or study social relationships between people (Wasserman & Faust 1994). Generally, networks are characterised by a certain degree of organisations in which groups of nodes form tightly connected units as communities. Communities represent functional entities which reflect the topological relationships between elements of the underlying network (Newman 2006) . Noting the level of resistance and acceptance regarding COVID-19, a high-level clustering of users could potentially unveil the distribution of users for reasons related to management, logistics, and containment of the outbreak. Sentiment Analysis -Another useful problem to tackle is understanding users' perceptions about measures taken in managing the pandemic. For instance, it will be possible to evaluate lockdown policy to understand users' willingness to comply and the attitudinal change over time. It is easy to be oblivious of early warnings despite apparent reasons suggesting otherwise. When the prevailing pandemic was first reported, many nations were heedless in taking proactive measures. A case in point is the various myths associated with the COVID-19, which often left the public bewildered concerning what preventive measures to take and which information to believe. The scale of the outbreak quickly overwhelmed healthcare facility making it difficult to attend to ailing people, fatigue from health workers, distress and grieve from families of ailing and lost ones. This sort of situation concerning a phenomenon with global impact such as the COVID-19 often triggers or evokes many questions on the observer's mind. However, using the right data and analysis tools, some of the concerns can be effectively addressed effectively. The online social media is currently one of the most prestigious sources of commoditised data attracting huge attention. Social media platforms have transformed social researches in terms of participants and size of data with profound effect. Because users can share information about virtually all aspects of their social life, social media platforms are ideal for studying various aspects of social events. However, the openness of social media also makes it a fertile breeding ground for all forms of narratives to flourish. Thus, it is equally important to confront the fight against the actual virus alongside the corresponding negative infodemic. This is needed because misleading information can have catastrophic consequences and hampers the fight about applying containment measures, which makes it pertinent to combat the pandemic from all possible fronts. This chapter contributed to mitigating the impact of inimical online content related to COVID-19 pandemic. The datasets and insights from the chapter will support studies interested in analysing the spread of fake and misleading content, evaluation of lockdown policy and tracking of sentiment over time. The data will further enrich existing databases for debunking misinformation and fact-checking avenues, such as the International Fact-Checking Network. Large arabic twitter dataset on covid-19 Covid-19: The first public coronavirus twitter dataset Global Health Risk Framework-The Neglected Dimension of Global Security: A Framework to Counter Infectious Disease Crises Detection of spam-posting accounts on Twitter A dictionary of epidemiology A Review: Epidemics and Pandemics in Human History Netnography. The Blackwell Encyclopedia of Sociology Leveraging Data Science To Combat COVID-19: A Comprehensive Review Social media research: Theories, constructs, and conceptual frameworks Digital ethnography. Innovative methods in media and communication research Coronavirus: Why You Must Act Now Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention Resources for combating misinformation: • Guidelines from WHO to help towards managing the COVID-19 infodemic • Tracker for Infodemic Management Activities, WHO Framework for Managing Infodemics in Health Emergencies Covid-19 stream Updates on COVID-19 diagnosis and treatment For an in-depth analysis of the impact of COVID-19 see the following blog post of Tomas (2020), which presents useful illustrations, data and models covering various aspect of the pandemic. Worldometer: Keeps track of COVID-19 global cases