key: cord-0614200-f7i0sbwt authors: Pastor-Escuredo, David; Tarazona, Carlota title: Characterizing information leaders in Twitter during COVID-19 crisis date: 2020-05-14 journal: nan DOI: nan sha: 93b89d4fa288a9cfbaab794a8e2d42606c291d04 doc_id: 614200 cord_uid: f7i0sbwt Information is key during a crisis such as the current COVID-19 pandemic as it greatly shapes people opinion, behaviour and even their psychological state. It has been acknowledged from the Secretary-General of the United Nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. Infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. For instance, infodemics can lead to hatred between population groups that fragment the society influencing its response or result in negative habits that help the pandemic propagate. On the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. We propose a framework to characterize leaders in Twitter based on the analysis of the social graph derived from the activity in this social network. Centrality metrics are used to identify relevant nodes that are further characterized in terms of users parameters managed by Twitter. We then assess the resulting topology of clusters of leaders. Although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information. Misinformation and fake news are a recurrent problem of our digital era [1] [2] [3] . The volume of misinformation and its impact grows during large events, crises and hazards [4] . When misinformation turns into a systemic pattern it becomes an infodemic [5, 6] . Infodemics are frequent specially in social networks that are distributed systems of information generation and spreading. For this to happen, the content is not the only variable but the structure of the social network and the behavior of relevant people greatly contribute [6] . During a crisis such as the current COVID-19 pandemic, information is key as it greatly shapes people's opinion, behaviour and even their psychological state [7] [8] [9] . However, the greater the impact the greater the risk [10] . It has been acknowledged from the Secretary-General of the United Nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. During a crisis, time is critical, so people need to be informed at the right time [11, 12] . Furthermore, information during a crisis leads to action, so population needs to be properly informed 1 Center of Innovation and Technology for Development, Technical University Madrid, Spain 2 LifeD Lab, Madrid, Spain to act right [13] . Thus, infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. For instance, infodemics can lead to hatred between population groups [14] that fragment the society influencing its response or result in negative habits that help the pandemic propagate. On the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. To fight misinformation and hate speech,content-based filtering is the most common approach taken [6, [15] [16] [17] . The availability of Deep Learning tools makes this task easier and scalable [18] [19] [20] . Also, positioning in search engines is key to ensure that misinformation does not dominate the most relevant results of the searches. However, in social media, besides content, people's individual behavior and network properties, dynamics and topology are other relevant factors that determine the spread of information through the network [21] [22] [23] . We propose a framework to characterize leaders in Twitter based on the analysis of the social graph derived from the activity in this social network [24] . Centrality metrics are used to identify relevant nodes that are further characterized in terms of users' parameters managed by Twitter [25] [26] [27] [28] [29] . Although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information [27, 30] . Tweets were retrieved using the real-time streaming API of Twitter. Two concurrent filters were used for the streaming: location and keywords. Location was restricted to a bounding box enclosing the city of Madrid [-3.7475842804 Each tweet was analyzed to extract mentioned users, retweeted users, quoted users or replied users. For each of these events the corresponding nodes were added to an undirected graph as well as a corresponding edge initializing the edge property "flow". If the edge was already created, the property "flow" was incremented. This procedure was repeated for each tweet registered. The network was completed by adding the property "inverse flow", that is 1/flow, to each edge. The resulting network featured 107544 nodes and 116855 edges. To compute centrality metrics the network described above was filtered. First, users with a node degree (number of edges connected to the note) less than a given threshold (experimentally set to 3) were removed from the network as well as the edges connected to those nodes. The reason of this filtering was to reduce computation cost as algorithms for centrality metrics have a high computation cost and also removed poorly connected nodes as the network built comes from sparse data (retweets, mentions and quotes). However, it is desirable to minimize the amount of filtering performed to study large scale properties within the network. The resulting network featured 15845 nodes and 26837 edges. Additionally the network was filtered to be connected which is a requirement for the computation of several of the centrality metrics described bellow. For this purpose the subnetworks connected were identified, selecting the largest connected network as the target network for analysis. The resulting network featured 12006 nodes and 25316 edges. Several centrality metrics were computed: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree and load. Each of this centrality metric highlights a specific relevance property of a node with regards to the whole flow through the network. Descriptors explanations are summarized in Table 1 . Besides the network-based metrics, Twitter user' parameters were collected: followers, following and favorites so the relationships with relevance metrics could be assessed. We applied several statistical tools to characterize users in terms of the relevance metrics. We also implemented visualizations of different variables and the network for a better understanding of leading nodes characterization and topology. We compared the relevance in the network derived from the centrality metrics with the user' profile variables of Twitter: number of followers, number of following and retweet count. Figure 1 shows a scatter plots matrix among all variables. Principal diagonal of the figure shows the distribution of each variable which are normally characterized by a high concentration in low values and a very long tail of the distribution. These distributions imply that few nodes concentrate most part of the relevance within the network. More surprisingly, same distributions are observed for Twitter user' parameters such as number of followers or friends (following). The load centrality of a node is the fraction of all shortest paths that pass through that node. Load centrality is slightly different than betweenness. The scatter plots shows that the is no significant correlation between variables except for the pair betweenness and load centralities as it is expected expected because they have similar definitions. This fact is remarkable as different centrality metrics provide a different perspective of leading nodes within the network and it does not necessarily correlates with the amount of related users, but also in the content dynamics. Users were ranked using on variable as the reference. Figure 2 shows the ranking resulting from using the eigenvalue centrality as the reference. The values were saturated to the percentile 95 of the distribution to improve visualization and avoid the effect of single values with very out of range values. This visualization confirms the lack of correlation between variables and the highly asymmetric distribution of the descriptors. Figure 3 summarizes the values of each leader for each descriptor showing that even within the top ranked leaders there is a very large variability. This means that some nodes are singular events within the network that require further analysis to be interpreted, as they could be leaders in society or just a product of the network dynamics. Figure 4 shows the ranking resulting from using current flow betweenness centrality as the reference. In this cases, the distribution of this reference variable is smoother and shows a more gradual behavior of leaders. To assess how the nodes with high relevance are distributed with projected the network into graphs by selecting the subgraph of nodes with a certain level of relevance (threshold on the network). The resulting network graphs may not be therefore connected. The eigenvalue-ranked graph shows high connectivity and very big nodes (see Fig. 5 ). This is consistent with the definition of eigenvalue centrality that highlights how a node is connected to nodes that are also highly connected. This structure has implications in the reinforcement of specific messages and information within high connected clusters which can act as promoters of solutions or may become lobbies of information. The current flow betweenness shows an unconnected graph which is very interesting as decentralized nodes play a key role in transporting information through the network (see Fig. 6 ). The current flow closeness shows also an unconnected graph which means that the social network is rather homogeneously distributed overall with parallel communities of information that do not necessarily interact with each other (see Fig. 7 ). By increasing the size of the graph more clusters can be observed, specially in the eigenvalue-ranked network (Fig. 8) . Some clusters also appear for the current flow betweenness and current flow closeness (see Fig.9 and 10). These clusters may have a key role in establishing bridges between different communities of practice, knowledge or region-determined groups. As the edges of the network are characterized in terms of flows between users, these bridges can be understood in terms of volume of information between communities. The distributions of the centrality metrics indicate that there are some nodes with massive relevance. These nodes can be seen as events within the flow of communication through the network [23] that require further contextualization to be interpreted. These nodes can propagate misinformation or make news or messages viral. Further research is required to understand the cause of this massive relevance events, for instance, if it is related to a relevant concept or message or whether it is an emerging event of the network dynamics and topology. Another way to assess these nodes is if they are consistently behaving this way along time or they are a temporal event. Also, it may be necessary to contextualize with the type of content they normally spread to understand their exceptional relevance. Besides the existence of massive relevance nodes, the quantification and understanding of the distribution of high relevant nodes has a lot of potential applications to spread messages to reach a wide number of users within the network. Current flow betweenness particularly seems a good indicator to identify nodes to create a safety net in terms of information and positive messages. The distribution of the nodes could be approached for the general network or for different layers or subnetworks, isolated depending on several factors: type of interaction, type of content or some other behavioral pattern. Experimental work is needed to test how a message either positive or negative spreads when started at one of the relevant nodes or close to the relevant nodes. For this purpose we are working towards integrating a network of concepts and the network of leaders. Understanding the dynamics of narratives and concept spreading is key for a responsible use of social media for building up resilience against crisis. We also plan to make interactive graph visualization to browse the relevance of the network and dynamically investigate how relevant nodes are connected and how specific parts of the graph are ranked to really understand the distribution of the relevance variables as statistical parameters are not suitable to characterize a common pattern. It is necessary to make a dynamic ethical assessment of the potential applications of this study. Understanding the network can be used to control purposes. However, we consider it is necessary that social media become the basis of pro-active response in terms of conceptual content and information. Digital technologies must play a key role on building up resilience and tackle crisis. Fake news detection on social media: A data mining perspective The science of fake news Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism Social media and fake news in the 2016 election Viral modernity? Epidemics, infodemics, and the 'bioinformational'paradigm How to fight an infodemic. The Lancet The covid-19 social media infodemic Corona virus (Covid-19)"infodemic" and emerging issues through a data lens: The case of china Infodemic": Leveraging High-Volume Twitter Data to Understand Public Sentiment for the COVID-19 Outbreak Infodemic and risk communication in the era of CoV-19 Information flow during crisis management: challenges to coordination in the emergency operations center The signal code: A human rights approach to information during crisis Quantifying information flow during emergencies Measuring political polarization: Twitter shows the two sides of Venezuela False news on social media: A data-driven survey Hate speech detection: Challenges and solutions An emotional analysis of false information in social media and news articles DeClarE: Debunking fake news and false claims using evidence-aware deep learning Csi: A hybrid deep model for fake news detection A deep neural network for fake news detection Dynamical strength of social ties in information spreading Impact of human activity patterns on the dynamics of information diffusion Efficiency of human activity on information spreading on Twitter Multiple leaders on a multilayer social media The ties that lead: A social network approach to leadership. The leadership quarterly Detecting opinion leaders and trends in online social networks Exploring the potential for collective leadership in a newly established hospital network Who takes the lead? Social network analysis as a pioneering tool to investigate shared leadership within sports teams Discovering leaders from community actions Analyzing World Leaders Interactions on Social Media We would like to thank the Center of Innovation and Technology for Development at Technical University Madrid for support and valuable input, specially to Xose Ramil, Sara Romero and Mónica del Moral. Thanks also to Pedro J. Zufiria, Juan Garbajosa, Alejandro Jarabo and Carlos García-Mauriño for collaboration.