key: cord-0492196-yy3vunho authors: Cicchini, Tomas; Pozo, Sofia Morena del; Tagliazucchi, Enzo; Balenzuela, Pablo title: News-sharing on Twitter reveals emergent fragmentation of media agenda and persistent polarization date: 2021-12-17 journal: nan DOI: nan sha: 4bd2994fc2af367c1c1850c872be1965d971ce9d doc_id: 492196 cord_uid: yy3vunho News sharing on social networks reveals how information disseminates among users. This process, constrained by user preferences and social ties, plays a key role in the formation of public opinion. In this work we analyze news sharing behavior on Twitter as bipartite news-user networks for two consecutive years of Argentinian major media outlets. Analysis of news networks revealed modular structure driven by semantic similarity of news content and homophilic media interactions, producing a fragmented media agenda. In particular, the two largest communities present a persistently polarized political leaning, reflected in the consumption of ideologically homogeneous groups of media outlets. On the other hand, the user networks show modules driven by similar profiles of news consumption, where core individuals have less diversified profiles, suggesting that the observed polarization could arise from lack of diversity to media exposure. The profound change in connectivity enabled by social networks [1] facilitates the formation of ties based on affinity, group membership or trust in influential individuals or organizations, among other causes [2, 3] . As these ties emerge, social networks become clustered, leading to constraints on information flow. In turn, these constraints also influence the process of opinion formation, which might act as a positive reinforcement for the clustered structure of social networks. The emergence of highly connected groups of individuals is a topological feature that repeatedly arises in studies of social networks. This characteristic has been observed in networks defined by preferential message propagation (retweet networks, in the case of Twitter) [4, 5] , as well as in networks of followers [6] . These groups can reflect the clustering of individuals based on the similarity of their beliefs, creating ideologically homogeneous communities that are frequently known as echo chambers [7, 8, 9] . More precisely, echo chambers can be defined as groups of like-minded users framing and reinforcing a shared narrative. In Twitter, for example, echo chambers can be characterized by tweets containing links to news outlets of known political leaning, as shown by Cinelli et al. [6] . Since the seminal work of McCombs and Shaw [10] , the role of mass media in the opinion formation process has been widely studied [11, 12, 13] in the context of the Theory of Agenda Setting. Over the past decade, social networks have played an increasingly prominent role in the formation of public opinion, and their contribution should not be underestimated [14, 15] . Weaver and colleagues [16] investigated news sharing on Twitter during the 2015 UK general elections, representing users and tweets with embedded urls as a bipartite network. The authors analyzed the community structure of the news projection network in terms of news content, media outlet diversity and geographic localization. Other studies focused on Swiss Twitter activity to analyze differences between national and international news outlet behavior [17] , and described political polarization during Indonesian elections using a users-media outlets bipartite network [18] . In this work we analyze news sharing behavior on Twitter as a bipartite network for two consecutive years of major Argentinian media outlets. The news projection network presented a modular structure driven by shared semantic content between news and homophilic media interactions, producing a fragmented media agenda. In particular, the two biggest communities in both data sets were characteristic of a persistently polarized political discussion, which was also reflected in the consumption of ideologically homogeneous media outlets within modules of the same ideological leaning. On the other hand, the user projection networks were clustered based on similarity of the user news consumption profiles, where core individuals were the less diversified in terms of media consumption, thus suggesting that the observed polarization could arise from lack of diversity in exposure to media. 1 Twitter activity of users identified as politically engaged were downloaded during two different periods, from July 29 th to August 10 th 2019, and from June 4 th to 11 th 2020. This population of Twitter users was identified around the 2019 primary elections in Argentina (August 11 th 2019). All included users were active during the period between August 5 th and August 12 th 2019. Only those tweets which contained news domain urls were included in this study. Twitter data was acquired using the social network official API, together with custom developed python codes. The acquisition process consisted of the following steps: 1. User selection: live download. Twitter activity was downloaded using the Stream Listener tool in the Twitter API, while filtering by keywords associated with politicians, electoral alliances and political parties, and restricting the time frame as previously mentioned. Following [16] , only users who created original content were identified as politically engaged. 2. Embedded urls tweets download: The identified tweets were downloaded using the Trend Line tool from the Twitter API, keeping only those containing embedded urls. 3. Url filter: News domains were obtained from the urls together with an Argentinian media data base provided by ABYZ News Links. Finally, only tweets from the major twenty Argentinian media outlets were selected for further analysis. The 2019 data set consisted of 59982 tweets with 30739 news urls from 6502 users; the 2020 data set consisted of 47210 tweets with 25288 news urls from 5450 users. Data sets shared 2073 users. The complex pattern of news shared by multiple users can be mapped onto bipartite networks following the procedure sketched in [16] . Bipartite networks have two different classes of nodes, news and users, and can be represented by an adjacency matrix as: where m is the number of news and n the number of users involved. These bipartite networks can be projected into news and user layers. Connections in the news projection indicate co-consumption across users, while the user projection describes users connected by news in common. The projections can be obtained by assigning a weight between two given nodes of a given class related to the number of connection between them through nodes of the other class [19] : where the sum runs on u, the set of nodes of the complementary class, and k u represents the degree of the u th node (for instance, in the case of the news article network, k u is the number of news articles shared by the u th user). Note that this projection mitigates the potential bias induced by highly active users when linking news. Finally, two additional filters were applied to these networks. First, given that these projections do not necessarily produce fully connected networks, we kept the largest connected components for further analysis. Second, the edges were filtered by their significance [20] 2 to work with the backbone structure of networks. Table 1 shows the final number of nodes in both networks for each period. 29/07 to 10/08 2019 2167 7081 4/06 to 11/06 2020 2232 5252 Table 1 : Number of nodes in the users and news networks respectively. 2 The value of significance used was α = 0.05 We mainly focused on the analysis of collective structures and the role of nodes within these structures. To detect communities in both projections of the bipartite networks we used a Python implementation of the Louvain algorithm [21, 22] . This algorithm is based on the optimization of the modularity Q, defined as: where, • A ij represents the edge weight between nodes i and j. • k i and k j are the sum of the weights of the edges attached to nodes i and j; respectively. • m is the sum of all the edge weights in the network. • C i and C j are the communities of nodes i and j. • The function δ(C i , C j ) is one if nodes i and j are in the same community, e.g. C i = C j , and zero otherwise. Due to the stochastic nature of the Louvain algorithm, the obtained community partitions may differ from each other in a comparatively small number of nodes. To obtain a well-defined membership metric of the nodes to the given communities, we constructed consensus networks [23] , allowing the robust assignment of nodes to communities. We analyzed the role of users in the network by means of the participation coefficient and the within module degree [24] . The first metric computes the fraction of edges an user has with users belonging to the same community relative to the edges connecting this users to others communities: where k i is the degree of the i-th user, k n i is the degree of the i-th user restricted to the users belonging to the n-th community, and M is the total number of communities in the network. If all the edges of a given user are within the same community, P i is equally zero; conversely, if all the connections of the i-th user are distributed among different communities, P i tends to one. Nodes can also be characterized in terms of their within module degree [24] . Here, we use this metric to label nodes in the visualized networks (see Fig. 2 for an example). This metric is defined as: wherek c i and σ k c i are the mean and the standard deviation of the degree distribution restricted to the community i (i.e. only considering edges between nodes belonging to this community), and k i is the degree of the i-th node computed using connections to nodes within the same community. To analyze the properties of emergent structures on both projections of the bipartite networks, we focus on metrics related to the semantic content and the media outlet membership (for the news projection) and to the user profile of media consumption (for the user projection). Nodes corresponding to news could belong to the same community due to different reasons, such as belonging to the same media outlet, or because they have similar semantic content. • Media outlet We first classify each news article according to the media outlet where it has been published. • Content Analysis The following steps were applied to the text of each news article: 1. tokenization: each element of the corpus was separated in individual terms, and non-alphanumeric characters and punctuation was removed. All terms were converted to lowercase. 2. stopwords filtering: using the Spanish stopwords database provided by nltk, the most common (and thus the least informative) words were filtered out. 3. term basis construction: a term basis was generated from the set of used terms. 4. frequency description: each news article was described as a term frequency vector with entries given by the basis computed in the previous step, i.e., the i-th term corresponded to the number of times the i-th term of the basis appeared in each news. 5. tf-idf description: to mitigate bias due to excessive contribution of frequently used words and increase the contribution of unusual (but informative) words, the term frequency -inverse document frequency (tf-idf) statistic was computed [25] , resulting in the following value for the i-th element of the vector news representation: where N is the number of documents in the corpus and N i is the number of documents where the i-th term appears. After these processing steps, the news corpus was described as a matrix M ∈ R n×m , with n the number of documents in the corpus and m the number of basis terms. • Unsupervised topic detection Starting from this mathematical representation of the corpus it is possible to detect the main topics (i.e. groups of similar articles with roughly the same semantic content) by performing, for instance, non-negative matrix factorization (NMF) [26] [27] on the document-term matrix (M). NMF results in the factorization of the news-terms matrix M as the product between two matrices: where t is the chosen number of topics, and N and W are the news in topics dimension and the topics in terms dimension, respectively. Both matrices are composed only of positive entries, allows their simple interpretation. To calculate the coverage of mass media, we estimated the amount of articles and their relative importance in each topic following the procedure sketched in [12] . Then, we defined the weight of the topic i (T i ) as the product of the number of news articles (weighted by degree of membership) and the length of the article. If weights are normalized to one, then the media agenda can be represented as a probability distribution in the topic space. Users are linked by the news they consume. We quantified the diversity in the media outlets consumed by each user in terms of the following vector: where the m i j component indicates the number of news from the j media outlet shared by the i user. Given the heterogeneity in the distribution of news outlets in the corpus, we adapted the idea behind the tf − idf approximation introducing a corrected version of this metric 3 : where N is the total number of users and N j the total number of users sharing news belonging to the j media outlet. Here, the factor log N N j has the purpose of correcting the potential bias caused by a given article being shared by multiple users. Moreover, users can share news from multiple outlets or, conversely, share news from only one of them. We estimated how diverse was the user behavior in terms of the shared media outlets using the maximum value of the consumed media vector. This measures the Lack of Diversity (LD) in the user behavior: where M is the total number of media outlets in our data set. After a normalization, this lack of diversity lies between 0 and 1. In this projection, nodes represent news articles and links the number of users co-sharing the corresponding news . We performed a topological analysis of this network, together with the semantic analysis of news content, with the purpose of understanding emergent structures as well as their consistency in time. We first address whether co-shared news are mainly linked because they were published in the same media outlet or because they have similar semantic content. To address the first possibility, we calculated the media homophily (h m ) defined as: where 2m = ij A ij is the total strength for weighted networks and δ m i ,m j is 1 if the medium outlets of nodes i and j are the same, and 0 otherwise. For the second case, we analyzed the similarity between the semantic content of co-shared news. As described above, we performed a semantic description of each news as a vector in term space using the tf-idf representation. Then, the cosine similarity between each pair of linked news was computed in both networks. Finally, the median of the cosine similarity distribution was taken as a global measure of the semantic similarity between the semantic content of the news. To asses the significance of the homophily and the median of the semantic similarity distributions between co-shared news, results were compared to a null-model where news were randomly assigned to each node. In Fig. 1 show the h m results of 2019 and 2020 respectively. We observe that media homophily (h m ) is far from the random distribution, with values higher than 0.5 in both cases. On the other hand, panels [C] and [D] show the median of the semantic content similarity distributions. Despite the low values (which are expected due the sparse representation in the semantic space), the observed similarities are larger than expected from the null hypothesis. These results highlight that media outlet membership and semantic content are both key ingredients to understand the patterns of news consumption. We performed community detection in both data sets, yielding the structures visualized in Fig. 2 . The displayed labels correspond to the media outlets of the news with greater within-module degree. Colors represent the different communities. It is possible to appreciate from Fig. 2 the strong relationship between the news media outlets and the communities structure and how it persists in time, at least for the two biggest communities: Pagina 12 in one of them, La Nacion, Clarin and Infobae in the other. To better understand the main driving forces behind this organization, we analyzed the media outlet distributions and performed topic decomposition of the news content within each community, as was detailed in previous section. We provide information corresponding to these metrics in Table 2 for the five main communities of each data set and plot a more detailed version in Figs. 3 and 4 for the two biggest communities. In Fig. 3 two main results for the 2019 dataset can be seen. First, the media outlet distributions for the two main communities, representing 18% and 14% of the total nodes, are fundamentally different, as shown by the stacked bars. The first one is dominated by the outlets Página 12 and El Destape, while the news articles in the second one are published mainly in three outlets: Clarín, La Nación and Infobae. This particular media distribution parallels the two most voted and widespread political parties in Argentina. While Página 12 and El Destape are known to support Kirchnerism, a center-left political party [28] , Clarín and La Nación are known for being the current "opposition media" [29] . The differences between these two group of media outlets also arose in other contexts [30] , such as institutional violence cases and crime. The ideological profiles of these media outlets have been identified as follows [30] : • Pagina 12 : left-of-centre broadsheet newspaper. • Clarin : centrist tabloid with the highest circulation of the national newspaper. • La Nacion: elite, centre-right, broadsheet. Second, from the analysis of the topic distribution, we observe that the media agenda is distributed among six different topics for both data sets. In 2019 some topics are present in both communities, such as National Elections and Electronic Vote-Counting, as expected in the context of a national election. We noticed that the National Elections word clouds for each community (Fig. 3) include the names of the main political candidates: Cristina, Alberto, F ernandez and Kicillof (Center-Left community) and V idal and M acri (Center-Right community). This reflects the political polarization between those communities that we described previously. Also, the topics Economy and Justice appeared in both communities, indicating their importance in the public discussion. However, there are some topics that appeared only in one community, such as Union Corruption Case and Drug Traffic Affair, which can be interpreted in terms of the specific interests of the readers of this group of newspapers. Finally, it should be noticed that the topic distribution discriminated by media outlet is also informative in terms of the media agenda. For instance, in the case of Ambito Financiero in the main community, the topic Economy is the most important, as expected for a newspaper specializing in that area. In Fig. 4 we can observe that the two main communities of the 2020 data set, including 19% and 15% of the total number of nodes, present media outlet distributions similar to those observed in the 2019 data set. This indicates that news sharing behavior, at least for the two largest communities, is strongly driven by the user preferences and thus is stable over time. Regarding topic decomposition, we observe the presence of topics such as Covid, Illegal Espionage and Exporting Company Affair. As expected for 2020, the media agenda in both communities was driven by the topic Covid, although the topic Illegal Espionage was also important in the second community. Summarizing, the mesoscale analysis of news networks showed that the media outlet distributions of the two main communities constitute relevant metrics to characterize polarization in terms of news sharing behavior. Also, the media agenda of these communities helps us understand the nature of this polarization (in particular 2019), given that the National Elections topic reveals than the most frequent words in each community are related to the main candidates of the two confronted political parties. The other communities are comparatively much smaller, as can be seen in Table 2 . Here, more than 75% of the news are from a single media outlet. For instance, news of the third community from both years are mostly from Infobae. In addition, the topics of those communities concern mostly international politics (with words such as Hong Kong, China, Venezuela, Colombia, Maduro, Gorge Floyd, Bolivia, Evo Morales, Texas and dollar). Analogously, the main media outlets of the forth and fifth communities in both years are La Izquierda Diario (more than 99% of this outlet) and Infobae. We investigated the year to year consistency of the community structure and its relationship with the media outlet distributions and the semantic content of news articles. For this analysis, each community was described using two metrics: their media outlet distribution and their semantic content. On one hand, the media outlet distribution of each community c is described by an array C m where each component accounts for the number of news of the media i in community c. On the other hand, for the semantic description (C s ) of each community, we took the average tf-idf representation of all news belonging to each community, preserving only the main twenty terms of each vector. In both cases, we calculate similarity by means of the weighted Jaccard index, defined as follows: where the superindex d accounts for the two possible descriptions -the media outlet distribution (m) and the mean semantic description (s)-and the sum is over k, i.e., the k − th term or the k − th media outlet of the description. The corresponding similarities were calculated between the 2019 and 2020 networks communities and are shown in Fig. 5 . Fig. 5 shows that from one year to the next the media outlet distribution in the main five communities is almost the same, as seen in the high values of media similarities in the diagonal. On the other hand, panel [B] of same figure displays the similarity measures between semantic contents. The low values observed here show that, as expected, the semantic content changes from the first to the second year. Summarizing, the media outlets distributions identify each community and thus persist in time, at least in the two data sets here investigated. In this section we study the bipartite networks projected onto the set of 2019 and 2020 users. The projections and filtering of links we done as previously [19, 20] . For the news projection we found that the identity of the emergent structures was mainly driven by the distribution of media outlets shared in each community. These allowed us to use the corrected media distribution, given by Eq. 9, to describe each user by its media consumer profile. In order to capture the emergent structures at the mesoscopic scale, we performed community detection in both data sets. In Fig. 6 we can see the networks visualization for the 2019 and 2020 user networks. Nodes were colored according to their community membership and the labels correspond to the set of news shared by users of the same community. In particular, word clouds show the average communities media vectors, according to Eq. 9 (< m i c >). In Table 3 we describe the relative sizes and the main media outlets for the five biggest communities. We can observe that for the 2020 data set, the two main communities represent almost 40% of the network and are comparatively much larger than others. These are dominated by Clarin -La Nacion -Infobae for the first one, and El Destape and Pagina 12 for the second one. The distributions of these outlets are similar to those observed in the two largest communities in the case of the news projections. On the other hand, in the 2019 data set, the biggest community was also dominated by El Destape and Pagina 12. However, the second one did not show the same distribution observed in the 2020 data set. In order to asses the properties of these emergent structures, we address the following questions: • How homogeneous are user communities in term of media consumption? • How is the behavior of the typical user of each community? Are they mono-consumer or poli-consumer? To answer the question about media consumption within user communities, we propose the following approach. Users can be described by a corrected media vector, which indicates their media consumption profiles, as was defined in section User projection. If each community is described by their average, then it is possible to calculate how similar are users to the mean vectors of each community. To do this, we first transformed the corrected media vector using Principal Component Analysis (PCA). This approach allows us visualize all users and communities average vectors in a two dimensional space and thus to estimate their similarities by means of cosine similarity. In Fig. 7 , the median of the cosine similarity distributions between users and communities average vectors are reported for the 2019 and 2020 data sets. Here, the comparison with their own communities are sketched in the diagonal, meanwhile the corresponding comparison with other communities can be found off the diagonal. These results show the consistency of these communities in terms of media consumer profiles. In Fig. 8 , average vectors for users and communities are mapped onto the two principal dimensions using PCA for both data sets. In Fig. 8 , users are embedded onto the first two dimensions of the PCA representation and colored according to communities. Here, row i and column j compares communities i and j respectively. These plots help to visualize the differences between the media consumers profiles and the averages of each community, as it was shown in Fig. 7 . In particular, we can appreciate the difference between two main communities of 2020. The previous results show that the users tend to cluster in communities according to the media outlets they read or share. We analyzed if the lack of diversity in media consumption plays an important role in the formation of these communities. For this we calculated the lack of diversity in media consumption following the definition given in Eq. 10 (i.e., taking the maximum value of the corrected media outlet distribution of each user). In Fig. 9 , we compare the lack of diversity in media consumption with the role of the users in the networks. In particular, we chose the participation coefficient defined in Eq. 4, because its low values indicate a strong membership to a given module. 9 shows that there is a negative correlation between the participation coefficient and the lack of diversity of the users in both data sets. The values of linear correlation for 2019 and 2020 are −0.40 and −0.42, respectively (p val < 0.01 compared with a null model where node's LD values were randomly shuffled. These results show that users that play a central role in each community have a noticeable lack of diversity in the news they consume. On the other hand, users that are in the border between communities tend to share news from different outlets. In this work we study news sharing in social networks in order to understand emergent collective properties of groups of politically active users. In particular we focus in users sharing Argentinian media outlets in Twitter in two consecutive years, 2019 and 2020. The questions guiding our research were: • Are news sharing constrained by any features of users or news? Or the information diffuse freely on social media? • Why a given group of news are more co-shared between them than with others? • Does users tend to form cluster according to given preferences in news consumption? We selected a given group of Twitter users and focus in tweets with links to media outlets they share. With this data, we build a bipartite network of users and news [16] and analyze their projections in both layers using the analytical machinery of complex networks. We focused in emergent structures at a mesoscale level (communities) and the role of nodes in these structures. When we analyzed the news projection, we found that news tend to group into robust communities where the main fingerprint is the media distribution in which they were published. The two main communities comprises around the 35% of the nodes and reflect the political polarization around the two main political parties in Argentina: Página 12 and El Destape in one side and Clarín, La Nación and Infobae in the other. Meanwhile the first two outlets have a center -left leaning, Clarín and La Nación have centrist and centrist-right leaning [30] . When we take a look into the topic National Elections, we noticed than the cluster dominated by news of Página 12 and El Destape highlighted the names Cristina, Alberto, F ernandez and Kicillof , which were the candidates of the Frente de Todos (Kirchnerism). On the other side, the same topic in the second largest community reflected the names of V idal and M acri, which were the candidates of Juntos por el Cambio, the political party of the former president Mauricio Macri. What happen with the rest of the communities? There are much smaller, are dominated basically by a single newspaper and focus on particular preferences: international political news in Infobae or the newspaper of the socialist worker party La Izquierda Diario. If we recall that links in the news projection are given by users than co-share news articles, these results means that their behavior is driven by strict preferences in news consumption: the vast majority ruled by ideological preferences, reflecting political polarization in news sharing and the rest of users grouped around specifics preferences. In the user projection, nodes were linked by the news their share. Taking into account the results in news projection, we faced the question if users tend to group around similar patterns of media consuming profile. We detected that the two biggest communities in the 2020 datasets were comprised by user with similar preferences of media outlets: Página 12 and El Destape in one side and Clarín, La Nación and Infobae in the other. However, in 2019 datasets these results were observed only for the biggest one. But the most interesting result here is that communities are composed mainly by core individuals with a marked lack of diversity in their news consumption meanwhile those users that tend to share news from different newspapers plays a peripheral role in their groups. Here we have observed that homogeneous groups of users and news in terms of certain cultural preferences or political leaning can be found also in sharing-news networks. The spreading of misinformation online Research note-the allure of homophily in social media: Evidence from investor responses on virtual communities Friendship prediction and homophily in social media Political polarization on twitter Extracting significant signal of news consumption from social networks: the case of twitter in italian political elections The echo chamber effect on social media Echo Chamber: Rush Limbaugh and the Conservative Media Establishment Blogs are echo chambers: Blogs are echo chambers Public opinion quarterly The Power of Information Networks: New Directions for Agenda Setting. Routledge studies in global information, politics and society. Routledge Quantifying time-dependent media agenda and public opinion by topic modeling Analyzing mass media influence using natural language processing and time series analysis Agenda setting through social media: The importance of incidental news exposure and social filtering in the digital era Social media and political agenda setting Communities of online news exposure during the uk general election 2015 Transnational news sharing on social media: Measuring and analysing twitter news media repertoires of domestic and foreign audience communities Media polarization on twitter during 2019 indonesian election Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality Extracting the multiscale backbone of complex weighted networks Fast unfolding of communities in large networks python-louvain: Louvain algorithm for community detection Consensus clustering in complex networks Functional cartography of complex metabolic networks Chapter 4 -text mining and network analysis of digital libraries in r Learning the parts of objects by non-negative matrix factorization Document clustering based on non-negative matrix factorization Kirchnerism in argentina: A populist dispute for hegemony. International Critical Thought Information, interest, and ideology: Explaining the divergent effects of government-media relationships in argentina Media and punitive populism in argentina and chile