key: cord-0122037-51mg75u4 authors: Mattei, Mattia; Pratelli, Manuel; Caldarelli, Guido; Petrocchi, Marinella; Saracco, Fabio title: Bow-Tie Structures of Twitter Discursive Communities date: 2022-02-07 journal: nan DOI: nan sha: 19bb4273be728073fbddefd9c28e6eda4919c53a doc_id: 122037 cord_uid: 51mg75u4 Bow-tie structures were introduced to describe the World Wide Web: in the direct network in which the nodes are the websites and the edges are the hyperlinks connecting them, the greatest number of nodes take part to a bow-tie, i.e. a Weakly Connected Component (WCC) composed of 3 main sectors: IN, OUT and SCC. SCC is the main Strongly Connected Component of WCC: it contains the greatest subgraph in which each node is reachable by any other one in the subgraph. The IN and OUT sectors are the set of nodes not included in SCC that, respectively, can access and are accessible to nodes in SCC. In SCC the greatest part of the websites can be found, while the search engines belongs to IN, and the authorities, as Wikipedia, are in OUT. In the analysis of Twitter debate, the recent literature focused on discursive communities, i.e. clusters of accounts interacting among themselves via retweets. In the present work, we studied discursive communities in 8 different thematic Twitter data sets in various languages. Surprisingly, we observed that almost all discursive communities therein display a bow-tie structure during political or societal debates. Instead, they are absent when the argument of the discussion is different as sport events, as in the case of Euro2020 Turkish and Italian data sets. We furthermore analysed the quality of the content created in the various sectors, using the domain annotation from the fact-checking website Newsguard: it turns out that content with the lowest quality is the ones produced and shared in SCC. In this sense, in discursive communities displaying great OUT blocks, the greatest part of the accounts has access to a great variety of contents, but their quality is, in general, quite low, creating the phenomenon known as infodemic. In the present paper, we correlate the presence of an infodemic to a peculiar network structure, i.e. a OUT-dominant bow-tie. Since their first introduction, Online Social Networks (OSN) have been deeply investigated for possible implications of the online public debate on political processes [1] . In the last decade, the centrality of OSN for political communications and debates has steady increased: OSN represent Discourse and discursive communities Whether circulating within an echo chamber or suggested by recommendation algorithms, the type of information users come across online is fundamental to reinforcing or not the division into 'closed' groups. Nevertheless, also the study of the interactions between users is of absolute interest to detect polarization phenomena. The term discourse community was coined in 1982 and it indicates 'groups that have goals or purposes, and use communication to achieve these goals' [16] . A discourse community is itself immaterial, and this tends to project it onto the forum on which it operates [17] . Thus, with the advent of OSN, discourse communities were projected onto the platforms themselves [18] : 'A discourse community can be viewed as a social network, built from participants who share some set of communicative purposes'. According to Berkenkotter [19] , 'just as the digital world is constantly evolving, discourse communities continually define and redefine themselves through communications among members'. In the discourse community definition, we implicitly know the identities of the individuals forming the community. Actually, in the case of Twitter, it is just partially true, since we have trustworthy information only about a small minority of accounts. For this reason, we prefer to use the term discursive communities, as it was introduced in Ref. [20] to identify group of users that are connected by non-trivial pattern of discourse, but for which we have limited information about the identity of the group itself. Nevertheless, since we can infer the discourse community of the discursive community by looking at a set of non-trivial data characterising the group, as the most frequent keywords used therein, the difference is more formal than substantial. Therefore, in the following we will use the two terms interchangeably. To detect discursive communities in OSN, the first contributions applied mixed approaches to the political debate on Twitter [21] [22] [23] . The work considered political debate on Twitter about the US presidential election campaign, i.e. a 'perfectly polarized' one in which two opposite fronts face each other. The authors manually annotated the most frequent keywords characterizing Republicans and Democratics' narratives and use them to infer the political orientation of accounts using them. The orientation of accounts not using hashtags was later inferred using a label propagation algorithm [24] . Remarkable, a clear partition in two distinct groups of users, supporters of the two political parties, was observed in the retweet network only (the network of users sharing There are two relevant points in observing the presence of bow-tie structures in discursive communities: how big the bow-tie is respect to the entire discursive community (a feature that is called uninformative, weak or strong bow-tie in the main text) and how random the presence of this structure is (i.e., its statistical significance). Regarding the first point, when the bow-tie is informative, even in the worst case, it represents more than 80% of all nodes in the discursive community, i.e., much more than what Broder et al. observed for WWW. Regarding the second point, in order to be sure that the observed bow-ties are not due to a random organization of links only, we compared the observed quantities with a maximum entropy null-model for directed network, conserving the in-and out-degree sequences [36] . The results show that the dimension of most of the bow-tie sectors are statistically significant, i.e., they carry a signal that cannot be due to the degree sequence only. In this sense, the presence of a bow-tie structure is an extremely non-trivial feature of the system. We can add more detail to the analysis of this structure. When the bow-tie is informative, we observe two cases: the OUT-dominant and the INTEND-dominant ones, depending on which sector is the largest (respectively, OUT or INTENDRILS). The OUT sector has access to all information produced in the discursive community and, in particular, to the one produced by the most active block, SCC. Thus, in principle, the OUT-dominant bow-tie should be more informed regarding the content shared in the discursive community. Instead, in the INTEND-dominant bow-ties, the most crowded sector is the one of INTENDRILS, i.e., the retweeters of IN that are not retweeted by anyone else and that cannot access to all content created by SCC. In principle, it should be desirable to have an OUT-dominant bow-tie: when the OUT sector is the most populated, there are many accounts that are exposed to information from all other sectors. This should give the accounts a multi-faceted, pluralistic knowledge. However, we carry out an analysis on the quality of content produced in the various sectors of the bow-ties, and our outcome returns a different picture. Indeed, the most active block is SCC, which, in discursive communities affected by m/disinformation, is responsible of the greatest flux of contents from non reliable sources. In this sense, since the greatest block is directly hit by questionable content produced by SCC, the OUT-dominant bow-ties are exposed to m/disinformation campaigns. This creates local infodemics. According to WHO, "infodemics are an excessive amount of information about a problem, which makes it difficult to identify a solution" 1 . Summarising, our contribution is twofold: • almost all the discursive communities in 8 datasets of Twitter debates, on different topics in different countries, display a bow-tie structure which is statistically significant; • relating the presence of the bow-tie with the concept of infodemics and the production of controversial content, in most cases the majority of users is exposed to untrustworthy information. We would like to remark that the results in this manuscript do not represent the only contribution that connects the diffusion of m/disinformation to the network structure (see, for instance, work in [37] [38] [39] [40] , just to consider some of the most recent contributions). However, this is the first time that the diffusion of m/disinformation is related to the presence of the bow-tie structure in discursive communities. In order to make our analysis as general as possible, we consider several Twitter datasets across different countries and about different topics. The data were collected using the Twitter Search API. In detail: • Covid-19 datasets: we explore Twitter posts containing keywords related to the Covid- 19 pandemic 2 , in different languages and therefore diffused in different countries. In particular, we consider the Italian, German and French debates about the pandemic, in the period between February and April 2020. • Dutch elections dataset: we collect Twitter posts about the national elections in the Netherlands in 2021. The keywords used for downloading data were "tweedekamer", "verkiezingen", "kabinet", "coalitie", "stem", "stembus", "verkiezingen2021" 3 and only messages in Dutch were selected. The dataset contains 1,002,696 tweets posted between February 2 and March 31, 2021. • Italian debate on migrants: we select Twitter posts shared in Italy with keywords regarding the discussion about the migration flows from Northern Africa to the Italian coasts. The dataset consists in 1,082,029 posts, published between January 23, 2019 and February 22, 2019. The dataset is described in more details in Ref. [28] . • Italian debate on the Astrazeneca vaccine: we examine 583,327 Twitter posts published in Italian, regarding the discussion about the safety of the Astrazeneca vaccine against Covid-19: the keywords used for the download were "astrazeneca", "aifa", "ema", "trombosi" 4 . The dataset contains posts shared between March 15, 2021 and May 15, 2021. • Italian and Turkish EURO2020: we analyze 298,538 Italian tweets and 522,363 Turkish ones about the European Football Championship EURO2020; the keyword used for the download was simply "#euro2020". The tweets were published between, respectively, June 11-13 and June 11-23, 2021. 2 In particular, the keywords for tweets collection were "coronavirus", "ncov", "covid", "SARS-CoV2", "#coronavirus", "#coronaviruses", "#WuhanCoronavirus", "#CoronavirusOutbreak", "#coronaviruschina", "#coronaviruswuhan", "#ChinaCoronaVirus", "#nCoV", "#ChinaWuHan", "#nCoV2020", "#nCov2019", "#covid2019", "#covid-19", "#SARS CoV 2", "#SARSCoV2", "#COVID19". The subset of Italian messages has been matter of investigation in Ref. [32] too. 3 Respectively, "House of representatives", "reconnaissance", "cabinet", "coalition", "vote", "ballot box", "explorations". 4 Respectively, "astrazeneca", "Italian Medicines Agency", "European Medicines Agency" and "thrombosis". So as not to burden the presentation, in the following we will present the results about the Italian Covid-19, Italian EURO2020 and Turkish EURO2020 datasets. We will show the results related to the other datasets wherever there will be something substantially different, compared with the Italian dataset. However, all graphics and results about the other datasets can be found in the Supplementary Material. Our analysis focuses on the structure of networks of retweets, for each dataset. Retweeting a post is one of the possible ways in which people can interact on Twitter and it consists in sharing the content of a tweet written by another user. It usually means endorsing the post content and it has also the effect of raising its visibility [21] [22] [23] . We start by distinguishing between verified and non-verified accounts. The former ones denote Twitter users whose identity has been verified by the social platform. This procedure is usually adopted to certify the accounts of renowned people and organizations and figures of public interest in general, as politicians, journalists, political parties, newspapers and TV-channels. We place the verified accounts on one layer of a bipartite network 5 and the non-verified ones on the other one, again considering links as retweets between them 6 . The main idea is to anchor the definition of discursive communities on verified users since they usually introduce new content and posts: as observed in many other studies [20, 25, 28, [31] [32] [33] 41] , verified users are, on average, much more retweeted than common users. Such a procedure obtains great performances, since it can be observed that the various discursive communities are coherent in terms of verified users belonging to the same political front; in a further analysis we are comparing this procedure with annotated datasets, better quantifying our performances [27] . Following the methodology introduced in Becatti et al. [25] , we count the common neighbors of each pair of verified users or, in simpler words, the number of non-verified users that have interacted (by retweeting or being retweeted) with the same pair of verified ones. The aim is projecting the bipartite network into the layer of the verified accounts, establishing an edge between two of them if the number of their common neighbors is significantly higher than what expected by a proper null-model. When this happens, we can assert that the two verified users refer to the same audience and, therefore, they probably share similar content and opinions. The statistical significance of the number of common neighbors can be established only comparing it with the predictions of an accurate benchmark, which, in this case, is represented by the Bipartite Configuration Model (BiCM, [42] ), an entropy-based model suited for bipartite networks. A complete description of the model and the projecting procedure can be found in Section 4. The result of the above procedure is a monopartite network of verified users. We further obtain a partition in communities implementing the Louvain algorithm [43] for the optimization of the modularity, with a slight modification. The standard definition of the modularity [44] implements the Chung-Lu null-model [45] . Literally, Chung-Lu null-model can be considered as a sparse matrix approximation of the entropy-based null-model defined in [46] and, indeed, returns wrong results in the presence of strong hubs [26] . We thus replaced the Chung-Lu null-model in the modularity with the unipartite configuration model (UCM ) defined in [46] -more details can be found in Section 4. For all the datasets, looking at the members of each discursive community, we can a posteriori associate the latter to a political wing, using the available information for verified users. We thus obtain clusters of users (even if we cannot characterise them on the basis of other topological quantities [47] ) which represent the main wings of the political scenario of each of the examined countries. In addition, in almost all the datasets, we identify also a Media cluster, with official accounts of newspapers, TV-channels, radio and other media. In the Appendix, the interested reader can find a complete description of all the discursive communities for the Italian Covid-19 dataset. For the other datasets, a brief description of their discursive communities is in the Supplementary Material. The next step in our procedure consists in extending the discursive communities to non-verified accounts. More in details, following the approach in Ref. [28] , we use the membership of verified users as (fixed) seeds for the label propagation algorithm proposed by Raghavan et al. [24] on the retweet network. This network is a monopartite and directed one in which nodes represent users and a link between them indicates that one user has retweeted the other one at least once: the single edge starts from the retweeted user and is directed towards the one who retweets. Let us remind that, in case the algorithm cannot find a dominant label for a specific vertex (i.e., in case of a tie), it randomly removes some of the edges attached to that vertex and repeats the procedure: for this reason, we run the label propagation 500 times and assign to each node the most frequent label (actually, the noise in the assignment of the labels is extremely limited). Fig. 2 shows the percentages of nodes placed in the various discursive communities for the Italian Covid-19 dataset (a detailed description of the various communities can be found in the caption of the figure) . Considering also the other datasets, in almost all the cases, the label propagation procedure could assign a label to approximately 90% of the nodes. As we could expect, in the Covid-19 datasets, the Media community is always the most numerous one: updates on the spread of the pandemic, written by the official accounts of various media, received a great amount of retweets. As highlighted in other works [20, 25, 28, 29, [31] [32] [33] , the presence of well-defined discursive communities is the signal that users on Online Social Networks (OSNs) are strongly polarized, i.e., they tend to tend to split into groups, which one with same opinions and political orientation. The original concept of bow-tie by Broder et al. [34] sees WWW divided into 3 main sectors: a Strongly Connected Component (SCC), in which each node can be reached by any other one in the same block, following the direction of the links; a group of nodes that can reach SCC, without being reached by it (called IN); a group of nodes that can be reached by SCC, but that cannot reach it (the OUT block). The description by Broder et al. was subsequently refined by Yang et al. [35] , who split the network in seven distinct parts 7 : • the greatest Strongly Connected Component (SCC); • the IN block; • the OUT block; • the TUBES sector, including nodes reachable from IN and accessing OUT, but not being part of SCC; • the INTENDRILS group, collecting all those nodes pointed by IN that cannot reach the OUT block; • the OUTTENDRILS sector, containing all those nodes pointing to OUT that cannot reach nodes in IN; • the OTHERS group, including all those nodes that cannot be placed in one of the previous six sectors. In Fig. 1 there is a schematic representation of the bow-tie structure defined in Ref. [35] . The seven groups of nodes are mutually disjointed. We remark that every directed network can be divided in blocks using the bow-tie decomposition. Nevertheless, as a rule of thumb, the bow-tie representation is informative about the network Figure 2 : Percentages of nodes in each discursive community, Italian Covid-19 dataset. Due to the presence of politicians and political parties from a specific political area, the various discursive communities are called following their political alignment. "PD" stays for the Italian Democratic Party (Partito Democratico); Italia Viva ("IV") is the political party of the former prime Minister and former PD secretary Matteo Renzi, while M5S is the "Movimento 5 Stelle", a political movement born on the web and being the most represented party in the Italian parliament at the time of the data collection. "FI" stays for Forza Italia, the political party of the former Prime Minister Silvio Berlusconi, while the "DX" (Destra) community includes right wing parties as Lega and Fratelli d'Italia. The most crowded discursive community is the one of Media in which there are most of the online news outcasts and newspapers. The accounts for which it was not possible to assign a discursive community are in grey. structure if the number of nodes in blocks other than OTHERS is greater or of the same order of those in OTHERS: the greatest the impact of the non-OTHERS blocks, the more informative the bow-tie structure is. In the present manuscript, we investigate the presence of a bow-tie structure in the discursive communities of the retweet network, i.e., in the network composed by Twitter accounts (the nodes) and retweets (the links connecting the original author to the retweeter). Results show that, when considering political online debates, a bow-tie structure is informative in almost every discursive community of our datasets, while for non-political debates (as the case of Euro2020), the bow-tie structure is less informative. Euro2020 itself records the extreme case in which more than one half of the nodes are in the OTHERS sector. We state that this bow-tie structure is uninformative -see, for example, the case of the Turkish debate during Euro2020 in Fig. 8 . We remark that the presence of informative bow-ties in many of the discorsive communities here investigated is not a trivial result. Indeed, there are no evident reasons for expecting such distribution of the nodes a priori. When a bow-tie structure is informative, we observe two recurrent situations in the investigated datasets and, according to them, we classify the bow-tie into two different categories: • When the OTHERS block is smaller than SCC, we will refer to strong bow-tie structures; • When the OTHERS block is greater than SCC, we will refer to weak bow-tie structures. Furthermore, when the bow-tie is informative, may it be weak or strong, we can categorize it in two different ways, that we called respectively OUT-dominant and INTEND-dominant. In OUT-dominant bow-ties, most of the nodes of the bow-tie are placed in the OUT sector. As a rule of thumb, OUT-dominant bow-ties are more frequent when the bow-tie is strong, but we can find some counter-examples. The INTEND-dominant bow-tie is a bow-tie structure in which instead the most part of the nodes is located in the INTENDRILS sector, i.e., when most part of the users retweets accounts from the IN zone and has little to no interaction with the users in the other sectors. INTEND-dominant bow-ties are in general more frequent in weak bow-ties. We highlight that it is not so strange that the most crowded blocks in the bow-ties are OUT and INTENDRILS: it was already observed in Ref. [48] that the greatest number of users tend to mostly retweet content created by others and limit their production of new messages. The difference between OUT-dominant and INTEND-dominant bow-ties is the access to information: OUT-dominant bow-ties are those in which the majority of users can access almost all messages exchanged over the discursive community, while in the INTEND-dominant ones the majority of users limits their retweets to the content produced by accounts in the IN block. Otherwise stated, the main difference between INTEND-and OUT-dominant bow-tie structures is that the former displays a more 'hierarchical' structure, i.e., few accounts (those in the IN sector) introduce new content and many others just share it (the INTENDRILS sector). Instead, in OUT-dominant bow-ties, the greatest part of the users (i.e., the OUT block) not only shares posts by accounts in the IN block, but also it retweets content by users in SCC, OUTTENDRILS and TUBES. We argue that this behaviour, while more 'democratic', is, at the same time, more risky. In fact, we will see in Subsection 3.3 that users with high visibility and which introduce new content on Twitter can be found mostly in the IN sector: typically, they are verified accounts. As observed in other studies, see, e.g. Ref. [32] , verified users tend to limit the spreading of lowquality content. We may argue, then, that users interacting mostly with verified users are safer from m/disinformation campaigns. In the following, we will see that the reputability of information shared confirms our hypothesis and we will come back on the matter. A single node represents one bow-tie sector and its dimension is proportional to the number of accounts in it. First, according to the definitions given above, the bow-tie structure is informative in all the discursive communities. In the cases of DX and IV, the bow-tie is particularly informative: its blocks include respectively 96.5% and 98.3% of the entire discursive community. Second, different discursive communities display bow-ties with different strengths. For instance, DX and IV discursive communities display strong bow-ties, while, M5S, Media, PD and FI have weak ones, since their SCCs are relatively small (and smaller than OTHERS). Third, the graph shows that the DX, IV, MEDIA and FI communities display OUT-dominant bow-ties, in which the OUT sector is the biggest one; considering all the investigated datasets, OUT-dominant bow-ties represent the most frequent configuration, being 21 out of 31 communities. Instead, 6 out of 31 discursive communities are INTEND-dominant bow-ties (as PD and M5S in Fig. 3) . We remark that, in all our datasets, all the right wing discursive communities display bow-ties with an OUT-dominant structure; in most of the cases, these bow-ties are also strong. The colours of the nodes in Fig. 3 are going to be explained in the following section. It may be argued that the bow-tie structures featured by the discursive communities in our datasets are just an accident, due to the different role of the various users in the debate. In fact, those accounts that have high out-degrees and low in-degrees are naturally in the IN sector; those that, viceversa, have high in-degrees and low out-degrees are in the OUT sector, and so on. To test whether the presence of bow-ties is merely attributable to the behavioral characteristics of the accounts, we compare the dimensions of the different sectors, as observed in the real network, with those in a randomised system in which the in-and out-degree sequences are fixed. If the partition in the various bow-tie sectors were just a matter of the degree sequence, none of the dimensions of the various blocks should be statistically significant. Otherwise, we should observe a significant mismatch with respect to the expectation of the null-model. In order to have an unbiased benchmark, we build an entropy-based null-model that preserves the in-and out-degree sequences, being maximally random for all the rest (see Ref. [26] for a review on the subject). Summarising, starting from a real network, we consider the set of all possible graph realizations (the graph ensemble) having the same number of nodes as in the real system. Then, we assign to each representative of the ensemble a different probability of realization by maximising the entropy of the ensemble, but constraining the average value of some topological property of the real network (in our case, the in-and out-degree sequences). In this way, even if the single realization of the ensemble does not display the network properties that we would like to preserve, the entire ensemble, on average, does. In the last years, such procedure has been adopted to analyse financial and economic systems [36, 42, 46, [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] , biological networks [66] [67] [68] [69] and online social networks [20, 25, [28] [29] [30] [31] [32] [33] 70] and it was shown to be effective to extract the relevant structure from a real network [71, 72] . Here, we implement the Direct Configuration Model (DCM ), firstly introduced in Ref. [36] and implemented in the Python module NEMtropy [65] . More details on the exact derivation of DCM can be found in Subsection 4.3. Going back to Fig. 3 , the colour of the circles indicates the agreement between the actual size of the bow-tie sectors and the size predicted by the DCM: we are interested in detecting both too "big" and too "small" blocks. In particular, the darker the colour of the sectors in Fig. 3 , the larger the − log 10 (p-value) (so the lower the p-value) and the greater the disagreement of the real system from the randomization. For each sector, the two-tailed p-value has been calculated looking to a sample of 1000 graphs generated by the DCM. The p-value tells us about the existence of a disagreement, but not about the direction of the disagreement. For instance, looking at the DX bow-tie in Fig. 3 , both the dimensions of OTHERS and SCC have a really small p-value, thus they do not agree with the randomization, but the OTHERS block is smaller than predicted by the DCM, while SCC is larger. Table 1 reports the exact p-values of the different blocks for the various bow-ties of Fig. 3 . The significance of the blocks for each bow-tie can be assessed by using the False Discovery Rate (FDR) correction [73] , setting the statistical significance level to α = 0.01. In the present case the correction is limited, due to the small number of blocks in the bow-tie. It is interesting to observe that, in both strong and weak bow-ties, the OTHERS block is statistically significant in all the discursive communities but PD. In particular, the dimension of the OTHERS block is much smaller then predicted by the null-model and the presence of the bow-tie is not due to the degree sequence only. [73] ). In the table, validated p-values are marked by an asterisk ' * '. The OTHERS block is statistically significant (in particular it is smaller than in the randomization) for all discursive communities but the PD one. It is remarkable that the dimension of SCC is significant in all strong bow-ties, while the one of OUT is significant only for IV bow-tie. SCC is statistically significant (and bigger than expected) for all bow-ties but FI and PD. The IN block is often statistically significant and smaller than expected. We may notice that in the strong bow-tie of IV discursive community the dimensions of all sectors are statistically significant, while none are in the PD bow-tie, which is the smallest discursive community. It is worth noting that also the dimension of the discursive community has a role: due to the limited possible variability, smaller bow-ties feature more agreement with the model. Usually, verified accounts on Twitter belong to public characters and organizations, such as journalists, politicians, actors, political parties, media, and VIPS in general. Previous studies testify that verified users tend to introduce new content and have high visibility on the platform [20, 25, 28, 32, 48] . Thus, we expect to find them in the IN block. The results in In those communities where the bow-tie structure is not informative (right panel, Fig. 4 ), a high percentage (42.9%) of verified users, on average, is in the OTHERS sector. In a few cases of not informative bow-ties, it happens that verified users are mostly in the OUTTENDRILS sector. In this last case, their messages hardly reach a big audience and are simply retweeted by a group of strong retweeters (OUT sector), not catching the interest of the accounts in the SCC. Let us remark that in the case of non-informative bow-ties the dimension of OUT and SCC blocks is nevertheless limited. The bar-charts in Fig. 6 show the percentage of nodes, the percentage of edges and the number of edges per node in the Strong Connected Component, for each discursive community of the Italian Covid-19 dataset. Not only DX is the one with the greatest number of nodes and the greatest number of links in SCC, but also the link density of SCC in DX is much greater than that of any other discursive community. Thus, the number of links in SCC of DX is not proportionate to the number of nodes, and it results in a greater average degree per node. We found very similar behaviours also for the right-oriented communities of the other datasets. In fact, in all our datasets, the discursive communities of conservative groups (i.e., DX in the Italian dataset, AfD in the German one, Conservatories in the Dutch one) are those with the highest percentage of nodes and, especially, of edges within SCC. This peculiar feature signals the presence of a common (self-)organization of accounts in line with conservative ideas on Twitter. NewsGuard 8 is an independent software toolkit that monitors the quality and transparency of several news websites worldwide. Through the tags that NewsGuard has assigned to news sites whose links appear in the retweets of our communities, we are able to quantify the amount of retweets containing untrustworthy URLs. The recurrent situation is that almost only the conservative discursive communities display retweets with such URLs. For the Italian Covid-19 dataset, the DX group has 26,318 retweets with links to untrustworthy webpages of news sites, many more than in other communities: 1,356 retweets for M5S, 78 retweets for IV, 20 retweets for MEDIA, 9 retweets for FI and 0 for the PD group. A very similar situation has been found for the other datasets, see Supplementary Materials. Another interesting aspect is that the most part of retweets containing not reliable URLs has origin in the strongly connected component. Fig. 7 shows in red the percentage of retweets containing URLs of untrustworthy news pages within and between the sectors of the bow-tie structure for the DX group. The highest percentage can be found in SCC and between SCC and OUT. Again, this is a recurrent situation also for the conservative communities of the other datasets under investigation. Here, we devote a specific section to comment about the case of the European football championship (EURO2020) dataset 9 . This dataset features a less divisive, less debated, and less discussed tweets topics. The topics of all the other datasets either have a strong political nature or are debating with sharp different positions. We then analyze whether the fact that topics are less discussed/devated has anything to do with the presence -or absence-of a bow-tie structure in the EURO2020 dataset. 8 https://www.newsguardtech.com/it/ 9 We do this for academic reasons, and not because Italy won the Euro2020 championship. Figure 5 : Percentage of verified accounts in the bow-tie sectors for each discursive community of the Covid-19 dataset. The bar-charts confirm that verified accounts are mainly located in the IN sector and, to a less extent, in the SCC one. Only for the PD group, which has a INTENDdominant bow-tie structure, verified accounts are mostly placed in the INTENDRILS block. We identified 5 discursive communities for the Italian dataset and 2 discursive communities for the Turkish one. Of these 7, 4 do not have an informative bow-tie structure (in fact, most part of the nodes are in OTHERS), and the other three have a weak one (OTHERS is smaller than the weakly connected component of the bow-tie, but still greater than the strongly connected one). Fig. 8 reports the bow-tie structures of the two discursive communities in the Turkish dataset. The SPORTS group contains the official accounts of football players and clubs, and of sports newspapers. AK refers to the Justice and Development Party (Turkish: Adalet ve Kalkınma Partisi, AKP), which is a conservative political party in Turkey including President Erdogan and his ministries. While SPORTS does not display any informative bow-tie, AK has a weak one. Following our interpretation, the latter displays a more hierarchical conversation on Twitter, in which the SCC is not numerous. Moreover, the dimensions of the sectors are mostly not statistically significant. For the Italian case (Fig. 9 ) the main discursive community is formed by football players, sports newspapers and journalists. There is also a MEDIA community, containing accounts of Italian In the Italian Covid-19 dataset, the conservative and right-oriented discursive community (DX) has more numerous and denser SCCs, as it is displayed in the highest two graphics. In the lowest graphic, it can be seen that, also considering the number of links per node in SCC, DX results again the first discursive community. These results hold for all the conservative groups in all the datasets under investigation. In order to create the various discursive communities we needed an appropriate null-model as benchmark for identifying those verified users that share the same audience. In this sense, it is necessary to compare the observed quantities with accurate predictions in order to state their significance: actually, the common audience may appear similar just due to the extreme activity of the considered verified users. We represent the interaction between verified accounts -the ones whose identity is certified by Twitter platform-and unverified ones (i.e. all the others) via a bipartite undirected binary network in which a link connects a verified users to an unverified ones if there is at least a retweet between one and the other, or viceversa. Since the information about the number of different accounts interacting -via tweet or retweet-with a user is encoded, in this representation, in the degree sequence for nodes of both layers, we need a benchmark discounting it. The natural choice is to choose an entropy-based null-model, since it provides, by definition an unbiased framework [26] : the null-model is maximally random, but for the constraints imposed on the system. The bipartite null-model discounting the degree sequence is the Bipartite Configuration Model (BiCM, [42] ). In the present section we will briefly revise the steps of its definition. Figure 9 : The bow-tie structure of the discursive communities for the Italian EURO2020 dataset. The dimension of the sectors is proportional to the number of nodes therein and the color quantifies the distance between the observed and the predicted dimension. The main discursive community is formed by football players, sports newspapers and journalists. Then, we identified a MEDIA community, containing accounts of Italian media, and three small political communities (DX, IV, M5S). MEDIA, DX and IV do not display an informative bow-tie structure (respectively 74%, 81.2% and 63.6% of the nodes in OTHERS), while FOOTBALLERS and M5S show a weak bow-tie (respectively 81.1% and 75.7% of nodes in INTENDRILS). and ⊥ have dimension, respectively, N and N ⊥ ; in the following, Latin indices will be used to identify nodes on the layer while Greek ones will be used for the ⊥ layer. Then, the bipartite network can be represented by its biadjacency matrix, i.e. a N × N ⊥ matrix M whose generic entry miα is 1 if the node i ∈ is connected to the node α ∈ ⊥ and 0 otherwise. Let us start from a real bipartite network G * Bi (in the following, all quantities denoted by a * will indicate those measured on the real network). First, let us define an ensemble of graphs, i.e. the set of all the possible bipartite graphs having the same number of nodes of G * Bi , but with all different topologies, from the fully connected to the empty ones. Then, we can define the Shannon entropy over the ensemble, by assigning a different probability to each of its elements: where, P (GBi) is the probability of the generic element of the graph ensemble GBi. Let us now maximise the entropy, while constraining the network degrees: in particular, we want that the ensemble average of degrees to match the value observed on the real network, in order to have a null-model tailored to the real system. In term of the biadjacency matrix, the degree sequences of the and ⊥ layers respectively read ki = α miα and hα = i miα. Using the method of the Lagrangian multipliers, the constrained maximisation can be expressed as the maximisation of S , defined as where S is the Shannon entropy defined above, ηi, θα are the Lagrangian multipliers relative to the degree sequences, respectively, on and ⊥, and ζ is the one relative to the probability normalization. Maximising S leads to a probability per graph GBi ∈ GBi that can be factorised in terms of the probabilities per link piα [74] , i.e. where piα = e −η i −θα 1 + e −η i −θα . Nevertheless, at this level the above equation is just formal, since we do not know the numerical value of ηi and θα. To this aim, we can then maximise the likelihood of the real network [46, 75] ; it can be shown that the likelihood maximisation is equivalent to imposing We want to infer similarities among nodes on the same layer. We can use as a measure of similarity the number of common neighbours -for each couple of verified users, the number of unverified users that have interacted, via tweet or retweet, with both. Let us assume, without loss of generality, that we want to project the information contained in the bipartite network onto the layer and call Vij the number of common neighbors between nodes i, j ∈ 10 . In terms of the biadjacency matrix, Vij can be expressed as where we have defined V α ij = miαmjα; V α ij = 1 if both i and j are connected to node α ∈ ⊥ and 0 otherwise. Let us now compare the observed Vij for each possible pair of nodes in with the prediction of the BiCM. Since link probabilities are independent, the presence of each V-motif V α ij can be regarded as the outcome of a Bernoulli trial: In general, the probability of observing Vij = n can be expressed as a sum of contributions, running on the n-tuples of considered nodes (in this case, the ones belonging to the layer of users). Indicating with An all possible nodes n-tuples among the layer of ⊥, this probability amounts at where the second product runs over the complement set of An. Eq. (2) represent the generalization of the usual Binomial distribution when the single Bernoulli trials have different probabilities, also known as Poisson Binomial distribution [76] . We can, then, verify the statistical significance of the observed co-occurrences by calculating their p-value according to the distribution in Eq. 2, i.e. the probability of observing a number of co-occurrences greater than, or equal to, the observed one: Repeating this calculation for every pair of nodes, we obtain N 2 p-values. In order to state the statistical significance of the hypotheses belonging to this group, it is necessary to adopt a multiple hypothesis testing correction; in the present paper, we use the False Discovery Rate (FDR, [77] ), since it controls the false positives rate. From the entire retweet network, in which the various accounts are represented as nodes in a direct network in which an arrow points the retweeter of a post, starting from its author, we extracted the various subgraphs of discursive community. Then, in order to compare the observed dimensions of the bow-tie sectors of these subgraphs and state their statistical significance, we adopted the Direct Configuration Model (DCM), which is the entropy-based model suited for direct monopartite networks [36] . For directed networks, the adjacency matrix is (in general) not symmetric, and each node i is characterized by two degrees: the out-degree k out i = j aij and the in-degree k in = j aji, where aij is the generic entry of the (directed) adjacency matrix A. The Directed Configuration Model (DCM) is therefore defined as the ensemble of direct networks with given out-degree and in-degree sequences. Using the same machinery as in the previous subsection 4.1, it is possible to derive a probability per graph: if GD is the generic representative of the ensemble of directed graphs GD, then the probability per graph P (GD) reads: Thus, again the probability per graph factorises in terms of probabilities per link qij, which can be expressed in terms of Lagrangian multipliers where γi and δj are the Lagrangian multipliers associated, respectively to the out-degree of node i and to the in-degree of node j. In order to get the numerical value of γi and δj, we can use the maximum likelihood as in the above subsection 4.1, which is equivalent to impose Since the bow-tie decomposition is highly non linear, in order to calculate the statistical significance of the dimension of the various blocks, we generated a sample of 1000 different graphs for each discursive community, using the probabilities provided by the DCM. Then, we obtained a distribution for the dimensions of the bow-tie sectors just looking to the decomposition of each graph in our ensemble. At this point, we could calculate a two-tailed p-value with a significance at α = 0.01 for estimating the distance between the dimensions observed with those reproduced by the ensemble. In the present analysis, we inferred the discursive communities from the communities in the validated network of verified users. In particular, we used the modularity based Louvain algorithm [43] . The modularity [78] compares the number of edges within the actual communities with its expectation under a certain null-model. Modularity can be written as where m is the total number of links of the network, aij are the entries of the adjacency matrix, pij is the probability to have a link between nodes i and j according to the chosen null-model, Ci and Cj are, respectively, the communities of nodes i and j and the Kronecker delta δ(Ci, Cj) selects all the pairs of nodes contained in the same community (equal to 1 if Ci = Cj or 0 otherwise). In the original definition in Ref. [79] , the null-model chosen is the Chung-Lu one [45] , which conserve the degree sequence, but it is known to be inconsistent for dense networks that present strong hubs [26] . In the present paper we use instead the entropy-based Undirected Configuration Model (UCM) defined in [46, 75] : it can be shown that in the case of sparse network, the UCM can be approximated by the Chung-Lu null-model. In the present paper, we implemented the BiCM, the DCM and the Louvain algorithm using UCM null-models via the Python module NEMtropy, described in Ref. [65] . Bow-tie structures were initially introduced for the description of the World Wide Web (WWW) [34] : websites are represented by nodes in a direct network in which the edges represent the hyperlinks. Broder et al. [34] show that the greatest number of websites belongs to a Weakly Connected Component (WCC) with a peculiar structure, see Fig. 1 . In particular, there are 3 main sectors that are crucial for the interpretation of the system: SCC, IN, and OUT. SCC is the main Strongly Connected Component of WCC and contains the greatest subgraph in which each node is reachable by any other node in the same group. In WWW, the SCC block includes the greatest part of the websites. The IN block contains all nodes that can access the nodes in SCC without being part of it. In WWW, these are the search engines: they link the greatest number of websites and direct the users to the websites closer to their requests. The OUT block contains all nodes that are reachable by nodes in SCC, without being part of it. In WWW, these are the authorities, i.e., websites as Wikipedia, considered as a reference for many websites. Nodes in the bow-tie structure of WWW represents nearly 75% of the websites in the whole WWW (at least, at time of publication of Ref. [34] ). In the present manuscript, we analysed eight thematic Twitter datasets in different languages, related to various debates in Europe. We extracted the discourse communities from the datasets and we investigate their network structure. Discourse (or discursive) communities are groups of users that interact among themselves by sharing the content created by others. It was shown that discursive communities tend to mirror the political orientation of users [20-23, 25, 28, 29, 31-33] . Discursive communities and bow-ties A first result of the analysis carried out in this work is that, in almost all the discursive communities extracted from the eight datasets, WCCs of the bow-tie include the great majority of the accounts. Particularly, a bow-tie structure is present in those discursive communities debating about politics, like in the case, e.g., of election campaigns (it is the case of the Dutch elections dataset) or debating about Society, e.g., 'how to handle a pandemic?' (it is the case of the Italian, German and French datasets about or 'how to manage migration fluxes?' (it is the case of the Italian online debate on migrants). Instead, a bow-tie structure is absent when the topics of the discussion are sportive ones, as in the case of Euro2020 Turkish and Italian datasets. More in details, we state that the bow-tie is informative if the corresponding WCC includes more than one half of the nodes of the entire discursive community, otherwise it is not informative. In the present datasets, we found that bow-ties are informative in all the discursive communities debating politics. In the case of the Euro2020 dataset, bow-ties are not informative, or, if present, they are extremely weak. When the bow-tie is informative, we found essentially 2 cases: 1) the most crowded block is the OUT one; 2) the most crowded block is the INTENDRILS one. The former is typical of the discursive communities of right wing parties in all European political/societal debates of our datasets, while the latter is more common in less active political discursive communities in many political/societal datasets. A closer inspection of the nodes in the various blocks and the quality of the shared content permit to better characterise the users in the bow-tie. The first observation is that the greatest part of the verified users, i.e., those accounts for which the identity of the owner has been certified by Twitter, in the IN sector, in each bow-tie. This finding is not surprising: as already observed in previous studies, verified users create content and are less active in sharing messages written by others [20, 25, 28, 32, 41, 80] . Verified users are mostly politicians and official accounts of political parties, as well as journalists and official accounts of their newscasts and newspapers. In this sense, a discursive community displaying a INTEND-dominant bow-tie structure (where INTRENDILS is the most crowded block) may appear, at first sight, as a less democratic group: the content is created by a few accounts and shared by a group of followers that limit their interactions to sharing the messages coming from the IN block. Instead, in a OUT-dominant bow-tie, the greatest block is OUT and it can access the content created by all the other blocks in the bow-tie (with the only exception of INTENDRILS), so having the possibility to intercept every voice in the discursive community. Actually, the issue is on the quality of the content created in the various blocks, see Fig. 7 . Leveraging our ongoing collaboration with the NewsGuard organization 11 , we annotated the URLs that appear in tweets in our datasets, based on the reliability and transparency ratings of the news sites to which those URLs belong (ratings given by NewsGuard). It turns out that the lowest reliable URLs, in a strong bow-tie, are the ones shared in SCC. The fact that verified accounts are not responsible for the vast majority of m/disinformation sharing was already observed in Ref. [32] and, in the present context, it reflects the fact that accounts in IN are minimally responsible for the spreading of low quality/untrustworthy content. Otherwise stated, when the source of information is not identifiable, the average quality of the content is lowered down. A largely populated OUT block implies that the greatest part of the accounts has access to a great variety of content, but its quality is lower than in the case of weak bow-ties. It is worth to consider also a peculiarity of right-wing discursive communities: for all those, the bow-tie is strong (i.e., the dimension of the OTHERS block is smaller than the SCC one) and it is neatly OUT-dominant. While the OUT-dominant configuration is already structurally prone to the diffusion of m/disinformation, this propensity is even more emphasized by the extreme activity of the SCC: for instance, in the Covid-19 Italian dataset the link density in the right-wing bow-tie is at least 3 times greater than any other OUT-dominant strong bow-ties. Infodemic is a recently introduced neologism, that became particularly popular during the Covid-19 pandemic. According to the WHO, "infodemics are an excessive amount of information about a problem, which makes it difficult to identify a solution. Infodemics can spread misinformation, disinformation and rumors during a health emergency. Infodemics can hamper an effective public health response and create confusion and distrust among people 12 ". The effects of the recent Covid-19 infodemic, even if debated [39, 81] , may put at risk the countermeasures to the spread of an epidemic and it is worrisome for policy makers 13 . In the present work, we relate the infodemic phenomenon to the specific structure of the discursive communities. As highlighted above, for the investigated datasets: • OUT-dominant bow-ties are the most affected by low quality contents; • This effect is particularly amplified in right-wing discursive communities, due to the extraordinary dimension and density of their SCC sectors (that strongly contributes to the creation and diffusion of low quality contents). Statistical significance of the analysis Here, we remark an important aspect of our analysis, of uttermost importance. In the analysis of a complex network, it is necessary to consider what is being measured, and what is its baseline. A typical example is the modularity, i.e. one of the most used target function for community detection. The problem resides in stating what is the number of links inside a group of nodes that is enough to form a community. In this case, we build a null-model, i.e., a model that shows part of the properties of the original system, being random for all the rest, to have a proper benchmark for our observations. We then compare the number of edges inside a group of nodes with the one expected by the null-model. Without the null-model, we could not know whether the number of links that bind a group of nodes are due to the degree sequence, or whether they are instead the genuine signal of the presence of a community. In the present study, we used an entropy-based null-model as a benchmark for our analysis [26, 82] . An entropy-based null-model allows to have a benchmark that is tailored to the system under analysis. It fixes (on average) some topological quantities to the values observed in the real network and leaves all the rest completely random. Being based on the (Shannon) entropy maximisation, it guarantees that it uniformly considers all the possible configurations (it is 'ergodic', using Statistical Physics jargon), thus it does not introduce any bias in the analysis. To strengthen the analysis, we study if the bow-tie structures are due to the degree sequence of the nodes in the various discursive communities. In fact, the size of IN and OUTTENDRILS could simply be due to the presence of many nodes with zero in-degree (an analogous consideration could be done for the OUT and the INTENDRILS blocks, considering, instead the out-degree). Thus, strong, weak and not informative bow-ties could be due to degree sequence only, and do not carry any kind of information on their own. We thus used the Directed Configuration Model defined in Ref. [49] and implemented by the Python module NEMtropy [65] . Our results show that the dimensions of the blocks in the bowtie are very often statistically significant: the p-value of the observed dimensions of the various blocks against the null-model expected distribution are extremely small, such that they are not compatible with the degree sequence, or, otherwise stated, the dimension of the various blocks cannot be explained using the degree sequences only. Limitations Even if we have obtained strong results (see the null-model validation check on the dimension of the bow-tie sectors), we have nevertheless to remark few aspects of our analysis that can limit its generalization. First, the analysis is related to eight different thematic datasets in different languages, all referring to European debates, some of them of political nature. Indeed, while the total amount of messages analysed is quite impressive, we are aware that, even if the spectrum of the arguments covered is various, our findings may be valid on our datasets only. In the near future, we are going to expand the countries covered by our analyses and expand the list of arguments under analysis. Following our jargon, OUT-dominant bow-ties display high level of m/disinformation. It is not a causal relation: the presence of OUT-dominant bow-ties does not imply the presence of an infodemic. In fact, if the reputability of the sources shared by SCC were high, we would have not observed any infodemic signal. Nevertheless, it is true that OUT-dominant bow-ties help the diffusion of m/disinformation, when present, since accounts in OUT are exposed to all contents created by nearly every block in the discursive community. Finally, it can be argued that the observed bow-ties appear just in discursive communities, a method that is recent and quite limited in the application [20, 25, 28, 29, [31] [32] [33] . Part of the authors of the present manuscript have in preparation a paper in which they compare the results obtained with different methods for extracting the information of discursive communities [27] : indeed the methodology analysed in the present paper is sound and among the most effective ones and show really good performances when compared to manually annotated data. Table 2 : Performance results after 10-folds cross validation on cresci-stock-2018 data-set To train and validate the classifier we leverage the publicly available cresci-stock-2018 14 dataset. In particular, we use the accounts metadata of the 6842 bots and 5880 human that were still active at the time of data collection; data were crawled on July 2020 through the Tweepy library 15 . To select the best model, we consider five algorithms, each of them belonging to a different category: MlP (Multilayer Perception) [89] , JRip, i.e., a Java-based implementation of the RIP-PER algorithm [90] , Naive Bayes [91] , Random Forest [92] , and the Weka [93] implementation of the Instance-based Learning Algorithms, i.e., IBk [94] . The performances of the five different algorithms are evaluated in terms of standard metrics, such as balanced accuracy, precision, and f-measure. The metrics are computed using a 10-fold cross-validation. For all the experiments, we rely on the open source (Java-based) Weka framework that provide us the implementations of (i) the five machine-learning algorithms (for which we use the default parameter settings 16 ), (ii) the evaluation metrics and (iii) the process of 10-fold cross validation. In light of our experiments (see Table 2 ), we select the Random Forest-based model as the classification process since it outperforms the other models. The resulting model for bot classification is then applied to tag all the accounts involved in our study, giving an average concentration of bots that is around 23.9% in total. In particular, if we focus on specific datasets, we observe percentage of bots around 23 These are quite high values, especially if we take as a baseline measure the one provided by Varol et al. [95] in a 2017 study which estimated the percentage of active bots on the Twittersphere at between 9 and 15%. However, in our research, several aspects could motivate both the high values and their variability amongst the datasets. Specifically, (i) we are looking at specific (hot) topics that might involve more significant numbers of bots than the average, (ii) we are considering datasets on significantly different topics (thus, the percentage of automated accounts might vary), and (iii) we are analyzing data collected in different time intervals, but evaluated with a single classifier (this might further affect the classification performance, due to the possible evolution of bots). With these premises to keep in mind, we now describe how the potential bots are distributed in the discursive communities. They are equally distributed among the discursive communities, with a slightly higher percentage of bots in the conservative groups: for instance, in the Italian and French Covid-19 datasets, the communities with the highest percentage of bots are DX and RIGHT-WING with, respectively, the 25.5% and the 29.7% of suspicious accounts. In our bow-tie structures, they are basically placed in the OUT sector or in the INTENDRILS one. In Fig. 10 are shown the percentages of bots in a specific bow-tie sector averaged on all the discursive communities in the usual three categories. Globally, the highest percentages can be found in the OUT sector and in the INTENDRILS one: literally, social bots tend to retweet more than to be retweeted. In particular, in the case of OUT-dominant bow-tie, on average the 60% of bots are placed in the OUT sector and to a lesser extent in INTENDRILS (around 25%). In the case of INTEND-dominant bow-ties, we found even above the 60% of bots in INTENDRILS sector. Instead, when the bow-tie structure is absent, OTHERS is the block the contains the It is worth to be mentioned is that the higher percentages of bots in the strongly connected component can be found in the right-oriented discursive communities. For instance, in the Italian Covid-19 dataset the percentage of bots in the SCC for the DX is the 7%, while for all the others it does not overcome the 2%. Such a situation is particularly dangerous, since the fact that social bots are able of being retweeted by human users (as it is the case for accounts in SCC) means that are able to pass off themselves as genuine accounts. This supplementary material contains all the information, the graphics and the statistics of the data sets not directly analysed in the main paper. In fact, for convenience, in the paper we presented only the results about the Italian Covid-19 dataset. The German Covid-19 dataset contains 1,552,582 tweets shared between February 2 and April 23 2020. The discursive communities in this dataset are the following: • AfD: this group contains accounts of politicians of the German nationalist and right-wing party "Alternative for Germany (AfD)"; • LEFT-WING: this community collects politicians of various German left-wing parties, as the "Social Democratic Party (SPD)", "Alliance 90/The Greens" and "Die Linke" (literally the left); • GOVERNMENT: in this community are placed official accounts of German ministries and institutions as the Foreign, Defense or Health Ministries. It also contains politicians of the "Christian Democratic Union of Germany (CDU)"; • MEDIA: this is the usual community which contains the official accounts of the main German newspapers, blogs, TV-channels, journalists and other media in general. In the bar chart of Fig. 11 the percentage of nodes in each discursive community is displayed. As for the other Covid-19 datasets the MEDIA group results the most numerous one, with approximately 70% of the nodes of the entire network. In Fig. 12 the bow-tie structures for the four discursive communities are showed. As in the main text, the dimension of the sectors is proportional to the number of nodes contained in them and the color indicates the mismatch with the predictions of the Direct Configuration Model (described Figure 12 : The bow-tie structure of the discursive communities of the German Covid-19 dataset. The dimension of the sectors is proportional to the number of nodes contained in them and the color quantifies the distance between the observed and the predicted dimensions. The AfD, Government and MEDIA groups display an informative bow-tie structure, i.e. the OTHERS sector is the represent less than 50% of the nodes. Considering the comparisons with the predictions of the Direct Configuration Model, the observed dimension for the OTHERS sector is significantly less numerous (considering a significance at 1%) for all the communities, apart for the LEFT-WING one. Figure 13 : Percentage of nodes and edges in the SCC for the communities in the German Covid-19 dataset. As for the other datasets, in the German Covid-19 one the conservatives and right-oriented discursive community (AfD) has more numerous and denser SCCs, as it is displayed in the two top panels. In the bottom panel, it can be seen that also considering the fraction of links per node in the SCC, the AfD group results again the first one. in the Methods section of the main text). The AfD, GOVERNMENT and MEDIA groups display informative bow-tie structures; all of them are OUT-dominant, but only AfD has a strong bowtie. In the LEFT-WING community the bow-tie is uninformative, with approximately 60% of the vertices in the OTHERS sector. In agreement with the results of the Italian Covid-19 dataset, the OTHERS block results significantly less numerous for the three communities with an informative bow-tie, and not for the LEFT-WING. Also in this dataset, the AfD, which contains right-oriented and conservatives accounts, shows a more numerous and denser SCC. It is the only community with above 10% of the nodes and 25% of the links within the SCC, in which each vertex has over 20 links on average attached to it (see Fig. 13 ). The accounts in the AfD discursive community are those who retweets the most urls of web-pages indicated by Newsguard as untrustworthy. Indeed, we found approximately 3,500 retweets of this type in AfD, about 200 in MEDIA, 20 in LEFT-WING and even none in GOVERNMENT. For AfD, 30% of them originates from the SCC and ends in the OUT sector, 25% between IN and OUT and 20% remains in the SCC. Therefore, in over 50% of the cases an user shares untrustworthy contents from the SCC. The French Covid-19 dataset consists in 3,060,197 posts published between March 23 and April 7 about the epidemic. We identified 4 different discursive communities: • RIGHT-WING: it collects conservatives and right-oriented accounts from French parties like "Rassemblement National", "Les Républicains" and "Les Identitaires"; Figure 14 : The dimension of the discursive communities of French Covid-19 dataset. In this bar chart is displayed the percentage of nodes in each discursive community. The MEDIA group results the most numerous one, with approximately 60% of the nodes, as it happens in the other Covid-19 datasets. • LEFT-WING: in this community there are the politicians and the supporters of center-left French parties like "La France Insoumise" or the socialists party ("Parti Socialiste"); • GOVERNMENT: it collects accounts of institutions and ministries like the official account of the French government or that of "Ministère des solidarités et de la santé" (Ministry of Solidarity and Health). It also contains politicians from the party "La République En Marche", whose leader is president Macron; • MEDIA: this is the usual community containing official accounts of various media and journalists. As for the others Covid-19 dataset, the MEDIA group results the most numerous one, with approximately 60% of the nodes of the network (Fig. 14) . For this dataset we could not make the comparisons with the predictions of the model because of the huge dimension of its discursive communities. For these groups the computation time for generating the graphs of the ensemble and analysing their bow-tie structure became too long. Therefore, in the following graphics there will be no information about the comparison with a null-model. In Fig. 15 it is easy to note that each discursive community displays an informative bow-tie structure. Remarkably, each of them are OUT-dominant ones, with not less of 40% of the nodes in every OUT sector. The mismatch in the number of nodes and links in the SCC between the right-wing community and the others still holds, but at much less extent, see Fig. 16 . Whereas Newsguard's data provide similar results of the other datasets: we found 979 retweets with urls to untrustworthy web-pages in the right-wing community, 103 in the left-wing and none in the others. For the former community 25% are located between SCC and OUT, 22% between IN and OUT, 14% in the SCC, 12% between IN and SCC and much less between the other sectors. In the case of the left wing, 45% of these retweets are located between SCC and OUT and 31% in the SCC. Figure 15 : The bow-tie structure of the discursive communities of the French Covid-19 dataset. In this dataset all discursive communities have informative bow-tie structures; all of them are OUT-dominant. Note that the color of the sectors is always the same because we could not make the comparisons with the predictions of the DCM, due to great dimension of the data set. Figure 16 : Percentage of nodes and edges in the SCC for the communities in the French Covid-19 dataset. We found again that the right-wing community has more links and nodes in the SCC, even if this time in much less extent respect to the other datasets. In this bar chart is displayed the percentage of nodes in each discursive community. The CONSER-VATIVES group results the most numerous one, with approximately 50% of the nodes of the entire network. The Dutch elections dataset consists in 1,002,696 tweets posted between February 2 and March 31 2021. In this case almost each discursive community has the name of a specific Dutch politcal party: "GroenLinks" (center-left, green), "Christian Democratic Appeal (CDA)" (center, Christian-democratic), "Democrats 66 (D66)" (center/center-left, liberal), "People's Party for Freedom and Democracy (VVD)" (conservative-liberal) and "Labour Party (PvdA)" (center-left, social-democratic). Then we have the CONSERVATIVES community which collects accounts from right-oriented parties like "Party for Freedom" or "Forum for Democracy" and the MEDIA & S.P., which is the usual MEDIA community with a couple of accounts belonging to the Dutch "Socialist Party". In Fig. 17 the dimension of these seven discursive communities are presented. As it is possible to observe in Fig. 18 , all the discursive communities in this case show an informative bow-tie structure, with the only exception of the VVD one. The CONSERVATIVES group results again that community with the highest percentages of nodes and links within SCC (see Fig. 19 ). It contains above 40% of the links of the entire network just in SCC. In this dataset are present few retweets containing urls to untrustworthy web-pages (Newsguard). However, they are all located in the CONSERVATIVES community: 153 in total, whose 55% in the SCC and 45% between SCC and OUT. This Italian dataset contains Twitter posts about the migration flows from Northern Africa. The dataset consists in 1,082,029 posts published between January 23, 2019 and February 22, 2019. The network has been divided simply in DX (right-oriented Italian parties as "Lega Nord"), CSX (left-oriented Italian parties as the Democratic Party and other minor center-left parties), M5S ("Five Star Movement" party) and the usual MEDIA community. The first two result the most numerous ones (Fig. 20) . In Fig. 21 there are the bow-tie structures for the four discursive communities in this dataset. The most numerous community of DX and CSX display informative bow-ties while in the two smaller Figure 18 : The bow-tie structure of the discursive communities of the Dutch elections dataset. All the discursive communities in this dataset display a strong bow-tie structure, but the VVD one. Figure 19 : Percentage of nodes and edges in the SCC for the communities in the Dutch elections dataset. Again, the conservative and right-oriented discursive community (CONS.) has more numerous and denser SCCs, as it is displayed in the two top panels. In the bottom panel, it can be seen that also considering the fraction of links per node in the SCC, the CONS. group outperforms other communties. In this bar chart the percentage of nodes in each discursive community is displayed. The DX and CSX groups result the most numerous ones, with between 30% and 50% of the nodes. This is the only dataset in which the percentage of nodes not assigned to a discursive community by the label propagation procedure overcome 15%. ones the nodes are mostly located in the OTHERS sector, especially for the MEDIA community (above 95%). Looking to the colors in the graphs, in general, the latter ones result more in agreement with the Direct Configuration Model. The DX community, which again contains politicians of right-oriented Italian parties, has the most numerous and denser SCC (Fig. 22) , such that on average a node therein has over 25 links. Newsguard data suggest that 15,160 retweets in the DX network contain the urls of untrustworthy web-pages, while only 14 for the CSX, 3 for MEDIA and none for M5S. In the case of the DX community, 59% of them can be found inside the SCC and 36% between the SCC and OUT. This dataset contains 583,327 Twitter posts published in Italy and regarding the discussion about the safety of Astrazeneca vaccine against Covid-19. The dataset contains posts shared between March 15, 2021 and May 15, 2021. Follows a brief description of the discursive communities identified: • DX: this is the usual right-oriented and conservatives community found even in the other Italian datasets, i.e. it contains accounts from the "Lega" and "Fratelli d'Italia" parties; • PD: the Italian Democratic Party (center-left); • IV: it collects the politicians of the "Italia Viva" party (center-left); • LEFT-WING COMMENTATORS: this particular community is formed by several wellknown personalities, often left-oriented, which are not politicians but journalists, blogger, actors or entertainers. This community contains also the most famous Italian epidemiologist Roberto Burioni; • M5S: the Italian populist party "Movimento 5 Stelle"; • MEDIA: the usual community containing official accounts of newspaper, blog, TV-channels, radio and others. Figure 21 : The bow-tie structure of the discursive communities of the Italian debate on migrants dataset. The most numerous community of DX and CSX display informative bow-ties (only the CSX one is strong), while in the smaller ones of M5S and MEDIA the nodes are mostly located in the OTHERS sector. Looking to the colors in the graphs, in general, the latter ones result more in agreement with the DCM. Figure 22 : Percentage of nodes and edges in the SCC for the communities in the Italian debate on migrants dataset. Again, the conservative and right-oriented discursive community (DX) has more numerous and denser SCCs, as it is displayed in the two top panels. In the bottom panel, it can be seen that also considering the fraction of links per node in the SCC, the DX group results again the first one. In this bar chart the percentage of nodes in each discursive community is displayed. All communities but MEDIA and PD display informative bow-ties, and, among them, the DX and IV ones are strong. The LEFT-WING COMMENTATORS group results the most numerous one, with over 60% of the nodes. The distribution of the nodes in these six communities is showed in Fig. 23 . The two biggest communities, DX and LEFT-WING COMMENTATORS, show a respectively strong and weak bow-tie structures (Fig. 24) , denoting, once more, that the strength of the structure does not depend on its dimension. Nevertheless, thery are both OUT-dominant. In the M5S and IV ones there is a nearly balanced situation between INTENDRILS and OUT as the dominant sector. While MEDIA bow-tie is poorly informative, the PD community is not informative, with over 50% of the nodes in the OTHERS. The DX community results again the community with the most numerous and denser strongly connected component; on average the nodes in its SCC have more than 27 links attached while in the other communities there are always less than 10 links per node (Fig. 25) . In this community we identified 728 retweets containing urls to untrustworthy pages, according to Newsguard. They are distributed as follow: 43% between SCC and OUT, 26% in the SCC, 13% between IN and SCC, 12% between IN and OUT and much less between the other sectors. We found only two retweets of this type in the M5S community and none in the others. Figure 24 : The bow-tie structure of the discursive communities of the Italian debate on Astrazeneca vaccine dataset. The DX and LEFT-WING COMMENTATORS communities display informative bow-tie structure; nevertheless they are both OUT-dominant ones. In the weak bow-ties of M5S and IV there is a balanced situation between INTENDRILS and OUT as the dominant sector. The PD community does not display an informative bow-tie structure, with over 50% of the nodes in the OTHERS. Figure 25 : Percentage of nodes and edges in the SCC for the communities in the Italian debate on Astrazeneca vaccine dataset. Again, the conservative and right-oriented discursive community (DX) has more numerous and denser SCCs, as it is displayed in the two top panels. In the bottom panel, it can be seen that also considering the fraction of links per node in the SCC, the DX group results again the first one. The political blogosphere and the 2004 U.S. election: divided they blog for Communication, Media use in the European Union : report. European Commission The echo chamber is overstated: the moderating effect of political interest and diverse media Lack of evidence for correlation between covid-19 infodemic and vaccine acceptance Comment on "the covid-19 infodemic does not affect vaccine acceptance Context matters: political polarization on twitter from a comparative perspective Political polarization on the digital sphere: A cross-platform, over-time analysis of interactional, positional, and affective polarization on social media Tweeting from left to right: Is online political communication more than an echo chamber? The science of fake news Weapons of Mass Distraction: Foreign State-Sponsored Disinformation in the Digital Age. Park Advisors The spreading of misinformation online Echo Chamber: Rush Limbaugh and the Conservative Media Establishment Echo chambers online?: Politically motivated selective exposure among internet news users Debunking in a world of tribes The Filter Bubble: What the Internet is hiding from you Discourse community Audience and Rhetoric: An Archaeological Composition of the Discourse Community Definition and genesis of an online discourse community A rhetoric for naturalistic inquiry and the question of genre Analysing twitter semantic networks: the case of 2018 italian elections Political polarization on twitter Predicting the political alignment of twitter users Partisan asymmetries in online political activity Near linear time algorithm to detect community structures in large-scale networks Extracting significant signal of news consumption from social networks: the case of Twitter in Italian political elections The Statistical Physics of Real-World Networks Discursive community detection on twitter The role of bot squads in the political propaganda on Twitter Brexit and bots: characterizing the behaviour of automated accounts on Twitter during the UK election Firms' challenges and social responsibilities during covid-19: A twitter analysis Networked partisanship and framing: A socio-semantic network analysis of the italian debate on migration Flow of online misinformation during the peak of the covid-19 pandemic in italy Italian twitter semantic network during the covid-19 epidemic Graph structure in the web Bow-tie decomposition in directed graphs Enhanced reconstruction of weighted networks from strengths and degrees Effectiveness of dismantling strategies on moderated vs. unmoderated online social platforms Information disorders during the covid-19 infodemic: The case of italian facebook Lack of evidence for correlation between covid-19 infodemic and vaccine acceptance The voice of few, the opinions of many: evidence of social biases in twitter covid-19 fake news sharing Bots are less central than verified accounts during contentious political events Randomizing bipartite networks: the case of the World Trade Web Fast unfolding of communities in large networks Finding and evaluating community structure in networks Connected Components in Random Graphs with Given Expected Degree Sequences Analytical maximum-likelihood method to detect patterns in real networks On the rich-club effect in dense and weighted networks Broadcasters and Hidden Influentials in Online Protest Diffusion Reciprocity of weighted networks Early-warning signals of topological collapse in interbank networks The role of distances in the world trade web Statistically validated network of portfolio overlaps and systemic risk Inferring monopartite projections of bipartite networks: An entropy-based approach Detecting early signs of the 2007-2008 crisis in the world trade Assessing systemic risk due to fire sales spillover through maximum entropy network reconstruction Reconstruction methods for networks: The case of economic and financial systems The physics of financial networks Grand canonical validation of the bipartite international trade network Grand canonical ensemble of weighted networks Maximum entropy approaches for the study of triadic motifs in the mergers & acquisitions network Colombian export capabilities: Building the firms-products network Elements in Structure and Dynamics of Complex Networks Gravity models of networks: integrating maximum-entropy and econometric approaches Lightning network: a second path towards centralisation of the bitcoin economy* Fast and scalable likelihood maximization for Exponential Random Graph Models From ecology to finance (and back?): A review on entropy-based null models for the analysis of bipartite networks Breaking the spell of nestedness: The entropic origin of nestedness in mutualistic systems The ambiguity of nestedness under soft and hard constraints Fluctuating ecological networks: a synthesis of maximum entropy approaches for pattern and perturbation detection Entropy-based randomization of rating networks A faster horse on a safer trail: generalized inference for the efficient reconstruction of weighted networks Comparing models for extracting the backbone of bipartite projections Controlling the false discovery rate: a practical and powerful approach to multiple testing Statistical mechanics of networks Maximum likelihood: Extracting unbiased information from complex networks On computing the distribution function for the Poisson binomial distribution Controlling the false discovery rate: a practical and powerful approach to multiple testing Networks: An Introduction Community structure in social and biological networks Analysing Twitter Semantic Networks: the case of 2018 Italian Elections Comment on "the covid-19 infodemic does not affect vaccine acceptance Maximum-entropy networks. Pattern detection, network reconstruction and graph combinatorics Fame for sale: efficient detection of fake twitter followers The rise of social bots The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race On the efficacy of old features for the detection of new bots Invasion percolation and critical transient in the barabási model of human dynamics A decade of social bot detection Multilayer perceptron, fuzzy sets, and classification Fast effective rule induction Estimating continuous distributions in bayesian classifiers Random forests Data mining: practical machine learning tools and techniques Instance-based learning algorithms Online human-bot interactions: Detection, estimation, and characterization FS ackowledge Pietro Galgani and Lizanne Dirkx for support in both the download and the analysis of the Dutch election dataset; Giulia Andrighetto, Stefano Guarino, Enrico Mastrostefano, Elena Pavan, Eugenia Polizzi and Tiziano Squartini for useful discussions. All authors acknowledge support from IMT PAI project Toffee. Here, we give a brief description of the discursive communities identified in the Italian Covid-19 dataset (Their dimensions are in Fig. 2 ):• DX: this community collects the official accounts and the main leaders of two Italian rightoriented political parties, 'Lega' and 'Fratelli d'Italia' ;• M5S: this community contains the main politicians and the official accounts of the Italian party 'Movimento 5 Stelle' (English: 5 Stars Movement), an anti-establishment political movement;• IV: this community is associated to the liberal party of 'Italia Viva' (English: Italy Alive) with centre/centre-left political positions;• PD: this cluster contains the politicians of the Italian 'Partito Democratico' (English: Democratic Party), the traditional centre-left party;• FI: this group collects the politicians and the official accounts of the Italian centre-right party of 'Forza Italia' (English: Italy Forward);• MEDIA: this type of community is present in almost all the datasets we analyzed. It contains the official accounts of newspapers, journalists, TV-channels, radio channels and in general other Italian media. Social bots are computer algorithms whose behaviour on social platforms is often far from being benign: malicious bots are purposely created to distribute spam, sponsor public characters and, ultimately, induce a bias within the public opinion [83] [84] [85] . Often these agents have the task of increasing the visibility of certain users [28, 32] .Here, we report the outcome of a study about detection of social bots in the datasets under investigation. For bot detection, we exploit the general-purpose bot detection system based on supervised-learning presented in [86] . Such a system has been shown to be highly accurate, both for unveiling automated accounts that work alone and those that participate in coordinated activities (we cannot determine phase transitions in this peculiar dynamics [87] ). The bot detector is 'traditional', i.e., only one user per time is analyzed during the classification process [88] .The classifier exploits so-called Class A features, i.e., features that can be directly extracted from the user profile. These features were originally introduced in [83] and, despite their simplicity, proved to be still effective for the detection of novel bots too. Features that are known to be the most expensive to compute (mainly in terms of time needed for data gathering), namely those concerning the account's relationships (friends and followers) have been disregarded.Hence, in order to decide about the type of the account (either a bot or not), we (i) train and evaluate different machine-learning algorithms on a dataset where bots and genuine accounts are a priori known, (ii) we select the model with the best classification performances and (iii) we apply the resulting model to the accounts of the datasets investigated in the main text of the manuscript.