key: cord-0811354-7155cxta
authors: nan
title: An Infoveillance System for Detecting and Tracking Relevant Topics From Italian Tweets During the COVID-19 Event
date: 2020-07-17
journal: IEEE Access
DOI: 10.1109/access.2020.3010033
sha: e07e5a3216a56f31ac4e0f33bd447f68f50d09ba
doc_id: 811354
cord_uid: 7155cxta

The year 2020 opened with a dramatic epidemic caused by a new species of coronavirus that soon has been declared a pandemic by the WHO due to the high number of deaths and the critical mass of worldwide hospitalized patients, of order of millions. The COVID-19 pandemic has forced the governments of hundreds of countries to apply several heavy restrictions in the citizens’ socio-economic life. Italy was one of the most affected countries with long-term restrictions, impacting the socio-economic tissue. During this lockdown period, people got informed mostly on Online Social Media, where a heated debate followed all main ongoing events. In this scenario, the following study presents an in-depth analysis of the main emergent topics discussed during the lockdown phase within the Italian Twitter community. The analysis has been conducted through a general purpose methodological framework, grounded on a biological metaphor and on a chain of NLP and graph analysis techniques, in charge of detecting and tracking emerging topics in Online Social Media, e.g. streams of Twitter data. A term-frequency analysis in subsequent time slots is pipelined with nutrition and energy metrics for computing hot terms by also exploiting the tweets quality information, such as the social influence of the users. Finally, a co-occurrence analysis is adopted for building a topic graph where emerging topics are suitably selected. We demonstrate via a careful parameter setting the effectiveness of the topic tracking system, tailored to the current Twitter standard API restrictions, in capturing the main sociopolitical events that occurred during this dramatic phase.

It is now well established that Internet and, in particular Online Social Media (OSM), are an invaluable source of fresh information. OSM have been widely adopted as means of news dissemination, event reporting, opinion expression and discussion [1] . Since 2006, the American online microblogging platform and social network service Twitter has gained rapidly more and more worldwide popularity with 321M active users in 2019. Twitter online operations started as a very short text message service provided by users via SMS or online platform. Currently, after a rapid and continuous evolution both from the technical point of view and in the diverse segments of the population reached worldwide, it is The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . an affirmed OSM conceived as a mixture of news media and social network features. Considering the mass of active users and how they interact with the platform -many of them can be considered as sensors or amplifier of facts or happening events -the Twitter data stream possess an invaluable strength in the task of discovering and tracking real-world events. In fact, a vast literature shows how the Twitter data stream can be used for discovering, tracking and analyzing these real-world events, such as earthquakes and natural disasters [2] - [4] in earth science, or national security events such as terrorists attacks [5] - [7] . Furthermore, Twitter data have been widely used even for tracking and analyzing important sociopolitical events, such as the riots during the Arab Spring [8] and the process of opinion formation around major political themes [9] - [12] , with particular attention to disinformation spreading [13] . VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Interestingly, Twitter has been used even for Public Health Monitoring tasks [14] , specifically during pandemic crisis such as the influenza A H1N1 or swine flu in 2009 [15] , [16] . Hence, OSM can be nowadays fruitfully used to study the dynamics of real-world events and monitoring such phenomena can have a direct implication on the possibility of understanding and describing their evolution, aiming to better decision making procedures for political decision makers and democratic institutions. In particular, a tracking system able to sense the Twitter stream to leverage fresh information in terms of emerging topics can be useful for early-detecting anomalous activities, preventing possible misuses of the OSM.

In this paper it is faced the analysis problem of the Italian Twitter community through a suitable topic tracking methodology during the lockdown period in Italy, subsequent to the dramatic COVID-19 pandemic. At the time of writing, the COVID-19 pandemic -also known as the coronavirus pandemic -is an ongoing pandemic of coronavirus disease in 2019 (hence . It is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the outbreak was first identified in Wuhan, mainland China, in December 2019 [17] . The World Health Organization (WHO) declared the outbreak a pandemic on 11 March 2020 and, as of June of the same year, more than 8.4 million cases of COVID-19 have been reported in more than 188 countries, resulting in more than 450,000 deaths with more than 4.1 million people that have been recovered worldwide. 1 In Italy, on 4 March 2020, after the detection of the first 100 death related to the pandemic, the government has ordered the complete closure of all schools and universities of all levels. On 11 March 2020, Italian Prime Minister Giuseppe Conte ordered a set of severe confinement measures and the so-called social distancing, together with the interruption of numerous productive, commercial and professional activities. Hence, the pandemic generated a worldwide dramatic situation never seen before with repercussions even on the economic scenario and, during the period that spans from March to June, the Italian population was constrained at home for safety reasons, acquiring important information mostly on social network platforms. The insane information flow about the pandemic enriched with fake-news has declared by WHO as a serious infodemic problem [18] - [20] . Eysenbach stated in early 2000 that infodemiology is a new research discipline and methodology related to the study of the determinants and distribution of health information and misinformation which may be useful in guiding health professionals and patients to quality health information on the Internet [18] . The WHO Director-General Tedros Adhanom Ghebreyesus at the Munich Security Conference on 15 February 2020 declared [21] ''We're not just fighting an epidemic; we're fighting an infodemic''. This mean that the risk of false information [22] (i.e. forms of falsehood, including rumors, hoaxes, myths, conspiracy theories and other misleading or inaccurate) is very high. Covid-19 is a phenomenon of enormous magnitude and relevance with a great impact on the media system [23] . With the starting of COVID-19 pandemic, we are assisting to a growing number of infodemiology studies [24] - [27] where, interestingly, the spread of news or rumors are evaluated with the same epidemic models adopted in real-world epidemics [28] , for example measuring a R 0 parameter that, if found higher than the unitary value, it announces an infodemic. In light of an infoveillance study over the English speakers' Twitter community, authors in [29] analyze 167073 tweets, collected from the beginning of February 2020 to mid-March 2020, through word frequencies and the Latent Dirichlet Allocation (LDA) approach, aiming to identify the most common topics in the tweets. The analysis identifies 12 topics, which were grouped into four main themes: origin of the virus; its sources; its impact on people, countries, and the economy; and ways of mitigating the risk of infection. As expected, the impact on people and the economy is not to be underestimated. However, the methodologies adopted in infoveillance and infodemiology studies differ in the specific goals of the analysis, in the data sources and in the approaches, which span from correlation assessments to advanced machine learning systems. In this universe, it is important having available a system able to promptly trigger facts and events online. Moreover, in this study, we adopt an extended meaning of the term ''infoveillance'' compared to the traditional one [19] , in that the COVID-19 pandemic impacts not only on public health debate but even in every social and economical facet, transforming safety issues in public security issues.

The following analysis focuses precisely on the early period of COVID-19 pandemic, during which a large dataset of tweets (in Italian language) has been collected through the Twitter Streaming APIs. The main aim of this work is to track the emergent topics within the general debate in Italy during the pandemic. For this purpose, a topic tracking system is constructed grounding on the methodological framework presented in [30] , adapting the main functions both to the deep change in Twitter APIs (for example, on the restriction of available data and the increasing in length of text messages) and to the current case study. The methodology allows tracking emerging topics grounding on monitoring emerging terms by adopting a series of Natural Language Processing and graph-based techniques. A topic is defined as a coherent set of semantically related terms that express a single argument. Hot terms are term heavily used during a long time period, while a term is emergent if it results to be hot in the considered time interval but not in the previous ones. Interestingly, the methodology is mediated by a biological metaphor, where the life-cycle of a keyword (word) can be considered as analogous to the one of a living being. Specifically, within a Content Aging Theory framework [31] , a keyword is like a biological system that, if it is fed by a well-suited amount of nourishment, then its life-cycle is prolonged, while as soon as it is no longer available the living organism likely dies. The nourishment for a keyword is provided by its occurrence statistics in a set of tweets in a time interval -measured through a Term Frequency (TF) term -and the quality of tweets (containing the given keyword), measured by a social influence value related to the user that generated the contents. In this study, the nourishment term is further increased if the given keyword is even marked as a hashtag, with the aim of providing more semantic strength to the considered keyword that can be, in this way, a bearer of meaning. The tracking and the detection of emergent terms and topics are obtained considering a sequence of time intervals in which is measured the vitality of the keyword through an energy quantity that takes into account both the difference in the nutrition term in different time intervals and the amount of time flow. The energy quantities and a co-occurrence analysis in different time windows allow building a graph containing emerging keywords and common words. Through a suitable algorithm, a partition of the co-occurrence graph is further obtained where sub-graphs are conceived as emergent topics for the given time interval.

This paper is organized as follows: in Section II the related works are revised, while in Section III the methodological framework is resumed. In Section IV the results of the analysis are presented and discussed. Conclusions are drawn in Section V. Finally, in Appendix, a glossary of main Italian terms, people and abbreviations is provided.

Topic Detection and Tracking aims at the extraction of topics from a collection (or stream) of texts in order to study and quantify their importance (''trend'') over time [32] . As aptly discussed in [33] , there are two main families of techniques in order to perform topic detection: document-pivot and feature-pivot. The main difference is that, in the former case, documents are clustered together, whereas in the latter case keywords or individual terms are clustered together.

That said, within the document-pivot family, research works such as [34] - [36] leveraged on Term Frequency-Inverse Document Frequency (TF-IDF) in order to map documents towards a suitable vector space [37] . On occasion, other features can be considered alongside TF-IDF, such as time proximity between tweets [38] .

Feature-pivot methods, as instead, heavily rely on statistical topic models, with the final goal of extracting 'hot terms' that describe a given topic. Within this family, LDA [39] plays a huge role [40] - [43] . Other techniques include the study of the burstiness of given terms, with the rationale that 'hot topics' spread rapidly on social media as soon as they are first announced [44] - [47] . An alternative approach, pursued in this work, is the use of graphs in order to capture the co-occurrences of terms: in fact, a graph is able to encode the pairwise similarities between nodes, which can either be single terms [30] , [48] , [49] or short sentences [50] . This allows to cast the topic detection problem into a community detection problem defined on a graph.

The vast majority of the aforementioned works deals with 'topic detection'. However, as discussed in [40] 'topic detection' is just one of the two building blocks in Topic Detection and Tracking, the other being indeed 'topic tracking'. Topic tracking can also be performed according to different strategies, including clustering [51] , online variants of LDA [40] , [52] or by exploiting and studying temporal dynamics over a pre-defined time window [30] , [53] .

The work by [30] serves as a starting point for this paper. Their work can be summarized as a five-steps procedure which starts by collecting tweets, then computing the energy of the terms by considering a given time window, selecting emerging terms according to their energies and building a co-occurrence graph amongst emerging terms. Finally, topics are collected from the resulting graph. In this paper, we perform some modification of the original pipeline proposed in [30] in order to address updates and changes in the Twitter API and in order to better suit our case study, that is, topic detection and tracking on COVID-19-related tweets: this period, although dramatic, represents a more unique than rare opportunity for this kind of work. Hence, we collected tweets everyday for about three months during the lockdown phase in Italy.

For the current study, we built a dataset of 1044645 tweets through a suitable listener connected to the standard Twitter Streaming API, accessible with a Twitter developer account. The Twitter Streaming API works like a radio receiver tuned on a specific radiofrequency that captures on-air programs in real-time. In fact, the Streaming API allows capturing streaming Twitter content selecting a set of keywords. The listener object has been set to collect a stream filtering for a time period that spans from 9 March to 5 June 2020, for the following Italian keywords: Salvini, Conte, PD, salvini, conte, pd, lega, Lega, coronavirus, Coronavirus, calcio, Calcio, sport, Sport, UE, ue, europa, Europa, USA, NBA, carceri, carcere, virus, meloni, Meloni, coni, CONI, renzi, Renzi, borsa, Borsa, Trump, NASA, ESA, scienza. The semantic of the selected 35 keywords have been chosen with the aim of offering a wide coverage of the main buzzing topics not focusing only on the COVID pandemic, but also to a more general socio-political scenario. In fact, maybe for the first time, a worldwide pandemic meets a globalized and interconnected world and issues overcome the public health safety invalidating the socio-economic tissue. For example, the tightness of the European Union has been severely put under pressure by the pandemic. Hence, both from a infoveillance and security viewpoint the selected keywords -see the glossary in the Appendix for a deeper explanation -cover the COVID-19 pandemic along with the internal and external economic and political scenario, the general scientific debate and sports. Tweets are filtered for the Italian language ('it') exploiting the specific filtering function available in the VOLUME 8, 2020

Twitter Streaming API. All collected tweets have been separated on a daily basis with an average of 20000 tweets per day.

A marked difference with the original methodology proposed in [30] is in the adoption, in the current study, of several preprocessing steps. The motivation is two-fold. With no preprocessing, the final outputs are noisy and the computational time of the entire algorithm pipeline is obviously higher due to such noise. The adopted preprocessing steps are the following:

• text tokenization with the aid of Part-of-Speech information;

• hashtags extraction; • lower casing conversion; • links, symbols, emojis and retweets removals;

• stop-words removals (Italian words most commonly used stored as a list in an external file);

• text lemmatization (optional): similar to stemming, associates to every word its lemma;

• numbers removals (optional). The topic tracking system is designed in a versatile fashion, hence some preprocessing steps are optional and leaved as a choice to the end-user. The lemmatization step, whether selected, is performed with the TreeTagger wrapper [54] , [55] .

The main aim of the topic tracking system is tracking emerging topics on the Twitter Italian community in a given time interval. Hence, within a time interval r set by the user, the tth time interval I t is defined as:

where i t is the starting instant of the t-th considered time interval (the value 0 is the first instant). For each time interval I t a corpus of n tweets TW t is collected and to each tweet j it is associated a suitable vectors of weights

The weight w j,x for the x-th vocabulary term and for the j-th tweet is given by the augmented term frequency [56] :

where tf j,x is is the term frequency value of the x-th vocabulary term for the j-th tweet and tf max j is the highest term frequency value of the j-th tweet. Hence, for each time interval, each tweet is represented as a weight vector that resumes the statistical information related to each pertaining term.

In order to compute the hot terms in a given time interval and the main topics in a suitable way, it is important to define two main concepts, that are the content nutrition and content energy. It is possible to imagine that each tweet provides its own keywords by a quantity called nutrition whose quality is given by the authority of the user that produced the tweet. In this way, different tweets containing the same keywords can receive different nutrition values depending on the representativeness of the user that produced the tweets. With difference to [30] , in this study the quality of the nutrition is given even considering if the keyword is used as hashtag.

Hence, considering a keyword k ∈ k t and the set of tweets TW t k ∈ TW t containing a term k at time interval I t , the amount of nutrition for a keyword k is defined as:

where w k,j is the weight of the keyword k for the tweet j (in the tweet vector tw j ), h is a constant that boosts the nutrition if the keyword is also an hashtag, and auth(user(tw j )) is a numerical value indicating the representativeness of the tweet author.

There are a number of methods for measuring the importance of a source in terms of several features related to the social influence of a user [57] . In their original work [30] adopt an authority graph and the PageRank algorithm [58] to estimate the social influence. They state that a Twitter user can follow the text stream of other users by expliciting the social relationship of follower. On the other hand, a user who is being followed by another user does not necessarily have to reciprocate the relationship by following it back, which makes the graph of the network directed. By the way, the Twitter public Streaming APIs make available only a subset of information about the author of a tweet and in this subset is unavailable the follower-followee list for build the social graph. Moreover, the computation of such a graph can be quite expensive. Thus, in the current study, we adopt a simple formulation -both from the computational point of view and exploiting the current available information about tweets' authors -of the social influence of a user u i through the number of followers and followees:

Finally, for each keyword k adopted in the Twitter community in a time interval I t , the nutrition amount evaluates the usage of this term by considering i) its frequency appearance in tweets, ii) the social influence of the source that reports the keyword k, iii) the possibility that the keywords has a strong semantic content (in the specific time interval) being an hashtag. Hence, the topic tracking system is in charge of evaluating the frequency of key terms and their relevance qualified by the user authority and the particular meaning in the specific contest. The nutrition for a keyword helps to defining another important quantity that is the the energy of a term. The energy is related to effective contribution, that is how much a term is emergent, in the corpus of tweets. The energy is the key value to compute the set of hot terms, where 'hotness' is related to the extensiveness of the usage within the considered time interval. The energy helps also to compute the emergence of a term, where a keyword is 'emergent' if it results to be hot in the considered time interval but not in the previous ones [30] . By these definitions, a hot term is different from an emergent term. It is possible to have a hot term (heavily used) that is not emergent in a time interval because the usage is quite constant in it.

The energy is computed considering a parameter s (0 < s < t), that limits the number of previous time slots considered to analyze the keywords life cycles, hence defining the history worthiness of the resulting emerging keywords. Given a keyword k, the energy value in a time interval I t is:

where nutr x k represents the nutrition obtained by the keyword k during the interval time I x . It is worth to note that Eq. (5) allows quantifying the usage of a given term with respect to its previous usages in a limited number of time intervals. It takes into account i) the difference in terms of usage of a given keyword by considering the difference of nutritions received in the time frames I x and I t (x < t), ii) the temporal distance among the two considered intervals.

The hot and the emergent keywords, within this framework, allows computing the emergent topics. It is important first defining a set of emerging terms through a critical drop value represented by a user-defined threshold δ ≥ 1:

By using Eq. (6) it is possible to define the set of emerging keywords EK t as:

Hence, the parameter δ rules the number of extracted hot terms. We remark that authors in [30] suggest that it is possible to compute the set of emergent terms even in an unsupervised fashion, without setting a threshold parameter. In this study, we refer to the supervised way, that is adopting a user-defined threshold, since this method is found more reliable, as reported even by the authors themselves.

To finally reach the definition of emerging topicsrelated to the emerging keywords -the system needs to analyze the semantic relationships of keywords through the co-occurrence information in the considered whole time interval. Hence, it is possible to define a correlation vector cv k to each keyword k ∈ K t . The correlation vector captures the relationships among the keyword k and all others terms in the given time interval. This is done by computing the degree of correlation between keywords k and z by using the set of tweets containing both terms as positive evidence of the relatedness of the two terms. On the contrary, the set of tweets containing only one of them represents a negative evidence. This idea is captured by the following formula that represent a probabilistic feedback mechanism [59] :

where:

• r k,z is the number of tweets in the interval containing both keywords k and z;

• n z is the number of tweets containing the keyword z;

• R k is the number of tweets containing k;

• N is the total number of tweets. Hence, a given term k is associated to a correlation vector:

where v = K t . The elements c k,i represent the correlation between the term k and the term i ∈ K t at the time interval I t .

At this point, the correlation vector cv t k can be used for identifying the main emerging topics related to emerging terms retrieved during the given time interval. Specifically, a directed keyword-based topic graph TG t (K t , E, ρ), can be constructed. K t is the set of vertices of which the elements are the keywords k ∈ K t retrieved during the time interval I t . Given two keywords k, z ∈ K t such that cv t k [z] = 0, there exists an edge k, z ∈ E, such that:

In the above Eq. (10), ρ k,z is the relative weight of the keyword k in cv t k , that is the role of the keyword z in the context of keyword k. In the current study the graph TG t (K t , E, ρ) is thinned by removing edges with values lower than a cutoff threshold φ. This parameter is fundamental for the emerging topics retrieval in that a too small value results in a huge unique component, while a large value leads to a disconnected graph, making useless the below-described procedure for retrieving the topics.

The topological structure of the graph can be exploited for retrieving semantically-related keywords that are intended as an emerging topic. In particular, for each keyword z ∈ EK t , an emerging topic is defined as the subgraph ET t z (K z , E z , ρ) connecting keywords that are semantically related to the keyword z within I t . The subgraph is obtained as the set of vertices S reachable from z through a path computed by means of the Depth First Search algorithm. In other words, topics are represented by strongly connected components. Given the entire set of n emerging keywords, EK t is computed as the corresponding set of emerging topics, namely the set ET t = ET t 1 , ET t 2 , . . . , ET t n , of strongly connected components. At the end of the procedure an emerging topic is represented by an emerging term z and other semantically related common terms not necessarily included in EK t , that can be thought popular terms (e.g. 'Trump'). In a pictorial graph representation the connected components can be represented VOLUME 8, 2020 as colored vertices, while their dimension can represent if a term is an emerging term or not (an example will be provided in Section IV).

It is worth to note that the topic graph exploits the information leveraged from all tweets, even those that do not report emerging terms. Hence the current approach not only is able to retrieve such terms that directly co-occur with the emerging terms but we can also retrieve those which are indirectly related with the emerging ones. This is possible with term co-occurring with keywords that they themselves co-occur with the emerging terms.

Finally, to provide the user with insights of which topic is more important, topics can be ranked by considering the energies of the related emerging terms.

Two different studies are performed in order to test the proposed approach. The first study aims at assessing the term energy evolution as function of time on a 30-days time horizon, whereas the second study aims at focusing on specific days in order to analyze their topics. The selection of terms (first analysis) and days (second analysis) is mainly driven by the events themselves: in fact, as clear from the previous section, the proposed system works in an unsupervised fashion. To this end, in order to check for the effectiveness of the approach, days with interesting events have been selected and validated a-posteriori. Same reasoning holds for the selection of terms for energy monitoring. As concerns the topics, several parameters are experimented, such as cutoff value φ for thinning the co-occurrence graph, the drop of value for retrieving emerging terms δ, and the number of previous time windows to consider in the hot terms computing s. The 'threshold' parameter has been introduced to limit the number of words per topic. Finally, in the presented experiments, the lemmatization in the preprocessing step is not adopted.

In a first analysis, we show the energy evolution for some of the most relevant words in the considered time horizon. For example, Figure 1a shows the energy evolution for the word boris which sees a spike on 5 April 2020, the day in which he has been taken to hospital due to coronavirus. 2 Similarly, Figure 1b regards the word trump, whose relevance on Twitter starts increasing from April, when the coronavirus pandemic started spreading in the U.S.A., and he started being a more common topic. 3 Figure 1c shows the trend for the word conte, with spikes on 24 March 2020, 28 March 2020, 1 April 2020, 6 April 2020 and 10 April 2020: in these days Giuseppe Conte held press releases and interviews in order to discuss and introduce new rules and regulations during the lockdown phase in Italy. 4 Finally, Figure 1d shows the energy evolution for the word mes, which became a hot topic in April due to the economic crisis due to the lockdown in Italy. 5 We remark that the performances of the actual version of the topic tracking system, specifically in detecting buzzing topics, is satisfactory in that several buzzing keywords, for example related to the president Donald Trump, or even the president Giuseppe Conte are heavily and constantly used by a Twitter user, but only in a certain time, depending on underlying events, they are boosted and the system is in charge of detecting these events along with the related topics.

In this second study, instead of focusing on the relevance of individual words over time, we focus analyzing topics on specific days within the considered time horizon. Topics are shown in Tables 1-7 , with setup parameters reported in their respective captions, whereas Figure 2 shows an example of graph representation of the 27 May 2020 topics. We further provide English translations of the terms composing the topics. For capitalized words and abbreviations we provide additional information in the Appendix. Table 1 shows six topics as lists of relevant terms related to 19 April 2020. The topmost topic deals with coronavirus which, as one shall expect, was a hot topic in mid-April due to the pandemic spread in Italy. The second topic deals with Walter Ricciardi, which re-tweeted an anti-Trump tweet from filmmaker Michael Moore. 6 The third topic deals with a press release by Gabriele Gravina in which he pushed against the suspension of Italian football league competitions due to coronavirus by claiming that he does not want to be ''the gravedigger of Italian football''. 7 -8 The fourth topic deals with the increasing number of victims due to coronavirus in Italy and the fifth one regards Lombardy, the Italian region that by far had the highest number of deaths and infected [60] . Finally, the last topic deals with Massimo Giletti, which interviewed Matteo Salvini on several COVID-19-related topics, including Walter Ricciardi's tweet (see first topic) and possible ideas in order to relax the lockdown in Italy. 9 Tables 2-4 regard 5 April 2020 and we use this day in order to address the sensitivity to the cutoff parameter φ and the number s of previous time windows considered in the hot terms computing. Specifically, Table 2 uses a cutoff value φ equal to 0.4 and s can be either 8 or 15, leading to four topics. The first topic deals with the (rejected) motion of no confidence issued against Giulio Gallera by the Democratic Party due to the bad way (according to the Democratic Party) in which he managed the COVID-19 emergency in Lombardy. 10 The second topic regards the hope to suggested hydroxychloroquine against COVID-19. 12 The last topic, as instead, regards the (rejected) request from Matteo Salvini to let churches be open (regardless of the lockdown) for celebrating Easter. 13 Topics in Table 3 have been obtained with cutoff value φ = 0.25 and s = 15. The third topic is the same as topic #1 in Table 2 , although represented by a higher number of terms. Similarly, the last topic is the same as topic #4 in Table 2 which further includes Fiorello, that replied via Instagram at Matteo Salvini's proposal. 14 The first topic regards the administrative order by the President of Tuscany region to make safety masks mandatory and that masks will be freely distributed door-to-door to avoid gatherings. 15 The fourth topic cheers the news that the number of hospitalized patients starts decreasing (data from Italian National Institute 12 https://www.nytimes.com/2020/04/05/us/politics/trumphydroxychloroquine-coronavirus.html 13 https://www.ansa.it/sito/notizie/politica/2020/04/05/coronavirussalvini-permettere-le-messe-a-pasqua-_81e512ac-9a26-4ffb-8de7-c0f0ab85d763.html (in Italian). 14 https://www.ilfattoquotidiano.it/2020/04/05/coronavirus-fiorellosalvini-propone-di-aprire-le-chiese-per-pasqua-un-errore-credo-che-dioaccetti-le-preghiere-anche-di-chi-sta-a-casa/5760474/ (in Italian). 15 https://www.ansa.it/sito/notizie/cronaca/2020/04/04/coronavirusin-lombardia-in-giro-con-le-mascherine.-anche-la-toscana-annuncia-lordinanza_6b9afb6d-1848-4366-8090-bb57ee9e1adf.html (in Italian).

of Health) and that a lockdown relaxation will be possible if the number of cases keeps decreasing. 16 The fifth one is quite a mixed-bag, which may include the suggestion to stay at home or the tragic destiny of nursing homes in Italy. 17 Finally, topics in Table 4 have been obtained by using a cutoff value φ equal to 0.3 and s = 8. The first topic is the same as topic #1 in Table 3 , topic #2 is likely the same as topic #5 in Table 3 , topic #3 is likely the same as topic #2 in Table 3 (although this is quite hard to interpret due to very few words) and the last topic is the same as topic #4 in Table 3 . Table 5 shows four topics related to 16 April 2020. The topmost one deals with coronavirus, as expectable. The second one deals with the death due to COVID-19 of Chilean writer and journalist Luis Sepúlveda. 18 The third topic deals with a press release by Luca Zaia, who proposed to stop 16 the lockdown starting from 4 May 2020. 19 The last topic (related to the previous one) regards several press releases by Table 6 regards 10 April 2020. The first topic regards a (rejected) proposal from the Democratic Party to introduce an economic manoeuvre according to which wealthy citizens shall be waived a tax in order to support low-income people during the COVID-19 emergency. 21 The second topic regards the (false) accusation from Giorgia Meloni and Matteo Salvini towards Giuseppe Conte of approving the European Stability Mechanism. The last topic (see also Section IV-A) regards the press release by Giuseppe Conte: in said press release, other than introducing and discussing new COVID-19-related rules and regulations, Giuseppe Conte debunked the accusation from Giorgia Meloni and Matteo Salvini (see previous topic). 22 Finally, Table 7 regards 8 April 2020. The first topic regards a discussion amongst members of the European Union regarding economic manoeuvres to help European countries heavily affected by the coronavirus pandemic, with Netherlands being the most hostile member against this manoeuvre. 23 The second topic regards Bernie Senders dropping out of the 2020 presidential race against republicans, leaving Joe Biden in charge of heading the democratic coalition. 24 The third topic deals with an administrative order according to which Italy, due to the coronavirus pandemic, self-proclaimed as non-safe place for NGOs to dock 25 and no migrants would be allowed on Italian soil. The last topic cheers the news that 8 April 2020 has been one of the days with few new cases and with a lot of recovered patients (more than 2000). 26

In this work we proposed an in-depth analysis of the general debate within the Italian Twitter community during the lockdown period established in Italy for security reasons due to the dramatic COVID-19 pandemic. For this purpose, it is experimented a methodological framework, grounded on a biological metaphor, able to track emerging terms and emerging topics in a given time span starting from a real-world dataset of Tweets collected during the lockdown period. The methodology served as a driver to develop a topic tracking system tailored to modern Twitter standards and specifically to the aim of retrieving buzzing terms and topics in the Italian language. The system is found capable of discovering, in an unsupervised fashion, the main emerging terms related even 21 https://www.repubblica.it/politica/2020/04/10/news/il_pd_un_ contributo_di_solidarieta_da_chi_ha_un_reddito_superiore_a_ 80mila_euro_-253640966/ (in Italian). 22 https://www.corriere.it/politica/20_aprile_11/coronavirus-showpremier-conte-diretta-tv-salvini-meloni-dicono-falsita-9396558e-7b67-11ea-afc6-fad772b88c99.shtml (in Italian). 23 https://www.ilfattoquotidiano.it/2020/04/08/coronavirus-fumata-neraeurogruppo-stallo-su-mes-e-eurobond-olanda-noi-contro-nuova-riunionegiovedi/5763524/ (in Italian). 24 https://edition.cnn.com/2020/04/08/politics/bernie-sanders-dropsout/index.html 25 https://www.repubblica.it/cronaca/2020/04/08/news/coronavirus_ sbarchi_a_lampedusa_allarme_quarantena_per_i_migranti-253444180/ (in Italian). 26 https://www.repubblica.it/cronaca/2020/04/08/news/coronavirus_ bilancio_contagiati_positivi_morti_guariti_picco-253489274/ (in Italian).

to socio-political events, succeeding in strongly highlighting when they are spiking, even for terms heavily and constantly used, such as, for example, the major Prime Ministers' names. This is true also for the main related topics. The proposed system is general purpose, and can be used on streams of Twitter messages, written in any language, to detect and to track topics emerging from any socially relevant event. The topic tracking system is found sensible to some system parameters, such as the threshold for obtaining the emerging terms and the parameter for thinning the co-occurrence graph. Future works foresee the automatic search for these thresholds and an in-depth analysis of the current dataset for different granulation levels in terms of time interval length that, in the current work, is fixed in one day. Furthermore, the system will be equipped with a sentiment analysis module capable even to measure the quantity of hate speech in social media contents. 

Exploring Twitter communication dynamics with evolving community analysis

An analysis of Twitter messages in the 2011 Tohoku earthquake

Earthquake shakes Twitter users: Real-time event detection by social sensors

Early tracking of people's reaction in Twitter for fast reporting of damages in the Mercalli scale

Information control and terrorism: Tracking the Mumbai terrorist attack through Twitter

A microblogging-based approach to terrorism informatics: Exploration and chronicling civilian sentiment and response to terrorism events via Twitter

Evaluating public response to the Boston marathon bombing and other acts of terrorism through Twitter

The Arab Spring|'I'll be waiting for you guys': A YouTube call to action in the egyptian revolution

Detection and analysis of 2016 us presidential election related rumors on twitter

Bumps and bruises: Mining presidential campaign announcements on Twitter

Prediction of the 2017 French election based on Twitter data analysis

Follow the leader! Direct and indirect flows of political communication during the 2013 Italian general election campaign

Investigating italian disinformation spreading on Twitter in the context of 2019 European elections

How to exploit Twitter for public health monitoring?'' Methods

The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza a H1N1 pandemic

An effective approach to track levels of influenza-a (H1N1) pandemic in India using Twitter

Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China

Infodemiology: The epidemiology of (MIS) information,'' Amer

Infodemiology: Tracking flu-related searches on the Web for syndromic surveillance

Corona virus (COVID-19) 'Infodemic' and emerging issues through a data lens: The case of China

How to fight an infodemic

COVID-19 infodemic: More retweets for science-based information on coronavirus than for false information

Impact of Covid-19 on the media system. Communicative and democratic consequences of news consumption during the outbreak

Tracking Covid-19 in Europe: An infodemiology study

Using reports of symptoms and diagnoses on social media to predict COVID-19 case counts in Mainland China: Observational infoveillance study

Infoveillance to analyze Covid19 impact on central America population

Association of the COVID-19 pandemic with Internet search volumes: A Google TrendsTM analysis

The COVID-19 social media infodemic

Top concerns of tweeters during the COVID-19 pandemic: Infoveillance study

Emerging topic detection on Twitter based on temporal and social terms evaluation

Life cycle modeling of news events using aging theory

Topic Detection and Tracking: Event-based Information Organization

Sensing trending topics in Twitter

Breaking news detection and tracking in Twitter

Tweetmotif: Exploratory search and topic summarization for Twitter

Beyond trending topics: Realworld event identification on Twitter

Speech & Language Processing

TwitterStand: News in tweets

Latent Dirichlet allocation

Research on topic detection and tracking for online news texts

Online learning for latent Dirichlet allocation,'' in Advances in Neural Information Processing Systems 23

Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey

A text mining research based on LDA topic modelling

Streaming first story detection with application to twitter

Peaks and persistence: Modeling the shape of microblog conversations

Patterns of temporal variation in online media

Dynamical classes of collective attention in Twitter

Event detection and tracking in social streams

A graph-based clustering scheme for identifying related tags in folksonomies

Meme-tracking and the dynamics of the news cycle

Discovery Data Mining (KDD)

Unsupervised and supervised clustering for topic tracking

On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking

Finding scientific topics

Probabilistic part-of-speech tagging using decision trees

Improvements in part-of-speech tagging with an application to German,'' in Natural Language Processing Using Very Large Corpora

Term-weighting approaches in automatic text retrieval

Finding and analyzing principal features for measuring user influence on Twitter

The pagerank citation ranking: Bringing order to the Web

A survey on the use of relevance feedback for information access systems

The outbreak of Covid-19 in Italy

ENRICO DE SANTIS (Member, IEEE) received the M.A.Sc. (Hons.) and Ph.D. degrees in information and communication engineering from the ''Sapienza'' University of Rome, Italy. During the Ph.D. degree, he has worked as an Assistant Researcher and a Postdoctoral Researcher with the Department of Computer Science, Ryerson University, Toronto. He currently holds a Postdoctoral position with the Department of Information Engineering, Electronics and Telecommunications (DIET), Sapienza. In 2017, he has joined the innovative startup SisterPomos at ''Sapienza'' University as CTO, dealing with the management of artificial intelligence projects in production environments. His research interests include artificial intelligence, complex systems and data-driven modeling, natural language processing, computational intelligence, neural networks, and fuzzy systems with application to several technical areas, such as smart grids and predictive maintenance. With regard to the NLP field, his interests include theoretical advances of natural language modeling to applications in text and social data mining. He has (co)authored more than 170 international journal/conference articles and book chapters. He is currently working on smart grids and microgrids modeling and control, intelligent systems for sustainable mobility, battery management systems, granular computing, data mining and knowledge discovery, computational biology, machine learning in non-metric spaces, graph and sequence matching, agent-based clustering, and parallel and distributed computing. His major research interests include computational intelligence and pattern recognition, including supervised and unsupervised machine learning techniques, neural networks, fuzzy systems, and evolutionary algorithms. His research interests include design of automatic modeling systems, focusing on classification, clustering, function approximation, and prediction problems.