key: cord-0665636-tuu54z1y authors: Pierri, Francesco; Pavanetto, Silvio; Brambilla, Marco; Ceri, Stefano title: VaccinItaly: monitoring Italian conversations around vaccines on Twitter date: 2021-01-11 journal: nan DOI: nan sha: 629b5200a0d235b934f81f08de09d7739c22a119 doc_id: 665636 cord_uid: tuu54z1y We monitor online conversations of Italian users around vaccines on Twitter, and we provide public access to the on-going data collection which will run continuously throughout the vaccination campaign taking place in Italy. We started collecting tweets matching vaccine-related keywords (in Italian) on December 20th 2020 using Twitter APIs, capturing the Italian vaccine rollout (27th December 2020), and at the time of this writing (13th January 2020) we collected over 1.8 M tweets, with an average number of 30k tweets shared on a daily basis. We report a consistent amount of low-credibility information already circulating on Twitter alongside vaccine-related conversations, whose prevalence is smaller yet comparable to high-credibility information. We believe that our data will allow researchers to understand the interplay between public discourse on online social media and the evolution of the on-going vaccination campaign against SARS-CoV-2 taking place in Italy. On January 30th 2020, the World Health Organization declared the outbreak of a novel coronavirus (SARS-CoV-2) a global pandemic 1 . A year later, the spread of the virus has caused almost 90 M confirmed cases and around 2 M fatalities globally 2 . Italy, in particular, has been one of the first European countries to be hit by the virus, with over 2 M confirmed cases and 85 k fatalities at the beginning of 2021, and the first country outside China to implement national lockdown to circumvent its spreading with severe social and economic consequences Spelta et al. 2020) . Despite the global crisis, we witnessed the most rapid vaccine development for a pandemic in history when Pfizer-Biontech vaccine showed a 95% efficacy and was recently approved in several countries 3 . In the next few months, dozens of other vaccines are expected to be approved and 2 0 2 0 -1 2 -2 1 2 0 2 0 -1 2 -2 4 2 0 2 0 -1 2 -2 7 2 0 2 0 -1 2 -3 0 2 0 2 1 -0 1 -0 2 2 0 2 1 -0 1 -0 5 2 0 2 1 -0 1 -0 8 2 0 2 1 -0 1 -1 1 2 0 2 1 -0 1 -1 4 0 50000 100000 150000 No. tweets Italian vaccine rollout Figure 1 : Daily number of tweets matching our list of vaccine-related keywords. The period shown goes from December 20th 2020 to January 13th but the collection is ongoing. made available to the public 4 . Italy, specifically, has started its vaccination campaign on December 27th 2020, and reached over 800 k dispensed doses 5 by January 13th 2021. As COVID-19 was spreading around the world, online social networks experienced a so-called "infodemic", i.e. an over-abundance of information about the on-going pandemic which yield severe repercussions on public health and safety (Zarocostas 2020; Yang et al. 2020; Gallotti et al. 2020) . It is believed that low-credibility information might drive vaccine hesitancy and make it hard to reach herd immunity (Yang et al. 2020) . The SOMA European observatory on online disinformation has recently identified four macro-categories of unreliable information about COVID-19 vaccines 6 : (1) there haven't been enough tests on vaccines to guarantee their safety; (2) some individuals died after being vaccinated; (3) there are further medical complications due to vaccines; (4) vaccines can modify our DNA. Since 2016 US presidential elections, the research com- munity has mostly focused its attention on political disinformation and election-related manipulation of online conversations (Lazer et al. 2018; Shao et al. 2018; Ferrara et al. 2016; Pierri and Ceri 2019) . However, much concern has grown around health-related misinformation which became manifest during recent measles outbreaks (Filia et al. 2017) and other epidemics such as H1N1 and Ebola (Chew and Eysenbach 2010; Fung et al. 2014) , eroding public trust in governments and institutions and undermining public counter measures during such crises. In this paper, we describe an on-going data collection which enables researchers to investigate conversations around vaccines on Twitter during the vaccination campaign which started in Italy at the end of December 2020. At the time of this writing (January 13th), we collected more than 1.8 M tweets matching a series of Italian keywords related to vaccines in general; these are updated on a daily basis to capture trending hashtags and relevant events. The collection will run continuously throughout 2021 to keep track of the vaccination campaign. The repository associated to this project can be accessed here: https://github.com/frapierri/ VaccinItaly. In the following, we provide related literature and we briefly describe our data, with a specific focus on the prevalence of low and high-credibility information shared alongside Twitter conversations on vaccines. We leave further analyses to future work. We finally draw some conclusions and provide potential usages for this dataset. We believe that this data 7 . can contribute to a deeper understanding of the impact of online social networks in an unprecedented scenario where trust in science and governments will be critical to battle a global pandemic. There is a huge corpus of literature around the diffusion of health-related (dis)information on online social networks. We describe a few contributions which are related to the Italian context, and refer the reader to (Wang et al. 2019 ) for a deeper review of the existing literature on the subject. (Aquino et al. 2017 ) explored the relationships between Measles, mumps and rubella (MMR) vaccination coverage in Italy and online search trends and social network activity from 2010 to 2015. Using a set of keywords related to the controversial link between MMR vaccines and autism, originated from a discredited 1998 paper, authors analyzed Google (search) Trends as well as the activity of Facebook pages and Twitter users on the same subject, and reported a significant negative correlation with the evolution of vaccination coverage in Italy (which decreased from 90% to 85% during the period of observation). They also identified realworld triggering events which most likely drove vaccine hesitancy, i.e. Court of Justice sentences which ruled in favor of a possible link between MMR vaccine and autism. (Donzelli et al. 2018 ) provide a quantitative analysis of the Italian videos published on YouTube, from 2007 to 2017, about the link between vaccines and autism or other serious side effects in children. They showed that videos with a negative tone were more prevalent and got more views than those with a positive attitude, although they did not inspect how they were actually treating the link between vaccines and autism. (Righetti 2020) analyzed the Italian vaccine-related environment on Twitter in correspondence of the child vaccination mandatory law promulgated in 2017. Using a keywordbased data collection similar to ours, the author showed that the strong "politicization" of the debate was associated to an increase in the amount of problematic information, such as conspiracy theories, anti-vax narratives and false news, shared by online users. Starting on December 20th 2020, we use a combination of Twitter Filter 8 and Search 9 APIs to collect tweets matching the set of keywords in Table 1 . In particular, we use the streaming endpoint in real-time, and we routinely employ the historical endpoint whenever we update our list of keywords in order to avoid missing relevant tweets. We daily check for trending hashtags and relevant events to add new keywords, e.g. "#novaccinoainovax" and "#iononsonounacavia" were trending on specific days only. We provide in Figure 1 a time series for the daily number of tweets collected matching vaccine-related keywords, and in Table 2 a breakdown of the dataset at the time of this writing (January 13th). We highlight the official start of the vaccination campaign in Italy (December 27th), which corresponds to a peak in Twitter volume, with a red vertical line. We show top-10 most tweeted hashtags in Figure 2 . Most of them refer to the COVID-19 pandemic, with only one referring to the on-going debate between pro and anti vaccine followers in the early days of the vaccination campaign (cf. "novax"). In Figure 3 we show top-10 most active users who are also verified by Twitter. We can see that they mostly correspond 0 25000 50000 75000100000125000 No. tweets to news media. We extract URLs contained in tweets to understand the prevalence of low and high credibility information shared in vaccine-based conversations (Yang et al. 2020 ). We use a consolidated source-based approach to label news articles 2 0 2 0 -1 2 -2 1 2 0 2 0 -1 2 -2 4 2 0 2 0 -1 2 -2 7 2 0 2 0 -1 2 -3 0 2 0 2 1 -0 1 -0 2 2 0 2 1 -0 1 -0 5 2 0 2 1 -0 1 -0 8 2 0 2 1 -0 1 -1 1 2 0 2 1 -0 No. tweets Low-credibility High-credibility Figure 4 : Daily number of tweets containing links to low and high credibility information websites. We use respectively red and blue colors to label y-axis and curve corresponding to the number of tweets for low and high credibility information. (Lazer et al. 2018; Gallotti et al. 2020; Shao et al. 2018; Pierri, Piccardi, and Ceri 2020; ?; Pierri, Artoni, and Ceri 2020; Grinberg et al. 2019 ; Bovet and Makse 2019) depending on the reliability of the source, referring to the lists of Italian low and high credibility news websites provided by (Pierri 2020; Pierri, Artoni, and Ceri 2020) . The former corresponds to websites flagged by Italian fact-checkers for publishing false news, hoaxes and conspiracy theories; the latter corresponds to Italian traditional and most popular news websites. Lists are available both in the dataset 10 associated to this paper and in our repository 11 and we plan to manually augment them during our analyses. In Figure 4 we show a time series for the daily number of tweets containing URLs coming from both classes of news websites (we use two different y-axes and colors to refer to each of them). We can see that high-credibility information is generally more prevalent than low-credibility one, and that they are positively correlated (Pearson correlation: R = 0.89, pval ∼ 0). Overall, low-credibility information amounts to over 20 k tweets, and top-10 shared websites are shown in Figure 5 . We notice that a handful of them accounts for most of the shared tweets. Besides, most of them are associated to the (far)right wing community (cf. "liberoquotidiano.it", "laverita.info" and "ilprimatonazionale.it") and we also notice "it.sputniknews.com", a news outlet belonging to a Russian network of news websites which is oftentimes shared by Italian right-wing politicians. High-credibility information accounts for approx. 95 k tweets, and we show top-10 shared websites in Figure 6 . We can see that the most shared website ("ansa.it") is comparable to the most shared source of low-credibility information ("imolaoggi.it"). Other websites exhibit a prevalence above 4k tweets, which is much higher than most low-credibility websites. We remark that a limitation to this estimates is that our lists might not fully capture the amount of low and highcredibility information circulating on Twitter. Besides, we do not consider different typologies of content such as photos, videos, memes, etc. We describe an on-going data collection which focuses on online conversations of Italian users around vaccines on Twitter, and we provide full access to the data which is continuously updated. We believe that this data allows to understand the interplay between public discourse on online social media and the vaccine rollout which will take place worldwide in the following months. In particular, we believe that researchers can exploit our data to further pursue several directions. To name a few: investigate the prevalence of reliable and unreliable on-line information and their impact on real-world initiatives (Yang et al. 2020 ); assess the presence of "coordinated inauthentic behaviour" and understand the role played by online malicious actors in such a critical moment (Pacheco et al. 2020; Broniatowski et al. 2018) ; analyze the polarization which takes place between pro and anti views on vaccines and understand the mechanisms which lead to the formation of hyper-partisan on-line communities (Schmidt et al. 2018; Johnson et al. 2020 ). The web and public confidence in MMR vaccination in Italy Economic and Social Consequences of Human Mobility Restrictions Under COVID-19 Influence of fake news in Twitter during the 2016 US presidential election Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak Misinformation on vaccination: a quantitative analysis of YouTube videos The rise of social bots Ongoing outbreak with well over 4,000 measles cases in Italy from January to end August 2017-what is making elimination so difficult? Ebola and the social media Assessing the risks of 'infodemics' in response to COVID-19 epidemics The online competition between proand anti-vaccination views The science of fake news The diffusion of mainstream and disinformation news on Twitter: the case of Italy and France Investigating Italian disinformation spreading on Twitter in the context of 2019 European elections False news on social media: a data-driven survey A multi-layer approach to disinformation detection in US and Italian news spreading on Twitter Health Politicization and Misinformation on Twitter. A Study of the Italian Twittersphere from Before, During and After the Law on Mandatory Vaccinations Polarization of the vaccination debate on Facebook The spread of lowcredibility content by social bots After the lockdown: simulating mobility, public health and economic recovery scenarios Systematic literature review on the spread of health-related misinformation on social media The COVID-19 Infodemic: Twitter versus Facebook How to fight an infodemic This work has been partially supported by the PRIN grant HOPE (FP6, Italian Ministry of Education), and the EU H2020 research and innovation programme, COVID-19 call, under grant agreement No. 101016233 "PERISCOPE" (https://periscopeproject.eu/).