key: cord-0447337-qmqvbo50 authors: B'eres, Ferenc; Csoma, Rita; Michaletzky, Tam'as Vilmos; Bencz'ur, Andr'as A. title: Vaccine skepticism detection by network embedding date: 2021-10-20 journal: nan DOI: nan sha: cfa56d06ca8e19e89db83f6bce941ade8be01df4 doc_id: 447337 cord_uid: qmqvbo50 We demonstrate the applicability of network embedding to vaccine skepticism, a controversial topic of long-past history. With the Covid-19 pandemic outbreak at the end of 2019, the topic is more important than ever. Only a year after the first international cases were registered, multiple vaccines were developed and passed clinical testing. Besides the challenges of development, testing, and logistics, another factor that might play a significant role in the fight against the pandemic are people who are hesitant to get vaccinated, or even state that they will refuse any vaccine offered to them. Two groups of people commonly referred to as a) pro-vaxxer, those who support vaccinating people b) vax-skeptic, those who question vaccine efficacy or the need for general vaccination against Covid-19. It is very difficult to tell exactly how many people share each of these views. It is even more difficult to understand all the reasoning why vax-skeptic opinions are getting more popular. In this work, our intention was to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic content. After multiple data preprocessing steps, we analyzed the tweet text as well as the structure of user interactions on Twitter. We deployed several node embedding and community detection models that scale well for graphs with millions of edges. For classification, we used the following three modalities with logistic regression: 1. text: 1, 000 dimensional TF-IDF vector of tweet text; 2. history: Four basic statistics calculated from past tweet labels of the same user; 3. embedding: 128-dimensional user representation in the reply network. We split the tweet data in time to 70% training and 30% testing. Our results are summarized in Table 1 . Not surprisingly, user statistics have a strong contribution as users usually stick to their past opinion. User representations from the Twitter reply network improve performance, as seen in Figure 1 . Indeed, tweets posted by users with no past label could be better inferred based on their social relations. Walklets [8] , the best performing node embedding model in Figure 2 , even managed to find pro-vaxxer and vax-skeptic user clusters, see Figure 3 . For future work, we will replace the logistic regression classifier with a unified back-propagation neural network. Summary. In this work, we quantitatively showed that social interactions play a major role in detecting vaccine skepticism. By deploying multiple node embedding models on a large Twitter reply network, we managed to discover pro-vaxxer and vaxskeptic communities. For reproducibility and future research purposes, we share our data on GitHub 1 . In order to comply with the data publication policy of Twitter, we only share the user ID, original and reply tweet IDs along with the encoded content vectors. Vaccine view Skeptic Pro Fig. 3 . Walklets clusters pro-vaxxer and vax-skeptic users well in the embedded space. On the left we show the kernel density estimation of the two groups for the whole test period, while on the right only active users between 5-13 May are visualized. Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs Learning sentiment-specific word embedding for twitter sentiment classification Enhanced network embedding with text information Context attention heterogeneous network embedding Semi-supervised network embedding with text information Flipping stance: Social influence on bot's and non bot's COVID vaccine stance COVID-19 vaccine hesitancy on social media: Building a public twitter dataset of anti-vaccine content, vaccine misinformation and conspiracies Walklets: Multiscale graph embeddings for interpretable network classification