key: cord-0529315-yolcw5ol authors: Catapang, Jasper Kyle; Cleofas, Jerome V. title: Topic Modeling, Clade-assisted Sentiment Analysis, and Vaccine Brand Reputation Analysis of COVID-19 Vaccine-related Facebook Comments in the Philippines date: 2021-10-11 journal: nan DOI: nan sha: 626882c56a1eff64eaaf86d2b2085e84c15ef979 doc_id: 529315 cord_uid: yolcw5ol Vaccine hesitancy and other COVID-19-related concerns and complaints in the Philippines are evident on social media. It is important to identify these different topics and sentiments in order to gauge public opinion, use the insights to develop policies, and make necessary adjustments or actions to improve public image and reputation of the administering agency and the COVID-19 vaccines themselves. This paper proposes a semi-supervised machine learning pipeline to perform topic modeling, sentiment analysis, and an analysis of vaccine brand reputation to obtain an in-depth understanding of national public opinion of Filipinos on Facebook. The methodology makes use of a multilingual version of Bidirectional Encoder Representations from Transformers or BERT for topic modeling, hierarchical clustering, five different classifiers for sentiment analysis, and cosine similarity of BERT topic embeddings for vaccine brand reputation analysis. Results suggest that any type of COVID-19 misinformation is an emergent property of COVID-19 public opinion, and that the detection of COVID-19 misinformation can be an unsupervised task. Sentiment analysis aided by hierarchical clustering reveal that 21 of the 25 topics extrapolated by topic modeling are negative topics. Such negative comments spike in count whenever the Department of Health in the Philippines posts about the COVID-19 situation in other countries. Additionally, the high numbers of laugh reactions on the Facebook posts by the same agency -- without any humorous content -- suggest that the reactors of these posts tend to react the way they do, not because of what the posts are about but because of who posted them. ommended to employ interventions that correct misinformation, engage people and use local narratives of success The COVID-19 pandemic has drastically affected the overall wellness and health of the entire world. On January 30, 2020, the first case of COVID-19, in the Philippines, has been reported by the country's Department of Health (DOH). COVID-19 is a respiratory disease caused by the SARS-CoV-2 virus-first identified in the province of Wuhan, located in China (Paules et al., 2020) . Over a year later, on March 1, 2021, the Philippines started with its COVID-19 vaccination program. Vaccine hesitancy among Filipinos is an ongoing phenomenon that the national vaccine campaign efforts are struggling with (Alfonso et al., 2021) . According to a recent study in the Philippines (Caple et al., 2021) , only 62.5% of their 7,193 respondents are willing to be vaccinated against COVID-19. A majority of the same respondents are only willing to be inoculated after many others have received the vaccine or after political figures have done so (Caple et al., 2021) . Additionally, the participants' preferences of vaccine brand are also studied; 59.7% of the participants are confident in a USA-made or European-made COVID-19 vaccine (Caple et al., 2021) . Sentiment analysis is a common natural language processing (NLP) task that has been done on a number of studies regarding COVID-19 public opinion (Melton et al., 2021; Garcia and Berton, 2021) . It computationally classifies the polarity of text data-neutral, positive, or negative sentiment. This is primarily done since gauging the sentiment of the public, especially on critical topics such as a pandemic like COVID-19, help determine possible policies and interventions that could shape the actions that society takes. A repository dedicated for this study-containing the link to the raw datasets, the source code of the different analyses performed in the study, and other miscellaneous files-can be found on Github. 1 The data is comprised of around 100 top comments for each of the 50 Facebook posts by the official page of the Department of Health Philippines. These comments are primarily in English and Filipino. The 50 posts are the search results for querying the string: "#RESBAKUNA #BIDASolusyon #BIDAangMayDisiplina", and setting the year to 2021. The comments obtained range from April 20, 2021 up to September 9, 2021. These query parameters cover the entire call-for-vaccination campaign by the Department of Health Philippines until September 9, 2021. The "top comments" sorting filter by Facebook is based on the number of reactions the comment has. A total of 4,877 comments is extracted from the 50 Facebook posts. In addition to the comments, the timestamps and the contents of the post they're commenting on are also extracted. The data is collected through Selenium and Python 3. Preprocessing of the Facebook comments are also done. The data scraped are preprocessed by removing several stop words, removing punctuations, removing emojis, and converting all letters to lowercase. This preprocessed dataset is the dataset utilized in the experiment. N-grams are subsequences, sized n, of virtually any sequence-like text and speech to name a few (Dai et al., 2020) . In natural language processing, n-grams are more commonly used for word sequences. For example, "vaccines work effectively" and "the viral strains" are examples of trigrams (3-grams, n=3) of the text: "COVID-19 vaccines work effectively on any of the viral strains." Word n-grams are used to model the co-occurring words within text data in order to find out the different, frequent combination of words that may help in modeling the data for any natural language processing task (Dai et al., 2020) . In this study, word n-grams would be extracted from the Facebook comments to model the frequently occurring combinations of words in the data. Topic modeling provides a fast and effective unsupervised extraction of topics from text data that are subjective in nature (Melton et al., 2021) . These texts include product and service reviews and social media posts. Topic modeling techniques such as Latent Dirichlet Allocation (LDA) and non-negative matrix factorization (NMF)-for long pieces of text-and biterm topic modeling-for short pieces of text, like tweets-have been the go-to algorithms of numerous studies utilizing topic models (Yan et al., 2013) . However, recent advances in deep learning have enabled the integration of transformer architectures even for topic modeling (Abuzayed and Al-Khalifa, 2021) . The most popular transformer architecture, Bidirectional Encoder Representations from Transformers (BERT), provide context through its embeddings-due to its bidirectionality (Abuzayed and Al-Khalifa, 2021) . This paper leverages that extra layer of context to extrapolate high-quality topics through a multilingual BERT model that the Python library BERTopic provides (Grootendorst, 2020) . A temporal variation of the topic model is also proposed to see the evolution of the topics with respect to time. Public opinion regarding the COVID-19 pandemic has been the subject of multiple NLP studies (Melton et al., 2021; Lyu et al., 2021; Garcia and Berton, 2021) . However, these studies-although mentioning vaccine-related topics and sentiments-have not discussed the public image of the different COVID-19 vaccine brands themselves, according to the extracted comments. In this article, cosine similarity of the BERT topic embeddings, obtained from the topic modeling experiment earlier, is proposed to associate the different COVID-19 vaccine brands to the different topics extracted from the Facebook comments. This technique is similar to the approach proposed by (Thongtan and Phienthrakul, 2019) . The different COVID-19 vaccine brands found in the data are: Pfizer, Moderna, AstraZeneca, Johnson & Johnson, Sputnik V, and Sinovac. One of the most used clustering techniques, hierarchical clustering is used to gain insights from the structure of a dataset. In this type of clustering, a pairwise measure of dissimilarity is used in order to assess the distance between two sets. This measure is called a linkage (Dogan and Birant, 2021) . The linkage that BERTopic utilizes is Ward's linkage (Grootendorst, 2020) . Ward's method is illustrated in Equation 1. The cluster distances initially used in Ward's method are thus defined to be the square of the Euclidean distance between the data points. The hierarchy produced by Ward's method consists of sub-hierarchies named clades (Dogan and Birant, 2021) . These clades of topics extracted by BERTopic is used to assign sentiments to the different topics. With these clades dictating the sentiment of different topics, the performance of the sentiment analysis done on the data relies on the quality of the output of the hierarchical clustering algorithm. In addition, this clade-assisted sentiment analysis has eliminated the need to manually label sentiments to each of the Facebook comments studied in the experiment, effectively making the proposed VERTEBRATE pipeline a semi-supervised learning approach. This statistic, shown in Equation 2, ranges from -1 to 1. In Equation 2, P o refers to the relative observed agreement among raters, while P e is the chance agreement's hypothetical probability. In this study, the ground truth is provided by the assignments made by the clades found in the hierarchical clustering model, and the other rater is the classification output. The n-gram frequencies, discussed in Section 2.2, are extracted to obtain the most commonly occurring sets of words in the data. The unigrams in Table 2 are the following: "vaccine", "bakuna", "wala", "tao", "DOH", "dose", "vaccinated", "sana", "ayaw" and "namatay". "Bakuna" is "vaccine" in Filipino. "Wala" is a Filipino word used to describe the absence of something. In this context it can pertain to the vaccine or other COVID-19 essentials. "Tao" is "person" in Filipino. "Sana" is a Filipino expression of hope or longing. "Ayaw" is a Filipino word that signals disapproval or rejection. "Namatay" is a Filipino word for the dead (as a noun) or died (as a past tense verb). Table 4 lists the ten most frequent and relevant trigrams. Several trigrams extracted from the data require further discussion. The first two trigrams are a similar case to the explanation made earlier about being parts of the "Department of Health Philippines". The trigram "want to be" expresses desire but it cannot stand on its own without being part of a 4-gram. Specifically, "want to be" needs to be proceeded by another word. The trigram "be vaccinated but", expresses concern or reservation to being vaccinated. Lastly, "wag pilitin ayaw" roughly translates to "don't force someone who's unwilling". For the last n-gram frequencies, Table 5 illustrates the 4-grams present in the data. "Department of health philippines", "of health philippines department", "health philippines department of", and "philippines department of health" are all variants of Department of Health Philippines. As discussed earlier, the trigram "want to be" requires another word after it. The 4-gram "want to be vaccinated" completes the thought. The rest of the 4-grams require no explanation. face, mask, face, shield 9 After using a multilingual BERT model for topic modeling on the 4,877 comments, the 25 topics that the algorithm produces are assigned topic names by a health professional by assessing the terms associated to the topic cluster and its representative comment. After constructing a dendrogram of the BERT topic embeddings, the hierarchy of the topics found within the data is demonstrated in Figure 1 . The topics of indeterminable sentiments are removed as they skew the data. As shown in Figure 1 , the data is divided into two superclades. The green superclade talks primarily about COVID-19 vaccine misinformation. The blue superclade is a combination of COVID-19 misinformation, frequently asked questions, and comments indicating desperation. The results of the evaluation metrics on the XGBoost model are shown in LightGBM performed the best among all the models in the experiments. Its performance is quantitatively illustrated by The performance metrics of the support vector machine is shown in Ta The BERT topic model is able to extrapolate 25 distinct topics found in Tables 6, 7 The COVID-19 pandemic has radically changed the world. Additionally, clade-assisted sentiment classification effectively models public sentiment. The best-performing classifier, LightGBM, that is proposed in the study has managed to perform with 92.4% accuracy. It also has a strong level of agreement in terms of Cohen's kappa statistic, with a value of 0.847. Our present study highlights the persisting prevalence of COVID-19 vaccine misinformation in social media. Conspiracy beliefs and other forms of misinformation had been noted as a significant predictor of complete vaccine hesitancy (Al-Sanafi and Sallam, 2021). We noticed that DOH and its entities offered no responses to the comments posted by netizens under these infographics. Meta-analytic evidence suggests the importance of identifying misinformation most susceptible to correction, and engage experts in responding to misinformation (Walter et al., 2020) . We recommend for DOH to form a social media team composed of health care professionals and interdisciplinary communication practitioners whose mandate is to respond to misinformation found in the comments of their posts. This engaging way of correcting false vaccine information will not only help quell vaccine doubts, but also hopefully improve the image of DOH among citizens. Also, our present study suggests that instead of drawing aspirational sentiments from people, posting about the success of wealthier countries in terms of vaccination only intensifies Filipinos' xenocentric tendencies to rationalize the poor COVID outcomes of the country. We recommend for DOH and other stakeholders involved in vaccine promotion to use narratives that are closer to home, like the diseases curbed by the expanded program on immunization in the Philippines. Using relatable narratives and storytelling had been indicated as effective means to combat anti-vaccine conspiracies (Lazić andŽeželj, 2021) . The VERTEBRATE pipeline effectively highlights the contents to avoid when posting on social media about COVID-19. Future work could include the implementation of an automatic labeling procedure for the topic model to further reduce manual effort. Coronavirus infections-more than just the common cold From dengvaxia to sinovac: Vaccine hesitancy in the philippines, The Diplomat Public sentiment analysis and topic modeling regarding covid-19 vaccines on the reddit social media platform: A call to action for strengthening vaccine confidence Topic detection and sentiment analysis in twitter content related to covid-19 from brazil and the usa Logram: Efficient log parsing using n-gram dictionaries A biterm topic model for short texts Bert for arabic topic modeling: An experimental study on bertopic technique Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics Covid-19 vaccine-related discussion on twitter: topic modeling and sentiment analysis Sentiment classification using document embeddings trained with cosine similarity a novel hierarchical clustering linkage method Interrater reliability: the kappa statistic Psychological determinants of covid-19 vaccine acceptance among healthcare workers in kuwait: A cross-sectional study using the 5c and vaccine conspiracy beliefs scales Evaluating the impact of attempts to correct health misinformation on social media: A metaanalysis A systematic review of narrative interventions: Lessons for countering anti-vaccination conspiracy theories and misinformation The authors would like to thank Nathaniel Oco for his invaluable insights, especially for suggesting the inclusion of Cohen's kappa statistic for evaluating the different classifiers and for proofreading the manuscript.