key: cord-0544904-af06yupq authors: Weinzierl, Maxwell A.; Harabagiu, Sanda M. title: Automatic Detection of COVID-19 Vaccine Misinformation with Graph Link Prediction date: 2021-08-04 journal: nan DOI: nan sha: b210c2b4e82ca304a96d0ad98e3a3d815d92b0e7 doc_id: 544904 cord_uid: af06yupq Enormous hope in the efficacy of vaccines became recently a successful reality in the fight against the COVID-19 pandemic. However, vaccine hesitancy, fueled by exposure to social media misinformation about COVID-19 vaccines became a major hurdle. Therefore, it is essential to automatically detect where misinformation about COVID-19 vaccines on social media is spread and what kind of misinformation is discussed, such that inoculation interventions can be delivered at the right time and in the right place, in addition to interventions designed to address vaccine hesitancy. This paper is addressing the first step in tackling hesitancy against COVID-19 vaccines, namely the automatic detection of known misinformation about the vaccines on Twitter, the social media platform that has the highest volume of conversations about COVID-19 and its vaccines. We present CoVaxLies, a new dataset of tweets judged relevant to several misinformation targets about COVID-19 vaccines on which a novel method of detecting misinformation was developed. Our method organizes CoVaxLies in a Misinformation Knowledge Graph as it casts misinformation detection as a graph link prediction problem. The misinformation detection method detailed in this paper takes advantage of the link scoring functions provided by several knowledge embedding methods. The experimental results demonstrate the superiority of this method when compared with classification-based methods, widely used currently. Enormous hope in the vaccines that inoculate against the SARS-CoV-2 virus, the causative agent of COVID-19, has been building, starting with 2020. When several vaccines have become available, millions signed up and received the vaccines enthusiastically. However, too many remain hesitant. Much hesitancy is driven by misinformation about the vaccines that is spread on social media. In fact, recent research by [1] has shown that exposure to online misinformation around COVID-19 vaccines affects intent to vaccinate in order to protect oneself or others. Therefore, it is essential to automatically detect where misinformation about COVID-19 vaccines on social media is spread and what kind of misinformation is discussed, such that inoculation interventions can be delivered at the right time and in the right place, in addition to interventions designed to address vaccine hesitancy. In this paper we address the first step in tackling hesitancy against COVID-19 vaccines, namely the automatic detection of how it may influence the attitudes with respect to vaccination. As with misinformation about COVID-19 in general, there are several misconceptions that are targeted when spreading misinformation about vaccines. These Misinformation Targets (MisTs) address commonly known misconceptions about the vaccines. As illustrated in Figure 1 , where two MisTs are illustrated, tweets containing misinformation may be referring to one or multiple MisTs. For example, Tweet 1 refers only to MisT A , whereas Tweet 2 refers both to MisT A and MisT B . In order to discover automatically which tweets contain misinformation and to which MisT they refer, we need to design a supervised misinformation discovery method that can be trained on a sufficiently large collection of tweets annotated with misinformation judgements. However, state-of-the-art methods use deep learning techniques, which require a very large training dataset, which is expensive to build. Nevertheless, such a dataset could be bootstrapped from a seed dataset of high quality when a method of detecting misinformation could operate on it. In this paper we introduce a tweet dataset annotated with misinformation about COVID-19 vaccines, called CoVaxLies, which was inspired by the recently released COVIDLies dataset [3] , as well as a method of discovering misinformation on it which can predict links between tweets and MisTs, similar to the links illustrated in Figure 1 . Our framework of discovering misinformation has several novelties. First, it considers that the misinformation about COVID-19 vaccines can be represented as a Misinformation Knowledge Graph (MKG), in which nodes are tweets that contain misinformation, while edges correspond to the MisTs shared by tweets. Secondly, we propose a representation of the MKG through knowledge embeddings that can be learned by several possible knowledge embeddings models. Thirdly, we use the link ranking functions available from each such knowledge embedding models for predicting a link between any tweet that may contain misinformation and tweets that share a known MisT. Finally, we project the linguistic content of tweets in the embedding space of the MKG to account not only for the misinformation structure, but also for the language that expressed it. The neural architecture that accounts for all these novelties, a system for Twitter Misinformation Detection through Graph Link Prediction (TMD-GLP) has produced, in our experiments, very promising results on the CoV-axLies dataset, especially when compared with a neural method that casts misinformation detection as a classification problem, as most current methods do. The remainder of the paper is organized as follows. Section 2 describes the related work while Section 3 details the approach used for retrieving tweets relevant to known MisTs regarding COVID-19 vaccines, as well as the expert judgements produced on the relevant data. Section 4 describes our graph-based bootstrapping for misinformation detection and details the neural architecture for Twitter Misinformation Detection through Graph Link Prediction (TMD-GLP). Section 5 presents the experimental results while Section 6 is providing discussions of the results. Section 7 summarizes the conclusions. There are two schools of thought for detecting misinformation on social media, based on (1) the identification of whether a social media posting contains or not misinformation, sometime qualified as a rumour; or (2) taking into account known misconceptions and discovering those postings that propagate a certain misconception. Most of the work belongs to the first school of thought. Misinformation Detection as Rumour Identification on Social Media: Early work aiming the identification of social media postings that contain misinformation (without being interested in its misinformation target) focused on finding useful features for detecting misinformation, e.g. special characters, specific keywords and expression types, [4] , [5] , [6] or the characteristics of users involved in spreading the misinformation, e.g. the number of followers, the users' ages and genders [4] ; [7] , and the news' propagation patterns [4] , [8] . More recent work embraced several deep learning methods. These deep learning methods were informed by the textual content of the tweet containing the misinformation, capturing its semantics [9] , or by encoding the content of the tweets responding to the misinformation [10] . Moreover, a joint recurrent and convolutional network model (CRNN) was reported in [11] , to better represent the profile of retweeters. Other deep learning-based methods for the identification of misinformation leveraged the propagation structure in the social network. [12] created a kernel-based method that captures high-order interactions differentiating different forms of misinformation while [13] designed a tree-structured recursive neural network to learn the embedding of the rumor propagation structure. Another interesting deep learning framework, reported in [14] , considered the prevalence of deliberately promoted misinformation campaigns, which can be identified by relying on Generative Adversarial Networks. Most of these misinformation detection methods were influenced by the datasets on which they were developed. Several well-known benchmark datasets for misinformation detection on Twitter were used previously. For example, the Twitter15 [15] and Twitter16 [10] These combined datasets allowed several researchers to develop promising methods trained on the tweets labeled as true or false rumors, while modeling the not only the content of the tweets, but also the retweet/reply sequence of users, along with user profiles. For example, in [16] a graph-aware representation of user interactions was proposed for detecting the correlations between the source tweet content and the retweet propagation though a dual co-attention mechanism. The same idea was explored on the same dataset in the dEFEND system [17] . The PHEME dataset [18] consists of Twitter conversation threads associated with 9 different newsworthy events such as the Ferguson unrest, the shooting at Charlie Hebdo, or Michael Essien contracting Ebola. A conversation thread consists of a tweet making a true and false claim, and a series of replies. There are 6,425 conversation threads in PHEME, while only 1,067 claims from tweets were annotated as true, 638 were annotated as false and 697 as unverified. A fraction of the PHEME dataset was used in the RumourEval task [19] , having only 325 threads of conversations and 145 claims from tweets labeled as true, 74 as false and 106 as unverified. Misinformation detection methods operating both on PHEME and on the RumourEval data sets used either a sifted multi-task learning model with a shared structure for misinformation and stance detection [20] , Bayesian Deep Learning models [21] or Deep Markov Random Fields [22] . However, none of these benchmark datasets contain any misinformation about COVID-19 or the vaccines used to protect against it. Very recently, a new dataset of tweets containing misinformation about COVID-19, called COVIDLies was released [3] . COVIDLies is a dataset, which unlike previous datasets that considered a large set of "popular" claims, which were later judged as true, false or unverifiable, was generated by starting with 86 known misconceptions about COVID-19, available from a Wikipedia article dedicated to misinformation about COVID-19. The misconceptions informed the retrieval of 6761 related tweets from COVID-19-related tweets identified by Chen et al [23] . The retrieved tweets were further annotated by researchers from the University of California, Irvine School of Medicine with stance information, reflecting their judgement whether the author of the tweet agreed with a given misconception, rejected the misconception, or the tweet had no stance. Furthermore, COVIDLies enabled the design of a system that could identify misinformation and also infer its stance through a form of neural entailment, as reported in [3] . We also used the COVIDLies dataset in recent work to infer automatically when misinformation about COVID-19 is rejected or adopted, by automatically discovering the stance of each tweet against the 86 available misconceptions, which were organized in a taxonomy of misconception themes and concerns. When using a neural architecture that benefits from stacked Graph Attention Networks (GATs) for lexico-syntactic, semantic and emotion information, we have obtained state-of-the-art results for stance detection on this dataset, as we report in [24] . We were intrigued and inspired by the COVIDLies dataset, and believed that if we could create a similar dataset containing misinformation about COVID-19 vaccines, which would not only complement the COVIDLies data, but it would also enable the development of novel techniques for misinformation detection. Therefore, in this paper we present the CoVaxLies dataset as well as a novel methodology of automatically detecting misinformation using it. We deliberately decided to generate the CoVaxLies dataset using a similar methodology as the one employed in the creation of the COVIDLies dataset, namely by starting with misconceptions or myths about the vaccines used to immunize against COVID-19 available on a Wikipedia article dedicated to them. But, we cast the misinformation detection problem differently. We still considered the retrieval phase essential for finding relevant tweets for the known vaccine myths, but we explored two different retrieval methods: one using the classic, BM25 [25] scoring function, and the other using the same neural scoring method that was used in the creation of COVIDLies. This allowed us to discover that classical scoring functions outperform scoring functions using BERT-informed methods. We then focused on producing highquality judgements for 7,246 tweets against 17 Misinformation Targets (MisTs) about COVID-19 vaccines of interest. Once the CoVaxLies dataset was generated, we were able to design a novel, simple and elegant method for discovering misinformation which was cast as learning to predict links in a Misinformation Knowledge Graph. Although our method for automatically detection misinformation in a collection of tweets uses deep learning techniques, as most of the recent approaches, it is the first method that represents misinformation as a knowledge graph, which can be projected in an embedding space through one of several possible knowledge embedding models. Misinformation about the COVID-19 vaccines has propagated widely, and has been shown to decrease vaccination intent in the UK and USA [1] The Wikipedia page available at en.wikipedia.org/wiki/COVID-19 misinformation#Vaccines also provides citations to scientific articles that debunk the misconceptions. For example, the BBC is cited on the Wikipedia page above mentioned for identifying and debunking the misconception that "RNA alters a person's DNA when taking the COVID-19 vaccine." In the cited article [33] the authors claim that "The fear that a vaccine will somehow change your DNA is one we've seen aired regularly on social media." They immediately debunk this misinformation by claiming: "The BBC asked three independent scientists about this. They said that the coronavirus vaccine would not alter human DNA." We took advantage of the existing efforts of pinpointing the misconceptions related to the COVID-19 vaccines and debunking them. We selected 17 misinformation claims, which we considered as Misinformation Targets, because the propagation of misinformation on social media targets one or several such misconceptions. In Table 1 investigation of the Twitter platform revealed that Twitter's tokenization of tweets splits up terms like "covid19" and "covid-19" into "covid" and "19". Therefore we selected "covid" as a search term that matches the tokenization not only of mentions of "covid" in tweets, but also mentions of "covid-19" or "covid19", thus optimizing the recall of relevant tweets, when combined with the keywords "coronavirus" and "vaccine". The retrieved tweets were authored in the time frame from December 18th, 2019, to January 4th, 2021. A large fraction of these tweets were duplicates, likely due to spam bots, which required filtering. Locality Sensitive Hashing (LSH) [34] is a well-known method used to remove near-duplicate documents in large collections. We perform LSH, with term trigrams, 100 permutations, and a Jaccard threshold of 50%, on our collection to produce C T = 753,017 unique tweets. We found that approximately 35% of the unique tweets in C T referred to an external news article, YouTube video, blog, or other website. Therefore, we also crawled these external links and parsed their contents with Newspaper3k [35] to include their titles with the original tweets. We found these titles added significantly more context to many of these tweets, and allowed us to identify many more instances of misinformation during the human judgement phase. For example, the tweet "Who still want to be vaccinated? URL: "Vaccine causing bells palsy: Pfizer vaccine side effects, covid" would be impossible to identify as pertaining to MisT 4 without the article's context. In order to identify in C T those tweets which potentially contain language relevant to the MisTs of interest, listed in Table 1 , we relied on two information retrieval systems: (1) a retrieval system using the BM25 [25] scoring function; and (2) a retrieval system using BERTScore [36] with Domain Adaptation The immune system overreacts to COVID-19 after taking the COVID-19 vaccine through antibody-dependent enhancement. Wikipedia (Health Feedback) [26] 6 The COVID-19 vaccine contains tissue from aborted fetuses. Wikipedia (Snopes) [26] 7 The COVID-19 vaccine was developed to control the general population either through microchip tracking or nanotransducers in our brains. Mayo Clinic [28] 8 More people will die as a result of a negative side effect to the COVID-19 vaccine than would actually die from the coronavirus. Mayo Clinic [28] 9 There are severe side effects of the COVID-19 vaccines, worse than having the virus. Mayo Clinic [28] 10 The COVID-19 vaccine is not safe because it was rapidly developed and tested. Mayo Clinic [28] 11 The COVID-19 vaccine can cause COVID-19 because it contains the live virus. University of Missouri Health Care [29] 12 The The COVID-19 vaccine should not be taken by people who are allergic to eggs. Mayo Clinic [31] 17 Vaccines contain unsafe toxins such as formaldehyde, mercury or aluminum. PublicHealth.org [32] (DA), identical to the one used in [3] . Both these retrieval systems operated on an index of C T , obtained by using Lucene [37] . Each retrieval system produced a ranked list of tweets when queried with the textual content of any of the MisTs of interest, listed in Table 1 . At most 200 top scored tweets were selected for each of these queries. We selected only 200 best scored tweets because (1) the same number of tweets was considered in the most similar prior work [3] ; and (2) it was a number of tweets that did not overwhelm our human judges. Additional tweets were also retrieved when replacing the word "COVID-19" with the word "coronavirus" in each query. From the top ranked tweets deemed relevant to the modified query, at most 200 tweets were also considered. This approach produced a maximum of 400 tweets, but the list of retrieved tweets returned by the retrieval systems often contained less than 400 tweets. For example, when MisT 10 was used as a query, we were able to retrieve only 213 distinct tweets, whereas when MisT 11 was used as a query, we retrieved 282 distinct relevant tweets. Some of the tweets retrieved when the query that was used was one of the MisTs listed in Table 1 were also retrieved when the modified query was used. In the end, we retrieved a total of 4,153 tweets that were deemed relevant to at least one MisT of interest. When using the retrieval system which relies on the BERTScore (DA), we were able to benefit from its domainadaptive pre-training on 97 million COVID-19 tweets. In this way, semantic relevancy to the domain of COVID-19 was preferred over keyword matching when scoring the tweets against the MisTs of interest. As with the first retrieval system, at most 200 top scored tweets were selected for the original queries and at most other 200 tweets when the modified queries, replacing the word "COVID-19" with the word "coronavirus", were used. In this way, the second retrieval system enabled us to collect 4,689 tweets deemed relevant to at least one MisT of interest. Obviously, the second retrieval system enabled us to collect a larger set of tweets deemed relevant to the MisTs of interest than the first retrieval system (4,689 tweets vs. 4,153 tweets), which we attribute to significant sensitivity in BERTScore when replacing the word "COVID-19" with "coronavirus" in the query. As in the case when the retrieval system using the BM25 scoring function, some of the tweets retrieved by the retrieval system using when BERTScore (DA), when the query was one of the MisTs listed in Table 1 it. This decision was made because stance detection was regarded as a separate task, as in [3] , [24] , which requires each tweet to be first found relevant to the MisT, before inferring its stance. As it is illustrated in Figure 2 , Tweet A is relevant to MisT 1 , agreeing with it, while Tweet B is relevant to MisT 6 , disagreeing with it. To evaluate the quality of judgements, we randomly selected a subset of 1,000 tweets (along with the MisT against which they have been judged to be relevant or non-relevant), which have been judged by at least three different language experts. Percent agreement between annotators was 92%. Fleiss' Kappa score was 0.83, which indicates strong agreement between annotators (0.8-0.9) [38] . There were high levels of agreement be- Disagreements in annotations were discussed, but largely came down to interpretation. For example, the tweet "we dont need a covid vaccine" was interpreted by one expert as relevant to the MisT 3 : "Natural COVID-19 immunity is better than immunity derived from a COVID-19 vaccine", while another judge found the tweet non-relevant to the same MisT. For the first annotator, the statement that no vaccine is necessary for COVID-19 implied that the author of the tweet entailed that that natural immunity from catching COVID-19 is better than any vaccineinduced immunity. The second judge took a more strict inter- When considering option 1, the link scoring function is computed between a new, unconnected tweet t y and each of the n x tweets t x i of each FCG(MisT x ). Then, the Condition ALL must be satisfied, where Condition ALL stipulates that if more than N x number of times the value returned by the link prediction function is superior to a threshold T x , the link is predicted between t y and FCG(MisT x ). Clearly, the number of tweets n x , and N x and T x are dependent on each MisT x , varying across MisTs. We assign N x and T x automatically by maximizing misinformation detection performance of the system for MisT x on the develop-ment collection, which is further detailed in Section 5.2. However, the number of times the link scoring function f needs to be computed when considering this option is equal to the number of tweets that are already connected in the MKG at the time of attempting to link a new tweet. This number easily grows in the thousands -and thus it renders this options computationally inefficient. The option 2 presents the advantage that the link prediction function f is evaluated only once for each MisT encoded in the MKG, and a link is predicted when the value returned by f is superior to a pre-defined threshold for each MisT, T x . Moreover, in both options, different link scoring functions, available from different knowledge embedding models, may predict different graph links in the MKG, and thus discover differently the misinformation in a collection of tweets. Our misinformation detection framework using graph link prediction allows for rapid experimentation with multiple knowledge embedding models to explore their performance. Each of them enables the learning of misinformation knowledge embeddings, in the form of knowledge embeddings of the tweets that may contain misinformation, as well as knowledge embeddings for each MisT of interest. Several knowledge embedding models have been widely used in the past decade, e.g. TransE [39] , TransD [40] . In addition to TransE and TransD, several other knowledge graph embedding models have shown promise in recent years, e.g. TransMS [41] and TuckER [42] . We have explored how all these four different knowledge embedding models perform in our framework for misinformation detection as graph link prediction. We briefly describe them before discussing how they were used in a novel neural architecture for twitter misinforma- it is possible to measure the plausibility of any potential link labeled as MisT j between any pair of tweets t i and t k using the geometric structure of the embedding space: where || · || L1 is the L1 norm. The plausibility of a relation be- i.e. (te d ). TransE has the advantage that it is extremely simple to utilize, but interactions between node embeddings and edge embeddings are limited. TransD extends TransE by learning two knowledge embeddings for each node and each edge from a knowledge graph such that the first embedding represents the "knowledge meaning" of the node or relation while the second embedding is a projection vector (with superscript p), used to construct a dynamic mapping matrix for each node/link pair. Thus, for each tweet from the MKG, t i TransD learns the pair of embedding (te i , te p i ) and for each link labeled as MisT j , it learns the pair of embeddings (me j , me p j ). The pair of knowledge embeddings for the tweet and for the link are learned by using a scoring function that measures the plausibility of a link labeled MisT j between a tweet t i and a tweet t k , defined as: where I is the identity matrix. TransD improves upon TransE by modeling the interactions between tweets and the links that span them through their respective knowledge embeddings, such that tweet embeddings change depending on which MisT is being considered for labeling a link. TransMS recognizes the importance of capturing nonlinear interactions between nodes and edges in a knowledge graph, and therefore expands on the approach of TransD. TransMS introduces non-linear interactions on both the node and edge knowledge embeddings before the additive translation of TransE is performed, and also adds an edge-specific threshold parameter α j . When considering the MKG, the knowledge embeddings for the tweet and for the link are learned by using a scoring function that measures the plausibility of a link labeled MisT j between a tweet t i and a tweet t k , defined as: where tanh(x) is the non-linear hyperbolic tangent function and α j is a real numbered parameter dependent on each MisT. The operator ⊗ represents the Hadamard product. TransMS improves upon TransD by allowing both the nodes to influence the edge embeddings and the edges to influence the node embeddings. TransMS also introduces non-linearities in these interactions, and allows edge type-specific α j thresholds to be learned. MisT j between a tweet t i and a tweet t k , defined as: where × n indicates the tensor product along the n-th mode. TuckER approaches the problem of learning knowledge embeddings from a multiplicative perspective, with an additional component in the W tensor which allows for additional shared interactions to be learned between the nodes and edges of the knowledge graph through tensor products. In addition to the knowledge embedding models, we also considered a K-Nearest Neighbors (KNN) baseline approach. The KNN approach ignores entirely the edge information available in the FCC(Mist j ) of each Mist j . Instead, this approach favors tweets that are closest in their representation in the embedding space. Thus, when scoring a link between an unconnected tweet t i , represented by a knowledge embedding te i , and any tweet t k , from the FCC(MisT j ), represented as te k , it computes: where || · || L2 is the L2 norm. To predict the link to any MisT of interest, e.g. MisT x , the Condition ALL , defined in Section 4.1, must be met. This condition is applied because all nodes that are already assigned to all FCGs from the MKG need to be con- [44] . COVID-Twitter-BERT-v2 is a pre-trained domain-specific language model, which means that it started with neural weights equal to those of BERT, but was additionally pre-trained on the masked language modeling task [43] for 97 million COVID-19 tweets. This process of further pre-training has been shown to improve performance on downstream tasks in various scientific [45] , biomedical [46] , and social media [47] domains. COVID-Twitter-BERT-v2 therefore produces contextualized embeddings mc j 1 , mc j 2 , ..., mc j l+2 for the word-piece tokens in the MisT m j along with the [CLS ] j and [S EP] j tokens. In this way, we encode the language describing the MisT using a contextualized embedding mr j ∈ R 1024 , where 1024 is the contextual embedding size for COVID-Twitter-BERT-V2, which is the first contextualized embedding mc j 1 , representing the initial [CLS ] j token embedding. Similarly, the language used in the tweets t i and t k is processed through Word-Piece Tokenization and then represented by contextual embeddings tr i and tr k after being processed through COVID-Twitter-BERT-v2. But, it is important to note, that the scoring function f of any of the knowledge embedding models that we considered, illustrated in Figure 3 where V T represents all training tweets, to replace t k . We ensure (t i , m j , t s ) E by re-sampling if we ever sample a link in E. This process guarantees that corrupted triplets are not real links. We utilize these negative links for learning, with the goal being to utilize the TMD-GLP system to score labeled links higher than corrupted links. Moreover, we optimized the following margin loss to train TMD-GLP when performing graphlink prediction: where γ is a training score threshold which represents the differences between the score of correct links and the incorrect predicted links. The loss L is minimized with the ADAM [48] optimizer, a variant of gradient descent. Since most existing systems that tackle misinformation detection are binary classifiers, we also generated a baseline binary classification system that can operate on the same data as the TMD-GLP system. Therefore we designed a simple neural architecture, following prior work [49] , which directly classi-COVID-Twitter-BERT-v2 and from t i , which is the first contextualized embedding rc i, j 1 , representing the initial [CLS ] j token embedding. This embedding is provided to a fully-connected layer with a softmax ac-tivation function which outputs a probability distribution over P(Misin f ormation|t i , m j ). As Figure 4 shows, misinformation is recognized when the probability is larger than a predefined threshold. In our experiments, the value of the threshold T was determined on the development data to be 0.9995. The TMD-BC-BERT system is trained to classify tweets that contain misinformation concerning a given MisT, while using the same training data that was used for training the TMD-GLP system. In addition, the TMD-BC-BERT system was trained end-to-end using the cross-entropy loss function: where We also compare the TMD-GLP system against a system implemented to use Long Short-Term Memory (LSTM) [50] cells instead of COVID-Twitter-BERT-v2 in the architecture illustrated in Figure 4 . Specifically, we use 2 layers of Bi-LSTMs [51] of size 1024. We call this baseline system TDM-BC-LSTM. Because we believe that it is critical to detect misinformation about COVID-19 vaccines only in tweets that are truly relevant to the MisTs of interest, we first conducted experiments to evaluate the quality of retrieval and then we separately evaluated the quality of misinformation detection. The evaluation of relevant tweets for each MisT from our CoVaxLies dataset was performed by considering two retrieval systems: (1) one using the BM25 scoring function, and (2) one using the BERTScore (DA). The methods used by these systems was described in Section 3.3. In order to conduct the retrieval evaluations, each MisT was used to formulate a query, processed by both retrieval systems. Out of the 753,017 unique tweets from C T that we have collected from the Twitter Human judgements of the relevance of both retrieval systems were performed on the unique tweets from the T R collection first mentioned in Section 3.2. T R contains 7,246 tweets deemed relevant to our MisTs of interest by at least one of the retrieval systems. Each human judgement has established the relevance or non-relevance of each tweet against the MisTs of interest. Three natural language experts participated in the judgements of relevance. Their inter-annotator agreement was discussed in Section 3.2. Table 2 lists the judgement results on the T R collection. We see that the human judges have found a similar number of MisT-Relevant tweets from the results returned by the retrieval system using the BM25 scoring function when compared to the retrieval system using BERTScore (DA) (1,979 vs 1,475) , while the number of retrieved tweets judged non-relevant by the system using BERTScore (DA) is significantly larger than the number corresponding to the tweets judged as Not Relevant from the tweets retrieved by the system using the BM25 scoring function (3,214 vs 2,174). This indicates, as shown in Table 2 that the percent of tweets deemed relevant by the system using the BM25 scoring function and also judged relevant by human experts is much higher than the same percent of tweets deemed relevant by the system using BERTScore. Therefore, retrieving tweets with the system that uses the BM25 scoring function is far better than using a retrieval system informed by BERTScore. To provide additional details, Figure 5 Vaccines Evaluation was performed by considering that misinformation detection was cast as multi-label binary classification. This means that multiple MisTs could be predicted for every tweet t i from the test collection of CoVaxLies, some correct and others Because the number of paired MisTs varies across tweets in the testing collection, this learning task could not be cast as a multi-class classification problem, but rather as a multi-label binary classification problem. System performance was evaluated using Micro averaged Precision (P), Recall (R) and F 1 1 score. The evaluation results are provided in Table 3 . Table 3 along with the results of the TMD-BC-BERT and TM-BC-LSTM systems. The bolded numbers represent the best results obtained across all systems. As shown in Table 3 while also best modeling symmetric relationships, which we often see in our Misinformation Knowledge Graph (MKG). This may also explain why the TransMS-All configuration of the TMD-GLP system generated the best overall Precision score. Additionally, two baselines for Detecting Misinformation about COVID-19 Vaccines were considered, both assuming that relevance retrieval is sufficient. These systems follow prior work [3] and cast misinformation detection as misinformation retrieval, by considering each tweet as a query and returning the most relevant MisTs. We acknowledge that this retrieval framework is atypical because the collection of MisTs is several orders of magnitude smaller than the number of tweets from the test collection of CoVaxLies, whereas retrieval systems typically work on an index which is much larger than the number of queries. But since this is the framework for retrieval that was considered in [3] , we adopted the same framework for these baselines. In our experiments, the retrieval system using the BM25-BC model returned for each tweet the most relevant MisTs, and we considered that MisTs with a relevance score above a pre-defined threshold T are predicted as linked to the tweet. The BERTScore(DA)-BC baseline compares the text of each tweet against the text of each MisT using the BERTScore (DA) relevance model. BERTScore (DA) assigns a relevance score to each MisT for each tweet, and MisTs with a relevance score above a pre-defined threshold T are predicted as relevant. The threshold T for both systems is selected by maximizing F 1 score on the development set. For this purpose, we have evaluated on the test collection of CoVaxLies the retrieval systems using the (1) BM25 scoring function, used for Binary Classification (BM25-BC) or; (2) the BERTScore(DA) scoring function, used for Binary Classification (BERTScore(DA)-BC). The retrieval system using the BM25 score produced a Micro F 1 score of 51.2, setting a baseline expectation of performance. This performance can be attributed to the large amount of shared terminology between MisTs and Relevant tweets, which benefits the BM25 scoring function. The retrieval system using BERTScore (DA) produced a Micro F 1 score of 29.0, which was much lower than expected, when comparing with results published in prior work [3] . The results of the evaluations listed in Table 3 are interesting, as they generally indicate that casting misinformation detection as graph-link prediction, informed by knowledge embedding models such as TransE, TransD, and TransMS, can generate promising results, superior to the results obtained when considering misinformation detection as a multi-label binary classification problem, as most current systems do. We can also notice that TransE, a much simpler knowledge embedding model, Collecting tweets for CoVaxLies and judging their relevance against each MisT revealed a large discrepancy between retrieval performance of the retrieval systems using the BM25 scoring function or BERTScore. Prior work [3] on the COVIDLies dataset found that retrieval using BERTScore performed better than retrieval using the BM25 scoring function for misinformation detection, but the entire COVIDLies dataset was collected using only BERTScore. Their collection methodology resulted in a judged Relevant tweet percentage of 14.98%, meaning only 14.98% of their discovered tweets were judged Relevant by annotators. We found a similar Relevant tweet percentage of 31.5% on our collection when only retrieving tweets using the BERTScore (DA), but simultaneously identified that the retrieval system using the BM25 scoring function generated a relevant percentage of 47.7%, presented in Table 2 . This large difference indicates that there are many instances of tweets containing misinformation which are not discovered by the retrieval system using BERTScore, and that the retrieval system using the BM25 scoring function is actually better at discovering more Relevant tweets for each MisT. We also analyzed the differences in total tweets: the retrieval system using the BM25 scoring function returned a total of 4, 153 tweets, while the retrieval system using BERTScore (DA) returned 4, 689 tweets. This difference in the number of retrieved tweets arises from the sensitivity of each system to changes to the query: We found that the retrieval system using BERTScore to be more sensitive to replacing "COVID-19" with "coronavirus" when querying the CoVaxLies dataset for each MisT. The retrieval system using BERTScore produced more disjoint retrieved lists of tweets for the original and the modified query, while when the retrieval system using the BM25 scoring function was used, there was much more overlap. But, when using the retrieval system informed by BERTScore, we discovered that most of these additional non-overlapping tweets were judged as Not Relevant. Advantages remain to using BERTScore, as the Relevant tweets it finds are characterized by less term overlap with the language used to describe the MisTs, and, thus it emphasizes more semantic relevancy. However, ignoring classical scoring functions such as BM25 leads to a heavily biased dataset of tweets that potentially contain misinformation, which may very well lead to misinformation topic shift, such that the tweets are deemed relevant, when in fact they share more semantics with other topics than the one we try to collect, namely misinformation about COVID-19 vaccines. As provided in Table 3 , performance of the misinformation detection systems using the BM25 scoring function and BERTScore (DA) on misinformation detection on our CoVaxLies dataset was significantly different. The BM25-BC system scored much higher on all metrics than BERTScore(DA)-BC, which was the opposite conclusion drawn from prior work [3] on the COVIDLies dataset. We hypothesized that this difference was due to the data collection methodology utilized to create the COVIDLies dataset: The only retrieval system used to find tweets for each MisT was BERTScore, therefore the Relevant annotations would be biased towards a BERTScore-based model. This would naturally lead the misinformation detection system using the BM25 scoring function to perform worse than a BERTScore-based model in the evaluation of misinformation detection. To test this hypothesis, we modified our data collection to only include annotated tweets which were discovered by BERTScore (DA) during the retrieval of relevant tweets for each MisT. We re-ran both the BERTScore(DA)-BC and BM25-BC misinformation detection systems on this modified collection of CoVaxLies and report the results in Table 4 6. Discussion The experimental results, provided in Covid vaccine: think 5x over currently. You cannot detox from this. You will be inserted with nano technology. Nano lipids. You'll become a human antenna with the aluminum encased in the nano lipids. They are basically impossible to remove once they are in. 13 : The COVID-19 vaccine can increase risk for other illnesses. The covid vaccine is a one way ticket to cancer and dementia. Remember that. Don't vaccinate. Table 7 : Tweets in which MisTs are not correctly detected by BM25-BC, TMD-BC-BERT, and TMD-GLP with the TransMS-Prototypical configuration. In order to assess the portability of our approach to detecting misinformation as graph link detection to other datasets, we have considered the COVIDLies dataset [3] . The COVIDLies collection consists of 5,748 tweets annotated with stance towards COVID-19 pandemic misinformation targets, with stance values of "Agree", "Disagree", and "No Stance". However, the COVIDLies annotators made no distinction between tweets which contained a neutral "No Stance" towards a MisT and tweets which were not relevant to a MisT, which were also annotated as "No Stance". To resolve this problem, we categorized tweets labeled as "Agree" or "Disagree" with a MisT as "Relevant" to the MisT, while all tweets with "No Stance" against a Mist were considered "Not Relevant" to that MisT. This re-annotation of COVIDLies in terms of relevance of tweets towards various MisTs led to having only 17% of the unique tweets from COVIDLies as relevant to one or more MisTs. In comparison, in the CoVaxLies collection, 57% of the unique tweets are relevant to one or more MisTs. The portability of the system for Twitter Misinformation Detection as Graph Link Prediction (TMD-GLP) on the COVIDLies collection also entails the availability of training, development and test sets on this collection. However, COVIDLies has no official training or testing collections. Therefore, we split COVIDLies into 5 evenly distributed folds, assigning three folds for training, one fold for development and the fifth fold for testing. This process was per- Table 5 . We also ported on COVIDLies the system for Twitter Misinformation Detection as Binary Classification with BERT (TMD-BC-BERT), evaluating it on the same 5-fold cross validation as we did with the TMD-GLP+TransMS-Prototypical system. Table 5 shows that even on COVIDLies, the TMD-GLP+TransMS-Prototypical system performs best, obtaining an F 1 score of 40.7. However, a major performance drop is observed between the performance of this system from its operation on the CoVaxLies dataset to its operation on the COVIDLies dataset, namely a drop from an F 1 score of 84.3 to an F1 score of 40.7. This drop can be explained when the dataset statistics are considered. Because the COVIDLies collection has a smaller percentage of relevant tweets than the Co-VaxLies collection, it will lead to the discovery of a Misinformation Knowledge Graph containing smaller FCGs for each MisT than those discovered when using the CoVaxLies dataset. This entails that it becomes harder to predict a correct graph link, when the FCGs are smaller. We can notice that both TMD-BC-BERT and TMD-GLP systems suffer from this major reduction in relevant tweets in the CoVaxLies collection, as the results listed in Table 5 show that both systems generate much better Recall results than Precision results, which are quite low. This low Precision is likely due to such a small number of relevant tweets for each MisT in COVIDLies, with many MisTs having only one or two relevant tweets from which to learn to predict true positive links, leading to many more false positive links. But, the evaluation results listed in Table 5 showcase the portability of the TM-GLP method on a second dataset, while also highlighting the limitations of the COVIDLies dataset. Detailed performance of the TMD-GLP system using the TransMS-Prototypical configuration is provided for each MisT in Table 6 . We also include the size n x of the FCG for each MisT. The TMD-GLP+TransMS-Prototypical misinformation detection system performed best on MisT 2 : "The COVID-19 vaccine causes infertility or miscarriages in women."; MisT 4 : "The COVID-19 vaccine causes Bell's palsy.", and MisT 6 : "The COVID-19 vaccine contains tissue from aborted fetuses.". These MisTs made easily identifiable claims, such as "causing infertility", "causing Bell's palsy", and "containing aborted fetus tissue". Statements supporting, refuting, or reporting on these claims were very easy to detect. For example, the following is a tweet referring to MisT 2 : "@JoPatWar @allen40 allen @Telegraph The Pfizer CEO said that the coro- Table 7 lists tweets which were judged to refer to some MisT of interest, but the misinformation detection systems that we evaluated failed to identify. The first tweet listed in Table 7 was judged to refer to MisT 1 (also listed in the Table) . The TMD-BC-BERT system as well as the TMD-GLP system with the TransMS-Prototypical configuration were able to detect the connection to this MisT, while the baseline using the BM25 scoring function failed to accomplish this task. A lack of exact term overlap between the tweet text and the MisT text explains why a term-based Lucene index, utilized by the BM25 system, along with a BM25 scoring function would be unlikely to discover this misinformation. The second tweet from Table 7 was judged to refer to MisT 17 . The TMD-GLP system with the TransMS-Prototypical configuration was the only system able to identify the reference to this MisT. The TMD-GLP system compared the knowledge graph embedding of the tweet with the knowledge embedding obtained for other tweets connected to the same MisT-informed FCG, such as "@FayCortez @annaedney @LauermanJohn @business Because the people who created the deadly, depopulating Covid vaccine are part of the same contingent of Planners who have been spraying you with aluminum, barium, and strontium via chemtrails to intentionally increase your chances of getting Alzheimer's." The TMD-GLP system was able to identify that the second tweet listed in Table 7 likely refers to the same MisT as this tweet, and therefore correctly detected the misinformation. The third tweet listed in Table 7 was judged to refer to MisT 13 . None of the systems that we evaluated were able to identify that this tweet referred to MisT 13 . There is little term overlap or contextual clues to indicate that this tweet would be related to MisT 13 , which explains why the TMD-BC-BERT systems as well as the system using the BM25 scoring function failed. To determine why the TMD-GLP system failed, we can look at the FCG informed by MisT 13 and recognize the following: Table 6 states that there were only two tweets in the FGC for MisT 13 , and upon further inspection we see that these two tweets only mention the COVID-19 vaccine increasing the risk of "cancer", "heart disease", and "HIV". The third tweet discusses "cancer", but it also proposes the COVID-19 vaccine is "a one way ticket to" the illness "dementia". The fact that "cancer" or "dementia" were not recognized as instances of "illness", because no clinical language processing was applied on the content of the tweets or of the description of the MisT 13 explains why the third tweet was not linked to MisT 13 by any of the systems that we have evaluated. There are several important limitations to our study. The first limitation originates in the fact that we aim to discover tweets that discuss or refer only to known misinformation targets. When additional misinformation targets become known, new relevant tweets must be retrieved, judged if they discuss information that is relevant to the new targeted misinformation, or not, and then enable the creation of new training, development and testing data for identifying additional tweets that discuss the new targeted misinformation. The recognition of new, yet unknown misinformation is not withing the scope of this study. Even if this may seem a major limitation, it is a significant departure from most previous methods, discussed in Section 2, that detect only if there is some misinformation (or rumor) in a tweet, but fail to recognize what kind of misinformation is discussed or referred. An additional limitation can be found in the use of specific search keywords, such as "covid" or "coronavirus". More search terms, such as "corona", "vax", or "jab" may be used more often by the general public, and should be considered in future studies. These terms might reveal new misinformation, or may be used more often by users which tweet about known misinformation. Another limitation of the study is determined by the fact that we decided not to consider the stance of the tweets against the misinformation targets. We only recognize if the information shared by a tweet is relevant to a MisT, but do not recognize if it agrees or disagrees with the predication of the MisT, or if it has no stance at all. We believe that stance detection is a separate task, which can be performed only on the tweets that are known to be relevant to a MisT. Previous work [3] , [24] showed that identifying the tweet stance towards a MisT benefits from knowing that the tweet is discussing information relevant to the MisT. In future work we plan to address the problem of recognizing tweets that are relevant to new misinformation targets, casting the problem as a zero-shot learning problem. It has been estimated that the COVID-19 vaccines will need to be accepted by at least 55% of the population to provide herd immunity, with estimates reaching as high as 85% depending on country and infection rate [52] . Reaching these required vaccination levels is hindered by vaccine hesitancy across the world [53] , which is often fuelled by misinformation spreading on social media. Therefore, it is important to know which misinformation targets are used, and in which tweets and by which authors, such that people can be inoculated against COVID-19 vaccine misinformation before they are exposed to it. Moreover, because Twitter is the social media platform where most of the conversations about COVID-19 vaccines take place, it is essential to discover automatically tweets that spread misinformation, such that vaccine hesitancy interventions can be delivered to those participating in misinformed conversations. In this paper we present CoVaxLies, a corpus of 7,246 tweets judged by language experts to refer to 17 different targets of misinformation about COVID-19 vaccines. The CoVaxLies dataset was created using a methodology similar to the one used in the generation of the COVIDLies [3] dataset of tweets, which annotated misinformation about COVID-19. Therefore, they can be used together for learning to identify misinformation about COVID-19 and COVID-19 vaccines. Both COVIDLies and Co-VaxLies are evolving, as additional targets of misinformation are added and tweets relevant to them are retrieved and judged by experts. This paper has also explored the need to retrieve tweets relevant to misinformation targets using a combination of retrieval systems, concluding that in this way a larger set of truly relevant tweets are discovered, and can be included in these datasets. In this paper, CoVaxLies was used to train and evaluate a novel, simple and elegant method for discovering misinformation on Twitter, relying on graph link prediction. This method is enabled by (a) the organization of a Misinformation Knowledge Graph and (b) the availability of link scoring functions from several knowledge embedding models. Our experiments have shown that superior results can be obtained when discovering misinformation using graph link prediction, as compared with neural classification-based methods, yielding an increase of up to 10% F1-score. The method presented in this paper does not consider the conversation threads, as many recent misinformation detection methods do, e.g. [16] , [14] . But CoVaxLies will be extended to account for entire conversation threads on Twitter, allowing us to extend the methodology of misinformation detection presented in this paper, and to evaluate the impact conversations have on the quality of misinformation detection, when misinformation targets are also considered, an important aspect that is currently ignored. Our future work will also consider the discovery of adoption or rejection of misinformation about COVID-19 vaccines. This will be achieved by relying on the automatic inference of the stance of tweets relative to the misinformation targets. This will allow us to expand our work reported in [24] , where we developed a neural architecture that combined the role of semantic, lexical and affect characteristics of language with the taxonomy of concerns raised by misinformation targets. Along with automatically detecting misinformation about COVID-19 vaccines, the recognition of the adoption or rejection of that misinformation will be a stepping stone in the direction of developing misinformation inoculation interventions on social media platforms in the era of COVID-19. There are no conflicts of interest. Measuring the impact of covid-19 vaccine misinformation on vaccination intent in the uk and usa What social media is saying about the covid-19 vaccine rollout COVIDLies: Detecting COVID-19 misinformation on social media Proceedings of the 20th International Conference on World Wide Web, WWW '11 Real-time rumor debunking on twitter Enquiring minds: Early detection of rumors in social media from enquiry posts, WWW '15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva Automatic detection of rumor on sina weibo Prominent features of rumor propagation in online social media A multi-semantics classification method based on deep learning for incredible messages on social media Detecting rumors from microblogs with recurrent neural networks, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI'16 Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks Detect rumors in microblog posts using propagation structure via kernel learning Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics Detect rumors on twitter by promoting information campaigns with generative adversarial learning Real-time rumor debunking on twitter GCAN: Graph-aware co-attention networks for explainable fake news detection on social media Fake news detection on social media: A data mining perspective Analysing how people orient to and spread rumours in social media by looking at conversational threads SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours Different absorption from the same sharing: Sifted multi-task learning for fake news detection Reply-aided detection of misinformation via bayesian deep learning Fake news detection using deep Markov random fields #covid-19: A public coronavirus twitter dataset tracking social media discourse about the pandemic (preprint), JMIR Public Health and Surveillance 6 Misinformation adoption or rejection in the era of covid-19 Okapi at trec-5 Covid-19 misinformation The real facts about common covid-19 vaccine myths Covid-19 vaccine myths debunked The covid-19 vaccine: Myths vs. facts Debunking the myths about the covid-19 vaccine Covid-19 vaccine myths debunked Vaccine rumours debunked: Microchips, 'altered dna' and more Google news personalization: Scalable online collaborative filtering Newspaper3k: Article scraping & curation Bertscore: Evaluating text generation with bert Interrater reliability: the kappa statistic Translating embeddings for modeling multi-relational data Knowledge graph embedding via dynamic mapping matrix Transms: Knowledge graph embedding for complex relations by multidirectional semantics Tensor factorization for knowledge graph completion BERT: Pre-training of deep bidirectional transformers for language understanding Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter SciBERT: A pretrained language model for scientific text Biobert: a pre-trained biomedical language representation model for biomedical text mining BERTweet: A pre-trained language model for English tweets Adam: A method for stochastic optimization Passage re-ranking with bert Long short-term memory Named entity recognition with bidirectional LSTM-CNNs Herd immunity -estimating the level required to halt the covid-19 epidemics in affected countries Mapping global trends in vaccine confidence and investigating barriers to vaccine uptake: a large-scale retrospective temporal modelling study