key: cord-0316558-u46ovlma
authors: Srba, Ivan; Pecher, Branislav; Tomlein, Matus; Moro, Robert; Stefancova, Elena; Simko, Jakub; Bielikova, Maria
title: Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims
date: 2022-04-26
journal: nan
DOI: 10.1145/3477495.3531726
sha: 29dfc9f02f34528be20ac3865961281b55ab7699
doc_id: 316558
cord_uid: u46ovlma

False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 573 manually and more than 51k automatically labelled mappings between claims and articles. Mappings consist of claim presence, i.e., whether a claim is contained in a given article, and article stance towards the claim. We provide several baselines for these two tasks and evaluate them on the manually labelled part of the dataset. The dataset enables a number of additional tasks related to medical misinformation, such as misinformation characterisation studies or studies of misinformation diffusion between sources.

False information on the Web has been a widely researched phenomenon in computer science for the past few years, as evidenced by many recent surveys, e.g., [2, 10, 34, 44, 47, 48] . The main focus was initially on political fake news; however, it shifted towards the medical domain with the arrival of COVID-19 pandemic and an infodemic (a surge of new misinformation 1 related to the .

Motivated by significant negative consequences of online false information, a number of approaches based on information retrieval and machine learning have been proposed to detect it. The main branch of existing works rely on indirect features derived from content (textual as well as multimedia) and context, such as content style, propagation patterns, author/source credibility, or social engagements/consumption [47] . This approach has several advantages, e.g., it allows to detect new cases of false information early (since new false information usually share similar characteristics with prior cases). On the other hand, the existing methods usually provide only limited single-label classification (typically a binary one -a news article/blog/social media post is/is not a piece of false information), have insufficient explainability, and, in addition, they may suffer from domain shifts (either natural changes in domain characteristics or targeted adversarial attacks).

Another branch of knowledge-based approaches evaluates the actual content veracity by performing a fact-checking. Fact-checking stands for detection and verification of a claim, such as "Drinking bleach or pure alcohol can cure the coronavirus infections" 2 , against a knowledge base (e.g., scientific articles [15] , articles from sources deemed reliable, such as Wikipedia [37] , or knowledge graphs of known facts [45, 46] ). This approach may be preferable in many situations, including tackling false information in medical domain that (from its inherent characteristics) requires accurate, easily explainable and robust approaches for misinformation detection.

Fact-checking can be done either manually by professional factcheckers or (semi-)automatically with the help of AI. Manual factchecking is time consuming and, yet, scale-insufficient. On the other hand, fully automatic end-to-end fact-checking (e.g., [16] ) is a challenging task and existing solutions have not yet achieved a sufficient accuracy, generality, and credibility [24] . The real promise of technologies for now lies in tools to assist fact-checkers to identify and investigate claims, and to deliver their conclusions as effectively as possible [9, 24] .

AI research may assist fact-checkers in the following steps of the fact-checking process [24, 42] : 1) identification of claims worth factchecking, 2) detection of previously fact-checked claims relevant to the identified fact-check-worthy claims, 3) retrieving relevant evidence to fact-check a claim, and 4) verification of the claim based on the retrieved evidence. In addition, the set of already factchecked claims can be mapped back to additional (already existing or new) online content. While similar to step 2 above, here the input is a fact-checked claim and the output a list of articles containing the claim. Thus, it can be viewed as the fifth (dissemination) step in the fact-checking process, which is typically not done in manual factchecking as it is difficult or even impossible for the fact-checkers to manually find/update such relevant content [40] . Especially in medical domain, many misinformative articles reuse claims, which have already been expertly fact-checked, thus making the use of existing databases of fact-checks feasible.

AI-based fact-checking support in the multiple steps above is fundamentally based on document to claim mapping (document being a news article/blog, a social media post, etc.) and more specifically on two IR/NLP tasks: presence detection and stance classification [1, 3, 12, 13, 26, 39, 43] . The detection of previously factchecked claims (step 2), became a target of research interest only recently [28] and is one of the least studied research problems related to fact-checking [24] . It is typically addressed at the claim to claim level, i.e., previously fact-checked claims are ranked based on their relevance for a single given claim [21, 28] . Nevertheless, very recently, Shaar et al. [27] formulated a more challenging version of this task as identification of all previously fact-checked claims in an input document (that can potentially contain multiple check-worthy claims). The task includes detection of a document's sentences containing any of the previously fact-checked claims and the stance of these sentences towards the present claims. Presence detection and stance classification are also crucial for the other steps of the fact-checking process. In step 4, presence of an investigated claim is detected to retrieve evidence and then its stance towards the claim is used to verify factuality of the claim. Finally, both presence and stance are used in step 5 to map already fact-checked claims to additional documents [40] .

While situation with datasets for the first branch of false information detection (based on content and contextual characteristics) continually improves (cf. Section 2), datasets for AI research on fact-checking, particularly datasets providing a mapping between documents and claims (claim presence and document stance), still present a major problem hampering further research.

In this paper, we are introducing a novel medical misinformation dataset. It contains:

• full-texts, original source URL, and other extracted metadata of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources; • annotations with a source credibility score from expertlycurated lists, such as Media Bias/Fact Check, when it is available; • around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact; • 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in a given article) and article stance (i.e., whether a given article supports or rejects a claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation, analyses of misinformation spreading or classification of source reliability. Its novelty and our main contributions lie in:

(1) focus on medical news articles and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset has been collected with our universal and extensible platform Monant [35] , which was designed to monitor, detect, and mitigate false information. We are publishing a static dump of the dataset 3 . Moreover, the dataset in Monant is being continuously updated with latest articles and fact-checked claims from medical and other domains (e.g., general news) and also in languages other than English (currently in Slovak and Czech). To access the live version of the dataset, the Monant platform provides an easy-to-use access by the means of a REST API 4 .

The majority of existing datasets [32] are created for purpose of single-label false information detection. They are commonly annotated only by some simple heuristics (e.g., the veracity of articles is determined by the credibility of their sources 5 [14, 17, 18] ). However, such heuristics do not necessarily capture the real veracity of the articles (e.g., articles published by reliable sources may sometimes contain misinformative content and vice versa) and, therefore, should be used only as weak labels. Contrary to that, datasets annotated manually remain small or not fully annotated (e.g., just by its title [41] ).

Another way to create single-label fake news datasets is to take advantage of fact-checks -by following a direct link from factchecking articles to debunked online content (news article/blogs, social media posts, etc.). Examples of such datasets are FakeNews-Net [33] (rich dataset providing social context from Twitter), Fake-Covid dataset [31] (providing 5,182 fact-checking articles related to COVID-19 circulated in 105 countries from 92 fact-checkers, however, without the debunked content itself), CoAID dataset [5] (providing the mapping of fact-checking articles to debunked content, although the number of news articles covered by the dataset is quite small), or FakeHealth dataset [6] (providing expertly annotated news stories published at HealthNewsReview.org 6 together with their social engagements on Twitter). These datasets do not work explicitly with claims themselves and mostly use fact-checks just to transfer veracity label to the original content. Thus, their suitability for training AI models to support steps in the fact-checking process is limited.

Specific fact-checking datasets [11, 42] are therefore created to support individual steps of fact-checking process by researchers as well as in data challenges (most prominently at CLEF CheckThat! Lab 7 , e.g., [29, 30] ). Most of them are focused on political domain (political debates) and short social media posts (mostly from Twitter [22] , Facebook 8 , or Reddit [23] ). However, fact-checking datasets focused on medical domain and providing mappings between claims and larger documents (such as news articles and blogs) are generally lacking. This presents a problem, because even though social media play a significant role in creation and dissemination of medical misinformation [36] , many people are exposed to it also when they search online for health-related issues (which is done by 72% of adult internet users according to Pew Research Center 9 ). In [40] , the authors created a large manually-annotated dataset (covering different domains). They mapped fact-checking articles to relevant documents containing the fact-checked claims along with stance of the documents. Unfortunately, this dataset is not public. Recently, Shaar et al. [27] created a dataset providing presence and stance mappings between larger documents and previously fact-checked claims, nevertheless, the dataset is not available yet and it is focused on political fact-checked claims only.

We can conclude that a publicly available, feature-rich, and large enough dataset containing medical news articles/blogs with labelled mappings between articles and fact-checked claims is still missing. In contrast to the described datasets, our work specifically focuses on creating a dataset containing news articles/blogs only. Focusing on one content type allows us to extract a rich set of metadata (e.g., articles' authors, sources, categories). To achieve a large set of labelled data, we do not rely on links between fact-checking articles and news articles/blogs, which are often missing. Rather, we provide both manually human-created and automatically predicted labels of claim presence and article stance which we aggregate into article-claim pair veracities.

To create a medical misinformation dataset of news articles/blogs and fact-checked claims (and to continuously obtain new data), we used our research platform Monant [35] . Scraping of the relevant web content and extraction of metadata is implemented by the means of so called monitors and data providers. Data providers implement the scraping functionality. General parsers (from RSS feeds, WordPress sites, Google Fact Check Tool 10 , etc.) as well as custom crawlers and parsers were implemented (e.g., for the factchecking site Snopes). Monitors define which data providers should be used, their scheduling (i.e., frequency of extractions), parameters setup (e.g., a list of RSS feed URLs used as an input to the RSS feed parser), and data provider chaining (if additional data providers should be chained, e.g., when a new article is found). All data is stored in a unified format in a central data storage.

To compile a list of medical English news sites/blogs and to determine their credibility, we used expertly-curated lists of reliable and unreliable sites (e.g., Media Bias/Fact Check 11 or OpenSources 12 ) and previous related works (e.g., [7] ). We added additional sources of unknown credibility that were often referenced (linked) by the sources in the initial list. Next, we checked for each source, whether it still existed and how the data could be obtained from it (e.g., using a WordPress or RSS feed parser or if it required a custom parser). We ended up with a list of 207 medical sources in English; we have a credibility (reliability) score for 70 of them. Examples of reliable (credible) sources include healthline.com, or who.int; examples of sources marked by the listings as unreliable are naturalnews.com, or healthimpactnews.com. Most of these sources contain only medical content and thus no additional content selection was needed. If a source contained articles falling under multiple topics (e.g., politics, home news), we restricted the scraping only to a category corresponding to medical/health news/blogs. Next, we searched for fact-checking sources that also perform fact-checking of medical claims; we compiled a list of 7 of them (namely Snopes.com, MetaFact.io, FactCheck.org, Politifact.com, FullFact.org, HealthFeedback.org, and ScienceFeedback.co). Similarly to the case of news sites/blogs, we either collected all factchecking articles in the case of medical-only fact-checking sites or relied on categories manually assigned by the expert fact-checkers. Since the fact-checked claims in the selected sources are explicitly stated by the fact-checkers, it was possible to automatically extract claims from the fact-checking articles. Additional claims were supplemented from the list of unproven cancer treatments published by [8] . As veracity ratings can differ between fact-checkers, we unified them into a scale of 6 values: false, mostly false, true, mostly true, mixture, and unknown (meaning a veracity of the claim could not be evaluated by a fact-checker or experts' consensus has not been reached yet). The latter originates mostly from the MetaFact.io site, where the experts' evaluations are crowdsourced (in comparison with other fact-checking portals where the fact-checking process is typically done by one expert only) and the claim veracity is determined only when the evaluation of a sufficient number of experts is available.

Our aim was to obtain manual ground-truth labels of claim presence, i.e., whether a given verified (fact-checked) claim is present in an article, and of article stance, i.e., what the stance of the article is towards the matched claim. Our proposed data labelling methodology was inspired by the work of Wang et al. [40] . The labelling is performed in four steps: First, we identify possible article-claim pairs to label. Second, the pairs are distributed to annotators in batches guaranteeing that one pair is given to multiple annotators to minimise possible mistakes in the labelling process and that the same annotator never sees the same pairs multiple times, even across batches. Next, the pairs are annotated by the annotators. Lastly, the labels from all annotators are aggregated into a single claim presence and article stance label for each labelled article-claim pair.

A total number of 28 annotators participated in the labelling process, including the authors of this paper, master students, and other researchers. To prevent potential subjectivity and low-quality labels, a match of at least two annotators had to be achieved for the label to be included into the dataset. When there was no match between the first two annotators, the article-claim pair was assigned to up to 3 additional ones to collect more labels. Overall, interannotator agreement was high; additional annotator was required only in 8.57% of cases for claim presence and in 6.94% of cases for article stance labels. In quite rare cases, when the agreement was not reached (covering difficult to annotate or disputable cases), the article-claim pair was disregarded.

To annotate claim presence, the annotators could select one of four possible labels:

(1) Present -when the annotator can find a part of the article (a sentence or a paragraph) that literally or semantically contains the claim. (2) Suggestive -when the article relates to the claim, but the annotator cannot identify any specific part of the article that contains it (e.g., an article discusses the flu vaccine efficacy and suggests that they are ineffective or even harmful by providing anecdotal evidence but never explicitly makes that claim).

(3) Not present -when the claim is not present in the article.

(4) Can't tell -when the annotator cannot, for some reason, choose any of the options above.

When the annotators selected either "Present" or "Suggestive" label, they were further asked to label the stance of the article towards the identified claim, by selecting one of four possible labels:

(1) Supporting -when the article supports the claim (directly or indirectly from its context). (2) Contradicting -when the article contradicts the claim (directly or indirectly from its context). (3) Neutral -when the article does not take a stand on the claim or presents arguments both for and against the claim. (4) Can't tell -when the annotator cannot, for some reason, choose any of the options above.

The individual article-claim pair labels are aggregated as follows: First, we filter out all "Can't tell" labels. Next, if any of the remaining claim presence or article stance labels was chosen by two or more annotators for a given article-claim pair, this label is assigned as the final aggregated one. In case of no match in claim presence labels, we lower the requirement by joining the "Present" and "Suggestive" labels into one and check again for a match. If a match is found, we assign a "Suggestive" label as the final aggregate claim presence label. It is also worth noting that article stance labels can be evaluated only when a given claim is present in the article. As a result, there is a lower number of article stance labels compared to the number of claim presence ones.

3.2.2 Selection of article-claim pairs for labelling. The number of all possible article-claim pairs is equal to the number of claims times the number of articles, which is far too many to label. Moreover, most of them would be irrelevant, i.e., they would consist of claims completely unrelated to the articles. To deal with this problem, we select for labelling only a subset of pairs with a high possibility to be relevant. We used two selection methods during our labelling.

At first, we used ElasticSearch to select a subset of the articleclaim pairs. More specifically, we used each claim in turn as a query to find matching articles. This returned a large set of articles along with the BM25 score for each article. We kept only articles with the score higher than the 2 3 of the maximum score, i.e., the score associated with the first matched article. We then shuffled the resulting set of article-claim pairs and sampled two batches, each with 100 random pairs, i.e., 200 pairs in total. We split them among six annotators so that each pair was assigned to three annotators. The annotations were collected using spreadsheets: each annotator was assigned one sheet per batch, with each row describing a single article-claim pair. For each article-claim pair, the annotators were presented with the title of the article, the claim, article URL and the claim URL for information.

However, this selection method led to a significant class imbalance. Out of 197 article-claim pairs, where there was an agreement between the annotators, the claims were labelled as present only in ∼10% of cases, which also limited the number of stance annotations. We also observed a relatively large number of "Can't tell" labels which were caused by several claims. These mostly too generic claims (e.g., "There are more doctors") were mistakenly matched with many articles. The former was addressed by using our proposed claim presence detection baseline (cf. Section 5.1) instead of the simple querying in ElasticSearch. To mitigate the latter, we manually filtered out these problematic claims from further labelling.

We also switched from spreadsheets to a custom-made webbased annotation application, suitable also for mobile devices, which enabled us to reach a wider range of annotators. The application streamlined the annotation process and the article-claim pairs distribution to the annotators. The article-claim pairs were served to annotators until a match of at least two annotators was achieved in the values of claim presence as well as the article stance. Pairs with at least one label, but where no consensus had been achieved yet, were served to the annotators with a higher priority to keep the "unfinished" pair labels to a minimum.

Each article-claim pair was presented in the application as shown in Figure 1 . The claim was presented at the top, visually separated from the rest of the presented content. Underneath the claim, the title of the article, followed by its formatted body, was presented to the annotators. On the bottom, the annotators were presented with buttons for assigning the claim presence label and-if the annotators chose that the claim is present in the article-also the article stance label. As the articles were long and often dealt with multiple claims at the same time, we used a supportive text highlighting feature: the application highlighted sentences in the article that were most similar to the claim. The similarity was determined by cosine similarity between a sentence embedding representation of the given claim and the sentences of the article. Using this approach, we collected additional 376 article-claim pair labels from 28 annotators.

The collection of labels was also distributed in time. First 439 article-claim pairs (denoted as Sample 1 in sections below) were annotated in 2019 and early 2020; since this was before the onset of the COVID-19 pandemic, this sample does not contain any claims or articles pertinent to it. The remaining 134 pairs (denoted as Sample 2 in sections below) were annotated in June 2021, thus capturing also narratives spread in that time.

The dataset consists of medical news articles/blogs and fact-checked claims in English language. However, the Monant platform, which was used to collect the dataset and makes it accessible via an API endpoint, also collects articles from other domains (e.g., politics or general news) and in other languages (currently mostly in Slovak and Czech). Out of approx. 885k unique news articles/blogs from 256 sources, there are 317k English medical articles from 207 sources. 13 Out of approx. 10k fact-checking articles extracted from 17 fact-checking sites, there are 3.5k fact-checked medical claims from 7 fact-checking sites. And out of approx. 780k discussion posts (related to 48k articles), there are 711k discussions posts attached to English medical articles. In the following analyses, we focus specifically on English medical data contained in the provided dataset. 13 The content of this section and section 5.3 is based on the dataset's analysis published at: https://github.com/kinit-sk/medical-misinformation-dataset/. To make the analysis replicable, it uses a "freeze time" set to February 1, 2022. As a result, only the data, that were present in the Monant platform up to this date, are considered. The dataset provides a rich set of features about each article. Besides an article's URL, title, textual body, and attached multimedia, it also contains information about article's authors, category, tags, and references. In addition, we collect (in regular intervals) the users' feedback on Facebook (i.e., the number of likes or shares) for each news article. In some cases, the posts from the attached discussions are available as well.

For 70 sources, we have an explicit source reliability (credibility) label (cf. Section 3.1 for more details): 22 sources are considered to be reliable sources, 48 sources are considered to be unreliable. Out of all medical articles, 39% were collected from reliable sources, 56% from unreliable sources, and only 5% articles are from the sources without any reliability label.

Wherever possible, we collected all articles published by a given source. Consequently, some of the articles in the dataset were published as soon as 1998. Nevertheless, the majority of the collected news articles were published between years 2010-2021 as shown in Figure 2 . We can see an increasing trend in the number of medical news articles, with a significant increase in the last three years (the extreme rise in year 2020 can be explained by the onset of the COVID-19 pandemic). Figure 3 shows the distribution of veracity ratings of the factchecked medical claims contained in the dataset. 983 were evaluated as false, 60 as mostly false, 100 as mixture, 39 as mostly true, and 259 as true. The rating of a significant number of claims (originating mostly from MetaFact.io, cf. Section 3.1) is currently unknown. 

The dataset contains 573 article-claim pairs labelled by human annotators. There are 323 pairs annotated with positive claim presence labels, out of which 309 also have article stance labels. The overall distribution of the claim presence and article stance labels is shown in Table 1 . It also shows distributions for individual Samples 1 and 2. As we can see, while there is a balance between present and not present labels in Sample 1 as well as overall, Sample 2 is skewed towards present labels. As to the article stance, most articles support the matched claims. There is a lack of "Neutral" stance labels in our dataset, i.e., of articles that would present both sides of the argument. This can make it difficult for models trained on this data to correctly classify this stance class. Besides the labels from human annotators, the dataset also contains approx. 51k article-claim pairs with labels predicted by our proposed baselines. Their analysis is provided in Section 5.3.

The collected dataset can support a range of fact-checking and misinformation-related tasks. Its main intended use is for training and evaluation of machine learning methods for the tasks of claim presence detection and article stance classification. The former can be considered a claim-oriented document retrieval problem, i.e., given a fact-checked claim, all documents, where it is present, are retrieved; or, alternatively, as previously fact-checked claims detection, i.e., given an unverified piece of text or claim, all relevant previously fact-checked claims are retrieved [29] . The latter is a classification problem; the aim is to detect stance (position) of the author of an input piece of text towards a specified target [20] .

Since the dataset contains articles from a number of reliable and unreliable sources, it could be used for the misinformation characterisation task, i.e., for analyses of characteristics of articles (how they are written) similar to [7] : what topics they cover and how these topics evolve over time. The mapping of articles to factchecked claims provides a straightforward grouping of the articles based on the misinformation they are related to.

The misinformation sources often create inter-connected networks which spread and amplify the false information [19] . Since the dataset contains full-texts of the articles, it supports the task of misinformation spreading/diffusion analysis. For example, it is possible to analyse linking patterns between the sources, search for content that is similar or even taken over from other sources, etc. Having a publication date of the articles, it is also possible to analyse where the misinformation first appeared and when (how fast) it was taken up by other sources. This is especially relevant with respect to the spread of misinformation between countries and across languages. Since the data available in Monant via an API endpoint also contain non-English sources (at this moment Slovak and Czech), it can be used to develop and test multilingual methods and analyse spreading patterns from English-language sources to other languages and (possibly) vice-versa.

Besides text, the dataset contains other modalities, such as image URLs, article and source metadata, etc. These can be all utilised to develop multimodal detection methods. Lastly, the dataset can also be used for the task of source credibility identification by utilising the existing source credibility labels and extracting a range of credibility indicators from the articles and available metadata, such as polarity of the articles, use of references, use of authors, etc.

The dataset was collected and is published for research purposes only. 14 We collected only publicly available content of news articles/blogs. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false factchecked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed our labelling methodology as described in Section 3.2 and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels (cf. Section 5.3).

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

The means for reporting considerable mistakes in the raw data and manual labels are described in the accompanying repository. 15 

We provide evaluation of three claim presence detection baselines and compare their performance on the whole manually labelled dataset using only Not present and Present (which includes also Suggestive) classes (see Table 1 for their distribution):

• Information retrieval (IR method) -For any given claim and any given article, the claim presence is determined by the IR method as follows: First, 1-, 2-and 3-grams are extracted from the given claim. Each n-gram is assigned a TF-IDF score where TF is calculated within the claim and IDF based on the whole corpus of articles. Next, we match n-grams to the sentences of an article. If an n-gram contains medical terms, their synonyms are also allowed when matching sentences. Medical terms are identified using the Academic Vocabulary List. 16 Synonyms to these terms are retrieved by comparing similarity of their word vectors using fastText 17 pre-trained on Wikipedia articles. The scores of n-grams, for which there is an article sentence containing all their terms, are summed up and normalized by the sum of all n-gram scores. We do this separately for 1-, 2-, and 3-grams and compute the final presence score as their average. The claim is classified as present in a given article if the final computed score is above a defined threshold.

• Sentence embedding similarity (SE method) -This method calculates a presence score based on sentence embeddings (using Universal Sentence Encoder [4] , model v4) extracted from article sentences and a claim. The score is an average of two similarity comparisons: 1) cosine similarity between an article title and a claim, and 2) average cosine similarity between a claim and 5 article sentences the most similar to the claim. The claim is classified as present in a given article if the final computed score is above a defined threshold. • Combined IRSE method -This method, first introduced in [25] , works the same as the IR method with few important distinctions: First, the score of each matched n-gram is computed as a product of its TF-IDF score (IR method) and the cosine similarity between the embedding of an article sentence, which contains all terms of the given n-gram, and the claim embedding (SE method). Second, to make the comparison more efficient, only sentences with similarity above a certain threshold are considered. This threshold is computed as an average of cosine similarity between the claim and an article title embeddings and cosine similarity between the claim and the K most similar sentences.

All three baselines required a choice of a threshold to make the claim presence decisions based on the computed presence scores. We chose the threshold values so that recall of the methods on the positive (i.e., Present) class would roughly be the same (around 0.4). This way, we can compare the methods working under the same requirement for the proportion of relevant items to be selected. The resulting thresholds for the IR, SE, and IRSE methods were 0.5, 0.5, and 0.45 respectively. The IRSE method also contains a prefiltering threshold. Our experiments showed that setting its value to 0.25 enabled it to discard a large number of potential mappings without affecting the overall performance of the method.

The results of the baselines on our labelled dataset are shown in Table 2 . Although the IR and SE methods achieved similar results, we can see that the IRSE method outperformed both suggesting the utility of their combination. This is also confirmed by Figure 4 , which illustrates a relation between true positive rate and false positive rate of the baselines using ROC (receiver operating characteristic) curve. The IRSE method retains lower false positive rate with increasing true positive rate than both the IR and SE methods. Out of them, the IR method performs better with lower false positive rate than the SE method.

We also evaluated the baselines individually for Samples 1 and 2 (see Table 2 ). Although the IRSE method retains the highest accuracy, the accuracy drops for all methods in Sample 2 compared to Sample 1. Manual inspection of the errors made by the IRSE method revealed that the decrease cannot be explained by a domain shift due to COVID-19. Most commonly, the errors were due to the claim presence method neglecting some information in claims and mapping them to articles that were related but did not discuss that specific case. For instance, for claim "Omega-3 fatty acids decrease triglycerides", we observed results that discussed other effects of Omega-3 fatty acids that did not relate to triglycerides. To handle such cases, a more strict threshold could be used. Table 2 : Precision, recall and F1-score of the claim presence detection baselines are evaluated on the whole manually labelled dataset. Accuracy is computed individually on the Sample 1 and Sample 2, which were collected and annotated in 2019 and in 2021 respectively, as well as on the dataset as a whole.

Not present S1 Acc. S2 Acc. Overall Acc. 

To evaluate article stance classification baselines, we utilize Sample 1 with 210 pairs as training set and Sample 2 with 99 pairs as testing set (see Table 1 for the distribution of classes in both samples). In both cases, we do not consider the Not present class, as it is not relevant for stance classification. In addition to the manually labelled Monant data, we also utilise the Fake News Challenge (FNC) dataset 18 . Similarly, we drop the class denoting that an article is unrelated to the claim. This leaves us with ∼20,450 samples with the following distribution: 27.24% Supporting, 7.5% Contradicting, and 65.26% Neutral.

We compare the performance of several baselines. The first group of baselines present the best models from the FNC:

• Talos Since the challenge took place already in 2017, these models can no longer be considered state of the art, but they nevertheless represent a relatively wide range of approaches utilising both handcrafted, but also automatically extracted features, which makes them interesting for benchmarking more novel models.

The second group comprises our proposed baseline methods which utilise CNN and LSTM combined with similarity and attention mechanism respectively to identify parts of articles relevant for stance classification:

• All Sentences CNN -the input to the CNN model is a claim followed by the first 100 sentences of the article, without any detection of their relevance. The articles with higher number of sentences are clipped and those with lower number of sentences are padded with zero vector. This network is meant for comparison purposes, to determine the effect of sentence relevance detection. • Attention LSTM -this model uses an LSTM network for obtaining high-level representations for both the claim and the article body. An attention mechanism is applied on the high-level representations to identify important parts of the articles. Another LSTM layer is applied on the output of the attention layer. A dropout with rate of 0.4 is applied to prevent overfitting, followed by a dense layer and a softmax layer for classification. • Similarity CNN -the input to the CNN model are the three most similar sentences to a given claim (based on the cosine similarity of their embedding representations) along with one previous and one following sentence for each. We use three different convolutional layers, the outputs of which are concatenated together. A dropout of rate 0.25 is applied before the convolutional layers and one with rate of 0.5 is applied on the concatenated output of the convolutions, to prevent overfitting. Finally, we apply a dense layer and a softmax layer for classification. For the last two proposed baseline models, we also employ transfer learning. We first train a general model using the FNC data and fine-tune it on the manually labelled Monant training data. Table 3 presents accuracy of the baseline methods on the two datasets. In case of FNC dataset, we evaluate the models using a test subset of the dataset, as it was originally released for the competition. In case of manually labelled Monant data, we perform two evaluations. First, we perform a 5-fold cross-validation on the training set (Sample 1) and report the mean performance of the model, which is determined by running the cross-validation 10 times. Second, we evaluate models trained on the whole Sample 1 on the testing set (Sample 2).

The results show that the models that utilise simple hand-crafted features struggle when dealing with a different dataset. This is evident in the Athene and its extension. We can presume that the used hand-crafted features are too specific for the FNC data and do not generalise well to the Monant data. On the other hand, the models with automatic feature extraction, which include UCL baseline model, Talos, and our proposed models, show a better performance and better generalisation. In addition, we can see that these models retain their accuracy even on Sample 2 which was collected later than the training data and could theoretically include data or concept drifts.

The results also suggest that the identification of relevant parts of the articles is necessary when dealing with longer articles. In case of FNC data, where the average length of article is ∼16 sentences, the performance increase is not as evident. This may be due to the specificity of the shorter articles, which mostly deal with a single claim, and therefore can be considered relevant as a whole for the classification. However, when investigating the articles from Monant, where the average article length is ∼55 sentences, the increase in performance observed in Attention LSTM and Similarity CNN as opposed to All Sentences CNN, is noticeable. In such articles, the extraction of features from the whole article results in a lot of noise, which causes problems for the classification.

When comparing attention mechanism with the similar sentences extracted using cosine similarity, we found out that the former sometimes struggled to identify relevant parts of articles. It tended to focus solely on the sentences similar to a given claim, while the arguments regarding the claims were often present in the surrounding sentences instead. Since Similarity CNN took also these surroundings into account, it achieved a better performance.

Lastly, the use of transfer learning contributed to a significant increase of performance on the Monant data, even though the discrepancy in the distribution of classes across the datasets was significant. When we were training LSTM networks using transfer learning, they often broke down and started predicting the most dominant class in the data. Even though the use of attention mechanism helped in this regard, CNNs proved to be more stable and reliable for generating good claim and article representations and therefore attained better performance.

We use the best-performing baselines, i.e., the IRSE method and the Similarity CNN with transfer learning, to predict claim presence and article stance labels for articles and fact-checked claims in the collected dataset; these are also part of the published data. In addition, we aggregate these labels into article-claim pair veracities as follows: If an article agrees with a claim, we assign the veracity of the claim to the article-claim pair. If an article contradicts a claim, we assign to the article-claim pair the veracity opposite to the claim veracity. Lastly, if an article has a neutral stance towards a claim, or the veracity of the claim is unknown, the article-claim pair is evaluated to be unknown as well.

The predicted labels are less precise compared to the manual ones, but at the same time, they are available for a much larger number of articles. They are also more accurate than many commonly used heuristics (e.g., the ones derived solely from the reliability of an article's source). This makes them ideal to be used as (weak) labels for other misinformation detection methods (based on articles' content style or context) while accepting some noise introduced by the methods' inaccuracy in some cases.

In total, there are approx. 51k article-claim mappings labelled with positive claim presence labels 20 and consequently, also with article stance annotations. Out of all 317k articles, 11% are mapped to at least one claim. Out of all 3.5k medical claims, 35% are mapped to at least one article. The majority of predicted claim presence labels are related to claims from MetaFact.io (66.6%), followed by FullFact.org (18.4%), HealthFeedback.org (9.6%), the list of cancerrelated claims created in [8] (3.7%), and Snopes.com (1.6%). Out of all predicted article stance labels, 79% are supporting, 4% are neutral, and 17% contradicting.

The resulting article-claim pair veracity labels (51k in total) have the following distribution: 20% are classified as false, 0.1% as mostly false, 0.1% as mixture, 0.1% as mostly true, 17% as true, and finally 62% of article-claim pairs are labelled as unknown. A high number of article-claim pairs labelled as unknown is caused by the fact that 58% of medical claims (mostly from MetaFact.io) have an unknown veracity.

Out of 35k articles mapped to at least one claim, 21% are mapped only to true article-claim pair veracity labels, 22% only to false article-claim pair veracity labels, and finally, 3% of articles contain a mixture of true and false article-claim pair veracity labels. The remaining articles are associated only with one or several articleclaim mappings with unknown veracity.

Regarding the source credibility labels, 69% of article-claim pair veracity labels relate to articles which come from unreliable sources. Out of them, 22% label article-claim pairs as false and 17% as true. 25% of article-claim pair veracity labels relate to articles which come from reliable sources; out of them, 17% label article-claim pairs as false and 20% as true. Although further investigation is needed, we can see that more veracity annotations relate to articles from unreliable sources (even when we consider the distribution of articles from un/reliable sources in our dataset). However, it also suggests that the information on the sources' credibility (commonly used as a heuristic to label articles) is not sufficient and the articles need to be assessed by the claims they make.

In this paper, we introduced a labelled dataset of medical articles with mappings to fact-checked claims for training and evaluation of machine learning methods supporting the fact-checking process. Besides providing a static dump of the dataset, we also provide a programmatic access to continuously updated data in our Monant platform. The platform has already been maintained for over 2.5 years, collecting, updating, and annotating new data. The main supported tasks are claim presence detection and article stance classification, for which we provide manual labels, and which are essential for searching and checking whether a new article contains claims that have already been fact-checked. In addition, the dataset enables a range of other tasks, such as misinformation characterisation studies, studies of misinformation diffusion, source credibility classification, etc. Thus, the dataset can be useful for researchers interested in misinformation, automatised or ML-supported factchecking as well as for NLP and IR community in general.

We also present results of claim presence and article stance baselines which are used to generate predicted labels mapping articles to fact-checked claims. While the former are based on combination of classical IR approaches and sentence similarity, the latter use more advanced neural networks approaches combined with transfer learning to compensate for the limited number of labelled samples (and class imbalance, especially w.r.t. the neutral class). The baselines leave plenty of space for improvement, e.g., by applying state-of-the-art pre-trained language models based on transformers. Also, they currently work only for content in English language and are strictly limited to textual content (i.e., they cannot detect presence of a claim in an image, such as a screenshot from a social medium post, meme, etc.).

As future work, we plan to: 1) extend the dataset with content in other languages; 2) develop multilingual methods of claim presence and article stance; and 3) apply them in a range of tasks, such as detection of previously fact-checked claims, mapping of these claims to additional online content, and to automate audits of misinformation prevalence in social media recommender systems [38] . Since the scarcity of manually labelled data will likely remain a problem, we will continue focusing on machine learning approaches that can utilise unlabelled or limited labelled data, such as meta learning or weakly supervised learning. Furthermore, we will seek more efficient ways of navigating the selection of examples to label (active learning), and ways of gathering and exploiting previous experience from other tasks as is the case of transfer and meta learning.

Integrating Stance Detection and Fact Checking in a Unified Corpus

A survey on fake news and rumour detection techniques

Combining similarity features and deep representation learning for stance detection in the context of checking fake news

Universal sentence encoder

CoAID: COVID-19 Healthcare Misinformation Dataset

Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository

Differences in Health News from Reliable and Unreliable Media

Fake Cures: User-Centric Modeling of Health Misinformation in Social Media

Understanding the promise and limits of automated factchecking

The Future of False Information Detection on Social Media: New Perspectives and Trends

2022. A survey on automated fact-checking

A Retrospective Analysis of the Fake News Challenge Stance-Detection Task

Preslav Nakov, and Isabelle Augenstein. 2021. A survey on stance detection for mis-and disinformation identification

Artificial Intelligence: Methodology, Systems, and Applications

Understanding and Mitigating Bias in Online Health Search

ClaimBuster: The First-Ever End-to-End Fact-Checking System

This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News

BanFakeNews: A Dataset for Detecting Fake News in Bangla

Quantitative and qualitative analysis of linking patterns of mainstream and partisan online news media in Central Europe

Stance Detection: A Survey

Did I See It Before? Detecting Previously-Checked Claims over Twitter

CREDBANK: A Large-Scale Social Media Corpus With Associated Credibility Annotations

Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection

Automated Fact-Checking for Assisting Human Fact-Checkers

FireAnt: Claim-Based Medical Misinformation Detection and Monitoring

A simple but tough-to-beat baseline for the Fake News Challenge stance detection task

Assisting the Human Fact-Checkers: Detecting All Previously Fact-Checked Claims in a Document

That is a Known Lie: Detecting Previously Fact-Checked Claims

CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims in Tweets and Political Debates

Tamer Elsayed, and Preslav Nakov. 2021. Overview of the CLEF-2021 CheckThat! Lab Task 1 on Check-Worthiness Estimation in Tweets and Political Debates

FakeCovid -A Multilingual Cross-domain Fact Check News Dataset for COVID-19

Combating Fake News: A Survey on Identification and Mitigation Techniques

FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for

Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behaviour

Prevalence of

FEVER: a Largescale Dataset for Fact Extraction and VERification

An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes

Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

WWW '18). International World Wide Web Conferences Steering Committee

Weak Supervision for Fake News Detection via Reinforcement Learning

Automated factchecking: A survey

From Stances' Imbalance to Their Hierarchical Representation and Detection

An overview of online fake news: Characterization, detection, and discussion

Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention

Reasoning Over Semantic-Level Graph for Fact Checking

A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities

Detection and Resolution of Rumours in Social Media: A Survey