key: cord-0590846-4a3tjnh5 authors: Drchal, Jan; Ullrich, Herbert; R'ypar, Martin; Vincourov'a, Hana; Moravec, V'aclav title: CsFEVER and CTKFacts: Czech Datasets for Fact Verification date: 2022-01-26 journal: nan DOI: nan sha: d4902c8b70820deaad351412c8cc53f265d2cbe5 doc_id: 590846 cord_uid: 4a3tjnh5 In this paper, we present two Czech datasets for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We consider 3 classes: SUPPORTS, REFUTES complemented with evidence documents or NEI (Not Enough Info) alone. Our first dataset, CsFEVER, has 127,328 claims. It is an automatically generated Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach, and the tools we provide, can be easily applied to other languages. The second dataset, CTKFacts of 3,097 claims, is annotated using the corpus of 2.2M articles of Czech News Agency. We present its extended annotation methodology based on the FEVER approach. We analyze both datasets for spurious cues - annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline. In the current highly-connected online society, the ever-growing information influx eases the spread of false or misleading news. The omnipresence of fake news motivated formation of fact-checking organizations such as AFP Fact Check 1 , International Fact-Checking Network 2 , PolitiFact 3 , Poynter 4 , Snopes 5 , and many others. At the same time, many tools for fake news detection and fact-checking are being developed: ClaimBuster [1] , ClaimReview 6 or CrowdTangle 7 ; see [2] for more examples. Many of these are based on machine learning technologies aimed at image recognition, speech to text, or Natural Language Processing (NLP). This paper deals with the latter, focusing on automated fact-checking (hereinafter also referred to as fact verification). Automated fact verification is a complex NLP task [3] in which the veracity of a textual claim gets evaluated with respect to a ground truth corpus. The output of a fact-checking system gives a classification of the claim -conventionally varying between supported, refuted and not enough information available in corpus. For the supported and refuted outcomes it further supplies the evidence, i.e., a list of documents that explain the verdict. Fact-checking systems typically work in two stages [4] . In the first stage, based on the input claim, the Document Retrieval (DR) module selects the evidence. In the second stage, the Natural Language Inference module matches the evidence with the claim and provides the final verdict. Table 1 shows an example of data used to train the fact-checking systems of this type. Current state-of-the-art methods applied to the domain of automated factchecking are typically based on large-scale neural language models [5] , which are notoriously data-hungry. While there is a reasonable number of quality datasets available for high-profile world languages [2], the situation for the low-resource languages is significantly less favorable. Also, most available largescale datasets are built on top of Wikipedia [4, 6-8]. While encyclopedic corpora are convenient for dataset annotation, these are hardly the only eligible sources of the ground truth. We argue that corpora of verified news articles used as claim verification datasets are a relevant alternative to encyclopedic corpora. Advantages are clear: the amount and detail of information covered by news reports are typically higher. Furthermore, the news articles typically inform on recent events attracting public attention, which also inspire new fake or misleading claims spreading throughout the online space. On the other hand, news articles address a more varied range of issues and have a more complex structure from the NLP perspective. While encyclopedic texts are typically concise and focused on facts, the style of news articles can vary wildly between different documents or even within a single article. For example, it is common that a report-style article is intertwined with quotations and informative summaries. Also, claim validity might be obscured by complex temporal or personal relationships: a past quotation like "Janet Reno will become a member of the Cabinet." may or may not support the claim "Janet Reno was the member of the Cabinet." This depends on, firstly, which date we verify the claim validity to, and secondly, who was or what was the competence of the quotation's author. Note that similar problems are less likely in encyclopedia-based datasets like FEVER [4] . The contributions of this paper are as follows: 1. CsFEVER: We propose an experimental Czech localization of the largescale FEVER [4] fact-checking dataset, utilizing the public MediaWiki interlingual document alignment of Wikipedia articles and a MT-based claim transduction. We publish our procedure to be used for other languages, and analyze its pitfalls. We denote the original English FEVER as EnFEVER in the following sections to distinguish various language mutations. 2. CTKFacts: we introduce a new Czech fact-checking dataset manually annotated on top of approximately two million Czech News Agency 8 news reports from 2000-2020. Inspired by FEVER, we provide an updated and extended annotation methodology aimed at annotations of news corpora, and we also make available an open-source annotation platform. The claim generation as well as claim labeling is centered around limited knowledge context (denoted dictionary in [4]), which is trivial to construct for hyperlinked textual corpora such as Wikipedia. We present a novel approach based on document retrieval and clustering. The method automatically generates dictionaries, which are composed of both relevant and semantically diverse documents, and does not depend on any inter-document linking. 3. We provide a detailed analysis of the CTKFacts dataset, including the inter-annotator agreement, and spurious cue analysis, where the latter detects annotation patterns possibly leading to overfitting of the NLP models. For comparison, we analyze the spurious cues of CsFEVER as well. We construct an annotation cleaning scheme that involves both manual and semi-automated procedures, and we use it to refine the final version of the CTKFacts dataset. We also provide classification and discussion of common annotation errors for future improvements of the annotation methodology. 4. We present baseline models for both DR and NLI stages as well as for the full fact-checking pipeline. 5. We publicly release the CTKFacts dataset as well as the experimental CsFEVER data, used source code and the baseline models 9 . This article is structured as follows: in Section 2, we give an overview of the related work. Section 3 describes our experimental method to localize the EnFEVER dataset using the MediaWiki alignment. We generate the Czech language CsFEVER dataset with it and analyze its validity. In Section 4, we introduce the novel CTKFacts dataset. We describe its annotation methodology, data cleaning, and postprocessing, as well as analysis of the inter-annotator agreement. Section 5 analyzes spurious cues for both CsFEVER and CTKFacts. In Section 6, we present the baseline models. Section 7 concludes with an overall discussion of the results and with remarks for future research. Wang in [11] presents another dataset of 12k+ claims, working with 5 classes (pants-fire, false, barely-true, half-true, mostly-true, and true). Each verdict includes a justification. However, evidence sources are missing. The models presented in the paper are claim-only, i.e., they deal with surfacelevel linguistic cues only. The author further experiments with speaker-related meta-data. Fact Extraction and VERification (FEVER) [4] is a large dataset of 185k+ claims covering the overall fact-checking pipeline. It was based on abstracts of 50k most visited pages of English Wikipedia. Authors present complex annotation methodology that involves two stages: the claim generation in which annotators firstly create a true initial claim supported by a random Wikipedia source article with context extended by the dictionary constructed from pages linked from the source article. The initial claim is further mutated by rephrasing, negating and other operations. The task of the second claim labeling stage is to provide the evidence as well as give the final verdict: SUPPORTS, REFUTES or NEI, where the latter stands for the not enough information label. Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) [6] adds 87k+ claims including evidence based on Wikipedia table cells. The size of FEVER data facilitates modern deep learning NLP methods. The FEVER authors host annual workshops involving competitions, with results described in [5] and [12] . MultiFC [13] is a 34k+ claim dataset sourcing its claims from 26 fact checking sites. The evidence documents are retrieved via Google Search API as the ten highest-ranking results. This approach significantly deviates from the FEVER-like datasets as the ground-truth is not limited by a closed-world corpus, which limits the trustworthiness of the retrieved evidence. Also, similar data cannot be utilized to train the DR models. WikiFactCheck-English [8] is another recent Wikipedia-based large dataset of 124k+ claims and further 34k+ ones including claims refuted by the same evidence. The claims are accompanied by context. The evidence is based on Wikipedia articles as well as on the linked documents. Considering other than English fact-checking datasets, the situation is less favorable. Recently, Gupta et al. [14] released a multilingual (25 languages) dataset of 31k+ claims annotated by seven veracity classes. Similarly to the MultiFC, evidence is retrieved via Google Search API. The experiments with the multilingual Bert [15] model show that the gain from including the evidence is rather limited when compared to claim-only models. FakeCovid [16] is a multilingual (40 languages) dataset of 5k+ news articles. The dataset focuses strictly on the COVID-19 topic. Also, it does not supply evidence in a raw form -human fact-checker argumentation is provided instead. Kazemi et al. [17] released two multilingual (5 languages) datasets, these are, however, aimed at claim detection (5k+ examples) and claim matching (2k+ claim pairs). In the Czech locale, the most significant machine-learnable dataset is the Demagog dataset [18] based on the fact-checks of the Demagog 10 organisation. The dataset contains 9k+ claims in Czech (and 15k+ in Slovak and Polish) labeled with a veracity verdict and speaker-related metadata, such as name and political affiliation. The verdict justification is given in natural language, often providing links from social networks, government-operated webpages, etc. While the metadata is appropriate for statistical analyses, the justification does not come from a closed knowledge base that could be used in an automated scheme. The work most related to ours was presented by the authors of [19, 20], who published a Danish version of EnFEVER called DanFEVER. Unlike our CsFEVER dataset, DanFEVER was annotated by humans. Given the limited number of annotators, it includes significantly fewer claims than EnFEVER (6k+ as opposed to 185k+). In this section, we introduce a developmental CsFEVER dataset intended as a Czech localization of the large-scale English EnFEVER dataset. It consists of claims and veracity labels justified with pointers to data within the Czech Wikipedia dump. A straightforward approach to automatically build such a dataset from the EnFEVER data would be to employ machine translation (MT) methods for both claims and Wikipedia articles. While MT methods are recently reaching maturity [21, 22] , the problem lies in the high computational complexity of such translation. While using the state-of-the-art MT methods to translate the claims (2.2M words) is a feasible way of acquiring data, the translation of all Wikipedia articles is a much costlier task, as only their abstracts have a total of 513M words (measuring the June 2017 dump used in [4]). However, in NLP research, Wikipedia localizations are often considered a comparable corpus [23] [24] [25] [26] [27] , that is, a corpus of texts that share a domain and properties. Furthermore, partial alignment is often revealed between Wikipedia locales, either on the level of article titles [26] , or specific sentences [23] -much like in parallel corpora. We hypothesize there may be a sufficient document-level alignment between Czech and English Wikipedia abstracts that were used to annotate the EnFEVER dataset, as in both languages the abstracts are used to summarize basic facts about the same real-world entity. In order to validate this hypothesis, and to obtain experimental large-scale data for our task, we proceed to localize the EnFEVER dataset using such an alignment derived from the Wikipedia interlanguage linking available on MediaWiki 11 . In the following sections, we discuss the output quality and information loss, and we outline possible uses of the resulting dataset. Our approach to generating CsFEVER from the openly available EnFEVER dataset can be summarized by the following steps: 1. Fix a version of Wikipedia dump in the target language to be the verified corpus. 2. Map each Wikipedia article referred in the evidence sets to a corresponding localized article using MediaWiki API 12 . If no localization is available for an article, remove all evidence sets in which it occurs. 3. Remove all SUPPORTS and REFUTES data points having empty evidence. 4. Apply MT method of choice to all claims. 5. Re-split the dataset to train, dev, and test so that the dev and test veracity labels are balanced. Before we explore the data, let us discuss the caveats of the scheme itself. Firstly, the evidence sets are not guaranteed to be exhaustive -no human annotations in the target language were made to detect whether there are new ways of verifying the claims using the target version of Wikipedia (in fact, this does not hold for EnFEVER either, as its evidence-recall was estimated to be 72.36% [4]). Secondly, even if our document-alignment hypothesis is valid on the level of abstracts, sentence-level alignment is not guaranteed. Its absence invalidates the EnFEVER evidence format, where evidence is an array of Wikipedia sentence identifiers. The problem could, however, be addressed by altering the evidence granularity of the dataset, i.e., using whole documents to prove or refute the claim, rather than sentences. Recent research on long-input processing language models [28] [29] [30] is likely to make this simplification less significant. Following our scheme from section 3.1, we used the June 2020 Czech Wikipedia dump parsed into a database of plain text articles using the wikiextractor 13 package and only kept their abstracts. In order to translate the claims, we have empirically tested three available state-of-the-art English-Czech machine translation engines (data not shown here). Namely, these were: Google Cloud Translation API 14 , CUBBITT [22] and DeepL 15 . As of March 17 th 2021 we observed DeepL to give the best results. Most importantly, it turned out to be robust w.r.t. homographs and faithful to the conventional translation of named entities (such as movie titles, which are very common amongst the 50k most popular Wikipedia articles used in [4]). Finally, during the localization process, we have been able to locate Czech versions of 6,578 out of 12,633 Wikipedia articles needed to infer the veracity of all EnFEVER claims. Omitting the evidence sets that are not fully contained by the Czech Wikipedia and omitting SUPPORTS/REFUTES claims with empty evidence, we arrive to 127,328 claims that can hypothetically be fully (dis-)proven in at least one way using the Czech Wikipedia abstracts corpus, which is 69% of the total 185,445 EnFEVER claims. We release the resulting dataset publicly in the HuggingFace datasets repository 16 . In Table 2 we show the dataset class distribution. It is roughly proportional to that of EnFEVER. Similarly to [4], we have opted for label-balanced dev and test splits, in order to ease evaluation of biased predictors. In order to validate our hypothesis that the Czech Wikipedia abstracts are semantically close to their English counterparts, we have sampled 1% (1257) verifiable claim-evidence pairs from the CsFEVER dataset and annotated their validity. Overall, we have measured a 66% transduction precision with a confusion distribution visualised in figure 1 -28% of our CsFEVER sample pairs were invalid due to NOT ENOUGH INFO in the proposed Czech Wikipedia abstracts, 5% sample claims were invalidated by an inadequate translation. We, therefore, claim that the localization method, while yielding mostly valid datapoints, needs a further refinement, and the CsFEVER as-is is noisy and mostly appropriate for experimental benchmarking of model recall in the document-level retrieval task. With caution, it may also be used for NLI experiments. As a workaround for the findings above for the NLI task, we also publish a dataset we call CsFEVER-NLI that was generated independently on the scheme from section 3.1 by directly translating FEVER-NLI pairs published in [31] . To avoid confusion, we omit CsFEVER-NLI from this paper and only experiment with the CsFEVER as generated in this chapter. We conclude that while the large scale of the obtained data may find its use, a collection of novel Czech-native dataset is desirable for finer tasks, and we proceed to annotate a CTKFacts dataset for our specific application case. In this section, we address collection and analysis of the CTKFacts dataset -our novel dataset for fact verification in Czech. The overall approach to the annotation is based on FEVER [4]. Unlike other FEVER-inspired datasets [6, 7, 20] which deal with corpora of encyclopedic language style, CTKFacts uses a ground truth corpus extracted from an archive of press agency articles. As the CTK archive is proprietary and kept as a trade secret, the full domain of all possible evidence may not be disclosed. Nevertheless, we provide public access to the derived NLI version of the CTKFacts dataset we call CTKFactsNLI. CTKFactsNLI is described in Section 4.8. For the ground truth corpus, we have obtained a proprietary archive of the Czech News Agency 17 , also referred to as CTK, which is a public service press agency providing news reports and data in Czech to subscribed news organizations. Due to the character of the service -that is, providing raw reports that are yet to be interpreted by the commercial media -we argue such corpus suffers from significantly less noise in form of sensational headlines, political bias, etc. This, however, has yet to be checked. The full extent of data provided to our research is 3.3M news reports published between 1 January 2000 and 6 March 2019. We reduce this number by neglecting redundancies and articles formed around tables (e.g., sport results or stock prices). Ultimately, we arrive to a corpus of 2M articles with a total of 11M paragraphs. Hereinafter, we refer to it as to the CTK corpus, and it is to be used as the verified text database for our annotation experiments. The FEVER shared task proposed a two-level retrieval model: first, a set of documents (i.e., Wikipedia abstracts) is retrieved. These are then fed to the sentence retrieval system which provides the evidence on the sentence level. This two-stage approach, however, does not match properties of the news corpora -in most cases, the news sentences are significantly less self-contained than those of encyclopedic abstract, which disqualifies the sentence-level granularity. On the other hand, the news articles tend to be too long for many of the state-of-the-art document retrieval methods. FEVER addresses a similar issue by trimming the articles to their short abstracts only. Such a trimming can not be easily applied to our data, as the news reports come without abstracts or summaries and scatter the information across all their length. In order to achieve a reasonable document length, as well as to make use of all the information available in our corpus, we opt to work with our full data on the paragraph level of granularity, using a single-stage retrieval. From this point onwards, we refer to the CTK paragraphs also as to the documents. We store meta-data for each paragraph, identifying the article it comes from, its order 18 and a timestamp of publication. In FEVER, every claim is based on a random sentence of a Wikipedia article abstract sampled from the fifty thousand most popular articles [4] . With the news report archive in its place, the approach does not work well, as most paragraphs do not contain any check-worthy information. In our case, we were forced to include an extra manual preselection task (denoted T 0 , see Section 4.5) to deal with this problem. In EnFEVER Claim Extraction as well as in the annotation of Dan-FEVER [19], the annotator is provided with a source Wikipedia abstract and a dictionary composed of the abstracts of pages hyperlinked from the source. The aim of such dictionary is to 1) introduce more information on entities covered by the source, 2) extend the context in which the new claim is extracted in order to establish more complex relations to other entities. With the exception of the claim mutation task (see below), annotators are instructed to disregard their own world knowledge. The dictionary is essential to ensure that the annotators limit themselves to the facts (dis-)provable using the corpus while still having access to higher-level, more interesting entity relations. As the CTK corpus (and news corpora in general) does not follow any rules for internal linking, it becomes a significant challenge to gather reasonable dictionaries. The aim is to select a relatively limited set of documents to avoid overwhelming the annotators. These documents should be highly relevant to the given knowledge query 19 while covering as diverse topics as possible at the same time to allow complex relations between entities. Our approach to generating dictionaries combines NER-augmented keyword-based document retrieval method and a semantic search followed by clustering to promote diversity. The keyword-based search uses the TF-IDF DrQA [32] document retrieval method being a designated baseline for the EnFEVER [4]. Our approach makes multiple calls to the DrQA, successively representing the query q by all possible pairs of named entities extracted from the q. As an example consider the query q = "Both Obama and Biden visited Germany.": N = {"Obama", "Biden", "Germany"} is the extracted set of top-level named entities. DrQA is then called |N | 2 = 3 times for the keyword queries q 1 = "Obama, Biden", q 2 = "Obama, Germany", q 3 = "Biden, Germany". Czech Named Entity Recognition is handled by the model of [33] . In the end, we select at most n KW (we use n KW = 4) documents having the highest score for the dictionary. This iterative approach aims to select documents describing mutual relations between pairs of NERs. It is also a way to promote diversity between the dictionary documents. Our initial experiments with a naïve method of simply retrieving documents based on the original query q (or simple queries constructed from all NERs in N ) were unsuccessful as journalists often rephrase, and the background knowledge can be found in multiple articles. Hence, the naïve approach often reduces to search for these rephrased but redundant textual segments. The second part of the dictionary is constructed by means of semantic document retrieval. We use the M-Bert [15] model finetuned on CsFEVER (see Section 6.1), which initially retrieves rather large set of n PRE = 1024 top ranking documents for the query q. In the next step, we cluster the n PRE documents based on their [CLS] embeddings using k-means. Each of the k (k = 2 in our case) clusters then represents a semantically diverse set of documents (paragraphs) P i for i ∈ {1, . . . , k}. Finally, we cyclically iterate through the clusters, always extracting a single document p ∈ P i closest to q by means of the cosine similarity, until the target number of n SEM (we used n SEM = 4) documents is reached. The final dictionary is then a union of n KW and n SEM documents selected by both described methods. During all steps of dictionary construction, we make sure that all the retrieved documents have an older timestamp than the source. Simply put, to each query, we assign a date of its formulation, and only verify it using the news reports published to that date. The combination of the keyword and semantic search, as well as the meta-parameters involved, are a result of empirical experiments. They are intended to provide a minimum neccessary context on the key actors of the claim and its nearest semantical neighbourhood. In the following text, we denote a dictionary computed for a query q as d(q). In the annotation tasks, it is often desirable to combine dictionaries of two different queries (claim and its source document) or to include the source paragraph itself. For clarity, we use the term knowledge scope to refer to such entire body of information. The overall workflow is depicted in 2 and described in the following list: 1. Source Document Preselection (T 0 ) is the preliminary annotation step as described in Section 4.3, managed by the authors of this paper. 2. Claim Extraction (T 1a ): • The system samples random paragraph p from the set of paragraphs preselected in the T 0 stage. • The system generates a dictionary d(p), querying for the paragraph p and its publication timestamp (see Section 4.4). • A is further allowed to augment K by other paragraphs published in the same article as some paragraph already in K, in case the provided knowledge needs reinforcement. : rephrase, negate, substitute similar, substitute dissimilar, generalize, and specify. We use the term final claim interchangeably with the claim mutation in the following paragraphs. This is the only stage where A can employ own world knowledge, although annotators are advised to preferably introduce knowledge that is likely to be covered in the corpus. To catch up with the additional knowledge introduced by A, the system precomputes dictionaries d(m 1 ), . . . d(m n ). 4. Claim Labeling (T 2 ): • The annotation environment randomly samples a final claim m and presents it to A with a knowledge scope K containing the original source paragraph p, its T 1a dictionary d(p), as well as the additional dictionary d(m) retrieved for m in T 1b . The order of K is randomized (except for p which is always first) not to bias the time-constrained A. • A is further allowed to augment K by other paragraphs published in the same article as some paragraph already in K, in case the provided knowledge needs reinforcement • A is asked to spend ≤ 3 minutes looking for minimum evidence sets E m 1 , . . . , E m n sufficient to infer the veracity label which is expected to be the same for each set. • If none found, A may also label the m as NEI. Note that FEVER defines two subtasks only: Claim Generation and Claim Labeling. The Claim Generation corresponds to our T 1 , while the Claim Labeling is covered by T 2 . Due to notable differences in experiment design, we have built our own annotation platform, rather than reusing that of [4] . The annotations were collected using a custom-built web interface. Our implementation of the interface and backend for the annotation workflow described in section 4.5 is distributed under the MIT license and may be inspected online 20 . We provide further information on our annotation platform in appendix A. Apart from T 0 , the annotation tasks were assigned to groups of bachelor and master students of Journalism from the Faculty of Social Sciences at the Charles University in Prague. We have engaged a total of 163 participants who have signed themselves for courses in AI Journalism and AI Ethics during the academic year 2020/2021. We used the resulting data, trained models and the annotation experiment itself to introduce various NLP mechanisms, as well as to obtain valuable feedback on the task feasibility and pitfalls. The annotations were made in several waves -instances of the annotation experiment performed with different groups of students. This design allowed us to adjust the tasks, fullfilment quotas and the interface after each wave, iteratively removing the design flaws. In the annotation labeling task, we advised the annotators to spend no more than 2-3 minutes finding as many evidence sets as possible within Wikipedia, so that the dataset can later be considered almost exhaustive [4]. With our CTK corpus, the exhaustivity property is unrealistic, as the news corpora commonly contain many copies of single ground truth. For example, claim "Miloš Zeman is the Czech president" can be supported using any "'. . . ', said the Czech president Miloš Zeman." clause occurring in corpus. Therefore, we propose a different scheme: annotator is advised to spend 2-3 minutes finding as many distinct evidence sets as possible within the time needed for good reading comprehension. Furthermore, we have collected an average of 2 cross-annotations for each claim. This allowed us to merge the evidence sets across different T 2 annotations of the same claim, as well as it resulted in a high coverage of our cross-validation experiments in section 4.6.1. After completing the annotation runs, we have extracted a total of 3,116 multiannotated claims. 47% were SUPPORTed by the majority of their annotations, REFUTES and NEI labels were approximately even, the full distribution of labels is listed in Table 3 . Of all the annotated claims, 1,776, that is 57%, had at least two independent labels assigned by different annotators. This sample was given by the intrinsic randomness of T 2 claim sampling. In this section, we use it to asses the quality of our data and ambiguity of the task, as well as to propose annotation cleaning methods used to arrive to our final cleaned CTKFacts dataset. Due to our cross-annotation design (Section 4.5.3), we had generously sized sample of independently annotated labels in our hands. As the total number of annotators was greater than 2, and as we allowed missing observations, we have used the Krippendorff's alpha measure [34] which is the standard for this case [35] . We have calculated the resulting Krippendorff's alpha agreement to be 56.42%. We interpret this as an adequate result that testifies to the complexity of the task of news-based fact verification within a fixed knowledge scope. It also encourages a round of annotation cleaning experiments that would exploit the number of cross-annotated claims to remove common types of noise. We have dedicated a significant amount of time to manually traverse every conflicting pair of annotations to see if one or both violate the annotation guidelines. The idea was that this should be a common case for such annotations, as the CTK corpus does not commonly contain a conflicting pair of paragraphs except for the case of temporal reasoning explained in section 4.7. After separating out 14% (835) erroneously formed annotations, we have been able to resolve every conflict, ultimately achieving a full agreement between the annotations. We discuss the main noise patterns in section 4.7. Upon evaluating our NLI models (Section 6.2), we have observed that model misclassifications frequently occur at T 2 annotations that are counterintuitive for human, but easier to predict for a neural model. Therefore, we have performed a series of experiments in model-assisted human-in-the-loop data cleaning similar to [36] in order to catch and manually purge outliers, involving an expertly trained annotator working without a time constraint: 1. A fold of dataset is produced using the current up-to-date annotation database, sampling a stratified test split from all untraversed claims -the rest of data is then divided into dev and train stratified splits, so that the overall train-dev-test ratio is roughly 8:1:1. 2. Mark the test claims as traversed. 3. A round of NLI models (section 6.2) is trained to obtain the strongest veracity classifier for the current fold. The individual models are optimized w.r.t. the dev split, while the strongest one is finally selected using test. 4. test-misclassifications of this model are then presented to an expert annotator along with the model suggestion and an option to remove an annotation violating the rules and to propose a new one in its place. 5. New annotations propagate into the working database and while there are untraversed claims, we proceed to step 1. Despite allowing several inconsistencies with the scheme above during the first two folds (that were largely experimental), this led to a discovery of another 846 annotations conflicting the expert annotator's labeling and a proposal of 463 corrective annotations (step 4.). In this section, we give an overview of common misannotation archetypes as encountered in the cleaning stage (sections 4.6.2 and 4.6.3). These should be considered when designing annotation guidelines for similar tasks in the future. The following list is sorted by decreasing appearance in our data. 1. Exclusion misassumption is by far the most prevalent type of misannotation. The annotator wrongly assumes that an event connected to one entity implies that it cannot be connected to the other entity. E.g., evidence "Prague opened a new cinema." leads to "Prague opened a new museum." claim to be refuted. In reality, there is neither textual entailment between the claims, nor their negations. We attribute this error to confusing the T 2 with a reading comprehension 21 task common for the field of humanities. 2. General misannotation: we were unable to find exact explanations for large part of the mislabelled claims. We traced the cause of this noise to both unclearly formulated claims and UI-based user errors. 3. Reasoning errors cover failures in assessment of the claim logic, e.g., confusing "less" for "more", etc. Also, this often involved errors in temporal reasoning, where an annotator submits a dated evidence that contradicts the latest news w.r.t. the timestamp. 4. Extending minimal evidence: larger than minimal set of evidence paragraphs was selected. This type of error typically does not lead to misannotation, nevertheless, it was common in the sample of dataset we were analyzing. 5. Insufficient evidence where the given evidence misses vital details on entities. As an example: the evidence "A new opera house has been opened in Copenhagen." does not automatically support the claim "Denmark has a new opera house." if another piece of evidence connecting Copenhagen and Denmark is not available. This type of error indicates that the annotator of the claim extended the allowed knowledge scope with his/hers own world-knowledge. Finally, we publish the resulting cleaned CTKFacts dataset consisting of 3,097 manually labeled claims, with label distribution as displayed in table 3. We opt for stratified splits due to the relatively small size of our data and make sure that no CTK source paragraph was used to generate claims in two different splits, so as to avoid any data leakage. The full CTK corpus cannot be, unfortunately, released publicly. Nevertheless, we extract all of our 3,911 labeled claim-evidence pairs to form the CTKFactsNLI dataset. Claim-wise, it follows the same splitting as our DR dataset, and the NEI evidence is augmented by the paragraph that was used to derive the claim to enable inference experiments. We have acquired the authorization from CTK to publish all evidence plaintexts, which we include in CTKFactsNLI and open for public usage. The dataset is released publicly on HuggingFace dataset hub 22 and provides its standard usage API to encourage further experiments. In the claim generation phase T 1b , annotators are asked to create mutations of the initial claim. These mutations may have a different truth label than the initial claim or even be non-verifiable with the given knowledge database. During trials in [4], the authors found that a majority of annotators had a difficulty with creating non-trivial negation mutations beyond adding "not" to the original. Similar spurious cues may lead to models cheating instead of performing proper semantic analysis. In [19] , the authors investigated the impact of the trivial negations on the quality of the EnFEVER and DanFEVER datasets. Here we present similar analysis based on the cue productivity and coverage measures derived from work of [37] . In our case the cues extracted from the claims have a form of unigrams and bigrams. The definition of the productivity assumes a balanced dataset with respect to labels. The productivity of a cue k is calculated as follows: where C denotes the set of possible labels C = {SUPPORTS, REFUTES, NEI}, A is the set of all claims, A cue=k is the set of claims containing cue k and A class=c is the set of claims annotated with label c. Based on this definition the range of productivity is limited to π k ∈ [1/|C| , 1], for balanced dataset. The coverage of a cue is defined as a ratio ξ k = |A cue=k | /|A|. We take the same approach as [19] to deal with the dataset imbalance: the resulting metrics are obtained by averaging over ten versions of the data based on random subsampling. We compute the metrics for both CsFEVER and CTKFacts datasets. Similarly to [19], we also provide the harmonic mean of productivity and coverage, which reflects the overall effect of the cue on the dataset. The results in Table 4 show that the cue bias detected in EnFEVER claims [19] propagates to the translated CsFEVER, where the words "není" ("is not") and "pouze" ("only") showed high productivity of 0.57 and 0.55 and ended in the first 20 cues sorted by the harmonic mean. However, their impact on the quality of the entire dataset is limited as their coverage is not high, which is illustrated by their absence in the top-5 most influential cues. Similar results for the CTKFacts are presented in Table 5 . In this section, we establish baseline models for both CsFEVER and CTK-Facts. We consider document retrieval (DR) and natural language inference (NLI) stages, as well as the full fact verification pipeline. We provide four baseline models for the document retrieval stage: DrQA and Anserini represent classical keyword-search approaches, while multilingual BERT (M-Bert) and ColBert models are based on Transformer neural architectures. In line with FEVER [4], we employ document retrieval part of DrQA [32] model. The model was originally used for answering questions based on Wikipedia corpus, which is relatively close to the task of fact-checking. The DR part itself is based on TF-IDF weighting of BoW vectors while optimized by using hashing. We calculated the TF-IDF index using DrQA implementation for all unigrams and bigrams with 2 24 buckets. Inspired by the criticism of choosing weak baselines presented in [38] , we decided to validate our TF-IDF baseline against the proposed Anserini toolkit implemented by Pyserini [39] . We computed the index and then finetuned the k 1 and b hyper-parameters using grid search on defined grid k 1 ∈ [0.6, 1.2], b ∈ [0.5, 0.9], both with step 0.1. On a sample of 10,000 training claims, we selected the best performing parameter values: for CsFEVER these were k 1 = 0.9 and b = 0.9, while for CTKFacts we proceed with k 1 = 0.6 and b = 0.5. Another model we tested is the M-Bert [15] , which is a representative of Transformer architecture models. We used the same setup as in [40] with an added linear layer consolidating the output into embedding of the required dimension 512. In the fine-tuning phase, we used the claims and their evidence as relevant (positive) passages. For multi-hop claims, based on combinations of documents, we split the combined evidence, so the queries are always constructed to relate to a single evidence document, only. We used this fine-tuned model to generate 512-dimensional embeddings of the whole document collection. In the retrieval phase, we used the FAISS library [41] and constructed PCA384 Flat index for CTKFacts and Flat index for CsFEVER data. The last tested model was a recent ColBert, which provides the benefits of both cross-attention and two-tower paradigms [42] . We have employed the implementation as provided by the authors 23 , changing the backbone model to M-Bert and adjusting for the special tokens. The model was trained using triplets (query, positive paragraph, negative paragraph) with the objective to correctly classify paragraphs using a crossentropy loss function. We constructed the training triplets so that the claim created by a human annotator was taken as a query, a paragraph containing evidence as a positive and a random paragraph from a randomly selected document as a negative sample. As already stated, For the CTKFacts, the number of claims is significantly lower than for CsFEVER. Therefore, we increased the number of CTK training triplets: instead of selecting negative paragraphs from a random document, we selected them from an evidence document with the condition that the paragraph must not be used directly in the evidence. The number of training triplets was still low, so we also generated synthetic triplets as follows. We generated a synthetic query by extracting a random sentence from a random paragraph. A set of the remaining sentences of this paragraph were designated a positive paragraph. The negative paragraph was, once again, selected as a random paragraph of a random document. Then the title was used as a query instead of a random sentence, and a random paragraph from the article was used as a positive. Negative paragraph was selected in the same way as above. As a result, we generated about 950,000 triplets (≈ 944, 000 synthetic and ≈ 6, 000 using human-created claims) for the CTKFacts. We tried two setups here, 32 and 128 dimensional term representation (denoted ColBert32 and ColBert128) with document trimming to a maximum of 180 tokens on CsFEVER. The results are shown in Table 6 . Methods are compared by means of Mean Reciprocal Rank (MRR) given k ∈ {1, 5, 10, 20} retrieved documents. For CsFEVER, the neural network models achieve significantly best results., with ColBert taking lead. In case of CTKFacts, both Anserini and ColBert are best performers. Interrestingly, M-Bert fails in this task. We found that this is mainly caused by M-Bert preference for shorter documents (including headings). The aim of the final stage of the fact-checking pipeline, the NLI task, is to classify the veracity of a claim based on the previously retrieved evidence. We have examined several different Transformer models pretrained on Czech data in order to provide a strong baseline for this task. From the multilingual models, we have experimented with SlavicBERT and Sentence M-Bert [43] models in their cased defaults, provided by the DeepPavlov library [44] , as well as with the original M-Bert from [15, 45] . We have further examined two pretrained XLM-RoBERTa-large models, one fine-tuned on an NLI-related SQuAD2 [46] down-stream task, other on the crosslingual XNLI [47] task. These were provided by deepset 24 and HuggingFace 25 . Finally, we have performed a round of experiments with a pair of recently published Czech monolingual models. RobeCzech [48] was pretrained on a set of currated Czech corpora using the RoBERTa-base architecture. FERNET-C5 [49] was pretrained on a large crawled dataset, using Bert-base architecture. The results are presented in Table 7 and show dominance of the finetuned XLM-RoBERTa models. For reference, the [4] baseline scored 80.82% accuracy in a similar setting (NearestP -nearest page for a NEI context) on EnFEVER, using the Decomposable Attention model. Similarly to [4], we give baseline results for the full fact verification pipeline. The pipeline is evaluated as follows: 1) given a mutated claim m from the test set, k evidence paragraphs (documents) P = {p 1 , . . . , p k } are selected using document retrieval models as described in Section 6.1. Note that documents in P are ordered by decreasing relevancy. The paragraphs are subsequently fed to an NLI model of choice (see details below), and accuracy (for CsFEVER) or F1 macro score (for the unbalanced CTKFacts) are evaluated. In case of supported and refuted claims, we analyze two cases: 1) for Score Evidence (SE), P must fully cover at least one gold evidence set, 2) for No Score Evidence (NSE) no such condition applies. No condition applies for NEI claims as well 26 . While our paragraph-oriented pipeline eliminates the need for sentence selection, we have to deal with the maximum input size of the NLI models (512 tokens in all cases), which gets easily exceeded for larger k. Our approach is to iteratively partition P into n consecutive splits S = {s 1 , . . . , s l }, where l ≤ n. Each split s i itself is a concatenation of successive documents s i = {p s , . . . , p e }, where 1 ≤ s ≤ e ≤ n. A new split is created for any new paragraph that would cause input overflow. If any single tokenized evidence document is longer than the maximum input length, it gets represented by a single split and truncated 27 . Moreover, each split is limited to at most k s successive evidence documents (k s = 2 for CsFEVER, k s = 3 for CTKFacts), so the overall average input length is more akin to data used to train the NLI models. In the prediction phase, all split documents p s , . . . , p e are concatenated, and, together with the claim m, fed to the NLI model getting predictions use λ = 1 2 in all cases) assigns higher importance to the higher-ranked documents. The results are presented in Table 8 . We show Sentence M-Bert for CsFEVER and XLM-RoBERTa @ XNLI for CTKFacts (described in Section 6.2) only, as these pairings gave the best results. For CsFEVER, M-Bert retrieval model gives the best results for k = 20: 55.45% accuracy for NSE and 35.30% for SE. In case of CTKFacts, Anserini slightly outperforms ColBert with 61.78% F1 macro score (for k = 10, NSE) and 26.91% (for k = 20, SE). ColBert gives overall balanced results on both datasets. Note that the SE to NSE difference is more pronounced for CTKFacts, which can be explained by high redundancy of CTKFacts paragraphs w.r.t. CsFEVER. The results for CsFEVER are similar to the EnFEVER baselines [4]. With this paper, we address the lack of a Czech dataset for automated factchecking. We have explored two ways of acquiring such data. Firstly, we localize the EnFEVER dataset, using a document alignment between Czech and English Wikipedia abstracts extracted from the interlingual links. We obtain and publish the CsFEVER dataset of 127k machine-translated claims with evidence enclosed within the Czech Wikipedia dump. We then validate our alignment scheme and measure a 66% precision using hand annotations over a 1% sample of obtained data. Therefore, we recommend the data for models less sensitive to noise and utilize it to train experimental DR models and for recall estimation. Secondly, we executed a series of human annotation runs with 163 students of journalism to acquire a novel dataset in Czech. As opposed to similar annotations that extracted claims and evidence from Wikipedia [4, 6, 20], we annotated our dataset on top of a CTK corpus extracted from a news agency archive to explore this different relevant language form. We collected a raw dataset of 3,116 labeled claims, 57% of which have at least two independent cross-annotations. From these, we calculate Krippendorff's alpha to be 56.42%. We proceed with manual and human-and-model-in-the-loop annotation cleaning to remove conflicting and malformed annotations, arriving at the thoroughly cleaned CTKFacts dataset of 3,097 claims and their veracity annotations complemented with evidence from the CTK corpus. We release its version for NLI called CTKFactsNLI to maintain corpus trade secrecy. Finally, we use our datasets to train baseline models for the full factchecking pipeline composed of Document Retrieval and Natural Language Inference tasks. • The fact-checking pipeline is to be augmented by the check-worthiness estimation [50] , that is, a model that classifies which sentences of a given text in Czech are appropriate for the fact verification. We are currently working on models that detect claims within the Czech Twitter, and a strong predictor for this task would also strengthen our annotation scheme from Section 4.5 that currently relies on hand-picked check-worthy documents. • While the SUPPORTS, REFUTES and NEI classes offer a finer classification w.r.t. evidence than binary true/false, it is a good convention of factchecking services to use additional labels such as MISINTERPRETED, that could be integrated into the common automated fact verification scheme if well formalised. • The claim extraction schemes like that from [4] or Section 4.5 do not necessarily produce organic claims capturing the real-world complexity of fact-checking. For example, just the EnFEVER train set contains hundreds of claims of form "X is a person.". This problem does not have a trivial solution, but we suggest integrating real-world claims sources, such as Twitter, into the annotation scheme. • While the FEVER localization scheme from Section 3.1 yielded a rather noisy dataset, its size and document precision encourage deployment of a model-based cleaning scheme like that from [51] to further refine its results. [2] Zeng, X., Abumansour, A.S., Zubiaga, Appendix A Annotation platform A.1 Claim Extraction Figure A3 shows the Claim Extraction (T 1a ) interface. The layout is inspired by the work of [4] and, by default, hides as much of the clutter away from the user as possible. Except for the article heading, timestamp, and source paragraph, all supporting information is collapsed and only rendered on user demand. An annotator reads the source article and, if it lacks a piece of information he/she wants to extract, looks for it in the expanded article or corpus entry. The user is encouraged to Skip any source paragraph that is hard to extract -the purpose of Source Document Preselection (T 0 ) was to reduce the skips as much as possible. Throughout the platform, we have ultimately decided not to display any stopwatch-like interface not to stress out the user. We have measured that, excluding the outliers (≤ 10s, typically the Skipped annotations, and ≥ 600s, typically a browser tab left unattended), average time spent on this task is 2 minutes 16 seconds and the median is 1 minute 16 seconds. Mutation types follow those of the FEVER Annotation Platform and are distinguished by loud colors, to avoid mismatches. Excluding the outliers, the overall average time spent generating a batch of mutations was 3m 35s (median 3m 15s) with an average of 3.98 mutations generated per claim. In Figure A3 we show the most complex interface of our platform -the T 2 : Claim Annotation form. Full instructions took about 5 minutes to read and comprehend. The input of multiple evidence sets works as follows: each column of checkboxes in A3 stands for a single evidence set, every paragraph from the union of knowledge belongs to this set iff its checkbox in the corresponding column is checked. Offered articles & paragraphs are collapsible, empty evidence set is omitted. On average, the labelling task took 65 seconds, with a median of 40s. An average SUPPORTS/REFUTES annotation was submitted along with 1.29 different evidence sets, 95% of which were composed of a single paragraph. 1 D i c t i o n a r y C l i c k t h e t i t l e o f t h e a r t i c l e t o d i s p l a y t h e s e c t i o n t h a t w a s s e l e c t e d a s r e l e v a n t t o t h e s o u r c e s e n t e n c e . + J i ř í D i e n s t b i e r z í s k a l p o d p o r u Č S S D p r o c e s t u n a H r a d 1 9 . 0 5 . 2 0 1 2 1 4 0 8 + K l a u s i F i s c h e r p ř i š l i k u r n á m u ž k r á t c e p o s t a r t u v o l e Claimbuster: The first-ever end-to-end fact-checking system Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics The FEVER2.0 shared task MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims X-fact: A new benchmark dataset for multilingual fact checking BERT: Pre-training of deep bidirectional transformers for language understanding FakeCovid -a multilingual cross-domain fact check news dataset for covid-19 Claim matching beyond English to scale global fact-checking Constructing a poor man's wordnet in a resourcerich world A simple yet robust algorithm for automatic extraction of parallel sentences: A case study on arabic-english wikipedia articles Longformer: The long-document transformer Reformer: The efficient transformer Nyströmformer: A nyström-based algorithm for approximating selfattention Combining fact extraction and verification with neural semantic matching networks Reading Wikipedia to answer open-domain questions Neural architectures for nested NER through linearization Estimating the reliability, systematic error and random error of interval data Answering the call for a standard reliability measure for coding data Discovering informative patterns and data cleaning Probing neural network comprehension of natural language arguments Critically examining the "neural hype": Weak baselines and the additivity of effectiveness gains from neural ranking models Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations Pre-training tasks for embedding-based large-scale retrieval Billion-scale similarity search with gpus Colbert: Efficient and effective passage search via contextualized late interaction over bert Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DeepPavlov: Open-source library for dialogue systems How multilingual is multilingual BERT? SQuAD: 100,000+ questions for machine comprehension of text Xnli: Evaluating cross-lingual sentence representations Robeczech: Czech roberta, a monolingual contextualized language representation model Comparison of czech transformers on text classification tasks Automated fact-checking for assisting human fact-checkers Acknowledgments. This article was produced with the support of the Technology Agency of the Czech Republic under theÉTA Programme, project TL02000288. The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16 019/0000765 "Research Center for Informatics" is also gratefully acknowledged. We would like to thank all annotators as well as other members of our research group, namely: Barbora Dědková, Alexandr Gažo, Jan Petrov, and Michal Pitr.