key: cord-0615434-xbyit993
authors: Schlicht, Ipek Baris; Paula, Angel Felipe Magnossao de; Rosso, Paolo
title: UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims
date: 2021-09-19
journal: nan
DOI: nan
sha: b6ec1f0690e3e2f5c1b5e394faf8061296d0ca97
doc_id: 615434
cord_uid: xbyit993

Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.

The number of fact-checking initiatives worldwide has increased to fight misinformation. Manual fact-checking is a labor-intensive and time-consuming task that cannot cope up with the dissemination of misinformation [1] . Therefore, the automation of fact-checking steps is required to speed up the process.

Check-worthy claim detection is a crucial step of an automated fact-checking pipeline [2, 1, 3] to prioritize what is needed to be fact-checked by fact-checkers or journalists. There has been an ongoing effort to address the claim-detection task by different research communities. Prior studies rely on machine learning methods that use statistical features with bag of words [4, 5, 6] . Additionally, CLEF CheckThat! Lab (CTL) has organized shared tasks to tackle this problem in political debates [7, 8] and social media [9] . This year, CTL 2021 [10] organized the shared task in English, Turkish, Bulgarian, Spanish and Arabic where the task datasets are collected from social media [11] . The task's input is a tweet and the output is a score indicating the check-worthiness of the tweet.

Multilingual language models have been widely used in natural language understanding tasks with low-resourced languages (e.g. comment moderation [12] , fake news detection [13] ).

However, the exhibition of cultural differences is inevitable in tasks in which cultural context is required [14] . This issue could harm the transfer of knowledge across languages. Fact-checking is one of such tasks where disagreements could exist on credibility assessments even among the domain experts [15, 16] . Furthermore, exposure of global claims and their credibility (e.g. could vary by country [17] .

With this motivation, in this paper, we present a unified framework that processes the input in different languages and uses a multilingual sentence transformer trained on the mixed language training set to learn representations for the low-resourced languages. To mitigate the bias in the sentence representations, we introduce a language identification task and train the model jointly for check-worthiness detection (CWD) and language identification (LI) tasks.

Our contributions can be summarized as follows:

1. We introduce a framework whose aim is to be aware of cultural bias. We conduct an extensive analysis on its performance. 2. We employ joint learning to reduce unintended bias. To the best of our knowledge, a similar method has not been applied to reduce bias in multilingual fact-checking tasks. 3. Our framework could be extended with various multilingual transformer models in Huggingface [18] . The source code and the trained models are publicly available 1 .

ClaimBuster is the first study to address the check-worthy claim detection task. The component of ClaimBuster [4, 5] that detects check-worthy claims is trained with a Support Vector Machine (SVM) classifier using tf-idf bag of words, named entity types, POS tags, sentiment, and sentence length as a feature set. [6] proposed a fully connected neural network model trained on claims and their related political debate content. Last year, CTL 2020 [9] organized a shared CWD task in English and Arabic for the claims in social media. In this shared task, multilingual transformer models performed well on the Arabic dataset [19] . However, for the English datasets, the participants did not utilize the multilingual transformer model. In our approach, we fine-tune the multilingual sentence transformers [20] , which is computationally less expensive than the BERT models, on the mixed language of the training dataset. We trained one model and employed this for all languages. The multi-task learning approach has been a proven method to mitigate unintended bias. Das et al. [21] applied multi-task learning on a face recognition task using Convolution Neural Network. As a related example in the Natural Language Processing (NLP) domain, Vaidya et al. [22] mitigate the identity bias in toxic comment detection. Their model encodes the inputs with a Bidirectional Long Short-Term Memory Network (BiLSTM). However, our approach and the tasks we deal with are different from those studies.

In this section, we introduce our framework, which is depicted in Figure 1 . The input of the framework is a Twitter post. The input is tokenized with a sentence transformer encoder in order to be fed into the transformer layer. After obtaining the shared text representation from a sentence transformer, the framework fine-tunes the shared representation and the classification layers for CWD and LI tasks by minimizing a joint loss. In the following subsections, we give more details about the sentence transformer and the joint training. 

The framework uses a Sentence-BERT (SBERT) transformer [23] , which is a modified BERT that uses a siamese and a triplet network. The SBERT can provide semantically more meaningful sentence embeddings than the BERT models. To support multilingualism in our framework and to enable fine-tuning with a small GPU, we use a pre-trained SBERT that was obtained by applying knowledge distillation [20] and that was trained on a multilingual corpus from a community-driven Q&A website 2 . We refer to it as QDMSBERT.

We apply mean pooling on the output of QDMSBERT to obtain sentence embeddings. We set the maximum length of the tokens as 128 by padding shorter texts and truncating longer texts.

The framework contains two task layers: one is for the CWD task and the other is for the LI task. The input of the task layers are shared QDMSBERT embeddings. Both task layers use the same neural network structure, consisting of two fully-connected layers followed by a softmax layer that outputs the probabilities of task classes. During the training, the weighted loss of the CWD and the LI task are summed up to compute the joint loss as seen in Equation 1 where is a probability indicating the importance of the tasks. Lastly, the joined loss is minimized by optimizing the weights of the transformer network and the tasks' classification layers. 

In this section, we give the details of the CLEF 2021 CheckThat! dataset, explain the baselines and the systems that we compared, and present the experimental settings.

The CLEF 2021 CheckThat! offers datasets in English, Spanish, Arabic, Turkish, and Bulgarian for the CWD task. The statistics of the datasets are given in Table 1 . The class distribution of the datasets for each language is highly imbalanced, which reflects the real-world cases. Check-worthy samples (Pos-Class) are the minority. English and Bulgarian datasets contain only COVID topic. The Turkish dataset covers miscellaneous topics, and the Spanish dataset has only samples about politics. The topics of the Arabic dataset are mainly COVID related.

We compare the proposed model (QDMSBERT ) against the following models and systems:

• SVM: It encodes the texts with unigrams.

• Monolingual Models and Mk-Bg-BERT: We use a distilled SBERT [23] model 3 for the English samples. We couldn't find any monolingual SBERTs for Arabic, Turkish and Spanish; therefore, we use popular BERT [24] variants that are trained on monolingual corpora. TrBERT 4 is the model for Turkish samples, BETO [25] for Spanish, and lastly AraBERT [26] for the tweets in Arabic. For Bulgarian tweets, we leverage a BERT model (Mk-Bg-BERT) trained on Macedonian and Bulgarian corpora 5 .

• CLEF-2021: Submissions for the CLEF-2021 CWD task [11] that support all languages, namely Accenture, BigIR and TOBB ETU 6 . • QDMSBERT: QDMSBERT where the weights are only optimized for the CWD task.

We split the training dataset randomly into five chunks and thus train five different QDMSBERT models with the epochs of 3, weighted decay Adam optimizer [27] , and in batches of 16. The mean of each model's predictions represents the final score. We use the GPU of Google Colab 7 for training the models. Table 2 presents the results of each model. We report the test results in official metrics of the shared task: Mean Average Precision (MAP), precision scores at 1-50 (P@1-P@50), R-Precision (R-Prec), and R-Rank. We first compare QDMSBERT with the SVM and QDMSBERT. QDMSBERT outperforms QDMSBERT in many metrics across the languages except for Arabic, also QDMSBERT underperforms the SVM in Spanish. We see performance gains on the English, Bulgarian and Turkish samples. The results indicate that QDMSBERT presents better results on the examples in COVID-19, but is generalized less to other topics in Spanish and in Arabic.

Among the results by the teams who submitted runs in all languages (group CLEF-2021), the performance of QDMSBERT is the best in English and the second in Bulgarian which is promising for a low-resource language.

Monolingual BERT models outperformed our model and the other teams' submissions in English and Spanish. TrBERT and AraBERT also show better results than our approach. Although we improve our outcome compared to QDMSBERT by mitigating differences across the languages, the performance of the monolingual embeddings is still unsurpassed in this task.

The presented results of QDMSBERT were accomplished by using a contribution of task loss ( ) of 0.6. This initial value was choose heuristically. As an ablation study, we change the values of the tasks' loss and train QDMSBERT for each value to understand its influence on the CWD learning. The optimal alpha value is 0.8. In Bulgarian samples, the lower alpha values could also yield good performance for the CWD task.

Lastly, we analyze the feature representations of QDMSBERT and QDMSBERT . We visualize the feature representations by applying t-distributed stochastic neighbor embedding nonlinear dimensionality reduction (T-SNE) [28] . As depicted in Figure 2 , the features that QDMSBERT produces are more clearly separated. For instance, the cluster with English samples (lower right region of Figure 2a ) in the T-SNE for QDMSBERT overlaps with both the cluster of Turkish and the cluster of Bulgarian samples. In contrast, the T-SNE for QDMSBERT Table 2 The results of the models on the test set. Our submission is QDMSBERT . shows that only very few non-English samples fall close to the English cluster (upper region of Figure 2b ). Table 3 The 

In this paper, we proposed a method to tackle the multilingual check-worthiness of claims. To mitigate bias due to cultural differences, we leveraged multilingual sentence BERTs as feature representations and trained them jointly with the language identification task. Our approach outperformed the SVM and QDMSBERT for almost all of the languages on the CLEF2021 dataset. Also, it became one of the top-performing approaches in Bulgarian and English datasets among the submissions that have been done for all these languages. In the future, we will investigate how the consideration of images [29] that are embedded in the tweets influence the results.

Understanding the promise and limits of automated fact-checking

A content management perspective on fact-checking

Automated fact checking: Task formulations, methods and future directions

Detecting check-worthy factual claims in presidential debates

Claimbuster: The first-ever end-to-end fact-checking system

A context-aware approach for detecting worth-checking claims in political debates

Overview of the CLEF-2018 checkthat! lab on automatic identification and verification of political claims. task 1: Check-worthiness

Overview of the CLEF-2019 checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness

Overview of checkthat! 2020: Automatic identification and verification of claims in social media

Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news

Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates

To block or not to block: Experiments with machine learning for news comment moderation

BanFakeNews: A Dataset for Detecting Fake News in Bangla

Mining cross-cultural differences and similarities in social media

News source credibility in the eyes of different assessors

Annotating credibility: Identifying and mitigating bias in credibility datasets

Misinformation, believability, and vaccine acceptance over 40 countries: Takeaways from the initial phase of the covid-19 infodemic

Transformers: State-of-the-art natural language processing

Elsayed, bigir at checkthat! 2020: Multilingual bert for ranking arabic tweets by check-worthiness

Making monolingual sentence embeddings multilingual using knowledge distillation

Mitigating bias in gender, age and ethnicity classification: a multi-task convolution neural network approach

Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

BERT: pre-training of deep bidirectional transformers for language understanding

Spanish pre-trained bert model and evaluation data

Arabert: Transformer-based model for arabic language understanding

Decoupled weight decay regularization

Visualizing data using t-sne

On the role of images for analyzing claims in social media

The work of P. Rosso was partially funded by the Spanish Ministry of Science and Innovation under the research project MISMIS-FAKEnHATE on MISinformation and MIScommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).