key: cord-0583064-n16j2hh5
authors: Plaza-del-Arco, Flor Miriam; Halat, Sercan; Pad'o, Sebastian; Klinger, Roman
title: Multi-Task Learning with Sentiment, Emotion, and Target Detection to Recognize Hate Speech and Offensive Language
date: 2021-09-21
journal: nan
DOI: nan
sha: a9001ac562b88b5916de8e834dec129cb1b78fe9
doc_id: 583064
cord_uid: n16j2hh5

The recognition of hate speech and offensive language (HOF) is commonly formulated as a classification task to decide if a text contains HOF. We investigate whether HOF detection can profit by taking into account the relationships between HOF and similar concepts: (a) HOF is related to sentiment analysis because hate speech is typically a negative statement and expresses a negative opinion; (b) it is related to emotion analysis, as expressed hate points to the author experiencing (or pretending to experience) anger while the addressees experience (or are intended to experience) fear. (c) Finally, one constituting element of HOF is the mention of a targeted person or group. On this basis, we hypothesize that HOF detection shows improvements when being modeled jointly with these concepts, in a multi-task learning setup. We base our experiments on existing data sets for each of these concepts (sentiment, emotion, target of HOF) and evaluate our models as a participant (as team IMS-SINAI) in the HASOC FIRE 2021 English Subtask 1A. Based on model-selection experiments in which we consider multiple available resources and submissions to the shared task, we find that the combination of the CrowdFlower emotion corpus, the SemEval 2016 Sentiment Corpus, and the OffensEval 2019 target detection data leads to an F1 =.79 in a multi-head multi-task learning model based on BERT, in comparison to .7895 of plain BERT. On the HASOC 2019 test data, this result is more substantial with an increase by 2pp in F1 and a considerable increase in recall. Across both data sets (2019, 2021), the recall is particularly increased for the class of HOF (6pp for the 2019 data and 3pp for the 2021 data), showing that MTL with emotion, sentiment, and target identification is an appropriate approach for early warning systems that might be deployed in social media platforms.

The widespread adoption of social media platforms has made it possible for users to express their opinions easily in a manner that is visible to a huge audience. These platforms provide a large step forward for freedom of expression. At the same time, social media posts can also contain harmful content like hate speech and offensive language (HOF), often eased by the quasi-anonymity on social media platforms [1] . The European Commission's recommendation against racism and intolerance defines HOF as "the advocacy, promotion or incitement of the denigration, hatred or vilification of a person or group of persons, as well any harassment, insult, negative stereotyping, stigmatization or threat of such person or persons and any justification of all these forms of expression -that is based on a non-exhaustive list of personal characteristics or status that includes 'race', color, language, religion or belief, nationality or national or ethnic origin, as well as descent, age, disability, sex, gender, gender identity, and sexual orientation" [2] .

With the number of social media posts rising sharply, purely manual detection of HOF does not scale. Therefore, there has been a growing interest in methods for automatic HOF detection. A straight-forward approach one might consider is to make use of basic word filters, which use lexicons that contain entries of words that are frequently used in hate speech [3] . This approach, however, has its limitations, given that HOF depends on discourse, the media, daily politics, and the identity of the target [4] . It also disregards the different use of potentially offending expressions across communities [5] .

These factors motivate an interest in more advanced approaches as they are developed in the field of natural language processing (NLP). Most recent, well-performing systems make use of machine learning methods to associate textual expressions in a contextualized manner with the concept of HOF. Mostly, existing models build on top of end-to-end learning, in which the model needs to figure out this association purely from the training data (and a general language representation which originates from self-supervised pretraining of a language model).

In this paper, we build on the intuition that HOF is related to other concepts that might help to direct this learning process. Analyzing the definition above, HOF is potentially related to sentiment, emotion, and the target of hate speech. First, sentiment analysis is often defined as the task of classifying an opinion expression into being positive or negative, given a particular target [6, 7] . HOF is related as it typically contains a negative expression or, at least, intend. Second, emotion analysis is concerned with the categorization of text into a predefined reference system, for instance basic emotions as they have been proposed by Paul Ekman [8] (fear, joy, sadness, surprise, disgust, anger). HOF contains expressions of anger and might cause fear or other emotions in a target group. Finally, the target is, by definition, a crucial element of hate speech, whether mentioned explicitly or not.

The concrete research question we test is whether a HOF detection system can be improved by exploiting existing resources that are annotated for emotion, sentiment and HOF target, and carrying out joint training of a model for HOF and these aspects. In building such a model, the developer has to decide (a) which of these aspects to include, (b) which corpora to use for training for each aspect, and (c) how to combine these aspects. We assume a simple multi-task learning architecture for (c) and perform model selection on the HASOC FIRE 2019 development data to address (b). Finally, we address question (a) through our submissions to HASOC FIRE 2021 Shared Task [9] subtask 1A 1 [10] which asks systems to carry out a binary distinction between "non hate/offensive" and "hate/offensive" English tweets. We find that a combination of all concepts leads to an improvement by about 2pp in F 1 on the HASOC 2019 test data, with a notable increase in recall by 6pp, and an increase by 0.5pp in F 1 in the HASOC 2021 test data (with an increase of 3pp in recall).

As argued above, detecting hate and offensive language on Twitter is a task closely linked to sentiment, emotion analysis, and target classification. In this section, we introduce these tasks alongside previous work and also mention some HOF detection shared tasks that took place in recent years in the NLP community.

Emotion analysis from text (EA) consists of mapping textual units to a predefined set of emotions, for instance basic emotions, as they have been proposed by Ekman [8] (anger, fear, sadness, joy, disgust, surprise), the dimensional model of Plutchik [11] (adding trust and anticipation), or the discrete model proposed by Shaver et al. [12] (anger, fear, joy, love, sadness, surprise). Great efforts have been conducted in the last years by the NLP community in a variety of emotion research tasks including emotion intensity prediction [13, 14, 15] , emotion stimulus or cause detection [16, 17, 18, 19] , or emotion classification [20, 21] . Studying patterns of human emotions is essential in various applications such as the detection of mental disorders, social media mining, dialog systems, business intelligence, or e-learning. An important application is the detection of HOF, since it is inextricably linked to the emotional and psychological state of the speaker [22] . Negative emotions such as anger, disgust and fear can be conveyed in the form of HOF. For example, in the text "I am sick and tired of this stupid situation" the author feels angry and at the same time is using offensive language to express that emotion. Therefore, the detection of negative emotions can be a clue to detect this type of behavior on the web.

An important aspect of EA is the creation of annotated corpora to train machine learning models. The availability of emotion corpora is highly fragmented, not only because of the different emotion theories, but also because emotion classification appears to be genre-and domain-specific [21] . We will limit the discussion of corpora in the following to those we use in this paper. The Twitter Emotion Corpus (TEC) was annotated with labels corresponding to Ekman's model of basic emotions (anger, disgust, fear, joy, sadness, and surprise) and consists of 21,051 tweets. It was automatically labeled with the use of hashtags that the authors selfassigned to their posts. The grounded emotions corpus created by Liu et al. [23] is motivated by the assumption that emotions are grounded in contextual experiences. It consists of 2,557 instances, labeled by domain experts for the emotions of happiness and sadness. EmoEvent, on the contrary, was labeled via crowdsourcing via Amazon Mechanical Turk. It contains a total of 8,409 tweets in Spanish and 7,303 in English, based on events related to different topics such as entertainment, events, politics, global commemoration, and global strikes. The labels that we use from this corpus correspond to Ekman's basic emotions, complemented by 'other'. DailyDialog, developed by Li et al. [24] , is a corpus consisting of 13,118 sentences reflecting the daily communication style and covering various topics related to daily life. The dialogues in the dataset cover totally ten topics. It was annotated following Ekman's emotions by domain experts. The ISEAR dataset was collected in the 90s by Klaus R. Scherer and Harald Wallbott by asking people to report on their experience of emotion-eliciting events. [25] . The dataset contains a total of 7,665 sentences from 3,000 participant reports labeled with single emotions. The last dataset that we use, CrowdFlower 2 , consists of 39,740 tweets labeled for 13 emotions. It is quite large, but more noisy than some other corpora, given the annotation procedure via crowdsourcing.

Sentiment analysis (SA) has emerged as one of the most well-known areas in NLP due to its significant implications in social media mining. Construed broadly, the task includes sentiment polarity classification, identifying the sentiment target or topic, opinion holder identification, and identifying the sentiment of one specific aspect (e.g., a product, topic, or organization) in its context sentence [7, 26] . Sentiment analysis is a stricter sense, i.e., polarity classification, is often modeled as a two-class (positive, negative) or three-class (positive, negative, neutral) categorization task. For instance, the opinionated expression "The movie was terrible, I wasted my time watching it" is clearly negative. A negative sentiment can be an indicator of the presence of offensive language, as previous studies have shown [27, 6] . Sentiment analysis and the identification of HOF share common discursive properties. Considering the example shown in Section 2.1, "I am sick and tired of this stupid situation", in addition to expressing anger, conveys a negative sentiment along with the presence of expletive language targeted to a situation. Therefore, both sentiment and emotion features can be used as useful information in the NLP systems to benefit the task of HOF detection in social media. Note that sentiment analysis is not a "simplified" version of emotion analysis -sentiment analysis is about the expression of an opinion, while emotion analysis is about inferring an emotional private state of a user. These tasks are related, but at least to some degree complementary [28] .

Unlike EA, as SA classification is one of the most studied tasks due to its broader applications, a larger number of corpora annotated with sentiments is available, particularly from Twitter. For instance, one of the most well-known datasets is the Stanford Sentiment Treebank [29] . It contains movie reviews in English from Rotten Tomatoes. Another popular dataset was released in SemEval 2016 for Task 4 is labeled with positive, negative, or neutral sentiments and includes a mixture of entities (e.g., Gaddafi, Steve Jobs), products (e.g., kindle, android phone), and events (e.g., Japan earthquake, NHL playoffs) [30] . The same year, another dataset was released in SemEval for Task 6 [31] , the Twitter stance and sentiment corpus which is composed of 4,870 English tweets labeled with positive and negative sentiments. For a more detailed overview, we refer the reader to recent surveys on the topic [32, 33, 34 ].

Hate speech and offensive language (HOF) is a phenomenon that can be observed with increasing frequency in social media in the last years. As HOF became more widespread, attention from the NLP community also increased substantially [35] . Most research is targeting Twitter as a platform, due to its popularity across various user groups and its relatively liberal API terms and conditions for researchers. The methodological approaches vary between lexicon-based approaches (among others, [36] ) which are preferable due to its transparency in decisionmaking and machine learning methods which typically show higher performance [37, 38, i.a.]. Definitions of hate speech for the operationalization in automatic detection systems vary and do not always strictly follow the ECRI definition that we introduced in Section 1. Sometimes, different categories such as hate speech, profanity, offensive language, and abuse are collapsed into one class because they are related [4] and might trigger similar responses (e.g., by authorities).

An important role in the HOF field has been played by a series of shared tasks. One of the first of such events was GermEval [39] , which was organized for the first time in 2014 and focuses on German. After initially focusing on information extraction tasks, the identification of offensive language was introduced in 2018 [40] . Two subtasks were offered, one on the classification into a non-offensive 'other', and 'profanity', 'insult', and 'abuse', and one, in which the three latter classes are collapsed to obtain a coarse-grained binary classification setup. This setup was retained in the 2019 edition of the shared task [41] .

Another well-known shared task event is OffensEval, which was held as part of the International Workshop on Semantic Evaluation (SemEval) in 2019 and 2020 [42, 43] . As part of the last OffensEval event, the OLID dataset was published, which contains a total of 14,000 tweets in English. It was annotated using a three-level hierarchical annotation scheme by crowdsourcing.

A third shared task series that took place in 2021 for the third time is HASOC (Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages) [44, 45] . In the first edition of the HASOC, in 2019 [44] , Hindi, German and English datasets were created for the definition of HOF based on Twitter and Facebook posts. HASOC 2020 introduced two tasks, one on coarse-grained HOF vs. non-HOF language and one which distinguishes hate, offensive language, and profane language for all these languages. HASOC 2021 was extended by a subtask on code-mixed language. This paper is a participation and system description in the coarse-grained identification of HOF in English (Subtask 1A), in the 2021 edition of HASOC 3 .

Another direction of research relevant for this paper is formed by studies that aim at obtaining a better understanding of the challenges of HOF detection. Davidson et al. [46] focused on the separation between the classes of hate speech and offensive language. They collected 33,458 tweets based on a crowdsourced lexicon and found, based on bag-of-words maximum entropy classifiers, that racist and homophobic tweets are more likely to be classified as hate speech, but that sexist tweets are generally classified as offensive. With a similar goal in mind, to understand which cases are particularly challenging for HOF detection models, Röttger et al. [47] introduced HateCheck, a suite of functional tests, to enable more detailed insights of where models might fail. They particularly analyzed distinct expressions of hate, like derogatory hate speech, threatening language, slurs, and profanity. Finally, Waseem and Hovy [48] perform a corpus study to understand which properties hate speech exhibits in contrast to non-hateful language. This work is noteworthy because it focuses on properties that are grounded in theories from social sciences rather than being primarily data-driven.

Other research focused on the development of well-performing models for HOF detection with adaptations to recent approaches to text classification via transfer learning. Mathur et al. [49] investigated the usage of mixed language. They presented the Multi-Input Multi-Channel Transfer Learning-based model (MIMCT) to detect HS, offensiveness, and abusive language tweets from the proposed Hinglish Offensive Tweet (HOT) dataset using transfer learning coupled with multiple feature inputs. They stated that their proposed MIMCT model outperforms basic supervised classification models. Wiedemann et al. [50] , participant in the GermEval competition, used a different strategy for automatic offensive language classification on German Twitter data. For this task, they used a set of BiLSTM and CNN neural networks and include background knowledge in the form of topics into the models.

We refer the interested reader to recent surveys on the topic of hate speech and offensive language detection for a more comprehensive overview [35, 51] .

According to the definition of hate speech, it must be targeted at a particular individual or group, whether that target is mentioned explicitly or not. Typical examples include black people, women, LGBT individuals or people of a particular religion [52] . The majority of current studies do not aim at detecting targets that are mentioned in the text (in the sense of information extraction), but aim at analyzing the properties of HOF towards a particular group by sampling only posts aimed at that group from social media. For example, Kwok and Wang [53] focused on the analysis of hate speech towards people of color, while Grimminger and Klinger [54] analyzed hate and offensive language by and towards supporters of particular political parties.

Some studies aim at answering the research question how various target groups are referred to. As an example, Lemmens et al. [55] analyzed the language of hateful Dutch comments regarding classes of metaphoric terms, including body parts, products, animals, or mental conditions. Such a closed world approach, however, does not permit the identification of targets that were not known at the development time of the HOF detection system. This is to some degree addressed by Silva et al. [56] , who developed a rule-based method to identify target mentions that are then, similarly to Lemmens et al. [55] , compared regarding the expressions that are used.

ElSherief et al. [57] focused on the distinction between directed hate, towards a particular individual as a representative of a group, and generalized hate which mentions the group itself. Their study is not focused on target classification, but on the analysis of which groups and individuals are particularly in focus of hate speech, including religious groups, genders, and ethnicities. To be able to do that, however, they needed to automatically detect words in context of HOF. They did that with the use of a mixed-effect topic model [58] .

This label set is also used in the shared task OffensEval 2019 [43] , which is the only competition we are aware of which included target classification as a subtask. The OLID dataset of OffensEval 2019 has, next to HOF annotations, labels which indicate whether the target is an individual, a group, some other target, or if it omitted. We use this annotation in our study.

Maybe the most pertinent question arising from our intuition above -namely that HOF detection is related to the tasks of emotion, sentiment and target classification -is how this intuition can be operationalized as a computational architecture. Generally speaking, this is a transfer learning problem, that is, a problem which involves generalization of models across tasks and/or domains. There are a number of strategies to address transfer learning problems; see Ruder [59] for a taxonomy. Structurally, our setup falls into the inductive transfer learning category, where we consider different tasks and have labeled data for each. Procedurally, we propose to learn the different tasks simultaneously, which amounts of multi-task learning (MTL). In the MTL scenario, multiple tasks are learned in parallel while using a shared representation [60] . In comparison to learning multiple tasks individually, this joint learning effectively increases the sample size while training a model, which leads to improved performance by increasing the generalization of the model [61] .

The concrete MTL architecture that we use is shown in Figure 1 . We build on a standard contextualized embedding setup where the input is represented by a transformer-based encoder, BERT, pre-trained on a very large English corpus [62] . We add four sequence classification heads to the encoder, one for each task, and fine-tune the model on the four tasks in question (binary/multiclass classification tasks). For the sentiment classification task a tweet is categorized into positive and negative categories; emotion classification classifies a tweet into different emotion categories (anger, disgust, fear, joy, sadness, surprise, enthusiasm, fun, hate, neutral, love, boredom, relief, none). Different subsets of these categories are considered in this task depending on the emotion corpus that is used to represent the concept of an emotion. Target classification categorize the target of the offense to an individual, group, to others and to be not mentioned; and HOF detection classifies a tweet into HOF or non-HOF. While training, the objective function weights each task equally. At prediction time, for each tweet in the HASOC dataset, four predictions are assigned, one for each task.

Our main research question is whether HOF detection can be improved by joint training with sentiment, emotion and target. Even the adoption of the architecture described in Section 3 leaves open a number of design choices, which makes a model selection procedure necessary.

For the purpose of model selection, we decided to use the dataset provided by the 2019 edition of the HASOC shared task, under the assumption that the datasets are fundamentally similar (we also experimented with the HASOC 2020 dataset, but the results indicated that this dataset is sampled from a different distribution than the 2021 dataset). During the evaluation phase, we then used the best model configurations we identified on HASOC 2019 to train a model on the HASOC 2021 training data and produce predictions for the HASOC 2021 test set.

The two main remaining model selection decisions are (a), which corpora to use to train the components?; (b), which components to include? In the following, we first provide details on the corpora we considered, addressing (a). We also describe the details of data preprocessing, ... Figure 1 : Proposed multi-task learning system to evaluate the impact of including emotion, sentiment, and target classification. The input representation is BERT-based tokenization and each task corresponds to one classification head. Information can flow from one task to another through the shared encoder that is updated during training via backpropagation.

training regimen and hyperparameter handling. The results are reported in Section 5 to address point (b).

We carry out MTL experiments to predict HOF jointly with the concepts of emotion, sentiment and HOF target. The datasets are listed in Table 1 . To represent sentiment in our MTL experiments, we use the SemEval 2016 Task 6 dataset [31] composed of 4,870 tweets in total. We include the task of target classification with the OLID dataset [63] , which consists of 14,100 English Tweets. The concept of HOF is modelled based on the HASOC 2021 dataset, which provides three sub-tasks. We participate as the team IMS-SINAI in sub-task 1A, which contains 5,214 English tweets splits into 3,074 tweets in the training set, 769 in the development set and 1,281 in the test set. 4 For emotion detection, we consider a set of six corpora in the model selection experiment. These are the Crowdflower data 5 , the TEC corpus [64] , the Grounded Emotions corpus [23] , EmoEvent [65] , DailyDialogues [24] , and ISEAR. Among the available emotion corpora, we chose those because they cover a range of general topics and/or the genre of tweets.

Tweets present numerous challenges in their tokenization, such as user mentions, hashtags, emojis, misspellings, among others. To address these challenges, we make use of the ekphrasis Python library 6 [66] . Particularly, we normalize all mentions of URLs, emails, users' mentions, percentages, monetary amounts, time and date expressions, and phone numbers. For example, "@user" is replaced by the token "<user>". We further normalize hashtags and split them into their constituent words. As an example, "#CovidVaccine" is replaced by "Covid Vaccine". Further, we replace emojis by their aliases. For instance, the emoji is replaced by the token ":face_with_tears_joy:" using the emoji Python library 7 . Finally, we replace multiple consecutive spaces by single spaces and replace line breaks by a space.

In the MTL stage, during each epoch, a mini-batch is selected among all 4 tasks, and the model is updated according to the task-specific objective for the task t. This approximately optimizes the sum of all multi-task objectives. As we are dealing with sequence classification tasks, a standard cross-entropy loss function is used as the objective.

For hyper-parameter optimization, we split the HASOC 2021 into train (80 %) and validation data (20 %) . Afterwards, in the evaluation phase we use the complete training set of HASOC 2021 in order to take advantage of having more labeled data to train our models. For the baseline BERT, we fine-tuned the model for four epochs, the learning rate was set to 4 · 10 −4 and the batch size to 32. For HASOC_sentiment and HASOC_emotion, we fine-tuned the model for three epochs, the learning rate was set to 3 · 10 −5 and 4 · 10 −5 respectively, and the batch size to 32. For HASOC_target, the epochs were set to four, the learning rate to 4 · 10 −5 and the batch size to 16. For HASOC_all, we fine-tuned the model for two epochs, the learning rate was set to 3 · 10 −4 and the batch size to 16. All the configurations used AdamW as optimizer.

We run all experiments with the PyTorch high-performance deep learning library [67] on a compute node equipped with a single Tesla-V100 GPU with 32 GB of memory. 

In this section, we present the results obtained by the systems we developed as part of our participation in HASOC 2021 English subtask 1. We use the official competition metric macro averaged precision, recall and F 1 -score as evaluation measures and further report HOF-specific results, as we believe that, for real-world applications, the detection of the concept HOF is more important than non-HOF. The experiments are performed in two phases: the model selection phase and the evaluation phase, which are explained in the following two sections.

As described above, we perform model selection by training our systems on the training set of HASOC 2019 and evaluating them on the corresponding test set. As our hypothesis is that the MTL system trained on related tasks to HOF detection increased the generalization of the model, we decided to use as a baseline the pre-trained language model BERT fine-tuned on the HASOC 2019 corpora to compare the results. In order to decide which emotion corpora to use for the task of emotion classification in the MTL setting, we test a number of emotion datasets, obtaining the results shown in the Table 2 . These results are on the main task of hate and offensive language detection, but vary the emotion dataset used for MTL. As can be seen, the best performance is obtained by the CrowdFlower dataset, with a substantial margin in terms of Macro-P score. This is despite our impression that this dataset is comparably noisy [21] . We believe that what makes the dataset suitable for HOF detection is that it contains a large number of tweets labeled with a wide range of emotion tags, including hate. Therefore, we decided to use this emotion dataset in the MTL setting for the final submission of HASOC 2021. Table 3 shows the results of the MTL models including the different auxiliary tasks on the HASOC 2019 test data. The setting HASOC_all refers to the MTL model trained on the combination of all tasks (HOF detection, emotion classification, polarity classification and offensive target classification). As can be seen, the MTL models surpass the baseline BERT by at least 2 percentage points Macro-F 1 . In particular, the MTL model that obtains the best performance is HASOC_all, followed by HASOC_target, HASOC_emotion and HASOC_sentiment. The performance of HASOC_all increases by 2 points Macro-F 1 over the baseline, with Macro-Precision increasing roughly 1.5 points and Macro-Recall roughly 2.5 points. Table 3 further shows the results of the MTL models on the HOF class in the HASOC 2019 test set. In all MTL systems except HASOC_emotion, the recall improved over the BERT baseline. The highest improvement in terms of this measure is observed in the HASOC_target model, with an increase of 6.2 points. The precision increases by 5.2 points in the HASOC_emotion model. The best run (HASOC_all) outperforms the baseline BERT with a substantial margin (0.702 to 0.667). Table 4 . Specifically, we show 7 examples, namely 4 false positives and 3 false negatives performed by the baseline BERT model. Regarding the false positives, the first two tweets (IDs 107 and 952) are predicted as HOF by the BERT model but MTL correctly classified them as non-HOF, presumably because although the predicted sentiment is negative, the model could neither recognize a negative emotion nor a target to classify it as HOF. Tweet with ID 506 is also correctly predicted by the MTL model as non-HOF, in this case, although the emotion sadness is negative, we believe that it is not strongly linked to HOF, moreover, the model does not recognize a specific target directed at HOF. The last false positive (tweet ID 4517) expresses a positive sentiment and the model is able to recognize it, thus we suppose that the MTL benefits from this affective knowledge to classify the tweet as non-HOF. Regarding the false negatives, the tweet with ID 254 has been classified by the MTL system as negative sentiment, negative emotion (sadness) and is directed to a person, therefore as these aspects are closely linked to the presence of HOF, we assume that the MTL take advantage of these aspects to correctly classify the tweet. The next sample, a tweet with ID 684, expresses a negative opinion and an anger emotion, correctly predicted by the MTL, this emotion is one of the emotions most inextricably related to HOF, and together with the negative sentiment could give a clue to the system to correctly classify the tweet as HOF, although the target is not identified. Finally, instance 821 expresses a negative sentiment towards a person, correctly identified by the MTL model. The model predicts fear for this instance -which we would consider a wrong classification. However, even from this classification (fear instead of anger), the MTL model benefits and makes the correct prediction, which was not possible in the plain BERT model.. These examples indicate that our MTL system predicts the class HOF more accurately than BERT and is particularly improved in cases that have been missed by the plain model (which is also reflected by the increased recall on the HASOC 2019 data). 

For evaluation, we use the dataset provided by the organizers of the HASOC 2021 English subtask 1A. First, we want to verify that the MTL models surpass the baseline BERT also in the evaluation setting. We train all models on the HASOC 2021 training set and test them on the dev set of HASOC 2021. The results obtained are shown in Table 5 . As can be seen, most of the MTL systems except HASOC_sentiment outperform the baseline, which validates our decision to select these models for the final evaluation of HASOC 2021. HASOC_sentiment does improve over the baseline in Macro-Precision, but shows a drop in Macro-Recall. One reason might be that the sentiment data that we use is in some relevant characteristic more similar to the data from 2019 than to the data in the 2021 edition of the shared task. Table 6 finally shows the five models that we submitted to the HASOC 2021 Shared Task as team IMS-SINAI, both with the official macro-average evaluation and the class-specific values (which were reported during the submission period by the submission system). We observe that BERT achieves a Macro-F 1 score of 0.790. The multi-task learning models are, in contrast to the HASOC 2019 results, mostly improved in terms of precision, and less consistently in terms of recall. Considering the target classification and emotion classification in multi-task learning models does not show any improvements, however, the sentiment classification does. These results for the separate concepts are contradicting the results on the 2019 data, which is an indicator that either the evaluation or annotation procedures or the data has changed in some relevant property: In the 2019 data, sentiment+HOF is not better than HOF, but emotion+HOF and target+HOF are. In the 2021 data, it is vice versa. However, when combining all concepts of sentiment, emotion, target, and HOF in one model (HASOC_all), we see an improvement that goes above the contribution by the sentiment model alone. Therefore we conclude that the concepts indeed are all helpful for the identification of hate speech and offensive language.

In addition, we report the results for the class HOF in the same table, without averaging them with the class non-HOF. We find this result particularly important, as the practical task of detecting hate speech is more relevant than detecting non-hate speech. The precision values are lower than the recall values, in comparison to the average results. The recall is particularly increased in the case of the best model configuration (HASOC_all) with 0.917 in comparison to 0.866 to the plain BERT approach. It is noteworthy that all multi-task models increase the recall at the cost of precision for the class HOF. This is both important for practical applications to detect hate speech in the world and from a dataset perspective, as most resources have a substantially lower label count of HOF than for other instances.

Most of the research conducted on the detection of hate speech and offensive language (HOF) has focused on training automatic systems specifically for this task, without considering other phenomena that are arguably correlated with HOF and could therefore be beneficial to recognize this type of phenomenon.

Our study builds on the assumption that the discourse of HOF involves other affective components (notably emotion and sentiment), and is, by definition, targeted to a person or group. Therefore, in this paper, as part of our participation as IMS-SINAI team in the HASOC FIRE 2021 English Subtask1A, we explored if training a model concurrently for all of these tasks (sentiment, emotion and target classification) via multi-task learning is useful for the purpose of HOF detection. We have used corpora labeled for each of the tasks, we have studied how to combine these aspects in our model, and also we have explored which combination of these concepts could be the most successful. Our experiments show the utility of our enrichment method. In particular, we find that the model that achieves the best performance in the final evaluation considers the concepts of emotion, sentiment, and target together. This improvement is even more clear in the HASOC 2019 data. In an analysis of results, we have found that the model is good at improving in false positives errors performed by BERT. A plausible mechanism here is that positive sentiments and positive emotions are opposite to the general spirit of hate speech and offensive language so that the presence of these indicators permit the model to predict the absence of HOF more accurately.

This is in line with other previous results on multi-task learning amongst multiple related tasks in the field of affective language. As an example, Akhtar et al. [68] has shown that both tasks of sentiment and emotion benefit from each other. Similarly, Chauhan et al. [69] showed an improvement in sarcasm detection when emotion and sentiment are additionally considered. Particularly the latter study is an interesting result that is in line with our work, because the sharp and sometimes offending property of sarcasm is shared with hate speech and offensive language. Further, Rajamanickam et al. [70] has already shown that abusive language and emotion prediction benefit from each other in a multi-task learning setup. This also is in line with our result, given that HOF is an umbrella concept that also subsumes abusive language.

A clear downside of our model is its high resource requirement: it needs annotated corpora for all the phenomena involved, and as our model selection experiments showed, the quality of these resources is very important. While resources that meet these needs are available for English, for the vast majority of languages no comparable resources exist. At the same time, the availability of multilingually trained embeddings makes it possible to extend the transfer setup that we adopted to a multilingual dimension, and train a model jointly on resources from different languages. This perspective fell beyond the scope of our study, but represents a clear avenue for future research, and one that looks promising given the outcome of our experiments. Other plausible extensions include the inclusion of further affective phenomena that are arguably correlated to hate speech, including stylistic ones such as sarcasm/irony [71] or author-based ones such as the "big five" personality traits [72] ; or a more detailed modeling of the hate speech target beyond the coarse-grained classification we used here, tying in, for example, with emotion role labeling [73] .

Another aspect to study in more detail is based on the observation of substantial differences between the results on the HASOC 2019 and the HASOC 2021 data. Apparently, the improvements of the MTL model are more clear on the 2019 data. This variance in results is an opportunity to study the aspects that influence the performance improvements when considering related concepts.

A Survey on Automatic Detection of Hate Speech in Text

ECRI General Policy Recommendation No. 15 on Combating Hate Speech, Online

Detecting Misogyny and Xenophobia in Spanish Tweets Using Language Technologies

Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

The Risk of Racial Bias in Hate Speech Detection

Automatic Detection of Hate Speech on Facebook Using Sentiment and Emotion Analysis

Sentiment Analysis -Mining Opinions, Sentiments, and Emotions

An argument for basic emotions

Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in: FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event

Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice

Emotion knowledge: further exploration of a prototype approach

Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

Shared Task on Emotion Intensity

Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Joint Learning for Emotion Classification and Emotion Cause Detection

Clause Classification for English Emotion Stimulus Detection

Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts

Emotion Stimulus Detection in German News Headlines

Proceedings of The 12th International Workshop on Semantic Evaluation

An Analysis of Annotated Corpora for Emotion Classification in Text

The psychology of profanity

Grounded emotions

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

The ISEAR Questionnaire and Codebook

Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums

Leveraging Affective Bidirectional Transformers for Offensive Language Detection

Annotation, Modelling and Analysis of Fine-Grained Emotions on a Stance and Sentiment Detection Corpus

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

SemEval-2016 Task 6: Detecting Stance in Tweets

Sentiment analysis algorithms and applications: A survey

A Survey of Sentiment Analysis from Social Media Data

Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets

Thirty years of research into hate speech: topics of interest and their evolution

A Lexicon-based Approach for Hate Speech Detection

Hate Speech Detection with Comment Embeddings

WWW '17 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva

GermEval 2014 Named Entity Recognition Shared Task for German

Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language

Shared Task on the Identification of Offensive Language

Proceedings of the Fourteenth Workshop on Semantic Evaluation, International Committee for Computational Linguistics

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

Overview of the HASOC Track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages

Automated Hate Speech Detection and the Problem of Offensive Language

HateCheck: Functional Tests for Hate Speech Detection Models

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

Did you offend me? Classification of Offensive Tweets in Hinglish Language

Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter

Hate speech detection: Challenges and solutions

Mapping Twitter hate speech towards social and sexual minorities: a lexicon-based approach to semantic content analysis

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI'13

Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection

Improving Hate Speech Type and Target Detection with Hateful Metaphor Features

Analyzing the Targets of Hate in Online Social Media

Hate lingo: A target-based linguistic analysis of hate speech in social media

Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11

Neural Transfer Learning for Natural Language Processing

Multitask learning

A Survey on Multi-Task Learning

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Predicting the Type and Target of Offensive Posts in Social Media

*SEM 2012: The First Joint Conference on Lexical and Computational Semantics

EmoEvent: A Multilingual Emotion Corpus based on different Events

DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis

Pytorch: An imperative style, highperformance deep learning library

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis

Joint Modelling of Emotion and Abusive Language Detection

From humor recognition to irony detection: The figurative language of social media

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Semantic Role Labeling of Emotions in Tweets

This work has been partially supported by a grant from European Regional Development Fund