key: cord-0456123-dn5mi9s1
authors: Emmery, Chris; K'ad'ar, 'Akos; Chrupala, Grzegorz; Daelemans, Walter
title: Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations
date: 2022-01-17
journal: nan
DOI: nan
sha: 3ef57a781e54b81d42d1df5796d3477c771880d4
doc_id: 456123
cord_uid: dn5mi9s1

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora. The perturbed data, models, and code are available for reproduction at https://github.com/cmry/augtox

Our online presence has simplified contact with our (in)direct network, and thereby drastically changed how, and with whom we interact. While online connections and self-disclosure are often socially beneficial (Valkenburg and Peter, 2007) , the absence of physical interaction has numerous adverse effects: it greatly reduces social accountability in (anonymous) interactions, amplifies one's exposure to people with malicious intent, and through our frequent use of mobile devices, the invasiveness thereof (Mason, 2008) . These factors accumulate to persistent online toxic behavior-the scale of which online platforms continue to struggle with from a technical, legal, and ethical perspective. Online harm (Banko et al., 2020 , provide a comprehensive taxonomy of this field) and-particularly for Natural Language Processing (NLP)-abusive language, are highly complex phenomena. Their study spreads across several subfields (detection of hate speech, toxic comments, offensive and abusive language, aggression, and cyberbullying), all with their unique problem sets and (almost exclusively English) corpora (Vidgen and Derczynski, 2021) . Moreover, there are numerous open issues with these tasks, as highlighted in a range of critical studies (Emmery et al., 2019; Rosa et al., 2019; Swamy et al., 2019; Madukwe et al., 2020; Nakov et al., 2021, for example) . Those open issues primarily pertain to the contextual, historical, and multi-modal nature of toxicity, the specificity of the data, and poor generalization across domains. The current work focuses on one of these subproblems: the continuously evolving nature of toxic content. Apart from the disparate channels and media through which (young) users communicate, this development particularly applies to the related vocabulary: slang, hate speech, or general insults (e.g., karen, simp, coofer, and covidiot). Given the strong focus on lexical cues exhibited by state-of-the-art toxic content classifiers (Gehman et al., 2020) , the existing corpora would have to be continuously expanded for models to retain their performance. This puts costly requirements on any system automatically moderating harmful content, while research in this domain still seems unconcerned with evaluating models in the wild . The adversarial nature of toxicity exacerbates these issues further; similar to any security application, it is safe to assume malicious actors will try to (actively) subvert any form of moderation they are subjected to. Yet, while ample work has investigated systems toward mimicking such behavior (Hosseini et al., 2017; Ebrahimi et al., 2018; Li et al., 2019 , for example, feature attacks against Google's Perspective API), work on toxic content detection rarely incorporates tests for robustness against adversarial attacks. More importantly, these attacks are commonly tailored to an existing toxicity classifier, whereas a human adversary would not have direct access to the models performing moderation. A realistic implementation of subversive human behavior would therefore require model-agnostic attacks. Accordingly, the current research combines multiple ideas from previous work: we apply lexical substitution to an online harm subtask with small corpora (cyberbullying detection) to investigate how lexical variation (either natural, or adversarial) affects model performance and, by extension, evaluate the robustness of current state-of-the-art models. We do this in a model-agnostic fashion; an external classifier indicates which words might be relevant to substitute. Those words are perturbed through a variety of transformer-based models, after which we assess changes in the predictions of a target classifier (see Figure 1 ). The perturbations are not selected to be adversarial against the external or target classifier. We subsequently evaluate to what extent augmenting existing cyberbullying corpora improves classifier performance, robustness against word-level perturbations, and transferability across different substitution models. With this, we provide new resources to evaluate, and compare, the robustness of future cyberbullying classifiers against lexical variation in toxicity.

We employ word-level or token-level perturbations (i.e., substitutions, see Table 1 for examples), which implies that for a given target word w t in document D = (w 0 , w 1 , . . . , w t , . . . , w n ), we find a set of perturbation candidates C using substitute 1 classifier f to exhaustively generate new samples D , any of which potentially produces an incorrect label for a target classifier f . However, the samples are not selected based on such label changes, which therefore does not make this an adversarial attack. We follow and improve upon the adversarial substitution framework 2 from Emmery et al. (2021) , which in turn extends that of TextFooler (Jin et al., 2020) 3 with transformer-based perturbations.

Target words T (D, f ) are selected and ranked based on their contribution to the classification of a document. This importance, or omission score (Samek et al., 2017; Kádár et al., 2017, among others) is calculated by deleting a word at a given position D t , denoted as D 	 . The omission score is then o y (D)−o y (D 	 ), where o y is the logit score of a substitute classifier f . Intuitively, this would provide us with highly toxic words, or text parts related to bullying, which can be perturbed in some way.

As we intend to improve lexical variation, we focus on proposing synonyms as perturbation candidates. Zhou et al. (2019) condition BERT's masked language modeling on a given word by providing the original word its embedding to the masked position. They apply Dropout (Srivastava et al., 2014) as a surrogate mask, and show this to produce a top-k of potential synonyms. The predicted words at the Dropout masked position by some separate transformer model f cnd are then our candidates C(T, f cnd ). To rank the candidates, they use a contextual similarity score:

where: h (D i ) is the concatenation of f cnd its last four layers for a given i th token in document D, D = (w 0 , . . . , c t , . . . w n ) is the perturbed document D where target word w t has been replaced with candidate c at the index of t, Λ is their cosine similarity, and α i,t is the average self-attention score across all heads in all layers ranging from the i th token to the t th position in D. Finally, we sanitize the candidates: filtering single characters, plural and capitalized forms of the original words, sub-words, and sentence-level duplicates.

BERT's associated tokenizers break down unknown words into word-pieces (Wu et al., 2016) , meaning there is no single embedding to apply Dropout to. Zhou et al. (2019) do not mention how they handle such cases; however, they are problematically common for our task (see Table 1 ). We therefore extend their method with a back-off method: if w t is out-of-vocabulary (OOV), 4 we collapse the word-pieces into one, and zero the embedding at that position (which then acts as a mask). 5 Words other than w t that are OOV remain word-pieces.

We employ and compare this lexical substitution method to produce new positive instances. We evaluate if the perturbed documents hold up as adversarial samples, and if they can be used for data augmentation.

For our corpora (all are English), we use two questionanswering-style social networks that allow for anonymous posting: Formspring (Reynolds et al., 2011) and Ask.fm . The latter features multi-label annotation, but is binarized (any indication of bullying 6 is labeled positive) to be compatible with other corpora. These corpora are significantly larger than the rest, as their platforms are typically used by young adults, and notorious for their bullying content (Binns, 2013) . Two long-form platforms can be found in YouTube (Dinakar et al., 2011) and MySpace (Bayzick et al., 2011) , the latter of which has instances of ten posts. The smallest two are from Twitter, both collected using topical keywords (Xu et al., 2012; Bretschneider et al., 2014) . The corpora's statistics can be found in Table 2 .

The models were implemented using HuggingFace's transformers (Wolf et al., 2020) library. 7 Dependency versions can be found in our repository.

All experiments follow the same model-agnostic approach: target words are determined through substitute classifier f (i.e., a distinct model trained on a different corpus than used in any other experiments). Generally, this is Gaussian Naive Bayes over tf·idf-weighted vectors, trained on Formspring. We additionally investigate a pre-trained version of BERT fine-tuned on the Jigsaw dataset (Hanu et al., 2020 , unitary/toxic-bert) as a transformerbased alternative for f (denoted by a +). While the task it has been fine-tuned on is slightly different, our assumption is that this model will have better representations and a larger vocabulary, which might make it more effective in choosing target words. Table 6 shows output examples of the various models. The probability of zeroing embedding dimensions in Dropout BERT is set to 0.2, as we empirically found that values around 0.3 often does not result in synonyms. We set the minimum required omission score to 0.005 for tokens to be considered for substitution, which yields 1-3 target words per document on average. The substitutions do not incorporate prior substitutions; they are done simultaneously-best candidates firstand exhaustively (i.e., while candidates for all slots are available) for a maximum of five samples. For all BERT models, we use the pre-trained bertlarge-cased, for BART we use bart-large. 8 We also report experiments where we use a fine-tuned toxicity version of pre-trained BERT (Caselli et al., 2021 , GroNLP/hateBERT) for f cnd in Dropout BERT (here referred to as Hate BERT, and when using a finetuned substitute classifier: Hate BERT+). The idea here is similar to that of using Dropout BERT+; domainspecific vocabularies will likely result in better and more varied substitutions. Hate BERT(+) uses a different BERT-based toxicity model than the '+' model for f , in order to keep these selections model-agnostic.

Baselines The substitution models are compared with two baselines adapted from related work. Both have shown to improve toxicity detection (Gehman et al., 2020; Quteineh et al., 2020; Yoo et al., 2021) , but have (to our knowledge) not been applied for augmenting cyberbullying content. Firstly, we employ the common data augmentation baseline: Easy Data Augmentation (Wei and Zou, 2019, or EDA) . EDA applies n of the following operations to an input text: synonym replacement using WordNet (Miller, 1995) , random character insertions, swaps, and deletions. We set the number of augmentations made by EDA similar to that of our PROMPT okay and stop calling me jaky you cock sucker. you know it you fucking pussy. you know you are an evil fucking bitch that only cares about getting her name in newspapers. i bet if you saw my face you wouldnt even believe i said ""oh yeah i think i can fuck you."" ... Table 3 : Output by GPT-2, receiving an original instance as prompt. The generated text (up to 70 tokens) is subsequently used as an augmented instance.

other models. Secondly, we employ fully unsupervised augmentation with GPT-2 (Radford et al., 2019, implemented in the pre-trained gpt2-large). We use the positive instances (i.e., documents containing cyberbullying) of each dataset as prompt, with a maximum input length of 30 tokens, and the generated output length to a maximum of 70, as we found that toxicity is prevalent in the first part of the generation (Gehman et al., 2020, made similar observations) . Table 3 shows examples of the output, and the eventual divergence from toxicity. 9

We follow recent state-of-the-art results (Elsafoury et al., 2021a) for our main classification model, and finetune all BERT-based models for 10 epochs with a batch size of 32 and a learning rate of 2e−5, as suggested by Devlin et al. (2019) . Accordingly, we set the maximum sequence length to 128, and insert a single linear layer after the pooled output. For the transformer experiments, we fine-tune incrementally: first on the original set, then on the augmented training set (including the original instances)-both using the same configuration (learning rate, batch size, etc.), except for running it for 2 epochs. This should offer performance advantages (Yang et al., 2019) , as well as increase model stability. 10 We compare BERT against a previously tried-and-tested (Emmery et al., 2019) 'simple' linear baseline: the Scikit-learn (Pedregosa et al., 2011) implementation of a Linear Support Vector Machine (Cortes and Vapnik, 1995; Fan et al., 2008, SVM ) with binary Bag-of-Words (BoW) features, using hyperparameter ranges from Hee et al. (2018) . Training of the SVM and BERT classifiers is done on a merged set of all the cyberbullying corpora in Table 2 , except for Formspring (reserved for substitute classifier f )-always on the same 90% split, augmented data or no. The SVM is tuned via grid search and nested, stratified cross-validation (with ten inner and three outer folds, no shuffling, using 10% splits). The 9 GPT-2 in particular tends to descend into literary content after too many tokens are generated. We also experimented with GPT-3's (Brown et al., 2020) curie from the OpenAI beta API (https://beta.openai.com/) but found systematically lower performance across all experiments compared to GPT-2. These results are therefore not included. 10 We found that fine-tuning on the mixed set renders augmentation ineffective for all models we tested. best settings (1-3-grams, class balancing, square hinge loss, and C = 0.01) are used in all experiments.

For both models, we also experiment with prepending a special token (Daumé III, 2007; Caswell et al., 2019 , follow a similar approach) to the augmented instances (<A>). As per recommendations in Kumar et al. (2020) , the token is not added to the vocabulary. These models are referred to as either f or f aug in Figure 1 , depending on if they were trained on augmented data. If not, we skip the 2 fine-tuning epochs for BERT.

To evaluate our classifiers on the main classification task, we use F 1 -scores. The impact of the substitution models on classification performance is measured via a decrease in True Positive Ratio (TPR) between regular and substituted samples (i.e., how many previously positively classified samples classified as negative after perturbation). Note that in these experiments, f is the same (either Naive Bayes or BERT); therefore, the substitute classifier always chooses the same target words to perturb. The amount of samples depends on the quality of the candidates the models propose. TPR decrease by itself might also indicate an augmented instance is not toxic anymore; hence, to evaluate the semantic consistency of the samples produced by the various augmentation models, we calculate both ME-TEOR (Banerjee and Lavie, 2005; Denkowski and Lavie, 2011) using the implementation from nltk 11 , and BERTSCORE (Sellam et al., 2020) between the original sentences and their respective augmented samples. METEOR measures flexible uni-gram token overlap, and BERTSCORE transformer-based similarity with respect to the contextual sentence encoding.

We run our substitution pipeline (visualized in Figure 1 ) on the positive instances X pos of some given corpus (or the entire collection), using the different models discussed in Section 3.2 for f and f cnd . Per such configuration, this generates augmented samples X pos (up to five per original instance). These can either be classified as is, or mixed in with the original corpus, producing Figure 2 : Decrease in True Positive Rate of the SVM and BERT classifiers after the respective substitution models have been applied (lower is more adversarial).

the augmented corpus X , with X train , and X test splits.

Using this configuration, we run our three experiments:

Experiment 1 We gauge the lexical variation (and hence the 'adversarial' character) in our augmented samples via f (X pos ). F 1 -scores and TPR changes close to f (X pos ) imply the substitutions are similar to the original words. We confirm this meaning preservation through semantic consistency metrics for X pos .

Experiment 2 Here, we train via the data augmentation scheme discussed in Section 3.3; i.e., fine-tune for 2 epochs on f (X train ). The resulting augmented classifier is referred to as f aug , which we evaluate on the original X test . An increase in F 1 -score with respect to f (X test ) indicates the augmentation is a success.

We measure robustness against perturbations, and transferability via f aug (X test ) by evaluating f aug performance across different substitution models producing perturbed samples in X test ; i.e., in a manyto-many evaluation. Any TPR increase implies augmentation improves robustness against perturbations. A total TPR higher than f (X test ) (Plain) does not necessarily increase the F 1 -score (from Experiment 2). If this increase holds for multiple perturbation models, this implies the augmentations are transferable.

Here, we discuss the results of our three Experiments (Sections 4.1-4.3) and close with suggestions for future work. The main results can be found in Table 5 and 7.

The results for this experiment can be found under the 'Samples' row in Table 5 .

Classifier Performance It can be seen that unsupervised (prompt conditioning) samples (i.e., GPT-2) are the most difficult to classify. This is to be expected, as the generated output is not always toxic. However, it is arguably rather remarkable that a large amount of the generated sentences are labeled positive by the cyberbullying classifier. This confirms that (contextually more) harmful content is generated (as illustrated in Table 3) , as also shown for toxicity detection by Gehman et al. see the fine-tuned target selector models ('+') show most 'adversarial' behavior, likely providing lexical diversity on words more important to the content classification. BART induces similar performance drops, but often inserts noise (see Table 6 ). The Dropout models seem to produce samples that are less diverse, but still show a solid .1 drop in F 1 -score (17.67% on average). To emphasize, this decrease is based on untargeted substitutions; i.e., without selecting the substituted words as to change the predictions of either f or f . Adversarial Samples Additional analyses can be found in Figure 2 . We observe the same patterns per substitution model 12 as in Table 5 , with the BERT classifier showing to suffer around 20% less in TPR compared to the SVM. This difference can partly be explained by the substitution models sampling from the same model as we fine-tuned for the classification task (bert-large-cased). As can be observed in this Figure, the difference is smaller when this is not the case ('+' models). This experiment not only underlines the strong focus on lexical cues 13 from linear classifiers, but also that transformer models are not immune to lexical variation-even when candidates are sampled from their own language model. This provides further evidence in line with research from Elsafoury et al. (2021a) , and Elsafoury et al. (2021b) (see Section 5). Semantic Consistency of Samples Here, we compared X pos with X pos as a reference. The results for METEOR and BERTSCORE of these pairs can be found in Figure 3 . Generally, these confirm the trend from the 12 To equalize the length, we matched the amount of nonaugmented test set instances for this experiment. 13 Which raises its own issues; see e.g., and Zhou et al. (2021) , for work on bias and debiasing. Table 5 : BERT-based cyberbullying classification scores (F 1 ) for Experiments 1 (under f (X pos )) and 2 (under f aug (X test )). Classifiers are trained and tested on the indicated corpus (from Section 3.1), 'Merged' is their combination. The other columns indicate, respectively: no substitutions (Plain), EDA, BERT-based models (where '+' indicates f uses BERT rather than an SVM, and 'Hate' that f cnd is pre-trained), BART, and GPT-2. Highlighted cells indicate that non-augmented performance was highest, bold indicates the highest performance per augmentation model. Standard deviation (small script) is reported over five runs with different seeds.

previous two parts of the experiment: models that have higher semantic consistency have less effect on classification performance. A clear difference in METEOR can be observed between EDA and the other models. This is likely due to both the metric and model using WordNet, resulting in bias in favor of EDA. GPT-2 is a strong outlier, as it generates new data. The semantic consistency scores seem comparable, and at times slightly better, than previous lexical substitution work (Shetty et al., 2018; Emmery et al., 2018; Mathai et al., 2020 , although these are all explicitly adversarial). We noticed that regarding the samples themselves, the transformer-based models often noticeably break down in terms of semantic preservation for the lower ranked candidates (see Table 6 ). For the models that do not use soft semantic constraints (such as Dropout), we already find antonyms, and generally ungrammatical and incoherent sentences within the top 5 candidates. Interestingly, at the same time, BERTSCORE assigns comparable scores to antonyms as it does for (intuitively) better substitution candidates. Given these observations, if mimicking adversarial behavior is to be given more weight, one should consider limiting the number of augmented samples, and tuning the omission score cut-off might prove to be worthwhile.

Given the strong performance effect the substitution models had on our classifiers, it seems plausible that they might prove to produce effective samples for augmentation purposes. Looking at the lower portion of Table 5 ; however, we can see that augmentation does not improve performance on the two biggest sets (Merged, and Ask.fm). Dropout BERT seems to improve performance for one of the smaller sets (Twitter II), but overall, EDA is generally a close contender with, if not more effective, than all of the more 'advanced' mod- Table 6 : Augmentations including the top 1 and 5 candidates from BERT (BE), Dropout BERT (Dr BE), and BART (BA), and the BERTSCORE (BSC) using the original text as reference, showing quality degradation (not well reflected in the metric) when sample size increases. BERT suggests antonyms, BART fails semantically. els. Interestingly, the transformer-based models seem to yield sizable improvements on the Myspace set, with GPT-2 increasing it most. The latter might be attributed to the low Type-Token Ratio in this set (see Table 2 ).

Interpretation Generally, none of these methods (baselines, or substitution-based augmentation) seem to yield the same performance improvement as observed in toxicity work (Ibrahim et al., 2018; Jungiewicz and Smywinski-Pohl, 2019) . However, note that, as we were interested in simulating potentially adversarial behavior, we conducted model-agnostic augmentation (that is, given an unknown attacker, or noise). Hence, while we might employ these models in an explicit adversarial training scheme to directly improve model performance, this would require extensive transferability evaluationstypically requiring larger, higher quality datasets-and only satisfy one dimension (data). Given this, we argue an improvement in classifying the augmented sets, as in Experiment 1 (Section 4.1), is more significant. 

The results of Experiment 3 can be found in Table 7 . We report TPR changes for f (X test ) (under initial TPR), and f aug (X test ) per setting to create f aug and X respectively.

Robustness Generally, it can be observed that (unsurprisingly) the augmented classifiers increase TPR most when the substitutions come from the same type of model. Hence, the 'second-best' TPR increases are more interesting. It can be observed that BERT and BART show strongest TPR improvements on three sets respectively, followed by the '+' models with one set respectively. This is quite a remarkable, contrasting result to Experiments 1 and 2, although it aligns with the observations from the semantic consistency scores that more conservative models are less effective augmenters. Hence, it seems that substitutions that are more diverse, and distant from the original instances provide better robustness against perturbations. While their output might be less semantically consistent, this is generally not a relevant criterion when one is only interested in improving task robustness.

Transferability Systematic performance gain across all substitution models (i.e., transferability) is the final indicator of augmentation utility. First, it must be noted that the TPR differences between 'same-model' and distinct model pairs are smaller for the transformer-based models (.026 on average) than EDA (.156). Lexical substitution using out-of-the-box BERT-in addition to high robustness-also achieves the highest transferability (mean .158) across substituted sets.

Performance Trade-Off The f aug results from this experiment should be contextualized against the performance trade-off in F 1 -score from Experiment 2. Using the information in Table 7 , it can be inferred that the best performing model, BERT, actually improves absolute TPR on average; if we add its .158 mean TPR increase to the .399 average (= .557) this exceeds the .537 nonaugmented TPR. However, as we showed in Experiment 2 (Table 5) , this does not improve overall task performance; rather, it decreases performance. For BERT, the F 1 -score slightly drops (.025-.030) on all sets. Hence, this is not a silver bullet, and such trade-offs should be considered when deploying these augmentation models to improve robustness against lexical variation.

Limitations and Future Work A substantial hurdle toward deploying the presented models for augmentation purposes is time. Upsampling the positive instances shown in Table 4 (5,350 total) with the transformerbased models takes 2-3 hours per model on a single NVIDIA Titan X (Pascal). 14 This impacts the amount of parameters that can be tweaked in reasonable time when using this architecture (such as omission score cut-offs, cosine similarity when ranking, dropout values, etc. which we all set empirically). Such computational demand is acceptable for smaller datasets like ours, and the augmentations can be run 'offline' (i.e., one time only), but these limitations should certainly be taken into account when scaling is among one's desiderata. Hence, recent work on decreasing the amount of queries for related models (Chauhan et al., 2021) is particularly relevant for future work. Additionally, there is a myriad of components the base architecture we presented here could be improved with. Most are discussed in Emmery et al. (2021) ; however, some new work is specifically of interest to data augmentation, such as improving the substitutions using beam search opposed to the simultaneous rollout we used in the current work). More broadly, adversarial training (Si et al., 2021; Pan et al., 2021) , implementing more robust stylometric features (Markov et al., 2021) , or modelbased weightings of the augmentation models could be explored; e.g., by selecting instances with a generation model in the loop (Anaby-Tavor et al., 2020) . This could be a particularly worthwhile option when focusing on conversation scopes, rather than message-level cyberbullying content (Emmery et al., 2019) .

Our work combines multiple sizeable-to the extent that they respectively produced several surveys (Fortuna and Nunes, 2018; Gunasekara and Nejadgholi, 2018; Mishra et al., 2019; Banko et al., 2020; Madukwe et al., 2020; Muneer and Fati, 2020; Salawu et al., 2020; Jahan and Oussalah, 2021; Mladenovic et al., 2021 )-areas of research; hence, we will provide a concise overview of the work directly related to our experimental setup. For all tasks, the issue of generalization seems a particularly popular subject of study: for cyberbullying, Emmery et al. (2019), and Larochelle and Khoury (2020), conclude there is little consensus in labeling practices, overlap between datasets, and that a combination of all datasets seems to transfer performance best. For hate speech, Salminen et al. (2020) , and Fortuna et al. (2021) , draw similar conclusions, showing that general forms of harm (e.g., toxic, offensive) generalize better than specific ones, such as hate speech. Finally, Nejadgholi and Kiritchenko (2020) provide unsupervised suggestions to address topic bias in data curation, potentially improving generalization. We draw from these works through cross-domain experiments on individual and combined corpora for cyberbullying, as well as pre-training on more general subtasks such as toxicity. Recent cyberbullying work (Reynolds et al., 2011; Xu et al., 2012; Nitta et al., 2013; Bretschneider et al., 2014; Dadvar et al., 2014; Van Hee et al., 2015, e.g ., are seminal work) has primarily focused on deploying Transformer-based models (Vaswani et al., 2017) ; by and large fine-tuning (Swamy et al., 2019; Paul and Saha, 2020; Gencoglu, 2021, e.g.) , or re-training (Caselli et al., 2020) BERT. It is worth noting that Elsafoury et al. (2021a; Elsafoury et al. (2021b) show that although fine-tuning BERT achieves state-of-theart performance in classification, its attention scores do not correlate with cyberbullying features, and they expect generalization of such models to be subpar. In our experiments, we employ similar domain-specific fine-tuned BERT models, and gauge generalization, sensitivity to perturbations, and the effects of augmentation to potentially improve the former. Adversarial attacks on text (Zhang et al., 2020; Roth et al., 2021, e.g ., provide broader surveys) can roughly be divided in character-level and word-level. The former relates to purposefully misspelling or otherwise symbolically replacing text (e.g., fvk you, @ssh*l3) to subvert algorithms (Eger et al., 2019; Kurita et al., 2019) . Wu et al. (2018) show such attacks on toxic content can be effectively deciphered. Word-level attacks are arguably straight-forward for humans, but significantly more challenging to automate-requiring preservation of toxicity; i.e, the semantics of the sentence. Previous work has investigated the effect of minimal edits on high-impact toxicity words, replacing them with harmless variants (Hosseini et al., 2017; Brassard-Gourdeau and Khoury, 2019) . Our current work is similar to that of Tapia-Téllez and Escalante (2020), and closest to that of Guzman-Silverio et al. (2020) , who apply simple synonym replacement using EDA, as well as adversarial token substitutions-the latter using TextFooler on misclassified instances. We extend BERT-based lexical substitution (Zhou et al., 2019) for model-agnostic perturbations, and data augmentation. Finally, regarding data augmentation for online harms (Bayer et al., 2021; Feng et al., 2021 , among others, provide more general-purpose overviews for various natural language data), toxicity work partly overlaps with work on adversarial attacks on text; for example, the synonym replacement from Ibrahim et al. (2018) , and Jungiewicz and Smywinski-Pohl (2019), which are distinctly either unsupervised, or semi-supervised. Another such example can be found in Rosenthal et al. (2021) employed democratic co-training to collect a large corpus of toxic tweets, and Gehman et al. (2020) find triggers that produce toxic content, querying GPT-like models (Radford et al., 2018) . Fully unsupervised augmentation has also been employed in Quteineh et al. (2020) , and Yoo et al. (2021) . In our experiments, we use a pipeline of models for lexical substitution, and compare it to GPT generations (Radford et al., 2019; Brown et al., 2020) .

In this work, we employed model-agnostic, transformerbased lexical substitutions to the task of cyberbullying classification. We show these perturbations significantly decrease classifier performance. Augmenting them using perturbed instances as new samples slightly trades off task performance with improved robustness against lexical variation. Future work should further investigate the use of these models to simulate and mitigate the effect of adversarial behavior in content moderation.

Our research strongly relied on openly available resources. We thank all whose work we could use.

Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper, N., and Zwerdling, N. (2020) . Do not have enough data? Deep learning to the rescue! In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7383-7390. AAAI Press.

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

A unified taxonomy of harmful content

A survey on data augmentation for text classification

Detecting the presence of cyberbullying using computer software

Facebook's ugly sisters: Anonymity and abuse on formspring and ask

Subversive toxicity detection using sentiment information

Detecting online harassment in social networks

Language models are few-shot learners

Hatebert: Retraining BERT for abusive language detection in english

HateBERT: Retraining BERT for abusive language detection in English

Tagged back-translation

Target model agnostic adversarial attacks with query budgets on language understanding models

Support-vector networks

Experts and machines against bullies: A hybrid approach to detect cyberbullies

Frustratingly easy domain adaptation

Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems

BERT: pre-training of deep bidirectional transformers for language understanding

Modeling the detection of textual cyberbullying

HotFlip: White-box adversarial examples for text classification

Text processing like humans do: Visually attacking and shielding NLP systems

Association for Computational Linguistics

When the timeline meets the pipeline: A survey on automated cyberbullying detection

Does BERT pay attention to cyberbullying?

Style obfuscation by invariance

Proceedings of the 27th International Conference on Computational Linguistics

Current limitations in cyberbullying detection: on evaluation criteria, reproducibility, and data scarcity

Adversarial stylometry in the wild: Transferable lexical substitution attacks on author profiling

LIBLINEAR: A library for large linear classification

A survey of data augmentation approaches for NLP

A survey on automatic detection of hate speech in text

How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets?

RealToxicityPrompts: Evaluating neural toxic degeneration in language models

Cyberbullying detection with fairness constraints

A review of standard text classification practices for multi-label toxicity identification of online content

Transformers and data augmentation for aggressiveness detection in mexican spanish

Detection and fine-grained classification of cyberbullying events

Automatic detection of cyberbullying in social media text

Deceiving google's perspective API built for detecting toxic comments

Imbalanced toxic comments classification using data augmentation and deep learning

A systematic review of hate speech automatic detection using natural language processing

Is BERT really robust? A strong baseline for natural language attack on text classification and entailment

Towards textual data augmentation for neural networks: synonyms and maximum loss

Representation of linguistic form and function in recurrent neural networks

Data augmentation using pre-trained transformer models

Towards robust toxic content classification

Generalisation of cyberbullying detection

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Textbugger: Generating adversarial text against realworld applications

In data we trust: A critical analysis of hate speech detection datasets

Exploring stylometric and emotion-based features for multilingual cross-domain hate speech detection

Cyberbullying: A preliminary assessment for school personnel

Adversarial black-box attacks on text classifiers using multi-objective genetic optimization guided by deep networks

Wordnet: A lexical database for english

Tackling online abuse: A survey of automated abuse detection methods

Cyber-aggression, cyberbullying, and cyber-grooming: A survey and research challenges

A comparative analysis of machine learning techniques for cyberbullying detection on twitter

Detecting abusive language on online platforms: A critical analysis

On crossdataset generalization in automatic detection of online abuse

Detecting cyberbullying entries on informal school websites based on category relevance maximization

Probing toxic content in large pre-trained language models

Improved text classification via contrastive adversarial training

CyberBERT: BERT for cyberbullying identification. Multimedia Systems

Scikit-learn: Machine learning in python

Textual data augmentation for efficient active learning on tiny datasets

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

Using machine learning to detect cyberbullying

Automatic cyberbullying detection: A systematic review

SOLID: A large-scale semi-supervised dataset for offensive language identification

Token-modification adversarial attacks for natural language processing: A survey

Approaches to automated detection of cyberbullying: A survey

Developing an online hate classifier for multiple social media platforms

Evaluating the visualization of what a deep neural network has learned

BLEURT: Learning robust metrics for text generation

A4NT: author attribute anonymity by adversarial training of neural machine translation

Better robustness by more coverage: Adversarial and mixup data augmentation for robust finetuning

Dropout: a simple way to prevent neural networks from overfitting

Studying generalisability across abusive language detection datasets

Data augmentation with transformers for text classification

Preadolescents' and adolescents' online communication and their closeness to friends

Detection and fine-grained classification of cyberbullying events

Attention is all you need

Directions in abusive language training data, a systematic review: Garbage in, garbage out

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

Transformers: State-of-the-art natural language processing

Google's neural machine

Decipherment for adversarial offensive language detection

Learning from bullying traces in social media

Data augmentation for BERT finetuning in open-domain question answering

Gpt3mix: Leveraging largescale language models for text augmentation

Adversarial attacks on deep-learning models in natural language processing: A survey

Generating natural language adversarial examples through an improved beam search algorithm

BERT-based lexical substitution

Challenges in automated debiasing for toxic language detection