key: cord-0441998-hncni2l6
authors: Alsmadi, Izzat; Ahmad, Kashif; Nazzal, Mahmoud; Alam, Firoj; Al-Fuqaha, Ala; Khreishah, Abdallah; Algosaibi, Abdulelah
title: Adversarial Attacks and Defenses for Social Network Text Processing Applications: Techniques, Challenges and Future Research Directions
date: 2021-10-26
journal: nan
DOI: nan
sha: b60c7b63fdf9269734ffc8a55746c57b87c09fa6
doc_id: 441998
cord_uid: hncni2l6

The growing use of social media has led to the development of several Machine Learning (ML) and Natural Language Processing(NLP) tools to process the unprecedented amount of social media content to make actionable decisions. However, these MLand NLP algorithms have been widely shown to be vulnerable to adversarial attacks. These vulnerabilities allow adversaries to launch a diversified set of adversarial attacks on these algorithms in different applications of social media text processing. In this paper, we provide a comprehensive review of the main approaches for adversarial attacks and defenses in the context of social media applications with a particular focus on key challenges and future research directions. In detail, we cover literature on six key applications, namely (i) rumors detection, (ii) satires detection, (iii) clickbait&spams identification, (iv) hate speech detection, (v)misinformation detection, and (vi) sentiment analysis. We then highlight the concurrent and anticipated future research questions and provide recommendations and directions for future work.

Social media has become a major communication channel in everyday life. According to the Datareportal report published on 27th January 2021 1 more than half of the world population (4.48 billion) use social media. Among the social media platforms, the most popular ones include Facebook, Youtube, WhatsApp, Instagram, WeChat, and Twitter. Such platforms are widely used for information dissemination and consumption [1] . Nowadays, social media platforms form an important gateway for connecting people, spreading thoughts, and linking business entities to customers through question advertisement, review collection, and feedback [2] . The key advantage of such platforms is the wide scope of the audience that allows individuals to directly connect and share content. For example, expressing an opinion, sentiment, and emotion towards products, entities, individuals, and/or society [3, 4, 5, 6] .

The reach to a large audience and freedom of generating and sharing their content also poses several challenges. For instance, these platforms have also been used for political or financial gains (e.g., Brexit) [7] , which has been typically done by targeting focused communities. Malicious actors have also been using them to spread rumors and mis/disinformation [8, 9] . These platforms has also been used a ground for hate speech [10, 11] , racism [12, 13] , xenophobia [14] , and prejudice [15] .

Two evolving major factors influence the growing attacks on machine learning algorithms and applications: (1) The rapidly growing dependence on automated decisions, and (2) The level and nature of influence such attacks can cause or trigger. In Natural Language Processing (NLP) in general and Online Social Networks (OSNs) in particular, perhaps the Russian Troll Tweets (RTT) dataset and the influence in the USA 2016 election is one major example. RTT showed a recent trend in cyber attacks where humans, rather than their computing devices are targets of such attacks. Social bots and trolls are examples of text generators in Generative Adversarial Networks (GAN) models. The discriminators are first the social network websites and their efforts to eliminate social bots/trolls and counter the spread of mis/disinformation. The other category of discriminators is other users of social networks who are the targets of those bots/trolls and their media campaigns. Autocratic countries see social network websites such as Twitter and Facebook as threats or weapons, [16] . This explains why many of those countries have state-sponsored agencies that are active in creating social media content for propaganda and anti-propaganda. They create social bots or trolls to target rival media. They also have social bots or trolls to target their own citizens. The widespread of social bots can be observed in recent social and political movements such as the Arab Spring, 2013-2014 protests in Ukraine, etc.

To address such issues a major research field has emerged, namely computational social science [17, 18] , a major part of it is to analyze social media content. The focus is automatically analysis of social media content using Machine Learning (ML) and Natural Language Processing (NLP) techniques. A major interest came from NLP research communities, which led to organizing workshops [19, 20, 21, 22] , evaluation campaigns [23, 24, 25, 26, 27, 28, 29] . These efforts have made significant progress in the field in terms of automatically identifying such content, debunking them, and facilitating organizations, social media platforms, and society as a whole.

While such progresses are in place, the discovery of adversarial examples [30, 31] raised a major concern in different research communities and has been studied for many different problems including image classification [32, 33] , security [34] , malware detection [35, 36] , robot-vision system [37] , medical diagnosis system [38, 39] , mis/disinformation [8, 9] , fake-news [40] , and rumor [41] . An adversarial attack refers to the process of crafting an adversarial example, which is intended to fool a model through adversarial perturbation. Adversarial attacks can be launched with different intentions including getting influence on a classifier's decision as well as violating security. One recent example of such attacks is the exploitation of the vulnerabilities of Microsoft's Tay chatbot, which was shut down due to the racist tweets [42] .

In the context of social media content analysis, the attackers (i.e., the individuals who want to misuse social media and spread inappropriate information) can perturb their content to ditch the AI-based content filters incorporated with social media platforms. For instance, an adversarial example may fool a hate speech detection system, which can result in misclassification of toxic and abusive content as acceptable and legitimate content.

To address the problem of adversarial attacks on ML models, there have been efforts, reported in the literature, to overcome such challenges. Such efforts include adversarial training with noise [43] , gradient masking [44] , defense distillation [45] , ensemble adversarial learning [46] , and feature squeezing [47] .

The adversarial ML has been well explored for images [32, 33] . There are also significant efforts on adversarial ML on text [50] and speech [52] , [53] , [54] . In this paper, we mainly target adversarial attacks and the corresponding defense methods for text based social media applications. 2 We particularly focus on key applications of social networks, such as (i) rumors, (ii) satires, (iii) click baits and spam, (vi) hate speech, (v) misinformation and (v) sentiment. Since text is a primary communication means in social media platforms [55] , the paper also provides detailed taxonomies of text adversarial attacks and corresponding defense techniques, which makes it self-contained in terms of relevant concepts. The paper also advises on key challenges, limitations, and future research directions for text-based social media applications.

The literature on adversarial text analysis is quite rich [49, 50, 51] . Want et al. [49] provide an overview of the literature on adversarial attacks and corresponding defense strategies for DNN-based English and Chinese text analysis systems. Zhang et al. [50] , provide a more detailed survey of adversarial attacks on deep learning-based models for NLP with a particular focus on adversarial textual examples generation methods. Xu et al. [51] on the other hand review the literature on adversarial attacks and defenses on images, graphs, and text analysis models. In contrast to existing surveys, this paper focuses on adversarial attacks and defense solutions for text-based social network applications. In Table 1 , we provide an overview of existing surveys on adversarial learning, attack and defense methods.

The key contributions of the survey can be summarized as follows.

• We first provide a detailed taxonomy of adversarial attacks and the corresponding defense techniques for text analysis frameworks.

• We then provide a comprehensive survey of adversarial learning, attacks, and defense systems for major social media applications including rumors, satires, clickbaits & spam, hate speech, misinformation detection, and sentiment.

• We also highlight key research challenges, open issues, and future research directions.

The rest of the paper is organized as follows. Section 2 provides a detailed taxonomy of adversarial attacks on the text analysis model. Attack techniques at the character, word, sentence, and inter levels are revised in Section 3. Section 4 covers the corresponding defense techniques. Section 5 provides an overview of adversarial attacks and corresponding defense solutions for different social media applications. Section 6 describes the key challenges and future research directions in adversarial text analysis. Finally, Section 7 provides some concluding remarks.

Current literature and several surveys categorized adversarial techniques into attack and defense [50, 49] . These techniques can be further classified based on how attack and defense are applied to ML models. Such a classification have evolved based on the type of data (e.g. image, text), the knowledge level of the attacker (e.g. white vs. black box), the goal of the attack (targeted vs. non-targeted), and the granularity of the input content (e.g., for NLP, character, word, sentence). Other classifications include the modality of the input content, the type of application or task, and the architecture of the ML model [50] . In Figure 1 , we list several categorizations of the main adversarial attack and defense techniques. In the figure, we use dotted lines and boxes to represent the categories that are loosely defined in the overall adversarial attack and defense techniques. For example, both model and target-based attacks apply to any model and application. Here is a brief description of each of these categorization criteria. Attack: The purpose of an adversarial attack is to manipulate the inputs and or the ML model so that it is fooled to produce faulty outcomes. This process takes on a variety of possible objectives and approaches.

Model Knowledge: This category corresponds to classifying the attack techniques based on the attacker's knowledge about the target model. Broadly speaking, this can be subcategorized into:

• White Box Attack: In a white-box attack, the adversary knows and exploits the input and the model information to optimize its attack. The model information includes inputoutput data, architecture, parameters, loss, and activation functions. In this attack type, adversarial data can be generated in a way that maximizes its impact on the classifier while being an imperceptible change. Typically, an adversary adjusts adversarial modifications to be in the direction of the model's gradient concerning the current input. This maximizes the increases in the loss function and thus optimizes the attack. White-box attacks have received a lot of attention in the literature due to their adaptive nature.

• Black Box Attack: The black box attack techniques do not require or have access to any knowledge of the model except input and output. This type of attack uses heuristic approaches or repeated queries to build the attack.

Targeted Output: This feature categorizes attacks into targeted and untargeted. In a targeted attack, the adversary maps the original model's output to a required faulty output for a given input. However, in an untargeted attack, the adversary cares only about causing the model to produce incorrect outputs, regardless of what they might be.

Defense: The goal of defense techniques is to design a robust model to fight against those attacks or to deal with such types of adversarial examples. As presented in Fig. 1 , the major defense techniques include (i) adversarial example detection and (ii) model enhancement. The goal of the former approach is to detect an adversarial example that is distinguishable from legitimate input. Whereas the goal of the latter approach is to train models with additional parameters, which are commonly referred to as adversarial training.

Our focus in this study is social medial applications with a particular emphasis on NLP. Hence, in the following sections, we provide in-depth studies of adversarial attacks and defense techniques reported in social media applications.

Today's social media platforms use a versatile bundle of networking tools, such as messaging, chatting, voice, and video calling. Amongst these, text is a primary tool for communication in social media [55] . Therefore, we will especially focus on the research body on adversarial attacks and defenses on NLP techniques from a social media perspective.

Adversarial attack in an NLP context faces more challenges compared to the case of computer vision models. Primarily, the discrete nature of text inputs imposes certain limitations on how one can efficiently modify them. In this regard, character, word, and sentence-level modifications are primarily text replacements at the reception levels. To this end, many challenges face the replacement process. Examples include how one characterizes token 3 or sentence similarity, establishing the embedding space, i.e., the space of acceptable replacements, and more importantly, how to search for the best replacement candidate within such a space. For NLP tasks, adversarial attacks are mainly categorized based on the granularity of the input. The main categorization of such granularity of the input includes the following.

1. Character-level attacks: achieved by adding, deleting, or swapping a character in a word.

2. Word-level attacks: achieved by replacing an original word with an adversary equivalent. There is a variety of methods to characterize this word equivalency. Examples include synonyms, antonyms, and semantic equivalence.

3. Sentence-level attacks: are carried out by adding, deleting, or paraphrasing certain sentences.

4. Inter-level attacks are combinations of character, word, or sentence level attacks.

Such categorizations are inspired based on how these linguistic components are used in NLP applications to train the machine learning model. For example, character and word level ngrams and their different representations (e.g., tf-idf, word2vec, contextual representation using transformers) have been used to train the machine. In Figure 2 , we present the most common categories of granularity-based adversarial attacks used in NLP tasks.

Character level information has been widely used in NLP for text classification tasks. [64] presents a pioneering work on 3 A token means a character or a word. applying convolutional neural networks (CNN) for text classification at the character level. This work has paved the way for character-level operations. In Table 2 , we summarize several character-level perturbations. These primarily include insertion, deletion, substitution/replacing, swapping, and/or scrambling a character. In Table 3 , we list sample works on characterlevel attacks with their model accessibility, attack type, targeted model, application, or task. As a pioneering work on the character-level attack, [56] investigates white-box attack with character-level adversarial examples to maximize the model's loss at limited numbers of modifications. This is referred to as the HotFlip algorithm. It is based on performing an atomic flip operation where the gradient of the model is used to select between several perturbations with respect to a one-hot vector representation. Such perturbations can be obtained by flipping, inserting, or deleting a character. In this regard, the authors argue that a simple beam search approach is superior to a greedy search between the flip operations.

Along the line of black-box attacks is the work of [57] , referred to as DeepWordBug based on characters and words. Despite its black-box nature, this method is adaptive to the input.

The key idea is to adaptively choose the most (critical) tokens. DeepWordBug operates in two main stages. First, it identifies critical tokens, which is based on the ranking of the perturbed tokens measured in terms of the classifier's output. Second, it changes the identified tokens using simple transformations. Such transformations include four primary types; (i) swapping two consecutive characters, (ii) substituting a character with another one by randomly selecting from the same word, (iii) randomly deleting a character, and (iv) inserting a randomly se- TextBugger selects between five modifications; (i) inserting a space between characters, (ii) randomly deleting a character, (iii) swapping two characters selected at random, (iv) replacing a character with another visually similar character, and (v) replacing a word with a semantically similar one.

Compared to character-level attacks, a word-level attack is naturally more imperceptible for humans and more difficult for machine learning algorithms to defend. In Table 4 , we summarize several word-level perturbations proposed in several research works. In Table 5 , we report several word-level attack contributions and categorize them based on model accessibility, attack type, targeted model, application, or task. In a broader sense, summarize the main research contributions made at word-level adversarial attack in Fig. 3 . These can be divided into four main categories; (i) optimizing adversarial attacks, (ii) identifying vulnerability in new applications, (iii) attack analysis and understanding, and (iv) investigating new attack aspects. Below we discuss each of them in detail.

A major part of research on a word-level attack is focused on optimizing adversarial attacks. In this study, we categorize the endeavors made in optimizing word-level attacks into the following sub-categories.

Minimizing word replacements. The first research body under the umbrella of optimizing world-level attacks focuses on minimizing word replacements. Generally speaking, the most com- mon perturbation type in word-level attack is based on word replacement. Typically, a target word is replaced by an equivalent one chosen from a space of possible replacements. Such a space was identified by Ebrahimi et al. [83] and referred to as the embedding space. In [83] , the authors consider neural machine translation (NMT) and propose elementary modifications to achieve word-level adversarial attacks. In this regard, the authors extend the HotFLip algorithm proposed in [56] adding, removing, or replacing a word in the translation input.

Another work on a word-level adversarial attack is proposed in [81] . In this work, the authors highlight the differences between adversarial attacks in vision and NLP domains. In particular, they point out that one can not arbitrarily modify text inputs as done in a computer-vision context. Thus, the most important contribution in this work is necessitating the importance of exploiting semantic and syntactic similarity between an original word and its adversarial replacements. Such similarities were overlooked by the literature before this work. The authors exploit these similarities in word replacement search using a population-based optimization algorithm to generate the embedding space of adversarial samples exhibiting semantic and synthetic similarities. Besides, they use this algorithm to select the best replacement of a given original word with respect to semantic and syntactic similarity and maximizing the model's loss. It is noted that they selected the word to be modified at random. Technically, this process is conducted in the following order.

1. Compute the nearest neighbors for a given word within its embedding space based on Euclidean distance.

2. Rank the similarity of those neighbors to the original word in terms of the difference between their label productions and that of the original word. Then keep a few neighbors having the maximum similarity 3. Amongst the neighbors of maximum similarities, select the one that maximizes the loss function of the model.

Improving the search for word replacements. Another subcategory of works on optimizing adversarial word-level attacks is concerned with improving the search for word replacements in an embedding space. In the current literature, there are three main categories of text replacement techniques: (i) gradientbased [83, 8, 80] , (ii) sampling-based [34, 75] , and (iii) enumeration based [59, 84, 74] , as shown in Fig. 4 . These techniques vary in the extent of the assumed model accessibility, and thus the information used to dictate the search process. Herein, we provide two examples of improving the search process in the embedding space. First, the above-mentioned DeepWordBug algorithm [57] also applies to the word level. In this setting, the same selection process is adopted. In other words, the algorithm works in two stages; first, it selects a set of critical words and then chooses the best amongst them. Another example is the TextBugger algorithm of [58] which also applies to the word level.

Obtaining embedding spaces. As discussed earlier, selecting the best replacement is of crucial importance for the success of the attack. Still, it is also important to first provide a suitable embedding space. So, the next research body under the umbrella of optimizing word-level attacks concerns obtaining the embedding space. Along this line, [75] propose a black-box, gradient-free genetic optimization algorithm for generating adversarial replacements based on natural selection of the genetic algorithm. Specifically, it is an iterative process for providing a generation of selections per iteration. It is noted that a fitness function is used to quantify the quality of a candidate. Besides, candidates of each generation are obtained by a cross-over of their preceding ancestors.

Several approaches aim at leveraging methods for defining image-based embedding space to the text application area while keeping in mind the differences between text and image data. Along this line, [80] proposes a framework for utilizing ideas from image attacks to be used as text attacks. First, the authors highlight the main differences between image and text domains in terms of generating adversarial attacks. In this regard, the authors report the bottleneck in this borrowing which can be cast as two main limitations. First, is the discrete nature of text rendering small perturbations inapplicable in a text context. Second, is the difficulty in evaluating how good an adversarial attack is. As a remedy to the first limitation, the authors propose searching within the set of possible word embedding models and then choosing the best adversarial alternative to the original word in terms of its nearest neighbor similar to the original. To alleviate the second limitation, the authors use a Word Mover's Distance (WMD) measure as an adversarial replacement quality metric.

Improving word replacement similarities. It is well-known that an adversarial replacement should be, at least visually, similar to the original word. Moreover, it is necessary to quantify this similarity. Therefore, another research body under the umbrella of optimizing word-level attacks focuses on improving the similarities used in identifying word equivalency. Along this line, [74] introduces two approaches for quantifying word similarity. First, is the concept of semantically equivalent adversaries (SEAs). In this context, SEAs are used to generate semantically equivalent adversarial replacements based on paraphrase generation techniques. These techniques include back-translation, i.e. translating a given sentence to a certain language, and then back-translation of the original language. Another example is the use of paraphrasing to obtain such replacements. In this setting, it is noted that the semantics of the text are preserved since the meaning is not changed. The second approach proposed by the authors is devising equivalent adversarial rules (SEARs) that govern the process of obtaining semiannually equivalent replacements.

In line with identifying and exploiting word replacement similarity, Ren et al. [70] consider imposing further restrictions on adversarial replacements. Namely, lexical correctness, grammatical correctness, and semantic similarity. Specifically, the authors propose an algorithm for adversarial word selection ordering where the saliency of the word and its classification probability are taken into account for determining its order in the embedding space. This algorithm is referred to as probability-weighted word saliency (PWWS) for text adversarial attacks. Ren et al. experimentally validate a high success rate of their algorithm at minimal word replacements. Moreover, the authors demonstrate that their modifications are salient at the human level through human evaluation of the attacks. Similar to other methods, this algorithm also possesses a high degree of transferability of its examples across different models and datasets.

To this end, word-level adversarial replacement techniques are accused to tend to overlook the linguistic context while replacing targeted words. This renders them vulnerable to human discovery. Accordingly, [73] calls for a wiser selection of the words. In particular, the authors incorporate context consistency as a constraint for replacement in a black-box adversarial attack algorithm. Algorithmically, this is achieved by sorting replacement suitability values based on a BERT-masked language model.

Developing attack triggers. The studies surveyed so far consider primarily input-dependent adversarial attacks. Therefore, the next research body under optimizing word-level attack considers developing (input-agnostic) attack triggers. Along this line, [72] investigates input-agnostic adversarial examples, referred to as triggers, at the token level in NLP models. The authors propose a gradient-based search for the best token to change and use minimal perturbation length. In [72] , the authors validate a high success rate of their algorithm as tested over various tasks and networks. Besides, they demonstrate that it possesses a degree of transferability regardless of its whitebox nature. In this regard, such attacks transfer across both examples and models for all tasks. As a future outlook, the authors necessitate investigating the development of grammatical triggers that can work anywhere in the input. Besides, future research may also consider dataset-or even task-agnostic triggered. Moreover, the authors report that adversarial attacks raise questions on whose responsibility it is that models are vulnerable to such attacks.

Investigating attack design trade-offs. The next research body under the umbrella of word-level attack concerns identifying and balancing design trade-offs. An inherent trade-off in the design of adversarial attacks is the saliency-strength trade-off. In this context, a strong perturbation is better to change the model output at the cost of reduced saliency, and vice-versa. This forms the fundamental design trade-off. However, identifying and balancing other design trade-offs was not considered until the work of [71] . In this work, the authors devised and investigated a probabilistic framework for generating adversarial attacks on models with discrete inputs. [71] propose two methods for world-level adversarial attacks and then investigate the underlying design trade-offs encountered with these methods. .

The second aspect of research contributions in word-level attacks concerns developing the understanding of adversarial attacks. In this regard, [76] argues that the interest in the applicability of ML models came at the cost of not giving enough care to the interpretation and the understanding of how ML models yield their outputs based on inputs. To bridge this gap, the authors investigate the sensitivity of the model's operation to input deletion, in particular. Specifically, the authors analyze how the behavior of a given model will change with erasure operations of different types and natures on input data. The authors investigate the use of reinforcement learning for detecting output-changing phrases at the input side. The result of the work conducted by [76] is a set of important interpretations and explanations of certain ML operational phenomena.

Another research body in the context of word-level attack focuses on identifying the vulnerability to adversarial attacks exhibited by new NLP tasks and application areas. In this regard, Ling et al. [82] consider the vulnerability of DNN-based text classifiers to adversarial input attacks. In particular, the authors propose three main methods for generating text adversarial examples; insertion, removal, and modification. Experiments validate that these methods, whether applied on the character level or the word level, can trick DNN models and cause them to generate wrong outputs. This is while the perturbations generated by such methods are still imperceptible at the human level.

It is worth mentioning that Liang et al.'s work is based on identifying the best characters to change, i.e., the hot characters. Then, it applies adversarial changes on other words containing more than 3 hot characters and refers to them as hot words. Similarly, hot phrases are identified as the ones having the maximum number of hot words. Therefore, this approach seems conceptually similar to what HotFlip does.

Another work by [65] considers the vulnerability of natural language classification models to adversarial attacks. This work establishes the existence of adversarial attacks in an NLP context. Specifically, it points out that, similar to other ML application areas, natural language classification models are vulnerable to attacks composed of small perturbations at the input leading to changing the NLP model's output.

As an attempt at investigating other text-related ML applications, [79] consider ML models used in the application of object detection. The authors show that such models are, similar to the case with other ML applications, inherently vulnerable to adversarial attacks on the image patch level. This is shown by proposing an efficient adversarial attack algorithm that targets a specific class of the object detection model. Thus, the authors craft their attack by tracking the model to be blind with respect to a certain object in an image patch called the invisibility patch. This algorithm has the ability to track several state-of-the-art object detection approaches. Besides, it transfers across different models and datasets.

Lastly, another research body of word-level attack contributions focuses on the attack itself. The common practice in adversarial machine learning has remained restricted to altering the input/model while leaving the input labels intact. In contrast to this widely-used assumption, [78] proposes a bilateral perturbation process, i.e., altering both input and the label. Thus, for altering a given input's label, the authors propose a closed-form mathematical expression. However, to alter an input, the authors use a one-step perturbation process while adopting a class label that maximizes confusion. The idea of perturbing both input and the label has been shown effective.

In sentence-level attacks, an adversary considers the inputs as sentences, and thus, applies adversarial modifications on them by inserting, replacing, or deleting sentences. Compared to character-and word-level attacks, sentence-level attacks demand a longer time in adversary text generation. The key attributes of several sentence-level adversarial attacks are presented in Table 6 .

There are several challenges to conducting adversarial attacks on models with textual input (e.g., sentence). One is Ref.

Model Acc. finding suitable candidate replacements so that generated text preserves syntax and semantics. Another is developing an efficient approach to finding good transformations. The work of Lei et al. [77] proposes methods that address both challenges (i) sentence and word paraphrasing that preserve syntax and semantics, (ii) gradient-guided greedy paraphrasing approach to find suitable transformations. Along this line, Iyyer et al. [84] propose syntactically controlled paraphrase networks to generate a paraphrase of the sentence, which is syntactically valid. This study shows that such generated adversarial examples i) can fool pre-trained models, and ii) when trained as augmented data can improve the robustness of the model. The work of Jia and Liang [85] also explores sentence-level adversarial examples, where the authors investigated whether a distorted sentence in a paragraph can lead to an incorrect answer.

By multi-level attack, we mean an adversarial attack that perturbs characters and words or words and sentences, or all of these. A multi-level attack is largely considered unexplored and open for future research. The few research studies mentioned in this section evaluate a preliminary approach to integrating attacks at different text levels. Another reason why multi-level attacks are unexplored heavily in literature is that they have nothing comparable to image-based adversarial attacks. As adversarial attacks in text are more recent than adversarial attacks in images, many of the methods in text attacks are inspired by early image attacks.

As multi-level attacks involve at least two of the three textlevel attacks, they are typically more complex and computationally expensive. Similar to the earlier categories, multi-level attacks can be used in white-box attacks (e.g., [56] , [85] , [72] ), as well as black-box attacks (e.g., [91] ).

Similar to the case of adversarial attacks, adversarial defense research in NLP is still in its early stages as compared to its counterparts in computer vision. In the broad sense, the existing efforts in this domain can be divided into four categories: (i) adversarial training, (ii) spell and grammatical check, (iii) defensive distillation, and (iv) recovery of perturbations. In the next subsections, we provide a detailed overview of the literature on each type of adversarial defense methods.

Adversarial training is the process of training a model on a mixture of clean data and adversarial examples. The inclusion of adversarial examples in such a training set has been shown to improve the model's robustness against those examples [31] . This defense scheme aims at suppressing the impact of an adversarial attack, assuming it has already happened or will inevitably happen. Intuitively, the re-trained model is expected to be more robust to the adversarial attacks whose instants are included in the training, as compared to the original model. In the adversarial security research area, adversarial training has been used by almost all adversarial attack works [92] . It is worth mentioning that, improving the model robustness by adversarial training has been widely used as an implicit indicator of the success of the underlying adversarial attack.

Despite being the first line of defense, there are certain limitations and drawbacks of adversarial training, as summarized below.

1. Some recent works are skeptical about the extent to which adversarial training can truly improve the model's robustness. However, to address such limitations more robust approaches have all been proposed [68, 62] . Some works report marginal/moderate robustness improvement for the adversarial training in certain NLP applications [68, 62] .

2. Adversarial training has recently been shown effective exclusively with the attack whose adversarial samples are used in its training [85] . This led to the inception of "blindspot" attacks [93] where the modified input resides in a so-called blind spot. This refers to the state of being far away from regular and potential inputs while still belonging to the data distribution. Thus, any adversarial training attempt is very unlikely to include such an example in its training set.

3. The current adversarial training procedures cannot scale to datasets with a large (intrinsic) dimension [93] . Besides, adversarial training is shown to work well only with repetitive data patterns in a training set.

Recently, there has been an interest in extending the classical adversarial training concept. Along this line, [68] considers an inherent gap between adversarial training in image processing and NLP application domains. While perturbations used in adversarial training for image processing can take any arbitrary direction, the authors signify that this approach should not directly apply to the NLP case. This is because such an approach ignores the inter-portability of the added perturbations. Therefore, the authors call for restricting the perturbation space to the dimension of possible word embedding models. In other words, perturbations in an NLP context should only yield meaningful tokens from a linguistic perspective. Through a set of experiments, the authors demonstrate that their approach generates interpretable adversarial outcomes while maintaining the attack performance.

As character or word-level attacks are based on modifying characters and words, several studies investigate the utility of spelling and grammatical checks of the input text as means for defense [57, 58, 62, 94, 95] . There have been several contributions in this direction. For instance, Gao et al. [57] use an auto spell corrector to boost and improve the model's robustness against adversarial attacks. Similarly, Li et al. [58] employ spell check as a tool for eliminating character-level modifications. In a more recent work, [62] propose appending text classification by word recognition. This aims at counteracting the effect of adversarial perturbations applied to characters. In this regard, various perturbations and changes such as character swapping, replacement, and keyboard typos can be counteracted.

Word-level input correction defense considers contextguided spell and grammar checks to guard against adversarial attacks. For instance, in [94] , a two-stage spell correction scheme to identify and correct misspelled words is proposed to guard against adversarial attacks. As another example, Bao et al. [95] propose a multi-task learning framework to identify adversarial sentences where adversarial words are embedded.

The distillation concept originated as a means for reducing the size of a given DNN architecture. This is possible by the virtue of training a smaller network on the logits and the training data of a given (larger) network. Thus, the smaller network will inherit the same functionality as the former one. More recently, [34] has adopted this idea as a defensive means for combating adversarial attacks. Specifically, the authors anticipate that the trained network will only inherit the benign functionality of the original one while being distilled from any adversarial functionality. It is noted that the above-mentioned approach has been shown successful in combating adversarial attacks in the image domain [96] .

The literature also witnesses some efforts for the identification and recovery of perturbations. For instance, a defense framework proposed by Zhou et al. [69] aims to determine whether a particular token is a perturbation or not. The process is composed of two phases. In the first phase, a set of possible perturbations that could have been applied to the text of interest is identified. The second phase aims at calculating the value of a specific attack estimator for each of the attack possibilities. After that, the token that is identified as a perturbation is reconstructed/restored from the embedding space based on a certain similarity measure. The authors used the k-nearest neighbors (kNN) to guide the search. The study has also concluded that the technique is generally successful in identifying attacks in a variety of NLP models and applications without the need for any retraining.

Another work in [63] develops a rule-based recovery of adversarial perturbations. The defense technique is based on replacing each non-standard character in the input stream with its nearest standard neighbor. In other words, this approach is based on reversing a text attack. The proposed solution has been proven very effective in machine learning translation applications.

In this section, we study five major social media applications that are vulnerable to adversarial attacks and summarize the literature that explores attack and defense techniques for these applications.

As social media has become a major communication channel where users are not only consuming news but also propagating them by sharing and producing them. Such a consumption and dissemination approach led to widespread rumors which causes serious consequences to the individuals, organization, or the society as whole [97] . To address such an issue social media platforms, government entities, journalists, fact-checkers, research communities, and other stakeholders are fighting to reduce the negative impact of rumors. While fact-checking organizations such as FactCheck.org and Snopes.com are manually checking and making them publicly available to facilitate other stakeholders and reduce the spread. As such an effort does not scale well, hence, the research community has been trying to develop an automated system to detect and alert users about such rumors (see recent surveys [98, 97] ).

In literature rumor has several definitions [98, 98] . (i) Rumor is a story or statement whose truth value is yet to be verified at the time of posting [99] . (ii) The truth value of a story or statement is verified authoritative sources and confirmed that it is false or fabricated, which is also referred to as false rumor [100] . (iii) The third definition of a rumor is based on the users' subjective judgment of the veracity of the story or statement [101] . The former one is the most widely used and consistent with the definition of different dictionaries (e.g., Oxford English Dictionary).

Rumor can be of different types: (i) long-standing rumors that have been circulating for a long time, (ii) emerging rumors during any event (e.g., rumor about COVID-19). Depending on the type of rumor classification systems have been designed accordingly.

A typical rumor classification system consists of several components, as can be seen in Figure 5 (adopted from [97] ). News or social media posts are monitored to detect whether they are rumors or not. Once a rumor is identified then posts related to the rumor are detected to flag them. The stance classification component then determines how each post reflects a particular stance on its like veracity. Then the last component determines the truth value of the rumor.

To design the machine learning models for such components different types of methods have been used. The work by Cao et al. [98] categorizes three types of methods. Handcrafted feature-based approach: features are extracted from different modalities (textual, visual, and social features) depending on their availability then classical machine learning methods (e.g., Decision Tree, Bayesian Networks, SVM, Random Forest, and Logistic regression) are used to train the model. Propagation-based approach: uses network-based information such as users, messages, and events to train the model. Deep learning approach: trains the model using a combination of different sets of features. Adversarial work: While malicious actors are trying to spread rumors to achieve political and financial goals, the research community and social media platforms are trying to develop automated systems/models to debunk them. These models are susceptible to malicious attacks too by malicious actors. Hence, it is important to develop models that are robust enough to deal with such adversarial attacks. Work in this direction for rumor detection are relatively very few. Ma et al. [41] is the first study that attempted to create robust models using GAN, which has been evaluated using two real-world datasets. Xiaoyu et al. [102] propose graph adversarial learning to graph structure to reduce the intentional perturbations. Song et al. [103] propose a method, which includes weighted-edge transformer-graph network, and position-aware adversarial response generator modules. In Table 7 , we report notable work on rumor detection, which propose different attack and defense methods.

Satire is defined in the Cambridge Dictionary as "a way of criticizing people or ideas in a humorous way". 4 It is a liter-ary device that writers use to mock or ridicule a person, group, or ideology by judging them for various issues, particularly in the context of contemporary politics and other topical issues [109] . Such devices include humor, irony, sarcasm, exaggerations, parody, or caricature [110] . These are typically applied to news and social media posts and the purpose is not to cause harm but to ridicule, or expose behavior that is shameful, corrupt, or otherwise "bad" [111, 112, 113] . Even the intention is not to mislead, however, the content can be mistaken by the reader as legitimate news, which can lead to the spread of misinformation [109, 114] and can be harmful [115] . Hence, it became essential to automatically identify them at a large scale and the typical classification task was to differentiate between real, fake, or satire news [112, 109] .

For designing the classification model, different classical and deep learning-based algorithms have been used. These include Naive Bayes Multinomial, LSTM, CNN [112, 116] . To train the model typical features include entity mentions [116] coherence information [117] , distributions of parts-of-speech, sentiment, and exaggerations [118] . Other approaches include neural network and attention mechanism to incorporate paragraphlevel linguistic features [115] , hierarchical deep neural network to incorporate sentence level and at the document level [119] and multimodal learning [109] using state-of-the-art visionlinguistic model ViLBERT.

Adversarial work: The work on adversarial training for satire detection is relatively new. The study in [110] used adversarial training while training the model, however, the purpose was not to defend an adversarial attack, rather the idea was to control the effect of publication source information in satire detection.

Social media platforms, such as Facebook and Twitter, have enabled people around the globe to connect and exchange ideas with each other. In doing so, they have also created opportunities for cyber-attackers and cyber-trolls to disseminate misinformation, offend and cyber-bully the users, and cause disputes through social media and public forums. [120] describe online trolls as malicious users pretending to be sincere members of a group discussion but subtly attempting to disrupt the discussion and cause conflict. [121] claim that the lack of attention to social science by the designers and developers of social media is responsible for spreading misinformation. Cyberattackers leverage digitalization for online discrimination, privacy breaches, misinformation, and cyberattacks [122] . [123] provides an example of misinformation in Italy. The Five Star Movement, founded by a national celebrity, linked with numerous social media pages and websites promoting the Catholic faith and spreading nationalist, anti-immigrant, and anti-Islam rhetorics. Similarly, [124] addresses the political threats posed by misinformation towards elections, democracy, and citizens via OSNs. These threats include releasing fake documents and ridiculing the candidates, spreading misinformation about the voting procedure, and harassing and intimidating minority groups. Also, the authors highlight the steps, such as extending existing criminal and human rights laws to online space, taken by the Canadian government to mitigate cyber-attacks.

ML models have proven effective in detecting malicious emails, fake news, and harmful data traffic in real-time ( [125] ). [126] propose a method by drawing a connection between the surface features and the document-level characteristics of discourse for automated detection of fake news and misinformation. The proposed model achieves a veracity accuracy of 74%. Likewise, [127] combines text-mining, behavior analysis, and connection graphs of a discourse, which involves users. It compares the results with a predefined knowledge base as an automated fact-checking and fake news detection model.

Attackers can leverage adversarial machine learning (AML) to evade detection and amplify their attacks (e.g., an attack on spam email filter [128] ). [120] introduces TrollHunter, an ML model to detect the spread of misinformation and fake news on Twitter by leveraging linguistic analysis. TrollHunter achieves an accuracy of 98.5% when detecting malicious tweets. The authors also present TrollHunter-Evader, which utilizes adversarial machine learning techniques, such as Test Time Evasion (TTE) and Ambient Tactical Deception (ATD), to evade detection, with a success rate of 40%. [40] Malicious Comment Generation Framework (MALCOM) generates relevant and acceptable phrases. It replaces them with the ones in a malicious discourse to evade ML detection models with a success rate of 94%.

Cyber-attackers utilize various psychological and technology-based techniques to perform scams and hoaxes on their targets. Internet users face online threats in the forms of phishing, pharming, hacking, profiling, and physiological influence [129] . In response, [130] implements an anti-spamming email filter based on unsupervised Artificial Neural Networks (ANNs), which achieves ROC AUC of 0.97, with an average sensitivity > 0.95. However, the performance declines due to concept drift as emails with new topics and structures are introduced to the model. The accuracy of ML models decreases over time due to changes in news topics, the way they are reported, and evasion techniques used by adversaries, known as 'concept drift' [131] . Thus, the authors recommend two approaches to mitigate the effects of concept drift: periodically training machine learning using the previous data samples and deploying Dynamic Weighted Majority into the system.

In addition to adversaries causing misclassification by masking the input data with noise and poisoning training samples, implementation of ML models can be challenging due to their complexity and lack of transparency, making them incomprehensible to humans, making their decisions untrustworthy [132] . To mitigate the negative perception of AI, [133] recommends educating developers on technology-based ethics and the potential effects on society, having a regulatory authority to monitor AI-based processes and making the AI computations and databases transparent to the consumers.

In Table 8 , we summarize the related work on clickbaits and spam detection addressing different attack and defense methods. a k-NN-based mechanism to sport outliers 

Hate speech, which refers to abusive or threatening speech or writing expressing a preconceived opinion against a particular group based on prohibitive attributes, such as color, race, religious beliefs, gender, or sexual orientation, is considered as one of the main causes of increasing global violence [139] . Social networks, such as Facebook, Instagram, and Twitter, and greater accessibility to Internet have further amplified it by allowing people to express their opinions to the global audience more freely and effectively [10, 11] .

Considering the increasing concerns over global violence, several efforts have been made to identify the potential sources and reduce the spread of hate speech over social networks. AI, ML, Data Science, and NLP communities are also playing their part by proposing interesting hate speech detection techniques [140] . For instance, [141] proposes a hate speech detection and classification framework for Twitter text streams based on BERT (Bidirectional Encoder Representations from Transformers). . However, similar to other NLP applications, hate speech detection methods are also subject to adversarial attacks, for instance, as demonstrated in [142] where hate speech recognition models were fooled by modifying the text using state-of-the-art NLP adversarial attacks. The goal of such adversarial attacks is to disturb the classification capabilities of the classifiers resulting in misclassification of abusive and toxic content. The literature provides several examples of adversarial attacks on hate speech detection techniques. For instance, Grondahl et al. [52] employed three different types of adversarial attacks to fool hate speech recognition models through (i) word changes, (ii) word-boundary changes, and (iii) appending unrelated innocuous words. Similarly, in [53] hate speech detection models are fooled through (i) typos, (ii) removing white spaces, (iii) inserting benign words, and (iv) appending character boundaries. The authors also launched attacks on the models by combining all the individual types of attacks resulting in a more effective adversarial attack.

To guard against the attacks on hate speech detection models, several interesting defense strategies have been introduced in the literature. The majority of the proposed methods rely on adversarial training to cope with the perturbations [51, 143, 144, 52] . For instance, in [143] with a learnable and fine-grained noise magnitude where noise (perturbation) is added to misleading samples. Besides adversarial training, some solutions also rely on pre-processing to defend against adversarial attacks on hate speech detection models [52, 54] . For instance, Moh et al. [53] propose four different pre-processing defense techniques, namely (i) word segmentation no redo (WSNR), (ii) word balance no redo (WBNR), (iii) good grammar no redo, and (iv) vowel search no redo (VSNR). The defense techniques deal with white-space removal, typos, benign word insertion, and character boundary appending attacks, respectively. Table 9 summarizes some key papers on adversarial attacks and defense methods for hate speech detection.

Misinformation is perhaps the most innocent of the terms discussed here. Its misleading information is created or shared without the intent to manipulate people. An example would be sharing a rumor that a celebrity died, before finding out that it is false.

Disinformation, by contrast, refers to deliberate attempts to confuse or manipulate people with dishonest information.

To fight against such false or misleading information, several initiatives for manual fact-checking have been launched. Some notable fact-checking organizations include FactCheck.org, 5 Snopes, 6 PolitiFact, 7 and FullFact. 8 A large body of research focused on developing automatic systems for detecting the factuality of such information [146, 147, 148, 149, 150] .

Such detection systems are also vulnerable to adversarial attacks. [133, 151] .

The study by [133] demonstrates three adversarial examples: Fact distortion, subject-object exchange and cause confounding. As a defense mechanism against adversarial attacks, the authors propose a crowdsourced knowledge graph to collect timely facts about news events. [8] describes adversarial attacks on social networks from two perspectives, attacks or manipulation of the network versus manipulation of the content. Several papers (e.g. [151] called for the need to build adaptive ML models that are more immune and robust to variations and perturbations of text inputs or features.

Several defense approaches against adversarial attacks on NLP are proposed in the literature (e.g. adversarial training, [152] , [68] , [153] , optimization-based methods: e.g. [154] , [155] , defense against neural fake news: e.g. [156] and [157] , [40] , and word/sentence embedding-based defense: e.g. [158] , [159] . [160] discussed three defense methods to adversarial attacks on misinformation: data modification, models modification and using auxiliary tools.

Sentiment analysis, which is also known as opinion mining, is another interesting application of NLP, and generally involves the use of modern technologies to analyze and extract quantitative results. Generally, the results of sentiment analysis are presented in the form of positive, negative, and neutral sentiments. In the literature, both textual and visual contents have been analyzed to extract opinions about an entity [5, 163, 164] .

However, text is more exploited due to a diversified set of applications and the level and freedom of expressing sentiments

The recent increase in the popularity of social media outlets, such as Twitter and Facebook, has further increased the importance, opportunities, and challenges associated with sentiment analysis [165] . For instance, sentiment analysis tools allow businesses to monitor and analyze the popularity and users' feedback on them and their competitors' products/services by processing text reviews shared by the users in online social networks. The objective quantitative results obtained through sentiment analysis are then utilized in making critical business decisions.

Similar to other NLP applications, due to the importance of results obtained from sentiment analysis in the decision-making process, sentiment analysis algorithms are also subject to several adversarial attacks. In sentiment analysis, an attacker can launch adversarial attacks by adding small perturbations to text to generate different perceptions than the actual opinions. For instance, in [166] , the vulnerabilities of a lexical natural language sentiment analysis algorithm are identified and analyzed under various types of adversarial attacks. Based on the experimental results, the authors conclude that the classifier's results could be significantly affected by exploiting the identified vulnerabilities. Alzntot et al. [81] on the other hand utilize a black-box population-based optimization algorithm for the generation of semantically and syntactically similar adversarial examples to launch attacks on sentiment analysis models. In [167] , three different types of character-level adversarial attacks are launched against a BERT-based sentiment analysis framework. The strategies used for generating the adversarial examples include mimicking human behavior and using leetspeak, misspellings, or misplaced commas. These strategies allow maximizing misclassification rates of sentiment analysis classifier with minimal changes. The literature also reports some interesting solutions to guard against adversarial attacks and extract correct perceptions on content shared in social networks. For instance, in [168] , an adversarial training-based solution has been proposed for aspectbased sentiment analysis. To this aim, two different BERT models, pre-trained on general-purpose and domain-specific data, are fine-tuned in a novel framework-based adversarial training. Wang et al. [169] on the other hand propose an adversarial defense tool namely TextFirewall for sentiment analysis algorithms. TextFirewall mainly relies on the inconsistency between the sentiment analysis model's prediction and the impact value, which is calculated by quantifying the positive and negative impact of a word on the sentiment polarity. Hosseini et al. [60] used spell-checking for limiting adversarial modifications at the character level. This scheme is also used in [166] to guard a sentiment analysis classifier against insertion and word substitution attacks. Du et al. [170] , on the other hand, employ network distillation technique along with adversarial training for robust sentiment analysis. Table 11 summarizes some key findings on adversarial attacks and defense methods for sentiment analysis.

In this section, we discuss current challenges and limitations in the area of adversarial attack generation and defense in textbased social media applications.

• Discrete data perturbation: This is a fundamental challenge in generating adversarial examples in a text as compared to images. It is widely believed that computer vision and image processing approaches to adversarial example generation do not directly apply to discrete inputs. In essence, such a direct application will result in exposing the attack as characters and words are not salient anymore. This is the case as perturbations have to be discrete. Moreover, text perturbations are mainly replacement operations. This poses an important research question on how to characterize good replacements?. Hence, an active area of research considers identifying and quantifying new aspects of text similarity. Similar to attack methods, continuous input defense techniques cannot be easily applied. As an example, the GAN approach is based on adding artificial noise to the inputs. Thus, it is not applicable in an NLP context. Accordingly, an interesting research area is how to enable the off-the-shelf adversarial generation and defense methods of continuous variables to be used in a textual context. • Human-factor dependency: As seen in the papers surveyed in this section, there is almost always a human effort in the design and optimization of adversarial attack and defense methodologies. This is especially the case for measuring the imperceptibility of adversarial replacements from a linguistic point of view [74] . Other human factor example aspects include concerns with setting the parameters of the attack model and optimizing such parameters. Therefore, it seems interesting to invest more research on how to automate these processes in an attempt towards eliminating or, at least, regulating the extent to which the human factor is necessary for this area [82] . In fact, this is a general research challenge that concerns ML in general, including the context of this section.

• Transferability-of attack and defense: Similar to the case in computer vision, adversarial examples in an NLP context are known to possess different aspects and extensions of transferability. Typically, they transfer across different models as well as training and testing datasets. Furthermore, recent research envisions focused on developing attacks that transfer even across different ML tasks, as well [172] . The transferability character exempts an attacker from the need to use the same model, architecture, data, or even a task of the attacked ML model [72] . This resembles a growing challenge against efficient defense techniques [40] . As in-tuitively expected, black-box and untargeted adversarial attacks have higher degrees of transferability as compared to white-box and/or targeted attacks. Therefore, there will be an increased demand for developing advanced defense techniques that work with this challenge. It is noted that transferability can be treated at the character, word, and sentence levels.

• Universality-of attack and defense: This property refers to the extent to which an adversary is independent of the input, either status of nature. Thus, in an NLP context, the universality of an adversary can be seen in how it can attack irrespective of the input. For example, an NLP attack can be independent of the language used. Another aspect is defense universality; the robustness against character-, word-, and sentence-level attacks at the same time. The incorporation of this property in adversarial attack and defense is an interesting area of research. Recent research signifies that the success of a defense strategy against a specific adversarial attack may not necessarily mean the system is robust against other adversaries. Therefore, developing universal defense mechanisms is of growing importance. This is especially the case with the state-of-the-art deep learning models that can learn the task irrespective of inputs or architecture. Besides, it is also worth investigating the relationship between efficient defenses against different attacks, and then the attacker-attacker relationship holds between their respective defenders.

• Lack of appropriate defense: An area worth investigating is whether defense methods have the transferability property, and if so, in what sense? and, more importantly, how to enhance and boost this transferability to have a generic defense approach?

Improved detection of grammatical errors is worth a further investigation [72, 66] . This will have the added benefit of decreasing the chances of adversarial attacks.

The development of more robust NLP architectures: Recent research works hint at a promising research outlook, which is on how to design inherently robust/defensive ML architectures that can suppress the impact of any adversarial attack [59] .

• Embedding space size-perturbation selection trade-offs: Several works on characterizing a suitable embedding space have been proposed. Besides, there is a variety of methods to search for the best replacement amongst the elements in the embedding space. Still, there is a need for more work on how to define the space, and how to efficiently search between its elements approaching the best replacement. This requires also balancing the trade-off between the embedding space size and the search computational burden. Moreover, there is a need for novel ideas on how to regularize and guide the search process within the embedding space. A few works have considered the question of identifying critical characters and words that are best to change amongst all others. However, there is still an evident need for more research on this idea. In this context, one may think of developing a unified framework for answering a more common question of what, where, when, and how to change, i.e., is it better to change a character, as (sub) word or a (sub) sentence? This opens up the horizon for developing new optimization settings that may prove to answer such questions in a systematic or tractable way.

• Lack of standards in benchmark datasets with proper collection of adversarial examples or instances : Given the fact that there is a variety of datasets used in training, testing, and performance evaluation, there is a need for standardizing benchmark datasets. Although several large-scale datasets have been collected for common NLP tasks in recent years, there remains a need for new datasets for more challenging social networks NLP applications. It is noted that this is a general concern in text and NLP adversarial attack research areas.

• Lack of standards quality metrics: With the diversity of attack and defense methods, there is a corresponding diversity of employed attack and defense quality metrics. A classical attack quality metric is the accuracy of the targeted ML model. Furthermore, in an NLP context, the semantic similarity between the original input and its adversarial replacement has also been widely used as a quality metric. Along this line, researchers have used Blue, Self-Blue, EmbSim, Euclidean distance, cosine, and semantic similarity to evaluate their attack methods or the quality of a generated text. On the other hand, human evaluation seems to be an unavoidable measure of imperceptibility [74] . To this end, there is an immense need for research on developing descriptive, fair, and standardized performance evaluation metrics for the quality of the attack and the defense. Moreover, other quality metrics evaluating other aspects of the attack, such as transferability, universality, and imperceptibility need also to be developed and evaluated. From an adversarial replacement generation perspective, there is a gowning need for standardizing measures and rules for semantic similarity. This is of great importance for ensuring the imperceptibility of attacks. Moreover, there is a need for developing and standardizing measures for the ML model's stability against adversarial attacks. Along the line of quality metrics, [88] recently showed the added benefit of incorporating the attack's meaning preservation capability into quality metrics for attack performance evaluation. Also, a core question is on defining the distance between the original text and its adversarial modification, i.e., on how to measure a text change?

• Computational complexity concerns: Either on the attack or defense sides, there is an inherent trade-off between the performance and computational complexity, just like any other application area. It can be seen that settings with more access to system parameters are better in performance but at the cost of more computational complexity. With The increase of the attack and defense degrees of complexity, there should be an accompanying research effort on how to implement such techniques at affordable computational complexity levels, execution time, and memory requirements. Similar to several aforementioned research challenges, compu-tational complexity is a general concern in ML in general, including the context of this section.

• Balance between protection against attacks, censorship, and freedom of speech: Social media outlets and feeds from these outlets can be attacked with groups with radical views to promote their views. Thus, the social network applications are more susceptible to attacks and mitigation techniques are hard to integrate with social platforms as that might be seen as an attack on "freedom of speech." It is important to maintain the balance between protection against attacks, censorship, and freedom of speech.

• Graph Neural Networks: Graphs representation has been receiving a lot of attention as a key framework for representing entities with relations. In the context of a social media network, users, user information, and relationships between users can be represented by graph nodes, node attributes, and graph edges, respectively. Besides, graph-level operations represent operations on the level of a social media network. As a typical example, predicting the link between two graph nodes corresponds to predicting a friendship or following relationship between two members on a social media platform [173] . As another example, estimating whether a node belongs to a subgroup corresponds to estimating whether a user belongs to a social media group.

The primary added benefit of a graph representation is rendering the inherent links between nodes. Classical methods such as random walk can preserve these links. Nevertheless, the success of deep neural networks in various fields has led to the development of graph neural networks (GNNS), a special type of neural network well-suited for embracing node links in their representation across the network layers. GNNs have been receiving increasing attention in recent years in the representation of data with relationships. A key usage of GNNs is in the representation of social media networks. This elegant representation has tweaked downstream tasks such as node and graph classification and link prediction.

Despite their success in modeling graph data, GNNs have been shown vulnerable to adversarial attacks [174] . In this context, imperceptible perturbations applied at the node, edge, graph, and attribute level are shown to trick the GCN. Therefore, reaping the promising potential of GNNs in representing social media networks requires improving their robustness to such attacks and preserving the privacy of the data they represent. Here is a summary of key research challenges facing the security and privacy of GNNs in the context of social media platforms.

-Addressing the dynamic nature of social media network graphs: The majority of existing research considers static graphs with node attributes [175] . Still, social media networks are continuously dynamic [176] . For example, a person may make new friendships or terminate existing ones with certain persons. Also, a new indecent may happen and attract the interest of people all of a sudden, such as the so-called trending events. Thus, this dynamic nature needs to be modeled well in GCNs and taken care of in the underlying design of adversarial attack and defense techniques.

-Imperceptibility-definition and quantification: There is a need for defining to which extent a given perturbation is imperceptible. This requires developing novel imperceptibility metrics.

-Identifying what makes a good perturbation: An interesting research direction is to study the commonality between good attacks attained by the existing techniques. This can hint at how to systematically design good attacks. This will also be reflected on how to exploit this knowledge in designing efficient defense measures.

-Extension of knowledge from other domains: A possible research direction is to consider extending the existing techniques for attack and defense in text and NLP domains to the graph domain.

-Scalability of attack and defense methods: there is a need for developing effective attack and defense methods at a small scale. Then, they can be easily scaled up to the level of the entire network.

-Mitigating privacy leakage in GCNs: Social media networks typically contain private data that should be kept away from the reach of the public, as well as publically published data; that the persons and platforms share publically at no harm. To this end, privacy leakage happens when public information can be exploited to infer private data. Examples along this line include revealing private links between persons based on public information [177] , and even re-identifying anonymized persons in a social media network based on public information [178] . In fact, a series of recent works accuse GCNs to leak private data, especially in the context of social media [179, 180] . The main reason for this leakage is the correlation between data of different users in a social GNN [181] . Thus, an important future research venue is to mitigate this leakage. Examples along this line include leveraging differential privacy and developing new GCN architectures to minimize this leakage. Fig. 6 summarizes the outstanding challenges in the areas of attack and defense, along with the underlying research trends; concurrent and future ones.

• Adversarial ML for social media NLP applications is a very important and growing research area as the stake is very high. For example, it allows influencing public opinions by hackers/attackers are typically from state-sponsor agencies.

• Textual content in social media is very noisy, Due to such diverse characteristics, NLP in general and adversarial NLP, in particular, is more challenging. Developing defense techniques is even harder.

• Online social media applications are more prone to adversarial attacks than conventional sources of text.

• Social media applications are more susceptible to attacks and mitigation techniques are hard to integrate with social platforms as that might be seen as an attack on "freedom of speech." So there is a delicate balance between protection against attacks, censorship, and freedom of speech.

• The widespread use of social media attracts the attackers to launch different types of adversarial attacks on these networks to fulfill their objectives.

• Social media outlets are distributed in nature, therefore, distributed attacks are easier to manifest in social media applications.

• A successful attack against the most vulnerable social media outlets might be enough to have unintended consequences (e.g., promote extreme views, etc.).

• The unstructured nature of text shared in social networks allows attackers to launch different types of adversarial attacks on the applications.

• The literature indicates that several interesting NLP applications of social networks are subject to adversarial attacks. Some of the notable applications include rumors, satires, parodies, clickbait, spam, hate speech, and misinformation detection.

• Recent literature necessities the need for developing efficient adversarial defense techniques.

• There is a need for reducing human dependency both in adversarial attack and defense

• The existing literature on adversarial NLP in general and in social media applications in particular lacks in focused benchmark datasets and quality metrics.

• Graph neural networks form a promising framework with a great potential for improving the representation of social media information. however, this requires improving their security against adversarial attacks and mitigating their privacy leakage.

We surveyed the state-of-the-art in adversarial attack and defense on five major text-based social media applications, which include rumors, satires, clickbait & spam, hate speech, misinformation detection, and sentiment. These applications are susceptible to adversarial attacks. In this paper, we first provide a general overview of adversarial attack and defense techniques applicable to both text and image applications. As text is a primary communication means in social media platforms, we then review state-of-the-art techniques for attack and defense techniques specifically applicable for NLP. After that, we discuss the attack and defense studies in the aforementioned five major social media applications. Finally, we highlight current research challenges in the context of security of ML for social media applications and state key lessons.

Social media and satellites

Use of social media by companies to reach their customers

Sentiment analysis and opinion mining

A model for sentiment and emotion analysis of unstructured social media text

Sentiment analysis from images of natural disasters

Implicit user trust modeling based on user attributes and behavior in online social networks

Social media, sentiment and public opinions: Evidence from #brexit and #us-election

Misinformation in social media: definition, manipulation, and detection

Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society

Hate speech review in the context of online social networks

Racism, hate speech, and social media: A systematic review and critique

Demographic word embeddings for racism detection on Twitter

Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

Sinai at semeval-2019 task 5: Ensemble learning to detect hate speech against inmigrants and women in english and spanish tweets

Detecting east asian prejudice on social media

China and russia submit cyber proposal

Life in the network: the coming age of computational social science

Manifesto of computational social science

Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Proceedings of the Third Workshop on Abusive Language Online

Proceedings of the Fourth Workshop on Online Abuse and Harms

Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

Overview of checkthat! 2020: Automatic identification and verification of claims in social media

Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media

Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims

Task 6 at SemEval-2021: Detection of persuasion techniques in texts and images

Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics

Proceedings of the Fourteenth Workshop on Semantic Evaluation, International Committee for Computational Linguistics

Intriguing properties of neural networks

Explaining and harnessing adversarial examples

Deepfool: a simple and accurate method to fool deep neural networks

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

Distillation as a defense to adversarial perturbations against deep neural networks

Adversarial examples for malware detection

Exploring adversarial examples in malware detection

Is deep learning safe for robot vision? adversarial examples against the icub humanoid

Adversarial attacks on medical machine learning

Adversarial attacks against medical deep learning systems

Generating malicious comments to attack neural fake news detection models

Detect rumors on twitter by promoting information campaigns with generative adversarial learning

Microsoft exec apologizes for tay chatbot's racist tweets, says users 'exploited a vulnerability

On stabilizing generative adversarial training with noise

Regularizer to mitigate gradient masking effect during single-step adversarial training

Defending against adversarial attack towards deep neural networks via collaborative multi-task training

Ensemble adversarial training: Attacks and defenses

Feature squeezing: Detecting adversarial examples in deep neural networks

Analysis methods in neural language processing: A survey

Towards a robust deep neural network in texts: A survey

Adversarial attacks on deep-learning models in natural language processing: A survey

Adversarial attacks and defenses in images, graphs and text: A review

All you need is" love" evading hate speech detection

No" love" lost: Defending hate speech detection models against adversaries

Tsar: A system for defending hate speech detection models against adversaries

Natural language processing for social media

Hotflip: White-box adversarial examples for text classification

Black-box generation of adversarial text sequences to evade deep learning classifiers

Textbugger: Generating adversarial text against real-world applications

Synthetic and natural noise both break neural machine translation

Deceiving google's perspective api built for detecting toxic comments

Acoustic and visual approaches to adversarial text generation for google perspective

Combating adversarial misspellings with robust word recognition

Text processing like humans do: Visually attacking and shielding nlp systems

Character-level convolutional networks for text classification

Adversarial examples for natural language classification problems

Is bert really robust? a strong baseline for natural language attack on text classification and entailment

Word-level textual adversarial attacking as combinatorial optimization

Interpretable adversarial perturbation in input embedding space for text

Learning to discriminate perturbations for blocking adversarial attacks in text classification

Generating natural language adversarial examples through probability weighted word saliency

Greedy attack and gumbel attack: Generating adversarial examples for discrete data

Universal adversarial triggers for attacking and analyzing nlp

Bert-based adversarial examples for text classification

Semantically equivalent adversarial rules for debugging nlp models

Genattack: Practical black-box attacks with gradient-free optimization

Understanding neural networks through representation erasure

Discrete adversarial attacks and submodular optimization with applications to text classification

Bilateral adversarial training: Towards fast training of more robust models against adversarial attacks

Towards a physical-world adversarial patch for blinding object detection models

Adversarial texts with gradient methods

Generating natural language adversarial examples

Deep text classification can be fooled

On adversarial examples for characterlevel neural machine translation

Adversarial example generation with syntactically controlled paraphrase networks

Adversarial examples for evaluating reading comprehension systems

Robust neural machine translation with doubly adversarial inputs

Attention is all you need

On evaluation of adversarial perturbations for sequence-to-sequence models

Adversarial attacks against lipnet: End-toend sentence level lipreading

Understanding and diagnosing vulnerability under adversarial attacks

Comparing attention-based convolutional and recurrent neural networks: Success and limitations in machine reading comprehension

Universal adversarial training with class-wise perturbations

The limitations of adversarial training and the blind-spot attack

A two-step spelling correction model for combating adversarial typos

Defending pre-trained language models from adversarial word substitutions without performance sacrifice

Adversarial co-distillation learning for image recognition

Detection and resolution of rumours in social media: A survey

Automatic rumor detection on microblogs: A survey

The psychology of rumor

False rumors detection on sina weibo by propagation structures

Information credibility on twitter

Rumor detection on social media with graph structured adversarial learning

Adversary-aware rumor detection

Detecting rumors from microblogs with recurrent neural networks

Exploiting context for rumour detection in social media

Ced: credible early detection of social media rumors

Learning reporting dynamics during breaking news for rumour detection in social media

Detect rumors in microblog posts using propagation structure via kernel learning

A multi-modal method for satire detection using textual and visual cues

Adversarial training for satire detection: Controlling for confounding variables

Defining "fake news" a typology of scholarly definitions

Fake news vs satire: A dataset and analysis

A survey of fake news: Fundamental theories, detection methods, and opportunities

Too many people think satirical news is real, The Conversation

Satirical news detection and analysis using attention mechanism and linguistic features

Automatic satire detection: Are you having a laugh?

Understanding satirical articles using common-sense

Fake news or truth? using satirical cues to detect potentially misleading news

Attending sentences to detect satirical fake news

New Security Paradigms Workshop

You break it, you buy it: The naiveté of social engineering in tech-and how to fix it

The second information revolution: digitalization brings opportunities and concerns for public health

Online disinformation and harmful speech: Dangers for democratic participation and possible policy responses

How artificial intelligence transforms cybersecurity

Fake news detection using discourse segment structure analysis

Manipulation and fake news detection on social media: a two domain survey, combining social network analysis and knowledge bases exploitation

Evasion attacks against machine learning at test time

Profiling hackers

E-mail spam filter based on unsupervised neural architectures and thematic categories: design and analysis

Robust fake news detection over time and attack

Event prediction in the big data era: A systematic survey

Fake news detection via nlp is vulnerable to adversarial attacks

How vulnerable are automatic fake news detection methods to adversarial attacks?

Untrue. news: A new search engine for fake stories

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

Exploring adversarial attacks and defences for fake twitter account detection

Troll factories: The internet research agency and state-sponsored agenda building

Hate speech on social media: Global comparisons, Council on Foreign Relations 7

A survey on hate speech detection using natural language processing

A bert-based transfer learning approach for hate speech detection in online social media

Poster: Adversarial examples for hate speech classifiers

Habertor: An efficient and effective deep hatespeech detector

Demoting racial bias in hate speech detection

Automated hate speech detection and the problem of offensive language

A survey on truth discovery

Fake news detection on social media: A data mining perspective

The science of fake news

The spread of true and false news online

The rise of guardians: Fact-checking url recommendation to combat fake news

The future of false information detection on social media: New perspectives and trends

Adversarial training methods for semi-supervised text classification

Freelb: Enhanced adversarial training for language understanding

Towards deep learning models resistant to adversarial attacks

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Defending against neural fake news

The limitations of stylometry for detecting machine-generated fake news

Word embedding perturbation for sentence classification

Application of the bert-based architecture in fake news detection

Review of artificial intelligence adversarial attack and defense technologies

Fakenewsnet: A data repository with news content, social context and spatialtemporal information for studying fake news on social media

All-in-one: Multi-task learning for rumour verification

Visual sentiment analysis from disaster images in social media

Geo-spatial multimedia sentiment analysis in disasters

Sentiment analysis over social networks: an overview

Adversarial attacks on a lexical sentiment analysis classifier

Adversarial examples against a bert absa model-fooling bert with l33t, misspellign, and punctuation

Adversarial training for aspect-based sentiment analysis with bert

Textfirewall: Omni-defending against adversarial texts in sentiment classification

Adversarial and domain-aware bert for cross-domain sentiment analysis

Learning word vectors for sentiment analysis

Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks

Joint link prediction and attribute inference using a social-attribute network

Gnnguard: Defending graph neural networks against adversarial attacks

Adversarial attacks and defenses on graphs

Representation learning for dynamic graphs: A survey

Link privacy in social networks

De-anonymizing social networks

Quantifying privacy leakage in graph embedding

Node-level membership inference attacks against graph neural networks

No free lunch in data privacy

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project number 1120.