key: cord-0169882-kt989vcz
authors: Qayyum, Hina; Zhao, Benjamin Zi Hao; Wood, Ian D.; Ikram, Muhammad; Kaafar, Mohamed Ali; Kourtellis, Nicolas
title: A deep dive into the consistently toxic 1% of Twitter
date: 2022-02-16
journal: nan
DOI: nan
sha: bf01e2d8dfb99f936dc4738a445f11d770747cbd
doc_id: 169882
cord_uid: kt989vcz

Misbehavior in online social networks (OSN) is an ever-growing phenomenon. The research to date tends to focus on the deployment of machine learning to identify and classify types of misbehavior such as bullying, aggression, and racism to name a few. The main goal of identification is to curb natural and mechanical misconduct and make OSNs a safer place for social discourse. Going beyond past works, we perform a longitudinal study of a large selection of Twitter profiles, which enables us to characterize profiles in terms of how consistently they post highly toxic content. Our data spans 14 years of tweets from 122K Twitter profiles and more than 293M tweets. From this data, we selected the most extreme profiles in terms of consistency of toxic content and examined their tweet texts, and the domains, hashtags, and URLs they shared. We found that these selected profiles keep to a narrow theme with lower diversity in hashtags, URLs, and domains, they are thematically similar to each other (in a coordinated manner, if not through intent), and have a high likelihood of bot-like behavior (likely to have progenitors with intentions to influence). Our work contributes a substantial and longitudinal online misbehavior dataset to the research community and establishes the consistency of a profile's toxic behavior as a useful factor when exploring misbehavior as potential accessories to influence operations on OSNs.

Influence operations are organized attempts on Online Social Networks (OSN) to shape people's opinion. Among strategic tools used by these malefactors are false news, misbehavior against communities based on religion, demographics or sexual orientation, paid trolls and automation (e.g., bots) (Alizadeh et al. 2020) . The public consensus is that OSNs must take action about malign influence operations, as it is crucial to be able to identify, characterize and predict the profiles instrumented to perform such operations.

Works such as (Neethu and Rajasree 2013) and (Ikeda et al. 2013) have characterized, interpreted and measured Twitter users' misbehavior in specific limited domains and in the user networks, while Ribeiro et al. (2018) report differences in content shared by normal and hateful users. It is also expected that groups of profiles may work together to create a deeper impact. Coordinated efforts of Twitter pro-files to spread toxicity or controversies about topics like Bitcoin, etc., were studied by Pacheco et al. (2021) , who examined networks of coordinated Twitter accounts by analyzing their profile activity and shared media on arbitrary lengths of time. However, they did not explore prolonged involvement of a profile in spreading toxic content and its utility in identifying and characterizing coordination.

In this paper, we seek to identify user profiles consistently pushing toxic content to promote or support influence operations. To this end, we curate the Twitter Toxic Tweets (3T) dataset, the largest Twitter dataset to date on this topic, with more than 293M tweets. This dataset allows us to understand how misbehavior has evolved on Twitter, while its' analysis is foundational in furthering the detection and understanding of Twitter profiles dedicated to spreading a toxic narrative, and their differentiation from other, benign profiles. 3T is seeded with seven smaller public datasets from past works studying online misbehavior on Twitter covering multiple themes of online misbehavior: hostility, racism, abuse, hatefulness, homophobia, spam and sexism. These datasets are balanced in their toxic and non-toxic users. A key limitation of the seed datasets is that users are often classified as toxic or not from the content of a single or a handful tweets, which does not allow deeper analysis of the users' ongoing behavior. To enable such analysis, we crawl the tweet timeline of each of the users present in the seed datasets. Our resulting 3T dataset contains 122,255 Twitter profiles and 293,401,161 individual tweets posted between 2007 and 2021. Human annotations are untenable given the size of our dataset. Hence, we turn to Google's Perspective API models to assign toxicity scores to each tweet, covering the following types of misbehavior: Toxicity, Severe Toxicity, Identity Attack, Inflammatory, Threat, and Insult. To our knowledge, this is the largest dataset annotated with these six Perspective API's models.

We identify six sets of user profiles who are both posting highly toxic content (based on the median of their tweets scores) and doing so consistently through tweet timeline (based on the Gini index of their tweets scores). The selected profiles become the focus of our study. We contrast these profiles with sets of random profiles to explore the likelihood they are participants within influence operations fur-thering hatred and toxicity online. We explore the following questions: 1) What toxic content, URLs, domains and hashtags do they share? What are their high-level topics of interest? How coherent and readable are their tweets? 2) How homogeneous are these clusters of toxic profiles based on the web resources they share? 3) Is there a measurable degree of automation within these profiles? Armed with this understanding of behavior from potential agents of influence operations, we hope our works inform the creation of improved tools to mitigate their negative impacts on OSNs. We demonstrate that our methodology is a useful tool for identifying and understanding toxic influence operations on OSNs. Our work provides tools for social media moderators to help curtail persistently toxic profiles and to maintain a safe environment for discourse between users. This paper makes the following main contributions:

• Longitudinal online misbehavior dataset. We collect and automatically annotate a large longitudinal dataset consisting of 293 million tweets ( §2). To our knowledge the largest published dataset on online misbehavior. Upon publication, we plan to release our enriched dataset for future research.

• Identification of consistently misbehaving profiles. Using Gini index and toxicity scores, we propose a novel approach to identify profiles that are consistently generating online toxic content, and demonstrate that this is an effective tool for identifying profiles likely involved in toxic influence operations ( §4).

• Characterization of consistently misbehaving profiles. We characterize profiles for casual vs. consistent misbehavior ( §5). We observe that consistently toxic profiles are specific and cohesive in types of shared web content. Hashtags shared by such profiles are coherent, but toxic and malignant in nature, which sets them apart from profiles occasionally involved in online misbehavior. Such profiles persistently discuss toxic and sensitive topics about war zone, ethnicity and religion.

• Analyzing homogeneous temporal misbehavior of profiles across categories of misbehavior. Our focus profiles across all categories share comparable number of similar domains. We identify consistently misbehaving profiles sharing interests via embedding similar hashtags ( §6). Our study signifies that consistently toxic users maintain a very discrete tweeting pattern, which extends to using specific hours of the day and specific days of the week. We reveal that it is highly likely that such misbehaving profiles are automated accounts instrumented for spreading misbehavior ( §7).

In this section, we detail our methodology for data collection and augmentation (overview shown in Figure 1 ). Section 2.1 states our seed datasets, and Section 2.2 provides details on the crawling of user timelines. Section 2.3 details our efforts to augment the crawled data with toxicity scores by query- ing the Google Perspective API, and finally, Section 2.4 provides a characterization of the augmented dataset.

We curated 7 publicly available datasets from past studies, as our base dataset. All these studies used human annotation, existing ML models or random sampling on streaming API to label users as toxic or not (based on a single tweet). A summary of the selected datasets, details of their size and labels can be found in Table 1 .

As per Twitter Terms and Conditions, Twitter User IDs (UIDs) and tweet content cannot be publicly shared. Consequently our seed datasets contained Tweet IDs (TIDs) and their respective annotation. Therefore, the first step was to query Twitter's API (Twitter 2021) to recover the UID responsible for each TID. Next, we queried Twitter API with these UIDs to retrieve each profile's historical tweets. The Twitter API allows the retrieval of the 3,200 most recent tweets from a profile, allowing us to study the historical record of each user and their evolving toxic behavior. We were unable to retrieve tweets from banned, deleted or private profiles. From the retrieved tweets (JSON files), we extracted relevant details such as the text, date of creation, hashtags, URLs and domains shared within tweets. For this study, we only consider English tweets. While the investigation of other languages can offer additional insights, we were constrained by Perspective API's range of supported languages at that time. Additionally, it was unclear if scores between languages are calibrated and directly comparable.

In the aforementioned seed datasets, one tweet was annotated or labeled per user. However, it is unrealistic to assume that this single-tweet label can be propagated on all the tweets of said user. Thus, to obtain a measure of misbehavior across all tweets of each user, we employed Google's Perspective API (Google 2021) . Perspective API provides multiple Convolutional Neural Networks (CNNs)-based models trained with GloVe word embedding (Pennington, Socher, and Manning 2014) for the evaluation of misconduct in any submitted text. This API offers 16 ML models that can provide a probabilistic score from 0 to 1 on the given text having an intensity on a specific dimension such as Toxicity, Threat, Inflammatory, etc. We focused on receiving scores from this API on the following six dimensions, defined as: Table 1 : Overview of 7 datasets used as seed with a collection of User IDs (UIDs) or Tweet IDs (TIDs), whatever was made publicly available. TIDs were used to recover Users IDs (RUIDs). Crawled UIDs (CUIDs) are the user profiles successfully crawled. CUIDs can be smaller than RUIDs if said RUID Twitter profiles were not found.

• Toxicity: Rude, disrespectful, or unreasonable comments, likely to make people leave a discussion. • Severe Toxicity: Comments very likely to make users leave a discussion or give up sharing their perspective. • Identity Attack: Negative or hateful comments targeting someone because of their identity, ethnicity, sexual orientation and such. • Inflammatory: Intended to provoke or inflame. • Insult: Insulting or negative comments towards a person or a group of people. • Threat: Intentions to inflict pain, injury, or violence against an individual or group.

The Perspective API provides multiple other experimental dimensions which we do not use here. We polled all 293M English tweets for a score from each of these models. Thus, each tweet in our dataset has these six scores.

To better understand the composition of our curated dataset, we first inspect the Cumulative Distribution Function (CDF) of each Perspective score, across all tweets through time ( Figure 2a ). We observe that the median score of a tweet for any of the six dimensions varies in the range 0.1 -0.2. Also, a steady rise in the curve in the low ranges of scores indicates that a majority of tweets do not strongly exhibit any specific form of misbehavior (80% of tweets have scores less than 0.4). Also, the strongest signal for misbehavior is in the dimension of Inflammatory content. A tail is also observed of tweets acting as exception to the rule, propagating what is perceived as a large amount of misbehavior (score → 1.0).

Following other studies (e.g., ElSherief et al. (2018)), we attempt to binarize the tweets are misbehaving or not, by applying a threshold of 0.4 (as an example) on each score (we discuss later in Section 4 the use of 0.4 as threshold). Then, we compute and plot in Figure 2b the proportion of misbehaving tweets each user has posted. We find that 80% of all users have a maximum of 30% of tweets (or less, depending on misbehavior category) considered as misbehaving. Still, there is a tail of strongly misbehaving users with a high percentage of tweets meeting this condition of misbehavior.

Furthermore, to inspect how misbehavior has changed through time across Twitter as a platform (represented by our sample), we bin tweets per month, and compute the me- 273 P(F-stat) 3 * 10 −45 7 * 10 −68 1 * 10 −48 3 * 10 −63 7 * 10 −6 6 * 10 −12 Table 2 : Linear regression (Ordinary Least Squares) of median Perspective score of all collected tweets by month (as plotted in Figure 2c ).

dian score across all user tweets created during each monthly bin. These median scores are presented in Figure 2c and demonstrate an increase in the level of misbehavior through time, especially in the years 2016-2020. In particular, when computing a linear regression model for each toxic behavior through time (Table 2) , we find that all six categories fit well such a model (with highest p-value of an F-test being 7 × 10 −6 ), and positive slopes ranging from 0.092 to 0.268. These trends may be potential indicators for expanded influence operations in recent years.

Takeaway 1: Toxic behavior on Twitter has been increasing through the last 15 years across six dimensions of misbehavior in tweets' texts.

The research presented in this paper is non-commercial, in line with Twitter's Terms and Conditions for research purposes. We used standard Twitter API to collected publicly available data on Twitter, from tweets of public user profiles. We acknowledge the responsibility of security and privacy which comes with the data collected. During the storing and processing of the data, Twitter users were referred to only by their UIDs. In all of our experiments, any result produced and shown cannot be used to re-identify, or track said users, as no user profiles are specifically named.

During our experiments, we follow ethical guidelines outlined in Rivers and Lewis (2014) . Given our experimentation on human-produced data, we obtained formal ethics committee approval from our institution's IRB. Our data will not be shared with any third-party for commercial purposes. 1 Our work is first to release scores for six Perspective API models for 122K profiles and 293M tweets to help facilitate the research in combating the online misbehavior in six dimensions on Twitter. The released 3T dataset will be a collection of TIDs and six Perspective API scores per TID: Toxicity, Severe Toxicity, Identity Attack, Inflammatory, Insult and Threat. Sharing of TIDs and scores are inline with Twitter and Perspective API's Terms and Conditions. We expect this work to contribute to a broader understanding of online misbehavior, by identifying consistently misbehaving Twitter user profiles which contribute disproportionately to online toxic content, as well as profiles that may be unintentionally marked as such due to errors in API scores.

Our goal is to identify and study profiles serving in influence operations, that consistently post tweets high in the six categories of misbehavior: Toxicity, Severe Toxicity, Identity Attack, Inflammatory, Insult and Threat. For each type, we identify sets of consistently toxic profiles based on the median score and Gini index of each profile's scores ( §4.1). We refer to these profiles as 'focus profiles'. We then perform a first-pass filtering to remove obscene content ( §4.2).

To identify focus profiles, we cluster the available profiles with respect to two main axes: overall high level of misbehavior (represented by median Perspective scores), and overall low variability in the toxicity of their posts. To measure a user's variability in toxicity, we use the Gini coefficient of their tweets' Perspective scores. The Gini coefficient was originally intended as a measure of the concentra- tion of wealth (Gini 1912) , but can equally be used to identify evenly distributed values such as toxicity, in our case. A consistent set of (low or high) scores produces a value closer to 0, whereas a wide range of (both low and high) scores produces a Gini coefficient value closer to 1. We drop profiles with less then 10 tweets, to have enough activity per user and reliably compute the two metrics.

We set a lower threshold on the median (0.4) and upper threshold on Gini coefficient values (0.25) to capture extreme and consistent toxic behavior. This approach is similar to that of ElSherief et al. (2018) where they choose 0.8 for Toxicity and 0.5 for Attack on Commenter scores (one of the Perspective API dimensions) to identify very toxic tweets in their corpus. We assume a conservative threshold of Gini in the range of 0.0 -0.25 aiming to capture highly consistent toxic behavior expected from participants in influence operations. The process is illustrated in the scatter-plot of Figure 3 for Identity Attack scores. Each dot represents a Twitter profile in 3T and the yellow box indicates the aforementioned thresholds. The profiles (dots) falling within the yellow box are our focus profiles for the Identity Attack dimension. This step is repeated for all six Perspective scores (for the plots for Inflammatory,Insult and Threat scores please refer to Appendix Sec. A.2). The resulting six clusters of focus profiles are summarized in Table 3 . For the rest of the paper, Focus Profiles are referenced based on the type of misbehavior they represent. In addition, for every set of focus profiles, we also select a random set of profiles from the 3T data, equal in number with each set of focus profiles. We refer to these as Random profiles, per type of misbehavior. As seen in Table 3 , the selected thresholds of values for identifying focus profiles lead to a small number of such profiles in all clusters, compared to the size of the 3T dataset. This can be attributed to the following reasons: a) to start with, the number of toxic tweets on Twitter is expected to be gen-erally small (∼8% was reported by Founta et al. (2018) ); b) then, extremely toxic profiles do not last long on Twitter: they get reported for misconduct violation and are banned fairly quickly (Jhaver et al. 2021) ; c) our selected thresholds are conservative (min median=0.4 and max Gini=0.25) and applied in a combined fashion, as we aim to identify extreme cases of both highly toxic profiles, who are also consistent in their toxicity (not just sporadically toxic).

In this work, we go beyond past studies that focus on typical Perspective scores such as (Severe) Toxicity (ElSherief et al. 2018; Hosseini et al. 2017; Jain et al. 2018) , and study Identity Attack, Inflammatory, Insult and Threat to identify profiles exhibiting diverse type of misbehavior. Thus, first, we compute a pairwise Pearson correlation across the scores of tweets between the six dimensions and plot their potential signal similarities in Figure 2d . It becomes clear that Toxicity and Severe Toxicity are highly correlated with each other, and with the rest of the dimensions.

Then, we investigate the type of content shared by the focus profiles based on these scores. In particular, we count all the hashtags shared by these profiles and investigate the top 50 most frequently shared hashtags in each group. In all profile selections, 35%-42% of hashtags were obscene in nature. For example, 31 out of the 50 most shared hashtags in the focus Identity Attack profiles were obscene, e.g. 'xxx', 'adult', 'asian', 'nsfw', 'beardedmen', and 'beards', and 7% of profiles had shared these hashtags in more than 80-85% of their tweets. Indeed, pornographic content is shared liberally on Twitter (Pew Research Center 2018). However, we are not interested in characterizing sexual obscenity on Twitter, instead to find potential participants of influence campaigns. We note that in the focus Toxicity and Severe Toxicity focus selections, 80-85% of profiles are involved in tweeting obscene hashtags, in contrast to the other four sets of focus profiles, whereby 6-9% profiles are responsible for a majority of obscene hashtags. Thus, like (Gomez et al. 2019) , we drop such profiles, after manual inspection of their hashtags. Overall, given the correlation result on Toxicity and Severe Toxicity, and the highly obscene profiles included in them, we decided to drop these two dimensions, and focus on the other four: Identity Attack, Insult, Inflammatory, and Threat. We also drop the 6-9% of profiles in these dimensions whose shared obscene hashtags exceed 80%.

The nature of the text in a profile's tweets, as well as auxiliary content included in the tweets and in the profile, such as URLs and hashtags, can represent the profile's focus and interests. If their interests align under influence operations, shared content of multiple profiles should follow suit. To extract these interests, we perform a longitudinal analysis of all tweets per focus profile, and observe the nature of shared content, over the four types of misbehavior: Identity Attack, Insult, Inflammatory, and Threat. In particular, we perform analysis on URLs ( §5.1) and hashtags ( §5.2) shared, the topics of tweets ( §5.3) and their degree of readability ( §5.4). Here, 2L-TLDs refer to second level domains (SLDs). "None" refers to unrated websites whose domain category is unknown to FortiGuard.

For URL analysis, we first detect all URLs from focus profiles per misbehaving dimension (Identity Attack, Inflammatory, Insult and Threat) and the corresponding sets of random profiles. Then, we extract second level domains (SLD) from all detected URLs, resulting in 319,082 SLDs. A SLD is the part of the domain that is located right before a Top Level Domain (TLD). For example, in www.example.com the SLD is example.com and the TLD is com. Beyond this point when we mention Domain, we refer to the SLD. Next, we classify these domains with the FortiGuard classification service (Inc. 2021). Forti-Guard uses link crawlers, customer logs and machine learning to categorize websites (Anonymous et al. 2020). Using FortiGuard, we successfully categorize 312,702 (98%) domains; the remaining 2% (6,380) corresponds to 32,419 (0.28%) of the web pages distributed by user tweets. Table 4 provides a breakdown of total and unique number of URLs, domains and domain categories for focus vs. random profiles. We note that not every focus profile shared URLs in their tweets, so the average of unique URLs per profile is computed using only the profiles which shared URL(s). We observe that all focus profiles shared a larger number of total URLs, and total domains than random profiles. Indeed, when we look at the unique URLs and domains, focus profiles present a different picture: they have shared a larger number of unique URLs, but a much smaller number of unique domains, than random profiles. In particular, the focus Insult profiles shared the highest number of total (71,897) and unique (67,320) URLs, and almost double that of random profiles.

Perhaps expectantly, random profiles shared URLs that are fairly unique (since their total and unique number of URLs is almost the same). When we look into unique domains, we observe a different picture: focus profiles share a much smaller (in some cases orders of magnitude less) number of unique domains than random. For example, focus Identity Attack profiles referenced only 52 unique domains, compared to 1,100 from random profiles, even though the two sets have same number of profiles and same order of magnitude total number of URLs. We also observe similar results in focus Insult profiles which shared only 136 unique domains from 71,897 URLs, compared to 1,027 unique domains from 36,720 URLs.

These results are also reflected in the average number of unique URLs and domains per profile: for all focus profile types, the averages are smaller than random profiles, demonstrating that focus profiles are, on average, sharing less diverse set of URLs and domains. Also, looking into domain categories extracted from Fortiguard, all sets of focus profiles have smaller number of categories than random sets of profiles.

We look into this further by plotting the top 20 categories of domains out of 89 different categories found in focus Identity Attack profiles in Figure 4 (Similar plots for other categories of misbehavior can be found in Appendix Sec. A.3). We observe that, focus identity attack profiles share different types of domains, and with different intensity, than random profiles. Clear trends can be seen, with Information Technology (e.g., marinsoftware.com and nec.com), URL Shortening (e.g., fb.me, tinyurl.com and myburbank.com), Social Networking (e.g., facebooklive.com), and News & Media (e.g., unfoxnews.com). Also, random profiles share a broader spectrum of web resources, from a more uniform distribution of categories (social networking, business, entertainment, shopping, streaming media, politics, etc.) The same trends are present in the other types of misbehaving profiles, with the top 3 categories of Inflammatory and Insult including Information Technology, URL Shortening, and Social Networking domains. The exception is Threat, where top 3 contains Social Networking, News & Media and Streaming Media & Download. Finally, pornographic content dominates focus profiles of Identity Attack and Insult, and is present in all toxic clusters, with 0.2%, 0.02%, 0.17%, and 0.12% in Identity Attack, Inflammatory, Insult, and Threat, respectively. These results are in line with findings of Pew Research Center (2018).

Finally, Figure 5a shows the CDF of number of unique URLs and domains per focus Identity Attack vs. random profile. For results based on other categories of misbehavior please refer Appendix A.5.

We observe that 95% of focus profiles post at most 100 unique domains, while 20% of the random group of users share at least 100 unique domains in their tweets. This suggests a tendency of such Identity Attack profiles to share a narrower set of external web content than random profiles.

Takeaway 2: Focus profiles fetch very specific and cohesive in type web content, originating from a larger set of URLs, compared to random profiles. Random profiles share a wider range of domains from a smaller set of URLs, pointing to more diverse web resources included in their tweets.

Adding hashtags to tweets is a popular and easy way for users to convey a message to an interested audience, and to have a voice in intended communities. In order to compare the tendency of sharing hashtags from focus and random profiles, we extracted and compared the total and unique number of hashtags in focus and random profiles. Figure 5b shows the CDF of these counts per profile, in focus Identity Attack, and random profiles. More results based on focus Inflammatory, Insult and Threat and their random sets of profiles can be found in Appendix A.6.

Around 70% of the focus profiles do not use any hashtags in their tweets, whereas there is only <1% of random profiles with no hashtags, suggesting that hashtags are generously used by all sets of random profiles.

On the focus profiles that do use hashtags, 90% of them use at most 10 hashtags, in contrast to 50% of random profiles that use at least 100 hashtags, demonstrating the diverse interests covered by random profiles. Table 5 shows that focus Identity Attack profiles share least number of total 701, and unique 612, hashtags. On the other hand, the focus Inflammatory profiles share the highest number of total (5,299) and unique (4,008) hashtags.

Overall, the four types of focus profiles share a considerably smaller number of hashtags than random profiles, and choose to engage specific and very few communities through hashtags.

Diving into the hashtags posted by these focus profiles (Table 6), they are strikingly different in nature to the ones from random profiles. Focus Identity Attack profiles share hashtags about warfare and conflict. The most shared hashtag #TreCru is about Treasure Cruise, an action role play combat game, whereas other hashtags include countries under attack or in war situation like #Syria and #BDS (i.e., Boycott, Divestment, Sanctions). This is a Palestinian-led movement for freedom and equality. Random profiles, on the other hand, share hashtags which indicate major happenings such as #Covid, #coronavirus, #BlackLivesMatter, and #Christmas.

Interestingly, focus Inflammatory and Insult profiles share hashtags about American political situation in a cohesive set of hashtags such as #Trump, #DumpTrump, #TrumpIsA- Loser, as well as #MAGA which points to the campaign led by Trump and its supporters in the last 6 years. Hashtags like #FakeNews, #MeTooMovement and #NBA are used in aggravating manner as well.

Takeaways 3-4: First, a minority of focus profiles use hashtags; when used, they are sparse, compared to the volume of unique hashtags from equal number of random profiles. Second, hashtags shared by focus profiles are coherent, but toxic and malignant in nature, compared to random.

The text in a tweet is of great value to peek into the nature of one-way posting and even discussions a profile engages in. In order to generalize the cohesiveness and types of topics discussed in our focus profiles (which are representative of the four categories of misbehavior), we use Topic modeling on the text of all tweets in each set of profiles.

We specifically used Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) , a probabilistic topic modeling algorithm to extract topics from a number of documents (user tweets). The LDA assumes that each document is composed of a number of topics, with each topic existing as a probability distribution over all the words in the topic. Our implementation to extract topics from users' tweets in toxic and random profiles leverages the Natural Language Toolkit (Sphinx and Theme 2021). For each profile in focus or random sets, we extract its tweets' text. We remove the re-tweets' text, and consider English-only tweets. We then remove all non alpha-numeric characters, URLs, and stop-words, and then apply lemmatization and tokenization. Finally, we extract topics for each category of misbehavior and random profiles. We also extract topics from the retweets of focus and random profiles, and discover similar results. For example, focus Inflammatory profiles retweet about Trump and women, and focus Insult profiles retweet about Black Lives Matter movement and killing. Contrasting these topics, all random groups shared in-cohesive benign topics.

Takeaways 5-6: First, topics of discussion of each cluster are related to the type of misbehavior the cluster represents. Second, there is cohesiveness in the theme of topics within the focus profiles of a cluster. Focus profiles, in general, discuss topics that project hatred, insult, threat and sensitive topics about war zones and politics, while random profiles discuss harmless topics, such as users' daily lives and feelings, history or books. As a whole, focus Identity Attack, Inflammatory, Insult and Threat profiles consistently share very specific, hateful, sensitive and obscene-natured content. 

We further analyze our focus and random profiles' tweets for grammatical and semantic correctness. We parse each tweet to extract the number of words, sentences, punctuation, nonletters (e.g., emoticons), and measure the Lexical Richness, the Automated Readability Index (ARI) (Senter and Smith 1967) and the Flesch Score (Flesch 1948) . Lexical richness, defined as the ratio of number of unique words to total number of words, reveals noticeable repetitions of distinct words. ARI estimates the comprehensibility of a text corpus and is computed as: (4.71×average word length)+(0.5×average sentence length)-21.43. Flesch score indicates how difficult it is to read the text and is computed as: 206.835-1.015×( totalwords totalsentences )-84.6×( totalsyllables totalwords ). Higher value of ARI, and higher Flesch score of a given text show its comprehensiveness and easy readability. Table 8 shows a summary of the results. In comparison to focus profiles, random profiles tweet with higher ARI compared to Identity Attack focus profiles (9.08 vs. 7.73), higher Richness (0.22 vs. 0.10), and higher Flesch score (58.66 vs. 53.42) thus suggesting that non-toxic profiles use a richer vocabulary, and that their tweets have higher readability.

Takeaway 7: Focus profiles tweet with lower comprehensibility and readability, and poorer vocabulary than random.

We have seen that focus profiles share less diverse content in terms of URLs, web domain categories, hashtags, and textbased topics. To discern if tweets of these profiles are homogeneous between profiles (and potentially working for the same influence operation) of a given selection, we further investigate the Jaccard similarity between domains ( §6.1) and hashtags ( §6.2) in tweets, within and between their respective clusters. We also study these similarities on all four types of misbehavior and compare how their distributions differ by computing their KL-Divergence distance.

We assess the overlap among the domains referenced in tweets of focus and random profiles using the pairwise Jaccard similarity index, computed between domains lists A and B referenced by focus Identity Attack or random profiles, respectively. The Jaccard Index is computed for two sets A and B as |A B| |A B| , and ranges from 0 (for no common elements between the two sets) to 1 (for a perfect match or overlap). Similar results are obtained for the other dimensions and are omitted for brevity. Figures 5c and 5d present the CDF of the Jaccard similarity scores of focus and random profiles, as obtained for domains and hashtags in their tweets, respectively. As expected, the great majority of random profiles are highly dissimilar with each other, as well as with focus profiles (Figure 5c : ∼1.3% of pairs of random profiles have max 0.2 similarity). However, within focus profiles, ∼50% of profiles have no similarity with each other, ∼25% have up to 0.3 similarity, and the rest have similarity higher than 0.3. Also, only 1-2% of focus profiles show any similarity with random profiles, indicating that the messages or objective of the focus profiles largely differ from random. Table 4 showed the number of domains and their categories. To assess the distance between their distributions, we compute the Kullback-Leibler Divergence (KL) between the number of domains in all four sets of focus profiles. D KL (P ||Q) is a statistical distance of how one probability distribution function P is different from a second one Q. In Figure 6a , we observe that the number of domains shared in focus Insult profiles is closest to focus Inflammatory (D KL :0.12), whereas the number of domains in focus Identity Attack and Threat profiles are also closest to Inflammatory profiles (D KL :0.25, 0.3). We also measure D KL between all pairs of random profile sets and find they differ greatly (D KL :36.27-49.39, details omitted due to space).

To analyze the homogeneity of domains found in tweets of profiles belonging to different categories of misbehavior, we also compute D KL for the similarity of the domains (cf. Figure 5c) . 

Following the analysis on domains, we computed Jaccard similarity on the sets of hashtags, as well as D KL , as measures to gauge similarity among sets of focus profiles. Figure 5d shows the CDF of the Jaccard similarity computed between the vectors of hashtags appearing in tweets of focus Identity Attack profiles, between random profiles, and across focus and random profiles. Similar results were retrieved for the other dimensions and are excluded due to space. We find that focus profiles within the specific cluster are more similar with respect to usage of hashtags, than random profiles. In particular, ∼86% of the focus Identity Attack profiles have at least 0.5 Jaccard similarity, and compared to the similarity within random profiles, where only 5% of pairs have a similarity ≥0.5. Also, focus profiles are using distinctly different hashtags from random, since their cross-profile similarity is close to 0 in 97% of cases.

We now turn our attention to the number of hashtags used by each profile. To calculate D KL of distribution of hashtags used per cluster of focus profiles, we first computed the CDFs of total hashtags per profile in all four sets of focus profiles. Figure 6c shows that the lowest D KL values are found for Inflammatory vs. Identity Attack profiles (0.34), and Inflammatory vs. Insult (0.23) profiles, indicating that they share similar number of hashtags. High D KL scores for random sets of profiles showed that these profiles, as earlier found, have quite different distributions with each other (min D KL in random set comparisons is 12.4). In Figure 6d , the lowest D KL values show that Insult, Threat, and Inflammatory profiles share most similar hashtags. Again, high D KL of hashtags similarity among random profiles showed the nature of hashtags they share does not match to each other.

Takeaway 9: Focus profiles across all toxic categories, in comparison to random, are homogeneous in terms of the number of, and actual hashtags they post in their tweets.

Twitter provides quick updates about significant events happening around the world. Unfortunately, many of these updates are in part due to automatic accounts with 66% of tweeted links from popular news and current event websites are made by Twitter Bots (Pew Research Center 2018). Beyond news, bots have been scrutinized for also spreading fake news and changing public political perception and discourse in coordinated influence operations. Our results on extracted topics, shared hashtags and URLs on the identified focus profiles provide hints to the possible automation of these accounts, in line with Pew Research Center (2018). Thus, we now attempt to characterize how many of the focus profiles, which are consistently posting very specific and toxic content, could be bots. Following past work by Fernquist et al. (2019) , we analyze their tweeting time patterns to infer periodic and bot-like behaviors, and compare it with random profiles ( §7.1). We also query Botometer by Sayyadiharikandeh et al. (2020) for a likelihood that a given profile is bot ( §7.2).

We define tweeting pattern as the frequency and timing of a profile's tweets. In order to investigate the tweeting pattern of focus profiles, we first isolate the timestamps of all their tweets. Figure 7a shows the Probability Distribution Function (PDF) of time between sequential tweets by focus Identity Attack and random profiles, up to 60 minutes (similar results were retrieved for the other clusters and can be found in Appendix Sec. A.4).

We observe that these focus profiles produce tweets at highly regular intervals, with clear peaks at <1, 5, 10, 15, and 20 minutes, with 75.2% of all inter-tweet intervals occurring faster than an hour. On the other hand, random profiles have a smooth distribution of inter-tweet intervals, producing new tweets in all possible time slots, and almost monotonically decreasing as the interval increases. Also, 64.3% of all intertweet intervals occur within an hour. Complementing ure 7a, Figure 7b shows the CDF of inter-tweet intervals: 20% of focus (random) profile tweets have an inter-arrival time larger than 123 (463) minutes, with a maximum of inter-arrival time of 22 (12) days.

Then, we look into the time of day and day of week that tweets are being posted by focus or random profiles. Figures 7c and 7d show the PDFs for these two tweeting pattern aspects. We observe that the focus profiles are more active starting from UTC 7am to 2pm than random profiles, which observe the typical diurnal behavior of regular user, with dual peaks during Americas and European working and evening hours.

Further, focus profiles are quite different in their posting activity during the week than random profiles: they maintain similar levels of activity throughout the whole week, and are more and consistently active on weekends, compared to random profiles who demonstrate a notable decrease in their weekend activity.

When we repeat this analysis on the remaining types of misbehavior (for each set of toxic and random profiles) we observe similar results, rest of the plots can be found in Appendix (Sec. A.4). We also elaborate the results as a D KL score to the Identity Attack analysis. The D KL of the inter-tweet time distribution for Identity Attack (i.e., Figure 7a) to Inflammatory, Insult and Threat focus profiles is D KL : 0.40, 0.01, 0.04, respectively. Similarly low scores are found when computing D KL on the CDFs (i.e., Figure 7b for Identity Attack), etc.: D KL : 0.01, 0.01, 0.04, respectively.

Takeaway 10: The tweeting behavior of focus profiles is very consistent and regular. They tweet frequently and at specific, small time intervals, and demonstrate longer activity hours during the day and week, without the typical breaks during weekend, that random profiles show.

To complement our findings, we also query Botometer by Sayyadiharikandeh et al. (2020) for a score for all focus and random profiles. Botometer is an AI-based algorithm that classifies a given Twitter account as bot/automated account or human. It takes into account a profile's followers, friends, account age, sentiment and language of its tweets, and outputs a bot score ranging from 0 to 5, with 0 being most human-like and 5 being the most bot like. The received Botometer scores are shown in Table 9 , for all four misbehavior dimensions and for focus vs. random profiles. We find that our focus profiles score higher than random profiles in being bots, in a consistent manner, and their scores are equally tight with random profiles in score variability. We also retrieve the Complete Automation Probability (CAP) by Yang et al. (2019) , a conditional probability of a profile being a bot with a given Botometer score. For example, Focus Identity Attack profiles have an average 4.53 Botometer score; at this score, 89.7% of accounts with this score or higher are likely to be bots, in contrast to the 57.8% of accounts for random profiles with 1.58 Botometer score.

Takeaway 11: Focus profiles demonstrate characteristics that rank them higher in the bot scale of Botometer, hinting to the higher likelihood they are automated accounts.

Online misbehavior detection on social networks has been extensively explored by several studies such as Gomez et al. (2019) ; Ribeiro et al. (2018); Founta et al. (2018) ; Waseem and Hovy (2016) ; Dhungana Sainju et al. (2021) , to name a few. This work identifies different types of bullying and online misbehavior and derives user motivations behind users involved in bullying on Twitter, along with an examination of temporal patterns in bullying-related tweets. The past studies have availed human annotations to differentiate between toxic and non-toxic tweets. This work relies on ML models of Perspective API to rate the collected tweets. This work also explored misbehavior dimensions beyond the prior works, and at larger scale of data, i.e., 293M tweets. Hosseini et al. (2017) and Jain et al. (2018) have studied the Google's Perspective API (Google 2021 ) and its resilience against adversarial attacks. Those studies leveraged Perspective API to score and analyze the toxicity of tweets. This work takes precedence over these studies in terms of size of the data set (293M tweets) and the number of misbehavior dimensions not studied in the past, namely Insult, Inflammatory, Threat, and Identity Attacks.

Influence operations on OSNs is a heightened phenomenon, which spreads through automated accounts or bots (Pew Research Center 2018). Consistent toxic and false content creation and dissemination is the base of a consistent spread of toxicity on OSNs by active operators or accounts (Establishment 2019). Content-based features best predict OSN-based influence operations, but unsupervised ML for detection of coordinated efforts of profiles in carrying these operations are infeasible at scale (Alizadeh et al. 2020) . This longitudinal study of 14 years gives a very clear picture of consistent production of toxic content. The presented methodology effectively differentiates consistent vs. occasional misbehavior of Twitter profiles, and allows spotting the consistently malignant content and profiles.

Summary: In this paper, we performed a first of its kind longitudinal study of 122K Twitter profiles and 293M tweets, over a period of 15 years (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) (2015) (2016) (2017) (2018) (2019) (2020) (2021) . We were particularly interested in studying toxic profiles who may participate in influence operations on Twitter. Towards this goal, we analyzed the toxicity of tweets using six Perspective API ML models and found that toxic behavior has increased through this 15-year period, across all six dimensions of misbehavior. We took a deep dive into the most toxic profiles, who are also very consistent in this behavior. We focused on these profiles and studied their posted content, topics covered and post timing patterns, and observed several characteristics that can help identification and removal from OSNs. These focus profiles are noticeably different to random Twitter profiles in terms of shared content and posting patterns.

Findings on consistent, and highly toxic profiles:

• They fetch and share very specific and cohesive in type web resources (domains), originating from many URLs. • They post from a homogeneous and small in size pool of domains shared within their cluster of misbehavior. • Less than 1/3rd of them use hashtags; and their hashtags are mostly malignant and toxic in nature. • They tweet on topics that are cohesive and related to their type of misbehavior: hatred, insult, threat, and sensitive topics about war zones and politics. • Their text has lower comprehensibility and readability, and uses poorer vocabulary than random profiles. • They tweet in small and regular time intervals, and often coincide with each other's posting activity. • They demonstrate longer activity hours during the day and week, without typical breaks in weekends, that random, or more normal, profiles show. • They are likely (semi)automated accounts, as they rank high in bot scale and regularity of posting.

Overall, the profiles we focused-on are small in number compared to the total dataset collected. This means that con-sistently toxic misbehavior is still manageable within a popular OSN such as Twitter. OSN admins can deploy methods like ours to detect and remove such profiles, who are probable participants in influence operations of social discourse.

Future Work: We plan to scrutinize further the phenomenon of consistent and highly toxic misbehavior based on features of Twitter profiles such as their self-declared location, and further attempt to automatically detect such profiles using ML classifiers, and across different platforms.

As a sanity check of the scores obtained from this API, we focused on the largest seed dataset, i.e., the ICWSM 2018 (Founta et al. 2018) , and cross-correlated the scores from the API models with the annotations assigned to the tweet of a user based on the tweet content. This dataset includes single tweets from 98. 3K users, of which 4,940, 13,690, 27,094, and 52,652 were labeled as 'hateful', 'spam', 'abusive', and 'normal', respectively . We were able to retrieve tweets from the timelines of a total 39,344 users present in this dataset. To validate that fact that pre-trained models of Perspective API's are producing stable outputs, we used the API models for "Toxicity","Severe Toxicity","Identity Attack","Insult","Inflammatory" and "Threat", because the definitions of these API scores are closest to the annotation effort from (Founta et al. 2018 ).

For all users in the seed dataset, we computed the median 6 scores mentioned above across all of their tweets. The results of this investigation in Fig. 8 show the distribution of the API scores, for all four available annotations. Toxicity scores for abusive tweets have a median of 0.2 and highest score are 2.6. Hateful scores have highest value of 2.5. Normal and spam labeled tweets got very low Toxicity scores of 1.8 and 1.5 respectively. In all cases, the distributions of perspective score medians for users labeled "abusive" or "hateful" are significantly different (p<0.01) to those for users labeled "normal" or "spam", showing consistency between perspective scores and annotated user labels. 

Focus group identification plots of Identity Attack Inflammatory, Insult and Threat scores are presented here. These groups were isolated by the imposing thresholds on respective median scores and Gini index. The details of the process can be found in ( §4). 

We plotted the top 20 categories of domains out of all different categories found in focus Identity Attack, focus Inflammatory, focus Insult and focus Threat profiles and their respective random sets of profiles in extension to the analysis performed in §5.1. Figure 10 : Top 20 domain categories in focus and random Identity Attack, Inflammatory, Insult and Threat sets of profiles. Here, 2L-TLDs refer to second level domains (SLDs). "None" refers to unrated websites whose domain category is unknown to FortiGuard.

Subplots in Figures 11 12, 13 and 14 are shared for focus Identity Attack, Inflammatory, Insult and Threat profiles and the respective random groups in extension to analysis performed in 7.1. Sub-plots (a) in these figures show the Probability Distribution Function (PDF) of time between sequential tweets by profiles of focus and random profiles, up to 60 minutes. Sub-plots(b) show the CDF of inter-tweet intervals as an extension to aforementioned PDF. We also looked into the time of day and day of week at which tweets are posted by focus or random profiles across three types of misbehavior in sub-plots (c) and (d) These plots in Figure 18 are shared to extend the results of Section 6.1 for focus and random profiles of Inflammatory, Insult and Threat categories of misbehavior. The Palestinian-led BDS movement promotes boycotts, divestments, and economic sanctions against Israel #BREAKING A hashtag used to represent breaking news. #BlackLivesMatter A political and social movement that seeks to highlight racism, discrimination, and inequality experienced by black people. #MeToo Me Too is a movement against sexual abuse and harassment through public disclosure of allegations. #MAGA Make America Great Again was a campaign slogan leading up to and during the Trump presidency #trap A subgenre of hip-hop music pg3d

Pixel Gun 3D is a online multiplayer FPS heavily influenced by the pixel art style of Minecraft Table 6 .

Content-based features predict social media influence operations

Triplet Censors: Demystifying Great Firewall's DNS Censorship Behavior

Latent Dirichlet Allocation. the Journal of machine Learning research

Bullying discourse on Twitter: An examination of bully-related tweets using supervised machine learning

Peer to peer hate: Hate speech instigators and their targets

Social network centric warfare -understanding influence operations in social media

Extracting Account Attributes for Analyzing Influence on Twitter

A new readability yardstick

Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

Variabilità e mutabilità

Exploring Hate Speech Detection in Multimodal Publications

Perspective API -Using machine learning to reduce toxicity online

Deceiving Google's Perspective API Built for Detecting Toxic Comments

Twitter user profiling based on text and community mining for market analysis. Knowledge-Based Systems . Inc., F. 2021. Web Filter Categories

Adversarial Text Generation for Google's Perspective API

When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data

Evaluating the Effectiveness of Deplatforming as a Moderation Strategy on Twitter

Hatred on Twitter During MeToo Movement -Kaggle

Sentiment analysis in twitter using machine learning techniques

Uncovering Coordinated Networks on Social Media: Methods and Case Studies

GloVe: Global Vectors for Word Representation

Like Sheep Among Wolves": Characterizing Hateful Users on Twitter

Ethical research standards in a world of big data

Detection of novel social bots by ensembles of specialized classifiers

Automated readability index

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

Arming the public with artificial intelligence to counter social bots

This work was partially supported by the Macquarie University Cybersecurity Hub (MQCHUB) and the EU H2020 Research and Innovation programme. Hina Qayyum was supported by Macquarie University Domestic High Degree Research Scholarship Program. Nicolas Kourtellis was partially supported during this project from the EU H2020 Research and Innovation programme under grant agreement No 830927 (Concordia). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the MQCHUB or of the EU H2020 Research and Innovation program.