key: cord-0261054-nole9ql2 authors: Halvani, Oren; Graner, Lukas title: POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis date: 2020-05-02 journal: nan DOI: 10.1145/3465481.3470050 sha: 1793fc681de207cb88dc4b088eb2cc51d4ed6106 doc_id: 261054 cord_uid: nole9ql2 Authorship verification (AV) is a fundamental research task in digital text forensics, which addresses the problem of whether two texts were written by the same person. In recent years, a variety of AV methods have been proposed that focus on this problem and can be divided into two categories: The first category refers to such methods that are based on explicitly defined features, where one has full control over which features are considered and what they actually represent. The second category, on the other hand, relates to such AV methods that are based on implicitly defined features, where no control mechanism is involved, so that any character sequence in a text can serve as a potential feature. However, AV methods belonging to the second category bear the risk that the topic of the texts may bias their classification predictions, which in turn may lead to misleading conclusions regarding their results. To tackle this problem, we propose a preprocessing technique called POSNoise, which effectively masks topic-related content in a given text. In this way, AV methods are forced to focus on such text units that are more related to the writing style. Our empirical evaluation based on six AV methods (falling into the second category) and seven corpora shows that POSNoise leads to better results compared to a well-known topic masking approach in 34 out of 42 cases, with an increase in accuracy of up to 10%. Texts are written for a variety of purposes and appear in numerous digital and non-digital forms, including emails, websites, chat logs, office documents, magazines and books. They can be categorized according to various aspects including language, genre, topic, sentiment, readability or writing style. The latter is particularly relevant when the question regarding the authorship of a certain document (e. g., ghostwritten paper, blackmail letter, suicide note, letter of confession or testament) arises. Stylometry is the quantitative study of writing style, especially with regard to questions of authorship, and can be dated back to the 19th century [21] . Stylometry uses statistical methods to analyze style on the basis of measurable features and has historical, literary and forensic applications. The underlying assumption in stylometry is that authors tend to write in a recognizable and unique way [13, 7] . Over the years, a number of authorship analysis disciplines have been established, of which authorship attribution (AA) is the most widely researched. The task AA is concerned with is to assign an anonymous text to the most likely author based on a set of sample documents from candidate authors. AA relies on the so-called "closed-set assumption" [20, 41] , which states that the true author of the anonymous text is indeed in this candidate set. However, if for some reason this assumption cannot be met, then an AA method will necessarily fail to select the true author of the anonymous text. A closely related discipline to AA is authorship verification (AV), which deals with the fundamental problem of whether two given documents D 1 and D 2 were written by the same person [27] . If this problem can be solved, almost any conceivable AA problem can be solved [27] , which is the reason why AV is particularly attractive for practical use cases. Based on the fact that any AA problem can be broken down into a series of AV problems [30] , we have decided to focus in this paper on the AV problem. From a machine learning point of view, AV represents a similarity detection problem, where the focus lies on the writing style rather than the topic of the documents. In spite of this, it can be observed in the literature that a large number of AV methods, including [1, 8, 9, 10, 29, 30, 31, 33, 39, 40] , are based on implicit 1 feature categories such as character/word or token n-grams. However, often it remains unclear which specific "linguistic patterns" they cover in contrast to explicit 2 feature categories such as punctuation marks, function words or part-of-speech (POS) tags, which can be interpreted directly. Since in general one has no control over implicitly defined features, it is important to ensure (for example, through a post-hoc analysis) what they actually capture. Otherwise, predictions made with AV methods based on such features may be influenced by the topic rather than the writing style of the documents. This, in turn, can prevent AV methods from achieving their intended goal. To counter this problem, we propose a simple but effective technique that deprives AV methods of the ability to consider topic-related features with respect to their predictions. The basic idea is to retain stylistically relevant words and phrases using a predefined list, while replacing topic-related text units with their corresponding POS tags. The latter represent word classes such as nouns, verbs or adjectives and thus provide grammatical information which refer to the content of the corresponding words. POS tags have been widely used in AA, AV and many other disciplines related to authorship analysis. They have been confirmed to be effective stylistic features, not only for documents written in English [9, 20, 34] but also in other languages such as Russian [32] , Estonian [36] and German [11] . While many AV and AA approaches consider simple POS tags [36] , other variants are also common in the literature including POS tag n-grams [19] , POS tag one-hot encodings [20] , POS tags combined with function words [11] and probabilistic POS tag structures [12] . The remainder of the paper is organized as follows: Section 2 discusses previous work that served as an motivation for our approach, which is proposed in Section 3. Section 4 describes our experiments and Section 5 concludes the work and provides suggestions for future work. A fundamental requirement of any AV method is the choice of a suitable data representation, which aims to model the writing style of the investigated documents. The two most common representations that can be used for this purpose are (1) vector space models and (2) language models. A large part of existing AV approaches fall into category (1) . Kocher and Savoy [25] , for example, as well as Koppel and Schler [26, 28] , proposed AV methods that consider the most frequent words occurring in the documents. Other approaches, that also make use of vector space models, are those of Potha and Stamatatos [39] , Koppel and Winter [30] , Hürlimann et al. [23] , Barbon et al. [3] and Neal et al. [33] which, among others, consider the most frequent character n-grams. On the other hand, AV methods based on neural networks such as the approaches of Hosseinia and Mukherjee [22] , Boenninghoff et al. [6] , Bagnall [2] and Jasper et al. [24] fall into category (2) . These approaches employ continuous space 3 language models that operate on the word and character level of the documents. In contrast to these, AV approaches as those proposed by Veenman and Li [47] or Halvani et al. [16, 17] are based on compression-based language models, where internally a probability distribution for a given document is estimated based on all characters and their preceding contexts. Regarding the latter, Bevendorff et al. [5] have shown that this type of AV methods are effective compared to the current state of the art in AV. Regardless of their strengths and effectiveness, all the above-mentioned AV approaches suffer from the same problem. They lack a control mechanism which ensures that their decision with respect to the questioned authorship, is not inadvertently distorted by the topic of the documents. In the absence of such a control mechanism, AV methods can (in the worst case) degenerate from style to simple topic classifiers. To address this problem, Stamatatos [44] proposed a technique that we refer to in this paper as TextDistortion 4 . The method aims to mask topic-specific information in documents, before passing them further to AA or AV methods. The topic-specific information is not related to the author's personal writing style, which is why masking helps to maintain the correct objective (classifying documents by their writing style rather than by their content). To achieve this, occurrences of infrequent words are substituted entirely by uniform symbols. In addition, numbers are masked such that their structure is retained while hiding their specific value. Given these transformations, most of the syntactical structure of the text is retained (including capitalization and punctuation marks) which is more likely to be associated with the authors writing style [44] . Stamatatos introduced the following two variants of TextDistortion, which require as a prerequisite a word list W k containing the k most frequent words 5 in the English language: • Distorted View -Single Asterisk (DV-SA): Every word w / ∈ W k in a given document D is masked by replacing each word occurrence with a single asterisk *. Every sequence of digits in D is replaced by a single hashtag #. • Distorted View -Multiple Asterisks (DV-MA): Every word w / ∈ W k in D is masked by replacing each of its characters with *. Every digit in D is replaced by #. Both variants can be applied to any given document D written in English, without the need for specific NLP tools or linguistic resources (besides the word list W k ). However, in order to apply TextDistortion to D, the hyperparameter k, which regulates how much content is going to remain in D, must be carefully specified beforehand. Moreover, one must take into account that the replacement of each potentially topic-related word w in D is performed uniformly without any further distinction as to what w represents. Consequently, the masking procedure may necessarily miss relevant information associated with w that could serve as a useful stylistic feature. Inspired by the approach of Stamatatos [44] , we propose an alternative topic masking technique called POSNoise ("POS-Tag-based Noise smoothing"), which addresses the two issues of distortion mentioned in Section 2. The core idea of our approach is to keep stylistically relevant words and phrases in a given document D using a predefined list L, while replacing topic-related words with their corresponding POS tags represented by a set S. Text units not covered by L and S, such as punctuation marks and idiosyncratic words, are further retained. In what follows, we first describe the requirements of POS-Noise and explain how it differs from the existing TextDistortion approach. Afterwards, we present the respective steps of our topic masking algorithm, which is listed in Algorithm 1 as Python-like pseudocode. Similarly to TextDistortion, our approach also relies on a predefined list 6 L of specific words that should not be masked. However, unlike TextDistortion, which uses a list of words ordered by frequency of occurrence in the British National Corpus (BNC), our list L is structured by grammatical factors. More precisely, L comprises different categories of function words, phrases, contractions, generic adverbs and empty verbs. Regarding the function word categories, we consider conjunctions, determiners, prepositions, pronouns and quantifiers, which are widely known in the literature (e. g., [35, 45] ) to be content and topic independent. With respect to the phrases, we use different categories of transitional phrases including causation, contrast, similarity, clarification, conclusion, purpose and summary. As generic adverbs, we consider conjunctive, focusing, grading and pronominal adverbs, while as empty verbs, we take auxiliary and delexicalised verbs into account, as these have no meaning on their own. The tenses 7 of verbs are additionally considered so that AV methods operating at the character level of documents can benefit from morphological features occurring in the inflected form of such words. Table 1 lists all the categories of words and phrases considered by POSNoise, together with a number of examples. Note that the comparison of the text units with L is case-insensitive (analogous to TextDistortion), i. e., the original case of the text units is retained. We also wish to emphasize that in our list L (in contrast to the word list W k used by TextDistortion) topic-related nouns and pronouns, verbs, adverbs and adjectives are not present. According to Sundararajan and Woodard [46] , especially the former two are strongly influenced by the content of the documents. Besides our statically defined list L, we also make use of certain dynamically generated POS tags to retain stylistic features. For this, we apply a POS tagger 8 to D so that a sequence of pairs t i , p i is created, where t i denotes a token and p i its corresponding POS tag. Here, we decided to restrict ourselves to the Universal POS Tagset 9 so that each p i falls into a coarse-grained POS category (cf. Table 2 ) of the token t i . There are two reasons why we decided to use this tagset. First, universal POS tags allow a better adaptation of POSNoise to other languages, as it can be observed that the cardinality of fine-grained POS tags differ from language to language [37] . Second, the POS tagger might lead to more misclassified POS tags if the fine-grained tagset is used instead. Note that we do not use the original form of the tags as they appear in the tagset such as PROPN (proper noun) or ADJ (adjective). Instead, we use individual symbols as representatives that are more appropriate with respect to AV methods that operate at the character level of documents. However, for readability reasons, we refer to these symbols as "tags". Once D has been tagged, POSNoise substitutes all adjacent pairs t i , p i , t i+1 , p i+1 , . . . , t i+n , p i+n , whose tokens t i , . . . , t n form an element in L, with their corresponding pos tags p i , . . . , p n , respectively. However, the replacement is only performed if p i ∈ S = {#, §, Ø, @, ©, µ, $, } holds (cf. Table 2 ). More precisely, every token t i for which p i / ∈ S applies is retained, in addition to all words and phrases in the document D that occur in L. The retained tokens are, among others, punctuation marks and interjections (e. g.,"yes", "no", "okay", "hmm", "hey", etc.), where the latter represent highly personal and thus idiosyncratic stylistic features according to Silva et al. [43] . Regarding numerals, we keep written numbers unmasked as such words and their variations may reflect stylistic habits of certain authors (e. g.,"one hundred" / "one-hundred"). Digits, numbers and roman numerals, on the other hand, are masked by the POS tag µ. In a subsequent step, we adjust punctuation marks that were separated from their adjacent words (e. g.,"however ," "however,") as a result of the tokenization process of the POS tagger. Our intention is to retain the positional information of the punctuation marks, as certain AV methods might use standard tokenizers that split by white-spaces so that "however ," would result in two different tokens. As a final step, all tokens are concatenated into the topic-masked representation D Masked . In Section 4, we compare the resulting representations of POSNoise and TextDistortion. For the latter, we also explain how a suitable setting of the hyperparameter k (i. e., one that suppresses topic-related words but retains style-related words as best as possible) was determined. In the following, we present our experimental evaluation. First, we present our seven compiled corpora and explain for them where the documents were obtained and how they were preprocessed, followed by a summary of their main statistics. Next, we mention which existing AV methods were chosen for the evaluation as well as how they were trained and optimized. Afterwards, we explain how an appropriate setting was chosen with regard to the topic-regularization hyperparameter of TextDistortion to allow a fair comparison between it and our POSNoise approach. Finally, we present the results and describe our analytical findings. Words ← tokenize( ) /* Note that might be a phrase such as "apart from this". */ /* Define POS tags that aim to replace topic-related words. */ 22 S = { "#", " §", "Ø", "@", "©", "µ", "$", " " } represents a written-out number (e. g., "four" or "twelve") To compare our approach against TextDistortion, we compiled seven English corpora covering a variety of challenges, such as texts (of the same authors) written at different periods of time, texts with an excessive use of slang and texts with no cohesion. In total, the corpora comprise 4.742 verification cases, were each corpus C = {c 1 , c 2 , . . .} is split into author-disjunct training and test partitions based on a 40/60 ratio. Each c ∈ C denotes a verification case (D U , D A ), where D U represents an unknown document and D A a set of sample documents of the known author A. To counteract population homogeneity, a form of an AV bias described by Bevendorff [4] , we ensured that for each A there is one same-authorship (Y) and one different-authorship (N) verification case. Furthermore, we constructed all corpora so that the number of Y and N cases is equal (in other words, all training and test corpora are balanced). In the Sections 4.1.1 -4.1.7, we introduce the corpora and summarize their key statistics in Table 3 . This corpus comprises 260 texts taken from the Webis Authorship Verification Corpus 2019 dataset released by Bevendorff et al. [4] . The texts in C Gut represent fragments extracted from fiction books that stem from the Project Gutenberg portal. C Gut is the only corpus where the contained documents have not been preprocessed by us, as they have already gone through a clean and well-designed preprocessing routine, which is described in detail by Bevendorff et al. [4] . The only action we took was to re-arrange the training and test partitions and to balance the Y-and N-cases, since the original partitions were both imbalanced. Table 3 : Overview of our corpora and their key statistics. Notation: |C| denotes the number of verification cases in each corpus C, while |D A | denotes the number of the known documents. The average character length of D U and D A is denoted by avg|D U | and avg|D A |, respectively. Note that in this context D A represents the concatenation of all documents in D A . This corpus comprises 752 excerpts of 288 Wikipedia talk page editors taken from the Wikipedia Sockpuppets dataset released by Solorio et al. [42] . The original dataset contains two partitions comprising sockpuppets and non-sockpuppets cases, where for C Wiki we considered only the latter subset. In addition, we did not make use of the full range of authors within the non-sockpuppets cases, as appropriate texts of sufficient length were not available for each of the authors. From the considered texts, we removed Wiki markup, timestamps, URLs and other types of noise. Besides, we discarded sentences with many digits, proper nouns and near-duplicate string fragments as well as truncated sentences. This corpus comprises 466 paper excerpts from 233 researchers, which were collected from the computational linguistics archive ACL Anthology 10 . The corpus was constructed in such a way that for each author there are exactly two papers 11 , stemming from different periods of time. From the original papers we tried, as far as possible, to restrict the content of each text to only such sections that mostly comprise natural-language text (e. g., abstract, introduction, discussion, conclusion or future work ). To ensure that the extracted fragments met the important AV bias cues of Bevendorff et al. [4] , we preprocessed each paper extract in C ACL manually. Among others, we removed tables, formulas, citations, quotes, references and sentences that include non-language content such as mathematical constructs or specific names of researchers, systems or algorithms. The average time span between both documents of each author is ≈12 years, whereas the minimum and maximum time span are 8 and 31 years, respectively. Besides the temporal aspect of C ACL , another characteristic of this corpus is the formal (scientific) language, where the usage of stylistic devices (e. g., repetitions, metaphors or rhetorical questions) is more restricted, in contrast to other genres such as chat logs. This corpus comprises 738 chat logs of 260 sex offenders, which have been crawled from the Perverted-Justice portal 12 . The chat logs stem from older instant messaging clients (e. g., MSN, AOL or Yahoo), where we ensured for each conversation that only chat lines from the offender were extracted. To obtain as much language variability as possible regarding the content of the conversations, we selected chat lines from different messaging clients and different time spans (where possible) and considered such lines that differed mostly from each other using a similarity measure 13 . One characteristic of the chats in C Perv is an excessive use of slang, a variety of specific abbreviations and other forms of noise. As further preprocessing steps, we discarded chat lines with less than 5 tokens (or 20 characters) and such lines containing usernames, timestamps, URLs as well as words with repetitive characters (e. g.,"yoooo" or "jeeeez"). This corpus consists of 828 excerpts of news articles from 276 journalists, crawled from The Telegraph website. Due to their nature, the original articles contain many verbatim quotes, which can distort the writing style of the author of the article. To counter this problem, we sampled from each article such sentences that did not contain quotations and other types of noise including headlines and URLs. As a result, the underlying characters in each preprocessed article are solely restricted to (case-insensitive) letters, spaces and common punctuation marks. Finally, we concatenated the preprocessed sentences from each article into a single document. Note that due to this procedure the coherence of the resulting document is distorted. Consequently, AV methods that make use of character and/or word n-grams may capture "artificial features" that occur across sentence boundaries. This corpus comprises 1,395 posts from 284 users that were obtained from The Apricity -A European Cultural Community 14 portal. The postings are distributed across different subforums with related topics (e. g., anthropology, genetics, race and society or ethno cultural discussion). To construct C Apric , we ensured that all documents within each verification case stem from different subforums. The crawled postings have been cleaned from markup tags, URLs, signatures, quotes, usernames and other forms of noise. This corpus consists of 4,000 posts from 1,000 users, which were crawled from the Reddit community network. Each document in C Reddit has been aggregated from multiple posts from the same so-called subreddit to obtain a sufficient length. However, all documents within each verification case originate from different subreddits with unrelated topics. Hence, in contrast to the C Apric corpus, C Reddit represents a mixed-topic corpus. In total, C Reddit covers exactly 1,388 different topics including politics, science, books, news and movies. To compare the effectiveness of POSNoise and TextDistortion, we selected six well-known AV methods that have shown their potential in various studies (e. g., [17, 5, 44] ). All of the chosen methods are based on implicit feature categories and are thus susceptible to topic influences. Four of these (COAV [17] , OCCAV [16] , NNCD [47] and ProfileAV [38] ) rely on character n-grams, while the remaining two (SPATIUM [25] and Unmasking [26] ) are based on frequent words/tokens. In the following, we describe some design decisions we have made with regard to these methods and explain which and how their hyperparameters were set. NNCD and SPATIUM represent binary-extrinsic AV methods (cf. [18] ), meaning that they rely on so-called "impostor documents" (external documents outside the respective verification case) for their classification of whether or not there is a matching authorship. In the original paper, Veenman and Li [47] did not provide an automated solution to generate the impostor documents, but collected them in a manual way using a search engine. However, since this manual approach is not scalable, we opted for an alternative idea in which the impostor documents were taken directly from the test corpora. This strategy has been also considered by Kocher and Savoy [25] with respect to their SPATIUM approach. Although using static corpora is not as flexible as using search engines, it has the advantage that due to the available metadata (for instance, user names of the authors) the true author of the unknown document is likely not among the impostors 15 . Furthermore, the documents contained in the test corpora are already cleaned and therefore do not require additional preprocessing. In their original form, SPATIUM and ProfileAV allow the three possible prediction outputs Y (same-author), N (different-author) and U (unanswered), whereas for the remaining approaches only binary predictions (Y/N) are considered. To enable a fair comparison, we therefore decided to unify the predictions of all involved AV methods to the binary case. In this context, verification cases for which the AV methods determined similarity values greater than 0.5 were classified as Y, otherwise as N. Here, all similarity values were normalized into the range [0, 1], so that 0.5 marks the decision threshold. All these three represent compression-based AV methods based on the PPMd compression algorithm, as specified in the original papers [17, 16, 47] . However, in these papers it has not been mentioned how the model-order hyperparameter of PPMd has been set. We therefore decided to set this hyperparameter to 7 for all three methods, based on our observation 16 that this value led to the best accuracy across all training corpora. Moreover, we used the same dissimilarity functions that were specified in the original papers (CDM for NNCD as well as CBC for COAV 17 and OCCAV 18 ). Apart from these, there are no other hyperparameters for these approaches. The two AV methods SPATIUM and Unmasking involve different sources of randomness (e. g., impostor selection and chunk generation) and, due to this, cause non-deterministic behavior regarding their predictions. In other words, applying these methods multiple times to the same verification case can result in different predictions which, in turn, can lead to a biased evaluation (cf. [18] ). To address this issue, we performed 11 runs for each method and selected the run where the accuracy score represented the median. The reason we avoided averaging multiple runs (as was the case, for example, in [39] ) was to obtain more precise numbers with respect to our analysis. Apart from SPATIUM 19 and OCCAV, the remaining four AV methods involve model and adjustable hyperparameters. The former represent parameters that are estimated directly from the data, while in contrast hyperparameters must be set manually. Of the four respective AV methods considered in our experiments, model parameters represent the weights that form the SVM-hyperplanes (used by Unmasking) or the thresholds required to accept or reject the questioned authorships (used by COAV and ProfileAV). To obtain the model parameters of ProfileAV and COAV, we trained both methods on the Original, POS-Noise and TextDistortion training corpora, respectively. The hyperparameters involved in our selected AV approaches represent, among others, the number of k cross-validation folds (used by Unmasking) or the n-order of the character n-grams (used by ProfileAV) and have been optimized as follows. Regarding Unmasking, an adjustment was needed to fit our experimental setting. In the original definition of this method, Koppel and Schler [26, 28] considered entire books to train and evaluate Unmasking, which differ in lengths from the documents used in our experiments. Therefore, instead of using the original fixed hyperparameter settings (which would make Unmasking inapplicable in our evaluation setting), we decided to consider individual hyperparameter ranges with values that are more appropriate for shorter documents as available in our corpora. The customized hyperparameter ranges are listed in Table 8 . For ProfileAV, on the other hand, we employed the same hyperparameter ranges described in the original paper [38] . Based on the original and adjusted hyperparameter ranges of ProfileAV and Unmasking, we optimized both methods on the training partitions of Original, POSNoise and TextDistortion using grid search guided by accuracy as a performance measure. The resulting hyperparameters are listed in Table 9 . To allow a reasonable comparison between POSNoise and TextDistortion, a suitable and fixed configuration for the latter is necessary. For this, we opted for the DV-SA variant on the basis of the following considerations: 16 Regarding the model-order hyperparameter we experimented with values in [1, 10] . 17 An implementation of COAV is available under https://paperswithcode.com/paper/authorship-verification-based-on-compression. 18 An implementation of OCCAV is available under https://paperswithcode.com/paper/authorship-verification-in-the-absence-of. 19 We used the original implementation of SPATIUM available under https://github.com/pan-webis-de. • Stamatatos [44] conducted a number of AA and AV experiments with the two variants DV-SA and DV-MA, where no differences were observed in the context of AA. However, with regard to the AV experiments Stamatatos found that the DV-SA variant was more competitive than DV-MA. • Both POSNoise and the TextDistortion variant DV-SA substitute topic-related text units token-wise, which allows a better comparability. • None of the selected AV methods consider word length as a separate feature, so there is no advantage with regard to this possibility that is only maintained by the DV-MA variant. Besides the choice between the two variants DV-SA and DV-MA an adjustable hyperparameter k must be set for TextDistortion, which regulates which tokens (not necessarily words) are retained, while all other tokens are masked. However, finding a suitable value for k can be seen as a problematic trade-off, since a low value suppresses too many style-related tokens, while in contrast a higher value leaves many topic-related tokens unmasked. To address this trade-off, we therefore pursued the following systematic approach. Given W k (the list on which TextDistortion is based) we divided all the words contained in it with respect to their index in W k into two groups, which include style-related tokens (e. g., conjunctions, determiners, prepositions, pronouns, etc.) and topic-related tokens (e. g., nouns, verbs and adjectives), respectively. Then, we visualized the distributions of the tokens in both groups, where it can be seen in Figure 1 that the distribution of the style-related tokens increase at a decreasing rate, while the topicrelated tokens increase linearly as the k value increases. In other words, as k increases, fewer style-related tokens occur in W k , while at the same time a higher number of topic-related tokens are present in W k . To determine an appropriate k that (1) suppresses topic-related words and (2) retains stylistically relevant patterns as much as possible, we chose k = 170 which represents the value at which the style-related tokens outnumber the topic-related tokens the most in terms of absolute frequency. For k ∈ {1, 2, . . . , |W k |}, the setting k = 170 satisfies conditions (1) and (2) to which the two topic-masked representations generated by POSNoise and TextDistortion differ from each other. For this, we list in Table 4 several example sentences taken from the documents in our test corpora, which show the differences regarding the outputs of both approaches. It can be clearly seen that both approaches entirely mask topic-related words. However, in contrast to TextDistortion, our approach retains a greater number of syntactic structures including multi-word expressions ("As an example" and "Like before"), contractions ("I'd") or sentence openers ("Regarding" and "Therefore") that represent important stylistic features. Another difference that can be seen in Table 4 is that POSNoise not only retains stylistically relevant words and phrases occurring in the documents but also generates additional features i. e., POS tags that increase the diversity of the documents feature space. Depending on the considered AV method, a variety of feature compositions can be derived from POSNoise representations, which include, for instance, POS tags with preceding/succeeding punctuation marks or POS tags surrounded by function words. Such feature compositions can play a decisive role in the prediction of an AV method and are therefore desirable. After training and optimizing all AV methods, we applied these to the respective test partitions of the Original, POSNoise and TextDistortion corpora. The overall results are shown in Table 7 , where a compact Repres. Original / topic-masked sentences As an example, let us analyze the following English sentence. POSNoise As an example, let us Ø the following @ #. TextDist. As an *, * us * the * * *. Like before, further improvements to this section are welcome. POSNoise Like before, further # to this # are @. TextDist. Like *, * * to this * are *. I'd like to see some other editors' opinions on this question. POSNoise I'd like to see some other #' # on this #. TextDist. *'* like to see some other *' * on this *. Therefore we add another operator to erase this function. POSNoise Therefore we Ø another # to Ø this #. TextDist. * we * another * to * this *. Regarding the lexicon, the model allows for clusters. POSNoise Regarding the #, the # Ø for #. TextDist. * the *, the * * for *. Look, most people have been lied to, most are... POSNoise Look, most # have been Ø to, most are... TextDist. *, most people have been * to, most are... Table 4 : Comparison between the resulting topic-masked representations generated by POSNoise and TextDistortion. summary with respect to the average/median improvements of POSNoise over TextDistortion is provided in Table 6 . In what follows, we focus on the results between POSNoise and TextDistortion (using k = 170) in Table 7 , which are highlighted in yellow. As can be seen from this table, POSNoise leads to better results than TextDistortion in terms of accuracy and AUC in 34 and 32 of 42 cases (81% and 76%, respectively). In total, there are 4 ties in terms of accuracy, where in two cases POSNoise leads to better AUC results, which we selected as a secondary performance measure. Besides the setting k = 170, we further evaluated the six AV methods on additional TextDistortion-preprocessed corpora using the settings 100, 300, 500 and 1000. A closer look at the results in Table 7 shows that TextDistortion leads to increasingly larger accuracy improvements the higher the setting of k is. What is not reflected in these results, however, is at what price the "improvements" occur. To gain a better understanding of how higher settings for k affect the performance of the methods, we first show how much topic-related text units remain in the documents preprocessed by TextDistortion. First, we concatenated all documents in each TextDistortion corpus into a single text D. Next, we tokenized D and subtracted from the resulting list all words that appear in our pattern list L. The remaining words and patterns can be inspected in detail in the word clouds illustrated in Table 5 . As can be seen from these word clouds, the document representations masked by TextDistortion retain a wide variety of topic-related words (e. g.,"system", "data", "information", etc.) the greater k increases, which are not present in the POSNoise representations. The presence of these topic-related text units provides a first indication that the improvement in verification results can be attributed to them. To investigate this further, we proceed on the basis of the following assumption: If higher k-values are related to the topic of the texts, this should be reflected in the context of a topic classification with correspondingly high results. Low k-values, on the other hand, should lead to low topic classification results. As a first step, we selected three well-known benchmark corpora for topic classification namely: AG's News Topic Classification Dataset [15] (denoted by C AG ), BBC News Dataset [14] (denoted by C BBC ) and Yahoo! Answers Topic Classification Dataset [48] (denoted by C Yahoo ), where all corpora were left in their original form (i. e., no subsampling or subsets were considered) in order to allow reproducibility of our results. Next, we trained a standard logistic regression classifier 20 (on the basis of tokens as features) using the Original, POSNoise and TextDistortion representations of C AG , C Yahoo and C BBC , where for TextDistortion we selected k ∈ {100, 170, 300, 500, 1000}. The results of the topic classifier (based on 5-fold cross-validation) for each C ∈ {C AG , C Yahoo , C BBC } along with its Original, POSNoise and TextDistortion representations are visualized against the median of the AV results of the six methods in the scatter plots shown in Figure 2 . As can be seen from the three scatter plots, the TextDistortion representations lead to higher AV classification results, but at the same time also to higher topic classification results as k increases. Conversely, lower k-values lead to both lower topic and AV results. In contrast, POSNoise approaches more closely the best possible compromise between topic and style (across C AG , C Yahoo , C BBC ), which is located at the very bottom right. When comparing the results of topic classification with respect to the Original and the POSNoise corpora, one can further observe that the degree of topic-related text units is significantly high regarding the former (up to ≈35% higher). This shows that POSNoise leads to a substantial reduction of topic-related text units without strongly affecting the style-related classification performance. It should be noted that we consider the absolute accuracy values as not significant for this experiment, since in all three topic corpora the function words alone already represent a strong classification signal. This is reflected by the fact that restricting the topic classifier exclusively on function words still yields high accuracy scores of 56% (C AG ), 35% (C Yahoo ) and 70% (C BBC ), which are comparable to POSNoise. Separately, one can see from the three scatter plots that regarding TextDistortion, the setting k = 170 in fact represents a good compromise between topic and style. While the median AV classification results are almost identical (≈1% variance in terms of accuracy) the topic classification results vary from 3 to 6%. Overall, this observation justifies our decision of choosing this setting for TextDistortion to be able to compare both approaches. Table 7 : Comparison between the six AV methods applied to all Original, POSNoise and TextDistortion test corpora. Bold values indicate the best accuracy (Acc.) results with respect to the POSNoise and TextDistortion (k = 170) corpora. In case of ties, AUC serves as a secondary ranking option represented by underlined values. We discussed a serious problem in authorship verification, which affects many AV methods that have no control over the features they capture. As a result, the classification predictions of the respective AV methods may be biased by the topic of the investigated documents. To address this problem, we proposed a simple but effective approach called POSNoise, which aims to mask topic-related text units in documents. In this way, only those features in the documents that relate to the authors' writing style are retained, so that the actual goal of an AV method can be achieved. In contrast to the alternative topic masking technique TextDistortion, our approach follows a two-step strategy to mask topic-related content in a given document D. The idea behind POSNoise is first to substitute topic-related words with POS tags predefined in a set S. In a second step, a predefined list L is used to retain certain categories of stylistically relevant words and phrases that occur in D. In addition to this list, we also retain all remaining words in D, for which their corresponding POS tags are not contained in S. The result of this procedure is a topic-agnostic document representation that allows AV methods to better quantify stylistically relevant features. Besides a POS tagger and a predefined list L, no further linguistic resources are required. In particular, there is no hyperparameter that requires careful adjustment, as it is the case with the TextDistortion approach. To assess our approach, we performed a comprehensive evaluation with six existing AV approaches applied to seven test corpora with related and mixed topics. Our results have shown that AV methods based on POSNoise lead to better results than TextDistortion in 34 out of 42 cases with an accuracy improvement of up to 10%. Also, we have shown that regardless of which k is chosen for TextDistortion, our approach consistently leads to a better trade-off between style and topic. However, besides the benefits of our approach, there are also several issues that require further consideration. When considering languages other than English, POSNoise must be adjusted accordingly. First, a predefined list of topic-agnostic words and phrases must be compiled for the target language. Second, our approach relies on a POS tagger so that the availability of a trained model for the desired language must be ensured. Third, due to the imperfection of the tagger, incorrect POS tags may appear in the topic-masked representations. Although we have hardly noticed this issue with respect to documents written in English, it is likely to happen with documents written in other languages. Fourth, due to the underlying tagging process, POSNoise is slower than the existing approach TextDistortion, where the runtime depends on several factors such as the text length or the complexity of the trained model. Besides these issues, POSNoise leaves further room for improvement. One idea for future work, is to investigate automated possibilities to extend the compiled patterns list. This can be achieved, for example, by using alternative linguistic resources such as lexical databases that are also available in multiple languages (e. g., WordNet and GermaNet). Another direction for future work is to investigate the question in which verification scenarios POSNoise is also applicable. One idea, for example, is to perform experiments under cross-domain AV conditions, which often occur in real forensic cases (for example, how a model trained on forum posts or cooking recipes performs on suicide letters, for which no training data is available yet). Beyond the boundaries of AV, we also aim to investigate the suitability and effectiveness of our approach in related disciplines of authorship analysis, such as author clustering and author diarization. Table 9 : Hyperparameters of ProfileAV and Unmasking. The hyperparameters of ProfileAV have the following notation: L u = profile size of the unknown document, L k = profile size of the known document, n = order of character n-grams and d = dissimilarity function. The definitions of the three dissimilarity functions d 0 , d 1 and SPI are described in detail in [38, Section 3.1]. The five hyperparameters U 1 -U 5 of Unmasking are described in Table 8 . Authorship Verification of Yorùbá Blog Posts using Character N-grams Author Identification Using Multi-headed Recurrent Neural Networks Authorship Verification Applied to Detection of Compromised Accounts on Online Social Networks Bias Analysis and Mitigation in the Evaluation of Authorship Verification Generalizing Unmasking for Short Texts Similarity Learning for Authorship Verification in Social Media Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity Authorship Verification for Short Messages Using Stylometry Authorship Verification using Deep Belief Network Systems Authorship Verification, Average Similarity Analysis Authorship Attribution with Support Vector Machines Learning Stylometric Representations for Authorship Analysis An Open Stylometric System Based on Multilevel Text Analysis Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering The Anatomy of a News Search Engine Authorship Verification in the Absence of Explicit Features and Thresholds On the Usefulness of Compression Models for Authorship Verification Assessing the Applicability of Authorship Verification Methods Style in Authors with Alzheimer's Disease Authorship attribution with convolutional neural networks and POS-eliding The Evolution of Stylometry in Humanities Scholarship Experiments with neural networks for small and large scale authorship verification GLAD: Groningen Lightweight Authorship Detection Authorship verification on short text samples using stylometric embeddings A Simple and Efficient Algorithm for Authorship Verification Authorship Verification as a One-Class Classification Problem The "fundamental problem" of authorship attribution Measuring Differentiability: Unmasking Pseudonymous Authors Automatically Identifying Pseudepigraphic Texts Determining if Two Documents are Written by the Same Author Deep dive into authorship verification of email messages with convolutional neural network Using Part-of-Speech Sequences Frequencies in a Text to Predict Author Personality: a Corpus Study Exploiting Linguistic Style as a Cognitive Biometric for Continuous Verification Authorship attribution by consensus among multiple features Using conjunctions and adverbs for author verification Authorship Verification of Opinion Pieces in Estonian A universal part-of-speech tagset A Profile-Based Method for Authorship Verification An Improved Impostors Method for Authorship Verification Improved Algorithms for Extrinsic Author Verification. Knowledge and Information Systems Introduction to Stylistic Models and Applications Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities twazn me!!! ;(' automatic authorship analysis of micro-blogging messages Authorship Attribution Using Text Distortion Authorship Verification What Represents "Style" in Authorship Attribution? Authorship Verification with Compression Features Character-Level Convolutional Networks for Text Classification This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE.