Identifying Translationese at the Word and Sub-word Level Ehud Alexander Avner∗ Noam Ordan† Shuly Wintner‡ Abstract We use text classification to distinguish automatically between original and trans- lated texts in Hebrew, a morphologically complex language. To this end, we design sev- eral linguistically informed feature sets that capture word-level and sub-word-level (in particular, morphological) properties of Hebrew. Such features are abstract enough to allow for the development of accurate, robust classifiers, and they also lend themselves to linguistic interpretation. Careful evaluation shows that some of the classifiers we define are, indeed, highly accurate, and scale up nicely to domains that they were not trained on. In addition, analysis of the best features provides insight into the morpho- logical properties of translated texts. 1 Introduction Much research in Translation Studies suggests that the language of translated texts, often called translationese, exhibits different linguistic properties from the language of original, non-translated texts. The differences are so marked that automatic (machine learning based) classification techniques can distinguish between original and translated texts with high ac- curacy, and indeed, several translationese classifiers have been defined for a few European ∗Department für Linguistik, Universität Potsdam, Germany †Angewandte Sprachwissenschaft sowie Übersetzen und Dolmetschen, Universität des Saarlandes, Ger- many ‡Department of Computer Science, University of Haifa, Israel languages. In this work, we employ text classification for the investigation of translationese in a morphologically complex language, namely Modern Hebrew. This work is, to the best of our knowledge, the first to address automatic identification of translationese in a Semitic language; we are also the first to train our classifiers on a corpus of twentieth-century literary texts. Another novelty of the present work is that we focus on morphological (and, more generally, sub-word) features. An advantage of morphological features is that they lend themselves to interpretation, i.e., to qualitative analysis, as they can potentially capture structural and stylistic differences between translated and original texts. Such differences are realized in more analytic languages (like English) on the token level. We thus set out to design several feature sets that capture word-level and sub-word-level phenomena – specifically morphological properties – of Hebrew translationese, and thus fo- cus on the linguistic information encoded in tokens and sub-tokens. As will be shown, using the output of a morphological analyzer does not suffice; more sophisticated feature engi- neering is called for. We present a novel approach to approximating Hebrew word structure by means of alphabet abstraction. This approach, when enhanced with morphosyntactic information (that is, part-of-speech tags), turns out to be one of the most accurate and scal- able among the classifiers we define. The main contribution of this work is the construction of accurate classifiers that identify Hebrew translationese and can scale up to domains they were not trained on. This is impor- tant not only theoretically; numerous studies have shown that Statistical Machine Transla- tion (SMT) systems can benefit a great deal when knowledge of the direction of translation is incorporated into the language and translation models (Kurokawa, Goutte, & Isabelle, 2009; Lembersky, Ordan, & Wintner, 2011, 2012a, 2012b, 2013). Robust detection of translationese is thus highly relevant for SMT. This is the first work to address the automatic identification of translationese in Hebrew (or any other Semitic language), and the first to focus on the morphological manifestation of translated texts’ properties. We thus also contribute to better understanding of the transla- 2 tion product. In addition, we show that literary corpora are suitable for the development of scalable identification systems, and introduce a novel approach to approximating Hebrew word structure that might be applicable to other Semitic languages. In the next section we survey existing work on the automatic identification of transla- tionese. In Section 3 we introduce some relevant characteristics of Hebrew orthography and morphology. The experimental setup is described in detail in Section 4. The features we de- fine, and the rationale for using them, are discussed in Section 5. We then list the results of several computational experiments in Section 6 and analyze them in Section 7. We conclude with directions for future research. 2 Related Work The term translationese was coined by Gellerstam (1986), who compared texts originally written in Swedish with texts translated from English into Swedish, and concluded that the striking differences between them do not indicate poor translation but rather a statistical phenomenon, a systematic influence of the source language on the target language. More recent works have suggested that all translations, regardless of source and target language, share certain features characteristic to translated texts (Baker, 1993; Toury, 1995). Baroni and Bernardini (2006) were the first to employ text classification to investigate and identify translationese. Their comparable corpus is a collection of articles from an Ital- ian geopolitics journal; each article is treated as a data point, i.e., as a training instance. The source languages from which articles are translated are assumed to be mainly English, Ara- bic, French, Spanish, Russian, and other languages. Prior to the classification, the corpus is tagged and lemmatized, and proper names are replaced with a dynamic ID-marker. The learning method they employ is Support Vector Machines (SVMs). They experiment with numerous feature sets: frequencies of unigrams, bigrams, and trigrams of words, lem- mas, part-of-speech (POS) tags, and a mixed mode in which function words are left un- touched in their surface form, while content words are substituted by their corresponding 3 POS tag. They also experiment with combinations of the single SVM classifiers trained on the aforementioned feature sets. Experiments are run using sixteen-fold cross-validation. Single-feature classifiers yield accuracy of at most 77.1% (word unigrams and mixed mode bigrams). Trigram models obtain 62.5%-71.5%. The worse classification model is POS un- igrams with accuracy of 49.6%. The best classifier (86.7%) is a combination of five models: lemma unigrams and bigrams, unigrams and bigrams of the mixed representation, and POS trigrams. Good features include function words and morphosyntactic categories in general, and personal pronouns and adverbs in particular. The accuracy of some classifiers is said to outperform human judgment. Kurokawa, Goutte, and Isabelle (2009) identify translationese in English and Canadian French, and show what impact their findings have on machine translation systems. The corpus they use is a large portion of the Canadian Hansard, transcripts of the Canadian par- liament proceedings. Following Baroni and Bernardini (2006), they produce four different representations of the corpus: surface forms, lemmas, POS tags, and a mixed representation. They, too, train SVM classifiers using n-gram frequencies: 1–to–5-grams of POS tags and of the mixed representation, and 1–to–3-grams of surface forms and lemmas. Classification is performed on blocks of text of varying lengths and on sentences. Using ten-fold cross- validation, the best classification results are just below 90% accuracy (for blocks) and 77% (for sentences). Both these results are achieved by SVMs trained on word bigram frequen- cies. Classifiers focusing on linguistic patterns, i.e., POS tags and the mixed representation, yield around 85%. The authors find that “[g]lobally, the relationship between the feature representations is clear: word > lemma > mixed > POS,” and that “there seems to be an op- timal n-gram length: bigram[s] for words and lemmas, trigrams for POS and mixed” (p. 84). Finally, Kurokawa et al. show that the direction of translation has an impact on SMT: transla- tion systems the direction of which is the same as the direction of the training data perform better than systems going in the opposite direction. Ilisei, Inkpen, Pastor, and Mitkov (2010) train and test their system on a translated and non-translated Spanish technical and medical dataset. They set out to go beyond the practi- 4 cal purpose of developing a classifier for translationese and “explore the characteristic [uni- versal] features which most influence the translated language” (p. 504). Specifically, they are interested in the contribution of features designed to capture the simplification hypothesis (Blum-Kulka & Levenston, 1983; cf. also Baker, 1993) to the identification of translationese. According to this hypothesis, outputs of translators are less complex in terms of grammar, vocabulary, etc., than the source texts they render. Ilisei et al. propose various such ‘simpli- fication features’: average sentence length, sentence depth (i.e., parse-tree depth), and lex- ical richness (type-token ratio), among other features. They compare several classification algorithms; the classifiers are trained on POS frequencies, including and excluding the sim- plification features. They find that removing the simplification features leads to decreased accuracy, and lexical richness is found to be the most informative feature. The best accuracy, 97.62%, is obtained by an SVM classifier. Ilisei and Inkpen (2011) apply similar methods to Romanian newspaper articles and obtain similar results. Popescu (2011) studies English translationese at the character level. His corpus consists of 214 book-length literary works, most of them from the nineteenth century. The subcorpus of original English contains 108 works written by British and American authors; the transla- tion subcorpus contains 76 works translated from French and 30 works translated from Ger- man. In the present work, we, too, train our classifiers on a literary corpus. Unlike Popescu, we strictly use twentieth-century literature. The features Popescu extracts are simply character 5-grams, irrespective of word and sentence boundaries. Classification is performed on the book level (i.e., the training and testing instances are complete books), using SVMs and ten-fold cross-validation, and achieves virtually 100% accuracy. However, when the SVM is trained on British English original texts and on translations from French, but tested on American English and on translations from German, the accuracy drops to 45.83%, implying that the classifiers are overfitting. Popescu repeats the previous experiment, this time eliminating from the feature space all 5-grams that the French original texts and their translated counterparts share. The accuracy ob- tained this time is 77.08%. The advantages of the character n-gram approach are obvious: it 5 is language-independent, does not presuppose any language processing tools, and seems to promise relatively high classification accuracy. We hypothesize that character n-grams cap- ture morphological features, and that such features could also be captured with n-grams shorter than 5. Koppel and Ordan (2011) identify translationese, but also detect the source language of translated texts. They work on English translated from several languages (Finnish, French, German, Italian, and Spanish), using the EUROPARL corpus, which records the proceedings of the European Parliament (Koehn, 2005). The feature set in all their experiments is a list of 300 function words, and the learning method is Bayesian logistic regression. Training is done on 2,000 chunks of 2,000 words each, half of which are original English, the other half translated English (where each source language constitutes exactly one fifth of the translated data). Using ten-fold cross-validation, they identify translationese with 96.7% accuracy. The source language is correctly classified in 92.7% of the chunks. In addition, they train a classi- fier on EUROPARL and test it on a different corpus containing newspaper articles in original English and in English translated from Greek, Hebrew, and Korean. This classifier obtains 64.8% accuracy. In the opposite setting, i.e., when training on newspaper articles and test- ing on EUROPARL, the result is worse – namely 58.8%. We adopt their setup, working with 2000-word chunks. We, too, test our classifiers on datasets very different from the ones they are trained on. Unlike Koppel and Ordan, one of our goals is to find meaningful feature sets that are able to scale up to out-of-domain corpora. Most recently, Volansky, Ordan, and Wintner (forthcoming) distinguish between original English and English translated from ten source languages (Danish, Dutch, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish). They, too, use parts of EU- ROPARL as their corpus: 4 million tokens of original English and 400,000 tokens from each of the source languages are partitioned into chunks of 2,000 tokens which are then used as training instances. The classification algorithm is SVM; testing is done using ten-fold cross- validation. Similar to Ilisei et al. (2010), Volansky et al.’s objective goes beyond the develop- ment of a working identification model. They set out to test several hypotheses – e.g., the 6 simplification hypothesis mentioned above – that have been purposed by translation schol- ars as translation universals. To this end, they define several feature sets that reflect these universals. Following Popescu (2011), they train classifiers using frequencies of character n-grams. In addition, they employ a precompiled list of prefixes and suffixes as a feature set approximating English morphological structure. The former classifier achieves 100% accu- racy, the latter 80%. Like Volansky et al. and Ilisei et al. (2010), our objective is not only to design accurate identification models, but also to explore computationally the properties of translated texts that distinguish them from original ones. However, in contrast to Volansky et al. and Ilisei et al., we work on a morphologically rich language with nonconcatenative morphology, and focus exactly on the word-level and sub-word-level features that cannot be investigated in more analytic languages such as English. Like Koppel and Ordan (2011), and unlike all other works mentioned above, we test our classifiers on datasets from domains different from the one they are trained on. 3 Hebrew Orthography and Morphology The Hebrew alphabet is a 22-letter abjad (Daniels, 1997) for which two main standards exist: the full script, in which vocalization diacritics decorate words, thereby explicating all vowels, and the lacking script, in which these diacritics are missing, and the two letters w and i are occasionally added to represent some, but not all, of the vowels which would otherwise be represented by diacritics.1 The overwhelming majority of Modern Hebrew texts – and all the texts our classifiers are trained and tested on – are written in the lacking variant. In this script, most of the five Hebrew vowels are left underspecified: /e/ and /a/ are usually not explicated (when they are, they are typically realized by the characters a and h ); /o/ and /u/, when specified, are realized by the same character, w, which is also used for the consonant /v/. Similarly, the single character i is used both for the vowel /i/ (when it is specified) and 1For the sake of readability, a straightforward ASCII transliteration of Hebrew is used in this study. The characters, in Hebrew alphabetical order, are abgdhwzxTiklmnsypcqr$t. 7 for the consonant /y/. The four characters which represent (some of ) the vowels – a, h, w, y – are traditionally known as matres lectionis; they also play a significant role in Hebrew derivational morphology. Many particles are realized as prefixes attached to the words immediately following them. These include the definite article h, the coordinating conjunction w (“and”), four of the most frequent prepositions – b (“in”), k (“as”), l (“to”), and m (“from”) – and subordinating con- junctions, such as k$ (“when”) and $ (“that/which”). When one of the prepositions b, k, or l precedes the definite article h, the latter is assimilated with the prefixing preposition and the resulting surface form becomes ambiguous with respect to definiteness. Hebrew has a rich, partly nonconcatenative morphology. Derivational processes are based on a root-and-pattern system; inflectional processes are mainly carried out by suf- fixation, but also involve prefixes, circumfixes, and pattern shifts. An example of the root-and-pattern mechanism are the seven Hebrew binyanim, i.e., the verbal patterns. Each pattern (binyan) is traditionally associated with a certain (vague, and thus not always predictable) meaning. For example, the Hif ’il pattern productively gener- ates causative variants of verbs; similarly (and, to a lesser extent), Hitpa’el is used for reflex- ives; three patterns systematically express the passive voice of three counterpart patterns, namely, Nif ’al (vs. Pa’al), Pu’al (vs. Pi’el), and Huf ’al (vs. Hif ’il). Consider the three-letter root k.t.b, broadly denoting the notion of writing. When combined with the Pa’al pattern CCC,2 this root yields the form ktb (“write”); when combined with the Nif ’al pattern, nCCC, traditionally the passive counterpart of Pa’al, it yields nktb (“being written”); when com- bined with the Hif ’il pattern, hCCiC, traditionally denoting causativization, it yields hktib (“dictate”). These morphological patterns are mechanisms for expressing constructions that require syntactic or lexical solutions in other languages. Similarly, a root can be combined with nominal patterns, for instance, hktbh (“dictation”) is the result of combining the k.t.b root with the nominal pattern hCCCh which typically produces nominalized forms of the verbal pattern Hif ’il. 2The C’s in the pattern represent the slots for the three consonants of the root. 8 Verbs inflect for number, gender (masculine and feminine), person, and tense. kwtbt, for instance, is the present tense feminine singular realization (underspecified for person) of the Pa’al verbal pattern. The Hebrew tense system is relatively simple, with three tenses and no aspectual distinctions. Note that the present tense is actually a participle form that can also be used as an adjective or a noun (akin to -ing forms in English). Nouns inflect for number, adjectives for number and gender, numerals for gender. Nouns, adjectives, participles, numerals, and quantifiers have two morphologically (and often phonologically) distinct forms: the unmarked absolute state and the construct state. The latter is used, in the case of nouns, adjectives, and participles, for the construction of compounds. It is also involved, in the case of nouns, in possessor-possessed constructions (i.e., noun compounds are, in point of fact, lexicalized possessive constructions). For exam- ple, $mlh (“dress,” absolute state) vs. $mlt (“dress,” construct state), as in $mlt klh (“dress,” construct + “bride,” absolute → “wedding dress,” but also “a dress of a bride”). The fact that, in the lacking script, approximately half of the construct forms are orthographically identical to the absolute forms adds substantially to the ambiguity of Hebrew word forms. There are several ways to express possessiveness in Hebrew, one of which is by attaching pronominal suffixes that inflect for number, gender, and person. The base form for these constructions is the construct state. For example, the first person singular possessive suffix i can be attached to $mlt (“dress”, construct) to yield $mlti (“my dress”). The other possessive constructions go beyond the word level and involve the preposition $l (“of”). The morphological complexity, the deficient orthography, and the affixation of frequent particles bring about a system that produces highly ambiguous texts: “First, the first and last few characters of each token may be either part of the stem or bound morphemes (prefixes or suffixes). Second, the lack of explicitly marked vowels yields many homographs” (Fabri, Gasser, Habash, Kiraz, & Wintner, 2014). Hence, word segmentation is not straightforward, POS tagging is “a much messier task [...] than in other languages, such as English” (Koppel, Mughaz, & Akiva, 2006), and automatic morphological analysis is an immensely complex enterprise. 9 4 Experimental Setup This is a corpus-based study; we use several corpora, which we automatically pre-process, to train and test machine learning based classifiers. We now explain the experimental setup and our methodology in more detail. 4.1 Corpus Design The main corpus used in this study is a subset of a corpus compiled by Jason Perry3 with the purpose of comparing translated and non-translated Hebrew texts. It is a monolingual comparable corpus that consists of first chapters of books published in Hebrew in the last decade (usually only the first chapter from each book). Perry downloaded the texts from a public Internet site aimed at exposing readers to newly published books.4 The corpus is annotated with the following metadata information: author name, book title, and source language. We add the following fields: translator name, genre (prose, play, verse, children’s literature, etc.), and the author’s year of birth. To allow for a better compa- rability, we restrict our training data to texts written by authors born after 1900, to English as the source language (of the translated texts), and to prose as the genre. The subcorpus of original Hebrew literature (henceforth Ohe b ) contains 176 book chap- ters by 156 authors. The translation subcorpus (Te n ) contains 128 book chapters by 123 authors translated from English by 81 translators. Each subcorpus contains exactly 600,000 tokens.5 Ohe b and Te n are the corpora we train our classifiers with; henceforth, we occasion- ally refer to them as the training data; together, they constitute the InC [in-corpus] experi- mental scenario introduced in Section 6. Recall that most work on the identification of translationese has been carried out on cor- pora containing data from restricted domains, e.g., geopolitics (Baroni & Bernardini, 2006), parliament proceedings (Kurokawa et al., 2009; Koppel & Ordan, 2011; Volansky et al., forth- 3http://hebrewcorpus.nmelrc.org/, accessed 25 June 2013. 4http://text.org.il/. 5In this study, punctuation marks are counted as tokens. 10 coming), or technical and medical data (Ilisei et al., 2010). We use a corpus containing twentieth-century literary texts in this study, first and foremost, because we are not aware of any other large-scale comparable Hebrew corpus containing texts from domains such as the above. In fact, we are not aware of any other large-scale comparable Hebrew corpus of any domain. There are, moreover, several benefits and interesting aspects to using a literary corpus: 1. The quality of translation is arguably very high. Not only can literary translators be as- sumed to be very competent translators, the common practice is that literary translations pass through an editorial cycle (copyediting, proofreading) before actually being published. 2. The multitude of authors and translators in our training data ensures that the classifiers do not learn to identify a specific author or translator but rather the phenomenon of Hebrew translationese. 3. A corpus of contemporary literature could be easily expanded for future research: in the age of the Internet the majority of publishers make excerpts of newly pub- lished books available online. 4. Metadata, such as the source language of the text, the birth date of the author, or the name of the translator, can normally be extracted with relative ease. 5. Identifying translationese by training on a corpus containing twentieth-century literature affords us an opportunity to explore a domain which very little work has been done on (one exception is Popescu (2011), whose corpus, however, consists of nineteenth-century litera- ture). In fact, classifying literary translations is probably a harder task than classifying other genres, both because of the diversity of the texts and because much effort is invested in the translation of literary works, and more freedom is given to the translator to render the text as similar as possible to original writing. This is in contrast to more “technical” translations, which are often done under strict deadlines, resulting in more source-influenced, less fluent translations. Indeed, addressing a different but related task, namely, source language detec- tion, Lynch and Vogel (2012), who train and test their models on nineteenth-century literary texts, state that they believe that literary translations “will pose a greater challenge [...] than the EUROPARL corpus, which is more homogeneous in style” (p. 778). 6. Finally, classifiers trained on a literary corpus might be able to scale up to scenarios which they are not trained 11 on. Baroni and Bernardini (2006), referring to their corpus, state that “[a] drawback of hav- ing a very uniform, very comparable corpus is that the results of our experiment may be true only for the specific genre and domain under analysis” (p. 264). We conjecture that a corpus of contemporary literature, being more heterogeneous than other closed-domain corpora, is suitable for the development of robust identification models. We use additional datasets in order to check to what extent our classifiers scale up to scenarios they are not trained on. First, we test whether our models can predict transla- tionese within the same domain (literature), but on texts translated from a different source language (French). Secondly, we test how well the classifiers predict translationese in a dif- ferent domain, but on texts translated from the same source language as the training data (i.e., English). This last task is notoriously difficult (Argamon, 2011). None of the works dis- cussed in Section 2, with the exception of Koppel and Ordan’s (2011), test their systems on a domain different from the one their systems are trained on.6 In-Domain Corpus We construct a small corpus containing book chapters translated from French, rather than from English, extracted from the preliminary full corpus compiled by Perry. We refer to this corpus as the in-domain dataset (InD f r ). It includes 18 book chapters by 17 authors translated by 13 translators, totaling 60,000 tokens. Out-of-Domain Corpora We use two additional small datasets: one containing journal and newspaper articles dealing with social science topics, often in a popular science style, the other containing journal and newspaper articles from the economics domain. We refer to these corpora as the out-of-domain datasets (OoD-soc[ial], OoD-eco[nomics]). OoD-soc consists of 33 translated articles and 26 original articles, and OoD-eco of 32 translated and 39 original. The number of authors and translators in these datasets is unknown; however, since the texts come from several different newspapers and jour- nals, it is safe to assume that no one author (or translator) is overrepresented. Each of the OoD datasets contains 26,000 tokens of texts translated from English and 26,000 tokens of texts originally written in Hebrew. 6Popescu (2011) tests his model on texts translated from a different source language but in the same domain. 12 4.2 Morphological Analysis and Chunking After applying a minimal cleaning script to the data, the corpora are first tokenized and then morphologically analyzed using the MILA tools (Yona & Wintner, 2008; Itai & Wintner, 2008). The morphological analyzer is a rule-based computational implementation of the inflectional morphology of Modern Hebrew, based on a lexicon of almost 30,000 lemmas. The morphological processor produces, for each token, its POS category.7 Then, according to the POS, several other properties are specified. For verbs, e.g., these properties include binyan (verbal pattern, cf. Section 3), gender, number, person, and tense. In addition, the morphological analyzer segments tokens by specifying the sequence of affix particles at- tached to them, as well as the form and function of these affixes. As an example, Figure 1 depicts the output of the morphological processor on the word forms wk$htxlti וכשהתחלתי! (“and when I began”) and sprihm !Mספריה (“their books”). Observe that in the first example, two prefixes are identified ,ו!) w “and”, followed by ,כש! k$ “when”), followed by the lemma htxil “begin”. Then, the main POS is listed (verb), followed by a sequence of morphological features. The second example shows also a suffix, !Mיה (ihm ), denoting a possessive pronoun in third person, masculine, plural (“their”). We come back to these examples in Section 5.2 below. The output of the analyzer is disambiguated using the tagger of Bar-Haim, Sima’an, and Winter (2008): This is a stochastic tagger, trained on newspaper articles, and it ranks the analyses produced by the analyzer by assigning a score to each analysis (typically, ‘1.0’ for the correct analysis, ‘0.0’ for the incorrect ones). Unfortunately, the tagger is unable to al- ways produce a unique top-ranked candidate; in cases where the tagger returns more than one optimal candidate, we simply pick the optimal result appearing first in the output. The reported accuracy of the POS tagger is 88.5%, but this evaluation is based on cross-validation experiments. As is well-known, out-of-domain evaluation of similar tasks usually reveals poorer performance. This is indeed our observation: on our corpus, the accuracy of tagging 7The tagset includes 25 tags: adjective, adverb, conjunction, copula, existential, foreign, interjection, in- terrogative, modal, MWE (multi word expression), negation, noun, numberExpression, numeral, participle, preposition, pronoun, properName, punctuation, quantifier, title, unknown, url, verb, and wordPrefix. 13 Figure 1: Output of the morphological analyzer on the tokens wk$htxlti וכשהתחלתי! (“and when I began”) and sprihm !Mספריה (“their books”). seems to be lower, although we do not have precise data. In particular, the tagger often fails to distinguish between verbal analyses that differ in the binyan only. We also do not have data on the accuracy of the tagger on identifying any specific feature, but a different Hebrew tagger (Lembersky, Shacham, & Wintner, 2014), reporting similar overall accuracy on the same test set, reports over 92% accuracy on main POS, around 95% for number, gender, and person, over 98% on tense, etc. Once a corpus is tokenized, analyzed, and tagged, it is partitioned into chunks, each containing 2,000 tokens;8 there is no correlation between the number of chunks we extract from a corpus and the number of texts (i.e., chapters or articles) this corpus contains. That is to say, each chunk contains exactly 2,000 tokens, regardless of chapter/article and sentence boundaries. Since the main objective of this work is to observe word-level and sub-word- level phenomena in general, and to learn from morphological features packaged in single words in particular, we do not alter the size of instances; we treat each corpus (translated and non-translated) as a single continuous stream of data. We believe that 2,000-token chunks 8Punctuation marks count as tokens. 14 strike a balance between having enough chunks per corpus, on the one hand, and having big enough chunks to avoid problems of sparsity for certain rare word-level and sub-word- level features, on the other hand. Since none of our classifiers goes beyond the word level, sentence boundaries are irrelevant. Table 1 summarizes the properties of the corpora used in the study. Chunks Tokens Texts Authors Translators Split Ohe b 300 600,000 176 156 All original Te n 300 600,000 128 123 81 All translated from English InD f r 30 60,000 18 17 13 All translated from French OoD-soc 26 52,000 59 50% orig., 50% trans. from Eng. OoD-eco 26 52,000 71 50% orig., 50% trans. from Eng. Table 1: The literary corpora: training (Ohe b and Te n ) and test (InD f r , translated from French); and the out-of-domain test corpora. 4.3 Methodology The core of our experimental methodology is the development of classifiers that can auto- matically distinguish between instances belonging to different classes (in our case, there are only two classes: translated and non-translated texts). The classifiers are trained on a corpus containing training data, that is, instances of the classes to be distinguished, each labeled a priori as belonging to one of the classes. Each of these instances is represented as a feature vector, a set of numeric attributes designed by the developers of the classifier to capture certain characteristics of the classes. The values of these features are extracted from the training instances (e.g., frequencies of certain words in an instance; see next section). During training, the classifiers learn to distinguish between the labeled instances, thereby assigning different weights to the features. A trained classifier can then be applied to unseen test instances and determine their class. If the features selected to represent the instances are meaningful, the classifier should be accurate when applied to test data. Such methodologies have been extensively and successfully used for the automatic clas- sification of texts according to, e.g., topic or genre (Sebastiani, 2002). They have been simi- 15 larly used for automatic author attribution, i.e., “inferring characteristics of the author from the characteristics of documents written by that author” (Juola, 2006, p. 233), for example, for identifying authors of newspaper articles (Diederich, Kindermann, Leopold, & Paass, 2003), or for determining the gender of a document’s author (Koppel, Argamon, & Shimoni, 2002). Support Vector Machine (SVM) is the classification algorithm employed in all our exper- iments. SVMs “probably represent the most successful technology for text categorization today” (Witten & Frank, 2005, p. 340), and indeed, SVMs have been widely and successfully used for identifying translationese (e.g., Baroni & Bernardini, 2006; Kurokawa et al., 2009; Popescu, 2011; Volansky et al., forthcoming). Specifically, we apply the Sequential Minimal Optimization algorithm (SMO) for training SVMs (Platt, 1999), using the default linear ker- nel, as implemented in the Weka machine learning toolkit (Hall et al., 2009). All the identification models are trained and tested on the corpus containing Ohe b and Te n in a ten-fold cross-validation procedure (we later refer to this experimental scenario as the InC [in-corpus] scenario). The obtained SVM classifiers are then also tested on the three additional datasets discussed above (InD f r , OoD-soc, and OoD-eco). For all the ex- periments we report accuracy, namely the percentage of text chunks the classifier correctly classifies. In Section 7 we analyze the resulting classifiers, exploiting the values of the weights assigned by SVMs to the features used for classification. 5 Feature Design The essence of text classification is the design of the feature vectors by which the text data are represented. As we do not go beyond the word level in this study, we design several feature sets aimed at capturing linguistic – specifically morphological – characteristics of surface tokens and sub-tokens. In this section we describe and motivate these feature sets. 16 5.1 Token-based Features We use two different kinds of token-based features: word unigrams extracted from the train- ing data, and a precompiled list of function words extracted from external corpora. In both settings, a list of tokens constitutes the feature vector representing a chunk, and feature val- ues are the frequencies of these tokens in the chunk. Word unigrams We compile a list of all the words in the training data, i.e., in the union of Te n and Ohe b (excluding punctuation), and use each word as a feature. Like Volansky et al. (forthcoming), we treat this experiment as a sanity check, since, being highly content-dependent, this feature set is expected to yield very good classification results when tested on the training corpus in a ten-fold cross-validation scenario, but not to scale up to external domains. Function Words Since Mosteller and Wallace’s seminal work on the Federalist Papers (1964), function words have been extensively and successfully used in text classification. This approach for feature design has also been proven to be instrumental in identifying translationese, albeit not very scalable (Koppel & Ordan, 2011). These words “carry little meaning by themselves [...] but [...] define relationships of syntactic or seman- tic functions between other (‘content’) words in the sentence [... they] are therefore largely topic-independent and may serve as useful indicators of an author’s preferred way to express broad concepts” (Juola, 2006, p. 242). Being highly frequent, these words exist in every chunk of text, regardless of its size, and since they are so frequent, it is safe to assume that more often than not text producers do not control the use of these words, i.e., do not select them consciously. Unlike English, however, Hebrew text tokens often contain more than one lexical item, and many typical function words, such as prepositions, are concatenated to other words belonging to other parts of speech (cf. Section 3 above). Hence, closed sets containing several hundred function words, like the ones used for English, cannot be compiled for Hebrew. The list we use in this study contains Hebrew words belong- 17 ing to the following categories: quantifiers, pronouns, prepositions, negation words, interrogative markers, existentials, copulas, and conjunctions. It contains all possi- ble inflections for each word – and only those surface forms that appear at least once in a collection of six large external Hebrew corpora. The list (which is available from MILA)9 contains 7,450 items. Due to the morphological and orthographic challenges Hebrew poses (e.g., the fact that many function words are realized as affixes), classi- fiers based on function words are not expected to perform on our data as well as they do on English texts. 5.2 Features that Reflect Morphological Aspects Since we are interested in investigating the morphological aspects of translationese, we de- fine a set of features that reflect such information. To this end, we use the output of the morphological processor mentioned above (Section 4.2). Based on processor’s output (cf. Figure 1), we define the following feature sets: POS While POS tags may be considered syntactic rather than morphological features, we mainly employ them, as will be described below (Section 5.5), in order to enhance the performance and sophistication of other feature sets. We also use them, like Ilisei et al. (2010) do, as a baseline for testing the contribution of other features, in our case the ‘pure’ morphological features; i.e., we first train a classifier based solely on the 25 POS tags in the tagset, and then test this classifier with each of the morphological features added to it, and also with combinations thereof. This should give us a good indication of the contribution made by each morphological feature. For example, in Figure 1, the value of POS is verb in the first example, and noun in the second. BINYAN The features in this category are the seven Hebrew verbal patterns, the binyanim (cf. Section 3 above). Since the verbal patterns have no counterpart in English, the source language of our study, we expect the frequencies of at least some of them to 9http://mila.cs.technion.ac.il 18 differ between original and translated texts. In Figure 1, the value of BINYAN in the first example is Hif ’il. STATUS The two features in this category are applicable to nouns, adjectives, participles, numerals, and quantifiers: the features construct and absolute reflect the construct and the absolute states, respectively (cf. Section 3 above). Since English does not have a form which is equivalent to the construct state, we expect the distribution of con- structions involving the construct state (e.g., possessive noun-noun constructions) to differ between Ohe b and Te n . POSSESSIVE This feature set contains only one feature indicating whether a possessive suf- fix is attached to the token. Since Hebrew has several ways of expressing possessive- ness, one of which is by means of attaching possessive suffixes (cf. Section 3), we ex- pect the distribution of these suffixes to be different across Ohe b and Te n . In Figure 1, a possessive suffix is attached to the second example. PREFIX_1, PREFIX_2 Even though Hebrew words can take several prefixes, it is rarely the case that more than two prefixes are attached to one token. We therefore consider only the first two prefix positions as feature categories. We expect them to convey signifi- cant classification cues, since they correspond to function words: recall that the defi- nite article, the conjunction and, and numerous prepositions are realized as prefixes in Hebrew. In Figure 1, the first example exhibits two prefixes: the value of PREFIX_1 is conjunction and the value of PREFIX_2 is temporalSubConj. The values of the morphological and POS feature sets are the frequencies of those fea- tures within a chunk. We also experimented with the logarithm of the frequencies as the actual values of features, but this turned out to be beneficial only for two classifiers, namely for the POS and the BINYAN classifiers. We therefore use log frequencies for these two fea- ture sets. Note that we do not define a feature for every coordinate of the morphological analysis provided by the analyzer. For example, we find gender, tense, and number to be less rele- 19 vant for our task. First, we consider them more lexical than morphological, not least due to Hebrew’s grammatical gender. Second, these features typically do not reflect translators’ decisions, as they are imposed by the source text (unlike, e.g., possessive or passive con- structions, where translators have several alternatives to choose from). 5.3 Features Based on Character n-grams Following Popescu (2011), we experiment with character n-grams. The feature set he de- signs contains character 5-grams, irrespective of word boundaries. Unlike him, we experi- ment with 1-grams through 5-grams, as well as with the union of all of them. Presumably, longer n-grams would capture many lexical phenomena, and would thus yield accurate in- domain but inaccurate out-of-domain classifiers (Hebrew words tend to be rather short due to lack of vowels). We also do not go beyond the word level; that is, we calculate n-grams oc- curring only within one token,10 since n-grams spanning over several tokens are expected to capture syntactic properties of the language, whereas the focus of this study is on mor- phological features. Inspired by Koppel et al. (2006), who use Hebrew and Aramaic prefixes and suffixes as features for the classification of rabbinic manuscripts, we design another feature set; we collect bigrams occurring at word boundaries, i.e, at the beginning and the end of tokens.11 Unlike them, we do not employ a predefined list of suffixes and prefixes. Note that since each bigram in this feature set is preceded or followed by a reserved character marking a word boundary (see Footnote 10), the bigrams in this experiment are, in point of fact, trigrams (in other words, this feature set is a proper subset of the character trigram feature set). 5.4 Features that Approximate Word Structure We also design a set of features that reflect, on the one hand, the formal representation of morphological information (i.e., the way morphological features are expressed in the or- 10Word boundaries are counted as characters. So, for example, the bigrams corresponding to a word like ab are {_a, ab, b_}, where ‘_’ is a reserved character marking a word boundary. 11Volansky et al. (forthcoming) apply a similar feature set to the identification of English translationese. 20 thography), but, on the other hand, are as content- and domain-independent as possible – that is, features that do not directly reflect lexical information. To this end, we define an ab- straction mechanism which is expected to approximate Hebrew word structures, e.g., verbal and nominal patterns. The idea is to reduce the Hebrew alphabet to a smaller alphabet, allowing symbols in the reduced alphabet to capture sets of characters. We run experiments with three such abstract alphabets (AbA), listed here in decreasing order of abstraction: AbA1 Consists of only two symbols: C , representing all consonants and V , replacing the characters traditionally known as matres lectionis (cf. Section 3). These characters play a significant role in Hebrew derivational morphology, among other things represent- ing some of the vowels. Formally: AbA1 := {C , V }, where C represents the consonants b, g, d, z, x, T, k, l, m, n, s, y, p, c, q, r, $, t and V represents a, h, w, i. AbA2 In this alphabet, C is as above, but V is spelled out. AbA2 thus contains five symbols: {C, a, h, w, i}. Not only do the matres lectionis play a significant role in nonconcatena- tive morphology (e.g., in verbal and nominal patterns), but the prefixes h and w also reflect the definite article and the coordinating conjunction and, respectively. AbA3 Includes ten symbols: {C, a, h, w, i, b, k, l, m, t}, where C stands for all remaining letters. The spelled out consonants b, k, l, and m are prepositions which are realized as prefixes. The characters k and m can also reflect other grammatical properties, such as gender, number, possessiveness, and tense. The consonant t participates in the construction of many verbal and nominal patterns; e.g., it is part of the unmarked feminine plural suffix wt. Figure 2 illustrates how the surface token wk$hmcxiqwt (“and when the funny ones [feminine]...”) is represented in each of the three abstract alphabets. The feature values in the AbA experiments are (the frequencies of ) complete abstracted tokens. Since no language-specific processing tools are necessary in order to create these ab- 21 Surface w k $ h m c x i q w t AbA1 V C C V C C C V C V C AbA2 w C C h C C C i C w C AbA3 w k C h m C C i C w t Figure 2: The three different abstract representations of the surface form wk$hmcxiqwt. stract representations, applying them to other languages with nonconcatenative morphol- ogy (specifically Arabic) is straightforward. 5.5 Feature Combinations We define two ways of combining features: disjunction and conjunction. The disjunction F1 ∪ F2 results in the union of the feature sets F1 and F2. Although the feature vector grows as a result of the disjunction, the features and their values remain the same. Combining by means of disjunction allows for a better understanding of the contribution each feature subset makes. The conjunction F1 × F2, on the other hand, results in a new feature set altogether, namely, the Cartesian product of F1 and F2. In this study, we employ conjunction in or- der to enrich the different AbA and character n-gram feature sets with POS information. For example, consider the conjunction of character bigrams and POS; given the input word hlk (“walked”), which is tagged as a verb, the features extracted from it are the pairs 〈_h, verb〉, 〈hl, verb〉, 〈lk, verb〉, and 〈k_, verb〉; the feature space includes the Cartesian product of all possible bigrams with all 25 POS tags. Similarly, when combining AbA1 × POS, each AbA1 feature, e.g., C V C , results in a feature set of 25 features, one for each POS tag: 〈CVC, noun〉, 〈CVC, verb〉, etc. By applying POS conjunction to the AbA alphabets and character n-grams, we obtain more nuanced and better interpretable feature sets, which remain, nevertheless, abstract and content-independent. 22 6 Results We implemented the features discussed in the previous section and constructed SVM clas- sifiers based on each set of features. We then tested the accuracy of each of the classifiers in four experimental setups corresponding to the corpora introduced in Section 4: InC (in-corpus) The training data, i.e., Ohe b and Te n . It includes 600 chunks: 300 original Hebrew instances and 300 translated from English; evaluation is done using ten-fold cross-validation. InD f r Testing on the in-domain dataset containing 30 chunks of twentieth-century litera- ture translated into Hebrew from French. Note that this dataset contains only chunks translated from French, that is, no texts originally written in Hebrew. OoD-soc Testing on the out-of-domain dataset dealing with social science topics (26 chunks, evenly balanced: 13 original Hebrew, 13 translated from English). OoD-eco Testing on the out-of-domain dataset dealing with economics (26 chunks, evenly balanced: 13 original Hebrew, 13 translated from English). For all the experiments we report accuracy; the baseline (choosing at random) is always 50%, as the test set is balanced. Since the test corpora are relatively small (30 chunks in the case of InD f r and 26 in the out-of-domain experiments), most differences in accuracy on these test sets are not statistically significant. However, differences of a few percentage points on the training set, for which we conduct cross-validation evaluation, are typically significant. To emphasize that, we graphically depict 95% confidence intervals (Clopper & Pearson, 1934) for the results of some of the InC experiments. Complete ranked confidence interval plots for all the experiments (InC, InD f r , OoD-soc, and OoD-eco) are listed in the appendix. 6.1 Classifiers Based on Tokens The accuracies of the classifiers trained on token-based features are given in Table 2. As we conjectured, the classifier trained on word unigrams is highly accurate in the in-corpus 23 scenario, but does not scale up to the in-domain and out-of-domain datasets. Similarly, and like Koppel and Ordan (2011), we find that a classifier trained solely on function words, while achieving convincing in-corpus results, does not perform very well when applied in other experimental scenarios. Classifier InC InD f r OoD-soc OoD-eco Word unigrams 98.3 66.6 65.3 69.2 Function words 93.5 66.6 69.2 61.5 Table 2: Results of classifiers based on tokens. 6.2 Classifiers Based on Morphological Analysis The results of the classifiers that reflect morphological aspects, namely, the ones trained solely on the output of the morphological analyzer, are given in Table 3. The confidence intervals of the cross-validation experiments are plotted in Figure 3. Classifier InC InD f r OoD-soc OoD-eco POS 76.8 80.0 50.0 50.0 BINYAN (BI) 57.0 66.6 57.6 53.8 STATUS (ST) 60.2 70.0 53.8 38.4 POSSESSIVE (PS) 54.2 50.0 42.3 46.1 PREFIX_1 (P1) 66.5 60.0 61.5 57.6 PREFIX_2 (P2) 56.3 50.0 46.1 65.3 POS ∪ BI 77.0 83.3 46.1 53.8 POS ∪ ST 77.5 83.3 53.8 50.0 POS ∪ PS 76.5 83.3 53.8 53.8 POS ∪ P1 80.8 76.6 57.6 57.6 POS ∪ P2 78.0 66.6 50.0 61.5 POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 82.7 66.6 53.8 65.3 BI ∪ ST ∪ PS ∪ P1 ∪ P2 71.5 56.6 69.2 65.3 Table 3: Results of classifiers based on morphological analysis. The POS classifier, here used mainly as a baseline to test the contribution of the ‘pure’ morphological features, yields 76.8% accuracy in the in-corpus scenario. While performing quite well on the separate dataset of literary texts translated from French (80%), it fails to 24 0.0 0.2 0.4 0.6 0.8 1.0 InC Accuracy POSSESSIVE (PS) PREFIX_2 (P2) BINYAN (BI) STATUS (ST) PREFIX_1 (P1) BI ∪ ST ∪ PS ∪ P1 ∪ P2 POS ∪ PS POS POS ∪ BI POS ∪ ST POS ∪ P2 POS ∪ P1 POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 Figure 3: Confidence intervals, classifiers based on morphological analysis (InC scenario). identify translationese in the OoD datasets. Interestingly, Baroni and Bernardini (2006) re- port that a similar classifier obtains 49.6% accuracy on their Italian data. They note that “the strikingly low performance of the unigram [POS] model is not surprising, since this model is using the relative frequency of 50 [POS] tags as its only cue” (p. 267). Volansky et al. (forth- coming), on the other hand, obtain 90% when applying a similar feature set to identifying English translationese. Apart from PREFIX_1, and the somewhat less impressive BINYAN, no other classifier based on a single morphological feature (including POS) manages to perform better than the baseline in all four experimental scenarios. When the features are combined by means of disjunction, the results improve somewhat. Not surprisingly, the combined feature sets that yield the best results in the in-corpus scenario are the ones that uses both POS and PREFIX_1 (POS ∪ P1; POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2). Once the POS baseline is removed and a classifier is trained using only the combination of the single, pure morphological features (BI ∪ ST ∪ PS ∪ P1 ∪ P2), accuracy drops in all scenarios except OoD-soc. Interestingly, this 25 OoD-soc results is the best any of the pure morphological classifier yields. In sum, classifiers based on features produced by morphological analysis fail to produce accurate classification results, especially out of domain. The reason may be the low quality of the morphological processing tools we use or the low dimensionality of these classifiers (sometimes containing only one or two features), coupled with the relatively small size of the training set. 6.3 Classifiers Based on Character n-grams Capturing much lexical information, classifiers based on character n-grams unsurprisingly yield good results (cf. Popescu, 2011; Volansky et al., forthcoming). These results are given in Table 4, and the confidence intervals of the InC experiments are plotted in Figure 4. Classifier InC InD f r OoD-soc OoD-eco 1-grams 69.2 26.7 61.5 61.5 2-grams 88.5 66.7 69.2 61.5 2-grams at word boundaries 92.5 70.0 57.6 53.8 3-grams 98.5 66.7 69.2 76.9 4-grams 98.8 73.3 73.1 80.7 4-grams, top-60 features 93.5 66.7 80.7 80.7 5-grams 98.3 66.7 73.1 76.9 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams 98.8 73.3 76.9 69.2 2-grams × POS 98.3 73.3 65.3 57.6 2-grams at word boundaries × POS 97.3 66.7 73.1 80.7 3-grams × POS 98.7 73.3 80.7 61.5 4-grams × POS 98.5 73.3 80.7 69.2 Table 4: Results of classifiers based on character n-grams. The optimal n for n-gram classifiers seems to be 4; not only is the 4-gram classifier highly accurate in cross-validation, it scales up nicely to out-of-domain tasks. Extending n-gram length to 5, or taking all n-grams of lengths 1 to 5, does not seem to improve much. Enhanc- ing the n-grams with POS information brings about a small accuracy gain in most cases. As a further indication of the robustness of the n-gram classifiers, we experimented with a 4-gram classifier that only uses the 30 most indicative features of Ohe b and the 30 most 26 0.0 0.2 0.4 0.6 0.8 1.0 InC Accuracy 1-grams 2-grams 2-grams at w.b. 4-grams, top-60 2-grams at w.b. × POS 5-grams 2-grams × POS 3-grams 4-grams × POS 3-grams × POS 4-grams 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams Figure 4: Confidence intervals, classifiers based on character n-grams (InC scenario). indicative of Te n : after training a classifier with the entire set of 4-grams, we selected the 30 features whose weights were greatest (most indicative of Ohe b ) and the 30 features whose weights were lowest (most indicative of Te n ). We then trained a new classifier using only these 60 features.12 The results of this top-60 4-gram classifier are listed in Table 4, and demonstrate the power of simple, low-dimensional classifiers. Some of the most indicative features are discussed in Section 7. 6.4 Classifiers Based on Abstract Alphabets The results of the classifiers that approximate Hebrew word structure by means of alphabet abstractions are given in Table 5, and the confidence intervals of the InC experiments are plotted in Figure 5. AbA1 and AbA2 reveal mixed results. While performing extremely well in certain scenar- ios (for example, the result AbA2 obtains on the OoD-eco dataset, 96.1%, is by far the best 12The specific features are listed in the appendix. 27 Classifier InC InD f r OoD-soc OoD-eco AbA1 78.2 46.6 35.6 76.9 AbA2 92.2 43.3 73.1 96.1 AbA3 97.0 70.0 80.7 65.3 AbA1 × POS 92.0 63.3 65.3 73.1 AbA2 × POS 95.8 76.6 73.1 80.7 AbA3 × POS 98.0 86.6 84.6 65.3 AbA3 × POS, top-40 features 91.7 80.0 80.7 76.6 Table 5: Results of classifiers based on abstract alphabets. 0.0 0.2 0.4 0.6 0.8 1.0 InC Accuracy AbA_1 AbA_3 × POS, top-40 AbA_1 × POS AbA_2 AbA_2 × POS AbA_3 AbA_3 × POS Figure 5: Confidence intervals, classifiers based on abstract alphabets (InC scenario). one any of our classifiers yields in that scenario, cf. Figure 9 in the appendix), they fail mis- erably in other scenarios. In contrast, AbA3 yields competitive results in all four scenarios. Considering the fact that these results are obtained without applying any feature selection methods, they are very promising. It stands to reason that by cautiously reducing the fea- ture spaces of these simple AbA classifiers, their performance will increase significantly. We intend to do that in future work. Upon enriching the abstract templates with POS information by means of conjunction, we observe accuracy improvement in most scenarios. While all three classifiers yield results well above the baseline, AbA3× POS is the best overall classifier we train in this study. Here, 28 too, we experimented with a classifier using only the top-n features most indicative of O and the top-n most indicative of T, this time with n being 20. This very low-dimensional classifier is the only one in this study obtaining more than 75% in all experimental scenarios. 7 Analysis We now look more closely at the results discussed in the previous section. Specifically, in order to understand which features are more relevant than others for a given classifier, we examine the weights an SVM assigns to the features it uses: the higher the weight, the more important the feature is considered to be. Note that low weights are not always an indica- tion that a feature is not important, as potential dependencies among features can discount important features. The inverse, however, does not hold: a feature assigned a high weight indicates a significant property of one of the classes to be distinguished. In the following, we highlight some of the more successful discriminating features. Word unigrams The token dwwqa is one the most prominent markers of Ohe b , i.e., it is underrepresented in translated texts. dwwqa is an adverb that roughly means “contrary to expectations.” Importantly, it is not lexicalized in English. This is a typical case of negative interference (Toury, 1995): a lexical gap between the two language systems involved in the translation process creates a situation where nothing in the source language (in this case, English) triggers the generation of the lacking item (in this case, dwwqa ) when translating into the target language (Hebrew). In our corpus, dwwqa is almost six times more frequent in Ohe b than in Te n . POS Three major differences in the POS distribution between Te n and Ohe b are given in Table 6. The difference in the distribution of proper names goes hand in hand with the explicita- tion hypothesis (Blum-Kulka, 1986). According to this hypothesis, translators tend to render implicit utterances in the source text (e.g., pronouns) more explicitly in the target text they 29 POS tag Ohe b Te n Ratio (Te n /Ohe b ) properName 22,808 28,470 1.25 copula 9,463 11,403 1.21 modal 3,243 3,677 1.13 Table 6: Three major differences in the POS distribution across Te n and Ohe b . produce, specifically by means of proper names. We note, however, that we (unlike, e.g., Baroni & Bernardini, 2006) do not notice remarkable differences in the distribution of pro- nouns between Ohe b and Te n . We also find that modal constructions are more frequent in translated than in non- translated Hebrew texts. This is a case of positive interference (Toury, 1995), i.e., overrep- resentation of features characteristic to the source language in translations (in this case, from English). Interestingly, the different distribution of modal verbs across Te n and Ohe b provides us with a partial explanation of the excessive use of copulas in translated texts: in Hebrew, most modal verbs do not inflect for tense; in order to express the past tense, modals are combined with copula past tense forms (functioning in these constructions as an auxiliary verb). And indeed, we find in our data that copulas in the past tense are highly collocated with modal verbs. This partially explains why copulas are more frequent in Te n . Another contributing factor stems from the optionality of Hebrew present-tense copulas in non-verbal sentences (Haugereid, Melnik, & Wintner, 2013); since counterpart copulas are mandatory in English, they tend to be explicated in translations to Hebrew. Morphological features The classifiers trained on the pure, single morphological feature sets do not perform very well, neither in the ten-fold cross-validation in-corpus scenario, nor when tested on the additional in-domain and out-of-domain datasets. This might be due to the low dimensionality of these classifiers, to the relatively small amount of training data, or to the performance of the morphological analyzer, which is often inaccurate. Perhaps not surprisingly, the classifier based on PREFIX_1 is the best performing one, as it reflects linguistic information which is realized in other languages (like English) as func- 30 tion words, e.g., conjunctions and prepositions. We find that the most significant marker of Ohe b is the prefix corresponding to the coordinating conjunction and. Indeed, this prefix is 1.24 times more frequent in Ohe b than in Te n . This finding calls for further research by translation scholars. Even though the BINYAN classifier, based on the seven Hebrew verbal patterns, manages to perform slightly better than the baseline in all experimental scenarios, we cannot inter- pret the results it yields. The reason is that the accuracy of the morphological processor is particularly poor with respect to the verbal patterns. Character n-grams Many of the most discriminative word unigrams are also reflected in the results of other feature classes like character n-grams and abstract alphabets. So, for instance, the character trigrams _dw, dww, wwq, wqa, and qa_ (corresponding to the token dwwqa mentioned above) are amongst the most prominent markers of Ohe b . In other words, even though we set out to capture morphological properties by looking at sub-tokens and alphabet abstractions, we sometimes end up capturing lexical cues. A detailed analysis of the most significant features of the 4-gram classifier reveals the following pattern: among the ten strongest indications of original Hebrew are three sub- strings of the lexical dwwqa, followed by ywd “more/still/yet”, kbr “already”, gm “also” and mwl “against”, with indications of word boundaries at either side of these short words. Note that these are all function words, that likely do not have direct, one-to-one counterparts in the source languages, and hence are distributed very differently between Ohe b and Te n . Other indications of Ohe b include bi$r (the prefix of bi$ral “in Israel”) and _tl_ (with word boundaries at both ends), clearly referring to Tel Aviv. Strong indications of translations include the prepositions kdi “in-order-to” and bzmn “while/during”, the adjective nwsp “additional”, the modal yewi “may”, but also n-grams that are more abstract and less transparent. AbA1 We find that one of the most prominent features of Te n is a triplet of matres lectionis, namely VVV. This template mostly encompasses four tokens which play a crucial role in 31 Hebrew grammar: 1. hwa, which can be either a pronoun (“he”) or a third person singular copula in the present tense (“is”) 2. hia (“she”), which is the same as hwa, namely, both a pronoun and a copula, only feminine 3. hih (“was”) and 4. hiw (“were”), which are copula forms in the past tense. Table 7 illustrates how these four most characteristic instantiations of this template are realized across Ohe b and Te n . Together they constitute 96% of this tem- plate’s occurrences in the training data. A reason for the overrepresentation of copula in Te n , namely, positive interference, was discussed above. VVVAb A1 Ohe b Te n Ratio (Te n /Ohe b ) hwa 4,227 5,617 1.33 hia 3,431 3,913 1.14 hih 2,835 3,607 1.27 hiw 908 1,394 1.54 Table 7: The four most characteristic instantiations of the AbA1 template VVV and their distribution in Te n and Ohe b . By looking at an abstract feature as simple as the VVV sequence, which potentially leaves room for 43 surface forms (in practice, only 30 of them are realized in the training data), we already have at our disposal numerous highly frequent distinguishing markers. AbA2 Naturally, some of the results found in AbA1 are also reproduced in AbA2. For ex- ample, the spelled-out instance hwa of the AbA1 VVV -template is one of the top markers of Te n . Another discriminating marker of Te n captured by AbA2 is the template hCCia, which captures certain instances of the verb pattern Hif’il (in past tense, third person singular masculine), namely, those instances with a as the third (and final) letter of the root. The Hif’il pattern is predominantly used as a causative in Hebrew and might indicate a structural difference between English and Hebrew. This calls for further investigation. The same AbA2 template, hCCia, also captures the token hn$ia (“the president”), reflect- ing perhaps a cultural marker (the Israeli head of state is the prime minister, rather than the president). Although lexical, this feature captures a significant difference between Te n 32 and Ohe b which is scalable to other domains, since it is rather frequent in genres such as newspaper articles and parliament proceedings. AbA3 While less abstract than AbA1 and AbA2, this third alphabet touches on morpho- logical templates that cannot be captured with the more abstract alphabets. Consider the template mCwCl, which, theoretically speaking, exclusively captures the masculine singu- lar passive participle of roots whose third (and final) letter is l. This template is three times more frequent in Te n . The most frequent instance of this template, the modal mswgl (“ca- pable of”), reflects constructions which are, as discussed above, more frequent in translated texts due to positive interference. The other instances of this AbA3 template suggest that there are different distributions of morphological items between Te n and Ohe b . This, too, calls for further studies. AbA, general Importantly, we manage to capture with the AbA templates many discrimi- native markers – whether lexical, morphological, or (morpho)syntactic – without relying on a ready-made, morphologically or syntactically informed mechanism. Enriching the templates with POS information improves the results. Once the AbA tem- plates are restricted to capture smaller token spaces, they become more precise. In AbA1 × POS, a prominent AbA1 feature like VVV is spelled out into 25 features (each corresponding to a different POS tag), thereby making 〈VVV, pronoun〉 and 〈VVV, copula〉 much more dom- inant than, say, 〈VVV, noun〉. The POS enhancement thus helps the classifiers to separate the wheat from the chaff. 8 Conclusion We have employed text classification for the investigation of translationese in a morpholog- ically complex language, namely Modern Hebrew. This is the first work addressing the au- tomatic identification of translationese in a Semitic language, and the first focusing on the morphological manifestation of translated texts’ properties. Specifically, we have trained 33 several SVM classifiers that distinguish with high accuracy between twentieth-century lit- erary texts translated from English and similar texts originally written in Hebrew. Some of these classifiers have proven to be robust, yielding good results when tested on different datasets, i.e, on texts from the same domain (twentieth-century literature), but translated from a different source language (French), and on texts from other domains, namely news- paper and journal articles dealing with the social sciences and economics. The fact that some of the classifiers scale up to other experimental scenarios supports our hypothesis that training on a corpus of contemporary literature – a very heterogeneous dataset – is suitable and beneficial for the development of scalable classifiers. Numerous feature design strategies have been explored: function words, word unigrams, pure morphological features, POS tags, character n-grams, and three different instances of a novel alphabet abstraction mechanism aimed at approximating Hebrew word structure. We have also experimented with several hybrid feature sets, i.e., combinations of some of the aforementioned feature sets by means of disjunction and conjunction. Classifiers trained solely on morphological information do not obtain very good results; this might be due to the performance of the often inaccurate morphological analyzer, the low dimensionality of these classifiers, or the relatively small amount of training data. The classifiers obtaining the best overall results use combined models, conjunctions of POS in- formation with either an alphabet abstraction or character n-grams. This indicates that, currently, the best way to represent word-level and sub-word-level phenomena in Hebrew, for the purpose of identifying translationese, is by approximating morphological analysis using shallow abstractions, and restricting those abstractions to specific POS spaces. As we saw in the previous section, even though we set out to capture morphological properties of Hebrew, we sometimes end up capturing non-morphological features. Even when applying POS annotation and alphabet abstractions, lexical markers manage to “sneak in,” e.g., in the form of proper nouns. Indeed, some of the most significant classification cues do not reflect morphological traits of the Hebrew language. Let us revisit the example of the lemma dwwqa (roughly meaning “contrary to expec- 34 tations”). It appears 6.62 times more often in Ohe b than in Te n (331 vs. 50 occurrences), thus averaging slightly more than one occurrence per original Hebrew chunk; its probabil- ity to appear in a Te n chunk, on the other hand, is 1/6 (assuming a uniform distribution). Although not a morphological feature, dwwqa does reflect a structural difference between Hebrew and English (and French), and due to its relatively high frequency it contributes immensely to classification. For example, CVVCV is one of the most significant features in the AbA1 experiment. Similarly, as described above in Section 7, n-grams corresponding to substrings of dwwqa are amongst the most prominent markers of Ohe b . In this sense we conclude that although our abstractions are not purely morphology based, the non- morphological features that do manage to sneak in are of both theoretical and practical value. In future work, our first concern will be dimensionality reduction. As we show with the 4- gram and AbA3 × POS classifiers, a very low-dimensional space of only 40 or 60 features suf- fices for producing highly accurate results. We intend to employ state-of-the-art algorithms in order to rank feature sets and select the most discriminative feature subsets, thereby re- ducing the size of feature vectors and limiting the effect of overfitting to the training data. This should bring about accuracy gains, but also facilitate a better understanding of the morphological properties of Hebrew translationese. We also intend to explore other ways of designing alphabet abstractions, more sophisti- cated than the ones developed and discussed in the present work. Substitutions could, for example, be made dependent upon positions within the surface token, thereby simulating Hebrew prefixes and suffixes. In this study, we have combined abstract alphabets only with POS tags. This has turned out to be a promising approach. Combinations with other feature sets might also prove fruitful, in particular with PREFIX_1, the pure morphological feature set yielding the best results. Finally, we plan to apply similar alphabet abstractions to other Semitic languages, such as Arabic, building on the similar root-and-pattern morphological structures of words in these languages. 35 Acknowledgments This research was supported by a grant from the Israeli Ministry of Science and Technology. We are grateful to Ted Briscoe for suggesting some of the n-gram experiments and for useful discussions. We wish to thank Titus von der Malsburg for suggesting the use of confidence intervals. We are also grateful to Bracha Lang for providing us with the out-of-domain cor- pora, and to Kayla Jacobs for providing us with the list of Hebrew function words. Thanks are also due to Irit Noy for annotating the literary texts with additional metadata. References Argamon, S. (2011). Book review of Scalability Issues in Authorship Attribution, by Kim Luyckx. Literary and Linguistic Computing , 27 (1), 95–97. Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 233–250). Amsterdam: John Benjamins. Bar-Haim, R., Sima’an, K., & Winter, Y. (2008). Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering , 14(2), 223–251. Baroni, M., & Bernardini, S. (2006). A new approach to the study of translationese: Machine- learning the difference between original and translated text. Literary and Linguistic Computing , 21(3), 259–274. Blum-Kulka, S., & Levenston, E. A. (1983). Universals of lexical simplification. In C. Færch & G. Kasper (Eds.), Strategies in interlanguage communication (pp. 119–139). Longman. Blum-Kulka, S. (1986). Shifts of cohesion and coherence in translation. In J. House & S. Blum-Kulka (Eds.), Interlingual and intercultural communication discourse and cognition in translation and second language acquisition studies (Vol. 35, pp. 17–35). Tübingen: Gunter Narr. Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4), 404-413. 36 Daniels, P. T. (1997). Scripts of Semitic languages. In R. Hetzron (Ed.), The Semitic languages (pp. 16–45). Routledge. Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied intelligence, 19(1-2), 109–123. Fabri, R., Gasser, M., Habash, N., Kiraz, G., & Wintner, S. (2014). Linguistic introduction: The orthography, morphology and syntax of Semitic languages. In I. Zitouni (Ed.), Semitic language processing (pp. 3–41). Berlin and Heidelberg: Springer. Gellerstam, M. (1986). Translationese in Swedish novels translated from English. In L. Wollin & H. Lindquist (Eds.), (pp. 88–95). Lund: CWK Gleerup. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. Haugereid, P., Melnik, N., & Wintner, S. (2013). Nonverbal predicates in Modern Hebrew. In S. Müller (Ed.), Proceedings of the 20th international conference on head-driven phrase structure grammar (pp. 6–26). CSLI Publications. Retrieved from http:// cslipublications.stanford.edu/HPSG/2013/hmw.pdf Ilisei, I., & Inkpen, D. (2011). Translationese traits in Romanian newspapers: A machine learning approach. International Journal of Computational Linguistics and Applica- tions, 2. Ilisei, I., Inkpen, D., Pastor, G. C., & Mitkov, R. (2010). Identification of translationese: A machine learning approach. In A. F. Gelbukh (Ed.), Proceedings of CICLing-2010: 11th international conference on computational linguistics and intelligent text processing (Vol. 6008, pp. 503–511). Springer. Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5). 37 Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing , 17 (4), 401–412. Koppel, M., Mughaz, D., & Akiva, N. (2006). New methods for attribution of rabbinic litera- ture. Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational and Applied Linguistics, 57 , 5–18. Koppel, M., & Ordan, N. (2011). Translationese and its dialects. In Proceedings of the 49th an- nual meeting of the association for computational linguistics: Human language tech- nologies (pp. 1318–1326). Portland, Oregon, USA: Association for Computational Lin- guistics. Kurokawa, D., Goutte, C., & Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings of MT Summit XII , 81–88. Lembersky, G., Ordan, N., & Wintner, S. (2011, July). Language models for machine transla- tion: Original vs. translated texts. In Proceedings of EMNLP. Lembersky, G., Ordan, N., & Wintner, S. (2012a, April). Adapting translation models to trans- lationese improves SMT. In Proceedings of the 13th conference of the european chapter of the association for computational linguistics (pp. 255–265). Avignon, France: As- sociation for Computational Linguistics. Retrieved from http://www.aclweb.org/ anthology/E12-1026 Lembersky, G., Ordan, N., & Wintner, S. (2012b, December). Language models for machine translation: Original vs. translated texts. Computational Linguistics, 38(4), 799–825. Retrieved from http://dx.doi.org/10.1162/COLI_a_00111 Lembersky, G., Ordan, N., & Wintner, S. (2013, January). Improving statistical machine translation by adapting translation models to translationese. Computational Linguis- tics, 39. Retrieved from http://dx.doi.org/10.1162/COLI_a_00159 Lembersky, G., Shacham, D., & Wintner, S. (2014, January). Morphological disambigua- tion of Hebrew: a case study in classifier combination. Natural Language En- gineering , 20, 69–97. Retrieved from http://journals.cambridge.org/article _S1351324912000216 38 Lynch, G., & Vogel, C. (2012). Towards the automatic detection of the source language of a literary translation. In Proceedings of the 24th international conference on computa- tional linguistics (COLING): Posters (pp. 775–784). Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: The Federalist. Addison-Wesley. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal opti- mization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: Support vector learning. Cambridge, MA: MIT Press. Popescu, M. (2011). Studying translationese at the character level. In G. Angelova, K. Bontcheva, R. Mitkov, & N. Nicolov (Eds.), Proceedings of recent advances in natural language processing (pp. 634–639). Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys, 34(1), 1–47. Toury, G. (1995). Descriptive translation studies and beyond. Amsterdam and Philadelphia: John Benjamins. Volansky, V., Ordan, N., & Wintner, S. (forthcoming). On the features of translationese. Literary and Linguistic Computing . Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and tech- niques (Second ed.). Morgan Kaufmann. Yona, S., & Wintner, S. (2008). A finite-state morphological grammar of Hebrew. Natural Language Engineering , 14(02), 173–190. Most Distinctive N -gram Features The 60 most distinctive n-gram features (Section 6.3) are: wqa_, wwqa, dwwq, ywd_, _kbr, _gm_, mwl_, kbr_, _ywd, _mwl, _dww, bier, bkll, _wrq, _egm, _acl, kll_, ela_, egm_, klwm, _lw_, _klw, wla_, _wla, _tl_, _ph_, _npe, ain_, blnw, sbln, arwx, _yew, _lmy, bmek, ylwl, hwa_, mswg, laxw, _ah_, dmii, _mlb, _zmn, hbiy, _bzm, briq, mlbd, _msw, _ydi, amwr, _keh, lehi, 39 mek_, ydii, mbri, _hwa, yewi, nwsp, bzmn, _kdi, kdi_ Confidence Interval Plots We graphically depict below 95% confidence intervals corresponding to the experiments reported on in Section 6: the InC experiments (Figure 6), the InD f r experiments (Figure 7), the OoD-soc experiments (Figure 8), and the OoD-eco experiments (Figure 9). 40 0.0 0.2 0.4 0.6 0.8 1.0 InC Accuracy POSSESSIVE (PS) PREFIX_2 (P2) BINYAN (BI) STATUS (ST) PREFIX_1 (P1) 1-grams BI ∪ ST ∪ PS ∪ P1 ∪ P2 POS ∪ PS POS POS ∪ BI POS ∪ ST POS ∪ P2 AbA_1 POS ∪ P1 POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 2-grams AbA_3 × POS, top-40 AbA_1 × POS AbA_2 2-grams at w.b. Function words 4-grams, top-60 AbA_2 × POS AbA_3 2-grams at w.b. × POS AbA_3 × POS Word unigrams 5-grams 2-grams × POS 3-grams 4-grams × POS 3-grams × POS 4-grams 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams Figure 6: Confidence intervals, InC experiments. 41 0.0 0.2 0.4 0.6 0.8 1.0 InD-fr Accuracy 1-grams AbA_2 AbA_1 POSSESSIVE (PS) PREFIX_2 (P2) BI ∪ ST ∪ PS ∪ P1 ∪ P2 PREFIX_1 (P1) AbA_1 × POS Word unigrams Function words BINYAN (BI) POS ∪ P2 POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 2-grams 3-grams 4-grams, top-60 5-grams 2-grams at w.b. × POS STATUS (ST) 2-grams at w.b. AbA_3 4-grams 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams 2-grams × POS 3-grams × POS 4-grams × POS POS ∪ P1 AbA_2 × POS POS AbA_3 × POS, top-40 POS ∪ BI POS ∪ ST POS ∪ PS AbA_3 × POS Figure 7: Confidence intervals, InD f r experiments. 42 0.0 0.2 0.4 0.6 0.8 1.0 OoD-soc Accuracy AbA_1 POSSESSIVE (PS) PREFIX_2 (P2) POS ∪ BI POS POS ∪ P2 STATUS (ST) POS ∪ ST POS ∪ PS POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 BINYAN (BI) POS ∪ P1 2-grams at w.b. PREFIX_1 (P1) 1-grams Word unigrams 2-grams × POS AbA_1 × POS Function words BI ∪ ST ∪ PS ∪ P1 ∪ P2 2-grams 3-grams 4-grams 5-grams 2-grams at w.b. × POS AbA_2 AbA_2 × POS 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams 4-grams, top-60 3-grams × POS 4-grams × POS AbA_3 AbA_3 × POS, top-40 AbA_3 × POS Figure 8: Confidence intervals, OoD-soc experiments. 43 0.0 0.2 0.4 0.6 0.8 1.0 OoD-eco Accuracy STATUS (ST) POSSESSIVE (PS) POS POS ∪ ST BINYAN (BI) POS ∪ BI POS ∪ PS 2-grams at w.b. PREFIX_1 (P1) POS ∪ P1 2-grams × POS Function words POS ∪ P2 1-grams 2-grams 3-grams × POS PREFIX_2 (P2) POS ∪ BI ∪ ST ∪ PS ∪ P1 ∪ P2 BI ∪ ST ∪ PS ∪ P1 ∪ P2 AbA_3 AbA_3 × POS Word unigrams 1- ∪ 2- ∪ 3- ∪ 4- ∪ 5-grams 4-grams × POS AbA_1 × POS AbA_3 × POS, top-40 3-grams 5-grams AbA_1 4-grams 4-grams, top-60 2-grams at w.b. × POS AbA_2 × POS AbA_2 Figure 9: Confidence intervals, OoD-eco experiments. 44