On the Features of Translationese Vered Volansky, Noam Ordan, and Shuly Wintner∗ Department of Computer Science, University of Haifa Mount Carmel, 31905 Haifa, Israel Abstract Much research in translation studies indicates that translated texts are ontologically different from original, non-translated ones. Trans- lated texts, in any language, can be considered a dialect of that lan- guage, known as ‘translationese’. Several characteristics of transla- tionese have been proposed as universal in a series of hypotheses. In this work we test these hypotheses using a computational methodol- ogy that is based on supervised machine learning. We define several classifiers that implement various linguistically-informed features, and assess the degree to which different sets of features can distinguish be- tween translated and original texts. We demonstrate that some feature sets are indeed good indicators of translationese, thereby corroborating some hypotheses, whereas others perform much worse (sometimes at chance level), indicating that some ‘universal’ assumptions have to be reconsidered. 1 Introduction This work addresses the differences between translated (T) and original (O), non-translated texts. These differences, to which we refer as ‘features’, have been discussed and studied extensively by translation scholars in the last three decades. In this work we employ computational means to investi- gate them quantitatively. Focusing only on English, our main methodology is based on machine learning, more specifically an application of machine learning to text classification. The special status of translated texts is a compromise between two forces, fidelity to the source text, on the one hand, and fluency in the target lan- guage, on the other hand. Both forces exist simultaneously: some ‘finger- prints’ of the source text are left on the target text, and at the same time the translated text includes shifts from the source so as to be more fluent and produce a better fit to the target language model. The differences between O and T were studied empirically since the 1990s by translation scholars on ∗This is the authors’ pre-print copy which differs from the final publication. computerized corpora (Laviosa, 2002), but only recently, since Baroni and Bernardini (2006), has it been shown that distinguishing between O and T can be done automatically with a high level of accuracy. Toury (1980) paved the way for studying translated texts in comparison to target language texts, ignoring the source text altogether. The idea behind this move was that translations as such, regardless of the source language, have something in common, certain stylistic features governed by so-called translation norms, and therefore, in order to learn about these special marks of translation, the right point of reference is non-translated texts. Baker (1993) calls for compiling and digitizing ‘comparable corpora’ and using them to study ‘translation universals’, such as simplification, the ten- dency to make the source text simpler lexically, syntactically, etc, or explic- itation, the tendency to render implicit utterances in the original more ex- plicit in the translation. This call sparked a long-lasting quest for translation universals, and several works test such hypotheses in many target languages, including English, Finnish, Hungarian, Italian and Swedish (Mauranen and Kujamäki, 2004; Mauranen, 2008). In this study we refrain from the token ‘universal’ and focus instead on ‘features’. This terminological choice has several reasons. First, the focus is mostly on data and empirical findings, and less on translation theory as such. Whereas the features are motivated by and organized according to theoretical categories, we admit that certain features can belong to more than one theoretical category (see Section 6). Second, we show that cer- tain features (such as mean sentence length) are highly dependent on the source language, and in general many of the features have a skewed distri- bution (again, see Section 6); we therefore cast doubt on the universality of ‘universals’. Third, the term ‘feature’ (or sometimes ‘attribute’) is common in machine learning parlance, which is the main methodology used in this study. This paper uses machine learning algorithms to distinguish between O and T. In particular, we apply text classification, a methodology that has been used for classifying texts according to topic, genre, etc. (Sebastiani, 2002), but also for authorship attribution, classification of texts according to their authors’ gender, age, provenance, and more (Koppel et al., 2009). This methodology has been successfully applied to studying O vs. T in various datasets and in different source and target languages (see Section 2). In most of these works the focus is on computational challenges, namely classifying with high accuracy, expanding to more scenarios (for example, cross-domain classification), and minimizing the samples on which the computer is trained, so the attribution can be done on smaller portions of texts. Our study, in contrast, employs this methodology to examine a list of 32 features of ‘translationese’ suggested by translation scholars, with an eye to the question whether some of these features can be utilized to tell O from T. The main contribution of this work is thus theoretical, rather than prac- 2 tical: we use computational means to investigate several translation studies hypotheses, corroborating some of them but refuting others. More gener- ally, we advocate the use of automatic text classification as a methodology for investigating translation studies hypotheses in general, and translation universals in particular. After reviewing related work in the next section, we detail our method- ology in Section 3. We describe several translation studies hypotheses in Section 4, explaining how we model them computationally in terms of fea- tures used for classification. The results of the classifiers are reported in Section 5, and are analyzed and discussed in Section 6. We conclude with suggestions for future research. 2 Related Work Numerous studies suggest that translated texts differ from original ones. Gellerstam (1986) compares texts written originally in Swedish and texts translated from English into Swedish. He notes that the differences between them do not indicate poor translation but rather a statistical phenomenon, which he terms translationese. The features of translationese were theoreti- cally organized under the terms laws of translation or translation universals. Toury (1980, 1995) distinguishes between two laws: the law of interfer- ence and the law of growing standardization. The former pertains to the fingerprints of the source text that are left in the translation product. The latter pertains to the effort to standardize the translation product according to existing norms in the target language and culture. The combined effect of these laws creates a hybrid text that partly corresponds to the source text and partly to texts written originally in the target language, but in fact is neither of them (Frawley, 1984). Baker (1993) suggests several candidates for translation universals, which are claimed to appear in any translated text, regardless of the source lan- guage: “features which typically occur in translated text rather than orig- inal utterances and which are not the result of interference from specific linguistic systems” (Baker, 1993, p. 243). Consequently, there is no need to study translations vis-à-vis their source. The corpus needed for such study is termed comparable corpus,1 where translations from various source languages are studied in comparison to non-translated texts in the same lan- guage, holding for genre, domain, time frame, etc. Among the better known universals are simplification and explicitation, defined and discussed thor- oughly by Blum-Kulka and Levenston (1978, 1983) and Blum-Kulka (1986), respectively. Following Baker (1993), a quest for the holy grail of translation universals 1This term should be distinguished from comparable corpus in computational linguis- tics, where it refers to texts written in different languages that contain similar information. 3 began, culminating in Mauranen and Kujamäki (2004). Chesterman (2004) distinguishes between S-universals and T-universals. S-universals are fea- tures that can be traced back to the source text, and include, among others, lengthening, interference, dialect normalization and reduction of repetitions. T-universals, on the other hand, are features that should be studied vis-à- vis non-translated texts in the target language, i.e., by using a comparable corpus. These include such features as simplification, untypical patterning and under-representation of target-language-specific items. This distinction classifies putative translation universals into two categories, each of which calls for a different kind of corpus, parallel for S-universals and comparable for T-universals. We cast all our features in a comparable corpus setting (T- universals). Our assumption is that if a feature is reflected in translations from several languages into English, it is very likely present in the source texts from which these translations were generated. Future study could very well verify this assumption. In the last decade, corpora have been used extensively to study trans- lationese. For example, Al-Shabab (1996) shows that translated texts ex- hibit lower lexical variety (type-to-token ratio) than originals; Laviosa (1998) shows that their mean sentence length is lower, as is their lexical density (ratio of content to non-content words). Both these studies provide evidence for the simplification hypothesis. Corpus-based translation studies became a very prolific area of research (Laviosa, 2002). Text classification methods have only recently been applied to the task of identifying translationese. Baroni and Bernardini (2006) use a two-million- token Italian newspaper corpus, in which 30% of the texts are translated from various source languages the proportions of which are not reported. They train a support vector machine (SVM) classifier using unigrams, bi- grams and trigrams of surface forms, lemmas and part-of-speech (POS) tags. They also experiment with a mixed mode, in which function words are left intact but content words are replaced by their POS tags. The best accu- racy, 86.7%, is obtained using a combination of lemma and mixed unigrams, bigrams and POS trigrams. Extracting theoretically interesting features, they show that Italian T includes more ‘strong’ pronouns, implying that translating from non-pro-drop languages to a pro-drop one, like Italian, is marked on T. In other words, if a certain linguistic feature is mandatory in the source language and optional in the target language, more often than not it will be carried over to the target text. This is a clear case of positive interference, where features that do exist in O have greater likelihood to be selected in T. In contrast, there are cases of negative interference, where features common in O are under-represented in T (Toury, 1995), and more generally, “[t]ranslations tend to under-represent target-language-specific, unique linguistic features and over-represent fea- tures that have straightforward translation equivalents which are frequently used in the source language” (Eskola, 2004, p. 96). Note that Baroni and 4 Bernardini (2006) use lemmas as features; this can artificially inflate the accuracy of the classifier since lemmas reflect topic and domain information rather than structural differences between the two classes of texts. Inspired by Baroni and Bernardini (2006), Kurokawa et al. (2009) use a mixed text representation in which content words are replaced by their corresponding POS tags, while function words are retained. The corpus here is the Canadian Hansard, which consists of texts in English and Canadian French and translations in both directions, drawn from official records of the proceedings of the Canadian Parliament. Classification is performed at both the document and the sentence level. Interestingly, they demonstrate that learning the direction is relevant for statistical machine translation: they train systems to translate between French and English (and vice versa) using a French-translated-to-English parallel corpus, and then an English- translated-to-French one. They find that in translating into French it is better to use the latter parallel corpus, and when translating into English it is better to use the former. The contribution of knowledge of the translation direction to machine translation is further corroborated in a series of works (Lembersky et al., 2011, 2012a,b). van Halteren (2008) shows that there are significant differences between texts translated from different source languages to the same target language in EUROPARL (Koehn, 2005). The features are 1–3-grams of tokens that appear in at least 10% of the texts of each class. There are 6×6 classes: an original and 5 translations from and into the following: Danish, English, French, German, Italian and Spanish. Tokens appearing in less than 10% of the texts in each class are replaced with 〈x〉. Thus, for example, are right 〈x〉 is a marker of translations form German, while conditions of 〈x〉 is a marker of translations from French. The 10% threshold does not totally exclude content words, and therefore many markers reflect cultural differences, most notably the form of address ladies and gentlemen which is highly frequent in the translations but rare in original English. Ilisei et al. (2010) test the simplification hypothesis using machine learn- ing algorithms. As noted earlier, certain features, such as average sentence length, do not provide a rich model, and cannot, by themselves, discrimi- nate between O and T with high accuracy. Therefore, in addition to the ‘simplification features’, the classifier is trained on POS unigrams, and then each simplification feature is included and excluded and the success rate in both scenarios is compared. They then conduct a t-test to check whether the difference is statistically significant. Ilisei et al. (2010) define several ‘simplification features’, including aver- age sentence length; sentence depth (as depth of the parse tree); ambigu- ity (the average number of senses per word); word length (the proportion of syllables per word); lexical richness (type/token ratio); and information load (the proportion of content words to tokens). Working on Spanish, the most informative feature for the task is lexical richness, followed by sentence 5 length and the proportion of function words to content words. Both lexical richness and sentence length are among the simplification features and are therefore considered to be indicative of the simplification hypothesis. All in all, they succeed in differentiating between translated and non-translated texts with 97.62% accuracy and conclude that simplification features exist and heavily influence the results. These results pertain to Spanish translated from English; Ilisei and Inkpen (2011) extend the results to Romanian, us- ing by and large the same methodology, albeit with somewhat more refined features. Furthermore, Ilisei (2013) experiments also with the explicitation hypothesis (in Spanish and Romanian), defining mainly features whose val- ues are the proportion of some part of speech categories in the text. Our work is similar in methodology, but is much broader in scope. While Ilisei et al. (2010); Ilisei and Inkpen (2011) use their simplification features to boost the accuracy of the classifier, our goal is different, as we are interested not in the actual accuracy of any feature by itself, but in its contribution, if any, to the classification and translation process. We test some of the simpli- fication features on English translated from ten source languages vs. original English. We also add more simplification features to those introduced by Ilisei et al. (2010); Ilisei and Inkpen (2011) to test the simplification hypoth- esis. Most importantly, we add many more features that test a large array of other hypotheses. Koppel and Ordan (2011) aim to identify the source language of texts translated to English from several languages, and reason about the simi- larities or differences of the source languages with respect to the accuracy obtained from this experiment. The data are taken from the EUROPARL corpus, and include original English texts as well as English texts translated from Finnish, French, German, Italian and Spanish. In order to abstract from content, the only features used for classification are frequencies of func- tion words. Koppel and Ordan (2011) can distinguish between original and translated texts with 96.7% accuracy; they can identify the original language with 92.7% accuracy; and they can train a classifier to distinguish between original English and English translated from language L1, and then use the same classifier to differentiate between original English and English trans- lated from L2, with accuracies ranging from 56% to 88.3%. Interestingly, the success rate improves when L1 and L2 are typologically closer. Thus, training on one Romance language and testing on another yields excellent results between 84.5%-91.5% (there are 6 such pairs for French, Italian and Spanish). The poor results (56%) of training on T from Finnish and testing on T from Italian or Spanish, for example, cast doubt on the concept of ‘transla- tion universals’. It shows that translationese is highly dependent on the pair of languages under study. Although Koppel and Ordan (2011) manage to train on all the T components vs. O and achieve a good result distinguishing between O and T (92.7%), it is exactly their main finding of pair-specific 6 dependence that may tie this success to their corpus design: three fifths of their corpus belong to the same language family (Romance), another fifth of translations from German is also related (Germanic), and only the last fifth, Finnish, is far removed (Finno-Ugric). In our experiments we use a wider range of source languages in an effort to neutralize this limitation: Romance (Italian, Portuguese, Spanish, and French); Germanic (German, Danish, and Dutch); Hellenic (Greek); and Finno-Ugric (Finnish). Popescu (2011), too, identifies translationese with machine-learning meth- ods. He uses a corpus from the literary domain, mainly books from the nineteenth century. The corpus contains 214 books, half of which (108) are originally written in British and American English. The other half is of translated English, 76 from French and 30 from German. The book do- mains are varied and translations are ensured to be of at least minimal quality. Popescu (2011) uses character sequences of length 5, ignoring sen- tence boundaries, for classification. He achieves 99.53% to 100% accuracy using different cross-validation methods. When training on British English and translations from French, and testing on American English and trans- lations from German, the accuracy is 45.83%. He then uses the original French corpus to eliminate proper names, still at the character level, and achieves 77.08% accuracy. By mixing American and British texts, 76.88% accuracy is achieved. This work has many advantages from the engineering point of view: extracting characters is a trivial text-processing task; the methodology is language-independent, and with some modifications it can be applied, for example, to Chinese script, where segmenting words is a non-trivial task; it does not impose any theoretical notions on the classification; last, the model for O and T is very rich since there are many possible character n-gram values (like the, of, -ion, -ly, etc.) and therefore the model can fit different textual scenarios on which it is tested. Similarly to Popescu (2011), we use simple n-gram characters, n = 1, 2, 3, among many other features. The higher n is, the more we can learn about translationese, as we show in Section 4. Still, it should be noted that character n-grams can capture lexical information, which, like lemmas, may reflect topic and domain information rather than structure. In contrast to some previous works, we use the machine-learning method- ology with great care. First, we compile a corpus with multiple source lan- guages, from diverse language families; we balance the proportion of each language within the corpus, and provide detailed information that can be used for replicating our results. Second, we totally abstract away from con- tent so as to be unbiased by the topics of the corpora. A classifier that uses as features the words in the text, for example, is likely to do a good job telling O from T simply because certain words are culturally related to the source language from which the texts are translated (e.g., the word Paris in texts translated from French). We provide data on such classifiers, but only 7 as a “sanity check”. Furthermore, while previous works used this method- ology to investigate the simplification hypothesis, we use it to investigate a wide array of translation studies hypotheses, including simplification, ex- plicitation, normalization and interference. Finally, and most importantly, we use a plethora of linguistically informed features to learn more about the nature of translationese. 3 Methodology Our main goal in this work is to study the features of translated texts.2 Our methodology is corpus-based, but instead of computing quantitative mea- sures of O and T directly, we opt for a more sophisticated, yet more revealing methodology, namely training classifiers on various features, and investigat- ing the ability of different features to accurately distinguish between O and T. We now detail the methodology and motivate it. 3.1 Text Classification with Machine Learning In supervised machine-learning, a classifier is trained on labeled examples the classification of which is known a priori. The current task is a binary one, namely there are only two classes: O and T. Each instance in the two classes has to be represented : a set of numeric features is extracted from the instances, and a generic machine-learning algorithm is then trained to distinguish between feature vectors representative of one class and those representative of the other. For example, one set of features for natural texts could be the words (or tokens) in the text; the values of these features are the number of occurrences of each word in the instance. This set of features is extracted from the text instances in both classes, and then each of the classes is modeled differently such that there is a model for how O should look like and a model for how T should look like. Given enough data for training and given that the features are indeed relevant, the trained classifier can then be given an ‘unseen’ text, namely a text that is not included in the training set. Such a text is again represented by a feature vector in the same manner, and the classifier can predict whether it belongs to the O class or to the T class. Such unseen texts are known as “test set”. One important property of such classifiers is that they assign “weights” to the features used for classification, such that significant features are as- signed higher weights. Due to potential dependencies among features, some features may be assigned weights that diminish their importance on their own, as they do not add any important data to the classifier. This means 2A terminological note is in place: throughout this paper, O and T refer to texts written in the same language, specifically in English. The languages from which T was translated are therefore referred to as the source languages. When we say French, for example, we mean texts translated to English from French. 8 that low weights are not always very reliable; but if a feature is assigned a high weight, it is certainly a good indication of a significant difference between the two classes (the inverse does not necessarily hold). 3.2 Motivation Applying machine learning algorithms to identify the class of the text (O or T) is thus a sound methodology for assessing the predictive power of a feature set. This is by no means a call to abandon traditional significance tests as a tool to learn about differences between texts, and in fact, we use both in this study. But text classification is more robust, in the sense that it reflects not just average differences between classes, but also the different distributions of features across the classes, in a way that facilitates generalization: prediction of the class of new, unseen examples. To illustrate this point, consider the case of punctuation marks. We compare the frequencies of several marks in O and T. Table 1 summarizes the data: for each punctuation mark, it lists the relative frequency (per 1000 token) in O and T; the ratio between O and T (‘ratio’); whether the feature in question typifies O or T according to a log-likelihood (LL) test (p < 0.05); and the strength of the weight assigned to the feature by a particular classifier (Section 3.3), where T1 is the most prominent feature of translation (the one with the highest weight, as determined by the classifier), T2 the second most prominent, and so on; the same notation from O1 to O7 is applied to O. Frequency Mark O T Ratio LL Weight , 37.83 49.79 0.76 T T1 ( 0.42 0.72 0.58 T T2 ’ 1.94 2.53 0.77 T T3 ) 0.40 0.72 0.56 T T4 / 0.30 0.30 1.00 — — [ 0.01 0.02 0.45 T — ] 0.01 0.02 0.44 T — ” 0.33 0.22 1.46 O O7 ! 0.22 0.17 1.25 O O6 . 38.20 34.60 1.10 O O5 : 1.20 1.17 1.03 — O4 ; 0.84 0.83 1.01 — O3 ? 1.57 1.11 1.41 O O2 - 2.68 2.25 1.19 O O1 Table 1: Summary data for punctuation marks across O and T 9 The most prominent marker of T according to the classifier is the comma, which is indeed about 1.3 times more frequent in T than in O. There are punctuation marks for which the ratio is much higher; for example, square brackets are about 2.2 times more frequent in T. But their frequency in the corpus is very low, and therefore, this difference is not robust enough to make a prediction. Theoretically it may be interesting to note that in translations from Swedish into English, for example, there are four times more square brackets than in original English, but this plays no significant role in the classification task. Conversely, there are cases where the critical value of LL is not significant by common standards, but it does play a role in classification. Such is the case of the colon. The ratio O/T is 1.03 and the critical value is 2.06, namely p < 0.15. This value still accounts for 85% of the cases and although common statistical wisdom would rule it out as an insignificant feature, it does play a significant role in telling O from T using text classification techniques. The parentheses appear almost always together. We notice rare cases of ‘(’ appearing in itself, usually as a result of tokenization problem, and some cases of the right parenthesis ‘)’ appearing by itself, usually in enumerating items within a paragraph, a common notation in non-English languages and therefore three times more frequent in T than in O (30 vs. 10 cases, respec- tively). Although the raw frequency of both ‘(’ and ‘)’ is about the same and although the ratio between their frequency in O and their frequency in T is nearly identical, ‘(’ appears to be a better marker of T according to the classifier. When a classifier is encountered with two highly dependent features it may ignore one of them altogether. This does not mean the ig- nored feature is not important, it only means it does not add much new information. In summary, we use text classification algorithms to measure the robust- ness of each feature set. We are interested in differences between O and T, but we are also interested in finding out how revealing these features are, how prominent in the marking of translated text, to the effect that they have a predictive power. We use the information produced by the classifiers to provide a preliminary analysis. Then, to make a finer analysis, we check some of the features manually and conduct significance tests. We believe that using text classification techniques provides a good tool to study the makeup of translated texts in a general way, on the one hand, and that using statistical significance tests on occasion enables the researcher to look at less frequent events which are no doubt part of the story of translationese, on the other hand. 3.3 Experimental Setup The main corpus we use is the proceedings of the European Parliament, EU- ROPARL (Koehn, 2005), with approximately 4 million tokens in English 10 (O) and the same number of tokens translated from 10 source languages (T): Danish, Dutch, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish. Although the speeches are delivered orally (many times read out from written texts), they can be considered a translation rather than interpretation, since the proceedings are produced in the follow- ing way:3 1. The original speech is transcribed and minimally edited; 2. The text is sent to the speaker, who may edit it further; 3. The resulting text is translated into the other official languages. The corpus is first tokenized and then partitioned into chunks of ap- proximately 2000 tokens (ending on a sentence boundary). The purpose of this is to make sure that the length of an article does not interfere with the classification. We thus obtain 2000 chunks of original English, and 200 chunks of translations from each of the ten source languages. We then gen- erate POS tags for the tokenized texts. We use the UIUC CCG sentence segmentation tool4 to detect sentence boundaries; and OpenNLP,5 with the default MaxEnt tagger and Penn Treebank tagset, to tokenize the texts and induce POS tags. We use the Weka toolkit (Hall et al., 2009) for classification; in all ex- periments, we use SVM (SMO) as the classification algorithm, with the default linear kernel. We employ ten-fold cross-validation6 and report accu- racy (percentage of chunks correctly classified). Since the classification task is binary and the training corpus is balanced, the baseline is 50%. 4 Hypotheses We test several translation studies hypotheses. In this section we list each hypothesis, and describe how we model it in terms of the features used for classification. Feature design is a sophisticated process. In determining the feature set, the most important features must: 1. reflect frequent linguistic characteristics we would expect to be present in the two types of text; 3We are grateful to Emma Wagner, Vicki Brett, and Philip Cole (EU, Head of the Irish and English Translation Unit) for this information. 4 http://cogcomp.cs.illinois.edu/page/tools_view/2, accessed 24 August 2012. 5 http://incubator.apache.org/opennlp/, accessed 24 August 2012. 6In ten-fold cross validation, 90% of the annotated data are used for training, and the remaining 10% are used for testing. This process is repeated ten times, with different splits of the data, and the ten results are averaged. This guarantees the robustness of the evaluation, and minimizes the risk of over-fitting to the training data. 11 http://cogcomp.cs.illinois.edu/page/tools_view/2 http://incubator.apache.org/opennlp/ 2. be content-independent, indicating formal and stylistic differences be- tween the texts that are not derived from differences in contents, do- main, genre, etc.; and 3. be easy to interpret, yielding insights regarding the differences between original and translated texts. We focus on features that reflect structural properties of the texts, some of which have been used in previous works. We now define the features we explore in this work; for each feature, we provide a precise definition that facilitates replication of our results, as well as a hypothesis on its ability to distinguish between O and T, based on the translation studies literature. When generating many of the features, we normalize the feature’s value, v, by the number of tokens in the chunk, n: v′ = v×2000/n. This balances the values over chunks that have slightly more or less than 2000 tokens each (recall that chunks respect sentence boundaries). Henceforth, when describing a normalized feature, we report v′ rather than v. We also multiply the values of some features by some power of 10, rounding up the result to the nearest integer, in order to have a set of values that is easier to compare. This does not affect the classification results. 4.1 Simplification Simplification refers to the process of rendering complex linguistic features in the source text into simpler features in the target text. Strictly speaking, this phenomenon can be studied only vis-à-vis the source text, since ‘simpler’ is defined here in reference to the source text, where, for example, the prac- tice of splitting sentences or refraining from complex subordinations can be observed. And indeed, this is how simplification was first defined and stud- ied in translation studies (Blum-Kulka and Levenston, 1983; Vanderauwerea, 1985). Baker (1993) suggests that simplification can be studied by compar- ing translated texts with non-translated ones, as long as both texts share the same domain, genre, time frame, etc. In a series of corpus-based studies, Laviosa (1998, 2002) confirms this hypothesis. Ilisei et al. (2010) and Ilisei and Inkpen (2011) train a classifier enriched by simplification features and bring further evidence for this universal in Romanian and Spanish. We model the simplification hypothesis through the following features:7 Lexical Variety The assumption is that original texts are richer in terms of vocabulary than translated ones, as hypothesized by Baker (1993) and studied by Laviosa (1998). Lexical variety is known to be an unstable phenomenon which is highly dependent on corpus size (Tweedie and 7Of the features we define in this section, the first five were also implemented by Ilisei et al. (2010), who, in addition, added sentence depth (as the depth of the parse tree) and ambiguity (as the average number of senses per word). 12 Baayen, 1998). We therefore use three different type-token ratio (TTR) measures, following Grieve (2007), where V is the number of types and N is the number of tokens per chunk. All three versions consider punctuation marks as tokens. 1. V/N, magnified by order of 6. 2. log(V )/log(N), magnified by order of 6. 3. 100×log(N)/(1−V1/V ), where V1 is the number of types occurring only once in the chunk. Mean word length (in characters) We assume that translated texts use simpler words, in particular shorter ones. Punctuation marks are ex- cluded from the tokens in this feature. Syllable ratio We assume that simpler words are used in translated texts, resulting in fewer syllables per word. We approximate this feature by counting the number of vowel-sequences that are delimited by conso- nants or space in a word, normalized by the number of tokens in the chunk. Lexical density This measure is also used by Laviosa (1998). The fre- quency of tokens that are not nouns, adjectives, adverbs or verbs. This is computed by dividing the number of tokens tagged with POS tags that do not open with J, N, R or V by the number of tokens in the chunk. Mean sentence length Splitting sentences is a common strategy in trans- lation, which is also considered a form of simplification. Baker (1993) renders it one of the universal features of ‘simplification’. Long and complicated sentences may be simplified and split into short, simple sentences. Hence we assume that translations contain shorter sen- tences than original texts. We consider punctuation marks as tokens in the computation of this feature. Mean word rank We assume that less frequent words are used more often in original texts than in translated ones. This is based on the observa- tion of Blum-Kulka and Levenston (1983) that translated texts “make do with less words” and the application of this feature by Laviosa (1998). A theoretical explanation is provided by Halverson (2003): translators use more prototypical language, i.e., they “regress to the mean” (Shlesinger, 1989). To compute this, we use a list of 6000 English most frequent words,8 and consider the rank of words (their position in the frequency-ordered list). The maximum rank is 5000 (since some words have equal ranks). We handle words that do not appear in the list in two different ways: 8 http://www.insightin.com/esl/, accessed 24 August 2012. 13 http://www.insightin.com/esl/ 1. Words not in this list are given a unique highest rank of 6000. 2. Words not in the list are ignored altogether. Values (in both versions) are rounded to the nearest integer. All punc- tuation marks are ignored. Most frequent words The normalized frequencies of the N most frequent words in the corpus. We define three features, with three different thresholds: N = 5, 10, 50. Punctuation marks are excluded. 4.2 Explicitation Explicitation is the tendency to spell out in the target text utterances that are more implicit in the source. Like simplification, this ‘universal’ can be directly observed in T only in reference to O; if there is an implicit causal relation between two phrases in the source text and a cohesive marker such as because is introduced in target text, then it could be said with confidence that explicitation took place. But explicitation can also be studied by con- structing a comparable corpus (Baker, 1993), and it is fair to assume that if there are many more cohesive markers in T than in O (in a well-balanced large corpus like EUROPARL), it could serve as an indirect evidence of explicitation. Blum-Kulka (1986) develops and exemplifies this phenomenon in transla- tions from Hebrew to English, and Øver̊as (1998) compiles a parallel bidirec- tional Norwegian-English and English-Norwegian corpus to provide further evidence for explicitation. Koppel and Ordan (2011) find that some of the prominent features in their list of function words are cohesive markers, such as therefore, thus and consequently. The first three classifiers below are inspired by an example provided by Baker (1993, pp. 243-4), where the clause The example of Truman was always present in my mind is rendered into Arabic with a fairly long para- graph, which includes the following: In my mind there was always the ex- ample of the American President Harry Truman, who succeeded Franklin Roosevelt.... Explicit naming We hypothesize that one form of explicitation in transla- tion is the use of a proper noun as a spelling out of a personal pronoun. We calculate the ratio of personal pronouns to proper nouns, both singular and plural, magnified by an order of 3. See also ‘Pronouns’, Section 4.5. Single naming The frequency of proper nouns consisting of a single token, not having an additional proper noun as a neighbor. This can be seen in an exaggerated form in the example above taken from Baker (1993, pp. 243-4). As a contemporary example, it is common to find 14 in German news (as of 2012) the single proper name Westerwelle, but in translating German news into another language, the translator is likely to add the first name of this person (Guido) and probably his role, too (minister of foreign affairs). Mean multiple naming The average length (in tokens) of proper nouns (consecutive tokens tagged as Proper Nouns), magnified by an order of 3. The motivation for this feature is the same as above. Cohesive markers Translations are known to excessively use certain co- hesive markers (Blum-Kulka, 1986; Øver̊as, 1998; Koppel and Ordan, 2011). We use a list of 40 such markers, based on Koppel and Ordan (2011); see Appendix A.1. Each marker in the list is a feature, whose value is the frequency of its occurrences in the chunk. 4.3 Normalization Translators take great efforts to standardize texts (Toury, 1995), or, in the words of (Baker, 1993, p. 244), they have “a strong preference for conven- tional ‘grammaticality’”. We include in this the tendency to avoid repeti- tions (Ben-Ari, 1998), the tendency to use a more formal style manifested in refraining from the use of contractions (Olohan, 2003), and the tendency to overuse fixed expressions even when the source text refrains, sometime deliberately, from doing so (Toury, 1980; Kenny, 2001). We model normalization through the following features: Repetitions We count the number of content words (words tagged as nouns, verbs, adjectives or adverbs) that occur more than once in a chunk, and normalize by the number of tokens in the chunk. In- flections of the verbs be and have are excluded from the count since these verbs are commonly used as auxiliaries. This feature’s values are magnified by an order of 3. Contractions The ratio of contracted forms to their counterpart full form(s). If the full form has zero occurrences, its count is changed to 1. The list of contracted forms used for this feature is given in Appendix A.2. Average PMI We expect original texts to use more collocations, and in any case to use them differently than translated texts. This hypothesis is based on Toury (1980) and Kenny (2001), who show that transla- tions overuse highly associated words. We therefore use as a feature the average PMI (Church and Hanks, 1990) of all bigrams in the chunk. Given a bigram w1w2, its PMI is: log(freq(w1w2)/freq(w1)×freq(w2)) 15 Threshold PMI We compute the PMI of each bigram in a chunk, and count the (normalized) number of bigrams with PMI above 0. 4.4 Interference Toury (1979) takes on the concept of interlanguage (Selinker, 1972) to de- fine interference as a universal. Selinker (1972) coins the term in order to talk about the hybrid nature of the output of non-native speakers producing utterances in their second language. This output is heavily influenced by the language system of their first language. Translation is very similar in this sense, one language comes in close contact with another through trans- fer. In translation, however, translators habitually produce texts in their native tongue. Therefore, Toury (1979) advocates a descriptive study of in- terference not tainted, like in second language acquisition, by the view that the output reveals “ill performances” (production of grammatically incor- rect structures). Interference operates on different levels, from transcribing source language words, through using loan translations, to exerting struc- tural (morphological and syntactic for example) influence. This may bring about, as noted by Gellerstam (1986), a different distribution of elements in translated texts, which he calls ‘translationese’, keeping it as a pure de- scriptive term (cf. Santos (1995)). We model interference as follows: POS n-grams We hypothesize that different grammatical structures used in the different source languages interfere with the translations; and that translations have unique grammatical structure. Following Baroni and Bernardini (2006) and Kurokawa et al. (2009), we model this assumption by defining as features unigrams, bigrams and trigrams of POS tags. We add special tokens to indicate the beginning and end of each sentence, with the purpose of capturing specific POS-bigrams and POS-trigrams representing the beginnings and endings of sentences. The value of these features is the actual number of each POS n-gram in the chunk. Character n-grams This feature is motivated by Popescu (2011). Other than yielding very good results, it is also language-type dependent. We hypothesize that grammatical structure manifests itself in this fea- ture, and as in POS n-grams, the different grammatical structures used in the different source languages interfere with the translations. We also hypothesize that this feature captures morphological features of the language. These are actually three different features (each tested separately): unigrams, bigrams and trigrams of characters. They are computed similarly to the way POS n-grams are computed: by the fre- quencies of n-letter occurrences in a chunk, normalized by the chunk’s size. Two special tokens are added to indicate the beginning and end 16 of each word, in order to properly handle specific word prefixes and suffixes. We do not capture cross-token character n-grams, and we exclude punctuation marks. Prefixes and suffixes Character n-grams are an approximation of mor- phological structure. In the case of English, the little morphology expressed by the language is typically manifested as prefixes and suf- fixes. We therefore define a more refined variant of the character n- gram feature, focusing only on prefixes and suffixes. We use a list of such morphemes (see Appendix A.3) as features, simply counting the number of words in a chunk that begin or end with each of the prefixes/suffixes, respectively. Contextual function words This feature is a variant of POS n-grams, where the n-grams can be anchored by specific (function) words. Kop- pel and Ordan (2011) use only function words for classification; we use the same list of words9 in this feature (see Appendix A.4). This fea- ture is defined as the (normalized) frequency of trigrams of function words in the chunk. In addition, we count trigrams consisting of two function words (from the same list) and one other word; in such cases, we replace the other word by its POS. In sum, we compute the fre- quencies in the chunk of triplets 〈w1, w2, w3〉, where at least two of the elements are functions words, and at most one is a POS tag. Positional token frequency Writers have a relatively limited vocabulary from which to choose words to open or close a sentence. We hypoth- esize that the choices are subject to interference. Munday (1998) and Gries and Wulff (2012) study it on a smaller scale in translations from Spanish to English and in translations from English to German, respec- tively. The value of this feature is the normalized frequency of tokens appearing in the first, second, antepenultimate, penultimate and last positions in a sentence. We exclude sentences shorter than five tokens. Punctuation marks are considered as tokens in this feature, and for this reason the three last positions of a sentence are considered, while only the first two of them are interesting for our purposes. 4.5 Miscellaneous Finally, we define a number of features that cannot be naturally associated with any of the above hypotheses, but nevertheless throw light on the nature of translationese. Function words We aim to replicate the results of Koppel and Ordan (2011) with this feature. We use the same list of function words (in 9We thank Moshe Koppel for providing us with the list of function words used in Koppel and Ordan (2011). 17 fact, some of them are content words, but they are all crucial for organizing the text; see the list in Appendix A.4) and implement the same feature. Each function word in the corpus is a feature, whose value is the normalized frequency of its occurrences in the chunk. Pronouns Pronouns are function words, and Koppel and Ordan (2011) re- port that this subset is among the top discriminating features between O and T. We therefore check whether pronouns alone can yield a high classification accuracy. Each pronoun in the corpus is a feature, whose value is the normalized frequency of its occurrences in the chunk. The list of pronouns is given in Appendix A.5. Punctuation Punctuation marks organize the information within sentence boundaries and to a great extent reduce ambiguity; according to the explicitation hypothesis, translated texts are less ambiguous (Blum- Kulka, 1986) and we assume that this tendency will manifest itself in the (different) way in which translated texts are punctuated. We focus on the following punctuation marks: ? ! : ; - ( ) [ ] ‘ ’ “ ” / , . Apostrophes used in contracted forms are retained. Following Grieve (2007), we define three variants of this feature: 1. The normalized frequency of each punctuation mark in the chunk. 2. A non-normalized notion of frequency: n/tokens, where n is the number of occurrences of a punctuation mark; and tokens is the actual (rather than normalized) number of tokens in the chunk. This value is magnified by an order of 4. 3. n/p, where p is the total number of punctuations in the chunk; and n as above. This value is magnified by an order of 4. Ratio of passive forms to all verbs We assume that English original texts tend to use the passive form more excessively than translated texts, due to the fact that the passive voice is more frequent in English than in some other languages (cf. Teich (2003) for German-English). If an active voice is used in the source language, translators may prefer not to convert it to the passive. Passives are defined as the verb be fol- lowed by the POS tag VBN (past participle). We calculate the ratio of passive verbs to all verbs, and magnified it by an order of 6. As a “sanity check”, we use two other features: token unigrams and token bigrams. Each unigram and bigram in the corpus constitutes a spe- cific feature, as in Baroni and Bernardini (2006). The feature’s value is its frequency in the chunk (again, normalized). For bigrams we add spe- cial markers of the edges of the sentences as described for POS-n-grams. We assume that different languages use different content words in varying frequencies in translated and non-translated texts. We expect these two 18 features to yield conclusive results (well above 90% accuracy), while token bigrams are expected to yield somewhat better results than token unigrams. These features are highly content-dependent, and are therefore of no em- pirical significance; they are only used as an upper bound for our other features, and to emphasize the validity of our methodology: we expect very high accuracy of classification with these features. 5 Results We implemented all the features discussed in the previous section as classi- fiers and used them for classifying held-out texts in a ten-fold cross-validation scenario, as described in Section 3. The results of the classifiers are reported in Table 2 in terms of the accuracy of classifying the test set. As a sanity check, we also report the accuracy of the content-dependent classifiers. As mentioned above, these are expected to produce highly- accurate classifiers, but teach us very little about the features of transla- tionese. As is evident from Table 3, this is indeed the case. For the sake of completeness, we note that it is possible to achieve very high classification accuracy even with a much narrower feature space. Some of the more complex feature sets have hundreds, or even thousands of fea- tures. In such cases, most features contribute very little to the task. To emphasize this, we take only the top 300 most frequent features. For exam- ple, rather than use all possible POS trigrams for classification, we only use the 300 most frequent sequences as features. Table 4 lists the classification results in this case. Evidently, the results are almost as high as when using all features. Our main objective, however, is not to produce the best-performing clas- sifiers. Rather, it is to understand what the classifiers can reveal about the nature of the differences between O and T. The following section thus anal- yses the results. 6 Analysis 6.1 Simplification Laviosa (1998, 2002) studied the simplification hypothesis extensively. Some features pertaining to simplification are also mentioned by Baker (1993). The four main features and partial findings pertain to mean sentence length, type-token ratio, lexical density and overrepresentation of highly frequent items. Lexical density fails altogether to predict the status of a text, being nearly on chance level (53% accuracy). Interestingly, while mean sentence length is much above chance level (65%), the results are contrary to com- mon assumptions in Translation Studies. According to the simplification 19 Category Feature Accuracy (%) Simplification TTR (1) 72 TTR (2) 72 TTR (3) 76 Mean word length 66 Syllable ratio 61 Lexical density 53 Mean sentence length 65 Mean word rank (1) 69 Mean word rank (2) 77 N most frequent words 64 Explicitation Explicit naming 58 Single naming 56 Mean multiple naming 54 Cohesive Markers 81 Normalization Repetitions 55 Contractions 50 Average PMI 52 Threshold PMI 66 Interference POS unigrams 90 POS bigrams 97 POS trigrams 98 Character unigrams 85 Character bigrams 98 Character trigrams 100 Prefixes and suffixes 80 Contextual function words 100 Positional token frequency 97 Miscellaneous Function words 96 Pronouns 77 Punctuation (1) 81 Punctuation (2) 85 Punctuation (3) 80 Ratio of passive forms to all verbs 65 Table 2: Classification results hypothesis, T sentences are simpler (i.e., shorter), but as Figure 1 shows, the contrary is the case. We computed the mean sentence length in eleven 400,000-word texts, one of them original English, and the other translated from ten source languages (this is the same corpus on which we run the classification). Only three translations (from Swedish, Finnish and Dutch) have a lower mean sentence length than original English, and on average O sentences are 2.5 tokens shorter. Whereas this result may pertain only to certain language pairs or certain genres, this alleged “translation universal” is definitely not universal. Moreover, it may actually be an instance of the 20 Category Feature Accuracy (%) Sanity Token unigrams 100 Token bigrams 100 Table 3: Classification results, “sanity check” classifiers Category Feature Accuracy Interference POS bigrams 96 POS trigrams 96 Character bigrams 95 Character trigrams 96 Positional token frequency 93 Table 4: Classification results, top-300 features only interference hypothesis, where sentence length in the target language reflects its length in the source language. This, however, should be studied under a parallel corpus setting, and is beyond the scope of this work. Figure 1: Mean sentence length according to ‘language’ The first two TTR measures perform relatively well (72% accuracy), and the indirect measures of lexical variety (mean word length and syllable ra- tio) are above chance level (66% and 61% accuracy, respectively). Following Holmes (1992) we experiment with more sophisticated measures of lexical variety. The best performing one is the one that takes into account hapax legomena, words that occur only once in a text. This variant of TTR (3) yields 76% accuracy. One important trait of hapax legomena is that as op- 21 posed to type-token ratio they are not so dependent on corpus size (Baayen, 2001). Another interesting classifier with relatively good results, in fact, the best performing of all simplification features (77% accuracy), is mean word rank. This feature is closely related to the feature studied by Laviosa (1998) (n top words) with two differences: (1) our list of frequent items is much larger, and (2) we generate the frequency list not from the corpora under study but rather from an external much larger reference corpus. In contrast, the design that follows Laviosa (1998) more strictly (N most frequent words) has a lower predictive power (64%). 6.2 Explicitation The three classifiers we design to check this hypothesis (explicit naming, single naming and mean multiple naming) do not exceed 58% classification accuracy. On the other hand, following Blum-Kulka (1986) and Koppel and Ordan (2011), we build a classifier that uses 40 cohesive markers and achieve 81% accuracy in telling O from T; such cohesive markers are far more frequent in T than in O. For example, moreover, thus and besides are used 17.5, 4, and 3.8 times more frequently (respectively) in T than in O. 6.3 Normalization None of these features perform very well. Repetitions and contractions are rare in EUROPARL and in this sense the corpus may not be suited for study- ing these phenomena. The repetition-based classifier yields 55% accuracy and the contraction-based classifier performs at chance level (50%). One of the classifiers that checks PMI, designed to pick on highly as- sociated words and therefore attesting to many fixed expressions, performs considerably better, namely 66% accuracy. This measure counts the number of associated bigrams whose PMI is above 0. As Figure 2 shows, English has far more highly associated bigrams than translations. If we take the word form stand, for example, then at the top of the list we normally get highly associated words, some of which are fixed expressions, such as stand idly, stand firm, stand trial, etc. There are considerably more highly associated pairs like these in O; conversely, this also means that there are more poorly associated pairs in T, such as the bigram stand unamended. This finding contradicts the case studies elaborated on in Toury (1980); Kenny (2001). It should be noted, however, that they discuss cases operating under particu- lar scenarios, whereas we check this phenomenon more globally, completely unbiased towards any scenario whatsoever. The finding is robust but it is oblivious to the particulars of subtle cases. 22 Figure 2: Number of bigrams whose PMI is above threshold according to ‘language’ 6.4 Interference The interference-based classifiers are the best performing ones. Most of them perform above 90%. In this sense we can say that interference is the most robust phenomenon typifying translations. However, we note that some of the features are somewhat coarse and may reflect some corpus-dependent characteristics. For example, in the character n-grams we notice that some of the top features in O include sequences that are ‘illegal’ in English and obviously stem from foreign names, such as the following letter bigrams: Haarder and Maat or Gazpron. To offset this problem we use only the top 300 features in several of the classifiers, with a minor effect on the results. The n-gram findings are consistent with Popescu (2011) in that we also find they catch on both affixes and function words: for example, typical trigrams in O are -ion and all whereas typical to T are -ble and the. As opposed to Popescu (2011) we reduced the feature space without the need to look at the original texts; Popescu (2011) looked for sequences of n-grams in the target language that also appear in the source texts, thereby eliminating mostly proper nouns. However, this method can be applied only to language pairs that use similar alphabet and orthography conventions. Using only the 300 most frequent features results in a drop in accuracy of up to 4%. Furthermore, restricting the space of features to only 38 prefixes and 34 suffixes, a much narrower domain than the set of all character bi-grams, for example, still yields 80% accuracy. Evidently, original and translated texts differ greatly in the way they use these affixes. Different English affixes were imported from different languages, and this is reflected in our findings. The prefix mono-, a marker of translated 23 language, is much more frequent in Greek than any other language. The suffix -ible, originating in Latin, is much more common in all the Romance languages, which are “clustered together” around this feature, compared to English. Last, the suffix -ize, originating in Latin, is highly frequent in original English, less frequent in the Romance languages, and even less in the other languages. Further study, backed by a sound historical linguistics perspective, may determine how such parameters affect transnational choices between language, taking into account their distance from each other. Part-of-speech trigrams is an extremely cheap and efficient classifier. The feature space is not too big, and the results are robust. A good discrimina- tory feature typifying T is, for example, the part-of-speech trigram of modal + verb base form + verb past particle, as in the highly frequent phrases in the corpus must be taken, should be given and can be used; as can be seen in Figure 3, it typifies more prominently translations from phylogenetically distant languages, such as Finnish, but original English is down the list, regardless of T’s source language. Figure 3: The average number of the POS trigram modal + verb base form + participle in O and ten Ts Moving now to positional token frequency, we report on three variations of this classifier with different degrees of accuracy (reported in brackets): taking into account all the tokens that appear in these positions (97%), us- ing only the 300 most frequent tokens (93%) and finally only the 50 most frequent tokens (82%). The last is the most abstract, picking almost ex- clusively on function words. The second most prominent feature typifying O is sentences opening with the word ‘But’. In fact, there are 2.25 times more cases of such sentences in O. In English there is a long prescriptive tradition forbidding writers to open a sentence with ‘But’, and although this ‘decree’ is questioned and even mocked at (Garner, 2003), the question 24 whether it is considered a good style is a common question posted on Inter- net forums dealing with English language use. Translators have been known to be conservative in their lexical choices (Kenny, 2001), and the underuse of ‘But’-opening sentences is yet another evidence for this tendency. As op- posed to other features in positional token frequency, this is not a case of interference but rather a tendency to (over-)abide to norms of translation, i.e., standardization (Toury, 1995). 6.5 Miscellaneous In this category we include several classifiers whose features do not fall under a clear-cut theoretical category discussed by translation theorists. The function words classifier replicates Koppel and Ordan (2011) and despite the good performance (96% accuracy) it is not very meaningful theoretically. One of its subsets, a list of 25 pronouns, reveals an interesting phenomenon: subject pronouns, like I, he and she are prominent indicators of O, whereas virtually all reflexive pronouns (such as itself, himself, yourself ) typify T. The first phenomenon is probably due to the fact that pronouns are much more frequent in T (about 1.25 more frequent) and a fine-tuned analysis of the distribution of pronouns in each sub-corpus normalized by the number of pronouns is beyond the scope of this study; the high representation of reflexive pronouns is probably due to interference from the source languages. The accuracy of classifying by pronouns alone is 77%. The accuracy of a classifier based on the ratio of passive verbs is much above chance level, yet not a very good predictor by itself (65%). T has about 1.15 times more passive verbs, and it is highly dependent on the source language from which T stems: original English is down the list, right after the Romance languages and Greek, and from the top down: Danish, Swedish, Finnish, Dutch and German. We experiment with three different classifiers based on punctuation marks as a feature set. The mark ‘.’ (actually indicating sentence length) is a strong feature of O and the mark ‘,’ is a strong marker of T. In fact, using only these two features we achieve 79% accuracy. Parentheses are very typical to T, indicating explicitation. A typical example is the following: The Vlaams Blok (‘Flemish Block’) opposes the patentability of computer-implemented inventions... Last, we find that exclamation marks are on average much more common in original English (1.25 times more frequent). Translations from three source languages, however, have more exclamation marks than original English: German, Italian and French. Translations from German use many more exclamation marks, 2.76 (!!!) times more than original English. 25 7 Conclusion Machines can easily identify translated texts. Identification has been suc- cessfully performed for very different data sets and genres, including parlia- mentary proceedings, literature, news and magazine writing, and it works well across many source and target languages (with the exception of liter- ary Polish, see Rybicki (2012)). But text classification is a double-edged sword. Consider how easily the classifier teases apart O from T based on letter bigrams: 98% accuracy, with a slight drop to 95% when only the top 300 most frequent letter bigrams are used. It is considerably better than the performance achieved by professional humans (Tirkkonen-Condit, 2002; Baroni and Bernardini, 2006). We then find that the letter sequence di is among the best discriminating features between O and T, as it is about 16% more frequent in T than in O; but it does not teach us much about T, and we cannot interpret this finding. Furthermore, text classification is highly dependent on the genres and domains, and cross-corpus classification (‘scalability’) is notoriously hard (Argamon, 2011). We addressed the first problem by designing linguistically informed fea- tures. For example, enhancing letter n-grams to trigrams already revealed some insights about morphological traits of T. The second problem calls for future research. Recall that we were unable to replicate the results reported by Olohan (2003), simply because contractions are a rarity in EUROPARL and therefore ‘normalizing’ them is even a rarer event. That translationese is dependent on genre is suggested and studied in various works (Steiner, 1998; Reiss, 1989; Teich, 2003). This point is much related to one of our main conclusions: the universal claims for translation should be reconsidered. Not only are they dependent on genre and register, they also vary greatly across different pairs of lan- guages. The best performing features in our study are those that attest to the ‘fingerprints’ of the source on the target, what has been called “source language shining through” (Teich, 2003). This is not to say that there are no features which operate “irrespective of source language” (like cohesive markers in EUROPARL), but the best evidence for translationese, the one that has the best predictive power, is related to interference, and interfer- ence by its nature is a pair-specific phenomenon. Note that mean sentence length, which we included in ‘simplification’, has been purported to be a trait of translationese regardless of source language, but turned out to be very much dependent on the source language, and in particular, contrary to previous assumptions, sentence length turned out to be shorter in O. This can indeed be shown in a well-balanced comparable corpus, ideally from as many source languages as possible and, when possible, typologically distant ones. Another caveat is related to comparable corpora in general. Olohan and Baker (2000) report that there are less omissions of optional reporting that 26 in T, as in I know (that) he’d never get here in time. This is, according to the authors, a case of explicitation, i.e., replacing a zero-connective with a that-connective to avoid ambiguity. Pym (2008) raises the following ques- tion: how do we know that this finding is not due to interference? What if the T component of this corpus consists of source languages in which the that-connective is obligatory and therefore it is just “shining through” to the target text? We cast the same doubt on some of our findings. The under-representation of sentences opening with But in T are probably due to normalization, but without reference to the source texts we will never be sure. With this kind of corpus — a comparable corpus — we can set- tle the ontological question (T is different from O across many dimensions suggested by translation scholars), but we are left with an epistemological unease: given our tools and methodology we do not know for sure what part of the findings is a mere result of source influence on the target text (inter- ference), and what part is inherent to the work of translators (simplification, normalization, and explicitation). We leave this question for future studies. References Omar S. Al-Shabab. Interpretation and the language of translation: creativ- ity and conventions in translation. Janus, Edinburgh, 1996. Shlomo Argamon. Book review of Scalability Issues in Authorship Attribu- tion, by Kim Luyckx. Literary and Linguistic Computing, 27(1):95–97, 12 2011. R. Harald Baayen. Word Frequency Distributions. Text, speech, and lan- guage technology. Kluwer Academic, 2001. ISBN 9780792370178. Mona Baker. Corpus linguistics and translation studies: Implications and applications. In Gill Francis Mona Baker and Elena Tognini-Bonelli, ed- itors, Text and technology: in honour of John Sinclair, pages 233–252. John Benjamins, Amsterdam, 1993. Marco Baroni and Silvia Bernardini. A new approach to the study of Trans- lationese: Machine-learning the difference between original and trans- lated text. Literary and Linguistic Computing, 21(3):259–274, September 2006. URL http://llc.oxfordjournals.org/cgi/content/short/21/ 3/259?rss=1. Nitza Ben-Ari. The ambivalent case of repetitions in literary translation. Avoiding repetitions: A “universal” of translation? Meta, 43(1):68–78, 1998. Shoshana Blum-Kulka. Shifts of cohesion and coherence in translation. In Juliane House and Shoshana Blum-Kulka, editors, Interlingual and inter- 27 http://llc.oxfordjournals.org/cgi/content/short/21/3/259?rss=1 http://llc.oxfordjournals.org/cgi/content/short/21/3/259?rss=1 cultural communication Discourse and cognition in translation and second language acquisition studies, volume 35, pages 17–35. Gunter Narr Verlag, 1986. Shoshana Blum-Kulka and Eddie A. Levenston. Universals of lexical sim- plification. Language Learning, 28(2):399–416, December 1978. Shoshana Blum-Kulka and Eddie A. Levenston. Universals of lexical sim- plification. In Claus Faerch and Gabriele Kasper, editors, Strategies in Interlanguage Communication, pages 119–139. Longman, 1983. Andrew Chesterman. Beyond the particular. In A. Mauranen and P. Ku- jamäki, editors, Translation universals: Do they exist?, pages 33–50. John Benjamins, 2004. Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990. ISSN 0891-2017. Sari Eskola. Untypical frequencies in translated language. In A. Mauranen and P. Kujamäki, editors, Translation universals: Do they exist?, pages 83–99. John Benjamins, 2004. William Frawley. Prolegomenon to a theory of translation. In William Frawley, editor, Translation. Literary, Linguistic and Philosophical Per- spectives, pages 159–175. University of Delaware Press, Newark, 1984. Bryan A. Garner. On beginning sentences with But. Michigan Bar Journal, 48:43–4, 2003. Martin Gellerstam. Translationese in Swedish novels translated from En- glish. In Lars Wollin and Hans Lindquist, editors, Translation Studies in Scandinavia, pages 88–95. CWK Gleerup, Lund, 1986. Stefan Th. Gries and Stefanie Wulff. Regression analysis in translation studies. In Michael P. Oakes and Meng Ji, editors, Quantitative Methods in Corpus-Based Translation Studies, Studies in Corpus Linguistics 51, pages 35–52. John Benjamins, Philadelphia, 2012. Jack Grieve. Quantitative authorship attribution: An evaluation of tech- niques. Literary and Linguistic Computing, 22(3):251–270, 2007. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H. Witten. The WEKA data mining software: an up- date. SIGKDD Explorations, 11(1):10–18, 2009. ISSN 1931-0145. doi: 10.1145/1656274.1656278. URL http://dx.doi.org/10.1145/1656274. 1656278. 28 http://dx.doi.org/10.1145/1656274.1656278 http://dx.doi.org/10.1145/1656274.1656278 Sandra Halverson. The cognitive basis of translation universals. Target, 15 (2):197–241, 2003. David I. Holmes. A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society, 155(1):91–120, 1992. Iustina Ilisei. A Machine Learning Approach to the Identification of Trans- lational Language: An Inquiry into Translationese Learning Models. PhD thesis, University of Wolverhampton, Wolverhampton, UK, February 2013. URL http://clg.wlv.ac.uk/papers/ilisei-thesis.pdf. Iustina Ilisei and Diana Inkpen. Translationese traits in Romanian newspa- pers: A machine learning approach. International Journal of Computa- tional Linguistics and Applications, 2(1-2), 2011. Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, and Ruslan Mitkov. Iden- tification of translationese: A machine learning approach. In Alexan- der F. Gelbukh, editor, Proceedings of CICLing-2010: 11th International Conference on Computational Linguistics and Intelligent Text Process- ing, volume 6008 of Lecture Notes in Computer Science, pages 503–511. Springer, 2010. ISBN 978-3-642-12115-9. URL http://dx.doi.org/10. 1007/978-3-642-12116-6. Dorothy Kenny. Lexis and creativity in translation: a corpus-based study. St. Jerome, 2001. ISBN 9781900650397. Philipp Koehn. Europarl: A parallel corpus for statistical machine trans- lation. In Proceedings of the tenth Machine Translation Summit, pages 79–86. AAMT, 2005. URL http://mt-archive.info/MTS-2005-Koehn. pdf. Moshe Koppel and Noam Ordan. Translationese and its dialects. In Pro- ceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1318–1326, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1132. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1):9–26, Jan 2009. ISSN 1532- 2882. doi: 10.1002/asi.v60:1. URL http://dx.doi.org/10.1002/asi. v60:1. David Kurokawa, Cyril Goutte, and Pierre Isabelle. Automatic detection of translated text and its impact on machine translation. In Proceedings of MT-Summit XII, pages 81–88, 2009. 29 http://clg.wlv.ac.uk/papers/ilisei-thesis.pdf http://dx.doi.org/10.1007/978-3-642-12116-6 http://dx.doi.org/10.1007/978-3-642-12116-6 http://mt-archive.info/MTS-2005-Koehn.pdf http://mt-archive.info/MTS-2005-Koehn.pdf http://www.aclweb.org/anthology/P11-1132 http://dx.doi.org/10.1002/asi.v60:1 http://dx.doi.org/10.1002/asi.v60:1 Sara Laviosa. Core patterns of lexical use in a comparable corpus of English lexical prose. Meta, 43(4):557–570, December 1998. Sara Laviosa. Corpus-based translation studies: theory, findings, ap- plications. Approaches to translation studies. Rodopi, 2002. ISBN 9789042014879. Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Language models for machine translation: Original vs. translated texts. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- ing, pages 363–374, Edinburgh, Scotland, UK, July 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D11-1034. Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Adapting transla- tion models to translationese improves SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255–265, Avignon, France, April 2012a. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ E12-1026. Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Language models for machine translation: Original vs. translated texts. Computational Linguistics, 38(4):799–825, December 2012b. URL http://dx.doi.org/ 10.1162/COLI_a_00111. A. Mauranen and P. Kujamäki, editors. Translation universals: Do they exist? John Benjamins, 2004. Anna Mauranen. Universals tendencies in translation. In Gunilla Anderman and Margaret Rogers, editors, Incorporating Corpora: the linguist and the translator, pages 32–48. Multilingual Matters, Clevedon, Buffalo and Toronto, 2008. Jeremy Munday. A computer-assisted approach to the analysis of translation shifts. Meta, 43(4):542–556, 1998. Maeve Olohan. How frequent are the contractions? A study of contracted forms in the translational English corpus. Target, 15(1):59–89, 2003. Maeve Olohan and Mona Baker. Reporting that in translated English: Ev- idence for subconscious processes of explicitation? Across Languages and Cultures, 1(2):141–158, 2000. Lin Øver̊as. In search of the third code: An investigation of norms in literary translation. Meta, 43(4):557–570, 1998. 30 http://www.aclweb.org/anthology/D11-1034 http://www.aclweb.org/anthology/D11-1034 http://www.aclweb.org/anthology/E12-1026 http://www.aclweb.org/anthology/E12-1026 http://dx.doi.org/10.1162/COLI_a_00111 http://dx.doi.org/10.1162/COLI_a_00111 Marius Popescu. Studying translationese at the character level. In Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, and Nicolas Nicolov, editors, Proceedings of RANLP-2011, pages 634–639, 2011. Anthony Pym. On Toury’s laws of how translators translate. In Anthony Pym, Miriam Shlesinger, and Daniel Simeoni, editors, Beyond Descrip- tive Translation Studies: Investigations in Homage to Gideon Toury, Benjamins translation library: EST subseries, pages 311–328. John Ben- jamins, 2008. ISBN 9789027216847. Katherine Reiss. Text types, translation types and translation assessment. In Andrew Chesterman, editor, Readings in translation theory, pages 105– 115. Oy Finn Lectura Ab, Helsinki, 1989. Jan Rybicki. The great mystery of the (almost) invisible translator: Sty- lometry in translation. In Michael P. Oakes and Meng Ji, editors, Quan- titative Methods in Corpus-Based Translation Studies, Studies in Corpus Linguistics 51, pages 231–248. John Benjamins, Philadelphia, 2012. Diana Santos. On grammatical translationese. In Kimmo Koskenniemi, editor, Short papers presented at the Tenth Scandinavian Conference on Computational Linguistics, pages 29–30, 1995. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, March 2002. ISSN 0360-0300. doi: 10.1145/505282.505283. URL http://doi.acm.org/10.1145/505282. 505283. Larry Selinker. Interlanguage. International Review of Applied Linguistics in Language Teaching, 10(1–4):209–232, 1972. Miriam Shlesinger. Simultaneous interpretation as a factor in effecting shifts in the position of texts on the oral-literate continuum. Master’s thesis, Tel Aviv University, Faculty of the Humanities, Department of Poetics and Comparative Literature, 1989. Erich Steiner. A register-based translation evaluation: An advertisement as a case in point. Target, 10(2):291–318, 1998. Elke Teich. Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Mouton de Gruyter, 2003. Sonja Tirkkonen-Condit. Translationese: A myth or an empirical fact? Tar- get, 14(2):207–220, 2002. Gideon Toury. Interlanguage and its manifestations in translation. Meta, 24(2):223–231, 1979. 31 http://doi.acm.org/10.1145/505282.505283 http://doi.acm.org/10.1145/505282.505283 Gideon Toury. In Search of a Theory of Translation. The Porter Institute for Poetics and Semiotics, Tel Aviv University, Tel Aviv, 1980. Gideon Toury. Descriptive Translation Studies and beyond. John Benjamins, Amsterdam / Philadelphia, 1995. Fiona J. Tweedie and R. Harald Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humani- ties, 32(5):323–352, 1998. Hans van Halteren. Source language markers in EUROPARL translations. In Donia Scott and Hans Uszkoreit, editors, COLING 2008, 22nd In- ternational Conference on Computational Linguistics, Proceedings of the Conference, 18-22 August 2008, Manchester, UK, pages 937–944, 2008. ISBN 978-1-905593-44-6. URL http://www.aclweb.org/anthology/ C08-1118. Ria Vanderauwerea. Dutch novels translated into English: the transforma- tion of a ‘minority’ literature. Rodopi, Amsterdam, 1985. A Lists of words A.1 Cohesive markers We use the following list of words as cohesive markers: as for, as to, because, besides, but, consequently, despite, even if, even though, except, further, furthermore, hence, however, in addition, in conclusion, in other words, in spite, instead, is to say, maybe, moreover, nevertheless, on account of, on the contrary, on the other hand, otherwise, referring to, since, so, the former, the latter, therefore, this implies, though, thus, with reference to, with regard to, yet, concerning. A.2 Contracted forms We use the following list of contracted forms and their expansions: i’m: i am, it’s: it is, it has, there’s: there is, there has, he’s: he is, he has, she’s: she is, she has, what’s: what is, what has, let’s: let us, who’s: who is, who has, where’s: where is, where has, how’s: how is, how has, here’s: here is, i’ll: i will, you’ll: you will, she’ll: she will, he’ll: he will, we’ll: we will, they’ll: they will, i’d: i would, i had, you’d: you would, you had, she’d: she would, she had, he’d: he would, he had, we’d: we would, we had, they’d: they would, they had, i’ve: i have, you’ve: you have, we’ve: we have, they’ve: they have, who’ve: who have, would’ve: would have, should’ve: should have, must’ve: must have, you’re: you are, they’re: they are, we’re: we are, who’re: who are, couldn’t: could not, can’t: cannot, wouldn’t: would not, don’t: do not, doesn’t: does not, didn’t: did not. 32 http://www.aclweb.org/anthology/C08-1118 http://www.aclweb.org/anthology/C08-1118 A.3 Prefixes and suffixes We use the following list of prefixes: a, an, ante, anti, auto, circum, co, com, con, contra, de, dis, en, ex, extra, hetero, homo, hyper, il, im, in, inter, intra, ir, macro, micro, mono, non, omni, post, pre, pro, sub, syn, trans, tri, un, uni and the following list of suffixes: able, acy, al, ance, ate, dom, en, ence, er, esque, ful, fy, ible, ic, ical, ify, ious, ise, ish, ism, ist, ity, ive, ize, less, ment, ness, or, ous, ship, sion, tion, ty, y. A.4 Function words We use the following list of function words: a, about, above, according, ac- cordingly, actual, actually, after, afterward, afterwards, again, against, ago, ah, ain’t, all, almost, along, already, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, are, aren’t, around, art, as, aside, at, away, ay, back, be, bear, because, been, before, being, below, beneath, beside, besides, better, between, beyond, bid, bil- lion, billionth, both, bring, but, by, came, can, can’t, cannot, canst, certain, certainly, come, comes, consequently, could, couldn’t, couldst, dear, defi- nite, definitely, despite, did, didn’t, do, does, doesn’t, doing, don’t, done, dost, doth, doubtful, doubtfully, down, due, during, e.g., each, earlier, early, eight, eighteen, eighteenth, eighth, eighthly, eightieth, eighty, either, eleven, eleventh, else, enough, enter, ere, erst, even, eventually, ever, every, every- body, everyone, everything, everywhere, example, except, exeunt, exit, fact, fair, far, farewell, few, fewer, fifteen, fifteenth, fifth, fifthly, fiftieth, fifty, finally, first, firstly, five, for, forever, forgo, forth, fortieth, forty, four, four- teen, fourteenth, fourth, fourthly, from, furthermore, generally, get, gets, getting, give, go, good, got, had, has, hasn’t, hast, hath, have, haven’t, having, he, he’d, he’ll, he’s, hence, her, here, hers, herself, him, himself, his, hither, ho, how, how’s, however, hundred, hundredth, i, i’d, i’m, i’ve, if, in, indeed, instance, instead, into, is, isn’t, it, it’d, it’ll, it’s, its, itself, last, lastly, later, less, let, let’s, like, likely, many, matter, may, maybe, me, might, million, millionth, mine, more, moreover, most, much, must, mustn’t, my, myself, nay, near, nearby, nearly, neither, never, nevertheless, next, nine, nineteen, nineteenth, ninetieth, ninety, ninth, ninthly, no, no- body, none, noone, nor, not, nothing, now, nowhere, o, occasionally, of, off, oft, often, oh, on, once, one, only, or, order, other, others, ought, our, ours, ourselves, out, over, perhaps, possible, possibly, presumable, presumably, previous, previously, prior, probably, quite, rare, rarely, rather, result, re- sulting, round, said, same, say, second, secondly, seldom, seven, seventeen, seventeenth, seventh, seventhly, seventieth, seventy, shall, shalt, shan’t, she, she’d, she’ll, she’s, should, shouldn’t, shouldst, similarly, since, six, six- teen, sixteenth, sixth, sixthly, sixtieth, sixty, so, soever, some, somebody, someone, something, sometimes, somewhere, soon, still, subsequently, such, 33 sure, tell, ten, tenth, tenthly, than, that, that’s, the, thee, their, theirs, them, themselves, then, thence, there, there’s, therefore, these, they, they’d, they’ll, they’re, they’ve, thine, third, thirdly, thirteen, thirteenth, thirti- eth, thirty, this, thither, those, thou, though, thousand, thousandth, three, thrice, through, thus, thy, till, tis, to, today, tomorrow, too, towards, twas, twelfth, twelve, twentieth, twenty, twice, twill, two, under, undergo, under- neath, undoubtedly, unless, unlikely, until, unto, unusual, unusually, up, upon, us, very, was, wasn’t, wast, way, we, we’d, we’ll, we’re, we’ve, wel- come, well, were, weren’t, what, what’s, whatever, when, whence, where, where’s, whereas, wherefore, whether, which, while, whiles, whither, who, who’s, whoever, whom, whose, why, wil, will, wilst, wilt, with, within, with- out, won’t, would, wouldn’t, wouldst, ye, yes, yesterday, yet, you, you’d, you’ll, you’re, you’ve, your, yours, yourself, yourselves. A.5 Pronouns We use the following list of pronouns: he, her, hers, herself, him, himself, i, it, itself, me, mine, myself, one, oneself, ours, ourselves, she, theirs, them, themselves, they, us, we, you, yourself. 34 Introduction Related Work Methodology Text Classification with Machine Learning Motivation Experimental Setup Hypotheses Simplification Explicitation Normalization Interference Miscellaneous Results Analysis Simplification Explicitation Normalization Interference Miscellaneous Conclusion Lists of words Cohesive markers Contracted forms Prefixes and suffixes Function words Pronouns