OP-LLCJ170029 279..296 University of Groningen Measuring Syntactical Variation in Germanic Texts Heeringa, Wilbert; Swarte, Femka; Schüppert, Anja; Gooskens, Charlotte Published in: Digital Scholarship in the Humanities DOI: 10.1093/llc/fqx029 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2018 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Heeringa, W., Swarte, F., Schüppert, A., & Gooskens, C. (2018). Measuring Syntactical Variation in Germanic Texts. Digital Scholarship in the Humanities, 33(2), 279-296. https://doi.org/10.1093/llc/fqx029 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 06-04-2021 https://doi.org/10.1093/llc/fqx029 https://research.rug.nl/en/publications/measuring-syntactical-variation-in-germanic-texts(4e9a76fd-fcca-4d79-abd8-cc768b714269).html https://doi.org/10.1093/llc/fqx029 Measuring syntactical variation in Germanic texts ............................................................................................................................................................ Wilbert Heeringa Fryske Akademy, The Netherlands Femke Swarte Faculty of Arts, Applied Linguistics, University of Groningen, The Netherlands Anja Schüppert Faculty of Arts, European Languages and Cultures, University of Groningen, The Netherlands Charlotte Gooskens Faculty of Arts, Applied Linguistics, University of Groningen, The Netherlands and School of Behavioural, Cognitive and Social Sciences, University of New England, Australia ....................................................................................................................................... Abstract We present two new measures of syntactic distance between languages. First, we present the ‘movement measure’ which measures the average number of words that has moved in sentences of one language compared to the corresponding sentences in another language. Secondly, we introduce the ‘indel measure’ which measures the average number of words being inserted or deleted in sentences of one language compared to the corresponding sentences in another language. The two measures were compared to the ‘trigram measure’ which was introduced by Nerbonne & Wiersma (2006, A Measure of Aggregate Syntactic Distance. In Nerbonne, J. and Hinrichs, E. (eds.) Linguistic Distances Workshop at the joint conference of International Committee on Computational Linguistics and the Association for Computational Linguistics, Sydney, July, 2006, pp. 82–90.). We correlated the results of the three measures and found a low correlation between the results of the movement and indel measure, indicating that the two measures represent different kinds of linguistic variation. We found a high correlation between the results of the movement measure and the trigram meas- ure. The results of all of the three measures suggest that English is syntactically a Scandinavian language. Because of our unique database design we were able to detect asymmetric relationships between the languages. All three measures sug- gest that asymmetric syntactical distances could be part of the explanation why native speakers of Dutch more easily understand German texts than native speak- ers of German understand Dutch texts (Swarte 2016). ................................................................................................................................................................................. Correspondence: Wilbert Heeringa, Fryske Akademy, P.O. Box 54, 8900 AB. Ljouwert, The Netherlands, E-mail: wheeringa@fryske- akademy.nl Digital Scholarship in the Humanities, Vol. 33, No. 2, 2018. � The Author 2017. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com 279 doi:10.1093/llc/fqx029 Advance Access published on 19 June 2017 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/33/2/279/3870409 by U niversity Library user on 09 N ovem ber 2018 1 Introduction Textometry is a discipline in which knowledge is derived from corpora without predefined infor- mation models. MacMurray and Leenhardt (2011) describe textometry as an approach in which ‘a text possesses its own internal structure that would be difficult to analyze by manual means alone. By applying statistical and probabilistic calculations directly to the textual units of comparable texts in a corpus it becomes possible to analyze patterns and trends that would otherwise be obscured by the quantity of the textual units’ (p. 606). And ‘Textometry consists of seeing the document through a prism of numbers and figures, producing information on the frequency counts of words, otherwise known as occurrences, whereas forms are a single graphical unit corresponding to several instances in the text’ (p. 606, see also Lebart and Salem (1994) and Tufféry (2007)). In this article we utilize written texts for revealing language variation. Language variation at different linguistic levels become apparent to a large extent when comparing written texts in different lan- guages, especially lexical, orthographic, and syntac- tical differences. Lexical differences are differences in vocabulary or lexicon. In the following example English and German do not have any cognates, apart from the articles: English: The boy teased the dog. German: Der Jungen neckte den Hund. On the other hand, pairs of sentences can be found where for each English word a German cognate is found. Cognates are words which have a common etymological origin and normally a similar shape. Example: English: The man saw a house. German: Der Mann sah ein Haus. In this example differences are orthographic differ- ences. Orthographic differences may reflect histor- ical developments of the pronunciation, for example, English saw versus German sah. However, orthographic differences do not always reflect linguistic differences, they may also be the result of differences in spelling conventions, for ex- ample, English house versus German Haus. Syntax is ‘the study of the principles and pro- cesses by which sentences are constructed in par- ticular languages’ (Chomsky 1957, p. 11). Between Germanic languages like English and German rela- tively large syntactical differences can be found, for example: English: Then she said that she will come tomorrow German: Dann sagte sie dass sie morgen kommen wird There exist several studies that have proposed how to measure lexical, orthographic, and syntactic dis- tances using parallel corpora. For example, Van Bezooijen and Gooskens (2005) measured lexical distances between Dutch, Afrikaans, and Frisian on the basis of written texts. They also measured orthographic distances using the same material. Zulu, Botha, and Barnard (2008) measured ortho- graphic distance between eleven South African lan- guages. A procedure for measuring syntactical distances between language varieties was introduced by Nerbonne and Wiersma (2006), who provided a foundation for measuring syntactic differences be- tween corpora. Their method uses part-of-speech (POS) trigrams as an approximation to syntactic structure. The frequencies of the trigrams of two corpora are compared for statistically significant differences. In this article we focus on the measurement of syntactical distances between a small set of five Germanic languages. We will apply the method of Nerbonne and Wiersma (2006) and refer to this as the ‘trigram measure’ throughout this article. In addition, we introduce two new methods for mea- suring syntactical variation. Using the first method, we measure the average number of word positions that a word in a sentence in language A has moved compared to the corresponding sentence in lan- guage B. We call this the ‘movement measure’. The second method measures the average number of words found in a sentence in language A that is missing in the corresponding sentence in language B, and the number of words in a sentence in lan- guage B that is missing in the sentence in language W. Heeringa et al. 280 Digital Scholarship in the Humanities, Vol. 33, No. 2, 2018 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/33/2/279/3870409 by U niversity Library user on 09 N ovem ber 2018 Deleted Text: " Deleted Text: ." Deleted Text: : Deleted Text: " Deleted Text: ." Deleted Text: ) Deleted Text: paper Deleted Text: " Deleted Text: " Deleted Text: 11 Deleted Text: paper Deleted Text: paper A. In other words, the number of words which is inserted or deleted in a sentence in language A com- pared to the corresponding sentence in language B is measured. We call this the ‘indel’ measure. We will compare the results of the two methods to results of the trigram method to answer the fol- lowing questions: (1) Do the movement measure and the indel measure yield different results? (2) Does the trigram method resemble one of the other methods in particular? We focus on the Germanic language group, more specifically on Danish, Dutch, English, German, and Swedish. In Section 2 we give a brief overview of related research concerning syntactical measure- ments. Section 3 describes the data source and the way in which syntactical distances are measured. The results of the distance measurements are pre- sented in Section 4. In Section 5 the research ques- tions are addressed. Finally, general conclusions will be drawn in Section 6. In this section we will also discuss how the methods can be validated. 2 Previous Research To measure syntactical distances between languages we explored literature to find a suitable distance meas- ure. We found two kinds of approaches dominating. One is based on categorical syntactical features. This approach is typically used when material from dialect atlases is used. Another is based on counting and comparing frequencies of trigrams of POS tags. This approach works well when large corpora are available with the words being tagged. The two approaches are discussed in Sections 2.1 and 2.2, respectively. In Section 2.3 we will motivate our choice. 2.1 Categorical syntactic variables Spruit (2008) measured syntactic distances between 267 local Dutch varieties, using data from two vol- umes of the Syntactic Atlas of the Dutch Dialects. The atlas volumes contain a large number of maps show- ing the geographic distribution of syntactic phe- nomena. The maps in the first volume represent 510 binary syntactic features, and those in the second volume represent 672 syntactic features, in all 1,182 features. An example concerns the comple- mentizer of the comparative if-clause in the Dutch sentence Het lijkt wel alsof er iemand in de tuin staat, ‘It looks as if there is someone in the garden’. Four examples of binary features are complement- izer¼of, complementizer¼of dat, complementizer ¼dat, complementizer¼alsof. Each feature is either true or false, and therefore binary. The distance be- tween two dialects was equal to the total number of shared features, and therefore, the distance will vary between 0 and 1,182. Szmrecsanyi (2008) investigated variation in British English dialects by using the Freiburg English Dialect Corpus (FRED), a naturalistic speech corpus sampling interview material from 162 different locations in thirty-eight different coun- ties all over the British Isles, excluding Ireland. FRED consists of 370 texts, which total about 2.5 million words of text.1 The corpus was analysed to obtain text frequencies of sixty-two morphosyntactic fea- tures, yielding a structured database that provided a sixty-two-dimensional frequency vector per locality. The feature frequencies were subsequently normal- ized to frequency per 10,000 words (because textual coverage in FRED varies across localities) and log- transformed to deemphasize large frequency differ- entials and to alleviate the effect of frequency outliers. The resulting 38�62 table (on the county level— that is, thirty-eight counties characterized by sixty- two feature frequencies each for the full data set) was converted into a 38�38 distance matrix using Euclidean distance—the square root of the sum of all squared frequency differentials—as an interval measure. This distance matrix was subjected to clus- ter analysis to find dialect groups. Grieve (2016) analysed a word corpus represent- ing the letter to the editor register as written be- tween 2000 and 2013 in 240 cities from across the USA. The letters were downloaded from the online archives of one or more newspapers published in 240 cities. A total of 135 grammatical alternation variables were measured and mapped across the 240 city sub-corpora. An alternation variable is ‘a set of distinct linguistic forms that have the same referential meaning’ (p. 36). The percentage of each variant is calculated as the quotient of the total Syntactical variation in Germanic texts Digital Scholarship in the Humanities, Vol. 33, No. 2, 2018 281 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/33/2/279/3870409 by U niversity Library user on 09 N ovem ber 2018 Deleted Text: in order Deleted Text: h Deleted Text: In order Deleted Text: part-of-speech Deleted Text: ' Deleted Text: ' Deleted Text: : Deleted Text: , Deleted Text: 38 Deleted Text: z Deleted Text: 62 Deleted Text: 62 Deleted Text: ten thousand Deleted Text: -- Deleted Text: 38 Deleted Text: 62 Deleted Text: -- Deleted Text: -- Deleted Text: in order Deleted Text: z Deleted Text: United States Deleted Text: the Deleted Text: " Deleted Text: " number of tokens of that variant in the corpus and the total number of tokens of all the variants of that alternation variable in the corpus, multiplied by 100 (see also Grieve 2009). Spruit (2008), Szmrecsanyi (2008), and Grieve (2009, 2016) used syntactic alternation variables (or linguistic variables) which were found in a dia- lect atlas (Spruit, 2008) or derived from written text corpora (Szmrecsanyi, 2008; Grieve, 2009, 2016). 2.2 Frequencies of POS categories Hirst and Feiguina (2007) presented a method for authorship discrimination that is based on the fre- quency of bigrams of syntactic labels that arise from partial parsing of the text. With this method the authors obtained a high accuracy on discrimination of the work of Anne and Charlotte Brontë (Brontë, 1847, 1848, 1853), both alone and combined with other classification features. High accuracies are achieved even on fragments of short texts of little more than 200 words long. While Hirst and Feiguina (2007) focussed on determining the authorship of texts, Nerbonne and Wiersma (2006), Lauttamus et al. (2007), Wiersma et al. (2010), and Nerbonne et al. (2010) measured the impact of L1 on L2 syntax in second language acquisition on the basis of corpora of English of Finnish Australians. They presented an application of a technique from language technol- ogy to tag a corpus automatically and to detect syn- tactic differences between two varieties of Finnish Australian English, one spoken by the first gener- ation and the other by the second generation. The technique compares frequencies of trigrams of POS categories as indicators of syntactic distance be- tween the varieties and then examine potential ef- fects of language contact. The frequency vectors were compared and analysed by using a permuta- tion test, which resulted in both a general measure of difference and a list with the n-grams that are most responsible for the difference. The findings showed syntactic ‘contamination’ from Finnish in the English of the adult first-generation speakers of Finnish ethnic origin. The results show that we can attribute some interlanguage features in the first generation to Finnish substratum transfer. Sanders (2007) extended the method and its application. He extended the method by using leaf-path ancestors of Sampson (2000) instead of trigrams, which captures internal syntactic struc- ture—every leaf in a parse tree records the path back to the root. The corpus used for testing is the International Corpus of English, Great Britain (Nelson et al., 2002), which contains syntactically annotated speech of Great Britain. The speakers were grouped into geographical regions based on place of birth. Sanders showed that dialectal vari- ation in eleven British regions from the International Corpus of English, Great Britain (ICE-GB) is detectable by the algorithm, using both leaf-ancestor paths and trigrams. 2.3 Our approach Spruit (2008), Szmrecsanyi (2008), and Grieve (2016) quantified syntactical language variation by using alternation variables. When using corpora as in the case of Szmrecsanyi (2008) and Grieve (2016), a set or features need to be chosen. The choice of features may partly depend on the data, but will easily be subjective. Given the fact that we use corpora (see Section 3.1) we prefer not to choose a set of features, but simply measure syntactical distances in terms of dif- ferences of sentence structure, regardless what fea- tures are represented by those differences. We will introduce two new measures. The first one measures the average number of word positions that a word in a sentence in language A has moved compared to the corresponding sentence in language B. The second measure measures the number of words which is inserted or deleted in a sentence in lan- guage A compared to the corresponding sentence in language B. The methodology of Hirst and Feiguina (2007) and Nerbonne and Wiersma (2006) likewise does not require the choice of a feature set and excels in simplicity. We will also consider their method- ology and compare the results of our measures with their trigram measure. Nerbonne and Wiersma’s (2006) method is sen- sitive only to sequential order, not to insertions, deletions, or phrase structure. Sanders (2007) clearly increased the sensitivity of the measure he W. Heeringa et al. 282 Digital Scholarship in the Humanities, Vol. 33, No. 2, 2018 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/33/2/279/3870409 by U niversity Library user on 09 N ovem ber 2018 Deleted Text: : Deleted Text: , Deleted Text: part-of-speech Deleted Text: part-of-speech Deleted Text: z Deleted Text: -- Deleted Text: 2008 Deleted Text: 2016 Deleted Text: 2016 developed a great deal with respect to phrase struc- ture. It might be argued that the movement and indel measures are potentially sensitive to higher levels of syntactic structure, perhaps even trans- formational structure (Chomsky, 1957). 3. Data Source and Measurement Techniques The data used in this article were collected in the context of a research programme which aims at finding linguistic and non-linguistic determinants of mutual intelligibility within the Germanic, Romance, and Slavic language families. Within this research programme, web-based intelligibility tests were performed and linguistic distances be- tween the languages were measured (Golubović, 2016; Swarte, 2016). 3.1 Data source The basis of our analyses is a set of four English texts at the B1/B2 level according to the Common European Framework of Reference for Languages.2 The texts were used as preparation exercises for the Preliminary English Test. The diploma is offered by University of Cambridge ESOL Examinations in England. The texts we use are obtained at englishaula.com. The texts are translated in each of the other four languages (Dutch, Danish, German, Swedish) by native speakers of those languages. The translations are subsequently corrected by two other native speakers. All of the native speakers had completed a university education or were still studying at a university. They were aged between 20 and 40 years. Just as the English text the four texts consist of sixty-six sentences (approximately 800 words) in total. Given five languages, we will analyse (5�4)/ 2¼10 language pairs. Our initial thought was to calculate the syntactic distance of a language pair by directly comparing the texts of the two languages to each other. However, by doing this, we would introduce a lot of noise in our data. We will illus- trate this by an example. In the text Child Athletes we find the following sentences in English and German: English: Some doctors agree that young mus- cles may be damaged by training before they are properly developed. German: Einige Ärzte behaupten, dass junge Muskeln die noch nicht ausreichend entwickelt sind während des Trainings beschä- digt werden können. The two sentences have about the same meaning, but syntactically they strongly differ. However, given the English sentence, it is possible to get a more literal German translation: English: Some doctors agree that young mus- cles may be damaged by training before they are properly developed. German: Einige Ärzte denken, dass junge Muskeln durch Training geschädigt werden können bevor sie ausreichend entwickelt sind. On the other hand, given the German sentence, a more literal translation in English is possible: German: Einige Ärzte behaupten, dass junge Muskeln die noch nicht ausreichend entwick- elt sind während des Trainings beschädigt werden können. English: Some doctors claim that young mus- cles which are still not properly developed can be damaged during the training. Since we want to model intelligibility (see Section 6), we should not calculate syntactic distances which are unnecessarily large. A reader who reads a sen- tence in a closely related language, will likely try to match the sentence with the most literal translation in his/her own language. Therefore, to obtain the data set that our analysis will be based on, each of the available texts in Danish, Dutch, English, German, and Swedish are ‘translated back’ in each of the other languages as literally as possible. Importantly, the texts are trans- lated as literally as possible with respect to syntax, but not necessarily with respect to lexicon, as this is not within the scope of this article. However, the translations are made so that the sentences are still grammatically correct. These translations are lan- guage specific, i.e. the Danish text is translated in a different way from Swedish than from German, for example. Note that we modified only the targets Syntactical variation in Germanic texts Digital Scholarship in the Humanities, Vol. 33, No. 2, 2018 283 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/33/2/279/3870409 by U niversity Library user on 09 N ovem ber 2018 Deleted Text: paper Deleted Text: