OP-LLCJ130064 199..224 Collaborative authorship in the twelfth century: A stylometric study of Hildegard of Bingen and Guibert of Gembloux ............................................................................................................................................................ Mike Kestemont Institute for the Study of Literature in the Low Countries & CLiPS Computational Linguistics Group, University of Antwerp, Belgium Sara Moens and Jeroen Deploige History Department, Ghent University, Belgium ....................................................................................................................................... Abstract Hildegard of Bingen (1098–1179) is one of the most influential female authors of the Middle Ages. From the point of view of computational stylistics, the oeuvre attributed to Hildegard is fascinating. Hildegard dictated her texts to secretaries in Latin, a language of which she did not master all grammatical subtleties. She therefore allowed her scribes to correct her spelling and grammar. Especially Hildegard’s last collaborator, Guibert of Gembloux, seems to have considerably reworked her works during his secretaryship. Whereas her other scribes were only allowed to make superficial linguistic changes, Hildegard would have permitted Guibert to render her language stylistically more elegant. In this article, we focus on two shorter texts: the Visio ad Guibertum missa and Visio de Sancto Martino, both of which Hildegard allegedly authored during Guibert’s secretaryship. We analyze a corpus containing the letter collections of Hildegard, Guibert, and Bernard of Clairvaux using a number of common stylometric techniques. We discuss our results in the light of the Synergy Hypothesis, suggesting that texts resulting from collaboration can display a style markedly different from that of the collaborating authors. Finally, we demonstrate that Guibert must have re- worked the disputed visionary texts allegedly authored by Hildegard to such an extent that style-oriented computational procedures attribute the texts to Guibert. ................................................................................................................................................................................. 1 Introduction Since the end of the 1960s, literary studies have seen a clear shift of focus from the analysis of authorial intentions to reader-oriented criticism. The repudi- ation of the modern idea of autonomous authorship has perhaps gone furthest in medieval studies, with the rise, since the late 1980s, of Material Philology (Nichols, 1997). Medievalists have become increas- ingly aware of the importance of manuscript culture in their understanding of texts: medieval texts should not primarily be studied, it is argued, as ab- stract entities resulting from authorial ambitions, but rather as tangible objects, materialized in Correspondence: Mike Kestemont, Institute for the Study of Literature in the Low Countries & CLiPS Computational Linguistics Group, University of Antwerp, Belgium. Email: mike.kestemont@gmail.com Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015. � The Author 2013. Published by Oxford University Press on behalf of ALLC. All rights reserved. For Permissions, please email: journals.permissions@oup.com 199 doi:10.1093/llc/fqt063 Advance Access published on 26 October 2013 . specific manuscript contexts. Every material mani- festation of a text is unique, because the acts of copying and compiling nearly always resulted in textual changes—from minor changes in orthog- raphy to complete rewritings. Our modern post-ro- mantic conception of authorship therefore seems profoundly anachronistic with respect to the Middle Ages (Cerquiglini, 1999, p. 8–10). Yet, even if medieval culture did not share our present- day view on the significance of original authorship, the Middle Ages have known many respected and authoritative individuals who were recognized by their contemporaries and posterior readers as pro- ducers of very specific literary works. Some kind of correlation even existed between the degree to which texts were susceptible to alterations and the religious and intellectual authority of their authors (Deploige, 2005). This did not mean, however, that such recog- nized authors were necessarily acting individually in the process of conceiving their treatises or narra- tives—quite the contrary. Writing in the Middle Ages meant entering into a dialogue with a long line of predecessors, whether through citations, paraphrasing, or allusions. In the actual process of literary composition too, medieval authors only seldom worked alone. A ‘new’ text could be the result of drafts on wax tablets copied by professional scribes, of processes of dictation and subsequent correction, etc. A twelfth-century authority like the Cistercian abbot Bernard of Clairvaux (1090–1153), one of the most prolific and influential medieval authors, is known to have been surrounded by a team of secretaries. For his sermons and letters in particular, he was assisted by a number of collabor- ators to whom he could dictate his messages or who were asked to produce texts in accordance with his own views. Some of his collaborators were even trained in imitating his writing style, thus facilitat- ing Bernard’s work of final editing or correcting (Leclercq, 1962; 1987, pp. 147–52). In the case of the remarkably few medieval female authors known to us, the role of secretaries and collaborators is even more intricate. Women writers like the German nuns Hildegard of Bingen (1098–1179) or Elizabeth of Schönau (1129–1165) were considered unlearned and incapable of independently writing down their visionary experiences, even if these were ‘divinely inspired’. These women therefore had to be assisted by male collaborators, often also serving as their spiritual directors. The precise nature and implications of such cross-gender collab- orations remain a topic of scholarly debate. The immediate incentive for the present article is the preparation of a new critical edition of two lesser known texts attributed to Hildegard of Bingen, supposedly dating from the last years of her life: the Visio de Sancto Martino, which is con- ceived as a letter addressed to the worshippers of Saint Martin, and the Visio ad Guibertum missa, containing spiritual advice to an anonymous monk-priest, generally identified as her last secre- tary, Guibert of Gembloux (1124–1213) (Deploige and Moens, forthcoming). Among the few scholars who paid attention to these texts, there is still no consensus as to the extent to which they should be attributed to either Hildegard herself or to her col- laborator Guibert. As neither traditional stylistic analysis nor contextual historical research has so far been able to resolve the problem, we will ap- proach this issue through a stylometric analysis. We will focus on three research questions. First, does stylometry allow for an authorial dif- ferentiation between the writings of twelfth-century Latin authors, belonging to highly similar intel- lectual circles? To answer this question, we will investigate the letter collections or epistolaria of Hildegard of Bingen, her secretary Guibert of Gembloux, and their famous contemporary, Bernard of Clairvaux. Our aim is to assess to what extent we can distinguish stylistic profiles for these authors, despite the marked variance within medi- eval manuscript culture (Cerquiglini, 1999), as well as the fact that these authors, like many of their contemporaries, were often assisted by secretaries. Next, we wish to analyze in more detail to what extent we can discern in Hildegard’s epistolary work, the influence of her last secretary, Guibert of Gembloux. Did her style undergo detectable styl- istic changes under the editorial assistance of Guibert, or does the same homogeneous authorial voice appear throughout her epistolary work? Finally, we will assess the complex question to which author we should attribute, at least on M. Kestemont et al. 200 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 -- , - -- twelfth 1 , Since have twelfth in spite of stylistic grounds, the visiones at stake in this article. In answering these research questions, we do not aim to develop novel stylometric techniques. The originality of this research is to be found in our application of a number of well-established tech- niques to assess their feasibility when dealing with medieval Latin texts, a textual tradition that until now has only rarely received attention in computa- tional authorship attribution. Before addressing these issues, we will first briefly introduce the state of research with respect to the so-called Mittarbeiter problem in the Hildegard scholarship. 2 ‘Uneducated in the Art of Grammar’ The Benedictine nun Hildegard of Bingen was one of the most productive female authors of the Middle Ages (Newman, 1998). After a youth as anchoress at the abbey of the monks of Disibodenberg in the Rhineland near Mainz, she ended up as abbess of her own convent at the nearby Rupertsberg. Her extensive oeuvre includes genres as diverse as vi- sionary books, letters, hagiographical texts, treatises on monastic life, musical compositions, and some works on physics and medical healing. Considered a true prophetess, receiving revelations and admon- itions from God, she enjoyed a special status, even in the highest ecclesiastical milieux. Her extensive circle of correspondents, comprising, among others, popes and the emperor, testifies to her prophetic reputation. She was therefore able to gain an au- thority unprecedented for a woman, enabling her to even criticize the male clergy of her time. Among the first to approve her visionary gift was Bernard of Clairvaux, in a letter answering her re- quest for support. Her female authorship was built on her recognition as a mouthpiece of God, which caused her to present herself during her entire life as a poor and uneducated woman—uneducated pre- cisely because she was a woman (Deploige, 1998). In one of her vitae, her biographer Guibert of Gembloux specifies that she was ‘uneducated as to her schooling in the art of grammar’ (Derolez, 1988–1989, p. 377). Her status, both as a woman and an allegedly unlearned prophetess who may not have had the same type of schooling as young monks, meant that throughout her life Hildegard had to be assisted by secretaries (Ferrante, 1998). Her first and principal secretary was Volmar of Disibodenberg, who remained her close associate until his death in 1173. He assisted in the redaction of the majority of her works. As we can learn from a famous miniature in the now lost manuscript (henceforth MS) Wiesbaden, Landesbibliothek, 1, dating from the end of her life, Hildegard dictated and wrote drafts on wax tablets, which were subse- quently copied on parchment and linguistically ‘pol- ished’ in accordance with the rules of grammar (Fig. 1). In addition, several Rupertsberg nuns must have aided their abbess as scribes during this period, given the number of known manuscripts produced in Rupertsberg under Hildegard’s super- vision (Embach, 2003, p. 76, 128–9, 160, 184–5; Herwegen, 1904, p. 302–8). After Volmar’s death, Hildegard had to complete her last major visionary cycle, the Liber divinorum operum (‘Book of the Divine Works’), with more occasional assistance by a number of different collaborators from her im- mediate circle of spiritual acquaintances (Herwegen, 1904, p. 308–15). At the very end of her life, how- ever, she was unexpectedly joined by Guibert, a monk from the abbey of Gembloux in Brabant (nowadays Belgium). Himself a fervent letter writer and hagiographer (Moens, 2010), he served as her secretary from 1177 until her death in 1179 (Delehaye, 1889; Ferrante, 1998, p. 122–30). While even the authenticity of her female author- ship had not always gone uncontested, until the sem- inal work by Schrader and Fürhrkötter (1956), a lot of scholarly efforts have been concerned with the precise role of Hildegard’s secretaries. Just as for other female writers working under the direction of father confessors (Coakley, 2006), the question has been raised to what extent Hildegard’s secretaries interfered with the final versions of her works, pos- sibly generating male, clerical interpretations rather than original female viewpoints. Following the pion- eering research by Herwegen (1904), most specialists now agree that the role of Hildegard’s collaborators was restricted to minor grammatical and stylistic al- terations. Generally speaking, they had to copy her words verbatim unless they received Hildegard’s Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 201 in order which art grammar' st , up -- explicit authorization for corrections (Schrader and Führkötter, 1956, p. 182–3; Ferrante, 1998, p. 104). It is generally assumed, however, that Hildegard must have granted a somewhat greater liberty to Guibert, who only entered into her life when she was already at the very advanced age of 79. Although their involvement was short, Guibert nevertheless had a significant impact on Hildegard’s literary Fig. 1 MS Wiesbaden, Landesbibliothek, 1, fol. 1r. (lost since 1945). Photo: Rheinisches Bildarchiv Köln 13321 M. Kestemont et al. 202 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 legacy. For example, he may have assisted her as one of the correctors in the final redaction of the Liber divinorum operum, of which MS Ghent, University Library, 241 (Fig. 9), can be considered the auto- graph copy most true to Hildegard’s own words (Derolez and Dronke, 1996, pp. XCI–XCIV). He also aided her in both the writing and compilation of portions of her epistolarium. On the basis of manu- script evidence, content, and dating, we can distin- guish in Hildegard’s letter collection a part that must have been written and compiled with the help of Volmar and another group of letters that must have been written or transmitted under Guibert’s supervision.1 Last but not least, Guibert is also thought to have directed the compilation of the so- called Riesenkodex (MS Wiesbaden, Landesbi- bliothek, 2), the manuscript in which, by the end of her life, Hildegard had collected all the authorized versions of her works (Van Acker, 1989, pp. 129–34). 3 Two Suspect Visions The Visio de sancto Martino (‘Vision of Saint Martin’) and Visio ad Guibertum missa (‘Vision sent to Guibert’), which are at stake in this article, cannot be found in the Riesenkodex. They are only preserved in three manuscripts that can be linked to the abbey of Gembloux and Guibert’s own oeuvre.2 Therefore, both texts are traditionally not included in the core of Hildegard’s canon (Schrader and Führkötter, 1956, p. 182; Embach, 2003, p. 469). Whereas the titles in the manuscripts (Fig. 2), as well as Guibert’s accompanying letters, firmly attri- bute these visiones to Hildegard, there are good rea- sons to suspect that Guibert must have been extensively involved in their final redaction. The figure of Saint Martin for instance—the main topic of the Visio de sancto Martino—is entirely absent from Hildegard’s oeuvre. Guibert, on the other hand, developed a lifelong fascination for this saint and devoted nearly half of his life to spreading his cult. The Visio ad Guibertum missa discusses the role of the priest as well as the topic of literary collaboration, both issues of direct rele- vance to Guibert. Moreover, the end of the latter text contains a passage of particular interest in which Hildegard grants Guibert the exceptional right to revise her texts more fundamentally than simply at the level of style and grammar: When you correct [the Visio de sancto Martino] and the other works, in the emend- ing of which your love kindly supports my deficiency, you should keep to this rule: that adding, subtracting, and changing nothing, you apply your skill only to make corrections where the order or the rules of correct Latin are violated. Or if you prefer—and this is something I have conceded in this letter beyond my normal practice—you need not hesitate to clothe the whole sequence of the vision in a more becoming garment of speech, preserving the true sense in every part. For even as foods nourishing in themselves do not appeal to the appetite unless they are sea- soned somehow, so writings, although full of salutary advice, displease ears accustomed to an urbane style if they are not recommended by some color of eloquence (translated by Newman, 1987, p. 23) With this statement, Hildegard allegedly granted Guibert editorial privileges that she had not allowed any other previous collaborator. The passage also prompted scholars to have a closer look at the authorship, style, and content of these visionary texts. Already in his 1882 edition, Pitra voiced doubts with respect to Hildegard’s alleged author- ship. He stated that Guibert, if not their original author altogether, must at least have reworked the texts profoundly. Pitra based his verdict on a number of syntactical features, on metaphors which he considered typical of Guibert, and on the extensive insertion of Biblical quotations (Pitra, 1882, p. 370–1, 375). Herwegen remained more cautious: although he accepted that Guibert had refined the texts stylistically, he still discerned Hildegard’s authorial voice shimmering through Guibert’s multiple corrections. He recognized Hildegard’s genius in the overall structure of the visions and in some typically Hildegardian vocabu- lary. He also rejected Pitra’s assertion that the nu- merous Biblical quotations could only have been inserted by Guibert (Herwegen, 1904, p. 394–6). Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 203 below . . -- -- -- -- which - 9 Newman recently stated that the Visio ad Guibertum missa was ‘written by Guibert in Hildegard’s persona’ (Newman, 1987, p. 24), although Van Acker (1989, p. 130) and Coakley (2006, p. 61) continued to con- sider Hildegard as the text’s author and Guibert as a mere stylistic reviser. These assertions concerning the authorship of the visiones seem to have been predominantly based on subjective appreciations of style and con- tent and the arguments used in this debate remain, at best, intuitive. The appearance of a new critical edition of the visiones once more put the question of their authorship at the forefront: should the texts be regarded as Hildegardian or pseudo-Hildegardian? Stylometric methods may provide a more objective basis for disentangling the issue and to re-assess the nature of Guibert’s secretaryship. 4 Corpus Preparation For the present study, Brepols Publishers generously provided a digital corpus containing the nearly complete works of Hildegard, Guibert, and Bernard of Clairvaux. We obtained these texts in raw format, corresponding to the way they are included in the Brepols electronic Library of Latin Texts, on the basis of modern critical editions.3 Fortunately, these editions are all based on manuscripts that were compiled under the supervision of the original authors or at least in their close vicinity, so that we do not have to worry about major scribal interven- tions. The fact that all three authors in our corpus have been productive letter writers rendered their epistolaria an attractive point of departure. More- over, the two short visionary texts of dubious origin that are at issue in this article are mostly comparable with Hildegard’s letters with respect to length, topics, and manuscript tradition. Obviously, we re- stricted our authors’ letter collections to the letters they wrote themselves, leaving aside the letters that were merely addressed to them and that were usu- ally contained in the same manuscripts (Constable, 1976). For Bernard, this resulted in a sub-corpus of 166,063 words and for Guibert of 124,580 words.4 Hildegard’s letter collection contained 109,633 words, 82,154 of which are contained in the part compiled with the help of her first secretary Fig. 2 MS Brussels, Royal Library, 5527–5534, fol. 141v. Epistula domine Hildegardis magistre cenobii sancti Roberti Pinguensis de excellentia beati Martini episcopi – ‘Letter of lady Hildegard, magistra of the monastery of saint Rupert in Bingen, on the excellence of the blessed bishop Martin’ M. Kestemont et al. 204 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 . paper to . Volmar, while the remaining 27,479 words consti- tute the letters that, as discussed earlier, have most probably been edited in some way by Guibert.5 Medieval Latin is characterized by unstable or- thography. As even a single scribe often used differ- ent spellings for the same word, modern editors already tend to silently normalize minor ortho- graphic variants. We have normalized the orthog- raphy in our corpus even further via lemmatization, a useful procedure in stylometry for medieval texts (Kestemont et al., 2010). The texts were first toke- nized using the Natural Language Toolkit (Bird et al., 2009). The coordinating conjunction –que (‘and’) was not realized as a separate word in medi- eval Latin, but it was appended to the preceding word (e.g. terra aquaque, ‘land and water’). To auto- matically isolate the clitic, we have stripped the suffix (‘xque’) from every word that did not occur in a list of words proposed by Schinke et al. (1996, p. 180–1).6 We have also split up the medieval con- traction of the reflexive pronoun se and the idiom- atic reinforcement ipsum in seipsum (or teipsum, teipsam, etc.). A number of specific character combinations were freely interchangeable in medieval Latin, such as ph for f, v for u, oe or ae for e (or for e�, the so-called ‘e caudata’) (Rigg, 1996). We have therefore lifted the difference between v and u, as well as between ae, oe, and e, by substituting all vs for us and all aes and oes for es. For the substitution of ae and oe by e, this actually meant that we were sometimes forced to erase the distinction between grammatically import- ant morphemes (e.g. between the male vocative sin- gular domine and the female nominative plural dominae). Yet, this was unavoidable, as a good deal of the aes and oes in our corpus were already con- tracted to es, making it nearly impossible to automat- ically normalize them the other way round. Subsequently, we checked whether the surface tokens in our corpus were present in a large and representative word list from the Perseus Project (Tufts University). When a token was not, we used a permutation algorithm to generate plausible spel- ling variants for it. If one of these newly generated forms was contained in the word list, the original form was replaced by its newly generated counter- part. To generate these variants, we constructed an array with all possible variations for the consecutive character groups. Next, we combined these options through the Cartesian product in the matrix by means of a permutation algorithm (Kestemont et al., 2010). Table 1 lists the series of common alter- native character combinations we have considered, loosely based on Riggs (1996). 7 An example matrix for a word like chirographum would be: {[c], [h j Ø], [i j y], [r], [o], [g], [r], [a], [ph j f], [u], [m]}. All unique, alternative word spellings that can be gener- ated on the basis of the matrix are: chirographum, ciro- graphum, chyrographum, cyrographum, chirografum, cirografum, chyrografum, and cyrografum. Finally, we automatically annotated the tokens with lemmas using the medieval Index Thomisticus Treebank (IT-TB: Passarotti and Dell’Orletta, 2010) as training material (ca. 170,000 tokens; ca. 9,000 sentences).8 For the lemmatization of our corpus we have used Morfette (Chrupala et al., 2008). Unlike other popular lemmatization tools, such as TreeTagger (Schmid, 1994), Morfette also lemma- tizes input tokens that the tagger did not already encounter verbatim in the training data. Morfette considers pairs of input tokens and lemmas in the training material. From these pairs it learns ‘shortest edit scripts’ or ways to transform tokens into their lemmas using character insertions, deletions, and re- placements. An annotated sample from the Visio ad Guibertum missa is listed as an example (Table 2), illustrating how this procedure did not manage to identify all lemmas correctly. Especially content words that are not typical of Thomas Aquinas’s scholastic vocabulary were not always recognized. For the function words used in our analyses (see below), this problem was fortunately hardly an issue. Table 1 Interchangeable medieval Latin character com- binations allowed in our permutation algorithm ci vs. ti ch vs. h ph vs. f h vs. Ø w vs. uu vs. vv vs. uv vs. vu i vs. j vs. y k vs. c vs. ch g vs. gu Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 205 above . Since - ). so ' ' ' ' ' since ' ' ' , up ). which (ses) below which 5 Feature Selection Today’s stylometry has become an umbrella term for a still growing number of techniques for author- ship analysis. Each of these has been the subject of both criticism and praise, making it hard to discern a consensus on best practice in this field. For this research too, we had to balance the pros and cons of a number of tried and tested methodologies. Recent studies still tend to agree on the undeniable meth- odological advantages of using function words in authorship attribution (Binongo, 2003, p. 11). An author’s use of function words is said, for instance, to be relatively unaffected by a text’s topic or genre. (Dis-)similarities between texts regarding function words are therefore to a certain extent content-in- dependent and can be more easily associated with authorship than e.g. content words or other topic- specific stylistics (Juola, 2006, p. 264–5). Numerous empirical studies have effectively demonstrated that analyses of the high-frequency strata of function words yield reliable indications about a text’s authorship (Koppel et al., 2009, p. 11–12; Stamatatos, 2009, p. 540–1). In this research, we have therefore restricted our analyses to function words, using a number of approved methods— many of them implemented in the publicly available script suite ‘Stylometry with R’ (Eder et al., 2013). Preliminary analyses showed that the upper tail of the frequency spectrum in our corpus still con- tained a good deal of content-rich lemmas. Among the ca. 200 most frequent lemmas in our entire corpus, listed in Table 3, we came across multiple topic-specific nouns like deus, dominus, sanc- tus, . . . and verbs like facio, uideo, uiuo, . . . The inclusion of such lemmas obviously reflects the cor- pus’s fairly specific, religious semantics. It is also related, however, to the simple fact that a highly inflected language like Latin with its many declen- sions makes less use of function words than weakly inflected languages like English. A third explanatory factor might be the fact that we worked with the frequencies of lemmas instead of surface forms. It thus seemed advisable to remove these content words from our data tables. The content-rich words we chose to remove are marked by a hashtag (#) in Table 3.9 The words followed by an asterisk (*) in the same Table 3 are non-reflexive personal pronouns, which are also often culled in stylometry to avoid the intrusion of genre-related or topic-specific features. Naturally, a collection of letters will contain more instances of the second-person pronouns tu/vos (‘you’) or tuus/vester (‘your’) than a saint’s life. In our analyses, we have deleted this kind of pronoun. Just as in Table 2, one can still distinguish a certain number of wrongly lemmatized tokens in Table 3. The surface form sui, for example, often seems to have remained unchanged, whereas it should have been transformed into suus. This particular error, however, is neutralized by our elimination of non- reflexive personal pronouns.10 In sum, our culling of the lemmas in Table 3 resulted in 65 function words with which to form the basis for the actual analyses. It should be noted, however, that character n-grams might have been an attractive additional feature type for our research, as these have often been shown to be excellent features in authorship attribution (Koppel et al., 2009, p. 12–13; Stamatatos, 2009, p. 541–2). This method, which does not require any kind of normalization or lemmatization, segments texts into consecutive, par- tially overlapping groups of n characters—the word ‘bigram’ for instance contains the bigrams ‘_b’, ‘bi’, ‘ig’, ‘gr’, ‘ra’, ‘am’, ‘m_’. Contrary to a word-level approach, character n-grams are also sensitive to stylistic information below the word level, like case endings or other grammatical morphemes that are Table 2 Example of lemmatization based on Morfette Original Lemma Translation in in ‘in’ uisionem uisio ‘vision’ anime anima ‘soul’ mee meus ‘my’ , / / uidi uideo ‘I see’ ingentem ingentem not recognized [ingens¼ ‘gigantic’] rutilantis rutilo ‘glow’ ignis ignis ‘fire’ nubem nubem not recognized [nubes¼ ‘cloud’] Translation: ‘In a vision of my soul, I saw a gigantic cloud of glowing fire.’ M. Kestemont et al. 206 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 - -- . . a total number of since -- not realized as separate words (Rybicki and Eder, 2011, p. 320). Latin, for instance, is a heavily in- flected language that makes use of affixes to mark the grammatical functions of words—‘by iron, not by sword’ being for example ‘ferro non gladio’ (Sapir, 1921, ch. VI). Therefore, it would have made sense to additionally study the character n-grams in the corpus. However, one runs into the aforementioned problem that historical languages are characterized by unstable orthography (Piotrowski, 2012). Although Latin spelling variation seems to have been less pronounced than in vernacular medieval languages, it does constitute a serious issue. When comparing two texts written by the same author, surviving in manuscripts with a strongly divergent orthography, stylometric methods may detect arti- ficially large differences. Conversely, and likewise due to scribal interference, texts of non-identical authorial provenance may show artificial similarities when they survive in manuscripts with a similar orthographical profile. In medieval manuscripts, we might even find inconsistent word spellings for the same words throughout the same text (Rigg, 1996). This ultimately implies that an approach based on character n-grams is unadvisable for medi- eval Latin (cf. Kestemont and Van Dalen Oskam, 2008). Unfortunately, this means that our approach based on lemmatization cannot take into account stylistic subtleties below the word level (e.g. Table 3 Most frequent lemmas in the corpus (#¼content words; *¼non-reflexive pronouns) et e quoniam #caritas #consilium contra qui uel #uerbum #uenio #rex #pono in #possum aut quasi dum #amicus #sum pro idem scilicet #talis #honor non quam super #causa #ceterus #nomen #tu* #uester* #terra #manus #caro uelut #is* autem #uolo #iustitia #fides ante #ego* #multus nunc #modus #res #ta #deus #habeo iam #primus #paruus #iudicium ad ne #uita semper apud usque hic #sanctus ac #audio #pax quantum sed enim #cor #mundus #salus #lex ut etiam #nam #debeo siue #fidelis de #noster* #do #uiuo #eternus #sol #suus* #uerus #solus #cado #inuenio #celestis #ille* #uideo unde inter #frater #potior a sicut quidem #o #uir uidelicet cum #alius tam #diligo magis tunc quod ita propter #uoluntas #fors #angelus ipse tamen #quidam #gloria #us #diuinus #tuus* #filius #bonus quoque #certus #summus #omnis #spiritus ergo atque #loquor #ideo si #christus #tempus #aliqui #uox #prior #sui* #bonum sine #malum #iustus #populus per #ecclesia nisi #mens post #episcopus #facio #opus #unus #oculus #misericordia #similis #homo xque #dies #nihil #celum #os #dico sic #nullus #secundum adhuc #nouus quia #magnus ubi #pars #domus #tantum #dominus #iste* #corpus #mors #uis #uia #meus* #anima #locus #peccatum #beatus licet nec #pater #uirtus #scio #quomodo #predico #quis #gratia #totus #hildegars #ueritas #fratres #duo #quero Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 207 . -- . indicative versus subjunctive mood, as expressed in case endings). However, we will demonstrate that our method is still able to harvest sufficient stylistic information from the texts. Indirectly, our results will therefore even serve to emphasize how much grammatical information is in fact still expressed by isolated function words in medieval Latin. 6 Testing Principal Components Analysis The first stylometric technique we adopt is principal components analysis (PCA), a procedure derived from multivariate statistics and commonly used to reduce the dimensionality of a data set (Binongo, 2003). By combining the original variables of a data table into new, uncorrelated compound variables or ‘principal components’, PCA is able to summarize large and complex data sets into insightful lower- dimensional scatterplots. When applied to the frequencies of high-frequency items in texts, this technique often successfully reveals the authorial structure in a data set. PCA’s good performance in authorship attribution is due to the fact that it ex- plicitly tries to model correlations between word frequencies. Especially the frequencies of function words show complex correlations that are related to stylistic, arguably authorial choices between small sets of alternative options. A mere visual in- spection of the samples’ positions in PCA scatterplots often shows that samples written by the same author will cluster, whereas groups of samples written by distinct authors lie further apart. Because of the considerable size of the epistolaria in the corpus, we could start with a large sample size of 10,000 lemmatized words per sample. Recent re- search has demonstrated that the accuracy of most authorship attribution techniques is likely to in- crease when larger samples are taken (Eder, 2010; Luyckx and Daelemans, 2011). Our selection of the epistolaria of exactly three authors—Hildegard of Bingen, Guibert of Gembloux and Bernard of Clairvaux—respects the fact that it is theoretically unadvisable to include more than three authors in a PCA, especially when the discussion of the results is restricted to the two first Principal Components (PCs) (Binongo and Smith 1999, p. 464). As is cus- tomary since Burrows (1987), our PCA is based on the correlation matrix, appropriately scaling the ori- ginal word frequencies. Fig. 3 shows the scatterplot that results from our first experiment. Each author’s samples are visualized as black letter combinations: the first letter of the author’s name is followed by a digit, indicating the sample’s indexed position in the respective episto- laria. G_EP-4, for instance, is the fourth sample of 10,000 lemmatized words taken from Guibert’s epistolarium.11 At this stage, we are restricting Hildegard’s epistolarium to the letters that are not associated in any way with Guibert’s secretaryship. Fig. 3 displays a remarkably clear authorial separ- ation of the samples. Guibert’s samples (G_EP) are concentrated in the upper-right quadrant, whereas the samples from Hildegard’s epistolarium (H_EPNG) are invariably positioned to the left. Finally, Bernard’s samples (B_EP) form a tight cluster of samples in the lower-right half of the plot. The density of this last cluster thus points at a clear stylistic unity, despite the fact that, as noted earlier, Bernard must have been assisted in his epistolary work by a true personal chancellery consisting of at least five different collab- orators (Leclercq, 1987, p. 147–52). Additionally, the plot in Fig. 3 contains a series of high-frequency items in light grey, the ‘component loadings’, visualizing how strongly the 65 lemmas have contributed to the creation of the PCs. If a word can, for instance, be found to the far left of the scatterplot, this demonstrates that it is relatively more frequent in samples with a similar position in the plot. Our first scatterplot thus shows that the use of et (‘and’) and a (‘from’) is surprisingly typical of Guibert’s writings, whereas the use of the prepos- ition in (‘in’) is very characteristic of the Hildegard samples. In comparison, the use of the lemmas non or si seems to be relatively more typical of Bernard’s writing. The scatterplot does not reveal any anoma- lies and it is safe to assume that the high-frequency grammatical lemmas argue in favor of a clear styl- istic differentiation between our authors. The remarkable stylistic differences with respect to a number of specific lemmas used by our authors can be highlighted in another way. The boxplots in Fig. 4 visualize information about the absolute M. Kestemont et al. 208 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 vs. PCA -- -- which . above u frequencies (medians, quartiles, etc.) for three inter- esting function words—in, et, and non—in samples of 2,000 words. In boxplot (a) concerning the use of in, the primary column refers to the counts in Hildegard; in the second boxplot (b) dealing with et, the left column concerns Guibert; and in boxplot (c), with the results for non, Bernard’s results are displayed in first column. The secondary column in all three boxplots refers to the material by the two other authors, e.g. Guibert and Bernard in boxplot (a). These boxplots indeed reveal unmistakable dif- ferences between the respective epistolaria with re- spect to the frequency of these important function words. Interestingly, these differences coincide with stylistic observations that have been made in trad- itional philological research. Given the visionary discourse developed in much of her writings— even in her letters—it is not surprising to come across an intensive use of the preposition in in Hildegard’s letters. She repeatedly sees things in divine visions; she continuously searches the alle- gorical meanings buried in the multitude of details that she discovers in her visions (Dronke, 1998). Guibert’s writings are especially notorious for their all too inflated and artificial style, and Guibert’s wearisome tendency to compose extremely long -0.3 -0.2 -0.1 0.0 0.1 0.2 -0 .3 -0 .2 -0 .1 0 .0 0 .1 0 .2 Principal Components Analysis PC1 (37.8%) 65 MFW Culled @ 0% Pronouns deleted Correlation matrix P C 2 ( 1 6 .9 % ) B_ep-1B_ep-2 B_ep-3 B_ep-4 B_ep-5 B_ep-6 B_ep-7 B_ep-8 B_ep-9 B_ep-10 B_ep-11 B_ep-12 B_ep-13 B_ep-14 B_ep-15 B_ep-16 G_ep-1 G_ep-2 G_ep-3 G_ep-4 G_ep-5G_ep-6 G_ep-7 G_ep-8 G_ep-9G_ep-10 G_ep-11 G_ep-12 H_epNG-1 H_epNG-2 _epNG-3 H_epNG-4 H_epNG-5 H_epNG-6 epNG-7 H_epNG-8 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 et qui in non ad hic sed ut de a cum quod ipse si per quia nec e uel pro quam autem ne enim etiam sicut ita tamen xque sic quoniam aut idem super nunc iam ac unde quidem tam propter ergo sine nisi ubi quasi scilicet semper inter quoque atque dum apud siue magis post adhuc contra uelut ante usque quantumuidelicettunc licet Principal components P ro p o rt io n o f va ri a n ce e xp la in e d ( in % ) 0 5 1 0 1 5 2 0 2 5 3 0 3 5 Fig. 3 PCA of the epistolaria by Hildegard, Guibert, and Bernard (10,000 lemmas/sample) Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 209 -- -- -- -- which sentences, full of coordinating conjunctions (see also Derolez, 1988, p. V and IX). Bernard’s frequent use of non can be related to the didactic nature of his epistolary expositions in which he very often relies on an antithetical style to illustrate his thoughts (Mohrmann, 1958; Pranger, 2011, p. 222). 7 Testing Delta For our PCA displayed in Fig. 3, we have been work- ing with extremely generous sample sizes of 10,000 lemmas each. Because the ultimate goal of this art- icle remains the attribution of the Visio ad Guibertum missa and the Visio de Sancto Martino of which the authorship seems very questionable, the problem of sample size needs to be put forward (Eder, 2010; Luyckx and Daelemans, 2011): while the first disputed visio at stake in this article still contains 7,489 lemmas, the latter only counts 3,301 words. The scatterplots in Fig. 5a and b show the results of the same procedure as in Fig. 3 but using sample sizes of 5,000 and 1,000 lemmas, respectively. This clearly illustrates the decrease in discriminatory performance of our PCA when we reduce the sample size in our experiments. Fig. 5b demonstrates that the authorial dis- crimination becomes less powerful, in particular between Guibert and Bernard in the vertical component. To what extent will we be able to rely on PCA for a fairly solid attribution of a text, like the Visio de Primary (41/41) Secondary (145/145) 2 0 4 0 6 0 8 0 1 0 0 1 2 0 A b so lu te f re q u e n cy p e r sl ic e ( 2 0 0 0 w o rd s) Boxplot for "in" (Wilcoxon rank sum: p < 0.05) Primary (62/62) Secondary (124/124) 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0 1 8 0 A b so lu te f re q u e n cy p e r sl ic e ( 2 0 0 0 w o rd s) Boxplot for "et" (Wilcoxon rank sum: p < 0.05) (a) (b) Fig. 4. (a–c) Boxplots of the absolute frequencies of in, et, and non in epistolary samples of 2,000 lemmas M. Kestemont et al. 210 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 Since paper 5 Sancto Martino, of only ca. 3,000 words? Although the scatterplots in the previous section demonstrate the general validity of the stylometric approach for our corpus, it makes sense to apply a second attri- bution technique to our corpus to validate the out- come of the PCA more precisely. Because it is unfeasible to generate new scatterplots for every small change in parameter settings like e.g. sample size in our experiments, we additionally apply Burrows’s Delta (2002) to the epistolaria. In its traditional implementation, Delta offers a similarity metric to determine the authorship of an- onymous works. Based on the frequencies of a small set of high-frequency items, Delta computes the stylistic distance between an unknown sample and a set of samples written by a series of candidate authors. It will attribute the anonymous sample to the author of the (single) sample in the data set to which it is closest in style according to the metric. As such, Delta uses a ‘nearest neighbor’ reasoning (Argamon, 2008). We can apply a ‘leave-one-out validation’ with Delta as follows. We can temporar- ily treat each sample in our collection as anonym- ous. Next, we can have Delta attribute the anonymized sample to one of the candidate authors and check whether the suggested attribution is suc- cessful or not. If at the end of this procedure, we divide the number of correct attributions by the total number of samples in the data set, we get a percentage that offers a useful approximation of the general effectiveness of our technique, should it, for instance, be applied to real-world samples of un- known provenance. Fig. 6 shows the result of this leave-one-out val- idation for various sample sizes (multiples of 100 lemmas, ranging from 500 to 4,000). It is obvious that larger sample sizes invariably lead to higher accuracies in cross-validation. Yet, whereas the ini- tial accuracies are fairly low (even < 85%), the attri- bution success quickly rises above the psychological barrier of 95% (sample sizes > 1,500 lemmas) and becomes entirely flawless when dealing with sample sizes of ca. 3,000 lemmas or more. For a text count- ing 3,301 lemmas, like the Visio de sancto Martino, we might well reach an attribution accuracy of about 99%. Moreover, because these numbers are in line with earlier reports concerning modern lan- guages (Eder, 2010; Luyckx and Daelemans, 2011), Fig. 6 again demonstrates that even a highly in- flected language like Latin contains a satisfying amount of useful stylistic information in its gram- matical lemmas alone. By now, we can assume that, when applied cau- tiously, PCA should offer enough solid ground to make conjectures about the authorship of the vi- sions in the corpus traditionally attributed to Hildegard. Following a nearest neighbor reasoning (Argamon, 2008), we can plot unseen, anonymous texts together with the works of established author- ial origin and investigate to which of the authorial clusters the unseen work is most similar in style. However, before moving on to the analysis of the visions, we have first tested this attribution proced- ure. In the PCA scatterplot in Fig. 7, we have added a new, ‘anonymous’ sample (amounting to 3,706 Primary (83/83) Secondary (103/103) 1 0 2 0 3 0 4 0 5 0 6 0 7 0 A b so lu te f re q u e n cy p e r sl ic e ( 2 0 0 0 w o rd s) Boxplot for "non" (Wilcoxon rank sum: p < 0.05) (c) Fig. 4. Continued Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 211 in order Since u which since u lemmas) by author ‘X’ to equal-sized samples from the aforementioned epistolaria. The new sample turns out to be stylistically much more similar to Bernard’s samples than to those by Hildegard or Guibert. Should this sample have been truly an- onymous, the analysis would have offered firm grounds for conjecture that the text from which the sample is derived is actually authored by Bernard of Clairvaux. In this specific case, this rea- soning would have led to a historically sound at- tribution, as the anonymous text we have questioned is in reality the Sermo in festo sancti Martini, written by Bernard around 1150. An interesting fact about this example is that even though the topic and genre of this text are perhaps quite different from the epistolary material of our candidate authors (viz. a sermon about the afore- mentioned Saint Martin), it is clear that our PCA procedure allows for solid conclusions. Although one should perhaps not always expect such clear- cut stylistic, authorial differentiation in historical corpora, this promising example clearly illustrates the benefits of the present methodology for (future) research. PC1 (29.7%) 65 MFW Culled @ 0% Pronouns deleted Correlation matrix P C 2 ( 1 2 .9 % ) -4 -2 0 2 4 -5 0 5 10 B G H (a) Fig. 5 (a and b) PCAs with reduced sample sizes (5,000 and 1,000 lemmas/sample) (continued) M. Kestemont et al. 212 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 since 8 Guibert’s Secretaryship: Synergy and Beyond? As discussed earlier, we have discerned two groups of letters in Hildegard’s epistolarium: one that must have originated at the time when Volmar was still Hildegard’s secretary and that bears no potential traces of Guibert’s interference, and another con- taining the letters that are likely to have been revised by Guibert. If we confront samples of 5,000 lemmas from both portions, labeled here H_EPNG and H_EPG, respectively, in a PCA, we get the result in Fig. 8. We notice that the first, horizontal PC captures an impressive 37% of the original variation in our data and primarily relates to the stylistic differentiation between Guibert’s own letter collections (G_EP) and the anterior portion of Hildegard’s epistolarium (H_EPNG). Interestingly, we see that the second PC in the right half of the plot (still capturing 9.4% of the original variation) discriminates between Hildegard’s non-Guibertian letters and her letters that can be associated with Guibert’s secretaryship. PC1 (14.3%) 65 MFW Culled @ 0% Pronouns deleted Correlation matrix P C 2 ( 6 .7 % ) -6 -4 -2 0 2 4 0 5 B G H(b) Fig. 5 Continued Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 213 above which which which These results thus suggest that there do indeed exist stylistic differences between the oldest portion of Hildegard’s epistolarium and the letters in which we expected to discern Guibert’s editorial finger- prints. They also confirm what can be deduced from the surviving manuscript evidence. The so- called autograph copy of the Liber divinorum operum mentioned earlier offers unique insight into the way in which Hildegard’s collaborators must have edited her texts under her supervision (Derolez, 1972). Fig. 9, showing a number of lines from the randomly selected page 370 of MS Ghent, University Library, 241, makes it clear that it was the function words in particular that were often altered by Hildegard’s correctors; tam being erased, quod being replaced by ut or quia, ad being added, etce- tera. A collaborator—especially Guibert, who is known to have had a great deal of freedom in his editorial work—may thus have had a notable impact on Hildegard’s stylistic profile. However, in Fig. 8, we see that the samples from Hildegard’s epistolarium that bear the influence of Guibert’s interference do not seek the company of Guibert’s own writings in the scatterplot. After all, they continue to be somewhat more similar to Hildegard’s style. This result is reminiscent of the Synergy Hypothesis, recently discussed by Pennebaker (2011).12 Pennebaker puts forward three hypotheses concerning the stylistic effect of collaborations between different authors. Such pro- jects can produce a language that is (1) similar to the one produced by a single person writing alone, (2) the average of the two writers, or (3) unlike either of one of the styles that the collaborating au- thors would produce on their own. Based on ex- ploratory research on the Federalist papers and Beatles songs, Pennebaker ultimately argues in favor of the latter, so-called ‘Synergy view’ on col- laborative authorship, not refuting however the pos- sibility that one of the collaborating authors might have remained more influential with respect to the end product (cf. Petrie et al., 2008). This Synergy Hypothesis thus might be applicable to a certain extent to the Hildegard–Guibert ‘collaboration’, where the result of the creative process does not fit in with the other letter samples written by Hildegard or Guibert individually, although the result is somewhat more similar to Hildegard. 500 1000 1500 2000 2500 3000 3500 4000 0 .8 5 0 .9 0 0 .9 5 1 .0 0 Cross validation Sample size C ro ss -v a lid a tio n a cc u ra cy ( % ) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Fig. 6 Cross-validation using Delta (dotted lowess line fitted) M. Kestemont et al. 214 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 above , which -- -- which ; ; - More can be learned about the stylistic dichot- omy in Hildegard’s epistolarium by applying a Mann–Whitney test to the lemmas occurring at least twice in 4,000 lemma samples. Here, we tem- porarily leave the realm of high-frequency lemmas and venture into the lower-frequency strata of the lexical spectrum. Hence, this test will not particu- larly emphasize the discriminatory power of high- frequency lemmas, as was the case with our other tests (Kilgariff, 2001). Fig. 10 contrasts the words that were predominantly used in the Hildegard’s letters written under Volmar’s secretaryship with those that become typical when Guibert took over the editorial work in the preservation of her letters. The lemmas have been ranked and plotted accord- ing to the U test statistic obtained for each lemma. Fig. 10 learns how the use of the relative pronoun qui (‘who’) for instance only becomes prominent in letters edited by Guibert, who is indeed notorious for constructing eloquent but complex sentences with a lot of embedded relative clauses. Moreover, this latter group of letters is also characterized by a PC1 (26.9%) 65 MFW Culled @ 0% Pronouns deleted Correlation matrix P C 2 ( 1 1 .6 % ) -6 -4 -2 0 2 4 -5 0 5 10 B G H X Fig. 7 Attribution of an anonymized sermo X to the Bernardian corpus Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 215 - , more dry and stereotypical ecclesiastical vocabulary (omnipotens, sanctus, spiritus, verus, . . . ), whereas the letters not influenced by Guibert betray a more direct and lively narrative style (sed, tunc, nunc, dico, ergo, deinde, . . . ), possibly more true to Hildegard’s own preferred way of expressing herself. We might thus be inclined to agree with Newman (1987, p. 24) when she stated: ‘Purists can at least rejoice that the collaboration [between Guibert and Hildegard] began only after the seer’s major works were completed’. From the methodological point of view, these results also show that the discriminatory effects in lower- frequency strata correspond with the stylistic di- chotomy present in the high-frequency vocabulary, thus corroborating the performance of the latter methodology. -0.4 -0.2 0.0 0.2 -0 .4 -0 .2 0. 0 0. 2 Principal Components Analysis PC1 (37%) 65 MFW Culled @ 0% Pronouns deleted Correlation matrix P C 2 (9 .4 % ) G_ep-1 G_ep-2 G_ep-3G_ep-4 G_ep-5 G_ep-6 G_ep-7 G_ep-8 G_ep-9 G_ep-10 G_ep-11 G_ep-12 G_ep-13 G_ep-14 G_ep-15G_ep-16G_ep-17 G_ep-18 G_ep-19 G_ep-20 G_ep-21 G_ep-22 G_ep-23 G_ep-24 H_epG-1 H_epG-2 H_epG-3 H_epG-4 H_epG-5 H_epNG-1 H_epNG-2 H_epNG-3 H_epNG-4 H_epNG-5 H_epNG-6 H_epNG-7 H_epNG-8 H_epNG-9 H_epNG-10 H_epNG-11 H_epNG-12 H_epNG- H_epNG-14 H_epNG-15H_epNG-16 -15 -10 -5 0 5 -1 5 -1 0 -5 0 5 et qui in non ad hic sed ut de a cum quod ipse si per quia nec e uel pro quam autem ne enim etiam sicut ita tamen xque sic quoniam aut idem super nunc iam ac undequidem tam propter ergo sine nisi ubi quasi scilicetsemper inter quoque atque dum apudsiue magis post adhuc contra uelut ante usque quantum uidelicet tunc licet Principal components P ro po rt io n of v ar ia nc e ex pl ai ne d (in % ) 0 5 10 15 20 25 30 35 Fig. 8 PCA of the epistolarium of Guibert, of the letters of Hildegard transmitted without Guibert’s editorial assistance, and of the Guibertian letters in Hildegard’s epistolarium (5,000 lemmas/sample) Fig. 9 MS Ghent, University Library, 241, p. 370 (detail). Reproduced with permission M. Kestemont et al. 216 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 Let us finally turn to the original incentive for the present article, namely, the authorship dis- cussion concerning two texts of dubious proven- ance: the relatively short Visio de Sancto Martino about Saint Martin (3,301 lemmas) and the some- what longer Visio ad Guibertum missa (7,492 lemmas). Fig. 11 offers the result of three PCAs in which we have confronted both ‘dubia’ (hence D_MART and D_MISSA) with the previously dis- cussed epistolary collections, again using the same 65 lemmas and a sample size of 3,301 lemmas. Fig. 11a considers all texts by all authors; Fig. 11b ex- cludes Bernard’s texts; Fig. 11c only considers Guibert’s epistolarium and the ‘anonymous’ vision- ary texts. All subplots in Fig. 11 clearly show that both visions tightly cluster with Guibert’s epistolarium, instead of with Hildegard’s. This effect is perhaps least prominent in Fig. 11a, where D_MART and D_MISSA display modest similarities to some of the epistolary samples from the portion of Hildegard’s epistolarium that was revised by Guibert. In all three plots, however, the visions are generally speaking far more similar to Guibert’s writings than to Hilde- gard’s. Significantly, most samples resulting from the combined authorial voices of Hildegard and Guibert again do not display any significant rap- prochement to the epistolaria of the individual au- thors. These observations seem to reinforce the Synergy Hypothesis. Moreover, the visions’ quasi- semetipse interdum ualde quod surgus mens quomodo quare amo mysticus uenio rectus populus ubi non ergo deinde hic in dico nunc sed Before Guibert Mann-Whitney U 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 solus possum semper cesso iesus numquam indumentum sanctus imito uel per a designo precipio efficio summus licet uerus qui e omnipotens With Guibert Mann-Whitney U 0 .0 0 0 .0 2 0 .0 4 0 .0 6 0 .0 8 Fig. 10 Results of Mann–Whitney test (U statistic) comparing the vocabulary in Hildegard’s epistolarium before and during Guibert’s secretaryship Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 217 - random position in the final subplot (Fig. 11c) re- veals no pronounced stylistic differences with Gui- bert’s letters, regarding the high-frequency lemmas analyzed. They invariably cluster with Guibert’s epistolary oeuvre, making him a much more plaus- ible author than Hildegard—at the very least, from a stylistic point of view. An important, yet inconspicuous, last feature of Fig. 11a is that it includes the Sermo in festo sancti Martini, even though it can hardly be spotted among Bernard’s other samples. This sermon deals, just like the Visio de Sancto Martino, with Saint Martin. Both texts were even clearly influ- enced by the same late Antique hagiographical narratives concerning this saint, namely, the works of his first hagiographers Sulpicius Severus (c. 363–425) and Gregory of Tours (538–594). It is interesting to note that despite their interwo- venness within the same intertextual tradition, they are still clearly distinguished and therefore demonstrate that topic-related stylistics hardly interferes with the author-related differences. The visionary texts under investigation thus betray Guibert’s stylistic influence to such an advanced extent that we could wonder whether we should not entirely attribute these texts to Guibert, in- stead of arguing for any form of ‘synergetical col- laboration’, as was still possible for the portion of the epistolarium over which both Hildegard and Guibert labored. -5 0 5 10 -6 -4 -2 0 2 4 Principal Components Analysis PC1 (24.6%) 65 MFW Culled @ 0% Correlation matrix P C 2 ( 1 1 .6 % ) B_ep-1 B_ep-2 B_ep-3 B_ep-4 B_ep-5B_ep-6 B_ep-7 B_ep-8 B_ep-9 B_ep-10 B_ep-11 B_ep-12B_ep-13 B_ep-14 B_ep-15 B_ep-16 B_ep-17 B_ep-18 B_ep-19 B_ep-20B_ep-21 B_ep-22 B_ep-23 B_ep-24 B_ep-25 B_ep-26B_ep-27 B_ep-28 B_ep-29 B_ep-30 B_ep-31 B_ep-32 B_ep-33B_ep-34 B_ep-35B_ep-36 B_ep-37 B_ep-38 B_ep-39 B_ep-40 B_ep-41 B_ep-42 B_ep-43 B_ep-44B_ep-45 B_ep-46 B_ep-47 B_ep-48 B_ep-49B_ep-50 B_Mart-1 D_Mart-1 D_Missa-1 D_Missa-2 G_ep-1 G_ep-2 G_ep-3 G_ep-4G_ep-5 G_ep-6 G_ep-7 G_ep-8G_ep-9 G_ep-10 G_ep-11 G_ep-12G_ep-13G_ep-14 G_ep-15 G_ep-16 G_ep-17 G_ep-18 G_ep-19 G_ep-20 G_ep-21 G_ep-22 G_ep-23 G_ep-24 G_ep-25G_ep-26 G_ep-27G_ep-28 G_ep-29 G_ep-30 G_ep-31 G_ep-32G_ep-33 G_ep-34G_ep-35 G_ep-36 G_ep-37 H2_epG-1 H2_epG-2H2_epG-3 H2_epG-4 H2_epG-5 H2_epG-6 H2_epG-7 H2_epG-8 H_epNG-1 H_epNG-2 H_epNG-3 H_epNG-4 H_epNG-5 H_epNG-6 H_epNG-7 H_epNG-8 H_epNG-9 H_epNG-10 H_epNG-11 H_epNG-12H_epNG-13H_epNG-14 H_epNG-15 H_epNG-16 H_epNG-17 H_epNG-18 H_epNG-19 H_epNG-20 H_epNG-21 H_epNG-22 H_epNG-23 H_epNG-24 -5 0 5 10 -6 -4 -2 0 2 4 Fig. 11 PCAs including the Visio de Sancto Martino and the Visio ad Guibertum missa (continued) M. Kestemont et al. 218 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 -- - - u 9 Conclusions It is obvious that the experiments reported in this article only touch the tip of the iceberg of the research on Hildegard’s complicated authorship, to say nothing of the exciting, broader topic of twelfth-century Latin writing. As stated in our Introduction, individuality and authorship remain complex issues when it comes to medieval literature. Even an authoritative and highly idio- syncratic author like Bernard of Clairvaux is known to have been assisted by a team of collab- orators. It is moreover clear that medieval scribes often gradually introduced errors and deviations when successively copying exemplars, thus pos- sibly altering the original authors’ style in the surviving copies of texts. Nevertheless, we hope to have demonstrated that these issues do not need to imply that stylometry, when applied cau- tiously, cannot yield valid research results in the field of medieval philology. First we showed that authorial discrimination was possible in the corpus studied. Although sam- ples had to be big enough to yield correct attribu- tions, stylometric methods were generally able to model the overall differences in writing style. This suggests that superficial interference from scribes (or even later editors) can be by-passed to a certain extent, for instance through lemmatization. Interestingly, we obtained satisfying results with a word-level approach, notwithstanding the fact that Latin is a highly inflected language. Although other strategies might increase attribution accuracies in the future, this shows that even in highly inflected -5 0 5 -4 -2 0 2 4 6 8 1 0 Principal Components Analysis PC1 (31.1%) 65 MFW Culled @ 0% Correlation matrix P C 2 ( 8 .1 % ) D_Mart-1 D_Missa-1 D_Missa-2 G_ep-1 G_ep-2 G_ep-3 G_ep-4 G_ep-5 G_ep-6 G_ep-7 G_ep-8 G_ep-9 G_ep-10 G_ep-11 G_ep-12 G_ep-13 G_ep-14 G_ep-15 G_ep-16 G_ep-17 G_ep-18 G_ep-19 G_ep-20 G_ep-21 G_ep-22 G_ep-23G_ep-24 G_ep-25 G_ep-26G_ep-27 G_ep-28 G_ep-29 G_ep-30 G_ep-31 G_ep-32 G_ep-33 G_ep-34 G_ep-35 G_ep-36 G_ep-37 H2_epG-1 H2_epG-2H2_epG-3H2_epG-4 H2_epG-5 H2_epG-6 H2_epG-7 H2_epG-8 H_epNG-1 H_epNG-2 H_epNG-3 H_epNG-4 H_epNG-5 H_epNG-6 H_epNG-7 H_epNG-8 H_epNG-9 H_epNG-10 H_epNG-11 H_epNG-12 H_epNG-13 H_epNG-14 H_epNG-15 H_epNG-16 H_epNG-17 H_epNG-18 H_epNG-19 H_epNG-20H_epNG-21 H_epNG-22 H_epNG-23H_epNG-24 -5 0 5 -4 -2 0 2 4 6 8 1 0 Fig. 11 Continued Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 219 paper introduction languages, plenty of stylistic information can already be harvested at the word-level. In the course of our research, we have also touched on collaborative authorship, an issue that recently has raised considerable interest in stylom- etry (Reynolds et al., 2012). Our methodology enabled us to discover clear stylistic differences in Hildegard of Bingen’s epistolary work between those letters for which she had relied on the modest as- sistance of her first collaborator Volmar and the letters that have been compiled and copy-edited by Guibert of Gembloux. Interestingly, the letter samples influenced by the collaboration between Hildegard and Guibert formed an isolated cluster that did not display advanced stylistic similarities to Hildegard’s former epistolary oeuvre, nor to that of Guibert. These results argue in favor of what Pennebaker (2011) has called the Synergy Hypothesis: when two authors are involved in the same texts, the end result need not resemble the writing style of one of the two individually; the result might rather resemble that of a ‘new’, third author. The evidence offered in this particular case study is valuable in this light, but at the same time still too scant to come to a final verdict on this fascinating topic. Finally, with respect to our initial research ques- tion, we hope to have convincingly disputed the authorship of two texts allegedly attributed to Hildegard: the Visio de Sancto Martino and the Visio ad Guibertum missa. We argued that these vi- sions are stylistically speaking completely in line with the writing style of Guibert de Gembloux, Hildegard’s last secretary. These results offer -6 -4 -2 0 2 4 6 -4 -2 0 2 4 Principal Components Analysis PC1 (9.7%) 65 MFW Culled @ 0% Correlation matrix P C 2 ( 8 .3 % ) D_Mart-1 D_Missa-1 D_Missa-2 G_ep-1 G_ep-2 G_ep-3 G_ep-4 G_ep-5 G_ep-6 G_ep-7 G_ep-8 G_ep-9 G_ep-10 G_ep-11 G_ep-12 G_ep-13 G_ep-14 G_ep-15 G_ep-16 G_ep-17 G_ep-18 G_ep-19 G_ep-20 G_ep-21 G_ep-22 G_ep-23 G_ep-24 G_ep-25 G_ep-26 G_ep-27 G_ep-28 G_ep-29 G_ep-30 G_ep-31 G_ep-32 G_ep-33 G_ep-34 G_ep-35 G_ep-36 G_ep-37 -6 -4 -2 0 2 4 6 -4 -2 0 2 4 Fig. 11 Continued M. Kestemont et al. 220 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 up which up u quantitative support to suspicions voiced in earlier, traditional philological research: if Guibert is not to be considered their original author altogether, it is clear that he reworked these texts so profoundly that hardly anything of Hildegard’s writing style is still discernible in them. In fact, it is noteworthy that our analyses could not offer any stylistic evidence at all that Hildegard once authored (even a preliminary or simply oral version of) these texts, although this remains of course an interesting historical possibility. Acknowledgements We thank the Corpus Christianorum Library & Knowledge Centre of Brepols (Turnhout) and in particular Luc Joqué for generously putting at our disposal the corpora analyzed in this article. Marco Passarotti (Università Cattolica del Sacro Cuore, Milan) generously provided us with the IT-TB, while Helma Dik (University of Chicago) provided the word list from the Perseus Project (Tufts University). We are moreover very grateful for the valuable feedback from Albert Derolez, Wim Verbaal, Antoon Bronselaer, and Guy De Tré. In addition, we thank the anonymous reviewers of the Digital Humanities Conference 2013 for their helpful comments on this research project, as well as the anonymous reviewers of this journal, in par- ticular, for their extensive feedback on the normal- ization procedures described. Mike Kestemont developed the stylometric methodology for this art- icle. Sara Moens brought in her domain expertise concerning Guibert of Gembloux and medieval epistolography. Jeroen Deploige, who took the ini- tiative for this collaborative research, contributed from his involvement with Hildegard scholarship. All three authors contributed equally to the end result. Funding This work was supported by the Research Foundation – Flanders, of which both Sara Moens and Mike Kestemont are fellows, and by the Flemish Hercules Foundation, which finances the project ‘Sources from the Medieval Low Countries (SMLC)’, directed by Jeroen Deploige. References Argamon, S. (2008). Interpreting Burrows’s Delta: geo- metric and probabilistic foundations. Literary and Linguistic Computing, 23(2): 131–47. Binongo, J. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attri- bution. Chance, 16(2): 9–17. Binongo, J. and Smith, W. (1999). The application of principal components analysis to stylometry. Literary and Linguistic Computing, 14(4): 446–66. Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. Analyzing Text with the Natural Language Toolkit. Sebastopol: O’Reilly. Burrows, J. (1987). Computation into Criticism. A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press. Burrows, J. (2002). ‘Delta’: a measure of stylistic differ- ence and a guide to likely authorship. Literary and Linguistic Computing, 17(3): 267–87. Cerquiglini, B. (1999). In Praise of the Variant: A Critical History of Philology. Baltimore: JHU Press. Chrupala, G., Dinu, G., and van Genabith, J. (2008). Learning morphology with Morfette. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010. Marrakech, Morocco: European Language Resources Association, pp. 2362–7. Coakley, J. (2006). Women, Men and Spiritual Power: Female Saints and Their Male Collaborators. New York: Columbia University Press. Constable, G. (1976). Letters and Letter-collections. Turnhout: Brepols. Delehaye, H. (1889). Guibert, abbé de Florennes et de Gembloux, XIIe et XIIIe siècles. Revue des Questions Historiques, 46: 5–90. Deploige, J. (1998). In Nomine Femineo Indocta. Kennisprofiel en Ideologie van Hildegard van Bingen (1098-1179). Hilversum: Verloren. Deploige, J. (2005). Anonymat et paternité littéraire dans l’hagiographie des Pays-Bas Méridionaux (ca. 920 - ca. 1320). Autour du discours sur l’’original’ et la ‘copie’ hagiographique au Moyen Âge. In Renard, E., Trigalet, M., Hermand, S., and Bertrand, P. (eds), Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 221 Scribere Sanctorum Gesta. Recueil d’études d’hagio- graphie médiévale offert à Guy Philippart. Turnhout: Brepols, pp. 77–107. Deploige, J. and Moens, S. (eds), Visio de Sancto Martino et Visio ad Guibertum missa. In Deploige, J., Embach, M., Evans, C., Gärtner, K., and Moens, S., Hildegardis Bingensis opera minora. Pars secunda. Turnhout: Brepols, forthcoming. Derolez, A. (1972). The genesis of Hildegard of Bingen’s Liber divinorum operum. The codicological evidence. In Gumbert, J.P. and De Haan, J.M. (eds), Litterae Textuales. Essays Presented to Gerard I. Lieftinck. II: Texts & Manuscripts. Amsterdam: Van Ghent, pp. 23–33. Derolez, A. (ed.) (1988–1989). Guiberti Gemblacensis epis- tolae: quae in codice B.R. BRUX. 5527-5534 inveniuntur. Turnhout: Brepols. Derolez, A. and Dronke, P. (eds), (1996). Hildegardis Bingensis Liber Divinorum Operum. Turnhout: Brepols. Dronke, P. (1998). The allegorical world-picture of Hildegard of Bingen: revaluations and new problems. In Burnett, C. and Dronke, P. (eds), Hildegard of Bingen: The Context of Her Thought and Art. London: The Warburg Institute. Eder, M. (2010). Does size matter? Authorship attribution, small samples, big problem. Digital Humanities 2010. Conference Abstracts. King’s College London, pp. 132–5. Eder, M., Kestemont, M., and Rybicki, J. (2013). Stylometry with R: a suite of tools. Digital Humanities 2013. Conference Abstracts. University of Nebraska- Lincoln, pp. 487–89. Embach, M. (2003). Die Schriften Hildegards von Bingen. Berlin: Akademie Verlag. Ferrante, J. (1998). Scribe quae vides et audis. Hildegard, Her Language, and Her Secretaries. In Townsend, D. and Taylor, A. (eds), The Tongue of the Fathers. Gender and Ideology in Twelfth-Century Latin. Philadelphia: University of Pennsylvania Press, pp. 102–35. Herwegen, I. (1904). Les collaborateurs de Ste. Hildegarde. Revue Bénédictine, 21: 192–204; 302–15; 381–403. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3): 233–334. Kestemont, M. and Van Dalen-Oskam, K. (2009). Predicting the past: memory-based copyist and author discrimination in medieval epics. In Calders, T., Tylus, K., and Pechenizkyi, M. (eds), Proceedings of BNAIC 2009. Eindhoven: Benelux Association for Artificial Intelligence, pp. 121–8. Kestemont, M., Daelemans, W., and De Pauw, G. (2010). Weigh your words—memory-based lemmatiza- tion for middle Dutch. Literary and Linguistic Computing, 25(3): 287–301. Kilgariff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1): 49–66. Klaes, M. (ed.) (2001). Hildegardis Bingensis Epistolarium. Pars III. Turnhout: Brepols. Köhler, R. (2005). Synergetic linguistics. In Köhler, R., Altman, G., and Piotrowoski, R. G. (eds), Quantitative Linguistik/Quantitative Linguistics. Ein Internationales Handbuch/An International Handbook. Berlin, New York: Walter de Gruyter, pp. 760–75. Koppel, M., Schler, J., and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1): 9–26. Leclercq, J. (1962). Saint Bernard et ses secrétaires. In Recueil d’études sur Saint Bernard et ses écrits, Vol. 1. Rome: Edizioni di storia e letteratura, pp. 3–25. Leclercq, J. (1987). Lettres de S. Bernard: histoire ou litterature? In Recueil d’études sur Saint Bernard et ses écrits, Vol. 4. Rome: Edizioni di storia e letteratura, pp. 125–225. Leclercq, J. and Rochais, H. (eds), (1974–1977). Epistolae In Sancti Bernardi opera, Vols 7–8. Rome: Editiones cistercienses. Leclercq, J., Talbot, C. H., and Rochais, H. (eds), (1957– 1977). In Sancti Bernardi opera. Rome: Editiones cistercienses. Luyckx, K. and Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26(1): 35–55. Moens, S. (2010). Twelfth-century epistolary language of friendship reconsidered. The case of Guibert of Gembloux. Revue belge de Philologie et D’histoire, 88(4): 983–1017. Mohrmann, C. (1958). Observations sur la langue et le style de saint Bernard. In S. Bernardi opera, Vol. 2. Rome: Editiones cistercienses, pp. IX–XXXIII. Newman, B. (1987). Sister of Wisdom. St. Hildegard’s Theology of the Feminine. LA: University of California Press. Newman, B. (ed.) (1998). Voice of the Living Light: Hildegard of Bingen and Her World. LA: University of California Press. Nichols, S. (1997). Why Material Philology? Some Thoughts. Zeitschrift für deutsche Philologie, 116: 10–30. M. Kestemont et al. 222 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 Passarotti, M. and Dell’Orletta, F. (2010). Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D. (eds), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17–23 May 2010. Valetta: European Language Resources Association, pp. 1694–71. Pennebaker, J. (2011). The Secret Life of Pronouns. What our Words Say About Us. NY: Bloomsbury. Petrie, K., Pennebaker, J., and Sivertsen, B. (2008). Things we said today: a linguistic analysis of the Beatles. Psychology of Aesthetics, Creativity, and the Arts, 2(4): 197–202. Pitra, J. B. (1882). Analecta Sacra et Classica Spicilegio Solesmensi Parata, Vol. 8. Paris: A. Jouby et Roge. Piotrowski, M. (2012). Natural Language Processing for Historical Texts. California: Morgan & Claypool Publishers. Pranger, B. (2011). Bernard the Writer. In McGuire, B.P. (ed.), A Companion to Bernard of Clairvaux. Leiden: Brill, pp. 220–48. Reynolds, N., Schaalje, G., and Hilton, J. (2012). Who wrote Bacon? Assessing the respective roles of Francis Bacon and his secretaries in the production of his English Works. Literary and Linguistic Computing, 27(4): 409–25. Rigg, A. (1996). Orthography and pronunciation. In Mantello, F. and Rigg, A. (eds), Medieval Latin: An Introduction and Bibliographical Guide. Washington: The Catholic University of America Press, pp. 79–82. Rybicki, J. and Eder, M. (2011). Deeper delta across genres and languages: Do we really need the most fre- quent words? Literary and Linguistic omputing, 26(3): 315–21. Sapir, E. (1921). Language: An Introduction to the Study of Speech. New York: Harcourt, Brace & Co.. Schinke, R., Greengas, M., Robrtson, A. M., and Willett, P. (1996). A stemming algorithm for Latin text databases. Journal of Documentation, 52(2): 172–87. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK. Schrader, M. and Führkötter, A. (1956). Die Echtheit des Schrifttums der heiligen Hildegard von Bingen. Quellenkri- tische Untersuchungen. Keulen–Graz: Böhlau Verlag. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3): 538–56. Van Acker, L. (1989). Der Briefwechsel der heiligen Hildegard von Bingen. Vorbemerkungen zu einer kri- tischen Edition. Revue Bénédictine, 99: 118–54. Van Acker, L. (ed.) (1991–1993). Hildegardis Bingensis Epistolarium. Turnhout: Brepols. Notes 1. Among the letters written with the help of Volmar, we count those in MS Wien, Österreichische Nationalbi- bliothek, 963 (theol. 348), which offers a copy of a collection compiled by Volmar before 1173 (Van Acker, 1991, p. XXVI), and the limited number of letters that can be found distributed over MS Stuttgart, Würt- tembergische Landesbibliothek, Cod. theol. phil. 48 253; MS Wien, Österreichische Nationalbibliothek, 881; MS Berlin, Staatsbibliothek Preussischer Kultur- besitz, Cod. theol. lat. fol. 699; MS London, British Library, Cod. Add. 17292; MS Paris, Bibliothèque Nationale, Nouv. Acquis. Lat. 760; MS Trier, Stadtbi- bliothek, Cod. 771/1350 and MS Kynžvart, Cod. 40. Among the letters compiled and edited under Guibert’s supervision, we count those in the Riesenkodex Wies- baden, Landesbibliothek, 2 (dating from 1177-1179/ 1180), that are not also found in MS Wien, Österrei- chische Nationalbibliothek, 963 (theol. 348) (Van Acker, 1991, p. XXVII), as well as those copied in MS Berlin, Staatsbibliothek Preussischer Kulturbesitz, Cod. lat. 48 674, which bear traces of Guibert’s editorial as- sistance (Klaes, 2001, p. XVII). Among the letters con- tained in the latter group, compiled under Guibert’s supervision, we obviously encounter all Hildegard’s letters addressed to Guibert and the ones that have been written in the years in which he stayed in Rupertsberg. 2. MSS Brussels, Royal Library, 5397–5407 and 5527– 5534 (both originating from Gembloux, early thir- teenth century) and MS Brussels, Royal Library, 1510–1519 (originating from Sint-Maartensdal near Louvain, fifteenth century). 3. See www.brepolis.net. The critical editions of the works of both Hildegard of Bingen and Guibert of Gembloux are published in several volumes in Brepols’s own Corpus Christianorum series. For the works of Bernardus, the Brepols Library of Latin Texts relies on Leclercq et al. (1957–1977). 4. Bernard’s letters, edited by Leclercq and Rochais (1974–1977), contain the ‘official’ epistolarium, Collaborative authorship in the twelfth century Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 223 that www.brepolis.net up compiled shortly after Bernard’s death, as well as letters transmitted elsewhere. Guibert’s letters were edited by Derolez (1988–1989) on the basis of MS Brussels, Royal Library, 5527–5534. 5. See note 1. Hildegard’s letters are edited by Van Acker (1991–1993) and by Klaes (2001) 6. We supplemented this list with three words—plerum- que, utrumque, and quicumque—yet did not allow any of these items into the restrictive set of function words we list below. We did not consider other, much less frequent clitics (e.g. –ne (‘if’) or –ve/ue (‘or’)), because it is difficult to automatically detect these using a simple rule-based approach and to distinguish them from e.g. the –ne in deuotione or the –ue in serue. 7. We have described our approach in a generic way for future reference. It should be noted, however, that there still remains a small number of possible spelling variants in medieval Latin that are hard to deal with but that were not relevant for the present research because we worked with critical editions that have already normalized orthography to a large extent. One can think here of the interchangeability of –mqu– and –nqu– in some words and the problem of single/double consonants (as e.g. in litera and lit- tera). A lesser frequent, yet still important, orthographical variant that we leave unaddressed is (–)exs– versus (�)ex–, because it is difficult to auto- matically detect it using a rule-based approach. Nevertheless, this variant hardly affects any of the func- tion words to which we have restricted our analyses. 8. In these training data too, we have substituted all vs for us and all aes/oes for es. 9. Note that licet, which strictly speaking derives from the impersonal verb licere, is considered a function word because it is primarily used as a subordinating concessive conjunction. 10. Other errors in the lemmatization displayed in Table 3 are ‘hildegars’, ‘us’, and ‘ta’. 11. Note that from this point onwards, we will express the size of textual samples in terms of the number of consecutive lemmatized words they contain (a number which, after tokenization, need not be iden- tical to the original number of surface forms in the original texts). 12 For the sake of conceptual clarity we shall keep Pennebaker’s original terminology, although it should be stressed that our present use of the term ‘Synergy Hypothesis’ is completely unrelated to the concept of ‘Synergetic Linguistics’ in the field of quan- titative linguistics (Köhler, 2005). M. Kestemont et al. 224 Digital Scholarship in the Humanities, Vol. 30, No. 2, 2015 -- -- since ' ' ' ' '