intertextuality_by_meaning_preprint


	   1	  

NOTE: THIS IS A PRE-PRINT DRAFT VERSION. The published version contains 
several editorial changes. Interested readers are advised to consult the forthcoming 
version of this paper in LLC ©: 2014. Published by Oxford University Press. All rights 
reserved. 

 
The Sense of a Connection: Automatic Tracing of Intertextuality by Meaning 

Walter J. Scheirer 

Harvard University 

Chris Forstall 

University of Geneva 

Neil Coffee 

State University of New York at Buffalo


	   2	  

The Sense of a Connection: Automatic Tracing of Intertextuality by Meaning 

 
1 Introduction 

The recognition that poetic texts are often significantly linked to their predecessors 

through shared or similar language has been an important part of the reading and study of 

literature since antiquity. More recently, however, scholarly interest has broadened 

beyond the verbatim reuse of specific phrases1 to take in the great scope and subtlety of 

intertextual connections2. The term intertext, coined by Julia Kristeva (1986, p. 37), has 

come to be used widely in literary studies to indicate linguistic similarity that, in 

presenting to the reader a marked connection between two texts, generates new meaning 

or novel stylistic effects. Emerging digital methods now make it possible to trace this sort 

of intertextuality with some success. Most typically, computational approaches search for 

the type of lexical correspondences that can be loosely described as paraphrase3. The 

process closely resembles the manual identification of intertextuality still commonly 

practiced by literary scholars, including the writers of commentaries (Coffee et al., 2012). 

   This same work has demonstrated, however, that meaningful connections between texts 

occur not only via lexical similarity but also through a broader similarity of meaning in 

the absence of words that have the same form or stem4. The classicist Stephen Hinds 

describes this phenomenon as a “poetic of corresponding inexactitude, which draws on 

but also distances itself from the rigidities of philological and intertextualist 

fundamentalisms alike” (Hinds, 1998, p. 50). One study indicated that, among a certain 

set of meaningful parallels between two texts, some 33% were made up by similarity of 

meaning in the absence of more than one shared word (Coffee et al., 2012, p. 415). 


	   3	  

   Quite remarkably, human readers are rather adept at identifying text reuse when faced 

with such “inexactitude,” where a predefined formula for lexical matching drawn from 

textual criticism would simply fail. For instance, consider the following lines from the 

Roman poet Lucan, which, in an epic simile, characterize the once-great general Pompey 

on the eve of the Roman Civil War as a tottering, but still venerated, oak5: 

 
qualis frugifero quercus sublimis in agro, 

exuvias veteres populi sacrataque gestans 

dona ducum, nec iam validis radicibus haerens, 

pondere fixa suo est; nudosque per aera ramos 

effundens, trunco, non frondibus, efficit umbram; 

et quamvis primo nutet casura sub Euro, 

tot circum silvae firmo se robore tollant, 

sola tamen colitur.  

(Lucan, Civil War 1.136–143) 

 
Just as a lofty oak in a productive field, bearing the ancient spoils and consecrated 

gifts of leaders, but no longer clinging with healthy roots, is fixed in place by its 

own weight; and spreading out bare branches through the air, it casts a shadow 

from its trunk rather than its leaves; and, although it sways, ready to fall at the 

first easterly wind, while so many of the surrounding trees bear themselves up on 

sturdy hardwood, it alone is honored. 

Commentators6 on this poem have noted intertextual connections to several passages in 

Vergil’s Aeneid. Among them is another simile, this one comparing the doomed city of 


	   4	  

Troy, finally penetrated by the besieging Greeks, to a moribund ash-tree toppled by 

industrious peasants: 

 
ac veluti summis antiquam in montibus ornum 

cum ferro accisam crebrisque bipennibus instant 

eruere agricolae certatim,—illa usque minatur 

et tremefacta comam concusso vertice nutat, 

volneribus donec paulatim evicta, supremum 

congemuit, traxitque iugis avolsa ruinam. 

(Vergil, Aeneid 2.626–531) 

 
Just as when farmers vie to uproot an ancient ash-tree high in the mountains, 

hacked at with a rain of blows from their iron axes—it keeps threatening to fall, 

and, with its foliage trembling, its crown shaken, it sways, until, overcome little 

by little with its wounds, uprooted from the ridge, it at last gives a groan and 

heaves forward its own collapse. 

 
As readers, how do we recognize that these two texts are related, when they share just 

one distinctive word, “sway” (nuto)7? We see a resemblance of theme: both texts describe 

a tottering old tree. Both passages also share a narrative function. In each case the 

tottering tree foreshadows the downfall of a hitherto stalwart bastion: the Trojan citadel 

in the Aeneid, the republican general Pompey in the Civil War. Indeed, the two events 

appear intertextually connected: the capture and destruction of mythological Troy 

anticipates the historical defeat and death of the Roman leader8. 


	   5	  

  Theme, narrative structure, historical and mythical events: the ability of poetic language 

to forge connections simultaneously among such different sign-systems is precisely what 

Kristeva’s original broad notion of intertextuality as “an intersection of textual surfaces” 

(Kristeva, 1986, p. 37) was meant to encompass. This view of intertextuality leads us to 

think of words, even different but related ones, as part of a continuum of reuse and 

repurposing, and so to see in our examples of epic collapse, and countless others, the 

potential for thematic material from one context to be redeployed in another to new 

effect. 

   Given the complexity of literary meaning that arises when readers encounter such 

instances of intertextuality, how can we capture it adequately with a computer model? 

What we need, in the words of Hinds, is a “fuzzy logic” that is flexible enough to identify 

highly inexact matches often based in thematic similarity (Hinds, 1998, p. 50). The 

technique we employ for identifying such semantic intertextuality is the popular natural 

language processing strategy of semantic analysis. Algorithms for semantic analysis are 

typically designed around the notion of word co-occurrence. That is, they start from the 

assumption, possibly counterintuitive but well-demonstrated, that words that occur in the 

same contexts have related meanings. This assumption, coupled with the cognitive 

matching process described above, motivated the design of Latent Semantic Indexing 

(LSI), an early and still powerful approach (Deerwester et al., 1990). The use of 

algorithms for semantic analysis, including topic modeling (Blei, 2011), has spread from 

the practical applications of natural language processing to become a popular tool for 

literary studies among digital humanists. Recent work has used semantic analysis to 


	   6	  

distinguish between genres, produce an algorithmic historiography of classical 

scholarship, and characterize sentiment in political writing9.         

   These types of tasks fall into what Jockers terms macroanalysis (Jockers, 2013), which 

applies the tools of machine learning to collect quantifiable evidence of literary 

phenomena over large corpora, which might consist of the collected works of an author, 

whole genres, and entire literatures. When instead used for close reading, semantic 

analysis has the potential to reveal the characteristics and behavior of the language 

elements that participate in intertextual connections. In this work, we are concerned with 

texts from antiquity where intertextuality takes the form of similar small phrases or 

passages, as opposed to corpora of large documents where semantic analysis is more 

commonly applied. 

   Let us begin with an example of how semantic analysis can be applied to this sort of 

small collection. Consider the following lines of Latin from Lucan’s Civil War as a 

simple corpus: 

 
1. bella per Emathios plus quam civilia campos iusque datum sceleri canimus 

2. post Cilicasne vagos, et lassi Pontica regis proelia, barbarico vix consummata veneno, 

ultima Pompeio dabitur provincia Caesar 

3. sed non in Caesare tantum nomen erat, nec fama ducis: sed nescia virtus stare loco: 

solusque pudor, non vincere bello 

4. turba minor ritu sequitur succincta Gabino, Vestalemque chorum ducit vittata sacerdos, 

Troianam soli cui fas vidisse Minervam 


	   7	  

5. certe populi, quos despicit Arctos, felices errore suo, quos ille timorum maximus haud 

urget, leti metus 

6. quodque (nefas) nullis inpune apparuit extis, ecce, videt capiti fibrarum increscere 

molem alterius capitis 

7. iam gelidas Caesar cursu superaverat Alpes, ingentisque animo motus, bellumque 

futurum ceperat. ut ventum est parvi Rubiconis ad undas 

8. rupta quies populi, stratisque excita iuventus deripuit sacris adfixa penatibus arma, 

quae pax longa dabat 

9. non, si tumido me gurgite Ganges summoveat, stabit iam flumine Caesar in ullo, post 

Rubiconis aquas 

 
Now suppose that we would like to find which of the above lines, if any, have some 

thematic similarity to the phrase Rubiconis aquas (“the waters of the Rubicon”). Given 

that Caesar is famously associated with crossing the Rubicon, if a semantic analysis 

approach were effective, we would expect a search for thematic material related to the 

Rubicon to turn up phrases in which Caesar appears. To test this hypothesis, we applied 

LSI to search for content similar to Rubiconis aquas. An in-depth look at the LSI 

algorithm, including a description of the relevant parameters, follows in the next section. 

For now, let us simply consider the top three results returned when we perform this test 

with an LSI approach, using two topics and cosine distance: 

 
1. post Cilicasne vagos, et lassi Pontica regis 

proelia, barbarico vix consummata veneno, ultima 

After defeating roving Cilician pirates and after 

battles on the Black sea with the fading king, 


	   8	  

Pompeio dabitur provincia Caesar . . . ?  

(Civil War 1.336-8)
 

scarcely ended by barbaric poison, 

will Caesar now be handed over to Pompey as his 

last charge? 

 
2. iam gelidas Caesar cursu superaverat Alpes, 

ingentisque animo motus bellumque futurum 

ceperat. ut ventum est parvi Rubiconis ad undas. 

(Civil War 1.183-5)
 

Already Caesar had overcome the frozen Alps with 

speed, and in his heart he had anticipated the great 

upheavals and war to come, when he arrived at the 

waters of the slender Rubicon. 

 
3. sed non in Caesare tantum nomen erat, nec fama 

ducis: sed nescia virtus stare loco: solusque pudor, 

non vincere bello. 

(Civil War 1.143-5)
 

But Caesar had not only a name and renown as a 

general, but also a courage incapable of standing 

still, and shame only at conquering without war.
 

As the results indicate, the test was successful: the search for thematic content similar to 

“waters of the Rubicon” turned up passages referring to Caesar as the top three results. In 

one of these phrases, the search for meanings similar to those of Rubiconis aquas 

detected mention of the Rubicon itself, along with Caesar, but the two others did not. The 

results also show substantial precision. The algorithm did not recall everything related to 

Caesar, but only hits rich in the martial language that also co-occurs with the word 

Rubicon, an emblem of the civil war. 

   This simple test suggests that material likely to be thematically associated in the mind 

of the reader (Caesar and Rubicon) can also be identified through semantic analysis. The 

remainder of this article will address in greater detail a more complicated task. Whereas 

we have just demonstrated a search that finds passages matching a phrase, we turn now to 


	   9	  

detecting semantic similarity between two whole passages. The goal of this sort of search 

is that the reader interested in finding instances of textual similarity absent verbal 

repetition will ultimately not need to input a search term, as we have just done, but will 

be able to simply search all passages of one given work against all of those in another. 

   With this basic understanding of our goals and approach in place, we can summarize 

the contributions we describe in the remainder of this article: 

1. A methodology for applying semantic analysis to the problem of detecting 

instances of intertextuality without strict lexical correspondence (Sec. 2). 

2. An extensive experimental analysis that compares the results of semantic analysis 

to human analysis, i.e. scholarly commentaries that compare two texts (Sec. 3).  

3. A publically accessible web tool that allows non-experts to apply our semantic 

analysis methodology to a large corpus of Latin writers (Sec. 4). 

4. The discovery of thematic matches between Lucan’s Civil War and Vergil’s 

Aeneid not previously recorded by commentators that were detected by our tool 

(Sec. 4). 

2 Methods 

2.1 LSI Approach 

To find the passages that best match a particular query phrase by context, we need to not 

only generate a semantic model, but also assess similarity within that model space. For 

this purpose, we chose to use the LSI module of the Gensim10 framework (Rehurek and 

Sojka, 2010) in a custom Python program. The underlying algorithm performs a 

transformation on a set of document vectors to draw out latent structure in the corpus, and 

to reduce dimensionality for computational efficiency. This is accomplished via Singular 


	   10	  

Value Decomposition (SVD), a matrix factorization technique in linear algebra. A 

similarity search is then performed in the resulting low rank transformation space. 

   In order to provide enough contextual information for the models and still keep the 

input highly localized to specific phrases, a window of approximately 500 characters 

around and including a target line of text was always selected to form a passage 

considered a “query.” Similarly, a window of approximately 1,000 characters around and 

including a line from the text we wanted to match against formed a passage considered a 

“document.” Note that each line from the text was used as a basis to create a document, 

resulting in a large measure of overlap between documents as the window moved across 

the text. A collection of all such documents from a text represents a training corpus. 

During pre-processing, the most common 250 words from the Tesserae corpus11 were 

removed from consideration. This list contains function words, as well as the most 

common nouns, verbs, adjectives, and adverbs. Each passage was then processed into a 

bag-of-words representation, with the inflected form of each word replaced with the set 

of all possible stems. This was done in lieu of typical lemmatization to increase the 

amount of text available for training (see the discussion of small sample sizes below). 

   Each LSI model for a corpus was trained using a user-specified number of topics (i.e. 

the dimensions retained after SVD is applied by the algorithm). Similarity queries 

proceeded by projecting a query passage and a corpus into the transformed model space, 

and assessing cosine similarity between the query passage and each document in the 

corpus to produce a set of match scores (in the range -1 to 1, where a higher score 

indicates a better match). These scores were then sorted to provide a ranked list of 


	   11	  

potential matches. The source code for this algorithm is available publicly on Github as 

part of the Tesserae web tool12. 

   A mathematically inclined reader might ask why we opted for LSI instead of a more 

flexible topic modeling approach such as Latent Dirichlet Allocation (LDA) (Blei, 2003). 

During the course of this work, we evaluated several LDA implementations including the 

online learning technique provided by Gensim, and the efficient sampling-based 

implementation provided by MALLET (McCallum, 2002). For text samples as small as 

our passages, these algorithms were not numerically stable, i.e. they produced radically 

different match scores for the exact same input across multiple trials. This is a significant 

problem for the scholar attempting to search for instances of textual reuse with some 

degree of confidence. The cause is an artifact of random bootstrapping (i.e. initializing 

the algorithm with different random data each time it is run) with limited sampling. The 

minimum sampling of text required for the statistical estimators to converge is something 

greater than what we are providing – LDA is most typically applied to long-form 

documents and any implementation must make certain assumptions on its input. This is a 

key open issue in machine learning for the digital humanities: textual analysis for forms 

such as poetry, song, or epigraphy will nearly always involve small samples13. 

   Our testing revealed drift in only the least significant digit of the scores produced by 

Gensim’s LSI implementation14, giving us enough stability to reliably replicate our 

results over any number of trials. The sizes of the query and document passages 

described above were determined experimentally with numerical stability in mind. 500 

characters for the query and 1000 characters for the document represent the smallest 

passage sizes that form a highly localized window around their respective target lines 


	   12	  

(ensuring that matches are not too broad), while providing enough numerical stability for 

the LSI algorithm. 

   For comparison, we also considered a simpler semantic analysis approach without the 

rank lowering of the LSI algorithm on the same texts. Again using the Gensim module, 

we computed the cosine distance between just the bag-of-words representations for a 

query passage and each passage in the corpus to produce a second set of match scores. 

The goal of this comparison was to see what LSI adds beyond the basic language model. 

According to Deerwester et al. (1990), rank lowering helps us find all words that are 

related to each document. This is typically a much larger set than the plain bag-of-words 

representation because it accounts for synonymy across the corpus. If LSI is indeed 

exploiting the “semantic structure” of our corpus via low-rank approximation, we should 

observe better match scores for relevant parallels compared to this simple approach. 

 
2.2 Experiment Design 

   Our baseline for experimentation is the n-gram matching capability that forms the core 

of the Tesserae search engine, which is freely available on the web15. Briefly 

summarized, the matching algorithm operates in two distinct stages16. In the first stage, 

all instances where a given unit (e.g. verse line or phrase) in one text shares at least two 

words with another unit in a different text are identified. The words may be exact forms 

or lemmata. In the second stage, the candidate matches are ranked by the relative rarity 

and proximity of their shared words. The final result is a score that reflects the overall 

strength of the match, if some word reuse is present. 


	   13	  

   We validated our approach with reference to two Latin epic poems, Lucan’s Civil War, 

book 1, and Vergil’s Aeneid. Civil War 1 consists of 695 hexameter lines, while all of the 

Aeneid consists of 9,896 hexameter lines. These epics are generally considered to have a 

deep and remarkable intertextual relationship17. This relationship is attested in the work 

of scholarly commentators, who, as expert readers, document a variety of forms of 

intertextual relationship, among them instances of shared meaning. We therefore tested 

our results against a benchmark data set assembled by the Tesserae Project comprising all 

intertexts between the two texts recorded by four major commentaries. The ability of the 

algorithm to replicate commentator decisions is used as the measure of performance. 

   From previous experiments with the Tesserae search engine, we know that it is possible 

to identify the majority of known intertexts by searching for sentences that share two or 

more lemmata. In a test on a set of given samples, the word-based algorithm missed 35 of 

the commentator parallels, however, which accounted for 1/3 of the benchmark set. 

Analysis of such missed samples suggests that they consist wholly or partially of 

instances of similar meaning, without shared words (Coffee et al., 2012b, 2014). This 

subset of the overall benchmark represents a union of parallels described in the four 

commentaries. Of these, individual commentators identified 30 distinct parallels, while 

two commentators independently identified each parallel in the remaining five. With due 

allowance for the subjectivity of the commentators, the objective of this work was to see 

how many of the 35 missed intertexts could be recovered by automatic matching by 

semantic context rather than words. 

 
	   14	  

3 Lucan-Vergil Benchmark Results 

Our first test case was the following excerpt from book 1 of the Civil War, where Lucan 

uses metaphorical language to describe the abandonment of Rome by its military age men 

(left panel below).  

 
qualis, cum turbidus Auster  

reppulit a Libycis inmensum Syrtibus aequor  

fractaque veliferi sonuerunt pondera mali,  

desilit in fluctus deserta puppe magister 

nauitaque et nondum sparsa conpage carinae  

naufragium sibi quisque facit, sic urbe relicta 

in bellum fugitur. nullum iam languidus aevo 

evaluit revocare parens coniunxve maritum 

fletibus, aut patrii, dubiae dum vota salutis 

conciperent, tenuere lares; nec limine quisquam 

haesit et extremo tunc forsitan urbis amatae 

plenus abit visu: ruit inrevocabile volgus. 

o faciles dare summa deos eademque tueri 

difficiles!  

(Civil War 1.498 – 511) 

postquam res Asiae Priamique evertere gentem 

immeritam visum superis, ceciditque superbum 

Ilium et omnis humo fumat Neptunia Troia, 

diversa exsilia et desertas quaerere terras  

auguriis agimur divum, classemque sub ipsa 

Antandro et Phrygiae molimur montibus Idae,  

incerti quo fata ferant, ubi sistere detur, 

contrahimusque viros. vix prima inceperat aestas 

et pater Anchises dare fatis vela iubebat, 

litora cum patriae lacrimans portusque relinquo 

et campos ubi Troia fuit. feror exsul in altum 

cum sociis natoque penatibus et magnis dis. 

 
(Aeneid 3.1-12) 

 
Just as when the swirling south wind drives the vast 

sea back from the Libyan Syrtes, and the shattered 

mass of the mast, with its sail, groans, the helmsman 

abandons the stern and leaps into the waves; and 

though the fittings of the hull are not yet strewn 

After the gods saw fit to overturn the affairs of Asia 

and visit undeserved punishment on the race of 

Priam, after proud Ilium had fallen and all of Troy, 

built by Neptune, was a smoking ruin, we were 

driven by signs from the gods to seek exile far away 


	   15	  

apart, each sailor fashions his own personal 

shipwreck; so too they desert the city and flee into 

war. Parents, frail with age, cannot call back their 

sons, nor wives, by their tears, their husbands; nor 

the ancestral homes, so long as they place their 

hopes on an unlikely salvation. No one hesitated on 

his threshold, to depart, perhaps, with a final look, 

filled with the love of his city. The crowd rushed on, 

heedless. How easily the gods give everything, how 

little they care to preserve it. 

 
(Civil War 1.498 – 511) 

and find vacant lands. Near Antander and the 

mountains of Phrygian Ida we constructed a fleet, 

though we were unsure where the fates were taking 

us, where we were to settle, and we gathered our 

men together. Summer had only just begun and my 

father Anchises ordered us to give sail for our 

destiny. I wept as I left the shores and harbors of my 

fatherland, and the plains where once was Troy. I 

was cast, an exile, onto the high seas, together with 

my companions, my son, the spirits of my 

household and the great gods above. 

(Aeneid 3.1-12) 

 
The major theme of these lines is abandonment, in this case of the city of Rome, (sic urbe 

relicta in bellum fugitur), articulated in part through a simile of shipwreck (desilit in 

fluctus, puppe magister, naufragium).  

   These lines are thought to be richly intertextual with the Aeneid. The commentator Paul 

Roche, author of the most recent and extensive commentary on this part of Lucan’s epic, 

notes numerous parallels18 in lines 504-7 alone (italicized in the left panel above), 

particularly with book 2 of the Aeneid 2. But Roche also remarks on a similarity with part 

of Aeneid 3 that has no shared words, making it a good test for detecting resemblance of 

meaning alone. The relevant passage from Aeneid 3 comes at the opening of the book, 

where Aeneas begins the story of his wanderings. It is marked in italics in the right panel 

above. 


	   16	  

   This entire passage was in fact included in a top match returned by our algorithm for 

the comparison above between Aeneid 3 and the Lucan passage. Roche observes the 

contrast between Aeneas’s concern for his family in flight and the disregard for their 

families shown by Romans fleeing their city in Lucan’s epic (Roche 2009, pp. 504-7). 

Our LSI method responds to related themes over a longer stretch of text. As in Lucan’s 

description of citizens’ flight from Rome, in the opening of Aeneid 3 we find pronounced 

themes of abandonment (diversa exsilia et desertas quaerere terras, litora cum patriae 

lacrimans portusque relinquo) intermingled with naval imagery (classem, vela, portus). 

This thematic similarity creates a connection between the texts despite the absence of any 

significant lexical overlap of the kind targeted by Tesserae lexical search and other text 

reuse search engines. The infrequent words common to both texts are underlined above, 

illustrating that the passages share none of the compact, word-level n-grams typically 

picked out in scholarly commentaries19. The passages could, in theory, be identified as 

similar based upon this sparse collection of shared words, but only by a search so 

minimally restrictive as to produce a flood of results. Matching via semantic analysis thus 

brings us much more directly to the thematic resemblance identified by Roche. 

   Taking this approach further, we experimented with the LSI modeling to see how many 

of the 35 missing commentator parallels between Civil War book 1 and the Aeneid we 

could return in the top 50 results, on the assumption that this was a highly manageable 

number for scholars to check. Passages (queries) from book 1 of Civil War were matched 

against all passages (documents) found in individual books of the Aeneid, and the results 

were ranked in descending order by LSI score. This search involved setting one arbitrary 

parameter, the number of topics (or dimensions) into which the passages would be 


	   17	  

categorized by content. For our experiment, we evaluated each query at 10, 15 and 20 

topics, and reported the parameter at which a valid parallel was found. 

   To provide the reader with a more thorough analysis of the proposed approach, we also 

computed precision, the fraction of retrieved instances relevant to a valid parallel, for 

each result. This was done by counting the number of matches that contained text from a 

valid parallel and dividing by the 50 total matches we always considered to be 

candidates. Recall from Sec. 2.1 that our approach generates a large sampling of 

overlapping windows, meaning that it is possible to have multiple valid matches per 

search instance. This is a useful feature for a scholar, in that we have good coverage of 

the context surrounding a target line of interest from a set of windows that overlap, but 

not completely. We exploit this behavior in our user interface (described below in Sec. 4) 

to highlight relevant passages of text. 

   Of the 35 missing parallels, the LSI approach returned 12, listed in Table 1. Several of 

these results were ranked in the top five returned by the algorithm for a given number of 

topics, indicating very strong thematic links. One additional parallel also found by the n-

gram matching algorithm of the Tesserae search engine was returned as a rank-3 result. 

Comparing the methods of analysis, we found that lower ranks tend to be correlated with 

higher precision. 

   These results also provide a basis for comparing our LSI method with the alternative 

approach of cosine distance between bag-of-words representations. When testing the 

latter, we observed much higher ranks (indicating worse performance) and lower 

precision values for most of the parallels in Table 1. In many cases, the ranks fall outside 

of the top 50 results, and are not considered valid matches by the matching criteria of this 


	   18	  

paper. Scores produced by this simple model were also significantly lower than those 

generated by the LSI approach. In every instance LSI outperformed the simpler bag-of-

words approach. Thus, for this corpus, we can conclude that by making use of low-rank 

approximation to capture the broader synonymy of the corpus, the LSI approach yields 

stronger matches that appear higher up in the rank order. 

   This is not to say, however, that the simpler model has been rendered useless. Table 2 

lists an additional set of missing Civil War 1 – Aeneid commentator parallels found in the 

top-50 results returned by the cosine distance between bag-of-words representations. 

These parallels are not found by the LSI approach. Similar to the results in Table 1, we 

again observe higher ranks and lower precision values for each parallel – not a single one 

of these matches falls within the top-10 results. This indicates that even as a weak 

approach, the simple bag-of-words model could be useful in combination with other, 

more powerful approaches via fusion (using a reasonably intelligent score analysis 

algorithm) to improve the rank position of a match. We are investigating this possibility 

in our ongoing work.  

 
4 A New Tool for the Study of Intertextuality 
 
Based on the satisfactory benchmark results, we designed an accessible front-end to the 

proposed algorithm for more traditional scholars of the classics. Those interested in 

trying out the algorithm have free access to an easy-to-use web-based tool via the 

Tesserae Project website20. Figs. 1 and 2 show the interface, which provides simple drop-

down menus for all parameters (author, work, book and number of topics), and a point-

and-click mechanism to allow the user to explore the texts while reading. Scholars 


	   19	  

without significant training in machine learning will find this tool to be a convenient 

starting point for conducting studies related to intertextuality and semantics at a large 

scale. At the time of this writing, 61 different Latin poets and prose writers are available 

for comparison. 

   An important question is whether this tool (and the underlying LSI algorithm) can be 

useful in revealing new instances of text reuse. Ideally, the approach should produce 

results beyond those in our benchmark set that were noted by commentators but missed 

by lemma matching. To this end, we used our web interface to visualize other strongly 

matching passages between Civil War 1 and the Aeneid, using the lines from Civil War 1 

in Tables 1 & 2 as “targets” (i.e. queries) against the passages from all of the “source” 

books of the Aeneid. This search turned up significant thematic correspondences not 

recorded by commentators, listed in Table 3. These included another passage in the 

Aeneid that shares the themes of abandonment and the sea with the lines around Civil 

War 1.504 quoted above. Here sailors flee from the shores of Polyphemus, and the 

related words are concentrated in the first two lines: 

 
sed fugite, o miseri, fugite atque ab litore funem 

rumpite. 

nam qualis quantusque cavo Polyphemus in antro  

lanigeras claudit pecudes atque ubera pressat, 

centum alii curva haec habitant ad litora vulgo 

infandi Cyclopes et altis montibus errant.  

(Aeneid 3.639-44) 

But flee, you wretches, flee and slash the cables 

from the shore. For as great and tall as Polyphemus 

is who lives in his hollow cave, keeps wooly flocks, 

and milks their udders, a hundred such other 

monstrous Cyclopes live together along the curved 

shore, and wander the steep mountains. 

 
	   20	  

Other passages were related by different common themes. The LSI algorithm identified 

the following two passages as highly related, and both in fact describe the god Bacchus, 

though in almost entirely different terms (they share just one word, vertice):  

 
                         nam, qualis vertice Pindi  

Edonis Ogygio decurrit plena Lyaeo . . .  

(Civil War 1.674-5) 

nec qui pampineis victor iuga flectit habenis 

Liber, agens celso Nysae de vertice tigris 

(Aeneid 6.804-5) 

 
                    For just as a Thracian bacchant, filled  

with Theban Bacchus, rushes down from 

the summit of Mount Pindus . . . 

 
Nor did Bacchus, who in victory guides his chariot 

with reins of vine, leading his tigers from the 

summit of lofty Nysa, [traverse as much land as 

Augustus will rule]. 

 
We also found a similar correspondence between text surrounding Civil War 1.676 and 

Aeneid 4.300-3. This instance contained both identical words (qualis, per urbem) and 

(near) synonyms (attonitam ~ excita, urguentem ~ stimulant). 

   In sum, then, our employment of LSI proved successful for the needs of users, in that it 

can bring them swiftly to significant instances of semantic similarity not previously 

recorded. And as a computational method, in every case the LSI algorithm again 

outperformed the simpler bag-of-words approach. 

 
5 Discussion 
 
Our experiment demonstrates that LSI can be used to detect intertextual relationships of 

meaning where few or no words are shared by the two texts. The same approach can in 

principle be extended to discover common themes and generic material, though 


	   21	  

computational constraints currently make it impossible to conduct a rapid search for such 

material over very large-scale corpora. The distinction between intertext and non-

intertext has always been fundamentally a heuristic one21 that can shift and change. If this 

sort of searching can be brought to larger scales, it will likely begin to dissolve the border 

between the instances of intertextuality most frequently noted by scholars – tight verbal 

correspondences – and the traditional understanding of similarities of mood and theme. 

 
6 Acknowledgements 

This work was supported by the National Endowment for the Humanities [Start-Up Grant 

Award #HD-51570-12]. We thank Prof. Neil Bernstein of Ohio University, who provided 

valuable feedback on an early draft of this work. 

 
	   22	  

 
Table 1. List of missing Civil War 1 (BC) – Aeneid (AEN) commentator parallels found 
by the LSI approach in the top 50 results returned by the algorithm for each query. Both 
rank and precision are reported. An asterisk denotes a parallel outside the missing parallel 
set also found by the lexical matching algorithm of the Tesserae search engine. For 
comparison, the corresponding ranks and precision values are also provided for a cosine 
distance between bag-of-words representations for each parallel. In nearly every instance, 
LSI outperformed the simpler bag-of-words approach. Comparing rank to precision, it 
can be seen that lower ranks tend to be correlated with higher precision. 

BC 
Line 

AEN 
Line Shared Context Topics 

LSI 
Rank 

LSI 
Prec. 

BoW 
Rank 

BoW 
Prec. 

1.60 1.291 Destiny of Caesar; peace 10 3 0.18 86 0.00 
1.139 4.441 The blowing wind; tree 20 1 0.22 1 0.28 
1.141 2.626 The blowing wind; tree 15 1 0.32 45 0.00 
1.193 2.774 An apparition 20 33 0.08 47 0.00 
1.193 3.47 An apparition 15 42 0.06 145 0.00 
1.291 11.492 Horses 20 26 0.02 212 0.00 
1.490 11.142 Flight 15 17 0.22 23 0.08 
1.504 2.634 Abandonment 15 3* 0.52 70 0.00 
1.504 3.11 Abandonment; Navy 15 4 0.16 215 0.00 
1.673 2.199 Omens; terror 15 31 0.02 162 0.00 
1.676 4.68 Dido as Bacchant 15 1 0.40 2 0.08 
1.676 6.48 Prophecy 15 39 0.44 148 0.00 
1.695 6.102 Frenzied discussion 20 22 0.20 31 0.20 

 
Table 2. List of missing Civil War 1 (BC) – Aeneid (AEN) commentator parallels found 
in the top 50 results returned by the cosine distance between bag-of-words 
representations. These results are not found by the LSI approach. Compared to the LSI 
approach in a general sense, we find that the ranks tend to be much higher and precision 
much lower for this baseline, with no result in this table placing in the top 10 of those 
returned. Low precision scores are also observed for this experiment. 

BC Line AEN Line Shared Context BoW Rank 
BoW 
Prec. 

1.1 4.628 War 48 0.02 
1.8 12.313 Hostility 23 0.06 
1.226 4.624 Broken treaty 13 0.10 
1.226 12.435 Fortune 38 0.04 
1.674 4.300 Dido as Bacchant 28 0.02 
1.678 10.670 Questioning destination 17 0.16 
1.685 2.554 Decapitation; shore 45 0.08 

 
	   23	  

	  
Table 3. Additional thematic matches found between Civil War 1 (BC) – Aeneid (AEN) 
by the LSI approach. These include highly specific parallels (a Bacchant in the passage 
around Civil War 1.676 and Bacchus around Aeneid 6.809 and Aeneid 4.304), weaker 
parallels with some lexical correspondence (Civil War 1.1 and Aeneid 4.98, Civil War 
1.212 and Aeneid 1.647), and interesting contextual parallels (a metaphorical description 
of nautical abandonment around Civil War 1.291 and sailors fleeing the shores of 
Polyphemus around Aeneid 3.639). An asterisk denotes a parallel also found by the 
lexical matching algorithm of the Tesserae search engine. Ranks are also provided for a 
cosine distance between bag-of-words representations. As with the experiment shown in 
Table 1, our LSI method consistently outperformed the simpler bag-of-words approach. 

BC Line AEN Line Shared Context Topics LSI Rank 
BoW 
Rank 

1.1 4.98 War 15 1* 1 
1.141 2.252 The blowing wind 15 6 128 
1.291 11.291 Conquest 20 1 4 
1.353 1.647 City; nation 15 1* 26 
1.504 3.639 Abandonment; nautical imagery 20 2 129 
1.676 6.809 Bacchus 15 1 81 
1.676 4.304 Bacchus 10 36 295 

 
	   24	  

 
Fig. 1. The public web interface to the algorithm described in this article. Parameters are 
presented to the user as a series of drop-down menus. The user can click on any line in 
the “Target” frame, which will initiate the LSI matching process between the passage 
centered on the target line and all passages in the “Source” frame. The Tesserae Project’s 
entire Latin corpus is available for search. The simple interface allows scholars with 
minimal training in machine learning to conduct sophisticated studies of semantic 
intertextuality at a large scale. 

 
	   25	  

 
Fig. 2. An example of a match between Civil War 1.141 and Aeneid 2.262. The entire 
passage highlighted on the left represents the query centered on Civil War 1.141. To 
reduce visual clutter, we only highlight the lines matching passages are centered upon on 
the right. The matches provide the scholar with an indication of the general neighborhood 
where semantically similar text can be found. Color intensity in the right-hand frame 
indicates the strength of the match (brighter colors mean a stronger match). 

 
	   26	  

Notes 

1. In classical scholarship sometimes called loci similes, or “similar passages,” and 

typically consisting of (near) verbatim reuse of a two-word phrase. 

 
2. Two useful surveys of practices within classical philology can be found in Pucci (1998, 

ch. 1) and Schmitz (2002, ch. 5); see also Coffee (2012). 

 
3. Examples include: global linear models for assessing verse similarity in the New 

Testament Gospels (Lee, 2007); unsupervised detection of Greek quotation (Büchler et 

al., 2010), and hash coding to detect reuse and citations in Lautréamont and Balzac 

(Ganascia et al., 2013). More flexible sequence alignment approaches (Horton et al., 

2010; Roe, 2012; Wolff, 2012; Smith et al., 2013), inspired by related analysis techniques 

in genetics, are prevalent as well. Most closely aligned with the goals of this present work 

are the eTRACES (Bamman and Crane, 2008; Büchler et al., 2011; Büchler et al., 2013; 

Geßner et al., 2013) and Tesserae (Forstall et al., 2011; Forstall and Scheirer, 2012; 

Coffee et al., 2012a; Coffee et al., 2012b; Coffee et al., 2014) projects. 

 
4. Wills (1996, ch. 1) gives an extensive set of textual features that can serve as the basis 

for intertextual connections, with examples of each. 

 
5. Latin texts cited here are from the Perseus Digital Library (see also Note 11 below); 

translations are our own. 

 
	   27	  

6. For example, Paul Roche (2009, ad loc.). 

 
7. Excluding extremely common function words such as et (“and”) and in (“in/on”). 

 
8. The process of recognition does not necessarily proceed in such an orderly fashion, 

however, from the concrete to the abstract; rather, the alert reader is often sensitized to 

the possibility of intertext in advance. This potential for an intertextual relationship to 

prime the reader to see further connections is described in detail by Wills (1996, pp. 26–

27). In general terms, “a poetic sign signals first to the other signs within the poetic 

system . . . before signaling its specific sense in a precise context” (Conte 1986, p. 44). 

For example, Vergil himself seems already to have foreshadowed Pompey’s death in his 

description of the death of Priam, patriarch of the Trojans (Hinds 1998, p. 8). Lucan’s 

readers might well have recognized the intertext first on this basis and only subsequently 

(or never) noticed the reuse of nuto.  

 
9. Allison et al. (2012), Mimno (2012), and Nelson (2013), respectively. 

 
10. http://radimrehurek.com/gensim/intro.html 

 
11. The Tesserae Latin corpus currently comprises just under 250 texts, evenly divided 

between prose and verse, principally from the first century BCE to the third century CE. 

Most of these texts are sourced from the Perseus Digital Library 


	   28	  

(http://www.perseus.tufts.edu; G. Crane, Editor). For further information, see 

http://tesserae.caset.buffalo.edu/sources.php. 

 
12. https://github.com/tesserae 

 
13. On difficulties associated with small samples in literary applications of text analysis, 

see, e.g., Eder (2014). 

 
14. The Gensim module implements scalable truncated Singular Value Decomposition in 

Python to calculate the low-rank approximation of a matrix. While there is no specific 

property of LSI that makes it more suited to small corpora, this particular SVD solver is 

stable for small sample sizes, making it useful for the kinds of searches demonstrated 

here. With a lack of good alternatives, we recommend that other researchers consider this 

implementation for semantic analysis problems that are constrained to small sample sizes. 

 
15. http://tesserae.caset.buffalo.edu 

 
16. Additional detail can be found in the “Methodology” section of Coffee et al. (2014). 

 
17. The Lucan commentaries we consulted for this information were Heitland and 

Haskins (1887), Thompson and Bruére (1968), Viansino (1995), and Roche (2009). 

 
	   29	  

18. (Roche, 2009), 312-313 records parallels between Civil War 1.504-7 and Aeneid 

2.635f., 651-3, 657-70, 673-8, 707-25, 747-51; 3.11f.; 7.757f; 11.160f. Most are 

contrastive, evoking the difference between Aeneas’s concern for keeping his loved ones 

together while fleeing Troy, and the disregard for family ties shown by those fleeing 

Rome in Civil War. 

 
19. Not underlined are the function words cum, et, and in, which are extremely common 

and typically excluded by even the shortest stop lists. 

 
20. http://tesserae.caset.buffalo.edu/cgi-bin/lsa.pl 

 
21. So, for example, novelist and semiotician Umberto Eco, in reflecting on the various 

intertextual relationships between his own fiction and that of Jorge Luis Borges, notes 

links of several distinct types, including “cases where I was not aware of it, but 

subsequently readers . . . forced me to recognize that Borges had influenced me 

unconsciously,” as well as others in which a reminiscence of Borges in Eco’s writing is 

due rather to a mutual debt to “preceding sources and the universe of intertextuality.” 

(Eco 2002, p. 121).  

 
	   30	  

References 

Allison, S., Heuser, R., Jockers, M. L., Moretti, F., and Witmore, M. (2012). Quantitative 

Formalism: An Experiment. n + 1, 13: 81-108.  

Bamman, D. and Crane, G. (2008). The Logic and Discovery of Textual Allusion. In 

Proceedings of the Second Workshop on Language Technology for Cultural 

Heritage Data (LaTeCH 2008), Marrakesh, Morrocco. 

Blei, D. (2011). Probabilistic Topic Models, Comminications of the ACM, 55(4): 77-84. 

Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine 

Learning Research, 3: 993-1022.  

Büchler, M., Geßner, A., Berti, M., and Eckart, T. (2013). Measuring the Influence of a 

Work by Text Re-use. Bulletin of the Institute of Classical Studies Supplement, 

122: 63-79. 

Büchler, M., Crane, G., Mueller, M., Burns, P., and Heyer, G. (2011). One Step Closer 

To Paraphrase Detection On Historical Texts. Journal of the Chicago Colloquium 

on Digital Humanities and Computer Science, 1(3). 

Büchler, M., Geßner, A., Eckart, T., and Heyer, G. (2010). Unsupervised Detection and 

Visualization of Textual Reuse on Ancient Greek Texts. Journal of the Chicago 

Colloquium on Digital Humanities and Computer Science, 1(2). 

Coffee, N., Koenig, J.-P., Poornima, S., Forstall, C. W., Ossewaarde, R., and Jacobson, S. 

(2012a). The Tesserae Project: Intertextual Analysis of Latin Poetry. Literary and 

Linguistic Computing, 28(2): 221:228. 


	   31	  

Coffee, N., Koenig, J.-P., Poornima, S., Ossewarde, R., Forstall, C., and Jacobson, S. 

(2012b). Intertextuality in the Digital Age. Transactions of the American 

Philological Association, 142(2): 381-419. 

Coffee, N. (2012). “Intertextuality in Latin Poetry.” In Oxford Bibliographies in Classics. 

Ed. D. Clayman. New York, Oxford University Press. 

Coffee, N., Forstall, C., Buck, T., Roache, K., and Jacobson, S. (2014). Modeling the 

Scholars: Detecting Intertextuality through Enhanced Word-Level N-Gram 

Matching. To appear in Literary and Linguistic Computing, Pre-print available at: 

http://tesserae.caset.buffalo.edu/blog/wp-content/uploads/2012/10/Modeling-the-

Scholars-2013-11-6LLC-preprint1.pdf. 

Conte, G. B. (1986). The Rhetoric of Imitation: Genre and Poetic Memory in Virgil and 

Other Latin Poets. Translated by Charles Segal. Cornell University Press, Ithaca, 

New York. 

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing 

by Latent Semantic Analysis. Journal of the American Society for Information 

Science 41(6): 391-407. 

Eco, U. (2002). Borges and My Anxiety of Influence. In, On Literature, pp. 118-135. 

Translated by Martin McLaughlin. Harcourt, Inc., Orlando FL. 

Eder, M. (2014). Does size matter? Authorship attribution, small samples, big problem. 

Literary and Linguistic Computing, forthcoming. Published online November 

2013, at http://llc.oxfordjournals.org/content/early/2013/11/14/llc.fqt066.full. 


	   32	  

Forstall, C. W. and Scheirer, W. J. (2012). Revealing hidden patterns in the meter of 

Homer’s Iliad. In Proceedings of the Chicago Colloquium on Digital Humanities 

and Computer Science, Chicago, Illinois. 

Forstall, C. W., Jacobson, S., and Scheirer, W. J. (2011). Evidence of Intertextuality: 

Investigating Paul the Deacon’s Angustae Vitae. Literary and Linguistic 

Computing 26(3): 285-296. 

Ganascia, J.-G., Glaudes, P., and DeLungo, A. (2013). Automatic Detection of Reuses 

and Citations in Literary Texts, In Proceedings of Digital Humanities, Lincoln, 

Nebraska. 

Geßner, A., Kötteritzsch, C., and Lauer, G. (2013). Biblical Intertextuality in the Digital 

World: The Tool GERTRUDE. In Proceedings of the 1st International Workshop 

on Collaborative Annotations in Shared Environment: Metadata, Vocabularies 

and Techniques in the Digital Humanities, Bologna, Italy. 

Heitland, W. E. and Haskins, C. E. (1887). M. Annaei Lucani Pharsalia. London: G. 

Bell. 

Hinds, S. (1998). Allusion and Intertext: The Dynamics of Appropriation in Roman 

Poetry. New York: Cambridge University Press. 

Horton, R., Olsen, M., and Roe, G. (2010). Something Borrowed: Sequence Alignment 

and the Identification of Similar Passages in Large Text Collections. Digital 

Studies / Le champ numérique, 2(1). 

Jockers, M. (2013). Macroanalysis: Digital Methods and Literary History. Champaign: 

University of Illinois Press. 


	   33	  

Kristeva, J. (1986). Word, Dialogue and Novel. In Moi, T. (ed.), The Kristeva Reader, 

New York: Columbia University Press, pp. 34-61. 

Lee, J. (2007). A Computational Model of Text Reuse in Ancient Literary Texts. In 

Proceedings of the 45th Annual Meeting of the Association of Computational 

Linguistics, Prague, Czech Republic, pp. 472-479. 

McCallum, A. K. (2002).  MALLET: A Machine Learning for Language Toolkit.     

http://mallet.cs.umass.edu (accessed 25 January 2014). 

Mimno, D. (2012). Computational Historiography: Data Mining in a Century of Classics 

Journals. Journal on Computing and Cultural Heritage 5(1). 

Nelson, R. K. (2011). Of Monsters, Men – And Topic Modeling. The New York Times. 

http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-

modeling (accessed 25 January 2014). 

Rehurek, R. and Sojka, P. (2010). Software Framework for Topic Modeling with Large 

Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for 

NLP Frameworks, Valletta, Malta, pp. 46-50. 

Pucci, Joseph (1998). The Full-Knowing Reader: Allusion and the Power of the Reader 

in the Western Literary Tradition. Yale University Press, New Haven, CT. 

Roe, G. H. (2012). Intertextuality and Influence in the Age of Enlightenment: Sequence 

Alignment Applications for Humanities Research. In Proceedings of Digital 

Humanities, Hamburg, Germany. 

Roche, P., Ed. (2009). Lucan: De bello civili. Book 1. Oxford: Oxford University Press. 

Schmitz, Thomas A. (2002). Modern Literary Theory and Ancient Texts: An 

Introduction. Blackwell Publishing, Malden MA. 


	   34	  

Smith, D. A., Cordelly, R., and Dillony, E. M. (2013). Infectious Texts: Modeling Text 

Reuse in Nineteenth-Century Newspapers. In Proceedings of the IEEE Workshop 

on Big Data and the Humanities, Santa Clara, California. 

Thompson, L. and Bruére, R. T. (1968). Lucan’s Use of Vergilian Reminiscence. 

Classical Philology, 63: 1-21. 

Viansino, G., Ed. (1995). Marco Annaeo Lucano: La Guerra Civile (Farsaglia) libri I-V. 

Milan: Arnoldo Mondadori. 

Wills, Jeffrey (1996). Repetition in Latin Poetry: Figures of Allusion. Clarendon Press, 

Oxford. 

Wolff, M. (2012). Surveying a Corpus with Alignment Visualization and Topic 

Modeling. In Proceedings of the Chicago Colloquium on Digital Humanities and 

Computer Science, Chicago, Illinois.