OP-LLCJ140007 1..19


Computer-supported collation of
modern manuscripts: CollateX
and the Beckett Digital
Manuscript Project
............................................................................................................................................................

Ronald Haentjens Dekker
Department of IT R&D, Huygens Institute for the History of the
Netherlands, Royal Netherlands Academy of Arts and Sciences, The
Netherlands

Dirk van Hulle
Department of Literary Studies, University of Antwerp, Antwerp,
Belgium

Gregor Middell
Institut für Deutsche Philologie, Universität Würzburg, Würzburg,
Germany

Vincent Neyt
Department of Literary Studies, University of Antwerp, Antwerp,
Belgium

Joris van Zundert
Methodology Research Program, Huygens Institute for the History
of the Netherlands, Royal Netherlands Academy of Arts and
Sciences, The Netherlands
.......................................................................................................................................

Abstract
Interoperability is the key term within the framework of the European-funded
research project Interedition,1 whose aim is ‘to encourage the creators of tools for
textual scholarship to make their functionality available to others, and to pro-
mote communication between scholars so that we can raise awareness of innova-
tive working methods’. The tools developed by Interedition’s ‘Prototyping’
working group were tested by other research teams, which formulate strategic
recommendations. To this purpose, the Centre for Manuscript Genetics
(University of Antwerp), the Huygens Institute for the History of the
Netherlands (The Hague), and the University of Würzburg have been working
together within the framework of Interedition. One of the concrete results of
collaboration is the development and fine-tuning of the text collation tool
CollateX.2 In this article, we would like to investigate how the architecture of a

Correspondence:
Joris van Zundert.
Huygens Institute for the
History of the Netherlands,
Royal Netherlands Academy
of Arts and Sciences, PO Box
90754, 2509 LT, The Hague,
The Netherlands
Email:
joris.van.zundert@
huygens.knaw.nl

Digital Scholarship in the Humanities ! The Author 2014. Published by Oxford University Press on
behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com

1 of 19

doi:10.1093/llc/fqu007

 Digital Scholarship in the Humanities Advance Access published December 2, 2014


digital archive containing modern manuscripts can be designed in such a way
that users can autonomously collate textual units of their choice with the help of
the collation tool CollateX and thus decide for themselves how efficiently this
digital architecture functions—as an archive, as a genetic dossier, or as an edition.
The first part introduces CollateX and its internal concepts and heuristics as a
tool for digitally supported collation. How this tool can be integrated in the
infrastructure of an electronic edition is discussed in part two. The third and
final part examines the possibility of deploying CollateX for the collation of
modern manuscripts by means of a test case: the Beckett Digital Manuscript
Project (www.beckettarchive.org).

.................................................................................................................................................................................

1 Computer-supported collation
with CollateX

Following John Unsworth’s textual scholarship
workflow typology of scholarly primitives
(Unsworth 2000), it is clear that text comparison
is pivotal to any kind of textual scholarship. The
role of text comparison becomes paramount in
any scholarly editing project that involves critical
enquiries about the edited text witnessed in multiple
versions. Conducting such a collation is a tedious
and error-prone work,3 especially because the
required attention to detail is highly exacting com-
pared with the repetitive and mechanical nature of
the task. From that perspective, this type of work
seems an ideal candidate for automation, not only
because computers can support users in tedious
error-prone duties rather efficiently, but specifically
because the number of versions is often so large that
it is simply not feasible any more to compare each
witness against another manually.4

The application of computers or other appara-
tuses to support the collation of texts already has
a long-standing tradition in and of itself (Smith
2000), reaching back at least to the usage of
optomechanical devices like those pioneered by
Charlton Hinman. Since then, the semiautomatic
collation of texts has been a well-established area
for the application of software, offering support in
managing large text traditions, in comparing prede-
termined passages of different versions as well as in
storing and rendering the results. However, it con-
tinued to be the user’s duty to orchestrate the whole
process and guide the computer in comparing rele-
vant passages by manually calibrating the complex

input to make it fit rather basic comparison algo-
rithms.5 Recent advancements in the field of com-
putational biology, a field closely related when
viewed from the computational perspective, resulted
in renewed attempts to further the degree of auto-
mation achieved thus far in the comparison of nat-
ural language texts.6 Protein sequences—not unlike
texts in natural language—can be modeled as se-
quences of symbols, whose differences can be under-
stood as a set of well-defined editing operations
(Levenshtein 1966), which transform one sequence
into another and can be computed. The analogy
goes even further, as the consecutive evaluation of
assumed editing operations between protein se-
quences on the one hand and texts on the other
hand bears striking similarities, as they often pro-
vide the basis for further stemmatic analysis and
genetic reasoning (Spencer and Howe, 2004). The
only subtle but crucial problem with this analogy is
that, while biologists can afford to leave aside meth-
odological questions about the intentionality of
assumed ‘editing operations’ on protein sequences,
philologists cannot always base their reasoning on
computed differences between texts. The prime ex-
ample for this kind of difference is a transposition of
two passages, which has been explicitly marked by
the author in the manuscript (for instance, via
arrows or a numbering scheme) and interpreted as
such by the editor, but the intentionality of which
cannot be computed deterministically by collation
software, even if the transposition itself can be
detected.

In addition to this dilemma, there are numerous
workflow-related challenges to surmount when
proper integration of collation software with a

R. Haentjens Dekker et al.

2 of 19 Digital Scholarship in the Humanities, 2014

www.beckettarchive.org
.
E
to
,
opto-
-


digital editing environment becomes a concern. It
makes computer-supported collation not only a
computationally complex but also an architecturally
challenging problem for software developers. As in
the traditional trilogy of ‘recensio’, ‘examinatio’
(including ‘collatio’), and ‘emendatio’ (see for in-
stance Grafton et al., 2010, p. 506), the collation of
digital texts is again central to the editorial work-
flow, with consecutive architectural dependencies
on many adjacent building blocks of the editing
environment. The workflow may apply specific
modeling and encoding of text versions as well as
possible automatic linguistic annotation like part of
speech tagging to be able to compare texts in a more
expressive manner than on plain character string
level. This might be the case for instance where
lemmatization is applied. Also, specific workflows
might require the ability of a human interacting
with the collation result or process. Certain idiosyn-
crasies of witnesses and their texts, for instance,
might have to be modeled and/or encoded by
human intervention where optical character recog-
nition does not serve well, e.g. in the case of visual
poetry; or researchers may need to intervene in the
process of automatic linguistic annotation of textual
versions to make them comparable in a more sens-
ible way, for instance, in supervised learning meth-
ods or in cases of languages that are poorly
supported by the current state-of-the-art in natural
language processing, such as medieval Dutch or
neo-Latin; or it may be necessary to manually
annotate the output of densely marked-up and
interconnected text versions as a result of their com-
parison yielding differences previously unnoticed.
Many newer approaches to the problem of collation
offer interesting solutions to the computational
challenge, but most of them do not fully address
the architectural challenges, nor do they approach
the problem as one which can only be solved
semiautomatically given the methodological frame-
work of its domain.

Consequently, existing solutions either remain
within the realm of decision-support systems,
which mainly help scholars keep the overview
while producing essentially handcrafted collation re-
sults and transforming them into a commented crit-
ical apparatus,7 or they automate the collation in a

way that is tailored to a specific use case and/or
runtime environment. The general applicability of
the latter approach can then only be approximated
via quantitative properties of its specific input and
the accuracy of the achieved results.8 In contrast,
we would like to offer a third, rather pragmatic
approach, in which we first dissect the problem of
collation into smaller more manageable subpro-
blems and then show by an example how each of
these subproblems can be addressed in a way that is
more fitting to its application domain and with a
higher chance of applicability to the variety of re-
quirements stipulated by the many different
scholarly environments in which the collation of
texts and its adjacent scholarly tasks have to be
performed.

1.1 Comparing existing solutions
Our example for this approach is CollateX, a proto-
typical collation tool, developed in the context of
Interedition. Shortly after the project started, it
became clear that a proper requirements analysis
for a versatile collation tool would need input
from a range of stakeholders as wide as possible,
including users and interested developers as well
as implementers of existing solutions. A collation
summit and a collation workshop were held in
Gothenburg and Brussels in 2009, co-organized by
the European COST Action 32 ‘Open Scholarly
Communities on the Web’,9 which invited imple-
menters of three collation tools—literary scholars,
digital humanists, and developers of XML database
software—to discuss conceptual commonalities
between their fields of expertise as they relate to
the collation of texts. The immediate result was
the agreement on a modularization of the digital
collation process into a couple of well-defined
steps, which—if applied in order and/or itera-
tively—allows the collation of texts to be supported
more flexibly by implementations adhering to this
separation of concerns.10 Four basic steps were
defined. The first is the tokenization of digital
texts to be compared—in effect the segmentation
of the texts into the sequence of tokens that will
be compared. The second step is the alignment of
tokens from different texts, which essentially iden-
tifies which segments of tokens match between the

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 3 of 19

,
fashion
&mdash;
.
O
.
,
.
O
;
-
hand-
,
--


text—effectively also identifying where the com-
pared texts differ and thus implying or assuming
edit operations in those places. The third step is
the analysis of the computed alignment, which
introduces an interpretative aspect into the process
as edit operations are now qualified (e.g. as deletion,
addition, or transposition). The fourth and final
step is the output/visualization of collation results.
The workflow of these four steps has since become
informally known as the ‘Gothenburg model’. We
will explain the various steps in more detail below.

Although any collation software can compare
texts on a character-by-character basis, in the
more common use case, before collation each text
(or comparand) is normally split up into segments
or tokens and compared on the level of the token
rather than on the character-level. This familiar step
in text (pre)processing, called ‘tokenization’, is per-
formed by a tokenizer and can happen on any level
of granularity, for instance, on the level of syllables,
words, lines, phrases, verses, paragraphs, text nodes
in a normalized XML DOM instance, or any other
unit suitable to the texts at hand. Another service
provided by tokenizers as defined in our model re-
lates to marked-up texts. As most collators compare
texts primarily based on their textual content,
embedded markup would usually get in the way
and therefore needs to be filtered out—but must
be kept as stand-off annotations during tokeniza-
tion—so the collator can henceforward operate on
tokens of textual content. Annotations must be kept
because it might be valuable to have the markup
context of every token available, for example, if
one wants to make use of it in the comparison of
tokens during the alignment step (see below). The
schematic diagram on the left in Fig. 1 depicts this
process: the upper line represents a comparand,
each character a, b, c, and d, an arbitrary token,
and the XML tags e1 and e2 are examples of
embedded markup. A tokenizer transforms this
marked-up text into a sequence of individual
tokens, each referring to its respective markup/tag-
ging context. From now on, a collator can compare
tokenized comparands to others based on its toke-
nized content and does not have to deal with its
specific notational conventions anymore, which
are often rather specific to a particular markup

language, dialect, or project. During the tokeniza-
tion step, it is also possible to normalize each token,
so the subsequent comparison can abstract from
certain specifics, such as case-sensitivity or even
morphological variants. In most use cases, we have
found that abstracting away from such specifics
yields useful collation results. However, it should
be noted that there is no principal methodological
or technical reason to enforce such abstraction. In
cases where specifics would turn out to be useful
as information for alignment of comparands,
the model allows us to take into account such
specifics.

When the comparands have been tokenized, a
collator will align the tokens of all comparands
involved. Aligning comparands implies the match-
ing of equal tokens and the insertion of ‘empty’
tokens (so-called ‘gap tokens’) in such a way that
the token sequences of all comparands line up prop-
erly. As mentioned before, this specific task of a
collator is computationally similar to the problem
of sequence alignment, as it is also encountered, for
example, in computational biology. Looking again
at an example (Fig. 1, center diagram), we assume
that three texts (each depicted in its own column)
are being compared: the first consists of the token
sequence ‘abcd’, the second reads ‘acdb’, and the
third ‘bcd’. A collator might align these three
comparands as depicted in a tabular manner. Each
comparand occupies a column, matching tokens
are aligned in a row, and necessary gaps as inserted
during the alignment process are marked by means
of a hyphen. Depending on the perspective from
which one translates this alignment into a set of
editing operations, one can conclude, for example,
that the token ‘b’ in the second row was omitted in
the second comparand or that it was added in the

Fig. 1 Schematic representation of the tokenization (left),
alignment (middle), and analysis (right) phases of a
collation workflow

R. Haentjens Dekker et al.

4 of 19 Digital Scholarship in the Humanities, 2014

--
'll
,
I
fashion


first and the third. A similar statement can be made
about ‘b’ in the last row by just inverting the rela-
tionship of being added/omitted.

In addition to atomic editing operations com-
puted in the alignment step, a further analysis of
the alignment, conducted by the user and supported
by the machine, can introduce additional interpret-
ative preconditions into the process. Repeating the
previous example in Fig. 1 (right diagram), one
might interpret the token ‘b’ in columns 2 and 5
as being transposed instead of as being simply added
and omitted. Whether these two edit operations ac-
tually can be interpreted as a transposition ultim-
ately depends on the judgment of the editor and can
at best be suggested, though not conclusively deter-
mined, via unambiguous heuristics. That is why an
additional analytical step, in which the alignment
results are augmented (and optionally fed back as
preconditions into the collator), appears as essential
to us in order to bridge the methodological ‘imped-
ance’ between a plain computational approach to
the theoretical problem and the established hermen-
eutical approach taken in practice. In some cases,
even human interpretation may of course not deter-
mine decisively whether an actual transposition
took place. We may have to conclude that some
cases of potential transposition cannot be deter-
mined with absolute certainty.

The obvious remaining step is the output of the
collation results, which is again a complex task. The
requirements here range from the encoding of
the results according to various conventions,
markup dialects, and formats required by other
tools to the visualization of results in multiple
facets, be it in a synoptic form, either as a rendering
focusing on one particular text and its variants, or as
a graph-oriented networked view, offering an over-
view of the collation result as a whole.

After establishing this separation of concerns, im-
plementers of collation-related software can hence-
forth focus on specific problems. For instance, the
collation tool Juxta11 has a feature-rich tokenizer for
XML-encoded texts since version 1.4, which has
been extended constantly in consecutive versions.
Juxta also has support for larger comparands as
well as stand-off annotations and is available as a
self-contained software library12 for reuse in other

tools. Comparable work is ongoing to generalize
Juxta’s visualization components.13

1.2 Comparing alignment algorithms
The main emphasis of CollateX’s development is on
improving the alignment step. As mentioned in the
introduction, aligning sequences of symbols is a
well-known problem in computer science having
many applications, notably in the field of computa-
tional biology. It has also been noted that the adop-
tion of existing sequence alignment algorithms for
use in the context of philology poses several prob-
lems, some of a conceptual methodological nature
and some of a practical technical nature. Rather
than providing a complete account of the pros
and cons of particular algorithms, a task better
undertaken elsewhere, we would like to draw atten-
tion to three recurring criteria, on which the quality
of recent alignment algorithms is evaluated:

1. Transposition detection
Detecting arbitrarily transposed passages in
versions of a text is a much harder problem
when done in the context of sequence align-
ment than computing insertions, deletions,
and substitutions. Schmidt concludes his ana-
lysis of this problem (Schmidt 2009) with a
pragmatic solution by stating that, given an
NP-complete computational problem and no
guaranteed correspondence between an opti-
mal computational result and the outcome
desired by the user, a heuristic algorithm
might be the best solution. Accordingly, algo-
rithms that try to detect transpositions do so
heuristically and refer to benchmarks measur-
ing computationally detected transpositions
against manually predetermined ones.

2. Support for flexible token matching
The well-known distinction between substan-
tial versus accidental variants as well as other
factors, like orthographic variation, require
alignment algorithms to match tokens more
flexibly than just via exact character matching.
Some algorithms use edit distance–based
thresholds for this purpose (e.g. Spencer/
Howe’s or Juxta’s), whereas others rely on
lookup tables predefined by the user, which

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 5 of 19

judgement
,
,
,
,
vs.
-
-
-


list possible mappings of tokens to match
them despite their differing character content.

3. Base-Text-/Order-Independence
Alignment algorithms like Juxta’s compare
versions one-on-one, so that as soon as more
than two versions are to be compared, the task
has to be reduced to pairwise comparison of
two versions at a time and consecutive mer-
ging of the pairwise results. Spencer and
Howe have shown the potential functional de-
pendence of such a unified result on the order
in which pairwise comparisons are merged.
This poses a problem for genetic research
based on such results, since a suitable order
in which the pairwise comparisons should be
merged depends on a hypothesis about which
texts are closer related to each other and
whose comparison results should conse-
quently be merged first (Spencer and Howe,
2004).

CollateX’s aligner tries to tackle all of these prob-
lems by following the modularization outlined in
the section above and by finding ways to align
tokens that do not inherit the trade-offs of existing
sequence alignment algorithms. As such it has to be
characterized as experimental, but at the same time
it already yields promising results.

1.3 Comparing texts with CollateX
This section gives an overview of the major concepts
by which CollateX aligns tokens of comparands. We
begin by explaining the basic challenge of aligning
two comparands including the detection of trans-
positions and extend the challenge stepwise up to
the alignment of multiple versions.

Most alignment algorithms work on the basis of
the following editing operations: insertion, deletion,
and substitution. These operations are well defined,
e.g. via Levenshtein’s concept of the edit distance.
A frequently recurring problem when comparing
two versions of a text is the phenomenon where a
passage of a text has been moved between them (i.e.
transposed). Moreover, transposed passages of a text
usually are not transposed literally, but contain
small changes on their own, which makes the chal-
lenge to detect these even harder. Alignment algo-
rithms that are constrained to the editing operations

just mentioned will regard a transposition either as a
deletion and an insertion or, in case two passages
have been swapped, as two substitutions. CollateX
releases this constraint by handling transpositions as
an additional kind of editing operation and trying to
detect those operations.

To start with a trivial case, detection of transpos-
itions is easy when all tokens in the compared ver-
sions are unique (Fig. 2).

When we look at the different tokens from the
two versions in each position, then it is easy to see
that ‘a’ is transposed with ‘c’ and ‘c’ with ‘a’. Apart
from all tokens being unique, the previous example
also assumes that moved passages of text are exactly
one token long. In the next example (cf. Fig. 3), we
drop this constraint as well.

The desired result would be that the sequence ‘a
b c d’ is transposed with ‘z’. The trivial approach
described above for the detection of transpositions
would not work in this case. Real-world cases of
transposition involve arbitrary length sequences of
tokens moving over seemingly arbitrary distances in
text, in the process, more often than not, also chan-
ging the internal order of the sequence to various
extents. To solve this problem, we need a more elab-
orate form of token matching. To this purpose, we
use a match table, which is a document-to-docu-
ment matrix, allowing us to compare two variant
witnesses, each word to each word. Let us first con-
sider a case where there is no variation (cf. Fig. 4).
We put the tokens of witness 1 as the column head-
ers of the matrix and the tokens of the identical

Fig. 3 Less trivial case of transposition

Fig. 2 Trivial case of transposition

R. Haentjens Dekker et al.

6 of 19 Digital Scholarship in the Humanities, 2014

,
,
:
,


witness 2 as the row headers of the matrix. We then
simply mark cells that have identical row and
column headers.

This simple case reveals an essential aspect of
document-to-document matrices for variant detec-
tion: in general, the ‘path’ from the upper left corner
to the bottom right corner that deviates the least
from the exact diagonal corresponds to the similar-
ity a reader would assume between two texts.
A reader would not assume, for example, that the
first ‘the’ in the horizontally depicted witness is ac-
tually to be identified with the second ‘the’ in the
vertically represented witness, or vice versa. Thus,
we assume the ‘conclusion’ depicted for the general-
ized case in Fig. 5a to be invalid, and the solution in
Fig. 5b is preferred. This case also gives us a hard
constraint for any algorithm design: we cannot
select more than one token for each row and/or
column, i.e. we cannot have two tokens simultan-
eously in one position.

Now consider a case of transposition. Text A is a
six-sentence (or ninety-seven–word) quote from
Samuel Beckett’s Stirrings Still. When we compare
two identical copies of this text in a matrix as ex-
plained above, we arrive on the result as depicted in

Fig. 6a. Now we copy text A, but we deliberately
move the last sentence to the start of the text. In
this way, we effectively create an artificial transpos-
ition. We now compare the original text to the copy
containing the artificial transposition. The result is
depicted in Fig. 6b. The displaced sentence is clearly
indicated by a diagonal in the top right corner, at-
testing that the last sentence of the original (hori-
zontal direction) coincides with the start of the
altered copy. If we created multiple transpos-
itions, we would get a result as depicted in Fig. 6c.
Instead of a clear diagonal, all the way
through we find a rather broken-up path of smaller
diagonals:

One night as he sat at his table head on hands
he saw himself rise and go. One night or day.
For when his own light went out he was not
left in the dark. Light of a kind came from the
one high window. Under it still the stool on
which till he could or would no more he used
to mount to see the sky. Why he did not crane
out to see what lay beneath was perhaps be-
cause the window was not made to open or
because he could or would not open it.

Fig. 4 Document-to-document matrix applied as a match
table, cells representing tokens coinciding between two
witnesses are marked (‘scored’) in this case with dots

Fig. 5 a(left), and b (right): Unrealistic alignment conclu-
sion (a) versus natural, elegant, or reader’s common sense
solution (b)

Fig. 6 a(left), b (center), and c (right)

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 7 of 19

97
-


Text A: excerpt from Samuel Beckett’s Stirrings
Still.

It will be clear from this example that in fact any
changes (or ‘edits’ as they are called in computer
science, be they intentional or not) in an initially
identical copy of a text will result in a deviation
from a perfect diagonal in the document-
to-document matrix—even the substitution of a
single character. Many such edits cause the visible
diagonal of a perfect alignment to be broken up in a
large collection of longer and shorter diagonals,
similar to what is shown in Fig. 6c (but many
times larger for real-world texts). We call these dis-
persed diagonals match phrases. Like the much sim-
pler case represented in Fig. 5, in a real-world case it
remains CollateX’s task to determine what sequence
of match phrases corresponds to the ‘natural’ align-
ment of two texts or witnesses. Of course, being a
computer program, CollateX has no conception of
the kind of alignment a human reader would infer,
and to further complicate matters, human readers
can actually have different opinions on what the
‘best’ alignment is. Therefore, CollateX has to rely
on a mathematical approximation of the inferences
human readers might make. Congruent to the argu-
ment of Bourdaillet and Ganascia (2007), the
approach of CollateX for this is to determine the
set of match phrases that corresponds to the smal-
lest number of edits between two witnesses. In other
words, the algorithm determines the smallest set of
longest match phrases that accounts for all variants
between two texts, as conceptualized in Fig. 7.

Theoretically, this process can be applied to an
n-dimensional matrix. This would facilitate com-
paring an arbitrary amount of witnesses, or in
other words, support for multiple witness align-
ment. However, the time needed to compute the
n-dimensional case is not linear, but probably in
the order of an n-exponential function of the text
length, making it computationally unattainable in a
reasonable amount of time. Therefore, multiple wit-
ness alignment must be supported in another way,
as explained in the remainder of this section.

To register the alignment and variation between
witnesses traced by the algorithm, an efficient way
to store the algorithm’s results is needed. To this
end, CollateX adopts Schmidt and Colomb’s

concept of a variant graph (Schmidt and Colomb,
2009). We will demonstrate this process using
the case of a variant token. In Fig. 8a, we see the
algorithm determining the first alignment. The
algorithm traverses all cells that represent aligned
tokens and adds a vertex for each, the edges of
which are indexed for both witnesses. However,
in the case of the token ‘i’, we traverse an empty
column. This means that we hit a token that is rep-
resented in one witness but that does not have a
corresponding token at that location in the other
witness. Instead, the other witness has ‘j’. In this
case, we add two vertices with indexed edges, one
for each witness (cf. Fig. 8b). This process ultimately
results in the situation depicted in Fig. 8c.

CollateX’s variant graph is a directed acyclic
graph in which each vertex represents a token
in one or more versions. Each edge in a variant
graph is annotated with one or more identifiers
for each version. Additionally, a variant graph has
a start and an end vertex (the #-vertices in Fig. 8)
that do not represent any tokens. By proceeding in
this way, variation at the start or the end of a ver-
sion can be recorded. When one reads the variant
graph from left to right, following only the edges
annotated with the identifier of a certain version,
the complete token sequence of this version can be
reconstructed. When transpositions are detected, a

Fig. 7 Vectors describing alignment of two witnesses in a
document-by-document matrix

R. Haentjens Dekker et al.

8 of 19 Digital Scholarship in the Humanities, 2014

T
:
:


duplicate of the transposed vertex is created and the
two copies are linked together (cf. Fig. 9).

By applying a variant graph, we can also extend
the applicability of the described algorithm from
pairwise comparisons to the alignment of more
than two versions. To this end, we apply the same
approach of matrix-wise comparison, but instead of
a 2D matrix a 3D matrix is used. This allows us to
compare a new witness with the entire variant graph
constructed so far. This process is conceptualized in
Fig. 10 where a new witness ‘t’ (with a reading iden-
tical to witness ‘u’) is added to the comparison.

Darker ‘cubes’ visualize alignment between the
existing graph and the new witness. (Lighter
‘cubes’ are for the readers’ orientation within the
3D matrix only.) Eventually, in this case, the process
leads to the addition of an index ‘t’ to all edges in
the graph that had an index ‘u’. Of course, when a
new reading is found that has not been recorded in
the graph, a new vertex would be inserted.

In this way, the graph represents a serialization of
the variation between the documents of a steadily
growing set. The serialization contains all the vari-
ation that is recorded during the alignment of pre-
vious comparisons and is derived from the variant
graph by arranging the tokens of all vertices in topo-
logical order. Note that a variant graph ultimately
describes variation between witnesses; it does not—
nor does CollateX—interpret or infer the type or
cause of variation. Thus, in a graph such as depicted
in Fig. 11, there is no given interpretation whether
the ‘j’ results from an addition in one witness or has
been deleted in another. Of course, whenever there
is additional knowledge on the provenance and the
date of the witnesses, this inference becomes a trivial
task in most cases.

As we noted above, CollateX’s development as a
whole is still in an experimental stage, but the cur-
rent version (as of the time of writing version 1.3 is
available on http://www.collatex.net) yields

Fig. 8 a (left), b (center), and c (right): Storing alignment results using a variant graph

Fig. 9 Capturing a transposition in a variant graph

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 9 of 19

two-dimensional
three-dimensional
to
d
-
http://www.collatex.net


convincing results. Inspection of a 10% sample of a
real-world test collating the first chapter of Darwin’s
Origin of Species (Bordalejo 2009) yielded a 100%
correct detection rate for textual variation, and
87.5% correct rate of transposition identification
(i.e. one false positive and seven correct identifica-
tions of transpositions, judged by human control)
within the sample. CollateX’s performance as to
speed may leave something to wish for. Currently,
sentence- to paragraph-sized collations are executed
in less than a second, virtually independent of the
amount of witnesses. Collation of chapter-sized texts
is feasible (seconds), but at larger sizes (‘book
length’), the speed degrades rapidly to unfeasible.
To this end, chunking or breaking larger bodies of
texts into smaller parts is an effective solution. A
number of identified problems remain to be ad-
dressed in future work. Most importantly, the inde-
pendence of the alignment results from the order in
which the versions are aligned needs more testing.
Although no dependence on the order could be wit-
nessed in test cases found in other publications ad-
dressing the issue (Spencer and Howe, 2004), it is
possible that, for example, a combination of repeated
tokens in versions and a change in the order of their
comparison might cause different results. Another

issue is testing and benchmarking. CollateX’s algo-
rithm is tested against an ever growing set of real-
world use cases varying from simple and constructed
cases such as ‘the black cat and the white dog’ versus
‘the white cat and the black dog’ to elaborate frag-
ments of Armenian, medieval Dutch, and Hebrew
prose and verse. This yields good use-case–based evi-
dence that CollateX is indeed capable of tracing
complicated examples of real-world textual vari-
ation. The development methodology used
(‘agile’14) implies an ever growing set of such real-
world cases, as new users request test runs of previ-
ously unseen material. However, there is also a need
for a mathematically/computationally constructed
larger test corpus of variant texts of which the vari-
ation is exactly known and modeled on real-world
textual variation, so that future releases of CollateX
can be benchmarked for accuracy and performance
to a certain standard. Creating such a test corpus is
an important step in future research and
development.

2 Integrating CollateX within the
infrastructure of a digital edition

Now that the conceptual framework and algorithm
have been mapped out, we would like to address the
issue of possible integrations of CollateX into elec-
tronic editions. The Beckett Digital Manuscript
Project proved to be a suitable test case since its soft-
ware infrastructure is rather typical for the way in
which many digital editions are currently composed.

Fig. 10 Visualization of multiple witness comparison using a 3D matching matrix

Fig. 11 Variant graph recording a variant that is either a
deletion in witness ‘u’ or an addition in witness ‘v’

R. Haentjens Dekker et al.

10 of 19 Digital Scholarship in the Humanities, 2014

1
7
,
,
vs.
-
.


It uses Apache Cocoon,15 a publishing framework,
which is based entirely on XML-oriented technolo-
gies. As XML and its adjacent standards are almost
ubiquitous in today’s Digital Humanities landscape,
ranging in their application from the encoding of
source material to the publication in multiple
XML-based formats like XHTML or PDF via XSL-
FO, Apache Cocoon is widely deployed among the
projects in this field. The framework is built around
the idea of configurable transformation scenarios.
Such scenarios are mapped flexibly to the URI name-
space of a project’s Web site and are triggered as soon
as a web client sends a request to any of the mapped
URIs. Cocoon then

(1) Takes input parameters from the request;
(2) Pulls additional relevant data from a variety of

data sources (e.g. XML databases, relational
databases, web services, or a server’s file
system);

(3) Converts all data into an XML document;
(4) Pushes the data through a chosen transform-

ation scenario, configurable by the site’s
developer as well as the user, and

(5) Returns the transformation result in the
response to the web client (Fig. 12).

The Beckett Digital Manuscript Project follows
this pattern in as much as it

(1) Receives requests for specific textual resources
in the edition, selected via an appropriate
URI;

(2) Pulls those resources, encoded in TEI-P5
compliant XML, from the server’s file system;

(3) Applies an XSLT-based transformation to the
XML-encoded text resource that is fitted to
the desired output format (often (X)HTML)
and the current site context in which the user
requests the resource, and

(4) Delivers the transformation result along with
any static resources (images, stylesheets, and
client-side script code) to the requesting client
(often a web browser).

To seamlessly integrate CollateX’s functionality
in this site architecture, it was deemed to be the
most elegant approach to think of the collator
module as another transformation scenario, which

transforms data from a selected set of text versions
into an intermediary XML-based format encoding
the collation result. As described above in the sec-
tion on CollateX’s design, the modularity of
CollateX allows us to uncouple the preprocessing
of input data and the postprocessing of colla-
tion results from the core of its functionality,
the alignment. Because of this flexibility, it was com-
paratively easy to embed CollateX into Apache
Cocoon as another transformer module. All that
was needed was

(1) A preprocessing step, which transformed an
XML-encoded set of versions into tokenized
input to the alignment module of CollateX,
and

(2) A postprocessing step, which renders the re-
sults of the alignment step in an XML-based
format, so it can be further processed by
Apache Cocoon’s components and ultimately
delivered to the user in the form desired
(cf. Fig. 13).

More specifically, the transformer module
looks for so-called ‘data islands’ in provided XML
input documents, which resemble the following
snippet:

<cx:collation xmlns:cx¼‘‘http://interedition.eu/
collatex/ns/1.0’’ cx:outputType¼‘‘tei’’>

<cx:witness> . . . </cx:witness>
<cx:witness> . . . </cx:witness>
<cx:witness> . . . </cx:witness>
. . .

</cx:collation>

Fig. 12 Usual blueprint for Apache Cocoon architecture-
based web services

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 11 of 19

w
1.
2.
,
3.
4.
5.
1.
2.
3.
4.
-
1.
-
2.
-
http://interedition.eu/collatex/ns/1.0
http://interedition.eu/collatex/ns/1.0


Whenever the transformer encounters such an
island in an input document, it substitutes it with
the resulting alignment of the given versions/
witnesses in the output while just copying the sur-
rounding data. The encoding of the output can be
controlled via the attribute ‘cx:outputType’.
Currently a proprietary, tabular data format and
TEI P5 parallel segmentation markup are supported.
By assembling these data islands, dynamically based
on the user’s request (parameterized for instance
via the input of an HTML form; see e.g. Fig. 14),
and by adding consecutive transformation mod-
ules after the collation has been performed, the
Beckett Digital Manuscript Project can provide
for the described personalization of its critical
apparatus.

Because of the modular design of CollateX,
a seamless integration of its functionality with the
site infrastructure of the Beckett Digital Manuscript
Project has been achieved. The amount of work was
comparatively limited because the CollateX team
was able to reuse most of the already developed
components and the Beckett project’s team was
able to build the integration entirely on their exist-
ing code base and platform.

However, the integration did not come without
trade-offs. Particularly, the synchronous online exe-
cution of the collation considerably limits the
amount of textual data that can be collated. This
constraint and the requirement of more complex

request/response choreographies than the ones
Cocoon provides out of the box as soon as larger
texts are collated, makes this a solution which
cannot simply be adopted without prior adjust-
ments by any edition with arbitrarily large text trad-
itions. But projects such as NINES, with its collation
software Juxta,16 and the Interedition project itself
develop web-service–based solutions, which aim to
overcome scalability issues related to the synchron-
ous on-the-fly collation in use today. Therefore, the
adoption of these solutions and again their integra-
tion with existing infrastructures like Apache
Cocoon offers a promising perspective.

3 CollateX, modern manuscripts,
and the digital scholarly editorial
process

Among all the interoperable tools developed within
the Interedition framework CollateX takes a special
place, as it realizes interoperability not just in a tech-
nical sense but also in a scholarly sense. Following
textual scholars such as Bryant (2002), Buzzetti
(2002), and McGann (2001) and genetic critics
such as Lebrave (1993), Grésillon (1994), Hay
(2002), de Biasi (2004), Ferrer (2011), and , we
have come to appreciate text in its essential fluidity
and its forms as a process rather than a static object.

Fig. 13 Blueprint of the integrated Apache Cocoon and CollateX architecture-based web service for the Beckett Digital
Manuscript Project

R. Haentjens Dekker et al.

12 of 19 Digital Scholarship in the Humanities, 2014

''
T
on-
-
.
,


The dynamic aspect of text has caused especially
McGann to argue that any electronic edition—or
for that matter any attempt in that direction—
should be based on dynamic processes, ideally im-
plementing ‘flexible and multi-dimensional views of

the materials’ (McGann 2001). Originally, McGann
envisaged such editions in the form of hypermedia
editions backed by relational databases, on top of
which adaptable transformational logic would cause
archived digital texts to be represented according to

Fig. 14 The Beckett Digital Manuscript Project

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 13 of 19


specific editorial practices and views of different edi-
torial or literary communities. Of course, CollateX
does not fully answer to such an ambitious perspec-
tive. However, it is interesting to note that CollateX
represents at least one aspect of such a transform-
ational logic as McGann pointed to. CollateX effect-
ively dynamically ‘reverse engineers’ the variation
present in textual tradition or genesis. The process
is dynamic because the basic process is independent
of tedious scholarly manual labor, causing
CollateX’s transformations to be repeatable. The
process is also dynamic, as it is adaptable. By
adding witnesses present in the database-backed
electronic archive to the analysis (or by removing
some of them from it), the effect and perspective on
the collation may change. In this sense, the applica-
tion of CollateX in editorial processes takes us one
step further in our abilities to express the dynamics
of text production, editing, and reception. At this
point, the software can be serviceable in the prepar-
ation of a scholarly edition, since it can also output
TEI parallel segmentation XML, which an editor
can then transform and visualize the way he or
she wants within the edition. Projects such as
‘manuscript archives’ that do not envisage the de-
velopment of a full ‘historical-critical’ edition, could
still offer their users an alternative to a traditional
‘critical apparatus’. Comparable with aspects of
what is currently often labeled social editing
(Siemens 2012), embedding CollateX in the
Beckett Digital Manuscript Project enables the user
to make his or her own selection of textual versions
that need to be collated and leave out the ones he or
she is not immediately interested in.

From the vantage point of editorial theory, this
development has interesting consequences regarding
the scholarly editor’s role, whose focus may shift
from the collation to a more interpretive function.
In this way, the integration of a collation tool may
be consequential in terms of bridging the gap be-
tween scholarly editing and genetic criticism. From
the perspective of editorial practice, the application
of CollateX is still at an experimental stage, but it
already shows that the modular and service-oriented
approach used by Interedition has the potential to
be useful, both to the specialized field of digital
scholarly editing and to a more general audience.

3.1 Pushing the collation envelope:
modeling genetic stages
Apart from facsimiles, topographic, and linear tran-
scriptions (encoded in XML), the Beckett Digital
Manuscript Project provides the option to compare
the different preparatory versions of the text—from
the earliest draft stages to the page proofs. To avoid
getting lost, the user is able to compare a particular
segment in one version with the same segment in
another version, or in all the other versions. The size
of such a segment is determined by the user, the
smallest unit being the sentence.

The user is offered a synoptic survey of all the
extant versions of the segment of his or her choice,
showing each version in its entirety with the variants
highlighted (cf. Fig. 15). The syntactical context of
each segment remains intact, but in order for the
variants to be highlighted, they had to be encoded
first. In view of the large amount of manuscript ma-
terials still to be transcribed, the project would not
have been able to include the option of encoding an
apparatus in all of the transcriptions. As an alterna-
tive to that manual encoding task, we tested the pos-
sibilities of digitally supported collation by means of
the CollateX algorithm.17 One of the complicating
elements of this test case is the rather large number
of versions in combination with the presence of de-
letions and additions in almost all of them.

To find solutions for the latter complicating
element, there are several ways of looking at the
challenge of collating modern manuscripts. One
way would be to regard it as a form of collation
that does not only collate versions of a text, but
also stages within versions. For one manuscript, ver-
sion can often be subdivided into several writing
stages. A writing stage is defined, according to the
suggestions of the TEI Special Interest Group (SIG)
on ‘Genetic Editions’, as ‘The a reconstructable stage
in the evolution of a text, represented by a docu-
ment or by a revision campaign within one or more
documents, possibly assigned to a specific point in
time’.18 Ideally, this would require that the editor
can identify not only different stages in the writing
process, but also the writing sequence within each
writing stage. If these sequences and stages can be
discerned unequivocally, it would be theoretically

R. Haentjens Dekker et al.

14 of 19 Digital Scholarship in the Humanities, 2014

,
,
to
even
'


possible to treat each stage as a version (or ‘witness’)
to be collated.

The TEI SIG on Genetic Editions suggested
working with ‘stageNotes’ to describe the compos-
ition stages that have been identified in the genesis
of a text. These ‘stages’ relate to the relatively
large unit of the textual version as a whole
(‘Textfassung’). Within a stage (say, an author writ-
ing a block of text in black ink, deleting, and adding
words in the same writing tool) it is often difficult,
if not impossible, to further discern different ‘sub-
stages’. Still, a genetic critic might be interested in a
collation tool that brings to the fore precisely this
kind of moment in the writing process, when the

writer did not immediately find the right words.
In the case of a simple example, ‘The ^black^ cat
^dog^ is alive ^dead^’—assuming all deletions ()
and additions (^) are made in the same handwriting
and writing tool—all of the following combinations
are theoretically possible:

W1a: the cat is alive
W1b: the black cat is alive
W1c: the cat is dead
W1d: the black cat is dead
W1e: the dog is alive
W1f: the black dog is alive
W1g: the dog is dead
W1e: the black dog is dead

Fig. 15 Synoptic survey of various version of one textual segment in the Beckett Digital Manuscript Project

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 15 of 19


The TEI SIG ‘Genetic Editions’ developed the
‘stageNote’ element for the documentary level.
According to the TEI SIG’s suggestions, ‘A genetic
editor needs to be able to assign a set of alterations
(deletions, additions, substitutions, transpositions,
etc.) and/or an act of writing to a particular
stage’.19 However, some authors always use the
same writing tool, not only for the ‘first stage’ of
their draft but for all subsequent revision cam-
paigns. Moreover, in the TEI suggestions for genetic
editions, the ‘stageNote’ element was developed
for the ‘documentary’ level (related to what Hans
Zeller called ‘Befund’, the record, as opposed to
‘Deutung’, its interpretation), not for the ‘textual’
level. However, collation is a text-related operation.
To collate modern manuscripts, it may therefore
be beneficial—for the purpose of designing digit-
ally supported collation tools for modern manu-
scripts—to conceive of the manuscript as ‘a
protocol for making a text’, according to Daniel
Ferrer’s definition (Ferrer 1998).

A relatively straightforward application of this
protocol model is to work with the ‘uncancelled
text’ of each manuscript (i.e. a ‘clean’ transcription
or reading text of a draft, without the deleted pas-
sages, i.e. by ignoring the passages marked by
<del> . . . </del> tags). This ‘uncancelled text’ is
usually an author’s last ‘protocol’ or instruction to
himself when he is on the verge of fair-copying or
typing out the text on another document. We tried
to apply this ‘uncancelled text’ system to test the
first research results of CollateX. All versions of a
segment are computed and their data are handed
over to CollateX for comparison. The segmentation
of the textual material (see above) can now serve an
extra purpose: apart from reducing the danger for
the user to get lost in the jungle of manuscripts, it
also determines the speed of the collation. Since the
most frequently chosen textual unit in the project is
the smallest segment (usually the unit of a sentence),

the number of versions can be relatively high (in the
test case: about twenty versions) without slowing
down the instant collation.

As an intermediary step, the ‘uncancelled text’
system is useful, but it does reduce the complexity
of the manuscript to a textual format. In a way, this
pragmatic solution ‘de-manuscripts’ the manuscript.
In order to try and refine the computer-assisted col-
lation of modern manuscripts, it would be helpful if
the collation software were XML-aware in order for
the input to be derived directly from the XML-
encoded transcription, and to record changes not
only between the stages but also between substages
within one stage. The test case provided us with the
following example: a passage in one of Beckett’s
manuscripts (UoR MS 2934, 9v-10r, written in
Beckett’s hand in black ink), with two consecutive
substitutions within the same writing stage:

and then again faint ^hoarse from long silence
^faint^^ from far within

In XML, this could be transcribed as follows:

and then again <subst xml:id¼‘‘subst1’’><del
xml:id¼‘‘del1’’>faint</del>

<add xml:id¼‘‘add1’’><subst xml:id¼‘‘subst2’’>
<del xml:id¼‘‘del2’’>hoarse from long
silence</del>

<add xml:id¼‘‘add2’’>faint</add></subst>
</add></subst> from far within

The subst, del, and add tags suffice to cover all stage
information, which could be expressed in the ‘aug-
mented’ variant graph of Fig. 16.

Each path in the graph represents a witness. For
the purposes of the collation of modern manuscripts,
a new type of node has been introduced (in the
example, S1 and S2, corresponding to subst1 and
subst2, respectively. This writing stage (a) then
needs to be compared with other stages or other ver-
sions, i.e. with multiple witnesses, say, (b) and (c):

Fig. 16 Conceptual CollateX variant graph capturing genetic stages of authoring

R. Haentjens Dekker et al.

16 of 19 Digital Scholarship in the Humanities, 2014

'
,
C
,
is
,
:


(a) and then again faint ^hoarse from long si-
lence ^faint^^ from far within

(b) and then again nothing from far within
(c) and then again faint from far within

In CollateX’s internal model, this would be pre-
sented as depicted in Fig. 17. An advantage of this
model is that it can support several collation op-
tions. By default, CollateX would use only the
‘uncancelled text’ of each witness, but whereas the
‘uncancelled text’ model (described above) only
took the final protocol into account, this model
takes in all the extra information about the cancelled
words, saves it, and enables us to ‘port out’ these
data again at the visualization stage. For instance, if
for whatever reason, one would prefer to compare
(b) and (c) with substage2 of (a), rather than to its
‘uncancelled text’, the algorithm can optionally be
instructed to collate [(b) nothing] and [(c) faint]
against [(a’) hoarse from long silence], rather than
against [(a’’) faint]. All the information stored in
the XML transcription passes through the collation
process untouched, so that it can be retrieved for
visualization purposes.

In terms of visualization, an option ‘hide cancel-
lations’ (as one of the ‘Tools’ in the menu) could
simplify the alignment table, reducing it to a visu-
alization of the different versions’ ‘uncancelled text’
only:

w1 and then Again Faint from far within
w2 and then Again nothing from far within
w3 and then Again faint from far within

But we can imagine that genetic critics and other
researchers interested in modern manuscripts might
want to have an overview of all the cancellations and
substitutions in the manuscripts. Undoing the same

‘hide cancellations’ option in the menu could offer
these users a more complete picture:

faint
hoarse from long silence

w1 and then again faint from far within
w2 and then again nothing from far within
w3 and then again faint from far within

The advantage of having introduced the new type
of node (S1 and S2 in the variant graph above) is
that the ‘hoarse from long silence’ variant can be
treated as one unit during the computer-supported
collation and also be presented as such at the visu-
alization stage.

4 In Closing

The collation of modern manuscripts involves the
treatment of cancelled text. A classical problem in
this area of study is the division of a modern manu-
script into ‘stages’ or even ‘substages’, for especially
if an author uses the same writing tool for all the
text on the document (including cancellations and
additions), it is often almost impossible to discern
separate stages. It is possible to work with the
‘uncancelled text’ of the documents in order to
compare the different versions, but researchers
working in modern manuscripts are usually espe-
cially interested in the cancellations and substitu-
tions. Therefore, we tried to find a solution for
computer-supported collation of modern manu-
scripts, including cancellations. We have explored
how this complex research problem in the applica-
tion of computers in the humanities could be
approached by breaking it down into a community
supported and well-defined set of subproblems,
which each on its own can be solved in a more

Fig. 17 Conceptual CollateX variant graph capturing genetic stages of authoring as well as witness variation

Computer-supported collation of modern manuscripts

Digital Scholarship in the Humanities, 2014 17 of 19

to
closing
C
c
-


flexible and efficient way. Looking beyond the spe-
cific problem of computer-supported collation,
such an approach does not only appear suitable
to us because it is a well-established practice in the
construction of complex software systems in gen-
eral, but also because it allows for effective collab-
oration among researchers and developers from
many different backgrounds and projects. From
this perspective, it is not by accident that the devel-
opment of a modularized collation solution took
shape within the context of the research project
‘Interedition’, whose aim it is to foster such collab-
oration and to address the organizational and archi-
tectural issues associated with such an approach as
well, issues which point beyond the development of
singular software tools for singular use cases.

References
Bordalejo, B. (2009). Introduction to the online variorum

of Darwin’s origin of species. http://darwin-online.org.
uk/Variorum/Introduction.html (accessed 7 February
2013). In van Wyhe, J. (ed.), (2002). The Complete
Work of Charles Darwin Online. http://darwin-online.
org.uk/ (accessed 7 February 2013).

Bourdaillet, J. and Ganascia, J.-G. (2007). Practical block
sequence alignment with moves. LATA 2007–
International Conference on Language and Automata
Theory and Applications, 3/2007. http://www-poleia.
lip6.fr/"ganascia/Medite_Project?action¼AttachFile&
do¼view&target¼LATAþ2007 (accessed 13 May 2013).

Bryant, J. (2002). The Fluid Text: A Theory of Revision and
Editing for Book and Screen. University of Michigan
Press. http://books.google.nl/books?id¼1w4wp
OdPbu4C (accessed 7 February 2013).

Buzzetti, D. (2002). Digital representation and the text
model. New Literary History, 33: 61–88.

de Biasi, P.-M. (2004). Toward a science of literature:
manuscript analysis and the genesis of the work. In
Jed Deppman, J., Ferrer, D., and Groden, M. (eds),
Genetic Criticism: Texts and Avant-textes. Philadelphia:
University of Pennsylvania Press, pp. 36–68.

Ferrer, D. (1998). The open space of the draft page: James
Joyce and modern manuscripts. In Bornstein, G. and
Tinkle, T. (eds), The Iconic Page in Manuscripts, Print,
and Digital Culture. Ann Arbor: University of Michigan
Press, pp. 249–67.

Ferrer, D. (2011). Logiques Du Brouillon: Modèles Pour
Une Critique Génétique. Paris: Éditions du Seuil.

Grafton, A., Most, G.W., and Settis, S. (2010). The Classical
Tradition. Harvard University Press, p. 506. http://books.
google.nl/books?id¼LbqF8z2bq3sC (accessed 7
November 2012).

Grésillon, A. (1994). Éléments De Critique Génétique: Lire
les Manuscrits Modernes. Paris: Presses universitaires de
France.

Hay, L. (2002). La Littérature Des écrivains. Paris: José
Corti.

Lebrave, J. L. (1993). L’édition Génétique. In Cadiot, A.
and Haffner, C. (eds), Les Manuscrits Des écrivains.
Paris: CNRS/Hachette, pp. 206–23.

Levenshtein, V. (1966). Binary codes capable of correct-
ing insertions and reversals. Soviet Physics: ‘Doklady’,
10: 707–10.

McGann, J. (2001). Radiant Textuality. Literature
Since the World Wide Web. New York: Palgrave/St
Martins.

Oakman, R. L. (1984). Computer Methods for Literary
Research. Athens: University of Georgia Press, pp. 118–37.

Schmidt, D. and Colomb, R. (2009). A data structure for
representing multi-version texts online. International
Journal of Human-Computer Studies, 67.6: 497–514.

Schmidt, D. (2009). Merging Multi-Version Texts:
A Generic Solution to the Overlap Problem.
Proceedings of Balisage: The Markup Conference
2009. Montreal: Balisage Series on Markup
Technologies, vol. 3.

Shillingsburg, P. L. (2006). From Gutenberg to Google:
Electronic Representations of Literary Texts. Cambridge:
Cambridge University Press, p. 110.

Siemens, R. (2012). Toward modeling the social edition:
an approach to understanding the electronic scholarly
edition in the context of new and emerging social
media. Literary and Linguistic Computing, 27: 445–61.
http://llc.oxfordjournals.org/content/27/4/445.full (ac-
cessed 7 November 2012).

Smith, S. E. (2000). The eternal verities verified. Charlton
Hinman and the roots of mechanical collation. Studies
in Bibliography, 53: 130–62.

Spencer, M. and Howe, C. J. (2004). Collating texts using
progressive multiple alignment. Computers and the
Humanities, 38: 253–70.

Unsworth, J. (2000). Scholarly Primitives: what methods
do humanities researchers have in common, and how
might our tools reflect this? Symposium on ‘Humanities

R. Haentjens Dekker et al.

18 of 19 Digital Scholarship in the Humanities, 2014

http://darwin-online.org.uk/Variorum/Introduction.html
http://darwin-online.org.uk/Variorum/Introduction.html
http://darwin-online.org.uk/
http://darwin-online.org.uk/
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://www-poleia.lip6.fr/~ganascia/Medite_Project?action=AttachFile&do=view&target=LATA+2007
http://books.google.nl/books?id=1w4wpOdPbu4C
http://books.google.nl/books?id=1w4wpOdPbu4C
http://books.google.nl/books?id=1w4wpOdPbu4C
http://books.google.nl/books?id=LbqF8z2bq3sC
http://books.google.nl/books?id=LbqF8z2bq3sC
http://books.google.nl/books?id=LbqF8z2bq3sC
http://llc.oxfordjournals.org/content/27/4/445.full


The research leading to these results has received funding from the European Research Council under the European Union's Seventh 
Framework Programme (FP7/2007-2013) / ERC grant agreement n° 31360911.