The Character in the Letter:
Epistolary Attribution in

Samuel Richardson’s Clarissa

Lisa Pearl, Kristine Lu, and Anousheh Haghighi
Department of Cognitive Sciences

3151 Social Science Plaza
University of California, Irvine

Irvine, CA 92697
Corresponding author email: lpearl@uci.edu

1


Abstract

Deliberate differences in how authors represent characters has been a core area of literary
investigation since the dawn of literary theory. Here, we focus on epistolary literature, where
authors consciously attempt to create different character styles through series of documents
like letters. Previous studies suggest that the linguistic gestalt of an author’s style – the au-
thor’s writeprint – can be extracted from the various characters of an epistolary novel, but it is
unclear whether individual characters themselves also have distinct writeprints. We examine
Samuel Richardson’s Clarissa, lauded as a watershed example of the epistolary novel, using
a recently developed and highly successful authorship attribution technique to determine (i)
whether Richardson can construct distinct character writeprints, and (ii) if so, which linguistic
features he manipulated to do so. We find that while there are not as many distinct character
writeprints as characters, Richardson does appear to have signature features he alters to create
distinct character styles – and few of these features are the function word or abstract syntactic
features typically comprising author writeprints. We discuss implications for other questions
about character identity in Clarissa and character writeprint analysis more generally.

2


1 Introduction
Since the dawn of literary theory, deliberate differences in how authors represent characters has
been a core area of investigation (Aristotle, 350 BCE). More recently, technological advances and
interdisciplinary collaborations have expanded the methodologies used for literary scholarship,
enabling investigations into more complex topics of authorship (e.g. The Federalist Papers: Adair,
1944; Mosteller and Wallace, 1963; Rokeach et al., 1970; Holmes and Forsyth, 1995; Tweedie
et al., 1996; Bosch and Smith, 1998; Fung, 2003; Collins et al., 2004; Oakes, 2004; Rudman,
2005) and character psychology (Zunshine, 2010; Vermeule, 2011).

Computational stylistics (Milic, 1966; Stamatatos et al., 2001) has been a favored tool in liter-
ary scholarship for investigating authorial differences for at least two reasons: (i) its emphasis on
character, and (ii) its ability to provide quantitative analysis without displacing the critic’s pow-
ers of interpretation (Smith, 1989). The related field of authorship attribution often uses similar
techniques based on author differences to determine an author’s identity. This has been useful for
contentiously attributed documents like the Federalist Papers, as well as in cases of authorship de-
ception, where one writer attempts to consciously imitate another author’s style (sometimes called
imitation attacks: Brennan and Greenstadt, 2009; Pearl and Steyvers, 2012).

Here, we focus on epistolary literature, a genre defined by heightened mimetic qualities such as
the lack of an omniscient narrator and the collection of “found documents” by each of the novel’s
characters. These qualities make it an intriguing domain for questions of character identity, style,
and authorship. In general, authors are assumed to have a writeprint where the gestalt of their
linguistic feature usage is distinctive (Li et al., 2006; Abbasi and Chen, 2008; Iqbal et al., 2008,
2010; Pearl and Steyvers, 2012). Even in epistolary novels where authors consciously attempt to
create different character styles, the author’s writeprint can be extracted. For example, Burrows
(2005) successfully applied his prominent Delta technique to identify differences in the writing of
Samuel Richardson’s epistolary Pamela and Samuel Fielding’s subsequent parody Shamela.

However, a related question concerns the characters within the epistolary novel itself: since
they are all written by the same author, are they in fact distinct? That is, do different characters
created by the same author have distinct character writeprints? While an author’s writeprint may
imprint each of the created characters, those characters could still be quite different stylistically;
on the other hand, it could be that the author was unable to deviate significantly from his author
writeprint for any of the characters. Notably, the linguistic features that comprise an author’s
writeprint have often been drawn from function words and abstract syntactic structures, because
these are believed not to be consciously manipulable (Mosteller and Wallace, 1964; Burrows, 1987;
Holmes et al., 2001; Binongo, 2003; Burrows, 2003; Juola and Baayen, 2005; Zhao and Zobel,
2005; Garcı́a and Martin, 2007; Stamatatos, 2009; Lučić and Blake, 2015; Kestemont et al., 2015).
This is why they become part of the linguistic signature of that author. So, we might expect
that an author is unable to alter them even when writing as different characters. Instead, perhaps
other linguistic features are altered, or perhaps the author is not truly able to distinguish characters
stylistically at all.

Recently, van Dalen-Oskam (2014) examined precisely this question of distinctive character
writeprints within the epistolary novels of famous Dutch women writers. While successful in iden-

3


tifying the author writeprints using existing techniques, van Dalen-Oskam was less satisfied with
the results of the character writeprint analysis using existing tools like Bootstrap Consensus Trees
(Eder, 2010; Eder and Rybicki, 2011) and Burrows’s Zeta (Burrows, 2007). She concluded that
convincing and objective computational methods do not yet exist for the task of identifying ex-
act differences between character writers, and that “a lot of work will have to be done to find a
(combination of) method(s) that will lead to verifiable and repeatable results,” particularly beyond
“words and their frequencies” (van Dalen-Oskam, 2014). One contribution of the current study
is the application of a different writeprint technique used by Pearl and Steyvers (2012) for au-
thorship deception, with what we feel are more satisfying results for issues surrounding character
writeprints.

We focus our investigation on Samuel Richardson’s Clarissa, one of the longest novels in En-
glish history, at over 1500 pages and nearly a million words. It is a novel rich with critical history
and has been lauded as a watershed example of the epistolary novel, receiving a pedagogical re-
naissance for its expanded set of character authors (over 30) and insight into psychological realism
(Zunshine, 2010). With Clarissa, Richardson sealed his reputation not only as a masterful writer
cited for his editorial prowess (Price, 2000, p.27) but also as an ambitious publisher of literary in-
sight. The novel centers around the beautiful and virtuous Clarissa Harlowe, a young lady caught
between her ambitious, greedy family and the wiles of the dashing libertine Robert Lovelace (refer-
enced mainly by his last name). Lovelace’s consuming desire to possess Clarissa leads to increas-
ingly dastardly and involved ploys of abduction and seduction. To illustrate the moral, familial, and
psychological torment facing Clarissa as she stalwartly clings to her virtue, Richardson creates a
diversity of meticulous epistles ranging from letters to legal documents to musical compositions to
torn remnants of Clarissa’s improvisational poetry. As noted in Richardson’s postscript to Clarissa,
the goal of this heterogeneous collection was to capture the “interesting personalities” of the char-
acters represented while also allowing them to be “various”, “natural”, and “well distinguished”.
That is, a primary goal for Richardson was to make the characters distinct enough to realistically
be separate people conversing with each other.

Given this rich dataset of character styles, we explore two basic questions about character
writeprints in Clarissa. First, can Richardson construct distinct character writeprints at all? If so,
it is useful to determine how distinct they are and whether there are as many character writeprints
as there are characters. Second, if any distinct character writeprints were in fact created, which
linguistic features did Richardson manipulate in order to do so?

We begin by describing the Clarissa epistolary corpus in more detail. We then briefly review
the writeprint analysis method from Pearl and Steyvers (2012) (henceforth PS) that we apply,
comparing it to other commonly used authorship techniques and highlighting its utility for auto-
matically determining linguistic features indicative of particular characters. We next discuss the
set of potential linguistic features that can comprise a character’s writeprint, which the PS tech-
nique draws from to automatically construct a given character’s writeprint. Our results suggest that
Richardson is somewhat successful at creating distinctive character styles. However, there are not
as many character writeprints as there are characters, and even the character writeprints discovered
are not as distinct as typical author writeprints identified by the PS method. Nonetheless, there do
appear to be signature features that Richardson tends to alter to create distinct character styles. In-

4


terestingly, few are the function word features that have traditionally comprised author writeprints
(Mosteller and Wallace, 1964; Burrows, 1987; Holmes et al., 2001; Binongo, 2003; Burrows, 2003;
Juola and Baayen, 2005; Zhao and Zobel, 2005; Garcı́a and Martin, 2007; Kestemont et al., 2015)
or the deeper syntactic features more recently gaining prominence (Stamatatos, 2009; Lučić and
Blake, 2015). We discuss implications for other questions about character identity in Clarissa and
character writeprint analysis more generally, concluding with suggestions for fruitful future work
in this area.

2 Corpus: Richardson’s Clarissa
The version of Richardson’s Clarissa used in our research was downloaded from The Oxford Text
Archive at http:// ota.ahds.ac.uk, and contains 789 letters, comprising 918,624 words total.1 There
are samples from 35 distinct characters in total, including an epilogue by Richardson himself.
Though nearly all samples are letters by a single author, several are not so neatly classified: (i)
letters where it is undecided who wrote them, (ii) letters jointly written by two or more characters,
(iii) letters written by one character pretending to be another, and (iv) a conclusion “supposedly
written by” one of the characters, John Belford. Table 1 summarizes the distribution of letters
across characters. It is clear that the majority of the letters (619 of 789, which is 78%) come from
just a few of the 35 characters: the two central characters Clarissa Harlowe and Lovelace, and their
respective confidantes, Anna Howe and John Belford, as highlighted in Fig. 1. Fig. 2 shows the
distribution of words per letter for these four characters, which can range from tens of words to
thousands of words.

Figure 1: The number of letters each character wrote is shown, with the points representing the four
main characters indicated. The remaining points correspond to the other characters who authored
letters in Clarissa.

5

http://ota.ahds.ac.uk


Table 1: Distribution of letters by character in Clarissa, indicating character name and the number
of letters written by that character.

Single Characters # Others #
Clarissa Harlowe 244 Undecided 59
Lovelace 193 Two authors 4
John Belford 104 Lovelace as Clarissa 2
Anna Howe 78 Lovelace as Anna 1
Judith Norton, William Morden 12 Supposedly written by Belford 1
Arabella Harlowe, James Harlowe, Jr. 10
Elizabeth Lawrance, John Harlowe 6
Charlotte Harlowe, Charlotte Montague, Lord M 5
Antony Harlowe 4
Anabella Howe 3
Antony Tomlinson, C. H. Hickman, Clarissa’s Father,

2Dorothy Hervey, Elias Brand, Joseph Leman,
R.D. Mowbray
Alexander Wyverly, Arthur Lewen,

1Clarissa’s Grandfather, Dolly Hervey, F. J. de la Tour,
Hannah Burton, Patrick McDonald, Roger Solmes,
Samuel Richardson, Tho Doleman, William Summers

Figure 2: The distribution of words per letter for the four main characters.

6


3 Methods

3.1 Overview of the PS Authorship Method

We adopt the method used by Pearl and Steyvers (2012), a highly successful authorship approach
that incorporates several aspects useful for character writeprint analysis, which are listed below in
(1). The PS method is notable for using all these components together, though other authorship
methods typically use some subset of them.

(1) PS authorship components
a. Preprocessing the raw linguistic feature values to increase the perceived importance

of those that are distinct for a given author
b. Using a variety of linguistic feature types
c. Utilizing a subset of the available features to form the writeprint
d. Allowing some writeprint features to matter more than others when making authorship

decisions

We briefly review each PS method component in turn before describing the method in more detail.

3.1.1 Preprocessing Feature Values

The preprocessing step is somewhat similar to using the Kullback-Leibler Divergence (KLD),
often called relative entropy (Zhao et al., 2006; Savoy, 2013, 2015), as well as to Burrows’s Delta
(Burrows, 2002; Hoover, 2004; Burrows, 2005, 2007; Argamon, 2008; Stamatatos, 2009; Savoy,
2013; Kestemont et al., 2015; Savoy, 2015) and the chi-squared method (Grieve, 2007; Savoy,
2015). These approaches compare the probability of a given feature value for the author in question
against the probability of that feature value in a specified comparison set. For the PS method, the
comparison set is the entire set of authors collectively; in contrast, for the KLD, Burrows’s Delta,
and chi-squared methods, the comparison set is a single author at a time. To our knowledge,
no other authorship method currently uses this component implemented this way, though the PS
implementation of it could be incorporated into any method that involves running an algorithm
over feature values.

3.1.2 Linguistic Feature Types

Unlike methods that rely on word frequency alone (e.g. Burrows’s Delta), the PS method allows
linguistic features to range across a variety of character-level, word-level, syntactic, semantic, and
formatting features (see Tables 7-17), similar to some previous approaches (see Stamatatos (2009)
for a review, and Luyckx and Daelemans (2011) and Eder (2013), among others). Like the pre-
processing component, this could be used with any method that involves running an algorithm
over feature values, though several recent implementations have not done so (e.g. the KLD imple-
mentation of Zhao et al. (2006), the Nearest Shrunken Centroid (NSC) implementation of Jockers
(2013), the Bootstrap Consensus Tree as applied by van Dalen-Oskam (2014), the Principle Com-
ponent Analysis (PCA) implementation of Kestemont et al. (2015), and the KLD and chi-squared

7


implementations of Savoy (2015)).

3.1.3 Features in a Writeprint

The Sparse Multinomial Logistic Regression (SMLR) algorithm of Krishnapuram et al. (2005)
used by the PS method has the ability to automatically determine which subset of the available
features is most useful for making authorship decisions, like several other approaches (e.g. Bur-
rows’s Delta, NSC, PCA, and Support Vector Machines (SVMs)). This contrasts with methods
like K-nearest neighbors (KNN) that obligatorily use the entire feature set. Importantly, the fea-
ture subset is what comprises the writeprint of any particular author. One notable ability of the
SMLR algorithm is that it can potentially identify a different subset of features for each author
writeprint, distinguishing it from methods that require all writeprints to use the same subset of
linguistic features (e.g. PCA, NSC).

3.1.4 Using Writeprint Features to Make Authorship Decisions

Like some other approaches (e.g. SVM, PCA), the SMLR algorithm used by the PS method allows
some writeprint features to matter more than others when making authorship decisions. A feature’s
importance to the algorithm is typically indicated by its weight. For example, a positive weight
corresponds to a writeprint feature that is indicative of the author, with higher weights indicating
features that are more useful for determining authorship.

3.2 Applying the PS Authorship Method for Character Writeprints

The basic representation of the problem the PS method will solve for character writeprints is
whether a target letter (e.g. a letter by a character from Clarissa) is by the same character as
the reference set of letters (e.g. letters by a single character from Clarissa). So, every data point
will involve information derived from the sets of letters in (2), and the core decision is which
character is the author.

(2) Sets of documents used to create data points
a. same data point

S = letter written by a character C1, Reference set R = all letters written by C1
b. different data point

D = letter written by a character C2, Reference set R = all letters written by C1

For a “same author” data point (2a), the label should be the same as the character whose letters
comprise the reference set (e.g. Clarissa Harlowe for the Clarissa Harlowe reference set). For a
“different author” data point (2b), the label should be some other character – ideally the correct
character, but at the very least not the character whose letters comprise the reference set (e.g. some
non-Clarissa Harlowe character for the Clarissa Harlowe reference set).2

The SMLR classifier (Krishnapuram et al., 2005) is a supervised machine learning method that
first trains on a collection of these data points that are labeled with the character author, learning

8


about the character writeprint for the character (C1) in the reference set. The classifier then at-
tempts to apply this acquired writeprint knowledge to a new unlabeled collection of data points
(the test set), which contains both same and different data points. If a character’s writeprint is
distinct, the classifier should perform well at identifying letters written by the character in the ref-
erence set (C1); in contrast, if a character’s writeprint is not distinct, the classifier will perform
poorly. A dataset is created for each character being investigated, with that character’s letters used
as the reference set R for all the data points in the set.

To create a data point, we follow the PS preprocessing procedure that increases the prominence
of potentially distinctive feature values by determining if a particular feature value is unusual
compared to the values characters typically have for that feature. We note that this procedure is
applied to each feature separately. Specifically, we first calculate the probability of the feature
value fv occurring, given the distribution of feature values in the reference set of letters from the
character in question: pchar = p(fv | distribution for character C1). Then, we compare pchar against
the probability of that feature value occurring, given the distribution of feature values in the set of
letters from all characters in Clarissa: pall = p(fv | distribution for all characters).

This gives us a quantitative translation of how distinctive feature value fv is for the character.
If the feature value is unusual (e.g, f1 in Fig. 3), the probability of fv coming from the character in
question will be higher than the probability of fv coming from the entire population of characters
(i.e. pchar > pall). If the feature value is instead common to other characters as well, the probability
of fv coming from the character in question will not be higher (i.e. pchar ≈ pall). If the feature
value is an aberration for this character (e.g, f2 in Fig. 3), the probability of fv coming from the
character in question will be lower (i.e. pchar < pall).

Each transformed feature value is calculated as follows, with the major aspects highlighted in
Fig. 3. First, we log transform the feature values from the reference set of letters for that character
(i.e. new value = log(raw value)), which creates a distribution of values that is roughly normally
distributed, provided the sample size is large enough. We then estimate the best-fitting normal
distribution for this observed distribution that represents the feature distribution for this character’s
letters (CHAR normal distribution). We note that this is where the size of the letter set authored
by a particular character ceases to matter (e.g., one character writing 244 letters while another
character only writes 78 letters) – the only information extracted is the parameter values of the
best-fitting normal distribution of the feature values in that characters letters. As long as a normal
distribution can be reasonably estimated, the exact size of the character letter data set is irrelevant.
We then apply the same process to the set of letters from all characters in Clarissa, generating the
normal distribution ALL that represents that feature’s distribution across all characters. We then
calculate the probability that the observed feature value fv in the target letter would be drawn from
the CHAR distribution (p(fv|CHAR) = pchar) and compare that against the probability that fv
would be drawn from the ALL distribution (p(fv|ALL) = pall) using the log-odds ratio in (3):

(3) log( pchar
pall

) = log( p(fv|CHAR)
p(fv|ALL)

)

A positive value means that pchar is larger than pall, and so this feature value is unusual for the
population of characters as a whole, but more typical for the character whose letters comprise the
reference set. That is, it is more likely to be a distinctive feature for this character. A negative value

9


means that pchar is smaller than pall, and so this feature value is more typical for the population
of characters as a whole and not for the reference set character. That is, it is likely not to be a
distinctive feature for this character. Fig. 3 demonstrates this for values f1 and f2, where f1 would
have a positive log-odds ratio while f2 would have a negative one.

Figure 3: Sample normal distributions derived from the log-transformed values for a single feature
in the letters of a single character (CHAR) and the letters of all characters (ALL). Sample feature
value f1 is typical of CHAR but atypical of ALL, and so would have a positive log-odds ratio.
Sample value f2 is atypical of CHAR but typical of ALL, and so would have a negative log-odds
ratio.

Because this preprocessing procedure is applied to each feature separately, the effective diag-
nosticity of each feature is assessed separately. Importantly, what matters is not how large or small
the raw feature value is for a given feature, but how unusual it is compared to other feature values
that occur.

So, a data point in a character’s data set, derived from a target letter and a reference set of letters
from that character, has the form in (4): a label indicating the character author of the target letter
followed by a set of log-odds transformed feature values from that target letter, given the reference
set.

(4) Sample data points derived from target letters and a reference set of letters by Anna Howe
a. Anna Howe, 0.668627018303141, 3.12335486642762, 0.141156015976692,

0.336583176410039, ...1.82762529986441
b. Clarissa Harlowe, 0.666158689800687, 2.58610541486525, 0.141156015976692,

0.336898797155784, ... 1.21406973468136

A dataset consists of 789 data points, with every letter in Clarissa used as a target letter. So, some
portion of a dataset will be labeled with the reference author (e.g. Anna Howe in (4)), while the rest

10


will be labeled with other characters. Because the preprocessing procedure requires the reference
set to be of some size in order to more accurately estimate a normal distribution for an individual
character’s reference set of letters (CHAR), we restricted our analysis to characters with fifty or
more letters in Clarissa. This confined our analysis to the four characters with the most letters and
so yielded four distinct datasets: Clarissa Harlowe, Lovelace, John Belford, and Anna Howe. The
SMLR classifier was then run on each constructed dataset.

We note that the SMLR classifier requires a parameter λ that controls how strongly the classifier
is biased to select a subset of the available linguistic features for a character’s writeprint. Larger
values of λ lead to writeprints consisting of fewer features, while smaller values lead to writeprints
consisting of more features. A value of 0.1, which is a fairly strong bias to prefer fewer features,
led to the best performance in a pilot analysis, and so we used this for our analyses below.

We used the SMLR java implementation available from http:// www.cs.duke.edu/∼amink/ software/
smlr/ , implementing ten-fold cross-validation to evaluate the classifier’s performance. In ten-fold
cross-validation, the dataset is divided into ten “folds” of equal size, with each fold containing
approximately the same distribution of labeled data as the entire data set. So, for example, since
244 of 789 letters are Clarissa Harlowe letters, each fold contained 78 or 79 target letters and ap-
proximately 24 target letters in each fold were Clarissa Harlowe letters. The classifier then makes
ten passes through the dataset, training on the data in nine of the ten folds and testing on the data
in the remaining fold, with each fold taking a turn as the test fold. So, for every pass, the classifier
learns what it can from 9

10
of the dataset about how the character labels are determined based on

the preprocessed feature values, and then attempts to label the data points in the remaining 1
10

with
the appropriate character. One benefit of cross-fold validation is that results are obtained for every
single data point in the data set, as each data point will be in one of the test folds. This guards
against results being impacted by a particularly easy or difficult test set, since ten test sets are used,
and the average of the ten test sets is the final score.

3.3 Linguistic Features Available for Character Writeprints

The PS method uses the SMLR classifier to automatically construct writeprints from a set of avail-
able linguistic features and bases its authorship decisions on those writeprints. We extracted a
set of 122 linguistic features of seven different kinds, shown in Tables 7-17 in the Appendix: 22
character-level features, 5 word-level features, 28 syntactic category features that could be viewed
as contentful (e.g. nouns), 24 syntactic category features that could be viewed as functional (e.g.
prepositions), 32 syntactic structure features (e.g. passives), 4 formatting features (e.g. italicized
words), and 7 semantic features (e.g. endearments). Similar to Pearl and Steyvers (2012), these
are all stylometric features that can be extracted automatically using freely available natural lan-
guage processing software3 and scripts written in a text manipulation programming language like
PERL. Notably, many of these features can be easily and automatically extracted from languages
besides English, provided the natural language processing software is available in the desired lan-
guage (i.e., the character-level features, the word-level features, several content syntactic category
features, several syntactic structure features, and the formatting features). However, we do note
that many of the remaining features were manually identified using knowledge of English gram-
mar (some content syntactic category features, several functional syntactic category features, and

11

http://www.cs.duke.edu/~amink/software/smlr/
http://www.cs.duke.edu/~amink/software/smlr/


some syntactic structure features). Additionally, the semantic features were manually identified
using domain-specific knowledge of Clarissa, rather than using the topic-modeling approach of
Pearl and Steyvers (2012), due to the relatively small size of the corpus. In general, when applying
the PS method to character writeprints, we believe it is likely more expedient for literary scholars
to draw on their own knowledge of the particular literature under investigation when identifying
potentially useful semantic features. For scholars studying works in other languages, this would
require manual identification of the relevant semantic features in those languages.

4 Results
We present two kinds of results: (i) quantitative results relating to character writeprints, based on
applying the PS method, and (ii) qualitative interpretations of those results that relate to character
distinctiveness and signature writeprint features manipulated by Richardson.

4.1 Character Distinctiveness

For each of the four characters who wrote over fifty letters in Clarissa, we can examine how
distinctive that character’s style is by assessing the ability of the PS method to correctly label a
target letter with the correct character, given a reference set of letters by that character (e.g. label a
letter by Clarissa Harlowe as being by her, when referenced against a set of letters by her). Notably,
the PS method had an average success rate of 100% on one data set and 89% on another in Pearl
and Steyvers (2012)’s study that attempted to identify different authors, so it is very accurate when
there are in fact different authors writing. If the character writeprints in Clarissa are as distinct as
author writeprints typically are, we would expect similar performance.

From the results of the SMLR classifier for each character, we can derive a confusion matrix
as in Table 2, where the rows represent the true character writer of the letter and the columns
represent the character writer labeled by the SMLR classifier. The four quantities correspond to
the distinctions in signal detection theory, representing (i) true positives (a): the number of letters
where the SMLR correctly labeled the letter by the target character as being by the target character,
(ii) false negatives (b): the number of letters where the SMLR incorrectly labeled a letter by the
target character as being by a different character, (iii) false positives (c): the number of letters
where the SMLR incorrectly labeled a letter not by the target character as being by the target
character, and (iv) true negatives (d): the number of letters where the SMLR correctly labeled a
letter not by the target character as not being by the target character.

Standard metrics used in computational linguistics to gauge the success of a classifier are pre-
cision and recall, defined as in (5a) and (5b), and combined into a single summary statistic known
as the F-score using the harmonic mean definition in (5c). Precision, recall, and F-score all range
between 0 and 1, with 1 being perfect performance. The intuitive interpretation of precision is how
accurate identification is, while the intuitive interpretation of recall is how complete identification
is. Ideally, a classifier will be both very accurate and very complete in its identification, yielding a
high F-score.

12


Table 2: The performance of the SMLR classifier can be summarized using a confusion matrix,
where the rows represent the true identity of the target letter’s author and the columns represent
the SMLR-labeled identity of the target letter’s author. Correct labels are a (true positives) and d
(true negatives), while incorrect labels are b (false negatives) and c (false positives).

Labeled
character non-character

True
character a b
non-character c d

(5) Evaluation metrics, with quantities from Table 2
a. Precision = # correctly labeled as character (a)

# labeled as character (a+c)
:

b. Recall = # correctly labeled as character (a)
# should be labeled as character (a+b)

:
c. F-Score = 2* precision∗recall

precision+recall

Table 3 shows the F-scores for each of the four characters investigated, while the detailed informa-
tion from the confusion matrices as well as the precision and recall scores for each character appear
in Appendix B. Most notably, none of the characters seem to have an easily identifiable style – the
highest performance by F-score is 0.558. This suggests that it is not easy for a single author, such
as Richardson, to differentiate the writing styles of different characters. Still, there seem to be nat-
ural classes of characters: (i) those that are more distinctive (Lovelace: 0.558), Clarissa Harlowe:
0.535), (ii) those that are somewhat distinctive (John Belford: 0.396), and (iii) those that are not
very distinctive at all (Anna Howe: 0.217).

There are at least two reasons why a character’s writeprint may not be very distinctive. First, a
character may simply be a hodgepodge of multiple characters’ styles, and so would overlap with
all other character styles. Alternatively, a character may be a derivative of other character styles
and have a writeprint that borrows from the main writeprint of those other character styles.

We can discern between these options by looking at the common confusions that emerge from
the SMLR confusion matrix, shown in Table 3. Any character which the SMLR confused the
target character with more than 10% of the time (either for the precision calculation, the recall
calculation, or both) is shown. We observe that both John Belford and Anna Howe are confused
with Clarissa Harlowe and Lovelace, but not with each other; we interpret this to mean that John
Belford’s and Anna Howe’s writeprints are distinct from each other. Moreover, while Lovelace
is confused with John Belford, he is not confused with Anna Howe; we interpret this to mean
John Belford’s writeprint comes primarily from Lovelace’s.4 Similarly, while Clarissa Harlowe is
confused with Anna Howe, she is not confused with John Belford; we interpret this to mean Anna
Howes writeprint comes primarily from Clarissa Harlowe’s.

Given the quantitative information from the F-scores and qualitative information from the sys-
tematic confusions, we suggest that Richardson was able to generate two main styles, one for each
of the two central characters (Lovelace, Clarissa Harlowe). He then derived additional charac-

13


ter styles from those two main styles (John Belford from Lovelace, Anna Howe from Clarissa
Harlowe). Narratively, this makes intuitive sense because Belford and Anna serve primarily as
“sounding walls” for Lovelace and Clarissa, respectively, as the moral conflict in the novel esca-
lates.

Because of this, an interesting subsequent analysis is to calculate the F-scores for the two main
characters while allowing the letters derived from their styles to count as instances of the main
character letters – that is, John Belford’s letters counted as instances of Lovelace’s letters and
Anna Howe’s letters counted as instances of Clarissa Harlowe’s letters. This increases the F-scores
of both main characters: Lovelace’s F-score increases to 0.586 while Clarissa Harlowe’s F-score
increases to 0.575. Though this improvement is non-trivial, these character F-scores are still far
below what we see when we compare the writings of two different authors (where F-scores are
at 0.890 and above, based on Pearl and Steyvers (2012)). A possible reason for this involves the
linguistic features that Richardson is able to manipulate to create different character styles. We
examine these next.

4.2 Signature Features

For each character (whether distinctive or not), we can examine the most distinctive features ac-
cording to the SMLR analysis, since the classifier learned not only which features comprised a
character’s writeprint but also how important each of those features was to the writeprint. The
importance of a given feature is indicated by the weight learned for it; the features with the highest
positive weight are the ones that influenced the SMLR’s decision the most when deciding to label
a target letter based on a particular character’s reference set. Though the SMLR bases its deci-
sion on all non-zero weighted features, these top features can serve qualitatively as the “signature”
writeprint features for that character’s style, as they matter the most to the classifier’s decision.
From this, we can assess how similar signature features are across character styles. Table 3 lists
the signature features for each character style, indicating which ones are shared across character
styles and which specific characters a given signature feature is associated with when more than
one character utilizes that signature feature.

There do seem to be common features Richardson prefers to manipulate, and every character
(distinctive or not) has several. The most commonly manipulated linguistic features are shown in
Table 4. Interestingly, few of these commonly manipulated features include what would be con-
sidered function words, which have often been a core distinguishing feature of author writeprints
(Mosteller and Wallace, 1964; Burrows, 1987; Holmes et al., 2001; Binongo, 2003; Burrows, 2003;
Juola and Baayen, 2005; Zhao and Zobel, 2005; Garcı́a and Martin, 2007; Stamatatos, 2009; Lučić
and Blake, 2015; Kestemont et al., 2015). Only one feature (frequency of all function words to-
gether) is of this kind.

Similarly, few are the abstract syntactic features that have also been central to more recent
authorship studies (Lučić and Blake, 2015). For example, while the average lengths of different
syntactic phrases were potential writeprint features that would capture nuances of verbosity specific
to those syntactic structures, none of these features were identified as common writeprint features
that Richardson manipulated. The three he manipulated which are closest to this kind of feature
are the frequency of clauses, the frequency of fragments, and the frequency of wh-adverb phrases.

14


Table 3: Distinctiveness of main characters, in descending order by F-score, which is a summary
statistic of classifier performance. Signature features for each character (SMLR weight ≥ 1.4) are
also listed in descending order of strength, according to the writeprints identified by the SMLR
classifier. Signature features appearing for more than one character are bolded, with the initials of
the character(s) that share(s) that signature feature in parentheses. Common confusions (account-
ing for ≥10% errors for precision and/or recall) are also shown.

Character F-score Signature features Common conf
Lovelace
(L)

0.558 verb frequency (CH, AH), alphabetic character fre-
quency, verb phrase frequency, function word fre-
quency (AH), noun phrase frequency, fragment
frequency (JB), total characters, punctuation fre-
quency (CH), total words, wh-adverb phrase fre-
quency (JB), title frequency (JB), word length
(JB), familial term frequency, gerund or present par-
ticiple frequency, infinitive to frequency

CH, JB

Clarissa
Harlowe
(CH)

0.535 verb frequency (L, AH), clause frequency (JB),
punctuation frequency (L), first person pronoun fre-
quency, universal determiner frequency, contraction
frequency

L, AH

John
Belford
(JB)

0.396 clause frequency (CH), parenthesis frequency, noun
frequency, colon frequency, fragment frequency (L),
title frequency (L), endearment frequency (AH),
adverb phrase length, word length (L), wh-adverb
phrase frequency (L)

CH, L

Anna
Howe
(AH)

0.217 function word frequency (L), endearment fre-
quency (JB), verb frequency (L, CH), wh-noun
phrase frequency, second person pronoun frequency,
emdash frequency, noun phrase frequency

CH, L

Interestingly, two of these three syntactic structure features have cues that are somewhat contentful.
Sentence fragments represent an incomplete structure, and an incomplete syntactic structure leads
to an incomplete sentence meaning. Likewise, wh-adverb phrases are headed by when, where, why,
and how, and these words are fairly easy to define (e.g., the time something happened = when).
So, both sentence fragments and wh-adverb phrases may be structural features that are easier to
consciously recognize and manipulate. Clause frequency may operate this way as well, since
clause frequency within a sentence can be increased by either conjoining main clauses together
(e.g. I like this and I want to read more) or embedding clauses in main clauses (e.g. I think that I
like this). In either case, a complete thought (represented by the additional clause) is added. Thus,
clause frequency may also be straightforward to consciously manipulate.

The remaining commonly manipulated features range over syntactic categories based on con-

15


tent words (frequency of verbs), semantic categories (frequency of endearments and titles), and
character-level features (frequency of punctuation, average word length). These may also be easier
to consciously recognize and manipulate.

This lends support to the idea that both functional category and abstract syntactic structure
usage are more unconscious and therefore more indicative of an author’s genuine identity. If
functional category and abstract syntactic structure usage are harder to consciously manipulate,
and these aspects are often at the heart of an author’s writeprint, this could be one reason why
character writeprints aren’t as distinctive as author writeprints typically are – even for an expert
like Richardson. Instead, Richardson focused on features he could consciously manipulate, which
are other feature types.

Table 4: Linguistic features most commonly manipulated by Richardson, including the type of
feature, number of characters where the linguistic feature is among the set of signature features, and
the specific characters whose writeprints contain the signature feature. Characters with distinctive
writeprints are indicated with an asterisk (*).

Feature Type # char Characters

verb frequency
syntactic

3
*Lovelace, *Clarissa Harlowe,

(content) Anna Howe

clause frequency
syntactic

2 *Clarissa Harlowe, John Belford
(structure)

fragment frequency
syntactic

2 *Lovelace, John Belford
(structure)

wh-adverb phrase frequency
syntactic

2 *Lovelace, John Belford
(structure)

function word frequency
syntactic

2 *Lovelace, Anna Howe
(function)

endearment frequency semantic 2 John Belford, Anna Howe
title frequency semantic 2 *Lovelace, John Belford

punctuation frequency
character

2 *Lovelace, *Clarissa Harlowe
(punctuation)

word length
character

2 *Lovelace, John Belford
(word)

Next, how did Richardson manipulate the signature features for each character? For each
signature feature in a character’s writeprint, we can examine how the distribution of that feature’s
values in a character’s letters compares to the distribution of that feature’s values in the entire
set of character letters. In particular, a simple analysis is whether that character’s feature value
was generally higher or lower than that feature’s value in the character population, as measured
by average or median feature value. If this occurs, it indicates one simple way that Richardson
manipulated that feature to create an aspect of the character’s writeprint. We note that not all
signature features will have a distribution that shows up under this analysis because there are many

16


ways for a feature distribution to be distinctive, and having an average or median value that is
higher or lower than the population average or median value is merely one of them.5 Nonetheless,
this is a convenient analysis to try as it lends itself well to verbally summarizing a character’s style
(e.g. one character uses more present participles and fewer titles), and generating content in that
character’s style (e.g. to emulate this character, use more present participles and fewer titles).

For each signature feature, we compared the character’s median value against the population’s
median value, and the character’s average value against the population’s average value. If the
character’s median and/or average value was at least 10% higher or lower than the population’s
median and/or average value for that feature, we included it in the set of signature features in Table
5 that have an easily describable pattern of usage. For example, Lovelace’s title frequency had
a median value of approximately 0.004 and an average value of approximately 0.0000085. The
population median value is approximately 0.006 while the population average value is approxi-
mately 0.000074. So, Lovelace’s median value is less than the population median value by over
33% (0.004/0.006 - 1 = -0.333) and Lovelace’s average value is less than the population’s average
value by over 88% (0.0000085/ 0.000074 - 1 = -0.885). This indicates that Lovelace’s style uses
titles less frequently than other characters, based on both the median and average values for this
feature. Table 5 summarizes the results of this analysis for the four characters investigated.

Table 5: Signature features for each character that have an average or median characteristic usage
that is either significantly higher (+10%) or lower (-10%) than the character population average
or median. If average and median usage differ, [avg] or [median] indicate which one behaves
which way. Distinct usage shared across character signature features is bolded, with the initials
of the character(s) that share(s) that distinctive average/median usage in parentheses. If the shared
distinctive usage is in the opposite direction (e.g. the first character’s is + and the second character’s
is -), the character in parentheses is also italicized. Asterisks (*) indicate signature feature usage
that is at least 100% higher/lower than the character population average and/or median.

Character Signature features: Distinctive on average or median
Lovelace
(L)

+: *gerund or present participle frequency, *infinitive to frequency, total charac-
ters, total words
-: familial term frequency, title frequency (JB [avg], JB [median])

Clarissa +: *contraction frequency, *first person pronoun frequency
Harlowe
(CH)

-: universal determiner frequency

John
Belford

+: adverb phrase length, *colon frequency (CH), *parenthesis frequency, title
frequency [median] (L)

(JB) -: *endearment frequency (AH), title frequency [avg] (L)
Anna
Howe

+: *emdash frequency, *endearment frequency (JB), *second person pronoun
frequency, *wh-noun phrase frequency

(AH) -: None

From this analysis, we can observe two notable stylistic choices made by Richardson, both of

17


which appear to distinguish character styles. The first distinguishing feature is the use of endear-
ments, as endearment frequency is a signature feature for both John Belford from Anna Howe, but
in the opposite direction: Belford tends to use relatively fewer endearments while Anna tends to
use relatively more. Interestingly, this is not a signature feature for either of the main writeprints
(Lovelace and Clarissa), so it is unlikely to be something accidentally transferred from the main
writeprints to the derived writeprints. Instead, it is more likely to be a conscious choice by Richard-
son to distinguish these two characters, who write more letters than any others except for Lovelace
and Clarissa.

The second distinguishing feature is the use of titles (title frequency), which is manipulated
for both Lovelace and John Belford. Lovelace always uses titles less frequently, while Belford’s
letters show a more nuanced pattern: his average usage is less frequent, but his median usage is
more frequent. This suggests that Belford’s letters typically use titles more frequently (i.e. many
of Belford’s letters have more titles than a typical character’s letter), but there are a few outlier
letters that use titles far less frequently than a typical character’s letter. These outlier letters would
lower the average title frequency while leaving the median title frequency relatively unaffected (see
endnote 5 about the relationship between average and median usage). A potential interpretation of
this pattern is that the usage of titles in letters is something Richardson felt was masculine, and so
he consciously manipulated it when writing letters by male characters.

More generally, this analysis provides simple rubrics for how to write in the style of a specific
character. For example, a message by Lovelace should contain few titles or familial terms, be
verbose, and use both present participles and infinitival to. Table 6 provides sample messages that
obey these rubrics, thereby representing “prototypical” examples of these characters’ styles.

5 General Discussion
Using a state-of-the-art authorship classification approach (Pearl and Steyvers, 2012), we discov-
ered that Richardson was able to create two distinct character writeprints for the four characters
examined. This suggests that while it is possible for an author to make some distinctive characters
writeprints, it is non-trivial to do so for each character.

Notably, the character writeprints Richardson is able to create are not as distinctive as author
writeprints typically are. A related observation is that the features Richardson most often manip-
ulates to create these character writeprints are not the functional or abstract syntactic features that
have been prominent for authorship studies. We suggest that this may be due to the accessibility of
these features. That is, the reason they are so often used for author writeprints is precisely because
they are not easy to consciously manipulate. In contrast, when a single author is creating sev-
eral character writeprints, the manipulated features may naturally be the ones that are consciously
accessible.

5.1 Applying the PS Approach for Related Authorship Questions in Clarissa

In addition to these discoveries about the character writeprints in Clarissa, we can also answer
interesting questions about literary deception in this particular epistolary novel. Notably, there

18


Table 6: Example messages from each character that follow the rubrics derived from each charac-
ter’s signature features.

Character Message
Lovelace And then, what a comely fight, all kneeling down together in one pew, according to

eldership, as we have seen in effigie, a whole family upon some old monument, where
the honest chevalier, in armour, is presented kneeling, with uplift hands, and half a dozen
jolter-headed crop-eared boys behind him, ranged gradatim, or step-fashion, according
to age and size, all in the same posture–Facing his pious dame, with a ruff about her
neck, and as many whey-faced girls, all kneeling behind her: An altar between them, and
an opened book upon it: Over their heads semilunary rays darting from gilded clouds,
surrounding an atchievement-motto, IN COELO SALUS– or QUIES–perhaps, if they
have happened to live the usual married life of brawl and contradiction. (http:// ota.ox.
ac.uk/ text/ 4363.html)

Clarissa
Harlowe

You’ll observe, that altho’ I have not demanded my estate in form, and of my trustees,
yet that I have hinted at leave to retire to it. How joyfully would I keep my word, if they
would accept of the offer I renew!–It was not proper, I believe you’ll think, on many
accounts, to own that I was carry’d off, against my inclination. (http:// ota.ox.ac.uk/ text/
4360.html)

John
Belford

He succeeds, takes private lodgings for her at Hackney; visits her by stealth, both of
them tender of reputations, that were extremely tender, but which neither had quite given
over; for rakes of either sex are always the last to condemn or cry down themselves:
Visited by nobody, nor visiting: The life of a thief, or of a man beset by creditors, afraid
to look out of his own house, or to be seen abroad with her. And thus went he on for
twelve years, and, tho’ he had a good estate, hardly making both ends meet; for, tho’ no
glare, there was no oeconomy; and besides, he had every year a child, and very fond of
them was he. But none of them lived above three years: And being now, on the death
of the dozenth, grown as dully sober, as if he had been a real husband, his good Mrs.
Thomas (for he had not permitted her to take his own name) prevailed upon him, to
think the loss of their children a judgment upon the parents for their wicked way of life...
(http:// ota.ox.ac.uk/ text/ 4361.html)

Anna
Howe

I HAVE both your letters at once. It is very unhappy, my dear, since your friends will
have you marry, that such a merit as yours should be addressed by a succession of worth-
less creatures, who have nothing but their pre?umption for their excuse. That these pre-
sumers appear not in this very unworthy light to some of your friends, is, because their
defects are not ?o striking to them, as to others.–And why? Shall I venture to tell you?–
Because they are nearer their own standard.–Modesty, after all, perhaps has a concern in
it; for how should they think, that a niece or a sister of theirs (I will not go higher, for fear
of incurring your displeasure) should be an angel? (http:// ota.ox.ac.uk/ text/ 4358.html)

are three letters in which one character is pretending to be another: Lovelace writes one letter as
Anna Howe and two letters as Clarissa Harlowe. This presents an intriguing layering effect for
character writeprints: Richardson is attempting to create the character writeprint for a character

19

http://ota.ox.ac.uk/text/4363.html
http://ota.ox.ac.uk/text/4363.html
http://ota.ox.ac.uk/text/4360.html
http://ota.ox.ac.uk/text/4360.html
http://ota.ox.ac.uk/text/4361.html
http://ota.ox.ac.uk/text/4358.html


(Lovelace) who is attempting to imitate another character’s writeprint (Anna or Clarissa). A very
basic question is whether Richardson successfully shifted Lovelace’s writeprint so that it appeared
not to be Lovelace.

We applied the PS method to each of these three letters, using Lovelace’s letters as the reference
set. If Richardson was successful at altering Lovelace’s writeprint, the SMLR classifier should not
identify any of those letters as having been written by Lovelace. This was indeed the case for all
three letters, meaning that Richardson effectively masked Lovelace’s style for those letters.

Given this, was Lovelace successful in his deception as Anna and Clarissa? To answer this,
we applied the PS method to these three letters, using Anna’s letters as the reference set for the
letter impersonating Anna and Clarissa’s letters as the reference set for the letters impersonating
Clarissa. Here, the deception seemed to fail. The letter impersonating Anna was not labeled as
being by Anna when compared against the Anna reference set. Similarly, the letters impersonating
Clarissa were not labeled as being by Clarissa when compared against the Clarissa reference set.
So, Lovelace’s deception was incomplete in this sense – though perhaps that was Richardson’s
intention. In particular, Richardson could have intended for the reader to recognize that the style
wasn’t quite the purported one in each case (Anna or Clarissa, respectively).

Additionally, there is a single letter that is “supposedly written” by John Belford. Yet, when
compared against the Belford letters as a reference set, this letter was not labeled as being by
Belford. This suggests that Richardson was unsuccessful in his portrayal of Belford as the writer
for this letter, whether intentionally or unintentionally. Richardson may have intended for it to be
written by one of the other characters; if so, it would have a writeprint matching one of these other
characters. However, when compared against other character reference sets with ten or more let-
ters (Anna Howe, Arabella Harlowe, Clarissa Harlowe, James Harlowe Jr, Judith Norton, Lovelace,
William Morden), this letter was also not identified as being the writeprint of any of those charac-
ters. So, in general, this letter does not seem to be an effective portrayal of any easily identifiable
character from Clarissa.

5.2 Applying the PS Approach More Generally for Character Writeprints

We believe the PS approach for discerning character writeprints in epistolary novels can be used to
answer several questions of interest to literary scholars, including epistolary novel technique (both
Richardson’s and that of other epistolary novel authors), comparative evaluations of epistolary
author skill, approaches to constructing character writeprints, and cues to author identity.

With respect to Samuel Richardson in particular, how skilled is he in his other epistolary novels
(Pamela, The History of Sir Charles Grandison) at creating the appropriate number of character
writeprints? Using the same PS approach, we can examine how distinct the character writeprints
are and whether there are both main and derived writeprints, as in Clarissa. We can also investigate
whether Richardson manipulates the same signature linguistic features as he did in Clarissa, and
if these features tend to exclude the functional and abstract syntactic features common in author
writeprints.

Additionally, we can apply the PS approach to investigate whether character writeprints are sen-
sitive to major plot changes. This nuanced question would be particularly worthwhile to examine
in Clarissa, as there is an abrupt letter marking the novel’s astonishing climax when relationships

20


and alliances shift, particularly among the four central characters. Do the character writeprints re-
flect these changing alliances, e.g. character writeprint similarities aligning with current character
alliances?

With respect to other writers of epistolary novels, especially Richardson’s contemporaries like
Jane Austen, Aphra Behn, Fanny Burney, James Howell, Frances Brooke, and Mary Shelley, how
skilled are these other writers at creating character writeprints? Using the PS approach, we can
determine how distinct these writeprints are, if there are the appropriate number, whether there are
main and derived writeprints, which signature features are manipulated, and the nature of these
signature features. We can also compare the signature features used by other epistolary authors to
those we discovered for Richardson in Clarissa. If the same features or same types of features are
commonly manipulated, this suggests that those are core features (or features types) for character
writeprints within epistolary literature.

The PS approach also allows us to explore the impact of author identity in a unique way, based
on the types of features that distinguish character writeprints and author writeprints. In epistolary
novels, authors must subsume their own identity in order to generate a distinct set of charac-
ter identities. Yet, our analysis of Richardson’s Clarissa suggests that the features manipulated
to create character writeprints differ from the features typically comprising author writeprints.
Given this, it may be that an author consciously manipulates certain features to generate character
writeprints while leaving the features that are not consciously accessible alone. In other words,
the author’s writeprint features may remain the same across all character writeprints generated by
that author. Suggestive evidence for this view comes from Burrows (2005), who successfully dis-
tinguished Richardson’s epistolary novel Pamela from Fielding’s parody Shamela, and from van
Dalen-Oskam (2014), who successfully distinguished the epistolary novels of two Dutch women
writers. If the author writeprint features do in fact remain the same across character writeprints,
it should be the case that character writeprints generated by one author are as distinct from char-
acter writeprints generated by another author as the author writeprint of the first author is from
the author writeprint of the second author. This is a prediction that can be empirically tested with
the quantitative approach we have used here, with intriguing implications for the cues to author
identity if validated.

6 Conclusion
We have demonstrated how to apply current machine learning techniques to answer questions in
literary scholarship related to authorship. This case study focuses on issues surrounding character
authorship in the landmark epistolary novel Clarissa, yielding both quantitative and qualitative
results about the ability of the innovative Samuel Richardson to develop distinct character styles
within a novel. The particular machine learning approach we use can serve as a reliable tool for
investigating other literary questions surrounding both the style of individual characters and the
style of the author who creates those characters.

21


Notes
1This dataset is available at http:// www.socsci.uci.edu/ ∼lpearl/ CoLaLab/ corpora/ Richardsons Clarissa.zip, with

letters organized by character.
2We note that we present results from an “n-way” classification task (where one of n characters is selected for

each data point), but this task could also be set up as a simpler binary classification task where the goal is to label
a letter as simply by the “same” author or by a “different” author than the reference set. Interestingly, we achieve
better results with the harder n-way classification than the easier binary (2-way) classification, suggesting that it is
useful to know which specific other character a letter was written by, rather than simply knowing that it was written
by a different character from the the author of the reference set letters. This may be a specific instance of the more
general situation where a seemingly harder problem is actually easier than a seemingly simpler problem because of the
subtle information available in the data available for the harder problem (e.g. joint inference in cognitive development:
Dillon et al., 2013; Doyle and Levy, 2013; Feldman et al., 2013; Börschinger and Johnson, 2014). In addition, the
n-way classification task allows us to see which specific characters a given character is confused with, which is an
important aspect of our character writeprint analysis.

3 We used the Stanford Part-of-Speech Tagger (available at http:// nlp.stanford.edu/ software/ tagger.shtml) to iden-
tify syntactic categories and the Stanford Parser (available at http:// nlp.stanford.edu/ software/ lex-parser.shtml) to
identify syntactic structures.

The list of syntactic category tags from the Stanford Part-of-Speech tagger used is as follows, with an example of
each tag in parentheses: CC = coordinating conjunction (and), CD = cardinal number (one penguin), DT = determiner
(the), EOS = end of sentence marker (theres a penguin here!), EX = existential there (there’s a penguin here), FW =
foreign word (hola), IN = preposition or subordinating conjunction (after), JJ = adjective (good), JJR = comparative
adjective (better), JJS = superlative adjective (best), LS = list item marker (one, two, three, ...), MD = modal (could),
NN = singular or mass noun (penguin, ice), NNS = plural noun (penguins), NNP = proper noun (Jack), NNPS = plural
proper noun (There are two Jacks?), PDT = predeterminer (all the penguins), POS = possessive ending (penguin’s),
PRP = personal pronoun (me), PRP$ = possessive pronoun (my), RB = adverb (easily), RBR = comparative adverb
(later), RBS = superlative adverb (most easily), RP = particle (look it up), SYM = symbol (this = that), TO = to (I
want to go), UH = interjection (oh), VB = base form of verb (we should go), VBD = past tense verb (we went), VBG
= gerund or present participle (we are going), VBN = past participle (we should have gone), VBP = non-3rd person
singular present tense verb (you go), VBZ = 3rd singular present tense verb (he goes), WDT = wh-determiner (which
one), WP = wh-pronoun (who), WP$ = possessive wh-pronoun (whose), WRB = wh-adverb (how).

The list of phrase-structure tags the Stanford Parser used is as follows, with an example of each tag in parentheses:
S = declarative sentence (I like penguins), SINV = sentences with subject-auxiliary inversion (Never have I seen such
penguins!), SBAR = embedded clauses (I like penguins that are cute.), INTJ = interjection (um), FRAG = fragment
(See penguins in the) RRC = reduced relative clause (penguins not presently swimming), SBARQ = wh-questions
(What did you see?), SQ = yes/no questions (Did you see that?), ADJP = adjective phrase (outrageously cute), ADVP
= adverb phrase (rather sweetly), CONJP = multi-word conjunctions (...as well as...), LST = list marker (one, two,
three), NAC = not a constituent (in the back of my mind it), NP = noun phrase (those penguins), NX = sub-noun
phrase (other people and), PP = preposition phrase (with the penguins), PRN = parenthetical (Those penguins (I really
like them)), PRT = particle (look it up), QP = quantifier phrase (a little bit more), UCP = unlike coordinate phrase
(from that, but thats why), VP = verb phrase (We like penguins), WHADJP = wh-adjective phrase (How hot is it?),
WHADVP = wh-adverb phrase (How are you?), WHNP = wh-noun phrase (Who are you?), WHPP = wh-preposition
phrase (With whom did you see it?), X = unknown phrase.

4This inference is also supported by noting that four of the six signature features of John Belford’s writeprint that
are shared by other characters are shared only by Lovelace’s writeprint (fragment frequency, title frequency, word
length, and wh-adverb phrase frequency). See discussion in the next section for how signature features were derived
for each character.

5We note that while average and median population values generally align, sometimes they do not. This is because
average values don’t factor out the effect of outliers that shift the average value significantly up or down, while median
values do. For example, suppose a set of 100 values has ten very low values around 0, while the remaining 90 values
have an average of 50. The average of this set is about 45 while the median is likely to be around 50. A value of 46 is

22

http://www.socsci.uci.edu/~lpearl/CoLaLab/corpora/Richardsons_Clarissa.zip
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/lex-parser.shtml


then 2% higher than the average (46/45 - 1 = 0.02) but 8% lower than the median (46/50 - 1 = -0.08).

23


References
Abbasi, A. and Chen, H. (2008). Writeprints: A stylometric approach to identity-level identifica-

tion and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS),
26(2): 7.

Adair, D. (1944). The authorship of the disputed Federalist Papers: Part II. The William and Mary
Quarterly: Magazine of Early American History, Institutions, and Culture: 235–264.

Argamon, S. (2008). Interpreting Burrows’s Delta: Geometric and probabilistic foundations.
Literary and Linguistic Computing, 23(2): 131–147.

Aristotle (350 BCE). Poetics. URL http:// classics.mit.edu/ Aristotle/ poetics.1.1.html.

Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis
to authorship attribution. Chance, 16(2): 9–17.

Börschinger, B. and Johnson, M. (2014). Exploring the role of stress in Bayesian word segmen-
tation using Adaptor Grammars. Association for Computational Linguistics.

Bosch, R. A. and Smith, J. A. (1998). Separating hyperplanes and the authorship of the disputed
Federalist Papers. American Mathematical Monthly: 601–608.

Brennan, M. R. and Greenstadt, R. (2009). Practical Attacks Against Authorship Recognition
Techniques. In IAAI.

Burrows, J. (2002). Delta: A measure of stylistic difference and a guide to likely authorship.
Literary and Linguistic Computing, 17(3): 267–287.

Burrows, J. (2003). Questions of Authorship: Attribution and Beyond A Lecture Delivered on
the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York. Computers and the
Humanities, 37(1): 5–32.

Burrows, J. (2005). Who wrote Shamela? Verifying the authorship of a parodic text. Literary and
linguistic computing, 20(4): 437–450.

Burrows, J. (2007). All the way through: Testing for authorship in different frequency strata.
Literary and Linguistic Computing, 22(1): 27–47.

Burrows, J. F. (1987). Word-patterns and story-shapes: The statistical analysis of narrative style.
Literary and linguistic Computing, 2(2): 61–70.

Collins, J., Kaufer, D., Vlachos, P., Butler, B., and Ishizaki, S. (2004). Detecting collaborations
in text comparing the authors’ rhetorical language choices in the Federalist Papers. Computers
and the Humanities, 38(1): 15–36.

24

http://classics.mit.edu/Aristotle/poetics.1.1.html


Dillon, B., Dunbar, E., and Idsardi, W. (2013). A single-stage approach to learning phonological
categories: Insights from Inuktitut. Cognitive Science, 37: 344–377.

Doyle, G. and Levy, R. (2013). Combining multiple information types in Bayesian word segmen-
tation. In HLT-NAACL. Citeseer, pp. 117–126.

Eder, M. (2010). Does size matter? Authorship attribution, small samples, big problem. Proceed-
ings of Digital Humanities: 132–135.

Eder, M. (2013). Mind your corpus: Systematic errors in authorship attribution. Literary and
Linguistic Computing: fqt039.

Eder, M. and Rybicki, J. (2011). Stylometry with R. In Digital Humanities 2011: Conference
Abstracts. Citeseer, pp. 308–311.

Feldman, N., Griffiths, T., Goldwater, S., and Morgan, J. (2013). A role for the developing
lexicon in phonetic category acquisition. Psychological Review, 120(4): 751–778.

Fung, G. (2003). The disputed Federalist Papers: SVM feature selection via concave minimiza-
tion. In Proceedings of the 2003 Conference on Diversity in Computing. ACM, pp. 42–46.

Garcı́a, A. M. and Martin, J. C. (2007). Function words in authorship attribution studies. Literary
and Linguistic Computing, 22(1): 49–66.

Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and
Linguistic Computing, 22(3): 251–270.

Holmes, D. I. and Forsyth, R. S. (1995). The Federalist revisited: New directions in authorship
attribution. Literary and Linguistic Computing, 10(2): 111–127.

Holmes, D. I., Robertson, M., and Paez, R. (2001). Stephen Crane and the New-York Tribune:
A case study in traditional and non-traditional authorship attribution. Computers and the Hu-
manities, 35(3): 315–331.

Hoover, D. L. (2004). Testing Burrows’s delta. Literary and Linguistic Computing, 19(4): 453–
475.

Iqbal, F., Binsalleeh, H., Fung, B. C., and Debbabi, M. (2010). Mining writeprints from anony-
mous e-mails for forensic investigation. Digital Investigation, 7(1): 56–64.

Iqbal, F., Hadjidj, R., Fung, B. C., and Debbabi, M. (2008). A novel approach of mining
write-prints for authorship attribution in e-mail forensics. Digital Investigation, 5: S42–S51.

Jockers, M. L. (2013). Testing authorship in the personal writings of Joseph Smith using NSC
classification. Literary and Linguistic Computing, 28(3): 371–381.

Juola, P. and Baayen, R. H. (2005). A controlled-corpus experiment in authorship identification
by cross-entropy. Literary and Linguistic Computing, 20(Suppl): 59–67.

25


Kestemont, M., Moens, S., and Deploige, J. (2015). Collaborative authorship in the twelfth cen-
tury: A stylometric study of Hildegard of Bingen and Guibert of Gembloux. Digital Scholarship
in the Humanities, 30(2): 199–224.

Krishnapuram, B., Figueiredo, M., Carin, L., and Hartemink, A. (2005). Sparse Multinomial
Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27: 957–968.

Li, J., Zheng, R., and Chen, H. (2006). From fingerprint to writeprint. Communications of the
ACM, 49(4): 76–82.

Lučić, A. and Blake, C. L. (2015). A syntactic characterization of authorship style surrounding
proper names. Digital Scholarship in the Humanities, 30(1): 53–70.

Luyckx, K. and Daelemans, W. (2011). The effect of author set size and data size in authorship
attribution. Literary and Linguistic Computing, 26(1): 35–55.

Milic, L. T. (1966). The next step. Computers and the Humanities, 1(1): 3–6.

Mosteller, F. and Wallace, D. L. (1963). Inference in an authorship problem: A comparative study
of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal
of the American Statistical Association, 58(302): 275–309.

Mosteller, F. and Wallace, D. L. (1964). Applied Bayesian and classical inference: The case of
the Federalist Papers. Springer Science & Business Media.

Oakes, M. P. (2004). Ant colony optimisation for stylometry: The Federalist Papers. In Proceed-
ings of the 5th International Conference on Recent Advances in Soft Computing. pp. 86–91.

Pearl, L. and Steyvers, M. (2012). Detecting Authorship Deception: A Supervised Machine
Learning Approach Using Author Writeprints. Literary and Linguistic Computing, 27(2): 183–
196.

Price, L. (2000). The Anthology and the Rise of the Novel: From Richardson to George Eliot.
Cambridge University Press.

Rokeach, M., Homant, R., and Penner, L. (1970). A value analysis of the disputed Federalist
Papers. Journal of Personality and Social Psychology, 16(2): 245.

Rudman, J. (2005). The non-traditional case for the authorship of the twelve disputed Federalist
Papers: A monument built on sand. Proceedings of ACH/ALLC 2005.

Savoy, J. (2013). Authorship attribution based on a probabilistic topic model. Information Pro-
cessing & Management, 49(1): 341–354.

Savoy, J. (2015). Estimating the probability of an authorship attribution. Journal of the Association
for Information Science and Technology.

26


Smith, J. B. (1989). Computer criticism. Literary Computing and Literary Criticism: Theoretical
and Practical Essays on Theme and Rhetoric: 13–44.

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the Amer-
ican Society for information Science and Technology, 60(3): 538–556.

Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2001). Computer-based authorship attribu-
tion without lexical measures. Computers and the Humanities, 35(2): 193–214.

Tweedie, F. J., Singh, S., and Holmes, D. I. (1996). Neural network applications in stylometry:
The Federalist Papers. Computers and the Humanities, 30(1): 1–10.

van Dalen-Oskam, K. (2014). Epistolary voices. The case of Elisabeth Wolff and Agatha Deken.
Literary and Linguistic Computing: fqu023.

Vermeule, B. (2011). Why do we care about literary characters? JHU Press.

Zhao, Y. and Zobel, J. (2005). Effective and scalable authorship attribution using function words.
In Information Retrieval Technology. Springer, pp. 174–189.

Zhao, Y., Zobel, J., and Vines, P. (2006). Using relative entropy for authorship attribution. In
Information Retrieval Technology. Springer, pp. 92–105.

Zunshine, L. (2010). Introduction to Cognitive Cultural Studies. JHU Press.

27


A Potential Features in Character Writeprints

Table 7: Potential character-level features available to the SMLR classifier for character
writeprints. The feature name, description, number of individual features of this kind, and im-
plementation are provided. One or more of the following is provided in parentheses for each
feature: an example of that feature or a description of that feature.

Feature Description # Implementation
alphabetic characters all letters

3 # char tokens
total # char tokens

digits all digits 0-9
punctuation all punctuation marks
word length average length of words 1 # char tokens

# word tokens

punctuation apostrophes, colons, commas, double
quotation marks, ellipses, em dashes,
en dashes, exclamation marks, forward
slashes, interrobangs, multiple punctu-
ation (!!), parentheses, periods, ques-
tions marks, semicolons, single quotation
marks, square brackets

17 # punct tokens
total # punct tokens

total characters total # of characters 1 # character tokens

Table 8: Potential word-level features available to the SMLR classifier for character writeprints.
The feature name, description, number of individual features of this kind, and implementation are
provided. One or more of the following is provided in parentheses for each feature: an example of
that feature and/or part-of-speech tags used to identify that feature.

Feature Description # Implementation
contractions won’t, can’t, etc. 1 # contracted words

# word tokens

foreign words foreign words (FW) 1 #foreign words
#word tokens

hyphenated words ever-to-be-revered, etc. 1 # hyphenated word tokens
# word tokens

lexical diversity6 word types
word tokens

1 # word types
# word tokens

total words # total words 1 total # word tokens

28


Table 9: Potential content syntactic category features available to the SMLR classifier for character
writeprints. The feature name, description, number of individual features of this kind, and imple-
mentation are provided. One or more of the following is provided in parentheses for each feature:
an example of that feature and/or part-of-speech tags used to identify that feature.

Feature Description # Implementation
adjectives all adjectives (good), comparative adjec-

tives (JJR), superlative adjectives (JJS)
3 # adj tokens

# word tokens

adverbs all adverbs (really), basic adverbs (RB),
comparative adverbs (RBR), superlative
adverbs (RBS)

4 # adv tokens
# word tokens

cardinal numbers one, two, three, etc. 1 # cardinal tokens
# word tokens

interjections all interjections (UH) 1 # interjection tokens
# word tokens

nouns all nouns, plural nouns (NNS), plural
proper nouns (NNPS), singular or mass
nouns (NN), singular proper nouns (NNP)

5 # noun tokens
# word tokens

ordinal numbers first, second, third, etc. 1 # ordinal tokens
# word tokens

possessives Clarissa’s (POS) 1 # poss tokens
# word tokens

pronouns 1st person pronouns (I), 2nd person pro-
nouns (you), 3rd person pronouns (he),
demonstrative pronouns (this), personal
pronouns (PRP), possessive pronouns
(PRP$ ), relative pronoun (which)

7
# pronoun tokens
# word tokens

verbs all verbs, gerund or present participle
(VBG), non-finite verbs (VB), past partici-
ple (VBN), past tense verbs (VBD)

5 # verb tokens
# word tokens

29


Table 10: Potential functional syntactic category features available to the SMLR classifier for
character writeprints. The feature name, description, number of individual features of this kind,
and implementation are provided. One or more of the following is provided in parentheses for each
feature: an example of that feature, part-of-speech tags used to identify that feature, or a collection
of tokens comprising that feature.

Feature Description # Implementation
coordinating
conjunctions

although, and, because, but, for, nor, or, since, so,
though, unless, while, yet

1 # coordinating conj
# word tokens

determiners additive (more), alternative (another, other, somebody
else, different), articles (a, an, the), disjunctive (either,
neither), distributive (each, every), elective (any, either,
whichever), equative (same), evaluative (such, so, that),
exclamative (what cheek), existential (some, any), inter-
rogative & relative (which, what, whichever, whatever),
maximal & minimal (most, least), negative (no, neither),
paucal (a few, a little, some), personal (we friends, you
scoundrels), quantifiers (all, few, many, several, some,
each, every, each, any, no, a lot of, much), subtractive
(less, fewer), sufficiency (enough, sufficient, plenty), uni-
versal (all, both, every)

19 # det tokens
# word tokens

function
words

all function words: articles (a, an, the), copula be, deter-
miners (DT ), expletives (EX), infinitival to, prepositions,
personal pronouns, possessive pronouns (PRP), posses-
sives (PRP$ ), relative pronouns (which)

1 # function tokens
# word tokens

infinitival to to go 1 # infin to tokens
# word tokens

prepositions with, etc. 1 # prep tokens
# word tokens

sentence
connectors

occurring at the beginning or end of sentences: also, any-
way, as, besides, finally, first, furthermore, hence, how-
ever, in addition, last but not least, lastly, moreover, nev-
ertheless, on the other hand, otherwise, second, so, still,
then, thus, too, well, yet

1 # sent connectors
# word tokens

30


Table 11: Potential syntactic structure features available to the SMLR classifier for character
writeprints. The feature name, description, number of individual features of this kind, and imple-
mentation are provided. One or more of the following is provided in parentheses for each feature:
an example of that feature, part-of-speech tags used to identify that feature, or phrase-structure
tags used to identify that feature.

Feature Description # Implementation
avg phrase length adjective phrases (ADJP), adverb phrases

(ADVP), conjunction phrases (CONJP),
noun phrases (NP), parenthetical phrases
((...)), preposition phrases (PP), quantifier
phrases (QP), verb phrases (VP)

8 # word tokens in phrase type
total # phrase type

avg sentence length sentences, exclamations, questions 3 # word tokens in sent type
total # sent type

clauses She laughed, and then she cried 1 # clause tokens
# sentences

embedded clauses the ones that I like 1 # emb cl tokens
# clauses

exclamations What a surprise! 1 # excl tokens
total # sentences

fragments fragments (FRAG) 1 # fragment tokens
# sentences

imperatives Do this now! 1 # imperative tokens
# sentences

main clauses I think she’s right. 1 # main cl tokens
# clauses

passives was kissed, etc. 1 # passive tokens
# sentences

phrases (rel freq) adjective phrases (ADJP), adverb phrases
(ADVP), conjunction phrases (CONJP),
noun phrases (NP), preposition phrases
(PP), quantifier phrases (QP), verb
phrases (VP)

7 # phrase type tokens
# phrases

phrases all phrases (ADJP, ADVP, CONJP, NP,
PP, QP, VP, WHADJP, WHADVP, WHNP,
WHPP)

1 # phrase tokens

questions Was that a surprise? 1 # questions tokens
# sentences

sentences total sentences 1 # sentence tokens
wh-phrases wh-adjective phrases (how hot,

WHADJP), wh-adverb phrases (how,
WHADVP), wh-noun phrases (what,
WHNP), wh-preposition phrases (to
what, WHPP)

4 # wh−phrase tokens
# phrases

31


Table 12: Potential formatting features available to the SMLR classifier for character writeprints.
The feature name, description, number of individual features of this kind, and implementation are
provided. An example or description of each feature is provided.

Feature Description # Implementation
all capitals LIKE, etc. 1 # caps word tokens

# word tokens

italicized phrases average length of italicized phrases 1 # ital word tokens
# italicized phrases

parenthetical words This seems (decidedly) interesting, etc. 1 # paren word tokens
# word tokens

italicized words like, etc. 1 # ital word tokens
# word tokens

Table 13: Potential domain-specific semantic features available to the SMLR classifier for char-
acter writeprints. The feature name, description, number of individual features of this kind, and
implementation are provided. One or more of the following is provided in parentheses for each
feature: an example of that feature or a collection of tokens comprising that feature.

Feature Description # Implementation
endearments affectionate(ly), beloved, (my) bird,

charmer, darling, (my) dear, dearest,
dearly, (ever-)affectionate, (my) good,
goodness, honey, lamb, (my) love, (my)
pet, sincere, sincerity, (my) sweet

1 # endearment tokens
# word tokens

epistolary correspondence, letter(s), paper(s), par-
cel(s), pen(s), receipt(s), resume, write,
written, writing, wrote

1 # epistolary tokens
# word tokens

familial aunt, brother, child, daughter, family,
father, grandfather, mamma, maternal,
mother, papa, paternal, sister, son, uncle

1 # familial tokens
# word tokens

propriety compliment, congratulate, excuse me,
grateful, manners, obliged, politeness,
propriety, thank you

1 # propriety tokens
# word tokens

“remarkable” adverbs best, better, quite, so, soon, sooner, soon-
est, too, well

1 # remarkable tokens
# word tokens

titles Dr, Esq, Jr, Lord, Madam, Miss, Ms, Mr,
Mrs, Reverend, Sir, Sr

1 # title tokens
# word tokens

writing authorship, compose, composition, cor-
respondence, drop a line, indite, ink,
letter(s), paper(s), parcel, pen(s), pen-
ning, receipt, resume, spell, words, write,
(piece of) writing, written material

1 # writing tokens
# word tokens

32


B Detailed Character Precision and Recall Scores
Below we show the details of the confusion matrices as well as the precision and recall scores used
to calculate the F-scores reported in the main text for each character. Precision and recall scores
range between 0.0 and 1.0.

Table 14: Specific confusion matrix values and accompanying precision and recall scores for
Clarissa Harlowe’s letters.

Labeled Recall
Clarissa non-Clarissa

True
Clarissa 136 108 0.557
non-Clarissa 128 416
Precision 0.515

Table 15: Specific confusion matrix values and accompanying precision and recall scores for
Lovelace’s letters.

Labeled Recall
Lovelace non-Lovelace

True
Lovelace 109 89 0.551
non-Lovelace 84 506
Precision 0.565

Table 16: Specific confusion matrix values and accompanying precision and recall scores for John
Belford’s letters.

Labeled Recall
Belford non-Belford

True
Belford 42 62 0.404
non-Belford 66 618
Precision 0.389

33


Table 17: Specific confusion matrix values and accompanying precision and recall scores for Anna
Howe’s letters.

Labeled Recall
Anna non-Anna

True
Anna 18 60 0.231
non-Anna 70 640
Precision 0.205

34


	Introduction
	Corpus: Richardson's Clarissa
	Methods
	Overview of the PS Authorship Method
	Preprocessing Feature Values
	Linguistic Feature Types
	Features in a Writeprint
	Using Writeprint Features to Make Authorship Decisions

	Applying the PS Authorship Method for Character Writeprints
	Linguistic Features Available for Character Writeprints

	Results
	Character Distinctiveness
	Signature Features

	General Discussion
	Applying the PS Approach for Related Authorship Questions in Clarissa
	Applying the PS Approach More Generally for Character Writeprints

	Conclusion
	Potential Features in Character Writeprints
	Detailed Character Precision and Recall Scores