OP-LLCJ130047 1..16


Comparative evaluation of term
selection functions for authorship
attribution
............................................................................................................................................................

Jacques Savoy

Computer Science Department, University of Neuchatel, Neuchâtel,

Switzerland
.......................................................................................................................................

Abstract
Different computational models have been proposed to automatically determine
the most probable author of a disputed text (authorship attribution). These
models can be viewed as special approaches in the text categorization domain.
In this perspective, in a first step we need to determine the most effective features
(words, punctuation symbols, part-of-speech, bigram of words, etc.) to discrim-
inate between different authors. To achieve this, we can consider different inde-
pendent feature-scoring selection functions (information gain, gain ratio,
pointwise mutual information, odds ratio, chi-square, bi-normal separation,
GSS, Darmstadt Indexing Approach (DIA), and correlation coefficient). Other
term selection strategies have also been suggested in specific authorship attribu-
tion studies. To compare these two families of selection procedures, we have
extracted articles from two newspapers and belonging to two categories (sports
and politics). To enlarge the basis of our evaluations, we have chosen one news-
paper written in the English language (‘Glasgow Herald’) and a second one in
Italian (‘La Stampa’). The resulting collections contain from 987 to 2,036 articles
written by four to ten columnists. Using the Kullback–Leibler divergence, the chi-
square measure and the Delta rule as attribution schemes, this study found that
some simple selection strategies (based on occurrence frequency or document
frequency) may produce similar, and sometimes better, results compared with
more complex ones.

.................................................................................................................................................................................

1 Introduction

In automatic authorship attribution, computer sys-
tems can be designed and implemented to deter-
mine the most probable author behind a disputed
document or a text excerpt (Mosteller and Wallace,
1964; Juola, 2006; Stamatatos, 2009). To achieve
this, a set of texts written by each of the possible
writers must be made available to the classifier. In
this study, we focus on the closed-class attribution
problem in which the real author is one of the

several possible candidates. Other pertinent con-
cerns related to this issue include the mining of
demographic or psychological information on an
author (profiling) (Pennebaker, 2011), or simply
determining whether or not a given author did
write a given Internet message or document (verifi-
cation) (Koppel et al., 2007). Instead of being
limited to text, we can also consider other media
(e.g. music, song, picture, drawing).

To solve this categorization task, we need to ex-
tract and select features that are useful in identifying

Correspondence:

Jacques Savoy, Computer

Science Department,

University of Neuchatel,

Neuchâtel, Switzerland

Email:

Jacques.Savoy@unine.ch

Literary and Linguistic Computing � The Author 2013. Published by Oxford University Press on
behalf of ALLC. All rights reserved. For Permissions, please email: journals.permissions@oup.com

1 of 16

doi:10.1093/llc/fqt047

 Digital Scholarship in the Humanities Advance Access published December 2, 2014

in fact 
Of course, i
,
, etc.


the differences between the authors’ writing styles.
In this study, we will consider words and punctu-
ation symbols as possible features or terms. In a
second step, we determine the discrimination
power of each term and then apply a selection pro-
cedure to derive a reduced set of terms that can
effectively discriminate between the different pos-
sible authors. Finally, through applying classifica-
tion rules or schemes, the system can determine
the most probable author of a text excerpt.

The rest of this article is divided as follows.
Section 2 presents related work, and Section 3 out-
lines the main characteristics of the corpora used in
our experiments. Section 4 briefly describes the term
selection functions applied in our experiments.
Section 5 presents the selected attribution methods,
and Section 6 evaluates them according to various
term selection strategies. Section 7 exposes some
practical considerations, and Section 8 draws the
main conclusions of this study.

2 Related Work

As far as automatic authorship attribution appro-
aches are concerned, the early solutions were based
on a unitary stylometric value that must be constant
for a given author but should vary from one writer
to another (Holmes, 1998). As measures, previous
studies have suggested using vocabulary richness
measures, average word length, mean sentence
length, and Yule’s K measure or other statistics
related to type-token ratios (Baayen, 2008). None
of these measures has been proven satisfactory in
all cases (Hoover, 2003), due in part to word distri-
butions (including word bigrams or trigrams) ruled
by a large number of low probability elements
(Large Number of Rare Events) (Baayen, 2001).
Moreover, these measures are based on a single
measurement, and therefore no term selection
procedure was needed.

To account for the vocabulary used, Mosteller
and Wallace (1964) propose a semiautomatic selec-
tion procedure to determine the most useful terms,
and particularly the most frequent ones composed
mainly of various function words (determiners,
prepositions, conjunctions, pronouns, and some

adverbs and verbal forms). In this case, we state
that the occurrence frequency of some word types
is not fully controlled by the author and varies from
one person to the other. For example, Mosteller and
Wallace (1964) notice that the term ‘while’ was used
36 times by Hamilton but never by Madison, the
second possible writer. In their last study, these re-
searchers have worked with a reduced list to 35
word types.

Following this perspective, Burrows (2002) sug-
gests automatically selecting words that can discrim-
inate between authors. The selection criterion is
simply the occurrence frequency, and Burrows
(2002) proposes to consider the first 40–150 most
frequent word types. In such a sample, we usually
find a large proportion of function words. This
threshold was first increased to 800, and then to
4,000 (Hoover, 2007). As a variant, Jockers and
Witten (2010) derive 2,907 terms (single words
and bigrams of words) appearing at least once in
texts written by all three possible authors of the 85
‘Federalist Papers’. From this list, the researchers
extract a reduced set composed of 298 terms, after
imposing the condition that for each item the rela-
tive frequency must be >0.05%. Various studies
have followed this vein suggesting using frequent
word types containing many functional words
(Damerau, 1975; Holmes and Forsyth, 1995;
Baayen and Halteren, 2002; Miranda Garcı́a and
Calle Martı́n, 2007). In a similar way, Grieve
(2007) considers selecting all word types in a
k-limit profile, where k indicates that each selected
term must occur, at least in k articles written by
every author (e.g. a value k¼5 imposes the presence
of the target word in at least five articles written by
every possible author). Thus, this scheme imposes
that all selected terms must be used by all authors
and not only a fraction of them.

Instead of selecting the features based on the
available corpus, Zhao and Zobel (2007) propose
to define a priori the most useful word types.
Their suggested list contains 363 English word
types, composed mainly of function words but
with some lexical terms (independent of the the-
matic of the underlying texts).

Finally, other studies suggest applying appro-
aches used in automatic text categorization (or

J. Savoy

2 of 16 Digital Scholarship in the Humanities, 2013

while 
whereas
as well as
very 
very 
 or LNRE
-
,
 to 
greater than 
,
,


text classification) defined as the task to automatic-
ally assign one (or more) predefined label(s) to each
input text (Manning and Schütze, 1999; Sebastiani,
2002; Stamatatos, 2009). To build such a system, we
need to apply a feature selection procedure to
reduce the number of terms needed to discriminate
between the different categories. Having fewer
terms, the underlying computation can be simpli-
fied, and the lexical space needed to be explored is
also reduced (Liu and Motoda, 2008). In a com-
parative study, Yang and Pedersen (1997) evaluated
six selection measures for topical text classification,
using two corpora and two classifiers (k-nearest
neighbors and linear least squares fit). Their experi-
ments indicate that the information gain (also called
expected mutual information) or the chi-square
statistic tends to achieve the best results. For
Sebastiani (2002), the odds ratio (OR) and the
chi-square are usually the selection functions
displaying the best performance.

Topical text categorization and authorship attri-
bution do not, however, strictly follow the same
implementation. In the former, we usually remove
the most frequent words (stopword list) having no
precise and useful meaning. In authorship attribu-
tion, these terms are viewed as important style mar-
kers because they are used in a less conscious way
than other words (Pennebaker, 2011). Thus, their
use and occurrence frequency may differ from one
author to the other. Moreover, topical text categor-
ization must deal with sparse data because many
terms appear only in a few documents. Therefore,
numerous synonyms must be taken into account to
achieve a high effectiveness. In authorship attribu-
tion, we tend to ground the classification decision
on frequent terms, thus reducing the problems
related to the synonymy and data sparseness.

The main objective of this article is to know if,
from the set of all possible terms and punctuation
symbols, we can automatically select a smaller per-
tinent set of terms that can discriminate among
authors. This objective can be achieved by automat-
ically ignoring noisy terms. Those terms are irrele-
vant in discriminating between the possible authors
because their occurrence distributions are similar
among them. Such noisy terms must be ignored,
and their removal might improve the classification

accuracy. Moreover, working with a reduced set of
terms will speed up the underlying computation and
decrease the risk of over fitting the classifier to the
available data (Hastie et al., 2009).

3 Evaluation Corpora

To obtain a replicable test collection with authors
sharing a common culture and having similar lan-
guage registers, we opt for a stable and publicly
available corpus by pulling out a subset of the
CLEF-2003 test suite. The first two collections are
written in the English language and correspond to
articles appearing in 1995 in the ‘Glasgow Herald’
newspaper. From this news source, we have chosen
articles covering two distinct topics, namely,
‘Sports’ and ‘Politics’. For each set, we have selected
five journalists having written 1,948 articles on
Sports and 987 on Politics. The distribution over
authors is depicted in Table 1.

To complement these first two collections, we
extracted two additional corpora based on articles
appearing in 1994 in the ‘La Stampa’ newspaper
(written in the Italian language). As with the first
newspaper, we also selected articles covering the
categories ‘Sports’ and ‘Politics’. For the Sports
subset, we chose four journalists having written
1,321 articles, whereas for the Politics corpus, we
had ten columnists having written 2,036 articles.
Table 2 shows the distribution over the authors.

These corpora are pertinent for authorship attri-
bution because each collection is formed by texts
having the same general topic and genre. Moreover,
they originate from the same period. We know that
the style may differ from one person to another, but
the period (Juola, 2003; Hughes et al., 2012), the
topic, the genre (Labbé, 2007), and the text intent
also have an obvious impact on the style. Finally,
their spelling was controlled and normalized (e.g. to
denote the capital of the People’s Republic of China,
we can name either Beijing or Peking).

To speed up the computation and derive an ef-
fective feature set, we ignored all terms appearing
less than ten times in a given corpus and terms ap-
pearing only in a single article. Moreover, we also
removed terms used only by a single author. Such

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 3 of 16

statistic tend
very
paper 
,
In order t
-
(GH) 
while 
on 
have 
since 
time 
time 
,
as well as
Of course, s


words may be good indicators of the real author,
but they are also easy to use by another person
aiming to play a masquerade. For example, when
analyzing the ‘Federalist Papers’, this filter will re-
move the term ‘while’ used frequently by Hamilton
but never by Madison, as well as the term ‘whilst’
used only by Madison.

With the ‘Glasgow Herald’, after applying these
constraints, the remaining vocabulary contains
6,616 word types for the Sports corpus and 5,128
terms for the political domain. In the ‘La Stampa’

corpora, the Sports subset comprises 6,780 word
types, and the Politics subset is composed of
10,644 terms. The question that then arises is the
following: can we extract from these sets of terms
subsets having better discrimination power to en-
hance the classification performance?

4 Selection Functions

In the machine learning domain, we can find differ-
ent independent feature-scoring functions to rank
the features according to their discriminative
power. To measure this capability for a term tk ac-
cording to a given category (or author) cj, with j¼1,
2, . . . , jCj, we usually use a contingency table for
each pair (tk, cj) as depicted in Table 3. In this
table, the value a indicates the number of texts be-
longing to the category cj in which the term tk occurs.
When considering all other classes (denoted by �cj),
the term tk appears in b other texts. Thus, in the
whole corpus, this term occurs in aþb texts, while
we can count aþc texts labeled with the category cj.

To measure the association between a term tk
and a category (or author) cj, we can compute the
‘pointwise mutual information’ (PMI) given in
Equation (1) (Church and Hanks, 1989).

PMIðtk ,cjÞ ¼ log2
Prob tk ,cj

� �
Prob tk½ � � Prob cj

� �
" #

¼ log2

a=n
( a + b )

�
n �

( a + c )
�

n

" # ð1Þ

This function compares two models to estimate the
probability of selecting the term tk within the cat-
egory cj. The first model is based on a direct esti-
mation of the joint probability (and denoted
Prob[tk, cj]¼a / n). This estimation is the numer-
ator of Equation (1). The second model (denomin-
ator of Equation (1)) estimates this probability by
considering independently the probability of the
occurrence of the term tk (Prob[tk]¼(aþb) / n),
and the probability of selecting a text belonging to
the category cj (Prob[cj]¼(aþc) / n). This second
model assumes that there is no relationship between
the occurrence of the term tk and the category cj.
When this assumption is true (no real relationship

Table 1 Distribution of 1,948 articles about Sports and

987 articles about Politics by author name in the ‘Glasgow

Herald’

Name Topics Number

1 Douglas Derek Sports 411

2 Gallacher Ken Sports 409

3 Gillon Doug Sports 369

4 Paul Ian Sports 419

5 Traynor James Sports 340

Total 1,948

1 Johnstone Anne Politics 73

2 Shields Tom Politics 174

3 Smith Graeme Politics 330

4 Trotter Stuart Politics 337

5 Wishart Ruth Politics 73

Total 987

Table 2 Distribution of 1,321 articles about Sports and

2,036 articles about Politics by author in ‘La Stampa’

Name Topics Number

1 Ansaldo Marco Sports 288

2 Beccantini Roberto Sports 365

3 Del Buono Oreste Sports 435

4 Ormezzano Gian Paolo Sports 232

Total 1,321

1 Battista Pierluigi Politics 232

2 Benedetto Enrico Politics 253

3 Galvano Fabio Politics 348

4 Gramellini Massimo Politics 119

5 Meli Mari Teresa Politics 216

6 Nirenstein Fiama Politics 53

7 Novazio Emanuele Politics 250

8 Pantarelli Franco Politics 203

9 Passarini Paolo Politics 304

10 Spinelli Barbara Politics 58

Total 2,036

J. Savoy

4 of 16 Digital Scholarship in the Humanities, 2013

--
&mdash;
a
+


between the category cj and the term tk), Prob[tk, cj]
can be estimated by Prob[tk] �Prob[cj]. In such
cases, the two probability estimates will be close,
and the ratio in Equation (1) will return a value
close to 1. Computing the logarithm of such a
value, we will find a value close to 0, indicating in-
dependence between the term occurrence and the
corresponding category.

On the other hand, when a strong association
does exist between the term tk and the category cj,
the value of a will be large. The direct estimation
for Prob[tk, cj] will be larger than the product
Prob[tk] �Prob[cj]. The ratio will then be larger
than 1 and the logarithm function will return a posi-
tive value. With a negative association between the
term tk and the category cj, the numerator will be
smaller than the denominator, returning a value smal-
ler than 1. Taking the logarithm, a negative value is
returned, indicating that the term tk is less frequently
used in category cj than in the rest of the corpus.

To illustrate this idea, we have taken a hypothet-
ical numerical example given in Table 4. In this case,
the term � appears in twenty-one texts in the
corpus, in which we can find ten texts written by
author Aj. The other authors have used the word �
in eleven other texts. The last row in Table 4 indi-
cates that the author Aj has written 200 texts, while
we can count 1,800 texts written by all other au-
thors. In other words, Aj has written 10% of all
the texts belonging to the corpus.

A quick analysis reveals that the author Aj repre-
sents �50% of the occurrences of the term �, but
only 10% of all texts. When computing the PMI
value (formulation given below), we can see that
the joint estimation is larger than the denominator.
The resulting ratio between the two models is larger
than 1 (4.76 in our example), giving a final value of
2.25. The term � is more closely associated to Aj
than by pure chance. Because the presence of this

term in a text is an indication that this document
might have been written by Aj, we can select this
term to help discrimination between Aj and all other
possible writers.

PMIð�,AjÞ

¼ log2

10=2000
( 10 +11 )

�
2000 �

( 10 +190 )
�

2000

" #

¼ log2
10 � 2000

21 � 200

� �
¼ log2 4:76½ � ¼ 2:25

On the other hand, if the value for a in Table 4 had
been 2 instead of 10, and keeping the same values
for the last row and the last column, we would find
no relationship between the term � and the author
Aj. Observing two texts written by Aj with the term
� corresponds closely to the independent model
(pure chance). Because Aj wrote 10% of the whole
corpus, it is not surprising to see close to 10% of
them with the term �. In this case, the PMI func-
tion will return the value �0.07, a value close to 0.

Using such a term selection function seems a
pertinent choice. Such a procedure may reveal the
terms useful in discriminating between the possible
authors. Moreover, this approach may rank all
terms according to their discriminative power and
we can limit the selection of the top m most dis-
criminative terms.

As a second function, we can estimate the prob-
ability Prob[cj j tk], a measure denoted Darmstadt
Indexing Approach (DIA) (Fuhr et al., 1991).
Based on Table 3, this probability is estimated by
a / (aþb) (to simplify the presentation, all formulae
are re-grouped in the Appendix). The DIA function
is based on different arguments than those justifying
the PMI function.

As a third function, we can use the odds ratio (OR)
(Manning and Schütze, 1999), which always returns a

Table 3 Example of a contingency table for a term tk and

a category cj

Category cj Category –cj

Term tk a b aþb

Other -tk c d cþd

aþc bþd n¼aþbþcþd

Table 4 Contingency table for the term � and the

author Aj

Aj Other authors –Aj

Term � 10 11 21

Other -� 190 1,789 1,979

200 1,800 2,000

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 5 of 16

21 
10
11
around 
 Since 
was 
In fact, o
 Since 
a surprise 
-
DIA 
(
)
Of course, t
odds ratio (
)


positive value. A positive association between the
term tk and the category cj is indicated by a value
larger than 1, although a value close to 0 signifies an
opposition. A value close to 1 denotes independence
between the term and the underlying class.

As a fourth selection function, we can use the
information gain (IG) or the expected mutual in-
formation. The value returned by this function is
large if a positive association exists. A small positive
value signifies the absence of a discriminative power
for the term tk and the category cj. Following the
same interpretation, we can compute the chi-square,
�2(tk, cj), statistics (Manning and Schütze, 1999).

As a sixth function, we can apply the gain ratio
(GR) returning a positive value to signal either a
positive or negative association between the term
tk and the category cj. Independence is indicated
by a value close to 0.

As a seventh term selection function, derived
from the �2, we can provide the correlation coeffi-
cient, CC(tk, cj) (Ng et al., 1997), which indicates a
positive association by a positive value (although a
negative value signifies an opposition). The inde-
pendence between the term and the category is
denoted by a value close to 0. Following the same
interpretation, we can compute the GSS coefficient
(Gavalotti et al., 2000).

Another interesting selection function is the
bi-normal separation (BNS) suggested by Forman
(2003). In the presence of independence, this
function returns a small positive value. A larger
value indicates either a positive or a negative asso-
ciation between the term tk and the underlying
category cj.

In addition to these nine selection functions, we
can simply consider the document frequency (df)
indicating the number of texts indexed by the
term tk. The larger this value, the better the corres-
ponding term. Moreover, we can also assume that
the style of a given author may be revealed by the
frequent use of certain forms. In this perspective, we
can follow Burrows (2002) and use the absolute
term frequency (tf). As for the df value, the higher
the absolute term frequency (or tf), the better the
usefulness of this term.

When applying one of the aforementioned selec-
tion functions, we can compute a local utility value

denoted f(tk, cj) for each tk and each category cj.
When faced with a binary classification problem
(two authors), such a function is enough to define
the overall selective value for each term. In author-
ship attribution in general, the number of authors
(categories) is larger than two. In such cases,
we need to aggregate the local utility values over
the jCj categories. To define such a global utility
measure for a term tk (denoted Uop(tk)), we can
take the maximum over the jCj categories or com-
pute the sum or a weighted mean as shown in
Equation (2).

UmaxðtkÞ ¼ Maxj f ðtk ,cjÞ,

UsumðtkÞ ¼
XjCj
j¼1

f ðtk ,cjÞ,

UwmeanðtkÞ ¼
XjCj
j¼1

Prob½cj� � f ðtk ,cjÞ

ð2Þ

Finally, to select the m most adequate terms, we ex-
tract the m terms having the highest utility values
Uop(tk) according to one of the aggregate operators
given above (max, sum, or weighted mean).

5 Authorship Attribution Methods

To evaluate the different term selection functions,
we have selected three authorship attribution
approaches to ground our finding on a relatively
broad basis. As a first method, we can apply the
approach suggested by Zhao and Zobel (2007),
who propose to compute the distance between
the author profile Aj (concatenation of all his/her
writings) and the query text Q by using the
Kullback–Leibler divergence (KLD) (also called rela-
tive entropy (Manning and Schütze, 1999)). This
measure is given in Equation (3), where Probq[ti]
(or ProbAj[ti]) indicates the occurrence probability
of a term ti in the query Q (or in the author profile
Aj), for i¼1, 2, . . . , m.

KLDðQjjAjÞ ¼
Xm
i¼1

Probq ti½ � � log2
Probq ti½ �

ProbAj ti½ �

� �
ð3Þ

When two distributions are identical, the KLD
measure is 0. Otherwise, the formula returns a

J. Savoy

6 of 16 Digital Scholarship in the Humanities, 2013

one
while 
zero
one
employ 
&chi;
zero
&Unicode_xF020;&chi;&eta;&iota;&minusb;&sigma;&theta;&upsi;&alpha;&rho;&epsiv;&comma;&Unicode_xF020;
while 
zero
very 
-
zero


positive value. This value grows as the distance (dis-
agreement) increases between the two underlying
distributions.

To estimate the needed probabilities, we can
apply the maximum likelihood principle and
access Prob[ti]¼ tfi/n, with tfi indicating the occur-
rence frequency of a term ti, and n the size of the
document. However, it is usually better to smooth
such estimates to avoid null probabilities (Manning
and Schütze, 1999). In our evaluations, we have
applied the Lidstone’s rule where the probabilities
are then estimated as (tfiþ�) / (nþ��jVj), with jVj
indicating the vocabulary size and � a parameter
fixed to 0.01 (showing the best performance).

As a second authorship attribution method, we
can compare the representation of a given text Q
with an author profile Aj using the chi-square stat-
istic (Grieve, 2007) defined by Equation (4) (the
same general method can be used as term selection
and attribution scheme). In this formulation, rtfq(ti)
represents the relative frequency of the ith term in
the query text, and rtfAj(ti) the same information in
the jth author profile. When comparing a query text
Q with different author profiles Aj, we simply select
the lowest chi-square value to determine the most
probable author of a disputed text.

�2ðQ, AjÞ ¼
Xm
i¼1

rtfqðtiÞ� rtfAjðtiÞ
� �2.

rtf AjðtiÞ ð4Þ

As a third authorship attribution method, we used
the Delta model (Burrows, 2002) measuring the dis-
tance between two texts according to the standar-
dized frequency (Z score) of their terms. This value
is obtained from the relative occurrence frequency
(denoted rtfij for term ti in the document dj) by
subtracting the mean (meani) and dividing by the
standard deviation (sdi), the mean and standard de-
viation estimated by considering the underlying
corpus (see Equation (5)).

Z scoreðtijÞ ¼
rtfij � meani

sdi
ð5Þ

Once these dimensionless quantities are obtained
for each selected word, we can then compute the
distance to those obtained from author profiles.
Given a query text Q, an author profile Aj, and a
set of terms ti, for i¼1, 2, . . . , m, we compute the

Delta value (or the intertextual distance) by apply-
ing Equation (6).

�ðQ, AjÞ ¼ 1=m �
Xm
i¼1

Z scoreðtiqÞ� Z scoreðtijÞ
		 		

ð6Þ

In this formulation, proposed by Burrows (2002), we
assign the same weight to each term ti. A large dif-
ference between Q and Aj will appear, when, for a
given term, both Z scores are large but with opposite
signs. On the other hand, when a term appears with
similar relative occurrence frequencies in both texts,
the difference in Z scores will be small. Finally, when
for all m terms the differences in Z scores are small,
the resulting � distance will be slight, indicating that
the same person probably wrote both texts.

6 Evaluation

To achieve unbiased performance estimations, we
cannot use the same instances for both training
the classifier and testing it. The set of available ex-
amples must, therefore, be divided into a training
set and a distinct test set. In the current study, we
opted for the leave-one-out approach (Hastie et al.,
2009). When applying this methodology with the
Sports corpus extracted from the ‘Glasgow
Herald’, each of the 1,948 articles, in turn, will
form the query text, whereas the remaining 1,947
texts will generate the training set used to determine
the most useful terms. The accuracy rate reported in
this study corresponds to the micro-average value,
the mean over all documents.

Some authors suggest not using the same train-
ing set to let the classifier learn (i.e. the author pro-
files) and to select the features. Thus, it is
recommended to use a disjoint set of instances for
feature selection and for learning. If from a theor-
etical point of view a bias exists, from a practical
viewpoint the impact of this bias is rather limited
(Singhi and Liu, 2006).

As a first authorship attribution model, we have
evaluated the KLD model (Zhao and Zobel, 2007).
With the ‘Glasgow Herald’ corpus, the feature selec-
tion is based on 363 predefined word types. This list
mainly contains function words (the, in, but, not,

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 7 of 16

&lambda;
&lambda;
&lambda;
&Delta;
leaving
while 
,
It 
thus 
Kullback-Leibler divergence (
)


am, of, can . . .) and frequent items (became, noth-
ing . . .). Some entries are less frequent (howbeit,
whereafter, whereupon), whereas others indicate
the expected behavior of the tokenizer (doesn,
weren) or correspond to an arbitrary decision (in-
dicate, missing, seemed). With the Italian corpora,
we first need to define a list of frequent terms usu-
ally appearing in all documents. To achieve this, we
have chosen a stopword list used by search technol-
ogy with this language (Savoy, 2001). This list in-
cludes 399 terms containing mainly function words
(il, la, del, in, con, nostro, essi, fare . . .) and frequent
items (anno, casa . . .).

Using the KLD method with the predefined set of
363 English words, we achieved an accuracy rate of
83.5% for the Sports corpus and 81.0% for Politics
(these values are reported in the line ‘a priori’ in
Table 5). To obtain a more complete picture, we
also considered all available terms (words and punc-
tuation symbols, no selection). In this case, the KLD
scheme produces an accuracy rate of 74.6% (Sports,
based on 6,616 terms) and 93.0% (Politics, based on
5,128 terms) (line denoted ‘All’ in Table 5).

With the Italian language and the predefined set
of 399 words, we achieved an accuracy rate of 94.9%
for the Sports subset and 88.7% for Politics
(these performances appear in the line ‘a priori’ in
Table 6). When considering all terms without any
selection, an accuracy rate of 97% was obtained with
the Sports subset (based on 6,780 terms) and 88.4%
with the Politics subset (based on 10,644 terms)
(line denoted ‘All’ in Table 6).

We then compare these baselines (a priori selec-
tion) with eleven other term selection methods and
three aggregation operators (max, sum, or weighted
mean (denoted wmean)). As the number of selected
terms (parameter m in the previous formulae de-
picted in Section 5), we have tested the following
values {150; 300; 500; 800; 1,000; 1,500; 2,000; 3,000;
4,000; and 5,000}. The last value (5,000) corres-
ponds to �75.6% of all 6,616 terms available for
the Sports subset in the ‘Glasgow Herald’ (or
97.5% of all 5,128 available terms for the Politics
subset). Similar percentages can be obtained when
analyzing the two Italian text collections. When
considering larger numbers, the term space is not
really reduced; therefore, we did not attempt this

variation. Instead of reporting all possible combin-
ations of the number of features with the three ag-
gregation functions, we have only reported the best
parameter setting for each selection function
(number of terms / aggregation operator).

The question that then arises is ‘can we obtain a
better performance using fewer terms?’ If yes, can
the set of terms defined by Zhoa and Zobel (2007)
produce a better performance than that produced

Table 5 Feature selection methods with KLD and

‘Glasgow Herald’, with the Sports corpus and Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

a priori 363 terms 83.5 363 terms 81.0

All 6,616 terms 74.6y 5,128 terms 93.0y

tf(tk, cj) 3,000 / max 92.8y 800 / max 95.1y

df(tk, cj) 3,000 / max 93.5y 2,000 / max 95.1y

IG(tk, cj) 1,500 / wmean 93.3y 800 / sum 95.0y

GR(tk, cj) 3,000 / max 93.1y 500 / max 96.1y

GSS(tk, cj) 3,000 / max 92.7y 1,000 / max 95.1y

�2(tk, cj) 1,500 / sum 93.3y 300 / sum 95.0y

CC(tk, cj) 2,000 / max 92.8y 300 / max 95.4y

BNS(tk, cj) 5,000 / wmean 90.7y 1,500 / sum 94.4y

PMI(tk, cj) 5,000 / wmean 90.3y 5,000 / wmean 92.2y

OR(tk, cj) 3,000 / wmean 91.5y 500 / wmean 94.1y

DIA(tk, cj) 3,000 / max 86.4y 5,000 / wmean 93.1y

Table 6 Feature selection methods with KLD and ‘La

Stampa’, with the Sports collection and Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

a priori 399 terms 94.9 399 terms 88.7

All 6,780 terms 97.0y 10,644 terms 88.4

tf(tk, cj) 5,000 / max 97.4y 500 / sum 94.4y

df(tk, cj) 5,000 / max 97.7y 500 / max 95.6y

IG(tk, cj) 5,000 / max 97.7y 4,000 / wmean 95.3y

GR(tk, cj) 4,000 / sum 97.3y 3,000 / wmean 95.2y

GSS(tk, cj) 5,000 / max 97.0y 500 / max 95.1y

�2(tk, cj) 4,000 / max 97.3y 3,000 / wmean 94.9y

CC(tk, cj) 5,000 / max 97.0y 1,500 / max 94.0y

BNS(tk, cj) 2,000 / max 97.3y 3,000 / wmean 93.8y

PMI(tk, cj) 4,000 / max 96.2 5,000 / max 87.3

OR(tk, cj) 5,000 / max 96.4 1,000 / wmean 93.0y

DIA(tk, cj) 4,000 / max 95.5 500 / max 90.1

J. Savoy

8 of 16 Digital Scholarship in the Humanities, 2013

can, &hellip;
,
s well as
very 
,
while 
 very
,
,
as well as
very 
,
``
A 
'' 
``
''
``
A
''
(Politics, 
``
''
, 
, 
, 
, 
, 
, 
, 
, 
, 
around 
``
?''  


from sets of terms defined by the various feature
selection functions?

The accuracy rates depicted in Tables 5 and 6
indicate that different selection functions may pro-
duce better performance levels than either the
manual selection or when ignoring the selection
procedure (lines labeled ‘All’). The manual selection
(lines labeled ‘a priori’) produces relatively low ac-
curacy rates for the English language compared with
the others (see Table 5). For the Politics subset of
the ‘Glasgow Herald’, considering all possible terms
clearly improves the performance (from 81.0 to
93.0%). Within the same category, but with the
Italian language (Table 6), the manual selection,
or considering all terms, tends to produce similar
performance levels (88.7 versus 88.4%). However,
those accuracy rates are lower than those achieved
by other selection functions.

Overall, Tables 5 and 6 tend to show that we can
achieve high-performance results when considering
relatively simple selection methods such as df, or the
absolute tf. In these tables, the best performance is
shown in bold. The performance differences with IG,
GR, GSS, chi-square (�2), or CC functions are usu-
ally small and not significant. However, the PMI,
OR, and DIA functions seem to offer lower
performance levels than those produced by other
selection schemes. When inspecting the different ag-
gregate operators (max, sum, or weighted mean), we
can see that the maximum function tends to occur
frequently in Tables 5 and 6, indicating that this ag-
gregation function tends to achieve the best results.

To verify whether a performance difference is
statistically significant between two term selection
procedures, we opted for the Sign test (Conover,
1980; Yang and Liu, 1999) (bilateral test, signifi-
cance level �¼1%). In this case, the null hypothesis
H0 assumes that both selection methods perform at
a similar level. In Tables 5–10, we use the first line as
a baseline, and any statistically significant perform-
ance difference is indicated by the symbol ‘y’. As we
can see in Tables 5 and 6, the performance differ-
ences are usually significant compared with the first
row, the selection strategy based on a predefined set
of words.

When applying the chi-square metric (Grieve,
2007), we have tried different k-limits and found

that for the ‘Glasgow Herald’ the 5-limit produces
the best performance (83.9% as depicted in Table 7
in the row ‘k-limit’) by selecting 1,633 terms for the
Sports corpus (or with k¼10 for the Politics subset,
selecting 434 terms and producing an accuracy of
76.0%). When specifying k-limit¼10, all selected
terms must appear in at least ten articles written
by all journalists. Thus, such a selection strategy
imposes that every possible author must have used
all selected terms.

With the two Italian corpora, the best accuracy
rates were achieved when considering the 200-limit
(selecting a small set of thirty-one terms, accuracy
rate¼83.7%) for the Sports subset or with k¼10
for the Politics subset (selecting 252 terms and pro-
ducing an accuracy of 71.9%).

When ignoring the selection procedure, we take
into account all possible terms. In this case, we can
achieve an accuracy rate of 72.5% (Sports, 6,616
available terms) and 79.5% (Politics, based
on 5,128 terms) with the English corpora (see
Table 7, line with the label ‘All’). In Table 8, for
the Italian collections, the accuracy rate is of
85.5% (Sports, 6,780 available terms) or 72.6%
(Politics, based on 10,644 terms) when we ignore
the selection procedure.

Instead of strictly following the selection scheme
proposed by Grieve (2007), we can apply different
feature-scoring selection functions. The best par-
ameter settings and accuracy rates are reported in
Table 7 for the ‘Glasgow Herald’ newspaper and in
Table 8 for the Italian corpora.

As for the KLD method, the results depicted in
Tables 7 and 8 indicate that an appropriate feature-
scoring function (with their parameter values)
might produce higher performance levels than
when considering all terms or when selecting
terms according to the best k-limit principle.
Overall, when comparing the different selection
strategies, the df, or tf selection schemes tend to
produce high-performance levels. With this chi-
square-based attribution scheme only, the BNS se-
lection function also offers a high effectiveness.
After applying the Sign-test, we can see that the
performance differences with the best k-limit
approaches are usually statistically significant (indi-
cated with the symbol ‘y’). Finally, and for both

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 9 of 16

``
''
``A 
''
to 
&percnt;
&percnt;
vs
.
 T
however 
high 
document frequency (
)
term frequency (
)
information gain (
)
gain ratio (
)
&chi;
correlation coefficient (
)
T
pointwise mutual information (
)
odds ratio (
)
, however,
&alpha;
to 
``
''
31 
``
''
Of course, i
high 
bi-normal separation (
)


languages, usually the PMI, the OR, and the DIA
functions tend to return less pertinent term sets,
achieving lower performance levels.

To obtain a broader view, we have also depicted
the best performances achieved with the Delta rule
(Burrows, 2002). As depicted in Table 9 for the
‘Glasgow Herald’, the best performance using the

Delta rule is achieved when considering the 300
most frequent terms for the Sports subset (accuracy
rate 74.2%, under the label ‘Most freq.’) or 200 with
the Politics corpus (83.5%). With the Italian text
collections, the most effective number of terms
is 150 for the Sports subset (85.7%) or 500
word types with the Politics corpus (84.1%) (see
Table 10). This selection procedure proposed with
the Delta rule is equivalent to the tf scoring function
with the sum aggregation.

Table 8 Feature selection methods with chi-square

method and ‘La Stampa’, with the Sports collection and

Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

k-limit 31 terms 83.7 252 terms 71.9

All 6,780 terms 85.5 10,644 terms 72.6

tf(tk, cj) 150 / sum 90.1y 300 / wmean 89.5y

df(tk, cj) 500 / wmean 91.9y 500 / wmean 92.8y

IG(tk, cj) 5,000 / max 86.3 150 / wmean 78.9y

GR(tk, cj) 5,000 / max 89.3y 150 / wmean 73.4

GSS(tk, cj) 5,000 / max 87.9y 300 / max 78.2y

�2(tk, cj) 5,000 / sum 85.9 5,000 / wmean 73.0

CC(tk, cj) 5,000 / max 86.5 5,000 / max 67.9y

BNS(tk, cj) 3,000 / max 93.1y 2,000 / max 81.8y

PMI(tk, cj) 5,000 / max 87.0 5,000 / max 69.7

OR(tk, cj) 5,000 / max 87.9y 4,000 / max 62.2y

DIA(tk, cj) 5,000 / max 79.7y 5,000 / max 50.3y

Table 7 Feature selection methods with chi-square

method and the ‘Glasgow Herald’, with the Sports

corpus and Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

k-limit 1,633 terms 83.9 434 terms 76.0

All 6,616 terms 72.5y 5,128 terms 79.5

tf(tk, cj) 500 / max 79.6y 300 / sum 90.4y

df(tk, cj) 500 / sum 83.6 800 / sum 90.3y

IG(tk, cj) 4,000 / sum 75.3y 150 / wmean 85.1y

GR(tk, cj) 4,000 / sum 74.9y 150 / sum 84.0y

GSS(tk, cj) 5,000 / max 73.8y 150 / max 87.9y

�2(tk, cj) 4,000 / sum 74.1y 150 / wmean 82.7y

CC(tk, cj) 5,000 / max 73.5y 5,000 / max 79.5

BNS(tk, cj) 3,000 / sum 77.6y 2,000 / max 86.5y

PMI(tk, cj) 4,000 / max 74.1y 3,000 / sum 24.5y

OR(tk, cj) 2,000 / max 74.8y 4,000 / wmean 79.7y

DIA(tk, cj) 5,000 / max 73.2y 500 / max 48.0y

Table 9 Feature selection methods with Delta method

and ‘Glasgow Herald’, with the Sports corpus and Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

Most freq. 300 terms 74.2 200 terms 83.5

All 6,616 terms 21.1y 5,128 terms 43.0y

tf(tk, cj) 300 / max 81.1y 150 / sum 83.0

df(tk, cj) 300 / max 78.7y 150 / wmean 81.8

IG(tk, cj) 500 / sum 83.4y 150 / sum 69.1y

GR(tk, cj) 500 / wmean 85.6y 150 / sum 72.0y

GSS(tk, cj) 300 / max 79.8y 2,000 / max 72.1y

�2(tk, cj) 1,000 / sum 74.2 4,000 / wmean 66.1y

CC(tk, cj) 300 / sum 42.2y 2,000 / max 63.8y

BNS(tk, cj) 1,500 / max 51.0y 1,500 / sum 62.7y

PMI(tk, cj) 5,000 / max 56.4y 1,500 / max 52.2y

OR(tk, cj) 150 / max 21.0y 4,000 / wmean 74.4y

DIA(tk, cj) 800 / max 54.5y 1,000 / max 64.8y

Table 10 Feature selection methods with Delta method

and ‘La Stampa’, with the Sports collection and Politics

Function Sports Politics

Parameter Accuracy

(%)

Parameter Accuracy

(%)

Most freq. 150 terms 85.7 500 terms 84.1

All 6,780 terms 29.8y 10,644 terms 15.5y

tf(tk, cj) 300 / max 93.3y 150 / max 85.1

df(tk, cj) 300 / max 93.3y 150 / max 84.9

IG(tk, cj) 3,000 / max 80.7y 150 / max 87.9y

GR(tk, cj) 4,000 / max 93.0y 150 / sum 88.7y

GSS(tk, cj) 500 / max 76.4y 150 / max 82.6

�2(tk, cj) 4,000 / max 81.2y 150 / sum 78.6y

CC(tk, cj) 500 / max 70.0y 150 / max 68.5y

BNS(tk, cj) 2,000 / wmean 62.5y 300 / wmean 28.0y

PMI(tk, cj) 3,000 / max 62.6y 800 / sum 32.9y

OR(tk, cj) 5,000 / sum 67.5y 300 / sum 32.8y

DIA(tk, cj) 4,000 / max 60.0y 150 / max 28.2y

J. Savoy

10 of 16 Digital Scholarship in the Humanities, 2013

pointwise mutual information (
)
odds ratio (
)
``
''


Without any feature selection, the Delta rule pro-
duces, with the English corpus, an accuracy of
21.1% (Sports, 6,616 available terms) and 43.0%
(Politics, 5,128 terms), as depicted in the line ‘All’
in Table 9. With the ‘La Stampa’ newspaper’s cor-
pora and considering all terms, the Delta rule
achieves an accuracy of 29.8% (Sports, 6,780
terms) and 15.5% (Politics, with 10,644 terms).
The performance differences with the first row are
rather large, indicating that the Delta rule must be
applied with a reduced number of word types. As
with the other authorship attribution methods, we
have also reported the best success rate according to
the eleven selection approaches together with the
best parameter setting.

Overall, the evaluations depicted for the Delta
rule (see Tables 9 and 10) confirm that high-per-
formance levels are achieved when using the df, or
the absolute tf as feature selection functions. The
performance differences are usually statistically sig-
nificant over the selection based on the most fre-
quent words, but only for the two Sports corpora.
As a second choice, we can use the GSS, IG, and chi-
square functions. However, the OR, the PMI, and
the DIA function tend to produce less pertinent fea-
ture sets, at least in an authorship attribution con-
text and especially when using the Delta rule (see
Tables 9 and 10).

A few cases are worth a comment. With the
English language (Table 9), the high result of the
GR function in the Sports subset is not confirmed
by the Politics corpus. We also notice that the cor-
relation coefficient (CC) function with the English
Sports corpus (see Table 9) achieves a rather low
accuracy level, as does the BNS selection function
with the Italian Politics subset.

Finally, to give a view of the selection effect of
the different functions, we have counted the per-
centage of selected terms in common between
two functions with the English corpora. Using
only the sum as the aggregate operator, and varying
the number of selected terms between 150 and
3,000, we can see that the functions df and tf
return, on an average, similar sets of terms (overlap
degree between 92 to 99%). A similar effect can be
detected with the function CC and the chi-square
metric (this can be explained by the fact that the

function CC is derived from the chi-square) or be-
tween the IG and GR. We can also detect a relation-
ship between the set of features defined by the
functions GSS and IG. In these cases, the overlay
is, on an average, 77%. Finally, it is difficult to
find a clear relationship between the BNS, OR, or
PMI functions and all the others. These three selec-
tion functions tend to propose different sets of
features.

7 Practical Considerations

When faced with a new authorship attribution
problem, which term selection function must we
apply and how many terms must we select? Based
on four test collections and three attribution
schemes, the experimental results do not show a
strong systematic pattern. However, some trends
can be detected. First, the absolute occurrence
frequency (tf) and the df tend to produce pertinent
and well-performing term sets for the three attribu-
tion schemes and the four collections. This is
an indication that using frequent words as features
to discriminate between different authors is an
effective strategy. Such selection approaches also
have the advantage to be easy to implement and
own a clear interpretation for the end-user.
Moreover, we cannot detect significant differences
between the evaluations done with the English
language and those performed over the Italian
collections.

It is worth mentioning that the term selection is
not based on the whole vocabulary. As specified in
Section 3, we can take into account the domain
knowledge. Thus, it is a good practice to ignore
word types having a low occurrence frequency or
appearing in a single (or a few) document(s).
Moreover, we have also removed words used by
only a single author. It is known that such terms
can be effective to distinguish between authors and
most of the selection functions will detect them as
effective features for authorship discrimination.
However, these terms are vulnerable because they
can be easily used to spoof a given identity.

The second important question is to define the
number of terms to be selected. Determining a

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 11 of 16

values 
``
''
high 
document frequency (
)
term frequency (
)
information gain (
)
 T
odds ratio (
)
pointwise mutual information (
)
however 
so 
bi-normal separation (
)
very 
&percnt;
correlation coefficient (
)
information gain (
)
gain ratio (
)
information gain (
)
bi-normal separation (
)
odds ratio (
)
pointwise mutual information (
)
 very
document frequency (
)
have 
These 
, however,


priori such an optimum value is difficult. The vari-
ous experiments depicted in the previous section
indicate that we need a small number of terms (be-
tween 150 and 300) to obtain one of the best per-
formance levels with the Delta rule (Tables 9 and
10). The chi-square authorship attribution scheme
also requires a relatively small number of terms to
produce one of the best accuracy rates (300–500 (df
function) with the English corpora (Table 7), and
500 terms (df function) with the Italian corpora
(Table 8)). With the KLD method, a general
conclusion is harder to draw. For the English cor-
pora, we need �500 for the Politics subset, and
1,500–3,000 terms for Sports subset (Table 5).
For the Italian language, �5,000 terms for the
Sports subset and 500 for the Politics part
(Table 6) are required to achieve the highest per-
formance levels.

In addition to these findings, we must recall that
the morphology of the Italian language is more
complex than that of English. Thus, we can expect
having more functional word types in this language
than in English. For example, the translation of the
determiner ‘the’ (definite article) could be ‘il, lo, l, i,
gli, la, or le’ because the variations in gender and
number must be specified in the Italian language.
This difference in size can be reduced when we con-
sider representing text using the lemmas (headword
or dictionary entry) instead of the word types. In
this case, we can conflate all inflected forms under
the same entry (e.g. ‘was, were, is,’ etc. are re-
grouped under the lemma ‘be’, whereas the pro-
nouns ‘I, me’ under ‘I’). This processing can be
done manually, but it is a costly operation. On the
other hand, we can apply an automatic part-of-
speech tagger. However, such an approach is not
error-free, and some recent studies have compared
the relative merits of these two text representation
schemes for authorship attribution (Savoy, 2012;
Miranda Garcı́a and Calle Martı́n, 2012). Finally,
we must mention that some natural languages
may have other linguistic construction than those
used in the English language. For example, the def-
inite article appears as a suffix in the Bulgarian or
Swedish language and not as a distinct and separate
lemma. Moreover, the indefinite article (‘an/a’) does
not exist in the Bulgarian language.

As a possible default parameter setting, we can
suggest selecting the first 300 most frequent word
types according to the document frequency (or
dfsum) for the English collections, and the top 500
most frequent terms (dfsum) with the Italian cor-
pora. Using the occurrence frequency (tf) will pro-
duce similar results, and according to our
experiments, there is no real reason to prefer one
function to the other. We have a slight preference
for the document frequency information because
this function ignores the variations in document
lengths. Finally, to determine the number of
terms, it is more efficient to work with a small
term set. According to our experiments, a size of
300 terms seems reasonable for the English lan-
guage. Considering that the morphology of the
Italian language is more complex, we suggest
adding 200 more terms when working with lan-
guages having a more complex morphology (i.e.
gender, grammatical cases).

Table 11 depicts the accuracy rates obtained
when adopting these suggested default parameter
settings. For the ‘Glasgow Herald’ corpora, we
have used the first 300 most frequent terms accord-
ing to the document frequency (or dfsum) and the
top 500 most frequent terms (dfsum) for ‘La Stampa’.
These performance levels are then compared with
the optimal parameter setting (considering all selec-
tion functions and number of terms).

As shown in Table 11, we can see that the
performance differences between the proposed
default parameter settings and the optimal ones
are rather small, particularly with the KLD attribu-
tion scheme. On the other hand, the Delta rule is
more sensitive to deviation from an optimal
number of terms.

These data also show that the df selection strategy
tends to produce similar term sets when compared
with the tf function. This relationship can be shown
by considering the performances obtained with the
Politics corpus of the ‘Glasgow Herald’ in Table 11.
Using the chi-square attribution scheme and the 300
most frequent occurring terms (tf), we achieve an
accuracy rate of 90.4%. Using the df function, the
selection of the 300 most frequent terms provides
a mean performance of 87.4%, a relative decrease
of �3.3%.

J. Savoy

12 of 16 Digital Scholarship in the Humanities, 2013

 to 
around 
 to 
around 
,
while 
(POS) 
Such 
however 
 very
,
This 
s
to 
-


8 Conclusion

To design an effective authorship attribution
scheme, we need to select the most appropriate fea-
tures (word types and punctuation symbols in
the current study) that can discriminate between
the different categories or authors. To evaluate the
different selection strategies, we compared nine fea-
ture-scoring functions and two well-known selec-
tion approaches used in authorship attribution
studies (based on the absolute term frequency (tf),
or the df). To combine the scores computed for
various terms, we have evaluated three aggregation
operators. As a classifier, we used the KLD measure
(Zhao and Zobel, 2007), the chi-square metric
(Grieve, 2007), and the Delta rule (Burrows, 2002).

Using four corpora extracted from the newspaper
‘Glasgow Herald’ and ‘La Stampa’ about Sports
(1,948 and 1,321 articles, respectively) and Politics
(987 and 2,036 articles, respectively), our evalu-
ations show that using the df or the absolute term
frequency (tf) tends to provide good overall per-
formances. In a second class of performance level,
we can place the GR, the chi-square, the GSS func-
tion, and the IG. The use of the PMI, the OR, or the
DIA function does not provide comparable results,
at least in the authorship attribution context.
Finally, the BNS function presents an erratic

behavior, working well in some cases, and moder-
ately in others.

The overall good results achieved by the absolute
term frequency (tf) are clearly an indication that the
suggested selection procedure proposed by Burrows
(2002) for the Delta rule is an effective one.
Similarly, the df selection function tends to propose
discriminative term sets, and this study confirms, in
part, the k-limit selection strategy suggested by
Grieve (2007). However, this latter procedure im-
poses that the selected terms be used by all possible
authors, a constraint not imposed by the df selection
function.

Unlike the Yang and Pedersen’s study (1997)
based on topical text classification, the IG or the
chi-square is not always the best performing meth-
od. According to Sebastiani (2002), good selections
can be achieved by applying the ORsum or the
GSSmax. The current study, which is based on
authorship attribution, indicates that this choice,
adequate for topical text classification, is not the
best when dealing with authorship attribution.

Finally, as an aggregation function, this study
tends to indicate that applying the maximum oper-
ator seems to be a good default choice. On the other
hand, our evaluations based on four distinct cor-
pora are unable to clearly define a specific number
of terms to be used for a new collection. As a default

Table 11 Evaluation of the proposed parameter setting versus the optimal one, with the English corpora and the Italian

corpora

Attribution

method

‘Glasgow Herald’ ‘La Stampa’

Corpus—Parameter Accuracy (%) Corpus—Parameter Accuracy (%)

KLD Sports—3,000 df max 93.5 Sports—5,000 df max 97.7

300 terms 90.0 (�4%) 500 terms 96.4 (�1%)

Chi-square 1,633 terms—no select. 83.9 3,000 BNS max 93.1

300 terms 64.5 (�23%) 500 terms 90.2 (�3%)

Delta 500 GR wmean 85.6 300 tf max 93.3

300 terms 78.7 (�8%) 500 terms 91.8 (�2%)

KLD Politics—500 GR max / max 96.1 Politics—500 df max 95.6

300 terms 94.5 (�2%) 500 terms 95.6 (0%)

Chi-square 300 tf sum 90.4 500 df wmean 92.8

300 terms 87.4 (�3%) 500 terms 90.0 (�3%)

Delta 150 tf sum 83.0 150 GR sum 88.7

300 terms 71.4 (�14%) 500 terms 73.7 (�17%)y

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 13 of 16

In order t
document frequency (
)
Kullback-Leibler divergence
document frequency (
)
gain ratio (
)
information gain (
)
s
pointwise mutual information (
)
odds ratio (
)
document frequency (
)
This 
, however,
information gain (
)
are 
s
odds ratio (
)


range of values, the evaluations reported in this
study indicate that considering between 300 and
500 terms seems a good starting point before further
investigations, when possible.

References
Baayen, H. R. (2001). Word Frequency Distributions.

Dordrecht: Kluwer Academic Press.

Baayen, H. R. (2008). Analyzing Linguistic Data: A
Practical Introduction to Statistics using R. Cambridge:
Cambridge University Press.

Baayen, H. R. and Halteren, H. V. (2002). An
Experiment in Authorship Attribution. In Proceedings
of the 6th Journées d’Analyse des Données Textuelles
2002. St-Malo, pp. 69–75.

Burrows, J. F. (2002). Delta: a measure of stylistic differ-
ence and a guide to likely authorship. Literary &
Linguistic Computing, 17: 267–87.

Church, K. W. and Hanks, P. (1989). Word Association
Norms, Mutual Information and Lexicography. In
Proceedings Association for Computational Linguistics
(ACL). pp. 76–83.

Conover, W. J. (1980). Practical Nonparametric Statistics,
2nd edn. New York: John Wiley & Sons.

Damerau, F. J. (1975). The use of function word frequen-
cies as indicators of style. Computers and the
Humanities, 9: 271–80.

Forman, G. (2003). An extensive empirical study of fea-
ture selection metrics for text classification. Journal of
Machine Learning, 3: 1289–1305.

Fuhr, N., Hartmann, S., Knorz, G., Lustig, G.,
Schwantner, M., and Tzeras, K. (1991). IR/X a Rule-
Based Multi-Stage Indexing System for Large Subject
Fields. In: Proceedings Recherche d’Information Assistée
par Ordinateur (RIAO). pp. 606–23.

Gavalotti, L., Sebastiani, F., and Simi, M. (2000).
Experiments on the Use of Feature Selection and
Negative Evidence in Automated Text Categorization.
In: Proceedings European Conference in Digital Libraries
(ECDL). Berlin: Springer, pp. 59–68.

Grieve, J. (2007). Quantitative authorship attribution:
an evaluation of techniques. Literary & Linguistic
Computing, 22: 251–70.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The
Elements of Statistical Learning. Data Mining, Inference,
and Prediction. New York: Springer-Verlag.

Holmes, D. I. (1998). The evolution of stylometry in

humanities scholarship. Literary and Linguistic

Computing, 13: 111–17.

Holmes, D. I. and Forsyth, R. S. (1995). The federalist

revisited: new directions in authorship attribution.

Literary and Linguistic Computing, 10: 111–27.

Hoover, D. L. (2003). Another perspective on vocabulary

richness. Computers and the Humanities, 37: 151–78.

Hoover, D. L. (2007). Corpus stylistics, stylometry, and

the styles of Henry James. Style, 41: 160–89.

Hughes, J. M., Foti, N. J., Krakauer, D. C., and

Rockmore, D. N. (2012). Quantitative patterns of styl-

istic influence in the evolution of literature. Proceedings

of the National Academy of Sciences United States of

America, 109(20): 7682–6.

Jockers, M. L. and Witten, D. M. (2010). A comparative

study of machine learning methods for authorship at-

tribution. Literary and Linguistic Computing, 25:

215–23.

Juola, P. (2003). The time course of language change.

Computers and the Humanities, 37: 77–96.

Juola, P. (2006). Authorship attribution. Foundations and

Trends in Information Retrieval, 1: 233–334.

Koppel, M., Schler, J., and Bonchek-Dokow, E. (2007).

Measuring differentiability. Journal of Machine Learning

Research, 8: 1261–76.

Labbé, D. (2007). Experiments on authorship attribution

by intertextual distance in english. Journal of Quantita-

tive Linguistics, 14: 33–80.

Liu, H. and Motoda, H. (2008). Computational

Methods of Feature Selection. Boca Raton: Chapman &

Hall / CRC.

Manning, C. D. and Schütze, H. (1999). Foundations of

Statistical Natural Language Processing. Cambridge: The

MIT Press.

Miranda Garcı́a, M. and Calle Martı́n, J. (2007).

Functions words in authorship attribution studies.

Literary & Linguistic Computing, 22: 27–47.

Miranda Garcı́a, M. and Calle Martı́n, J. (2012). The

authorship of the disputed federalist papers with an

annotated corpus. English Studies, 93: 371–90.

Mosteller, F. and Wallace, D. L. (1964). Applied Bayesian

and Classical Inference: The Case of the Federalist Papers.

Reading: Addison-Wesley.

Ng, H. T., Goh, W. B., and Low, K. L. (1997).

Feature selection, perceptron learning, and a usability

case study for text categorization. In: Proceedings

J. Savoy

14 of 16 Digital Scholarship in the Humanities, 2013


ACM-Special Interest Group in Information Retrieval
(SIGIR). New York: The ACM Press, pp. 67–73.

Pennebaker, J. W. (2011). The Secret Life of Pronouns.
What Our Words Say about Us. New York: Bloomsbury
Press.

Savoy, J. (2001). Report on CLEF-2001 Experiments. In
Peters, C., Braschler, M., Gonzalo, J., and Kluck, M.
(eds), Cross-Language Information Retrieval and
Evaluation, LNCS #2069. Berlin: Springer, pp. 27–43.

Savoy, J. (2012). Authorship attribution: a comparative
study of three text corpora and three languages. Journal
of Quantitative Linguistics, 19: 132–61.

Sebastiani, F. (2002). Machine learning in automatic text
categorization. ACM Computing Survey, 14: 1–27.

Singhi, S. and Liu, H. (2006). Feature subset selection bias
for classification learning. In Proceedings International

Conference on Machine Learning. New York: The ACM
Press, pp. 849–56.

Stamatatos, E. (2009). A survey of modern authorship
attribution methods. Journal American Society for
Information Science and Technology, 60: 538–56.

Yang, Y. and Liu, X. (1999). A Re-Examination of Text
Categorization Methods. In Proceedings ACM-SIGIR.
New York: The ACM Press, pp. 42–9.

Yang, Y. and Pedersen, J. O. (1997). A Comparative
Study of Feature Selection in Text Categorization. In
Proceedings International Conference on Machine
Learning. pp. 412–420.

Zhao, Y. and Zobel, J. (2007). Entropy-Based Authorship
Search in Large Document Collection. In Proceedings
European Conference on Information Retrieval (ECIR).
Berlin: Springer, pp. 381–92.

Appendix
Table A1 List of the functions used for feature selection

DIA(tk, cj) Prob[cj j tk]

PMI(tk, cj) log2
Prob½tk ,cj�

.
:Prob½tk� � Prob½cj�


 �
¼ log2 Prob½tkjcj�

� �
� log2 Prob½tk�ð Þ

OR(tk, cj)
Prob½tkjcj� � 1 � Prob½tkj� cj�

� �.
: 1 � Prob½tkjcj�
� �

� Prob½tkj� cj�

IG(tk, cj)
P

c2fcj ,�cjg

P
t2ftk ,�tkg

Prob½t ,c� � log2
Prob½t ,c�

.
:Prob½t� � Prob½c�

h i

GR(tk, cj) Prob½t ,c� � log2
Prob½t ,c�=Prob½t� � Prob½c�

h i
þ Prob½�t ,c� � log2

Prob½�t ,c�=Prob½�t� � Prob½c�

h i

�2(tk, cj)
n � Prob½tk ,cj� � Prob½�tk , � cj�

� �
� Prob½tk , � cj� � Prob½�tk ,cj�
� �� �2

Prob½tk� � Prob½�tk� � Prob½cj� � Prob½�cj�

CC(tk, cj)

ffiffiffi
n
p
� Prob½tk ,cj� � Prob½�tk , � cj�
� �

� Prob½tk , � cj� � Prob½�tk ,cj�
� �� �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Prob½tk� � Prob½�tk� � Prob½cj� � Prob½�cj�

p
GSS(tk, cj) Prob½tk ,cj� � Prob½�tk , � cj�

� �
� Prob½tk , � cj� � Prob½�tk ,cj�
� �

BNS(tk, cj) F
�1ðProb½tkjcj�Þ� F

�1ðProb½tkj� cj�Þ
		 		a

aF�1 represents the inverse of the normal cumulative distribution function.

Comparative evaluation of term selection functions

Digital Scholarship in the Humanities, 2013 15 of 16


Table A2 Estimation for the selection functions and possible values for a positive association

or independence

Function Estimation Positive Independence

DIA(tk, cj) a / (aþb)

PMI(tk, cj) log2[a � n / (aþb) � (aþc)] �0 �0

OR(tk, cj) (a � d) / (c � b) �1 �1

IG(tk, cj) a/n � log2[a�n / (aþb)(aþc)]þ

b/n � log2[b�n / (aþb)(bþd)]þ

c/n � log2[c�n / (aþc)(cþd)]þ

d/n � log2[d�n / (bþd)(cþd)]

�0 �0

GR(tk, cj) a/n � log2[a�n / (aþb)(aþc)]þ

c/n � log2[c�n / (aþc)(cþd)]

�0 �0

�2(tk, cj) n � (a�d�c�b)
2

/

[(aþc)�(bþd)�(aþb)�(cþd)]

�1 �0

CC(tk, cj) sqrt(n) � (a�d�c�b) /

sqrt[(aþc)�(bþd)�(aþb)�(cþd)]

�0 �0

GSS(tk, cj) [(a�d)�(c�d)] /n
2

�0 �0

BNS(tk, cj) j F
�1

(a/(aþc))�F
�1

(b/(bþd)) j
a

�0 �0

a
F
�1

represents the inverse of the normal cumulative distribution function.

J. Savoy

16 of 16 Digital Scholarship in the Humanities, 2013