doi:10.1016/j.eswa.2007.01.018


Available online at www.sciencedirect.com
www.elsevier.com/locate/eswa

Expert Systems with Applications 34 (2008) 1599–1608

Expert Systems
with Applications
An incremental cluster-based approach to spam filtering

Wen-Feng Hsiao a,*, Te-Min Chang b

a
Department of Information Management, National Pingtung Institute of Commerce, Taiwan

b
Department of Information Management, National Sun Yat-sen University, Taiwan
Abstract

As email becomes a popular means for communication over the Internet, the problem of receiving unsolicited and undesired emails,
called spam or junk mails, severely arises. To filter spam from legitimate emails, automatic classification approaches using text mining
techniques are proposed. This kind of approaches, however, often suffers from low recall rate due to the natures of spam, skewed class
distributions and concept drift. This research is thus to propose an appropriate classification approach to alleviating the problems of
skewed class distributions and drifting concepts. A cluster-based classification method, called ICBC, is developed accordingly. ICBC
contains two phases. In the first phase, it clusters emails in each given class into several groups, and an equal number of features (key-
words) are extracted from each group to manifest the features in the minority class. In the second phase, we capacitate ICBC with an
incremental learning mechanism that can adapt itself to accommodate the changes of the environment in a fast and low-cost manner.
Three experiments are conducted to evaluate the performance of ICBC. The results show that ICBC can effectively deal with the issues
of skewed and changing class distributions, and its incremental learning can also reduce the cost of re-training. The feasibility of the
proposed approach is thus justified.
� 2007 Elsevier Ltd. All rights reserved.

Keywords: Email classification; Skewed class distribution; Concept drift; Incremental learning
1. Introduction

With the rapid growth of the Internet, email has become
one of the most common media for us to distribute
information. More and more people depend on it to com-
municate because of its properties of convenience, no
restrictions in time and location, prompt delivery, and
low cost. Some people, however, abuse it to spread large
amount of information ranging from ads of drug, easy
money, porn, to political promotion. As a result, our mail-
boxes are usually filled with unsolicited emails (spam or
junk mails). Spam filtering becomes one of the essential
issues to most companies, governments, or even individual
users.
0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.01.018

* Corresponding author. Tel.: +886 8 7238700x6201.
E-mail addresses: wfhsiao@mail.npic.edu.tw (W.-F. Hsiao), temin@

mail.nsysu.edu.tw (T.-M. Chang).
To resolve the spam problem, traditional filters
employed simple and straightforward methods, called fil-
tering by rules, which classify spam emails by matching
particular email fields (e.g. sender, recipient, date, subject,
and attachment) with certain keywords. Although certain
amount of spam can be filtered out, more complex spam
(for example, those that can hide their own identities,
including sending machines and/or senders) can survive
easily with the method applied.

More recent and advanced methods rely more on the
content analyses of email bodies to deal with complex
spam. This kind of methods extracts features (keywords)
from both legitimate and spam emails, and then builds a
classification model using techniques from text mining.
However, their main problem is the poor performance in
the recall rate resulting from the natures of spam itself
(Fawcett, 2003), including skewed class distributions and
concept drift. In the case of skewed class distributions,
the number of spam emails received is far more than that

mailto:wfhsiao@mail.npic.edu.tw
mailto:temin@ mail.nsysu.edu.tw
mailto:temin@ mail.nsysu.edu.tw


Cl0

ClP ClN

P

P

N

N

Cl 0

Cl P Cl N

NP

Fig. 1. Concept illustration of the refinement approach (Source: Wu et al.,
p. 210).

1600 W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608
of legitimate ones so that the class distributions of legiti-
mate and spam are largely uneven. This often leads to a
poor recall rate for the legitimate emails (the minority
class). For the concept drift problem, users’ preferences
may change with time or the topics of spam (or legitimate
emails) may vary according to the fashionable trends.
Therefore, the recall rates for both spam and legitimate
emails can be low due to the changes. The learned structure
of a classifier should adapt itself to accommodate these
changes to classify new emails correctly.

The purposes of this research are thus to propose an
appropriate classification method to alleviate the above-
mentioned problems, and to improve the effectiveness of
spam filtering accordingly. We develop an adaptive
cluster-based classification method, called incremental clus-
tering-based classification (ICBC), for such purposes. Basi-
cally, we deal with the class-skewed problem by first
clustering emails in each class (spam and legitimate emails)
into some coherent subsets. An equal number of keywords
are then extracted from each subset. By doing so, rare but
important keywords of minority classes can be extracted.
To deal with the problem of concept drift, we capacitate
ICBC with an inherent incremental learning mechanism.
With the incremental learning capability, ICBC not only
can personalize users’ email filtering preferences, but also
can adjust itself over time for the changing environments,
and thus can improve the classification accuracy under
concept drift in a fast and low-cost fashion.

The rest of this paper is organized as follows. The
related research works on special spam natures, adaptive
learning, cluster analysis, and spam filtering are reviewed
in Section 2. The details of the ICBC method are elabo-
rated in Section 3, followed by the presentation of experi-
ments and results for evaluating the effectiveness of
ICBC. Finally, the conclusions and future works are dis-
cussed in Section 5.

2. Literature review

2.1. Special spam natures

Emails with spam often exhibit the phenomenon of
skewed class distributions. Namely, we receive spam emails
far more than legitimate ones in our daily life. In such cases
where the class distributions of the data are highly skewed,
typical classifiers have difficulty in correctly predicting the
classes with few data (minority classes) (Monard & Batista,
2002). Minority classes, however, may even deserve our
attention more, for example, customer churn, credit card
fraud, insurance fraud, and rare disorders. The cost of
not identifying these minority classes correctly is often pro-
hibitive. Typical approaches to deal with the problem of
skewed class distributions can be classified into three cate-
gories; the method of under-sampling for reducing majority
data (Hart, 1968), the method of over-sampling for
expanding minority data (Honda, Motizuki, Ho, & Okum-
ura, 1997) and the method of multi-classifier committee
(Chan, Fan, Prodromidis, & Stolfo, 1999). The under-sam-
pling approach is to decrease the data of the majority class
whereas the over-sampling approach is to increase the data
of the minority class. The former may encounter the criti-
cism of not fully making use of all the data, and the latter
may induce noise. In contrast, the multi-classifier commit-
tee approach proportionately partitions data into several
subsets and generate multiple classifiers by training individ-
ual subsets. The final prediction of the class is an aggregate
from the outputs of all classifiers.

On the other hand, concept drift refers to the varying
concepts over time. A classifier predicts the class of an
unknown example using the rules (boundaries) that are
induced from training examples to discriminate the classes.
If the learned concepts (the discrimination boundaries) are
not changed, the classifier’s performance is about the same.
However, when the concepts drift as time flies by, the ori-
ginal classifier cannot perform well unless it keeps learning
the emerging concepts and modifying its learnt structure
accordingly. Spam emails are a good example of concept
drift as users’ preferences change frequently and the topics
of spam also change with contemporary trend. Based on
the extent of the drift, three kinds of concept drift are usu-
ally discussed in the literature: sudden drift, moderate drift
and slow drift (Stanley, 2003). Based on the class distribu-
tions, on the other hand, Forman (2006) classified the con-
cept drift into three types: shifting class distribution,
shifting subclass distribution, and fickle concept drift. To
deal with concept drift, the classifiers should possess incre-
mental learning capability that adapts itself to the environ-
mental changes.
2.2. Adaptive learning in text categorization

To improve the accuracy of a classifier, Wu, Pang, and
Liu (2002) proposed a refinement approach to solving the
problem of the misfit between data and models. Their
method did not modify the classification algorithm itself;
instead, it employed an incremental refinement procedure
to deal with the model misfit problem. The concept is illus-
trated in Fig. 1. It is an iterative process with two major
steps. The first step is to train the classifier given a data


W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608 1601
set. Since the classification result based on the trained clas-
sifier is usually not perfect, the second step is to split the
given data set based on the classifier’s current prediction
results. The process then iterates to the first step to train a
corresponding classifier for each split. It goes on repeatedly
until a stop criterion is met (the F-measure in this case).
Note that in this way, the data set in each split becomes
purer and purer, which implies the concept represented at
a more descending node would be more consistent.

ICBC is motivated by Wu et al.’s work, which employed
an adaptive refinement procedure to partition the data into
sub-clusters until the concept represented within a cluster is
consistent. However, unlike their computational effort to
repeatedly train the classifier, ICBC clusters the data sim-
ply by using clustering techniques.

2.3. Cluster analysis

The clustering techniques can usually be divided into
two categories: partitional and hierarchical techniques
(Jain, Murty, & Flynn, 1999). Partitional clustering usually
employs a measure (e.g. squared error) and attempts to
optimize this measure (e.g. minimize the square error).
The procedure is iterated several times until a stop criterion
is met.

Hierarchical clustering can be further divided into
agglomerative and divisive clustering techniques. For
agglomerative clustering, each data point is initially viewed
as a cluster. Clusters are gradually combined together
according to their similarities (with the minimal distance
criterion). On the other hand, divisive clustering starts
from a big cluster containing all the data points, followed
by separating the cluster(s) iteratively. The measures
employed to calculate the distances between clusters
include single, complete, and average links.

(1) Single link: the distance between two clusters can be
obtained by finding the minimal distances between
them, i.e., the distance between the two closest data
points in the two clusters, respectively.

(2) Complete link: the distance between two clusters is to
calculate the maximal distances between them, i.e.,
the distance between the two farthest data points in
the two clusters, respectively.

(3) Average link: the distance between two clusters is the
mean distance between all possible pairs of data
points in the two clusters, respectively.
2.4. Spam filtering research

With text-mining techniques adopted, a spam filter is
usually established through the following steps. First, users
manually classify their emails into legitimate and spam
emails. A feature vector (keywords) is then formed based
on TFIDF (term frequency · inverse document frequency)
or its variants. Finally, a classifier is constructed using a
classification algorithm that makes use of the feature vector
to best discriminate the two categories (legitimate and
spam).

Payne and Edward (1997) employed a rule-based induc-
tion method to automatically learn the classification rules.
Their system, called ‘‘Magi’’, was a kind of agent program.
It learned how to deal with new incoming emails through
observing its user’s behaviors (e.g. forwarding, deleting,
and storing) toward received emails. Magi used two thresh-
olds, predictive threshold and confidence threshold, as the
criteria for firing different level of delegated actions. The
predictive threshold was used to determine whether Magi
should provide its suggested action to the user for reference
or not. On the other hand, the confidence threshold was set
at a higher level for the system to automatically execute the
action according to its prediction (e.g. forwarding, deleting,
or storing).

Bayes probabilistic model is another method that is
commonly used in document classification (Lian, 2002).
Sahami, Dumais, and Horvitz (1998) used Naı̈ve Bayes
to analyze spam. Androutsopoulos, Koutsias, Chandrinos,
and Spyropoulos (2000) employed Naı̈ve Bayes to con-
struct email classifiers in spam filtering. Owing to its simple
formulation and satisfactory performance, Naı̈ve Bayes
classifiers were often used as the benchmark for compari-
son purpose (Zhang & Oles, 2001). Naı̈ve Bayes, however,
assumes that each document can only belong to one cate-
gory (one-to-one correspondence) and the occurrence
between individual terms is mutually independent (Nigam,
McCallum, Thrun, & Mitchell, 2000), which are not realis-
tic in the real world context. Therefore, recent researches of
Bayesian Network tend to relax those assumptions.

In addition, Drucker, Wu, and Vapnik (1999) employed
support vector machine (SVM) to filter spam emails. Their
results were compared to those of Ripper, Rocchio, and
boosting decision tree. They found that with binary expres-
sion in the feature vector of emails, SVM performs best.
Overall, SVM was compatible with boosting decision tree
but with shorter computational time.

Luo and Zincir-Heywood (2005) designed an SOM-based
sequence analysis (SBSA) system for spam filtering. The first
part of SBSA was to represent document with a two-level
hierarchical SOM architecture, and the second part was to
employ a sequence-based k nearest neighbor (KNN) classi-
fier for classification. SBSA was compared to the Naı̈ve
Bayes classifier and showed superior performance.

Delany and Cunningham (2006) proposed a spam filter
system, ECUE, which employed instance-based learning
technique to track concept drift. By learning from mis-
classification emails (either spam or legitimate emails),
ECUE could personalize to the specifics of users’ prefer-
ences, and adapt to the changing natures of spam. How-
ever, with its adoption of k nearest neighbor algorithm,
ECUE had to re-perform the feature selection process
and rebuild the instance-base every time the learning mech-
anism was triggered by the misclassification of emails. Our
proposed approach, ICBC, is compatible with ECUE on
concept drift with a major exception that ICBC changes


1602 W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608
cluster structure incrementally rather than rebuild the
instance base every time ECUE is triggered to learn.

3. Proposed approach

The main purpose of ICBC is to handle the problem of
skewed class distributions, and of concept drift. It basically
includes two phases, each of which deals with the men-
tioned two problems, respectively. Details of the phases,
the classifier training phase and the incremental learning
phase, are described as follows.

3.1. Classifier training phase

The class-skewed problem is annoying because classes
with more data (majority classes) usually dominate those
with less data (minority classes). The consequence is that
the classifiers give more weights on the majority classes
to maintain higher prediction accuracy. To relieve this
problem, one should find a way that can yield comparable
weights to different classes while maintaining reasonable
prediction accuracy. By inspecting the procedure of docu-
ment classification task, we realize that a particular step
that makes minority classes dominated is the feature selec-
tion. The feature selection, commonly based on a variant of
TFIDF (see, for example, Lee & Lee, 2006), is to extract a
keyword vector that can discriminate different classes.
Since it is done for the whole training document set,
chances for keywords to be selected from the majority clas-
ses are higher than from the minority classes.

To overcome this problem, we propose ICBC, which
bases itself on cluster analysis (as illustrated in Fig. 2).
For each given class, ICBC groups the emails into several
clusters (sub-concepts), and extract an equal number of
keywords from each cluster. When a new email comes,
ICBC compares the distances between this new email and
the keyword vector extracted from each cluster, and assigns
Feature Selection of TFIDF 

Fig. 2. The conceptual comparison of featu
it to the cluster that has the minimal distance. By doing so,
ICBC can effectively avoid the problem that the keywords
in minority classes are usually insignificant to be selected
such that documents from minority classes are often mis-
classified to majority classes.

The purpose of the classifier training phase is to resolve
the problem of skewed class distributions. The details of this
phase are shown in Fig. 3. Just like instance-based learning,
the cluster-based ICBC simply performs feature selection
task in this phase without training a classification model.
ICBC first employs a stop word list (McCallum, 2002)
and the Porter stemming algorithm (1980) to remove possi-
ble noisy data and reduce the dimension of feature space.
For each given class, ICBC applies TFIDF to extract key-
words for clustering purpose. The clustering method
adopted here is hierarchical with complete-linkage distance
measure. Hierarchical, instead of partitional, clustering is
used because the latter suffers from the instability that
causes from different initial cluster centers and unknown
number of clusters. The hierarchical clustering is also con-
tributive to incremental learning, as explained later in the
second phase. After the clustering is done, we re-extract
an equal number of keywords from each cluster. The crite-
rion to extract keywords now is TF without IDF since our
aim is to find features that represent the cluster rather than
those that have higher discrimination ability. Finally, the
feature vector is determined for each cluster.

To classify a new email, ICBC adopts the similarity
comparison procedure as shown in Fig. 4. The essential
key issue here is to select the cluster that most matches
the new email, and assign the email the class that the
matching cluster belongs to. It is referred to as using unsu-
pervised learning to facilitate the supervised learning task.
The similarity is calculated by comparing the keyword vec-
tor of each cluster to the new email. The more keywords
match, the more similar the email to the cluster. Finally,
the cluster that has the largest number of matched key-
Feature Selection of ICBC 

re selection between TFIDF and ICBC.


Removed of stop 
words

Porter’s stemmer

Clustering
Extracting

keywords from each 
formed cluster

Keywords 
extraction

Training
se t

Removed of stop 
words

Porter’s stemmer

Clustering
Extracting

keywords from each 
formed cluster

Keywords 
extraction

Training
set

Fig. 3. The procedure of training.

A new email 

Spam

Calculate the similarity of the 

new email to each cluster 

(both legitimate and spam 

categories)

Find the most similar cluster 

Legitimate email 

Which category does 

this cluster belong to?

Feature vectors from 

each cluster 

Label the new email as 

spam and assign it into 

the cluster 

Label the new email as 

legitimate email and assign 

it into the cluster 

Fig. 4. The procedure to classify an unlabeled email.

W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608 1603
words is selected as the most similar cluster, and the class
of the new email is then determined.

3.2. Incremental learning phase

The purpose of the incremental learning phase is to
impose the incremental learning ability on ICBC. ICBC
automatically triggers the incremental learning mechanism
when it detects topic changes in spam (or legitimate
emails), or when the users correct misclassified emails. In
the former situation, ICBC will adapt itself by modifying
the existent classification knowledge accordingly, whereas
in the latter situation, ICBC quickly learns to avoid the
same mistakes and personalizes the users’ email filtering
preferences.

The incremental learning ability of ICBC lies in its clus-
ter structure of keywords. With new emails coming, the
structure can be adjusted to accommodate new informa-
tion. The procedure of this phase is illustrated in Fig. 5.
With the same classification scheme for a new email in
the previous phase, the incremental learning occurs when
ICBC detects changes in its concepts (sub-clusters), or
when users modify misclassified emails.

As we classify a new email and store it to the corre-
sponding cluster, we first consider whether the topic change
in spam or legitimate emails is significant or not by check-
ing the split condition of the cluster. If the coherence of the
cluster no longer remains, the cluster will be automatically
split into two or more sub-clusters, which denotes the
generation of new topics. Fig. 6 illustrates the splitting
process. Be it the original cluster or the split clusters, we
re-extract the keyword vector of the cluster(s) to adapt to
the changes. In addition, the cluster is split one at a time,
if any, as we deal with the slow concept drift situation
(Stanley, 2003) in incremental learning.

Furthermore, users may trigger the incremental learning
by modifying the classification results of the new emails.
When the modification occurs, the misclassified email will


A new email 

Spam

Decide whether the cluster should 

be split after the new email is added. 

Re-extract keywords 

Calculate the similarity of the 

new email to each cluster 

(both legitimate and spam 

categories)

Find the most similar cluster

Legitimate email

Which category does 

this cluster belong to?

Feature vectors from 

each cluster 

Label the new email as 

spam and assign it into 

the cluster 

Label the new email as 

legitimate email and assign 

it into the cluster 

The user agree with 

the ICBC result

Wait for another 

new email

Yes No Re-extract keywords 

for this

misclassification 

Fig. 5. The procedure of incremental learning in ICBC.

Fig. 6. The illustration of cluster splitting process.

1604 W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608
be taken out of where it was stored and re-placed into the
most matching cluster in the correct class. Again, for the
matching cluster, we need to decide whether it should be
split or not. Keywords of the misplaced cluster and the


W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608 1605
matching cluster, or the split clusters are re-extracted to
adapt to the changes.

Note that ICBC changes the cluster structure before
users correct the classification results. The main reason is
that users usually process their emails in a batch mode,
rather than one-by-one. For the efficiency purpose, ICBC
will automatically activate possible structure change before
users’ correction actions.
4. Experiments and results

In this section we conduct three experiments to illustrate
the feasibility of ICBC. We first explain our experimental
design, followed by detailing the three experiments.
4.1. Experimental design

We design three experiments to evaluate ICBC. Their
main objectives are listed in Table 1.
4.1.1. Data collection

To conduct the experiments, two data sets are collected.
The first data set, used for Experiments I and II, is from
Spamassassin’s1 non-spam email dataset and E.M.
Canada’s2 spam email dataset. The non-spam emails from
Spamassassin, with a total of 8624 records, serve as legiti-
mate emails. The emails filtered from E.M. Canada with
time frame from January 2002 to December 2003, with a
total of 15,213 records, serve as spam. The second data
set, used for Experiment III, is from Yahoo newsgroups.
Six subjects (eight sub-subjects), dated from March 15,
2005 to March 21, 2005, are collected with statistics shown
in Table 2.
4.1.2. Performance measure

To evaluate the performance of the proposed classifier,
we employ the measures that are widely used in email clas-
sification. A confusion matrix can be formed according to
the matching situations between the predicted classes by
the classifier and their actual classes (as shown in Table 3).

In this matrix, A represents the number that the classi-
fier correctly classifies the legitimate mails; D is the number
that the classifier correctly rejects the spam emails. B is the
number that the classifier incorrectly classifies spam emails
as legitimate ones, while C is the number that the classifier
incorrectly rejects legitimate emails as spam. The common
evaluation measures include Precision, Recall, F-measure,
and Accuracy. Their corresponding definitions are as
follows:

Positive Precision ¼
A

A þ B
ð1Þ
1 Spamassassin email dataset: http://spamassassin.apache.org/
publiccorpus.

2 E.M. Canada spam email dataset: http://www.em.ca/~bruceg/spam.
Negative Precision ¼
D

C þ D
ð2Þ

Positive Recall ¼
A

A þ C
ð3Þ

Negative Recall ¼
D

B þ D
ð4Þ

F -measure ¼
2pr

p þ r
ð5Þ

Accuracy ¼
A þ D

A þ B þ C þ D
ð6Þ

Positive precision represents the percentage of real legi-
timate in the predicted legitimate emails. Negative preci-
sion represents the percentage of real spam in the
predicted spam. Positive recall is defined as the percentage
of the true legitimate emails that are correctly predicted by
the classifier. Negative recall is defined as the percentage of
the true spam emails that are correctly predicted by the
classifier. F-measure is the harmonic average of (positive
or negative) precision and recall. Accuracy is the percent-
age of all emails that are correctly classified by the
classifier.

4.1.3. Benchmark comparison

In the experiments, we employ the performance of the k
nearest neighbors (KNN) approach (adopted in ECUE,
Delany & Cunningham, 2006) as the benchmark for com-
parison. KNN is a typical kind of instance-based learning
methods, i.e., it simply performs feature selection task in
the training phase without really training a classification
model. When testing, it employs distance (dissimilarity)
or similarity measures to find k instances in the training
set that are most similar to the test data, and then aggre-
gate the results of these k instances to be the prediction
result for the test data.

With no explicit knowledge structure during training,
KNN is insignificantly influenced by the class-skewed
problem. On the other hand, KNN does not have incre-
mental learning mechanism. Whenever there are new com-
ing instances, it simply stores them as the training data and
re-perform feature selection task. When testing, it performs
similarity matching of the test data with the entire training
data. The computational cost of re-training, i.e., repetitive
feature selection tasks, is expensive though.

4.2. Experiment I

In Experiment I, under the condition of no significant
skewed class distribution and of no concept drift, we exam-
ine ICBC’s performance using the whole collected data
from Spamassassin and E.M. Canada datasets. We ran-
domly select 20% of the 8648 Spamassassin dataset and
20% of the 15,213 E.M. Canada dataset as the training
set. The remaining 80% of each dataset is served as test
data. The result is shown in Table 4.

From Table 4, we first observe that both ICBC and
KNN perform well under the large number of data

http://spamassassin.apache.org/publiccorpus
http://spamassassin.apache.org/publiccorpus
http://www.em.ca/~bruceg/spam


Table 1
Objectives of the experiments

Objectives

Experiment I To examine the performance of ICBC under the condition of no significant class-skewed problem and of no concept drift
Experiment II To demonstrate the performance of ICBC under the situation of skewed class distributions
Experiment III To justify the incremental learning capability of ICBC under the situation of a user’s preference changes

Table 2
Statistics of collected data in Experiment III

Subject Sub-subject Number Subject Sub-subject Number

Sport Basketball 50 Entertainment Movie 60
Football 20 Health Weight loss 20

Business Economy 50 Technology Macintosh 20
Stock market 60 Weather Weather 20

Table 3
A confusion matrix in email classification

Actual

Legitimate Spam

Predicted by the
classifier

Legitimate A B
Spam C D

Table 4
The classification result in Experiment I

Precision Recall F-Measure Accuracy

ICBC Legitimate 0.93177 0.92302 0.92738 0.94730
Spam 0.95762 0.96100 0.95931

KNN Legitimate 0.97538 0.92461 0.94931 0.96416
Spam 0.95927 0.98657 0.97273

1606 W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608
employed. Furthermore, it shows that ICBC performs
slightly worse than KNN. The reason is that ICBC calcu-
lates distances by comparing test data with the feature vec-
tor of each cluster, while KNN compares test data with all
training data. KNN, therefore, results in more accurate
performance by enumerative comparison.

Nevertheless, KNN suffers from its slowness in enumer-
ative comparison. The time for classifying a new instance
grows as the number of instances that KNN stores
0
20

60
40

100
80

120
140
160
180

10 20 30 40 50 60 70 80 90 100
No. of training instances

S
ec

on
ds

ICBC 
KNN

Fig. 7. The comparison of computational time between ICBC and KNN.
increases. To illustrate, we perform a simple experiment
that increases the number of training instances from 10
to 100, with an increment of 10, and fixes the number of
testing instances to 200, and compute the computational
time for ICBC with that for KNN. The result is shown in
Fig. 7.

From Fig. 7 it is obvious that the time for KNN grows
as the number of training instances increases. On the con-
trary, the time for ICBC seems to be steadily constant.
ICBC largely decreases the influence of the number of
training instances on the prediction speed.
4.3. Experiment II

Experiment II is to examine ICBC on dealing with the
issue of skewed class distributions. Traditional classifiers
tend to prefer the majority classes and result in poor recall
rates for minority classes. This problem may not even be
salient if one employs the overall prediction accuracy as
the performance measure, since the errors resulted from
the minority classes would not have major impact on the
overall prediction accuracy. To better account for the
class-skewed problem, we employ the positive recall as
the evaluation criterion in this experiment.

To deliberately make significant skewed class distribu-
tions, we select five training data sets, each of which con-
sists of 200 spam and 10, 20, 30, 40, or 50 legitimate
emails. The test set consisting of another 50 legitimate
emails is used to evaluate the performance under each
situation.

The result is shown in Fig. 8. Overall, the positive recall
rates for both ICBC and KNN deteriorate as the skewed
class distribution phenomena become more significant.
More importantly, ICBC outperforms KNN on all of the
five cases. Originally, KNN should not be influenced by
the problem of skewed class distributions. However, due
to the global feature selection mechanism in text categori-
zation, KNN still cannot avoid the feature selection prob-
lem in minority classes. In contrast, ICBC selects equal
number of features within clusters. This result illustrates


Table 6
Recall rates from ICBC and KNN under incremental learning

Size of
test set

ICBC KNN

Movie Stock F-
Measure

Movie Stock F-
Measure

0a 0.64000 0.74286 0.69143 0.46154 0.74074 0.60114
5 0.92857 0.93750 0.93304 0.70833 0.93548 0.82191
10 0.96774 0.96552 0.96663 0.93750 0.92857 0.93304
15 0.96774 0.96552 0.96663 0.92308 0.90909 0.91608
20 0.96774 0.96552 0.96663 0.85714 0.80000 0.82857
25 0.96774 0.96552 0.96663 0.88235 0.84615 0.86425
30 0.95082 0.94915 0.94999 0.85714 0.80000 0.82857

a Represents the initial framework without incremental learning.

10 20 30 40 50
The number of legitimate emails 

1.0

0.8

0.6

0.4

0.2

0 0

P
o

si
ti

v
e 

re
ca

ll
 

ICBC

KNN

Fig. 8. The result of classification under skewed class distributions.

W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608 1607
the effectiveness of the clustering strategy of ICBC on deal-
ing with the issue of skewed class distributions.

4.4. Experiment III

The objective of Experiment III is to evaluate the incre-
mental learning capability of ICBC. Therefore, we deliber-
ately create a scenario with significant changes of a user’s
preference, a typical example of concept drift. The scenario
is as follows. Suppose that an online user subscribes the
Yahoo newsgroups and receives electronic news accord-
ingly. At first, four subjects of news (Weather, Football,
Weight Loss, and Macintosh) are sent to the user who con-
siders emails of the former two subjects as legitimate and
the latter two as spam. An initial classifier is trained
according to the user’s original preferences. Gradually,
another two subjects of news (Movie and Stock Market)
are also sent to the user who begins to read the emails of
Movie subject (legitimate) but still ignore the emails of
Stock Market subject (spam). In this case, the classifier
should be able to adapt itself to the new situation and make
correct prediction. Abiding by this description, the data in
this experiment are manipulated as shown in Table 5. The
procedure to conduct this experiment is further explained
as follows:

(1) The initial classifier is built by a training data set that
consists of 20 emails of Weather subject and 20 emails
of Football subject as legitimate emails, and 20 emails
of Weight Loss subject and 20 emails of Macintosh
subject as spam.
Table 5
Statistics of data manipulated in Experiment III

Class Subject Number of
emails

Legitimate Weather 20
Football 20

Spam Weight loss 20
Macintosh 20

Incremental learning
test data

Legitimate Movie 30
Spam Stock

market
30

Incremental learning validation
data

Legitimate Movie 30
Spam Stock

market
30
(2) Movie and Stock are selected as the two targeted sub-
jects emerging gradually, with emails of Movie sub-
ject as the new legitimate emails and emails of
Stock subject as the new spam.

(3) To examine the effects of the incremental learning, we
apply 30 emails of Movie subject and 30 emails of
Stock subject as the incremental learning test data,
in an increment of 5 emails fed into the classifier each
time. We hold the rest 30 emails of Movie subject and
the rest 30 emails of Stock subject as the validation
data to evaluate the adapted classifier.

The incremental learning results (recall rates) of ICBC
and KNN are shown in Table 6. We observe that ICBC
does not detect the concept drift at first because of few test
data inputs. Nonetheless, it quickly adapts itself to the con-
cept drift with more test data, and improves the recall rate
accordingly. On the contrary, KNN may learn to adapt to
the new situations, but in a much slow and unstable
manner.

We further plot the F-measure results from ICBC and
KNN in Fig. 9 for visual comparison. It is clear that our
ICBC outperforms KNN in the incremental learning pro-
cess. In addition, KNN uses all the available data to re-per-
form the feature selection task each time, while ICBC
performs incremental learning and selects features within
those varied clusters solely, and thus reduce the cost of
re-training.
0

0.2

0.4

0.6

0.8

1

0 10 15 20 25 30

F
-m

ea
su

re

ICBC

KNN

5

Fig. 9. The F-measure comparison between ICBC and KNN.


1608 W.-F. Hsiao, T.-M. Chang / Expert Systems with Applications 34 (2008) 1599–1608
5. Concluding remarks

With the prevailing of Internet, email has become an
indispensable part of our daily life for communication.
However, because of the prevalence of spam emails, how
to detect and filter spam become a quite essential issue.
In this study, we propose an incremental cluster-based clas-
sification approach, ICBC, to improving the effectiveness
of spam filtering.

ICBC is a classification method based on cluster analysis.
It first clusters concepts into several sub-concepts, and
extract equal-sized representative features (keywords) from
each sub-concept. With this procedure, concepts in minor-
ity classes can have equal-sized features as those in majority
classes, which lessens the class-skewed problem. At the
same time, the cluster structure of ICBC can be adaptively
changed, which imposes ICBC the capability of incremental
learning. ICBC can accommodate changes in a quick and
low-cost manner, and relieve the problem of concept drift.

Three experiments are conducted to investigate the per-
formance of ICBC. The result of Experiment I shows that,
when the problem of skewed class distributions is not crit-
ical, the performance of ICBC is comparable to that of
KNN but its computational time is far less than that of
KNN. In Experiment II, ICBC outperforms KNN when
the problem of skewed classes is critical. The result shows
that ICBC can deal with the issue of skewed class distribu-
tions effectively. In addition, the result of Experiment III
shows that the incremental learning of ICBC can accom-
modate the concept drift, and thus reduce the prediction
error. On the contrary, KNN, without incremental learn-
ing, has slow-adapting and unstable performance and high
re-training cost.

To continue our research, we consider several potential
future works. First, the initial cluster structure of ICBC has
critical impact on the classification accuracy of its later
modifications. A future research work is to consider
approaches that can improve the initial cluster structure.
WordNet or Ontology, for example, may be used to
enforce the links between concepts and result in a more
accurate initial clustering. Second, ICBC solely relies on
the content analysis of emails without considering other
useful information. A hybrid approach that includes con-
tent analysis and useful email information (e.g. headers)
may improve the effectiveness in spam filtering. Finally, it
is worth implementing ICBC as a spam filtering system
applicable in the real world.

References

Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, D.
(2000). Learning to filter spam e-mail: a comparison of a naive
bayesian and a memory-based approach. The 4th PKDD’s workshop
on machine learning and textual information access.

Chan, P. K., Fan, W., Prodromidis, A. L., & Stolfo, S. J. (1999).
Distributed data mining in credit card fraud detection. IEEE Intelli-
gent Systems, 14(6), 67–74.

Delany, S. J., & Cunningham, P. (2006). ECUE: a spam filter that uses
machine learning to track concept drift. Technical Report, Computer
Science Department, The University of Dublin. Available from https://
www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.
pdf.

Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines
for spam categorization. IEEE Transactions on Neural Networks, 10(5),
1048–1054.

Fawcett, T. (2003). In vivo spam filtering: a challenge problem for KDD.
ACM SIGKDD Explorations Newsletter, 5(2), 140–148.

Forman, G. (2006). Tackling concept drift by temporal inductive transfer.
SIGIR ’06 ACM.

Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transac-
tions on Information Theory, IT-14, 515–516.

Honda, T., Motizuki, H., Ho, T. B., & Okumura, M. (1997). Generating
decision trees from an unbalanced data set. In Maarten vanSomeren.,
Gerhard Widmer (Eds.), Poster papers presented at the 9th European
conference on machine learning (ECML) (pp. 68–77).

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review.
ACM Computing Surveys, 31(3), 264–323.

Lee, C., & Lee, G. G. (2006). Information gain and divergence-based
feature selection for machine learning-based text categorization.
Information Processing and Management, 42(1), 155–165.

Lian, Y. (2002). E-mail filtering, Master thesis. University of Sheffield,
Department of Advanced Software Engineering.

Luo, X., & Zincir-Heywood, N. (2005). Comparison of a SOM based
sequence analysis system and naive bayesian classifier for spam
filtering. In Proceedings of IEEE international joint conference on neural
networks – IJCNN ’05 (Vol. 4, pp. 2571 – 2576).

McCallum, A. (2002). Bow: a toolkit for statistical language modeling,
text retrieval, classification and clustering. Available from http://
www.cs.cmu.edu/~mccallum/bow/.

Monard, M. C., & Batista, G. E. A. P. A. (2002). Learning with skewed
class distributions. In Advances in logic, artificial intelligence and
robotics (LAPTEC’02).

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text
classification from labeled and unlabeled documents using EM.
Machine Learning, 39, 103–134.

Payne, T. R., & Edward, P. (1997). Interface agents that learn: an
investigation of learning issues in a mail agent interface. Applied
Artificial Intelligence, 1–32.

Porter, M. (1980). An algorithm for suffix stripping. Program. Automated
Library and Information Systems, 4(3), 130–137.

Sahami, M., Dumais, S., Heckerman D., & Horvitz, E., (1998). A
Bayesian approach to filtering junk e-mail. AAAI-98 workshop on
learning for text categorization.

Stanley, K. O. (2003). Learning concept drift with a committee of decision
trees. Department of Computer Sciences Technical Report AI-03-302.

Wu, H., Pang, T. H., Liu, B., & Li., X. (2002). A refinement approach to
handling model misfit in text categorization. In Proceedings of the
ACM SIGKDD international conference on knowledge discovery and

data mining (pp. 23–26).
Zhang, T., & Oles, F. J. (2001). Text categorization based on regularized

linear classification methods. Information Retrieval, 4(1), 5–31.

http://https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf
http://https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf
http://https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf
http://www.cs.cmu.edu/~mccallum/bow/
http://www.cs.cmu.edu/~mccallum/bow/

	An incremental cluster-based approach to spam filtering
	Introduction
	Literature review
	Special spam natures
	Adaptive learning in text categorization
	Cluster analysis
	Spam filtering research

	Proposed approach
	Classifier training phase
	Incremental learning phase

	Experiments and results
	Experimental design
	Data collection
	Performance measure
	Benchmark comparison

	Experiment I
	Experiment II
	Experiment III

	Concluding remarks
	References