()


Appl. Math. Inf. Sci. 9, No. 4, 1709-1718 (2015) 1709

Applied Mathematics & Information Sciences
An International Journal

http://dx.doi.org/10.12785/amis/090407

Clustering of Telecommunications User Profiles for
Fraud Detection and Security Enhancement in Large
Corporate Networks: A case Study

Constantinos S. Hilas1,∗, Paris A. Mastorocostas1 and Ioannis T. Rekanos2

1 Dept. of Informatics Engineering, Technological Educational Institute of Central Macedonia, GR-62124, Serres, Greece
2 Dept of Electrical and Electronics Engineering, School of Engineering, Aristotle University of Thessaloniki, GR-54124, Thessaloniki,

Greece

Received: 8 Oct. 2014, Revised: 8 Jan. 2015, Accepted: 9 Jan. 2015
Published online: 1 Jul. 2015

Abstract: A user’s transactions with modern networks and services produce a vast amount of user related data. The byproduct of
every phone call a person makes or every web page one visits is translated into a log record with usage data. By studying these log
records, the user’s behavior is revealed and one may come up with clues about user preferences, identify security issues, or discover
fraudulent use of the network or the service one provides. Thus, the modeling of network users’ behavior may serve as an invaluable
tool for the IT manager. In this paper, many of these issues are discussed and emphasis is given on the construction of appropriate
user behavior representation in telecommunications. As an example, the application of two clustering techniques is presented, with the
task to identify appropriate user behavior representations (profiles) inside a large organization’s telecommunications network, in order
to spot fraudulent usage. Through this study a researcher and/or the organization’s network manager may gain more insight into the
problems of user profiling and fraud detection.

Keywords: User profiling, clustering applications, data mining, fraud detection, telecommunications

1 Introduction

A user profile is a collection of personal data associated
to a specific user. Ideally a user’s profile would be an
explicit digital representation of that person. However,
one may have several different profiles depending on the
system with which he interacts. For example, from a
personal computer’s perspective a user profile is a
collection of settings that make the computer look and
work the way you want it to. For an e-commerce system,
e.g. an electronic book store, a user’s profile may include
identity information, like login credentials; billing
information, like credit card data; and more importantly, a
user’s preferences as regards literature genres he prefers,
recent purchases, etc. Some of this knowledge was
explicitly entered to the system by the user, while other
was implied by the system after analyzing past
transactions. In this sense a profile emerges on the basis
of monitoring ones usage patterns, so it is only relevant to
a user’s attitude against a specific service. Of course,

these data may be correlated with data from other
databases and yield a more specific inference about one’s
preferences. User profiles may be found in operating
systems, e-commerce applications, social networking
sites, recommender systems, intelligent tutoring systems,
etc.

The main idea behind user profiling is that the past
behavior of a user can be accumulated in order to
construct a profile, or a “user dictionary”, or a “user
signature” of what might be the expected values of a
user’s behavior. In its simplest form this profile is a vector
that contains single numerical summaries of some aspect
of behavior or some kind of multivariate behavioral
pattern. A profile may also contain categorical, censored,
or other non-numeric data.

The profile follows the logic of the recordable data,
with the constraints inherent in computer technology.
This issue, the inherently reductive character of a profile,
is important because profiles may impact privacy and
identity in the strong sense (concerning our sense of self).

∗ Corresponding author e-mail: chilas@teicm.gr

c© 2015 NSP
Natural Sciences Publishing Cor.

http://dx.doi.org/10.12785/amis/090407


1710 C. Hilas et. al. : Clustering of Telecommunications User Profiles for Fraud...

Since profiles will often affect our lives (providing or
prohibiting access, enabling selection, inclusion and
exclusion) it is of utmost importance to clarify in what
ways and on what basis they affect our lives, without
conflating profile and profiled person [1]. This remark
should always be kept in mind when constructing user
profiles whether this is done in order to help or protect the
user from other dangers.

After a profile is constructed, it can be used
appropriately. An operating system stores user profiles
that contain one’s settings for desktop backgrounds,
screen savers, pointer preferences, sound settings, and
other features. A recommender system uses past user
preferences in order to predict items that the user has not
yet considered, e.g. a book, a song, or even a friend [2].
Traditionally, in computer security user profiles are
constructed based on any basic usage characteristic such
as resources consumed, login location, typing rate and
counts of particular commands. Future behavior of the
user can then be compared with his profile in order to
examine the consistency with it (normal behavior) or any
deviation from his profile, which may imply a breach in
security or some fraudulent activity.

In particular, fraud detection is important to the
telecommunications industry because companies and
suppliers of telecommunications services lose a
significant proportion of their revenue as a result.
Moreover, the modeling and characterization of users’
behavior in telecommunications can be used to improve
network security, improve services, provide personalized
applications, and optimize the operation of electronic
equipment and/or communication protocols.

Several categories of telecommunications fraud have
been reported in the literature. The most prominant are
the technical fraud, the contractual fraud, the procedural
fraud, and the hacking fraud [3]. The first three usually
burden the economics of the service provider, while
hacking fraud also harms the subscriber. Hacking fraud is
usually met in the form of the superimposed fraud where
the fraudster (hacker) uses a service concurrently with the
subscriber and burdens his account. The present paper
focuses on superimposed fraud identification.

The Communications Fraud Control Association
(CFCA) recently announced the results of a global survey
carried out in 2013. Experts estimate 2013 fraud losses at
$46.3 billion (USD), up 15% from 2011. As a percent of
global telecom revenues, fraud losses are approximately
2.09%, a 0.21% increase from 2011. The main reason for
the relative increase in fraud is due to more fraudulent
activity targeting the wireless industry [4]. It should also
be noted that the relative decrease in fraud, that was
apparent in previous surveys, was not an actual decline in
absolute values but it had been attributed to the fact that
the growth in global telecom revenues had outpaced the
growth in fraud losses in the past (e.g. the CFCA 2011
survey).

Exchange of ideas in fraud detection is limited by the
fact that it makes no sense to describe the methods in

detail, as it gives fraudsters the information they require
to evade detection. Moreover, companies and
organizations that have been defrauded refrain from
revealing the situation due to reputation concerns. Adding
to this, fraud detection problems involve huge data sets,
which are constantly evolving. Data sets can be as large
as tenths of thousands of calls per weekday for a large
organization with 3 or 4 thousand employees, to hundreds
of millions of calls for national carriers. One should also
consider the size of the related metadata.

Another difficulty with fraud detection is the fact that,
nowadays, the term telecommunications is wider than
ever. It includes both wired and wireless systems, mobile
and cellular systems, legacy systems (PSTN, ISDN),
terrestrial and satellite networks, and a plethora of
Internet–based communication applications. The diversity
of network types and applications, along with the
deregulation of the market and the relocation of services
to the cloud, makes fraud detection a complex task.

Research in telecommunications fraud detection is
mainly motivated by fraudulent activities in mobile
technologies [3, 5]. Recent research also focuses on VoIP
technologies [6]. Fraud detection methods can be based
on statistical or machine learning techniques and may be
supervised or unsupervised [7, 8, 9]. Fawcett and Phua
[10] have ensembled the bibliography on the use of data
mining and machine learning methods for automatic
fraud detection up to 2005.

This paper proceeds as follows: In the next Section
the user modeling procedure and the proposed user
profiles are presented. A brief presentation of the
clustering techniques and the clustering quality statistics
that are used is given in the third Section. The user data
and the outcome of the analysis are described in the
fourth Section. In the last Section conclusions are
discussed.

2 Behavior Modeling and Profiling in
Telecommunications

The data that can be used to construct the basic profile
vector for a telecommunications user are contained in the
Call Detail Record (CDR) of any Private Branch
Exchange (PBX) or any VoIP switch. The format of the
CDR varies among providers or programs. Some
programs allow CDRs to be configured by the user. In
most cases a CDR contains at least data such as: the caller
ID, the chargeable duration of the call, the called party
ID, the date and the time of the call, etc [11]. In mobile
telephone systems, such as GSM, the data records that
contain details of every mobile phone attempt are the Toll
Tickets. Location data may also be useful. In computer
security, user profiles may be constructed based on any
basic usage characteristic such as resources consumed,
login location, typing rate and counts of particular
commands.

c© 2015 NSP
Natural Sciences Publishing Cor.


Appl. Math. Inf. Sci. 9, No. 4, 1709-1718 (2015) / www.naturalspublishing.com/Journals.asp 1711

Classifier

Y
-A

xi
s

Suspicious 
Cases

Fraud

Normal

Events

������
������
������

Call Database

Billing 
Services

Fig. 1: Both the billing and the security systems draw usage related data from a common database.

CDR data are primarily collected in order to charge
the user for the service he gets, but may additionally be
used to manage the security of the system. The study of
CDRs may reveal cases of unauthorized use of the system
which may burden both the provider and the subscriber
(Fig.1). The system administrator may investigate unusual
call patterns, multiple calls to the same number, calls
outside normal working hours, calls with long duration,
phone calls to suspicious destinations such as
international or premium-rate services (e.g. 090, 1-900,
900). Also, in the case of private PBXs, where access to
outgoing destinations is done by means of personal
authorization codes (PAC), one may check for frequent or
simultaneous use of a PAC, attempts to import
nonexistent system codes (efforts to find codes), or using
PACs that are not associated with a user [12].

In order to develop models that can be applied to
distinguish legitimate from fraudulent usage, one needs
examples of both cases. Finding data of normal usage is
easy. In fact, most of the data generated during the
operation of a system are of this type. The data relating to
fraud are relatively rare and any effort to characterize
them in detail is difficult, time consuming, and requires
specialized knowledge. Moreover, the processing and
storage of user data is subject to restrictions by the
legislation that protects privacy [13, 14].

Several methods appear in the literature for collecting
and classifying data usage. An approach that involves the
user in the process, and thus may overcome privacy
concerns, is block crediting, proposed by Fawcett and
Provost [15]. According to this method both the provider
and the subscriber are involved in characterizing the data.
If the subscriber has any objection on the amount of his
bill, he may contact the provider and collaborate in order
to characterize the calls in his detailed bill and then pay
for the corresponding price. Typically, the account is
divided into two periods, one with the calls of the
legitimate user and one that also contains calls from the
usurper. However, this kind of separation does not
provide high accuracy in the characterization of cases. On

the other hand, a detailed per phone call characterization
of the account is by nature expensive and requires too
much human intervention.

In general, fraud detection focuses on the analysis of
users’ activity and the related approaches are divided into
two main subcategories. The absolute one that searches
for limits between legal and fraudulent behavior, and the
differential approach that tries to detect extreme changes
in a user’s behavior. All cases of telecom fraud can
actually be viewed as fraud scenarios which are related to
the way the access to the network was acquired.

One of the most interesting aspects of the problem is
the evaluation of different user representations (profiles)
and their effect towards the proper discrimination between
legitimate and fraudulent activity.

In the present analysis, for each user, three different
profile types are constructed and tested. The first profile
(Profile1) is build up from the accumulated weekly
behavior of the user. The profile consists of seven fields
which are the mean and the standard deviation of the
number of calls per week (calls), the mean and the
standard deviation of the duration (dur) of calls per week,
the maximum number of calls, the maximum duration of
one call and the maximum cost of one call (Fig.2). All
maxima are computed within a week’s period.

The second profile (Profile2) is a detailed daily
behavior of a user which is constructed by separating the
number of calls per day and their corresponding duration
per day according to the called destination, i.e., national
(nat), international (int), and mobile (mob) calls, and the
time of the day, i.e., working hours (w), afternoon hours
(a), and night (n) (Fig.3).

Last, the third profile (Profile3) is an accumulated per
day behavior (Fig.4). It consists of the number of calls and
their corresponding duration separated only according to
the called destination, that is, national, international and
mobile calls.

The last two profiles were also accumulated per week
to give Profile2w and Profile3w. So, overall five different
user profile representations are evaluated.

c© 2015 NSP
Natural Sciences Publishing Cor.

www.naturalspublishing.com/Journals.asp


1712 C. Hilas et. al. : Clustering of Telecommunications User Profiles for Fraud...

mean(calls) mean(dur)std(calls) max(cost)max(dur)max(calls)std(dur)

Fig. 2: Profile1 of telephone calls

int_calls_w int_dur_w int_dur_nint_calls_nint_dur_aint_calls_a

nat_calls_w nat_dur_w nat_dur_nnat_calls_nnat_dur_anat_calls_a

mob_calls_
w

mob_dur_w mob_dur_n
mob_calls_

n
mob_dur_a

mob_calls_
a

Fig. 3: Profile2 of telephone calls

nat_calls int_durint_callsmob_durmob_callsnat_dur

Fig. 4: Profile3 of telephone calls

In the past, feed-forward neural networks have
applied to classify cases of user behavior [16]. No matter
how well a neural network classifier may have performed,
there is no clue about the features it actually used in order
to achieve its performance. So, it is difficult to identify
important characteristics that led to a successful
classification.

In order to further investigate the problem of
appropriate user modeling towards fraud detection, we
also apply clustering techniques on the data. The aim is to
test whether cases from the same class tend to form
clusters and under which condition.

An important characteristic of most clustering
methods is that they are actually unsupervised learning
approaches. Thus, one does not have to provide the
corresponding algorithms with class examples. On the
contrary, the algorithm is left to decide on case
similarities by means of a pre-selected distance
(similarity) measure [17].

It is hoped that clustering will unveil important
information on the nature of user data and on the key
features that may be used to distinguish between
legitimate and fraudulent usage.

3 Clustering Techniques and Clustering
Quality Measures

Clustering is one of the most important sets of
unsupervised learning tools among the machine learning
techniques and algorithms. It deals with finding a
structure or an intrinsic grouping in a collection of
unlabeled data. During the clustering procedure some
objects are labeled “similar” to each other and
“dissimilar” to objects belonging to other clusters. So,
clustering aims on organizing objects into groups whose
members are similar in some way. The researcher needs
to decide on the similarity criterion that will be used

during the process as well as on the clustering quality
criterion that will be applied to decide what constitutes a
good clustering. Clustering has been applied to a wide
range of topics, such as pattern recognition, compression,
and classification, as well as in diverse disciplines like
biology, marketing, psychology and business. For a
detailed introduction of clustering techniques the reader is
referred to Kaufmann and Rousseeuw [17].

Two of the most common clustering techniques are
used in the present work, namely the partitioning and the
hierarchical clustering. As a main representative of the
partitioning techniques we will apply the k-means
algorithm. The hierarchical clustering technique that will
be used is the agglomerative clustering. Their main
difference is that the first needs the user’s input on the
expected number of clusters while the latter needs no user
intervention.

3.1 K-means clustering

The main idea behind the k-means algorithm is that in
order to obtain k clusters, the algorithm selects k objects
from the data sets. The remaining objects are then
assigned to the nearest representative object, as the
algorithm attempts to minimize the average squared
distance between objects. Other distance measures can be
used as well.

A graphical representation of the clustering is
provided by displaying the silhouettes introduced by
Kaufmann and Rousseeuw [17]. Silhouettes are
constructed in the following way. Given the number of
clusters in the problem, the value s(i) is defined for each
object i and these numbers are presented in a plot. The
value s(i) is defined by:

s(i) =
b(i)− a(i)

max{a(i),b(i)}
, (1)

where a(i) is the average dissimilarity of i to all other
objects in A, where A is the cluster in which the object i is
first assigned, with d(i,C) being the average dissimilarity
of i to all objects of any other cluster C different from A.
Object i is assigned to the cluster B that satisfies b(i) =
d(i,B).

3.2 Agglomerative Clustering

During hierarchical agglomerative clustering the user
does not specify the expected number of clusters k.
Instead, the algorithm constructs a tree-like hierarchy, a
dendrogram, which implicitly contains all values of k.
The root of the tree structure defines a cluster that
contains all data, while its leafs represent n clusters, each
one containing one of the n objects. The agglomerative
clustering algorithm starts with each object representing a
cluster, called a singleton, and proceeds by fusing the

c© 2015 NSP
Natural Sciences Publishing Cor.


Appl. Math. Inf. Sci. 9, No. 4, 1709-1718 (2015) / www.naturalspublishing.com/Journals.asp 1713

closest ones until a single cluster is obtained. Therefore, a
measure of dissimilarity between two clusters must be
defined [17].

Two different distance measures, namely the Euclidean
distance and the correlation between objects, are used for
both clustering algorithms.

3.3 Clustering Quality Measures

In order to judge on the quality of the clustering structure,
we apply three quality statistics. These are the silhouette
coefficient (SC), the agglomerative coefficient (AC) and
the cophenetic coefficient (CC). Visual inspection of the
dendrograms and the consistency of their structure may
also reveal certain characteristics.

The silhouette coefficient (SC) is given by:

SC = max
k

s̄(k), (2)

where s(k) is derived from (1) and the maximum is
taken over all k for which the silhouettes can be
constructed. The SC is a useful measure of the amount of
clustering structure that has been discovered by the
classification algorithm.

Another measure, that allows us to decide whether the
clustering reveals a clear separation between groups, is
the comparison of the within–cluster dissimilarity with
the between–cluster dissimilarity. This measure in a tree
structure is equivalent to the tendency that the tree
branches get taller. If for each object i one measures its
distance, l(i), from the tree root, normalize the value in
[0-1] and compute the average value, then he has a
measure of the average tree height. This average, whose
larger values imply better clustering, is named
agglomerative coefficient (AC), and it is given by:

AC =
1
n

n

∑
i=1

l(i), (3)

The cophenetic coefficient, CC, is a measure of how
faithfully a dendrogram preserves the pair-wise distances
between the original un-modeled data points. Given the
matrix Z with the distances between clusters and the object
dissimilarity matrix Y , then CC can be used as a measure of
correlation between the object distances and their distance
from the tree root. The value of CC is given by:

CC =

∑
i, j
i< j

(

Yi j −Ŷ
)(

Zi j − Ẑ
)

√

∑
i, j
i< j

(

Yi j −Ŷ
)2

∑
i, j
i< j

(

Zi j − Ẑ
)2
, (4)

where Yi j is the distance of object i from j. CC values
closer to 1 reveal better clustering structures [18].

One last criterion that may help choose in favor of one
clustering structure over another is the consistency of the

structure. If a branch in a dendrogram is at the same
height with its neighbors, then this reveals similarity
between the corresponding objects at that level of the
structure. This behavior can be considered consistent.
Whenever a branch height significantly differs from that
of its neighbors, this behavior reveals high values of
dissimilarity between the corresponding objects and can
be named inconsistent. Inconsistent links in a dendrogram
may reveal cluster separation points. For a more detailed
analysis on cluster selection criteria the reader is referred
to [19].

4 An Application of User Profile Clustering

Our experiments are based on real data extracted from a
database that holds the CDR for a period of eight years
from an organization’s PBX. According to the
organization’s charging policy, only calls to national,
international and mobile destinations are charged. Calls to
local destinations are not charged so they are not included
in the examples. In order to properly charge users, for the
calls they place, a system of Personal Identification
Numbers (PIN) is used. Each user owns a unique PIN
which “unlocks” the organization’s telephone sets in
order to place costly outgoing calls. If anyone (e.g., a
fraudster) finds a PIN he can use it to place his own calls
from any telephone set within the organization.

Several user accounts, which have been defrauded,
have been identified. The detailed daily accounts were
examined by a field expert and each phone call was
marked as either normal or defrauded. If during a day no
fraudulent activity was present, then the whole day was
marked as normal. If at least one call from the fraudster
was present, then the whole day was marked as fraud.
This separation of the classes is named detailed
characterization. Adding to this, each day was also
marked, according to the first time that fraudulent activity
appeared. Thus, each user’s account is split into two sets,
one pre- and one post-fraud. This separation of the classes
is named coarse characterization.

In this work we use the examples of 6 users (it has
been stressed earlier that it is difficult to isolate fraud
cases), 5 profile representations, and 2 different ways to
characterize the user accounts as normal of fraudulent.
The above give 60 different data sets.

The user daily profiles will be used as an input for the
two algorithms. We will ask the k-means algorithm to
partition the input space in two distinct groups. If the
legitimate and the fraudulent behavior cases are
sufficiently different from each other, then the k-means
algorithm will provide us with two distinct clusters of
data. If this is not the case, then the division will not be
good. Then the same input data will be fed into the
agglomerative clustering algorithm and we will also
check whether there is an output with distinct cases
separation.

c© 2015 NSP
Natural Sciences Publishing Cor.

www.naturalspublishing.com/Journals.asp


1714 C. Hilas et. al. : Clustering of Telecommunications User Profiles for Fraud...

Table 1: Clustering outcome for a defrauded user account
using the five profiles and two similarity measures,
Euclidean distance (Eucl) and Correlation (Correl)

Profile Distance Correct SC AC CC
Measure Clustering

(%)
Profile1 Eucl 77.9310 0.7734 0.9632 0.9143
Profile1 Correl 84.8276 0.8729 0.9881 0.9299
Profile2 Eucl 78.0374 0.5303 0.9738 0.9514
Profile2 Correl 67.2897 0.2624 0.9952 0.9698
Profile3 Eucl 67.7570 0.8104 0.9824 0.9160
Profile3 Correl 76.1682 0.7668 0.9984 0.9725
Profile2w Eucl 78.0374 0.5264 0.9738 0.9514
Profile2w Correl 77.1028 0.3429 0.9952 0.9698
Profile3w Eucl 71.2329 0.7533 0.9748 0.9338
Profile3w Correl 77.3973 0.7296 0.9956 0.9489

Table 2: Clustering statistics for six defrauded user
accounts and two similarity measures – Profile1 has been
used in all cases

Profile Distance Correct SC AC CC
Measure Clustering

(%)
User1 euclidean 77.9310 0.7734 0.9632 0.9143

correlation 84.8276 0.8729 0.9881 0.9299
User2 euclidean 65.2582 0.9466 0.9770 0.9118

correlation 78.8732 0.7152 0.9975 0.9232
User3 euclidean 73.3154 0.9989 0.9944 0.9910

correlation 67.1159 0.5847 0.9945 0.6782
User4 euclidean 79.4979 0.8570 0.9706 0.8857

correlation 92.4686 0.6865 0.9949 0.8302
User5 euclidean 62.1367 0.5784 0.9544 0.6278

correlation 79.4149 0.8750 0.9706 0.8785
User6 euclidean 76.9031 0.7632 0.9612 0.8949

correlation 86.4668 0.7445 0.9903 0.8302

In Table 1 the outcome of the clustering procedure for
a defrauded user’s account is listed for the five profiles and
two similarity measures. These are the results for User1.
In general, the best results, i.e. the higher percentage of
correct clustering, were achieved with the use of Profile1
combined with correlation as the similarity measure.

In Table 2 the values of the statistics of Section 3.3
are presented for the cases of four defrauded user
accounts. For each user the similarity measure and the
correct clustering percentage are also given. Bold
numbers are used to stress best statistics values. In
general, correlation gave better results. The case of User3
(Table 2) is an exception that will be dealt with shortly.
Due to space limitations only the findings for Profile1 will
be depicted in the figures that follow.

First, the clustering ability of each profile is tested by
means of the k-means algorithm using the Euclidean
distance as a similarity measure. Comparison of the
clustering outcome with the original class separation
reveals that the clustering is correct by 77.9%. However,
visual observation of the silhouette diagrams (Fig.5(a))
shows the existence of negative silhouette values at the

bottom of the figure which imply wrong clustering. The
SC for the clustering in this figure is 0.7734. If the
procedure is repeated but with the correlation of the input
vectors as the similarity measure, then the outcome is
depicted in (Fig.5(b)). Now, the percentage of correct
clustering is 84.8% and the SC value is 0.8729.

In general, the k-means clustering showed that better
separation between classes is achieved when user
behavior is modeled with Profile1, where the detailed
characterization of the cases is used and correlation is
employed as the similarity measure.

Examples of the agglomerative clustering of the same
user’s behavior and the same profile (Profile1) are shown
in Fig.6. The former plot (Fig.6(a)) is a case where the
Euclidean distance is used as measure of the similarity
between objects, while in the latter (Fig.6(b)) the objects’
correlation is used instead. Fig.6 is reprinted here from
[8] with permission from Elsevier.

In all cases the analysis reveals that whenever the
correlation is used as the distance measure then one gets
the highest percentage of correct clustering. Visual
inspection of the dendrograms is also important. The
difference of correct clustering between the dendrograms
in Fig.6(a) and Fig.6(b) is only 7%. However, each one
reveals two completely different views of the data. In the
first one, one distinct cluster is formed that contains all
the outliers regardless of their class membership. In the
second, clusters are formed which include only pure fraud
cases. In this case there are three clusters that contain the
fraud cases [8]. This implies a type of mixed behavior of
the fraudster. Actually, a dendrogram structure implicitly
contains all possible cluster separations and is up to the
analyst to decide the correct number of clusters and the
point of separation.

It is interesting to study the silhouette diagrams in
conjunction with their agglomerative clustering
counterparts. Especially in the case where the Euclidean
distance is used, one may observe that the silhouette
diagram shows some kind of “bad” clustering that may
mislead the researcher, but its corresponding dendrogram
conveys a behavior that needs to be further explored. In
fact, it was an expert’s intervention, who examined all the
cases in the separate cluster and revealed that these
belong to outliers, i.e. rare cases of extreme network
usage by the legal user or the fraudster.

Fig. 7, which is reprinted here from [8] with
permission from Elsevier, shows the dendrogram for
User3 of Table 2 when the correlation is used as the
similarity measure. From Table 2 one may observe that
for the case of User3 the clustering statistics for
correlation are worst than the ones for the Euclidean
distance similarity measure. However, a distinct cluster
with pure legal behavior is formed ((Fig. 7)). On the
contrary, the silhouette diagram for the same user with the
Euclidean distance, gives two clusters where the second
one has only one member. Then the first one includes all
legitimate usage cases and gives the high (73.31) correct
clustering percentage.

c© 2015 NSP
Natural Sciences Publishing Cor.


Appl. Math. Inf. Sci. 9, No. 4, 1709-1718 (2015) / www.naturalspublishing.com/Journals.asp 1715

−0.5 0 0.5 1

1

2

Silhouette Value

C
lu

st
e

r

User1 − Profile1 

(a)

0 0.2 0.4 0.6 0.8 1

1

2

Silhouette Value

C
lu

st
e

r

User1 − Profile1 

(b)

Fig. 5: Silhouette diagram for the clustering of a user’s
profile into two clusters using (a) the Euclidean distance
as a similarity measure – the negative silhouette values at
the bottom of the figure imply wrong clustering of the user
behaviour, and (b) correlation as a similarity measure –
here two well separated clusters are formed

5 Conclusions and Discussion

As large companies, organizations or institutions seek to
grow their customer base and revenue, they must
increasingly combat sophisticated security threats while
navigating the growing challenge of compliance risk
related to adhering with national or international laws.
Business needs drive an unprecedented demand for
complex IT networks and applications that call for the

0

200

400

600

800

1000

1200

User1 − Profile1 

Two clusters cutting point

Outliers

(a)

0

0.05

0.1

0.15

0.2

0.25

User1 − Profile1 

four clusters

two clusters

Different clustering limits

70% of fraud cases

(b)

Fig. 6: Hierarchical agglomerative clustering of a user’s
profile using as similarity measure (a) the Euclidean
distance between objects – revealing outliers, and (b)
the correlation between objects – revealing distinct case
groups

establishment of authentication and fraud detection
solutions throughout the enterprise. To effectively
improve security and compliance controls, total cost of
ownership and the end-customer experience, companies
are adopting identity-based security and fraud detection
platforms that span enterprise-wide.

In particular, for the telecommunications industry,
fraud detection is important because companies and
suppliers of telecommunications services lose a
significant proportion of their revenue as a result. As
regards large organizations and companies, problems with
internal fraudulent activities, i.e. misuse of the corporate

c© 2015 NSP
Natural Sciences Publishing Cor.

www.naturalspublishing.com/Journals.asp


1716 C. Hilas et. al. : Clustering of Telecommunications User Profiles for Fraud...

0

0.05

0.1

0.15

0.2

0.25

0.3

three clusters

almost all legal behaviour

Fig. 7: The dendrogram for User3 of Table 2 when the
correlation is used as the similarity measure. A distinct
cluster with all legal behavior is formed (reprinted from
[8] with permission from Elsevier)

network or usurpation of colleagues’ identities burden the
company’s budget or degrade the management reputation
among employees. Adding to these, the modeling and
characterization of users’ behavior in telecommunications
can be used to improve network security, improve
services, provide personalized applications, and optimize
the operation of electronic equipment and/or
communication protocols.

In this context, we present how some well established
unsupervised learning techniques may be applied on
telecommunications data in order to provide us with more
insight on both a legitimate user’s and a fraudster’s
behavior. Prior to this, raw usage data must be
transformed into appropriate user profiles. The profile
construction and selection is also a challenge.

From the analysis it is concluded that, as regards user
profile building, accumulated characteristics of a user yield
better discrimination results. However, aggregating user’s
behavior for larger periods than a week was avoided in
order to preserve some level of on-line detection ability.

Clustering reveals that misclassification occurs due to
mixed types of behavior. That is, there are cases for which
the legitimate user acts like a fraudster, e.g. making long
or expensive calls, while at the same time a fraudster may
act like a normal user, e.g. he is more refrained and tries
to mimic legitimate user behavior. So, if one forces the
clustering procedure to produce two distinct groups of data
(one with the legitimate and one with the fraudulent usage)
it will fail to do it correctly.

It was evident from the analysis that there was also
some kind of “noise” in the data. This is due to the fact
that even the field expert, that manually characterized the
cases in the first place, had a hard time distinguishing

between the two classes. This observation reveals the
complexity in a fraud detection problem. This “grey” area
in the classification problem is depicted as the suspicious
cases in Fig.1; a space between pure legal and fraudulent
activity.

Taking into account the aforementioned observations,
i.e. the mixed types of behavior and the noisy data, it is
concluded that clustering alone can not be used as the
only decision technique on whether a user account is
defrauded or not. Input from other methods is also
needed, and a more complex framework should be used.
Nonetheless, clustering may provide the researcher with
invaluable insight on the nature of data and on the nature
of user behavior.

The user profiles that were presented, in this work,
have the benefit of respecting users’ privacy. That is,
except from some coarse user behavior characteristics, all
private data (e.g. called number, calling location, etc) are
hidden from the analyst. Private data would definitely
enhance the accuracy of fraud detection. In fact, the
expert who characterized the data sets, in the first place,
used rules based on private data and his domain specific
knowledge. However, our aim is to test the ability to
detect fraud given the minimum possible information
about the user.

A fundamental aspect of any predictive problem in
data analysis is the choice of an appropriate criterion for
estimation and performance assessment. In the case of
fraud, one needs, in particular, to combine both
classification accuracy and timeliness of classification.
Moreover, the granularity of the timeline of events plays
an important role in the predictive ability of models. This,
also, means that standard measures of classification
performance, such as the error rate, AUC, KS statistic,
information value, etc, may not be sufficient. This is
evident in the present work by the fact that it is difficult to
judge on the most appropriate clustering outcome just by
evaluating similarity statistics. As can be seen in Table 1
the percentage of correct clustering is a vital criterion,
while visual observation of the corresponding graphs is
also needed. One of the main tasks in future research
should be, apart from the selection of appropriate user
models (profiles), the effort to design suitable measures
and performance curves which combine these aspects.

The development of generalized frameworks that
would be capable of adapting to different environments is
highly desired. The presented methodologies may, also,
be applied to study similar problems in mobile or data
networks. The first step should be the appropriate user
modeling, i.e. the selection of the appropriate attributes
for profile building. Towards this step, an expert’s
intervention is highly needed. Adding to this, due to
privacy concerns and restrictions, the user’s collaboration
may also be asked for. The user may give his consent for
the analysis of his account which helps overcome some
legislation problems regarding privacy. His collaboration
may also help the analyst to identify fraudulent activity
more precisely.

c© 2015 NSP
Natural Sciences Publishing Cor.


Appl. Math. Inf. Sci. 9, No. 4, 1709-1718 (2015) / www.naturalspublishing.com/Journals.asp 1717

Nonetheless, one should keep in mind that security
problems often depend on the specific nature of each
problem and there is a risk when directly generalizing the
findings even to similar environments. Among other risks,
there underlies the danger of information leakage
between interconnected IT systems which may burden
users’ privacy.

Future research includes the application of other
clustering methods like the subtractive clustering, or other
unsupervised techniques like the SOM, on the problem.
Social network analysis is also of great interest and seems
like an appealing alternative to statistical approaches or
computational intelligence ones. What is intriguing about
its application on the problem is that we expect to
approach user behavior not by means of some usage
statistic but by detecting transactions between the persons
behind the statistics. Recent evidence suggests that
technical controls only detect one third of fraud cases
with zero time exposure and loss. More complex fraud is
detected with a range of technical and sociotechnical
controls from inside and outside the firm [20].

It is hoped that further research will give important
findings on how to distinguish between legitimate and
fraudulent network usage. Moreover, the departure from a
strict technical approach and the use of social and
behavioral controls used in the organizational
environment may provide more clues about the problem.

Acknowledgement

The authors would like to thank the members of the
Telecommunications Centre of the Aristotle University of
Thessaloniki for their contribution of anonymised user
data. The authors wish to acknowledge financial support
provided by the Research Committee of the
Technological Education Institute of Central Macedonia
(Serres) under grant SAT/IC/181212-157/7

References

[1] Hildebrandt, Mireille and James Backhouse (eds.),
D7.2: Descriptive analysis and inventory of profiling
practices. FIDIS Deliverable, available online at:
http://www.fidis.net/resources/deliverables/profiling/ (2005).

[2] Adomavicius, G., Huang, Z., and Tuzhilin, A.,
Personalization and Recommender Systems. In Z.-L. C.
a. S. Raghavan (Ed.), State-of-the-Art Decision Making
Tools in the Information-Intensive Age (pp. 55-100):
Tutorials in Operations Research (2008).

[3] P. Gosset and M. Hyland, Classification, detection and
prosecution of fraud in mobile networks, Proceedings of
ACTS Mobile Summit, Sorrento, Italy, June (1999).

[4] Communications Fraud Control Association. 2013
Global Fraud Loss Survey, Press Release, Roseland,
NJ (CFCA) October 10, 2013, available online:
http://www.cfca.org/press.php (2013).

[5] Moreau Y., Preneel B., Burge P., Shawe-Taylor J., Stoermann
C., Cooke C., Novel Techniques for Fraud Detection
in Mobile Telecommunication Networks, ACTS Mobile
Summit, Granada Spain (1997).

[6] Kapourniotis, T. Dagiuklas, T., Polyzos, G., Alefragkis, P.,
Scam and fraud detection in VoIP Networks: Analysis and
countermeasures using user profiling, 50th FITCE Congress
(FITCE), Palermo Italy, doi: 10.1109/FITCE.2011.6133427,
(2011).

[7] R. J. Bolton and D. J Hand, Statistical fraud detection: a
review, Statistical Science, 17.3, 235255, (2002).

[8] Constantinos S. Hilas and Paris As. Mastorocostas,
An Application of Supervised and Unsupervised
Learning Approaches to Telecommunications Fraud
Detection, Knowledge-Based Systems, 21, 721 726,
doi:10.1016/j.knosys.2008.03.026, (2008).

[9] Olszewski, Dominik, A probabilistic approach to fraud
detection in telecommunications, Knowledge-Based
Systems, 26, 246-258, (2012).

[10] Fawcett, T. and Phua, C., Fraud
Detection Bibliography, Accessed From:
http://liinwww.ira.uka.de/bibliography/Ai/fraud.detection.html
(2005).

[11] S. F. Hinde, Call Record Analysis, Making Life Easier
- Network Design and Management Tools (Digest No:
1996/217), IEE Colloquium on, 8/1 8/4, (1996).

[12] DEFINITY Enterprise Communication Server (ECS):
Technical Articles and Technical Tips, available online:
http://esearch.avaya.com/

[13] Directive 95/46/EC of the European Parliament and of
the Council of 24 October 1995 on the protection of
individuals with regard to the processing of personal data
and on the free movement of such data, Official Journal L
281/31 of 23.11.95, pp 31-50, available online at: http://eur-
lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:
31995L0046:EN:NOT (1995).

[14] Directive 2002/58/EC. Directive on privacy and
electronic communications. Official Journal L 201,
31.7.2002, 3747, available online at: http://eur-
lex.europa.eu/LexUriServ/LexUriServ.do? uri=CELEX:
32002L0058:EN:NOT (2002)

[15] Fawcett, T. and F. Provost, Adaptive fraud detection, Journal
of Data Mining and Knowledge Discovery, 1.3, 291316
(1997).

[16] Constantinos S. Hilas, and John N. Sahalos, Testing the
fraud detection ability of different user profiles by means of
FFNN classifiers, in Collias St. et al ed.. Lecture Notes in
Computer Science, 4132, Part II, Berlin-Heidelberg: Springer
Verlag, 872-883 (2006).

[17] Kaufman L., and Rousseeuw P.J., Finding groups in data:
an introduction to cluster analysis, New York: John Wiley &
Sons, Inc. (1990).

[18] Rohlf, F. J. and David L. Fisher, Test for hierarchical
structure in random data sets, Systematic Zool., 17, 407-412
(1968).

[19] Witten I. H., and Frank E., Data mining: practical machine
learning tools and techniques with Java implementations,
London: Academic Press, (2000).

[20] Goode, S., and Lacey, Detecting complex account fraud
in the enterprise: The role of technical and non-technical
controls, Decision Support Systems, 50.4, 702-714 (2011).

c© 2015 NSP
Natural Sciences Publishing Cor.

www.naturalspublishing.com/Journals.asp
http://www.fidis.net/resources/deliverables/profiling/
http://www.cfca.org/press.php
http://liinwww.ira.uka.de/bibliography/Ai/fraud.detection.html
http://esearch.avaya.com/


1718 C. Hilas et. al. : Clustering of Telecommunications User Profiles for Fraud...

Constantinos S. Hilas
received the B.S. degree
in Physics, in 1989, the
M.S. degree in Electronic
Physics - Radioelectrology,
in 1996, and the Ph.D.
degree in Physics, in 2007,
from the Aristotle University
of Thessaloniki (AUTH),
Greece. He also received the

M.S. degree in Information Systems from the University
of Macedonia (UM), Thessaloniki, Greece, in 2000.
From 1993 to 2002, he was Computer Administrator
and the Administrator of the Telecommunications
Network of AUTH. Since 2002, he has been with the
Technological Educational Institute of Central Macedonia
(Serres, Greece), where he is currently an Associate
Professor in the Dept. of Informatics Engineering. His
basic research interests include user characterization,
service development, and data mining techniques in
telecommunications and networking.

Paris A. Mastorocostas
received the Diploma
and Ph.D. degrees in
Electrical & Computer
Engineering from Aristotle
University of Thessaloniki,
Greece. Presently he serves
as Professor at the Dept.
of Informatics Engineering,
Technological Education

Institute of Central Macedonia, Serres, Greece. His
research interests include fuzzy and neural systems with
applications to identification and classification processes,
data mining and scientific programming.

Ioannis T. Rekanos
received the M.Sc. and
the Ph.D. degree in electrical
and computer engineering,
in 1993 and in 1998,
respectively, both from
the Aristotle University
of Thessaloniki (AUTH),
Greece. From 2000 to 2002,
he was a Postdoctoral Senior

Researcher in the Radio Laboratory at the Helsinki
University of Technology, Finland. From 2002 to 2006,
he served as a faculty member of the Department of
Informatics and Communications, TEI of Serres,
Greece. Since 2006, he has been with the AUTH,
where he is currently an Associate Professor in the
Department of Electrical and Computer Engineering. His
research interests include electromagnetic and acoustic
wave propagation, inverse scattering, computational
electromagnetics, and digital signal processing.

c© 2015 NSP
Natural Sciences Publishing Cor.


	Introduction
	Behavior Modeling and Profiling in Telecommunications
	Clustering Techniques and Clustering Quality Measures
	An Application of User Profile Clustering
	Conclusions and Discussion