Identifying web sessions with simulated annealing


Expert Systems with Applications 41 (2014) 1593–1600
Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
Identifying web sessions with simulated annealing
0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.08.056

⇑ Corresponding author. Tel.: +56 2 7180900.
E-mail addresses: tomas.arcec@usach.cl (T. Arce), proman@dii.uchile.cl (P.E.

Román), jvelasqu@dii.uchile.cl (J. Velásquez), victor.parada@usach.cl (V. Parada).
1 Tel.: +56 2 7180900.
Tomás Arce a,1, Pablo E. Román b, Juan Velásquez c, Víctor Parada d,⇑
a Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Estación Central, Santiago, Chile
b Center of Mathematical Modelling (CMM) UMI CNRS 2807, Universidad de Chile, Av. Blanco Encalada 2120, Piso 7, Santiago, Chile
c Departamento de Ingeniería Industrial, Universidad de Chile, República 701, Santiago, Chile
d Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Estación Central, Santiago, Chile
a r t i c l e i n f o

Keywords:
Web usage mining
Web session
Simulated annealing
a b s t r a c t

Delivery of efficient service through a web site makes it compulsory in the redesigning stage to take into
account the behavior of the users, which can be studied by means of a web log file that partially records
information about user visits. The reconstruction of all of the sequences of pages that are visited by users
who browse a web site is known as the web sessionization problem, and it has been formulated by means
of an integer programming model; however, because a web log can accumulate a large amount of infor-
mation, it is necessary to reconstruct the sessions over a period of weeks or months, thus the solution to
this problem requires a long computational processing time. This paper presents a heuristic approach
based on simulated annealing for the sessionization problem. Using this approach, it has been possible
to reduce the processing time up to 166 times compared to the time that is required for the integer pro-
gramming model. Furthermore, the metaheuristic solution finds new optimum values, which achieve
increases on the order of 17% in the best cases.

� 2013 Elsevier Ltd. All rights reserved.
1. Introduction

The Internet has become a flourishing source of data for differ-
ent research fields related to social networks, such as sociology
(Lin, Jheng, & Yu, 2012; Nohuddin et al., 2012), marketing (Fong,
Zhou, Hui, Tang, & Hong, 2012; Wang, Ting, & Wu, 2013) and com-
puter science (Devi, Devi, Rani, & Rao, 2012; Yin & Guo, 2013). Also,
the massive usage of the Internet drives many lucrative businesses
such as e-commerce. Thus, it is increasingly necessary for web sites
to be designed with a structure that makes it easy for the user to
obtain the service (Wang & Ren, 2009). To that end, it is necessary
to study the behavior of the users which is partially kept in a pri-
vacy-compliant data file saved on the web server known as a web
log. Legislation in several countries forbids the storage of personal
information in order to safeguard personal privacy (Mayer &
Mitchell, 2012). In order to follow national laws, internet compa-
nies must rely on the anonymity of web logs for extracting infor-
mation about their user preferences and browsing behavior
(Velásquez, 2013; Velásquez & Palade, 2008). The generated
browsing sequences represent an input source for discovering
the behavior patterns of users who visit a web site (Cooley,
Mobasher, & Srivastava, 1999; Kosala & Blockeel, 2000; Tao, Hong,
Lin, & Chiu, 2009).
Every time a user requests a web page or some resource con-
tained on it such as images, videos, and sounds, a new record is
created in the web log. The generated information allows the
detection of the most visited pages, the common page access
sequences, the users’ preferred contents and even the generation
of user profiles (Choi & Lee, 2009; Nasraoui, Soliman, Saka, Badia,
& Germain, 2008). Such detection is called web usage mining
oriented toward extracting knowledge from web logs (Román,
L’Huillier, & Velásquez, 2010), which requires the extraction of
individual sequences of a user’s web interaction while maintaining
anonymity (Mayer & Mitchell, 2012). The following data are typi-
cally recorded in the file: the IP address from which the inquiry
is made, the time and date of the inquiry, the requested resource,
a code indicating the result of the operation, the number of bytes
transferred in the request, a string that contains the address of
the site from which the present request was originated and the
browser and operating system that was used. The generated
browsing sequences represent an input source for discovering
the behavior patterns of users who visit a web site (Cooley et al.,
1999; Kosala & Blockeel, 2000; Tao et al., 2009).

With expert assistance the information stored in the web log
can be used to support decision-making that can help in restruc-
turing a web site to improve access to the preferred contents;
the information can also be used to detect market segments based
on the buying behavior of the users and to implement resource
suggestion systems for the users. The sequence of individual inter-
actions is called a session and the reconstruction of all of the page

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2013.08.056&domain=pdf
http://dx.doi.org/10.1016/j.eswa.2013.08.056
mailto:tomas.arcec@usach.cl
mailto:proman@dii.uchile.cl
mailto:jvelasqu@dii.uchile.cl
mailto:victor.parada@usach.cl
http://dx.doi.org/10.1016/j.eswa.2013.08.056
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


1594 T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600
sequences visited by the users during their browsing through a
web site is known as the web sessionization problem (WSP)
(Berendt, Mobasher, Spiliopoulou, & Wiltshire, 2001; Chitraa &
Selvdoos, 2010; Huynh & Miller, 2009; Poženel, Mahnic, & Kukar,
2010; Velásquez & Palade, 2008). A correct reconstruction of the
sessions must take into account that the validity of the generated
patterns depends largely on the credibility of the sessions obtained
(Berendt et al., 2001). In spite of that, the reconstruction of sessions
is not a trivial process because of factors that hinder reconstruc-
tion, such as proxy servers or the use of cache memory in the
client’s web browser.

The techniques that solve the WSP from a web log must deal
mainly with the nonexistence of a clear identification of the users.
Matching the IP address as a single criterion is not sufficient,
because when multiple users have access to a site through a proxy
server, they are registered in the web log with the proxy server’s
address. To partially remove the identification error, the tech-
niques use the IP address and the web browser that was used by
the web site visitor as identification criteria (Pirolli, Pitkow, &
Rao, 1996). Other techniques directly track user operations by
using cookies with client-side scripts obtaining accurate sessions,
but they are not recommended since they violate user privacy laws
(Mayer & Mitchell, 2012). Therefore, the web usage mining on web
logs is a safe way to analyze user preferences, as long as we take
into account that in general a high quality sessionization is not
possible by means of machine learning techniques that are known
to be subject to data error (Román et al., 2010).

Current improvement of the web usage mining processing relies
on accurately solving the WSP, for which several heuristics have
been proposed. The most widely used technique to tackle the
WSP is the time heuristic, which considers that the sessions have
a time limit (Catledge & Pitkow, 1995; Cooley, Mobasher, &
Srivastava, 1997; Huynh & Miller, 2009). This technique groups
the records according to the IP address and the web browser that
was employed by the user; it arranges the records of each resultant
group in a temporal order and then obtains the sessions, ensuring
that the first and last records of a given session do not exceed a
time limit. The major drawback of this algorithm consists of the
null consideration of the hyperlink topology of the web site. A
revised version of this heuristic considers the site’s link structure
together with a temporal criterion (Pirolli et al., 1996), in addition
to the criteria that the consecutive records of the same session
must refer to pages between which there is a link. Nowadays, with
modern web browser navigation it becomes hard to track the infor-
mation due to the multiple ways that pages are cached, loaded, and
navigated. Multitab navigation, back button browsing and history
jumps are commonly not reflected on web logs (Román et al.,
2010), so hyperlink topology restriction is relaxed from being a
strong restriction on session properties.

A recent approach addresses the WSP as an optimization prob-
lem by defining an integer programming model (Román, Dell, &
Velásquez, 2010). All of the possible reconstructions of sessions
from a given web log constitute the feasible solution space of the
problem. The choice of a specific reconstruction implies a search
for the reconstruction that has a maximum value in terms of a spe-
cific objective function. Web hyperlink topology and time sequenc-
ing are incorporated as restrictions in the model. Relaxed browsing
behavior could even be easily included (Dell, Román, & Velásquez,
2009). However, because a web log in a single day can accumulate
an enormous number of records and the supporting decision mak-
ing requires the reconstruction of sessions over a period of weeks
or months, the number of variables of the integer programming
models is huge for the currently available capacity for solving inte-
ger programming problems. Solving even small instances of the
problem requires several hours of computing time. The perfor-
mance is even worse when considering more common browsing
behaviors such as parallel tabs and back buttons. The ideal situa-
tion would consider the existence of an algorithm that can process
the information in real time by requiring a short computing time.

A novel algorithm for solving the WSP was presented in Bayir,
Toroloslu, Demirbas, and Cosar (2012). The algorithm is based on
graph modeling of the sessions that are constructed considering
maximal path length, hyperlink topology and back button browsing.
They found improved accuracy in recovering sessions relative to
previous models. However, their major drawback comes from the
theoretical justification of the model affecting its reproducibility.

Instead of generating the optimum solution by means of an
algorithm that solves the integer programming problem, we obtain
a good quality solution using a small amount of computing time,
by means of Simulated Annealing (SA) which is a metaheuristic
that emulates the physicochemical process that takes place in
the cooling of pure substances, systematically generating a new
solution from the current solution and allowing, at some instant,
the choice of a poor solution, with a certain probability that de-
creases with time (Kirkpatrick, Gelatt, & Vecchi, 1983; Talbi,
2009). This method has been shown to be very efficient for solving
problems belonging to the NP-Hard class, such as the design of
electronic circuits, the reconstruction of images, the generation
of roads, set partition problems and planning problems (Suman &
Kumar, 2005). This paper presents an approach that is based on
simulated annealing for the WSP for the purpose of reconstructing
each session of the users that visit a web site.

The second section presents the WSP representation under an
SA, identifying the elements needed for the experimental phase.
The third section presents the main results, and in the last section,
the main conclusions are given.

2. Methods and materials

2.1. The web sessionization problem

The model founded on integer programming (Dell, Roman, &
Velasquez, 2008) is based on ensuring that a session is constituted
by a set of records that share the same IP address and the same
web browser. For consecutive records within a session, the con-
straints ensure the maintenance of the link structure of the site
and that the session does not go beyond a certain time limit. Thus,
if a record is found directly after another record within the same
session, then the following are true:

j Both share the same IP address and web browser.
j There is a link between the referenced page in both records.
j The time difference between the recorded application for

both records does not exceed a certain mtp value or time
window.
Let

j r and r’ be records of a web log;
j obe the order of a record in a given session, o = 1, 2, . . ., O,

where is the maximum size that a session can have;
j s be the identifier of a session;
j Co be the value of the coefficient of the objective function

when there is a record assigned to position o;
j xros be the binary decision variable, which has a value of 1

when record r is assigned to position o in a given session s
and a value of 0 in any other case;

j Bpage(r) be the set of records that can be directly before
record r in the same session, according to the criteria of hav-
ing the same IP address, corresponding to the same web
browser, having a link between consecutive records and
maintaining the time window between consecutive records;


SA algorithm 
Input: Min f(x) s.t. x ∈ Ω;
Output: x*;
Generate an initial solution x0 ∈ Ω; x* = x0;
Define T > 0; t = 0; 
Repeat

T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600 1595
j first be the set of records that are always found in the same
position in a session. It is possible to calculate this set by
first getting bpager for all of the web log records. Therefore,
a record r 2 first if bpage(r) = £.

The integer programming model is the following:

ISP : Maximize z ¼
X

ros

Co xros ð1aÞ

s.t.
X

os

xros ¼ 1; 8r ð1bÞ

X

r

xros � 1; 8o; s ð1cÞ

xr;oþ1;s �
X

r02bpageðrÞ
xr0os; 8r; o; s ð1dÞ

xros ¼ 0; 1 8r; o; s ð1eÞ

xros ¼ 0 8r;2 first; o > 1; s ð1fÞ

The first constraint establishes that a record can only be in a single
session and in a given order within it. The second constraint estab-
lishes that there can only be a unique record in a given session and a
position within it. The third constraint establishes the correct
arrangement of the records of the same session, complying with
the time, site structure and the same IP address and web browser
combination conditions. The value of Co is used to give a bonus to
sessions of a given size. For example, by setting a value of Co = 1
when o = 2 and Co = 0 " o – 2, we are favoring a large number of
size 2 sessions in the final reconstruction. The value of the objective
function for a sessionization is calculated taking into account the
length of each of the generated sessions (l) and then adding the va-
lue of Co that corresponds to each position of the session. For exam-
ple, for a single session of length l = 5, the value of the objective
function is the following:
X

Co ¼ C1 þ C2 þ C3 þ C4 þ C5 ð2Þ

Several measures can be used to reproduce the sessions. Functions
C1o through C

4
o are used in this paper and are presented in Table 1.

All of the functions are monotonically increasing in o with the pur-
pose of maximizing small-size sessions or else maximizing large-
size sessions.

2.2. Quality of the reconstruction

The quality of the reconstruction is measured with respect to
the power law for the size of the sessions on a web site (Huberman,
Pirolli, Pitkow, & Lukose, 1998; Oliveira et al., 2006). This law states
that most visits to a web site are concentrated on a small number
of pages and the rest of the pages receive a smaller number of vis-
its. Thus, the quality of the reconstruction can be verified by revis-
ing the coefficient of correlation r2 and the standard error S,
obtained from a linear regression performed on the logarithm of
Table 1
Different coefficients Co for the objective function.

Co

C1o = log(o)

C2o = 1.5log(o) + (o � 3)
2/12o

C3o = o

C4o = o
2

the size of the session and the number of sessions generated by
the reconstruction.
2.3. Simulated annealing (SA)

The SA algorithm is a metaheuristic optimization procedure
that is inspired by the cooling process of molten metals, and it
has been used to find good quality solutions to various combinato-
rial optimization problems (Kirkpatrick et al., 1983; Talbi, 2009;
Černý, 1985). In the metallurgical process, the initial state is the
molten metal, and the temperature is decreased in a controlled
manner, allowing the formation of intermediate equilibrium states
until a solid crystalline structure is achieved. During the process,
an internal form of the metal’s energy is decreased. Following this
analogy, to solve an optimization problem, the solution space is
visited, varying a parameter T that corresponds to the temperature
so that, at a high value of T, the probability of accepting solutions
worse than the current solution is high. As T decreases, the algo-
rithm is gradually converted into a local search algorithm. Regula-
tion of the acceptance probability takes place by means of a
Boltzmann distribution. The adequate representation of an optimi-
zation problem that is solved by SA requires defining a rule to gen-
erate neighboring solutions of the current solution, the function
that must be optimized and an initial solution. The proper algo-
rithm parameters must also be defined: the initial value of T, a
way of decreasing it gradually and the number of iterations that
are performed over the internal cycle N(t). The procedure for solv-
ing a maximization problem is presented in the pseudocode de-
picted in Fig. 1.
2.4. Representation of a solution with SA

A solution of a WSP is represented using a matrix X(mxn), where
m is equal to the maximum size of a session and n is equal to the
number of records contained in the web log partition. Then, each
element xij of the matrix corresponds to the identification number
of the record that is in position i of session j. Thus, for a given solu-
tion, the column index is the identifier of the reconstructed
session.
2.5. Generation of an initial solution

The initial solution generated uses a variation of the Links Heu-
ristic (Pirolli et al., 1996), ensuring that all of the consecutive re-
cords within a session do not exceed a maximum time window.
For two records, r1 and r2, to be located consecutively in the same
session, the following must be satisfied:
 Repeat 
  Generate solution xj neighboring xi;

δ = ƒ(xj) - ƒ(xi);
  If δ > 0 then xi = xj;
  Otherwise If random (0, 1) < exp (-δ / T) then xi = xj;
  If xi < x* then x* = x0;
 Until N(t) iterations are completed 
 t = t +1; T = T(t);
Until the stop criterion is reached. 

Fig. 1. SA for a maximization problem.


Function GeneratesInitialSolution ()
MTP : Maximum time difference between consecutive records. 
SMAX : Maximum length of a session. 
newSession=1;
session=1;
For each record r in the web log 

If (newSession = 1)then
 Create a new session with record r in the first position. 
Or else 

If
(The size of the session is Not equal to SMAX) 
AND
(The last record of the session, r - 1, has 
the same address as record r)
AND
(There is a link between r - 1 and r)
AND
(r.timeStamp – (r – 1).timeStamp ≤ MTP) then

 Add r to the current session. 
  Or else 
   newSession =1; 

r --; 
End If 

End If 
End For 

End Function 

Fig. 2. Algorithm for generating the initial solution.

1596 T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600
j Both records r1 and r2 have the same IP address.
j Record r1 has a link with r2.
j The records have visiting times within a predefined time

window mtp.
j A predefined maximum session time is not exceeded.

Fig. 2 presents the algorithm for generating the initial solution.
In the algorithm, the records are assigned to a session following the
viability conditions that are described above.

2.6. Generation of the neighboring solution

The generation of a neighboring solution consists of three steps:
choosing a session randomly, deleting the chosen session and reas-
signing each of the records of the deleted session. In the last step,
there are two possibilities; the record is either assigned to other
existing sessions that comply with the maximum number of re-
cords per session, the maximum time difference and the existence
of a link between consecutive records, or it is assigned to a new
session where the record occupies the first position. A record can
be relocated as follows:

j At the beginning of the session. The chosen record re must
have a link to the first record r1, and the time difference
between re and r1 must not exceed the maximum time win-
dow mtp.

j In the middle of the session. If ri and ri+1 are two consecutive
records within the session, there must be a link that goes
from ri to re and another one that goes from re to ri+1 for
the chosen record re to be located between ri and ri+1; the
time difference between ri, re and between re, ri+1 must not
be greater than the maximum time window mtp.

j At the end of the session. The existence of a link between
the chosen record re and the last record rf must be verified,
and the time difference between rf and re must not exceed
the time window mtp.

2.7. Evaluation function

The evaluation function considered is the proper objective func-
tion (1a) of the integer programming problem, keeping in mind the
four possibilities for Co that are presented in Table 1.
2.8. Selection of SA parameters

With the aim of searching for the adequate adaptation of SA for
the WSP, we set the parameters according to the following criteria
for each case:

j T0. A scheme has been adopted with a variable T0 that allows
changing the execution time of the algorithm depending on
the effort needed for the search, measured according to the
size of the instance to be solved. In this way, T0 is chosen
based on the maximum approximate difference in the
objective function (Aarts & Lenstra, 1997), as follows:
DFmax ¼ fmax � fmin ð3Þ

In the problem’s context, the greatest decrease of Co occurs when
the largest session is fragmented into smaller unit sessions. The dif-
ference gives Eq. (4) as a result:

DFMax ¼
XL

o¼1
Co � LC1 ð4Þ

where L is equal to the largest session of the initial solution
generated.

Considering the natural logarithm of the acceptance probability
p, ln(p) = �DFmax/T0, with p = 0.99, we get T0 such that when a solu-
tion gets worse by a magnitude of DFmax, it will be accepted with a
probability of 0.99. Then, T0 is calculated from T0 = �DFmax/0.01.

j Cooling function. A geometric temperature drop is used,
given by Ti = �a Ti � 1, with a 2 (0, 1).

j The number of iterations of the internal cycle. The number
of iterations is dynamic as the search continues. If a neigh-
boring solution that improves the current solution is gener-
ated, the cooling function is applied to decrease T (Cardoso,
Salcedo, & Azevedo, 1994; Ravi & Shukla, 2009). To avoid
stagnation of the SA at a single T level, the maximum num-
ber of iterations equal to the number of possible neighbor-
ing solutions of a given solution is considered. An
approximation of this number is given by the number of
sessions at each level of T; when a neighboring solution is
generated, the number of possible movements is equivalent
to the number of sessions that can be chosen for their
records to be reassigned.

j T _f A fixed value is used that is found during the calibration
process.

2.9. Algorithm for WSP

The internal cycle operates with a constant value of T visits;
each time, a neighbor solution of the current solution is generated
by means of a reassignment of a record from one session to
another. It accepts this solution automatically when a better
sessionization is generated and accepts it with a specific probabil-
ity when the sessionization is worse than the current one.
2.10. Equipment used

The implementation of SA was written in the C language and
was compiled with GCC version 4.3.2. The computer used for both
the implementation and the execution of the experiment had an
Intel Core 2 Duo 1.8-Ghz processor, with 1 GB of RAM, an 80-GB
hard disc, and the Linux Operating System, kernel version 2.6.26,
Debian version 5.0 distribution.


T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600 1597
2.11. Test instances

To verify the operation of the implementation of the proposed
solution, a web log generated by a web site that considered a
storage period of one month was used. The site has 172 pages
and 1228 links between them. The web log was previously
preprocessed to delete the requirements of multimedia objects,
access to web mail and error records. After the preprocessing, the
total number of records is 102,303. Of the 16,985 IP addresses con-
tained in the file, 16,785 have less than 50 records that represent
98% of the total. In turn, 14,265 IP addresses visit 3 or fewer differ-
ent pages, which represent 84% of the total IP addresses.

Reconstructing sessions using the mathematical model and
starting from a web log leads to an oversized requirement of the
hardware resources. However, because a session is composed of
records that share the same IP address, it is possible to partition
the web log into smaller chunks. In this way, the possible solutions
space is reduced, and the problem’s natural restrictions are applied
to divide it into multiple smaller size problems. By applying the
algorithm to each chunk and then integrating the partial results,
the resource requirement is reduced considerably. Not all of the
partitions have the same interest; there are sets of records that
are more interesting for the sessionization, given the possibility
that the recorded requests correspond to multiple user accesses
that are masked under a single address (which corresponds to
the proxy server or the firewall that is being used). To identify
these sets, two measures of difficulty are used that are established
in relation to a given IP address, the diversity of the record and the
number of records. The record diversity of a given IP address is
measured in terms of the entropy of the IP address (Dell et al.,
2008), defined as the following:
X

i2PðIPÞ
ðfi=nRÞlogbðnR=fiÞ ð5Þ

where fi corresponds to the number of times in which page i was
accessed by a given IP address, nR is the number of records contain-
ing that IP address, b is the number of different pages that have
been visited from a given IP address and P(IP) is the number of
pages visited from the IP address. The entropy takes values between
0 and 1. When the entropy is close to 0, most of the records for a
given address correspond to accesses to a single page; when it is
close to 1, all of the pages accessed by the address are visited with
the same frequency.

A large diversity for an IP address is a good indicator of the dif-
ficulty of the records associated with that address, but it is not a
sufficient condition, because there can be a large diversity with
very few records. That is why, together with entropy, a criterion
of a minimum number of records per IP address is used; an IP
address with high entropy and a large number of records ensures
that the records associated with that address are referred to differ-
ent pages and that they are numerous, capturing what happens in a
web log when multiple users gain access to the site by means of
proxy servers.

To construct the test instances, the records with IP addresses
that have entropy greater than or equal to 0.5 and a number of
records in the web log greater than or equal to 50 are chosen.
For the web log used, the number of records chosen under this
criterion was 17,709.
2.12. Metrics used

To evaluate and analyze the results obtained, the following
metrics are used:

j Percent difference of the objective function, Gap:
Gap ¼ jfðxÞ� f �ðxÞj=f �ðxÞ ð6Þ

where f⁄(x) is the reference value for the comparison.
j Power Law fit. The quality of the reconstruction is verified

by means of the coefficients r2 and S, obtained from a linear
regression over the logarithm of the size of a session and the
number of sessions. The values 0 6 r2 6 1 and S allow the
evaluation of the degree of fit of the number of sessions ver-
sus the size of the sessions with the Power Law. When r2

takes values close to 1, the model fits perfectly with the
Power Law. On the other hand, S measures the standard
error of the model in terms of the difference between the
real value and the estimated value. The closer it is to 0,
the better the model’s fit is with real data (Montgomery &
Runger, 2006).

2.13. Execution of the experiment

To reconstruct the sessions with SA, we used the values of
a = 0.99 and Tf = 0.08 for the parameters. For each set of instances,
an execution of the algorithm considers 10 attempts for each way
of assigning coefficient Co of the objective function. An attempt
must be understood as the execution of the algorithm over all of
the chunks of the web log.

The assignment functions of Co presented in Table 1 are consid-
ered for the experiment. Subsequently there are 40 attempts for
the set of ISP instances and 40 attempts for the set of instances
of the metaheuristic, with a total of 80 attempts. Counting the
application of the solution to each chunk, a total of 21,200 execu-
tions of the SA are performed.

3. Results

The execution of SA obtains the following for each assignment
function of Co and each set of test instances:

j Average attempt: The reconstruction closest to the average
value of the objective function calculated from the 10
attempts executed.

j Best attempt: A reconstruction based on the best attempt of
the 10 executed, in terms of the value of the objective
function.

j Best reconstruction: A reconstruction that is obtained by
choosing, for each chunk, the best sessionization in terms
of the value of the objective function. The solutions of each
of the best chunks are joined to form the best possible solu-
tion from the attempts that are executed up to some point.
This reconstruction is obtained from the processing of the
results, after the execution of the implementation.

Furthermore, SA provides information on the value of the objec-
tive function that obtained the solution, the computer time and the
quality of the reconstruction of the sessions in terms of the Power
Law in relation to the size of the sessions.

3.1. General results

The general results that are obtained during the execution of
the algorithm are presented in Table 2. The smallest average num-
ber of sessions is obtained with C1o . The largest session average is
obtained with C4o . The function C

1
o yields the shortest processing

times, with an average of 2.15 min per attempt and a total test
time for the execution of the 10 attempts of 21.52 min. Function
C4o takes an average of 3.86 min per attempt, with a total test time
of 38.58 min.

Table 3 presents the best attempts for each of the objective
functions. The lowest number of sessions was 11,312, obtained


Table 2
Results of SA for WSP.

Co Maximum # of
sessions

Minimum # of
sessions

Average # of
sessions

Standard
deviation

zmax zmin zpromedio Average computer time
[min]

Total test time
[min]

Co1 11321 11312 11317.7 2.83 6481.3 6460.4 6469.9 2.15 21.52
Co2 11334 11321 11326.3 4.42 13990.0 13918.5 13957.2 2.41 24.14
Co3 11349 11320 11332.5 7.76 32561.0 32391.0 32489.8 2.47 24.74
Co4 11352 11335 11340.5 4.81 134926.0 131012.0 132746.8 3.86 38.58

Table 3
Results of the best attempts for each of the objective functions.

Co Id attempt z r
2 S No. of sessions Computer time [min]

C1o 10 6481.3 0.943 0.577 11312 2.15

C2o 1 13990.0 0.931 0.666 11327 2.41

C3o 7 32561.0 0.919 0.707 11326 2.48

C4o 6 134926.0 0.941 0.581 11338 3.86

Table 4
Evolution of the value of the objective function for different executions of SA.

Co1 Co2 Co3 Co4

Attempt z Dz Gap [%] z Dz Gap [%] z Dz Gap [%] z Dz Gap [%]

1 6460.4 – – 13990.0 – – 32541.0 – – 134012.0 – –
2 6490.7 30.29 0.467 14032.4 42.44 0.302 32688.0 147.0 0.450 136611.0 2599.0 1.90
3 6505.0 14.23 0.219 14041.4 9.03 0.064 32744.0 56.0 0.171 137883.0 1272.0 0.93
4 6511.2 6.25 0.096 14052.4 10.93 0.078 32770.0 26.0 0.079 138246.0 363.0 0.26
5 6514.3 3.12 0.048 14063.7 11.33 0.081 32794.0 24.0 0.073 138811.0 565.0 0.41
6 6519.6 5.30 0.081 14068.6 4.92 0.035 32839.0 45.0 0.137 139885.0 1074.0 0.77
7 6521.4 1.77 0.027 14072.1 3.50 0.025 32847.0 8.0 0.024 140056.0 171.0 0.12
8 6524.0 2.57 0.039 14080.2 8.02 0.057 32893.0 46.0 0.140 140056.0 0.0 0.00
9 6525.1 1.10 0.017 14080.2 0.00 0.000 32898.0 5.0 0.015 140126.0 70.0 0.05
10 6528.0 2.97 0.046 14080.3 0.18 0.001 32898.0 0.0 0.000 140126.0 0.0 0.00

Table 5
Results obtained with SA and CPLEX for WSP.

Co z (ISP) z (SA) av. attempt z (SA) best value Gap [%]

C1o 5953.3 6470.4 6528.0 9.66

C2o 13229.4 13956.5 14080.3 6.43

C3o 30783.0 32482.0 32898.0 6.87

C4o 119450.0 132485.0 140126.0 17.31

1598 T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600
with C1o , and the largest number of sessions, 11,338, was obtained
with C4o . The shortest and longest times occur with C

1
o and C

4
o , with

2.15 and 3.86 min, respectively. The best fit with the Power Law
occurs with C1o .
3.2. Evolution of the best reconstruction

Because the solution is based on a random search method, it is
possible that a new execution over a specific chunk will find a bet-
ter solution than one found in a previous execution. Executing SA
several times over the same web log, it is possible to choose the
best solution for each of the chunks. Because the sessionization
of each chunk is independent of the others, the best reconstruction
can be generated based on a certain number of attempts. This way
of operating allows refining the quality of a sessionization by
means of successive executions. Table 4 shows the value of the
objective function for the best possible reconstruction up to a given
attempt. The value of z is presented for every function C1o through
C4o , as is the difference in z that is produced by the new execution of
SA with respect to the previous execution.
It can be seen that the values of the objective function tend to
converge at a given point, causing the largest increase in the first
four attempts, with more than 70%. Specifically, the increases
remain until attempt number 10, except for C3o and C

4
o , which con-

verge after the eighth attempt. The best refining is produced with
C4o , which shows an increase of 4.56%.

If a unique reconstruction is sought that is stable in terms of the
value of the objective function, it is seen that the road to follow is
to execute SA more than once with C4o . Alternatively, if efficiency is
sought while sacrificing the stability of the sessionization obtained,
it is recommended to execute SA only once.
3.3. Comparison with the exact method

The integer programming model was solved with CPLEX («IBM
– IBM – Mathematical Programming», 2011) version 10.1.0,
together with GAMS («GAMS», s.f.), a software system used for
defining optimization models. The experiment was executed on a
computer with 1.6 Ghz and 2 GB RAM. CPLEX was executed with
a time limit of 300 s for each chunk. When this limit was reached,
the software returned the best solution found up to that moment
for this chunk.

Table 5 shows the value of the objective function for each of the
assignment functions. The results show that SA improves the opti-
mum found by CPLEX for all of the assignment functions, with an
increase with respect to the optimum value that goes from 9.66%
to 17.31%, respectively.

Table 6 shows the processing time of CPLEX, tISP, and the aver-
age execution time for each SA attempt, tSA. SA has a significantly


Table 6
Computer time for ISP and SA.

Co tISP [min] tSA [min] Speedup

C1o 359.1 2.15 166.78

C2o 372.5 2.40 154.90

C3o 379.9 2.47 153.51

C4o 96.3 3.86 24.96

Table 7
Quality of the reconstruction of sessions based on Power Law fit.

Method r2 S

Time heuristic 0.915 0.663

ISP-CPLEX model (C2o ) 0.978 0.383

SA (Average attempt, C4o ) 0.943 0.587

SA (Best attempt, C1o ) 0.943 0.577

SA (Best reconstruction, C4o ) 0.943 0.573

T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600 1599
shorter execution time than the integer programming model in all
of the assignment functions of Co. The speedup reaches 166.78 for
C2o , which is the maximum value of this indicator. The smallest
speedup is obtained with C4o , with a 24.96-fold reduction in pro-
cessing time.

To study the quality of the reconstruction based on the fit with
the Power Law, we obtain the values of r2 and S for the time heuris-
tic. In this case, the records are arranged temporarily in increasing
order for each group. Each record is chosen and added to a session
in such a way that the consecutive records of a given session do not
have a time difference greater than 300 s. Table 7 shows the best-
fitting values obtained by each of the methods studied.

Based on the above values, the best fit is obtained with the ISP
model solved with CPLEX using C2o . Both of the reconstructions that
are obtained with ISP and those obtained by SA exceed the time
heuristic in terms of that metric, which presents the worst values
in both the standard error and in r2.

4. Discussion and conclusions

The implementation of Simulated Annealing to solve the WSP
improves the value of the objective function for each of the assign-
ment functions, with a gap that varies between 9.66% and 17.31%.
The speed for obtaining the results is one of the main strengths of
SA over the exact method. In the best case, the reduction is of the
order of 166 times the execution time of the exact method.

Executing SA several times on the same set of instances allows
new solutions to be obtained that are better than those that can be
obtained with a single execution. The study of the magnitude of
the increments in relation to the number of executions of the solu-
tion yielded an interesting convergence of the solutions to a given
value of the objective function. That allows the achievement of a
potentially stable and unique reconstruction depending on the
number of executions, compared with only a single execution of
the solution. Clearly, this implies an increase in processing time
that must be evaluated according to the accuracy required for
the reconstruction of the sessions.

From the standpoint of the sessionization problem, the fit of the
reconstructions obtained by SA in terms of the Power Law for the
size of the sessions is better using CPLEX but comes at a high com-
putational cost.

Acknowledgements

This research was partially supported by the Millennium
Institute Complex Engineering Systems ICM: P-05-004-F, FBO16.
References

Aarts, E., & Lenstra, J. K. (Eds.). (1997). Local search in combinatorial optimization.
John Wiley & Sons, Inc.

Bayir, M. A., Toroloslu, I. H., Demirbas, M., & Cosar, A. (2012). Discovering better
navigation sequences for the session construction problem. Data & Knowledge
Engineering, 73, 58–72.

Berendt, B., Mobasher, B., Spiliopoulou, M., & Wiltshire, J. (2001). Measuring the
accuracy of sessionizers for web usage analysis. In Proceedings of the workshop
on web mining at the first SIAM international conference on data mining, 7–14,
Chicago, IL.

Cardoso, M., Salcedo, R., & Azevedo, S. (1994). Nonequilibrium simulated annealing:
A faster approach to combinatorial minimization. Industrial Engineering
Chemical Research, 33, 1908–1918.

Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the
world-wide web. Computer Networks and ISDN Systems, 27(6), 1065–1073.

Černý, V. (1985). Thermodynamical approach to the traveling salesman problem:
An efficient simulation algorithm. Journal of Optimization Theory and
Applications, 45(1), 41–51.

Chitraa, V., & Selvdoos, A. (2010). A survey on preprocessing methods for web usage
data. IJCSIS International Journal of Computer Science and Information Security,
7(3), 78–83.

Choi, J., & Lee, G. (2009). New techniques for data preprocessing based on usage logs
for efficient web user profiling at client side. In Proceedings of the 2009 IEEE/WIC/
ACM international joint conference on web intelligence and intelligent agent
technology (Vol. 3, pp. 54–57). Washington, D.C.

Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: information and
pattern discovery on the world wide web. In Proceedings of the 9th international
conference on tools with artificial intelligence (pp. 558–567). Newport Beach,
California.

Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world
wide web browsing patterns. Knowledge and Information Systems, 1(1), 5–32.

Dell, R., Roman, P., & Velasquez, J. (2008). Web user session reconstruction using
integer programming. In IEEE/WIC/ACM international conference on web
intelligence and intelligent agent technology (Vol. 1, pp. 385–388). Sydney.
doi:http://dx.doi.org/10.1109/WIIAT.2008.181.

Dell, R., Román, P., & Velásquez, J. (2009). Web User Session Reconstruction with
back button browsing. In Knowledge-based and intelligent information and
engineering systems, lecture notes in computer science (Vol. 5711, pp. 326–332).
doi:http://dx.doi.org/10.1007/978-3-642-04595-0_40.

Devi, B. N., Devi, Y. R., Rani, B. P., & Rao, R. R. (2012). Design and implementation of
web usage mining intelligent system in the field of e-commerce. Procedia
Engineering, 30, 20–27.

Fong, A. C. M., Zhou, B., Hui, S., Tang, T., & Hong, G. (2012). Generation of
personalized ontology based on consumer emotion and behavior analysis. IEEE
Transactions on Affective Computing, 3(2), 152–164.

GAMS. GAMS Home Page. Retrieved from http://www.gams.com/.
Huberman Pirolli Pitkow & Lukose (1998). Strong regularities in world wide web

surfing. Science, 280(5360), 95–97.
Huynh, T., & Miller, J. (2009). Empirical observations on the session timeout

threshold. Information Processing & Management, 45(5), 513–528.
IBM – IBM – Mathematical programming: linear programming, mixed-integer

programming and quadratic programming – IBM ILOG CPLEX optimizer –
software. Retrieved from: http://www-01.ibm.com/software/integration/
optimization/cplex-optimizer/.

Kirkpatrick, S., Gelatt, D., & Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220(4598), 671–680.

Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD
Explorations, 2, 1–15.

Lin, S., Jheng, Y., & Yu, C. (2012). Combining ranking concept and social network
analysis to detect collusive groups in online auctions. Expert Systems with
Applications, 39(10), 9079–9086.

Mayer, J. R. & Mitchell, J. C. (2012). Third-party web tracking: Policy and technology,
security and privacy. In IEEE symposium on security and privacy (pp. 413–427).
San Francisco.

Montgomery, D. C., & Runger, G. C. (2006). Applied statistics and probability for
engineers (4th ed.). Wiley.

Nasraoui, O., Soliman, M., Saka, E., Badia, A., & Germain, R. (2008). A web usage
mining framework for mining evolving user profiles in dynamic web sites. IEEE
Transactions on Knowledge and Data Engineering, 20(2), 202–215.

Nohuddin, P. N. E., Coenen, F., Christley, R., Setzkorn, C., Patel, Y., & Williams, S.
(2012). Finding ‘‘interesting’’ trends in social networks using frequent pattern
mining and self organizing maps. Knowledge-Based Systems, 29, 104–113.

Oliveira, J. G., Dezso, Z., Goh, K.-I., Kondor, I., Barabasi, A.-L., & Vazquez (2006).
Modeling bursts and heavy tails in human dynamics. Physical Review E –
Statistical, Nonlinear and Soft Matter Physics, 73(3 Pt 2), 036127.

Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable
structures from the web. In Proceedings of the SIGCHI conference on human
factors in computing systems: common ground (pp. 118–125). Vancouver.

Poženel, M., Mahnic, V., & Kukar, M. (2010). Separation of interleaved web sessions
with heuristic search. In Proceedings of the ICDM ’10, IEEE international
conference on data mining (pp. 411–420). Sidney.

Ravi, S., & Shukla, S. (Eds.). (2009). Fundamental problems in computing: An analysis of
several heuristics for the traveling salesman problem. Dordrecht, Netherlands:
Springer Netherlands.

http://refhub.elsevier.com/S0957-4174(13)00677-5/h0005
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0005
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0010
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0010
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0010
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0015
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0015
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0015
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0020
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0020
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0025
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0025
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0025
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0030
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0030
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0030
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0035
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0035
http://dx.doi.org/10.1109/WIIAT.2008.181
http://dx.doi.org/10.1007/978-3-642-04595-0_40
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0040
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0040
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0040
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0045
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0045
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0045
http://www.gams.com/
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0050
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0050
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0055
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0055
http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0060
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0060
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0065
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0065
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0070
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0070
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0070
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0145
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0145
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0075
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0075
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0075
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0080
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0080
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0080
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0085
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0085
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0085
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0090
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0090
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0090


1600 T. Arce et al. / Expert Systems with Applications 41 (2014) 1593–1600
Román, P. E., Dell, R. F., & Velásquez, J. D. (2010). Advanced technique in web data
pre-processing and cleaning. In J. D. Velásquez & L. Jain (Eds.). Advances in
techniques in web intelligence-1 (Vol. 311, pp. 19–48). Heidelberg: Springer
Verlag.

Román, P. E., L’Huillier, G., & Velásquez, J. D. (2010). Web usage mining. In J. D.
Velásquez & L. Jain (Eds.). Advances in techniques in web intelligence-1 (Vol. 311,
pp. 143–165). Heidelberg: Springer Verlag.

Suman, B., & Kumar, P. (2005). A survey of simulated annealing as a tool for single
and multiobjective optimization. Journal of the Operational Research Society,
57(10), 1143–1160.

Talbi, E. G. (2009). Metaheuristics: From design to implementation. John Wiley and
Sons.

Tao, Y. H., Hong, T. P., Lin, W. Y., & Chiu, W. Y. (2009). A practical extension of web
usage mining with intentional browsing data toward usage. Expert Systems with
Applications, 36(2), 3937–3945.
Velásquez, J. D. (2013). Web mining and privacy concerns: some important legal
issues to be consider before applying any data and information extraction
technique in web-based environments. Expert Systems with Applications, 40(13),
5228–5239.

Velásquez, J. D., & Palade, V. (2008). Adaptive web sites: A knowledge extraction from
web data approach. Amsterdam: IOS Press.

Wang, T., & Ren, Y. (2009). Research on personalized recommendation based on
web usage mining using collaborative filtering technique. WSEAS Transactions
on Information Science and Applications, 6(1), 62–72.

Wang, K., Ting, I., & Wu, H. (2013). Discovering interest groups for marketing in
virtual communities: An integrated approach. Journal of Business Research, 66(9),
1360–1366.

Yin, P., & Guo, Y. (2013). Optimization of multi-criteria website structure based on
enhanced tabu search and web usage mining. Applied Mathematics and
Computation, 219(24), 11082–11095.

http://refhub.elsevier.com/S0957-4174(13)00677-5/h0100
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0100
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0100
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0100
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0095
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0095
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0095
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0105
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0105
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0105
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0110
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0110
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0115
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0115
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0115
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0120
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0120
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0120
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0120
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0125
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0125
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0140
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0140
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0140
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0135
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0135
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0135
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0130
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0130
http://refhub.elsevier.com/S0957-4174(13)00677-5/h0130

	Identifying web sessions with simulated annealing
	1 Introduction
	2 Methods and materials
	2.1 The web sessionization problem
	2.2 Quality of the reconstruction
	2.3 Simulated annealing (SA)
	2.4 Representation of a solution with SA
	2.5 Generation of an initial solution
	2.6 Generation of the neighboring solution
	2.7 Evaluation function
	2.8 Selection of SA parameters
	2.9 Algorithm for WSP
	2.10 Equipment used
	2.11 Test instances
	2.12 Metrics used
	2.13 Execution of the experiment

	3 Results
	3.1 General results
	3.2 Evolution of the best reconstruction
	3.3 Comparison with the exact method

	4 Discussion and conclusions
	Acknowledgements
	References