A hybrid model for business process event and outcome prediction


Article
DOI: 10.1111/exsy.12079

A hybrid model for business process event and outcome
prediction

Mai Le,1,2 Bogdan Gabrys1 and Detlef Nauck2
(1) School of Design, Engineering and Computing, Bournemouth University, Bournemouth, UK
E-mail: mai.phuong@bt.com; bgabrys@bournemouth.ac.uk
(2) BT Research and Technology, Ipswich, UK
E-mail: detlef.nauck@bt.com

Abstract: Large service companies run complex customer service processes to provide communication services to their customers. The
flawless execution of these processes is essential because customer service is an important differentiator. They must also be able to predict if
processes will complete successfully or run into exceptions in order to intervene at the right time, preempt problems and maintain customer
service. Business process data are sequential in nature and can be very diverse. Thus, there is a need for an efficient sequential forecasting
methodology that can cope with this diversity. This paper proposes two approaches, a sequential k nearest neighbour and an extension of
Markov models both with an added component based on sequence alignment. The proposed approaches exploit temporal categorical
features of the data to predict the process next steps using higher order Markov models and the process outcomes using sequence alignment
technique. The diversity aspect of the data is also added by considering subsets of similar process sequences based on k nearest neighbours.
We have shown, via a set of experiments, that our sequential k nearest neighbour offers better results when compared with the original ones;
our extension Markov model outperforms random guess, Markov models and hidden Markov models.

Keywords: artificial intelligence, method, system

1. Introduction

Very often, processes are poorly documented because they
have evolved over time or the legacy information technology
systems that were used to implement them may have changed.
This adds the additional challenge to identify the actual
process model. Process mining (van der Aalst, 2011) can
reconstruct process models from workflow data to some
extent by searching for a model that explains the observed
workflow. This process model contains the list of all unique
process prototypes (workflow sequences). Data from this
mining process can help to address the problem of proactive
process monitoring and process event prediction.

Firstly, in order to predict process events, we consider
approaches from data mining (Berry & Linoff, 2004). Because
of the nature of the data, approaches that can deal with
temporal sequence data are the most relevant. One example
from this class of approaches is Markov models (MMs) that
have been used to study stochastic processes. In the work of
Pitkow and Pirolli (1999), for example, it has been shown that
Markov models are well suited to the study of Web-users’
browsing behaviour. Event sequences are used to train a
Markov model that encodes the transition probabilities
between subsequent events. Event sequences can represent
customer behaviour, customer transactions, workflow in
business processes and so on. The prediction of the
continuation of a given sequence is based on the transition
probabilities encoded in the Markov model.

The order of a Markov model represents the number of
past events that are taken into account for predicting the
subsequent event. Intuitively, higher order Markov models
are more accurate than lower order Markov models because
they use more information present in the data. This has been
shown to be true in the prediction of Web browsing
behaviour. Lower order models may not be able to
discriminate the difference between two subsequences,
especially when the data are very diverse, that is if there
are a large number of unique sequences. In the work of Ruta
and Majeed (2011), it has been shown that higher order
Markov models can be more accurate than lower order
models in the context of customer service data.

However, higher order models can suffer from weak
coverage in particular when the data are diverse and not
clustered. For sequences that are not covered by the model,
a default prediction is required. Default predictions typically
reduce the accuracy of the model. It is obvious that with
increasing order of the model, the computational
complexity also grows. In the case of plain Markov models,
there are already approaches to deal with the trade-off
between coverage and accuracy. These approaches have
been introduced in the studies of Deshpande and Karypis
(2004), Eirinaki et al. (2005) and Pitkow and Pirolli (1999).
The general idea is to merge the transition states of different
order Markov models, which is followed by pruning
‘redundant’ states. A particular method is the selective

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00


Markov model, an extension of All Kth order Markov
models (Deshpande & Karypis, 2004).

Secondly, to predict the process outcomes, bearing in
mind that the data can be very diverse, our idea is to allocate
a number of sequences from the historical data that are most
similar to the given sequence expecting that they would have
similar outcome. K nearest neighbour (KNN) is chosen as
the predictive model here. In our area of application, we
consider a number of sequences in which we want to find
K sequences that are most similar to a given sequence

S s
jð Þ
1 ; …; s

jð Þ
nj

� �
, where j is the index of the sequence. The

similarity is determined using a distance function, and the
prediction should be the majority class among the classes of
K nearest sequences. This method is used by Ruta et al.
(2006) to predict churn using data from a telecommunication
company. The authors combined KNNs and the theory of
survival. In their work, the authors used Euclidean distances
to calculate the distance between the given sequence and all
sequences in the data sample. In this study, as our data consist
of event sequences, we use the edit distance in the process of
comparing two sequences. In order to exploit the temporal
characteristics of process data, we combine KNN with
sequence alignment from biology to form a sequential KNN.

In this paper, we first present a hybrid approach to address
the lack of coverage in higher order Markov models. This
approach is a combination of pure Markov models and
sequence alignment technique (Waterman, 1994). The
sequence alignment technique is applied when the default
prediction is required. This is the case when the given sequence
cannot be found in the transition matrix. The matching
procedure is applied in order to extract those sequences
(patterns) from the transition matrix that are most similar to
the given sequence. We determine the prediction for the given
sequence based on the predictions for the sequences we
obtained. We secondly present a sequential KNN approach
that is used to predict process outcomes. Our sequential
KNN matches sequences using local alignment. The output
of the matching is used to rank the similarities of sequences.

We are testing our approaches on workflow data from a
telecommunication business process, but they are likely to
be applicable in a variety of other sequential prediction
problems – especially in the case of our extended Markov
model if Markov models have already been used and shown
to be relevant.

The rest of this paper is organised as follows. Sequence
alignment and distance measurement are presented in
Section 2. Section 3 introduces the predictive models that
are used in this study. Section 4 then presents the experimental
results and evaluation. Finally, in Section 5, our conclusions
and directions for future work are discussed.

2. Sequence alignment

To determine sequence similarity, both distance measures
and similarity measures can be used because distance and

similarity are dual concepts. For numeric variables, well-
known distance measures exist and can be easily applied.
However, in sequence analysis, sometimes, we have to work
with sequences that are constructed from symbols, for
example, categories (churn prediction), phonemes (speech
recognition) or characters (hand-writing recognition). In
this case, we need specialised functions that have the ability
of measuring the similarity of symbolic sequences.

Sequence alignment is very common in bio-informatics
and has a relatively long history in this domain. The target
entities of sequence alignment in bio-informatics are amino
acid sequences of proteins, DNA sequences and so on.
Sequence alignment is used for a number of purposes
(Needleman & Wunsch, 1970); for example, to compare
new DNA sequences with DNA databases. Transactions
of DNA sequences into amino acid sequences are compared
with protein databases in order to verify if the relationships
that are found between them could have occurred by chance
and so on. Algorithms used in sequence alignment are
mainly divided into global alignment and local alignment.
Global alignment provides a global optimisation solution,
which spans the entire length of all query sequences. In
contrast, local alignment aims to find the most similar
segments from two query sequences. We use the idea of
sequence alignment in Markov models but do not use
existing sequence alignment algorithms. Instead, we build
a simple alignment procedure for short subsequences. We
also use the local alignment algorithm and combine it with
KNNs. The local algorithm is presented in the following:

Given two sequences, we have to consider the order of the
events and compute the score of matching the ith event in
one sequence with the jth event in the other sequence,
i = {1, …, len1}, j = {1, …, len2} where len1 and len2 are the
lengths of the two given sequences. The aim of the local
algorithm (Smith & Waterman, 1981; Waterman, 1994) is
to find a pair of most similar segments in the given
sequences. There are two key matrices used in local
alignment: the substitution matrix and the score matrix.

1. Substitution matrix: In biology, a substitution matrix
describes the rate at which one amino acid in a sequence
transforms to another amino acid over time. The entries
of this matrix present the probabilities of one amino acid
transforming (mutating) into another. There are different
ways of generating the substitution matrix. The simplest
way is to not take into account amino acid mutations
and instead just give a score of 1 to the same amino acids
and use a score of 0 to a pair of different amino acids.

s i; jð Þ ¼ 0 if event i ≠ event j
1 otherwise

�

In this case, the substitution matrix is an identity
matrix, the elements of the main diagonal are 1 and all
the others are 0. To present mutations, a more
complicated form of substitution matrix is used. The
elements of the matrix off the main diagonal can have

© 2014 Wiley Publishing LtdExpert Systems, xxxx 2014, Vol. 00, No. 00


a value different from 0, depending on the likelihood
and frequency of transformation between the two
amino acids.

2. Score matrix: A score matrix is built based on the
following formula:

hi0 ¼ h0j ¼ h00 ¼ 0 (1)
where hi0, h0j and h00 values are the initial values for the
recursive formula that is used to compute hij.

hij ¼ max hi�1; j � δ; hi�1; j�1 þ s xi; yj
� �

; hi; j�1 � δ; 0
� �

(2)

where xi and yj are events at positions i and j from the
given sequences. s(xi, yj) is the score from the
substitution matrix corresponding to events xi and yj.
Example: Given two sequences ABCDE and EBCAD,
the corresponding score matrix is as follows:

A B C D E

E 0 0 0 0 1

B 0 1 0 0 0

C 0 0 2 0 0

A 1 0 0 1 0

D 0 0 0 1 0

0
BBBBBBBBB@

1
CCCCCCCCCA

The ith event in a sequence can be aligned to the jth
event in another sequence or can be aligned to nothing
(deletion). This leads to a number of possible matchings
of the two sequences. The optimal pair of aligned
segments is identified by first finding the highest score
in the matrix. This element is the end of the optimal
aligned path. Then, the path is filled by tracking back
from that optimal highest score diagonally up towards
the left corner until 0 is reached. In the example, the best
match is BC.

3. Predictive models

This section presents Markov models, KNNs and their
extensions that can help enhance the current approaches
to process management in a telecommunication business
context by predicting future events and process outcomes
based on the current and past data. Here, a business
process instance (Sj) is a composition of discrete events
(or tasks) in a time-ordered sequence, Sj =

s
jð Þ

1 ; s
jð Þ

2 …s
jð Þ

nj

n o
where sj takes values from a finite set of

event types E = {e1, …, eL}. Apart from its starting time

t
jð Þ

i and duration of T
jð Þ

i , each of these events has its
own attributes. For simplicity, we assume that a process
does not contain any overlapping events that means there
are no parallel structures.

The goals of our predictive models are to predict the next

event s
Nþ1ð Þ
iþ1 following event s

Nþ1ð Þ
i in a given process instance

SNþ1 ¼ s Nþ1ð Þ1 ; s
Nþ1ð Þ
2 ; …; s

Nþ1ð Þ
i�1

n o
based on the data from

completed process instances S = {S1, S2 … SN} or to predict
the process outcomes (success/failure). A simple predictive
model is illustrated in Figure 1.

In the following subsections, we will discuss some
examples of process prediction models, namely Markov
models, hybrid Markov models (MSAs), hidden Markov
models (HMMs), KNNs and sequential KNNs (KnsSA).

3.1. Markov models

Markov models are a kind of stochastic model. They are
based on the mathematical theory of Markov chains.
Markov models (Papoulis & Pillai, 1991) have been used
in studying stochastic processes in terms of modelling and
predicting customer behaviour. The idea of the model is to
use k > 0 steps to predict the following (k + 1)th step. Given
a sequence of random variables {Xn}, a first order Markov
model uses the current step xi � 1 to predict the next step xi

Figure 1: A simple predictive model.

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00


and to generalise; a kth order Markov model uses the last k
steps to predict the next one:

p xi xi�1; …; xi�nÞ ¼ p xi xi�1; …; xi�kÞ:jðjð (3)
To construct a Markov model a matrix M, M = (mij) is

created containing the probabilities for moving from the
previous k steps to the next. The columns of the matrix
correspond to the unique tasks, i = (1, …, I), and the rows
of the matrix correspond to the unique patterns of length k
that are present in the training data set that is used to
construct the model. These unique patterns are called states
j = (1, …, J). Elements mij of matrix M are the probabilities pij:

pij ¼
nij
ni
: (4)

where ni is the number of times pattern i occurs in the data
set, and nij is the number of times pattern i is followed by
step j in the data set.

Higher order Markov models can be expected to be more
accurate. However, we have to take into account their weak
coverage property. For example, consider using a Markov
model to predict Web page access and consider a sequence
of Web pages {P1, P2, P1, P3, P2, P1} that is used to create
the model. In a second order Markov model, there would
be no sample {P2, P3}. Consequently, the prediction for this
pattern would have to be the default prediction of the
model, that is, the most frequent step (P1 in this example).
One approach to overcome the problem is to merge the
transition states of different order Markov models, after
pruning the ‘redundant’ states. A particular method is the
selective Markov model, an extension of all Kth order
Markov models (Deshpande & Karypis, 2004).

Markov models are often not dynamic, and once the
model is trained, the obtained matrix is fixed and used for
predictions. It is important that the training set data are
large enough to be representative for the encountered
patterns. In order to accommodate changes in data, the
matrix can be rebuilt from scratch after a certain period. It
is straightforward to create a dynamic Markov model that
adapts to new data by storing the counts nij and ni and
updating M with additional rows and/or columns if new
patterns and/or steps are encountered in the new data. We
can also apply a discounting factor to the current counts
to give more weight to the pattern frequencies in new data.

3.2. Hidden Markov models

In the field of sequential data, HMMs are a very powerful
tool. Similar to Markov models, HMMs are a type of
stochastic model. They are also based on the mathematical
theory of Markov processes that further developed into the
theory of HMMs by Baum in the 1960s (Baum et al., 1970).
HMMs received much attention because of their application
in speech recognition. Moreover, there is a wide range of
applications ranging from telecommunication, recognition
(gesture, speech and so on) to finance, biology and so on.

Many approaches in temporal data series only deal with
observed variables. We can consider how the performance
of the model changes after adding an unobservable or
hidden variable that we assume to have an influence on the
observed variables. HMMs assume that there is a latent
variable that influences the values of the observed variable.
The benefits of this approach are shown, for example, in
the work of Cox and Popken (2002).

As an example, for an application domain, consider
modelling customer behaviour based on customer event
sequences (e.g. historic purchases and contacts with
customer service). A latent variable that influences customer
behaviour could be the level of customer satisfaction.

Let O = {o1, …, oN} be the sample of observed variables
and Q a sequence of hidden states {q1, …, qN}. The observed
variable can be continuous and follow a (usually Gaussian)
distribution, or it can be discrete and take values from a
finite set V = {v1, …, vK}. The hidden states are discrete,
and their value space is Ω, Ω = {s1, …, sM}. The expression
of the joint probability of observed sequence is as follows:

p o1; o2; …; onjq1; …; qnð Þ ¼ p q1ð Þ
Yn
i¼2

p qijqi�1ð Þ
Yn
i¼1

p oijqið Þ

(5)

Because of the complexity of the computation, lower
order hidden Markov models are more popular. The most
popular one is the first order hidden Markov model that
assumes that the current state depends only on the preceding
state and is independent from the earlier states.

p qn qn�1; on�1; …; q1; o1Þ ¼ p qn qn�1Þjðjð (6)
Intuitively, higher order HMMs can be expected to be

more accurate (Thede & Happer, 1999). However, the lack
of coverage problem needs to be considered, as mentioned
in Section 3.2. It is obvious that with increasing order of
the model, we are facing more complex computation. An
HMM is defined by a set of three parameters λ = (A, B, π):

A ¼ aij
� �

¼ p qijqj
� �� �

(7)

where A is the transition matrix and an element aij is the
probability to move from state j to state i. It is necessary
to point out that only homogeneous HMMs are considered
here, implying the time independence of the transitions
matrix. The non-homogeneous type is discussed by Netzer
et al. (2007). Furthermore,

B ¼ bij
� �

¼ p oijqj
� �� �

(8)

where B is the matrix containing probabilities of having ot = vi,
denoted as oi, if the hidden state is qj at time t. Finally,

πj ¼ p q1 ¼ sj
� �

(9)

is the probability that sj is the initial state at time t = 1.
There are three fundamental problems in HMMs:

estimation, evaluation and decoding. The estimation problem

© 2014 Wiley Publishing LtdExpert Systems, xxxx 2014, Vol. 00, No. 00


is about finding the optimal set of parameters λ = (A, B, π) that
maximise p(O|λ′) given a sequence O of observed states and
some initial model λ′. The Baum–Welch algorithm is used
for this. The evaluation problem means finding p(O|λ) given
the sequence of observations O and the model λ using the
forward (backward) algorithm. To solve the decoding
problem, the Viterbi algorithm is applied for finding the most
realisable, suitable sequence of hidden states q1, …, qk for the
associated sequence O and model λ.

Researchers have been working on improvements of the
performance of HMMs. Similar to Markov models, there
are approaches to deal with the trade-off between coverage
and accuracy. However, merging the transition states of
higher order hidden Markov models is much more
complicated than for Markov models.

3.3. Hybrid Markov models

The work presented in this paper investigates the idea of
combining higher order Markov models with a sequence
alignment technique (Waterman, 1994) in order to maintain
the high accuracy of the higher order Markov models and,
at the same time, compensate for their lack of coverage.
When a higher order Markov model is employed to predict
process events and the given sequence has no match in the
transition matrix M, we would be forced to produce a
default prediction that would be detrimental to the model
accuracy. To this end, we present our approach, MSA for
Markov sequence alignment, which is based on the
assumption that similar sequences are likely to produce the
same outcome. Basically, the steps that MSA took to
identify the next step for a previously unseen sequence Snew
have no match in the transition matrix as follows:

• Compare Snew with all sequences (patterns and states) in
the transition matrix, and extract l states that are most
similar.

• Generate prediction based on these l selected states. For
each state i in the transition matrix M = (mij), there is a
corresponding predictive row vector, (mi1, …, miJ). The
predicted next step of state i is the task j with mij = max
(mi1, …, miJ). Thus, the predictions for the l selected states
contain l* unique events, l* < = l, E� ¼ e1; …; ; el�f g. Then,
the final prediction for Snew is the most frequent event in E

*.
If, however, there is no single most frequent event in E*,
then the one with highest probability is selected.
Alternatively, we go back to the l most similar states and
select the sequence that is most similar to Snew and use its
predicted step as the prediction for Snew. If there is no single
most similar sequence, we select the one with the highest
frequency from this set.

In order to compare Snew with the sequences in the
transition matrix, we need to be able to define and measure
their similarities or distances. Here, the sequence alignment
technique is applied (Waterman, 1994). In particular, we
adopt the edit (deletion and insertion) weighted distance

function for its flexibility in determining the degree of
similarity. Given two sequences, a weight is assigned to each
pair of events at the same position in the two sequences. The
weight is 0 if the two events are identical, 1 or δ (a specified
value) if they are different. In case both events of the pair
must be deleted in order to keep the longest possible similar
subsequences, the corresponding weight is 1. If only one of
the two events must be deleted, the weight is δ. The sum of
weights of the comparison is then used to determine the
similarity of sequences. The lower the weight is, the more
similar two sequences are.

This algorithm starts by constructing a matrix where the
elements represent similarity between any pair of events
from the two sequences to be matched. This matrix is then
used to find the most suitable editing of the given sequences
where deletions and insertions are penalised. The optimal
editing is chosen based on the total score computed. For
example, given two sequences ABCDE and EBCAD, the
matrix looks as follows:

A B C D E

E 0 0 0 0 1

B 0 1 0 0 0

C 0 0 1 0 0

A 1 0 0 0 0

D 0 0 0 1 0

0
BBBBBBBBB@

1
CCCCCCCCCA

Next, the first events A and E from both sequences are
deleted that results in a weight of 1. The second event in
both sequences is B and has weight 0. The fourth event A
is deleted from the sequence EBCAD in order to match
two Ds in fourth and fifth positions in the two sequences.
We add δ to the sum of weights. Finally, we delete the
last event E from the sequence ABCDE, adding another
δ. The total weight for matching the two sequences is
w = 1 + 0 + 0 + δ + δ = 1 + 2δ.

The following example will illustrate our method for the
case of the transition matrix of a third order Markov model

A B C

ABC 0:1 0:2 0:7

CBC 0:1 0:5 0:4

BAA 0:2 0:5 0:3

0
BBB@

1
CCCA

and a given sequence BCC. This sequence has not occurred
before and is not stored in the matrix. The aims are to find
the most similar sequences from the matrix and use their
predictions to generate the prediction for the given sequence
BCC. We first match BCC to ABC:

A B C

B 0 1 0

C 0 0 1

C 0 0 1

0
BBB@

1
CCCA

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00


The weight of this comparison is w = δ + 0 + δ = 2δ.
Similarly, the resulting weights of matching CBC and
BAA against BCC are w = 2δ and w = 2, respectively.
Let δ = 0.4 and l = 2, the two chosen sequences in this
case are then ABC and CBC and their predictive vectors
are (0.1, 0.2, 0.7) and (0.1, 0.5, 0.4), respectively. Based
on these vectors, the predictive next steps are {B, C}.
The weights and frequency of occurrence of these two
events are equal; hence, the next step of the given
sequence BAC is predicted to be C because of higher
transition probability, m13 = 0.7 that is greater than
m22 = 0.5.

Algorithms 1 and 2 illustrate the procedure of matching
two sequences by deletion and insertion of symbols with
penalties.

3.4. KNNs combined with sequence alignment

KNN is one of the classical approaches in data mining
(Berry & Linoff, 2004). It is essentially a non-parametric
approach; hence, one of its advantages is that there is no
training procedure required. The model automatically
selects sequences from the continuously updated data. The
only configuration possible is a choice of different values
for K ≥ 1, that is, how many nearest neighbours we want
to use. KNNs can be used in its original non-sequential
form (Eastwood & Gabrys, 2009), or it can be extended
into a sequential approach (Ruta et al., 2006). The core idea
is to find similar sequences expecting that these sequences
have a common behaviour and outcome. This is a
reasonable expectation, for example, in biology where
similar DNA or protein sequences are expected to have the
same shape function.

Algorithm 2 Function calculateScore(n, maxCommonIndexes1,
maxCommonIndexes2, maxCommonLength): computes the
score of matching two sequences

1: totalPoints=0
2: totalDeltas = 0
3: maxCommonIndexes1(maxCommonLength + 1) = n + 1
4: maxCommonIndexes2(maxCommonLength + 1) = n + 1
5: last1 = 0
6: last2 = 0
7: for i = 1 : maxCommonLength + 1 do
8: dist1 = maxCommonIndexes1(i) – last1 – 1
9: dist2 = maxCommonIndexes2(i) – last2 – 1
10: if (dist1 > dist2) then
11: noPoints = dist2
12: noDeltas = dist1 – dist2
13: else
14: noPoints = dist1
15: noDeltas = dist2 – dist1
16: end if
17: totalPoints = totalPoints + noPoints
18: totalDeltas = totalDeltas + noDeltas
19: last1 = maxCommonIndexes1(i)
20: last2 = maxCommonIndexes2(i)
21: end for
22: return totalPoints + totalDeltas*deltaValue

Because of the sequential nature of business processes, we
want to extract K similar sequences in terms of their
temporal characteristics and not in terms of numerical
quantities. This can be seen as a string comparison problem.
Our idea is to adopt the sequence alignment approach from
biology and combine it with KNNs. The resulting approach
is named K nearest sequence with sequence alignment
(KnsSA). By combining with sequence alignment, KNNs
allow us to sequentially compare symbolic sequences. Given
N sequences, our approach builds a distance matrix (dij)N * N

Algorithm 1 Function scan(row, col, length): identifies the optimal editing process to match two sequences via deletion/
insertion of symbols

1: commonIndexes1(length) = row
2: commonIndexes2(length) = col
3: if (length > maxCommonLength) then ▹ store the longest path
4: maxCommonLength = length
5: for i = 1 : maxCommonLength do
6: maxCommonIndexes1(i) = commonIndexes1(i)
7: maxCommonIndexes2(i) = commonIndexes2(i)
8: end for
9: maxCommonScore =
10: calculateScore(n,maxCommonIndexes1,maxCommonIndexes2,maxCommonLength)
11: else
12: if (length == maxCommonLength) then
13: score = calculateScore(n,commonIndexes1,commonIndexes2, length)
14: if (score < maxCommonScore) then ▹ if the paths are equal, store the smallest score
15: for i = 1 : maxCommonLength do
16: maxCommonIndexes1(i) = commonIndexes1(i)
17: maxCommonIndexes2(i) = commonIndexes2(i)
18: end for
19: maxCommonScore = score
20: end if
21: end if
22: end if

© 2014 Wiley Publishing LtdExpert Systems, xxxx 2014, Vol. 00, No. 00


storing the alignment scores for matching any two sequences.
The elements of the distance matrix are then used to rank the
sequences. The elements of the distance matrix are obtained
by matching every pair of sequences using local alignment.

4. Evaluation

Having defined MSA and KnsSA, we now present a
discussion of our empirical evaluation. We assess the
efficiency of our approaches and explore potential future
research directions for employing our Markov extension
approach in business process event forecasting and our
sequential symbolic KNN approach in business process
outcome forecasting. Our models have been implemented
in MATLAB and run against four data sets of process
instances from the records of a major telecommunication
company. The first data set (DS1) consists of
telecommunication line fault repair records for a 9-month
period. The second (DS2) covers a 1-month period and
represents a process for fixing broadband faults. The third
data set DS3 is from a different fault repair process. Data
sets DS1 and DS2 are used in the experiments of our
extended Markov model. Data sets DS3 and DS4 are used
in the KnsSA experiments.

4.1. Process analysis and data preprocessing

In our data sets, the population of process instances turned
out to be very diverse and not straightforward to work with.

For instance, DS1 is very diverse and contains 28963 entries,
2763 unique process instances and 285 unique tasks. The
process instance length varies from 1 to 78. On the other
hand, DS2 represents a significantly smaller scale set with
only 5194 entries, 794 process instances and 10 unique tasks.
The process instance length varies from 2 to 32. DS3
represents a small scale set with only 10000 entries, 633
process instances and hundreds of unique tasks. DS4 is also
a real process with available labels with 11839 entries, 571
process instances with different lengths and hundreds of
unique tasks. The length of process instances in DS3 and
DS4 varies considerably.

Figure 2 shows a visualisation of 10% of set DS1. The
diagram has been created by BT’s process mining tool
Aperture (Taylor et al., 2012) that uses workflow data to
create a process diagram. For this process, the diagram
illustrates the complexity of the workflow data. It is
basically impossible to visually analyse or understand the
process from this figure. This is a typical scenario for
processes that have evolved over time and are poorly
documented.

4.2. Extension Markov model experiments

To evaluate MSA, we benchmarked our model with three
other approaches:

• RM – random model: In order to find the next task
following the current one, we randomly select from the
set of potential next tasks. For example, if from the

Figure 2: Process model obtained by using Aperture for DS1 visualising a highly complex process.

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00


historical data, we know that tasks following A belong to
the set {C, D, E}, we randomly select a value from that set
as the predicted next step.

• All Kth Markov models: A number of different order
Markov models are generated from first order to kth
order. Given a sequence, we start with the highest order
Markov model. If the given sequence cannot be found
in the transition matrix, we create a new shorter sequence
by removing the first event in the sequence. We then
continue the procedure with the next lower order Markov
model until we either find a match in a transition matrix
or after trying the first order Markov model, a default
prediction is required.

• HMM – hidden Markov models: We tested several first
order HMMs with different lengths for the input
sequences and different number of hidden states and
picked the best one.

For each comparison, we used 90% entries of the data
sets as training data and the remaining 10% were used as
test data. Basically, for each sequence in the test data, we
checked if our model can correctly predict the next step
of a particular task (i.e. we measured the number of
successful predictions). The results were evaluated using
10-fold cross-validation. The results are displayed as a
percentage. After the first order to seventh order MSAs
were built, we investigated different values for l, the
number of sequences (patterns) in the transition matrix
that is most similar to the given sequence. l was varied
from 1 to 7, and the results show that l = 5 is optimal for
DS1 and l = 3 is optimal for DS2. We now look at the
specific results.

Figures 3 and 4 illustrate MSAs accuracy against the two
data sets, DS1 and DS2, respectively. The results show that
by incorporating the default prediction improvement
module, MSA outperforms other comparable models,
especially when the order of the Markov model increases.
This is because the relevance of the comparison between
sequences is directly proportionate to their lengths. As can

be seen, the second order MSA performs best among a
range of different orders of MSA. For the case of DS1, it
is about 27% correct. The third order MSA works best in
the case of DS2, it provides correct predictions in about
70% of the cases. Nonetheless, the highest performance
order depends on the data set. For the data set that provides
a good performance with a relatively high Markov order
(i.e. fifth and above), the role of our default prediction
improvement module becomes highly significant.

Figure 5 shows the performances of all models against
the two data sets. As can be seen, RM performs worst with
only around 10% success for DS2. When applied to the
bigger data set DS1, the result goes down to nearly 2%
(14 times worse than that of MSAs’ average performance).
This can be explained by the fact that with DS2, each task
can have an average of seven potential next tasks, whereas
this figure is nearly 30 for DS1. Thus, the probability of
picking the right next task reduces as the set size increases.
The results highlight the difficulty of handling complex
data sets. When the data is not too diverse (i.e. DS2),
MSA (fifth order) obtains the highest number of correct
predictions with a result of 63%. Compared with other
benchmarks, MSA results are better than a fifth-order
Markov model that achieves 57%, an All Kth (K = 5)
Markov model that achieves 60% and an HMM that
achieves 43%. Please note that we did not pick the most

Figure 3: Percentage of correct predictions before and after
applying the default prediction module in Markov models
using data set DS1.

Figure 4: Percentage of correct predictions before and after
applying the default prediction module in Markov models
using data set DS2.

Figure 5: Percentage of correct predictions of different
models on DS1 and DS2 data sets.

© 2014 Wiley Publishing LtdExpert Systems, xxxx 2014, Vol. 00, No. 00


accurate MSA and All Kth Markov model to compare
with the HMM and the RM. The reason is that All Kth
Markov models and MSAs only have an advantage with
higher orders. In order to see how well they cope with
default predictions, we chose a fifth order MSA and an
All fifth Markov model.

When the data is more fragmented as in the case of DS1,
the effectiveness of our model is reduced. However, even
though the performance on DS1 is not as good as the
performance on DS2, our model still performs an important
role considering the requirement of selecting a single correct
task from 30 potential candidates on average. Even in this
case, we still manage to outperform the HMM result by
20% and also outperform the All fifth Markov model
result by 3%.

4.3. Sequential KNN experiments

4.3.1. Data preparation We carried out a number of
experiments based on records from two real processes
(DS3 and DS4). As we aim to predict the process outcome,
it is necessary to label the outcome as either success or
failure, and this needs to be carried out before the proposed
method can be applied. There is, however, an issue in our
available process data. Records that are labelled with
outcomes are not available for many processes. For some
processes, it is not always trivial to define process success
or failure.

To this end, we have to look for different suitable criteria
to classify process instances into success/failure for different
data sets. One such criterion is the duration used to
complete a process instance. That is, to label the process
instance as a success if the time duration is under a chosen
threshold, otherwise it is labelled as a failure. In the case
of DS4, however, the difference between the actual delivered
date and the delivered date promised to the customer is used
as the criterion to determine the success and failure.
Particularly, if the actual delivered date is before the agreed
date, then that process instance is classified as success,
otherwise it is classified as failure.

4.3.2. Results To evaluate KnsSA, we benchmarked our
models with two other approaches:

• RM – random model: In order to find the outcome of the
process, we randomly generate a number between 0 and
1; if the generated number is greater than 0.5, the
outcome is success (1) and vice versa; if the generated
number is smaller than 0.5, the outcome is failure (0).

• Original KNN: We chose K nearest sequences in terms of
having common unique tasks. For example, given two
sequences A, B, D and A, A, C, there are one A, one B
and one D in the first sequence; there are two As and
one C in the second one. Each unique task can be
considered as one category. The distance for each
category is computed as the difference in the number of
occurrences of the corresponding task. Then, the sum

over all categories is taken in order to obtain the total
distance between any two given sequences. For instance,
the two sequences given previously consist of four
categories A, B, C and D. The distance for category A is
dA = 1 and those of categories B, C, D are dB = 1, dC = 1
and dD = 1, respectively. The resulting total distance of
these two sequences is d = dA + dB + dC + dD = 4.

For the proposed models, we investigate the effect of K as
it is important to obtain the reasonable number of similar
sequences. As the labels are 0 and 1, we decide to select
odd values for K so we can always extract the outcome/
label of the given sequence based on the K obtained
sequences. The data sets DS3 and DS4 have a large number
of unique tasks and the difference between the lengths of the
sequences is substantial. Intuitively, the value of K should be
small taking into account the diversity of the data.

The results of the local KnsSA applied to data sets DS3
and DS4 are presented in the Figure 6.

The performances of the original KNN, random guess
and the proposed models, local KnsSAs, applied to the
two data sets are presented in Figure 7.

The results show that the proposed model outperforms
both benchmarking models original KNN and random
guess. This also implies that the temporal characteristics of
the data are important for predicting the process outcome.

5. Conclusions

In this paper, we have introduced several models for predicting
the next step in business processes in a telecommunication
services context where no formal description of the processes
exists or the real process significantly deviates from the

Figure 6: Percentage of correct predictions of local KnsSA
using data sets DS3 and DS4.

Figure 7: Percentage of correct predictions of different
models on data sets DS3 and DS4.

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00


designed path prototype when being applied in real
environments and under realistic constraints. Specifically, we
first applied Markov and hidden Markov models to generate
probability matrices for moving from one event to the next
and then using them to provide predictions. Next, we have
proposed a novel predictive model (MSA) that is an extension
of Markov models employing sequence alignment. In order to
analyse its effectiveness, we have empirically evaluated MSA
against two data sets from a major telecommunication
company with varying degrees of diversification. The results
have clearly demonstrated that MSAs outperform both
Markov models and HMMs by at least 10%. In addition, we
have also shown that higher order Markov models outperform
a first order Markov model. The default prediction module we
introduced significantly improves the corresponding original
Markov model result when the order of the model is high
(starting from the fifth order Markov model). Nonetheless,
although the improvement provided by the new default
prediction module is quite significant, higher order MSAs do
not outperform second order MSA in the case of the data set
DS1 and third order MSA in the case of the data set DS2.
Another contribution of the paper is that we propose a new
sequential KNN for symbolic sequences. This proposed
method outperforms the benchmarking models and proves
that sequential characteristics of the data play an important
role in predicting the process outcome. These experimental
results motivate us to use sequence alignment technique as a
similarity measure to group sequences that have similar
characteristics and then tackle them separately. Our future
research will look into using sequence alignment and K-means
clustering to cluster data into K groups and then treat each
group with suitable approaches.

References

VAN DER AALST, W. (2011) Process Mining: Discovery,
Conformance and Enhancement of Business Processes, New
York Icn: Springer-Verlag.

BAUM, L.E., T. PETRIE, G. SOULES and N. WEISS (1970) A
maximization technique occurring in the statistical analysis of
probabilistic functions of Markov chains, The Annals of
Mathematical Statistics, 41, 164–171.

BERRY, M.J.A. and G.S. LINOFF (2004) Data Mining Techniques:
For Marketing, Sales, and Customer Relationship Management,
New York: Wiley.

COX, L.A. JR. and D.A. POPKEN (2002) A hybrid system-
identification method for forecasting telecommunications
product demands, International Journal of Forecasting, 18,
647–671.

DESHPANDE, M. and G. KARYPIS (2004) Selective Markov models
for predicting web page accesses, ACM Transactions on Internet
Technology (TOIT), 4, 163–184.

EASTWOOD, M. and B. GABRYS (2009) A non-sequential
representation of sequential data for churn prediction, in
Proceedings of the KES2009 Conference, Santiago, Chile.

EIRINAKI, M., M. VAZIRGIANNIS and D. KAPOGIANNIS (2005) Web
path recommendations based on page ranking and Markov
models, WIDM’05, 2–9.

NEEDLEMAN, S.D. and C.D. WUNSCH (1970) A general method
applicable to the search for similarities in the amino acid

sequence of two proteins, Journal of Molecular Biology, 48,
443–453.

NETZER, O., J. LATTIN and V.S. SRINIVASAN (2007) A hidden
Markov model of customer relationship dynamics, Marketing
Science, 27, 185–204.

PAPOULIS, A. and S.U. PILLAI (1991) Probability, Random Variables
and Stochastic Processes, New York: McGraw-Hill.

PITKOW, J. and P. PIROLLI (1999) Mining longest repeating
subsequences to predict world wide web surfing, in The 2nd
USENIX Symposium on Internet Technologies & Systems,
Colorado, USA, 139–150.

RUTA, D. and B. MAJEED (2011) Business process forecasting in
telecommunication industry, in GCC Conference and Exhibition
(GCC), 2011 IEEE, Dubai, United Arab Emirates, 389–392.

RUTA, D., D. NAUCK and B. AZVINE (2006) K nearest sequence
method and its application to churn prediction, in Intelligent
Data Engineering and Automated Learning IDEAL 2006,
volume 4224 of Lecture Notes in Computer Science, Corchado,
E., Yin, H., Botti, V. and Fyfe, C. (eds). Berlin: Springer,
207–215.

SMITH, T.F. and M.S. WATERMAN (1981) Identification of common
molecular subsequences, Journal of Molecular Biology, 147,
195–197.

TAYLOR, P., M. LEIDA and B.A. MAJEED (2012) Case study in
process mining in a multinational enterprise, in Lecture Notes
in Business Information Processing, Berlin: Springer.

THEDE, S. and M. HAPPER (1999) A second-order hidden Markov
model for part-of-speech tagging, in The 37th Annual Meeting
of the ACL, Maryland, USA, 175–182.

WATERMAN, M. (1994) Estimating statistical significance of
sequence alignments, Philosophical Transactions of the Royal
Society of London, Series B: Biological Sciences, 344, 383–390.

The authors

Mai Le

Mai Le is currently a research assistant at Bristol University.
She is working on a collaborative project between Bristol
University and BT’s Research and Innovation Division.
Her research focuses on data analytics and visual analytics
that enable autonomous systems to make decisions in
unusual situations based on human judgment and on data
from previous similar situations. Mai obtained her MSc in
optimisation and modelling from the University of Paris 6,
in 2007. She is completing her PhD at Bournemouth
University in mathematical modelling and predictive
analysis of temporal sequential data.

Bogdan Gabrys

Bogdan Gabrys received an MSc in electronics and
telecommunication from Silesian Technical University,
Poland, in 1994, and a PhD in computer science from
Nottingham Trent University, UK, in 1998. After many years
of working at different Universities, he moved to the
Bournemouth University in UK in January 2003, where he
holds a Chair in Computational Intelligence position and acts
as the founding director of an interdisciplinary Data Science
Institute. His current research interests lay in a broad area of
intelligent, complex adaptive systems and include a wide range
of machine learning, biologically/nature inspired learning and
hybrid intelligent techniques encompassing data and

© 2014 Wiley Publishing LtdExpert Systems, xxxx 2014, Vol. 00, No. 00


information fusion, learning and adaptation methods, multiple
classifier and prediction systems, processing and modelling of
uncertainty in predictive modelling, pattern recognition,
diagnostic analysis and decision support systems.

Detlef Nauck

Detlef Nauck is chief research scientist for Data Science
with BT’s Research and Innovation Division located at
Adastral Park, Ipswich, UK. He is leading a group of
international scientists working on Intelligent Data
Analysis and Autonomic Systems. Detlef has been working
on data analytics projects across a number of areas such as

customer service analytics, real-time business intelligence,
business process analytics and network analytics. His
current work focusses on introducing autonomic principles
like self-learning, self-healing and self-optimisation into
business processes and network management. Detlef is
a visiting professor at Bournemouth University and a
private docent at the Otto-von-Guericke University of
Magdeburg, Germany. Detlef holds an MSc (1990) and a
PhD (1994) in computer science both from the University
of Braunschweig, Germany. He also holds a Habilitation
(postdoctoral degree) in computer science from the Otto-
von-Guericke University of Magdeburg, Germany (2000).

© 2014 Wiley Publishing Ltd Expert Systems, xxxx 2014, Vol. 00, No. 00