key: cord-026169-j4navhku authors: Zhang, Sicui; Genga, Laura; Dekker, Lukas; Nie, Hongchao; Lu, Xudong; Duan, Huilong; Kaymak, Uzay title: Towards Multi-perspective Conformance Checking with Aggregation Operations date: 2020-05-18 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50146-4_17 sha: doc_id: 26169 cord_uid: j4navhku Conformance checking techniques are widely adopted to validate process executions against a set of constraints describing the expected behavior. However, most approaches adopt a crisp evaluation of deviations, with the result that small violations are considered at the same level of significant ones. Furthermore, in the presence of multiple data constraints the overall deviation severity is assessed by summing up each single deviation. This approach easily leads to misleading diagnostics; furthermore, it does not take into account user’s needs, that are likely to differ depending on the context of the analysis. We propose a novel methodology based on the use of aggregation functions, to assess the level of deviation severity for a set of constraints, and to customize the tolerance to deviations of multiple constraints. Nowadays organizations often define procedures describing how their processes should be performed to satisfy a set of constraints, e.g., to minimize the throughput time or to comply with rules and regulations. A widely used formalism to represent these procedures consists in so-called process models, that are graphic or logic formalism representing constraints defined on organization processes, e.g., by the order of execution of the activities. However, it is well documented in literature that real process behavior often deviates from the expected process, which often leads to performance issues or opens the way to costly frauds [12] . In recent years, the increasing use by organizations of information systems (e.g., ERP, SAP, MRP and so on) to support and track the execution of their processes enabled the development of automatic, data-driven techniques to assess the compliance level of the real process behavior. Among them, Conformance checking techniques have been gaining increasing attention both from practitioners and academics [1, 2, 5, 6, 23] . Given an event log, i.e., a log file tracking data related to activities performed during process executions, conformance checking techniques are able to pinpoint discrepancies (aka, deviations) between the log and the corresponding model. While classic conformance checking techniques only deal with the control-flow of the process, i.e., the activities execution order, in recent years also some multi-perspective conformance checking, aimed to deal also with data constraints, have become more and more relevant [23, 25] . Nevertheless, there are still several open challenges to implement multiperspective conformance checking. Among them, here we focus on the lack of appropriate modeling mechanisms for dealing with the uncertainty and graduality often characterizing human-decisions in real-world processes. State of the art techniques implement a crisp approach: every execution of an activity is considered as either completely wrong or completely correct. [13, 23, 25] . While this assumption is well grounded to deal with the control-flow (indeed, each activity is either executed at the right moment, or it is not), when addressing data constraints it can easily lead to misleading results. A well-known example of this issue can be found in the healthcare domain. Let us assume that a surgery department implements a guideline stating that the systolic blood pressure (SBP) of a patient has to be lower than 140 to proceed with a surgery. It is reasonable to expect that sometimes clinicians will not refuse to operate patients whose SBP is 141, since this is quite a small deviation and delaying the surgery could be more dangerous for the patient. Clearly, surgeries performed with this value of SBP are likely to be much less problematic than surgeries performed with a SBP equal to, e.g., 160. However, conformance checking techniques would simply mark both these cases as 'not compliant ', without allowing for any distinction. This behavior is undesirable, since it is likely to return in output a plethora of not-interesting deviations, at the same time hiding those which could deserve further investigation. We investigated this issue in our previous work [29] , where we proposed to use fuzzy sets, which are used to present the flexibility in the constraints and the goals in fuzzy optimization [20] , to determine the severity of violations of a single soft constraint per activity. However, the previous work used basic strategy of standard conformance checking techniques for dealing with multiple constraints deviations; namely, the total degree of data deviations of that activity is computed by summing up the costs for all the violated constraints. This strategy poses some important limitations when investigating the data compliance. First, it introduces an asymmetry in the assessment of control-flow and data deviations. While controlflow deviations for each activity express the level of compliance of the activity to control-flow constraints (either fully compliant or wrong), in the presence of multiple data constraints the obtained value does not give an indication of the overall level of compliance to the constraints set. Furthermore, no customization to the user's needs is provided. First, in this setting data violations tend to be considered more severe than control-flow ones, even if this might not fit with user's intention. Furthermore, different contexts might require tailored functions to assess multiple data deviations severity. In this paper, we address this issue by proposing a novel fuzzy conformance checking methodology based on the use of aggregation functions, which have been proved feasible for modeling simultaneous satisfaction of aggregated criteria [20] . With respect to previous work, the approach brings two main contributions: a) it applies fuzzy aggregation operators to assess the level of deviation severity for a set of constraints, and b) it allows to customize the tolerance to deviations of multiple constraints. As a proof-of-concept, we tested the approach over synthetic data. The remainder of this paper is organized as follows. Section 2 introduces a running example to discuss the motivation of this work. Section 3 introduces basic formal notions. Section 4 illustrates our approach, and Sect. 5 presents results obtained by a set of synthetic experiments. Section 6 discusses related work. Finally, Sect. 7 draws some conclusions and presents future work. Consider, as a running example, a loan management process derived from previous work on the event log of a financial institute made available for the BPI2012 challenge [3, 15] . Figure 1 shows the process in BPMN notation. The process starts with the submission of a loan application. Then, the application passes through a first assessment of the applicant's requirements and, if the requested amount is greater than 10000 euros, also through a more thorough fraud detection analysis. If the application is not eligible, the process ends. Otherwise, the application is accepted, an offer to be sent to the customer is selected and the details of the application are finalized. After the offer has been created and sent to the customer, the latter is contacted to discuss the offer with her. At the end of the negotiation, the agreed application is registered on the system. At this point, further checks can be performed on the application, if the overall duration is still below 30 days and the Amount is larger than 10000, before approving it. Let us assume that this process is supported by some system able to track the execution of its activities in a so-called event log. In practice, this is a collection of traces, i.e., sequences of activities performed within the same process execution, each storing information like the execution timestamp of the execution, or other data element [1] . As an example, let us consider the following traces 1 showing two executions of the process in Fig. 1 (note that we use acronyms rather than complete activity names) : σ 1 = (A S, {Amount = , . Both executions violate the constraints defined on the duration and the amount of the loan, according to which the activity W F A should have been anyway skipped. Conformance checking techniques also attempt to support the user in investigating the interpretations of a deviation. In our case, the occurrence of the activity W F A could be considered either as a 1) control-flow deviation (i.e., data are corrected but the activity should not have been executed) or as a 2) data-flow deviation (i.e., the execution of the activity is correct but data have not been properly recorded on the system). In absence of domain knowledge in determining what is the real explanation, conformance checking techniques assess the severity (aka, cost) of the possible interpretations and select the least severe one, assuming that this is the one closest to the reality. In our example, conformance checking would consider σ 1 as a control-flow deviation, since the cost would be equal to 1, while data-flow deviation would correspond to 2, having two violated constraints; for σ 2 , instead, the two interpretations would be equivalent, since only one data constraint is violated. In previous work [29] we investigated how to use fuzzy membership function to assess severity of data deviations taking into account the magnitude of the deviations. However, the approach still comes with some limitations when considering multiple constraints. Indeed, with this approach the overall severity of the data deviation for an activity is assessed by a simple sum operation. For example, let us suppose that with the method in [29] we obtained a cost of 0.3, 0.8 for the violations of Amount and Duration in W F A in σ 1 , thus obtaining a total cost of 1.1, and 0.8 and 0 in σ 2 , thus obtaining, a total cost of 0.8. In this setting, activities involving multiple constraints will tend to have an interpretation biased towards control-flow deviations, since the higher the number of constraints, the higher the the data-deviation cost. Furthermore, it is worth noting that the comparison between the two traces can be misleading; in one case, constraints are violated, even if one only slightly deviated; while in the second case only one constraint is violated, even if with quite a strong deviation. However, the final numerical results are quite similar, thus hiding the differences. This example shows how the use of the simple sum function can impact the results significantly, without the user realizing it and, above all, without providing the user with any customization mechanism. For example, the user might want to assess the data-compliance level in terms of the percentage of satisfied constraints, or by considering only the maximum cost, and so on. However, current techniques do not allow for this kind of customization. This section introduces a set of concepts that will be used through the paper. Conformance checking techniques detect discrepancies between a process model and the real process execution. Here we define the notion of process model using the notation from [2] , enriched with data-related notions explained in [13] . M = (P, P I , P F , A M , V, W, U, is a guard function, i.e., a boolean formula expressing a condition on the values of the data variables. W : A M → 2 V is a write function, that associates an activity with the set of variables which are written by the activity. Finally, V alues : is a function that associates each state with the corresponding pairs variable=value. The firing of an activity s = (a, w) ∈ A M × (V → U ) in a state p is valid if: 1) a is enabled in p ; 2) a writes all and only the variables in W (a); 3) G(a) is true when evaluate over V alues(p ). To access the components of s we introduce the following notation: vars(s) = w, act(s) = a. Function vars is also overloaded such that vars The set of valid process traces of a model M is denoted with ρ(M ) and consists of all the valid firing sequences σ ∈ (A M × (V → U )) * that, from an initial state P I lead to a final state P F . Figure 1 provides an example of a process model in BPMN notation. Process executions are often recorded by means of an information system in event logs. Formally, let S N be the set of (valid and invalid) firing of activities of a process model M ; an event log is a multiset of traces L ∈ B(S * N ). Given an event log L, conformance checking builds an alignment between L and M , mapping "moves" occurring in the event log to possible "moves" in the model. A "no move" symbol " " is used to represent moves which cannot be mimicked. For convenience, we introduce the set S N = S N ∪ { }. Formally, we set s L to be a transition of the events in the log, s M to be a transition of the activities in the model. A move is represented by a pair (s L , s M ) ∈ S N × S N such that: LM such that the projection of the first element (ignoring ) yields σ L , and the projection on the second element (ignoring ) yields σ M . Let us consider the model in Fig. 1 and the trace σ 1 in Sect. 2. Table 1 shows two possible alignments γ 1 and γ 2 for activity W F A. For Alignment γ 1 , the pair (W F A, W F A) is a move in both with incorrect data, while in γ 2 the move (W F A, ⊥) is matched with a , i.e., it is a move on log. (In remaining part, Amount and Duration are abbreviated to A and D). As shown in Example 1, there can be multiple possible alignments for a given log trace and process model. Our goal is to find the optimal alignment, i.e., the alignment with minimum cost. To this end, the severity of deviations is assessed by means of a cost function. Aggregation operations (AOs) are mathematical functions that satisfy minimal boundary and monotonicity conditions, and are often used for modeling decision making processes, since they allow to specify how to combine the different criteria that are relevant when making a decision [17, 27] . In literature, many AOs have been defined (see [18, 19, 22] for an overview), with different level of complexity and different interpretations. A commonly used class of aggregation operators are the t-norms, which are used to model conjunction of fuzzy sets. In compliance analysis, one often tries to satisfy all constraints on the data, and so t-norms are suitable operators for modeling soft constraints in compliance analysis. Widely used t-norms are the minimum, product and the Yager operators [21] . In addition to the t-norms, other aggregation operators could also be used, depending on the goals of the compliance analysis. We do not consider other types of aggregation operators in this paper, but, in general, one could use the full flexibility of different classes of fuzzy set aggregation operators that have been used in decision making (see, e.g. [11] ). We introduce a compliance checking approach tailored to dealing with decision tasks under multiple guards, to enhance the flexibility of the compliance assessing procedure. To this end, we investigate the use of AOs. Compliance checking in process analysis is based on the concept of alignment between a process model and a process trace that minimizes a cost of misalignment. The computation of an optimal alignment relies on the definition of a proper cost function for the possible kind of moves (see Sect. 3). Most of stateof-the art approaches adopt (variants of) the standard distance function defined in [2] , which sets a cost of 1 for every move on log/model (excluding invisible transitions), and a cost of 0 for synchronous moves. Multi-perspective approaches extend the standard cost function to include data costs. Elaborating upon these approaches, in previous work [29] we defined our fuzzy cost function as follows. This cost function assigns a cost equal to 1 for a move in log; 1 plus the number of variables that should have been written by the activity for a move in model; finally, the sum of the cost of the deviations (1-μ i ) for the data variables if it's a move in both. Note that the latter consider both the case of move with incorrect and incorrect data. As discussed in Sect. 2, summing up all the data cost presents important limitations to assess the conformance of multiple constraints. Therefore, in the present work, we propose a new version of our fuzzy cost function with the goal of standardize every move within the range (0,1) and allow the user to customize the cost function to her needs. Let π(μ 1 , μ 2 , ..., μ n ) be an userdefined aggregated membership function of multiple variables. Then (1 − π) is the overall deviation cost of a set of variables. The cost k(S L , S M ) is defined as: (2) The problem of finding an optimal alignment is usually formulated as a search problem in a directed graph [14] . Let Z = (Z V , Z E ) be a directed graph with edges weighted according to some cost structure. The A* algorithm finds the path with the lowest cost from a given source node v 0 ∈ Z v to a node of a given goals set Z G ⊆ Z V . The cost from each node is determined by an evaluation function f (v) = g(v) + h(v), where: -g : Z V → R + gives the smallest path cost from v 0 to v; -h : Z V → R + 0 gives an estimate of the smallest path cost from v to any of the target nodes. If h is admissible,i.e. it underestimates the real distance of a path to any target node v g , then A* finds a path that is guaranteed to have the overall lowest cost. The algorithm works iteratively: at each step, the node v with lowest cost is taken from a priority queue. If v belongs to the target set, the algorithm ends returning node v. Otherwise, v is expanded: every successor v 0 is added to the priority queue with a cost f (v 0 ). Given a log trace and a process model, to employ A* to determine an optimal alignment we associate every node of the search space with a prefix of some complete alignments. The source node is an empty alignment γ 0 = , while the set of target nodes includes every complete alignment of σ L and M . For every pair of nodes (γ 1 , γ 2 ), γ 2 is obtained by adding one move to γ 1 . The cost associated with a path leading to a graph node γ is then defined as g(γ) = K(γ) + |γ|, where K(γ) = sL,sM ∈γ k(s L , s M ), with k(s L , s M ) defined as in (1), |γ| is the number of moves in the alignment, and is a negligible cost, added to guarantee termination. Note that the cost g has to be strictly increasing. While we do not give a formal proof for the sake of space, it is straight to see that g is obtained in our approach by the sum of all non negative elements. Therefore, while moving from an alignment prefix to a longer one, the cost can never decrease. For the definition of the heuristic cost function h(v) different strategies can be adopted. Informally, the idea is computing, from a given alignment, the minimum number of moves (i.e., the minimum cost) that would lead to a complete alignment. Different strategies have been defined in literature, e.g., the one in [2] , which exploits Petri-net marking equations, or the one in [28] , which generates possible states space of a BPMN model. Example 2. Let us analyze possible moves to assign to the activity W F A in σ 1 . Let us assume that the memberships of the variables are μ A = 0.4 and μ D = 0.2. According to (2) and P roduct t-norm we get the fuzzy cost function k(S L , S M ). (3) Figure 2 shows the portion of the space states for the alignment building of σ 1 . At node #11, f = 0, since no deviations occurred so far. From here, there are two possible moves that could be selected, one representing a move on log (on the left), one a move on model (on the right) and finally a move in both (in the middle). Since using the P roduct aggregation the data cost is equal to 0.92, the algorithm selects the move in both, being the one with the lowest cost. This section describes a set of experiments we performed to obtain a proof-ofconcept of the approach. We compared the diagnostics returned by an existing approach [29] and our new cost functions with three t − norm aggregations. More precisely, we aimed to get the answer to the question: what is the impact of different aggregation operations on the obtained alignments? In particular, we assess the impact of the aggregation function in terms of a) differences in the overall deviation cost, and b) difference in terms of the interpretation, i.e., the moves selected by the alignment algorithm as the best explanation for the deviation. In order to get meaningful insights on the behavior we can reasonably expect by applying the approach in the real world, we employ a realistic synthetic event log, consisting of 50000, introduced in a former paper [16] , obtained starting from one real-life logs, i.e., the event log of the BPI2012 challenge 2 . We evaluated the compliance of this log against a simplified version of the process model in [16] , to which we added few data constraints (see Fig. 1 ). The approach has been implemented as an extension to the tool developed by [28] , designed to deal with BPMN models. Our process model involves two constraints for the activity W F A, i.e., Amount ≥ 10000 and Duration ≤ 30. Here we assume that Amount ∈ (3050, 10000) and Duration ∈ (30, 70) represent a tolerable violation range for the variables. Since we cannot refer to experts' knowledge, we derived these values from simple descriptive statistics. In particular, we considered values falling within the third quartile as acceptable. The underlying logic is that values which tend to occur repeatedly are likely to indicate acceptable situations. Regarding the shape of the membership functions for the variables, here we apply the linear function μ, as reported below. For the classic sum function, we use the cost function provided by (1); while for the new approach with AOs, we apply the cost function in (2) . We tested the t − norms: Minimum, P roduct, and Y ager. When data deviations and control-flow deviations show the same cost, we picked the control-flow move. This assumption simulates what we would do in a real-world context. Indeed, without a-priori knowledge on the right explanation, it is reasonable to assume that it is more likely that the error was executing the activity, rather than accepting out-of-range data deviations. Note that here we focus on the activity W F A, since, in our log, is the only one involving multiple data constraints. Table 2 shows differences in terms of number and type of moves, as well as in terms of costs. The columns #move in log, #move in data show the number of traces in which the alignment has selected for the activity W F A a move in log or a move in data, respectively. The column "Average costs" shows the average alignment cost. The conformance checking algorithms selects for each activity the move corresponding to the minimum cost. Therefore, the differences among the chosen move depend on the different costs obtained on W F A when applying different operators. To provide a practical example of the impact of the aggregated cost on the obtained diagnostics, below we discuss the results obtained for one trace. Table 3 shows the cost of possible moves for W F A according to the aggregation functions. Table 4 shows the move picked by each function to build the alignment. Using the Sum function, the data cost is 1.003, so that a move-in-log is chosen as an optimal alignment. In the other cases, instead, the move in data is the one with the lowest cost. Since both the deviations fall in the acceptable range, this interpretation is likely to be more in line with the user's expectations. The observations made for the example can be generalized to the overall results of Table 2 , which shows a set of traces whose interpretation is heavily affected by the chosen cost function. As expected, the Sum function is the most biased towards the choice of move in log interpretation. It selects 40 moves in log more that Product and Min, and 29 more than Yager. One can argue that this choice is likely not one the human analyst would have expected. Indeed, we are using Yager with ω = 2 [11] , that means that when both the variables show severe deviations, we expect the data cost to be 1 and move-in-log to be picked. This means that at least 29 of the aligned traces were marked as move-in-log also if both the constraints did not show severe deviations. We argue that this behavior can be misleading for the analyst or, anyway, not being in line with her needs. The Product function marks other 18 traces as move-in-data, in addition to the ones marked by the Yager. This was expected, since the Product function relaxes the requirements on the full satisfaction of the set of constraints. Nevertheless, this implies that in all these 18 traces the deviations always fell in the tolerance range. Therefore, also these situations might have been better represented as data deviations, depending on the analysts' needs. As regards the Min function, it returns a full data deviation in the presence of at least one deviation outside the deviation range, which explains why it returned the same alignments of the Product function. The overall alignments costs are in line with the expectations. The Sum function returns the highest average cost, as expected, the Min the lowest, while the Yager and the Product behave similarly, and the difference can likely be explained with the 18 traces of difference discussed above. While the absolute difference among the costs is not very relevant, these results show that both the alignments and the assessment of the deviations are impacted by the choice of the cost function, thus highlighting once again the need for a more flexible approach to compliance assessment allowing the user to tailor the cost function to her context. During the last decades, several conformance checking techniques have been proposed. Some approaches [9, 10, 26] propose to check whether event traces satisfy a set of compliance rules, typically represented using declarative modeling. Rozinat and van der Aalst [24] propose a token-based technique to replay event traces over a process model to detect deviations, which, however, has been shown to provide misleading diagnostics in some contexts [4] . Recently, alignments have been proposed as a robust approach to conformance checking based on the use of a cost function [2] . While most of alignment-based approaches use the standard distance cost function as defined by [2] , some variants have been proposed to enhance the provided diagnostics, e.g., the work of Alizadeh et al. [8] , which computes the cost function by analyzing historical logging data. Besides the control flow, there are also other perspectives like data, or resources, that are often crucial for compliance checking analysis. Few approaches have investigated how to include these perspectives in the analysis: [7] extends the approach in [8] by taking into account data describing the contexts in which the activities occurred. Some approaches proposed to compute the control-flow first then assessing the compliance with respect to the data perspective, e.g. [13] . These methods gives priority to check the control flow, with the result that some important deviations can be missed. [23] introduces a cost function balancing different perspectives, thus obtaining more precise diagnostics. The approaches mentioned so far assume a crisp evaluation of deviations. To the best of our knowledge, the only work which explored the use of a fuzzy cost function is our previous work [29] which, however, did not consider multiple constraints violation. In this work, we investigated the use of fuzzy aggregation operations in conformance checking of process executions to deal with multiple data constraints for an activity. The proposed approach enhances significantly the flexibility of compliance checking, allowing the human analyst to customize the compliance diagnostic according to her needs. We elaborated upon the relevance of this aspect both theoretically and with some examples. As a proof of concept, we implemented the approach and tested it over a synthetic dataset, comparing results obtained by cost functions with classic sum function and three different aggregations. The experiments confirmed that the approach generates more "balanced" diagnostics, and introduces the capability of personalizing the acceptance of deviations for multiple guards. Nevertheless, there are several research directions still to be explored. In future work, first we plan to test our approach with real-world data. Furthermore, we intend to investigate the usage of different aggregation functions, as well as the possibility of extending the notion of aggregation to take into account also other kinds of deviations. Finally, we intend to investigate potential applications, for example in terms of on-line process monitoring and support, with the aim of enhancing the system resilience to exceptions and unforeseen events. Process mining manifesto Replaying history on process models for conformance checking and performance analysis Mining process performance from event logs Towards robust conformance checking Memory-efficient alignment of observed and modeled behavior Alignment based precision checking Constructing probable explanations of nonconformity: a data-aware and history-based approach History-based construction of alignments for conformance checking: formalization and implementation Conformance checking and diagnosis for declarative business process models in data-aware scenarios Comprehensive rule-based compliance checking and risk management with process mining Model predictive control using fuzzy decision functions Data-and resource-aware conformance checking of business processes Aligning event logs and process models for multi-perspective conformance checking: an approach based on integer linear programming Generalized best-first search strategies and the optimality of A Discovering anomalous frequent patterns from partially ordered event logs Predicting critical behaviors in business process executions: when evidence counts Fuzzy measures and integrals in MCDA Aggregation functions: construction methods, conjunctive, disjunctive and mixed classes Aggregation functions: means Weighted constraints in fuzzy optimization Operations on fuzzy numbers extended by yager's family of tnorms Fuzzy Sets and Fuzzy Logic: Theory and Applications Balanced multiperspective checking of process conformance Conformance checking of processes based on monitoring real behavior Dependence-based data-aware process conformance checking Compliance checking of data-aware and resource-aware compliance requirements Modeling decisions: Information Fusion and Aggregation Operators Aligning event logs to task-time matrix clinical pathways in BPMN for variance analysis Towards multiperspective conformance checking with fuzzy sets The research leading to these results has received funding from the Brain Bridge Project sponsored by Philips Research.