key: cord-0044432-xlnhsh15 authors: Fani Sani, Mohammadreza; van Zelst, Sebastiaan J.; van der Aalst, Wil M. P. title: Conformance Checking Approximation Using Subset Selection and Edit Distance date: 2020-05-09 journal: Advanced Information Systems Engineering DOI: 10.1007/978-3-030-49435-3_15 sha: 05f0cb1e55b3546608a417c80a33cc28f080c569 doc_id: 44432 cord_uid: xlnhsh15 Conformance checking techniques let us find out to what degree a process model and real execution data correspond to each other. In recent years, alignments have proven extremely useful in calculating conformance statistics. Most techniques to compute alignments provide an exact solution. However, in many applications, it is enough to have an approximation of the conformance value. Specifically, for large event data, the computation time for alignments is considerably long using current techniques which makes them inapplicable in reality. Also, it is no longer feasible to use standard hardware for complex process models. This paper, proposes new approximation techniques to compute approximated conformance checking values close to exact solution values in less time. These methods also provide upper and lower bounds for the approximated alignment value. Our experiments on real event data show that it is possible to improve the performance of conformance checking by using the proposed methods compared to using the state-of-the-art alignment approximation technique. Results show that in most of the cases, we provide tight bounds, accurate approximated alignment values, and similar deviation statistics. One of the main branches of process mining is conformance checking, aiming at investigating conformity of a discovered/designed process model w.r.t, real process executions [1] . This branch of techniques is beneficial to detect deviations and to measure how accurate a discovered model is. In particular, the techniques in this branch are able to check conformance based on process modeling formalisms that allow for describing concurrency, i.e., the possibility to specify order-independent execution of activities. Early conformance checking techniques, e.g., "'token-based replay" [2] , often lead to ambiguous and/or unpredictable results. Hence, alignments [3] were developed with the specific goal to explain and quantify deviations in a non-ambiguous manner. Alignments have rapidly turned into the de facto standard conformance checking technique [4] . Moreover, alignments serve as a basis for techniques that link event data to process models, e.g., they support performance analysis, decision mining [5] , business process model repair [6] and prediction techniques. However, computing alignments is time consuming on real large event data, which makes it unusable in reality. In many applications, we need to compute alignment values several times, e.g., if we want to have a suitable process model for an event log, we need to discover many process models using various process discovery algorithms with different settings, and, measure how each process model fits with the event log using alignment techniques. As normal alignment methods require considerable time for large event data, analyzing many candidate process models is impractical. Consequently, by decreasing the alignment computation time, we can consider more candidate models in a limited time. Moreover, in several cases, we do not need to have accurate alignment values, i.e., it is sufficient to have a quick approximated value or a close lower/upper bound for it. In this paper, we propose several conformance checking approximation methods that provide approximated alignment values plus lower and upper bounds for the actual alignment value. The underlying core idea, is to consider just a subset of the process model behavior, instead of its all behavior. The methods additionally return problematic activities, based on their deviation rates. Using these methods, users are able to adjust the amount of process model behaviors considered in the approximation, which affects the computation time and the accuracy of alignment values and their bounds. We implemented the methods in two open-source process mining tools and applied them on several large real event data and compared them with the stateof-the-art alignment approximation method. The results show that using some of proposed methods, we are able to approximate alignment values faster and at the same time the approximated values are very close to actual alignment values. The remainder of this paper is structured as follows. In Sect. 2, we discuss related work. Section 3 defines preliminary notation. We explain the main method in Sect. 4 and evaluate it in Sect. 5. Section 6 concludes the paper. Several process mining techniques exists, ranging from process discovery to prediction. We limit related work to the field of conformance checking and sampling techniques in the process mining domain. We refer to [1] for an overview of process mining. In [7] , the authors review the conformance checking techniques in process mining domain. In [8] different methods for conformance checking and its applications are covered. Early work in conformance checking uses token-based replay [2] . The techniques replay a trace of executed events in a Petri net and add missing tokens if transitions are not able to fire. After replay, a conformance statistic is computed based on missing and remaining tokens. Alignments were introduced in [9] and have rapidly developed into the standard conformance checking technique. In [10, 11] , decomposition techniques are proposed for alignment computation. Moreover, [12] proposes a decomposition method to find an approximation of the alignment in a faster time. Applying decomposition techniques improves computation time, i.e., the techniques successfully use the divide-and-conquer paradigm, however, these techniques are primarily beneficial when there are too many unique activities in the process [13] . Recently, general approximation schemes for alignments, i.e., computation of near-optimal alignments, have been proposed [14] . Finally, the authors in [4] propose to incrementally compute prefix-alignments, i.e., enabling real-time conformance checking for event data streams. A relatively limited amount of work has been done to use sampling approaches in process mining domain. In [15] , the authors proposed a sampling approach based on Parikh vectors of traces to detect the behavior in the event log. In [16] , the authors recommend a trace-based sampling method to decrease the discovery time and memory footprint that assumes the process instances have different behavior if they have different sets of directly follows relations. Furthermore, [17] recommends a trace-based sampling method specifically for the Heuristic miner [18] . In both of these sampling methods, we have no control on the size of the final sampled event data. Also, depend on the defined behavioral abstraction,the methods may select almost all the process instances. Finally, all these sampling methods are unbiased and consequently they leads to non-deterministic results. In [19] , we analyze random and biased sampling methods with which we are able to adjust the size of the sampled data for process discovery. Moreover, [36] shows that using a clustering-based instance selection method will provide more precise and simpler process models. Some research focuses on alignment value approximation. [20] proposes sampling the event log and applying conformance checking on the sampled data. The method increases the sample size until the approximated value is accurate enough. However, the method does not guarantee the accuracy of the approx-imation, e.g., by providing bounds for it. In Sect. 5, we show that if there is lot of unique behavior in the event log, using this method, the approximation time exceeds the computation time for finding the alignment value. The authors of [21] propose a conformance approximation method that applies relaxation labeling methods on a partial order representation of a process model. Similar to the previous method, it does not provide any guarantee for the approximated value. Furthermore, it needs to preprocess the process model each time. In this paper, we propose multiple alignment approximation methods that increase the conformance checking performance. The methods also provide bounds for alignment values and a deviation ratio of problematic activities. In this section, we briefly introduce basic process mining and, specifically, conformance checking terminology and notations that ease the readability of this paper. 1 Given a set X, a multiset B over X is a function B : X → N ≥0 that allows certain elements of X appear multiple times. B = {e ∈ X | B(e) > 0} is the set of elements present in the multiset. The set of all multisets over a set X is written as B(X). Given a system net SN , φ f (SN ) is the set of all complete firing sequences of SN and φ v (SN ) is the set of all possible visible traces, i.e., complete firing sequences starting its initial marking and ending in its final marking projected onto the set of observable activities (not silent transitions, e.g., t 3 in Fig. 1) . To measure how a trace aligns to a process model, we need to define the notation of moves. A move is a pair (x, t) where the first element refers to the log and the second element refers to the corresponding transition in the model. For example, (a, t 1 ) means that both log and model make an "a move" and the move in the model is caused by the occurrence of transition t 1 (as the label of t 1 is a). Note that indicates "no move" in log/model trace. Now, we define Alignment as follows [10] . Let σ L ∈ L be a log trace and σ M ∈ φ f (SN ) a complete firing sequence of a system net SN . An alignment of σ L and σ M is a sequence of pairs γ ∈ A * LM such that the projection on the first element (ignoring ) yields σ L and the projection on the second element (ignoring and transition labels) yields σ M . An alignment is a sequence of legal moves such that after removing all symbols, the top row corresponds to a trace in the event log and the bottom row corresponds to a complete firing sequence in φ f (SN ). The middle row corresponds to a visible path when ignoring the τ steps, i.e., corresponding to silent transitions (e.g., t 3 in Fig. 1 ). For silent transitions, there is no corresponding recorded event in the log. The following alignments relate to σ L = a, c, b, d, e and the Petri net in Fig. 1 . By considering the label of visible transitions of an alignment, we find the corresponding model trace, e.g., the model trace of γ 1 is a, b, e . To quantify the costs of misalignments we introduce a move cost function δ. Synchronous moves, i.e., moves that are similar in the trace and the model, have no costs, i.e., for In this paper, we use a standard cost function δ S that assigns unit costs: In the above example alignments, δ S (γ 1 ) = 2 and δ S (γ 2 ) = 1. Given a log trace and a system net, we may have many alignments. To select the most appropriate one, we select an alignment with the lowest total costs. LM is a mapping that assigns any log trace σ L to an optimal alignment, i.e., γ SN (σ L ) ∈ Γ σL,SN and γ SN (σ L ) is an optimal alignment. -λ SN ∈ A * → A * is a mapping that assigns any log trace σ L to visible activities of the model trace of the optimal alignment. In the running example, γ SN ( a, c, b, d, e ) = γ 2 (γ 2 is optimal), and λ( a, c, b, d, e ) = a, c, b, e is the corresponding model trace for the optimal alignment. We can compute the distance of two traces (or two sequences) faster using the adapted version of Levenshtein distance [22] . Suppose that σ, σ ∈ A * , Edit Distance function (σ, σ ) → N returns the minimum number of edits that are needed to transform σ to σ . As edit operations, we allow deletion/insertion of an activity (or a transition label) in a trace, e.g., ( a, c, f, e , a, f, c, a ) = 4, corresponds to two deletions and two insertions. This measure is symmetric, i.e., (σ, σ ) = (σ , σ). It is possible to use the function instead of the standard cost function. Thus, and δ S return same distance values. The function is expendable from unit cost (i.e., δ S ) to another cost by giving different weights to insertion and deletion of different activities. In [23] , it is explained that the Levenshtein metric before normalization satisfies the triangle inequality. In other words, returns the distance of the most similar sequence in S for σ L . Let φ v (SN ) is a set of all visible firing sequences in SN , and γ SN (σ) is an optimal alignment for sequence σ. It is possible to use Φ(σ, φ v (SN )) instead of δ S (γ SN (σ)) 2 . Using the edit distance function, we are able to find which activities are required to be deleted or inserted. So, not only the cost of alignment; but, the deviated parts of the process model (except invisible transitions) are also detectable using this function. It is possible to convert misalignment costs into the fitness value using Eq. 1. It normalizes the cost of optimal alignment by one deletion for each activity in the trace and one insertion for each visible transition in the shortest path of model (SPM). The fitness between an event log L and a system net SN (i.e., F itness(L, SN )) is a weighted average of traces' fitness. As computational complexity of computing alignment is exponential in the number of states and the number of transitions, it is impractical for larger petri nets and event logs [24] . Considering that the most time consuming part in the conformance checking procedure is finding an optimal alignment for each σ L ∈ L and the system net SN leads us to propose an approximation approach that requires fewer alignment computations. The overview of the proposed approach is presented in Fig. 2 . We suggest to use M B ⊆ φ v (SN ) instead of the whole φ v (SN ) and apply the edit distance function instead of δ S . In the following lemma, we show that using this approach, we have an upper bound for the cost of alignment (i.e., a lower bound for the fitness value). returns an upper bound for the cost of optimal alignment. Here, we explain the main components of our proposed approach, i.e., constructing a subset of model behavior (M B ) and computing the approximation. As explained, we propose to use M B i.e., a subset of visible model traces to have an approximated alignment. An important question is how to construct M B . In this regard, we propose two approaches, i.e., simulation and candidate selection. 1) Simulation: The subset of model traces can be constructed by simulating the process model. In this regard, having a system net and the initial and final markings, we simulate some complete firing sequences. Note that we keep only the visible firing sequences in M B . It is possible to replay the Petri net randomly or by using more advanced methods, e.g., stochastic petri net simulation techniques. This approach is fast; but, we are not able to guarantee that by increasing the size of M B we will obtain the perfect alignment (or fitness) value, because the model traces are able to be infinite. Another potential problem of this method is that the generated subset may be far from traces in the event log that leads to have an inaccurate approximation. 2) Candidate Selection: The second method to construct M B is computing the optimal alignments of selected traces in the event log and finding the corresponding model traces for these alignments. In this regard, we first select some traces (i.e., candidates) from the event log L and put them in L C . Then for each σ L ∈ L C we find the optimal alignment and insert λ SN (σ L ) to M B . Thereafter, for other traces σ L ∈ L C (i.e., L C = L − L C ), we will use M B and compute Φ(σ L , M B ). As the triangle inequality property holds for the edit distance function, it is better to insert λ SN (σ L ) in M B instead of considering σ L . To make it more clear, let σ L be a log trace, SN is a system net, and σ M = λ SN (σ L ) is the corresponding visible model trace for an optimal alignment of σ L and SN . According to the triangle inequality property, for any trace σ ∈ L, we have (σ, σ M ) ≤ (σ, σ L ) + (σ L , σ M ). So, the cost of transforming σ L to σ M is less than the cost of transforming it to σ L and then to σ M . As Φ(σ L , M B ) returns the minimum cost of the most similar sequence in M B to σ L , putting directly the alignments of traces M B causes to have a smaller upper bound for alignment cost. Moreover, it is possible to have λ SN (γ SN (σ 1 )) = λ SN (γ SN (σ 2 )) for σ 1 = σ 2 . Therefore, by inserting λ SN (σ 1 ) instead of σ 1 in M B , we will have M B with fewer members that increases the performance of the approximation. To select the candidate traces in L C , we propose three different methods. We can select these traces randomly or based on their frequency in the event log (i.e, L(σ L )). The third possible method is to apply a clustering algorithm on the event log and put the traces in K different clusters based on their control flow information. We then select one trace, i.e., medoid for each cluster that represents all cluster's members. It is expected that by using this approach, the detected bounds will be more accurate. Table 1 . Result of using the proposed approximation method for the event log that is given in Fig. 1 considering that MB = { a, b, e , a, b, c, After constructing M B , we use it for all traces in the L C . Note that for the simulation method, L C = ∅ and L C = L. Moreover, for the candidate selection method, we use the alignment values that already computed by in constructing M B . To compute the lower bound for the fitness value, we compute the fitness value of all of the σ ∈ L C using Φ(σ, M B ). Afterwards, based on the weighted average of this fitness and alignments that are computed in the previous part, the lower bound for the fitness value is computed. For the upper bound of fitness value, we compare the length of each trace in L C with the shortest path in the model (i.e., SP M ). To find SP M , we compute the cost of the optimal alignment for an empty trace (i.e., ) and the system net. In the example that is given in Fig. 1 To compute the approximation values, for each trace in σ ∈ L C , we compute the Φ(σ, M B ) and compare it to the average fitness value of L C . If the fitness value of the new trace is higher than F itness(L C , SN), we consider Φ(σ, M B ) as the approximated fitness value; otherwise, F itness(L C , SN) will be considered for the approximation. Similar to the bounds, we use the weighted averages of fitness values of L C and L C to compute the approximated fitness value of whole event log. Note that for the simulation method that L C = ∅, the approximated fitness value for each trace (and for the whole event log) is equal to the lower bound. Finally, the proposed method returns the number of asynchronous (i.e., deletions and insertions) and synchronous moves for each activity in the event log. This information helps the data analyst to find out the source of deviations. The computed bounds and the approximated fitness value for each trace and the overall event log in Fig. 1 based on M B = { a, b, e , a, b, c, e } is given in Table 1 . This M B is possible to gained by computing the alignment of the two most frequent traces in the event log or by simulation. The approximated fitness will be 0.929 that its accuracy equals to 0.008. The proposed bounds are 0.95 and 0.898. Moreover, the method returns the number of insertion and deletions that are 1 insertion for a, 5 insertion for b, 3 deletions for c, 3 deletions for d, and nothing for e. By increasing |M B |, we expect to have more accurate approximations and bounds. But, increasing the |M B | for the candidate selection approach increases the number of required alignments computations and consequently increases the computation time. In this section, we aim to explore the accuracy and the performance of our methods. We first explain the implementation, and, subsequently, we explain the experimental setting. Finally, the experimental results and some discussions will be provided. To apply the proposed conformance approximation method, we implemented the Conformance Approximation plug-in in the ProM [25] framework 3 . It takes an event log and a Petri net as inputs and returns the conformance approximation, its bounds, and the deviation rates of different activities. In this implementation, we let the user adjusts the size of M B and the method to select and insert model traces in it (i.e., simulation and alignment of selected candidates). If the user decides to use alignments for creating model behavior, she can select candidates based on their frequency, random, or using the clustering algorithm. For finding the distance of a log trace and a model trace, we used the edit distance function, which is an adapted version of the Levenshtein distance [22] . To cluster traces, we implement the K-Medoids algorithm that returns one trace as a candidate for each cluster [26] based on their edit distance. To apply the methods on various event logs with different parameters, we ported the developed plug-in to RapidProM, i.e., an extension of RapidMiner and combines scientific work-flows with a several process mining algorithms [27] . We applied the proposed methods on eight different real event logs. Some information about these event logs is given in Table 2 . Here, uniqueness refers to V ariant# T race# . For process discovery, we used the Inductive Miner [34] with infrequent thresholds equal to 0.3, 0.5, and 0.7. We applied conformance approximation methods with different settings. In this regard, an approximation parameter is We also compared our proposed method with the statistical sampling method [20] . The approximation parameter for this method determines the size and the accuracy of sampling and we consider = δ = approximation parameter × 0.001. We did not consider [12] in the experiments, as it does not improve the performance of normal computation of alignment [35] for event logs which have few unique activities using the default setting. Even for some event logs with lots of unique activities in [12] , the performance improvement of our methods is higher. Because of the page limit, we do not show results of this experiment here. In all experiments and for all methods, we used eight threads of CPU. Moreover, each experiment was repeated four times, since the conformance checking time is not deterministic, and the average values are shown. To evaluate how the conformance approximation is able to improve the performance of the conformance checking process, we used the P I = Normal Conformance Time Approximated Conformance Time . In this formula, a higher P I value means conformance is computed in less time. As all our proposed methods need a preprocessing phase (e.g., for clustering the traces), we compute the P I with and without the preprocessing phase. The accuracy of the approximation, i.e., the difference between approximated conformance value and the actual fitness value shows how close is the approximated fitness to the actual fitness value that is computed by Accuracy = |AccF itness − AppxF itness|. Also, we measure the distance of the provided upper and lower bounds. The bound width of an approximation is computed by BoundW idth = U BF itness − LBF itness. Tighter bound widths means that we have more accurate bounds. In Fig. 3 , we show how different approximation methods improve the performance of conformance checking. For most of the cases, the improvement is higher for the simulation method. It is because, the most time consuming part in conformance checking is computing the optimal alignment. As in the simulation method, there is no need to do any alignment computation, it is faster than any other method. For some event logs, the statistical sampling method [20] is not able to provide the approximation faster than the normal conformance checking (i.e., P I < 1). It happens because, this method is not able to benefit from the parallel computing of alignment and after each alignment computation it needs to check if it needs to do more alignment or not. For the statistical method, decreasing approximation parameter leads to more precise approximations; however, it causes to have less PI value. Among the candidate selection methods, using the frequency method usually leads to a higher PI value. For some event logs, e.g., Road, none of the method has a high PI value. It happens because in Fig. 3 , we consider the preprocessing time. The preprocessing time corresponds to choosing the candidate traces and simulating the process model behaviors that needs to be done once per each event log or process model. For the candidate selection methods, this phase is independent of process models and for doing that we do not need to consider any process model. For the simulation method, this phase is independent of the given event log. Thus, we are able to do the preprocessing step before conformance approximation. If we use some event log standards such as MXML and Parquet, we do not need to preprocess the event log for the frequency and random method because we know the number of variants and their frequency beforehand. In Fig. 4 , we show the performance improvement without considering the preprocessing time. As the statistical sampling method does not have preprocessing phase, it is not shown in this figure. It is shown that there is a linear decrement in improvement of the candidate selection methods by increasing the approximation parameter. It is expectable, as increasing in this parameter for candidate selection methods means more optimal alignment computations that requires more time. For example, by considering 5 for this parameter, means that we need to compute 5% of all optimal alignments of the normal conformance checking. Therefore, it is expected that the approximated conformance value will be computed in 20 times faster than using normal alignment. After analyzing the performance improvement capabilities of the proposed methods, in Table 3 , we compare the accuracy of their approximations. In this regard, the average accuracy values of the approximated conformance values are shown in this table. The lower value means a higher accuracy or in other words, the approximated fitness value is closer to the actual fitness value. In this table, Fitness shows the actual fitness value when the normal conformance checking method is used. We used different values for the approximation parameter as explained in Sect. 5.2. The results show that for most of the event logs the accuracy of the simulation method is not good enough. However, for BPIC-2018-Reference and BPIC-2018-Department, that have simpler process models, using this method, we generated almost all the model behavior (i.e., M B = φ v ) and obtain perfect accuracy. Results show that if we use the statistical, and frequency methods, we usually obtain accuracy value below 0.01 which is acceptable for many applications. Among the above methods, results of the statistical sampling method are more stable and accurate. However, the accuracy of candidate selection methods is usually improved by using a higher approximation parameter. In the next experiment, we aim to evaluate the provided bounds for the approximation. Figure 5 shows how increasing the value of the approximation parameter increases the accuracy of the provided lower and upper bounds. As the statistical method does not provide any bounds, we do not consider it in this Fig. 4 and Fig. 5 , we observe that there is a trade-off between the performance and the accuracy of the approximation methods. By increasing the number of visible traces in M B , we need more time to approximate the fitness value; but, we will provide more accurate bounds. In the case that we set the approximation parameter to 100, the bound width will be zero; however, there will not any improvement in performance of the conformance checking. By adjusting the approximation parameter, the end user is able to specify the performance improvement. Figure 5 shows that for some event logs like Sepsis and BPIC-2018-Inspection, none of the approximation methods are able to provide tight bounds. That hap- pens because in these event logs not only do we have lots of unique traces; but, also these traces are not close to each other. In Table 4 , we show the average of edit distance of the most similar trace in the event logs that equals to Average σ∈L Φ(σ, L − σ). If the traces in an event log are similar to each other, we are able to provide tight bounds by the approximation methods. This characteristic of the event log can be analyzed without any process model before the approximation. Therefore, it is expected to use more traces in M B when the traces are not similar. Using this preprocessing step, user is able to adjust the approximation parameter easier. Finally, we analyze the accuracy of the provided information about deviations. We first analyze the normal alignments of event logs and process models. Thereafter, for each alignment, we determine the six most problematic activities based on their deviation ratio that is computed based on the following formula. Afterwards, we compare the deviation ratio of these problematic activities with the case that the approximation method was used. The result of this experiment is given in Table 5 . Here, we used the frequency selection method with an approximation parameter equal to 10. We did not compare the result with the statistical method as the goal of this method is either the fitness value or the number of asynchronous moves; but, could not return both of them at the same Table 5 . Comparison of deviation ratio of the six most problematic activities using normal alignment (Real ) and the frequency based approximation method (Appx ). Results show that using the frequency method, we find the problematic activities that have high deviation rates. Considering all the experiments, we conclude that using frequency of traces for selecting candidates is more practical. Moreover, the candidate selection methods give more flexibility to users to trade off between the performance and the accuracy of approximations compared to the statistical method that sometimes could not improve the performance and has nondeterministic results. In addition, the proposed methods provide bounds for the approximated alignment value and deviation rates for activities that is useful for many diagnostic applications. Finally, the proposed methods are able to use parallel computation and benefit from adjusted computational resources. In this paper, we proposed approximation methods for conformance value including providing upper and lower bounds. Instead of computing the accurate alignment between the process model and all the traces available in the event log, we propose to just consider a subset of possible behavior in the process model and use it for approximating the conformance value using the edit distance function. We can find this subset by computing the optimal alignments of some candidate traces in the event log or by simulating the process model. To evaluate our proposed methods, we developed them in ProM framework and also imported them to RapidProM and applied them on several real event logs. Results show that these methods decrease the conformance checking time and at the same time find approximated values close to the actual alignment value. We found that the simulation method is suitable to be used when the given process model is simple. We also show that using the frequency method is more applicable to select the candidate traces and have accurate results. Results also indicate that although the statistical method [20] is able to approximate accurately, it takes more time and for some event logs, it is slower than the normal conformance checking. As future work, we aim to find out what the best subset selection method is due to the available time and event data. Also, it is possible to provide an incremental approximation tool that increases the M B during the time and let the end user decide when the accuracy is enough. Here, we did not use the probabilities for the simulation method, we think that by using the distribution in the event log, we enhance the simulation method. Process Mining -Data Science in Action Conformance checking of processes based on monitoring real behavior Alignment based precision checking Online conformance checking: relating event streams to process models using prefix-alignments Data-aware process mining: discovering decisions in processes using alignments Model repair-aligning process models to reality Evolution of compliance checking in process mining discipline Epilogue. Conformance Checking Replaying history on process models for conformance checking and performance analysis Decomposing petri nets for process mining: a generic approach Single-entry single-exit decomposed conformance checking Recomposing conformance: closing the circle on decomposed alignment-based conformance checking in process mining Divide and conquer: a tool framework for supporting decomposed discovery in process mining A recursive paradigm for aligning observed behavior of large structured process models Process mining meets abstract interpretation How much event data is enough? A statistical framework for process discovery Statistical sampling in process mining discovery Flexible heuristics miner (FHM) The Impact of Event Log Subset Selection on the Performance of Process Discovery Algorithms Estimating process conformance by trace sampling and result approximation Approximate computation of alignments of business processes through relaxation labelling On the theory and computation of evolutionary distances Computation of normalized edit distance and applications Measuring precision of modeled behavior Prom: The process mining toolkit. BPM (Demos) Effective spell checking methods using clustering algorithms RapidProM: Mine your processes and not just your data Boudewijn): BPI challenge BPI challenge 2018 Boudewijn): BPI challenge Hospital billing-event log Road traffic fine management process Sepsis cases-event log Discovering block-structured process models from event logs containing infrequent behaviour Efficiently computing alignments Prototype selection based on clustering and conformance metrics for model discovery