Currently, alignment-based approaches are considered the de-facto standard for measuring conformance. However, such approaches are expensive to compute as they require generating the state space of the combined model and trace behavior, a.k.a synchronous product net. The generated state space, as a graph, is further enriched with cost estimates and is fed to the A*-algorithm to find the shortest path from the start state, where the trace and a first activity of the model start, to an end target state where we have executed the whole trace and also a path in the model. In between, there could be situations where we observe only an event from the trace without a counterpart activity from the model, log move, or vice versa, model move.
The shortest path found is called an optimal alignment as it will always have the minimum number of non-synchronous moves, either log moves or model moves, that makes the trace matches the model behavior. The deviation, also called cost of an alignment is the number of non-synchronous moves. Of course, an alignment of cost 0 indicates a perfect match between the trace and the model.
A family of these approaches depend on the explicit representation of the model behavior as a set of traces [3]. Then, Levenshtein distance, based on insertion or deletions only is used to find the minimum distance and thus the cheapest alignment between a log trace and a trace in the model behavior [2]. This approach is considered approximate because the model behavior chosen is usually a subset of the whole behavior of the model. So, the cheapest alignment found is not guaranteed to be optimal, because there might be another model trace that was not picked that better aligns with the log trace.
As we pick a finite subset of the model behavior and reduce alignment computation to a kind of string matching problem, we can encode the model behavior, represented by a set of (traces) strings in compact way that can speed up the match.
Currently, finding the best alignment would require one-to-one comparison between the log trace and each trace in the model behavior.
In string matching, tries, a.k.a as prefix trees, can speed up the search by logarithmically reducing the search space with each matching character between the input string, the log trace in our case, and the set of predefined strings, the model behavior. Moreover, the encoding of the model behavior requires a logarithmic space to store the model behavior as all common prefixes between model traces will be stored once.
Consider the Petri net below to be our process model.
Out of this model, we can obtain the following set of model traces. Note that the model
- <a,b,c,e>
- <a,b,c,d,b,e>
- <a,b,c,d,b,d,b,e>
- <a,c,b,e>
- <a,b,e>
Note that, due to the loop in the model, the sequence <b,d> can repeat arbitrarily, causing traces of possibly infinite length.
Consider also the following log traces to compute alignment for:
- <a,b,c,e>
- <a,e>
- <c,e>
- <a,c,b,d,e>
- <a,b,e>
- <a, c, b, d, b, d, b, d, b, e>
The trie below is constructed from the five model traces above.
We can see that, trace <a,b,c,e>
matches a path in the trie and total cost is 0. For trace <a,e>
, event a
matches the label of node a.1
, thus, a synchronous move is made. The corresponding state is added to the queue with cost 0. Next, we try to match the first event in the suffix <e>
, to a child of node a.1
. As there is no match, we have to consider making log and model moves. The cost of making the log move is the length of the suffix in addition to the minimum path length from a.1
to an end of a trace, i.e. 0+2=2. For node a.1
, there are two model moves, i.e., executing b
via node b.2
or c
via node c.2
, the cost of making model move b
is 1+1=2, length of suffix <e>
plus the minimum path length from b.2
to the end. This cost is further reduced by 1 as b
has a child with the label e
. Thus, the final cost for taking b.2
is 1. The cost of making move c
is 1+2=3. The three states s1=[a.1, <(a,a),(>>,e)>,<>>,2]
, s2=[b.2, <(a,a),(b,>>)> ,<e> ,1]
, and s3=[c.2, <(a,a),(c,>>)> ,<e> ,3]
are enqueued. s2
is the cheapest. From s2
, a synchronous move (e,e)
is found, and we reach the end of both the path in the trie and the trace with alignment as <(a,a), (b,>>), (e,e)>
with alignment cost of 1. s2
is now a candidate state. Next, s1
will be dequeued. Here, we have traversed the whole trace, and we need to follow the shortest path in the trie. So, the alignment will be <(a,a), (>>,e), (b,>>), (e,>>)
with cost 2. However, since this cost is greater than the cost of the candidate state alignment, it is rejected. s3
is handled similarly.
For trace <c,e>
, we will examine log and model moves, as no child node of the root has the label "c". A log move is s4=[\bot, <(>>,c)> , <e> , 4]
and a model move is s5=[a.1, <(a,>>)> , <c,e> , 4]
. Both have the same cost. Assuming that s5
is dequeued first, we reach a new state s6=[a.1, <(a,>>), (c,c)> , <e> , 0]
. s6
will be dequeued next, because it was a synchronous move. From s6
, no synchronous move can be made. So, we check model and log moves. At some stage, candidate state [e.4/2, <(a,>>), (c,c), (b,>>), (e,e)> , <>, 0]
will be reached. While the queue still has more states to explore, they will not result in cheaper alignments.
This repository contains our implementation for using tries to find alignments. The most relevant code that has been used in the experimental evaluation can be found in this class.
Example code to run the checker as well as the baseline approach based on edit distance can be found in the Runner.java class.
You can download the data that we used for the experimental evaluation from this link
[1] Josep Carmona, Boudewijn F. van Dongen, Andreas Solti, Matthias Weidlich: Conformance Checking - Relating Processes and Models. Springer 2018, ISBN 978-3-319-99413-0, pp. 1-263
[2] Mohammadreza Fani Sani, Sebastiaan J. van Zelst, Wil M. P. van der Aalst: Conformance Checking Approximation Using Subset Selection and Edit Distance. CAiSE 2020: 234-251
[3] Mohammadreza Fani Sani, Juan J. Garza Gonzalez, Sebastiaan J. van Zelst, Wil M. P. van der Aalst: Conformance Checking Approximation Using Simulation. ICPM 2020: 105-112