Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery
1. On the Role of Fitness,
Precision, Generalization and
Simplicity in Process
Discovery
Joos Buijs
Boudewijn van Dongen
Wil van der Aalst
http://www.win.tue.nl/coselog/
2. Advances in Process Mining
• Many process discovery and conformance checking algorithms
and tools are available (cf. the various ProM packages).
• Also commercial software based on these ideas:
Disco (Fluxicon), Reflect (Futura/Perceptive), BPMOne (Pallas Athena/Perceptive),
ARIS Process Performance Manager (Software AG), Interstage Automated Process
Discovery (Fujitsu), QPR ProcessAnalyzer/Analysis (QPR Software), flow
(fourspark), Discovery Analyst (StereoLOGIC), etc.
• We applied process mining in over 100 organizations.
More than 75 people
involving more than 50
organizations created the
Process Mining Manifesto in
the context of the IEEE Task
Force on Process Mining.
Available in 13 languages
PAGE 1
5. Challenge:
Four Competing Quality Criteria
“able to replay event log” “Occam’s razor”
replay fitness simplicity
process
discovery
generalization precision
“not overfitting the log” “not underfitting the log”
PAGE 4
6. Example: one log four models
b
examine
thoroughly
g
pay
c compensation
a examine e
start register casually decide end
# trace
request
h 455 acdeh
d reject
check ticket request 191 abdeg
f reinitiate
request 177 adceh
N1 : fitness = +, precision = +, generalization = +, simplicity = +
144 abdeh
111 acdeg
a c d e h
82 adceg
start register examine check decide reject end
request casually ticket request
56 adbeh
N2 : fitness = -, precision = +, generalization = -, simplicity = +
47 acdefdbeh
“able to replay event log” “Occam’s razor”
38 adbeg
examine check
thoroughly b d ticket g 33 acdefbdeh
replay fitness simplicity pay
compensation
a 14 acdefbdeg
start register examine
c end 11 acdefdbeg
request casually
e f reinitiate h
process decide request reject
request
9 adcefcdeh
discovery N3 : fitness = +, precision = -, generalization = +, simplicity = + 8 adcefdbeh
5 adcefbdeg
a d c e g
3 acdefbdefdbeg
generalization precision register
request
check
ticket
examine
casually
decide pay
compensation
2 adcefdbeg
a c d e g 2 adcefbdefbdeg
“not overfitting the log” “not underfitting the log” register examine check decide pay
request casually ticket compensation 1 adcefdbefbdeh
a d c e h 1 adbefbdefdbeg
register check examine decide reject
request ticket casually request 1 adcefdbefcdefdbeg
a c d e h 1391
start end
register examine check decide reject
request casually ticket request
… (all 21 variants seen in the log)
a b d e g
register examine check decide pay
request thoroughly ticket compensation
a d b e h
register check examine decide reject
request ticket thoroughly request
a b d e h
register examine check decide reject
request thoroughly ticket request PAGE 5
N4 : fitness = +, precision = +, generalization = -, simplicity = -
7. # trace
455 acdeh
Model N1 191 abdeg
177 adceh
144 abdeh
111 acdeg
82 adceg
56 adbeh
b 47 acdefdbeh
examine
thoroughly 38 adbeg
g 33 acdefbdeh
pay
c compensation 14 acdefbdeg
a examine e
11 acdefdbeg
start register casually decide end
request 9 adcefcdeh
h
d reject 8 adcefdbeh
check ticket request 5 adcefbdeg
f reinitiate 3 acdefbdefdbeg
request
N1 : fitness = +, precision = +, generalization = +, simplicity = + 2 adcefdbeg
2 adcefbdefbdeg
1 adcefdbefbdeh
1 adbefbdefdbeg
1 adcefdbefcdefdbeg
PAGE 6
1391
9. # trace
455 acdeh
Model N3 191 abdeg
177 adceh
144 abdeh
111 acdeg
82 adceg
56 adbeh
47 acdefdbeh
examine check
thoroughly b d ticket g 38 adbeg
pay 33 acdefbdeh
compensation
a 14 acdefbdeg
start register examine end 11 acdefdbeg
request casually c
e f reinitiate
reject
h 9 adcefcdeh
decide request
request 8 adcefdbeh
N3 : fitness = +, precision = -, generalization = +, simplicity = +
5 adcefbdeg
3 acdefbdefdbeg
2 adcefdbeg
2 adcefbdefbdeg
1 adcefdbefbdeh
1 adbefbdefdbeg
1 adcefdbefcdefdbeg
PAGE 8
1391
10. # trace
455 acdeh
Model N4 191 abdeg
177 adceh
144 abdeh
a d c e g 111 acdeg
register check examine decide pay
request ticket casually compensation 82 adceg
a c d e g 56 adbeh
register examine check decide pay
request casually ticket compensation 47 acdefdbeh
a d c e h 38 adbeg
register check examine decide reject
request ticket casually request 33 acdefbdeh
a c d e h 14 acdefbdeg
start end
register examine check decide reject
request casually ticket request 11 acdefdbeg
… (all 21 variants seen in the log)
9 adcefcdeh
8 adcefdbeh
5 adcefbdeg
a b d e g
register examine check decide pay 3 acdefbdefdbeg
request thoroughly ticket compensation
2 adcefdbeg
a d b e h
register check examine decide reject 2 adcefbdefbdeg
request ticket thoroughly request
1 adcefdbefbdeh
a b d e h
register examine check decide reject 1 adbefbdefdbeg
request thoroughly ticket request
1 adcefdbefcdefdbeg
N4 : fitness = +, precision = +, generalization = -, simplicity = -
PAGE 9
1391
11. Another challenge: Huge search space
M M M M M
M M M M M M M M
M M M M M M M M
M M M M M M M
M M M MM
M M M M M M MM M M M M
M M M M M
M M M M M M M M M M M M
M M M M M M M M
MM M MM M M MM M M M
M M M MM M MM M MM M MM MM
MM M M M M M M M M M M M
M M M M M M M M M M MM M M M M
M M M M M M M M M M M M MM M M M M
M M M M M M M M M M M M
M MM M M M M M
M M M MM
M M M MM
M M M M
M M M MM M M M M MM
M M M M M M MM M M M M M
M M M M M M
MM M MM M M MM M
M
M M M M
M M M MM
M M M MM M M M M M MM M MM
MM M M M M M M M M
M M M M
M M MM M M M M M M M M
M M M MM
M M M M
M M M MM M M M M MM
M M M M M M M M M M MM
M M M M M M
M MM M M M M M M
MM
M M M MM M MM M
M M M MM M M MM M MM
M M M M MM M M
M M
PAGE 10
12. … with just a few interesting candidates
M M M M
M M M M M M M M M
M M M M M M M M
M M M M M M M M
M M M MM
M M M M M MM M M M M
M M M M M
M M M M M M M M M M M M M M
M MM M M M M M M M
MM M MM M M M
M M M MM
M M M M MM M MM M MM MM
M M M M
MM
M M M MM M M M M M M
M M MM M M MM M M
M
M M
M M M M M M M M M M M M MM M M M M M
M MM M M M M M M M M
M M M MM MM M M
M M M MM
M M M M
M M M MM M M M M MM
M M M M M MM M M M M M M M
M M M M
M MM M M M M M M M
MM
M M M MM M M MM M M M M MM M MM
MM M M M M M M M M MM M M
M M M M M
M MM M M M M M M
M M M M
M M M MM
M M M M
M M M MM M M M M MM
M M M M M M M M M M MM
M M M M M M M M
MM M M M M M MM M
M
M M M M
M M M MM
M M M MM M M M M M MM M MM
M M M M M
PAGE 11
13. Two requirements
1. It should be possible to “able to replay event log” “Occam’s razor”
replay fitness simplicity
seamlessly balance the
different quality criteria process
discovery
based on user-defined generalization precision
preferences. “not overfitting the log” “not underfitting the log”
2. The algorithm should
always return a "correct"
process model and not
waste time on model
having deadlocks and
other anomalies.
PAGE 12
14. Proposal: Evolutionary Tree Miner (ETM)
• Process trees as representation (= limit search space
to "good" models).
• Genetic approach (= very flexible)
• Fitness function uses all four criteria (= seamlessly
balance the different "forces")
PAGE 13
15. Representational Bias: Process Trees
→ • Always sound because of the
block structure A
A → • Also Loop and OR operator
∧ E
B C D
X D
A B
B C A (BA)* E
PAGE 14
16. Petri Net Semantics
(used for comparison and conformance checking only)
A
A B
Sequence
Parallellism B
A
B A
Exclusive Choice
A
B
B Or Choice
Loop
PAGE 15
17. Steps of the Genetic ETM Algorithm
Change
Population
No
Create Return
Measure Stop
Initial Best
Quality ? Yes
Population Individual
PAGE 16
18. Population Change
Elite
Population i Population i+1
Selection
Replace Crossover Mutation
PAGE 17
19. Four Metrics (see paper)
“able to replay event log” “Occam’s razor”
replay fitness simplicity
process
discovery
generalization precision
“not overfitting the log” “not underfitting the log”
1 = optimal
0 = very bad
PAGE 18
20. Example
B E
A C G
D F
A = send e-mail, B = check credit,
C = calculate capacity, D = check system,
E = accept, F = reject, G = send e-mail
PAGE 19
21. Conventional Algorithms (1/3)
("best effort" mapping to process trees to allow for comparison)
alpha miner
low fitness
ILP miner
low precision
language-based region miner
low fitness
PAGE 20
24. Often unsound result and no mechanism
to seamlessly balance the four criteria
PAGE 23
25. Genetic Mining (ETM) While Considering
Only One Criterion
best value
possible
for this log
ETM with weight zero to three out of four perspectives. PAGE 24
26. Considering Replay Fitness and One
Other Criterion
ETM with weight zero to two out of four perspectives. PAGE 25
27. Considering 3 of 4 Criteria
replay fitness needs to have a larger weight PAGE 26
28. Considering All Four Criteria with
Emphasis on Fitness
fitness has weight 10
PAGE 27
29. Initial Model Versus Discovered Model
simulated discovered by ETM
B E
A C G
D F
PAGE 28
30. Real-Life Event Logs
• Event log L0 is the event log used before. L0 contains
100 traces, 590 events and 7 activities.
• Event Log L1 contains 105 traces, 743 events in total,
with 6 different activities.
• Event Log L2 contains 444 traces, 3.269 events in total,
with 6 different activities.
• Event Log L3 contains 274 traces, 1:582 events in total,
with 6 different activities.
Event logs L1, L2 and L3 are extracted from
the information systems of municipalities
participating in the CoSeLoG project
(http://www.win.tue.nl/coselog/).
PAGE 29
31. Results
If unsound , the
sound behavior
is approximated
when creating
the process
tree.
Equal weights for
all criteria.
PAGE 30
32. Conclusion
“able to replay event log” “Occam’s razor”
replay fitness simplicity
process
discovery
generalization precision
“not overfitting the log” “not underfitting the log”
• First algorithm that allows for balancing
all four perspectives.
• Genetic algorithm is very flexible, but
also very slow.
• Process trees only used internally
(choose your favorite representation)
• Future work:
− Improve speed
− Distribute PM tasks
− Discover configurable process trees
PAGE 34