JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery

Region-Based Foldings in Process Discovery
ABSTRACT
A central problem in the area of Process Mining is to obtain a formal model that represents the processes
that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be
used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes
inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal
model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at
learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an
automaton and therefore one should convert the traces into an acyclic automaton in order to apply these
techniques. Given that the complexity of the region-based techniques depends on the size of the input automata,
revealing the underlying cycles and folding the initial automaton can incur in a significant complexity
alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region
information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be
obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool
suggest that the techniques presented in this paper are a big step into widening the application of the theory of
regions in Process Mining for industrial scenarios.
Existing System
The global patterns that can be used to make predictions about the future has been one of the key
elements that have brought Data Mining to be one of the most relevant research areas in the last decades. Data
mining techniques can be applied naturally on large amount of data like databases or even the Internet, and with
GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com

the help of other disciplines like statistics or machine learning, can effectively reveal important patterns in many
scenarios such as health care, business or transportation. As in data mining, Process Discovery tries to reveal
patterns. However, the patterns aimed by Process Discovery techniques are process models, i.e., formal
representations of the processes of a system. Due to its different focus, Process Discovery techniques apply
disciplines different from the ones used in data mining, to allow for the derivation of both the statics and the
dynamics of a system process. Depending on the emphasis, different dimensions can be considered ranging
from social (the identification of communities) to control-flow (the identification of the complex interplay
between system’s tasks). In this work we consider the latter: discover a Petri net from a log, that is from a set of
traces corresponding to executions of a system. The first method to obtain a Petri net from a log was presented.
Disadvantages
To overcome this limitation, several extensions have been presented in the literature to widen the class
of Petri nets that the algorithm can discover.
The theory of regions was initially proposed to solve the synthesis problem: obtain a Petri net that has a
behavior equivalent to a given transition system.
Proposed System
The theory of regions was initially proposed to solve the synthesis problem: obtain a Petri net that has a
behavior equivalent to a given transition system. three conversions from a language to a TS were proposed,
namely sequence, multiset, and set. The main difference between them is how it is decided whether the
occurrence of an event in a trace produces a new state in the TS or just introduces an arc to an existing state.
Together with these conversions, a number of additional conversions producing smaller TSs by means of
abstractions have been proposed in the literature. Besides the sequence and multiset conversions, other
conversions have been proposed that can yield smaller TSs at the cost of sacrificing regions. We use the term
abstraction techniques to refer to them. The fundamental difference between all these methods and our proposal
is that, in our case, the set of sacrificed regions is controlled considering bounds that are already used by
process discovery tools, thus the compression of the TS does not involve a quality reduction.

Advantages
An advantage of region theory for process discovery is that it allows to perform label splitting.
The advantages offered by the theory of regions, there are two main reasons that hamper a wider
adoption of region-based Process Discovery methodologies in an industrial setting. One is their
sensitivity to noise.
The other hand the benefits for rbminer are twofold, since a smaller region basis reduces the amount of
regions to explore. In this case, both advantages (state and basis reduction) combine to achieve orders of
magnitude speedups.
Module
1. Get Input Text File
2. Discovery Sentence Word
3. Decided Sentence
4. Tandem Repeats
5. Sequence And Multiset Conversions
6. Counting Data
Module Description
Get Input Text File
The Process Discovery differs from synthesis in the knowledge assumption: while in synthesis one
assumes a complete description of the system, only a partial description of the system is assumed in Process
Discovery. Therefore, equivalence or bisimulation is no longer a goal to achieve. Instead, obtaining
approximations that succinctly represent the log under consideration are more valuable.
Discovery Sentence Word
The fact that a discovery algorithm returns a PN with a smaller language than desired is referred as
overfitting. A classical strategy to avoid overfitting is to allow the algorithms to restrict their output to k-
bounded PNs (kbounded discovery), usually for small values of k, as nets with high numbers of tokens are
considered harder to understand for humans than nets with fewer tokens. The particular k used in each case can

be either determined from the desired level of complexity of the resulting PN1 or the number of available
resources in the system (since places can represent resources).
Decided Sentence
The conversions from a language to a TS were proposed, namely sequence, multiset, and set. The main
difference between them is how it is decided whether the occurrence of an event in a trace produces a new state
in the TS or just introduces an arc to an existing state.
Tandem Repeats
The detection of unfolded cycles in an acyclic TS is a problem related to finding consecutively repeated patterns
in a string. The latter problem has been studied in several fields with many variations and under different
names, although it is often referred as the finding tandem repeats problem.
Sequence And Multiset Conversions
The sequence and multiset conversions, other conversions have been proposed that can yield smaller
TSs at the cost of sacrificing regions. We use the term abstraction techniques to refer to them. The fundamental
difference between all these methods and our proposal is that, in our case, the set of sacrificed regions is
controlled considering bounds that are already used by process discovery tools, thus the compression of the TS
does not involve a quality reduction.
Counting Data
The region-based approaches yield PNs that never reject a trace of the log, they are extremely sensitive
to noise. Hence, to be applicable, the approach presented in this paper must be preceded by a noise filtering
phase. The filtering can be done by clustering techniques or by outlier detection. Also, considering the
frequencies of the states is a possibility in our approach to distinguish between real and noisy states, because the
latter have often low frequency. For instance, only Parikh vector differences between frequent states could be
taken into account to differentiate real folding opportunities from spurious cycle unfoldings caused by noise. An
advantage of region theory for process discovery is that it allows to perform label splitting (i.e., to change the
label of some arcs in the TS so that an event is actually represented by a set of different events). Label splitting
is a technique that can help into improving the visualization of the PN, but also into avoiding to generalize too
much. This technique can also be used with the TSs produced by our approach. However, the splitting options
might be reduced as a consequence of arcs with the same label in the original TS that have been now merged
into one arc in the folded TS.

FLOW CHART
Region-Based Process Discovery
Get The Input Text File
Discovery Sentence Word
Sequence and Multiset Tandem Repeats Counting Data

CONCLUSION
The presents a novel technique for compacting a TS, one of the objects typically used in process discovery
algorithms. The two main characteristics of this technique makes it very attractive in the context of region-
based k-bounded process discovery: first, it is one of the most aggressive folding techniques in the literature,
and second, it preserves the important regions that are crucial for PN derivation. The use of folding techniques
that are region-aware like the one presented in this paper may be a crucial step to use region-based algorithms
for process discovery in industrial scenarios.
REFFERENCE
[1] W. van der Aalst, H. Reijers, and M. Song, “Discovering Social Networks from Event Logs,” Computer
Supported Cooperative Work, vol. 14, no. 6, pp. 549-593, 2005.
[2] W. van der Aalst, T. Weijters, and L. Maruster, “Workflow Mining: Discovering Process Models from
Event Logs,” IEEE Trans. Knowledge Data Eng., vol. 16, no. 9, pp. 1128-1142, Sept. 2004.
[3] A. de Medeiros, W. van der Aalst, and A. Weijters, “Workflow Mining: Current Status and Future
Directions,” Proc. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, pp. 389-
406, 2003.
[4] L. Wen, W. van der Aalst, J. Wang, and J. Sun, “Mining Process Models with Non-Free-Choice
Constructs,” Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 145-180, 2007.
[5] W. van der Aalst, A. de Medeiros, and A. Weijters, “Genetic Process Mining,” Proc. 26th Int’l Conf.
Applications and Theory of Petri Nets (ICATPN), pp. 48-69, 2005.
[6] A. Ehrenfeucht and G. Rozenberg, “Partial (Set) 2-Structures. Part I, II,” Acta Informatica, vol. 27, pp. 315-
368, 1990.

JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery

Similaire à JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery (20)

Plus de IEEEGLOBALSOFTTECHNOLOGIES

Plus de IEEEGLOBALSOFTTECHNOLOGIES (20)

Dernier

Dernier (20)

JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery