SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
DOI : 10.5121/ijdms.2015.7103 29
MINING CLOSED SEQUENTIAL PATTERNS IN
LARGE SEQUENCE DATABASES
V. Purushothama Raju1
and G.P. Saradhi Varma2
1
Research Scholar, Dept. of CSE, Acharya Nagarjuna University
Guntur, A.P., India
2
Department of Information Technology
S.R.K.R. Engineering College, Bhimavaram, A.P., India
ABSTRACT
Sequential pattern mining is studied widely in the data mining community. Finding sequential patterns is a
basic data mining method with broad applications. Closed sequential pattern mining is an important
technique among the different types of sequential pattern mining, since it preserves the details of the full
pattern set and it is more compact than sequential pattern mining. In this paper, we propose an efficient
algorithm CSpan for mining closed sequential patterns. CSpan uses a new pruning method called
occurrence checking that allows the early detection of closed sequential patterns during the mining
process. Our extensive performance study on various real and synthetic datasets shows that the proposed
algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP by an order of
magnitude.
KEYWORDS
Data mining, sequential pattern mining, closed sequential pattern mining, sequence database
1. INTRODUCTION
Sequential pattern mining is an important and active research topic in data mining. Sequential
pattern mining was first introduced by Agrawal and Srikant in [1]. Since then several efficient
sequential pattern mining algorithms have been proposed to find out sequential patterns.
Sequential pattern mining can be applied to several business and scientific applications including
biological sequence analysis, market analysis, discovering web access patterns, mining customer
shopping sequences, web click streams, XML query access patterns for caching, block
correlations in storage systems, sequences of file block references in operating systems, target
marketing, customer retention, feature selection for sequence classification, user behavior
analysis, web recommender systems, network intrusion detection, personalization systems and the
study of scientific or medical processes.
There is an increasing trend of utilizing sequential pattern mining in source code mining and
software specification mining. These tools convert the source code into a sequence database
representation, and find the sequential patterns in order to retrieve different information such as
copy-pasted code segments, API usage, programming rules and etc.
Consider a book store such as Amazon can find that the customers who have first bought a book
‘‘Introduction to Data Mining” will often purchase another book ‘‘Data Science for Business” in
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
30
a later transaction. Store managers can use this information to do recommendation when a user
browses the data mining book. Another useful application is to identify block access patterns of
disk systems. The block access patterns are useful to predict the blocks that are accessed next, and
these blocks can be prefetched into cache to reduce the disk I/O latency.
Sequential pattern mining produces an exponential number of patterns when the database contains
long sequences which are expensive in both time and space. The same problem also occurs in
itemset and graph mining when the patterns are long. For example, assume the database contains
a long sequence {(x1)(x2)……(x100)} it will generate 2100
- 1 frequent subsequences. Some
subsequences have the support which is equivalent to the support of the long sequence, which
are basically redundant patterns. Therefore, instead of mining the complete set of sequential
patterns, it is better to mine closed sequential patterns only. A closed sequential pattern is a
sequential pattern which has no supersequence with the same support.
Closed sequential pattern mining extensively reduces the number of patterns produced and it can
be utilized to obtain the complete set of sequential patterns. In addition, closed sequential pattern
mining algorithms make use of search space pruning techniques and outperform sequential
pattern mining algorithms. Closed sequential pattern mining helps users to find more interesting
patterns and reduces the burden of users to explore too many patterns.
As an example, consider a sequence database which has 31,000 sequences, when min_sup is 40%,
there are 16,204 sequential patterns, where as only 609 patterns are closed sequential patterns. To
find the complete set of sequential patterns, an efficient sequential pattern mining algorithm will
take about 400 seconds on a 2 GHz Intel dual core PC. Where as a fastest closed sequential
pattern mining algorithm takes only 53 seconds to mine all closed sequential patterns on the same
machine. Therefore, it is better to mine only closed sequential patterns.
There are two popular algorithms for closed sequential pattern mining: CloSpan[2] and BIDE[3].
CloSpan performs the mining in two stages. In the first stage it produces a closed sequential
pattern candidate set and stores it in a prefix sequence lattice. In the second stage it performs post
pruning to eliminate non closed sequential patterns. CloSpan works under candidate maintenance-
and-test paradigm, hence it is not scalable because a large number of closed sequential pattern
candidates will occupy more memory space and lead to huge search space for checking the
closure of new patterns, particularly for low support threshold values or long patterns. BIDE
mines closed sequential patterns without candidate maintenance. It prunes the search space more
deeply and performs closure checking of patterns in an efficient manner. BIDE follows a strict
depth-first search order to produce the closed sequential patterns.
In this paper, we propose an efficient algorithm, called CSpan, which makes use of the occurrence
checking method that allows early detection of closed sequential patterns during the mining
process. Our extensive performance study on various real and synthetic datasets shows that the
proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP[4]
by an order of magnitude.
The remaining of this paper is organized as follows. Section 2 discusses the related work. In
Section 3, we present the problem definition. In Section 4, we discuss the proposed method for
mining closed sequential patterns. Section 5 reports the performance evaluation. Finally, we
conclude the work in Section 6.
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
31
2. RELATED WORK
Agrawal and Srikant [1] proposed the sequential pattern mining problem. Later a number of
efficient algorithms have been developed for sequential pattern mining. These algorithms are
categorized into three groups such as Apriori-based methods, pattern-growth methods and vertical
format based methods.
Apriori-based methods implement a candidate-generation-and-test strategy. The representative
algorithm of these methods is GSP [5], which is an enhancement of AprioriAll algorithm
developed by the same authors. Algorithms implementing a candidate-generation-and-test
strategy produce a large number of candidate sequences. Counting and pruning such a large set is
more expensive. In addition, Apriori-based methods require more number of database scans
when there are long patterns.
Pattern-growth methods generally start with a frequent pattern, and grow the frequent pattern
when traversing the pattern search space using depth-first search. FreeSpan[6], PrefixSpan [7],
CloSpan, BIDE and etc belonging to this type. PrefixSpan finds the frequent-1 sequences in the
database and builds projected databases for each sequence. It then searches the projected database
to uncover locally frequent items and recursively locates the frequent sequences that contain the
item as a prefix. This process is repeated until all frequent sequences are discovered. It uses a
pseudo-projection technique for constructing projected databases to improve the performance.
PrefixSpan is faster than GSP, because in each iteration it scans only the sequences in a projected
database and a projected database is much smaller than the original database.
Vertical format based methods represent a sequence database using vertical format which allows
faster computation of support counting. SPADE [8] and SPAM [9] are based on vertical format.
SPADE makes use of a vertical id-list format and a join operation between two id-lists to discover
frequent sequential patterns. It consumes more memory, since id-lists are several times bigger
than the initial database. SPAM constructs a vertical bitmap of the database for efficient candidate
generation and support counting. SPAM traverses the lexicographic sequence tree in a depth-first
manner. SPAM is faster than SPADE due to the fast bitmap computation and consumes more
memory space than SPADE.
Closed itemset mining is used to find closed itemsets in a transaction database. Several algorithms
were developed for closed itemset mining. A-Close[10] is the first algorithm developed for
mining closed itemsets. It first finds level-wise frequent itemsets using Apriori strategy, and
mines all minimal generators. In the second step, it computes the closure of all minimal
generators. The performance of A-Close degrades due to the huge cost of the off-line closure
calculation.
Closet [11] and Closet+ [12] algorithms use compact FP-Tree data structure. In the first scan of
the dataset, frequent single items are identified and in the second scan the pruned transactions are
added to the FP-Tree. Closet finds closed itemsets by closure climbing and grows frequent closed
itemsets with a depth first browsing of the FP-Tree and recursive conditional FP-Tree projections.
It uses subset checking to detect the duplicates. Closet+ is an improvement of Closet and it has an
adaptive behavior in order to fit both sparse and dense datasets. Closet+ uses a new upward
checking technique to detect duplicates for sparse datasets.
Charm[13] constructs a prefix tree using frequent itemsets and searches the prefix tree using
bottom-up depth-first approach. Once a frequent itemset is produced, its tid-list is checked with
other itemsets having the same parent. If one tid-list contains another tid-list, the related nodes are
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
32
combined because both the itemsets belong to the same equivalence class. Charm uses diff-set
technique for sorting itemset tid-lists in each node of the tree.
It is not feasible to adopt the techniques used in closed itemset mining for closed sequential
pattern mining, because subsequence testing needs order matching and the search space of
sequences is bigger than that of itemsets. Also, sequential pattern mining is generally less
efficient than closed sequential pattern mining, particularly in mining long patterns and with low
support threshold.
X.Yan et al. developed the CloSpan algorithm for mining closed sequential patterns. CloSpan
produces less number of discovered sequences than the traditional methods while preserving the
same expressive power. CloSpan can mine long sequences and runs faster than PrefixSpan.
CloSpan divides the mining process into two phases. In first phase it generates a candidate set for
closed sequential patterns and stores it in a prefix sequence lattice. In second phase it conducts
post pruning to eliminate non closed sequences.
CloSpan prunes the nonclosed sequential patterns by using an efficient search space pruning
method, known as equivalence of projected databases. Unfortunately, a closed sequential pattern
mining algorithm under candidate maintenance-and-test paradigm has rather poor scalability
because a large number of closed sequential patterns candidates require more memory and lead to
huge search space for the closure checking of new patterns, which is usually the case when the
support threshold is low or the patterns are long.
Wang et al. developed the BIDE algorithm. It mines closed sequential patterns without candidate
maintenance. It prunes the search space more deeply and performs closure checking of patterns in
an efficient manner. BIDE follows a strict depth-first search order to produce the closed
sequential patterns. It uses a novel closure checking scheme called BI-Directional Extension. The
forward directional extension is used to grow the prefix patterns and also for closure checking of
prefix patterns, whereas the backward directional extension is used for both closure checking of a
prefix pattern and pruning the search space. It prunes the search space by using the BackScan
pruning method. BIDE first applies the BackScan pruning method to check if a prefix sequence
can be pruned, if not, computes the number of backward-extension items. Later it finds the
number of forward extension items, if there is no backward-extension item or forward-extension
item then it outputs closed sequential patterns. Although BIDE does not keep track of any
historical closed sequential patterns candidates for a new pattern's closure checking, it is a
computational consuming approach since it needs multiple database scans for the bi-direction
closure checking and the BackScan pruning.
The COBRA algorithm was developed by Huang et al. [14]. It adopts a bi-phase reduction
method for closed sequential pattern mining. It finds closed itemsets for first phase reduction and
then finds closed sequential patterns for second phase reduction. COBRA first conducts item
extension and then do sequence extension, which eliminates some drawbacks in pattern growth
methods. It performs the mining in three stages: (I) Mining Closed Frequent Itemset (II) Database
Encoding and (III) Mining Closed Sequential Pattern. COBRA uses closed itemsets to eliminate
duplicate item enumeration and to decrease the matching cost for mining locally frequent items.
It uses both vertical and horizontal database formats to reduce the search time in the mining
process. It uses two pruning methods LayerPruning and ExtPruning. LayerPruning eliminates
non closed branches and reduces the costs in pattern checking. ExtPruning checks closure of
patterns to eliminate non closed sequential patterns. COBRA requires large memory space than
BIDE.
ClaSP mines closed sequential patterns in temporal transaction data. It is inspired on the Spade
algorithm that uses vertical database format strategy for generating sequential patterns. ClaSP has
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
33
two main phases: The first phase produces Frequent Closed Candidates (FCC) that is kept in main
memory; and the second phase performs a post-pruning to remove all non-closed sequences from
FCC to finally get exactly frequent closed sequences. It first discovers all frequent 1-sequences in
the database and then the method DFS-Pruning is called recursively for all frequent 1-sequences
to search the corresponding subtree. FCC is acquired when this process is done for all of the
frequent 1-sequences and finally, the algorithm terminates by removing the non-closed sequences
in FCC. ClaSP uses the method CheckAvoidable for pruning the search space and outperforms
CloSpan.
3. PROBLEM DEFINITION
In this section, we first introduce some preliminary concepts, and then formalize the closed
sequential pattern mining problem.
Let I ={i1,i2,….,im} be a set of all items. A subset of I is called an itemset. A sequence S=(k1,k2,
…, kn ) (ki ⊆ I ) is an ordered list of itemsets. The items in each itemset are sorted in alphabetic
order. The length of the sequence is the total number of items in the sequence. A sequence S1
=(a1,a2,…..,am) is a subsequence of another sequence S2 =(b1,b2,….,bn), denoted as S1 ⊑ S2, if
there exists integers 1 ≤ i1 < i2 < . . . < im ≤ n and a1 ⊆ bi1 , a2 ⊆ bi2 , . . . , and am ⊆ bim. We call S2
as a super-sequence of S1 and S2 contains S1.
A sequence database, SD={S1,S2,…,Sn}, is a set of sequences and each sequence has an id. The
size, |SD|, of the sequence database SD is the total number of sequences in the SD. The support of
a sequence X in a sequence database SD is the no of sequences in SD which contain X.
Definition 1 (Sequential pattern): Given a minimum support threshold m_sup, a sequence α is a
sequential pattern on SD if support (α) is greater than m_sup.
Definition 2 (Closed sequential pattern): A sequential pattern α is a closed sequential pattern if
there does not exist a sequential pattern β , such that support (α) = support (β) and α β.
The problem of closed sequential pattern mining is to find the complete set of closed sequential
patterns above a minimum support threshold m_sup for an input sequence database SD. Table 1
shows a sample sequence database. The items in each itemset are sorted in alphabetic order. If
m_sup=2 , the closed sequential pattern set contains 3 sequences {(bc) : 3, (c)(d):2, (a)(bc):2}
and the corresponding sequential pattern set contains 9 sequences. It indicates that closed
sequential pattern set contains less no of sequences than sequential pattern set.
Table 1. A sample sequence database
Sid Sequence
1 (ab)(bc)
2 (bc)(d)(e)
3 (ac)(bcd)
Given a sequence X=(p1,….,pn) and an item k, X can be concatenated with k using either itemset
extention X∆ik=(p1,….,pn (k)) or sequence extension X∆sk=(p1,….,pn,(k)). For example (ab) is
an itemset extension of (a) and (a)(b) is a sequence extension of (a).
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
34
4. PROPOSED METHOD
In this section, we describe our proposed algorithm CSpan. It uses depth-first search for
generating the closed sequential patterns. Since many previously developed algorithms proved
that depth-first search based algorithm is more efficient than the breadth-first search based
algorithm in mining long patterns.
Figure 1. An example sequential pattern tree
The CSpan algorithm uses a sequential pattern tree to generate the closed sequential patterns. The
sequential pattern tree is constructed as follows. The root of the tree is labeled with φ. Next, all
frequent 1-sequences in the database are identified and added to level 1 of the sequential pattern
tree. Then, a sequence k at level 1 is extended by appending a frequent item in k’s projected
database to get its frequent super-pattern at level 2. A sequence can be extended in two ways:
sequence extension and itemset extension. In case of sequence extension, the item is added as a
new itemset to the sequence. In case of itemset extension, the item is appended to the last itemset
in the sequence. This process is repeated for the sequences at the level 2 and above until there are
no super-sequences to generate.
Fig. 1 represents the sequential pattern tree for the sample sequence database shown in Table. 1.
First the root of the sequential pattern tree is set to φ and then all frequent 1- sequences are added
as the children of the root at level 1. The number after the sequence indicates the support of the
sequence and the minimum support is chosen as 2. The children of a node are arranged
lexicographically. Next, the frequent 2-sequences at level 2 are produced by concatenating the
frequent 1-sequence with one of the frequent item in its projected database, and similarly the
frequent 3-sequences at level 3 are produced by concatenating the frequent 2-sequence with one
of the frequent item in its projected database.
A sequence in the sequential pattern tree is considered as a sub-sequence of its child. To locally
mine the frequent super-sequences of a certain sequence, we construct the projected database for
the sequence and grow the sequence to obtain its frequent super-sequences. We use the pseudo
based projection approach of PrefixSpan to create the projected database.
φ
(a):2
(a)(bc):2
(b):3 (c):3 (d):2
(a)(b):2 (a)(c):2 (bc):3 (c)(d):2
Level 0
Level 1
Level 2
Level 3
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
35
4.1 Occurrence checking
We propose a new pruning method called occurrence checking that allows the early detection of
closed sequential patterns.
Lemma 1. Occurrence checking: A sequential pattern X is not closed if a frequent item y exists
such that (1) y appears in every sequence of X’s projected database and (2) the distance between
X and y is identical in every sequence of X’s projected database.
Proof. If a frequent item y appears in every sequence of X’s projected database and the distance
between X and y is identical in every sequence of X’s projected database, then we can always
discover another frequent sequence containing X and y whose support is equivalent to X’s
support. Therefore, X cannot be closed.
We demonstrate the occurrence checking scheme as follows. Assume a is a frequent sequence. If
we locate an item b after a in every sequence of a’s projected database, then we can declare that a
is not closed since ab is its super-sequence with the same support.
4.2 CSpan Algorithm
The CSpan algorithm has two phases. First, it scans the database to determine all frequent 1-
sequences. Second, for each frequent k-sequence, it constructs the projected database and locates
all frequent items in the projected database to produce its frequent super-sequences at the next
level in the sequential pattern tree, where k≥1. During the mining process, it employs occurrence
checking for early detection of closed sequential patterns.
Algorithm 1: CSpan
Input: A sequence database SD and minimum support m_sup.
Output: The complete set of closed sequential patterns.
1. S1 = frequent 1-sequences in SD
2. CSP = φ
3. for each s in S1 do
4. PD = Projected database of s
5. CS = Generate_patterns (PD, s, m_sup)
6. CSP = CSP CS
7. end for
Algorithm 2: Generate_patterns(PD, s, m_sup)
Input: A projected database PD, sequence s and minimum support m_sup
Output: Closed sequence patterns with prefix s.
1. CS = φ
2. P1 = frequent items in PD
3. if support(s) ≥ m_sup and P1≠ φ then
4. if (s does not pass the occurrence checking) then
5. if (s is closed w.r.t CS) then
6. CS= CS s
7. end if
8. end if
9. for each p in P1 do
10. PD=Projected database of s ∆s p
11. CS=CS Generate_patterns(PD, s ∆s p, m_sup)
12. PD=Projected database of s ∆i p
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
36
13. CS=CS Generate_patterns(PD, s ∆i p, m_sup)
14. end for
15. end if
16. return CS
Example: Consider the sample sequence database shown in Table 1 and assume that m_sup = 2.
First, we determine all frequent 1-sequences (a), (b), (c) and (d), and perform the occurrence
checking on these sequences. All frequent 1- sequences pass the occurrence checking step and
therefore they are not closed. Next, we determine the frequent super-sequences of frequent 1-
sequence (a) by using the frequent items in its projected database. There are two frequent items in
(a)’s projected database, namely, (b) and (c).
We first grow the frequent 1-sequence (a) by appending (b) to it, to get the frequent 2- sequence
(a)(b). Then we check whether (a)(b) passes the occurrence checking or not. But, (a)(b) passes
the occurrence checking step. Therefore, it is not a closed sequential pattern. Next we grow the
sequence (a)(b) by appending a frequent item in its projected database to get the frequent 3-
sequence (a)(bc), which does not pass the occurrence checking. Therefore, it is a closed
sequential pattern. Since there are no frequent items in (a)(bc)’s projected database, we can’t
generate its super-sequences and we stop growing it further. By growing the sequential pattern
tree in the manner as explained above, we can generate all closed sequential patterns. Finally, we
obtain three closed sequential patterns (bc), (c)(d) and (a)(bc).
5. PERFORMANCE EVALUATION
In this section, we perform a thorough performance evaluation of our proposed CSpan algorithm
on both real and synthetic data sets with various kinds of sizes and data distributions, and we
compare CSpan with CloSpan and a recently proposed algorithm ClaSP.
All experiments were conducted on a 2GHz Intel Core2 Duo processor PC with 1GB main
memory running Microsoft Windows XP. The algorithms were implemented in Java and were
executed using different support values.
In our experiments we used a real world dataset BMS WebView of KDD CUP 2000[15] and two
synthetic datasets. BMS WebView is a click stream data from an e-commerce web store named
Gazelle and it has been used widely to assess the performance of frequent pattern mining. This
dataset contains sequences of 59601 customers with a total of 146000 purchases in 497 distinct
product categories. The average length of sequences is 2.42 items with a standard deviation of
3.22. In this dataset, there are some long sequences. For example, 318 sequences contain more
than 20 items. The characteristics of the BMS WebView dataset are given in Table 2. We
generated the synthetic datasets using SPMF[16] framework. The characteristics of the two
synthetic datasets are given in Table 3.
Table 2. Characteristics of the BMS WebView dataset
S. No. Characteristic Value
1 No of sequences 59601
2 No of distinct items 497
3 Average length of
sequences
2.42
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
37
Table 3. Characteristics of the Synthetic datasets
Three sets of experiments were conducted to evaluate the performance of the CSpan algorithm.
The first set compares the runtime performance of CSpan with CloSpan and ClaSP using real
world dataset BMS WebView for different support values. The second and third sets compare the
runtime performance of CSpan with CloSpan and ClaSP using synthetic datasets for different
support values.
Fig. 2 shows the results of runtime performance using the real world dataset BMS WebView. The
X-axis is the minimum support, while the Y-axis is the algorithms runtime. The support values
are set from 0.01 to 0.06. Our proposed algorithm CSpan runs faster than CloSpan and ClaSP,
because of the use occurrence checking technique.
Figure 2. Performance comparison using BMS WebView dataset.
Fig. 3 and Fig. 4 show the results of runtime performance using the synthetic datasets. The X-axis
is the minimum support, while the Y-axis is the algorithms runtime. The support values are set
from 0.1 to 0.6. Our proposed algorithm CSpan outperforms CloSpan and ClaSP on both the
synthetic datasets.
S. No. Characteristic Dataset1
Value
Dataset2
Value
1 No of sequences 20000 15000
2 No of distinct items 15 20
3 No of items per itemset 3 3
4 No of itemsets per sequence 5 6
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
38
Figure 3. Performance comparison using synthetic dataset1
Figure 4. Performance comparison using synthetic dataset2
All the above experiments confirm that the proposed algorithm CSpan is efficient and takes less
run time comparing to other algorithms. Because CSpan uses occurrence checking for early
detection of closed sequential patterns and stores a result set that contains only closed sequential
patterns, where as CloSpan and ClaSP keep a larger number of closed sequential pattern
candidates and remove the non-closed ones at the end of the mining process.
6. CONCLUSION
Several researchers focused on the sequential pattern mining problem and many algorithms were
developed to mine sequential patterns. Closed sequential pattern mining is a variant of sequential
pattern mining and attains broad attention in the recent years because it has the same expression
ability of the sequential pattern mining and more compact than the sequential pattern mining.
In this paper, we propose an efficient algorithm CSpan which makes use of a new pruning method
called occurrence checking that allows the early detection of closed sequential patterns during the
mining process. Our extensive performance study on various real and synthetic datasets shows
that the proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm
ClaSP by an order of magnitude.
In future we will extend CSpan to incorporate user specified constraints. Other interesting
research problems that can be pursued include parallel mining of closed sequential patterns [17]
and mining of structured patterns.
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015
39
REFERENCES
[1] R. Agrawal and R. Srikant, “Mining sequential patterns,” Proc. Int’l Conf. Data Engineering (ICDE
’95), pp. 3-14, Mar. 1995.
[2] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining closed sequential patterns in large databases,” Proc.
SIAM Int’l Conf. Data Mining (SDM ’03), pp. 166-177, May 2003.
[3] J. Wang, J. Han, and Chun Li, “Frequent closed sequence mining without candidate maintenance,”
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1042-1056, Aug. 2007.
[4] Antonio Gomariz, Manuel Campos, Roque Marin, and Bart Goethals, “ClaSP: An efficient algorithm
for mining frequent closed sequences,” PAKDD 2013, LNAI 7818, Part I, pp. 50–61, 2013.
[5] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizations and performance
improvements,” Proc. Int’l Conf. Extending Database Technology (EDBT ’96), pp. 3-17, Mar. 1996.
[6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.C. Hsu, “FreeSpan: Frequent pattern-
projected sequential pattern mining,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and
Data Mining (SIGKDD’00), pp. 355-359, Aug. 2000.
[7] J . Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, “PrefixSpan : Mining
sequential patterns efficiently by prefix-projected pattern growth,” Proc. Int’l Conf. Data
Engineering (ICDE ’01), pp. 215-224, Apr. 2001.
[8] M. Zaki, “SPADE: An efficient algorithm for mining frequent sequences,” Machine Learning, vol.
42, pp. 31-60, 2001.
[9] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential pattern mining using a bitmap
representation,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD’
02), pp. 429-435, July 2002.
[10] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lot Lakhal, “Discovering frequent closed itemsets
for association rules,” Proceedings of the 7th International Conference on Database Theory (ICDT
'99), pp. 398-416,1999.
[11] J. Pei, J. Han, and R. Mao, “CLOSET: An efficient algorithm for mining frequent closed itemsets,”
Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD
’00), pp. 21-30, May 2000.
[12] J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the best strategies for mining frequent closed
itemsets,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’03),
pp. 236-245, Aug. 2003.
[13] M. Zaki and C. Hsiao, “CHARM: An efficient algorithm for closed itemset mining,” Proc. SIAM Int’l
Conf. Data Mining (SDM ’02), pp. 457-473, Apr. 2002.
[14] Kuo-Yu Huang, Chia-Hui Chang, Jiun-Hung Tung, and Cheng-Tao Ho, “COBRA: Closed sequential
pattern mining using bi-phase reduction approach,” Proceedings of 8th International Conference,
DaWaK, Springer LNCS 4081, pp. 280-291, 2006.
[15] Ron Kohavi, Carla E. Brodley, Brian Frasca, Llew Mason, and Zijian Zheng, “KDD-Cup 2000
organizers' report: Peeling the onion,” SIGKDD Explorations, vol. 2, no. 2, pp. 86-93, Dec. 2000.
[16] Fournier-Viger P., An Open-Source Data Mining Library, http://www.philippe-fournier-
viger.com/spmf/index.php?link=datasets.php, 2008, Accessed 20 July 2014.
[17] S. Cong, J. Han, and D.A. Padua, “Parallel mining of closed sequential patterns,” Proc. ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’05), pp. 562-567, Aug.
2005.

Contenu connexe

Tendances

Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningUsage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningIOSR Journals
 
A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Publishing House
 
An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...Editor IJCATR
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Sequential Pattern Mining Methods: A Snap Shot
Sequential Pattern Mining Methods: A Snap ShotSequential Pattern Mining Methods: A Snap Shot
Sequential Pattern Mining Methods: A Snap ShotIOSR Journals
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesEditor IJMTER
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...IAEME Publication
 
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...Waqas Tariq
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
A Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesA Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesIRJET Journal
 

Tendances (19)

Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data MiningUsage and Research Challenges in the Area of Frequent Pattern in Data Mining
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
 
Ijcatr04051004
Ijcatr04051004Ijcatr04051004
Ijcatr04051004
 
A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...
 
20120140502006
2012014050200620120140502006
20120140502006
 
An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...
 
Ej36829834
Ej36829834Ej36829834
Ej36829834
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Sequential Pattern Mining Methods: A Snap Shot
Sequential Pattern Mining Methods: A Snap ShotSequential Pattern Mining Methods: A Snap Shot
Sequential Pattern Mining Methods: A Snap Shot
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining Techniques
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...
 
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining Procedure
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Ijtra130516
Ijtra130516Ijtra130516
Ijtra130516
 
A Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesA Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association Rules
 
98 34
98 3498 34
98 34
 

En vedette

A comparative study on different trust based routing schemes in manet
A comparative study on different trust based routing schemes in manetA comparative study on different trust based routing schemes in manet
A comparative study on different trust based routing schemes in manetijwmn
 
Impact of different mobility scenarios on fqm framework for supporting multim...
Impact of different mobility scenarios on fqm framework for supporting multim...Impact of different mobility scenarios on fqm framework for supporting multim...
Impact of different mobility scenarios on fqm framework for supporting multim...ijwmn
 
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESijasuc
 
Rsaodv a route stability based ad hoc on demand distance vector routing prot...
Rsaodv  a route stability based ad hoc on demand distance vector routing prot...Rsaodv  a route stability based ad hoc on demand distance vector routing prot...
Rsaodv a route stability based ad hoc on demand distance vector routing prot...ijwmn
 
Performance analysis of fls, exp, log and
Performance analysis of fls, exp, log andPerformance analysis of fls, exp, log and
Performance analysis of fls, exp, log andijwmn
 
An educational bluetooth quizzing application in android
An educational bluetooth quizzing application in androidAn educational bluetooth quizzing application in android
An educational bluetooth quizzing application in androidijwmn
 
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...ijwmn
 
WSN, FND, LEACH, modified algorithm, data transmission
WSN, FND, LEACH, modified algorithm, data transmission WSN, FND, LEACH, modified algorithm, data transmission
WSN, FND, LEACH, modified algorithm, data transmission ijwmn
 
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...ijwmn
 
An efficient model for reducing soft blocking probability in wireless cellula...
An efficient model for reducing soft blocking probability in wireless cellula...An efficient model for reducing soft blocking probability in wireless cellula...
An efficient model for reducing soft blocking probability in wireless cellula...ijwmn
 
K dag based lifetime aware data collection in wireless sensor networks
K dag based lifetime aware data collection in wireless sensor networksK dag based lifetime aware data collection in wireless sensor networks
K dag based lifetime aware data collection in wireless sensor networksijwmn
 
Multiple optimal path identification using ant colony optimisation in wireles...
Multiple optimal path identification using ant colony optimisation in wireles...Multiple optimal path identification using ant colony optimisation in wireles...
Multiple optimal path identification using ant colony optimisation in wireles...ijwmn
 
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...ijwmn
 
Analysing an analytical solution model for simultaneous mobility
Analysing an analytical solution model for simultaneous mobilityAnalysing an analytical solution model for simultaneous mobility
Analysing an analytical solution model for simultaneous mobilityijwmn
 
International Journal of Wireless & Mobile Networks (IJWMN)
International Journal of Wireless & Mobile Networks (IJWMN)International Journal of Wireless & Mobile Networks (IJWMN)
International Journal of Wireless & Mobile Networks (IJWMN)ijwmn
 
Impact of random mobility models on olsr
Impact of random mobility models on olsrImpact of random mobility models on olsr
Impact of random mobility models on olsrijwmn
 
The Seven Deadly Sins of Social Media
The Seven Deadly Sins of Social MediaThe Seven Deadly Sins of Social Media
The Seven Deadly Sins of Social MediaReally B2B
 
Tollan xicocotitlan a reconstructed city by augmented reality ( extended )
Tollan xicocotitlan  a reconstructed city by augmented reality ( extended )Tollan xicocotitlan  a reconstructed city by augmented reality ( extended )
Tollan xicocotitlan a reconstructed city by augmented reality ( extended )ijdms
 
Performance analysis of voip traffic over integrating wireless lan and wan us...
Performance analysis of voip traffic over integrating wireless lan and wan us...Performance analysis of voip traffic over integrating wireless lan and wan us...
Performance analysis of voip traffic over integrating wireless lan and wan us...ijwmn
 
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...ijwmn
 

En vedette (20)

A comparative study on different trust based routing schemes in manet
A comparative study on different trust based routing schemes in manetA comparative study on different trust based routing schemes in manet
A comparative study on different trust based routing schemes in manet
 
Impact of different mobility scenarios on fqm framework for supporting multim...
Impact of different mobility scenarios on fqm framework for supporting multim...Impact of different mobility scenarios on fqm framework for supporting multim...
Impact of different mobility scenarios on fqm framework for supporting multim...
 
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
 
Rsaodv a route stability based ad hoc on demand distance vector routing prot...
Rsaodv  a route stability based ad hoc on demand distance vector routing prot...Rsaodv  a route stability based ad hoc on demand distance vector routing prot...
Rsaodv a route stability based ad hoc on demand distance vector routing prot...
 
Performance analysis of fls, exp, log and
Performance analysis of fls, exp, log andPerformance analysis of fls, exp, log and
Performance analysis of fls, exp, log and
 
An educational bluetooth quizzing application in android
An educational bluetooth quizzing application in androidAn educational bluetooth quizzing application in android
An educational bluetooth quizzing application in android
 
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...P ERFORMANCE A NALYSIS  O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...
 
WSN, FND, LEACH, modified algorithm, data transmission
WSN, FND, LEACH, modified algorithm, data transmission WSN, FND, LEACH, modified algorithm, data transmission
WSN, FND, LEACH, modified algorithm, data transmission
 
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...
Performance Analysis of Bfsk Multi-Hop Communication Systems Over K-μ Fading ...
 
An efficient model for reducing soft blocking probability in wireless cellula...
An efficient model for reducing soft blocking probability in wireless cellula...An efficient model for reducing soft blocking probability in wireless cellula...
An efficient model for reducing soft blocking probability in wireless cellula...
 
K dag based lifetime aware data collection in wireless sensor networks
K dag based lifetime aware data collection in wireless sensor networksK dag based lifetime aware data collection in wireless sensor networks
K dag based lifetime aware data collection in wireless sensor networks
 
Multiple optimal path identification using ant colony optimisation in wireles...
Multiple optimal path identification using ant colony optimisation in wireles...Multiple optimal path identification using ant colony optimisation in wireles...
Multiple optimal path identification using ant colony optimisation in wireles...
 
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
 
Analysing an analytical solution model for simultaneous mobility
Analysing an analytical solution model for simultaneous mobilityAnalysing an analytical solution model for simultaneous mobility
Analysing an analytical solution model for simultaneous mobility
 
International Journal of Wireless & Mobile Networks (IJWMN)
International Journal of Wireless & Mobile Networks (IJWMN)International Journal of Wireless & Mobile Networks (IJWMN)
International Journal of Wireless & Mobile Networks (IJWMN)
 
Impact of random mobility models on olsr
Impact of random mobility models on olsrImpact of random mobility models on olsr
Impact of random mobility models on olsr
 
The Seven Deadly Sins of Social Media
The Seven Deadly Sins of Social MediaThe Seven Deadly Sins of Social Media
The Seven Deadly Sins of Social Media
 
Tollan xicocotitlan a reconstructed city by augmented reality ( extended )
Tollan xicocotitlan  a reconstructed city by augmented reality ( extended )Tollan xicocotitlan  a reconstructed city by augmented reality ( extended )
Tollan xicocotitlan a reconstructed city by augmented reality ( extended )
 
Performance analysis of voip traffic over integrating wireless lan and wan us...
Performance analysis of voip traffic over integrating wireless lan and wan us...Performance analysis of voip traffic over integrating wireless lan and wan us...
Performance analysis of voip traffic over integrating wireless lan and wan us...
 
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
 

Similaire à Mining closed sequential patterns in large sequence databases

A novel algorithm for mining closed sequential patterns
A novel algorithm for mining closed sequential patternsA novel algorithm for mining closed sequential patterns
A novel algorithm for mining closed sequential patternsIJDKP
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningijcisjournal
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniquesijsrd.com
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
A Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining AlgorithmsA Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining AlgorithmsSara Alvarez
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDatamining Tools
 
Mining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic AlgorithmMining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic Algorithmijsrd.com
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval basedijcsa
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039ijsrd.com
 
A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Journals
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
Fast Sequential Rule Mining
Fast Sequential Rule MiningFast Sequential Rule Mining
Fast Sequential Rule Miningijsrd.com
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
An incremental mining algorithm for maintaining sequential patterns using pre...
An incremental mining algorithm for maintaining sequential patterns using pre...An incremental mining algorithm for maintaining sequential patterns using pre...
An incremental mining algorithm for maintaining sequential patterns using pre...Editor IJMTER
 

Similaire à Mining closed sequential patterns in large sequence databases (20)

A novel algorithm for mining closed sequential patterns
A novel algorithm for mining closed sequential patternsA novel algorithm for mining closed sequential patterns
A novel algorithm for mining closed sequential patterns
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data mining
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniques
 
H0964752
H0964752H0964752
H0964752
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
A Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining AlgorithmsA Brief Overview On Frequent Pattern Mining Algorithms
A Brief Overview On Frequent Pattern Mining Algorithms
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Mining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic AlgorithmMining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic Algorithm
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
AVL TREE AN EFFICIENT RETRIEVAL ENGINE IN CLASSIFIED FINGERPRINT DATABASE
AVL TREE AN EFFICIENT RETRIEVAL ENGINE IN CLASSIFIED FINGERPRINT DATABASEAVL TREE AN EFFICIENT RETRIEVAL ENGINE IN CLASSIFIED FINGERPRINT DATABASE
AVL TREE AN EFFICIENT RETRIEVAL ENGINE IN CLASSIFIED FINGERPRINT DATABASE
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval based
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039
 
A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...A comprehensive study of major techniques of multi level frequent pattern min...
A comprehensive study of major techniques of multi level frequent pattern min...
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
Fast Sequential Rule Mining
Fast Sequential Rule MiningFast Sequential Rule Mining
Fast Sequential Rule Mining
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
An incremental mining algorithm for maintaining sequential patterns using pre...
An incremental mining algorithm for maintaining sequential patterns using pre...An incremental mining algorithm for maintaining sequential patterns using pre...
An incremental mining algorithm for maintaining sequential patterns using pre...
 

Dernier

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Dernier (20)

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Mining closed sequential patterns in large sequence databases

  • 1. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 DOI : 10.5121/ijdms.2015.7103 29 MINING CLOSED SEQUENTIAL PATTERNS IN LARGE SEQUENCE DATABASES V. Purushothama Raju1 and G.P. Saradhi Varma2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University Guntur, A.P., India 2 Department of Information Technology S.R.K.R. Engineering College, Bhimavaram, A.P., India ABSTRACT Sequential pattern mining is studied widely in the data mining community. Finding sequential patterns is a basic data mining method with broad applications. Closed sequential pattern mining is an important technique among the different types of sequential pattern mining, since it preserves the details of the full pattern set and it is more compact than sequential pattern mining. In this paper, we propose an efficient algorithm CSpan for mining closed sequential patterns. CSpan uses a new pruning method called occurrence checking that allows the early detection of closed sequential patterns during the mining process. Our extensive performance study on various real and synthetic datasets shows that the proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP by an order of magnitude. KEYWORDS Data mining, sequential pattern mining, closed sequential pattern mining, sequence database 1. INTRODUCTION Sequential pattern mining is an important and active research topic in data mining. Sequential pattern mining was first introduced by Agrawal and Srikant in [1]. Since then several efficient sequential pattern mining algorithms have been proposed to find out sequential patterns. Sequential pattern mining can be applied to several business and scientific applications including biological sequence analysis, market analysis, discovering web access patterns, mining customer shopping sequences, web click streams, XML query access patterns for caching, block correlations in storage systems, sequences of file block references in operating systems, target marketing, customer retention, feature selection for sequence classification, user behavior analysis, web recommender systems, network intrusion detection, personalization systems and the study of scientific or medical processes. There is an increasing trend of utilizing sequential pattern mining in source code mining and software specification mining. These tools convert the source code into a sequence database representation, and find the sequential patterns in order to retrieve different information such as copy-pasted code segments, API usage, programming rules and etc. Consider a book store such as Amazon can find that the customers who have first bought a book ‘‘Introduction to Data Mining” will often purchase another book ‘‘Data Science for Business” in
  • 2. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 30 a later transaction. Store managers can use this information to do recommendation when a user browses the data mining book. Another useful application is to identify block access patterns of disk systems. The block access patterns are useful to predict the blocks that are accessed next, and these blocks can be prefetched into cache to reduce the disk I/O latency. Sequential pattern mining produces an exponential number of patterns when the database contains long sequences which are expensive in both time and space. The same problem also occurs in itemset and graph mining when the patterns are long. For example, assume the database contains a long sequence {(x1)(x2)……(x100)} it will generate 2100 - 1 frequent subsequences. Some subsequences have the support which is equivalent to the support of the long sequence, which are basically redundant patterns. Therefore, instead of mining the complete set of sequential patterns, it is better to mine closed sequential patterns only. A closed sequential pattern is a sequential pattern which has no supersequence with the same support. Closed sequential pattern mining extensively reduces the number of patterns produced and it can be utilized to obtain the complete set of sequential patterns. In addition, closed sequential pattern mining algorithms make use of search space pruning techniques and outperform sequential pattern mining algorithms. Closed sequential pattern mining helps users to find more interesting patterns and reduces the burden of users to explore too many patterns. As an example, consider a sequence database which has 31,000 sequences, when min_sup is 40%, there are 16,204 sequential patterns, where as only 609 patterns are closed sequential patterns. To find the complete set of sequential patterns, an efficient sequential pattern mining algorithm will take about 400 seconds on a 2 GHz Intel dual core PC. Where as a fastest closed sequential pattern mining algorithm takes only 53 seconds to mine all closed sequential patterns on the same machine. Therefore, it is better to mine only closed sequential patterns. There are two popular algorithms for closed sequential pattern mining: CloSpan[2] and BIDE[3]. CloSpan performs the mining in two stages. In the first stage it produces a closed sequential pattern candidate set and stores it in a prefix sequence lattice. In the second stage it performs post pruning to eliminate non closed sequential patterns. CloSpan works under candidate maintenance- and-test paradigm, hence it is not scalable because a large number of closed sequential pattern candidates will occupy more memory space and lead to huge search space for checking the closure of new patterns, particularly for low support threshold values or long patterns. BIDE mines closed sequential patterns without candidate maintenance. It prunes the search space more deeply and performs closure checking of patterns in an efficient manner. BIDE follows a strict depth-first search order to produce the closed sequential patterns. In this paper, we propose an efficient algorithm, called CSpan, which makes use of the occurrence checking method that allows early detection of closed sequential patterns during the mining process. Our extensive performance study on various real and synthetic datasets shows that the proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP[4] by an order of magnitude. The remaining of this paper is organized as follows. Section 2 discusses the related work. In Section 3, we present the problem definition. In Section 4, we discuss the proposed method for mining closed sequential patterns. Section 5 reports the performance evaluation. Finally, we conclude the work in Section 6.
  • 3. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 31 2. RELATED WORK Agrawal and Srikant [1] proposed the sequential pattern mining problem. Later a number of efficient algorithms have been developed for sequential pattern mining. These algorithms are categorized into three groups such as Apriori-based methods, pattern-growth methods and vertical format based methods. Apriori-based methods implement a candidate-generation-and-test strategy. The representative algorithm of these methods is GSP [5], which is an enhancement of AprioriAll algorithm developed by the same authors. Algorithms implementing a candidate-generation-and-test strategy produce a large number of candidate sequences. Counting and pruning such a large set is more expensive. In addition, Apriori-based methods require more number of database scans when there are long patterns. Pattern-growth methods generally start with a frequent pattern, and grow the frequent pattern when traversing the pattern search space using depth-first search. FreeSpan[6], PrefixSpan [7], CloSpan, BIDE and etc belonging to this type. PrefixSpan finds the frequent-1 sequences in the database and builds projected databases for each sequence. It then searches the projected database to uncover locally frequent items and recursively locates the frequent sequences that contain the item as a prefix. This process is repeated until all frequent sequences are discovered. It uses a pseudo-projection technique for constructing projected databases to improve the performance. PrefixSpan is faster than GSP, because in each iteration it scans only the sequences in a projected database and a projected database is much smaller than the original database. Vertical format based methods represent a sequence database using vertical format which allows faster computation of support counting. SPADE [8] and SPAM [9] are based on vertical format. SPADE makes use of a vertical id-list format and a join operation between two id-lists to discover frequent sequential patterns. It consumes more memory, since id-lists are several times bigger than the initial database. SPAM constructs a vertical bitmap of the database for efficient candidate generation and support counting. SPAM traverses the lexicographic sequence tree in a depth-first manner. SPAM is faster than SPADE due to the fast bitmap computation and consumes more memory space than SPADE. Closed itemset mining is used to find closed itemsets in a transaction database. Several algorithms were developed for closed itemset mining. A-Close[10] is the first algorithm developed for mining closed itemsets. It first finds level-wise frequent itemsets using Apriori strategy, and mines all minimal generators. In the second step, it computes the closure of all minimal generators. The performance of A-Close degrades due to the huge cost of the off-line closure calculation. Closet [11] and Closet+ [12] algorithms use compact FP-Tree data structure. In the first scan of the dataset, frequent single items are identified and in the second scan the pruned transactions are added to the FP-Tree. Closet finds closed itemsets by closure climbing and grows frequent closed itemsets with a depth first browsing of the FP-Tree and recursive conditional FP-Tree projections. It uses subset checking to detect the duplicates. Closet+ is an improvement of Closet and it has an adaptive behavior in order to fit both sparse and dense datasets. Closet+ uses a new upward checking technique to detect duplicates for sparse datasets. Charm[13] constructs a prefix tree using frequent itemsets and searches the prefix tree using bottom-up depth-first approach. Once a frequent itemset is produced, its tid-list is checked with other itemsets having the same parent. If one tid-list contains another tid-list, the related nodes are
  • 4. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 32 combined because both the itemsets belong to the same equivalence class. Charm uses diff-set technique for sorting itemset tid-lists in each node of the tree. It is not feasible to adopt the techniques used in closed itemset mining for closed sequential pattern mining, because subsequence testing needs order matching and the search space of sequences is bigger than that of itemsets. Also, sequential pattern mining is generally less efficient than closed sequential pattern mining, particularly in mining long patterns and with low support threshold. X.Yan et al. developed the CloSpan algorithm for mining closed sequential patterns. CloSpan produces less number of discovered sequences than the traditional methods while preserving the same expressive power. CloSpan can mine long sequences and runs faster than PrefixSpan. CloSpan divides the mining process into two phases. In first phase it generates a candidate set for closed sequential patterns and stores it in a prefix sequence lattice. In second phase it conducts post pruning to eliminate non closed sequences. CloSpan prunes the nonclosed sequential patterns by using an efficient search space pruning method, known as equivalence of projected databases. Unfortunately, a closed sequential pattern mining algorithm under candidate maintenance-and-test paradigm has rather poor scalability because a large number of closed sequential patterns candidates require more memory and lead to huge search space for the closure checking of new patterns, which is usually the case when the support threshold is low or the patterns are long. Wang et al. developed the BIDE algorithm. It mines closed sequential patterns without candidate maintenance. It prunes the search space more deeply and performs closure checking of patterns in an efficient manner. BIDE follows a strict depth-first search order to produce the closed sequential patterns. It uses a novel closure checking scheme called BI-Directional Extension. The forward directional extension is used to grow the prefix patterns and also for closure checking of prefix patterns, whereas the backward directional extension is used for both closure checking of a prefix pattern and pruning the search space. It prunes the search space by using the BackScan pruning method. BIDE first applies the BackScan pruning method to check if a prefix sequence can be pruned, if not, computes the number of backward-extension items. Later it finds the number of forward extension items, if there is no backward-extension item or forward-extension item then it outputs closed sequential patterns. Although BIDE does not keep track of any historical closed sequential patterns candidates for a new pattern's closure checking, it is a computational consuming approach since it needs multiple database scans for the bi-direction closure checking and the BackScan pruning. The COBRA algorithm was developed by Huang et al. [14]. It adopts a bi-phase reduction method for closed sequential pattern mining. It finds closed itemsets for first phase reduction and then finds closed sequential patterns for second phase reduction. COBRA first conducts item extension and then do sequence extension, which eliminates some drawbacks in pattern growth methods. It performs the mining in three stages: (I) Mining Closed Frequent Itemset (II) Database Encoding and (III) Mining Closed Sequential Pattern. COBRA uses closed itemsets to eliminate duplicate item enumeration and to decrease the matching cost for mining locally frequent items. It uses both vertical and horizontal database formats to reduce the search time in the mining process. It uses two pruning methods LayerPruning and ExtPruning. LayerPruning eliminates non closed branches and reduces the costs in pattern checking. ExtPruning checks closure of patterns to eliminate non closed sequential patterns. COBRA requires large memory space than BIDE. ClaSP mines closed sequential patterns in temporal transaction data. It is inspired on the Spade algorithm that uses vertical database format strategy for generating sequential patterns. ClaSP has
  • 5. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 33 two main phases: The first phase produces Frequent Closed Candidates (FCC) that is kept in main memory; and the second phase performs a post-pruning to remove all non-closed sequences from FCC to finally get exactly frequent closed sequences. It first discovers all frequent 1-sequences in the database and then the method DFS-Pruning is called recursively for all frequent 1-sequences to search the corresponding subtree. FCC is acquired when this process is done for all of the frequent 1-sequences and finally, the algorithm terminates by removing the non-closed sequences in FCC. ClaSP uses the method CheckAvoidable for pruning the search space and outperforms CloSpan. 3. PROBLEM DEFINITION In this section, we first introduce some preliminary concepts, and then formalize the closed sequential pattern mining problem. Let I ={i1,i2,….,im} be a set of all items. A subset of I is called an itemset. A sequence S=(k1,k2, …, kn ) (ki ⊆ I ) is an ordered list of itemsets. The items in each itemset are sorted in alphabetic order. The length of the sequence is the total number of items in the sequence. A sequence S1 =(a1,a2,…..,am) is a subsequence of another sequence S2 =(b1,b2,….,bn), denoted as S1 ⊑ S2, if there exists integers 1 ≤ i1 < i2 < . . . < im ≤ n and a1 ⊆ bi1 , a2 ⊆ bi2 , . . . , and am ⊆ bim. We call S2 as a super-sequence of S1 and S2 contains S1. A sequence database, SD={S1,S2,…,Sn}, is a set of sequences and each sequence has an id. The size, |SD|, of the sequence database SD is the total number of sequences in the SD. The support of a sequence X in a sequence database SD is the no of sequences in SD which contain X. Definition 1 (Sequential pattern): Given a minimum support threshold m_sup, a sequence α is a sequential pattern on SD if support (α) is greater than m_sup. Definition 2 (Closed sequential pattern): A sequential pattern α is a closed sequential pattern if there does not exist a sequential pattern β , such that support (α) = support (β) and α β. The problem of closed sequential pattern mining is to find the complete set of closed sequential patterns above a minimum support threshold m_sup for an input sequence database SD. Table 1 shows a sample sequence database. The items in each itemset are sorted in alphabetic order. If m_sup=2 , the closed sequential pattern set contains 3 sequences {(bc) : 3, (c)(d):2, (a)(bc):2} and the corresponding sequential pattern set contains 9 sequences. It indicates that closed sequential pattern set contains less no of sequences than sequential pattern set. Table 1. A sample sequence database Sid Sequence 1 (ab)(bc) 2 (bc)(d)(e) 3 (ac)(bcd) Given a sequence X=(p1,….,pn) and an item k, X can be concatenated with k using either itemset extention X∆ik=(p1,….,pn (k)) or sequence extension X∆sk=(p1,….,pn,(k)). For example (ab) is an itemset extension of (a) and (a)(b) is a sequence extension of (a).
  • 6. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 34 4. PROPOSED METHOD In this section, we describe our proposed algorithm CSpan. It uses depth-first search for generating the closed sequential patterns. Since many previously developed algorithms proved that depth-first search based algorithm is more efficient than the breadth-first search based algorithm in mining long patterns. Figure 1. An example sequential pattern tree The CSpan algorithm uses a sequential pattern tree to generate the closed sequential patterns. The sequential pattern tree is constructed as follows. The root of the tree is labeled with φ. Next, all frequent 1-sequences in the database are identified and added to level 1 of the sequential pattern tree. Then, a sequence k at level 1 is extended by appending a frequent item in k’s projected database to get its frequent super-pattern at level 2. A sequence can be extended in two ways: sequence extension and itemset extension. In case of sequence extension, the item is added as a new itemset to the sequence. In case of itemset extension, the item is appended to the last itemset in the sequence. This process is repeated for the sequences at the level 2 and above until there are no super-sequences to generate. Fig. 1 represents the sequential pattern tree for the sample sequence database shown in Table. 1. First the root of the sequential pattern tree is set to φ and then all frequent 1- sequences are added as the children of the root at level 1. The number after the sequence indicates the support of the sequence and the minimum support is chosen as 2. The children of a node are arranged lexicographically. Next, the frequent 2-sequences at level 2 are produced by concatenating the frequent 1-sequence with one of the frequent item in its projected database, and similarly the frequent 3-sequences at level 3 are produced by concatenating the frequent 2-sequence with one of the frequent item in its projected database. A sequence in the sequential pattern tree is considered as a sub-sequence of its child. To locally mine the frequent super-sequences of a certain sequence, we construct the projected database for the sequence and grow the sequence to obtain its frequent super-sequences. We use the pseudo based projection approach of PrefixSpan to create the projected database. φ (a):2 (a)(bc):2 (b):3 (c):3 (d):2 (a)(b):2 (a)(c):2 (bc):3 (c)(d):2 Level 0 Level 1 Level 2 Level 3
  • 7. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 35 4.1 Occurrence checking We propose a new pruning method called occurrence checking that allows the early detection of closed sequential patterns. Lemma 1. Occurrence checking: A sequential pattern X is not closed if a frequent item y exists such that (1) y appears in every sequence of X’s projected database and (2) the distance between X and y is identical in every sequence of X’s projected database. Proof. If a frequent item y appears in every sequence of X’s projected database and the distance between X and y is identical in every sequence of X’s projected database, then we can always discover another frequent sequence containing X and y whose support is equivalent to X’s support. Therefore, X cannot be closed. We demonstrate the occurrence checking scheme as follows. Assume a is a frequent sequence. If we locate an item b after a in every sequence of a’s projected database, then we can declare that a is not closed since ab is its super-sequence with the same support. 4.2 CSpan Algorithm The CSpan algorithm has two phases. First, it scans the database to determine all frequent 1- sequences. Second, for each frequent k-sequence, it constructs the projected database and locates all frequent items in the projected database to produce its frequent super-sequences at the next level in the sequential pattern tree, where k≥1. During the mining process, it employs occurrence checking for early detection of closed sequential patterns. Algorithm 1: CSpan Input: A sequence database SD and minimum support m_sup. Output: The complete set of closed sequential patterns. 1. S1 = frequent 1-sequences in SD 2. CSP = φ 3. for each s in S1 do 4. PD = Projected database of s 5. CS = Generate_patterns (PD, s, m_sup) 6. CSP = CSP CS 7. end for Algorithm 2: Generate_patterns(PD, s, m_sup) Input: A projected database PD, sequence s and minimum support m_sup Output: Closed sequence patterns with prefix s. 1. CS = φ 2. P1 = frequent items in PD 3. if support(s) ≥ m_sup and P1≠ φ then 4. if (s does not pass the occurrence checking) then 5. if (s is closed w.r.t CS) then 6. CS= CS s 7. end if 8. end if 9. for each p in P1 do 10. PD=Projected database of s ∆s p 11. CS=CS Generate_patterns(PD, s ∆s p, m_sup) 12. PD=Projected database of s ∆i p
  • 8. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 36 13. CS=CS Generate_patterns(PD, s ∆i p, m_sup) 14. end for 15. end if 16. return CS Example: Consider the sample sequence database shown in Table 1 and assume that m_sup = 2. First, we determine all frequent 1-sequences (a), (b), (c) and (d), and perform the occurrence checking on these sequences. All frequent 1- sequences pass the occurrence checking step and therefore they are not closed. Next, we determine the frequent super-sequences of frequent 1- sequence (a) by using the frequent items in its projected database. There are two frequent items in (a)’s projected database, namely, (b) and (c). We first grow the frequent 1-sequence (a) by appending (b) to it, to get the frequent 2- sequence (a)(b). Then we check whether (a)(b) passes the occurrence checking or not. But, (a)(b) passes the occurrence checking step. Therefore, it is not a closed sequential pattern. Next we grow the sequence (a)(b) by appending a frequent item in its projected database to get the frequent 3- sequence (a)(bc), which does not pass the occurrence checking. Therefore, it is a closed sequential pattern. Since there are no frequent items in (a)(bc)’s projected database, we can’t generate its super-sequences and we stop growing it further. By growing the sequential pattern tree in the manner as explained above, we can generate all closed sequential patterns. Finally, we obtain three closed sequential patterns (bc), (c)(d) and (a)(bc). 5. PERFORMANCE EVALUATION In this section, we perform a thorough performance evaluation of our proposed CSpan algorithm on both real and synthetic data sets with various kinds of sizes and data distributions, and we compare CSpan with CloSpan and a recently proposed algorithm ClaSP. All experiments were conducted on a 2GHz Intel Core2 Duo processor PC with 1GB main memory running Microsoft Windows XP. The algorithms were implemented in Java and were executed using different support values. In our experiments we used a real world dataset BMS WebView of KDD CUP 2000[15] and two synthetic datasets. BMS WebView is a click stream data from an e-commerce web store named Gazelle and it has been used widely to assess the performance of frequent pattern mining. This dataset contains sequences of 59601 customers with a total of 146000 purchases in 497 distinct product categories. The average length of sequences is 2.42 items with a standard deviation of 3.22. In this dataset, there are some long sequences. For example, 318 sequences contain more than 20 items. The characteristics of the BMS WebView dataset are given in Table 2. We generated the synthetic datasets using SPMF[16] framework. The characteristics of the two synthetic datasets are given in Table 3. Table 2. Characteristics of the BMS WebView dataset S. No. Characteristic Value 1 No of sequences 59601 2 No of distinct items 497 3 Average length of sequences 2.42
  • 9. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 37 Table 3. Characteristics of the Synthetic datasets Three sets of experiments were conducted to evaluate the performance of the CSpan algorithm. The first set compares the runtime performance of CSpan with CloSpan and ClaSP using real world dataset BMS WebView for different support values. The second and third sets compare the runtime performance of CSpan with CloSpan and ClaSP using synthetic datasets for different support values. Fig. 2 shows the results of runtime performance using the real world dataset BMS WebView. The X-axis is the minimum support, while the Y-axis is the algorithms runtime. The support values are set from 0.01 to 0.06. Our proposed algorithm CSpan runs faster than CloSpan and ClaSP, because of the use occurrence checking technique. Figure 2. Performance comparison using BMS WebView dataset. Fig. 3 and Fig. 4 show the results of runtime performance using the synthetic datasets. The X-axis is the minimum support, while the Y-axis is the algorithms runtime. The support values are set from 0.1 to 0.6. Our proposed algorithm CSpan outperforms CloSpan and ClaSP on both the synthetic datasets. S. No. Characteristic Dataset1 Value Dataset2 Value 1 No of sequences 20000 15000 2 No of distinct items 15 20 3 No of items per itemset 3 3 4 No of itemsets per sequence 5 6
  • 10. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 38 Figure 3. Performance comparison using synthetic dataset1 Figure 4. Performance comparison using synthetic dataset2 All the above experiments confirm that the proposed algorithm CSpan is efficient and takes less run time comparing to other algorithms. Because CSpan uses occurrence checking for early detection of closed sequential patterns and stores a result set that contains only closed sequential patterns, where as CloSpan and ClaSP keep a larger number of closed sequential pattern candidates and remove the non-closed ones at the end of the mining process. 6. CONCLUSION Several researchers focused on the sequential pattern mining problem and many algorithms were developed to mine sequential patterns. Closed sequential pattern mining is a variant of sequential pattern mining and attains broad attention in the recent years because it has the same expression ability of the sequential pattern mining and more compact than the sequential pattern mining. In this paper, we propose an efficient algorithm CSpan which makes use of a new pruning method called occurrence checking that allows the early detection of closed sequential patterns during the mining process. Our extensive performance study on various real and synthetic datasets shows that the proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP by an order of magnitude. In future we will extend CSpan to incorporate user specified constraints. Other interesting research problems that can be pursued include parallel mining of closed sequential patterns [17] and mining of structured patterns.
  • 11. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.1, February 2015 39 REFERENCES [1] R. Agrawal and R. Srikant, “Mining sequential patterns,” Proc. Int’l Conf. Data Engineering (ICDE ’95), pp. 3-14, Mar. 1995. [2] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining closed sequential patterns in large databases,” Proc. SIAM Int’l Conf. Data Mining (SDM ’03), pp. 166-177, May 2003. [3] J. Wang, J. Han, and Chun Li, “Frequent closed sequence mining without candidate maintenance,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1042-1056, Aug. 2007. [4] Antonio Gomariz, Manuel Campos, Roque Marin, and Bart Goethals, “ClaSP: An efficient algorithm for mining frequent closed sequences,” PAKDD 2013, LNAI 7818, Part I, pp. 50–61, 2013. [5] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizations and performance improvements,” Proc. Int’l Conf. Extending Database Technology (EDBT ’96), pp. 3-17, Mar. 1996. [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.C. Hsu, “FreeSpan: Frequent pattern- projected sequential pattern mining,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD’00), pp. 355-359, Aug. 2000. [7] J . Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, “PrefixSpan : Mining sequential patterns efficiently by prefix-projected pattern growth,” Proc. Int’l Conf. Data Engineering (ICDE ’01), pp. 215-224, Apr. 2001. [8] M. Zaki, “SPADE: An efficient algorithm for mining frequent sequences,” Machine Learning, vol. 42, pp. 31-60, 2001. [9] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential pattern mining using a bitmap representation,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD’ 02), pp. 429-435, July 2002. [10] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lot Lakhal, “Discovering frequent closed itemsets for association rules,” Proceedings of the 7th International Conference on Database Theory (ICDT '99), pp. 398-416,1999. [11] J. Pei, J. Han, and R. Mao, “CLOSET: An efficient algorithm for mining frequent closed itemsets,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD ’00), pp. 21-30, May 2000. [12] J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the best strategies for mining frequent closed itemsets,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’03), pp. 236-245, Aug. 2003. [13] M. Zaki and C. Hsiao, “CHARM: An efficient algorithm for closed itemset mining,” Proc. SIAM Int’l Conf. Data Mining (SDM ’02), pp. 457-473, Apr. 2002. [14] Kuo-Yu Huang, Chia-Hui Chang, Jiun-Hung Tung, and Cheng-Tao Ho, “COBRA: Closed sequential pattern mining using bi-phase reduction approach,” Proceedings of 8th International Conference, DaWaK, Springer LNCS 4081, pp. 280-291, 2006. [15] Ron Kohavi, Carla E. Brodley, Brian Frasca, Llew Mason, and Zijian Zheng, “KDD-Cup 2000 organizers' report: Peeling the onion,” SIGKDD Explorations, vol. 2, no. 2, pp. 86-93, Dec. 2000. [16] Fournier-Viger P., An Open-Source Data Mining Library, http://www.philippe-fournier- viger.com/spmf/index.php?link=datasets.php, 2008, Accessed 20 July 2014. [17] S. Cong, J. Han, and D.A. Padua, “Parallel mining of closed sequential patterns,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’05), pp. 562-567, Aug. 2005.