CloSapn

Mining Closed Sequential Patterns in
Large Datasets
Presenter: Ildar Nurgaliev
Lab: Dainfos
Innopolis University CloSpan page 1 of 34

Main idea
Instead of mining the complete set of frequent subsequences
we mine frequent closed subsequences

Benets
• can mine really long sequences
• produce signicantly less number of discovered frequent
sequences

Preliminary Concepts
Sequence
• items: I = {i1, i2, ..., im}
• itemset (ti ): ti ⊆ I
• sequence (ordered list): s = t1, t2, ..., tm
• size |s|: number of itemsets in s
• length l(s): l(s) =
n
i=1
|ti | (total number of items)

α sub-sequence of β OR β super-sequence of α (contains)
• α = α1, α2, ..., αm
• β = β1, β2, ..., βm
• α β (if α = β, written as α β)
• i ∃i1, i2, ..., im, such that
1 ≤ i1 i2 ... im ≤ n and
α1 ⊆ βi , α2 ⊆ βi2, ..., αm ⊆ βim
• β absorbs α: if β contains α and their support are the
same

Support
• D = {s1, s2, ..., sn}: sequence database
• each s associated with id (id of si is i)
• |D|: number of s in D
• support(α): number of s in D which contain α
support(α) = |{s|s ∈ D and α s}|
• min_sup: minimum support threshold

Frequent sequential pattern (FS) and closed FS (CS)
• FS: includes all s of support(s) ≥ min_sup
• CS = {α|α ∈ FS and β ∈ FS
such that α β and support(α) = support(β)}
• closed sequence mining: nd CS above min_sup
• database containment relation D D :
if ∃ an injective function f : D → D , s.t.
∀s ∈ D, s f (s)

Item extension
• Given: s = t1, ..., tm and item α
• s α: concatenation (I-Step or S-Step)
• s i α = t1, ..., tm ∪ {α} if ∀k ∈ rm, k α
Example: (αe) is I-Step extension of (α)
• s s α = t1, ..., tm, {α}
Example: (α)(c) is S-Step extension of (α)

Sequence extension
• Given: s = t1, ..., tm and p = t1, ..., tn
• s p: concatenation (itemset-extension or
sequence-extension)
• s i p = t1, ..., tm ∪ t1, ..., tn if ∀k ∈ tm, j ∈ t1, k j
• s s p = t1, ..., tm, t1, ..., tn
• s = p s: p - prex and s - sux of s
Example: (e)(α) is prex of (e)(abf )(bde) and
(bf )(bde) is its sux

s-projected database (physical projection and pseudo projection)
• Ds = {p|s ∈ D, s = r p s.t. r is minimum prex
containing s (s r and r , s r r)}
p can be empty
Example
• D (αf ) = { (d)(e)(α) , (bde) }
• D (e)(α) = {$, (b) , (_bf )(bde) }

Lexicographic Sequence Tree
Set Lexicographic Order
• Let t = {i1, i2, ..., ik}, t = {j1, j2, ..., jl }, where
i1 ≤ ... ≤ ik and j1 ≤ ... ≤ jl
• t t i either of the following is true:
1. 0 ≤ h ≤ min{k, l }, we have ir = jr for r h, and ih jh
2. k l , and i1 = j1, i2 = j2, ..., ik = jk
Example: (a, f ) (b, f ), (a, b) (a, b, c) and
(a, b, c) (b, c)

Sequence Lexicographic Order
i if s = s p, then s s
ii if s = α i p and s = α s p , no matter what is order
relation between p and p is, s s
iii if s = α i p and s = α i p , p p indicated s s
iv s = α s p and s = α s p , p p indicates s s
Example: (a, b) (a, b)(a) ; (a, b) (a)(a)

Lexicographic Sequence Tree construction
1. each node in the tree corresponds to a sequence, and the
root is a null sequence;
2. if a parent node corresponds to a sequence s, its child is
either an itemset-extension of s, or a sequence-extension
of s;
3. the left sibling is less than the right sibling in sequence
lexicographic order.

Lexicographic Sequence Tree and Prex Search Tree

Example
Lexicographic Sequence Tree with min_sup = 2

Search Space Pruning and Prex Sequence Lattice
LEMMA 1 (Common Prex)
LEMMA 1. Given a subsequence s, and its projected database
Ds, if ∃α, α is a common prex for all the sequences with the
same extension type (either itemset or sequence - extension) in
Ds, then ∀β, if s β is closed, α must be a prex of β. That
means ∀β α, we need not search s β and its descendants
except the branch of s α.
Example: Ds = { (d)(e)(af ) , (d)(e)(fg) }, all the
sequences in Ds share a common prex α = (d)(e) , so any
sequence with prex s but not s (d)(e) must not be closed.
So we can jump to the branch s α.

LEMMA 2 (Partial Order)
LEMMA 2. Given a sequence s, and its projected database Ds,
if among all the sequences in Ds, and item α does always
occur before an item β (either in the same itemset for all
sequences in Ds or in a dierent itemset, but not both), then
Ds α β = Ds β. Therefore, ∀γ, s β γ is not closed. We need
not search any sequence in the branch of s β.

Theorem 1 (Equivalence of Projected Databases)
• I(D) =
n
i=1
l(si ): total number items in D
Theorem 1: Given 2 sequences, s, s , s s , then
Ds = Ds ⇔ I(Ds) = I(Ds )
Example: Consider D-sample on 15 slide.
• D (af ) = D (f ) = { (d)(e) , (de) }, and
• I(D (af ) ) = I(D (f ) ) = 4.
Based on Theorem 1, the following search space pruning can
be achieved.

Proof of Theorem 1
• Ds = Ds → I(Ds) = I(Ds ) (obvious);
• Since s s , then Ds Ds and I(Ds ) ≤ I(Ds);
• The equality between I(Ds ) and I(Ds) holds only if
∀γ ∈ Ds , γ ∈ Ds, and vice versa. Therefore, Ds = Ds .

LEMMA 3 (Early Termination by Equivalence)
LEMMA 3. Given 2 sequences, s s and also
I(Ds) = I(Ds ), then ∀γ, support(s γ) = support(s γ).
Example: Consider D-sample on 15 slide.
• I(D (af ) ) = I(D (f ) );
• both ((af )(d)) and (af )(e) are frequent;
We can conclude that the support of (af )(d) and (f )(d) ,
(af )(e) and (f )(e) are the same without knowing the
support of (f )(e) and (f )(d) .

Projected database closed set (LS)
• LS = {s|support(s) ≥ min_sup} and s , s.t s s and
I(Ds) = I(Ds );
• CS ⊆ LS ⊆ FS: instead of mining CS directly, CloSpan
algorithm rst produces the complete set of LS
• then non-closed sequence elimination is applied in LS to
generate CS based of Lemma 3.

Corollary 1 (Backward Sub-Pattern)
Corollary 1. If a sequence s s' and s s , the condition of
I(Ds) = I(Ds ) is sucient to stop searching any descendant
of s in the prex searching tree.
s is backward sub-pattern of s if s s and s s (s is discovered
after s)
Example: I(D (f ) ) = I(D (af ) ) → D (f ) = D (af )

Corollary 2 (Backward Super-Pattern)
Corollary 2. If a sequence s s and s s , if the condition of
I(Ds) = I(Ds ) holds, it is sucient to translating the
descendants of s to s instead of searching any descendant of
s in the prex search tree.
Example: the same logic as in the previous example.

CloSpan: Design and Implementation
2 main steps
CloSpan divides mining process into 2 stages.
1. Generated the LS set, a superset of closed frequent
sequences, and stores it in a prex sequence lattice;
2. it does post-pruning to eliminate non-closed sequences.

Algorithm 1: ClosedMining(D, min_sup, L)

Algorithm 2: CloSpan(s, Ds , min_sup, L)

Algorithm : CloSpan
• Hash index on the size of projected database in order to
speed up check on Theorem 1 (1-4 lines of CloSpan);
• if I(Ds ) = I(Ds) then;
• if s s , then we do not add I(Ds), s ;
• if s s, then we replace I(Ds ), s with I(Ds), s .
I(Ds), s

Algorithm 3: checkProjectedDBSize(s, k, H)
Corresponds to line 1-4 in Algorithm 2.

Algorithm 3: hash function algorithm
• Database size range from 0 to I(D), so if the values of
I(Ds) are dense in a small range, performance degrade;
• by Theorem 1 we could use necessary propositions of
holding Ds = Ds in a part of hash key;
• L(Ds) = I(Ds) + m
j=1
n
k=ij +1 l(sk);
• if s s , L(Ds) = L(Ds ) ↔ I(Ds) = I(Ds ).

Non-Closed Sequence Elimination
Check out for super sequence
• support(s) as its Hash function
• nd all the sequences with the same support of s
• check whether there is a super-sequence containing s.
• if s s and
support(s) = support(s ) → T (Ds) = T (Ds )
(corresponding sequences' id sum)
• that's why T (Ds) = T (Ds ) could be used as a Hash
function instead of support (more sparse)

Conclusion
CloSpan
• Solve closed sequential pattern mining problem;
• CloSpan outperforms PrexSpan by more than one order
of magnitude;
• capable of mining longer frequent sequences in a large
data set with low min_sup;
• it does not modify the frequent pattern mining algorithm:
it only denes the early termination condition of search
branch;
• this method can be extended to other existing sequential
pattern mining algorithms (SPADE, SPAM).

Possible improvements
CloSpan
• The performance of CloSpan is achieved by smart
prunning method, do it more smart;
• Do not need to keep track of any single historical
frequent closed sequence (or candidate) for a new
pattern's closure checking.

Possible improvements
BIDE algorithm
1. BIDE consumes much less memory and can be an order of
magnitude faster than CloSpan when the support is low;
2. BIDE has linear scalability against base size in terms of
runtime eciency and space usage;
3. the BackScan pruning method is very eective in
enhancing the performance of BIDE.

CloSpam in trajectory Mining
Sequential Pattern Mining from Trajectory Data
• need more studies: IPCA and DBScan on Trajectory data.
• CloSpan could be used as unsupevised algorithm for
detecting most crowded paths in a city.
• ...

CloSapn

Recommandé

Recommandé

Contenu connexe

Similaire à CloSapn

Similaire à CloSapn (20)

Plus de Ildar Nurgaliev

Plus de Ildar Nurgaliev (7)

Dernier

Dernier (20)

CloSapn