Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
CloSapn
1. Mining Closed Sequential Patterns in
Large Datasets
Presenter: Ildar Nurgaliev
Lab: Dainfos
Innopolis University CloSpan page 1 of 34
2. Main idea
Instead of mining the complete set of frequent subsequences
we mine frequent closed subsequences
Innopolis University CloSpan page 2 of 34
3. Benets
• can mine really long sequences
• produce signicantly less number of discovered frequent
sequences
Innopolis University CloSpan page 3 of 34
4. Preliminary Concepts
Sequence
• items: I = {i1, i2, ..., im}
• itemset (ti ): ti ⊆ I
• sequence (ordered list): s = t1, t2, ..., tm
• size |s|: number of itemsets in s
• length l(s): l(s) =
n
i=1
|ti | (total number of items)
Innopolis University CloSpan page 4 of 34
5. Preliminary Concepts
α sub-sequence of β OR β super-sequence of α (contains)
• α = α1, α2, ..., αm
• β = β1, β2, ..., βm
• α β (if α = β, written as α β)
• i ∃i1, i2, ..., im, such that
1 ≤ i1 i2 ... im ≤ n and
α1 ⊆ βi , α2 ⊆ βi2, ..., αm ⊆ βim
• β absorbs α: if β contains α and their support are the
same
Innopolis University CloSpan page 5 of 34
6. Preliminary Concepts
Support
• D = {s1, s2, ..., sn}: sequence database
• each s associated with id (id of si is i)
• |D|: number of s in D
• support(α): number of s in D which contain α
support(α) = |{s|s ∈ D and α s}|
• min_sup: minimum support threshold
Innopolis University CloSpan page 6 of 34
7. Preliminary Concepts
Frequent sequential pattern (FS) and closed FS (CS)
• FS: includes all s of support(s) ≥ min_sup
• CS = {α|α ∈ FS and β ∈ FS
such that α β and support(α) = support(β)}
• closed sequence mining: nd CS above min_sup
• database containment relation D D :
if ∃ an injective function f : D → D , s.t.
∀s ∈ D, s f (s)
Innopolis University CloSpan page 7 of 34
8. Preliminary Concepts
Item extension
• Given: s = t1, ..., tm and item α
• s α: concatenation (I-Step or S-Step)
• s i α = t1, ..., tm ∪ {α} if ∀k ∈ rm, k α
Example: (αe) is I-Step extension of (α)
• s s α = t1, ..., tm, {α}
Example: (α)(c) is S-Step extension of (α)
Innopolis University CloSpan page 8 of 34
9. Preliminary Concepts
Sequence extension
• Given: s = t1, ..., tm and p = t1, ..., tn
• s p: concatenation (itemset-extension or
sequence-extension)
• s i p = t1, ..., tm ∪ t1, ..., tn if ∀k ∈ tm, j ∈ t1, k j
• s s p = t1, ..., tm, t1, ..., tn
• s = p s: p - prex and s - sux of s
Example: (e)(α) is prex of (e)(abf )(bde) and
(bf )(bde) is its sux
Innopolis University CloSpan page 9 of 34
10. Preliminary Concepts
s-projected database (physical projection and pseudo projection)
• Ds = {p|s ∈ D, s = r p s.t. r is minimum prex
containing s (s r and r , s r r)}
p can be empty
Example
• D (αf ) = { (d)(e)(α) , (bde) }
• D (e)(α) = {$, (b) , (_bf )(bde) }
Innopolis University CloSpan page 10 of 34
11. Lexicographic Sequence Tree
Set Lexicographic Order
• Let t = {i1, i2, ..., ik}, t = {j1, j2, ..., jl }, where
i1 ≤ ... ≤ ik and j1 ≤ ... ≤ jl
• t t i either of the following is true:
1. 0 ≤ h ≤ min{k, l }, we have ir = jr for r h, and ih jh
2. k l , and i1 = j1, i2 = j2, ..., ik = jk
Example: (a, f ) (b, f ), (a, b) (a, b, c) and
(a, b, c) (b, c)
Innopolis University CloSpan page 11 of 34
12. Lexicographic Sequence Tree
Sequence Lexicographic Order
i if s = s p, then s s
ii if s = α i p and s = α s p , no matter what is order
relation between p and p is, s s
iii if s = α i p and s = α i p , p p indicated s s
iv s = α s p and s = α s p , p p indicates s s
Example: (a, b) (a, b)(a) ; (a, b) (a)(a)
Innopolis University CloSpan page 12 of 34
13. Lexicographic Sequence Tree
Lexicographic Sequence Tree construction
1. each node in the tree corresponds to a sequence, and the
root is a null sequence;
2. if a parent node corresponds to a sequence s, its child is
either an itemset-extension of s, or a sequence-extension
of s;
3. the left sibling is less than the right sibling in sequence
lexicographic order.
Innopolis University CloSpan page 13 of 34
16. Search Space Pruning and Prex Sequence Lattice
LEMMA 1 (Common Prex)
LEMMA 1. Given a subsequence s, and its projected database
Ds, if ∃α, α is a common prex for all the sequences with the
same extension type (either itemset or sequence - extension) in
Ds, then ∀β, if s β is closed, α must be a prex of β. That
means ∀β α, we need not search s β and its descendants
except the branch of s α.
Example: Ds = { (d)(e)(af ) , (d)(e)(fg) }, all the
sequences in Ds share a common prex α = (d)(e) , so any
sequence with prex s but not s (d)(e) must not be closed.
So we can jump to the branch s α.
Innopolis University CloSpan page 16 of 34
17. Search Space Pruning and Prex Sequence Lattice
LEMMA 2 (Partial Order)
LEMMA 2. Given a sequence s, and its projected database Ds,
if among all the sequences in Ds, and item α does always
occur before an item β (either in the same itemset for all
sequences in Ds or in a dierent itemset, but not both), then
Ds α β = Ds β. Therefore, ∀γ, s β γ is not closed. We need
not search any sequence in the branch of s β.
Innopolis University CloSpan page 17 of 34
18. Search Space Pruning and Prex Sequence Lattice
Theorem 1 (Equivalence of Projected Databases)
• I(D) =
n
i=1
l(si ): total number items in D
Theorem 1: Given 2 sequences, s, s , s s , then
Ds = Ds ⇔ I(Ds) = I(Ds )
Example: Consider D-sample on 15 slide.
• D (af ) = D (f ) = { (d)(e) , (de) }, and
• I(D (af ) ) = I(D (f ) ) = 4.
Based on Theorem 1, the following search space pruning can
be achieved.
Innopolis University CloSpan page 18 of 34
19. Search Space Pruning and Prex Sequence Lattice
Proof of Theorem 1
• Ds = Ds → I(Ds) = I(Ds ) (obvious);
• Since s s , then Ds Ds and I(Ds ) ≤ I(Ds);
• The equality between I(Ds ) and I(Ds) holds only if
∀γ ∈ Ds , γ ∈ Ds, and vice versa. Therefore, Ds = Ds .
Innopolis University CloSpan page 19 of 34
20. Search Space Pruning and Prex Sequence Lattice
LEMMA 3 (Early Termination by Equivalence)
LEMMA 3. Given 2 sequences, s s and also
I(Ds) = I(Ds ), then ∀γ, support(s γ) = support(s γ).
Example: Consider D-sample on 15 slide.
• I(D (af ) ) = I(D (f ) );
• both ((af )(d)) and (af )(e) are frequent;
We can conclude that the support of (af )(d) and (f )(d) ,
(af )(e) and (f )(e) are the same without knowing the
support of (f )(e) and (f )(d) .
Innopolis University CloSpan page 20 of 34
21. Search Space Pruning and Prex Sequence Lattice
Projected database closed set (LS)
• LS = {s|support(s) ≥ min_sup} and s , s.t s s and
I(Ds) = I(Ds );
• CS ⊆ LS ⊆ FS: instead of mining CS directly, CloSpan
algorithm rst produces the complete set of LS
• then non-closed sequence elimination is applied in LS to
generate CS based of Lemma 3.
Innopolis University CloSpan page 21 of 34
22. Search Space Pruning and Prex Sequence Lattice
Corollary 1 (Backward Sub-Pattern)
Corollary 1. If a sequence s s' and s s , the condition of
I(Ds) = I(Ds ) is sucient to stop searching any descendant
of s in the prex searching tree.
s is backward sub-pattern of s if s s and s s (s is discovered
after s)
Example: I(D (f ) ) = I(D (af ) ) → D (f ) = D (af )
Innopolis University CloSpan page 22 of 34
23. Search Space Pruning and Prex Sequence Lattice
Corollary 2 (Backward Super-Pattern)
Corollary 2. If a sequence s s and s s , if the condition of
I(Ds) = I(Ds ) holds, it is sucient to translating the
descendants of s to s instead of searching any descendant of
s in the prex search tree.
Example: the same logic as in the previous example.
Innopolis University CloSpan page 23 of 34
24. CloSpan: Design and Implementation
2 main steps
CloSpan divides mining process into 2 stages.
1. Generated the LS set, a superset of closed frequent
sequences, and stores it in a prex sequence lattice;
2. it does post-pruning to eliminate non-closed sequences.
Innopolis University CloSpan page 24 of 34
25. CloSpan: Design and Implementation
Algorithm 1: ClosedMining(D, min_sup, L)
Innopolis University CloSpan page 25 of 34
26. CloSpan: Design and Implementation
Algorithm 2: CloSpan(s, Ds , min_sup, L)
Innopolis University CloSpan page 26 of 34
27. CloSpan: Design and Implementation
Algorithm : CloSpan
• Hash index on the size of projected database in order to
speed up check on Theorem 1 (1-4 lines of CloSpan);
• if I(Ds ) = I(Ds) then;
• if s s , then we do not add I(Ds), s ;
• if s s, then we replace I(Ds ), s with I(Ds), s .
I(Ds), s
Innopolis University CloSpan page 27 of 34
28. CloSpan: Design and Implementation
Algorithm 3: checkProjectedDBSize(s, k, H)
Corresponds to line 1-4 in Algorithm 2.
Innopolis University CloSpan page 28 of 34
29. CloSpan: Design and Implementation
Algorithm 3: hash function algorithm
• Database size range from 0 to I(D), so if the values of
I(Ds) are dense in a small range, performance degrade;
• by Theorem 1 we could use necessary propositions of
holding Ds = Ds in a part of hash key;
• L(Ds) = I(Ds) + m
j=1
n
k=ij +1 l(sk);
• if s s , L(Ds) = L(Ds ) ↔ I(Ds) = I(Ds ).
Innopolis University CloSpan page 29 of 34
30. Non-Closed Sequence Elimination
Check out for super sequence
• support(s) as its Hash function
• nd all the sequences with the same support of s
• check whether there is a super-sequence containing s.
• if s s and
support(s) = support(s ) → T (Ds) = T (Ds )
(corresponding sequences' id sum)
• that's why T (Ds) = T (Ds ) could be used as a Hash
function instead of support (more sparse)
Innopolis University CloSpan page 30 of 34
31. Conclusion
CloSpan
• Solve closed sequential pattern mining problem;
• CloSpan outperforms PrexSpan by more than one order
of magnitude;
• capable of mining longer frequent sequences in a large
data set with low min_sup;
• it does not modify the frequent pattern mining algorithm:
it only denes the early termination condition of search
branch;
• this method can be extended to other existing sequential
pattern mining algorithms (SPADE, SPAM).
Innopolis University CloSpan page 31 of 34
32. Possible improvements
CloSpan
• The performance of CloSpan is achieved by smart
prunning method, do it more smart;
• Do not need to keep track of any single historical
frequent closed sequence (or candidate) for a new
pattern's closure checking.
Innopolis University CloSpan page 32 of 34
33. Possible improvements
BIDE algorithm
1. BIDE consumes much less memory and can be an order of
magnitude faster than CloSpan when the support is low;
2. BIDE has linear scalability against base size in terms of
runtime eciency and space usage;
3. the BackScan pruning method is very eective in
enhancing the performance of BIDE.
Innopolis University CloSpan page 33 of 34
34. CloSpam in trajectory Mining
Sequential Pattern Mining from Trajectory Data
• need more studies: IPCA and DBScan on Trajectory data.
• CloSpan could be used as unsupevised algorithm for
detecting most crowded paths in a city.
• ...
Innopolis University CloSpan page 34 of 34