SlideShare une entreprise Scribd logo
1  sur  22
Sequence mining algorithm



           Monica Dăgădiţă
                        ISI
 Introduction
             to sequence mining
 Why sequence mining?
 Sequence mining algorithms
 SPADE
    Motivation
    Definitions and examples
    Algorithm
    Implementation




                     Data Mining   11/8/2011   2
 Aim - finding statistically relevant patterns
 between data examples where the values are
 delivered in a sequence

 Originallyintroduced for market basket
 analysis - customer behaviour predictions

2    types of sequence mining:
     string mining – biology (gene/protein sequences)
     itemset mining - marketing and CRM applications

                       Data Mining   11/8/2011   3
 Discovering   patterns:
    Bookstore: 70% of the people who buy Jane
     Austen’s “Pride and Prejudice” also buy “Emma”
     within a month
    Website: finding sequences of most frequently
     accessed pages

 Usage:
    Promotions
    Shelf placement
    Restructure the website
    Recommender systems

                     Data Mining   11/8/2011   4
 Apriori
 GSP  (Generalized Sequential Pattern)
 FreeSpan (Frequent pattern-projected
  Sequential pattern mining)
 PrefixSpan (Prefix-projected Sequential
  pattern mining)
 SPADE (Sequential PAttern Discovery using
  Equivalence classes)




                  Data Mining   11/8/2011   5
 Problems   of existing solutions
    Repeated database scans
    Complex internal data structures


 Key   features of SPADE:
    Fixed number of database scans
    Vertical id-list database format
    Decomposition of search space into smaller
     pieces – processed independently




                     Data Mining   11/8/2011      6
 Itemset:    set of m distinct items
   I = {i1, i2, …, im }
 Event: non-empty collection of items
   (i1,i2 … ik)
 Sequence : ordered list of events
  < e1 -> e2 -> … -> en >
 K-sequence : sequence with k items
  (B->AC) – 3-sequence



                  Data Mining   11/8/2011   7
 Subsequence:   given two sequences α=<a1 a2 … an>
 and β=<b1 b2 … bm>, α is called a subsequence of
 β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2
 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn

  Examples:
  1. (B->AC) is a subsequence of (AB->E->ACD)
  2. (AB->E) is not a subsequence of (ABE)




                    Data Mining   11/8/2011     8
Data Mining   11/8/2011   9
Id-lists of the most frequent items (1-sequences)




                   Data Mining   11/8/2011   10
 D->BF->A
    Step 1: D->B




    Step 2: D->BF




                     Data Mining   11/8/2011   11
 D->BF->A
    Step 3 : D->BF->A




 Not   space-efficient
    Solution: 2 columns - (sid,eid) for each sequence
    Eid – id of the sequence’s last item


                      Data Mining   11/8/2011   12
 D->BF->A   (space-efficient id-list joins)
                                                               D->B

                                                       SID       EID
                                                       1         15
                                                       1         20
                                                       4         20




                   D->BF->A                                  D->BF

             SID       EID                         SID          EID
             1         25                          1            20
             4         25                          4            20


                         Data Mining   11/8/2011                      13
 Complete   latice representation




                   Data Mining   11/8/2011   14
Data Mining   11/8/2011   15
 Decomposing  the latice => smaller pieces
 that can be solved independently

 Equivalence   classes
 2 sequences are in the same class (Θk) if they
  share a common k length prefix
 Example
   k=1 : Θ1 -> {[A],[B],[D],[F]}




                    Data Mining   11/8/2011   16
Data Mining   11/8/2011   17
Data Mining   11/8/2011   18
 SPADE(min_sup,D)
  //min_sup – minimum_support
 //D –initial dataset
 F1<- {frequent items or 1-sequences}
 F2<- {frequent 2-sequences}
 Ε <- {equivalence classes [X] Θ1 }
 for all [X] in E
   enumerate_frequent_seq([X],min_sup)




                  Data Mining   11/8/2011   19
   Enumerate_frequent_seq(S,min_sup)
      for all Ai in S
          Ti <- {}
          for all Aj in S, with j≥i
              R<- Ai v Aj (join)
              if R satisfies min_sup
                   Ti <- Ti U {R}
          end
          Enumerate_frequent_seq(Ti , min_sup) //DFS
    end
    For all non-empty Ti
      Enumerate_frequent_seq(Ti , min_sup) //BFS


                       Data Mining   11/8/2011   20
 The   R Project for Statistical Computing
    developed at Bell Laboratories (formerly
     AT&T, now Lucent Technologies) by John
     Chambers and colleagues

    Different implementation of S language

    arulesSequences package




                      Data Mining   11/8/2011   21
Data Mining   11/8/2011   22

Contenu connexe

Tendances

5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighborUjjawal
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentationRamandeep Kaur Bagri
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptxRoshan86572
 

Tendances (20)

5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 
Clustering
ClusteringClustering
Clustering
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
 

Similaire à SPADE -

OSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo RickliOSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo RickliNETWAYS
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsAsuka Nakajima
 
Interval intersection
Interval intersectionInterval intersection
Interval intersectionAabida Noman
 
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Michael Rush
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern miningkiran said
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatternsKamal Singh Lodhi
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesKeshav Murthy
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...shravanthium111
 
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSAWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSCobus Bernard
 
Citation data flow 2012 nat latipat
Citation data flow 2012 nat latipatCitation data flow 2012 nat latipat
Citation data flow 2012 nat latipatLATIPAT
 
Datamining at SemWebPro 2012
Datamining at SemWebPro 2012Datamining at SemWebPro 2012
Datamining at SemWebPro 2012Vincent Michel
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionKazuki Fujikawa
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
ScilabTEC 2015 - KIT
ScilabTEC 2015 - KITScilabTEC 2015 - KIT
ScilabTEC 2015 - KITScilab
 
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...Marlon Dumas
 

Similaire à SPADE - (20)

OSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo RickliOSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
OSDC 2011 | NeDi - Network Discovery im RZ by Remo Rickli
 
FP-growth.pptx
FP-growth.pptxFP-growth.pptx
FP-growth.pptx
 
Cdi implementation
Cdi implementationCdi implementation
Cdi implementation
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
 
Interval intersection
Interval intersectionInterval intersection
Interval intersection
 
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatterns
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
 
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSAWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
 
Citation data flow 2012 nat latipat
Citation data flow 2012 nat latipatCitation data flow 2012 nat latipat
Citation data flow 2012 nat latipat
 
Datamining at SemWebPro 2012
Datamining at SemWebPro 2012Datamining at SemWebPro 2012
Datamining at SemWebPro 2012
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
ScilabTEC 2015 - KIT
ScilabTEC 2015 - KITScilabTEC 2015 - KIT
ScilabTEC 2015 - KIT
 
SMDMS'13
SMDMS'13SMDMS'13
SMDMS'13
 
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
Split Miner: Discovering Accurate and Simple Business Process Models from Eve...
 

Dernier

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

SPADE -

  • 1. Sequence mining algorithm Monica Dăgădiţă ISI
  • 2.  Introduction to sequence mining  Why sequence mining?  Sequence mining algorithms  SPADE  Motivation  Definitions and examples  Algorithm  Implementation Data Mining 11/8/2011 2
  • 3.  Aim - finding statistically relevant patterns between data examples where the values are delivered in a sequence  Originallyintroduced for market basket analysis - customer behaviour predictions 2 types of sequence mining:  string mining – biology (gene/protein sequences)  itemset mining - marketing and CRM applications Data Mining 11/8/2011 3
  • 4.  Discovering patterns:  Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month  Website: finding sequences of most frequently accessed pages  Usage:  Promotions  Shelf placement  Restructure the website  Recommender systems Data Mining 11/8/2011 4
  • 5.  Apriori  GSP (Generalized Sequential Pattern)  FreeSpan (Frequent pattern-projected Sequential pattern mining)  PrefixSpan (Prefix-projected Sequential pattern mining)  SPADE (Sequential PAttern Discovery using Equivalence classes) Data Mining 11/8/2011 5
  • 6.  Problems of existing solutions  Repeated database scans  Complex internal data structures  Key features of SPADE:  Fixed number of database scans  Vertical id-list database format  Decomposition of search space into smaller pieces – processed independently Data Mining 11/8/2011 6
  • 7.  Itemset: set of m distinct items I = {i1, i2, …, im }  Event: non-empty collection of items (i1,i2 … ik)  Sequence : ordered list of events < e1 -> e2 -> … -> en >  K-sequence : sequence with k items (B->AC) – 3-sequence Data Mining 11/8/2011 7
  • 8.  Subsequence: given two sequences α=<a1 a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn  Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE) Data Mining 11/8/2011 8
  • 9. Data Mining 11/8/2011 9
  • 10. Id-lists of the most frequent items (1-sequences) Data Mining 11/8/2011 10
  • 11.  D->BF->A  Step 1: D->B  Step 2: D->BF Data Mining 11/8/2011 11
  • 12.  D->BF->A  Step 3 : D->BF->A  Not space-efficient  Solution: 2 columns - (sid,eid) for each sequence  Eid – id of the sequence’s last item Data Mining 11/8/2011 12
  • 13.  D->BF->A (space-efficient id-list joins) D->B SID EID 1 15 1 20 4 20 D->BF->A D->BF SID EID SID EID 1 25 1 20 4 25 4 20 Data Mining 11/8/2011 13
  • 14.  Complete latice representation Data Mining 11/8/2011 14
  • 15. Data Mining 11/8/2011 15
  • 16.  Decomposing the latice => smaller pieces that can be solved independently  Equivalence classes 2 sequences are in the same class (Θk) if they share a common k length prefix Example k=1 : Θ1 -> {[A],[B],[D],[F]} Data Mining 11/8/2011 16
  • 17. Data Mining 11/8/2011 17
  • 18. Data Mining 11/8/2011 18
  • 19.  SPADE(min_sup,D) //min_sup – minimum_support //D –initial dataset F1<- {frequent items or 1-sequences} F2<- {frequent 2-sequences} Ε <- {equivalence classes [X] Θ1 } for all [X] in E enumerate_frequent_seq([X],min_sup) Data Mining 11/8/2011 19
  • 20. Enumerate_frequent_seq(S,min_sup) for all Ai in S Ti <- {} for all Aj in S, with j≥i R<- Ai v Aj (join) if R satisfies min_sup Ti <- Ti U {R} end Enumerate_frequent_seq(Ti , min_sup) //DFS end For all non-empty Ti Enumerate_frequent_seq(Ti , min_sup) //BFS Data Mining 11/8/2011 20
  • 21.  The R Project for Statistical Computing  developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues  Different implementation of S language  arulesSequences package Data Mining 11/8/2011 21
  • 22. Data Mining 11/8/2011 22