SlideShare une entreprise Scribd logo
1  sur  27
Graph Mining
1
Graph Mining
 Graphs
 Model sophisticated structures and their interactions
 Chemical Informatics
 Bioinformatics
 Computer Vision
 Video Indexing
 Text Retrieval
 Web Analysis
 Social Networks
 Mining frequent sub-graph patterns
 Characterization, Discrimination, Classification and Cluster
Analysis, building graph indices and similarity search
2
Mining Frequent Subgraphs
 Graph g
 Vertex Set – V(g)
 Edge set – E(g)
 Label function maps a vertex / edge to a label
 Graph g is a sub-graph of another graph g’ if there exists a graph iso-
morphism from g to g’
 Support(g) or frequency(g) – number of graphs in D = {G1, G2,..Gn} where
g is a sub-graph
 Frequent graph – satisfies min_sup
3
Discovery of Frequent Substructures
 Step 1: Generate frequent sub-structure candidates
 Step 2: Check for frequency of each candidate
 Involves sub-graph isomorphism test which is computationally
expensive
 Approaches
 Apriori –based approach
 Pattern Growth approach
4
Apriori based Approach
5
Start with graph of small size –
generate candidates with extra
vertex/edge or path
AprioriGraph
• Level wise mining method
• Size of new substructures is
increased by 1
• Generated by joining two similar
but slightly different frequent sub-
graphs
• Frequency is then checked
Candidate generation in graphs
is complex
Apriori Approach
 AGM (Apriori-based Graph Mining)
 Vertex based candidate generation – increases sub structure size by one
vertex at each step
 Two frequent k size graphs are joined only if they have the same (k-1)
subgraph (Size – number of vertices)
 New candidate has (k-1) sized component and the additional two
vertices
 Two different sub-structures can be formed
6
Apriori Approach
 FSG (Frequent Sub-graph mining)
 Edge-based Candidate generation – increases by one-edge at a
time
 Two size k patterns are merged iff they share the same subgraph
having k-1 edges (core)
 New candidate – has core and the two additional edges
7
Apriori Approach
 Edge disjoint path method
 Classify graphs by number of disjoint paths they have
 Two paths are edge-disjoint if they do not share any common edge
 A substructure pattern with k+1 disjoint paths is generated by joining
sub-structures with k disjoint paths
 Disadvantage of Apriori Approaches
 Overhead when joining two sub-structures
 Uses BFS strategy : level-wise candidate generation
 To check whether a k+1 graph is frequent – it must check all of its size-k sub graphs
 May consume more memory
8
Pattern-Growth Approach
 Uses BFS as well as DFS
 A graph g can be extended by adding a new edge e. The newly
formed graph is denoted by g ♦x e.
 Edge e may or may not introduce a new vertex to g.
 If e introduces a new vertex, the new graph is denoted by g ♦xf e,
otherwise, g ♦xb e, where f or b indicates that the extension is in a forward
or backward direction.
 Pattern Growth Approach
 For each discovered graph g performs extensions recursively until all
frequent graphs with g are found
 Simple but inefficient
 Same graph is discovered multiple times – duplicate graph
9
Pattern Growth
10
gSpan Algorithm
 Reduces generation of duplicate graphs
 Does not extend duplicate graphs
 Uses Depth First Order
 A graph may have several DFS-trees
 Visiting order of vertices forms a linear order - Subscript
 In a DFS tree – starting vertex – root; last visited vertex – right-most vertex
 Path from v0 to vn – right most path
11
Right most path: (b), (c) – (v0, v1, v3); (d) – (v0, v1, v2, v3)
gSpan Algorithm
 gSpan restricts the extension method
 A new edge e can be added
 between the right-most vertex and another vertex on the right-most path (backward
extension);
 or it can introduce a new vertex and connect to a vertex on the right-most path (forward
extension)
 Right-most extension, denoted by G ♦r e
12
gSpan Algorithm
 Chooses any one DFS tree – base subscripting and
extends it
 Each subscripted graph is transformed into an edge sequence –
DFS code
 Select the subscript that generates minimum sequence
 Edge Order – maps edges in a subscripted graph into a sequence
 Sequence Order – builds an order among edge sequences
13
Introduce backward edges:
Given a vertex v all of its backward edges should appear before
its forward edges (if any); If there are two backward edges (i,j1)
appears before (i,j2)
Order of forward edges: (0,1) (1,2) (1,3)
Complete sequence: (0,1) (1,2) (2,0) (1,3)
gSpan Algorithm
14
DFS Lexicographic Ordering: Edge order, First Vertex label, Edge label, Second Vertex label
Here γ0 < γ1 < γ2
γ0 – Minimum DFS Code
Corresponding subscript – Base
Subscripting
gSpan – carries out right most
extension on the minimum
DFS code
gSpan – carries out right most
extension on the minimum
DFS code
gSpan Algorithm
 Root – Empty code
 Each node is a DFS code encoding a graph
 Each edge – rightmost extension from a (k-1) length DFS code to a
k-length DFS code
 If codes s and s’ encode the same graph – search space s’ can be safely
pruned
15
gSpan Algorithm
16
Mining Closed Frequent Substructures
 Helps to overcome the problem of pattern explosion
 A frequent graph G is closed if and only if there is no proper super graph G0
that has the same support as G.
 Closegraph Algorithm
 A frequent pattern G is maximal if and only if there is no frequent super-
pattern of G.
 Maximal pattern set is a subset of the closed pattern set.
 But cannot be used to reconstruct entire set of frequent patterns
17
Mining Alternative Substructure Patterns
 Mining unlabeled or partially labeled graphs
 New empty label φ is assigned to vertices and edges that do not have labels
 Mining non-simple graphs
 A non simple graph may have a self-loop and multiple edges
 growing order - backward edges, self-loops, and forward edges
 To handle multiple edges - allow sharing of the same vertices in two neighboring
edges in a DFS code
 Mining directed graphs
 6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1
 Mining disconnected graphs
 Graph / Pattern may be disconnected
 Disconnected Graph – Add virtual vertex
 Disconnected graph pattern – set of connected graphs
 Mining frequent subtrees
 Tree – Degenerate graph
18
Constraint based Mining of Substructure
Patterns
 Element, set, or subgraph containment constraint
 user requires that the mined patterns contain a particular set of
subgraphs - Succinct constraint
 Geometric constraint
 A geometric constraint can be that the angle between each pair of
connected edges must be within a range – Anti-monotonic constraint
 Value-sum constraint
 the sum_of (positive) weights on the edges, must be within a range low
and high – (sum > low) Monotonic / Anti-monotonic (sum < high)
 Multiple categories of constraints may also be enforced
19
Mining Approximate Frequent Substructures
 Approximate frequent substructures allow slight structural variations
 Several slightly different frequent substructures can be represented
using one approximate substructure
 SUBDUE – Substructure discovery system
 based on the Minimum Description Length (MDL) principle
 adopts a constrained beam search
 SUBDUE performs approximate matching
20
Mining Coherent and Dense Sub structures
 A frequent substructure G is a coherent sub graph if the mutual information
between G and each of its own sub graphs is above some threshold
 Reduces number of patterns mined
 Application: coherent substructure mining selects a small subset of features that have high
distinguishing power between protein classes.
 Relational graph –each label is used only once
 Frequent highly connected or dense subgraph mining
 People with strong associations in OSNs
 Set of genes within the same functional module
 Cannot judge based on average degree or minimal degree
 Must ensure connectedness
 Example: Average degree: 3.25
Minimum degree 3
21
Mining Dense Substructures
 Dense graphs defined in terms of Edge Connectivity
 Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is
disconnected.
 A minimum cut is the smallest set in all edge cuts.
 The edge connectivity of G is the size of a minimum cut.
 A graph is dense if its edge connectivity is no less than a specified minimum cut
threshold
 Mining Dense substructures
 Pattern-growth approach called Close-Cut (Scalable)
 starts with a small frequent candidate graph and extends it until it finds the largest super graph with the
same support
 Pattern-reduction approach called Splat (High performance)
 directly intersects relational graphs to obtain highly connected graphs
 A pattern g discovered in a set is progressively intersected with subsequent components to give g’
 Some edges in g may be removed
 The size of candidate graphs is reduced by intersection and decomposition operations.
22
Applications – Graph Indexing
 Indexing is essential for efficient search and query processing
 Traditional approaches are not feasible for graphs
 Indexing based on nodes / edges / sub-graphs
 Path based Indexing approach
 Enumerate all the paths in a database up to maxL length and index them
 Index is used to identify all graphs with the paths in query
 Not suitable for complex graph queries
 Structural information is lost when a query graph is broken apart
 Many false positives maybe returned
 gIndex – considers frequent and discriminative substructures as index features
 A frequent substructure is discriminative if its support cannot be approximated by the intersection of the
graph sets
 Achieves good performance at less cost
23
Graph Indexing
24
Only (c) is an exact match, but
others are also reported due to the
presence of sub-structures
Substructure Similarity Search
 Bioinformatics and Chem-informatics applications involve query
based search in massive complex structural data
25
Form a set of sub-graph queries with one
or more edge deletions and then use
exact substructure search
Substructure Similarity Search
 Grafil (Graph Similarity Filtering)
 Feature based structural filtering
 Models each query graph as a set of features
 Edge deletions – feature misses
 Too many features – reduce performance
 Multi-filter composition strategy
 Feature Set - group of similar features
26
Classification and Cluster Analysis using
Graph Patterns
 Graph Classification
 Mine frequent graph patterns
 Features that are frequent in one class but less in another – Discriminative
features – Model construction
 Can adjust frequency, connectivity thresholds
 SVM, NBM etc are used
 Cluster Analysis
 Cluster Similar graphs based on graph connectivity (minimal cuts)
 Hierarchical clusters based on support threshold
 Outliers can also be detected
 Inter-related process
27

Contenu connexe

Tendances

Tendances (20)

Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Chapter 9 morphological image processing
Chapter 9   morphological image processingChapter 9   morphological image processing
Chapter 9 morphological image processing
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Clustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning AlgorithmsClustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning Algorithms
 
Edge Detection and Segmentation
Edge Detection and SegmentationEdge Detection and Segmentation
Edge Detection and Segmentation
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Chapter10 image segmentation
Chapter10 image segmentationChapter10 image segmentation
Chapter10 image segmentation
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 

En vedette

www.pharmagroup.it
www.pharmagroup.itwww.pharmagroup.it
www.pharmagroup.it
streamky
 
Institucional Empresas Fm Estacion 21 San Juan.
Institucional Empresas Fm Estacion 21 San Juan. Institucional Empresas Fm Estacion 21 San Juan.
Institucional Empresas Fm Estacion 21 San Juan.
Ricardo Fernández
 
Ovario Poliquistico 2005
Ovario Poliquistico 2005Ovario Poliquistico 2005
Ovario Poliquistico 2005
rahterrazas
 
Sant mer, heroi del drac de banyoles
Sant mer, heroi del drac de banyolesSant mer, heroi del drac de banyoles
Sant mer, heroi del drac de banyoles
Berta
 
Norte Parque Residencial Email Chl
Norte Parque Residencial   Email ChlNorte Parque Residencial   Email Chl
Norte Parque Residencial Email Chl
imoveisdorio
 
Cv ernst mayer 2016
Cv ernst mayer 2016Cv ernst mayer 2016
Cv ernst mayer 2016
Ernst Mayer
 
What is this DevOps thing and why do I need it?
What is this DevOps thing and why do I need it?What is this DevOps thing and why do I need it?
What is this DevOps thing and why do I need it?
Safe Swiss Cloud
 

En vedette (20)

Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
120808
120808120808
120808
 
www.pharmagroup.it
www.pharmagroup.itwww.pharmagroup.it
www.pharmagroup.it
 
Flyer Master
Flyer MasterFlyer Master
Flyer Master
 
Reunião Programa de Ressignificação
Reunião Programa de RessignificaçãoReunião Programa de Ressignificação
Reunião Programa de Ressignificação
 
Institucional Empresas Fm Estacion 21 San Juan.
Institucional Empresas Fm Estacion 21 San Juan. Institucional Empresas Fm Estacion 21 San Juan.
Institucional Empresas Fm Estacion 21 San Juan.
 
Presentación FxBot
Presentación FxBotPresentación FxBot
Presentación FxBot
 
Ovario Poliquistico 2005
Ovario Poliquistico 2005Ovario Poliquistico 2005
Ovario Poliquistico 2005
 
Designing and developing a Windows Phone 7 Silverlight Application End-to-End...
Designing and developing a Windows Phone 7 Silverlight Application End-to-End...Designing and developing a Windows Phone 7 Silverlight Application End-to-End...
Designing and developing a Windows Phone 7 Silverlight Application End-to-End...
 
How To Build A Business Online: Start With Why
How To Build A Business Online: Start With WhyHow To Build A Business Online: Start With Why
How To Build A Business Online: Start With Why
 
Sant mer, heroi del drac de banyoles
Sant mer, heroi del drac de banyolesSant mer, heroi del drac de banyoles
Sant mer, heroi del drac de banyoles
 
Norte Parque Residencial Email Chl
Norte Parque Residencial   Email ChlNorte Parque Residencial   Email Chl
Norte Parque Residencial Email Chl
 
Xavier Giné - Educación financiera y participación financiera en países en de...
Xavier Giné - Educación financiera y participación financiera en países en de...Xavier Giné - Educación financiera y participación financiera en países en de...
Xavier Giné - Educación financiera y participación financiera en países en de...
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Fighting Food Loss and Food Waste in Japan
Fighting Food Loss and Food Waste in JapanFighting Food Loss and Food Waste in Japan
Fighting Food Loss and Food Waste in Japan
 
[Webinar Slides] Gmail’s Responsive Email Updates
[Webinar Slides] Gmail’s Responsive Email Updates[Webinar Slides] Gmail’s Responsive Email Updates
[Webinar Slides] Gmail’s Responsive Email Updates
 
A development of a coin slot prepayment system
A development of a coin slot prepayment systemA development of a coin slot prepayment system
A development of a coin slot prepayment system
 
Cv ernst mayer 2016
Cv ernst mayer 2016Cv ernst mayer 2016
Cv ernst mayer 2016
 
What is this DevOps thing and why do I need it?
What is this DevOps thing and why do I need it?What is this DevOps thing and why do I need it?
What is this DevOps thing and why do I need it?
 

Similaire à 5.5 graph mining

Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
pournima055
 

Similaire à 5.5 graph mining (20)

Lecture 2.3.1 Graph.pptx
Lecture 2.3.1 Graph.pptxLecture 2.3.1 Graph.pptx
Lecture 2.3.1 Graph.pptx
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data Structure
 
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
 
breadth first search
breadth first searchbreadth first search
breadth first search
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Lecture 4&5 computer vision edge-detection code chains hough transform snakes
Lecture 4&5 computer vision edge-detection code chains hough transform snakesLecture 4&5 computer vision edge-detection code chains hough transform snakes
Lecture 4&5 computer vision edge-detection code chains hough transform snakes
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
Graph mining seminar_2009
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009
 
graph_mining_seminar_2009.ppt
graph_mining_seminar_2009.pptgraph_mining_seminar_2009.ppt
graph_mining_seminar_2009.ppt
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
Unit 3 graph chapter6
Unit 3  graph chapter6Unit 3  graph chapter6
Unit 3 graph chapter6
 
Graphs
GraphsGraphs
Graphs
 
Spanningtreesppt
SpanningtreespptSpanningtreesppt
Spanningtreesppt
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
Analysis &amp; design of algorithm
Analysis &amp; design of algorithmAnalysis &amp; design of algorithm
Analysis &amp; design of algorithm
 
Improvement of shortest path algorithms using subgraphs heuristics
Improvement of shortest path algorithms using subgraphs heuristicsImprovement of shortest path algorithms using subgraphs heuristics
Improvement of shortest path algorithms using subgraphs heuristics
 
NON-LINEAR DATA STRUCTURE-Graphs.pptx
NON-LINEAR DATA STRUCTURE-Graphs.pptxNON-LINEAR DATA STRUCTURE-Graphs.pptx
NON-LINEAR DATA STRUCTURE-Graphs.pptx
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph Databases
 
Unit-6 Graph.ppsx ppt
Unit-6 Graph.ppsx                                       pptUnit-6 Graph.ppsx                                       ppt
Unit-6 Graph.ppsx ppt
 

Plus de Krish_ver2

Plus de Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

5.5 graph mining

  • 2. Graph Mining  Graphs  Model sophisticated structures and their interactions  Chemical Informatics  Bioinformatics  Computer Vision  Video Indexing  Text Retrieval  Web Analysis  Social Networks  Mining frequent sub-graph patterns  Characterization, Discrimination, Classification and Cluster Analysis, building graph indices and similarity search 2
  • 3. Mining Frequent Subgraphs  Graph g  Vertex Set – V(g)  Edge set – E(g)  Label function maps a vertex / edge to a label  Graph g is a sub-graph of another graph g’ if there exists a graph iso- morphism from g to g’  Support(g) or frequency(g) – number of graphs in D = {G1, G2,..Gn} where g is a sub-graph  Frequent graph – satisfies min_sup 3
  • 4. Discovery of Frequent Substructures  Step 1: Generate frequent sub-structure candidates  Step 2: Check for frequency of each candidate  Involves sub-graph isomorphism test which is computationally expensive  Approaches  Apriori –based approach  Pattern Growth approach 4
  • 5. Apriori based Approach 5 Start with graph of small size – generate candidates with extra vertex/edge or path AprioriGraph • Level wise mining method • Size of new substructures is increased by 1 • Generated by joining two similar but slightly different frequent sub- graphs • Frequency is then checked Candidate generation in graphs is complex
  • 6. Apriori Approach  AGM (Apriori-based Graph Mining)  Vertex based candidate generation – increases sub structure size by one vertex at each step  Two frequent k size graphs are joined only if they have the same (k-1) subgraph (Size – number of vertices)  New candidate has (k-1) sized component and the additional two vertices  Two different sub-structures can be formed 6
  • 7. Apriori Approach  FSG (Frequent Sub-graph mining)  Edge-based Candidate generation – increases by one-edge at a time  Two size k patterns are merged iff they share the same subgraph having k-1 edges (core)  New candidate – has core and the two additional edges 7
  • 8. Apriori Approach  Edge disjoint path method  Classify graphs by number of disjoint paths they have  Two paths are edge-disjoint if they do not share any common edge  A substructure pattern with k+1 disjoint paths is generated by joining sub-structures with k disjoint paths  Disadvantage of Apriori Approaches  Overhead when joining two sub-structures  Uses BFS strategy : level-wise candidate generation  To check whether a k+1 graph is frequent – it must check all of its size-k sub graphs  May consume more memory 8
  • 9. Pattern-Growth Approach  Uses BFS as well as DFS  A graph g can be extended by adding a new edge e. The newly formed graph is denoted by g ♦x e.  Edge e may or may not introduce a new vertex to g.  If e introduces a new vertex, the new graph is denoted by g ♦xf e, otherwise, g ♦xb e, where f or b indicates that the extension is in a forward or backward direction.  Pattern Growth Approach  For each discovered graph g performs extensions recursively until all frequent graphs with g are found  Simple but inefficient  Same graph is discovered multiple times – duplicate graph 9
  • 11. gSpan Algorithm  Reduces generation of duplicate graphs  Does not extend duplicate graphs  Uses Depth First Order  A graph may have several DFS-trees  Visiting order of vertices forms a linear order - Subscript  In a DFS tree – starting vertex – root; last visited vertex – right-most vertex  Path from v0 to vn – right most path 11 Right most path: (b), (c) – (v0, v1, v3); (d) – (v0, v1, v2, v3)
  • 12. gSpan Algorithm  gSpan restricts the extension method  A new edge e can be added  between the right-most vertex and another vertex on the right-most path (backward extension);  or it can introduce a new vertex and connect to a vertex on the right-most path (forward extension)  Right-most extension, denoted by G ♦r e 12
  • 13. gSpan Algorithm  Chooses any one DFS tree – base subscripting and extends it  Each subscripted graph is transformed into an edge sequence – DFS code  Select the subscript that generates minimum sequence  Edge Order – maps edges in a subscripted graph into a sequence  Sequence Order – builds an order among edge sequences 13 Introduce backward edges: Given a vertex v all of its backward edges should appear before its forward edges (if any); If there are two backward edges (i,j1) appears before (i,j2) Order of forward edges: (0,1) (1,2) (1,3) Complete sequence: (0,1) (1,2) (2,0) (1,3)
  • 14. gSpan Algorithm 14 DFS Lexicographic Ordering: Edge order, First Vertex label, Edge label, Second Vertex label Here γ0 < γ1 < γ2 γ0 – Minimum DFS Code Corresponding subscript – Base Subscripting gSpan – carries out right most extension on the minimum DFS code gSpan – carries out right most extension on the minimum DFS code
  • 15. gSpan Algorithm  Root – Empty code  Each node is a DFS code encoding a graph  Each edge – rightmost extension from a (k-1) length DFS code to a k-length DFS code  If codes s and s’ encode the same graph – search space s’ can be safely pruned 15
  • 17. Mining Closed Frequent Substructures  Helps to overcome the problem of pattern explosion  A frequent graph G is closed if and only if there is no proper super graph G0 that has the same support as G.  Closegraph Algorithm  A frequent pattern G is maximal if and only if there is no frequent super- pattern of G.  Maximal pattern set is a subset of the closed pattern set.  But cannot be used to reconstruct entire set of frequent patterns 17
  • 18. Mining Alternative Substructure Patterns  Mining unlabeled or partially labeled graphs  New empty label φ is assigned to vertices and edges that do not have labels  Mining non-simple graphs  A non simple graph may have a self-loop and multiple edges  growing order - backward edges, self-loops, and forward edges  To handle multiple edges - allow sharing of the same vertices in two neighboring edges in a DFS code  Mining directed graphs  6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1  Mining disconnected graphs  Graph / Pattern may be disconnected  Disconnected Graph – Add virtual vertex  Disconnected graph pattern – set of connected graphs  Mining frequent subtrees  Tree – Degenerate graph 18
  • 19. Constraint based Mining of Substructure Patterns  Element, set, or subgraph containment constraint  user requires that the mined patterns contain a particular set of subgraphs - Succinct constraint  Geometric constraint  A geometric constraint can be that the angle between each pair of connected edges must be within a range – Anti-monotonic constraint  Value-sum constraint  the sum_of (positive) weights on the edges, must be within a range low and high – (sum > low) Monotonic / Anti-monotonic (sum < high)  Multiple categories of constraints may also be enforced 19
  • 20. Mining Approximate Frequent Substructures  Approximate frequent substructures allow slight structural variations  Several slightly different frequent substructures can be represented using one approximate substructure  SUBDUE – Substructure discovery system  based on the Minimum Description Length (MDL) principle  adopts a constrained beam search  SUBDUE performs approximate matching 20
  • 21. Mining Coherent and Dense Sub structures  A frequent substructure G is a coherent sub graph if the mutual information between G and each of its own sub graphs is above some threshold  Reduces number of patterns mined  Application: coherent substructure mining selects a small subset of features that have high distinguishing power between protein classes.  Relational graph –each label is used only once  Frequent highly connected or dense subgraph mining  People with strong associations in OSNs  Set of genes within the same functional module  Cannot judge based on average degree or minimal degree  Must ensure connectedness  Example: Average degree: 3.25 Minimum degree 3 21
  • 22. Mining Dense Substructures  Dense graphs defined in terms of Edge Connectivity  Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is disconnected.  A minimum cut is the smallest set in all edge cuts.  The edge connectivity of G is the size of a minimum cut.  A graph is dense if its edge connectivity is no less than a specified minimum cut threshold  Mining Dense substructures  Pattern-growth approach called Close-Cut (Scalable)  starts with a small frequent candidate graph and extends it until it finds the largest super graph with the same support  Pattern-reduction approach called Splat (High performance)  directly intersects relational graphs to obtain highly connected graphs  A pattern g discovered in a set is progressively intersected with subsequent components to give g’  Some edges in g may be removed  The size of candidate graphs is reduced by intersection and decomposition operations. 22
  • 23. Applications – Graph Indexing  Indexing is essential for efficient search and query processing  Traditional approaches are not feasible for graphs  Indexing based on nodes / edges / sub-graphs  Path based Indexing approach  Enumerate all the paths in a database up to maxL length and index them  Index is used to identify all graphs with the paths in query  Not suitable for complex graph queries  Structural information is lost when a query graph is broken apart  Many false positives maybe returned  gIndex – considers frequent and discriminative substructures as index features  A frequent substructure is discriminative if its support cannot be approximated by the intersection of the graph sets  Achieves good performance at less cost 23
  • 24. Graph Indexing 24 Only (c) is an exact match, but others are also reported due to the presence of sub-structures
  • 25. Substructure Similarity Search  Bioinformatics and Chem-informatics applications involve query based search in massive complex structural data 25 Form a set of sub-graph queries with one or more edge deletions and then use exact substructure search
  • 26. Substructure Similarity Search  Grafil (Graph Similarity Filtering)  Feature based structural filtering  Models each query graph as a set of features  Edge deletions – feature misses  Too many features – reduce performance  Multi-filter composition strategy  Feature Set - group of similar features 26
  • 27. Classification and Cluster Analysis using Graph Patterns  Graph Classification  Mine frequent graph patterns  Features that are frequent in one class but less in another – Discriminative features – Model construction  Can adjust frequency, connectivity thresholds  SVM, NBM etc are used  Cluster Analysis  Cluster Similar graphs based on graph connectivity (minimal cuts)  Hierarchical clusters based on support threshold  Outliers can also be detected  Inter-related process 27