6. Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy
26. ML is not a black-box.
Transparency
Learning is also about understanding.
Interpretability
Whatever can go wrong, will go wrong.
Diagnosis
Moving on
30. Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support
33. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 5 is not frequent
34. Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 5 is not frequent
min_support
35. Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 5 is frequent therefore
{C, D} : 5 is frequent
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
M = 4
59. Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
60. Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice
69. Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy
71. Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.