2. Association Rule Mining
• All Electronics-customer buys PC & Digital Camera
What should you recommend to him next?
Frequent patterns and association rules are the knowledge that you want to
mine
• Frequent patterns: patterns that appear frequently in a data set
• Frequent item sets: such as milk and bread, that appear frequently in a
transaction data set is frequent item set.
• Frequent sub sequence: appear in subsequence together in transaction data
set
• Frequent substructure: sub graphs, sub trees or sub lattices which may be
combined with item sets or subsequence ,if it occurs frequently is called a
frequent structured pattern
3. Basic Concepts
• Mining frequent patterns plays an essential role in mining associations,
correlations, data classifications, clustering etc.,
• Market Basket Analysis:
customer1:milk,bread,cereal
customer2:milk,bread,sugar,eggs
customer3:milk,bread,butter
customer4:sugar,eggs
• Which groups or sets of items are customers likely to purchase on a
given trip to a store?
4. Association Rules
• Support and Confidence are two measures of rule interestingness.
Support: (usefulness of discovered rules)
Certainity:(certainity of discovered rules)
[ support=2%,confidence=60%]
2% of all the transactions under analysis show that computer and
antivirus are purchased together- support
60% of the customers who purchased a computer also bought the
software- confidence
5. Association Rules
• Association rules are interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold
• Frequent itemset, closed item sets and association rules:
I={I1,I2,..In}-Itemset
D-Task relevant data-database
T-Transaction
Rule: A=>B
Support(A=>B)=P(AUB)-Relative support
Confidence(A=>B)=P(B/A)
6. Association Rules
• Item sets
• K-Item sets
• Occurrence frequency of an itemset
• Minimum support threshold: If the relative support of an itemset I satisfies a
prespecified minimum support threshold then I is a frequent itemset.
• Confidence(A=>B)=P(B/A)
=support(AUB)
support(A)
=support_count(AUB)
support_count(A)
• Thus the problem of mining association rules can be reduced to that of mining
frequency item sets.
7. Frequent Item set in Data set (Association Rule
Mining)
• Association Mining searches for frequent items in the data-set. In frequent
mining usually the interesting associations and correlations between item
sets in transactional and relational databases are found. In short, Frequent
Mining shows which items appear together in a transaction or relation.
• Need of Association Mining:
Frequent mining is generation of association rules from a Transactional
Dataset. If there are 2 items X and Y purchased frequently then its good to
put them together in stores or provide some discount offer on one item on
purchase of other item. This can really increase the sales. For example it is
likely to find that if a customer buys Milk and bread he/she also
buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can
suggest the customer to buy butter if he/she buys Milk and Bread.
8. Important Definitions :
• Support : It is one of the measure of interestingness. This tells about
usefulness and certainty of rules. 5% Support means total 5% of
transactions in database follow the rule.
• Support(A -> B) = Support_count(A ∪ B)
• Confidence: A confidence of 60% means that 60% of the customers
who purchased a milk and bread also bought butter.
• Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
• If a rule satisfies both minimum support and minimum confidence, it
is a strong rule.
9. Important Definitions :
• Support_count(X) : Number of transactions in which X appears. If X
is A union B then it is the number of transactions in which A and B
both are present.
1.Maximal Itemset: An itemset is maximal frequent if none of its
supersets are frequent.
2.Closed Itemset: An itemset is closed if none of its immediate
supersets have same support count same as Itemset.
3.K- Itemset: Itemset which contains K items is a K-itemset. So it can
be said that an itemset is frequent if the corresponding support count is
greater than minimum support count.
10. Example On finding Frequent Itemsets
• Consider the given dataset with given transactions.
• Lets say minimum support count is 3
• Relation hold is maximal frequent => closed => frequent
11. • 1-frequent:
• {A} = 3; // not closed due to {A, C} and not maximal
• {B} = 4; // not closed due to {B, D} and no maximal
• {C} = 4; // not closed due to {C, D} not maximal
• {D} = 5; // closed item-set since not immediate super-set has same count. Not maximal
• 2-frequent:
• {A, B} = 2 // not frequent because support count < minimum support count so ignore
• {A, C} = 3 // not closed due to {A, C, D}
• {A, D} = 3 // not closed due to {A, C, D}
• {B, C} = 3 // not closed due to {B, C, D}
• {B, D} = 4 // closed but not maximal due to {B, C, D}
• {C, D} = 4 // closed but not maximal due to {B, C, D}
• 3-frequent:
• {A, B, C} = 2 // ignore not frequent because support count < minimum support count
• {A, B, D} = 2 // ignore not frequent because support count < minimum support count
• {A, C, D} = 3 // maximal frequent
• {B, C, D} = 3 // maximal frequent
• 4-frequent:
• {A, B, C, D} = 2 //ignore not frequent
12. AR as Two step Process
• Find all frequent item sets
• Generate strong association rules from the frequent item sets
• Challenge in mining frequent item sets:
• Closed frequent item set: An itemset X is closed in a data set D if there
exists no proper super-itemset Y such that Y has the same support
count as X in D
• Maximal Frequent item set: An itemset X is a maximal frequent
itemset in a data set D if X is frequent & there exists no super-itemset
Y such that X ʗ Y& Y is frequent in D
13. Example: closed and maximal frequent
item sets
• A transaction database has only two transactions:
{<a1,a2,..a100>;<a1,a2,..a50>} Min_sup=1
• We find two closed frequent item sets and their support counts
C={{a1,a2,..a100}:1;{a1,a2,..a50}:2}
• Only one maximal frequent itemset:
M={{a1,a2,…a100}:1}
• We cannot include {a1,a2,..a50} as a maximal frequent itemset
because it has a frequent superset,{a1,a2,..a100}
• C-closed frequent item set, M-Maximal frequent item sets
14. Example: closed and maximal frequent
item sets
• Set of closed frequent item sets contain complete information
regarding the frequent item sets
• From c, we can derive
(i){a2,a45:2} since {a2,a45} is a sub-itemset of the itemset
{a1,a2,..a50:2}
(ii){a8,a55:1} since {a8,a55} is not a sub-itemset of the previous
itemset but of the itemset {a1,a2,..a100:1}
15. Frequent Itemset Mining Methods: Apriori
and FP Growth
• Apriori algorithm:
Finding frequent item sets by confined candidate generation
A seminal algorithm proposed by R.Agarwal & R.Srikant in 1994 for
mining frequent item sets.
Name of the algorithm is due to the fact that algorithm uses prior
knowledge of frequent itemset properties
Apriori Property: All non empty subsets of a frequent itemset must
also be frequent
Join Step and Prune Step
19. Generating Association Rules from
frequent item sets
• Once the frequent item sets from transactions have been found, it is
straightforward to generate strong association rules from them
• Strong association rules satisfy both minimum support and minimum
confidence
• Confidence(A=>B)=P(B/A)
=support_count(AUB)
support_count(A)
20. Generating Association Rules from
frequent item sets
• Association rules are generated as follows:
For each frequent itemset L, generate all non-empty subsets of L
For every non-empty subset s of L, output the rule
“s=>l-s” if sup_count(l)
sup_count(s) >= min_conf
22. Improving the efficiency of apriori
• Hash – based Technique: a hash based technique can be used to
reduce the size of the candidate k-item sets, cK ;k >1
• Example :
23. Improving the efficiency of apriori
• Transaction Reduction: reducing the no. of transaction scanned in
future iterations.
• A transaction that does not contain any frequent k-item sets cannot
contain any frequent (k+1) item sets.
• Such a transaction can be marked or removed from further
consideration.
24. Improving the efficiency of apriori
• Partitioning:2db scans
Partitioning the data to find candidate itemsets requires 2 db scans to
mine the frequent itemsets
• Phase I:
Divide the transaction of D into ‘n’ non overlapping partitions
Find the local frequent itemsets for each partition
Any itemset that is frequent in D must occur as a frequentitemset in
atleast one of the partitions
Therefore all local frequent itemsets are candidate itemsets in D
25. Improving the efficiency of apriori
• Phase: II
A second scan of D is conducted to determine the global frequent
item set, D is scanned only once in each phase
• Sampling
• Dynamic itemset counting
26. A database has five transactions. Let min sup D
60% and min conf D 80%.
27.
28. A pattern-growth approach for mining
frequent item sets
• Apriori algorithm: Disadvantages
• Generate and test method-reduces the size of candidate sets that leads
to good performance gain
• Suffers from nontrivial costs
29. Frequent pattern growth or FP growth
(Divide and Conquer)
• Mines the complete set of frequent item sets without such a costly
candidate generation
• First it compresses the database representing frequent items into FP-
tree,which retains the itemset association information
• Create the root of the tree labelled with “null”
• Scan D second time
• Items in each transaction are processed in ”L” order and branch is
created for each transactions
30. Mining the FP-tree
• Start from each frequent length_1 pattern (as an initial suffix pattern)
construct its conditional pattern base
• Then constructs its conditional FP tree and perform mining recursively
on the tree
• Pattern growth is achieved by the concatenation of suffix pattern with
the frequent patterns generated from a conditional FP-tree
• This method reduces the search cost.
• Algorithm-FP growth
35. Mining closed and maximum patterns
• How can we mine closed frequent item sets?
• Strategies included:
Item merging
Sub-itemset pruning
Item skipping
• When a new frequent itemset is derived it is necessary to perform two
kinds of closure checking:
Superset checking
Subset checking
41. Advanced pattern mining
• What is pattern mining?
• Pattern mining: A Road map
Basic patterns: frequent pattern, closed pattern, max-pattern,
infrequent pattern or rare patterns, negative patterns
Based on the abstraction levels involved in a pattern: single-level
association rule, multilevel association rules
42. Pattern mining: A Road map
Based on the number of dimensions involved in the rule or pattern :
Single-dimensional association rule/pattern , Multidimensional
association rule/pattern
43. Pattern mining: A Road map
• Based on the types of values handled in the rule or pattern: Boolean
association rule, quantitative association rule
44. Pattern mining: A Road map
• Based on the constraints or criteria used to mine selective
patterns:constraint-based,approximate,compressed,near-match,top-
k,redundancy-aware top-k
• Based on kinds of data and features to be mined: sequential patterns,
structural patterns
• Based on application domain-specific semantics
• Based on data analysis usages: pattern based classification, pattern
based clustering
45.
46. Pattern mining in multilevel,
multidimensional space
• Mining multilevel associations
47. Pattern mining in multilevel,
multidimensional space
• Using uniform minimum support for all levels
• Using reduced minimum support at lower levels
48. Pattern mining in multilevel,
multidimensional space
• Using item or group-based minimum support
49. Pattern mining in multilevel,
multidimensional space
• Mining Multidimensional Associations
Single dimensional or intradimensional association rules
Multi dimensional or interdimensional association rules
50. Pattern mining in multilevel,
multidimensional space
• Mining quantitative association rules
A data cube method
A clustering-based method
A statistical analysis method to uncover exceptional behaviours
51.
52. Pattern mining in multilevel,
multidimensional space
• Mining rare patterns and negative patterns
53. Constraint-based frequent pattern mining
• It includes the following: Knowledge type constraints, data
constraints, dimension/level constraints, Interestingness constraints,
Rule constraints
• Meta-rule guided mining of association rule
• Constraint based pattern generation
• An efficient frequent pattern mining processor can prune its search
space during mining in two ways:
Pruning pattern search space
Pruning data search space
54. Constraint-based frequent pattern mining
• There are five categories of pattern mining constraints:
Antimonotonic
Monotonic
Succint
Convertible
In convertible