Data mining techniques unit III

DATA MINING
TECHNIQUES
UNIT-III

Association Rule Mining
• All Electronics-customer buys PC & Digital Camera
What should you recommend to him next?
Frequent patterns and association rules are the knowledge that you want to
mine
• Frequent patterns: patterns that appear frequently in a data set
• Frequent item sets: such as milk and bread, that appear frequently in a
transaction data set is frequent item set.
• Frequent sub sequence: appear in subsequence together in transaction data
set
• Frequent substructure: sub graphs, sub trees or sub lattices which may be
combined with item sets or subsequence ,if it occurs frequently is called a
frequent structured pattern

Basic Concepts
• Mining frequent patterns plays an essential role in mining associations,
correlations, data classifications, clustering etc.,
• Market Basket Analysis:
customer1:milk,bread,cereal
customer2:milk,bread,sugar,eggs
customer3:milk,bread,butter
customer4:sugar,eggs
• Which groups or sets of items are customers likely to purchase on a
given trip to a store?

Association Rules
• Support and Confidence are two measures of rule interestingness.
Support: (usefulness of discovered rules)
Certainity:(certainity of discovered rules)
[ support=2%,confidence=60%]
2% of all the transactions under analysis show that computer and
antivirus are purchased together- support
60% of the customers who purchased a computer also bought the
software- confidence

Association Rules
• Association rules are interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold
• Frequent itemset, closed item sets and association rules:
I={I1,I2,..In}-Itemset
D-Task relevant data-database
T-Transaction
Rule: A=>B
Support(A=>B)=P(AUB)-Relative support
Confidence(A=>B)=P(B/A)

Association Rules
• Item sets
• K-Item sets
• Occurrence frequency of an itemset
• Minimum support threshold: If the relative support of an itemset I satisfies a
prespecified minimum support threshold then I is a frequent itemset.
• Confidence(A=>B)=P(B/A)
=support(AUB)
support(A)
=support_count(AUB)
support_count(A)
• Thus the problem of mining association rules can be reduced to that of mining
frequency item sets.

Frequent Item set in Data set (Association Rule
Mining)
• Association Mining searches for frequent items in the data-set. In frequent
mining usually the interesting associations and correlations between item
sets in transactional and relational databases are found. In short, Frequent
Mining shows which items appear together in a transaction or relation.
• Need of Association Mining:
Frequent mining is generation of association rules from a Transactional
Dataset. If there are 2 items X and Y purchased frequently then its good to
put them together in stores or provide some discount offer on one item on
purchase of other item. This can really increase the sales. For example it is
likely to find that if a customer buys Milk and bread he/she also
buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can
suggest the customer to buy butter if he/she buys Milk and Bread.

Important Definitions :
• Support : It is one of the measure of interestingness. This tells about
usefulness and certainty of rules. 5% Support means total 5% of
transactions in database follow the rule.
• Support(A -> B) = Support_count(A ∪ B)
• Confidence: A confidence of 60% means that 60% of the customers
who purchased a milk and bread also bought butter.
• Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
• If a rule satisfies both minimum support and minimum confidence, it
is a strong rule.

Important Definitions :
• Support_count(X) : Number of transactions in which X appears. If X
is A union B then it is the number of transactions in which A and B
both are present.
1.Maximal Itemset: An itemset is maximal frequent if none of its
supersets are frequent.
2.Closed Itemset: An itemset is closed if none of its immediate
supersets have same support count same as Itemset.
3.K- Itemset: Itemset which contains K items is a K-itemset. So it can
be said that an itemset is frequent if the corresponding support count is
greater than minimum support count.

Example On finding Frequent Itemsets
• Consider the given dataset with given transactions.
• Lets say minimum support count is 3
• Relation hold is maximal frequent => closed => frequent

• 1-frequent:
• {A} = 3; // not closed due to {A, C} and not maximal
• {B} = 4; // not closed due to {B, D} and no maximal
• {C} = 4; // not closed due to {C, D} not maximal
• {D} = 5; // closed item-set since not immediate super-set has same count. Not maximal
• 2-frequent:
• {A, B} = 2 // not frequent because support count < minimum support count so ignore
• {A, C} = 3 // not closed due to {A, C, D}
• {A, D} = 3 // not closed due to {A, C, D}
• {B, C} = 3 // not closed due to {B, C, D}
• {B, D} = 4 // closed but not maximal due to {B, C, D}
• {C, D} = 4 // closed but not maximal due to {B, C, D}
• 3-frequent:
• {A, B, C} = 2 // ignore not frequent because support count < minimum support count
• {A, B, D} = 2 // ignore not frequent because support count < minimum support count
• {A, C, D} = 3 // maximal frequent
• {B, C, D} = 3 // maximal frequent
• 4-frequent:
• {A, B, C, D} = 2 //ignore not frequent

AR as Two step Process
• Find all frequent item sets
• Generate strong association rules from the frequent item sets
• Challenge in mining frequent item sets:
• Closed frequent item set: An itemset X is closed in a data set D if there
exists no proper super-itemset Y such that Y has the same support
count as X in D
• Maximal Frequent item set: An itemset X is a maximal frequent
itemset in a data set D if X is frequent & there exists no super-itemset
Y such that X ʗ Y& Y is frequent in D

Example: closed and maximal frequent
item sets
• A transaction database has only two transactions:
{<a1,a2,..a100>;<a1,a2,..a50>} Min_sup=1
• We find two closed frequent item sets and their support counts
C={{a1,a2,..a100}:1;{a1,a2,..a50}:2}
• Only one maximal frequent itemset:
M={{a1,a2,…a100}:1}
• We cannot include {a1,a2,..a50} as a maximal frequent itemset
because it has a frequent superset,{a1,a2,..a100}
• C-closed frequent item set, M-Maximal frequent item sets

Example: closed and maximal frequent
item sets
• Set of closed frequent item sets contain complete information
regarding the frequent item sets
• From c, we can derive
(i){a2,a45:2} since {a2,a45} is a sub-itemset of the itemset
{a1,a2,..a50:2}
(ii){a8,a55:1} since {a8,a55} is not a sub-itemset of the previous
itemset but of the itemset {a1,a2,..a100:1}

Frequent Itemset Mining Methods: Apriori
and FP Growth
• Apriori algorithm:
Finding frequent item sets by confined candidate generation
A seminal algorithm proposed by R.Agarwal & R.Srikant in 1994 for
mining frequent item sets.
Name of the algorithm is due to the fact that algorithm uses prior
knowledge of frequent itemset properties
Apriori Property: All non empty subsets of a frequent itemset must
also be frequent
Join Step and Prune Step

Generating Association Rules from
frequent item sets
• Once the frequent item sets from transactions have been found, it is
straightforward to generate strong association rules from them
• Strong association rules satisfy both minimum support and minimum
confidence
• Confidence(A=>B)=P(B/A)
=support_count(AUB)
support_count(A)

Generating Association Rules from
frequent item sets
• Association rules are generated as follows:
For each frequent itemset L, generate all non-empty subsets of L
For every non-empty subset s of L, output the rule
“s=>l-s” if sup_count(l)
sup_count(s) >= min_conf

Improving the efficiency of apriori
• Hash – based Technique: a hash based technique can be used to
reduce the size of the candidate k-item sets, cK ;k >1
• Example :

• Transaction Reduction: reducing the no. of transaction scanned in
future iterations.
• A transaction that does not contain any frequent k-item sets cannot
contain any frequent (k+1) item sets.
• Such a transaction can be marked or removed from further
consideration.

• Partitioning:2db scans
Partitioning the data to find candidate itemsets requires 2 db scans to
mine the frequent itemsets
• Phase I:
Divide the transaction of D into ‘n’ non overlapping partitions
Find the local frequent itemsets for each partition
Any itemset that is frequent in D must occur as a frequentitemset in
atleast one of the partitions
Therefore all local frequent itemsets are candidate itemsets in D

• Phase: II
A second scan of D is conducted to determine the global frequent
item set, D is scanned only once in each phase
• Sampling
• Dynamic itemset counting

A database has five transactions. Let min sup D
60% and min conf D 80%.

A pattern-growth approach for mining
frequent item sets
• Apriori algorithm: Disadvantages
• Generate and test method-reduces the size of candidate sets that leads
to good performance gain
• Suffers from nontrivial costs

Frequent pattern growth or FP growth
(Divide and Conquer)
• Mines the complete set of frequent item sets without such a costly
candidate generation
• First it compresses the database representing frequent items into FP-
tree,which retains the itemset association information
• Create the root of the tree labelled with “null”
• Scan D second time
• Items in each transaction are processed in ”L” order and branch is
created for each transactions

Mining the FP-tree
• Start from each frequent length_1 pattern (as an initial suffix pattern)
construct its conditional pattern base
• Then constructs its conditional FP tree and perform mining recursively
on the tree
• Pattern growth is achieved by the concatenation of suffix pattern with
the frequent patterns generated from a conditional FP-tree
• This method reduces the search cost.
• Algorithm-FP growth

Mining frequent item sets using the
vertical data format

Mining closed and maximum patterns
• How can we mine closed frequent item sets?
• Strategies included:
Item merging
Sub-itemset pruning
Item skipping
• When a new frequent itemset is derived it is necessary to perform two
kinds of closure checking:
Superset checking
Subset checking

Pattern Evaluation Methods
• Strong rules are not necessarily interesting:

Pattern Evaluation Methods
• From association analysis to correlation analysis:
• Correlation rule:
• Correlation measure:

Pattern Evaluation Methods: chi-square
measure

Comparison of pattern evaluation
measures
• All-confidence
• Max_confidence
• Kulczynski(kulc)
• Cosine
• Null Transactions
• Null Invariant

Advanced pattern mining
• What is pattern mining?
• Pattern mining: A Road map
Basic patterns: frequent pattern, closed pattern, max-pattern,
infrequent pattern or rare patterns, negative patterns
Based on the abstraction levels involved in a pattern: single-level
association rule, multilevel association rules

Pattern mining: A Road map
Based on the number of dimensions involved in the rule or pattern :
Single-dimensional association rule/pattern , Multidimensional
association rule/pattern

• Based on the types of values handled in the rule or pattern: Boolean
association rule, quantitative association rule

• Based on the constraints or criteria used to mine selective
patterns:constraint-based,approximate,compressed,near-match,top-
k,redundancy-aware top-k
• Based on kinds of data and features to be mined: sequential patterns,
structural patterns
• Based on application domain-specific semantics
• Based on data analysis usages: pattern based classification, pattern
based clustering

Pattern mining in multilevel,
multidimensional space
• Mining multilevel associations

• Using uniform minimum support for all levels
• Using reduced minimum support at lower levels

• Using item or group-based minimum support

• Mining Multidimensional Associations
Single dimensional or intradimensional association rules
Multi dimensional or interdimensional association rules

• Mining quantitative association rules
A data cube method
A clustering-based method
A statistical analysis method to uncover exceptional behaviours

• Mining rare patterns and negative patterns

Constraint-based frequent pattern mining
• It includes the following: Knowledge type constraints, data
constraints, dimension/level constraints, Interestingness constraints,
Rule constraints
• Meta-rule guided mining of association rule
• Constraint based pattern generation
• An efficient frequent pattern mining processor can prune its search
space during mining in two ways:
Pruning pattern search space
Pruning data search space

• There are five categories of pattern mining constraints:
Antimonotonic
Monotonic
Succint
Convertible
In convertible

• Pruning data space with data pruning constraints
Data succinctness
Data antimonotocity

Data mining techniques unit III

Data mining techniques unit III

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data mining techniques unit III

Similaire à Data mining techniques unit III (20)

Plus de malathieswaran29

Plus de malathieswaran29 (14)

Dernier

Dernier (20)

Data mining techniques unit III