The document proposes leveraging the efficiencies of bitmap operations and vertical data formats to perform association rule mining in a distributed environment. It describes how a horizontal transactional dataset can be converted to a vertical bitmap structure in a single pass, representing itemsets and their associations. This structure allows computation of frequent itemsets to be performed efficiently through bitmap operations and lends itself well to distributed processing by mapping each item index to separate processes. The approach could potentially outperform the traditional Apriori algorithm by avoiding multiple scans of data and enabling parallel mining.
Time Series Foundation Models - current state and future directions
Distributed Association Rule Mining Using Vertical Bitmap Formats
1.
2. Question?
• Is it possible to leverage benefits of
vertical data formats in combination
with efficiencies of bitmap operations
to mine association rules in a
distributed environment.
3. Association Rule Mining??
• Finding Interesting Relationships
between the variables.
• Finding the subset that is common to a
chosen minimum number of the
itemsets from the set of itemsets.
• Pattern Recognition.
• Explained By Market Basket Analysis.
4. Sample (Toy ) Data
Set
TID Item ID’s
T100 I1, I2, I5
T200 I2, I4
T300 I1, I2
T400 I2, I5
5. Apriori
• Fundamental Algorithm for Association
Rule Mining.
• Mines frequent patterns from a horizontal
data format which represents the items
categorized into particular transactions.
• i-th stage identifies all frequent i-element
sets.
• Two steps:
• > Candidate generation.
• > Candidate counting.
6. Vertical Form
• Transactions categorized into particular items.
• Vertical format data mining only has to parse
the dataset once to get the itemsets.
• For the itemset generation from the 2nd
itemset it only needs to refer the previous
itemset.
• Eliminates parsing through the dataset each
time to count the frequency of itemsets, for
each round.
• More efficient than its horizontal form.
7. BitMaps
• Compactly store individual bits.
• Exploit bit-level parallelism effectively.
• 0’s and 1’s.
• 1 indicates existence.
8. Combined?
• Algorithm takes a horizontal data set.
• With a one pass of the data set
construct a bit map based data
structure.
• This structure is in vertical format.
• The structure facilitates efficient mining
of association rules.
9. Sample (Toy ) Data
Set
TID Item ID’s
T100 I1, I2, I5
T200 I2, I4
T300 I1, I2
T400 I2, I5
10. Sample (Toy ) Data
Set
TID Item ID’s
T100 I1, I2, I5
Horizontal
Format T200 I2, I4
T300 I1, I2
T400 I2, I5
26. Insights
• The algorithm performs better than
Apriori in most scenarios.
• Data structure generation dominates
the total time in most cases.
• As an aside…
• Can this be made to a distributed
mining algorithm?
27. Turns out this can be done rather easily.
Algorithm lends to map reduce like
distributed processing..
Each master array index is self
contained.. I1 I2 I5
2
1 1
1 0
So can be mined in parallel.
Data structure generation Map phase
Result accumulation -> Reduce phase
28. What Does Future Hold?
• Make this distributed.
• Java not the best of options. Use C so
we can control memory allocations the
way we want.
• Experiment with bitmap compression
techniques.