Apriori data mining in the cloud

Case study: d60 Raptor
smartAdvisor

Jan Neerbek
Alexandra Institute

Agenda

· d60: A cloud/data mining case
· Cloud
· Data Mining
· Market Basket Analysis
· Large data sets
· Our solution

2

Alexandra Institute

The Alexandra Institute is a non-profit
company that works with application-
oriented IT research.

Focus is pervasive computing, and we
activate the business potential of our
members and customers through research-
based userdriven innovation.

3

The case: d60

· Danish company
· A similar products recommendation engine
· d60 was outgrowing their servers (late 2010)
· They saw a potential in moving to Azure

4

The setup

Product
Internet Recommendations
Webshops

Log shopping
patterns

Do data mining

5

The cloud potential

· Elasticity
· No upfront server cost
· Cheaper licenses
· Faster calculations

6

Challenges

· No SQL Server Analysis Services (SSAS)
· Small compute nodes
· Partioned database (50GB)
· SQL server ingress/outgress access is
slow

7

The cloud

Node Node
Node

Node

Node
Node
Node

8

The cloud and services

Node Node
Node

Node
Data layer
service

Node Messaging
Node Service
Node

9

Data layer service

Data layer
· Application specific (schema/layout) service

· SQL, table or other
· Easy a bottleneck
· Can be difficult to scale

10

Messaging service
Task Queues

· Standard data structure Messaging
Service
· Build-in ordering (FIFO)
· Can be scaled
· Good for asynchronous messages

11

Data mining

Data mining is the use of automated data analysis
techniques to uncover relationships among data
items

Market basket analysis is a data mining
technique that discovers co-occurrence
relationships among activities performed by
specific individuals

[about.com/wikipedia.org]
13

Market basket analysis

Customer1 Customer2 Customer3 Customer4
Avocado Milk Beef Cereal
Milk Diapers Lemons Beer
Butter Avocado Beer Beef
Potatoes Beer Chips Diapers

14


Customer1 Customer2 Customer3 Customer4
Avocado Milk Beef Cereal
Milk Diapers Lemons Beer
Butter Avocado Beer Beef
Potatoes Beer Chips Diapers

Itemset (Diapers, Beer) occur 50%

Frequency threshold parameter
Find as many frequent itemsets as possible
15


Popular effective algorithm: FP-growth 
Based on data structure FP-tree
Requires all data in near-memory 
Most research in distributed models has been for
cluster setups 

16

Building the FP-tree
(extends the prefix-tree structure)

Customer1
Avocado
Avocado
Milk
Butter
Butter
Potatoes

Milk

Potatoes

17


Customer2
Avocado
Milk
Diapers
Avocado
Butter
Beer

Milk

Potatoes

18


Customer2
Avocado
Milk
Diapers
Avocado
Butter Beer
Beer

Milk Diapers

Potatoes Milk

19


Customer2
Avocado
Milk
Diapers
Avocado
Butter Beer
Beer

Milk Diapers

Potatoes Milk

20


Avocado Beef

Butter Beer Beer

Milk Diapers Chips Cereal

Potatoes Milk Lemon Diapers

21

FP-growth

Grows the frequent itemsets, recusively

FP-growth(FP-tree tree)
{
…
for-each (item in tree)
count =CountOccur(tree,item);
if (IsFrequent(count))
{
OutputSet(item);
sub = tree.GetTree(tree, item);
FP-growth(sub);
}

22

FP-growth algorithm
Divide and Conquer

Traverse tree
Avocado Beef

Butter Beer Beer



23

FP-growth algorithm
Divide and Conquer

Generate sub-trees
Avocado Beef

Butter Beer Beer



24

FP-growth algorithm
Divide and Conquer

Call recursively
Avocado Beef

Butter Beer Beer Avocado

Milk Diapers Chips Cereal Butter Beer

Diapers

25

FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
do?
· Emulate Distributed Shared Memory

26

Distributed Shared Memory?

CPU CPU CPU CPU CPU

Memory Memory Memory Memory Memory

Network

Shared Memory

· To add nodes is to add memory
· Works best in tightly coubled setups, with low-lantency,
high-speed networks
27

FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
do?
· Emulate Distributed Shared Memory
· Optimize your data structures
· Buy more RAM
· Get a good idea

28

Get a good idea

· Database scans are serial and can be
distributed
· The list of items used in the recursive calls
uniquely determines what part of data we are
looking at

29

Get a good idea

Avocado Beef

Butter Beer Beer Avocado

Milk Diapers Chips Cereal Butter Beer

Diapers

30

Get a good idea

Avocado

Avocado Butter, Milk

Butter Beer

Diapers

Milk
Avocado

Beer

Diapers,Milk
These are postfix paths
31

Buckets

· Use postfix paths for messaging
· Working with buckets

Transactions

Items

33

FP-growth revisited
Replaced with
FP-growth(FP-tree tree) postfix

{
…
Done in parallel
for-each (item in tree)
Done in parallel
count =CountOccur(tree,item);
if (IsFrequent(count))
{
OutputSet(item);
Done in parallel sub = tree.GetTree(tree, item);
FP-growth(sub);
}

34

Communication

Node Node

Data layer

Node Node

35

Revised Communication

Node Node

MQ
Data layer

Node Node

36

Running FP-growth

Distribute buckets

Count items
(with postfix size=n)

Collect counts
(per postfix)
Call recursive

Standard FP-growth

37

Running FP-growth

Distribute buckets

Count items
(with postfix size=n)

Collect counts
(per postfix)
Call recursive

Standard FP-growth

38

Collecting what we have learned

· Message-driven work, using message-queue
· Peer-to-peer for intermediate results
· Distribute data for scalability (buckets)
· Small messages (list of items)
· Allow us to distribute FP-growth

39

Advantages

· Configurable work sizes
· Good distribution of work
· Robust against computer failure
· Fast!

40

So what about performance?
04:30:00

04:00:00

03:30:00

03:00:00

02:30:00 Message-driven FP-growth

FP-growth
02:00:00

Total node time
01:30:00

01:00:00

00:30:00

00:00:00
1 2 4 8

41

Apriori data mining in the cloud

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Apriori data mining in the cloud

Similar to Apriori data mining in the cloud (13)

More from Alexandra Instituttet

More from Alexandra Instituttet (11)

Recently uploaded

Recently uploaded (20)

Apriori data mining in the cloud

Editor's Notes