1. Case study: d60 Raptor
smartAdvisor
Jan Neerbek
Alexandra Institute
2. Agenda
· d60: A cloud/data mining case
· Cloud
· Data Mining
· Market Basket Analysis
· Large data sets
· Our solution
2
3. Alexandra Institute
The Alexandra Institute is a non-profit
company that works with application-
oriented IT research.
Focus is pervasive computing, and we
activate the business potential of our
members and customers through research-
based userdriven innovation.
3
4. The case: d60
· Danish company
· A similar products recommendation engine
· d60 was outgrowing their servers (late 2010)
· They saw a potential in moving to Azure
4
5. The setup
Product
Internet Recommendations
Webshops
Log shopping
patterns
Do data mining
5
6. The cloud potential
· Elasticity
· No upfront server cost
· Cheaper licenses
· Faster calculations
6
7. Challenges
· No SQL Server Analysis Services (SSAS)
· Small compute nodes
· Partioned database (50GB)
· SQL server ingress/outgress access is
slow
7
9. The cloud and services
Node Node
Node
Node
Data layer
service
Node Messaging
Node Service
Node
9
10. Data layer service
Data layer
· Application specific (schema/layout) service
· SQL, table or other
· Easy a bottleneck
· Can be difficult to scale
10
11. Messaging service
Task Queues
· Standard data structure Messaging
Service
· Build-in ordering (FIFO)
· Can be scaled
· Good for asynchronous messages
11
13. Data mining
Data mining is the use of automated data analysis
techniques to uncover relationships among data
items
Market basket analysis is a data mining
technique that discovers co-occurrence
relationships among activities performed by
specific individuals
[about.com/wikipedia.org]
13
15. Market basket analysis
Customer1 Customer2 Customer3 Customer4
Avocado Milk Beef Cereal
Milk Diapers Lemons Beer
Butter Avocado Beer Beef
Potatoes Beer Chips Diapers
Itemset (Diapers, Beer) occur 50%
Frequency threshold parameter
Find as many frequent itemsets as possible
15
16. Market basket analysis
Popular effective algorithm: FP-growth
Based on data structure FP-tree
Requires all data in near-memory
Most research in distributed models has been for
cluster setups
16
17. Building the FP-tree
(extends the prefix-tree structure)
Customer1
Avocado
Avocado
Milk
Butter
Butter
Potatoes
Milk
Potatoes
17
18. Building the FP-tree
Customer2
Avocado
Milk
Diapers
Avocado
Butter
Beer
Milk
Potatoes
18
19. Building the FP-tree
Customer2
Avocado
Milk
Diapers
Avocado
Butter Beer
Beer
Milk Diapers
Potatoes Milk
19
20. Building the FP-tree
Customer2
Avocado
Milk
Diapers
Avocado
Butter Beer
Beer
Milk Diapers
Potatoes Milk
20
21. Building the FP-tree
Avocado Beef
Butter Beer Beer
Milk Diapers Chips Cereal
Potatoes Milk Lemon Diapers
21
22. FP-growth
Grows the frequent itemsets, recusively
FP-growth(FP-tree tree)
{
…
for-each (item in tree)
count =CountOccur(tree,item);
if (IsFrequent(count))
{
OutputSet(item);
sub = tree.GetTree(tree, item);
FP-growth(sub);
}
22
27. Distributed Shared Memory?
CPU CPU CPU CPU CPU
Memory Memory Memory Memory Memory
Network
Shared Memory
· To add nodes is to add memory
· Works best in tightly coubled setups, with low-lantency,
high-speed networks
27
28. FP-growth algorithm
Memory usage
The FP-tree does not fit in local memory; what to
do?
· Emulate Distributed Shared Memory
· Optimize your data structures
· Buy more RAM
· Get a good idea
28
29. Get a good idea
· Database scans are serial and can be
distributed
· The list of items used in the recursive calls
uniquely determines what part of data we are
looking at
29
30. Get a good idea
Avocado Beef
Butter Beer Beer Avocado
Milk Diapers Chips Cereal Butter Beer
Diapers
Potatoes Milk Lemon Diapers
30
31. Get a good idea
Avocado
Avocado Butter, Milk
Butter Beer
Diapers
Milk
Avocado
Beer
Diapers,Milk
These are postfix paths
31
39. Collecting what we have learned
· Message-driven work, using message-queue
· Peer-to-peer for intermediate results
· Distribute data for scalability (buckets)
· Small messages (list of items)
· Allow us to distribute FP-growth
39
40. Advantages
· Configurable work sizes
· Good distribution of work
· Robust against computer failure
· Fast!
40
41. So what about performance?
04:30:00
04:00:00
03:30:00
03:00:00
02:30:00 Message-driven FP-growth
FP-growth
02:00:00
Total node time
01:30:00
01:00:00
00:30:00
00:00:00
1 2 4 8
41
Weare a Tech transfer company. Webuildbrigdesbetweenuniversities (and other research institutes) and companiesFocuspervasisecomputing. For example mobile, cloud, data treatment
Product: Raptor smart advisorHeavy data mining solutionproductrecomendation (brought, browsing, otherusers), association data miningMulti-passRessource intensiveLast year (2010) current server is becoming to smallNeed to upgrade -> biginvestment (hw, licenses)Looked to the cloud, Azure (utility model, usagepricing)What potentials did theysee?
TheiroldsetupWe log patterns and build a model.Weuse the currentuserspattern to queryagainst the model (historic data)
The reasons d60 looked to the cloudCheaperlicenses,e.g. continuespayment But typicallyyougetupgrades for freeFaster calculation ->currently (last year) batch processing, and the batch processingshouldbe done within 12 hoursNowtheycan do somethingtheycouldn’tbeforeNear-real time responses – still workingon-real time events, trends etc-huge potential
D60 wanted to continue to use SQL Server50 GB in corboratesetting is not muchDuringprojectwerealised:Contacting the sqlserver from outside is slowish. By 10-20% (sometimes more!) (compared to a onpremisenetworkedsetup)
This isloose talk. Weneed to establish basis. The cloud is a bounch of looselyconnected nodes. For it to be cloud you have to have the ability to scale up and downondemand - elasticityNodes aretypically virtual images of a small(ish) computerNodes interconnected via LAN (ifwearelucky), but mightbepositioned in geographicaldifferent locationsWeexpectbetter respons times thanifthiswas over the internet. Howeverwewillexperiencelower respons times thanifwe had a dedicatedsetuponpremises.But it’scheapIt is a distributedsetup, where hardware and OS is typicallyvendorcontrolled. As in otherdistributedsetups – plan on nodes beingunavailable.
QueueAzure has a messagequeueGoogle has map-reduceframe-workorAppEngineTaskqueueAmazon has simple queue service (SQS)ScalableimplementationsexistsGlobal dbSimilary all providers has a global dbThereare a number of other cloud services, but theseare the onesthat matter to us
Manyorderes, alsounordered, we just need FIFOWeended up buildingourown, because ofsome initial bad experienceswith the azureone (slowresponsesetc)Maybe not so negative
This is prettyvague. Nowweconsider a more concreteexample
Marketbasket is the historicalexample of associasionmining.Diapers and beerImagien 4 customers and their shopping baskets (carts)A basket is called a transactionAn item in a basket is called an ”item”Wewill talk about sets of items ”itemsets” or sets of items
School book examplesGive story of whyDiapers and BeerareconnectedDon’t have time to go outDiscussvalues for frequencythresholdF-itemsetsGoal of Shopping basket analysis is to find as manyfrequent itemsets as possibleGenerateConditionalProbabilistrulesHard problemInternet and everyclick – lot of dataEach item witheach item – exponential
FP-growth (from 2000)Clustersetups(allowing for fast information exchangebetween nodes)FP-growth: recursivealgorithm, each step takes a FP-treewithconditions, a conditionalFP-treeGenerallyperformsquitewellTreesize is comparable to data set size (huge)Nearmemory: fast memory (ram vsharddrivevsnetworkstorage), knownthat page faultdestroyeffecientcy of algortihmOptions for caching, but cloud is big problemLot of current researchCluster (paraellel) setupvsdistributedsetup. Shared RAM allow for really fast info transferReasearch: transfer of subtreesWiderapplication. For metypical type of problem. Centralizedalgorithm, want to distributewant to do
Example of tree from examplebeforeHere customer1Note alphanummericalsorting, not kosher (weusefrequencycount)A prefixtreeortrieWill not talk about the lookup pointer structuresFP-tree is abouttwothings:CompressingFast data lookup
Customer 2 from before
Compression, but not 100% (Milk) (continueonnext slide)
Note Milk node is in theretwice. Low-complexitycompressionNot intelligent compressionBut weneed to consider the orderingFor live data set typicallyyouareable to shrink an order of magnitude, due to the orderingEach node have weight
All fourcustomersaddedActually I am not showning the root (null-node)Compression 13/16 approx 18% compressionLets look at the FP-growthalgorithm
Thiswewant to distributeLets look at the tree
Find frequent items (in tree)
Wecount the occurences of ”Milk”, if support count is highengoughwegenerate the sub-tree. Otherwise just forget it.Milk occurs in two branches
Note: all edges has a weight (as memtionedbefore)Weused the parts of the FP-tree not shownhere to loop/build the treeefficiently
Youcan of course havemix of memorysetup types
We of coursegot a goodidea
That database scansareserial is crusial – wecould have neededrandomlookup. Thismeansthatdistributing the database is in-expensiveSecondbulletimplythatwe do not need the tree to do the recursive (distributed) calls. Allow for cheap/fastnetworkusageThiswashardwork to come up withWe show last bulletwith an examplenext
Example from beforeI am going to move the treeon the right to the left (next)
As youcansee as the postfixpathbecomes longer and longer the prefix-treebecomessmaller and smallerWecanbuild the subtree from the postfixpath!
Each trans is made of itemsThe reasonthatwecanusebuckets is because of Observation1
What is wrongwiththispicture?Sort of the original algorithm. We ”inheirited” turned out to be bad. Next gen:
Peer-2-peer based + messagequeueRead-onlydbWeareworkingon new version withevenlessdblookup
For eachpostfix:Distributeitem-buckets, transaction-bucketsCount number of postfixpaths in bucketsCollectcountacrosstransactionsFor eachfrequentpostfixpathcallrecursivelyIf expectedsize of prefixtree is small do standard FP-growthIN order to distributebucketsweuse MessageQueueINorder to collectcountsweusemessage parsing (node to node)IN order to do standard FPG weuselocalcomputation
Message-driven – good for cloudMQ scaleswellDistributed data, thiswas a lessonlearned, also bad experiencewith SQL serverNew distribute FPGSmall messages in constrast to the cluster solutions, good for slownetworks
Whataboutperf? (next)
Ours is the lowgraphWith onlyoneworker.Growth as dbgrows