SlideShare a Scribd company logo
1 of 42
Case study: d60 Raptor
    smartAdvisor


                           Jan Neerbek
                     Alexandra Institute
Agenda

·   d60: A cloud/data mining case
·   Cloud
·   Data Mining
·   Market Basket Analysis
·   Large data sets
·   Our solution




2
Alexandra Institute


    The Alexandra Institute is a non-profit
    company that works with application-
    oriented IT research.

    Focus is pervasive computing, and we
    activate the business potential of our
    members and customers through research-
    based userdriven innovation.


3
The case: d60


·   Danish company
·   A similar products recommendation engine
·   d60 was outgrowing their servers (late 2010)
·   They saw a potential in moving to Azure




4
The setup



                           Product
                Internet   Recommendations
Webshops




           Log shopping
           patterns

                                 Do data mining




 5
The cloud potential


· Elasticity
· No upfront server cost
· Cheaper licenses
· Faster calculations



6
Challenges


· No SQL Server Analysis Services (SSAS)
· Small compute nodes
· Partioned database (50GB)
· SQL server ingress/outgress access is
    slow




7
The cloud



                          Node           Node
                                  Node

            Node




                   Node
     Node
                                 Node




8
The cloud and services



                          Node            Node
                                  Node

            Node
                                         Data layer
                                          service



                   Node                   Messaging
     Node                                  Service
                                 Node




9
Data layer service

                                            Data layer
·    Application specific (schema/layout)    service

·    SQL, table or other
·    Easy a bottleneck
·    Can be difficult to scale




10
Messaging service
Task Queues

·    Standard data structure          Messaging
                                       Service
·    Build-in ordering (FIFO)
·    Can be scaled
·    Good for asynchronous messages




11
12
Data mining


Data mining is the use of automated data analysis
 techniques to uncover relationships among data
 items


Market basket analysis is a data mining
 technique that discovers co-occurrence
 relationships among activities performed by
 specific individuals


     [about.com/wikipedia.org]
13
Market basket analysis


     Customer1    Customer2    Customer3    Customer4
Avocado          Milk         Beef         Cereal
Milk             Diapers      Lemons       Beer
Butter           Avocado      Beer         Beef
Potatoes         Beer         Chips        Diapers




14
Market basket analysis


      Customer1    Customer2    Customer3    Customer4
 Avocado          Milk         Beef         Cereal
 Milk             Diapers      Lemons       Beer
 Butter           Avocado      Beer         Beef
 Potatoes         Beer         Chips        Diapers



Itemset (Diapers, Beer) occur 50%

Frequency threshold parameter
Find as many frequent itemsets as possible
 15
Market basket analysis


Popular effective algorithm: FP-growth 
Based on data structure FP-tree
Requires all data in near-memory 
Most research in distributed models has been for
  cluster setups 




16
Building the FP-tree
(extends the prefix-tree structure)

                                       Customer1
           Avocado
                                      Avocado
                                      Milk
                                      Butter
          Butter
                                      Potatoes



         Milk




     Potatoes



17
Building the FP-tree

                        Customer2
           Avocado
                       Milk
                       Diapers
                       Avocado
          Butter
                       Beer



         Milk




     Potatoes



18
Building the FP-tree

                                Customer2
           Avocado
                               Milk
                               Diapers
                               Avocado
          Butter     Beer
                               Beer



         Milk        Diapers




     Potatoes           Milk



19
Building the FP-tree

                                Customer2
           Avocado
                               Milk
                               Diapers
                               Avocado
          Butter     Beer
                               Beer



         Milk        Diapers




     Potatoes           Milk



20
Building the FP-tree


           Avocado             Beef




          Butter     Beer        Beer




         Milk        Diapers          Chips   Cereal




     Potatoes           Milk          Lemon   Diapers



21
FP-growth

Grows the frequent itemsets, recusively

FP-growth(FP-tree tree)
{
     …
     for-each (item in tree)
          count =CountOccur(tree,item);
          if (IsFrequent(count))
          {
               OutputSet(item);
               sub = tree.GetTree(tree, item);
               FP-growth(sub);
          }

22
FP-growth algorithm
Divide and Conquer

Traverse tree
      Avocado             Beef



     Butter     Beer         Beer



     Milk       Diapers     Chips    Cereal



Potatoes           Milk      Lemon    Diapers




23
FP-growth algorithm
Divide and Conquer

Generate sub-trees
      Avocado             Beef



     Butter     Beer         Beer



     Milk       Diapers     Chips    Cereal



Potatoes           Milk      Lemon    Diapers




24
FP-growth algorithm
Divide and Conquer

Call recursively
      Avocado             Beef



     Butter     Beer         Beer                Avocado



     Milk       Diapers     Chips    Cereal     Butter     Beer



                                                           Diapers
Potatoes           Milk      Lemon    Diapers




25
FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
  do?
· Emulate Distributed Shared Memory




26
Distributed Shared Memory?


          CPU       CPU           CPU      CPU      CPU


         Memory    Memory       Memory    Memory   Memory


                            Network


                          Shared Memory



·    To add nodes is to add memory
·    Works best in tightly coubled setups, with low-lantency,
     high-speed networks
27
FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
  do?
· Emulate Distributed Shared Memory
· Optimize your data structures
· Buy more RAM
· Get a good idea



28
Get a good idea


·    Database scans are serial and can be
     distributed
·    The list of items used in the recursive calls
     uniquely determines what part of data we are
     looking at




29
Get a good idea



      Avocado             Beef



     Butter     Beer         Beer                Avocado



     Milk       Diapers     Chips    Cereal     Butter     Beer



                                                           Diapers
Potatoes           Milk      Lemon    Diapers




30
Get a good idea


                                    Avocado



           Avocado                   Butter, Milk

          Butter          Beer



                          Diapers



                   Milk
                                      Avocado



                                                Beer




                                    Diapers,Milk
     These are postfix paths
31
32
Buckets


·    Use postfix paths for messaging
·    Working with buckets


                         Transactions




                 Items




33
FP-growth revisited
                                      Replaced with
 FP-growth(FP-tree tree)                 postfix

 {
          …
Done in parallel
          for-each (item in tree)
                                                  Done in parallel
                 count =CountOccur(tree,item);
                 if (IsFrequent(count))
                 {
                      OutputSet(item);
   Done in parallel   sub = tree.GetTree(tree, item);
                      FP-growth(sub);
                 }



 34
Communication




         Node                Node




                Data layer




         Node                Node




35
Revised Communication




         Node           Node




                MQ
                               Data layer



         Node           Node




36
Running FP-growth


                    Distribute buckets




                    Count items
                    (with postfix size=n)



                    Collect counts
                    (per postfix)
                    Call recursive


                    Standard FP-growth

37
Running FP-growth


                    Distribute buckets




                    Count items
                    (with postfix size=n)



                    Collect counts
                    (per postfix)
                    Call recursive


                    Standard FP-growth

38
Collecting what we have learned


·    Message-driven work, using message-queue
·    Peer-to-peer for intermediate results
·    Distribute data for scalability (buckets)
·    Small messages (list of items)
·    Allow us to distribute FP-growth




39
Advantages


·    Configurable work sizes
·    Good distribution of work
·    Robust against computer failure
·    Fast!




40
So what about performance?
     04:30:00


     04:00:00


     03:30:00


     03:00:00


     02:30:00                   Message-driven FP-growth

                                FP-growth
     02:00:00

                                Total node time
     01:30:00


     01:00:00


     00:30:00


     00:00:00
                1   2   4   8




41
Thank you!




42

More Related Content

Viewers also liked

2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"Alexandra Instituttet
 
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Alexandra Instituttet
 
2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"Alexandra Instituttet
 
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"Alexandra Instituttet
 
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"Alexandra Instituttet
 
øLmave eller bare insulinresistent
øLmave   eller bare insulinresistentøLmave   eller bare insulinresistent
øLmave eller bare insulinresistentEdb Huset a/s
 
2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"Alexandra Instituttet
 
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenKontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenAlexandra Instituttet
 
Customer Experience Maturity Assessment
Customer Experience Maturity AssessmentCustomer Experience Maturity Assessment
Customer Experience Maturity AssessmentMac Wheeler
 
Gartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing ClientsGartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing Clientsfranzel77
 
Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Jesse Gant
 
Top 10 utilities interview questions with answers
Top 10 utilities interview questions with answersTop 10 utilities interview questions with answers
Top 10 utilities interview questions with answerslibbygray000
 
Defining product marketing
Defining product marketingDefining product marketing
Defining product marketingGerardo A Dada
 

Viewers also liked (16)

2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"
 
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
 
2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"
 
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
 
Sund Innovation i Randers
Sund Innovation i RandersSund Innovation i Randers
Sund Innovation i Randers
 
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
 
øLmave eller bare insulinresistent
øLmave   eller bare insulinresistentøLmave   eller bare insulinresistent
øLmave eller bare insulinresistent
 
2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"
 
Forretningsudvikling og Innovation
Forretningsudvikling og InnovationForretningsudvikling og Innovation
Forretningsudvikling og Innovation
 
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenKontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
 
Customer Experience Maturity Assessment
Customer Experience Maturity AssessmentCustomer Experience Maturity Assessment
Customer Experience Maturity Assessment
 
Gartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing ClientsGartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing Clients
 
Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Product and UX - are the roles blurring?
Product and UX - are the roles blurring?
 
Top 10 utilities interview questions with answers
Top 10 utilities interview questions with answersTop 10 utilities interview questions with answers
Top 10 utilities interview questions with answers
 
Defining product marketing
Defining product marketingDefining product marketing
Defining product marketing
 
Strategic Role - Product Management
Strategic Role - Product ManagementStrategic Role - Product Management
Strategic Role - Product Management
 

Similar to Apriori data mining in the cloud

Data Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aData Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aOllieShoresna
 
Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Lili Wu
 
Jazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelJazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelKai Wähner
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data miningSulman Ahmed
 
Datacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaDatacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaKnowerce
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
Eclat algorithm in association rule mining
Eclat algorithm in association rule miningEclat algorithm in association rule mining
Eclat algorithm in association rule miningDeepa Jeya
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptraju980973
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 

Similar to Apriori data mining in the cloud (13)

Data Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aData Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts a
 
Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products
 
Jazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelJazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache Camel
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Datacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaDatacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp Bratislava
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Eclat algorithm in association rule mining
Eclat algorithm in association rule miningEclat algorithm in association rule mining
Eclat algorithm in association rule mining
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.ppt
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 

More from Alexandra Instituttet

2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"Alexandra Instituttet
 
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"Alexandra Instituttet
 
Midt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktMidt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktAlexandra Instituttet
 
Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Alexandra Instituttet
 
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet
 
Find nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGFind nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGAlexandra Instituttet
 
Sund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterSund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterAlexandra Instituttet
 

More from Alexandra Instituttet (11)

2010 Sorø "Ekko af byen"
2010 Sorø "Ekko af byen"2010 Sorø "Ekko af byen"
2010 Sorø "Ekko af byen"
 
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
 
2010 Sorø "Åbning i Sorø"
2010 Sorø "Åbning i Sorø"2010 Sorø "Åbning i Sorø"
2010 Sorø "Åbning i Sorø"
 
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
 
Midt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktMidt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunkt
 
Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling
 
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
 
Massive Data
Massive DataMassive Data
Massive Data
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Find nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGFind nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTING
 
Sund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterSund Innovation i Randers Sundhedscenter
Sund Innovation i Randers Sundhedscenter
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Apriori data mining in the cloud

  • 1. Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute
  • 2. Agenda · d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution 2
  • 3. Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation. 3
  • 4. The case: d60 · Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure 4
  • 5. The setup Product Internet Recommendations Webshops Log shopping patterns Do data mining 5
  • 6. The cloud potential · Elasticity · No upfront server cost · Cheaper licenses · Faster calculations 6
  • 7. Challenges · No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is slow 7
  • 8. The cloud Node Node Node Node Node Node Node 8
  • 9. The cloud and services Node Node Node Node Data layer service Node Messaging Node Service Node 9
  • 10. Data layer service Data layer · Application specific (schema/layout) service · SQL, table or other · Easy a bottleneck · Can be difficult to scale 10
  • 11. Messaging service Task Queues · Standard data structure Messaging Service · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages 11
  • 12. 12
  • 13. Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org] 13
  • 14. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers 14
  • 15. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15
  • 16. Market basket analysis Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups  16
  • 17. Building the FP-tree (extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes 17
  • 18. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes 18
  • 19. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 19
  • 20. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 20
  • 21. Building the FP-tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 21
  • 22. FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); } 22
  • 23. FP-growth algorithm Divide and Conquer Traverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 23
  • 24. FP-growth algorithm Divide and Conquer Generate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 24
  • 25. FP-growth algorithm Divide and Conquer Call recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 25
  • 26. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory 26
  • 27. Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory · To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency, high-speed networks 27
  • 28. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea 28
  • 29. Get a good idea · Database scans are serial and can be distributed · The list of items used in the recursive calls uniquely determines what part of data we are looking at 29
  • 30. Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 30
  • 31. Get a good idea Avocado Avocado Butter, Milk Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths 31
  • 32. 32
  • 33. Buckets · Use postfix paths for messaging · Working with buckets Transactions Items 33
  • 34. FP-growth revisited Replaced with FP-growth(FP-tree tree) postfix { … Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34
  • 35. Communication Node Node Data layer Node Node 35
  • 36. Revised Communication Node Node MQ Data layer Node Node 36
  • 37. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37
  • 38. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38
  • 39. Collecting what we have learned · Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth 39
  • 40. Advantages · Configurable work sizes · Good distribution of work · Robust against computer failure · Fast! 40
  • 41. So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 02:30:00 Message-driven FP-growth FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41

Editor's Notes

  1. Weare a Tech transfer company. Webuildbrigdesbetweenuniversities (and other research institutes) and companiesFocuspervasisecomputing. For example mobile, cloud, data treatment
  2. Product: Raptor smart advisorHeavy data mining solutionproductrecomendation (brought, browsing, otherusers), association data miningMulti-passRessource intensiveLast year (2010) current server is becoming to smallNeed to upgrade -> biginvestment (hw, licenses)Looked to the cloud, Azure (utility model, usagepricing)What potentials did theysee?
  3. TheiroldsetupWe log patterns and build a model.Weuse the currentuserspattern to queryagainst the model (historic data)
  4. The reasons d60 looked to the cloudCheaperlicenses,e.g. continuespayment But typicallyyougetupgrades for freeFaster calculation ->currently (last year) batch processing, and the batch processingshouldbe done within 12 hoursNowtheycan do somethingtheycouldn’tbeforeNear-real time responses – still workingon-real time events, trends etc-huge potential
  5. D60 wanted to continue to use SQL Server50 GB in corboratesetting is not muchDuringprojectwerealised:Contacting the sqlserver from outside is slowish. By 10-20% (sometimes more!) (compared to a onpremisenetworkedsetup)
  6. This isloose talk. Weneed to establish basis. The cloud is a bounch of looselyconnected nodes. For it to be cloud you have to have the ability to scale up and downondemand - elasticityNodes aretypically virtual images of a small(ish) computerNodes interconnected via LAN (ifwearelucky), but mightbepositioned in geographicaldifferent locationsWeexpectbetter respons times thanifthiswas over the internet. Howeverwewillexperiencelower respons times thanifwe had a dedicatedsetuponpremises.But it’scheapIt is a distributedsetup, where hardware and OS is typicallyvendorcontrolled. As in otherdistributedsetups – plan on nodes beingunavailable.
  7. QueueAzure has a messagequeueGoogle has map-reduceframe-workorAppEngineTaskqueueAmazon has simple queue service (SQS)ScalableimplementationsexistsGlobal dbSimilary all providers has a global dbThereare a number of other cloud services, but theseare the onesthat matter to us
  8. Manyorderes, alsounordered, we just need FIFOWeended up buildingourown, because ofsome initial bad experienceswith the azureone (slowresponsesetc)Maybe not so negative
  9. This is prettyvague. Nowweconsider a more concreteexample
  10. Marketbasket is the historicalexample of associasionmining.Diapers and beerImagien 4 customers and their shopping baskets (carts)A basket is called a transactionAn item in a basket is called an ”item”Wewill talk about sets of items ”itemsets” or sets of items
  11. School book examplesGive story of whyDiapers and BeerareconnectedDon’t have time to go outDiscussvalues for frequencythresholdF-itemsetsGoal of Shopping basket analysis is to find as manyfrequent itemsets as possibleGenerateConditionalProbabilistrulesHard problemInternet and everyclick – lot of dataEach item witheach item – exponential
  12. FP-growth (from 2000)Clustersetups(allowing for fast information exchangebetween nodes)FP-growth: recursivealgorithm, each step takes a FP-treewithconditions, a conditionalFP-treeGenerallyperformsquitewellTreesize is comparable to data set size (huge)Nearmemory: fast memory (ram vsharddrivevsnetworkstorage), knownthat page faultdestroyeffecientcy of algortihmOptions for caching, but cloud is big problemLot of current researchCluster (paraellel) setupvsdistributedsetup. Shared RAM allow for really fast info transferReasearch: transfer of subtreesWiderapplication. For metypical type of problem. Centralizedalgorithm, want to distributewant to do
  13. Example of tree from examplebeforeHere customer1Note alphanummericalsorting, not kosher (weusefrequencycount)A prefixtreeortrieWill not talk about the lookup pointer structuresFP-tree is abouttwothings:CompressingFast data lookup
  14. Customer 2 from before
  15. Compression, but not 100% (Milk) (continueonnext slide)
  16. Note Milk node is in theretwice. Low-complexitycompressionNot intelligent compressionBut weneed to consider the orderingFor live data set typicallyyouareable to shrink an order of magnitude, due to the orderingEach node have weight
  17. All fourcustomersaddedActually I am not showning the root (null-node)Compression 13/16 approx 18% compressionLets look at the FP-growthalgorithm
  18. Thiswewant to distributeLets look at the tree
  19. Find frequent items (in tree)
  20. Wecount the occurences of ”Milk”, if support count is highengoughwegenerate the sub-tree. Otherwise just forget it.Milk occurs in two branches
  21. Note: all edges has a weight (as memtionedbefore)Weused the parts of the FP-tree not shownhere to loop/build the treeefficiently
  22. Youcan of course havemix of memorysetup types
  23. We of coursegot a goodidea
  24. That database scansareserial is crusial – wecould have neededrandomlookup. Thismeansthatdistributing the database is in-expensiveSecondbulletimplythatwe do not need the tree to do the recursive (distributed) calls. Allow for cheap/fastnetworkusageThiswashardwork to come up withWe show last bulletwith an examplenext
  25. Example from beforeI am going to move the treeon the right to the left (next)
  26. As youcansee as the postfixpathbecomes longer and longer the prefix-treebecomessmaller and smallerWecanbuild the subtree from the postfixpath!
  27. Each trans is made of itemsThe reasonthatwecanusebuckets is because of Observation1
  28. What is wrongwiththispicture?Sort of the original algorithm. We ”inheirited” turned out to be bad. Next gen:
  29. Peer-2-peer based + messagequeueRead-onlydbWeareworkingon new version withevenlessdblookup
  30. For eachpostfix:Distributeitem-buckets, transaction-bucketsCount number of postfixpaths in bucketsCollectcountacrosstransactionsFor eachfrequentpostfixpathcallrecursivelyIf expectedsize of prefixtree is small do standard FP-growthIN order to distributebucketsweuse MessageQueueINorder to collectcountsweusemessage parsing (node to node)IN order to do standard FPG weuselocalcomputation
  31. Message-driven – good for cloudMQ scaleswellDistributed data, thiswas a lessonlearned, also bad experiencewith SQL serverNew distribute FPGSmall messages in constrast to the cluster solutions, good for slownetworks
  32. Whataboutperf? (next)
  33. Ours is the lowgraphWith onlyoneworker.Growth as dbgrows