SlideShare une entreprise Scribd logo
1  sur  31
Beating up on Bayesian Bandits
Mahout
• Scalable Data Mining for Everybody
What is Mahout
• Recommendations (people who x this also x
  that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
  examples)
• Stuff (LDA, SVD, frequent item-set, math)
What is Mahout?
• Recommendations (people who x this also x
  that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
  examples)
• Stuff (LDA, SVM, frequent item-set, math)
Classification in Detail
• Naive Bayes Family
  – Hadoop based training
• Decision Forests
  – Hadoop based training
• Logistic Regression (aka SGD)
  – fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family
  – Hadoop based training
• Decision Forests
  – Hadoop based training
• Logistic Regression (aka SGD)
  – fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family
  – Hadoop based training
• Decision Forests
  – Hadoop based training
• Logistic Regression (aka SGD)
  – fast on-line (sequential) training
  – Now with MORE topping!
An Example
And Another

From: Thu, Paul 20, 2010 at 10:51 AM
Date: Dr. May Acquah
Dear Sir,
From: George <george@fumble-tech.com>
Re: Proposal for over-invoice Contract Benevolence
Hi Ted, was a pleasure talking to you last night
Based on information gathered from the idea of
at the Hadoop User Group. I liked the India
hospital directory, I am pleased to propose a
going for lunch together. Are you available
confidential business noon? for our mutual
tomorrow (Friday) at deal
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Feature Encoding
Hashed Encoding
Feature Collisions
How it Works
• We are given “features”
  – Often binary values in a vector
• Algorithm learns weights
  – Weighted sum of feature * weight is the key
• Each weight is a single real value
A Quick Diversion
• You see a coin
    – What is the probability of heads?
    – Could it be larger or smaller than that?
•   I flip the coin and while it is in the air ask again
•   I catch the coin and ask again
•   I look at the coin (and you don’t) and ask again
•   Why does the answer change?
    – And did it ever have a single value?
A First Conclusion
• Probability as expressed by humans is
  subjective and depends on information and
  experience
A Second Conclusion
• A single number is a bad way to express
  uncertain knowledge



• A distribution of values might be better
I Dunno
5 and 5
2 and 10
The Cynic Among Us
A Second Diversion
Two-armed Bandit
Which One to Play?
• One may be better than the other
• The better machine pays off at some rate
• Playing the other will pay off at a lesser rate
  – Playing the lesser machine has “opportunity cost”


• But how do we know which is which?
  – Explore versus Exploit!
Algorithmic Costs
• Option 1
  – Explicitly code the explore/exploit trade-off


• Option 2
  – Bayesian Bandit
Bayesian Bandit
•   Compute distributions based on data
•   Sample p1 and p2 from these distributions
•   Put a coin in bandit 1 if p1 > p2
•   Else, put the coin in bandit 2
The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
  exploitation

• Can be extended to more general response
  models
Deployment with Storm/MapR
  Targeting                                             Online
   Engine                                               Model
                RPC                         RPC
                             Model
                            Selector         RPC
                                                            Online
                                           RPC              Model
  Impression
    Logs
                                       Training
                      Conversion                                 Online
                                        Training
                       Detector                                  Model
                                             Training
  Click Logs

               RPC

                                   All state managed transactionally
                                   in MapR file system
  Conversion
  Dashboard
Service Architecture

                       MapR Pluggable Service Management


              Storm
Targeting                                             Online
 Engine                                               Model
              RPC                         RPC
                           Model
                          Selector         RPC
                                                          Online
Impression
  Logs

                    Conversion
                     Detector
                                         RPC

                                     Training

                                      Training
                                                          Model



                                                               Online
                                                                        Hadoop
                                                               Model
                                           Training
Click Logs

             RPC



Conversion
Dashboard




                                       MapR Lockless Storage Services
Find Out More
• Me: tdunning@mapr.com
      ted.dunning@gmail.com
      tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning

Contenu connexe

Similaire à Lahug 2012-02-07

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 

Similaire à Lahug 2012-02-07 (20)

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...
Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...
Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Paddle_Spark_Summit
Paddle_Spark_SummitPaddle_Spark_Summit
Paddle_Spark_Summit
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Functional Ideas for a Cloudy Future
Functional Ideas for a Cloudy FutureFunctional Ideas for a Cloudy Future
Functional Ideas for a Cloudy Future
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Benchmarking OTM and Java - Is Your Platform Limiting Performance
Benchmarking OTM and Java - Is Your Platform Limiting PerformanceBenchmarking OTM and Java - Is Your Platform Limiting Performance
Benchmarking OTM and Java - Is Your Platform Limiting Performance
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 

Plus de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Plus de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Dernier

Dernier (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Lahug 2012-02-07

  • 1. Beating up on Bayesian Bandits
  • 2. Mahout • Scalable Data Mining for Everybody
  • 3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)
  • 4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)
  • 5. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  • 6. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  • 7. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!
  • 9. And Another From: Thu, Paul 20, 2010 at 10:51 AM Date: Dr. May Acquah Dear Sir, From: George <george@fumble-tech.com> Re: Proposal for over-invoice Contract Benevolence Hi Ted, was a pleasure talking to you last night Based on information gathered from the idea of at the Hadoop User Group. I liked the India hospital directory, I am pleased to propose a going for lunch together. Are you available confidential business noon? for our mutual tomorrow (Friday) at deal benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ...
  • 13. How it Works • We are given “features” – Often binary values in a vector • Algorithm learns weights – Weighted sum of feature * weight is the key • Each weight is a single real value
  • 14. A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?
  • 15. A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience
  • 16. A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better
  • 23. Which One to Play? • One may be better than the other • The better machine pays off at some rate • Playing the other will pay off at a lesser rate – Playing the lesser machine has “opportunity cost” • But how do we know which is which? – Explore versus Exploit!
  • 24. Algorithmic Costs • Option 1 – Explicitly code the explore/exploit trade-off • Option 2 – Bayesian Bandit
  • 25. Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2
  • 26.
  • 27.
  • 28. The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models
  • 29. Deployment with Storm/MapR Targeting Online Engine Model RPC RPC Model Selector RPC Online RPC Model Impression Logs Training Conversion Online Training Detector Model Training Click Logs RPC All state managed transactionally in MapR file system Conversion Dashboard
  • 30. Service Architecture MapR Pluggable Service Management Storm Targeting Online Engine Model RPC RPC Model Selector RPC Online Impression Logs Conversion Detector RPC Training Training Model Online Hadoop Model Training Click Logs RPC Conversion Dashboard MapR Lockless Storage Services
  • 31. Find Out More • Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com • MapR: http://www.mapr.com • Mahout: http://mahout.apache.org • Code: https://github.com/tdunning

Notes de l'éditeur

  1. No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
  2. Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.