Lahug 2012-02-07

•Télécharger en tant que PPTX, PDF•

2 j'aime•1,096 vues

Ted Dunning

A corrected set of slides from the LA HUG talk that I gave in February 2012

Technologie

Mahout
• Scalable Data Mining for Everybody

What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)

Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training

And Another

From: Thu, Paul 20, 2010 at 10:51 AM
Date: Dr. May Acquah
Dear Sir,
From: George <george@fumble-tech.com>
Re: Proposal for over-invoice Contract Benevolence
Hi Ted, was a pleasure talking to you last night
Based on information gathered from the idea of
at the Hadoop User Group. I liked the India
hospital directory, I am pleased to propose a
going for lunch together. Are you available
confidential business noon? for our mutual
tomorrow (Friday) at deal
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...

How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value

A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?

A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience

A Second Conclusion
• A single number is a bad way to express
uncertain knowledge

• A distribution of values might be better

Which One to Play?
• One may be better than the other
• The better machine pays off at some rate
• Playing the other will pay off at a lesser rate
– Playing the lesser machine has “opportunity cost”

• But how do we know which is which?
– Explore versus Exploit!

Algorithmic Costs
• Option 1
– Explicitly code the explore/exploit trade-off

• Option 2
– Bayesian Bandit

Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2

The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation

• Can be extended to more general response
models

Deployment with Storm/MapR
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
RPC Model
Impression
Logs
Training
Conversion Online
Training
Detector Model
Training
Click Logs

RPC

All state managed transactionally
in MapR file system
Conversion
Dashboard

Service Architecture

MapR Pluggable Service Management

Storm
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
Impression
Logs

Conversion
Detector
RPC

Training

Training
Model

Online
Hadoop
Model
Training
Click Logs

RPC

Conversion
Dashboard

MapR Lockless Storage Services

Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning

Contenu connexe

Similaire à Lahug 2012-02-07

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...

Roberto Pepato

In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System. "The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking." Learn more: http://www.isc-events.com/bigdata14/schedule.html and http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0

The Analytics Frontier of the Hadoop Eco-System

inside-BigData.com

Boosting spark performance: An Overview of Techniques

Ahsan Javed Awan

Paddle_Spark_Summit

Min Hsieh (Kyle) Tsai

Deploying Data Science Engines to Production

Mostafa Majidpour

Scalable Deep Learning Platform On Spark In Baidu

Jen Aman

This talk will walk you through the typical workflow of a data scientist or a data analyst at Uber, how they get access to Uber's Big data and fast data sources for ad hoc and experimental analysis, how the data platforms will make it easy to discover datasets, run interactive queries against our petabyte scale data lake to identify the features you're interested in, wrangle and prepare data for advanced analytics and machine learning. Our platforms also provide capabilities to do iterative machine learning and deep learning training seamless on single nodes and distributed on our Big data and GPU clusters, analyze, visualize and share the results of their experiments with colleagues and peers to get feedback, and even productionize data analytics jobs and ML models all without a degree in CS. Interested? Come, learn how Uber's Big data platforms and Data science workbench put the power of Spark in the hands of our Data scientists and data analysts for advanced analytics and ML/DL use cases.

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

Databricks

Taking the role of a software architect for the last 10 years starting at a small startup moving to Amdocs OSS devision and then to Wix as chief architect, I have gained some understanding of what it makes to do architecture. I can say today that software architecture is not about * UML * Those huge system box diagrams * Writing documents I count 4 different types of software architecture - each of the four is complex and can make a full presentation by itself. + System architecture - the actual layout of process on hosts - what is a service, number of instances, how services collaborate, etc. + Data architecture - the selection of data storage engines and their usage + Build architecture - the dependencies between different artifacts and their impact on development and deployment + Network architecture - the structure of your layer 1, 2 and 3 network with higher level services (Routers, VLANS, VPNs, etc). I propose talking about software architecture - what is it, what practices and challenges an architect should focus on and how to bring value to an R&D organization. Resource management, self healing systems, containment of failure, architecture vs organization, etc.

Software Architecture

Yoav Avrahami

CTR prediction algorithms are essential, and are used extensively for ads bidding and sponsored search. While logistic regression models have proven effective for this kind of problem, rapid growth in the amount of data has created a lot of challenges. For example, how to train a logistic regression model with billions of parameters in a commodity hardware cluster, or how to improve the model’s accuracy with better feature engineering. Other challenges include figuring out how to benefit from popular deep learning technologies to reduce the dependence on human labor and expert knowledge, and how to improve job performance given such a complicated workload. At Spark Summit East 2017, Hortonworks introduced vector-free L-BFGS to conquer the scalability challenge of MLlib and provide a very scalable logistic regression implementation. In this talk, hear about their experience integrating this implementation with different feature learning technologies to solve Ad CTR prediction problems, and the lessons they learned.

Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...

Databricks

Benchmarking Hadoop and Big Data

Nicolas Poggi

Functional Ideas for a Cloudy Future

Richard Minerich

All Aboard the Databus

Amy W. Tang

Complex Event Processing: What?, Why?, How?

Fabien Coppens

Performance Oriented Design

Rodrigo Campos

SnappyData @ Seattle Spark Meetup

SnappyData

Sybase Complex Event Processing

Sybase Türkiye

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Spark Summit

Your choice of platform has a significant impact on how OTM performs. Factors such as hardware architecture, operating system, Java JVM and application server can either enhance or limit OTM's performance and scalability. In this presentation, Chris will share his experience in benchmarking platforms for Java performance. He will discuss several freely available tools that you can add to your toolkit, including VolanoMark, DaCapo and Soap Stone. He will then show you to use each of these tools and discuss how the results relate to aspects of OTM, including agents, workflow and bulk plans. After this presentation, you'll understand how to use freely available tools to benchmark your platform and predict relative OTM performance. Presented by Chris Plough at MavenWire.

Benchmarking OTM and Java - Is Your Platform Limiting Performance

MavenWire

Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line. Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together. Reasons to attend: Learn how AWS can help you process and make better use of your data with meaningful insights. Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions. Learn about real time data processing with Amazon Kinesis.

Launching Your First Big Data Project on AWS

Amazon Web Services

Seattle Spark Meetup Mobius CSharp API

shareddatamsft

Similaire à Lahug 2012-02-07 (20)

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...

The Analytics Frontier of the Hadoop Eco-System

Boosting spark performance: An Overview of Techniques

Paddle_Spark_Summit

Deploying Data Science Engines to Production

Scalable Deep Learning Platform On Spark In Baidu

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

Software Architecture

Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...

Benchmarking Hadoop and Big Data

Functional Ideas for a Cloudy Future

All Aboard the Databus

Complex Event Processing: What?, Why?, How?

Performance Oriented Design

SnappyData @ Seattle Spark Meetup

Sybase Complex Event Processing

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Benchmarking OTM and Java - Is Your Platform Limiting Performance

Launching Your First Big Data Project on AWS

Seattle Spark Meetup Mobius CSharp API

Plus de Ted Dunning

We introduce the idea that metadata, including project information, data labels, data characteristics and indications of valuable use, can be propagated through a data processing lineage graph. Further, finding examples of significant cooccurrence of propagated and original metadata gives us the basis of an interesting kind of search engine gives interesting recommendations of data given a problem statement even in a near cold-start situation.

Dunning - SIGMOD - Data Economy.pptx

Ted Dunning

How to Get Going with Kubernetes

Ted Dunning

The folk wisdom has always been that when running stateful applications inside containers, the only viable choice is to externalize the state so that the containers themselves are stateless or nearly so. Keeping large amounts of state inside containers is possible, but it’s considered a problem because stateful containers generally can’t preserve that state across restarts. In practice, this complicates the management of large-scale Kubernetes-based infrastructure because these high-performance storage systems require separate management. In terms of overall system management, it would be ideal if we could run a software-defined storage system directly in containers managed by Kubernetes, but that has been hampered by lack of direct device access and difficult questions about what happens to the state on container restarts. Ted Dunning describes recent developments that make it possible for Kubernetes to manage both compute and storage tiers in the same cluster. Container restarts can be handled gracefully without loss of data or a requirement to rebuild storage structures and access to storage from compute containers is extremely fast. In some environments, it’s even possible to implement elastic storage frameworks that can fold data onto just a few containers during quiescent periods or explode it in just a few seconds across a large number of machines when higher speed access is required. The benefits of systems like this extend beyond management simplicity, because applications can be more Agile precisely because the storage layer is more stable and can be uniformly accessed from any container host. Even better, it makes it a snap to configure and deploy a full-scale compute and storage infrastructure.

Progress for big data in Kubernetes

Ted Dunning

Anomaly Detection: How to find what you didn’t know to look for

Ted Dunning

Streaming Architecture including Rendezvous for Machine Learning

Ted Dunning

Machine Learning Logistics

Ted Dunning

Tensor Abuse - how to reuse machine learning frameworks

Ted Dunning

Machine Learning logistics

Ted Dunning

T digest-update

Ted Dunning

Finding Changes in Real Data

Ted Dunning

Where is Data Going? - RMDC Keynote

Ted Dunning

This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract: Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging. Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop. Topics include: * Queues versus logs * Security issues like authentication, authorization, and encryption * Scalability and performance * Handling applications that span multiple data centers * Multitenancy considerations * APIs, integration points, and more

Real time-hadoop

Ted Dunning

Cheap learning-dunning-9-18-2015

Ted Dunning

Sharing Sensitive Data Securely

Ted Dunning

This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation. In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

Ted Dunning

How the Internet of Things is Turning the Internet Upside Down

Ted Dunning

Apache Kylin - OLAP Cubes for SQL on Hadoop

Ted Dunning

Dunning time-series-2015

Ted Dunning

Doing-the-impossible

Ted Dunning

Anomaly Detection - New York Machine Learning

Ted Dunning

Plus de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx

How to Get Going with Kubernetes

Progress for big data in Kubernetes

Anomaly Detection: How to find what you didn’t know to look for

Streaming Architecture including Rendezvous for Machine Learning

Machine Learning Logistics

Tensor Abuse - how to reuse machine learning frameworks

Machine Learning logistics

T digest-update

Finding Changes in Real Data

Where is Data Going? - RMDC Keynote

Real time-hadoop

Cheap learning-dunning-9-18-2015

Sharing Sensitive Data Securely

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

How the Internet of Things is Turning the Internet Upside Down

Apache Kylin - OLAP Cubes for SQL on Hadoop

Dunning time-series-2015

Doing-the-impossible

Anomaly Detection - New York Machine Learning

Dernier

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Developing An App To Navigate The Roads of Brazil

V3cube

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Dernier (20)

What Are The Drone Anti-jamming Systems Technology?

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Boost Fertility New Invention Ups Success Rates.pdf

AWS Community Day CPH - Three problems of Terraform

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Boost PC performance: How more available memory can improve productivity

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Partners Life - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

🐬 The future of MySQL is Postgres 🐘

Developing An App To Navigate The Roads of Brazil

Powerful Google developer tools for immediate impact! (2023-24 C)

Apidays New York 2024 - The value of a flexible API Management solution for O...

presentation ICT roal in 21st century education

A Domino Admins Adventures (Engage 2024)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

2024: Domino Containers - The Next Step. News from the Domino Container commu...

How to Troubleshoot Apps for the Modern Connected Worker

Strategies for Landing an Oracle DBA Job as a Fresher

Lahug 2012-02-07

1. Beating up on Bayesian Bandits

2. Mahout • Scalable Data Mining for Everybody

3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)

4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)

5. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

6. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

7. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!

8. An Example

9. And Another From: Thu, Paul 20, 2010 at 10:51 AM Date: Dr. May Acquah Dear Sir, From: George <george@fumble-tech.com> Re: Proposal for over-invoice Contract Benevolence Hi Ted, was a pleasure talking to you last night Based on information gathered from the idea of at the Hadoop User Group. I liked the India hospital directory, I am pleased to propose a going for lunch together. Are you available confidential business noon? for our mutual tomorrow (Friday) at deal benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ...

10. Feature Encoding

11. Hashed Encoding

12. Feature Collisions

13. How it Works • We are given “features” – Often binary values in a vector • Algorithm learns weights – Weighted sum of feature * weight is the key • Each weight is a single real value

14. A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?

15. A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience

16. A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better

17. I Dunno

18. 5 and 5

19. 2 and 10

20. The Cynic Among Us

21. A Second Diversion

22. Two-armed Bandit

23. Which One to Play? • One may be better than the other • The better machine pays off at some rate • Playing the other will pay off at a lesser rate – Playing the lesser machine has “opportunity cost” • But how do we know which is which? – Explore versus Exploit!

24. Algorithmic Costs • Option 1 – Explicitly code the explore/exploit trade-off • Option 2 – Bayesian Bandit

25. Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2

26.

27.

28. The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models

29. Deployment with Storm/MapR Targeting Online Engine Model RPC RPC Model Selector RPC Online RPC Model Impression Logs Training Conversion Online Training Detector Model Training Click Logs RPC All state managed transactionally in MapR file system Conversion Dashboard

30. Service Architecture MapR Pluggable Service Management Storm Targeting Online Engine Model RPC RPC Model Selector RPC Online Impression Logs Conversion Detector RPC Training Training Model Online Hadoop Model Training Click Logs RPC Conversion Dashboard MapR Lockless Storage Services

31. Find Out More • Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com • MapR: http://www.mapr.com • Mahout: http://mahout.apache.org • Code: https://github.com/tdunning

Notes de l'éditeur

No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.

Lahug 2012-02-07

Recommandé

Recommandé

Contenu connexe

Similaire à Lahug 2012-02-07

Similaire à Lahug 2012-02-07 (20)

Plus de Ted Dunning

Plus de Ted Dunning (20)

Dernier

Dernier (20)

Lahug 2012-02-07

Notes de l'éditeur