SlideShare une entreprise Scribd logo
1  sur  60
Intuit Confidential and Proprietary 1
CTG Data Science Lab
August 17, 2016
Multi-armed Bandit Problem
Potential Improvement for DARTS
Aniruddha Bhargava, Yika Yujia Luo
Intuit Confidential and Proprietary 2
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 3
Problem Overview
Intuit Confidential and Proprietary 4
When do we run into Multi-armed Bandit Problem (MAB)?
Gambling Research Funding
Clinical Trials Content Management
Intuit Confidential and Proprietary 5
What is Multi-armed Bandit Problem (MAB)?
Goal: Pick the best restaurant efficiently
Logistics: Select a restaurant for each person, who leaves you a tip afterwards
$1 $8 $10
How?
$3 $6 $6Average: $2 Average: $7 Average: $6
Intuit Confidential and Proprietary 6
MAB Terminology
Exploration: a learning process of people’s
preferences, always involves a certain degree of
randomness
Exploitation: use the current, reliable knowledge
of a certain parameter to select a restaurant
Arm: restaurant
Expected Reward: Average tips in the end
Regret: expected tip loss after sending a person
to a restaurant that is not the best
Policy: a strategy that you use to select restaurant
Total Cumulative Regret: the total tips you lose -
- a performance measure for bandit algorithms
Expected: $1
Expected: $10
Regret is $9!
Expected: $8
Regret is $9!
Regret is $2!
0 Regret!
Total regret: $20 
User: People sent to restaurants
Reward: Tips
$0
Intuit Confidential and Proprietary 7
Big Picture
MAB Big Picture
Decision
Making
Optimization
MAB
Choose the best product
by finding the best restaurant
to go
Minimize total regret
by avoiding sending people to bad
restaurants as much as possible
Intuit Confidential and Proprietary 8
Algorithms
(Non-contextual Cases)
“Anytime you are faced with the problem of both exploring
and exploiting a search space, you have a bandit problem.
Any method of solving that problem is a bandit algorithm”
-- Chris Stucchio
Intuit Confidential and Proprietary 9
Non-Contextual
Non-contextual V.S. Contextual
User Product
IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone
Intuit Confidential and Proprietary 10
ε-greedy
Thompson Sampling
Upper Confidence Bound (UCB)
MAB Policies
There are more bandit algorithms… ...
A/B Testing
Adaptive
Intuit Confidential and Proprietary 11
AB Testing
Person i Random
100%
Exploration Exploitation
Person j
100%
Intuit Confidential and Proprietary 12
ε-greedy
Person i
Highest
average tips
Random
Record person i’s
feedback,
Update that
restaurant’s average
tips value
Select (ε = 0.2)
Update
Intuit Confidential and Proprietary 13
Upper Confidence Bound (UCB)
Person i
Highest
upper
confidence
bound Record person i’s
feedback,
Update the upper
confidence bound
of that
restaurant’s
average tips
Select
Update
Average tips
from restaurant j #people went
to restaurant j
#people
100%
Intuit Confidential and Proprietary 14
Thompson Sampling (Bayesian)
Person i
Highest
tips from
the
sampling
Record person i’s
feedback,
Update that
restaurant’s average
tip distribution
Select
Update
Simulate
3 restaurants’
average tip
distribution,
randomly draw
a value
from each
distribution
Sampling
McDonald’s
Subway
Chili's
Average Tips($)
100%
Intuit Confidential and Proprietary 15
Thompson Sampling (Bayesian)
Pr(r < b) = 10% Pr(r < b) = 0.01%
Intuit Confidential and Proprietary 16
Algorithm Comparison
1. Exploration V.S Exploitation
2. Total Regret
3. Batch Update
Intuit Confidential and Proprietary 17
Algorithm Comparison: Exploration V.S. Exploitation
IMPORTANT THING HERE:
Exploration costs money!
Exploration(%)
Time (%)
75
50
25
0
100
25 50 75 100
AB Testing
ε
ε-greedy
Intuit Confidential and Proprietary 18
Algorithm Comparison: Total Regret
M
44%
S
28%
C
28%
AdaptiveAB Testing
M
70%
S
18%
C
12%
Time Time
Intuit Confidential and Proprietary 19
Algorithm Comparison: Batch Update
AB Testing ε-greedy UCB Thompson
Very Robust Depends Not Robust Robust
System User
Question
Answer
Store
Many
Answers
Intuit Confidential and Proprietary 20
Algorithm Comparison: Summary
AB Testing ε-greedy UCB Thompson
• Easy to
implement
• If good ε found,
lower total regret
and faster to find
best arm than ε-
first
• Good for large amount of arms
• Find the best arm fast
• Low total regret
• Robust to batch
update
Pros
Cons
• Easy to
implement
• Good for small
amount of arms
• Robust to batch
update
• Not robust to
batch update
• Sensitive to statistical
assumptions
• High total
regrets
• Need to figure
out good ε
• High total
regrets
Intuit Confidential and Proprietary 21
ContextualNon-Contextual
Non-contextual V.S. Contextual
Female
Vegetarian
Married
Latino
Burger
Non-
Vegetarian
Cheap
Good
Service
User Product
IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person
Intuit Confidential and Proprietary 22
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 23
Algorithms
(Contextual Bandits)
Intuit Confidential and Proprietary 24
What do we mean by context?
Likes spicy food, refined
tastes, plays violin, Male,
…
From Wisconsin, likes
German food, likes
Football, Male, …
Student, doesn’t like
seafood, allergic to cats,
Female, …
Chief of AFC, watches
shows on competitive
eating, Female, …
User side Arm side
Tex-Mex style, sit down dining,
founded in 1975, …
Serves sandwiches, has veggie
options, founded in 1965, …
Breakfast, lunch, and dinner, cheap,
founded in 1940, …
Intuit Confidential and Proprietary 25
User Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Best possible without context Context (user) Best possible with context
Non-Contextual
User Context
Intuit Confidential and Proprietary 26
Arm Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Contextual (arm)
Contextual (user) Best possible without user context
Best possible with user context Context (arm and user)
Non-contextualOnly arm context
Both arm and user context
User context can increase the
optimal rewards;
Arm context can get you there
faster!
Takeaway Message
Intuit Confidential and Proprietary 28
User side:
Population segmentation
e.g. DARTS
Clustering users
Learning embedding
Arms side:
Linear models:
LinUCB, Linear TS, OFUL
Maintain estimate of best arm
More data → shrink uncertainty
Exploiting Context
Intuit Confidential and Proprietary 29
Assumptions:
• Users can be represented as points in space
• Users cluster together so that points that are close are similar
• Stationarity
Exploiting User Context
Intuit Confidential and Proprietary 30
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Intuit Confidential and Proprietary 31
Linear
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Intuit Confidential and Proprietary 32
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Quadratic
Intuit Confidential and Proprietary 33
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
40% 35% 25%
Hierarchical
Intuit Confidential and Proprietary 34
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
80% 15% 5%
5% 15% 80%
Hierarchical
Intuit Confidential and Proprietary 35
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 50% 45%
80% 15% 5%
5% 10% 85%
15% 80% 5%
Hierarchical
Intuit Confidential and Proprietary 36
80% 15% 5%
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 5% 90%
10% 45% 45%
5% 50% 45%
15% 80% 5%
Hierarchical
Intuit Confidential and Proprietary 37
Assumptions:
• We can represent arms as vectors.
• Rewards are a noisy version of the inner product.
• Stationarity.
Look at only arm context and no user context
Methods include:
• Linear UCB
• Linear Thompson Sampling
• OFUL (Optimism in the Face of Uncertainty – Linear)
• ... and many more.
Linear models
Exploiting Arm Context
Intuit Confidential and Proprietary 38
The Math Slide
Standard noisy linear model:
rt = xtTθ* + ηt
θ* : the optimal arm
xt : arm pulled at time t
rt : reward at time t
ηt : noise at time t
Ct : confidence set
λ : ridge term
Xt : matrix of all arms
pulled till time t
Collect all data and write:
r = X θ* + η
Least Squares Solution:
θLS = (XTX)-1 XTr
Ridge regression:
θLSR = (XTX + λI)-1 XTr
Typical Linear Bandit
algorithm:
θ0 = 0
t = 0,1,2,…
xt = argmaxx∈Ct (xTθt )
θt = (Xt
TXt + λI)-1 Xt
Trt
Intuit Confidential and Proprietary 39
Exploiting Arm Context Arms
Optimal arm
meat vegetarian
spicymild
Mince pie
Buffalo wings
Tofu scramble
Grilled
vegetables
Ratatouille
Tandoori
Chicken
Jalapeno
scramble
Pad Thai
Penne Arrabiata
Set of Arms
x1, x2, …
θ* : the optimal arm
Intuit Confidential and Proprietary 40
Exploiting Arm Context Arms
Optimal arm
Next arm
chosen
Reward (=cos(θ)) is small, but we can still infer
information about other arms!
Buffalo wings
θ
Intuit Confidential and Proprietary 41
Exploiting Arm Context
C1
θ1
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
Intuit Confidential and Proprietary 42
Exploiting Arm Context
We’ve already honed in on a pretty good choice
x2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
Intuit Confidential and Proprietary 43
Exploiting Arm Context
And the process continues …
C2
θ2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
Intuit Confidential and Proprietary 44
• Big assumption that we know good features.
• Finding features takes a lot of work.
• Few arms, many people → learn an embedding of arms
• Few people, many arms → Featurize, linear bandits
• Linear models are a naive assumption, see kernel methods.
Some Caveats
Intuit Confidential and Proprietary 45
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
Intuit Confidential and Proprietary 46
Industry Review
Intuit Confidential and Proprietary 47
Companies using MAB
Intuit Confidential and Proprietary 48
Headlines, Photos and Ads
Washington Post Google
Intuit Confidential and Proprietary 49
Used Upper Confidence Bound (UCB) to picking headlines and photos
Washington Post
Intuit Confidential and Proprietary 50
Google Experiments
Used Thompson Sampling (TS)
Updated models twice a day
Two metrics used to gauge end of experiment:
• 95% confidence that alternate better or …
• "potential value remaining in the experiment”
The more arms the higher the
gain over A/B testing.
Takeaway Message
Intuit Confidential and Proprietary 52
Advanced Topics
Intuit Confidential and Proprietary 53
Biasing
Data Joining and Latency
Non-stationary
Topics
Intuit Confidential and Proprietary 54
Bias
Website 1 Website 2
50% 50%Probability
Number
sold
100 20
90% 10%Probability
Number
sold
100 20
Who did better?
Intuit Confidential and Proprietary 55
• Be careful when using past data!
• Inverse Propensity Score Matching
• New sales estimates:
Bias
Website 1: 100*0.5+20*0.5 = 60
Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75
Intuit Confidential and Proprietary 56
Data Joining and Latency
Courtesy: Microsoft MWT white paper
Context, decision
Rewards
Latency
Intuit Confidential and Proprietary 57
Non-Stationarity – Beer example
January April July October December
Stouts and
porters
Pale Ales
and IPAs
Wits and
Lagers
Oktoberfests
and Reds
Christmas
Ales
My yearly beer taste:
Intuit Confidential and Proprietary 58
Preferences change over time.
There may be periodicity in data, Tax season is a great example.
Some solutions:
• Slow changes → System with finite memory
• Abrupt changes → Subspace tracking/anomaly detection
Non-Stationarity
Preferences change over time,
biases are added and data
needs to be joined from
different sources.
Takeaway Message
Intuit Confidential and Proprietary 60
Thank You.
Questions?

Contenu connexe

Tendances

Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptxprathammishra28
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classificationNdSv94
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Kishor Datta Gupta
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender modelsParmeshwar Khurd
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsFrancesco Casalegno
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descentSuraj Parmar
 

Tendances (20)

Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptx
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classification
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
 

En vedette

Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...IT Arena
 
Recommender system
Recommender systemRecommender system
Recommender systemArif Huda
 
MAB_EE_冷启动-jinghuixiao
MAB_EE_冷启动-jinghuixiaoMAB_EE_冷启动-jinghuixiao
MAB_EE_冷启动-jinghuixiaoxceman
 
Recommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsRecommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsBee-Chung Chen
 
Ee 想说爱你不容易
Ee 想说爱你不容易Ee 想说爱你不容易
Ee 想说爱你不容易xiaoerxiaoer
 
A Multi-armed Bandit Approach to Online Spatial Task Assignment
A Multi-armed Bandit Approach to Online Spatial Task AssignmentA Multi-armed Bandit Approach to Online Spatial Task Assignment
A Multi-armed Bandit Approach to Online Spatial Task AssignmentUmair ul Hassan
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationLiang Tang
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksIlias Flaounas
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 

En vedette (12)

Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...
 
Bandit algorithms
Bandit algorithmsBandit algorithms
Bandit algorithms
 
Recommender system
Recommender systemRecommender system
Recommender system
 
MAB_EE_冷启动-jinghuixiao
MAB_EE_冷启动-jinghuixiaoMAB_EE_冷启动-jinghuixiao
MAB_EE_冷启动-jinghuixiao
 
Recommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsRecommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online Components
 
Ee 想说爱你不容易
Ee 想说爱你不容易Ee 想说爱你不容易
Ee 想说爱你不容易
 
A Multi-armed Bandit Approach to Online Spatial Task Assignment
A Multi-armed Bandit Approach to Online Spatial Task AssignmentA Multi-armed Bandit Approach to Online Spatial Task Assignment
A Multi-armed Bandit Approach to Online Spatial Task Assignment
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized Recommendation
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 

Similaire à multi-armed bandit

Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...DevGAMM Conference
 
136 advanced a-b testing (anthony rindone)
136   advanced a-b testing (anthony rindone)136   advanced a-b testing (anthony rindone)
136 advanced a-b testing (anthony rindone)ProductCamp Boston
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining ExpertUsing Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining ExpertPablo Delgado
 
Customer perception towards fast food Chains
Customer perception towards fast food ChainsCustomer perception towards fast food Chains
Customer perception towards fast food ChainsGaurav Bhattacharya
 
Running Head PSA1PSA 4Black Lives Matter Mo.docx
Running Head PSA1PSA   4Black Lives Matter Mo.docxRunning Head PSA1PSA   4Black Lives Matter Mo.docx
Running Head PSA1PSA 4Black Lives Matter Mo.docxtoltonkendal
 
How To Write A Critique Of A Novel. Online assignment writing service.
How To Write A Critique Of A Novel. Online assignment writing service.How To Write A Critique Of A Novel. Online assignment writing service.
How To Write A Critique Of A Novel. Online assignment writing service.Andrea Jones
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in businessLouis Dorard
 
Survival Regression PyData 2018
Survival Regression PyData 2018Survival Regression PyData 2018
Survival Regression PyData 2018lornaman
 
Group 3 MA60 grifols marketing plan (slides)
Group 3 MA60 grifols marketing plan (slides)Group 3 MA60 grifols marketing plan (slides)
Group 3 MA60 grifols marketing plan (slides)John Acener Padua
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...Spark Summit
 
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...Short: Using Behavioural Economics to sell carbon-reducing products & initiat...
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...The Hunting Dynasty
 

Similaire à multi-armed bandit (20)

Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
 
136 advanced a-b testing (anthony rindone)
136   advanced a-b testing (anthony rindone)136   advanced a-b testing (anthony rindone)
136 advanced a-b testing (anthony rindone)
 
Consumer behaviour
Consumer behaviourConsumer behaviour
Consumer behaviour
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining ExpertUsing Data Science to Transform OpenTable Into Your Local Dining Expert
Using Data Science to Transform OpenTable Into Your Local Dining Expert
 
Zachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOsZachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOs
 
AppTheories_T4
AppTheories_T4AppTheories_T4
AppTheories_T4
 
Quality Counts, Livestock Education, Ethics
Quality Counts, Livestock Education, EthicsQuality Counts, Livestock Education, Ethics
Quality Counts, Livestock Education, Ethics
 
Customer perception towards fast food Chains
Customer perception towards fast food ChainsCustomer perception towards fast food Chains
Customer perception towards fast food Chains
 
Running Head PSA1PSA 4Black Lives Matter Mo.docx
Running Head PSA1PSA   4Black Lives Matter Mo.docxRunning Head PSA1PSA   4Black Lives Matter Mo.docx
Running Head PSA1PSA 4Black Lives Matter Mo.docx
 
How To Write A Critique Of A Novel. Online assignment writing service.
How To Write A Critique Of A Novel. Online assignment writing service.How To Write A Critique Of A Novel. Online assignment writing service.
How To Write A Critique Of A Novel. Online assignment writing service.
 
Black Hat Masterclass
Black Hat MasterclassBlack Hat Masterclass
Black Hat Masterclass
 
Lecture 6 Consumer behaviour.pptx
Lecture 6 Consumer behaviour.pptxLecture 6 Consumer behaviour.pptx
Lecture 6 Consumer behaviour.pptx
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
Survival Regression PyData 2018
Survival Regression PyData 2018Survival Regression PyData 2018
Survival Regression PyData 2018
 
Group 3 MA60 grifols marketing plan (slides)
Group 3 MA60 grifols marketing plan (slides)Group 3 MA60 grifols marketing plan (slides)
Group 3 MA60 grifols marketing plan (slides)
 
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...
 
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...Short: Using Behavioural Economics to sell carbon-reducing products & initiat...
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...
 
1 - GMP June 2022 - CQS.pdf
1 - GMP June 2022 - CQS.pdf1 - GMP June 2022 - CQS.pdf
1 - GMP June 2022 - CQS.pdf
 
high effort judgement
high effort judgementhigh effort judgement
high effort judgement
 
Harnessing bias
Harnessing biasHarnessing bias
Harnessing bias
 

multi-armed bandit

  • 1. Intuit Confidential and Proprietary 1 CTG Data Science Lab August 17, 2016 Multi-armed Bandit Problem Potential Improvement for DARTS Aniruddha Bhargava, Yika Yujia Luo
  • 2. Intuit Confidential and Proprietary 2 Agenda 1. Problem Overview 2. Algorithms Non-contextual cases Contextual cases 3. Industry Review 4. Advanced Topics
  • 3. Intuit Confidential and Proprietary 3 Problem Overview
  • 4. Intuit Confidential and Proprietary 4 When do we run into Multi-armed Bandit Problem (MAB)? Gambling Research Funding Clinical Trials Content Management
  • 5. Intuit Confidential and Proprietary 5 What is Multi-armed Bandit Problem (MAB)? Goal: Pick the best restaurant efficiently Logistics: Select a restaurant for each person, who leaves you a tip afterwards $1 $8 $10 How? $3 $6 $6Average: $2 Average: $7 Average: $6
  • 6. Intuit Confidential and Proprietary 6 MAB Terminology Exploration: a learning process of people’s preferences, always involves a certain degree of randomness Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant Arm: restaurant Expected Reward: Average tips in the end Regret: expected tip loss after sending a person to a restaurant that is not the best Policy: a strategy that you use to select restaurant Total Cumulative Regret: the total tips you lose - - a performance measure for bandit algorithms Expected: $1 Expected: $10 Regret is $9! Expected: $8 Regret is $9! Regret is $2! 0 Regret! Total regret: $20  User: People sent to restaurants Reward: Tips $0
  • 7. Intuit Confidential and Proprietary 7 Big Picture MAB Big Picture Decision Making Optimization MAB Choose the best product by finding the best restaurant to go Minimize total regret by avoiding sending people to bad restaurants as much as possible
  • 8. Intuit Confidential and Proprietary 8 Algorithms (Non-contextual Cases) “Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm” -- Chris Stucchio
  • 9. Intuit Confidential and Proprietary 9 Non-Contextual Non-contextual V.S. Contextual User Product IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone
  • 10. Intuit Confidential and Proprietary 10 ε-greedy Thompson Sampling Upper Confidence Bound (UCB) MAB Policies There are more bandit algorithms… ... A/B Testing Adaptive
  • 11. Intuit Confidential and Proprietary 11 AB Testing Person i Random 100% Exploration Exploitation Person j 100%
  • 12. Intuit Confidential and Proprietary 12 ε-greedy Person i Highest average tips Random Record person i’s feedback, Update that restaurant’s average tips value Select (ε = 0.2) Update
  • 13. Intuit Confidential and Proprietary 13 Upper Confidence Bound (UCB) Person i Highest upper confidence bound Record person i’s feedback, Update the upper confidence bound of that restaurant’s average tips Select Update Average tips from restaurant j #people went to restaurant j #people 100%
  • 14. Intuit Confidential and Proprietary 14 Thompson Sampling (Bayesian) Person i Highest tips from the sampling Record person i’s feedback, Update that restaurant’s average tip distribution Select Update Simulate 3 restaurants’ average tip distribution, randomly draw a value from each distribution Sampling McDonald’s Subway Chili's Average Tips($) 100%
  • 15. Intuit Confidential and Proprietary 15 Thompson Sampling (Bayesian) Pr(r < b) = 10% Pr(r < b) = 0.01%
  • 16. Intuit Confidential and Proprietary 16 Algorithm Comparison 1. Exploration V.S Exploitation 2. Total Regret 3. Batch Update
  • 17. Intuit Confidential and Proprietary 17 Algorithm Comparison: Exploration V.S. Exploitation IMPORTANT THING HERE: Exploration costs money! Exploration(%) Time (%) 75 50 25 0 100 25 50 75 100 AB Testing ε ε-greedy
  • 18. Intuit Confidential and Proprietary 18 Algorithm Comparison: Total Regret M 44% S 28% C 28% AdaptiveAB Testing M 70% S 18% C 12% Time Time
  • 19. Intuit Confidential and Proprietary 19 Algorithm Comparison: Batch Update AB Testing ε-greedy UCB Thompson Very Robust Depends Not Robust Robust System User Question Answer Store Many Answers
  • 20. Intuit Confidential and Proprietary 20 Algorithm Comparison: Summary AB Testing ε-greedy UCB Thompson • Easy to implement • If good ε found, lower total regret and faster to find best arm than ε- first • Good for large amount of arms • Find the best arm fast • Low total regret • Robust to batch update Pros Cons • Easy to implement • Good for small amount of arms • Robust to batch update • Not robust to batch update • Sensitive to statistical assumptions • High total regrets • Need to figure out good ε • High total regrets
  • 21. Intuit Confidential and Proprietary 21 ContextualNon-Contextual Non-contextual V.S. Contextual Female Vegetarian Married Latino Burger Non- Vegetarian Cheap Good Service User Product IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person
  • 22. Intuit Confidential and Proprietary 22 Agenda 1. Problem Overview 2. Algorithms Non-contextual cases Contextual cases 3. Industry Review 4. Advanced Topics
  • 23. Intuit Confidential and Proprietary 23 Algorithms (Contextual Bandits)
  • 24. Intuit Confidential and Proprietary 24 What do we mean by context? Likes spicy food, refined tastes, plays violin, Male, … From Wisconsin, likes German food, likes Football, Male, … Student, doesn’t like seafood, allergic to cats, Female, … Chief of AFC, watches shows on competitive eating, Female, … User side Arm side Tex-Mex style, sit down dining, founded in 1975, … Serves sandwiches, has veggie options, founded in 1965, … Breakfast, lunch, and dinner, cheap, founded in 1940, …
  • 25. Intuit Confidential and Proprietary 25 User Context 0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 Average reward over time Non-contextual Best possible without context Context (user) Best possible with context Non-Contextual User Context
  • 26. Intuit Confidential and Proprietary 26 Arm Context 0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 Average reward over time Non-contextual Contextual (arm) Contextual (user) Best possible without user context Best possible with user context Context (arm and user) Non-contextualOnly arm context Both arm and user context
  • 27. User context can increase the optimal rewards; Arm context can get you there faster! Takeaway Message
  • 28. Intuit Confidential and Proprietary 28 User side: Population segmentation e.g. DARTS Clustering users Learning embedding Arms side: Linear models: LinUCB, Linear TS, OFUL Maintain estimate of best arm More data → shrink uncertainty Exploiting Context
  • 29. Intuit Confidential and Proprietary 29 Assumptions: • Users can be represented as points in space • Users cluster together so that points that are close are similar • Stationarity Exploiting User Context
  • 30. Intuit Confidential and Proprietary 30 Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John
  • 31. Intuit Confidential and Proprietary 31 Linear Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John
  • 32. Intuit Confidential and Proprietary 32 Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John Quadratic
  • 33. Intuit Confidential and Proprietary 33 Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John 40% 35% 25% Hierarchical
  • 34. Intuit Confidential and Proprietary 34 Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John 80% 15% 5% 5% 15% 80% Hierarchical
  • 35. Intuit Confidential and Proprietary 35 Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John 5% 50% 45% 80% 15% 5% 5% 10% 85% 15% 80% 5% Hierarchical
  • 36. Intuit Confidential and Proprietary 36 80% 15% 5% Exploiting User Context meat vegetarian spicymild Joe Yao Nichola Peter Aniruddha Rachel Sophie Yika Vineeta Jason Andre Chris Madeline John 5% 5% 90% 10% 45% 45% 5% 50% 45% 15% 80% 5% Hierarchical
  • 37. Intuit Confidential and Proprietary 37 Assumptions: • We can represent arms as vectors. • Rewards are a noisy version of the inner product. • Stationarity. Look at only arm context and no user context Methods include: • Linear UCB • Linear Thompson Sampling • OFUL (Optimism in the Face of Uncertainty – Linear) • ... and many more. Linear models Exploiting Arm Context
  • 38. Intuit Confidential and Proprietary 38 The Math Slide Standard noisy linear model: rt = xtTθ* + ηt θ* : the optimal arm xt : arm pulled at time t rt : reward at time t ηt : noise at time t Ct : confidence set λ : ridge term Xt : matrix of all arms pulled till time t Collect all data and write: r = X θ* + η Least Squares Solution: θLS = (XTX)-1 XTr Ridge regression: θLSR = (XTX + λI)-1 XTr Typical Linear Bandit algorithm: θ0 = 0 t = 0,1,2,… xt = argmaxx∈Ct (xTθt ) θt = (Xt TXt + λI)-1 Xt Trt
  • 39. Intuit Confidential and Proprietary 39 Exploiting Arm Context Arms Optimal arm meat vegetarian spicymild Mince pie Buffalo wings Tofu scramble Grilled vegetables Ratatouille Tandoori Chicken Jalapeno scramble Pad Thai Penne Arrabiata Set of Arms x1, x2, … θ* : the optimal arm
  • 40. Intuit Confidential and Proprietary 40 Exploiting Arm Context Arms Optimal arm Next arm chosen Reward (=cos(θ)) is small, but we can still infer information about other arms! Buffalo wings θ
  • 41. Intuit Confidential and Proprietary 41 Exploiting Arm Context C1 θ1 Arms Optimal arm Next arm chosen Estimate of optimal arm Region of uncertainty
  • 42. Intuit Confidential and Proprietary 42 Exploiting Arm Context We’ve already honed in on a pretty good choice x2 Arms Optimal arm Next arm chosen Estimate of optimal arm Region of uncertainty
  • 43. Intuit Confidential and Proprietary 43 Exploiting Arm Context And the process continues … C2 θ2 Arms Optimal arm Next arm chosen Estimate of optimal arm Region of uncertainty
  • 44. Intuit Confidential and Proprietary 44 • Big assumption that we know good features. • Finding features takes a lot of work. • Few arms, many people → learn an embedding of arms • Few people, many arms → Featurize, linear bandits • Linear models are a naive assumption, see kernel methods. Some Caveats
  • 45. Intuit Confidential and Proprietary 45 Agenda 1. Problem Overview 2. Algorithms Non-contextual cases Contextual cases 3. Industry Review 4. Advanced Topics
  • 46. Intuit Confidential and Proprietary 46 Industry Review
  • 47. Intuit Confidential and Proprietary 47 Companies using MAB
  • 48. Intuit Confidential and Proprietary 48 Headlines, Photos and Ads Washington Post Google
  • 49. Intuit Confidential and Proprietary 49 Used Upper Confidence Bound (UCB) to picking headlines and photos Washington Post
  • 50. Intuit Confidential and Proprietary 50 Google Experiments Used Thompson Sampling (TS) Updated models twice a day Two metrics used to gauge end of experiment: • 95% confidence that alternate better or … • "potential value remaining in the experiment”
  • 51. The more arms the higher the gain over A/B testing. Takeaway Message
  • 52. Intuit Confidential and Proprietary 52 Advanced Topics
  • 53. Intuit Confidential and Proprietary 53 Biasing Data Joining and Latency Non-stationary Topics
  • 54. Intuit Confidential and Proprietary 54 Bias Website 1 Website 2 50% 50%Probability Number sold 100 20 90% 10%Probability Number sold 100 20 Who did better?
  • 55. Intuit Confidential and Proprietary 55 • Be careful when using past data! • Inverse Propensity Score Matching • New sales estimates: Bias Website 1: 100*0.5+20*0.5 = 60 Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75
  • 56. Intuit Confidential and Proprietary 56 Data Joining and Latency Courtesy: Microsoft MWT white paper Context, decision Rewards Latency
  • 57. Intuit Confidential and Proprietary 57 Non-Stationarity – Beer example January April July October December Stouts and porters Pale Ales and IPAs Wits and Lagers Oktoberfests and Reds Christmas Ales My yearly beer taste:
  • 58. Intuit Confidential and Proprietary 58 Preferences change over time. There may be periodicity in data, Tax season is a great example. Some solutions: • Slow changes → System with finite memory • Abrupt changes → Subspace tracking/anomaly detection Non-Stationarity
  • 59. Preferences change over time, biases are added and data needs to be joined from different sources. Takeaway Message
  • 60. Intuit Confidential and Proprietary 60 Thank You. Questions?

Notes de l'éditeur

  1. which machines to play, how many times to play each machine and in which order to play them Given a fixed budget, the problem is to allocate resources among the competing projects investigating the effects of different experimental treatments while minimizing patient losses
  2. Add a slide
  3. 7
  4. E-greedy is the basic
  5. 11
  6. https://www.cs.bham.ac.uk/internal/courses/robotics/lectures/ucb1.pdf
  7. 16
  8. 17
  9. 18
  10. 19
  11. 20
  12. Compared to recommendation problems (Netflix), only one pair is know, e.g.. Peter has watched movies vs Peter went to McDonalds. Remove the without box. So just have the with
  13. On average what a population might pay for the best average experience is lower than what each individual might pay for their optimal experience Think of vegetarians and meat eaters: suppose population with 2/3rd meat eaters, 1/3rd vegetarians On average, cater to non-vegetarians. So if people willing to spend, on average, $15, best possible reward, population wide is $10. If we can identify the two populations then selecting appropriate restaurants, we can get on average $15. A $5 increase!
  14. We can learn faster Knowing something about one arm tells us information about other arms: Think of vegetarians and meat eaters again If we know that the population says no to meat-only restaurants, then we know that they might say the same about other meat-only restaurants Learning that reduces the number of contenders for the optimal restaurant
  15. change from more to less text
  16. Exploiting user context
  17. quadratic/linear/... should be in the subtitle
  18. exploiting user context - hierarchical
  19. Convention: bold: vectors, capitals matrices
  20. remove dashed arm, make the title bold
  21. center the circle continuous estimate of the best answer but selecting from a discrete space put dots
  22. Get rid of the bullet points reduce verbosity further
  23. bullet for each company instead of bullets add as many logos as possible group by the type of usage title: Companies using Adaptive MAB algorithms
  24. add subtitles for what each are
  25. say that they aren't adding content talk over that they aren't using either context make the workflow bigger last and second to last can be combined remove bullet points
  26. make a meta-point that there are different things that people do between completely deploy and forget and monitoring The "value remaining" in an experiment is the amount of increased conversion rate you could get by switching away from the champion. The whole point of experimenting is to search for this value. If you’re 100% sure that the champion is the best arm, then there is no value remaining in the experiment, and thus no point in experimenting. But if you’re only 70% sure that an arm is optimal, then there is a 30% chance that another arm is better, and we can use Bayes’ rule to work out the distribution of how much better it is.  Figure 2: how many days earlier does TS end compared to A/B/n testing Figure 2: frequency out of 500 experiments
  27. Remove first line
  28. swap biasing an non-stationarity make latency and joining in one bullet point
  29. Both sold 120 pies and the same number of each. The probability of showing them is different voice over why
  30. Data Joining, we need to bring together: The context (user and arm) What variation was shown (decision) Reward Latency: Delay between user response and when systems sees it Batch updates
  31. In reality, for a short enough period, preferences remain mostly the same. only infer insights from a fixed time window (good for slow changing signals). see if something weird happened (good for sudden changes).
  32. Q&A or Thank You Slide