1. Intuit Confidential and Proprietary 1
CTG Data Science Lab
August 17, 2016
Multi-armed Bandit Problem
Potential Improvement for DARTS
Aniruddha Bhargava, Yika Yujia Luo
2. Intuit Confidential and Proprietary 2
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
4. Intuit Confidential and Proprietary 4
When do we run into Multi-armed Bandit Problem (MAB)?
Gambling Research Funding
Clinical Trials Content Management
5. Intuit Confidential and Proprietary 5
What is Multi-armed Bandit Problem (MAB)?
Goal: Pick the best restaurant efficiently
Logistics: Select a restaurant for each person, who leaves you a tip afterwards
$1 $8 $10
How?
$3 $6 $6Average: $2 Average: $7 Average: $6
6. Intuit Confidential and Proprietary 6
MAB Terminology
Exploration: a learning process of people’s
preferences, always involves a certain degree of
randomness
Exploitation: use the current, reliable knowledge
of a certain parameter to select a restaurant
Arm: restaurant
Expected Reward: Average tips in the end
Regret: expected tip loss after sending a person
to a restaurant that is not the best
Policy: a strategy that you use to select restaurant
Total Cumulative Regret: the total tips you lose -
- a performance measure for bandit algorithms
Expected: $1
Expected: $10
Regret is $9!
Expected: $8
Regret is $9!
Regret is $2!
0 Regret!
Total regret: $20
User: People sent to restaurants
Reward: Tips
$0
7. Intuit Confidential and Proprietary 7
Big Picture
MAB Big Picture
Decision
Making
Optimization
MAB
Choose the best product
by finding the best restaurant
to go
Minimize total regret
by avoiding sending people to bad
restaurants as much as possible
8. Intuit Confidential and Proprietary 8
Algorithms
(Non-contextual Cases)
“Anytime you are faced with the problem of both exploring
and exploiting a search space, you have a bandit problem.
Any method of solving that problem is a bandit algorithm”
-- Chris Stucchio
9. Intuit Confidential and Proprietary 9
Non-Contextual
Non-contextual V.S. Contextual
User Product
IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone
10. Intuit Confidential and Proprietary 10
ε-greedy
Thompson Sampling
Upper Confidence Bound (UCB)
MAB Policies
There are more bandit algorithms… ...
A/B Testing
Adaptive
11. Intuit Confidential and Proprietary 11
AB Testing
Person i Random
100%
Exploration Exploitation
Person j
100%
12. Intuit Confidential and Proprietary 12
ε-greedy
Person i
Highest
average tips
Random
Record person i’s
feedback,
Update that
restaurant’s average
tips value
Select (ε = 0.2)
Update
13. Intuit Confidential and Proprietary 13
Upper Confidence Bound (UCB)
Person i
Highest
upper
confidence
bound Record person i’s
feedback,
Update the upper
confidence bound
of that
restaurant’s
average tips
Select
Update
Average tips
from restaurant j #people went
to restaurant j
#people
100%
14. Intuit Confidential and Proprietary 14
Thompson Sampling (Bayesian)
Person i
Highest
tips from
the
sampling
Record person i’s
feedback,
Update that
restaurant’s average
tip distribution
Select
Update
Simulate
3 restaurants’
average tip
distribution,
randomly draw
a value
from each
distribution
Sampling
McDonald’s
Subway
Chili's
Average Tips($)
100%
15. Intuit Confidential and Proprietary 15
Thompson Sampling (Bayesian)
Pr(r < b) = 10% Pr(r < b) = 0.01%
16. Intuit Confidential and Proprietary 16
Algorithm Comparison
1. Exploration V.S Exploitation
2. Total Regret
3. Batch Update
17. Intuit Confidential and Proprietary 17
Algorithm Comparison: Exploration V.S. Exploitation
IMPORTANT THING HERE:
Exploration costs money!
Exploration(%)
Time (%)
75
50
25
0
100
25 50 75 100
AB Testing
ε
ε-greedy
18. Intuit Confidential and Proprietary 18
Algorithm Comparison: Total Regret
M
44%
S
28%
C
28%
AdaptiveAB Testing
M
70%
S
18%
C
12%
Time Time
19. Intuit Confidential and Proprietary 19
Algorithm Comparison: Batch Update
AB Testing ε-greedy UCB Thompson
Very Robust Depends Not Robust Robust
System User
Question
Answer
Store
Many
Answers
20. Intuit Confidential and Proprietary 20
Algorithm Comparison: Summary
AB Testing ε-greedy UCB Thompson
• Easy to
implement
• If good ε found,
lower total regret
and faster to find
best arm than ε-
first
• Good for large amount of arms
• Find the best arm fast
• Low total regret
• Robust to batch
update
Pros
Cons
• Easy to
implement
• Good for small
amount of arms
• Robust to batch
update
• Not robust to
batch update
• Sensitive to statistical
assumptions
• High total
regrets
• Need to figure
out good ε
• High total
regrets
21. Intuit Confidential and Proprietary 21
ContextualNon-Contextual
Non-contextual V.S. Contextual
Female
Vegetarian
Married
Latino
Burger
Non-
Vegetarian
Cheap
Good
Service
User Product
IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person
22. Intuit Confidential and Proprietary 22
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
24. Intuit Confidential and Proprietary 24
What do we mean by context?
Likes spicy food, refined
tastes, plays violin, Male,
…
From Wisconsin, likes
German food, likes
Football, Male, …
Student, doesn’t like
seafood, allergic to cats,
Female, …
Chief of AFC, watches
shows on competitive
eating, Female, …
User side Arm side
Tex-Mex style, sit down dining,
founded in 1975, …
Serves sandwiches, has veggie
options, founded in 1965, …
Breakfast, lunch, and dinner, cheap,
founded in 1940, …
25. Intuit Confidential and Proprietary 25
User Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Best possible without context Context (user) Best possible with context
Non-Contextual
User Context
26. Intuit Confidential and Proprietary 26
Arm Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Contextual (arm)
Contextual (user) Best possible without user context
Best possible with user context Context (arm and user)
Non-contextualOnly arm context
Both arm and user context
27. User context can increase the
optimal rewards;
Arm context can get you there
faster!
Takeaway Message
28. Intuit Confidential and Proprietary 28
User side:
Population segmentation
e.g. DARTS
Clustering users
Learning embedding
Arms side:
Linear models:
LinUCB, Linear TS, OFUL
Maintain estimate of best arm
More data → shrink uncertainty
Exploiting Context
29. Intuit Confidential and Proprietary 29
Assumptions:
• Users can be represented as points in space
• Users cluster together so that points that are close are similar
• Stationarity
Exploiting User Context
30. Intuit Confidential and Proprietary 30
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
31. Intuit Confidential and Proprietary 31
Linear
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
32. Intuit Confidential and Proprietary 32
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Quadratic
33. Intuit Confidential and Proprietary 33
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
40% 35% 25%
Hierarchical
34. Intuit Confidential and Proprietary 34
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
80% 15% 5%
5% 15% 80%
Hierarchical
35. Intuit Confidential and Proprietary 35
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 50% 45%
80% 15% 5%
5% 10% 85%
15% 80% 5%
Hierarchical
36. Intuit Confidential and Proprietary 36
80% 15% 5%
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 5% 90%
10% 45% 45%
5% 50% 45%
15% 80% 5%
Hierarchical
37. Intuit Confidential and Proprietary 37
Assumptions:
• We can represent arms as vectors.
• Rewards are a noisy version of the inner product.
• Stationarity.
Look at only arm context and no user context
Methods include:
• Linear UCB
• Linear Thompson Sampling
• OFUL (Optimism in the Face of Uncertainty – Linear)
• ... and many more.
Linear models
Exploiting Arm Context
38. Intuit Confidential and Proprietary 38
The Math Slide
Standard noisy linear model:
rt = xtTθ* + ηt
θ* : the optimal arm
xt : arm pulled at time t
rt : reward at time t
ηt : noise at time t
Ct : confidence set
λ : ridge term
Xt : matrix of all arms
pulled till time t
Collect all data and write:
r = X θ* + η
Least Squares Solution:
θLS = (XTX)-1 XTr
Ridge regression:
θLSR = (XTX + λI)-1 XTr
Typical Linear Bandit
algorithm:
θ0 = 0
t = 0,1,2,…
xt = argmaxx∈Ct (xTθt )
θt = (Xt
TXt + λI)-1 Xt
Trt
39. Intuit Confidential and Proprietary 39
Exploiting Arm Context Arms
Optimal arm
meat vegetarian
spicymild
Mince pie
Buffalo wings
Tofu scramble
Grilled
vegetables
Ratatouille
Tandoori
Chicken
Jalapeno
scramble
Pad Thai
Penne Arrabiata
Set of Arms
x1, x2, …
θ* : the optimal arm
40. Intuit Confidential and Proprietary 40
Exploiting Arm Context Arms
Optimal arm
Next arm
chosen
Reward (=cos(θ)) is small, but we can still infer
information about other arms!
Buffalo wings
θ
41. Intuit Confidential and Proprietary 41
Exploiting Arm Context
C1
θ1
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
42. Intuit Confidential and Proprietary 42
Exploiting Arm Context
We’ve already honed in on a pretty good choice
x2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
43. Intuit Confidential and Proprietary 43
Exploiting Arm Context
And the process continues …
C2
θ2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
44. Intuit Confidential and Proprietary 44
• Big assumption that we know good features.
• Finding features takes a lot of work.
• Few arms, many people → learn an embedding of arms
• Few people, many arms → Featurize, linear bandits
• Linear models are a naive assumption, see kernel methods.
Some Caveats
45. Intuit Confidential and Proprietary 45
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
49. Intuit Confidential and Proprietary 49
Used Upper Confidence Bound (UCB) to picking headlines and photos
Washington Post
50. Intuit Confidential and Proprietary 50
Google Experiments
Used Thompson Sampling (TS)
Updated models twice a day
Two metrics used to gauge end of experiment:
• 95% confidence that alternate better or …
• "potential value remaining in the experiment”
51. The more arms the higher the
gain over A/B testing.
Takeaway Message
53. Intuit Confidential and Proprietary 53
Biasing
Data Joining and Latency
Non-stationary
Topics
54. Intuit Confidential and Proprietary 54
Bias
Website 1 Website 2
50% 50%Probability
Number
sold
100 20
90% 10%Probability
Number
sold
100 20
Who did better?
55. Intuit Confidential and Proprietary 55
• Be careful when using past data!
• Inverse Propensity Score Matching
• New sales estimates:
Bias
Website 1: 100*0.5+20*0.5 = 60
Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75
56. Intuit Confidential and Proprietary 56
Data Joining and Latency
Courtesy: Microsoft MWT white paper
Context, decision
Rewards
Latency
57. Intuit Confidential and Proprietary 57
Non-Stationarity – Beer example
January April July October December
Stouts and
porters
Pale Ales
and IPAs
Wits and
Lagers
Oktoberfests
and Reds
Christmas
Ales
My yearly beer taste:
58. Intuit Confidential and Proprietary 58
Preferences change over time.
There may be periodicity in data, Tax season is a great example.
Some solutions:
• Slow changes → System with finite memory
• Abrupt changes → Subspace tracking/anomaly detection
Non-Stationarity
59. Preferences change over time,
biases are added and data
needs to be joined from
different sources.
Takeaway Message
which machines to play, how many times to play each machine and in which order to play them
Given a fixed budget, the problem is to allocate resources among the competing projects
investigating the effects of different experimental treatments while minimizing patient losses
Compared to recommendation problems (Netflix), only one pair is know, e.g.. Peter has watched movies vs Peter went to McDonalds.
Remove the without box. So just have the with
On average what a population might pay for the best average experience is lower than what each individual might pay for their optimal experience
Think of vegetarians and meat eaters: suppose population with 2/3rd meat eaters, 1/3rd vegetarians
On average, cater to non-vegetarians. So if people willing to spend, on average, $15, best possible reward, population wide is $10.
If we can identify the two populations then selecting appropriate restaurants, we can get on average $15. A $5 increase!
We can learn faster
Knowing something about one arm tells us information about other arms:
Think of vegetarians and meat eaters again
If we know that the population says no to meat-only restaurants, then we know that they might say the same about other meat-only restaurants
Learning that reduces the number of contenders for the optimal restaurant
change from more to less text
Exploiting user context
quadratic/linear/... should be in the subtitle
exploiting user context - hierarchical
Convention: bold: vectors, capitals matrices
remove dashed arm, make the title bold
center the circle
continuous estimate of the best answer but selecting from a discrete space
put dots
Get rid of the bullet points
reduce verbosity further
bullet for each company
instead of bullets add as many logos as possible
group by the type of usage
title: Companies using Adaptive MAB algorithms
add subtitles for what each are
say that they aren't adding content
talk over that they aren't using either context
make the workflow bigger
last and second to last can be combined
remove bullet points
make a meta-point that there are different things that people do between completely deploy and forget and monitoring
The "value remaining" in an experiment is the amount of increased conversion rate you could get by switching away from the champion. The whole point of experimenting is to search for this value. If you’re 100% sure that the champion is the best arm, then there is no value remaining in the experiment, and thus no point in experimenting. But if you’re only 70% sure that an arm is optimal, then there is a 30% chance that another arm is better, and we can use Bayes’ rule to work out the distribution of how much better it is.
Figure 2: how many days earlier does TS end compared to A/B/n testing
Figure 2: frequency out of 500 experiments
Remove first line
swap biasing an non-stationarity
make latency and joining in one bullet point
Both sold 120 pies and the same number of each. The probability of showing them is different
voice over why
Data Joining, we need to bring together:
The context (user and arm)
What variation was shown (decision)
Reward
Latency:
Delay between user response and when systems sees it
Batch updates
In reality, for a short enough period, preferences remain mostly the same.
only infer insights from a fixed time window (good for slow changing signals).
see if something weird happened (good for sudden changes).