Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Solving the Contextual Multi-Armed Bandit
Problem at Nordstrom
John Maxwell
Nordstrom
2017/05/19
Motivating the Problem:
Limitations of A/B testing for product recommendations
Motivating the Problem:
Limitations of A/B testing for product recommendations
Need to balance exploration and exploitatio...
Motivating the Problem:
Limitations of A/B testing for product recommendations
Need to balance exploration and exploitatio...
Exploration vs Exploitation
Explore first: explore then learn (like A/B testing)
Exploration vs Exploitation
-greedy: exploit–but also explore a little bit
Exploration vs Exploitation
Upper Confidence Bound (UCB): optimistic when uncertain
UCB Illustrated
1
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
UCB Illustrated
2
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
UCB Illustrated
3
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
UCB Illustrated
4
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
UCB Illustrated
5
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
UCB Illustrated
6
Arm1 Arm2
0.50
0.75
1.00
1.25
1.50
1.75
arm
avg
Choice
0
1
Including Context
How can we use things we know about people and products
(context) along with UCB?
Including Context
How can we use things we know about people and products
(context) along with UCB?
Train a ridge regressi...
Including Context
How can we use things we know about people and products
(context) along with UCB?
Train a ridge regressi...
Including Context
How can we use things we know about people and products
(context) along with UCB?
Train a ridge regressi...
Including Context
This seems hard to implement
Including Context
This seems hard to implement
Have to invert a potentially large matrix on every call
Including Context
This seems hard to implement
Have to invert a potentially large matrix on every call
How do you deal wit...
Including Context
Notice how similar this is to classification
arm 1 arm 2 arm 3
1 . .
. .5 .
. . 2
.8 . .
Including Context
Notice how similar this is to classification
arm 1 arm 2 arm 3
1 . .
. .5 .
. . 2
.8 . .
We have partial ...
Including Context
Inverse propensity scoring:
ci,t = −
ri,t(ai ) · I{π(xi,t) = ai }
pi,t(ai )
arm 1 arm 2 arm 3
c1,1 0 0
0...
Including Context
If you think about IPS transformed rewards as costs, you can
reduce this to cost-sensitive classification
Including Context
If you think about IPS transformed rewards as costs, you can
reduce this to cost-sensitive classification...
Including Context
If you think about IPS transformed rewards as costs, you can
reduce this to cost-sensitive classification...
Including Context
If you think about IPS transformed rewards as costs, you can
reduce this to cost-sensitive classification...
Implementation
Implementation
Dora: a node app that explores using -greedy
Implementation
Dora: a node app that explores using -greedy
Logging, delayed joins
Implementation
Dora: a node app that explores using -greedy
Logging, delayed joins
TensorFlow + TensorFlow Serving: Consis...
Questions?
twitter: @jhnmxwll
github: jmmaxwell
site: john-maxwell.com
email: john [at] john-maxwell.com
References
Agarwal, Alekh, Daniel J. Hsu, Satyen Kale, John Langford, Lihong
Li, and Robert E. Schapire. 2014. “Taming the...
Prochain SlideShare
Chargement dans…5
×

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017

646 vues

Publié le

John Maxwell, a data scientist at Nordstrom, did his graduate work in international development economics, focusing on field experiments. He has since led research projects in Indonesia and Ethiopia related to microenterprise, developed large mathematical simulation models used for investment decisions by WSDOT, built dynamic pricing algorithms at Thriftbooks.com, and led the development of Nordstrom’s open source a/b testing service: Elwin. He currently focuses on contextual multi-armed bandit problems and machine learning infrastructure at Nordstrom.

Abstract summary

Solving the Contextual Multi-Armed Bandit Problem at Nordstrom:
The contextual multi-armed bandit problem, also known as associative reinforcement learning or bandits with side information, is a useful formulation of the multi-armed bandit problem that takes into account information about arms and users when deciding which arm to pull. The barrier to entry for both understanding and implementing contextual multi-armed bandits in production is high. The literature in this field pulls from disparate sources including (but not limited to) classical statistics, reinforcement learning, and information theory. Because of this, finding material that fills the gap between very basic explanations and academic journal articles is challenging. The goal of this talk is to provide those lacking intermediate materials as well as an example implementation. Specifically, I will explain key findings from some of the more cited papers in the contextual bandit literature, discuss the minimum requirements for implementation, and give an overview of a production system for solving contextual multi-armed bandit problems.

Publié dans : Technologie
  • Soyez le premier à commenter

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017

  1. 1. Solving the Contextual Multi-Armed Bandit Problem at Nordstrom John Maxwell Nordstrom 2017/05/19
  2. 2. Motivating the Problem: Limitations of A/B testing for product recommendations
  3. 3. Motivating the Problem: Limitations of A/B testing for product recommendations Need to balance exploration and exploitation intelligently
  4. 4. Motivating the Problem: Limitations of A/B testing for product recommendations Need to balance exploration and exploitation intelligently People aren’t all the same, though maybe similar
  5. 5. Exploration vs Exploitation Explore first: explore then learn (like A/B testing)
  6. 6. Exploration vs Exploitation -greedy: exploit–but also explore a little bit
  7. 7. Exploration vs Exploitation Upper Confidence Bound (UCB): optimistic when uncertain
  8. 8. UCB Illustrated 1 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  9. 9. UCB Illustrated 2 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  10. 10. UCB Illustrated 3 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  11. 11. UCB Illustrated 4 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  12. 12. UCB Illustrated 5 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  13. 13. UCB Illustrated 6 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  14. 14. Including Context How can we use things we know about people and products (context) along with UCB?
  15. 15. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts)
  16. 16. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts) Choose the arm using the UCB idea!
  17. 17. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts) Choose the arm using the UCB idea! at = argmaxa∈At xt,a ˆθa predicted payoff +α xt,aAa −1 xt,a standard deviation of payoff Li et al. (2010)
  18. 18. Including Context This seems hard to implement
  19. 19. Including Context This seems hard to implement Have to invert a potentially large matrix on every call
  20. 20. Including Context This seems hard to implement Have to invert a potentially large matrix on every call How do you deal with delayed rewards?
  21. 21. Including Context Notice how similar this is to classification arm 1 arm 2 arm 3 1 . . . .5 . . . 2 .8 . .
  22. 22. Including Context Notice how similar this is to classification arm 1 arm 2 arm 3 1 . . . .5 . . . 2 .8 . . We have partial feedback. . . how can we get full feedback?
  23. 23. Including Context Inverse propensity scoring: ci,t = − ri,t(ai ) · I{π(xi,t) = ai } pi,t(ai ) arm 1 arm 2 arm 3 c1,1 0 0 0 c2,2 0 0 0 c3,3 c1,4 0 0 Agarwal et al. (2014)
  24. 24. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification
  25. 25. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm
  26. 26. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm Simplest is probably least squares regression for each arm with argmin to choose cost minimizing arm
  27. 27. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm Simplest is probably least squares regression for each arm with argmin to choose cost minimizing arm Can do this part offline
  28. 28. Implementation
  29. 29. Implementation Dora: a node app that explores using -greedy
  30. 30. Implementation Dora: a node app that explores using -greedy Logging, delayed joins
  31. 31. Implementation Dora: a node app that explores using -greedy Logging, delayed joins TensorFlow + TensorFlow Serving: Consistent way to train and serve cost-sensitive classifier
  32. 32. Questions? twitter: @jhnmxwll github: jmmaxwell site: john-maxwell.com email: john [at] john-maxwell.com
  33. 33. References Agarwal, Alekh, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. 2014. “Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits.” CoRR abs/1402.0555. http://arxiv.org/abs/1402.0555. Li, Lihong, Wei Chu, John Langford, and Robert E Schapire. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web, 661–70. ACM.

×