Reward Innovation for long-term member satisfaction

Reward innovation for
long-term member
satisfaction
Gary Tang, Jiangwei Pan, Henry
Wang and Justin Basilico
Goal
Create a personalized homepage to help
members find content to watch and enjoy
that
maximizes long-term member satisfaction.
Member enjoys watching Netflix
So continues the subscription
and
tells their friends about it
Long-term satisfaction for Netflix
Batch learning from bandit feedback
Production policy
● show recommendations to the member
Member gives feedbacks on recommendations
● immediate: skip/play a show
● long-term: cancel/renew subscription
Goal
● train a policy to maximize the long-term reward
Batch learning from bandit feedback
Production policy
● show recommendations to the member
Member gives feedbacks on recommendations
● immediate: skip/play a show
● long-term: cancel/renew subscription
Goal
● train a policy to maximize the long-term reward
Challenges with long-term retention
Noisy signal
Influenced by external
factors
Nonsensitive signal
Only sensitive for
“borderline” members
Delayed signal
Need to wait a long time &
hard to attribute
Proxy reward
Train policy to optimize proxy reward
● highly correlated with long-term
satisfaction
● sensitive to individual
recommendations
Immediate feedback as proxy
● e.g. start to play a show
Delayed feedback could be more aligned
● e.g. completing a show
Delayed proxy reward
Need to wait a long time to observe the
delayed reward
● can not be used in training
immediately
● can hurt coldstarting of model
Don’t want to wait? predict delayed reward
● use all user actions up to policy training
time
Train policy to maximize predicted reward
Note: predicted reward can not be used
online as it uses post-recommendation user
actions
The ideas are not new
Related work:
● Long-term optimization in recommenders
● Reward shaping in reinforcement learning
● Online reward optimization
We focus on reward innovation as an important
product development workstream at Netflix
Integrating reward in bandit
Reward component
provides the objective
of bandit policy
training data
Reward innovation
● Ideation: what aspect of long-term satisfaction hasn’t been
captured as a reward?
○ Requires balancing perspectives of ML, business, and
psychology. Not easy!
● Development: how do we compute this new reward for every
recommended item?
○ Collecting immediate feedbacks, predicting delayed feedbacks
● Evaluation: whether this is a good reward for the bandit
policy to maximize?
Reward infrastructure
Common development patterns
● Computing immediate proxy
rewards
● Predicting delayed proxy rewards
● Combining multiple rewards
● Sharing rewards across multiple
recommender components
Reward evaluation
● Online testing is expensive (time, resources)
● Use offline evaluation to determine promising
reward candidates for online testing
● Challenge: hard to compare policies trained with
different rewards
Offline reward evaluation
● compare policies along multiple
reward axes using OPE
● choose small number of candidate
policies on pareto front
● compare candidate policies online
using long-term user satisfaction
metrics
Practical learnings
● Reward normalization: dynamic range of reward
can affect SGD training dynamics
● Reward features: pairing a reward with a correlated
feature tends to improve the model’s ability to
optimize that reward
● Reward alignment: make different parts of the
overall recommender system “point in the same
direction”
Summary
● Proxy rewards: we can train bandit policies using
proxy rewards to optimize long-term member
satisfaction
● Art and science: to come up with good reward
hypotheses
● Supporting infrastructure: we develop
infrastructure to help iterate on new hypotheses
quickly
Open challenges
● Proxy rewards: how can we identify proxy rewards that are
aligned with long-term satisfaction in a more principle
way?
● Reinforcement learning: can we use reinforcement
learning to optimize long-term reward directly for
recommendations?
Thank you!
Questions?
Jiangwei Pan
jpan@netflix.com
1 sur 18

Recommandé

Recommendation Modeling with Impression Data at Netflix par
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixJiangwei Pan
903 vues24 diapositives
Recent Trends in Personalization: A Netflix Perspective par
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveJustin Basilico
30.3K vues64 diapositives
Time, Context and Causality in Recommender Systems par
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsYves Raimond
5.9K vues35 diapositives
Lessons Learned from Building Machine Learning Software at Netflix par
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
14.5K vues34 diapositives
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band... par
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Justin Basilico
4.1K vues19 diapositives
Artwork Personalization at Netflix par
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
28.1K vues83 diapositives

Contenu connexe

Tendances

若手・中堅研究者のための科研費 par
若手・中堅研究者のための科研費若手・中堅研究者のための科研費
若手・中堅研究者のための科研費Yasuhisa Kondo
40.8K vues25 diapositives
Causal Inference in Marketing par
Causal Inference in MarketingCausal Inference in Marketing
Causal Inference in MarketingTa-Wei (David) Huang
1.6K vues17 diapositives
Learning a Personalized Homepage par
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
6.5K vues34 diapositives
Engagement, metrics and "recommenders" par
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Mounia Lalmas-Roelleke
3.6K vues42 diapositives
Dowhy: An end-to-end library for causal inference par
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
5.9K vues38 diapositives
The VoiceMOS Challenge 2022 par
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
605 vues31 diapositives

Tendances(20)

若手・中堅研究者のための科研費 par Yasuhisa Kondo
若手・中堅研究者のための科研費若手・中堅研究者のための科研費
若手・中堅研究者のための科研費
Yasuhisa Kondo40.8K vues
Dowhy: An end-to-end library for causal inference par Amit Sharma
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
Amit Sharma5.9K vues
The VoiceMOS Challenge 2022 par NU_I_TODALAB
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
NU_I_TODALAB605 vues
計量経済学と 機械学習の交差点入り口 (公開用) par Shota Yasui
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui19K vues
Counterfactual evaluation of machine learning models par Michael Manapat
Counterfactual evaluation of machine learning modelsCounterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning models
Michael Manapat20.5K vues
SHAP値の考え方を理解する(木構造編) par Kazuyuki Wakasugi
SHAP値の考え方を理解する(木構造編)SHAP値の考え方を理解する(木構造編)
SHAP値の考え方を理解する(木構造編)
Kazuyuki Wakasugi12.6K vues
ノンパラメトリックベイズ4章クラスタリング par 智文 中野
ノンパラメトリックベイズ4章クラスタリングノンパラメトリックベイズ4章クラスタリング
ノンパラメトリックベイズ4章クラスタリング
智文 中野2.6K vues
金融時系列のための深層t過程回帰モデル par Kei Nakagawa
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
Kei Nakagawa2K vues
Recent Trends in Personalization at Netflix par Justin Basilico
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
Justin Basilico24.2K vues
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO" par Deep Learning JP
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
Deep Learning JP801 vues
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15 par MLconf
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf5.3K vues
Monte carlo dropout and variational bound par 天乐 杨
Monte carlo dropout and variational boundMonte carlo dropout and variational bound
Monte carlo dropout and variational bound
天乐 杨147 vues
Modeling uncertainty in deep learning par Sungjoon Choi
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning
Sungjoon Choi2.9K vues
Counterfactual Learning for Recommendation par Olivier Jeunen
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen633 vues
ベルヌーイ分布からベータ分布までを関係づける par itoyan110
ベルヌーイ分布からベータ分布までを関係づけるベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づける
itoyan11020.2K vues

Similaire à Reward Innovation for long-term member satisfaction

VIP Program Creation par
VIP Program CreationVIP Program Creation
VIP Program CreationJenny Weigle
128 vues13 diapositives
2016 5-5 technology association of georgia - tag - sales force engagement (2)... par
2016 5-5 technology association of georgia - tag - sales force engagement (2)...2016 5-5 technology association of georgia - tag - sales force engagement (2)...
2016 5-5 technology association of georgia - tag - sales force engagement (2)...Erin Bush
119 vues38 diapositives
Headspace apm growth project par
Headspace apm growth projectHeadspace apm growth project
Headspace apm growth projectDhiren Patel
1.1K vues43 diapositives
Analytics Academy 2017 Presentation Slides par
Analytics Academy 2017 Presentation SlidesAnalytics Academy 2017 Presentation Slides
Analytics Academy 2017 Presentation SlidesHarvardComms
963 vues180 diapositives
Aligning with partners on customer success par
Aligning with partners on customer successAligning with partners on customer success
Aligning with partners on customer successMatthew Klassen
693 vues23 diapositives
Sequential Decision Making in Recommendations par
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
2.1K vues32 diapositives

Similaire à Reward Innovation for long-term member satisfaction(20)

2016 5-5 technology association of georgia - tag - sales force engagement (2)... par Erin Bush
2016 5-5 technology association of georgia - tag - sales force engagement (2)...2016 5-5 technology association of georgia - tag - sales force engagement (2)...
2016 5-5 technology association of georgia - tag - sales force engagement (2)...
Erin Bush119 vues
Headspace apm growth project par Dhiren Patel
Headspace apm growth projectHeadspace apm growth project
Headspace apm growth project
Dhiren Patel1.1K vues
Analytics Academy 2017 Presentation Slides par HarvardComms
Analytics Academy 2017 Presentation SlidesAnalytics Academy 2017 Presentation Slides
Analytics Academy 2017 Presentation Slides
HarvardComms963 vues
Aligning with partners on customer success par Matthew Klassen
Aligning with partners on customer successAligning with partners on customer success
Aligning with partners on customer success
Matthew Klassen693 vues
Sequential Decision Making in Recommendations par Jaya Kawale
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
Jaya Kawale2.1K vues
How to Gauge Engagement in Moodle Using NPS par Lambda Solutions
How to Gauge Engagement in Moodle Using NPSHow to Gauge Engagement in Moodle Using NPS
How to Gauge Engagement in Moodle Using NPS
Lambda Solutions244 vues
Dev's Guide to Feedback Driven Development par Marty Haught
Dev's Guide to Feedback Driven DevelopmentDev's Guide to Feedback Driven Development
Dev's Guide to Feedback Driven Development
Marty Haught1.9K vues
Panel: Making responsible gambling work within the industry par Horizons RG
Panel: Making responsible gambling work within the industry Panel: Making responsible gambling work within the industry
Panel: Making responsible gambling work within the industry
Horizons RG599 vues
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic... par Bottom-Line Performance
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...
Mkt2 building youronlinemktgpresence par brock55
Mkt2 building youronlinemktgpresenceMkt2 building youronlinemktgpresence
Mkt2 building youronlinemktgpresence
brock55447 vues
5 Ways to Increase the Effectiveness of Your Compensation Programs par Human Capital Media
5 Ways to Increase the Effectiveness of Your Compensation Programs5 Ways to Increase the Effectiveness of Your Compensation Programs
5 Ways to Increase the Effectiveness of Your Compensation Programs
How to Keep Up Your Creative Testing After Budget Cuts par Tinuiti
How to Keep Up Your Creative Testing After Budget CutsHow to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget Cuts
Tinuiti558 vues
Necessary Elements of Digital Marketing to Grow Your Business par Digital Vidya
Necessary Elements of Digital Marketing to Grow Your BusinessNecessary Elements of Digital Marketing to Grow Your Business
Necessary Elements of Digital Marketing to Grow Your Business
Digital Vidya1.1K vues
First 30 days of Your CRO Program par VWO
First 30 days of Your CRO ProgramFirst 30 days of Your CRO Program
First 30 days of Your CRO Program
VWO61 vues
Closing the Loop: Enhancing User Experience with Monetization | Tal Shoham par Jessica Tams
Closing the Loop: Enhancing User Experience with Monetization | Tal ShohamClosing the Loop: Enhancing User Experience with Monetization | Tal Shoham
Closing the Loop: Enhancing User Experience with Monetization | Tal Shoham
Jessica Tams341 vues
The Bumpy Road to Actionable SLOs par DevOps.com
The Bumpy Road to Actionable SLOsThe Bumpy Road to Actionable SLOs
The Bumpy Road to Actionable SLOs
DevOps.com140 vues

Dernier

Ransomware is Knocking your Door_Final.pdf par
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
96 vues46 diapositives
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... par
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...ShapeBlue
159 vues25 diapositives
Initiating and Advancing Your Strategic GIS Governance Strategy par
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance StrategySafe Software
176 vues68 diapositives
Future of AR - Facebook Presentation par
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
64 vues27 diapositives
Cencora Executive Symposium par
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
159 vues14 diapositives
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... par
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...ShapeBlue
126 vues10 diapositives

Dernier(20)

Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... par ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue159 vues
Initiating and Advancing Your Strategic GIS Governance Strategy par Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software176 vues
Future of AR - Facebook Presentation par Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 vues
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... par ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 vues
Why and How CloudStack at weSystems - Stephan Bienek - weSystems par ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 vues
"Surviving highload with Node.js", Andrii Shumada par Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 vues
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... par ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue194 vues
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue par ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 vues
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T par ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue152 vues
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... par ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue184 vues
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... par TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc170 vues
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue par ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue203 vues
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue par ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 vues
Business Analyst Series 2023 - Week 4 Session 7 par DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 vues
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... par ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 vues

Reward Innovation for long-term member satisfaction

  • 1. Reward innovation for long-term member satisfaction Gary Tang, Jiangwei Pan, Henry Wang and Justin Basilico
  • 2. Goal Create a personalized homepage to help members find content to watch and enjoy that maximizes long-term member satisfaction.
  • 3. Member enjoys watching Netflix So continues the subscription and tells their friends about it Long-term satisfaction for Netflix
  • 4. Batch learning from bandit feedback Production policy ● show recommendations to the member Member gives feedbacks on recommendations ● immediate: skip/play a show ● long-term: cancel/renew subscription Goal ● train a policy to maximize the long-term reward
  • 5. Batch learning from bandit feedback Production policy ● show recommendations to the member Member gives feedbacks on recommendations ● immediate: skip/play a show ● long-term: cancel/renew subscription Goal ● train a policy to maximize the long-term reward
  • 6. Challenges with long-term retention Noisy signal Influenced by external factors Nonsensitive signal Only sensitive for “borderline” members Delayed signal Need to wait a long time & hard to attribute
  • 7. Proxy reward Train policy to optimize proxy reward ● highly correlated with long-term satisfaction ● sensitive to individual recommendations Immediate feedback as proxy ● e.g. start to play a show Delayed feedback could be more aligned ● e.g. completing a show
  • 8. Delayed proxy reward Need to wait a long time to observe the delayed reward ● can not be used in training immediately ● can hurt coldstarting of model Don’t want to wait? predict delayed reward ● use all user actions up to policy training time Train policy to maximize predicted reward Note: predicted reward can not be used online as it uses post-recommendation user actions
  • 9. The ideas are not new Related work: ● Long-term optimization in recommenders ● Reward shaping in reinforcement learning ● Online reward optimization We focus on reward innovation as an important product development workstream at Netflix
  • 10. Integrating reward in bandit Reward component provides the objective of bandit policy training data
  • 11. Reward innovation ● Ideation: what aspect of long-term satisfaction hasn’t been captured as a reward? ○ Requires balancing perspectives of ML, business, and psychology. Not easy! ● Development: how do we compute this new reward for every recommended item? ○ Collecting immediate feedbacks, predicting delayed feedbacks ● Evaluation: whether this is a good reward for the bandit policy to maximize?
  • 12. Reward infrastructure Common development patterns ● Computing immediate proxy rewards ● Predicting delayed proxy rewards ● Combining multiple rewards ● Sharing rewards across multiple recommender components
  • 13. Reward evaluation ● Online testing is expensive (time, resources) ● Use offline evaluation to determine promising reward candidates for online testing ● Challenge: hard to compare policies trained with different rewards
  • 14. Offline reward evaluation ● compare policies along multiple reward axes using OPE ● choose small number of candidate policies on pareto front ● compare candidate policies online using long-term user satisfaction metrics
  • 15. Practical learnings ● Reward normalization: dynamic range of reward can affect SGD training dynamics ● Reward features: pairing a reward with a correlated feature tends to improve the model’s ability to optimize that reward ● Reward alignment: make different parts of the overall recommender system “point in the same direction”
  • 16. Summary ● Proxy rewards: we can train bandit policies using proxy rewards to optimize long-term member satisfaction ● Art and science: to come up with good reward hypotheses ● Supporting infrastructure: we develop infrastructure to help iterate on new hypotheses quickly
  • 17. Open challenges ● Proxy rewards: how can we identify proxy rewards that are aligned with long-term satisfaction in a more principle way? ● Reinforcement learning: can we use reinforcement learning to optimize long-term reward directly for recommendations?