SlideShare une entreprise Scribd logo
1  sur  31
Economic Hierarchical Q-Learning Erik G. Schultink, Ruggiero Cavalloand David C. Parkes Harvard University AAAI-08 July 17, 2008
Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system  (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm
[object Object]
Each sub-problem solved by a different agent
Leaf nodes are primitive actions; non-leaf nodes are macroactions
State abstraction
Addresses curse-of-dimensionality, leaving smaller state space to explore
Rewards accrue only for primitive actions
Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left
Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks  { M0 , M1, … , Mn} where Mi consists of:  Ti : termination predicate partitioning Mi into active states Si and exit-states Ei   Ai: set of actions that can be performed in Mi Ri: local-reward function
Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj
HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).
Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal
Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality
Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality
Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.
Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal
Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal
Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].
Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a).  In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): 	expected discounted reward to i 			upon completion of a QC(i,s,a): 	expected discounted reward to i 				after a completes, until i exits QE(i,s,a): 	expected total discounted reward  			after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i
Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.
Summary of Related HRL Algorithms * shown only empirically ,[object Object]
 MAXQQ – [Dietterich 2000]
ALispQ – [Andre and Russell 2002]
 HOCQ – [Marthi and Russell 2006],[object Object]
EHQ Transfer System parent child child child Children submit bids  (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)
EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)
EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child
EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent
EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained
EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves

Contenu connexe

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

En vedette

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Economic Hierarchical Q-Learning

  • 1. Economic Hierarchical Q-Learning Erik G. Schultink, Ruggiero Cavalloand David C. Parkes Harvard University AAAI-08 July 17, 2008
  • 2. Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm
  • 3.
  • 4. Each sub-problem solved by a different agent
  • 5. Leaf nodes are primitive actions; non-leaf nodes are macroactions
  • 7. Addresses curse-of-dimensionality, leaving smaller state space to explore
  • 8. Rewards accrue only for primitive actions
  • 9. Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left
  • 10. Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks { M0 , M1, … , Mn} where Mi consists of: Ti : termination predicate partitioning Mi into active states Si and exit-states Ei Ai: set of actions that can be performed in Mi Ri: local-reward function
  • 11. Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj
  • 12. HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).
  • 13. Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal
  • 14. Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality
  • 15. Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality
  • 16. Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.
  • 17. Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal
  • 18. Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal
  • 19. Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].
  • 20. Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a). In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): expected discounted reward to i upon completion of a QC(i,s,a): expected discounted reward to i after a completes, until i exits QE(i,s,a): expected total discounted reward after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i
  • 21. Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.
  • 22.
  • 23. MAXQQ – [Dietterich 2000]
  • 24. ALispQ – [Andre and Russell 2002]
  • 25.
  • 26. EHQ Transfer System parent child child child Children submit bids (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)
  • 27. EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)
  • 28. EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child
  • 29. EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent
  • 30. EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained
  • 31. EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves
  • 32. EHQ Transfer System +4 -1 0 0 0 0 parent child child child +1 -4 +5 +2 -6 +3 During execution, both parent and child update their local Q-values based on their stream of rewards
  • 35. HOFuel Subsidy Convergence Root Leave left room Reach goal
  • 36. Taxi Domain RO = HO in this domain, which is taken from [Dietterich 2000]
  • 37.
  • 38. EHQ appears to converge, but does not clearly surpass MAXQQ
  • 39.
  • 40. References Andre, D., and Russell, S. 2002. State abstraction for programmable reinforcement learning agents. In AAAI-02. Edmonton, Alberta: AAAI Press. Baum, E. B., and Durdanovich, I. 1998. Evolution of cooperative problem-solving in an artificial economy. Journal of Artificial Intelligence Research. Dean, T., and Lin, S.-H. 1995. Decomposition techniques for planning in stochastic domains. In IJCAI-95, 1121–1127. San Francisco, CA: Morgan Kaufmann Publishers. Dietterich, T. G. 2000a. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research13:227–303.
  • 41. References Dietterich, T. G. 2000b. State abstraction in MAXQ hierarchical reinforcement learning. Advances in Neural Information Processing Systems 12:994–1000. Holland, J. 1986. Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In Machine Learning, volume 2. San Mateo, CA: Morgan Kaufmann. Marthi, B.; Russell, S.; and Andre, D. 2006. A compact, hierarchically optimal Q-function decomposition. In UAI-06. Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10.

Notes de l'éditeur

  1. HRL is a variation on RL where the problem is decomposed into a set of sub-problems. These sub-problems can then be solved more-or-less independently and their solutions combined to build a solution to the original problem. There are several potential advantages to this approach: first, state abstraction – in many cases, certain aspects of the original state space can be ignored in the context of a particular subproblem, allowing that sub-problem to be solved in a much smaller “abstract” state space. Second, the hierarchical structure of the decomposition lends itself to value decomposition – traditional RL Q-values can intstead be expressed as a sum of several components; the components of Q-values can often be re-used, reducing the number of values that must be learned. Additionally, the solution policy to a given sub-problem may be able to be re-used in other parts of the hierarchy.
  2. Convert to non-technical slide on HRL. Why HRL – allows state abstraction, decompose into sub-problems
  3. To help illustrate these concepts, we introduce the HOFuel domain, constructed to emphasize the distinction between the RO and HO solution policies. It is a grid-world navigation task with a fuel constraint. Running into walls is no-op with a penalty; add opti
  4. But HRL can introduce a tension for some domains; solving sub-problems without enough regard for how the solutions to individual sub-problems impact the overall solution quality can lead to solutions that are sub-optimal from the perspective of the original problem. Additionally, the structure of the hierarchy itself may artificially limit the solution quality .We thus differentiate between three concepts of optimality. The first, global optimality, is equivalent to the traditional notion of optimality in reinforcement learning.
  5. The second, Hierarchical optimality, is equivalent to global optimality except where constrained by the hierarchy.
  6. The third, recursive optimality, is defined as each subtask being solved optimally with respect to the solutions to the sub problems below it in the hierarchy. The globally optimal solution policy is always equivalent to or better than the HO solution. Similarly, the HO solution policy is always equivalent to or better than the RO solution policy.RO is easier, because the agent only has to reason about local rewards. Resolving this tension will be the focus of my work
  7. We conceptualize the hierarchy as though each sub-problem is being solved by a different agent. Dietterich (2000) noted that exit-reward payments could alter incentives in the problem to make the RO and HO solutions equivalent. We took further inspiration from the Hayek system development by the Eric Baum, which involved agents buying and selling control of the world to solve the problem. Hayek was itself based on Holland classifiers; both systems are applied to traditional RL not HRL.Hayek and market like system; Baum buying control of the world in evolutionary context; Holland in RL, not HRL work.
  8. HRL decompositions can improve learning speed by allowing extraneous state variables within a given subtask to be ignored within that subtask.
  9. EHQ follows this decomposition framework, as do several other HRL algorithms in literatio. Notably, not all model Qe explicitly (or at all).
  10. ALispQ and HOCQ provide impressive HO convergence results, however, EHQ can achieve HO using a simple and decentralized pricing mechanism.
  11. Add rewards in timesteps ….
  12. Modeling QE allows for HO convergence, but is often depends on many state variables, lessening the potential for state abstractions and slowing learning speed.In practice, we found it beneficial limit Ej to the set of reachable exit-states, as discovered empirically during learning.(briefly mention the other possible normalizations if time permits)
  13. Replace this with a high-level overview of the algorithm? (ie agent at each node in the hierarchies does a form of Q-learning to update it’s local QV and QC values. Parent models the expected reward of invoking a macroaction, implemented by a child agent, by receiving a “bid” from that agent of it’s expected reward for the given state. When the parent chooses a macroaction to invoke, control is passed to the child agent along with information about what subsidies that child will be paid for its possible exit-states. When the child reaches an exit-state, it receives the subsidy for the state it achieved. Control is returned to the parent, which receives reward equal to the child’s bid less the subsidy it paid the child.
  14. Normalizing to min reachable (briefly mention the other possible normalizations if time permits)