Economic Hierarchical Q-Learning

Economic Hierarchical Q-Learning Erik G. Schultink, Ruggiero Cavalloand David C. Parkes Harvard University AAAI-08 July 17, 2008

Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm

Each sub-problem solved by a different agent

Leaf nodes are primitive actions; non-leaf nodes are macroactions

Addresses curse-of-dimensionality, leaving smaller state space to explore

Rewards accrue only for primitive actions

Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left

Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks { M0 , M1, … , Mn} where Mi consists of: Ti : termination predicate partitioning Mi into active states Si and exit-states Ei Ai: set of actions that can be performed in Mi Ri: local-reward function

Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj

HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).

Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal

Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality

Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality

Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.

Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal

Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal

Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].

Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a). In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): expected discounted reward to i upon completion of a QC(i,s,a): expected discounted reward to i after a completes, until i exits QE(i,s,a): expected total discounted reward after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i

Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.

Summary of Related HRL Algorithms * shown only empirically ,[object Object]

ALispQ – [Andre and Russell 2002]

HOCQ – [Marthi and Russell 2006],[object Object]

EHQ Transfer System parent child child child Children submit bids (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)

EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)

EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child

EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent

EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained

EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves

Economic Hierarchical Q-Learning

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Economic Hierarchical Q-Learning

Notes de l'éditeur