4. Want to see relevant and diverse items in
recommendation
(C) Copyright IBM Corp. 2018 4
“trunk” by Bing (little diversity)“trunk” by Google (some diversity)
Can have more diversity
5. Want to take relevant and diverse actions in
team sports
Zone defense Man-to-man defense
(C) Copyright IBM Corp. 2018 5
basketballhq.com upload.wikimedia.org
6. Diversity also matters in soccer
(C) Copyright IBM Corp. 2018 6
RoboCupSoccer - Simulation J League (Based on data from DataStadium)
9. Reinforcement learning seeks to find an
optimal policy for sequential decision making
(C) Copyright IBM Corp. 2018 9
Agent
Action
Environment
Observation, Reward
Goal: Maximize cumulative reward
10. An approach of reinforcement learning is to
learn the action-value function
• Action-value function: 𝑄 𝜋
• 𝑄 𝜋(𝑠, 𝑎): Expected cumulative reward from state 𝑠 by taking action 𝑎 and then
following (optimal) policy 𝜋
• Assume (relaxed later): Markovian state 𝑠 is fully observable
(C) Copyright IBM Corp. 2018 10
𝑠
𝑎
…
reward …reward reward
𝑄 𝜋(𝑠, 𝑎)
11. For most practical tasks, the action-value
function needs to be approximated
Exponentially large state space
• Combination of multiple factors
• e.g. Factor: “state” of each position
• History of observations
Exponentially large action space
• Combination of multiple “levers”
• Combination of multiple agents
(C) Copyright IBM Corp. 2018 11
Cannot deal with 𝑄(𝑠, 𝑎) in tabular form
12. Action-value functions approximated with
(deep) neural networks
(C) Copyright IBM Corp. 2018 12
state 𝑠
Distributed representation
(e.g. each unit for each pixel)
𝑄 𝜋
(𝑠, 𝑎1)
𝑄 𝜋(𝑠, 𝑎2)
Exponentially many
units for exponentially
large action space
13. Challenges in collaborative multi-agent
reinforcement learning
• Exponentially many combinations of actions (team-actions)
(C) Copyright IBM Corp. 2018 13
1. How to efficiently evaluate the value of team-actions
2. How to efficiently sample good team-actions
14. Consider the diversity of actions, in addition
to the value (relevance) of each action
(C) Copyright IBM Corp. 2018 14
Diversity
Value of team-action of 3 agents:
Volume2 = 𝑄(𝑠, 𝑎1, 𝑎3, 𝑎5 )
16. Our definition of diversity (similarity) in
multi-agent reinforcement learning
• The value of team-action is represented by determinant (volume)
(C) Copyright IBM Corp. 2018 16
We will learn the similarity between actions accordingly
Two actions are similar
Two actions are dissimilar
Value is low when the two actions are taken together
Value is high when the two actions are taken together
17. We represent the action-value function with
determinant
• 𝐳≤𝑡 ≡ (𝐳0, 𝐳1, … , 𝐳 𝑡): Time−series of observations
• 𝐳≤𝑡 = 𝑠𝑡 if Markovian state 𝑠𝑡 is observable
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁: Binary features of team-action 𝑎 𝑡
• e.g. 𝐱 𝑡 indicates which actions are selected by the team
• 𝐋 𝑡: Positive semi-definite matrix (kernel) that can depend on 𝐳≤𝑡
(C) Copyright IBM Corp. 2018 17
𝐋 𝑡(𝐱t)𝑥𝑡 𝑖 = 1
𝐋 𝑡
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
25. Example: Blocker Task
• Reward
• +1 when an attacker reaches the end zone
• −1 each step
• We use target positions of agents as the feature of state-action pair
• 𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 = 𝑠𝑡+1 ∈ 0, 1 21
(C) Copyright IBM Corp. 2018 25
26. Determinantal SARSA finds a nearly optimal
policy 10 times faster than baseline methods
(C) Copyright IBM Corp. 2018 26
[Heese+ 2013]
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA
27. Quality-similarity decomposition of the kernel
• Value, relevance, quality
𝑞𝑖 = 𝐿𝑖,𝑖
• Similarity
𝑆𝑖,𝑗 =
𝐿𝑖,𝑗
𝐿𝑖,𝑖 𝐿𝑗,𝑗
(C) Copyright IBM Corp. 2018 27
𝐋
𝐪
𝐪𝐒=
28. Value of individual actions (next positions)
learned by Determinantal SARSA
(C) Copyright IBM Corp. 2018 28
29. (C) Copyright IBM Corp. 2018 29
similar
dissimilar
low value when
taken together
high value when
taken together
Recall:
Our definition of action similarity
Similarity between actions (next positions)
learned by Determinantal SARSA
30. Stochastic Policy Task
• 2 𝑁 actions
• If action matches states
• Get +10 reward
• Hidden state transitions
(C) Copyright IBM Corp. 2018 30
00101
01011
31. Determinantal SARSA finds nearly optimal policies,
while baselines suffer from partial observability
(C) Copyright IBM Corp. 2018 31
(𝑁 = 5) (𝑁 = 6) (𝑁 = 8)
SARSA: Free-energy SARSA
NN: Neural network
EQNAC: Energy-based Q-value natural actor critic
ENATDAC: Energy-based natural TD actor critic
Determinantal SARSA Determinantal SARSA Determinantal SARSA
Baselines from [Heese+ 2013]
33. Want to sample actions having high value
with high probability
• 𝜀-greedy
• Uniformly at random with probability 𝜀
• Best action (𝑎⋆ = argmax 𝑎 𝑄(𝑠, 𝑎)) with probability 1 − 𝜀
Intractable for large action space
• Boltzmann exploration
• Take action 𝑎 with probability ∼ exp(−𝛽 𝑄(𝑠, 𝑎))
• 𝛽: inverse temperature
(C) Copyright IBM Corp. 2018 33
34. Choosing a diverse team-action with
Boltzmann exploration
• Our value function:
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
• Team-action selected according to Boltzmann exploration
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ𝐱 det 𝐋 𝑡 𝐱 𝛽
(C) Copyright IBM Corp. 2018 34
𝐱 𝑡 ≡ 𝜓 𝑎 𝑡 ∈ 0, 1 𝑁:
Binary features of team-action 𝑎 𝑡
35. Boltzmann exploration can be performed
efficiently when 𝛽 = 1
𝜋 𝐱 𝑡 𝐳≤𝑡 =
det 𝐋 𝑡 𝐱 𝑡
σ𝐱 det 𝐋 𝑡 𝐱
=
det 𝐋 𝑡 𝐱 𝑡
det 𝐋 𝑡 + 𝐈
(C) Copyright IBM Corp. 2018 35
Sum over 2 𝑁 terms Determinant of 𝑁 × 𝑁 matrix
36. Determinantal point process (DPP) has been
studied for modeling and learning diversity
• Probability of selecting a subset 𝒳 [Macchi 1975]:
ℙ 𝒳 =
det 𝐋 𝒳
σ 𝒳
det 𝐋 𝒳
=
det 𝐋 𝒳
det 𝐋 + 𝐈
• Sampling in 𝑂 𝑁 𝑘3
time [Hough+ 2006]
• 𝑁: size of ground set
• 𝑘: number of items in sampled subset
(C) Copyright IBM Corp. 2018 36
[Kulesza & Taskar (2012)]
37. Prior work has been applying DPPs in
machine learning and artificial intelligence
• Recommendation [Gillenwater+ 2014, Gartrell+ 2017]
• Summarization of document, videos, … [Gong+ 2014]
• Hyper-parameter optimization [Kathuria+ 2016]
• Mini-batch sampling [Zhang+ 2017]
• Modeling neural spiking [Snoek+ 2013]
(C) Copyright IBM Corp. 2018 37
38. Nearly exact Boltzmann exploration with
MCMC [Kang 2013 (for 𝛽 = 1)]
1. Initialize 𝐱
2. Repeat
• 𝐱 ← 𝐱′ with probability min 1,
det 𝐋 𝑡 𝐱′
det 𝐋 𝑡 𝐱
𝛽
• When 𝐱’ and 𝐱 differs by one bit, det 𝐋 𝑡 𝐱′
can be computed from
det 𝐋 𝑡 𝐱 via rank-one update techniques
• Assume we can find 𝑎 ← 𝜓−1(𝐱)
(C) Copyright IBM Corp. 2018 38
39. Simpler approaches:
Heuristics for more exploration and exploitation
• For more exploration
• Uniform with probability 𝜀
• DPP with probability 1 − 𝜀
• For more exploitation
• Sample 𝑀 actions according to DPP
• Choose the best among 𝑀
• Hard-core point processes [Matérn 1986]
(C) Copyright IBM Corp. 2018 39
41. (C) Copyright IBM Corp. 2018 41
Need diversity in team-actions Our definition of action similarity
similar
dissimilar
low value when
taken together
high value when
taken together
Challenge
𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 𝛼 + log det 𝐋 𝑡 𝐱 𝑡
Our solution
Exponentially many team-actions
Efficient learning
∇ 𝐕(𝐱 𝑡) 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = 2 𝐕 𝐱 𝑡
+ ⊤
∇ 𝜙 𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡 = diag 𝐕 𝐱 𝑡
+ 𝐕(𝐱 𝑡) ∇ 𝜙 𝐝 𝑡 𝜙
Efficient sampling (cf. DPP)
𝜋 𝐱 𝑡 𝐳≤𝑡 =
exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱 𝑡
σ𝐱 exp 𝛽𝑄 𝜃 𝐳≤𝑡, 𝐱
=
det 𝐋 𝑡 𝐱 𝑡
𝛽
σ𝐱 det 𝐋 𝑡 𝐱 𝛽
42. References
• T. Osogami and R. Raymond, Determinantal reinforcement learning,
AAAI-19
• T. Osogami and R. Raymond, Determinantal SARSA: Toward deep
reinforcement learning for collaborative agents, NIPS 2017 Deep
Reinforcement Learning Symposium
• T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara, Dynamic
determinantal point processes, AAAI-18
• K. Zhao, T. Morimura, and T. Osogami, Event-based visual analytics for
team-based invasion sports, PacificVis 2018
(C) Copyright IBM Corp. 2018 42