Multi-Agent Reinforcement Learning is an extension of single-agent RL to problems with multiple interacting agents. It is challenging due to non-stationary environments and credit assignment across agents. Baseline methods like Independent Q-Learning treat other agents as part of the environment. Cooperation methods use centralized critics and decentralized actors. Zero-sum methods were applied to StarCraft. General-sum methods like LOLA learn opponent models to optimize strategies. Experience replay and communication protocols help agents learn cooperative behaviors.
2. Contents
● Introduction
○ What is Multi-agent RL?
● Background
○ (Single agent)Reinforcement Learning
○ Game Theory
● Multi-Agent Reinforcement Learning
○ Why multi-agent RL is hard to train?
○ Baseline
○ Cooperation
○ Zero-Sum
○ General-Sum
● References
3. Introduction
What is Multi-agent RL?
- Reinforcement Learning is promising way to solve sequential decision making
problems.
source : https://now.sen.go.kr/2016/12/03.php
source :
https://deepmind.com/blog/article/Agent57-Outper
forming-the-human-Atari-benchmark
4. Introduction
What is Multi-agent RL?
- We can expand it by adding multiple agents to solve more complex problems.
source :
https://deepmind.com/blog/article/alphastar-maste
ring-real-time-strategy-game-starcraft-ii source :
https://www.youtube.com/watch?v=kopoLzvh5jY
5. Introduction
What is Multi-agent RL?
Problem size
Number of Agents
Tabular Solution Methods
ex)Game Theory
Tabular Solution Methods
ex)Dynamic Programming
Approximate Solution Methods
ex)Monte Carlo, TD learning
Approximate Solution Methods
6. Reinforcement Learning
- Reinforcement learning is a problem, a class of solution methods that work well on the problem, and
the field that studies this problems and its solution methods.
- Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize
a numerical reward signal. The learner is not told which actions to take, but instead must discover
which actions yield the most reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate reward but also the next situation and, through that, all
subsequent rewards.
Background
11. Background
Game Theory
- The study of mathematical models pertaining to the strategic interaction of
decision making where several self-interested players must make choices that
potentially affect the interests of other players.
- Only talk about non-cooperative with complete information in this Seminar.
12. Background
Game Theory
- Normal form representation
- A set of players
- All possible strategies for player i
- Utility function for each players
- Goal
- maximizing their own expected utilities(payoff)
- depending on any beliefs.
- Assume “All players are rational.”
13. Background
Game Theory
- strategies(like policy)
- pure strategies
- Select only single strategy
- mixed strategies
- Randomize over the set of available actions according to some
probability distribution
- beliefs
14. Game Theory
- suppose non-cooperative 2 rational players
Background
row player (-1,-1) (-3,0)
(0,-3) (-2,-2)
column player A B
a
b
Prisoner's dilemma
16. Background
Game Theory
- Nash Equilibrium
- If each player has chosen a strategy — an action plan choosing their own actions based on what
has happened so far in the game — and no player can increase their own expected payoff by
changing their strategy while the other players keep theirs unchanged, then the current set of
strategy choices constitutes a Nash equilibrium.
- A strategy profile is a Nash equilibrium if
- Mutual best responses
- Rationality + Correct beliefs
- Every finite game has at least one Nash equilibrium.
17. Game Theory
- Find nash eqbm
Background
row player (5,3) (1,0)
(0,1) (2,4)
column player A B
a
b
18. Background
Game Theory
- Extensive form
- The players of a game
- What each player can do at each of their moves
- The payoffs received by every player for every possible combination of moves
- + What each player knows for every move
- + For every player every opportunity they have to move
- Subgame Perfect Equation
- Backward Induction(Bellman Equation)
source :
https://en.wikipedia.org/wiki/Extensive-form_game
19. Background
Game Theory
- A game in normal form and a game in extensive form can carry the same
information.
source :
https://en.wikipedia.org/wiki/Extensive-form_game
21. Background
Game Theory
- Common Knowledge
- There is common knowledge of p in a group of agents G when all the agents in G know p, they all know
that they know p, they all know that they all know that they know p, and so on ad infinitum.
- Event E, each player P1,P2.
- P1 knows E.
- P2 knows E.
- P1 knows P2 knows E.
- P2 knows P1 knows E.
- P1 knows P2 knows (P1 knows E)
- P2 knows P1 knows (P2 knows E)
- ...
22. Background
Game Theory
- Common Knowledge Example
- Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all
hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher
announces that at least one of the hats is red, and then sequentially asks each girl if she
knows the color of her hat, the third girl questioned can know her hat color.
Red hat puzzle
23. Background
Game Theory
- Common Knowledge Example
- Each girl A,B,C has an information set.
- Teacher announced and girl A didn't answer, RWW can’t be answer.
24. Background
Game Theory
- Common Knowledge Example
- Girl B didn’t answer. RRW and WRW can’t be the answer.
- Girl C can answer her hat color is red.
26. Background
Game Theory
- Finitely Repeated Iterations
- Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by
punishment reducing deviation incentive.
- Infinitely Repeated Iterations
- Using discount factor, player i’s payoff diminishes with time depending on discount
factor.
- It can be that the preferred strategy is not to play a Nash strategy of the stage game, but
to cooperate and play a socially optimum strategy.
27. Why multi-agent RL is hard to train?
- Credit Assignment Problem
- One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the
environment gives a global total-sum scalar reward, so more consideration is needed to infer
which agent contributes to this than in a single agent situation.
- Environment
- non-stationary
- Training of each agent prevents the learning environment to be non-stationary for the
other agents.
- Interaction limitation
- How each agent communicates with each other.
Multi-Agent Reinforcement Learning
28. Why multi-agent RL is hard to train?
- Goal setting
- Cooperation
- zero-sum
- General sum
- need to learn to reciprocate
Multi-Agent Reinforcement Learning
30. Setting
- Centralized Training Decentralized Execution
- During centralized training, the agent receives additional information, as well as local
information. And the agent uses only local information when it execution.
- Recurrent Network to deal with POMDP
- In POMDP, agent needs to infer state well, so it encode the previous history
information.
- Deep Recurrent Q-Learning for Partially Observable MDPs
Multi-Agent Reinforcement Learning
31. Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
- Each agent Independently learns own
Q-network on Pong.
- Another agent is considered as environment.
- Independent Actor-Critic(IAC) is of the same
kind.
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Multi-Agent Reinforcement Learning
32. Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015)
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Cooperation
Competition
Multi-Agent Reinforcement Learning
37. Cooperation
- QMIX: Monotonic Value Function
Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
- value decomposition networks
- Q Sum
- QMIX
- QMIX source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
38. Cooperation
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
39. Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Use Common Knowledge and hierarchically control agents.
- Dec-POMDP
- Decentralized Partially Observable Markov Decision Processes
- State is composed of a number of entities.
- In state s, binary mask , all entities the agent a can see :
- Every group member(agent) computes common knowledge independently using prior
knowledge and commonly known trajectory.(random seed is also Common knowledge)
Multi-Agent Reinforcement Learning
41. Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
42. Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
43. Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Central-V
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
44. Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018)
- To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the
same time step.
- If we can use true state information, then Bellman equation can be formulated :
- Recording data with time
- Calculating an importance weighted loss :
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
45. Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- But it can’t!(All agents in partially observable environment)
- So we make new game that is specified by
- augmented state(action-observation history added) and reward function.
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
46. Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Q function is updated only approximation in the partially observable
setting(Intractable!)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
47. Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Important Sampling is approximation and hard to control variance.
- Instead, use idea of Hyper Q-learning!
- Input another agent’s policy into observation.
- Hard to scaling -> finger-print! (e.g. training iteration number,
exploration rate)
Multi-Agent Reinforcement Learning
48. Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
49. Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- RIAL
- Action : U + M
- environment action U
- message M
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
50. Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- DIAL
- Action : U + M
- environment action U
- message M
- C-Net
- Q network
- message network
- DRU
- After noise is added, it
passes sigmoid function.
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
51. Cooperation
- Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
- DIAL
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
52. Zero-Sum
- Mastering the game of Go with deep neural networks and tree search(2016)
vs
- Grandmaster level in StarCraft II using multi-agent reinforcement
learning(2019)
- League
- Main Agents
- Main exploiter agents
- League exploiter agents
- Prioritized fictitious self-play
Multi-Agent Reinforcement Learning
53. General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Suppose there are 2 players, each policy parameter is
- If we can access all parameter value, then iteratively calculate
- Instead, with step size , naive learner 1’s parameter update rule :
Multi-Agent Reinforcement Learning
54. General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Unlike NL, LOLA learner learn to optimize(respect to player 1):
- Assuming small , first-order Taylor expansion result in :
- By substituting the opponent’s naive learning step :
Multi-Agent Reinforcement Learning
58. General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- Tit-for-tat strategy
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
59. General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
Naive Learner VS LOLA
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
60. General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
61. Reference
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free
Encyclopedia. Retrieved 05:59, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020
3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic
Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908.
https://doi.org/10.4018/978-1-4666-6571-2.ch034
4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace
Independent Publishing Platform.
5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:09, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
62. 6. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:10, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free
Encyclopedia. Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454
8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia.
Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520
9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford
Reference
63. 10. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017).
Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4),
e0172395. https://doi.org/10.1371/journal.pone.0172395
11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson,
(2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence
12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018).
QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In
International conference on machine learning.
13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin
Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning.
arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810.
Reference
64. 14. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017).
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th
International Conference on Machine Learning, in Proceedings of Machine Learning Research
70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html
15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with
deep multi-agent reinforcement learning. CoRR, abs/1605.06676,
16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016).
Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
https://doi.org/10.1038/nature16961
17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI]
Reference
65. 17. Chung-san, R. (2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나?
서울특별시교육청. https://now.sen.go.kr/2016/12/03.php
18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark.
https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark
19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind.
https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube.
https://www.youtube.com/watch?v=kopoLzvh5jY
21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare.
https://www.slideshare.net/tayagkrischelle/game-theora6
22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs
Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o
Reference