SlideShare une entreprise Scribd logo
Multi-Agent Reinforcement Learning
Seolho Kim
Contents
● Introduction
○ What is Multi-agent RL?
● Background
○ (Single agent)Reinforcement Learning
○ Game Theory
● Multi-Agent Reinforcement Learning
○ Why multi-agent RL is hard to train?
○ Baseline
○ Cooperation
○ Zero-Sum
○ General-Sum
● References
Introduction
What is Multi-agent RL?
- Reinforcement Learning is promising way to solve sequential decision making
problems.
source : https://now.sen.go.kr/2016/12/03.php
source :
https://deepmind.com/blog/article/Agent57-Outper
forming-the-human-Atari-benchmark
Introduction
What is Multi-agent RL?
- We can expand it by adding multiple agents to solve more complex problems.
source :
https://deepmind.com/blog/article/alphastar-maste
ring-real-time-strategy-game-starcraft-ii source :
https://www.youtube.com/watch?v=kopoLzvh5jY
Introduction
What is Multi-agent RL?
Problem size
Number of Agents
Tabular Solution Methods
ex)Game Theory
Tabular Solution Methods
ex)Dynamic Programming
Approximate Solution Methods
ex)Monte Carlo, TD learning
Approximate Solution Methods
Reinforcement Learning
- Reinforcement learning is a problem, a class of solution methods that work well on the problem, and
the field that studies this problems and its solution methods.
- Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize
a numerical reward signal. The learner is not told which actions to take, but instead must discover
which actions yield the most reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate reward but also the next situation and, through that, all
subsequent rewards.
Background
Reinforcement Learning
source : Sutton, Reinforcement learning: An introduction
Background
Background
Reinforcement Learning
- RL framed as a infinite horizon discounted Markov Decision Process(MDP)
- (infinite horizon) MDP
- Find policy
source :
https://en.wikipedia.org/wiki/Markov_decision_process
Background
Reinforcement Learning
- Value function
- Action value function
Background
Reinforcement Learning
- value based
- policy based
Background
Game Theory
- The study of mathematical models pertaining to the strategic interaction of
decision making where several self-interested players must make choices that
potentially affect the interests of other players.
- Only talk about non-cooperative with complete information in this Seminar.
Background
Game Theory
- Normal form representation
- A set of players
- All possible strategies for player i
- Utility function for each players
- Goal
- maximizing their own expected utilities(payoff)
- depending on any beliefs.
- Assume “All players are rational.”
Background
Game Theory
- strategies(like policy)
- pure strategies
- Select only single strategy
- mixed strategies
- Randomize over the set of available actions according to some
probability distribution
- beliefs
Game Theory
- suppose non-cooperative 2 rational players
Background
row player (-1,-1) (-3,0)
(0,-3) (-2,-2)
column player A B
a
b
Prisoner's dilemma
Game Theory
- Best Response
Background
Background
Game Theory
- Nash Equilibrium
- If each player has chosen a strategy — an action plan choosing their own actions based on what
has happened so far in the game — and no player can increase their own expected payoff by
changing their strategy while the other players keep theirs unchanged, then the current set of
strategy choices constitutes a Nash equilibrium.
- A strategy profile is a Nash equilibrium if
- Mutual best responses
- Rationality + Correct beliefs
- Every finite game has at least one Nash equilibrium.
Game Theory
- Find nash eqbm
Background
row player (5,3) (1,0)
(0,1) (2,4)
column player A B
a
b
Background
Game Theory
- Extensive form
- The players of a game
- What each player can do at each of their moves
- The payoffs received by every player for every possible combination of moves
- + What each player knows for every move
- + For every player every opportunity they have to move
- Subgame Perfect Equation
- Backward Induction(Bellman Equation)
source :
https://en.wikipedia.org/wiki/Extensive-form_game
Background
Game Theory
- A game in normal form and a game in extensive form can carry the same
information.
source :
https://en.wikipedia.org/wiki/Extensive-form_game
Background
Game Theory
- We can use value function on each node
Background
Game Theory
- Common Knowledge
- There is common knowledge of p in a group of agents G when all the agents in G know p, they all know
that they know p, they all know that they all know that they know p, and so on ad infinitum.
- Event E, each player P1,P2.
- P1 knows E.
- P2 knows E.
- P1 knows P2 knows E.
- P2 knows P1 knows E.
- P1 knows P2 knows (P1 knows E)
- P2 knows P1 knows (P2 knows E)
- ...
Background
Game Theory
- Common Knowledge Example
- Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all
hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher
announces that at least one of the hats is red, and then sequentially asks each girl if she
knows the color of her hat, the third girl questioned can know her hat color.
Red hat puzzle
Background
Game Theory
- Common Knowledge Example
- Each girl A,B,C has an information set.
- Teacher announced and girl A didn't answer, RWW can’t be answer.
Background
Game Theory
- Common Knowledge Example
- Girl B didn’t answer. RRW and WRW can’t be the answer.
- Girl C can answer her hat color is red.
Game Theory
- Repeated Iterations
Background
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
1
2 2
1 1 1 1
2 iterations
Background
Game Theory
- Finitely Repeated Iterations
- Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by
punishment reducing deviation incentive.
- Infinitely Repeated Iterations
- Using discount factor, player i’s payoff diminishes with time depending on discount
factor.
- It can be that the preferred strategy is not to play a Nash strategy of the stage game, but
to cooperate and play a socially optimum strategy.
Why multi-agent RL is hard to train?
- Credit Assignment Problem
- One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the
environment gives a global total-sum scalar reward, so more consideration is needed to infer
which agent contributes to this than in a single agent situation.
- Environment
- non-stationary
- Training of each agent prevents the learning environment to be non-stationary for the
other agents.
- Interaction limitation
- How each agent communicates with each other.
Multi-Agent Reinforcement Learning
Why multi-agent RL is hard to train?
- Goal setting
- Cooperation
- zero-sum
- General sum
- need to learn to reciprocate
Multi-Agent Reinforcement Learning
Setting
Multi-Agent Reinforcement Learning
source :
Foerster. Multi Agent Reinforcement learning(2019)
Setting
- Centralized Training Decentralized Execution
- During centralized training, the agent receives additional information, as well as local
information. And the agent uses only local information when it execution.
- Recurrent Network to deal with POMDP
- In POMDP, agent needs to infer state well, so it encode the previous history
information.
- Deep Recurrent Q-Learning for Partially Observable MDPs
Multi-Agent Reinforcement Learning
Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
- Each agent Independently learns own
Q-network on Pong.
- Another agent is considered as environment.
- Independent Actor-Critic(IAC) is of the same
kind.
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Multi-Agent Reinforcement Learning
Baseline
- Independent Q Learning(IQL)
- Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015)
source :
Multiagent Cooperation and Competition with
Deep Reinforcement Learning(2015)
Cooperation
Competition
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent Policy
Gradients(2017)(COMA)
- Centralized Critic, parameter sharing Actors.
- each actor gradient
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
- Credit Assignment Problem
- shaped reward
- Using Default action c? No
- Advantage function
- Iterating for getting all action value? No
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
- Algorithm
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- Counterfactual Multi-Agent(COMA)
source :
Counterfactual multi-agent policy gradients(2017)
Multi-Agent Reinforcement Learning
Cooperation
- QMIX: Monotonic Value Function
Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
- value decomposition networks
- Q Sum
- QMIX
- QMIX source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent
Reinforcement Learning(2018)
source :
QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Use Common Knowledge and hierarchically control agents.
- Dec-POMDP
- Decentralized Partially Observable Markov Decision Processes
- State is composed of a number of entities.
- In state s, binary mask , all entities the agent a can see :
- Every group member(agent) computes common knowledge independently using prior
knowledge and commonly known trajectory.(random seed is also Common knowledge)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Delegation Action
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Multi-Agent Common Knowledge Reinforcement Learning(2018)
- Central-V
source :
Multi-Agent Common Knowledge Reinforcement
Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018)
- To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the
same time step.
- If we can use true state information, then Bellman equation can be formulated :
- Recording data with time
- Calculating an importance weighted loss :
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- But it can’t!(All agents in partially observable environment)
- So we make new game that is specified by
- augmented state(action-observation history added) and reward function.
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Q function is updated only approximation in the partially observable
setting(Intractable!)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
- Important Sampling is approximation and hard to control variance.
- Instead, use idea of Hyper Q-learning!
- Input another agent’s policy into observation.
- Hard to scaling -> finger-print! (e.g. training iteration number,
exploration rate)
Multi-Agent Reinforcement Learning
Cooperation
- Stabilising Experience Replay for Deep Multi-Agent Reinforcement
Learning(2018)
source :
Stabilising Experience Replay for Deep
Multi-Agent Reinforcement Learning(2018)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- RIAL
- Action : U + M
- environment action U
- message M
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep
Multi-Agent Reinforcement
Learning(2016)
- DIAL
- Action : U + M
- environment action U
- message M
- C-Net
- Q network
- message network
- DRU
- After noise is added, it
passes sigmoid function.
- Action select : e greedy
- No experience replay
- Parameter sharing
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Cooperation
- Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
- DIAL
source :
Learning to Communicate with Deep Multi-Agent
Reinforcement Learning(2016)
Multi-Agent Reinforcement Learning
Zero-Sum
- Mastering the game of Go with deep neural networks and tree search(2016)
vs
- Grandmaster level in StarCraft II using multi-agent reinforcement
learning(2019)
- League
- Main Agents
- Main exploiter agents
- League exploiter agents
- Prioritized fictitious self-play
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Suppose there are 2 players, each policy parameter is
- If we can access all parameter value, then iteratively calculate
- Instead, with step size , naive learner 1’s parameter update rule :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- Unlike NL, LOLA learner learn to optimize(respect to player 1):
- Assuming small , first-order Taylor expansion result in :
- By substituting the opponent’s naive learning step :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning rule :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- Naive learner :
- Second order :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- complete LOLA update policy gradient :
- Opponent can’t access :
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
- Tit-for-tat strategy
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
Naive Learner VS LOLA
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
General-Sum
- Learning with Opponent-Learning Awareness(2018)
- LOLA learning via policy gradient :
source :
Learning with Opponent-Learning Awareness(2018)
Multi-Agent Reinforcement Learning
Reference
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free
Encyclopedia. Retrieved 05:59, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020
3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic
Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908.
https://doi.org/10.4018/978-1-4666-6571-2.ch034
4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace
Independent Publishing Platform.
5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:09, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
6. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free
Encyclopedia. Retrieved 06:10, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free
Encyclopedia. Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454
8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia.
Retrieved 06:11, August 9, 2021, from
https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520
9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford
Reference
10. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017).
Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4),
e0172395. https://doi.org/10.1371/journal.pone.0172395
11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson,
(2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence
12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018).
QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In
International conference on machine learning.
13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin
Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning.
arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810.
Reference
14. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017).
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th
International Conference on Machine Learning, in Proceedings of Machine Learning Research
70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html
15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with
deep multi-agent reinforcement learning. CoRR, abs/1605.06676,
16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,
Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016).
Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
https://doi.org/10.1038/nature16961
17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI]
Reference
17. Chung-san, R. (2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나?
서울특별시교육청. https://now.sen.go.kr/2016/12/03.php
18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark.
https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark
19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind.
https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube.
https://www.youtube.com/watch?v=kopoLzvh5jY
21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare.
https://www.slideshare.net/tayagkrischelle/game-theora6
22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs
Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o
Reference

Contenu connexe

Tendances

Tendances (20)

multi-armed bandit
multi-armed banditmulti-armed bandit
multi-armed bandit
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Q-learning
Q-learningQ-learning
Q-learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 

Similaire à Multi-Agent Reinforcement Learning

Developing a serious game to evaluate and train group decision making skills
Developing a serious game to evaluate and train group decision making skills Developing a serious game to evaluate and train group decision making skills
Developing a serious game to evaluate and train group decision making skills
Lincoln Social Computing Research Centre
 

Similaire à Multi-Agent Reinforcement Learning (20)

Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
 
Developing a serious game to evaluate and train group decision making skills
Developing a serious game to evaluate and train group decision making skills Developing a serious game to evaluate and train group decision making skills
Developing a serious game to evaluate and train group decision making skills
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Is Production RL at a tipping point?
Is Production RL at a tipping point?Is Production RL at a tipping point?
Is Production RL at a tipping point?
 
A Primer On Play: How to use Games for Learning and Results
A Primer On Play: How to use Games for Learning and ResultsA Primer On Play: How to use Games for Learning and Results
A Primer On Play: How to use Games for Learning and Results
 
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
 
Using Game Learning Analytics to Improve the Design, Evaluation and Deploymen...
Using Game Learning Analytics to Improve the Design, Evaluation and Deploymen...Using Game Learning Analytics to Improve the Design, Evaluation and Deploymen...
Using Game Learning Analytics to Improve the Design, Evaluation and Deploymen...
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Game Theory Economics
Game Theory EconomicsGame Theory Economics
Game Theory Economics
 
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVERANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
 
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVERANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
ANSWER SET PROGRAMMING (DLV – CLINGO):CONNECT 4 SOLVER
 
WEEF/GEDC eMadrid_Systematizing Game Learning Analytics for Improving Serious...
WEEF/GEDC eMadrid_Systematizing Game Learning Analytics for Improving Serious...WEEF/GEDC eMadrid_Systematizing Game Learning Analytics for Improving Serious...
WEEF/GEDC eMadrid_Systematizing Game Learning Analytics for Improving Serious...
 
Icce21 systematizing game learning analytics for improving serious games life...
Icce21 systematizing game learning analytics for improving serious games life...Icce21 systematizing game learning analytics for improving serious games life...
Icce21 systematizing game learning analytics for improving serious games life...
 
Game Balance 3: Interesting Strategies
Game Balance 3: Interesting StrategiesGame Balance 3: Interesting Strategies
Game Balance 3: Interesting Strategies
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex models
 
Game-based Learning Webinar by GreenBooks & Gamelearn
Game-based Learning Webinar by GreenBooks & GamelearnGame-based Learning Webinar by GreenBooks & Gamelearn
Game-based Learning Webinar by GreenBooks & Gamelearn
 
Game theory for a better world
Game theory for a better worldGame theory for a better world
Game theory for a better world
 
Primer on Play: Case Study for Knowledge Guru
Primer on Play: Case Study for Knowledge GuruPrimer on Play: Case Study for Knowledge Guru
Primer on Play: Case Study for Knowledge Guru
 

Dernier

Dernier (20)

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Multi-Agent Reinforcement Learning

  • 2. Contents ● Introduction ○ What is Multi-agent RL? ● Background ○ (Single agent)Reinforcement Learning ○ Game Theory ● Multi-Agent Reinforcement Learning ○ Why multi-agent RL is hard to train? ○ Baseline ○ Cooperation ○ Zero-Sum ○ General-Sum ● References
  • 3. Introduction What is Multi-agent RL? - Reinforcement Learning is promising way to solve sequential decision making problems. source : https://now.sen.go.kr/2016/12/03.php source : https://deepmind.com/blog/article/Agent57-Outper forming-the-human-Atari-benchmark
  • 4. Introduction What is Multi-agent RL? - We can expand it by adding multiple agents to solve more complex problems. source : https://deepmind.com/blog/article/alphastar-maste ring-real-time-strategy-game-starcraft-ii source : https://www.youtube.com/watch?v=kopoLzvh5jY
  • 5. Introduction What is Multi-agent RL? Problem size Number of Agents Tabular Solution Methods ex)Game Theory Tabular Solution Methods ex)Dynamic Programming Approximate Solution Methods ex)Monte Carlo, TD learning Approximate Solution Methods
  • 6. Reinforcement Learning - Reinforcement learning is a problem, a class of solution methods that work well on the problem, and the field that studies this problems and its solution methods. - Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. Background
  • 7. Reinforcement Learning source : Sutton, Reinforcement learning: An introduction Background
  • 8. Background Reinforcement Learning - RL framed as a infinite horizon discounted Markov Decision Process(MDP) - (infinite horizon) MDP - Find policy source : https://en.wikipedia.org/wiki/Markov_decision_process
  • 9. Background Reinforcement Learning - Value function - Action value function
  • 11. Background Game Theory - The study of mathematical models pertaining to the strategic interaction of decision making where several self-interested players must make choices that potentially affect the interests of other players. - Only talk about non-cooperative with complete information in this Seminar.
  • 12. Background Game Theory - Normal form representation - A set of players - All possible strategies for player i - Utility function for each players - Goal - maximizing their own expected utilities(payoff) - depending on any beliefs. - Assume “All players are rational.”
  • 13. Background Game Theory - strategies(like policy) - pure strategies - Select only single strategy - mixed strategies - Randomize over the set of available actions according to some probability distribution - beliefs
  • 14. Game Theory - suppose non-cooperative 2 rational players Background row player (-1,-1) (-3,0) (0,-3) (-2,-2) column player A B a b Prisoner's dilemma
  • 15. Game Theory - Best Response Background
  • 16. Background Game Theory - Nash Equilibrium - If each player has chosen a strategy — an action plan choosing their own actions based on what has happened so far in the game — and no player can increase their own expected payoff by changing their strategy while the other players keep theirs unchanged, then the current set of strategy choices constitutes a Nash equilibrium. - A strategy profile is a Nash equilibrium if - Mutual best responses - Rationality + Correct beliefs - Every finite game has at least one Nash equilibrium.
  • 17. Game Theory - Find nash eqbm Background row player (5,3) (1,0) (0,1) (2,4) column player A B a b
  • 18. Background Game Theory - Extensive form - The players of a game - What each player can do at each of their moves - The payoffs received by every player for every possible combination of moves - + What each player knows for every move - + For every player every opportunity they have to move - Subgame Perfect Equation - Backward Induction(Bellman Equation) source : https://en.wikipedia.org/wiki/Extensive-form_game
  • 19. Background Game Theory - A game in normal form and a game in extensive form can carry the same information. source : https://en.wikipedia.org/wiki/Extensive-form_game
  • 20. Background Game Theory - We can use value function on each node
  • 21. Background Game Theory - Common Knowledge - There is common knowledge of p in a group of agents G when all the agents in G know p, they all know that they know p, they all know that they all know that they know p, and so on ad infinitum. - Event E, each player P1,P2. - P1 knows E. - P2 knows E. - P1 knows P2 knows E. - P2 knows P1 knows E. - P1 knows P2 knows (P1 knows E) - P2 knows P1 knows (P2 knows E) - ...
  • 22. Background Game Theory - Common Knowledge Example - Three girls are sitting in a circle, each wearing a red or white hat. Each can see the color of all hats except their own. Now suppose they are all wearing red hats. It is said that if the teacher announces that at least one of the hats is red, and then sequentially asks each girl if she knows the color of her hat, the third girl questioned can know her hat color. Red hat puzzle
  • 23. Background Game Theory - Common Knowledge Example - Each girl A,B,C has an information set. - Teacher announced and girl A didn't answer, RWW can’t be answer.
  • 24. Background Game Theory - Common Knowledge Example - Girl B didn’t answer. RRW and WRW can’t be the answer. - Girl C can answer her hat color is red.
  • 25. Game Theory - Repeated Iterations Background 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 2 iterations
  • 26. Background Game Theory - Finitely Repeated Iterations - Non-equilibrium strategy can be equilibrium if there is more than one nash equilibrium by punishment reducing deviation incentive. - Infinitely Repeated Iterations - Using discount factor, player i’s payoff diminishes with time depending on discount factor. - It can be that the preferred strategy is not to play a Nash strategy of the stage game, but to cooperate and play a socially optimum strategy.
  • 27. Why multi-agent RL is hard to train? - Credit Assignment Problem - One of MARL's biggest challenge is Credit Assignment Problem. In cooperative situations, the environment gives a global total-sum scalar reward, so more consideration is needed to infer which agent contributes to this than in a single agent situation. - Environment - non-stationary - Training of each agent prevents the learning environment to be non-stationary for the other agents. - Interaction limitation - How each agent communicates with each other. Multi-Agent Reinforcement Learning
  • 28. Why multi-agent RL is hard to train? - Goal setting - Cooperation - zero-sum - General sum - need to learn to reciprocate Multi-Agent Reinforcement Learning
  • 29. Setting Multi-Agent Reinforcement Learning source : Foerster. Multi Agent Reinforcement learning(2019)
  • 30. Setting - Centralized Training Decentralized Execution - During centralized training, the agent receives additional information, as well as local information. And the agent uses only local information when it execution. - Recurrent Network to deal with POMDP - In POMDP, agent needs to infer state well, so it encode the previous history information. - Deep Recurrent Q-Learning for Partially Observable MDPs Multi-Agent Reinforcement Learning
  • 31. Baseline - Independent Q Learning(IQL) - Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) - Each agent Independently learns own Q-network on Pong. - Another agent is considered as environment. - Independent Actor-Critic(IAC) is of the same kind. source : Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) Multi-Agent Reinforcement Learning
  • 32. Baseline - Independent Q Learning(IQL) - Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) source : Multiagent Cooperation and Competition with Deep Reinforcement Learning(2015) Cooperation Competition Multi-Agent Reinforcement Learning
  • 33. Cooperation - Counterfactual Multi-Agent Policy Gradients(2017)(COMA) - Centralized Critic, parameter sharing Actors. - each actor gradient source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 34. Cooperation - Counterfactual Multi-Agent(COMA) - Credit Assignment Problem - shaped reward - Using Default action c? No - Advantage function - Iterating for getting all action value? No source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 35. Cooperation - Counterfactual Multi-Agent(COMA) - Algorithm source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 36. Cooperation - Counterfactual Multi-Agent(COMA) source : Counterfactual multi-agent policy gradients(2017) Multi-Agent Reinforcement Learning
  • 37. Cooperation - QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) - value decomposition networks - Q Sum - QMIX - QMIX source : QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 38. Cooperation - QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) source : QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 39. Cooperation - Multi-Agent Common Knowledge Reinforcement Learning(2018) - Use Common Knowledge and hierarchically control agents. - Dec-POMDP - Decentralized Partially Observable Markov Decision Processes - State is composed of a number of entities. - In state s, binary mask , all entities the agent a can see : - Every group member(agent) computes common knowledge independently using prior knowledge and commonly known trajectory.(random seed is also Common knowledge) Multi-Agent Reinforcement Learning
  • 40. Cooperation - Multi-Agent Common Knowledge Reinforcement Learning(2018) - Delegation Action source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 41. Cooperation - Multi-Agent Common Knowledge Reinforcement Learning(2018) source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 42. Cooperation - Multi-Agent Common Knowledge Reinforcement Learning(2018) source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 43. Cooperation - Multi-Agent Common Knowledge Reinforcement Learning(2018) - Central-V source : Multi-Agent Common Knowledge Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 44. Cooperation - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) - To enhance data efficiency, ReplayBuffer is introduced. It assumed the same condition at the same time step. - If we can use true state information, then Bellman equation can be formulated : - Recording data with time - Calculating an importance weighted loss : source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 45. Cooperation - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) - But it can’t!(All agents in partially observable environment) - So we make new game that is specified by - augmented state(action-observation history added) and reward function. source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 46. Cooperation - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) - Q function is updated only approximation in the partially observable setting(Intractable!) source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 47. Cooperation - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) - Important Sampling is approximation and hard to control variance. - Instead, use idea of Hyper Q-learning! - Input another agent’s policy into observation. - Hard to scaling -> finger-print! (e.g. training iteration number, exploration rate) Multi-Agent Reinforcement Learning
  • 48. Cooperation - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) source : Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning(2018) Multi-Agent Reinforcement Learning
  • 49. Cooperation - Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) - RIAL - Action : U + M - environment action U - message M - Action select : e greedy - No experience replay - Parameter sharing source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 50. Cooperation - Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) - DIAL - Action : U + M - environment action U - message M - C-Net - Q network - message network - DRU - After noise is added, it passes sigmoid function. - Action select : e greedy - No experience replay - Parameter sharing source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 51. Cooperation - Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) - DIAL source : Learning to Communicate with Deep Multi-Agent Reinforcement Learning(2016) Multi-Agent Reinforcement Learning
  • 52. Zero-Sum - Mastering the game of Go with deep neural networks and tree search(2016) vs - Grandmaster level in StarCraft II using multi-agent reinforcement learning(2019) - League - Main Agents - Main exploiter agents - League exploiter agents - Prioritized fictitious self-play Multi-Agent Reinforcement Learning
  • 53. General-Sum - Learning with Opponent-Learning Awareness(2018) - Suppose there are 2 players, each policy parameter is - If we can access all parameter value, then iteratively calculate - Instead, with step size , naive learner 1’s parameter update rule : Multi-Agent Reinforcement Learning
  • 54. General-Sum - Learning with Opponent-Learning Awareness(2018) - Unlike NL, LOLA learner learn to optimize(respect to player 1): - Assuming small , first-order Taylor expansion result in : - By substituting the opponent’s naive learning step : Multi-Agent Reinforcement Learning
  • 55. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning rule : Multi-Agent Reinforcement Learning
  • 56. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning via policy gradient : - Naive learner : - Second order : Multi-Agent Reinforcement Learning
  • 57. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning via policy gradient : - complete LOLA update policy gradient : - Opponent can’t access : Multi-Agent Reinforcement Learning
  • 58. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning via policy gradient : - Tit-for-tat strategy source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 59. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning via policy gradient : Naive Learner VS LOLA source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 60. General-Sum - Learning with Opponent-Learning Awareness(2018) - LOLA learning via policy gradient : source : Learning with Opponent-Learning Awareness(2018) Multi-Agent Reinforcement Learning
  • 61. Reference 1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. 2. Wikipedia contributors. (2021, July 17). Markov decision process. In Wikipedia, The Free Encyclopedia. Retrieved 05:59, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=1034067020 3. Zhu, H., Nel, A., & Ferreira, H. (2015). Competitive Spectrum Pricing under Centralized Dynamic Spectrum Allocation. Advances in Wireless Technologies and Telecommunication, 884–908. https://doi.org/10.4018/978-1-4666-6571-2.ch034 4. Bonanno, G. (2018). Game Theory: Volume 1: Basic Concepts (2nd ed.). CreateSpace Independent Publishing Platform. 5. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free Encyclopedia. Retrieved 06:09, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715
  • 62. 6. Wikipedia contributors. (2021, March 2). Extensive-form game. In Wikipedia, The Free Encyclopedia. Retrieved 06:10, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Extensive-form_game&oldid=1009744715 7. Wikipedia contributors. (2021, July 8). Common knowledge (logic). In Wikipedia, The Free Encyclopedia. Retrieved 06:11, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Common_knowledge_(logic)&oldid=1032661454 8. Wikipedia contributors. (2021, March 2). Repeated game. In Wikipedia, The Free Encyclopedia. Retrieved 06:11, August 9, 2021, from https://en.wikipedia.org/w/index.php?title=Repeated_game&oldid=1009754520 9. Foerster, J. N. (2018). Deep multi-agent reinforcement learning [PhD thesis]. University of Oxford Reference
  • 63. 10. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4), e0172395. https://doi.org/10.1371/journal.pone.0172395 11. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson, (2018). Counterfactual Multi-Agent Policy Gradients, AAAI Conference on Artificial Intelligence 12. Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018). QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning. 13. Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin Boehmer, and Shimon Whiteson(2018). Multi-Agent Common Knowledge Reinforcement Learning. arXiv:1810.11702 [cs] URL http://arxiv.org/abs/1810. Reference
  • 64. 14. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H.S., Kohli, P. & Whiteson, S.. (2017). Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:1146-1155 Available from http://proceedings.mlr.press/v70/foerster17b.html 15. J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson(2016). Learning to communicate with deep multi-agent reinforcement learning. CoRR, abs/1605.06676, 16.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961 17. J. N. Foerster et al.(2017), Learning with opponent-learning awareness. arXiv:1709.04326 [cs.AI] Reference
  • 65. 17. Chung-san, R. (2016, December 3). ‘알파고 시대’ 우리 교육, 어떻게 나아가야 하나? 서울특별시교육청. https://now.sen.go.kr/2016/12/03.php 18. DeepMind. (2020, May 31). Agent57: Outperforming the human Atari benchmark. https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark 19. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. (2019, January 24). DeepMind. https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii 20. Multi-Agent Hide and Seek. (2019, September 17). [Video]. YouTube. https://www.youtube.com/watch?v=kopoLzvh5jY 21. Tayagkrischelle, T. (2014, September 13). game theorA6 [Slides]. Slideshare. https://www.slideshare.net/tayagkrischelle/game-theora6 22. Lanctot, M. [ Laber Labs]. (2020, May 16). Multi-agent Reinforcement Learning - Laber Labs Workshop [Video]. YouTube. https://www.youtube.com/watch?v=rbZBBTLH32o Reference