the presentation of the article "Mastering the game of Go with deep neural networks and tree search" given at the Optimization Seminar 2015/2016
Notes:
- All URLs are clickable.
- All citations are clickable (when hovered over the "year" part of "[author year]").
- To download without a SlideShare account, use https://www.dropbox.com/s/p4rnlhoewbedkjg/AlphaGo.pdf?dl=0
- The corresponding leaflet is available at http://www.slideshare.net/KarelHa1/leaflet-for-the-talk-on-alphago
- The source code is available at https://github.com/mathemage/AlphaGo-presentation
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
1. AlphaGo: Mastering the game of Go
with deep neural networks and tree search
Karel Ha
article by Google DeepMind
Optimization Seminar, 20th April 2016
5. Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
1
6. Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
audio recognition (Shazam, SoundHound)
1
7. Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
audio recognition (Shazam, SoundHound)
self-driving cars
1
13. Game of Thrones Generated Character by Character
http://pjreddie.com/darknet/rnns-in-darknet/ 5
14. Game of Thrones Generated Character by Character
JON
He leaned close and onions, barefoot from
his shoulder. “I am not a purple girl,” he
said as he stood over him. “The sight of
you sell your father with you a little choice.”
“I say to swear up his sea or a boy of stone
and heart, down,” Lord Tywin said. “I love
your word or her to me.”
Darknet (on Linux)
http://pjreddie.com/darknet/rnns-in-darknet/ 5
15. Game of Thrones Generated Character by Character
JON
He leaned close and onions, barefoot from
his shoulder. “I am not a purple girl,” he
said as he stood over him. “The sight of
you sell your father with you a little choice.”
“I say to swear up his sea or a boy of stone
and heart, down,” Lord Tywin said. “I love
your word or her to me.”
Darknet (on Linux)
JON
Each in days and the woods followed his
king. “I understand.”
“I am not your sister Lord Robert?”
“The door was always some cellar to do his
being girls and the Magnar of Baratheon,
and there were thousands of every bite of
half the same as though he was not a great
knight should be seen, and not to look at
the Redwyne two thousand men.”
Darknet (on OS X)
http://pjreddie.com/darknet/rnns-in-darknet/ 5
16. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
Hayes 2016 6
17. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
Hayes 2016 6
18. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
The biggest risk to the world, is me, believe it or not.
Hayes 2016 6
19. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
The biggest risk to the world, is me, believe it or not.
I am what ISIS doesn’t need.
Hayes 2016 6
20. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
The biggest risk to the world, is me, believe it or not.
I am what ISIS doesn’t need.
I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He
has been amazing.
Hayes 2016 6
21. DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
The biggest risk to the world, is me, believe it or not.
I am what ISIS doesn’t need.
I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He
has been amazing.
I buy Hillary, it’s beautiful and I’m happy about it.
Hayes 2016 6
22. Atari Player by Google DeepMind
https://youtu.be/0X-NdPtFKq0?t=21m13s
Mnih et al. 2015 7
29. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
http://www.nickgillian.com/ 9
30. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
http://www.nickgillian.com/ 9
31. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
http://www.nickgillian.com/ 9
32. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 9
33. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 9
34. Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 9
41. Underfitting and Overfitting
Beware of overfitting!
It is like learning for a mathematical exam by memorizing proofs.
https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
46. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
Silver et al. 2016 14
47. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
Silver et al. 2016 14
48. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
Silver et al. 2016 14
49. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
Silver et al. 2016 14
50. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
b is the games breadth (number of legal moves per position)
Silver et al. 2016 14
51. Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
b is the games breadth (number of legal moves per position)
d is its depth (game length)
Silver et al. 2016 14
52. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150
Allis et al. 1994 15
53. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
Allis et al. 1994 15
54. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
[10100
] times more complex
than chess.
https://deepmind.com/alpha-go.html
Allis et al. 1994 15
55. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
[10100
] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
Allis et al. 1994 15
56. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
[10100
] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
Allis et al. 1994 15
57. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
[10100
] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
for the depth: a neural network to evaluate the current
position
Allis et al. 1994 15
58. Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
[10100
] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
for the depth: a neural network to evaluate the current
position
for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 15
62. Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
63. Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
64. Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
suitable to model systems with a high tolerance to error
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
65. Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
suitable to model systems with a high tolerance to error
e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
69. Neural Networks: Modes
Two modes
feedforward for making predictions
backpropagation for learning
Dieterle 2003 18
70. Neural Networks: an Example of Feedforward
http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 19
71. Gradient Descent in Neural Networks
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
72. Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
73. Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
However, error functions are not necessarily convex or so “smooth”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
75. Convolutional Neural Networks (CNN or ConvNet)
http://code.flickr.net/2014/10/20/introducing-flickr-park-or-bird/ 21
76. (Deep) Convolutional Neural Networks
The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
77. (Deep) Convolutional Neural Networks
The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
85. Rules of Go
Black versus White. Black starts the game.
the rule of liberty
23
86. Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
23
87. Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength). 23
89. Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
https://en.wikipedia.org/wiki/Go_(game) 24
90. Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
plus the number of empty intersections surrounded by that
player’s stones
https://en.wikipedia.org/wiki/Go_(game) 24
91. Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
plus the number of empty intersections surrounded by that
player’s stones
plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player
https://en.wikipedia.org/wiki/Go_(game) 24
98. SL Policy Network (1/2)
13-layer deep convolutional neural network
Silver et al. 2016 28
99. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
Silver et al. 2016 28
100. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
Silver et al. 2016 28
101. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
Silver et al. 2016 28
102. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Silver et al. 2016 28
103. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Silver et al. 2016 28
104. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
Silver et al. 2016 28
105. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
Silver et al. 2016 28
106. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
55.7% accuracy (raw board position + move history as input)
Silver et al. 2016 28
107. SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
55.7% accuracy (raw board position + move history as input)
57.0% accuracy (all input features)
Silver et al. 2016 28
108. SL Policy Network (2/2)
Small improvements in accuracy led to large improvements
in playing strength
Silver et al. 2016 29
109. Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 30
111. Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
accuracy of 24.2%
Silver et al. 2016 31
112. Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
accuracy of 24.2%
It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.
Silver et al. 2016 31
113. Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 32
114. RL Policy Network (1/2)
identical in structure to the SL policy network
Silver et al. 2016 33
115. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
Silver et al. 2016 33
116. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
Silver et al. 2016 33
117. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
Silver et al. 2016 33
118. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
Silver et al. 2016 33
119. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
between the current RL policy network and a randomly
selected previous iteration
Silver et al. 2016 33
120. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
between the current RL policy network and a randomly
selected previous iteration
to prevent overfitting to the current policy
Silver et al. 2016 33
121. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
between the current RL policy network and a randomly
selected previous iteration
to prevent overfitting to the current policy
stochastic gradient ascent:
∆ρ ∝
∂ log pρ(at|st)
∂ρ
zt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 33
122. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
between the current RL policy network and a randomly
selected previous iteration
to prevent overfitting to the current policy
stochastic gradient ascent:
∆ρ ∝
∂ log pρ(at|st)
∂ρ
zt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 33
123. RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of classification
weights ρ initialized to the same values, ρ := σ
games of self-play
between the current RL policy network and a randomly
selected previous iteration
to prevent overfitting to the current policy
stochastic gradient ascent:
∆ρ ∝
∂ log pρ(at|st)
∂ρ
zt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 33
124. RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
Silver et al. 2016 34
125. RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
Silver et al. 2016 34
126. RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go
program, Pachi (Baudiˇs and Gailly 2011)
Silver et al. 2016 34
127. RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go
program, Pachi (Baudiˇs and Gailly 2011)
The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 34
128. Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 35
129. Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Silver et al. 2016 36
130. Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
goal: to estimate a value function
vp
(s) = E[zt|st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy p)
Silver et al. 2016 36
131. Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
goal: to estimate a value function
vp
(s) = E[zt|st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy p)
Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s).
Silver et al. 2016 36
132. Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
goal: to estimate a value function
vp
(s) = E[zt|st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy p)
Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s).
task of regression
Silver et al. 2016 36
133. Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
goal: to estimate a value function
vp
(s) = E[zt|st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy p)
Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s).
task of regression
stochastic gradient descent:
∆θ ∝
∂vθ(s)
∂θ
(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 36
135. Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Silver et al. 2016 37
136. Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game outcomes, rather than
generalizing to new positions.
Silver et al. 2016 37
137. Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a seperate game
Silver et al. 2016 37
138. Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a seperate game
almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 37
139. Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given position.
Silver et al. 2016 38
140. Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
forward pass of the value network vθ
Silver et al. 2016 38
141. Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
forward pass of the value network vθ
100 rollouts, played out using the corresponding policy
Silver et al. 2016 38
142. Elo Ratings for Various Combinations of Networks
Silver et al. 2016 39
144. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
Silver et al. 2016 40
145. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
Silver et al. 2016 40
146. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
Silver et al. 2016 40
147. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
Silver et al. 2016 40
148. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Silver et al. 2016 40
149. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Silver et al. 2016 40
150. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Each edge (s, a) keeps:
action value Q(s, a)
Silver et al. 2016 40
151. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Each edge (s, a) keeps:
action value Q(s, a)
visit count N(s, a)
Silver et al. 2016 40
152. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Each edge (s, a) keeps:
action value Q(s, a)
visit count N(s, a)
prior probability P(s, a) (from SL policy network pσ)
Silver et al. 2016 40
153. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Each edge (s, a) keeps:
action value Q(s, a)
visit count N(s, a)
prior probability P(s, a) (from SL policy network pσ)
Silver et al. 2016 40
154. MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)
Each edge (s, a) keeps:
action value Q(s, a)
visit count N(s, a)
prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 40
156. MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg max
a
(Q(st , a) + u(st , a))
Silver et al. 2016 41
157. MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg max
a
(Q(st , a) + u(st , a))
where bonus
u(st , a) ∝
P(s, a)
1 + N(s, a)
Silver et al. 2016 41
159. MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ.
Silver et al. 2016 42
160. MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ.
The output probabilities are stored as priors P(s, a) := pσ(a|s).
Silver et al. 2016 42
163. MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ until the end of game
Silver et al. 2016 43
164. MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ until the end of game
Silver et al. 2016 43
165. MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1 − λ)vθ(s) + λz
Silver et al. 2016 43
166. MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
the action values Q
Silver et al. 2016 44
167. MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
the action values Q
visit counts N
Silver et al. 2016 44
168. Once the search is complete, the algorithm
chooses the most visited move from the root
position.
Silver et al. 2016 44
170. Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
Silver et al. 2016 46
171. Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the move indicated by the red circle;
Silver et al. 2016 46
172. Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the move indicated by the red circle;
Fan Hui responded with the move indicated by the white square;
Silver et al. 2016 46
173. Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the move indicated by the red circle;
Fan Hui responded with the move indicated by the white square;
in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.
Silver et al. 2016 46
191. Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
https://en.wikipedia.org/wiki/Fan_Hui 50
192. Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
https://en.wikipedia.org/wiki/Fan_Hui 50
193. Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
https://en.wikipedia.org/wiki/Fan_Hui 50
194. Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
100 billion neurons
https://en.wikipedia.org/wiki/Fan_Hui 50
195. Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
100 billion neurons
100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 50
197. AlphaGo versus Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
51
198. AlphaGo versus Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui 51
199. Lee Sedol “The Strong Stone”
https://en.wikipedia.org/wiki/Lee_Sedol 52
200. Lee Sedol “The Strong Stone”
professional 9 dan
https://en.wikipedia.org/wiki/Lee_Sedol 52
201. Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
https://en.wikipedia.org/wiki/Lee_Sedol 52
202. Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
https://en.wikipedia.org/wiki/Lee_Sedol 52
203. Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
Lee Sedol would win 97 out of 100 games against Fan Hui.
https://en.wikipedia.org/wiki/Lee_Sedol 52
204. Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
Lee Sedol would win 97 out of 100 games against Fan Hui.
biological neural network comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 52
205. I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
52
206. I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4:1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
52
207. I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4:1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
52
208. AlphaGo versus Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
209. AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
210. AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
211. AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
212. AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
213. AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
217. AlphaGo versus Ke Jie?
professional 9 dan
https://en.wikipedia.org/wiki/Ke_Jie 54
218. AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
https://en.wikipedia.org/wiki/Ke_Jie 54
219. AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major international tournaments
https://en.wikipedia.org/wiki/Ke_Jie 54
220. AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major international tournaments
head-to-head record against Lee Sedol 8:2
https://en.wikipedia.org/wiki/Ke_Jie 54
221. AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major international tournaments
head-to-head record against Lee Sedol 8:2
biological neural network comparable to Fan Hui’s, and thus
by transitivity, also comparable to Lee Sedol’s
https://en.wikipedia.org/wiki/Ke_Jie 54
222. I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke Jie
54
223. I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke Jie
Now facing AlphaGo, I do not feel the same
strong instinct of victory when I play a
human player, but I still believe I have the
advantage against it. It’s 60 percent in
favor of me.
Ke Jie
54
224. I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke Jie
Now facing AlphaGo, I do not feel the same
strong instinct of victory when I play a
human player, but I still believe I have the
advantage against it. It’s 60 percent in
favor of me.
Ke Jie
Even though AlphaGo may have defeated
Lee Sedol, it won’t beat me.
Ke Jie
54
228. Difficulties of Go
challenging decision-making
intractable search space
complex optimal solution
It appears infeasible to directly approximate using a policy or value function!
Silver et al. 2016 55
231. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
Silver et al. 2016 56
232. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
Silver et al. 2016 56
233. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
Silver et al. 2016 56
234. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Silver et al. 2016 56
235. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Monte Carlo rollouts
Silver et al. 2016 56
236. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Monte Carlo rollouts
scalable implementation
Silver et al. 2016 56
237. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Monte Carlo rollouts
scalable implementation
multi-threaded simulations on CPUs
Silver et al. 2016 56
238. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Monte Carlo rollouts
scalable implementation
multi-threaded simulations on CPUs
parallel GPU computations
Silver et al. 2016 56
239. AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural networks
trained by novel combination of supervised and reinforcement
learning
new search algorithm combining
neural network evaluation
Monte Carlo rollouts
scalable implementation
multi-threaded simulations on CPUs
parallel GPU computations
distributed version over multiple machines
Silver et al. 2016 56
241. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
Silver et al. 2016 57
242. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
Silver et al. 2016 57
243. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
evaluating them more precisely (value network)
Silver et al. 2016 57
244. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
evaluating them more precisely (value network)
Silver et al. 2016 57
245. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
Silver et al. 2016 57
246. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
Silver et al. 2016 57
247. Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue against Kasparov.
It compensated this by:
selecting those positions more intelligently (policy network)
evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 57
251. Selection of Moves by the SL Policy Network
move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).
Silver et al. 2016
252. Selection of Moves by the Value Network
evaluation of all successors s of the root position s, using vθ(s)
Silver et al. 2016
253. Tree Evaluation from Value Network
action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only)
Silver et al. 2016
254. Tree Evaluation from Rollouts
action values Q(s, a), averaged over rollout evaluations only
Silver et al. 2016
255. Results of a tournament between different Go programs
Silver et al. 2016
256. Results of a tournament between AlphaGo and distributed Al-
phaGo, testing scalability with hardware
Silver et al. 2016
262. AlphaGo versus Lee Sedol: Game 1
https://youtu.be/vFr3K2DORc8
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
263. AlphaGo versus Lee Sedol: Game 2 (1/2)
https://youtu.be/l-GsfyVCBu0
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
264. AlphaGo versus Lee Sedol: Game 2 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
265. AlphaGo versus Lee Sedol: Game 3
https://youtu.be/qUAmTYHEyM8
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
266. AlphaGo versus Lee Sedol: Game 4
https://youtu.be/yCALyQRN3hw
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
267. AlphaGo versus Lee Sedol: Game 5 (1/2)
https://youtu.be/mzpW10DPHeQ
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
268. AlphaGo versus Lee Sedol: Game 5 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
269. Further Reading I
AlphaGo:
Google Research Blog
http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html
an article in Nature
http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234
a reddit article claiming that AlphaGo is even stronger than it appears to be:
“AlphaGo would rather win by less points, but with higher probability.”
https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/
a video of how AlphaGo works (put in layman’s terms) https://youtu.be/qWcfiPi9gUU
Articles by Google DeepMind:
Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih
et al. 2015)
Neural Turing Machines (Graves, Wayne, and Danihelka 2014)
Artificial Intelligence:
Artificial Intelligence course at MIT
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-034-artificial-intelligence-fall-2010/index.htm
270. Further Reading II
Introduction to Artificial Intelligence at Udacity
https://www.udacity.com/course/intro-to-artificial-intelligence--cs271
General Game Playing course https://www.coursera.org/course/ggp
Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2
The Singularity Is Near (Kurzweil 2005)
Combinatorial Game Theory (founded by John H. Conway to study endgames in Go):
Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory
On Numbers and Games (Conway 1976)
Computer Go as a sum of local games: an application of combinatorial game theory (M¨uller 1995)
Chess:
Deep Blue beats G. Kasparov in 1997 https://youtu.be/NJarxpYyoFI
Machine Learning:
Machine Learning course
https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/
Reinforcement Learning http://reinforcementlearning.ai-depot.com/
Deep Learning (LeCun, Bengio, and Hinton 2015)
271. Further Reading III
Deep Learning course https://www.udacity.com/course/deep-learning--ud730
Two Minute Papers https://www.youtube.com/user/keeroyz
Applications of Deep Learning https://youtu.be/hPKJBXkyTKM
Neuroscience:
http://www.brainfacts.org/
272. References I
Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looijen.
Baudiˇs, Petr and Jean-loup Gailly (2011). “Pachi: State of the art open source Go program”. In: Advances in
Computer Games. Springer, pp. 24–38.
Bowling, Michael et al. (2015). “Heads-up limit holdem poker is solved”. In: Science 347.6218, pp. 145–149. url:
http://poker.cs.ualberta.ca/15science.html.
Champandard, Alex J (2016). “Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks”. In:
arXiv preprint arXiv:1603.01768.
Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6.
Dieterle, Frank Jochen (2003). “Multianalyte quantifications by means of integration of artificial neural networks,
genetic algorithms and chemometrics for time-resolved analytical data”. PhD thesis. Universit¨at T¨ubingen.
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge (2015). “A Neural Algorithm of Artistic Style”. In:
CoRR abs/1508.06576. url: http://arxiv.org/abs/1508.06576.
Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing machines”. In: arXiv preprint
arXiv:1410.5401.
Hayes, Bradley (2016). url: https://twitter.com/deepdrumpf.
Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. url:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (visited on 04/01/2016).
273. References II
Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444.
Li, Chuan and Michael Wand (2016). “Combining Markov Random Fields and Convolutional Neural Networks for
Image Synthesis”. In: CoRR abs/1601.04589. url: http://arxiv.org/abs/1601.04589.
Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518.7540,
pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.
M¨uller, Martin (1995). “Computer Go as a sum of local games: an application of combinatorial game theory”.
PhD thesis. TU Graz.
Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature
529.7587, pp. 484–489.