Grenoble

Monte-Carlo Tree Search

Games with partial
observation
Olivier.Teytaud@inria.fr + David Auger
+Hervé Fournier + Fabien Teytaud + Sébastien Flory
+ JY Audibert+ S. Bubeck + R. Munos + ...
Includes Inria, Cnrs, Univ. Paris-Sud, LRI, CMAP,
Taiwan universities, Lille, Paris, Boostr...

TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of Excellence.

Grenoble
June 2011
Games with simultaneous actions 1 Grenoble, June 19th, 2011.


1. Games (a bit of formalism)

2. Hidden information <==> SA

3. Decidability / complexity

4. Real implementation
==> appli to UrbanRivals


A game is a directed graph

Games with simultaneous actions Grenoble, June 19th, 2011. 3

A game is a directed graph with actions

1

2
3


and players
1 White
Black
2
3

White 12

43
White Black
Black

Black
Black

and players and observations
Bob
Bear Bee
Bee 1 White
Black
2
3

White 12

43
White Black
Black

Black
Black

and players and observations and rewards
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12

43 Rewards
White Black on leafs
Black only!
Black
Black

A game is a directed graph +actions
+players +observations +rewards +loops
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12

43
White Black
Black

Black
Black

Consider games as follows:
Bob
Bear Bee
Bee 1
Black
Turn 1 White
Turn 2
2
…
3
+1
0
Turn K: all information is revealed.
Turn K+1 White 12
Turn K+2
… White 43
Black
Turn 2K: all information is revealed
Black
…
… Black
TurnBlack all information is revealed
NK:

Rewrite it as follows:
Bob
Turn 1: player 1 chooses Bee Bear
Bee 1
Black
(privately) his strategy until turn K
White
Turn 2: player 2 chooses
2
(privately) his strategy until turn K +1
3
Intermediate turns removed! 0
White 12
Turn K+1
Turn White K+2 43
Black
… Black
… Black
… Black
Games with simultaneous actions all information 2011. revealed
Turn NK: Grenoble, June 19th, is 11

Rewrite it as follows:
Bob
Turn 1: player 1 chooses Bee Bear
Bee 1
Black
(privately) his strategy until turn K
White
Equivalent
Turn 2: player 2 chooses to
2
(privately) his strategy until turn K +1 simultaneous
3 actions
Intermediate turns removed! 0
White 12
Turn K+1
Turn White K+2 43
Black
… Black
… Black
… Black
Games with simultaneous actions all information 2011. revealed
Turn NK: Grenoble, June 19th, is 12

Bob
Bear Bee
Bee 1 White
Black
Now it's a game with simultaneous information
2
and no hidden information.
3
+1
0
Simultaneous actions
White 12

White
= (almost) Black
43

short term hidden information.
Black

Black
Black



2. Hidden information <== SA
(and sometimes <==>)




Compact representation ?

Succinct representation (in short, without tedious details):
- graph of size N represented in size O(log N) ;
- usually not better in terms of complexity;
- keep this in mind when considering complexity.


Complexity question ?

Instance = position.

Question = Is there a strategy
which wins whatever
are the decisions
of the opponent ?
= natural question if full observability.
Answering this question then allows perfect
play.

Complexity question ? (UD)

Instance = position.

Question = Is there a strategy
which wins whatever
are the decisions
of the opponent ?
= natural question if full observability.
Answering this question then allows perfect
play.

Complexity question for matrix
game ?

100000
Good for column-player !
010000
001000 ==> but no sure win.
000100 ==> the “UD” question is not
000010 relevant here!
000001

Complexity question for
Joint work with
phantom-games ? F. Teytaud

This is phantom-go.

Good for black: wins
with proba 1-1/(8!)

Here,
there's no move
which ensures a win.

But some moves are
much better than
others!

It becomes complicated

Isn't it possible to
consider
a better question ?


Complexity (2P, no random)
X= proba(winning) that we look for
Unbounded Exponential Polynomial
horizon horizon horizon
Full
Observability EXP EXP PSPACE

No obs EXPSPACE NEXP
(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE
Observable (Rintanen) (Mundhenk)
(X=100%)

Simult. Actions ? EXPSPACE ? <<<= EXP <<<= EXP

No obs undecidable Teytaud,
Auger, IJFCS
Partiallywith simultaneous actions
Games undecidable 21 (accepted)
Grenoble, June 19th, 2011.
Observable

State of the art

EXPTIME-complete in the general
fully-observable case
Games with simultaneous actions 22

EXPTIME-complete fully
observable games

- Chess (for some nxn generalization)

- Go (with no superko)

- Draughts (international or english)

- Chinese checkers

- Shogi

PSPACE-complete fully
observable games

- Amazons
- Hex
- Go-moku
- Connect-6
- Qubic
- Reversi
- Tic-Tac-Toe

Many games with filling of each cell once and only once

EXPSPACE-complete
unobservable games (Hasslun & Jonnsson)

The two-player unobservable case is
EXPSPACE-complete
(games in succinct form).


E X P S P Atwo-player unobservable case is
The C E - c o m p l e t e
EXPSPACE-complete

PROOF:
(I) First note that strategies are just sequences of actions
(no observability!) + UD=>opponent can see the state!
(II) It is in EXPSPACE=NEXPSPACE, because of the
following algorithm:
(a) Non-deterministically choose the sequence of
Actions
(b) Check the result against all possible strategies
(III) We have to check the hardness only.

EXPSPACE-complete

PROOF:
(no observability!) + UD=>opponent can see the state!
actions

EXPSPACE-complete

PROOF:
(no observability!)
actions

EXPSPACE-complete
PROOF of the hardness:
Reduction to: is my TM with exponential tape
going to halt ?

Consider a TM with tape of size N=2^n.

We must find a game
- with size n ( n= log2(N) )
- such that the first player has a winning
strategy iff the TM halts.

EXPSPACE-complete
uEncoding ravTuring machine with Ha stape & J osizes oN)
n o b s e a b l e g a m e s ( a s l u n of n n s n
as a game with state O(log(N))

Player 1 chooses the sequence of
configurations of the tape (N=4):

x(0,1),x(0,2),x(0,3),x(0,4) ==> initial state
x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)
x(3,1),x(3,2),x(3,3),x(3,4)
.....................................


EXPSPACE-complete


x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)
x(3,1),x(3,2),x(3,3),x(3,4)
.....................................
x(N,1), x(N,2), x(N,3), x(N,4)

Wins by
Games with simultaneous actions 31

final state !

EXPSPACE-complete


x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)Except if P2 finds an
x(3,1),x(3,2),x(3,3),x(3,4) illegal transition!
..................................... ==> P2 can check the
x(N,1), x(N,2), x(N,3), x(N,4)
consistency of one 3-uple per line

Wins by
Games with simultaneous actions 32 ==> requests space log(N)

final state ! ( = position of the 3-uple)

EXPSPACE-complete PO games

The one-player PO case is
EXPSPACE-complete


2EXPTIME-complete PO games

The two-player PO case is
2EXP-complete


Undecidable games (B. Hearn)

The three-player PO case is
undecidable. (two players against one,
not allowed to communicate)


Complexity (2P, no random)
Unbounded Exponential Polynomial
horizon horizon horizon
Full
Observability EXP EXP PSPACE

No obs EXPSPACE NEXP
(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE
Observable (Rintanen 97)
(X=100%) Reduction to 1P + random
(Madani et al)
Simult. Actions ? EXPSPACE ? <<<= EXP <<<= EXP

No obs undecidable

Partiallywith simultaneous actions
Games undecidable 36 Grenoble, June 19th, 2011.
Observable

Another formalization

c

==> much more satisfactory

Madani et al.

c

1 player + random = undecidable.

Madani et al.

1 player + random = undecidable.

We extend to two players with no random.
Problem: rewrite random nodes, thanks to additional
player.


A random node to be rewritten



Rewritten as follows:
Player 1 chooses a in [[0,N-1]]
Player 2 chooses b in [[0,N-1]]
c=(a+b) modulo N
Go to tc
Each player can force the game to be equivalent to
the initial one (by playing uniformly)
==> the proba of winning for player 1 (in case of perfect play)
is the same as for the initial game
==> undecidability!

Important remark

Existence of a strategy for winning with
proba > 0.5
==> also undecidable for the
restriction to games in which the proba
is >0.6 or <0.4
==> not just a subtle
precision trouble.


Real implementation for
simultaneous action ?

MCTS principle

But with EXP3 in nodes.


UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

Replace it
with
EXP3 / INF

The game of Go is a part of AI.
Computers are ridiculous in front of children.

Easy situation.
Termed “semeai”.
Requires a little bit
of abstraction.


800 cores, 4.7 GHz,
top level program.

Plays a stupid move.


8 years old;
little training;
finds the good move

MoGo(TW): games vs pros
in the game of Go
First win in 9x9

First draw (a few days ago!) over 6 games

First win over 4 games in 9x9 blind Go

First win with H2.5 in 13x13 Go

First win with H6 in 19x19 Go in 2009 (also done by Zen)

First win with H7 in 19x19 Go vs top pro in 2009 (also
done by Pachi in 2011)





==> Dark Chess endgames
==> appli to UrbanRivals


Let's have fun with Urban Rivals (4 cards)
Each player has
- four cards (each one can be used once)
- 12 pilz (each one can be used once)
- 12 life points

Each card has:
- one attack level
- one damage
- special effects (forget it for the moment)

Four turns:
- P1 attacks P2
- P2 attacks P1
- P1 attacks P2
- P2 attacks P1


Let's have fun with Urban Rivals
First, attacker plays:
- chooses a card
- chooses ( PRIVATELY ) a number of pilz
Attack level = attack(card) x (1+nb of pilz)

Then, defender plays:
- chooses a card
- chooses a number of pilz
Defense level = attack(card) x (1+nb of pilz)

Result:
If attack > defense
Defender looses Power(attacker's card)
Else
Attacker looses Power(defender's card)


Let's have fun with Urban Rivals
==> The MCTS-based AI is now at the best human level.

Experimental (only) remarks on EXP3:

- discard strategies with small number of sims = better approx
of the Nash

- also an improvement by taking into
account the other bandit

- not yet compared to INF

- virtual simulations (inspired by Kummer)

Conclusions
New stuff:
Undecidability of optimal play for 2-player games with hidden information
Transformation “PO periodically revealed ==> simultaneous action game
with full observation”

Open problems
Complexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both cases ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent “realistic” approaches for PO games (H finite)

Conclusions
New stuff:
Undecidability of optimal play for 2-player games with hidden information
Transformation “PO periodically revealed ==> simultaneous action game
with full observation”

Open problems
Complexity: simultaneous action and infinite horizon (in progress)
Complexity with PO: same information for both players ?
Nash of matrix games with strong dominance
Mathematical validation of variants of Exp3 / Inf
Consistent “realistic” approaches for PO games (H finite)

When is MCTS relevant ?

Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
Simultaneous actions

More difficult for
High values of H;
Highly unobservable cases (Monte-Carlo, but not
Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC


We should
Robust in front of: test INF and
High dimension; justify mathematically
Non-convexity of Bellman values;
our improvements
Complex models
Delayed reward Some Further
Simultaneous actions undecidability
results
work !
More difficult for
High values of H;
Highly unobservable cases (Monte-Carlo, but not
Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC

Convenient.
How to apply it: Easy to check.
Implement the transition
(a function action x state → state )

Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games;
difficult if two opponents)

==> at this point you can simulate...

Implement UCT (just a bias in the simulator – no real optimizer)

Possibly parallelize (Gelly et al)

PO problems, approx.
Nash ==> mailing list

Challenge: outperform humans
in “Urban Rivals”
- free game
- fast games (~ 1 minute)
- 11M registered players

Grenoble

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (10)

Dernier

Dernier (20)

Grenoble