SlideShare une entreprise Scribd logo
1  sur  7
Machine Learning for Adversarial Agent Microworlds
    J. Scholz1, B. Hengst2, G. Calbert1, A. Antoniades2, P. Smet1, L. Marsh1, H-W. Kwok1, D. Gossink1
         1
          DSTO Command and Control Division, 2NICTA, E-Mail: Jason.scholz@defence.gov.au
                              Keywords: Wargames; Reinforcement Learning

EXTENDED ABSTRACT                                       automated decision making, when it is likely to be
                                                        hazardous.” Cognisant of these issues, we state a
We describe models and algorithms to support            formalism for microworlds initially to manage
interactive decision making for national and            representation complexity for machine learning.
military strategic courses of action. We describe       This formalism is based on multi-agent stochastic
our progress towards the development of                 games, abstractions and homomorphism.
innovative reinforcement learning algorithms,
scaling progressively to more complex and               According to Kaelbling et al (1996),
realistic simulations.                                  “Reinforcement learning (RL) is the problem faced
                                                        by an agent that must learn behaviour through trial
Abstract representations or ‘Microworlds’ have          and error interactions with a dynamic
been used throughout military history to aid in the     environment.” As RL is frustrated by the curse of
conceptualization of terrain, force disposition and     dimensionality, we have pursued methods to
movements. With the introduction of digitized           exploit temporal abstraction and hierarchical
systems into military headquarters the capacity to      organisation. As noted by Barto and Mahadevan
degrade decision-making has become a concern            2003 (p. 23), “Although the proposed ideas for
with these representations. Maps with overlays are      hierarchical RL appear promising, to date there has
a centerpiece of most military headquarters and         been insufficient experience in experimentally
may attain an authority which exceeds their             testing the effectiveness of these ideas on large
competency, as humans forget their limitations          applications.” We describe the results of our foray
both as a model of the physical environment, and        into reinforcement learning approaches of
as a means to convey command intent (Miller,            progressively larger-scale applications, with ‘Grid
2005). The design of microworlds may help to            world’ models including modified Chess and
alleviate human-system integration problems, by         Checkers, a military campaign-level game with
providing detail where it is available and              concurrent moves, and ‘fine-grained’ models with
appropriate and recognizable abstractions where it      a land and an air game.
is not. Ultimately, (Lambert and Scholz 2005 p.29)
view “the division of labor between people and          We detail and compare results from microworld
machines should be developed to leave human             models for which both human-engineered and
decision making unfettered by machine                   machine-learned agents were developed and
interference when it is likely to prove heroic, and     compared. We conclude with a summary of
enhance human decision making with automated            insights and describe future research directions.
decision aids, or possibly override it with



INTRODUCTION                                            •   accumulate learning over long time periods on
                                                            various microworld models and use what is
During Carnegie Mellon University's Robotics                learned to cope with new situations,
Institute 25th anniversary, Raj Reddy said, "the
biggest barrier is (developing) computers that learn    •   decompose problems and formulate their own
with experience and exhibit goal-directed                   representations, recognising relevant events
behaviour. If you can't build a system that can             among the huge amounts of data in their
learn with experience, you might as well forget             "experience",
everything else". This involves developing agents
characterized by the ability to:                        •   exhibit adaptive, goal directed behaviour and
                                                            prioritise multiple, conflicting and time
•   learn from their own experience and the                 varying goals,
    experience of human commanders,
•   interact with humans and other agents using           shown that humans under certain conditions
    language and context to decipher and respond          display weakness in self-regulation of their own
    to complex actions, events and language.              cognitive load. Believing for example, that “more
                                                          detailed information is better”, individuals can
Extant agent-based modelling and simulation               become information overloaded, fail to realise that
environments for military appreciation appear to          they are overloaded and mission effectiveness
occupy extreme ends of the modelling spectrum.            suffers. Similarly, a belief that “more reliable
At one end, Agent-Based Distillations (ABD)               information is better”, individuals can be informed
provide weakly expressive though highly                   or observe that some sources are only partially
computable (many games per second execution)              reliable, fail to pay sufficient attention to the
models using attraction-repulsion rules to represent      reliable information from those sources and
intent in simple automata (Horne and Johnson              mission effectiveness suffers.
2003). At the other end of the spectrum, large scale
systems such as the Joint Synthetic Armed Forces          Thus we seek a formalism which will allow the
(JSAF) and Joint Tactical Level Simulation (JTLS)         embodiment of cognitively-austere, adaptable and
require days to months and many staff to set up for       agreed representations.
a single simulation run (Matthews and Davies
2001) and are of limited value to highly interactive   Microworld Formalism
decision-making. We propose a microworld-based
representation which falls somewhere in between           We use the formalisation of multi-agent stochastic
these extremes, yet allows for decision interaction.      games (Fundenberg and Tirole, 1995) as our
                                                          generalised scientific model for microworlds. A
INTERACTIVE MICROWORLDS                                   multi-agent stochastic game may be formalised as
                                                          a tuple N , S , A, T , R where N is the number of
Microworlds have been used throughout military
                                                          agents, S the states of the microworld,
history to represent terrain, force disposition and
movements. Typically such models were built as             A = A1 × A2 × ... × AN concurrent actions, where
miniatures on the ground (see figure 1) or on a            Ai is the set of actions available to agent i , T is
board grid, but these days are more usually               the stochastic state transition function for each
computer-based.                                           action,                            [ ]
                                                                              T : S × A × S → 0,1 ,          and
                                                           R = ( R1 , R2 ,..., RN ) , where Ri is the immediate
                                                          reward for agent i, Ri : S × A → ℜ .The objective
                                                          is to find an action policy π i ( S ) for each agent i
                                                          to maximise the sum of future discounted rewards
                                                           Ri .

                                                          In Multi-agent stochastic games each agent must
                                                          choose actions in accordance with the actions of
                                                          other agents. This makes the environment
                                                          essentially non-stationary. In attempting to apply
                                                          learning methods to complex games, researchers
                                                          have developed a number of algorithms that
                                                          combine      game-theoretic       concepts    with
                                                          reinforcement learning algorithms, such as the
Figure 1. Example of a Microworld.                        Nash-Q algorithm (Hu and Wellman 2003).

Creating a microworld for strategic decisions is          Abstractions, Homomorphism and Models
primarily a cognitive engineering task. A
microworld (e.g. like a map with overlaid                 To make learning feasible we need to reduce the
information) is a form of symbolic language that          complexity of problems to manageable
should represent the necessary objects and                proportions. A challenge is to abstract the huge
dynamics, but not fall into the trap of becoming as       state-space generated by multi-agent MDPs. To do
complex as the environment it seeks to express            this we plan to use any structure and constraints in
(Friman and Brehmer 1999). It should also avoid           the problem to decompose the MDP.
the potential pitfalls of interaction, including the
human desire for “more”, when more is not                 Models can be thought of as abstractions that
necessarily better. Omedei et al (2004) have              generally reduce the complexity and scope of an
environment to allow us to focus on specific            update a parameterised value function through on-
   problems. A good or homomorphic model may be            line   temporal      differences.     Independently
   thought of as a many-to-one mapping that                discovered by Beal and Smith in 1997, TDLeaf has
   preserves operations of interest as shown in figure     been applied to classical games such as
   2. If the environment transitions from                  Backgammon, Chess, Checkers and Othello with
   environmental states w1 to w2 under environmental       impressive results (Schaeffer, 2000).
   dynamics, then we have an accurate model if the
   model state m1 transitions to m2 under a model          For each asymmetric variant, we learned a simple
                                                           evaluation function, based on materiel and
   dynamic and states w1 map onto m1 , w2 map              mobility balance. The combined cycle of
   onto m2 and environmental dynamics map onto             evaluation function learning and Monte Carlo
   the model dynamics.                                     simulation allowed us to draw general conclusions
                                                           about the relative importance of materiel, planning,
                                                           tempo,     stochastic    dynamics    and      partial
                                                           observability of pieces (Smet et al 2005) and will
                                                           be described further in section 4.

                                                           Our approach to playing alternate move partially
                                                           observable games combined some aspects of
                                                           information theory and value-function based
                                                           reinforcement learning. During the game, we
                                                           stored not only a belief of the distribution of our
                                                           opponent’s pieces, but conditioned on one of these
                                                           states, a distribution of the possible positions of
                                                           our pieces (a belief of our opponent’s belief). This
       Figure 2. Homomorphic Models                        enabled the use of ‘entropy-balance,’ the balance
                                                           between relative uncertainty of opposition states as
   The purpose of a microworld is to represent a           one of the basis functions in the evaluation
   model of the actual military situation. The initial     function (Calbert and Kwok 2004). Simulation
   models described here are abstractions provided by      results from this are described in section 4.
   commanders to reflect the real situation as
   accurately as possible.                                 TD-Island Microworld

Grid World Examples                                        TD-Island stands for ‘Temporal Difference
                                                           Island’, indicating the use of learning to control
   Chess and Checkers Variant Microworlds                  elements in an island-based game. Space and time
                                                           are discrete in our model. Each of 12 discrete
   Our project initially focused on variants of abstract   spatial states is either of type land or sea (figure 3).
   games in discrete time and space such as Chess
   and Checkers. We deliberately introduced
   asymmetries into the games and conducted
   sensitivity   analysis     through     Monte-Carlo
   simulation at various ply depths up to seven. The
   six asymmetries were either materiel, where one
   side commences with a materiel deficit, planning,
   with differing ply search depths, tempo, in which
   one side was allowed to make a double move at
   some frequency, stochastic dynamics where pieces
   are taken probabilistically and finally hidden          Figure 3. TD-Island Microworld.
   pieces, where one side possessed pieces that could
   be moved, though the opponent could only                Elements in the game include jetfighters, army
   indirectly infer the piece location (Smet et al         brigades and logistics supplies. Unlike Chess, two
   2005).                                                  or more pieces may occupy the same spatial state.
                                                           Further from Chess, in TD-Island, there is
   The TDLeaf algorithm is suitable for learning           concurrency of action in that all elements may
   value functions in these games (Baxter et al,           move at each time step. This induces an action-
   1998). TDLeaf takes the principal value of a            space explosion making centralized control
   minimax tree (at some depth) as the sample used to      difficult for games with a large number of
elements. The TD-Island agent uses domain
   symmetries and heuristics to prune the action-
   space, inducing the homomorphic model.

   Some symmetries in the TD-Island game are
   simple to recognise. For example, if considering
   the set of all possible concurrent actions of
   jetfighters, striking targets is the set of all possible
   permutations of jetfighters assigned to these
   targets. One can eliminate this complexity by
   considering only a single permutation. By
   considering this single permutation, one “breaks”          Figure 1: Tempest Seer Microworld.
   the game symmetry, reducing the overall
   concurrent action space dimension which induces            Given the complexity of this microworld, it is
   a homomorphic model (Calbert 2004).                        natural to attempt to systematically decompose its
                                                              structure into a hierarchically organized set of sub-
   The approach to state-space reduction in the TD-           tasks. The hierarchy would be decomposed of
   Island game is similar to that found in Chess based        higher-level strategic tasks such as attack and
   reinforcement learning studies, such as the                defend at the top, with lower level tactical level
   KnightCap game (Baxter et al 1998). A number of            tasks near the bottom of the hierarchy, with
   features or basis functions are chosen and the             primitive or atomic moves at the base. Such a
   importance of these basis functions forms the set          hierarchy has been composed for the Tempest Seer
   of parameters to be learned. The history of Chess          game as shown below.
   has developed insights into the appropriate choice
   of basis functions (Furnkranz and Kubat 2001).                                                                                 Effects (root)
                                                                                                                                   Effects (root)

   Fortunately there are considerable insights into
   abstractions important in warfare, generating what
                                                                                               Attacking force
                                                                                                Attacking force                                           Defending force
                                                                                                                                                           Defending force
   is termed ‘operational art’. Such abstractions may
   include the level of simultaneity, protection of
   forces, force balance and maintenance of supply
                                                                                  Target(t)
                                                                                                                                                                                      Defend(d)
   lines amongst others (US Joint Chiefs of Staff
   2001). Each of these factors may be encoded into a                        Strike mission
                                                                                  Strike
   number of different basis functions, which form                                Mission(t)                                   Diversionary                            CAP(d)                           Intercept(j)


   the heart of the value function approximation.               Release weapon
                                                                                                                                                                                           CAP
                                                                                                                                                                                             CAP
                                                                                                                                                                                         intercept
                                                                                                                                                                                           intercept
                                                                 Release weapon
                                                                                                          Max intercept
                                                                                                           Max intercept       Out mission        In mission         Avoid threats
                                                                                                           Target dist          Out mission         In mission        Avoid threats
                                                                                                            Target dist


Continuous Time-Space Example                                                                   Loiter             Take off          Straight Nav(p)         Land
                                                                                                                                                                                         Tarmac wait
                                                                                                                                                                                          Tarmac wait       Shoot
                                                                                                                                                                                                             Shoot
                                                                                                 Loiter             Take off          Straight Nav(p)         Land




   Tempest Seer
                                                              Figure 2: Hierarchical decomposition of the
   This is a microworld for evaluating strategic              Tempest Seer game.
   decisions for air operations. Capabilities modelled
   include various fighters, air to air refuelling,           Given this decomposition, one can apply the
   surface to air missiles, ground based radars, as           MAXQ-Q algorithm to learn game strategies
   well as logistical aspects. Airports and storage           (Dieterrich, 2002). One can think of the game
   depots are also included. The model looks at a             being played through the execution of differing
   1000 km square of continuous space, with time              subroutines, these being nodes of the hierarchy as
   advancing in discrete units. The Tempest Seer              differing states are encountered. Each node can be
   microworld will enable users to look at the                considered as a stochastic game, focused on a
   potential outcomes of strategic decisions. Such            particular aspect of play such as targeting, or at an
   decisions involve the assignment of aircraft and           elemental level taking-off or landing of aircraft.
   weapons to missions, selection of targets, defence
   of assets and patrolling of the air space. While           Though the idea of learning with hierarchical
   these are high-level decisions, modelling the              decomposition is appealing, in order to make the
   repercussions of such decisions accurately requires        application practical, several conditions must
   attention to detail. This microworld is therefore at       apply. First, our human designed hierarchy must
   a much lower level of abstraction than TD-Island.          be valid, in the sense that at the nodes of the
                                                              hierarchy capture the richness of strategic
                                                              development used by human players. Second,
                                                              crucially, one must be able to abstract out
irrelevant parts of the state representation, at                       2004). Another form of action abstraction is to
differing parts of the hierarchy. As an example,                       define actions extended in time. They have been
consider the “take-off” node. Provided all aircraft,                   variously referred to as options (Sutton et al 1999),
including the enemies, are sufficiently far from the                   sub-tasks, macro-operators, or extended actions. In
departing aircraft, we can ignore their states. The                    multi-agent       games      the    action      space
states of other aircraft can effectively be safely                      A = A1 × A2 × ... × AN describes concurrent actions,
abstracted out, without altering the quality of
                                                                       where Ai is the set of actions available to agent i.
strategic learning. Without such abstraction,
hierarchical learning is in fact far more expensive                    The number of concurrent actions is A =     ∏A    i   .
than general “flat” learning (Dietterich, 2002).                                                                     i
Finally, the current theory of hierarchical learning                   We can reduce the problem by constraining the
must be modified to include explicit adversarial                       agents so that only one agent can act at each time-
reasoning, for example the randomisation of                            step. State symmetry arises when behaviour over
strategies to keep the opponent “second-guessing”                      state sub-spaces is repetitive throughout the state
so to speak.                                                           space (Hengst 2002), or when there is some
                                                                       geometric symmetry that solves the problem in one
We have completed research into incorporating                          domain and translates to others. Deictic
adversarial reasoning into hierarchical learning                       representations only consider state regions
through a vignette in which taxi’s complete for                        important to the agent, for example, the agent’s
passengers, and are able to “steal” passengers. The                    direct field of view. Multi-level task
hierarchical decomposition for this game is shown                      decompositions use a task hierarchy to decompose
below                                                                  the problem. Whether the task hierarchy is user
                                                                       specified as in MAXQ (Dietterich 2000) or
                                      Root
                                                                       machine discovered as in HEXQ (Hengst 2002),
                                                                       they have the potential to simultaneously abstract
                   Get                            Deliver              both actions and states.

                                                                       Part of our approach to address achieving human
                                                                       agreement and trust in the representation has been
  PickUp           Steal           Navigate (t)   Evade      PutDown
                                                                       a move from simple variants of Chess and
                                                                       Checkers to more general microworlds like TD
                                                                       Island and Tempest Seer. We advocate an open-
                                                                       interface component approach to software
           North           South        East      West      Wait       development, allowing the representations to be
                                                                       dynamically adaptable as conditions change.

Figure 3: Hierarchical decomposition for the                           RESULTS OF A MICROWORLD GAME
competitive taxi problem.
                                                                       We illustrate learning performance for a Checkers-
In an approach to combining hierarchy with game-                       variant microworld. In this variant, each side is
theory, one of the authors has combined MAXQ                           given one hidden piece. If this piece reaches the
learning with the “win-or-learn fast” or WOLF                          opposition baseline, it becomes a hidden king.
algorithm (Bowling and Veloso, 2003) which                             After some experience in playing this game, we
adjusts the probabilities of executing a particular                    estimated the relative importance of standard
strategy according to whether you are doing well                       pieces, hidden unkinged and hidden kinged pieces.
in the strategic play or loosing.                                      We then machine learned our parameter values
                                                                       using the belief based TDLeaf algorithm described
                                                                       earlier.

SUMMARY ISSUES

We have found the use of action and state
symmetry, extended actions, time-sharing, deictic
representations,     and       hierarchical    task
decompositions useful for improving the efficiency
of learning in our agents. Action symmetries arise,
for example, when actions have identical effects in
the value or Q-function (Ravindran 2004, Calbert
Ability initiative, in part through the Australian
                                                       Research Council. We wish to acknowledge the
                                                       inspiration of Dr Jan Kuylenstierna and his team
                                                       from the Swedish National Defence College.

                                                       REFERENCES

                                                       Alberts, D., Gartska, J., Hayes, R., (2001)
                                                              Understanding Information Age Warfare,
                                                              CCRP Publ., Washington, DC.

                                                       Barto, A.G., Mahadevan, S. (2003) Recent
                                                             Advances in Hierarchical Reinforcement
Figure 5. Results for Checkers with hidden pieces.           Learning.     Discrete-Event      Systems,
                                                             13:41--77. Special Issue on Reinforcement
Figure 5 details the probability of wins, losses and         Learning.
draws for three competitions each of 5000 games.
In the first competition, an agent with machine-       Baxter, J.; Tridgell, A.; and Weaver, L, (1998)
learned evaluation function weights was played                TDLeaf (λ ) :     Combining     Temporal
against an agent with an evaluation function where            Difference Learning with Game Tree
all weights were set as uniformly important. In the           Search. Proceedings of the Ninth
second competition, an agent with machine-                    Australian Conference on Neural Network:
learned evaluation function weights was played                168-172.
against an agent with a human-estimated
evaluation function. Finally, an agent using human     Beal, D.F. and Smith, M.C. (1997) Learning Piece
estimated weights played against an agent using               Values Using Temporal Differences, ICCA
uniform values. The agent with machine-estimated              Journal 20(3): 147-151.
evaluation function weights wins most games. This
result points towards our aim to construct adequate    Bowling, M and Veloso, M. (2002) Multiagent
machine learned representations for new unique               learning using a variable learning rate.
strategic situations.                                        Artificial Intelligence Vol. 136, pp.
                                                             215-250.
FUTURE WORK AND CONCLUSIONS
                                                       Calbert, G., (2004) “Exploiting Action-Space
We will continue to explore the theme of                      Symmetry for Reinforcement Learning in a
microworld and agent design and fuse our                      Concurrent Dynamic Game”, Proceedings
experience in games, partially-observable planning            of the International Conference on
and extended actions. Though we have identified               Optimisation Techniques and Applications.
good models as homomorphic, there is little work              2004.
describing realistic approaches to constructing
these homomorphisms for microworlds or                 Calbert, G and Kwok, H-W, (2004) Combining
adversarial domains.                                          Entropy Based Heuristics, Minimax search
                                                              and Temporal differences to play Hidden
We are also interested in capturing human intent              State games. Proceedings of the Florida
and rewards as perceived by military players in               Artificial Intelligence Symposium, 2004.
microworld games. This may be viewed as an                    AAAI Press.
inverse MDP, where one determines rewards and
value function given a policy, as opposed to           Dietterich,      T.G,      (2000)    “Hierarchical
determining a policy through predetermined                    Reinforcement Learning with the MAXQ
rewards (Ng and Russell 2000). We believe that                Function Decomposition”, Journal of
augmentation of human cognitive ability will                  Artificial Intelligence Research, Vol. 13,
provide an improvement in the ability for military            pp.227-303.
commanders to achieve their intent.
                                                       Friman, H. and Brehmer, B. (1999) “Using
ACKNOWLEDGEMENTS                                             Microworlds To Study Intuitive Battle
                                                             Dynamics: A Concept For The Future”,
National ICT Australia is funded through the                 ICCRTS, Rhode Island, June 29 - July 1.
Australian Government's Backing Australia's
Fundenberg, D. and Tirole, J., (1995) Game                   Proceedings      of    the      Seventeenth
      Theory. MIT Press.                                     International   Conference    on Machine
                                                             Learning.
Furnkranz, J., Kubat, M. (eds), (2001) Machines
      that learning to play games. Advances in        Omodei, M. M., Wearing, A. J., McLennan, J.,
      Computation Theory and Practice, Vol.8.              Elliott, G. C., & Clancy, J. M. (2004). More
      Nova Science Publishers, Inc.                        is Better? Problems of self regulation in
                                                           naturalistic decision making settings, in
Hengst, B., (2002) “Discovering Hierarchy in               Montgomery, H., Lipshitz, R., Brehmer, B.
      Reinforcement Learning with HEXQ”,                   (Eds.). Proceedings of the 5th Naturalistic
      Proceedings of the Nineteenth International          Decision Making Conference.
      Conference on Machine Learning, pp.
      243-250, Morgan-Kaufman Publ.                   Ravindran, B. (2004) "An Algebraic Approach to
                                                            Abstraction in Reinforcement Learning".
Horne, G., Johnson, S., Eds. (2003), Maneuver               Doctoral Dissertation, Department of
      Warfare Science, USMC Project Albert                  Computer      Science,  University    of
      Publ.                                                 Massachusetts, Amherst MA.

Hu, J., Wellman, M.P., (2003), Nash Q-Learning        Schaeffer, J. (2000) “The games computers (and
       for General Sum Stochastic Games,                    people) play.” Advances in Computers 50 ,
       Journal of Machine Learning Research,                Marvin Zelkowitz editor, Academic Press,
       Vol. 4, pp. 1039-1069.                               pp. 189-266.

Kaelbling L.P., Littman M.L., Moore A.W. (1996).      Smet, P. et. al. (2005) “The effects of materiel,
       Reinforcement Learning: A Survey, Journal            tempo and search depth in win loss ratios in
       of Artificial Intelligence Research, Vol. 4,         Chess”,      Proceedings     of   Australian
       pp. 237-285.                                         Artificial Intelligence Conference.

Kuylenstierna, J., Rydmark, J. and Fahrus, T.         Sutton, R., Precup, D. and Singh, S. (1999)
      (2000) “The Value of Information in War:               Between MDPs and semi-MDPs: A
      Some Experimental Findings”, Proc. of the              Framework for Temporal Abstraction in
      5th International Command and Control                  Reinforcement        Learning.     Artificial
      Research and Technology Symposium                      Intelligence. Vol. 112. pp. 181-211.
      (ICCRTS), Canberra, Australia.
                                                      US Joint Chiefs of Staff, (2001) Joint Doctrine
Lambert, D., Scholz, J.B. (2005), “A Dialectic for          Keystone and Capstone Primer. Appendix
     Network Centric Warfare”, Proceedings of               A, 10 Sept. http://www.dtic.mil/doctrine/jel/
     the 10th International Command and Control             new_pubs/primer.pdf
     Research and Technology Symposium
     (ICCRTS), MacLean, VA, June 13–16.
     http://www.dodccrp.org/events/2005/10th/
     CD/papers/016.pdf

Matthews, K, Davies, M., (2001) Simulations for
      Joint Military Operations Planning.
      Proceedings of the SIMTEC Modelling and
      Simulation Conference.

Miller, C.S. (2005), “A New Perspective for the
       Military: Looking      at Maps     Within
       Centralized Command and Control
       Systems”. Air and Space Power Chronicles,
       30                                 March.
       http://www.airpower.maxwell.af.mil/airchr
       onicles/cc/miller.html

Ng, A.Y. and Russell, S. (2000) “Algorithms for
      inverse     reinforcement       learning”,

Contenu connexe

Similaire à J. Scholz 1 , B. Hengst 2 , G. Calbert 1 , A. Antoniades 2 ...

Machine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent MicroworldsMachine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent Microworldsbutest
 
Machine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent MicroworldsMachine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent Microworldsbutest
 
Running head Multi-actor modelling system 1Multi-actor mod.docx
Running head Multi-actor modelling system  1Multi-actor mod.docxRunning head Multi-actor modelling system  1Multi-actor mod.docx
Running head Multi-actor modelling system 1Multi-actor mod.docxtodd581
 
Running head Multi-actor modelling system 1Multi-actor mod.docx
Running head Multi-actor modelling system  1Multi-actor mod.docxRunning head Multi-actor modelling system  1Multi-actor mod.docx
Running head Multi-actor modelling system 1Multi-actor mod.docxglendar3
 
Simulation of Imperative Animatronics for Mobile Video Games
Simulation of Imperative Animatronics for Mobile Video GamesSimulation of Imperative Animatronics for Mobile Video Games
Simulation of Imperative Animatronics for Mobile Video GamesDR.P.S.JAGADEESH KUMAR
 
Temporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity RecognitionTemporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity RecognitionIRJET Journal
 
Beyond believable agents - employing AI for improving game like simulations f...
Beyond believable agents - employing AI for improving game like simulations f...Beyond believable agents - employing AI for improving game like simulations f...
Beyond believable agents - employing AI for improving game like simulations f...Mirjam Eladhari
 
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...Dynamic assignment of geospatial-temporal macro tasks to agents under human s...
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...Reza Nourjou, Ph.D.
 
Army Study: Ontology-based Adaptive Systems of Cyber Defense
Army Study: Ontology-based Adaptive Systems of Cyber DefenseArmy Study: Ontology-based Adaptive Systems of Cyber Defense
Army Study: Ontology-based Adaptive Systems of Cyber DefenseRDECOM
 
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...guestac67362
 
Artificial Intelligence Algorithms
Artificial Intelligence AlgorithmsArtificial Intelligence Algorithms
Artificial Intelligence AlgorithmsIOSR Journals
 
An exhaustive survey of reinforcement learning with hierarchical structure
An exhaustive survey of reinforcement learning with hierarchical structureAn exhaustive survey of reinforcement learning with hierarchical structure
An exhaustive survey of reinforcement learning with hierarchical structureeSAT Journals
 
Self Organization Simulation Over Gis Based On Multi Agent Platform
Self Organization Simulation Over Gis Based On Multi Agent PlatformSelf Organization Simulation Over Gis Based On Multi Agent Platform
Self Organization Simulation Over Gis Based On Multi Agent Platformanas_elf
 
On the Dynamics of Machine Learning Algorithms and Behavioral Game Theory
On the Dynamics of Machine Learning Algorithms and Behavioral Game TheoryOn the Dynamics of Machine Learning Algorithms and Behavioral Game Theory
On the Dynamics of Machine Learning Algorithms and Behavioral Game TheoryRikiya Takahashi
 
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...ahmad bassiouny
 
A Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docA Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docbutest
 

Similaire à J. Scholz 1 , B. Hengst 2 , G. Calbert 1 , A. Antoniades 2 ... (20)

Machine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent MicroworldsMachine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent Microworlds
 
Machine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent MicroworldsMachine Learning for Adversarial Agent Microworlds
Machine Learning for Adversarial Agent Microworlds
 
Running head Multi-actor modelling system 1Multi-actor mod.docx
Running head Multi-actor modelling system  1Multi-actor mod.docxRunning head Multi-actor modelling system  1Multi-actor mod.docx
Running head Multi-actor modelling system 1Multi-actor mod.docx
 
Running head Multi-actor modelling system 1Multi-actor mod.docx
Running head Multi-actor modelling system  1Multi-actor mod.docxRunning head Multi-actor modelling system  1Multi-actor mod.docx
Running head Multi-actor modelling system 1Multi-actor mod.docx
 
Social LSTMの紹介
Social LSTMの紹介Social LSTMの紹介
Social LSTMの紹介
 
Simulation of Imperative Animatronics for Mobile Video Games
Simulation of Imperative Animatronics for Mobile Video GamesSimulation of Imperative Animatronics for Mobile Video Games
Simulation of Imperative Animatronics for Mobile Video Games
 
Temporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity RecognitionTemporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity Recognition
 
Beyond believable agents - employing AI for improving game like simulations f...
Beyond believable agents - employing AI for improving game like simulations f...Beyond believable agents - employing AI for improving game like simulations f...
Beyond believable agents - employing AI for improving game like simulations f...
 
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...Dynamic assignment of geospatial-temporal macro tasks to agents under human s...
Dynamic assignment of geospatial-temporal macro tasks to agents under human s...
 
Lec0
Lec0Lec0
Lec0
 
Army Study: Ontology-based Adaptive Systems of Cyber Defense
Army Study: Ontology-based Adaptive Systems of Cyber DefenseArmy Study: Ontology-based Adaptive Systems of Cyber Defense
Army Study: Ontology-based Adaptive Systems of Cyber Defense
 
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...
A Paper Presentation On Artificial Intelligence And Global Risk Paper Present...
 
Artificial Intelligence Algorithms
Artificial Intelligence AlgorithmsArtificial Intelligence Algorithms
Artificial Intelligence Algorithms
 
An exhaustive survey of reinforcement learning with hierarchical structure
An exhaustive survey of reinforcement learning with hierarchical structureAn exhaustive survey of reinforcement learning with hierarchical structure
An exhaustive survey of reinforcement learning with hierarchical structure
 
Self Organization Simulation Over Gis Based On Multi Agent Platform
Self Organization Simulation Over Gis Based On Multi Agent PlatformSelf Organization Simulation Over Gis Based On Multi Agent Platform
Self Organization Simulation Over Gis Based On Multi Agent Platform
 
Dc32644652
Dc32644652Dc32644652
Dc32644652
 
On the Dynamics of Machine Learning Algorithms and Behavioral Game Theory
On the Dynamics of Machine Learning Algorithms and Behavioral Game TheoryOn the Dynamics of Machine Learning Algorithms and Behavioral Game Theory
On the Dynamics of Machine Learning Algorithms and Behavioral Game Theory
 
Crowd Control MAGS v1.0
Crowd Control MAGS v1.0Crowd Control MAGS v1.0
Crowd Control MAGS v1.0
 
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...
Learning Structure, Reusability And Real Time Modeling In Teams Of Autonomous...
 
A Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docA Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.doc
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

J. Scholz 1 , B. Hengst 2 , G. Calbert 1 , A. Antoniades 2 ...

  • 1. Machine Learning for Adversarial Agent Microworlds J. Scholz1, B. Hengst2, G. Calbert1, A. Antoniades2, P. Smet1, L. Marsh1, H-W. Kwok1, D. Gossink1 1 DSTO Command and Control Division, 2NICTA, E-Mail: Jason.scholz@defence.gov.au Keywords: Wargames; Reinforcement Learning EXTENDED ABSTRACT automated decision making, when it is likely to be hazardous.” Cognisant of these issues, we state a We describe models and algorithms to support formalism for microworlds initially to manage interactive decision making for national and representation complexity for machine learning. military strategic courses of action. We describe This formalism is based on multi-agent stochastic our progress towards the development of games, abstractions and homomorphism. innovative reinforcement learning algorithms, scaling progressively to more complex and According to Kaelbling et al (1996), realistic simulations. “Reinforcement learning (RL) is the problem faced by an agent that must learn behaviour through trial Abstract representations or ‘Microworlds’ have and error interactions with a dynamic been used throughout military history to aid in the environment.” As RL is frustrated by the curse of conceptualization of terrain, force disposition and dimensionality, we have pursued methods to movements. With the introduction of digitized exploit temporal abstraction and hierarchical systems into military headquarters the capacity to organisation. As noted by Barto and Mahadevan degrade decision-making has become a concern 2003 (p. 23), “Although the proposed ideas for with these representations. Maps with overlays are hierarchical RL appear promising, to date there has a centerpiece of most military headquarters and been insufficient experience in experimentally may attain an authority which exceeds their testing the effectiveness of these ideas on large competency, as humans forget their limitations applications.” We describe the results of our foray both as a model of the physical environment, and into reinforcement learning approaches of as a means to convey command intent (Miller, progressively larger-scale applications, with ‘Grid 2005). The design of microworlds may help to world’ models including modified Chess and alleviate human-system integration problems, by Checkers, a military campaign-level game with providing detail where it is available and concurrent moves, and ‘fine-grained’ models with appropriate and recognizable abstractions where it a land and an air game. is not. Ultimately, (Lambert and Scholz 2005 p.29) view “the division of labor between people and We detail and compare results from microworld machines should be developed to leave human models for which both human-engineered and decision making unfettered by machine machine-learned agents were developed and interference when it is likely to prove heroic, and compared. We conclude with a summary of enhance human decision making with automated insights and describe future research directions. decision aids, or possibly override it with INTRODUCTION • accumulate learning over long time periods on various microworld models and use what is During Carnegie Mellon University's Robotics learned to cope with new situations, Institute 25th anniversary, Raj Reddy said, "the biggest barrier is (developing) computers that learn • decompose problems and formulate their own with experience and exhibit goal-directed representations, recognising relevant events behaviour. If you can't build a system that can among the huge amounts of data in their learn with experience, you might as well forget "experience", everything else". This involves developing agents characterized by the ability to: • exhibit adaptive, goal directed behaviour and prioritise multiple, conflicting and time • learn from their own experience and the varying goals, experience of human commanders,
  • 2. interact with humans and other agents using shown that humans under certain conditions language and context to decipher and respond display weakness in self-regulation of their own to complex actions, events and language. cognitive load. Believing for example, that “more detailed information is better”, individuals can Extant agent-based modelling and simulation become information overloaded, fail to realise that environments for military appreciation appear to they are overloaded and mission effectiveness occupy extreme ends of the modelling spectrum. suffers. Similarly, a belief that “more reliable At one end, Agent-Based Distillations (ABD) information is better”, individuals can be informed provide weakly expressive though highly or observe that some sources are only partially computable (many games per second execution) reliable, fail to pay sufficient attention to the models using attraction-repulsion rules to represent reliable information from those sources and intent in simple automata (Horne and Johnson mission effectiveness suffers. 2003). At the other end of the spectrum, large scale systems such as the Joint Synthetic Armed Forces Thus we seek a formalism which will allow the (JSAF) and Joint Tactical Level Simulation (JTLS) embodiment of cognitively-austere, adaptable and require days to months and many staff to set up for agreed representations. a single simulation run (Matthews and Davies 2001) and are of limited value to highly interactive Microworld Formalism decision-making. We propose a microworld-based representation which falls somewhere in between We use the formalisation of multi-agent stochastic these extremes, yet allows for decision interaction. games (Fundenberg and Tirole, 1995) as our generalised scientific model for microworlds. A INTERACTIVE MICROWORLDS multi-agent stochastic game may be formalised as a tuple N , S , A, T , R where N is the number of Microworlds have been used throughout military agents, S the states of the microworld, history to represent terrain, force disposition and movements. Typically such models were built as A = A1 × A2 × ... × AN concurrent actions, where miniatures on the ground (see figure 1) or on a Ai is the set of actions available to agent i , T is board grid, but these days are more usually the stochastic state transition function for each computer-based. action, [ ] T : S × A × S → 0,1 , and R = ( R1 , R2 ,..., RN ) , where Ri is the immediate reward for agent i, Ri : S × A → ℜ .The objective is to find an action policy π i ( S ) for each agent i to maximise the sum of future discounted rewards Ri . In Multi-agent stochastic games each agent must choose actions in accordance with the actions of other agents. This makes the environment essentially non-stationary. In attempting to apply learning methods to complex games, researchers have developed a number of algorithms that combine game-theoretic concepts with reinforcement learning algorithms, such as the Figure 1. Example of a Microworld. Nash-Q algorithm (Hu and Wellman 2003). Creating a microworld for strategic decisions is Abstractions, Homomorphism and Models primarily a cognitive engineering task. A microworld (e.g. like a map with overlaid To make learning feasible we need to reduce the information) is a form of symbolic language that complexity of problems to manageable should represent the necessary objects and proportions. A challenge is to abstract the huge dynamics, but not fall into the trap of becoming as state-space generated by multi-agent MDPs. To do complex as the environment it seeks to express this we plan to use any structure and constraints in (Friman and Brehmer 1999). It should also avoid the problem to decompose the MDP. the potential pitfalls of interaction, including the human desire for “more”, when more is not Models can be thought of as abstractions that necessarily better. Omedei et al (2004) have generally reduce the complexity and scope of an
  • 3. environment to allow us to focus on specific update a parameterised value function through on- problems. A good or homomorphic model may be line temporal differences. Independently thought of as a many-to-one mapping that discovered by Beal and Smith in 1997, TDLeaf has preserves operations of interest as shown in figure been applied to classical games such as 2. If the environment transitions from Backgammon, Chess, Checkers and Othello with environmental states w1 to w2 under environmental impressive results (Schaeffer, 2000). dynamics, then we have an accurate model if the model state m1 transitions to m2 under a model For each asymmetric variant, we learned a simple evaluation function, based on materiel and dynamic and states w1 map onto m1 , w2 map mobility balance. The combined cycle of onto m2 and environmental dynamics map onto evaluation function learning and Monte Carlo the model dynamics. simulation allowed us to draw general conclusions about the relative importance of materiel, planning, tempo, stochastic dynamics and partial observability of pieces (Smet et al 2005) and will be described further in section 4. Our approach to playing alternate move partially observable games combined some aspects of information theory and value-function based reinforcement learning. During the game, we stored not only a belief of the distribution of our opponent’s pieces, but conditioned on one of these states, a distribution of the possible positions of our pieces (a belief of our opponent’s belief). This Figure 2. Homomorphic Models enabled the use of ‘entropy-balance,’ the balance between relative uncertainty of opposition states as The purpose of a microworld is to represent a one of the basis functions in the evaluation model of the actual military situation. The initial function (Calbert and Kwok 2004). Simulation models described here are abstractions provided by results from this are described in section 4. commanders to reflect the real situation as accurately as possible. TD-Island Microworld Grid World Examples TD-Island stands for ‘Temporal Difference Island’, indicating the use of learning to control Chess and Checkers Variant Microworlds elements in an island-based game. Space and time are discrete in our model. Each of 12 discrete Our project initially focused on variants of abstract spatial states is either of type land or sea (figure 3). games in discrete time and space such as Chess and Checkers. We deliberately introduced asymmetries into the games and conducted sensitivity analysis through Monte-Carlo simulation at various ply depths up to seven. The six asymmetries were either materiel, where one side commences with a materiel deficit, planning, with differing ply search depths, tempo, in which one side was allowed to make a double move at some frequency, stochastic dynamics where pieces are taken probabilistically and finally hidden Figure 3. TD-Island Microworld. pieces, where one side possessed pieces that could be moved, though the opponent could only Elements in the game include jetfighters, army indirectly infer the piece location (Smet et al brigades and logistics supplies. Unlike Chess, two 2005). or more pieces may occupy the same spatial state. Further from Chess, in TD-Island, there is The TDLeaf algorithm is suitable for learning concurrency of action in that all elements may value functions in these games (Baxter et al, move at each time step. This induces an action- 1998). TDLeaf takes the principal value of a space explosion making centralized control minimax tree (at some depth) as the sample used to difficult for games with a large number of
  • 4. elements. The TD-Island agent uses domain symmetries and heuristics to prune the action- space, inducing the homomorphic model. Some symmetries in the TD-Island game are simple to recognise. For example, if considering the set of all possible concurrent actions of jetfighters, striking targets is the set of all possible permutations of jetfighters assigned to these targets. One can eliminate this complexity by considering only a single permutation. By considering this single permutation, one “breaks” Figure 1: Tempest Seer Microworld. the game symmetry, reducing the overall concurrent action space dimension which induces Given the complexity of this microworld, it is a homomorphic model (Calbert 2004). natural to attempt to systematically decompose its structure into a hierarchically organized set of sub- The approach to state-space reduction in the TD- tasks. The hierarchy would be decomposed of Island game is similar to that found in Chess based higher-level strategic tasks such as attack and reinforcement learning studies, such as the defend at the top, with lower level tactical level KnightCap game (Baxter et al 1998). A number of tasks near the bottom of the hierarchy, with features or basis functions are chosen and the primitive or atomic moves at the base. Such a importance of these basis functions forms the set hierarchy has been composed for the Tempest Seer of parameters to be learned. The history of Chess game as shown below. has developed insights into the appropriate choice of basis functions (Furnkranz and Kubat 2001). Effects (root) Effects (root) Fortunately there are considerable insights into abstractions important in warfare, generating what Attacking force Attacking force Defending force Defending force is termed ‘operational art’. Such abstractions may include the level of simultaneity, protection of forces, force balance and maintenance of supply Target(t) Defend(d) lines amongst others (US Joint Chiefs of Staff 2001). Each of these factors may be encoded into a Strike mission Strike number of different basis functions, which form Mission(t) Diversionary CAP(d) Intercept(j) the heart of the value function approximation. Release weapon CAP CAP intercept intercept Release weapon Max intercept Max intercept Out mission In mission Avoid threats Target dist Out mission In mission Avoid threats Target dist Continuous Time-Space Example Loiter Take off Straight Nav(p) Land Tarmac wait Tarmac wait Shoot Shoot Loiter Take off Straight Nav(p) Land Tempest Seer Figure 2: Hierarchical decomposition of the This is a microworld for evaluating strategic Tempest Seer game. decisions for air operations. Capabilities modelled include various fighters, air to air refuelling, Given this decomposition, one can apply the surface to air missiles, ground based radars, as MAXQ-Q algorithm to learn game strategies well as logistical aspects. Airports and storage (Dieterrich, 2002). One can think of the game depots are also included. The model looks at a being played through the execution of differing 1000 km square of continuous space, with time subroutines, these being nodes of the hierarchy as advancing in discrete units. The Tempest Seer differing states are encountered. Each node can be microworld will enable users to look at the considered as a stochastic game, focused on a potential outcomes of strategic decisions. Such particular aspect of play such as targeting, or at an decisions involve the assignment of aircraft and elemental level taking-off or landing of aircraft. weapons to missions, selection of targets, defence of assets and patrolling of the air space. While Though the idea of learning with hierarchical these are high-level decisions, modelling the decomposition is appealing, in order to make the repercussions of such decisions accurately requires application practical, several conditions must attention to detail. This microworld is therefore at apply. First, our human designed hierarchy must a much lower level of abstraction than TD-Island. be valid, in the sense that at the nodes of the hierarchy capture the richness of strategic development used by human players. Second, crucially, one must be able to abstract out
  • 5. irrelevant parts of the state representation, at 2004). Another form of action abstraction is to differing parts of the hierarchy. As an example, define actions extended in time. They have been consider the “take-off” node. Provided all aircraft, variously referred to as options (Sutton et al 1999), including the enemies, are sufficiently far from the sub-tasks, macro-operators, or extended actions. In departing aircraft, we can ignore their states. The multi-agent games the action space states of other aircraft can effectively be safely A = A1 × A2 × ... × AN describes concurrent actions, abstracted out, without altering the quality of where Ai is the set of actions available to agent i. strategic learning. Without such abstraction, hierarchical learning is in fact far more expensive The number of concurrent actions is A = ∏A i . than general “flat” learning (Dietterich, 2002). i Finally, the current theory of hierarchical learning We can reduce the problem by constraining the must be modified to include explicit adversarial agents so that only one agent can act at each time- reasoning, for example the randomisation of step. State symmetry arises when behaviour over strategies to keep the opponent “second-guessing” state sub-spaces is repetitive throughout the state so to speak. space (Hengst 2002), or when there is some geometric symmetry that solves the problem in one We have completed research into incorporating domain and translates to others. Deictic adversarial reasoning into hierarchical learning representations only consider state regions through a vignette in which taxi’s complete for important to the agent, for example, the agent’s passengers, and are able to “steal” passengers. The direct field of view. Multi-level task hierarchical decomposition for this game is shown decompositions use a task hierarchy to decompose below the problem. Whether the task hierarchy is user specified as in MAXQ (Dietterich 2000) or Root machine discovered as in HEXQ (Hengst 2002), they have the potential to simultaneously abstract Get Deliver both actions and states. Part of our approach to address achieving human agreement and trust in the representation has been PickUp Steal Navigate (t) Evade PutDown a move from simple variants of Chess and Checkers to more general microworlds like TD Island and Tempest Seer. We advocate an open- interface component approach to software North South East West Wait development, allowing the representations to be dynamically adaptable as conditions change. Figure 3: Hierarchical decomposition for the RESULTS OF A MICROWORLD GAME competitive taxi problem. We illustrate learning performance for a Checkers- In an approach to combining hierarchy with game- variant microworld. In this variant, each side is theory, one of the authors has combined MAXQ given one hidden piece. If this piece reaches the learning with the “win-or-learn fast” or WOLF opposition baseline, it becomes a hidden king. algorithm (Bowling and Veloso, 2003) which After some experience in playing this game, we adjusts the probabilities of executing a particular estimated the relative importance of standard strategy according to whether you are doing well pieces, hidden unkinged and hidden kinged pieces. in the strategic play or loosing. We then machine learned our parameter values using the belief based TDLeaf algorithm described earlier. SUMMARY ISSUES We have found the use of action and state symmetry, extended actions, time-sharing, deictic representations, and hierarchical task decompositions useful for improving the efficiency of learning in our agents. Action symmetries arise, for example, when actions have identical effects in the value or Q-function (Ravindran 2004, Calbert
  • 6. Ability initiative, in part through the Australian Research Council. We wish to acknowledge the inspiration of Dr Jan Kuylenstierna and his team from the Swedish National Defence College. REFERENCES Alberts, D., Gartska, J., Hayes, R., (2001) Understanding Information Age Warfare, CCRP Publ., Washington, DC. Barto, A.G., Mahadevan, S. (2003) Recent Advances in Hierarchical Reinforcement Figure 5. Results for Checkers with hidden pieces. Learning. Discrete-Event Systems, 13:41--77. Special Issue on Reinforcement Figure 5 details the probability of wins, losses and Learning. draws for three competitions each of 5000 games. In the first competition, an agent with machine- Baxter, J.; Tridgell, A.; and Weaver, L, (1998) learned evaluation function weights was played TDLeaf (λ ) : Combining Temporal against an agent with an evaluation function where Difference Learning with Game Tree all weights were set as uniformly important. In the Search. Proceedings of the Ninth second competition, an agent with machine- Australian Conference on Neural Network: learned evaluation function weights was played 168-172. against an agent with a human-estimated evaluation function. Finally, an agent using human Beal, D.F. and Smith, M.C. (1997) Learning Piece estimated weights played against an agent using Values Using Temporal Differences, ICCA uniform values. The agent with machine-estimated Journal 20(3): 147-151. evaluation function weights wins most games. This result points towards our aim to construct adequate Bowling, M and Veloso, M. (2002) Multiagent machine learned representations for new unique learning using a variable learning rate. strategic situations. Artificial Intelligence Vol. 136, pp. 215-250. FUTURE WORK AND CONCLUSIONS Calbert, G., (2004) “Exploiting Action-Space We will continue to explore the theme of Symmetry for Reinforcement Learning in a microworld and agent design and fuse our Concurrent Dynamic Game”, Proceedings experience in games, partially-observable planning of the International Conference on and extended actions. Though we have identified Optimisation Techniques and Applications. good models as homomorphic, there is little work 2004. describing realistic approaches to constructing these homomorphisms for microworlds or Calbert, G and Kwok, H-W, (2004) Combining adversarial domains. Entropy Based Heuristics, Minimax search and Temporal differences to play Hidden We are also interested in capturing human intent State games. Proceedings of the Florida and rewards as perceived by military players in Artificial Intelligence Symposium, 2004. microworld games. This may be viewed as an AAAI Press. inverse MDP, where one determines rewards and value function given a policy, as opposed to Dietterich, T.G, (2000) “Hierarchical determining a policy through predetermined Reinforcement Learning with the MAXQ rewards (Ng and Russell 2000). We believe that Function Decomposition”, Journal of augmentation of human cognitive ability will Artificial Intelligence Research, Vol. 13, provide an improvement in the ability for military pp.227-303. commanders to achieve their intent. Friman, H. and Brehmer, B. (1999) “Using ACKNOWLEDGEMENTS Microworlds To Study Intuitive Battle Dynamics: A Concept For The Future”, National ICT Australia is funded through the ICCRTS, Rhode Island, June 29 - July 1. Australian Government's Backing Australia's
  • 7. Fundenberg, D. and Tirole, J., (1995) Game Proceedings of the Seventeenth Theory. MIT Press. International Conference on Machine Learning. Furnkranz, J., Kubat, M. (eds), (2001) Machines that learning to play games. Advances in Omodei, M. M., Wearing, A. J., McLennan, J., Computation Theory and Practice, Vol.8. Elliott, G. C., & Clancy, J. M. (2004). More Nova Science Publishers, Inc. is Better? Problems of self regulation in naturalistic decision making settings, in Hengst, B., (2002) “Discovering Hierarchy in Montgomery, H., Lipshitz, R., Brehmer, B. Reinforcement Learning with HEXQ”, (Eds.). Proceedings of the 5th Naturalistic Proceedings of the Nineteenth International Decision Making Conference. Conference on Machine Learning, pp. 243-250, Morgan-Kaufman Publ. Ravindran, B. (2004) "An Algebraic Approach to Abstraction in Reinforcement Learning". Horne, G., Johnson, S., Eds. (2003), Maneuver Doctoral Dissertation, Department of Warfare Science, USMC Project Albert Computer Science, University of Publ. Massachusetts, Amherst MA. Hu, J., Wellman, M.P., (2003), Nash Q-Learning Schaeffer, J. (2000) “The games computers (and for General Sum Stochastic Games, people) play.” Advances in Computers 50 , Journal of Machine Learning Research, Marvin Zelkowitz editor, Academic Press, Vol. 4, pp. 1039-1069. pp. 189-266. Kaelbling L.P., Littman M.L., Moore A.W. (1996). Smet, P. et. al. (2005) “The effects of materiel, Reinforcement Learning: A Survey, Journal tempo and search depth in win loss ratios in of Artificial Intelligence Research, Vol. 4, Chess”, Proceedings of Australian pp. 237-285. Artificial Intelligence Conference. Kuylenstierna, J., Rydmark, J. and Fahrus, T. Sutton, R., Precup, D. and Singh, S. (1999) (2000) “The Value of Information in War: Between MDPs and semi-MDPs: A Some Experimental Findings”, Proc. of the Framework for Temporal Abstraction in 5th International Command and Control Reinforcement Learning. Artificial Research and Technology Symposium Intelligence. Vol. 112. pp. 181-211. (ICCRTS), Canberra, Australia. US Joint Chiefs of Staff, (2001) Joint Doctrine Lambert, D., Scholz, J.B. (2005), “A Dialectic for Keystone and Capstone Primer. Appendix Network Centric Warfare”, Proceedings of A, 10 Sept. http://www.dtic.mil/doctrine/jel/ the 10th International Command and Control new_pubs/primer.pdf Research and Technology Symposium (ICCRTS), MacLean, VA, June 13–16. http://www.dodccrp.org/events/2005/10th/ CD/papers/016.pdf Matthews, K, Davies, M., (2001) Simulations for Joint Military Operations Planning. Proceedings of the SIMTEC Modelling and Simulation Conference. Miller, C.S. (2005), “A New Perspective for the Military: Looking at Maps Within Centralized Command and Control Systems”. Air and Space Power Chronicles, 30 March. http://www.airpower.maxwell.af.mil/airchr onicles/cc/miller.html Ng, A.Y. and Russell, S. (2000) “Algorithms for inverse reinforcement learning”,