SlideShare une entreprise Scribd logo
1  sur  6
Hybridization of Reinforcement Learning Agents
                                Héctor J. Fraire Huacuja
                                Juan J. González Barbosa
                                 Jesús V. Flores Morfin

                       Department of Systems and Computation
                        Technological Institute of Madero City
            Avenue 1º de Mayo y Sor Juana Inés de la Cruz S/N Colony Los Mangos
                                 (12) 10-04-15 Commutator
                              (12) 10-29-02 Computer Center
                         Technological Institute of the Lagoon

Summary                                           The origen of reinforcement learning
                                                  came along with the beginnings of
The purpose of this work was to show              cybernetics and involves elements of
that agents of classic reinforcement              statistic, psychology, neuro-science and
learning[1], can achieve significant              computational sciences [2]. In the last
improvements in          performance by           ten years, the interest on these
utilizing techniques of hybridization.            techniques in the fields of the Artificial
The applied methodology consists of               Intelligence and the learning machines
defining a mechanism to compare the               has been greatly increased [8]. The
capacity of learning of the agents, to test       reinforcement learning is a form of
the performance of the classic agents in          agents programming, basied on rewards,
similar conditions of interaction with its        which purpose is to make the agent
environment and finally, the worst one            makes perform a task without a precise
of the agents must be altered in order to         specification of what to do. This
increase     its    performance.     After        approach modifies the effective
analyzing the classic agents of                   paradigms of programming and creates
reinforcement learning, under an                  a field of opportunities of great
environment of simulated maze, the                amplitude.
comparative tests located agent SARSA
as having the worst performance and
agent SARSA(λ) as having the best.
The most prominent result is that the
application      of      techniques      of
hybridization of the considered agents,
allow the construction of hybrid agents
that present notable improvements in

The reinforcement learning is a               Reinforcement Learning
technique that allows the construction
of intelligent agents with learning           The learning has been the key factor of
capacities and adaptation. An agent of        the intelligent systems since they have
this class learns about the interaction       to make robots develope tasks without
with the environment in which he              the    necessity     of    an    explicit
performes and adapts to changes that          programming[7]. The main elements
appear in its surroundings. The               involved in the problem are the
perception that the agent has of the          following:
environment through its sensors,
becomes a state vector. To each one of        Agent: Subject that makes the tasks of
the possible states, an elementary action     learning and takes decisions to reach a
is associated and the agent is capable of     goal.
making the action through its element         Environment: Everything that is not
actuators. The relation state-action is the   controlled by the agent generates states
nucleus of this approach, since the           and receives actions.
update of this function allows the agent      In other words, the agent and the
to keep a record of the experience            environment interact in a sequence of
acquired. By using this information, the      steps.
agent determines at each step the action      State (s): Perception that the agent has
that contributed the most to achive the       of the environment.
global objectives. The selection of the       Action (a): Conduct that the agent uses
action is made this way most of the           to modify its surroundings or
time. In a small proportion of time the       environment.
selection of the action is made by            Policy: It defines the form in which the
random among all the possible actions.        agent should be conducted from a given
This        mechanism,        denominated     state or of a combination of (state,
exploration, allows the agent to test         action). A policy can be represented in a
actions that in the past were found of        table or in another type of structure.
little effectiveness. This function allows    Normally a policy is defined implicitly
the agent to adapt its performance to         in the table associated to a function. The
changes in the environment. Once the          determination of a policy is the nucleus
selected action is made, a prize or a         of the approach since this defines the
punishment for the action is generated.       conduct of the agent.
The reward accumulated is stored in the       Function of Prizes: It defines the goal
relation estate-action.                       of the agent. It determines how
                                              desirable a state or a pair (state, action)
This type of agent has been selected to       can be. In a certain sense which events
evaluate its potential application in the     are good or bad so that the agent achive
construction of controllers of mobile         its goal. The agent´s goal is to maximize
autonomous robots, for two reasons.           the total amount of rewards (r) or
First of all, the learning is made on line.   quantity of prizes received during the
This condition is fundamental for             time of the experiment. The function of
applications of robots in which animal        prizes defines what is good for the agent
actions are emulated. The other reason        instantly.
is that the reinforcement learning is         Value Function (state) V(s): It is the
extremely economic as for the resources       total quantity of rewards (r) that the
of calculate and storage required for its     agent expects to accumulate, from state
implementation.                               (s) in a given time (i). This specifies
what is good for the agent in the long        successor states. This property is known
run.                                          him like Bootstrapping.
• V(si)=E[Σri l si]
Value Function (state-action) Q(s, a):        The Monte Carlo methods do not
It is the expected value of rewards (r),      require a complete knowledge of the
starting from state (s) and the action (a)    environment.      They      use    on-line
in a given time (i). The estimation of the    experience or simulated of the
values of these functions is the nucleus      interaction with the environment
of the activity of this method.               (sequence of states, actions and
• Q(si,ai)=E[Σri l si,ai]                     rewards). The simulation requires a
In the problem of reinforcement               partial description of the environment.
learning, the agent makes decisions in        This stimates the value function and
function of a signal provided by the          optimal policies from experience in
environment, called the environment´s         form of episodes. It is of practical utility
state. If each state of the environment       to make the tasks that can be described
has the characteristic of sumarizes all       on the basis of subtasks or episodes.
the passed information, in such a way         Bootstrapping is not used.
that information of previous states is not
required, it says that the state signal has   The temporal-difference learning is a
the Markov property [ 4 ].                    combination of dynamic programming
                                              ideas and Monte Carlo ideas. Monte
If the property of Markov is not              Carlo methods can learn directly from
assumed, the task of maintaining the          on-line experience without using a
information of all the last states implies    model of the environment. Dynamic
to count on a great capacity of memory        programming updates the values of a
available. A task of reinforcement            state on the basis of the estimation of
learning that satisfies the property of       the values of other states, without
Markov will be called a process of            waiting for arrival of the following
decision of Markov.                           state. The reinforcement learning
                                              problem is divided in 2 subproblems: A
The methods        for solving the            problem of Prediction and a Problem of
reinforcement learning problem are the        Control. First, it tries to determine the
dynamic programming, the Monte Carlo          value function V *, for a given policy.
methods      and    temporal-difference       Secondly, it determines the optimal
learning [3].                                 policy π *, by convergence to the
                                              maximum of the value function Q * [6].
The dynamic programming is a                  The classic algorithms of temporal-
collection of algorithms that can be used     difference that help to solve this
to compute optimal policies given a           problem are: SARSA, SARSA(λ), Q-
model of the environment as a Markov          learning.
decision process. Its utility practices is
limited since it requires a complete          Classic Algorithms
model of the environment. However it
is important to dominate this techniques      Let:
since it provides the basis of operation      Q(s, a). - The value function (state,
of other techniques. All the methods of       action)
the dynamic programming update the            s. - The present state in time t
estimates of the values of a state based      a. - The action for state s
on estimates of the the values of             s'. - The following state in the time t+1
                                              á. - The action for the state s'
α - Parameter that describes the time in
each step                                                            Hybrid Agent
r. - reward
γ - Rank of discontinuity between 0                                  The hibridazation technique used
     and 1                                                           consists of modifying the form in which
                                                                     rewards are calculated after each action.
Algorithm SARSA                                                      From this mechanism, applied by
                                                                     SARSA, a quadratic combination is
Initialize Q(s, a) arbitrarily                                       generated in which the parameters are
Repeat for each episode                                              basically two constants (τ=200 and
   Initialize s                                                      µ=75) that controls the contribution or
      Selects a from s using the policy                              weight of each one of the factors. In the
defined by Q.                                                        described experimental tests, the agent
   Repeat for each step of episode                                   shows the potential of this approach to
      Take action a, observe r and s'                                obtain      improvements       in     the
      Selects a' from s' using Q.                                    performance of the classic algorithms
         Q(s,a)Q(s,a)+α [ r + γ Q(s’,a’)                            denominated                   Q_SARSA
- Q(s,a)]                                                            (Quick_SARSA):
        ss’; aa’
   until s is terminal                                               Algorithm Q-SARSA

Algorithm Sarsa(λ)                                                   Initialize Q(s, a) arbitrarily
                                                                     Repeat for each episode
Initialize Q(s, a) arbitrarily and e(s,a)=0                             Initialize s
for all s and a                                                          Selects a from s using policy derived
Repeat for each episode                                              from Q.
   Initialize s,a                                                       Repeat for each step of episode
   Repeat for each step of episode                                      Take the action a, observe r and s'
     Take the action a, observe r and s'                                Seleccione a' from s' using Q.
      Selects a' from s' using Q.                                        Q(s,a)Q(s,a)+α [ r + τ (γ Q(s’,a’) -
      δ r + γ Q(s’,a’) - Q(s,a)                                     Q(s,a)) + µ(γ Q(s’,a’) - Q(s,a))2 ]
      e(s,a)e(s,a)+1                                                        ss’; aa’
      For all s,a:                                                      until s is terminal
         Q(s,a) Q(s,a) + α δ e(s,a)
          e(s,a) γ λ e(s,a)                                         Comparative Tests
      ss’; aa’
   until s is terminal                                               In the tests that appear hybrid algorithm
                                                                     Q_SARSA against algorithms SARSA
Q-Learning Algorithm                                                 and SARSA(λ) are compared. The
                                                                     environment in which the agents evolve
Initialize Q(s, a) arbitrarily                                       is a simulator of mazes. In which the
Repeat for each episode:                                             agent is put under the following test: It
  Initialize s                                                       is placed in the initial square of a maze
  Repeat for each step of episode:                                   and from there it must reach the
   Choose a from s using policy derived                              indicated square as the goal. The global
from Q                                                               objective of the agent is to learn how to
   Take action a, observe r, s'                                      go from the initial square to the target
Q( s, a ) ← Q( s, a ) + α [ r + γ maxa 'Q( s' , a ') − Q( s, a ) ]   square. After each attempt, the value
      s ← s';                                                        function Q(s, a) allows to establish a
until s is terminal                                                  relation between each one of the
possible states of the environment and       through counting the number of
the     action     of   greater     reward   attempts made by each algorithm until
accumulated. The relation state-action       the stabilization.
of highest rewards that it settles the end
of each attempt, is denominated learned                           # Made Attempts
policy. The criterion to define in which     SARSA                211,003
moment the agent has learned to make         Q_SARSA                  204
the task is the following: Once the agent    SARSA(λ)                 246
already learned how to go from the
initial square to the target one without     Results for     mazes     of   average
variation in the learning policy during      complexity:
the last 200 attempts [8 ]. It is possible
to hope that to greater complexity of the                         # Total of Actions
task this number must be increased to        SARSA                4,430,244
assure that the learned policy is indeed a   Q_SARSA                 61,542
solution to the problem that tries to        SARSA(λ)                 2,034
solve the agent.
                                                                  # Made Attempts
With this criterion the agents were          SARSA                283,699
tested to solve mazes of different           Q_SARSA                  255
complexity (easy, average and difficult).    SARSA(λ)                 205
The mazes from easy complexity have a
single way of solution.                      Results for     mazes    of    difficult
The mazes from average complexity            complexity:
have one or more ways of solution.
The mazes from difficult complexity                               # Total of Actions
have ways whitout exit and one or more       SARSA                16,777,216
ways of solution.                            Q_SARSA               3,659,667
                                             SARSA(λ)                 17,710
A first indicator of the effectiveness of
the agents is the total number of actions                         # Made Attempts
made by the agent in the test, until the     SARSA                 11,796
learned-policy became stabilized. The        Q_SARSA               33,437
results of the test in mazes of easy         SARSA(λ)                 356
complexity with respect to this indicator
are the following ones:
                       # Total of Actions
SARSA                   29,673,089           Conclusions
Q_SARSA                     72,964
SARSA(λ)                     4,045           This study showed that the algorithm
                                             Q_SARSA           surpasses     SARSA
This table locates with all clarity that     algorythm and that the used mechanism
the performance of Q_SARSA is more           of hibridization utilized is capable of
                                             achieving improved performance of the
similar to that of SARSA(λ), than to
                                             agents. The type of hibridization with
that of SARSA. The relevance of this
                                             which one has experienced, maintains
result is manifest since Q_SARSA is
                                             low costs of operation and the basic
structurally of type SARSA. The
                                             structure of the agents.
performance of the algorithms in
function of the rapidity with which they
stabilize the learned policy, is observed    Future jobs
[5].Mahadevan,      Sridhar,    Khaleeli
It would be interesting to try to make           Nikfar, and Marchalleck Nicholas.
some other modifications of the hybrid           1998. Designing Agent Controllers
agent, such as the evaluation of the             using     Discrete-Event     Markov
result of the actions that the agent made        Models. Michigan State University,
in the past. The objective of this               MI, USA.
function would be to filter the actions to   [6].Mahadevan, Sridhar. 1997. Machine
eliminate those that evidently do not            Learning for Robots: A Comparison
contribute to the achievement of the             of Different Paradigms. University
global goal of the agent.                        of South Florida, USA.
It would also be interesting to implant      [7].Brooks,     Rodney       A.    1991.
the algorithm of learning Q_SARSA in             Intelligence Whitout Reason. MIT
a robot.                                         Press A.I. Memo No. 1293. USA.
                                             [8].Martin         Mario.        (1998).
Acknowledgments                                  Reinforcement       learning     for
                                                 embedded agents facing complex
Special thanks for the accomplishment            task. Tesis Doctoral. Universidad
of this project are given to the Counsel         Politecnica de Cataluña.
of     the    National     System     of
Technological Education, from the
General Direction of Technological
Institutes and of the Direction of the
Technological Institute of City Madero.

Project supported by COSNET key


[1].Pendrith, Mark D. & Ryan Malcolm
    R. K. 1997. C-Trace: A new
    algorithm for reinforcement learning
    of robotic control. The University of
    New      South Wales;       Sydney,
[2].Sutton, Richard S., & Barto Andrew
    G. Reinforcement Learning. An
    Introduction. Ed. MIT Press;
    Cambridge, Massachusetts, USA.
[3].Kaelbling, Leslie P., Littman,
    Michael L., Moore, Andrew W.
    1996. Reinforcement Learning: A
    Survey. Brown University, USA.
[4].Mohan Rao, K. Vijay. 1997.
    Learning Algorithms for Markov
    Decision Processes. Departament of
    Computer Science and Automation
    Indian     Institute   of    Science
    Bangalore – 560 012.

Contenu connexe


Harnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoringHarnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoringTELKOMNIKA JOURNAL
KnowledgeFromDataAtScaleProjectMarciano Moreno
Traffic light control in non stationary environments based on multi
Traffic light control in non stationary environments based on multiTraffic light control in non stationary environments based on multi
Traffic light control in non stationary environments based on multiMohamed Omari
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Muhammed Kocabaş
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewYONG ZHENG
Nature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsNature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsXin-She Yang
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationUniversité de Liège (ULg)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Jeong-Gwan Lee
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...Reza Nourjou, Ph.D.
An Implementation on Effective Robot Mission under Critical Environemental Co...
An Implementation on Effective Robot Mission under Critical Environemental Co...An Implementation on Effective Robot Mission under Critical Environemental Co...
An Implementation on Effective Robot Mission under Critical Environemental Co...IJERA Editor
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...YONG ZHENG
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...André Gonçalves
Integration of a Predictive, Continuous Time Neural Network into Securities M...
Integration of a Predictive, Continuous Time Neural Network into Securities M...Integration of a Predictive, Continuous Time Neural Network into Securities M...
Integration of a Predictive, Continuous Time Neural Network into Securities M...Chris Kirk, PhD, FIAP
Evolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
Evolutionary Testing Approach for Solving Path- Oriented Multivariate ProblemsEvolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
Evolutionary Testing Approach for Solving Path- Oriented Multivariate ProblemsIDES Editor
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Waqas Tariq

Tendances (20)

Harnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoringHarnessing deep learning algorithms to predict software refactoring
Harnessing deep learning algorithms to predict software refactoring
Fuzzy inference systems
Fuzzy inference systemsFuzzy inference systems
Fuzzy inference systems
Traffic light control in non stationary environments based on multi
Traffic light control in non stationary environments based on multiTraffic light control in non stationary environments based on multi
Traffic light control in non stationary environments based on multi
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
Nature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsNature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic Algorithms
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...
Intelligent Algorithm for Assignment of Agents to Human Strategy in Centraliz...
An Implementation on Effective Robot Mission under Critical Environemental Co...
An Implementation on Effective Robot Mission under Critical Environemental Co...An Implementation on Effective Robot Mission under Critical Environemental Co...
An Implementation on Effective Robot Mission under Critical Environemental Co...
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
Integration of a Predictive, Continuous Time Neural Network into Securities M...
Integration of a Predictive, Continuous Time Neural Network into Securities M...Integration of a Predictive, Continuous Time Neural Network into Securities M...
Integration of a Predictive, Continuous Time Neural Network into Securities M...
Evolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
Evolutionary Testing Approach for Solving Path- Oriented Multivariate ProblemsEvolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
Evolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
501 183-191
501 183-191501 183-191
501 183-191

En vedette

Pagkakaiba ng pasalita at pasulat na diskurso
Pagkakaiba ng pasalita at pasulat na diskursoPagkakaiba ng pasalita at pasulat na diskurso
Pagkakaiba ng pasalita at pasulat na diskursoMariel Bagsic
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora Cruz
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora CruzPagsusuri sa maikling kwentong Bansot (Buod) ni Aurora Cruz
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora CruzShaina Mavreen Villaroza
Introduction of Philippine Epic AND biag ni lam ang
Introduction of Philippine Epic AND biag ni lam angIntroduction of Philippine Epic AND biag ni lam ang
Introduction of Philippine Epic AND biag ni lam anghelen de la cruz
Yunit 3 istruktura ng wika
Yunit 3  istruktura ng wikaYunit 3  istruktura ng wika
Yunit 3 istruktura ng wikaRita Mae Odrada
Aspekto ng pandiwa
Aspekto ng pandiwaAspekto ng pandiwa
Aspekto ng pandiwazichara
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan at Kapangyarihan ng Sangay Nito
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan  at Kapangyarihan ng Sangay NitoYUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan  at Kapangyarihan ng Sangay Nito
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan at Kapangyarihan ng Sangay NitoEDITHA HONRADEZ
Aspekto ng Pandiwa
Aspekto ng PandiwaAspekto ng Pandiwa
Aspekto ng Pandiwajennymae23
Mga Uri at mga Aspekto ng Pandiwa
Mga Uri at mga Aspekto ng PandiwaMga Uri at mga Aspekto ng Pandiwa
Mga Uri at mga Aspekto ng PandiwaMckoi M
Regalo sa Guro (Maikling Kuwento)
Regalo sa Guro (Maikling Kuwento)Regalo sa Guro (Maikling Kuwento)
Regalo sa Guro (Maikling Kuwento)arseljohn120
Maikling Kwento
Maikling KwentoMaikling Kwento
Maikling Kwentorosemelyn

En vedette (19)

Pagkakaiba ng pasalita at pasulat na diskurso
Pagkakaiba ng pasalita at pasulat na diskursoPagkakaiba ng pasalita at pasulat na diskurso
Pagkakaiba ng pasalita at pasulat na diskurso
Aspeto ng pandiwa
Aspeto ng pandiwaAspeto ng pandiwa
Aspeto ng pandiwa
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora Cruz
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora CruzPagsusuri sa maikling kwentong Bansot (Buod) ni Aurora Cruz
Pagsusuri sa maikling kwentong Bansot (Buod) ni Aurora Cruz
Introduction of Philippine Epic AND biag ni lam ang
Introduction of Philippine Epic AND biag ni lam angIntroduction of Philippine Epic AND biag ni lam ang
Introduction of Philippine Epic AND biag ni lam ang
The Sonnet
The SonnetThe Sonnet
The Sonnet
Yunit 3 istruktura ng wika
Yunit 3  istruktura ng wikaYunit 3  istruktura ng wika
Yunit 3 istruktura ng wika
Aspekto ng pandiwa
Aspekto ng pandiwaAspekto ng pandiwa
Aspekto ng pandiwa
Filipino 9 Tanka at Haiku
Filipino 9 Tanka at HaikuFilipino 9 Tanka at Haiku
Filipino 9 Tanka at Haiku
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan at Kapangyarihan ng Sangay Nito
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan  at Kapangyarihan ng Sangay NitoYUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan  at Kapangyarihan ng Sangay Nito
YUNIT 3 ARALIN 1:Ang Pambansang Pamahalaan at Kapangyarihan ng Sangay Nito
Grade 3 (PANDIWA)
Grade 3 (PANDIWA)Grade 3 (PANDIWA)
Grade 3 (PANDIWA)
Aspekto ng Pandiwa
Aspekto ng PandiwaAspekto ng Pandiwa
Aspekto ng Pandiwa
Mga Uri at mga Aspekto ng Pandiwa
Mga Uri at mga Aspekto ng PandiwaMga Uri at mga Aspekto ng Pandiwa
Mga Uri at mga Aspekto ng Pandiwa
Regalo sa Guro (Maikling Kuwento)
Regalo sa Guro (Maikling Kuwento)Regalo sa Guro (Maikling Kuwento)
Regalo sa Guro (Maikling Kuwento)
Maikling Kwento
Maikling KwentoMaikling Kwento
Maikling Kwento

Similaire à Hibridization of Reinforcement Learning Agents

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfAiblogtech
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdfhemayadav41
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...IRJET Journal
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsbutest
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsbutest
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
A deep reinforcement learning strategy for autonomous robot flocking
A deep reinforcement learning strategy for autonomous robot flockingA deep reinforcement learning strategy for autonomous robot flocking
A deep reinforcement learning strategy for autonomous robot flockingIJECEIAES
Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1Rajmeet Singh
Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1Hitesh Mohapatra
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
Literature Survey
Literature SurveyLiterature Survey
Literature Surveybutest
final report (ppt)
final report (ppt)final report (ppt)
final report (ppt)butest

Similaire à Hibridization of Reinforcement Learning Agents (20)

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdf
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
Week 3.pdf
Week 3.pdfWeek 3.pdf
Week 3.pdf
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
A deep reinforcement learning strategy for autonomous robot flocking
A deep reinforcement learning strategy for autonomous robot flockingA deep reinforcement learning strategy for autonomous robot flocking
A deep reinforcement learning strategy for autonomous robot flocking
Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1Reinforcement learning-ebook-part1
Reinforcement learning-ebook-part1
Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1Reinforcement Learning / E-Book / Part 1
Reinforcement Learning / E-Book / Part 1
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Literature Survey
Literature SurveyLiterature Survey
Literature Survey
final report (ppt)
final report (ppt)final report (ppt)
final report (ppt)

Plus de butest

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
Facebook Facebook
Facebook butest
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest

Plus de butest (20)

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
Facebook Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc

Hibridization of Reinforcement Learning Agents

  • 1. Hybridization of Reinforcement Learning Agents Héctor J. Fraire Huacuja Juan J. González Barbosa Jesús V. Flores Morfin Department of Systems and Computation Technological Institute of Madero City Avenue 1º de Mayo y Sor Juana Inés de la Cruz S/N Colony Los Mangos (12) 10-04-15 Commutator (12) 10-29-02 Computer Center Technological Institute of the Lagoon Summary The origen of reinforcement learning came along with the beginnings of The purpose of this work was to show cybernetics and involves elements of that agents of classic reinforcement statistic, psychology, neuro-science and learning[1], can achieve significant computational sciences [2]. In the last improvements in performance by ten years, the interest on these utilizing techniques of hybridization. techniques in the fields of the Artificial The applied methodology consists of Intelligence and the learning machines defining a mechanism to compare the has been greatly increased [8]. The capacity of learning of the agents, to test reinforcement learning is a form of the performance of the classic agents in agents programming, basied on rewards, similar conditions of interaction with its which purpose is to make the agent environment and finally, the worst one makes perform a task without a precise of the agents must be altered in order to specification of what to do. This increase its performance. After approach modifies the effective analyzing the classic agents of paradigms of programming and creates reinforcement learning, under an a field of opportunities of great environment of simulated maze, the amplitude. comparative tests located agent SARSA as having the worst performance and agent SARSA(λ) as having the best. The most prominent result is that the application of techniques of hybridization of the considered agents, allow the construction of hybrid agents that present notable improvements in performance. Introduction
  • 2. The reinforcement learning is a Reinforcement Learning technique that allows the construction of intelligent agents with learning The learning has been the key factor of capacities and adaptation. An agent of the intelligent systems since they have this class learns about the interaction to make robots develope tasks without with the environment in which he the necessity of an explicit performes and adapts to changes that programming[7]. The main elements appear in its surroundings. The involved in the problem are the perception that the agent has of the following: environment through its sensors, becomes a state vector. To each one of Agent: Subject that makes the tasks of the possible states, an elementary action learning and takes decisions to reach a is associated and the agent is capable of goal. making the action through its element Environment: Everything that is not actuators. The relation state-action is the controlled by the agent generates states nucleus of this approach, since the and receives actions. update of this function allows the agent In other words, the agent and the to keep a record of the experience environment interact in a sequence of acquired. By using this information, the steps. agent determines at each step the action State (s): Perception that the agent has that contributed the most to achive the of the environment. global objectives. The selection of the Action (a): Conduct that the agent uses action is made this way most of the to modify its surroundings or time. In a small proportion of time the environment. selection of the action is made by Policy: It defines the form in which the random among all the possible actions. agent should be conducted from a given This mechanism, denominated state or of a combination of (state, exploration, allows the agent to test action). A policy can be represented in a actions that in the past were found of table or in another type of structure. little effectiveness. This function allows Normally a policy is defined implicitly the agent to adapt its performance to in the table associated to a function. The changes in the environment. Once the determination of a policy is the nucleus selected action is made, a prize or a of the approach since this defines the punishment for the action is generated. conduct of the agent. The reward accumulated is stored in the Function of Prizes: It defines the goal relation estate-action. of the agent. It determines how desirable a state or a pair (state, action) This type of agent has been selected to can be. In a certain sense which events evaluate its potential application in the are good or bad so that the agent achive construction of controllers of mobile its goal. The agent´s goal is to maximize autonomous robots, for two reasons. the total amount of rewards (r) or First of all, the learning is made on line. quantity of prizes received during the This condition is fundamental for time of the experiment. The function of applications of robots in which animal prizes defines what is good for the agent actions are emulated. The other reason instantly. is that the reinforcement learning is Value Function (state) V(s): It is the extremely economic as for the resources total quantity of rewards (r) that the of calculate and storage required for its agent expects to accumulate, from state implementation. (s) in a given time (i). This specifies
  • 3. what is good for the agent in the long successor states. This property is known run. him like Bootstrapping. • V(si)=E[Σri l si] Value Function (state-action) Q(s, a): The Monte Carlo methods do not It is the expected value of rewards (r), require a complete knowledge of the starting from state (s) and the action (a) environment. They use on-line in a given time (i). The estimation of the experience or simulated of the values of these functions is the nucleus interaction with the environment of the activity of this method. (sequence of states, actions and • Q(si,ai)=E[Σri l si,ai] rewards). The simulation requires a In the problem of reinforcement partial description of the environment. learning, the agent makes decisions in This stimates the value function and function of a signal provided by the optimal policies from experience in environment, called the environment´s form of episodes. It is of practical utility state. If each state of the environment to make the tasks that can be described has the characteristic of sumarizes all on the basis of subtasks or episodes. the passed information, in such a way Bootstrapping is not used. that information of previous states is not required, it says that the state signal has The temporal-difference learning is a the Markov property [ 4 ]. combination of dynamic programming ideas and Monte Carlo ideas. Monte If the property of Markov is not Carlo methods can learn directly from assumed, the task of maintaining the on-line experience without using a information of all the last states implies model of the environment. Dynamic to count on a great capacity of memory programming updates the values of a available. A task of reinforcement state on the basis of the estimation of learning that satisfies the property of the values of other states, without Markov will be called a process of waiting for arrival of the following decision of Markov. state. The reinforcement learning problem is divided in 2 subproblems: A The methods for solving the problem of Prediction and a Problem of reinforcement learning problem are the Control. First, it tries to determine the dynamic programming, the Monte Carlo value function V *, for a given policy. methods and temporal-difference Secondly, it determines the optimal learning [3]. policy π *, by convergence to the maximum of the value function Q * [6]. The dynamic programming is a The classic algorithms of temporal- collection of algorithms that can be used difference that help to solve this to compute optimal policies given a problem are: SARSA, SARSA(λ), Q- model of the environment as a Markov learning. decision process. Its utility practices is limited since it requires a complete Classic Algorithms model of the environment. However it is important to dominate this techniques Let: since it provides the basis of operation Q(s, a). - The value function (state, of other techniques. All the methods of action) the dynamic programming update the s. - The present state in time t estimates of the values of a state based a. - The action for state s on estimates of the the values of s'. - The following state in the time t+1 á. - The action for the state s'
  • 4. α - Parameter that describes the time in each step Hybrid Agent r. - reward γ - Rank of discontinuity between 0 The hibridazation technique used and 1 consists of modifying the form in which rewards are calculated after each action. Algorithm SARSA From this mechanism, applied by SARSA, a quadratic combination is Initialize Q(s, a) arbitrarily generated in which the parameters are Repeat for each episode basically two constants (τ=200 and Initialize s µ=75) that controls the contribution or Selects a from s using the policy weight of each one of the factors. In the defined by Q. described experimental tests, the agent Repeat for each step of episode shows the potential of this approach to Take action a, observe r and s' obtain improvements in the Selects a' from s' using Q. performance of the classic algorithms Q(s,a)Q(s,a)+α [ r + γ Q(s’,a’) denominated Q_SARSA - Q(s,a)] (Quick_SARSA): ss’; aa’ until s is terminal Algorithm Q-SARSA Algorithm Sarsa(λ) Initialize Q(s, a) arbitrarily Repeat for each episode Initialize Q(s, a) arbitrarily and e(s,a)=0 Initialize s for all s and a Selects a from s using policy derived Repeat for each episode from Q. Initialize s,a Repeat for each step of episode Repeat for each step of episode Take the action a, observe r and s' Take the action a, observe r and s' Seleccione a' from s' using Q. Selects a' from s' using Q. Q(s,a)Q(s,a)+α [ r + τ (γ Q(s’,a’) - δ r + γ Q(s’,a’) - Q(s,a) Q(s,a)) + µ(γ Q(s’,a’) - Q(s,a))2 ] e(s,a)e(s,a)+1 ss’; aa’ For all s,a: until s is terminal Q(s,a) Q(s,a) + α δ e(s,a) e(s,a) γ λ e(s,a) Comparative Tests ss’; aa’ until s is terminal In the tests that appear hybrid algorithm Q_SARSA against algorithms SARSA Q-Learning Algorithm and SARSA(λ) are compared. The environment in which the agents evolve Initialize Q(s, a) arbitrarily is a simulator of mazes. In which the Repeat for each episode: agent is put under the following test: It Initialize s is placed in the initial square of a maze Repeat for each step of episode: and from there it must reach the Choose a from s using policy derived indicated square as the goal. The global from Q objective of the agent is to learn how to Take action a, observe r, s' go from the initial square to the target Q( s, a ) ← Q( s, a ) + α [ r + γ maxa 'Q( s' , a ') − Q( s, a ) ] square. After each attempt, the value s ← s'; function Q(s, a) allows to establish a until s is terminal relation between each one of the
  • 5. possible states of the environment and through counting the number of the action of greater reward attempts made by each algorithm until accumulated. The relation state-action the stabilization. of highest rewards that it settles the end of each attempt, is denominated learned # Made Attempts policy. The criterion to define in which SARSA 211,003 moment the agent has learned to make Q_SARSA 204 the task is the following: Once the agent SARSA(λ) 246 already learned how to go from the initial square to the target one without Results for mazes of average variation in the learning policy during complexity: the last 200 attempts [8 ]. It is possible to hope that to greater complexity of the # Total of Actions task this number must be increased to SARSA 4,430,244 assure that the learned policy is indeed a Q_SARSA 61,542 solution to the problem that tries to SARSA(λ) 2,034 solve the agent. # Made Attempts With this criterion the agents were SARSA 283,699 tested to solve mazes of different Q_SARSA 255 complexity (easy, average and difficult). SARSA(λ) 205 The mazes from easy complexity have a single way of solution. Results for mazes of difficult The mazes from average complexity complexity: have one or more ways of solution. The mazes from difficult complexity # Total of Actions have ways whitout exit and one or more SARSA 16,777,216 ways of solution. Q_SARSA 3,659,667 SARSA(λ) 17,710 A first indicator of the effectiveness of the agents is the total number of actions # Made Attempts made by the agent in the test, until the SARSA 11,796 learned-policy became stabilized. The Q_SARSA 33,437 results of the test in mazes of easy SARSA(λ) 356 complexity with respect to this indicator are the following ones: # Total of Actions SARSA 29,673,089 Conclusions Q_SARSA 72,964 SARSA(λ) 4,045 This study showed that the algorithm Q_SARSA surpasses SARSA This table locates with all clarity that algorythm and that the used mechanism the performance of Q_SARSA is more of hibridization utilized is capable of achieving improved performance of the similar to that of SARSA(λ), than to agents. The type of hibridization with that of SARSA. The relevance of this which one has experienced, maintains result is manifest since Q_SARSA is low costs of operation and the basic structurally of type SARSA. The structure of the agents. performance of the algorithms in function of the rapidity with which they stabilize the learned policy, is observed Future jobs
  • 6. [5].Mahadevan, Sridhar, Khaleeli It would be interesting to try to make Nikfar, and Marchalleck Nicholas. some other modifications of the hybrid 1998. Designing Agent Controllers agent, such as the evaluation of the using Discrete-Event Markov result of the actions that the agent made Models. Michigan State University, in the past. The objective of this MI, USA. function would be to filter the actions to [6].Mahadevan, Sridhar. 1997. Machine eliminate those that evidently do not Learning for Robots: A Comparison contribute to the achievement of the of Different Paradigms. University global goal of the agent. of South Florida, USA. It would also be interesting to implant [7].Brooks, Rodney A. 1991. the algorithm of learning Q_SARSA in Intelligence Whitout Reason. MIT a robot. Press A.I. Memo No. 1293. USA. [8].Martin Mario. (1998). Acknowledgments Reinforcement learning for embedded agents facing complex Special thanks for the accomplishment task. Tesis Doctoral. Universidad of this project are given to the Counsel Politecnica de Cataluña. of the National System of Technological Education, from the General Direction of Technological Institutes and of the Direction of the Technological Institute of City Madero. Project supported by COSNET key 700.99-P References [1].Pendrith, Mark D. & Ryan Malcolm R. K. 1997. C-Trace: A new algorithm for reinforcement learning of robotic control. The University of New South Wales; Sydney, Australia. [2].Sutton, Richard S., & Barto Andrew G. Reinforcement Learning. An Introduction. Ed. MIT Press; Cambridge, Massachusetts, USA. 1998. [3].Kaelbling, Leslie P., Littman, Michael L., Moore, Andrew W. 1996. Reinforcement Learning: A Survey. Brown University, USA. [4].Mohan Rao, K. Vijay. 1997. Learning Algorithms for Markov Decision Processes. Departament of Computer Science and Automation Indian Institute of Science Bangalore – 560 012.